WO2020113673A1 - Cancer subtype classification method employing multiomics integration - Google Patents

Cancer subtype classification method employing multiomics integration Download PDF

Info

Publication number
WO2020113673A1
WO2020113673A1 PCT/CN2018/121838 CN2018121838W WO2020113673A1 WO 2020113673 A1 WO2020113673 A1 WO 2020113673A1 CN 2018121838 W CN2018121838 W CN 2018121838W WO 2020113673 A1 WO2020113673 A1 WO 2020113673A1
Authority
WO
WIPO (PCT)
Prior art keywords
omics
matrix
patient
data
target
Prior art date
Application number
PCT/CN2018/121838
Other languages
French (fr)
Chinese (zh)
Inventor
杨超
殷鹏
蒋佳新
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2020113673A1 publication Critical patent/WO2020113673A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The present application provides a cancer subtype classification method employing multiomics integration. The method comprises: acquiring target multiomics data of each patient in a target cancer patient group; calculating to obtain an omics similarity matrix; performing prediction on each omics similarity matrix to obtain a predicted similarity matrix; using the omics similarity matrix to correct the predicted similarity matrix, and acquiring a corrected matrix; performing weighted fusion to obtain a fusion matrix; and performing spectral clustering on the fusion matrix, and establishing a cancer subtype category label corresponding to the fusion matrix of each patient. The present application improves the accuracy of classification evaluation of cancer subtypes, while also using a more flexible integration method to classify patients, thereby improving the efficiency of data analysis, and facilitating research on cancer subtypes.

Description

一种基于多组学集成的癌症亚型分类方法A cancer subtype classification method based on multi-omics integration
相关申请的交叉引用Cross-reference of related applications
本申请要求于2018年12月07日提交中国专利局的申请号为201811496363.3、名称为“一种基于多组学集成的癌症亚型分类方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application with the application number 201811496363.3 and the title of "a cancer subtype classification method based on multi-omics integration" submitted to the Chinese Patent Office on December 07, 2018, the entire content of which is cited by reference Incorporated in this application.
技术领域Technical field
本申请涉及癌症亚型分类评估技术领域,更具体地说,涉及一种基于多组学集成的癌症亚型分类方法。The present application relates to the technical field of cancer subtype classification and evaluation, and more specifically, to a cancer subtype classification method based on multi-omics integration.
背景技术Background technique
癌症亚型的鉴定对癌症诊断和治疗至关重要。仅利用单组学信息进行癌症亚型分类存在较失衡的类别划分,往往划分后的癌症亚型具有较大存活率差异。因此,近年来已经提出了许多通过整合目标多组学数据来鉴定癌症亚型的方法。The identification of cancer subtypes is essential for cancer diagnosis and treatment. There is a relatively unbalanced classification of cancer subtypes using only single-omics information, and the divided cancer subtypes often have large differences in survival rates. Therefore, in recent years, many methods for identifying cancer subtypes by integrating target multi-omics data have been proposed.
癌症目标多组学数据集成常用方法包括特征提取、降维和相似度矩阵计算等,其中特征提取与降维方法一般结合使用,如潜变量因子分解。常用的聚类方法有:K-means、均值漂移聚类、基于密度的聚类和谱聚类等。Common methods for cancer target multi-omics data integration include feature extraction, dimensionality reduction, and similarity matrix calculation. Among them, feature extraction and dimensionality reduction methods are generally used in combination, such as latent variable factorization. Common clustering methods include: K-means, mean shift clustering, density-based clustering, and spectral clustering.
但是,现有方法中并没有考虑样本之间的相似性偏差和集成中不同组学数据的权重,造成对于患者的癌症亚型分类结果的准确性差,误差较大。However, the existing methods do not consider the similarity deviation between samples and the weight of different omics data in the integration, resulting in poor accuracy and large errors in the classification results of cancer subtypes for patients.
发明内容Summary of the invention
有鉴于此,本申请提供一种基于多组学集成的癌症亚型分类方法,包括:In view of this, this application provides a cancer subtype classification method based on multi-omics integration, including:
获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵;Obtaining target multi-omics data for each patient in the target cancer patient group; and, calculating an omics similarity matrix corresponding to each omics in the target multi-omics data;
用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩阵对应的预测相似度矩阵;Predicting each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;
利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵;Using the omics similarity matrix to modify the predicted similarity matrix to obtain a modified matrix;
将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;Performing weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;
对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别。Perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.
优选地,所述“用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩阵对应的预测相似度矩阵”包括:Preferably, the "prediction of each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices" includes:
基于线性回归法,分别将每个患者的所述目标多组学数据的其中每个组学对应的组学相似度矩阵作为目标矩阵,利用其他组学对应的组学相似度矩阵对所述目标矩阵进行线性回归预测,分别得到所述目标多组学数据的每个所述目标矩阵中的数据对应的预测值,并得到包含所述预测值的每个所述组学相似度矩阵对应的预测相似度矩阵。Based on the linear regression method, the omology similarity matrix corresponding to each omics of the target multi-omics data of each patient is taken as the target matrix, and the omics similarity matrix corresponding to the other omics is used for the target The matrix performs linear regression prediction to obtain the predicted value corresponding to the data in each of the target matrices of the target multi-omics data, respectively, and obtain the prediction corresponding to each of the omics similarity matrix containing the predicted values Similarity matrix.
优选地,所述线性回归预测利用如下公式进行:Preferably, the linear regression prediction is performed using the following formula:
Figure PCTCN2018121838-appb-000001
Figure PCTCN2018121838-appb-000001
其中,β 0为超参数,β t为线性回归学习模型到的参数;r' k,ij为预测值。 Among them, β 0 is the hyperparameter, β t is the parameter obtained by the linear regression learning model; r′ k,ij is the predicted value.
优选地,所述“利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵”包括:Preferably, the "using the omics similarity matrix to modify the predicted similarity matrix to obtain a modified matrix" includes:
将所述目标多组学数据中每个组学的所述组学相似度矩阵与对应的所述预测相似度矩阵进行求和平均,分别得到所述目标多组学数据中的每个组学的修正矩阵。Summing and averaging the omics similarity matrix of each omics in the target multiomics data and the corresponding predicted similarity matrix to obtain each omics in the target multiomics data Correction matrix.
优选地,所述求和平均通过如下公式计算:Preferably, the summation average is calculated by the following formula:
Figure PCTCN2018121838-appb-000002
Figure PCTCN2018121838-appb-000002
其中,k为所述目标多组学数据中的组学,W k为修正矩阵,M k为组学相似度矩阵,M' k为所述预测相似度矩阵。 Where k is omics in the target multi-omics data, W k is a correction matrix, M k is an omics similarity matrix, and M′ k is the predicted similarity matrix.
优选地,所述“获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵”包括:Preferably, the “obtaining target multi-omics data for each patient in the target cancer patient group; and calculating the omics similarity matrix corresponding to each omics in the target multi-omics data” includes:
确定所述目标癌症患者群中每个患者的目标多组学数据,并对其中缺失的组学对应的数据进行均值插补;Determine the target multi-omics data for each patient in the target cancer patient group, and perform mean interpolation on the data corresponding to the missing omics;
对所述目标多组学数据进行相似度计算,得到所述目标多组学数据中每个组学对应的所述相似度矩阵;相似度计算公式为:Perform similarity calculation on the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data; the similarity calculation formula is:
Figure PCTCN2018121838-appb-000003
Figure PCTCN2018121838-appb-000003
其中,x k,it为第k个组学中,癌症患者i对应特征t的值;
Figure PCTCN2018121838-appb-000004
为第k个组学中,癌症患者i的平均值。
Where x k,it is the value of feature t corresponding to cancer patient i in the kth omics;
Figure PCTCN2018121838-appb-000004
It is the average value of cancer patient i in the k-th omics.
优选地,所述“对所述目标多组学数据进行相似度计算,得到所述目标多组学数据中每个组学对应的所述相似度矩阵”之后,还包括:Preferably, after "calculating the similarity of the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data", the method further includes:
对所述相似度矩阵通过Fisher转换进行数据处理,得到处理后的所述相似度矩阵;其中,Fisher转换的公式为:Data processing is performed on the similarity matrix by Fisher conversion to obtain the processed similarity matrix; wherein, the Fisher conversion formula is:
Figure PCTCN2018121838-appb-000005
Figure PCTCN2018121838-appb-000005
其中,r k,ij为矩阵变换的相似度矩阵,相似度矩阵M k,ij为S k,ij构成的矩阵。 Among them, r k,ij is the similarity matrix of matrix transformation, and the similarity matrix M k,ij is the matrix composed of S k,ij .
优选地,所述“将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵”包括:Preferably, the “weight fusion of the correction matrix corresponding to each omics to obtain a fusion matrix” includes:
利用差分搜索法,确定所述目标多组学数据中每个组学对应的组学权重;Use the differential search method to determine the weight of the omics corresponding to each omics in the target multi-omics data;
根据每个组学的所述组学权重,对每个患者的各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;其中,加权融合通过如下公式进行:According to the omics weights of each omics, weighted fusion is performed on the correction matrix corresponding to each omology of each patient to obtain a fusion matrix; wherein, the weighted fusion is performed by the following formula:
Figure PCTCN2018121838-appb-000006
Figure PCTCN2018121838-appb-000006
其中,W k为本专利修正后的第k组学相似度矩阵,ω k为W k对应的权重,W为加权融合后的最终矩阵。 Among them, W k is the k -th group similarity matrix revised by the patent, ω k is the weight corresponding to W k , and W is the final matrix after weighted fusion.
优选地,在所述“对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别”之后,还包括:Preferably, after "spectral clustering the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient", the method further includes:
分别计算所述目标癌症患者群的每个所述癌症亚型类别中所有患者的每个组学的均值,作为亚型均值;Calculating the mean of each omics of all patients in each of the cancer subtype categories of the target cancer patient group as the mean of subtypes;
计算所述目标癌症患者群的每个所述癌症亚型类别中的所有所述亚型均值的亚型族群中心点;Calculating the center point of the subtype group for the mean value of all the subtypes in each of the cancer subtype categories of the target cancer patient group;
获取所述待分析患者群中的患者的待测多组学数据;其中,所述待测多组学数据中的组学类别与所述目标多组学数据的组学类别相同;并且,计算所述待分析患者群中每个患者的待测多组学数据的中心点作为待分析中心点;Obtain the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data; and, calculate The center point of the multi-omics data to be tested of each patient in the patient group to be analyzed is taken as the center point to be analyzed;
基于欧氏距离算法,计算所述待分析患者群中每个患者的待测多组学数据的待分析中心点与每个所述亚型族群中心点的相对距离,作为检测距离值;Based on the Euclidean distance algorithm, calculating the relative distance between the center point of the multi-omics data of each patient in the patient group to be analyzed and the center point of each subtype group as the detection distance value;
选取所述待分析患者群中的每个患者的所有所述检测距离值中,距离最小的所述检测距离值对应的所述癌症亚型类别,作为所述待分析患者群中的该患者的癌症亚型类别。Selecting the cancer subtype category corresponding to the detection distance value with the smallest distance among all the detection distance values of each patient in the patient group to be analyzed, as the patient’s Cancer subtype category.
此外,为解决上述问题,本申请还提供一种基于多组学集成的癌症亚型分类装置,包括:In addition, in order to solve the above problems, the present application also provides a cancer subtype classification device based on multi-omics integration, including:
获取模块,配置成获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵;An acquisition module configured to acquire target multi-omics data for each patient in the target cancer patient group; and, calculate and obtain an omics similarity matrix corresponding to each omics in the target multi-omics data;
预测模块,配置成用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩阵对应的预测相似度矩阵;A prediction module configured to predict each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;
修正模块,配置成利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵;A correction module configured to modify the predicted similarity matrix using the omics similarity matrix to obtain a correction matrix;
融合模块,配置成将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;A fusion module configured to perform weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;
聚类模块,配置成对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别。The clustering module is configured to perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.
本申请提供的一种基于多组学集成的癌症亚型分类方法。本申请通过计算得到目标患者群中每个患者的目标多组学数据中的每个组学对应的组学相似度矩阵,并且利用线性回归法进行预测得到每个所述组学相似度矩阵对应的预测相似度矩阵,进而将组学相似度矩阵和预测相似度矩阵进行组合修正,得到修正矩阵,根据权重进行加权融合,再进行谱聚类,从而为每个患者建立基于预设癌症亚型类别标签的对应的癌症亚型类别编号。本申请在相似性矩阵基础上,提出了一种简单有效的相似性融合模型,配置成整合目标多组学数据以识别癌症亚型。针对每个组学数据中的样本之间的所存在的相似性偏差,并使用线性模型预测样本之间的相似性进行修正,进而权重来整合来自目标多组学数据的校正的修正矩阵,实现将患者样本聚类到不同的亚型组中进行分类。本申请提高了癌症亚型的分类评价的准确性,并通过更灵活的整合方法实现对于患者进行分类,提高了数据分析效率,为对于癌症亚型的研究提供了方便。This application provides a cancer subtype classification method based on multi-omics integration. In this application, the omics similarity matrix corresponding to each omics in the target multi-omics data of each patient in the target patient group is calculated, and the linear regression method is used to predict each omics similarity matrix. Predictive similarity matrix, and then combine and correct the omics similarity matrix and the predicted similarity matrix to obtain a modified matrix, perform weighted fusion according to weights, and then perform spectral clustering to establish a predetermined cancer subtype for each patient. The corresponding cancer subtype category number of the category label. This application proposes a simple and effective similarity fusion model based on the similarity matrix, which is configured to integrate target multi-omics data to identify cancer subtypes. For the similarity deviation between the samples in each omics data, and using a linear model to predict the similarity between samples for correction, and then the weights to integrate the corrected correction matrix from the target multi-omics data to achieve Cluster patient samples into different subtype groups for classification. This application improves the accuracy of classification and evaluation of cancer subtypes, and realizes the classification of patients through a more flexible integration method, improves the efficiency of data analysis, and provides convenience for the study of cancer subtypes.
附图说明BRIEF DESCRIPTION
图1为本申请基于多组学集成的癌症亚型分类方法实施例方案涉及的硬件运行环境的结构示意图;FIG. 1 is a schematic structural diagram of a hardware operating environment involved in an embodiment of a cancer subtype classification method based on multi-omics integration of the present application;
图2为本申请基于多组学集成的癌症亚型分类方法第一实施例的流程示意图;FIG. 2 is a schematic flowchart of a first embodiment of a cancer subtype classification method based on multi-omics integration;
图3为本申请基于多组学集成的癌症亚型分类方法第二实施例的流程示意图;FIG. 3 is a schematic flowchart of a second embodiment of a cancer subtype classification method based on multi-omics integration;
图4为本申请基于多组学集成的癌症亚型分类方法第三实施例的流程示意图;FIG. 4 is a schematic flowchart of a third embodiment of a cancer subtype classification method based on multi-omics integration;
图5为本申请基于多组学集成的癌症亚型分类方法第三实施例中另一种实施方式的流程示意图;FIG. 5 is a schematic flowchart of another implementation manner in the third example of the cancer subtype classification method based on multi-omics integration of the present application;
图6为本申请基于多组学集成的癌症亚型分类方法第四实施例的步骤S50之后的流程示意图;FIG. 6 is a schematic flow chart after step S50 of the fourth embodiment of the cancer subtype classification method based on multi-omics integration of the present application;
图7为本申请基于多组学集成的癌症亚型分类方法的胶质母细胞瘤癌亚型的存活率和存活时间对比关系图;7 is a comparison diagram of survival rate and survival time of glioblastoma cancer subtypes based on a multi-omics integrated cancer subtype classification method;
图8为本申请基于多组学集成的癌症亚型分类装置的功能模块示意图。8 is a schematic diagram of functional modules of a cancer subtype classification device based on multi-omics integration of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional characteristics and advantages of the present application will be further described in conjunction with the embodiments and with reference to the drawings.
具体实施方式detailed description
下面详细描述本申请的实施例,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同 或类似功能的元件。The embodiments of the present application are described in detail below, in which the same or similar reference numerals indicate the same or similar elements or the elements having the same or similar functions throughout.
此外,术语“第一”和“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”和“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。In addition, the terms "first" and "second" are for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of this application, the meaning of "plurality" is two or more, unless otherwise specifically limited.
在本申请中,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”和“固定”等术语应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或成一体;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。In this application, unless otherwise clearly specified and defined, the terms "installation", "connected", "connected" and "fixed" should be understood in a broad sense, for example, it can be a fixed connection or a detachable connection , Or integrated; it can be mechanical connection or electrical connection; it can be directly connected or indirectly connected through an intermediary, it can be the connection between two components or the interaction between two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in this application according to specific situations.
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
如图1所示,是本申请实施例方案涉及的终端的硬件运行环境的结构示意图。As shown in FIG. 1, it is a schematic structural diagram of a hardware operating environment of a terminal involved in a solution of an embodiment of the present application.
本申请实施例计算机设备可以是PC,也可以是智能手机、平板电脑、或者具有一定便携计算机等可移动式终端设备。如图1所示,该计算机设备可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002配置成实现这些组件之间的连接通信。用户接口1003可以包括显示屏、输入单元比如键盘和遥控器,可选用户接口1003还可以包括标准的有线接口和无线接口。网络接口1004可选地可以包括标准的有线接口和无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器,例如磁盘存储器。存储器1005可选地还可以是独立于前述处理器1001的存储装置。可选地,终端还可以包括RF(Radio Frequency,射频)电路、音频电路和WiFi模块等等。此外,计算机设备还可配置陀螺仪、气压计、湿度计、温度计和红外线传感器等其他传感器,在此不再赘述。The computer device in the embodiment of the present application may be a PC, or may be a smart phone, a tablet computer, or a portable terminal device with a certain portable computer. As shown in FIG. 1, the computer device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is configured to implement connection communication between these components. The user interface 1003 may include a display screen, an input unit such as a keyboard and a remote controller, and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory, such as a disk memory. The memory 1005 may optionally also be a storage device independent of the aforementioned processor 1001. Optionally, the terminal may further include an RF (Radio Frequency) circuit, an audio circuit, a WiFi module, and so on. In addition, the computer device can also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, and infrared sensors, which will not be repeated here.
本领域技术人员可以理解,图1中示出的计算机设备并不构成对计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。如图1所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、数据接口控制程序、网络连接程序以及基于多组学集成的癌症亚型分类程序。A person skilled in the art may understand that the computer device shown in FIG. 1 does not constitute a limitation on the computer device, and may include more or less components than shown, or combine some components, or arrange different components. As shown in FIG. 1, the memory 1005 as a computer-readable storage medium may include an operating system, a data interface control program, a network connection program, and a cancer subtype classification program integrated based on multi-omics.
本申请提供的一种基于多组学集成的癌症亚型分类方法。其中,所述方法提高了癌症亚型的分类评价的准确性,并通过更灵活的整合方法实现对于患者进行分类,提高了数据分析效率,为对于癌症亚型的研究提供了方便。This application provides a cancer subtype classification method based on multi-omics integration. Among them, the method improves the accuracy of classification and evaluation of cancer subtypes, and realizes the classification of patients through a more flexible integration method, improves the efficiency of data analysis, and provides convenience for the study of cancer subtypes.
实施例1:Example 1:
参照图2,本申请第一实施例提供一种基于多组学集成的癌症亚型分类方法,包括:Referring to FIG. 2, the first embodiment of the present application provides a cancer subtype classification method based on multi-omics integration, including:
步骤S10,获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵;Step S10: Obtain target multi-omics data for each patient in the target cancer patient group; and, calculate and obtain an omics similarity matrix corresponding to each omics in the target multi-omics data;
上述,组学相似度矩阵即为M k;其中,M k可以通过如下形式表达: As mentioned above, the omics similarity matrix is M k ; where M k can be expressed in the following form:
Figure PCTCN2018121838-appb-000007
Figure PCTCN2018121838-appb-000007
上述,目标癌症患者群为需要进行数据分析,对该群中的所有患者进行批量癌症亚型分类的集合。目标癌症患者群中包含有具有相同类型癌症但具有相同和/或不同情况的患者的病理数据(理化指标数据和生化检验结果等)。As mentioned above, the target cancer patient group is a collection that requires data analysis to perform batch cancer subtype classification for all patients in the group. The target cancer patient group includes pathological data (physical and chemical index data and biochemical test results, etc.) of patients with the same type of cancer but with the same and/or different conditions.
上述,在目标癌症患者群中,包含有多个具有相同类型癌症的患者,其中,每个患者均具有包含有多个组学的目标多组学数据。As described above, the target cancer patient group includes multiple patients with the same type of cancer, and each patient has target multi-omics data including multiple omics.
上述,所述目标多组学数据,即为在目标癌症患者群中,每个患者均具有的需要进行数据分析的多个组学的组合。As mentioned above, the target multi-omics data is a combination of multiple omics that each patient has in the target cancer patient group and needs data analysis.
例如,对于肺癌,建立肺癌的目标癌症患者群。群中所有的患者均为肺癌患者,数量为400个患者。根据不同组学的重要性和与肺癌的相关性,定义mRNA、甲基化和基因表达量这3个组学进行分析研究,将mRNA、甲基化和基因表达量这3个组学作为每个患者对应的目标多组学数据,而其中mRNA、甲基化和基因表达量为目标多组学数据中的单个的组学。For example, for lung cancer, establish a target cancer patient population for lung cancer. All patients in the group are lung cancer patients, the number is 400 patients. According to the importance of different omics and their relevance to lung cancer, the three omics of mRNA, methylation and gene expression are defined for analysis and research, and the three omics of mRNA, methylation and gene expression are used as each The target multi-omics data corresponding to each patient, and the mRNA, methylation and gene expression levels are the individual omics in the target multi-omics data.
步骤S20,用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩阵对应的预测相似度矩阵;Step S20, predicting each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;
步骤S30,利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵;Step S30, using the omics similarity matrix to modify the predicted similarity matrix to obtain a modified matrix;
上述,现有的癌症亚型的数据分类处理技术中,具有的解决方案一般为如下情况:As mentioned above, the existing cancer subtype data classification processing technology generally has the following solutions:
(1)集成目标多组学数据;(1) Integrate target multi-omics data;
(2)进行聚类;(2) Perform clustering;
(3)聚类结果进行存活率分析;(3) Analysis of survival rate of clustering results;
(4)评估聚类结果。(4) Evaluate the clustering results.
由此可见,现有的癌症亚型的数据分类处理方法中并没有考虑样本之间的相似性偏差和集成中不同组学数据的权重,这也是现有的分类方法中的普遍现象,而特征维度过多,特征选择的质量影响聚类结果的质量,大大降低结果的可信度和准确度。It can be seen that the existing cancer subtype data classification processing method does not consider the similarity deviation between samples and the weight of different omics data in the integration, which is also a common phenomenon in the existing classification methods. Too many dimensions, the quality of feature selection affects the quality of clustering results, greatly reducing the credibility and accuracy of the results.
本实施例中考量了这些缺点,利用简单的回归和线性融合集成了不同组学数据的权重,避免了特征选择和降维。本实施例不仅考虑了不同类型数据的患者之间的相似性,还权衡了不同类型占有的权重,最后利用谱聚类方法。模型在相似性矩阵基础上进行改进,可以简单有效地进行癌症亚型分类。提高癌症亚型类别划分的质量,亚型内一致性更强,更有利于保障亚型的后续研究和癌症治疗。In this embodiment, these shortcomings are considered, and simple regression and linear fusion are used to integrate the weights of different omics data to avoid feature selection and dimensionality reduction. This embodiment not only considers the similarity between patients of different types of data, but also weighs the weights occupied by different types, and finally uses the spectral clustering method. The model is improved on the basis of the similarity matrix, and cancer subtype classification can be performed simply and effectively. Improve the quality of cancer subtype classification, and the consistency within subtypes is stronger, which is more conducive to ensuring the follow-up research and cancer treatment of subtypes.
本实施例中,提出在或得到目标癌症患者群中每个患者的多组学相似度矩阵后,通过线性回归法建立每个相似度矩阵对应的预测矩阵,并通过预测矩阵对相似度矩阵进行修正,即将实测值和预测值进行综合校正,从而可以得到一致性更强、准确度更高和数据可信度更高的分类结果。In this embodiment, it is proposed that after or obtaining the multi-omics similarity matrix of each patient in the target cancer patient group, a prediction matrix corresponding to each similarity matrix is established by linear regression, and the similarity matrix is performed by the prediction matrix Correction, that is, to comprehensively correct the measured value and the predicted value, so as to obtain a classification result with stronger consistency, higher accuracy, and higher data reliability.
需要说明的是,线性回归是利用数理统计中回归分析,来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。回归分析中包括两个或两个以上的自变量,且因变量和自变量之间是线性关系,则称为多元线性回归分析。本实施例中,通过多元线性回归分析,对组学相似度矩阵中的每个数据进行预测,得到一个与该组学相似度矩阵相对应的预测相似度矩阵。It should be noted that linear regression is a statistical analysis method that uses regression analysis in mathematical statistics to determine the interdependent quantitative relationship between two or more variables. The regression analysis includes two or more independent variables, and the linear relationship between the dependent variable and the independent variable is called multiple linear regression analysis. In this embodiment, through multiple linear regression analysis, each data in the omics similarity matrix is predicted to obtain a predicted similarity matrix corresponding to the omics similarity matrix.
步骤S40,将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;Step S40: Perform weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;
上述,根据每个组学对应的权重,对目标多组学数据中的多个组学对应的修正矩阵进行加权融合,从而得到了每个患者对应的融合矩阵。As described above, according to the weight corresponding to each omics, the correction matrices corresponding to the multiple omics in the target multi-omics data are weighted and fused, thereby obtaining the fusion matrix corresponding to each patient.
步骤S50,对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别。Step S50: Perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.
上述,预设癌症亚型类别标签为对于目标的癌症类型的分类的标记。例如,通过聚类后,肺癌分类亚型分为三型,一型,二型,三型,分别为肺癌的类别标签。对应地,癌症亚型类别编号,一型为C1,二型为C2,三型为C3,本实施例中,将该类别标号作为对应的类别标签,从而实现基于该标签的分类。As mentioned above, the preset cancer subtype category label is a label for classification of the target cancer type. For example, after clustering, the subtypes of lung cancer are classified into three types, one type, two types, and three types, which are the category labels of lung cancer. Correspondingly, the cancer subtype category number is C1 for the first type, C2 for the second type, and C3 for the third type. In this embodiment, the category label is used as the corresponding category label to implement classification based on the label.
需要说明的是,谱聚类算法建立在谱图理论基础上,与传统的聚类算法相比,它具有能在任意形状的样本空间上聚类且收敛于全局最优解的优点。该算法首先根据给定的样本数据集定义一个描述成对数据点相似度的亲合矩阵,并且计算矩阵的特征值和特征向量,然后选择合适的特征向量聚类不同的数据 点。It should be noted that the spectral clustering algorithm is based on the theory of spectral graph. Compared with the traditional clustering algorithm, it has the advantages of clustering on a sample space of arbitrary shape and converging to the global optimal solution. The algorithm first defines an affinity matrix describing the similarity of paired data points according to a given sample data set, and calculates the eigenvalues and eigenvectors of the matrix, and then selects the appropriate eigenvectors to cluster different data points.
上述,通过谱聚类,对目标癌症进行分类,并在分类后建立不同分类类型对应的癌症亚型类别标签(例如,C1\C2\C3),即实现了患者的融合矩阵与癌症亚型类别标签之间的关系的建立,从而通过癌症亚型类别标签对患有不同亚型癌症的患者进行分类的目的。As mentioned above, the target cancer is classified through spectral clustering, and after the classification, the cancer subtype category labels corresponding to different classification types are established (for example, C1\C2\C3), that is, the patient fusion matrix and the cancer subtype category are realized The purpose of establishing the relationship between tags is to classify patients with different subtypes of cancer through the cancer subtype category tags.
本实施例提供的一种基于多组学集成的癌症亚型分类方法,通过计算得到目标患者群中每个患者的目标多组学数据中的每个组学对应的组学相似度矩阵,并且利用线性回归法进行预测得到每个所述组学相似度矩阵对应的预测相似度矩阵,进而将组学相似度矩阵和预测相似度矩阵进行组合修正,得到修正矩阵,根据权重进行加权融合,再进行谱聚类,从而为每个患者建立基于预设癌症亚型类别标签的对应的癌症亚型类别编号。本实施例在相似性矩阵基础上,提出了一种简单有效的相似性融合模型,配置成整合目标多组学数据以识别癌症亚型。针对每个组学数据中的样本之间的所存在的相似性偏差,并使用线性模型预测样本之间的相似性进行修正,进而权重来整合来自目标多组学数据的校正的修正矩阵,实现将患者样本聚类到不同的亚型组中进行分类。本实施例提高了癌症亚型的分类评价的准确性,并通过更灵活的整合方法实现对于患者进行分类,提高了数据分析效率,为对于癌症亚型的研究提供了方便。This embodiment provides a method for classifying cancer subtypes based on multi-omics integration, through calculation, an omics similarity matrix corresponding to each omics in the target multi-omics data of each patient in the target patient group, and The linear regression method is used to make predictions to obtain the predicted similarity matrix corresponding to each of the omics similarity matrices. The omics similarity matrix and the predicted similarity matrix are combined and corrected to obtain a correction matrix, which is weighted and fused according to the weights. Spectral clustering is performed to establish a corresponding cancer subtype category number based on a preset cancer subtype category label for each patient. This embodiment proposes a simple and effective similarity fusion model based on the similarity matrix, which is configured to integrate target multi-omics data to identify cancer subtypes. For the similarity deviation between the samples in each omics data, and use the linear model to predict the similarity between the samples for correction, and then the weights to integrate the corrected correction matrix from the target multi-omics data to achieve Cluster patient samples into different subtype groups for classification. This embodiment improves the accuracy of the classification evaluation of cancer subtypes, and realizes the classification of patients through a more flexible integration method, improves the efficiency of data analysis, and provides convenience for the study of cancer subtypes.
实施例2:Example 2:
参照图3,本申请第二实施例提供一种基于多组学集成的癌症亚型分类方法,基于上述图2所示的第一实施例,所述步骤S20,“用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩阵对应的预测相似度矩阵”包括:Referring to FIG. 3, the second embodiment of the present application provides a method for classifying cancer subtypes based on multi-omics integration. Based on the first embodiment shown in FIG. 2 above, the step S20, "use linear regression for each The prediction of the omics similarity matrix to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices includes:
步骤S21,基于线性回归法,分别将每个患者的所述目标多组学数据的其中每个组学对应的组学相似度矩阵作为目标矩阵,利用其他组学对应的组学相似度矩阵对所述目标矩阵进行线性回归预测,分别得到所述目标多组学数据的每个所述目标矩阵中的数据对应的预测值,并得到包含所述预测值的每个所述组学相似度矩阵对应的预测相似度矩阵;Step S21, based on the linear regression method, taking the omology similarity matrix corresponding to each omics of the target multi-omics data of each patient as the target matrix, and using the omics similarity matrix pairs corresponding to other omics Performing linear regression prediction on the target matrix, respectively obtaining the predicted value corresponding to the data in each of the target matrices of the target multi-omics data, and obtaining each of the omics similarity matrices containing the predicted values Corresponding predicted similarity matrix;
所述线性回归预测利用如下公式进行:The linear regression prediction is performed using the following formula:
Figure PCTCN2018121838-appb-000008
Figure PCTCN2018121838-appb-000008
其中,β 0为超参数,β t为线性回归学习模型到的参数;r' k,ij为预测值。 Among them, β 0 is the hyperparameter, β t is the parameter obtained by the linear regression learning model; r′ k,ij is the predicted value.
上述,每个患者的目标多组学数据中,包括多个组学,每个组学对应的通过相似度计算,得到了对应的组学相似度矩阵。然后,利用线性回归方法,对每个组学进行预测,得到预测相似度矩阵。As mentioned above, the target multi-omics data for each patient includes multiple omics, and the corresponding omics similarity matrix is obtained by calculating the similarity corresponding to each omics. Then, the linear regression method is used to predict each omics to obtain a predicted similarity matrix.
具体地,目标多组学数据中包含有的每个组学,分别通过其中的一个组学的组学相似度矩阵作为目标矩阵,利用区别于该目标矩阵的其他组学对应的组学相似度矩阵对该目标矩阵进行线性回归预测,可得到该目标矩阵中数据的预测值,即得到了该目标矩阵对应的预测相似度矩阵。然后,利用该方法对区别于目标矩阵的其他矩阵进行预测,从而分别得到了每个组学相似度矩阵对应的预测相似度矩阵。Specifically, for each omics included in the target multi-omics data, the omics similarity matrix of one omics is used as the target matrix, and the omics similarity corresponding to other omics different from the target matrix is used. The matrix performs linear regression prediction on the target matrix to obtain the predicted value of the data in the target matrix, that is, the predicted similarity matrix corresponding to the target matrix. Then, this method is used to predict other matrices different from the target matrix, so as to obtain the predicted similarity matrix corresponding to each omics similarity matrix.
例如,患者的目标多组学数据中,包含有M1、M2和M3等3个组学。线性回归预测过程为:For example, the patient's target multi-omics data includes 3 omics such as M1, M2, and M3. The linear regression prediction process is:
利用M2和M3对M1进行线性回归预测,得到M1';Use M2 and M3 to perform linear regression prediction on M1 to get M1';
利用M1和M3对M2进行线性回归预测,得到M2';Use M1 and M3 to perform linear regression prediction on M2 to get M2';
利用M1和M2对M3进行线性回归预测,得到M3'。上述M1'、M2'和M3'即为通过线性回归预测所得到的分别与M1、M2和M3对应的预测相似度矩阵。Using M1 and M2 to perform linear regression prediction on M3, get M3'. The above M1', M2' and M3' are the prediction similarity matrices corresponding to M1, M2 and M3 respectively obtained by linear regression prediction.
所述步骤S30,“利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵”包括:In step S30, "correcting the predicted similarity matrix using the omics similarity matrix to obtain a correction matrix" includes:
步骤S31,将所述目标多组学数据中每个组学的所述组学相似度矩阵与对应的所述预测相似度矩阵进行求和平均,分别得到所述目标多组学数据中的每个组学的修正矩阵;Step S31: Summing and averaging the omics similarity matrix of each omics in the target multi-omics data and the corresponding predicted similarity matrix to obtain each of the target multi-omics data Modification matrix of omics;
所述求和平均通过如下公式计算:The summation average is calculated by the following formula:
Figure PCTCN2018121838-appb-000009
Figure PCTCN2018121838-appb-000009
其中,k为所述目标多组学数据中的组学,W k为修正矩阵,M k为组学相似度矩阵,M' k为所述预测相似度矩阵。 Where k is omics in the target multi-omics data, W k is a correction matrix, M k is an omics similarity matrix, and M′ k is the predicted similarity matrix.
上述,在得到每个患者的组学相似度矩阵后,基于线性回归方法对每个组学相似度矩阵进行预测,得到每个组学相似度矩阵对应的预测相似度矩阵。然后,根据组学相似度矩阵和对应的预测相似度矩阵,进行求和平均,即通过预测值对已经得到的相似度值进行修正,从而将相似度值提高了准确度,这样就在考虑每个组学数据中的样本之间的相似性偏差情况下,使用线性模型预测样本之间的相似性,达到弥补患者间相似性的问题,从而使得到的相似度矩阵的值更加准确,具有可信度。As mentioned above, after obtaining the omics similarity matrix of each patient, each omics similarity matrix is predicted based on the linear regression method to obtain the predicted similarity matrix corresponding to each omics similarity matrix. Then, according to the omics similarity matrix and the corresponding predicted similarity matrix, a summation average is performed, that is, the similarity value that has been obtained is corrected by the predicted value, thereby improving the accuracy of the similarity value, so that each In the case of similarity deviation between samples in the omics data, the linear model is used to predict the similarity between samples, to make up for the problem of similarity between patients, so that the value of the obtained similarity matrix is more accurate and has Reliability.
实施例3:Example 3:
参照图4-5,本申请第三实施例提供一种基于多组学集成的癌症亚型分类方法,基于上述图2所示的第一实施例,所述步骤S10,“获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵”包括:4-5, the third embodiment of the present application provides a cancer subtype classification method based on multi-omics integration, based on the first embodiment shown in FIG. 2 above, the step S10, "obtain target cancer patient population The target multi-omics data for each patient in; and the omics similarity matrix corresponding to each omics in the target multi-omics data is calculated to include:
步骤S11,确定所述目标癌症患者群中每个患者的目标多组学数据,并对其中缺失的组学对应的数据进行均值插补;Step S11: Determine the target multi-omics data of each patient in the target cancer patient group, and perform mean interpolation on the data corresponding to the missing omics.
上述,在目标多组学数据中,包含有多个组学,但是由于患者众多,不一定每个患者都完整地进行了每个组学的测试,可能存在缺检的情况,造成部分患者缺少某个组学,无法进行计算的情况,再次需要将缺项的患者所缺少的组学进行利用其它患者的所有该项的均值进行插补,从而补足数据,在不改变数据真实值的情况下保证数据的统计学意义。As mentioned above, the target multi-omics data contains multiple omics, but due to the large number of patients, each patient may not be fully tested for each omics. There may be a lack of inspection, resulting in some patients lacking If a certain omics cannot be calculated, it is necessary to interpolate the missing omics from the missing patients using the average of all the items of other patients to supplement the data without changing the true value of the data Ensure the statistical significance of the data.
步骤S12,对所述目标多组学数据进行相似度计算,得到所述目标多组学数据中每个组学对应的所述相似度矩阵;相似度计算公式为:Step S12: Perform similarity calculation on the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data; the similarity calculation formula is:
Figure PCTCN2018121838-appb-000010
Figure PCTCN2018121838-appb-000010
其中,x k,it为第k个组学中,癌症患者i对应特征t的值;x k,i为第k个组学中,癌症患者i的平均值。 Among them, x k,it is the value of the feature t corresponding to the cancer patient i in the kth omics; x k,i is the average value of the cancer patient i in the kth omics.
上述,相似度矩阵,即为将所有患者的某个组学的数据进行列表,例如,横坐标为基因表达量,纵坐标为患者名称或编号,在该图标中具有计算每个患者的基因表达量与其他患者的基因表达量的相似性。从而建立一个与患者对应的基因表达量的组学相似度矩阵。The above, the similarity matrix is to list the data of a certain omics of all patients, for example, the horizontal axis is the gene expression amount, and the vertical axis is the patient name or number. In this icon, the gene expression of each patient is calculated The similarity of the amount of gene expression with other patients. Thus, an omics similarity matrix of gene expression corresponding to patients is established.
在另一种实施方式下,所述步骤S10,“对所述目标多组学数据进行相似度计算,得到所述目标多组学数据中每个组学对应的所述相似度矩阵”之后,还包括:In another embodiment, after step S10, "calculate the similarity of the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data", Also includes:
步骤S60,对所述相似度矩阵通过Fisher转换进行数据处理,得到处理后的所述相似度矩阵;其中,Fisher转换的公式为:Step S60: Perform data processing on the similarity matrix by Fisher conversion to obtain the processed similarity matrix; wherein, the Fisher conversion formula is:
Figure PCTCN2018121838-appb-000011
Figure PCTCN2018121838-appb-000011
其中,r k,ij为矩阵变换的相似度矩阵,相似度矩阵M k,ij为S k,ij构成的矩阵。 Among them, r k,ij is the similarity matrix of matrix transformation, and the similarity matrix M k,ij is the matrix composed of S k,ij .
上述,在得到每个患者的每个组学对应的组学相似度矩阵后,通过Fisher转换,对该相似度矩阵进行数据预处理。预处理的过程及对患者的组学相似度矩阵中的数据进行归一化处理,从而得到的预处理后的组学相似度矩阵,可在进一步数据处理时更高效地运行。As described above, after obtaining the omics similarity matrix corresponding to each omics of each patient, data conversion is performed on the similarity matrix through Fisher transformation. The process of pre-processing and normalizing the data in the patient's omics similarity matrix to obtain the pre-processed omics similarity matrix can be run more efficiently during further data processing.
所述步骤S40,“将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵”包括:The step S40, "weight fusion of the correction matrix corresponding to each omics to obtain a fusion matrix" includes:
步骤S41,利用差分搜索法,确定所述目标多组学数据中每个组学对应的组学权重;Step S41, using a differential search method to determine the weight of the omics corresponding to each omics in the target multi-omics data;
上述,利用0.05步长的差分搜索法,确定癌症各组学对应最佳权重。As mentioned above, a 0.05-step differential search method is used to determine the optimal weights corresponding to each omics of cancer.
步骤S42,根据每个组学的所述组学权重,对每个患者的各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;其中,加权融合通过如下公式进行:Step S42: Perform weighted fusion on the correction matrix corresponding to each omics of each patient according to the omics weight of each omics to obtain a fusion matrix; wherein, the weighted fusion is performed by the following formula:
Figure PCTCN2018121838-appb-000012
Figure PCTCN2018121838-appb-000012
其中,W k为本专利修正后的第k组学相似度矩阵,ω k为W k对应的权重,W为加权融合后的最终矩阵。 Among them, W k is the k -th group similarity matrix revised by the patent, ω k is the weight corresponding to W k , and W is the final matrix after weighted fusion.
上述,根据每个组学对应的最佳权重,对每个患者所包括的所有组学的修正矩阵进行融合,从而可得到每个患者的融合矩阵。As mentioned above, according to the optimal weight corresponding to each omics, all the correction matrices included in each patient are fused, so that the fusion matrix of each patient can be obtained.
例如,本实施例中,对加权融合和单一组学进行了对比,具体情况见表1:For example, in this embodiment, weighted fusion and single omics are compared. For details, see Table 1:
表1 单一组学和加权融合在亚型存活分析的Cox-log P-value对比表Table 1 Cox-log P-value comparison table of single-omics and weighted fusion in subtype survival analysis
数据data 基因表达gene expression DNA甲基化DNA methylation miRNA表达miRNA expression 加权融合Weighted fusion
GBMGBM 2.49×10 -3 2.49×10 -3 5.71×10 -3 5.71×10 -3 1.50×10 -3 1.50×10 -3 2.66×10 -4 2.66×10 -4
由此表可见,加权融合具有更小的P值,亚型分类可靠度更高。所以本实施例中,采用加权融合方法对多个组学的数据进行融合,从而可以得到在统计学意义上更加可靠,更加准确的计算分析结果。From this table, it can be seen that weighted fusion has a smaller P value, and the reliability of subtype classification is higher. Therefore, in this embodiment, a weighted fusion method is used to fuse data from multiple omics, so that statistically more reliable and accurate calculation and analysis results can be obtained.
实施例4:Example 4:
参照图6,本申请第四实施例提供一种基于多组学集成的癌症亚型分类方法,基于上述图2所示的第一实施例,在所述步骤S50,“对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别”之后,还包括:Referring to FIG. 6, the fourth embodiment of the present application provides a cancer subtype classification method based on multi-omics integration. Based on the first embodiment shown in FIG. 2 above, in the step S50, "for each patient’s corresponding After the fusion matrix performs spectral clustering to determine the cancer subtype category corresponding to each patient, it also includes:
步骤S70,分别计算所述目标癌症患者群的每个所述癌症亚型类别中所有患者的每个组学的均值,作为亚型均值;Step S70, calculating the mean of each omics of all patients in each of the cancer subtype categories of the target cancer patient group as the mean of subtypes;
在确认了目标癌症患者群中每个患者对应的癌症亚型类别后,可根据确定后的类别,构建一般规律。根据该一般规律,作为数据分析模型,从而对其他的单独患者或多个患者群的病例数据进行分析,从而实现快速分型的目的。After confirming the cancer subtype category corresponding to each patient in the target cancer patient group, a general rule can be constructed according to the determined category. According to this general rule, it is used as a data analysis model to analyze the case data of other individual patients or multiple patient groups, so as to achieve the purpose of rapid typing.
此外,作为构建一般规律的数据分析模型,目标癌症患者群中的患者数量需要达到一定的数量,数量越大,则该数据分析模型的作为一般规律的准确性越高,所以在此可设定一预设阈值,当目标癌症患者群的患者的数量达到预设阈值,才可作为数据分析模型进行对于其他患者的组学数据的癌症分型的分析。例如,该预设阈值为300例,即目标癌症患者群中患者数量要不小于300。In addition, as a data analysis model for building a general law, the number of patients in the target cancer patient group needs to reach a certain number. The larger the number, the higher the accuracy of the data analysis model as a general law, so it can be set here A preset threshold, when the number of patients in the target cancer patient group reaches the preset threshold, it can be used as a data analysis model to analyze the cancer classification of omics data of other patients. For example, the preset threshold is 300 cases, that is, the number of patients in the target cancer patient group must be not less than 300.
上述,在目标癌症患者群中,包含有多个患者,并且,每个患者对应一个癌症亚型类别,即为在进行数据分型分析后,将目标癌症患者群中的所有患者划分为根据不同癌症亚型类别对应的组。As mentioned above, the target cancer patient group contains multiple patients, and each patient corresponds to a cancer subtype category, that is, after data classification analysis, all patients in the target cancer patient group are divided according to different The group corresponding to the cancer subtype category.
计算所述目标癌症患者群的每个所述癌症亚型类别中所有患者的每个组学的均值,作为亚型均值;Calculating the mean of each omics of all patients in each of the cancer subtype categories of the target cancer patient group as the mean of subtypes;
将所述目标癌症患者群的每个所述癌症亚型类别中所有患者的每个组学的特征数值进行求取均值, 得到亚型均值。其中,亚型均值的个数,与癌症亚型类别的个数相同。An average value of each omics characteristic of all patients in each of the cancer subtype categories of the target cancer patient group is averaged to obtain a subtype average. Among them, the average number of subtypes is the same as the number of cancer subtypes.
步骤S80,计算所述目标癌症患者群的每个所述癌症亚型类别中的所有所述亚型均值的亚型族群中心点;Step S80, calculating the center point of the subtype group of the mean value of all the subtypes in each of the cancer subtype categories of the target cancer patient group;
通过步骤S70后,可得到目标癌症患者群中每个所述癌症亚型的多个亚型均值,对每个所述癌症亚型的多个亚型均值进行求平均,可得到亚型族群中心点。After step S70, the average of multiple subtypes of each of the cancer subtypes in the target cancer patient group can be obtained, and the average of the multiple subtypes of each of the cancer subtypes can be averaged to obtain the subtype group center point.
步骤S90,获取所述待分析患者群中的患者的待测多组学数据;其中,所述待测多组学数据中的组学类别与所述目标多组学数据的组学类别相同;并且,计算所述待分析患者群中每个患者的待测多组学数据的中心点作为待分析中心点;Step S90: Obtain the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data; And, calculate the center point of the multi-omics data to be tested for each patient in the patient group to be analyzed as the center point to be analyzed;
上述,步骤“获取所述待分析患者群中的患者的待测多组学数据;其中,所述待测多组学数据中的组学类别与所述目标多组学数据的组学类别相同”可以在步骤S70之前或与步骤S70同时进行,只要在进行执行“计算所述待分析患者群中每个患者的待测多组学数据的中心点作为待分析中心点”之前完成即可。In the above, the step “obtaining the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data "Can be performed before step S70 or simultaneously with step S70, as long as it is completed before performing "calculating the center point of the multi-omics data of each patient in the patient group to be analyzed as the center point of the analysis".
上述,待分析患者群,为区别于目标癌症患者群的患者的组合,该群中可以为一个患者,也可以为多个患者。其中,限定待分析患者群中的每个患者的待测多组学数据的组学类别,要与目标癌症患者群中每个患者的目标多组学数据中的组学类别相一致。例如,目标癌症患者群中的目标多组学数据包括突变、甲基化和mRNA等,那对应的待分析患者群中每个患者也要具有突变、甲基化和mRNA等这些数据,在多组学数据相一致的情况下,才能进行比较和分析。As mentioned above, the patient group to be analyzed is a combination of patients different from the target cancer patient group, and this group may be one patient or multiple patients. Wherein, the omics category defining the to-be-tested multi-omics data of each patient in the patient group to be analyzed must be consistent with the omics category in the target multi-omics data of each patient in the target cancer patient group. For example, the target multi-omics data in the target cancer patient group includes mutations, methylation, and mRNA. Each patient in the corresponding patient group to be analyzed must also have such data as mutations, methylation, and mRNA. Only when the omics data is consistent can the comparison and analysis be performed.
上述,通过求取待分析患者群中每个患者的待测多组学数据中所有组学的均值,从而即得到待分析中心点。As mentioned above, by obtaining the average value of all omics in the multi-omics data of each patient in the patient group to be analyzed, the central point to be analyzed is obtained.
步骤S100,基于欧氏距离算法,计算所述待分析患者群中每个患者的待测多组学数据的待分析中心点与每个所述亚型族群中心点的相对距离,作为检测距离值;Step S100, based on the Euclidean distance algorithm, calculate the relative distance between the center point of the multi-omics data of each patient in the patient group to be analyzed and the center point of each subtype group as the detection distance value ;
需要说明的是,欧氏距离(Euclid Distance)也称欧几里得度量和欧几里得距离,是一个通常采用的距离定义,它是在m维空间中两个点之间的真实距离。在二维空间中的欧氏距离就是两点之间的直线段距离。It should be noted that Euclid Distance (Euclid Distance), also known as Euclidean metric and Euclidean distance, is a commonly used definition of distance, which is the true distance between two points in m-dimensional space. The Euclidean distance in two-dimensional space is the distance of a straight line between two points.
通过欧氏距离算法,计算待分析患者群中每个患者的待测多组学数据的待分析中心点与每个亚型族群中心点的欧式距离,作为检测距离值。Through the Euclidean distance algorithm, the Euclidean distance between the center point of the multi-omics data of each patient in the patient group to be analyzed and the center point of each subtype group is calculated as the detection distance value.
所述步骤S110,选取所述待分析患者群中的每个患者的所有所述检测距离值中,距离最小的所述检测距离值对应的所述癌症亚型类别,作为所述待分析患者群中的该患者的癌症亚型类别。In step S110, among all the detection distance values of each patient in the patient group to be analyzed, the cancer subtype category corresponding to the detection distance value with the smallest distance is selected as the patient group to be analyzed In this patient's cancer subtype category.
在得到每个患者的所有的检测距离值后,对其所有的检测距离值进行比较,选取其中数值上最小的检测距离值对应的癌症亚型类别,作为该患者的癌症亚型类别,从而实现了在对目标癌症患者群中所有患者分型分析后,将其作为一般规律的数据分析模型,对其他患者进行迅速分型的目的。After obtaining all the detection distance values of each patient, compare all the detection distance values and select the cancer subtype category corresponding to the smallest detection distance value in the numerical value as the cancer subtype category of the patient, thereby achieving After the classification of all patients in the target cancer patient group, it is used as a general regular data analysis model to quickly classify other patients.
例如,对于新加入的待分析患者群中的单个或多个癌症患者,可以利用原聚类标签数据对单个样本或多个样本群进行归类计算,直接判别其癌症亚型类别。For example, for a single or multiple cancer patients in the newly added patient group to be analyzed, the original cluster label data can be used to perform classification calculation on a single sample or multiple sample groups to directly determine the cancer subtype category.
在目标癌症患者群中有500个患者,每个患者的包括O1,O2,O3三个组学数据,通过步骤S10-S50的方法对此患者群划分出C1和C2两个亚型。将新加入一批患者(待分析患者群)中的患者设为n1,n2,...,nk。具体如下:There are 500 patients in the target cancer patient group, and each patient includes O1, O2, O3 three omics data, and the patient group is divided into two subtypes of C1 and C2 by the method of steps S10-S50. Set the patients newly added to a group of patients (patient group to be analyzed) as n1, n2, ..., nk. details as follows:
1、分别计算目标癌症患者群的每个所述癌症亚型类别(C1、C2)中所有患者的每个组学(O1,O2,O3)的均值,作为亚型均值,亚型均值设为X 1,1,X 1,2,X 1,3,以及X 2,1,X 2,2,X 2,3。其中,X下标中的逗号前的1对应C1,逗号前的2对应C2,逗号后的1、2和3分别对应O1,O2和O3。 1. Calculate the mean of each omics (O1, O2, O3) of all patients in each of the cancer subtype categories (C1, C2) of the target cancer patient group as the mean of the subtypes. X 1,1 , X 1,2 , X 1,3 , and X 2,1 , X 2,2 , X 2,3 . Among them, the 1 before the comma in the X subscript corresponds to C1, the 2 before the comma corresponds to C2, and the 1, 2, and 3 after the comma correspond to O1, O2, and O3, respectively.
2、计算所述目标癌症患者群的每个所述癌症亚型类别(C1、C2)中的所有所述亚型均值的亚型族 群中心点:2. Calculate the center point of the subtype group for the mean value of all the subtypes in each of the cancer subtype categories (C1, C2) of the target cancer patient group:
X1=(X 1,1+X 1,2+X 1,3)/3;对应C1; X1=(X 1,1 +X 1,2 +X 1,3 )/3; corresponding to C1;
X2=(X 2,1+X 2,2+X 2,3)/3;对应C2。 X2=(X 2,1 +X 2,2 +X 2,3 )/3; corresponds to C2.
3、获取所述待分析患者群中的患者的待测多组学数据;其中,所述待测多组学数据中的组学类别与所述目标多组学数据的组学类别相同;并且,计算所述待分析患者群中每个患者的待测多组学数据的中心点作为待分析中心点;3. Obtain the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data; and , Calculate the center point of the multi-omics data to be tested for each patient in the patient group to be analyzed as the center point to be analyzed;
分别计算新样本n1,n2,...,nk的中心点:Calculate the center points of the new samples n1, n2, ..., nk respectively:
new1=(n1,1+n1,2+n1,3)/3;new1=(n1,1+n1,2+n1,3)/3;
new2=(n2,1+n2,2+n2,3)/3;new2=(n2,1+n2,2+n2,3)/3;
......
newk=(n k,1+n k,2+n k,3)/3; newk=(n k,1 +n k,2 +n k,3 )/3;
其中,n k,1,n k,2,n k,3分别是新样本第k个患者(本例中为多个患者)在组学O1,O2,O3的值。 Among them, n k,1 ,n k,2 ,n k,3 are the k-th patient of the new sample (in this case, multiple patients) in omics O1, O2, O3 values.
4、新样本亚型归类:选取所述待分析患者群中的每个患者的所有所述检测距离值中,距离最小的所述检测距离值对应的所述癌症亚型类别,作为所述待分析患者群中的该患者的癌症亚型类别。4. Classification of new sample subtypes: among all the detection distance values of each patient in the patient group to be analyzed, the cancer subtype category corresponding to the detection distance value with the smallest distance is selected as the The cancer subtype category of the patient in the patient group to be analyzed.
利用欧式距离算法公式:
Figure PCTCN2018121838-appb-000013
Using the Euclidean distance algorithm formula:
Figure PCTCN2018121838-appb-000013
进行求取所述每个患者的所有所述检测距离值;其中,i为亚型类别个数,计算新样本k与各亚型族中心的检测距离值d 1,k和d 2,k(本实施例中确定的癌症亚型类别为C1和C2两个,所以对应的要计算得到两个检测距离值)。 Find all the detection distance values of each patient; where, i is the number of subtype categories, and calculate the detection distance values d 1,k and d 2,k of the new sample k and the center of each subtype family ( The cancer subtype categories determined in this embodiment are C1 and C2, so the corresponding two detection distance values must be calculated).
若d 1,k<d 2,k,则新样本k属于亚型C1;若d 1,k>d 2,k,属于亚型C2。此外,如果是多个癌症亚型类别的话,例如5个,则可选取其中最小的检测距离值对应的癌症亚型类别作为该患者的癌症亚型类别即可。 If d 1,k <d 2,k , the new sample k belongs to subtype C1; if d 1,k >d 2,k , it belongs to subtype C2. In addition, if there are multiple cancer subtype categories, for example, five, the cancer subtype category corresponding to the smallest detection distance value may be selected as the cancer subtype category of the patient.
本实施例中,通过对于新加入的待分析患者群中的单个或多个癌症患者,可以利用原聚类标签数据对单个样本或多个样本群进行归类计算,直接判别其癌症亚型类别,从而可根据癌症亚型分类方法建立一般规律作为数据分析模型,实现对于其他患者的数据分析,从而可在临床研究中,为目标患者或患者群进行快速分型和数据分析提供了方便。此外,每个后加入的其他患者的分型的数据,也可加入到该模型中,从而不断地修正和提高模型分析的准确度,可统计学上的可信度。In this embodiment, for a single or multiple cancer patients in the newly added patient group to be analyzed, the original cluster label data can be used to classify a single sample or multiple sample groups to directly determine the cancer subtype category In this way, a general rule can be established as a data analysis model according to the cancer subtype classification method to realize data analysis for other patients, so that it can provide convenience for rapid typing and data analysis of target patients or patient groups in clinical research. In addition, the classification data of each other patient added later can also be added to the model, so as to continuously correct and improve the accuracy of the model analysis, which can be statistically credible.
基于胶质母细胞瘤癌症的统计学应用实验:Statistical application experiment based on glioblastoma cancer:
为更好地说明本申请中所提供的基于多组学集成的癌症亚型分类方法,分别进行应用对比实验。In order to better illustrate the cancer subtype classification method based on multi-omics integration provided in this application, application comparison experiments were conducted separately.
首先,针对于胶质母细胞瘤癌症患者,包含有215个病例,对上述215个病例的患者,分别通过基于多组学集成的癌症亚型分类方法进行分类。从而得到分类结果(如表2)。由表2中可见,通过聚类后得到的分类结果进行统计,统计三种亚型患者的年龄、性别和生存时间,可分析得出:C1亚型与C2亚型存在显著不同的发病机制,后续研究分析可根据临床药物等对亚型治疗效果进行实验对比,研究各亚型患者对应的治疗药物和治疗方法。First, for patients with glioblastoma cancer, there are 215 cases, and the patients of the above 215 cases are classified by a cancer subtype classification method based on multi-omics integration. Thus, the classification result is obtained (as shown in Table 2). It can be seen from Table 2 that the classification results obtained after clustering are counted, and the age, sex and survival time of the three subtypes are counted. It can be analyzed that there are significantly different pathogenesis of C1 subtype and C2 subtype. Subsequent research and analysis can be based on clinical drugs and other experimental comparisons of subtype treatment effects, to study the corresponding treatment drugs and treatment methods of patients of each subtype.
进一步地,对所得到的分类结果与对应的存活率进行绘图,其中,包含的Subtype1、Subtype2和Subtype3等三种亚型对应C1、C2和C3。根据分析结果,建立上述胶质母细胞瘤癌症患者群中三种亚型存活率和与对应存活时间的比较,获得结果如图7,由图7可见三种亚型之间的存活率存在显著性差异,证明本实施例中所提供的基于多组学集成的癌症亚型分类方法准确有效,并且具有数据的统计学意义, 且具有可信度。Further, the obtained classification result is plotted against the corresponding survival rate, where the three subtypes including Subtype1, Subtype2 and Subtype3 correspond to C1, C2 and C3. According to the analysis results, the survival rates of the three subtypes in the above-mentioned glioblastoma cancer patient group and the comparison with the corresponding survival time are established. The results are shown in Figure 7, which shows that there are significant survival rates among the three subtypes The sexual difference proves that the cancer subtype classification method based on multi-omics integration provided in this embodiment is accurate and effective, and has statistical significance of data and credibility.
表2 胶质母细胞瘤癌的临床特征对比表Table 2 Comparison table of clinical characteristics of glioblastoma carcinoma
子类型IDSubtype ID C1(N=42)C1 (N=42) C1(N=112)C1 (N=112) C1(N=61)C1 (N=61)
患者(男性:女性)Patient (Male: Female) (24:18)(24:18) (69:43)(69:43) (41:20)(41:20)
平均年龄(岁)Average age (years) 46.446.4 58.858.8 54.854.8
平均生存时间(天)Average survival time (days) 931.9931.9 402.5402.5 504.9504.9
此外,参考图8,本申请还提供一种基于多组学集成的癌症亚型分类装置,包括:In addition, referring to FIG. 8, the present application also provides a cancer subtype classification device based on multi-omics integration, including:
获取模块10,配置成获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵;The obtaining module 10 is configured to obtain target multi-omics data for each patient in the target cancer patient group; and, calculate and obtain an omics similarity matrix corresponding to each omics in the target multi-omics data;
预测模块20,配置成用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩阵对应的预测相似度矩阵;The prediction module 20 is configured to use a linear regression method to predict each of the omics similarity matrices to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;
修正模块30,配置成利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵;The modification module 30 is configured to modify the predicted similarity matrix using the omics similarity matrix to obtain a modified matrix;
融合模块40,配置成将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;The fusion module 40 is configured to perform weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;
聚类模块50,配置成对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别。The clustering module 50 is configured to perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.
此外,本申请还提供一种计算机设备,所述计算机设备包括存储器以及处理器,所述存储器配置成存储基于多组学集成的癌症亚型分类程序,所述处理器运行所述基于多组学集成的癌症亚型分类程序以使所述移动终端执行如上述所述基于多组学集成的癌症亚型分类方法。In addition, the present application also provides a computer device including a memory and a processor configured to store a cancer subtype classification program integrated based on multi-omics, the processor running the multi-omics-based An integrated cancer subtype classification program to enable the mobile terminal to perform a multi-omics integrated cancer subtype classification method as described above.
此外,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有基于多组学集成的癌症亚型分类程序,所述基于多组学集成的癌症亚型分类程序被处理器执行时实现如上述所述基于多组学集成的癌症亚型分类方法。In addition, the present application also provides a computer-readable storage medium on which is stored a cancer subtype classification program based on multi-omics integration, the cancer subtype classification program based on multi-omics integration is When executed by the processor, the cancer subtype classification method based on multi-omics integration as described above is realized.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟和光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or part that contributes to the existing technology, and the computer software product is stored in a storage medium (such as ROM/RAM as described above) , Magnetic disks and optical disks), including several instructions to enable a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to perform the method described in each embodiment of the present application. The above are only the preferred embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection in this application.

Claims (10)

  1. 一种基于多组学集成的癌症亚型分类方法,其特征在于,包括:A cancer subtype classification method based on multi-omics integration, which includes:
    获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵;Obtaining target multi-omics data for each patient in the target cancer patient group; and, calculating an omics similarity matrix corresponding to each omics in the target multi-omics data;
    用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩阵对应的预测相似度矩阵;Predicting each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;
    利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵;Using the omics similarity matrix to modify the predicted similarity matrix to obtain a modified matrix;
    将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;Performing weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;
    对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别。Perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.
  2. 如权利要求1所述基于多组学集成的癌症亚型分类方法,其特征在于,所述“用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩阵对应的预测相似度矩阵”包括:The cancer subtype classification method based on multi-omics integration according to claim 1, wherein the "prediction of each of the omics similarity matrices using linear regression method to obtain each of the omics similarity The predicted similarity matrix corresponding to the degree matrix includes:
    基于线性回归法,分别将每个患者的所述目标多组学数据的其中每个组学对应的组学相似度矩阵作为目标矩阵,利用其他组学对应的组学相似度矩阵对所述目标矩阵进行线性回归预测,分别得到所述目标多组学数据的每个所述目标矩阵中的数据对应的预测值,并得到包含所述预测值的每个所述组学相似度矩阵对应的预测相似度矩阵。Based on the linear regression method, the omology similarity matrix corresponding to each omics of the target multi-omics data of each patient is taken as the target matrix, and the omics similarity matrix corresponding to the other omics is used for the target The matrix performs linear regression prediction to obtain the predicted value corresponding to the data in each of the target matrices of the target multi-omics data, respectively, and obtain the prediction corresponding to each of the omics similarity matrix containing the predicted values Similarity matrix.
  3. 如权利要求2所述基于多组学集成的癌症亚型分类方法,其特征在于,所述线性回归预测利用如下公式进行:The cancer subtype classification method based on multi-omics integration according to claim 2, wherein the linear regression prediction is performed using the following formula:
    Figure PCTCN2018121838-appb-100001
    Figure PCTCN2018121838-appb-100001
    其中,β 0为超参数,β t为线性回归学习模型到的参数;r' k,ij为预测值。 Among them, β 0 is the hyperparameter, β t is the parameter obtained by the linear regression learning model; r′ k,ij is the predicted value.
  4. 如权利要求1所述基于多组学集成的癌症亚型分类方法,其特征在于,所述“利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵”包括:The cancer subtype classification method based on multi-omics integration according to claim 1, characterized in that the "using the omics similarity matrix to modify the predicted similarity matrix to obtain a correction matrix" includes:
    将所述目标多组学数据中每个组学的所述组学相似度矩阵与对应的所述预测相似度矩阵进行求和平均,分别得到所述目标多组学数据中的每个组学的修正矩阵。Summing and averaging the omics similarity matrix of each omics in the target multiomics data and the corresponding predicted similarity matrix to obtain each omics in the target multiomics data Correction matrix.
  5. 如权利要求4所述基于多组学集成的癌症亚型分类方法,其特征在于,所述求和平均通过如下公式计算:The cancer subtype classification method based on multi-omics integration according to claim 4, wherein the summation average is calculated by the following formula:
    Figure PCTCN2018121838-appb-100002
    Figure PCTCN2018121838-appb-100002
    其中,k为所述目标多组学数据中的组学,W k为修正矩阵,M k为组学相似度矩阵,M' k为所述预测相似度矩阵。 Where k is omics in the target multi-omics data, W k is a correction matrix, M k is an omics similarity matrix, and M′ k is the predicted similarity matrix.
  6. 如权利要求1所述基于多组学集成的癌症亚型分类方法,其特征在于,所述“获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵”包括:The cancer subtype classification method based on multi-omics integration according to claim 1, wherein the "acquiring the target multi-omics data of each patient in the target cancer patient group; and, calculating The omics similarity matrix corresponding to each omics in omics data includes:
    确定所述目标癌症患者群中每个患者的目标多组学数据,并对其中缺失的组学对应的数据进行均值插补;Determine the target multi-omics data for each patient in the target cancer patient group, and perform mean interpolation on the data corresponding to the missing omics;
    对所述目标多组学数据进行相似度计算,得到所述目标多组学数据中每个组学对应的所述相似度矩阵;相似度S k,ij的计算公式为: Perform similarity calculation on the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data; the calculation formula of the similarity Sk,ij is:
    Figure PCTCN2018121838-appb-100003
    Figure PCTCN2018121838-appb-100003
    其中,x k,it为第k个组学中,癌症患者i对应特征t的值;
    Figure PCTCN2018121838-appb-100004
    为第k个组学中,癌症患者i的平均值。
    Where x k,it is the value of feature t corresponding to cancer patient i in the kth omics;
    Figure PCTCN2018121838-appb-100004
    It is the average value of cancer patient i in the k-th omics.
  7. 如权利要求6所述基于多组学集成的癌症亚型分类方法,其特征在于,所述“对所述目标多组学数据进行相似度计算,得到所述目标多组学数据中每个组学对应的所述相似度矩阵”之后,还包括:The cancer subtype classification method based on multi-omics integration according to claim 6, characterized in that the "similarity calculation is performed on the target multi-omics data to obtain each group in the target multi-omics data After learning the corresponding similarity matrix, it also includes:
    对所述相似度矩阵通过Fisher转换进行数据处理,得到处理后的所述相似度矩阵;其中,Fisher转换的公式为:Data processing is performed on the similarity matrix by Fisher conversion to obtain the processed similarity matrix; wherein, the Fisher conversion formula is:
    Figure PCTCN2018121838-appb-100005
    Figure PCTCN2018121838-appb-100005
    其中,r k,ij为矩阵变换的相似度矩阵,相似度矩阵M k,ij为S k,ij构成的矩阵。 Among them, r k,ij is the similarity matrix of matrix transformation, and the similarity matrix M k,ij is the matrix composed of S k,ij .
  8. 如权利要求1所述基于多组学集成的癌症亚型分类方法,其特征在于,所述“将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵”包括:The cancer subtype classification method based on multi-omics integration according to claim 1, wherein the "weighted fusion of the correction matrix corresponding to each omics to obtain a fusion matrix" includes:
    利用差分搜索法,确定所述目标多组学数据中每个组学对应的组学权重;Use the differential search method to determine the weight of the omics corresponding to each omics in the target multi-omics data;
    根据每个组学的所述组学权重,对每个患者的各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;其中,加权融合通过如下公式进行:According to the omics weights of each omics, weighted fusion is performed on the correction matrix corresponding to each omology of each patient to obtain a fusion matrix; wherein, the weighted fusion is performed by the following formula:
    Figure PCTCN2018121838-appb-100006
    Figure PCTCN2018121838-appb-100006
    其中,W k为本专利修正后的第k组学相似度矩阵,ω k为W k对应的权重,W为加权融合后的最终矩阵。 Among them, W k is the k -th group similarity matrix revised by the patent, ω k is the weight corresponding to W k , and W is the final matrix after weighted fusion.
  9. 如权利要求1所述基于多组学集成的癌症亚型分类方法,其特征在于,在所述“对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别”之后,还包括:The cancer subtype classification method based on multi-omics integration according to claim 1, characterized in that, in the "spectral clustering of the fusion matrix corresponding to each patient, the cancer subtype corresponding to each patient is determined After the category, it also includes:
    分别计算所述目标癌症患者群的每个所述癌症亚型类别中所有患者的每个组学的均值,作为亚型均值;Calculating the mean of each omics of all patients in each of the cancer subtype categories of the target cancer patient group as the mean of subtypes;
    计算所述目标癌症患者群的每个所述癌症亚型类别中的所有所述亚型均值的亚型族群中心点;Calculating the center point of the subtype group for the mean value of all the subtypes in each of the cancer subtype categories of the target cancer patient group;
    获取待分析患者群中的患者的待测多组学数据;其中,所述待测多组学数据中的组学类别与所述目标多组学数据的组学类别相同;并且,计算所述待分析患者群中每个患者的待测多组学数据的中心点作为待分析中心点;Obtain the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data; and, calculate the The center point of the multi-omics data of each patient in the patient group to be analyzed is taken as the center point of the analysis;
    基于欧氏距离算法,计算所述待分析患者群中每个患者的待测多组学数据的待分析中心点与每个所述亚型族群中心点的相对距离,作为检测距离值;Based on the Euclidean distance algorithm, calculating the relative distance between the center point of the multi-omics data of each patient in the patient group to be analyzed and the center point of each subtype group as the detection distance value;
    选取所述待分析患者群中的每个患者的所有所述检测距离值中,距离最小的所述检测距离值对应的所述癌症亚型类别,作为所述待分析患者群中的该患者的癌症亚型类别。Selecting the cancer subtype category corresponding to the detection distance value with the smallest distance among all the detection distance values of each patient in the patient group to be analyzed, as the patient’s Cancer subtype category.
  10. 一种基于多组学集成的癌症亚型分类装置,其特征在于,包括:A cancer subtype classification device based on multi-omics integration is characterized in that it includes:
    获取模块,配置成获取目标癌症患者群中的每个患者的目标多组学数据;并且,计算得到所述目标多组学数据中每个组学对应的组学相似度矩阵;An acquisition module configured to acquire target multi-omics data for each patient in the target cancer patient group; and, calculate and obtain an omics similarity matrix corresponding to each omics in the target multi-omics data;
    预测模块,配置成用线性回归法对每个所述组学相似度矩阵进行预测,得到每个所述组学相似度矩 阵对应的预测相似度矩阵;A prediction module configured to predict each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;
    修正模块,配置成利用所述组学相似度矩阵修正所述预测相似度矩阵,得到修正矩阵;A correction module configured to modify the predicted similarity matrix using the omics similarity matrix to obtain a correction matrix;
    融合模块,配置成将各组学对应的所述修正矩阵进行加权融合,得到融合矩阵;A fusion module configured to perform weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;
    聚类模块,配置成对每个患者对应的所述融合矩阵进行谱聚类,确定每个患者对应的癌症亚型类别。The clustering module is configured to perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.
PCT/CN2018/121838 2018-12-07 2018-12-18 Cancer subtype classification method employing multiomics integration WO2020113673A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811496363.3 2018-12-07
CN201811496363.3A CN111291777B (en) 2018-12-07 2018-12-07 Cancer subtype classification method based on multigroup chemical integration

Publications (1)

Publication Number Publication Date
WO2020113673A1 true WO2020113673A1 (en) 2020-06-11

Family

ID=70974918

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/121838 WO2020113673A1 (en) 2018-12-07 2018-12-18 Cancer subtype classification method employing multiomics integration

Country Status (2)

Country Link
CN (1) CN111291777B (en)
WO (1) WO2020113673A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816259A (en) * 2020-07-07 2020-10-23 西安电子科技大学 Incomplete omics data integration method based on network representation learning
CN112687327A (en) * 2020-12-28 2021-04-20 中山依数科技有限公司 Cancer survival analysis system based on multitask and multi-mode
CN113420802A (en) * 2021-06-04 2021-09-21 桂林电子科技大学 Alarm data fusion method based on improved spectral clustering
WO2022222230A1 (en) * 2021-04-23 2022-10-27 平安科技(深圳)有限公司 Indicator prediction method and apparatus based on machine learning, and device and storage medium
CN115631847A (en) * 2022-10-19 2023-01-20 哈尔滨工业大学 Early lung cancer diagnosis system based on multiple mathematical characteristics, storage medium and equipment
CN115985513A (en) * 2023-01-05 2023-04-18 徐州医科大学科技园发展有限公司 Data processing method, device and equipment based on multigroup cancer typing
CN116741397A (en) * 2023-08-15 2023-09-12 数据空间研究院 Cancer typing method, system and storage medium based on multi-group data fusion

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537358B (en) * 2021-07-19 2023-09-01 华南理工大学 Cancer subtype identification method and system based on multiple sets of mathematical data sets
CN115171779B (en) * 2022-07-13 2023-09-22 浙江大学 Cancer driving gene prediction device based on graph attention network and multiple groups of chemical fusion
CN115064266B (en) * 2022-07-21 2024-04-26 山东大学 Incomplete multi-set data-based cancer diagnosis system, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243300A (en) * 2015-08-31 2016-01-13 合肥工业大学 Approximation spectral clustering algorithm based method for predicting cancer metastasis and recurrence
CN106529165A (en) * 2016-10-28 2017-03-22 合肥工业大学 Method for identifying cancer molecular subtype based on spectral clustering algorithm of sparse similar matrix
CN107506617A (en) * 2017-09-29 2017-12-22 杭州电子科技大学 The half local disease-associated Forecasting Methodologies of social information miRNA
CN108171012A (en) * 2018-01-17 2018-06-15 河南师范大学 A kind of gene sorting method and device
CN108563660A (en) * 2017-12-29 2018-09-21 温州大学 service recommendation method, system and server

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050079524A1 (en) * 2000-01-21 2005-04-14 Shaw Sandy C. Method for identifying biomarkers using Fractal Genomics Modeling
US7213023B2 (en) * 2000-10-16 2007-05-01 University Of North Carolina At Charlotte Incremental clustering classifier and predictor
WO2017176423A1 (en) * 2016-04-08 2017-10-12 Biodesix, Inc. Classifier generation methods and predictive test for ovarian cancer patient prognosis under platinum chemotherapy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243300A (en) * 2015-08-31 2016-01-13 合肥工业大学 Approximation spectral clustering algorithm based method for predicting cancer metastasis and recurrence
CN106529165A (en) * 2016-10-28 2017-03-22 合肥工业大学 Method for identifying cancer molecular subtype based on spectral clustering algorithm of sparse similar matrix
CN107506617A (en) * 2017-09-29 2017-12-22 杭州电子科技大学 The half local disease-associated Forecasting Methodologies of social information miRNA
CN108563660A (en) * 2017-12-29 2018-09-21 温州大学 service recommendation method, system and server
CN108171012A (en) * 2018-01-17 2018-06-15 河南师范大学 A kind of gene sorting method and device

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816259A (en) * 2020-07-07 2020-10-23 西安电子科技大学 Incomplete omics data integration method based on network representation learning
CN111816259B (en) * 2020-07-07 2024-02-09 西安电子科技大学 Incomplete multi-study data integration method based on network representation learning
CN112687327A (en) * 2020-12-28 2021-04-20 中山依数科技有限公司 Cancer survival analysis system based on multitask and multi-mode
CN112687327B (en) * 2020-12-28 2024-04-12 中山依数科技有限公司 Cancer survival analysis system based on multitasking and multi-mode
WO2022222230A1 (en) * 2021-04-23 2022-10-27 平安科技(深圳)有限公司 Indicator prediction method and apparatus based on machine learning, and device and storage medium
CN113420802A (en) * 2021-06-04 2021-09-21 桂林电子科技大学 Alarm data fusion method based on improved spectral clustering
CN115631847A (en) * 2022-10-19 2023-01-20 哈尔滨工业大学 Early lung cancer diagnosis system based on multiple mathematical characteristics, storage medium and equipment
CN115631847B (en) * 2022-10-19 2023-07-14 哈尔滨工业大学 Early lung cancer diagnosis system, storage medium and equipment based on multiple groups of chemical characteristics
CN115985513A (en) * 2023-01-05 2023-04-18 徐州医科大学科技园发展有限公司 Data processing method, device and equipment based on multigroup cancer typing
CN115985513B (en) * 2023-01-05 2023-11-03 徐州医科大学科技园发展有限公司 Data processing method, device and equipment based on multiple groups of chemical cancer typing
CN116741397A (en) * 2023-08-15 2023-09-12 数据空间研究院 Cancer typing method, system and storage medium based on multi-group data fusion
CN116741397B (en) * 2023-08-15 2023-11-03 数据空间研究院 Cancer typing method, system and storage medium based on multi-group data fusion

Also Published As

Publication number Publication date
CN111291777A (en) 2020-06-16
CN111291777B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
WO2020113673A1 (en) Cancer subtype classification method employing multiomics integration
Wang et al. Confounder adjustment in multiple hypothesis testing
Niroula et al. PON-P2: prediction method for fast and reliable identification of harmful variants
TWI677852B (en) A method and apparatus, electronic equipment, computer readable storage medium for extracting image feature
US10262059B2 (en) Method, apparatus, and storage medium for text information processing
US11562002B2 (en) Enabling advanced analytics with large data sets
US20150066378A1 (en) Identifying Possible Disease-Causing Genetic Variants by Machine Learning Classification
Bergersen et al. Weighted lasso with data integration
WO2021089013A1 (en) Spatial graph convolutional network training method, electronic device and storage medium
US11183268B2 (en) Genomic network service user interface
US20210174906A1 (en) Systems And Methods For Prioritizing The Selection Of Targeted Genes Associated With Diseases For Drug Discovery Based On Human Data
WO2020047921A1 (en) Deep metric learning method based on hierarchical triplet loss function, and apparatus thereof
WO2022206604A1 (en) Classification model training method and apparatus, classification method and apparatus, computer device, and storage medium
US11593665B2 (en) Systems and methods driven by link-specific numeric information for predicting associations based on predicate types
Mukhopadhyay Large-scale mode identification and data-driven sciences
US10445341B2 (en) Methods and systems for analyzing datasets
CN110827924A (en) Clustering method and device for gene expression data, computer equipment and storage medium
Baker et al. Feature selection for data integration with mixed multiview data
WO2019042097A1 (en) Optimization method and device for system parameter design space
WO2021134513A1 (en) Methods for determining chromosome aneuploidy and constructing classification model, and device
Gündoğdu Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique
CN114463587A (en) Abnormal data detection method, device, equipment and storage medium
Augugliaro et al. dglars: an R package to estimate sparse generalized linear models
Wu et al. Variable selection for sparse high-dimensional nonlinear regression models by combining nonnegative garrote and sure independence screening
JP2019505940A (en) Determining phenotype from genotype

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18942218

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18942218

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/11/2021)