CN115620808A - Cancer gene prognosis screening method and system based on improved Cox model - Google Patents
Cancer gene prognosis screening method and system based on improved Cox model Download PDFInfo
- Publication number
- CN115620808A CN115620808A CN202211631423.4A CN202211631423A CN115620808A CN 115620808 A CN115620808 A CN 115620808A CN 202211631423 A CN202211631423 A CN 202211631423A CN 115620808 A CN115620808 A CN 115620808A
- Authority
- CN
- China
- Prior art keywords
- matrix
- cox
- message
- patient
- regression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000004393 prognosis Methods 0.000 title claims abstract description 39
- 108700019961 Neoplasm Genes Proteins 0.000 title claims abstract description 27
- 102000048850 Neoplasm Genes Human genes 0.000 title claims abstract description 27
- 238000012216 screening Methods 0.000 title claims abstract description 25
- 239000011159 matrix material Substances 0.000 claims abstract description 112
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 56
- 230000004083 survival effect Effects 0.000 claims abstract description 54
- 230000014509 gene expression Effects 0.000 claims abstract description 39
- 206010028980 Neoplasm Diseases 0.000 claims abstract description 29
- 201000011510 cancer Diseases 0.000 claims abstract description 28
- 206010027476 Metastases Diseases 0.000 claims abstract description 11
- 230000009401 metastasis Effects 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 65
- 238000004422 calculation algorithm Methods 0.000 claims description 19
- 238000012546 transfer Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 7
- 230000007704 transition Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims 2
- 238000012163 sequencing technique Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000000875 corresponding effect Effects 0.000 description 21
- 238000010586 diagram Methods 0.000 description 10
- 238000004458 analytical method Methods 0.000 description 6
- 238000011282 treatment Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000012804 iterative process Methods 0.000 description 3
- 238000010207 Bayesian analysis Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000000018 DNA microarray Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002716 delivery method Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000011337 individualized treatment Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000491 multivariate analysis Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
- G16B35/20—Screening of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- General Physics & Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Optimization (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Library & Information Science (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Chemical & Material Sciences (AREA)
- Computational Mathematics (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Analytical Chemistry (AREA)
- Algebra (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本发明涉及生存分析Cox模型回归技术领域,更具体地,涉及一种基于改进Cox模型的癌症基因预后筛选方法及系统。The present invention relates to the technical field of survival analysis Cox model regression, and more specifically relates to a cancer gene prognosis screening method and system based on an improved Cox model.
背景技术Background technique
随着DNA微阵列技术的兴起和发展,该项技术可以同时监测数千个基因的表达水平以研究某些治疗,疾病和发育阶段对基因表达的影响。常用的场景为:检测多名癌症病人的癌变细胞的基因表达量,并通过随访获取这些病人的生存数据,最后利用生存分析手段对这些收集到的数据进行统计分析,最后筛选出预后相关的基因。研究预后基因与肿瘤的关系可以对预测预后、复发、转移乃至指导治疗提供信息,最终目的是为患者的个体化治疗提供帮助,进一步为癌症的治疗提供突破。With the rise and development of DNA microarray technology, this technology can simultaneously monitor the expression levels of thousands of genes to study the effects of certain treatments, diseases and developmental stages on gene expression. The commonly used scenario is: detecting the gene expression of cancerous cells of multiple cancer patients, and obtaining the survival data of these patients through follow-up, and finally using survival analysis methods to perform statistical analysis on these collected data, and finally screen out the genes related to prognosis . Studying the relationship between prognostic genes and tumors can provide information for predicting prognosis, recurrence, metastasis, and even guiding treatment. The ultimate goal is to provide assistance for individualized treatment of patients and further provide breakthroughs in cancer treatment.
而收集到的生存数据和基因表达量需要经过系统性的生存分析,从上万个基因中筛出十几个关键预后基因,这一步是整个预后分析中不可或缺的一环,通过这十几个基因组成的基因集,可以对癌症病人的风险进行评估,提供更多治疗信息。The collected survival data and gene expression levels need to undergo a systematic survival analysis to screen out more than a dozen key prognostic genes from tens of thousands of genes. This step is an indispensable part of the entire prognostic analysis. Through these ten A gene set composed of several genes can be used to assess the risk of cancer patients and provide more treatment information.
其中,Cox回归模型在医学随访研究中得到广泛的应用,是迄今生存分析中应用最多的多因素分析方法。它是一种基于协变量线性组合的半参数模型,该模型以生存结局和生存时间为因变量,可同时分析众多因素对生存时间的影响,能分析带有截尾生存时间的资料,且不要求估计资料的生存分布类型,具有优良的性质,该回归模型在癌症预后基因筛选中具有举足轻重的地位。Among them, the Cox regression model has been widely used in medical follow-up research, and is the most widely used multivariate analysis method in survival analysis so far. It is a semi-parametric model based on a linear combination of covariates. The model takes survival outcome and survival time as dependent variables. It can analyze the influence of many factors on survival time at the same time, and can analyze data with censored survival time. The type of survival distribution required to estimate the data has excellent properties, and the regression model plays a pivotal role in the screening of cancer prognosis genes.
根据公开文献显示,Cox回归模型中最常用到的求解方法是由Noah Simon等人于提出来的通过坐标下降,并使用热启动沿着正则化路径(范数和范数作为惩罚项)进行拟合的Cox回归方法。但其惩罚项系数通过交叉验证进行确定,这使得惩罚项系数无法自动地精确地求解,由于这种拟合是通过优化方法进行计算的,是一种点估计,无法得出后验分布并结合期望最大算法(Expectation-Maximum)进行先验参数自动求解(即惩罚项系数),这使得算法最终筛选出来的预后基因不能很好的和癌症相关联。According to the public literature, the most commonly used solution method in the Cox regression model is proposed by Noah Simon et al. through coordinate descent, and uses hot start along the regularization path ( Norm and Norm as a penalty term) to fit the Cox regression method. However, the penalty item coefficient is determined through cross-validation, which makes the penalty item coefficient cannot be automatically and accurately solved. Since this fitting is calculated by an optimization method, it is a point estimate, and the posterior distribution cannot be obtained and combined with The Expectation-Maximum algorithm (Expectation-Maximum) automatically solves the prior parameters (that is, the penalty coefficient), which makes the prognosis genes finally screened by the algorithm not well correlated with cancer.
其中,Cox回归是一种生存分析方法,它是预后基因筛选中的一环,且占有重要地位。Cox回归模型求解得到的回归系数的含义是对每个对应基因的风险加权,只有回归系数准确了,后续每个患者的风险计算才会准确。因此,需要一种精度更高的求解Cox回归模型的方法。Among them, Cox regression is a survival analysis method, which is a part of prognostic gene screening and plays an important role. The meaning of the regression coefficient obtained by solving the Cox regression model is to weight the risk of each corresponding gene. Only when the regression coefficient is accurate can the subsequent risk calculation of each patient be accurate. Therefore, a method for solving the Cox regression model with higher accuracy is needed.
为此,结合以上需求和现有技术缺陷,本申请提出了一种基于改进Cox模型的癌症基因预后筛选方法及系统。Therefore, in combination with the above requirements and the defects of the prior art, the present application proposes a cancer gene prognosis screening method and system based on an improved Cox model.
发明内容Contents of the invention
本发明提供了一种基于改进Cox模型的癌症基因预后筛选方法及系统,在回归部分通过先验的加入及其参数的自动更新提高了的回归精度,并筛选出回归系数中绝对值大的对应基因作为预后基因,对后续的预测预后、复发、转移乃至指导治疗提供信息。The present invention provides a cancer gene prognosis screening method and system based on the improved Cox model. In the regression part, the regression accuracy is improved through the addition of a priori and the automatic update of its parameters, and the corresponding regression coefficient with a large absolute value is screened out. Genes, as prognostic genes, provide information for subsequent prediction of prognosis, recurrence, metastasis and even guidance of treatment.
本发明的首要目的是为解决上述技术问题,本发明的技术方案如下:Primary purpose of the present invention is to solve the above-mentioned technical problems, and technical scheme of the present invention is as follows:
本发明第一方面提供了一种基于改进Cox模型的癌症基因预后筛选方法,本方法包括以下步骤:The first aspect of the present invention provides a cancer gene prognosis screening method based on the improved Cox model, the method comprising the following steps:
S1、采集癌症患者的癌症细胞不同基因的表达量,收集患者的生存数据,将癌症细胞不同基因的表达量和患者信息整理为第一矩阵,对第一矩阵进行预处理,得到第二矩阵。S1. Collect the expression levels of different genes in cancer cells of cancer patients, collect patient survival data, and organize the expression levels of different genes in cancer cells and patient information into the first matrix , for the first matrix Perform preprocessing to get the second matrix .
S2、将步骤S1得到的生存数据和第二矩阵X输入预设的Cox回归模型,求解得到回归系数。S2. Input the survival data obtained in step S1 and the second matrix X into the preset Cox regression model, and solve to obtain the regression coefficients.
S3、根据患者的风险函数评估回归系数中对应基因的患者风险,筛选出高患者风险对应的预后基因组。S3. Evaluate the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screen out the prognostic gene group corresponding to the high patient risk.
S4、利用筛选出的预后基因组通过生物学理论对预测预后、复发和转移提供指导信息。S4. Use the screened prognostic genome to provide guidance information for predicting prognosis, recurrence and metastasis through biological theory.
其中,在第一矩阵中,矩阵的行代表患者信息,矩阵的列代表癌症细胞的基因片段;第一矩阵的某个元素表示对应行的病人体内对应列的基因的表达水平。Among them, in the first matrix In , the rows of the matrix represent patient information, and the columns of the matrix represent gene fragments of cancer cells; the first matrix An element of represents the expression level of the gene in the corresponding column in the patient in the corresponding row.
其中,所述生存数据包括有:协变量矩阵即第二矩阵X,生存时间y和删失索引c。Wherein, the survival data include: a covariate matrix, namely the second matrix X , survival time y and censoring index c.
其中,回归系数中绝对值较大的分量对应的基因对患者的生存时间有较大影响,通过评估回归系数能够筛选出高患者风险对应的预后基因集。Among them, the gene corresponding to the component with a larger absolute value in the regression coefficient has a greater impact on the survival time of the patient, and the prognostic gene set corresponding to the high patient risk can be screened out by evaluating the regression coefficient.
其中,步骤S1中预处理过程具体为:通过生物学信息统计手段去除无关基因,得到列数较少的第二矩阵。Wherein, the preprocessing process in step S1 is specifically: removing irrelevant genes by means of biological information statistics to obtain a second matrix with fewer columns .
进一步的,步骤S2中,首先将生存数据和第二矩阵组合形成的第三矩阵,将第三矩阵输入所述预设的Cox回归模型;其中,第三矩阵记作[X,y,c],其中X代表协变量矩阵即第二矩阵,y代表生存时间,c代表删失索引;其中第i个病人的生存数据为。Further, in step S2, the survival data and the second matrix are first combined to form a third matrix, and the third matrix is input into the preset Cox regression model; wherein, the third matrix is recorded as [X, y, c] , where X represents the covariate matrix, which is the second matrix, y represents the survival time, and c represents the censored index; where the survival data of the i -th patient is .
进一步的,第i个所述患者的风险函数具体为:Further, the risk function of the ith patient is specifically:
其中为共享基准风险函数;为求解Cox回归模型得到的回归系数;表示第i个患者的基因表达水平。in is the shared benchmark risk function; The regression coefficient obtained for solving the Cox regression model; Indicates the gene expression level of the i -th patient.
其中,通过利用Cox回归模型回归拟合出回归系数,我们就可以根据患者的基因表达水平来评估患者风险,而回归系数中绝对值较大的分量,则对患者生存时间起着较大的影响,而这些分量对应的基因正是我们要筛选出来的预后基因集。Among them, by using the Cox regression model regression to fit the regression coefficient , according to the gene expression level of the patient, we can to assess patient risk, and the regression coefficient The components with larger absolute values have a greater impact on the survival time of patients, and the genes corresponding to these components are the prognostic gene sets we want to screen out.
进一步的,步骤S2中求解Cox回归模型得到回归系数,具体包括以下步骤:Further, in step S2, solving the Cox regression model to obtain regression coefficients specifically includes the following steps:
S21、将已有的生存数据合并成第三矩阵并根据参数生存时间排序,利用排序后的数据构建Cox回归模型,初始化先验参数和消息传递参数。S21. Merge existing survival data into a third matrix and sort according to parameter survival time, use the sorted data to construct a Cox regression model, and initialize prior parameters and message passing parameters.
S22、根据Cox回归模型的分列式矢量因子图,利用期望传播算法,通过矩匹配规则将高维消息投影到独立的高斯分布上,循环迭代求解模型,输出回归系数和近似后验概率。S22. According to the columnar vector factor diagram of the Cox regression model, the expectation propagation algorithm is used to project the high-dimensional information onto the independent Gaussian distribution through the moment matching rule, and iteratively solve the model, and output the regression coefficient and the approximate posterior probability.
S23、将回归系数和近似后验概率输入期望最大算法,更新先验参数。S23. Input the regression coefficient and the approximate posterior probability into the expectation maximization algorithm, and update the prior parameters.
S24、判断回归系数是否达到预设的迭代结束条件;若达到预设的迭代结束条件,则输出当前轮迭代得到的回归系数;若没有达到预设的迭代结束条件,则返回步骤S22进行下一轮迭代。S24, judging whether the regression coefficient reaches the preset iteration end condition; if the preset iteration end condition is reached, the regression coefficient obtained by the current round of iteration is output; if the preset iteration end condition is not reached, then return to step S22 for the next step round of iterations.
其中,所述第三矩阵为[X,y,c],X代表协变量矩阵,y代表生存时间,c代表删失索引。Wherein, the third matrix is [X, y, c], X represents a covariate matrix, y represents survival time, and c represents a censored index.
其中,借助完整的贝叶斯分析方法解决回归系数估计的问题,将带惩罚项的最大似然估计转化为贝叶斯角度的最小均方误差估计,采用因子图作为工具,通过基于期望传播的消息传递方法计算节点间传递的消息,获取回归系数的近似后验概率,其实质为近似推断出回归系数所服从的概率分布。Among them, with the help of a complete Bayesian analysis method to solve the problem of regression coefficient estimation, the maximum likelihood estimation with penalty items is transformed into the minimum mean square error estimation of the Bayesian angle, and the factor graph is used as a tool. The message passing method calculates the messages transmitted between nodes and obtains the approximate posterior probability of the regression coefficients. Its essence is to approximate the probability distribution that the regression coefficients obey.
进一步的,所述先验参数包括有:均值、方差和稀疏率;所述消息传递参数包括有:正方向消息的均值和方差;所述步骤S21具体为:将协变量矩阵X矩阵归一化,根据生存时间y对第三矩阵为[X,y,c]进行降序排序,将排序后的第三矩阵为[X,y,c]代入Cox部分似然函数,初始化先验参数和消息传递函数。Further, the prior parameters include: mean ,variance and sparse rate ; The message delivery parameters include: the mean value and variance of the positive direction message; the step S21 is specifically: normalize the covariate matrix X matrix, and the third matrix is [X, y, c] according to the survival time y Perform descending sorting, substitute the sorted third matrix [X, y, c] into the Cox partial likelihood function, and initialize the prior parameters and message transfer function.
其中,所述先验参数和回归系数均服从高斯-伯努利分布,具有稀疏性。Wherein, the prior parameters and the regression coefficients all obey the Gauss-Bernoulli distribution and are sparse.
其中,采用拉普拉斯方法和矩生成函数,对似然函数节点的投影操作进行近似化简,让复杂的计算得以简化,在较小损失的情况下求解出较精确的回归系数。Among them, the Laplace method and the moment generation function are used to approximate and simplify the projection operation of the likelihood function node, which simplifies complex calculations and solves more accurate regression coefficients with less loss.
进一步的,所述将协变量矩阵X矩阵归一化具体为:Further, the normalization of the covariate matrix X matrix is specifically:
其中, mean(X)为X矩阵全体元素的均值, var(X)为X矩阵全体元素的方差。Among them, mean( X ) is the mean value of all elements of X matrix, and var( X ) is the variance of all elements of X matrix.
所述Cox部分似然函数具体为:The Cox partial likelihood function is specifically:
其中,表示该函数为转移到的转移概率,用于表示关于是归一化的;为Cox部分似然函数,未归一化,表示正比关系;该函数以为变量,其第i个元素,为的第i个元素。in, means that the function is move to The transition probability of about is normalized; is the Cox partial likelihood function, which is not normalized, and represents a proportional relationship; the function starts with is a variable whose i -th element , for The i -th element of .
所述先验参数的初始化具体为:令回归系数服从高斯-伯努利分布,其数学表达式为:The initialization of the prior parameters is specifically: make the regression coefficient obey the Gauss-Bernoulli distribution, and its mathematical expression is:
其中,表示狄拉克Delta函数;表示均值为、方差为的高斯分布;该函数以为变量;初始化先验参数,,。in, Represents the Dirac Delta function; Indicates that the mean is , the variance is Gaussian distribution; the function takes as a variable; initialize the prior parameters , , .
所述消息传递函数的初始化具体为:初始化正方向消息的消息传递函数,其数学表达式为:The initialization of the message transfer function is specifically: initialize the message transfer function of the message in the forward direction, and its mathematical expression is:
其中,为元素全为0的n维列向量;为元素全为1的n维列向量,下标表示向量的维度大小;是服从独立同方差多维高斯分布的随机变量;为元素为1的n列维向量;初始化,,。in, is an n-dimensional column vector whose elements are all 0; is an n-dimensional column vector whose elements are all 1, and the subscript indicates the dimension of the vector; is a random variable that obeys an independent homoscedastic multidimensional Gaussian distribution; Is an n column-dimensional vector with
其中,在所述Cox回归模型的分列式矢量因子图中,使用四个多维随机变量表示因子图上传递的消息,即将消息视为一种多维高斯概率密度函数,所述矩匹配过程要求消息服从以下分布:Wherein, in the split vector factor diagram of the Cox regression model, four multidimensional random variables are used to represent the message transmitted on the factor diagram, that is, the message is regarded as a multidimensional Gaussian probability density function, and the moment matching process requires the message to obey the following distributed:
其中,是服从独立同方差多维高斯分布的随机变量;为元素为1的n列维向量,下标表示向量的维度大小;为元素为1的p列维向量,下标表示向量维度;当多维高斯随机变量的元素相互独立时,即协方差矩阵非对角线元素为0时,能够采用向量来表示对角矩阵。in, is a random variable that obeys an independent homoscedastic multidimensional Gaussian distribution; is an n-column vector with an element of 1, and the subscript indicates the dimension of the vector; is a p-dimensional vector with an element of 1, and the subscript indicates the dimension of the vector; when the elements of the multidimensional Gaussian random variable are independent of each other, that is, when the off-diagonal elements of the covariance matrix are 0, the vector can be used to represent the diagonal matrix.
进一步的,所述步骤S22具体为,基于矩匹配规则在Cox回归模型的分列式矢量因子图上进行消息传递,包括以下步骤:Further, the step S22 is specifically, based on the moment matching rule, message passing is performed on the columnar vector factor graph of the Cox regression model, including the following steps:
S221、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对进行更新,具体为:S221, according to the moment matching rule of the split vector factor graph of the Cox regression model, to Make an update, specifically:
在节点上,将的消息与相乘并投影到独立同方差的多维高斯分布上,投影得到的结果再和的消息相除,得到的消息。at node on, will news with Multiply and project onto a multidimensional Gaussian distribution with independent homoscedasticity, and then sum the results obtained by the projection Divide the news, get news.
其中,是投影操作,即求出关于的均值向量和方差向量,因为是独立同方差的多维高斯,所以向量中的每个元素都相等且非对角线元素为0,并输出。in, is a projection operation, that is, find about The mean vector of and variance vector , because it is a multidimensional Gaussian with independent homoscedasticity, so the vector Each element in is equal and the off-diagonal elements are 0, and outputs .
S222、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对进行更新,具体为:S222, according to the moment matching rule of the split vector factor graph of the Cox regression model, to Make an update, specifically:
在节点上,将的消息和相乘然后积掉变量,并投影到独立同方差的多维高斯分布上,投影得到的结果再和的消息相除,得到的消息;其中是狄拉克Delta函数。at node on, will news and multiply and product the variables , and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and the results obtained by the projection are summed Divide the news, get news; among them is the Dirac Delta function.
S223、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对进行更新,具体为:S223, according to the moment matching rule of the split vector factor graph of the Cox regression model, to Make an update, specifically:
在节点上,将的消息和相乘得到的结果经过投影到独立同方差的多维高斯分布上,将投影得到的结果和的消息相除,得到的消息;其中,投影操作得到的均值是作为输出结果的Cox回归系数。exist on the node, the news and The results obtained by multiplication are projected onto the multidimensional Gaussian distribution with independent homoscedasticity, and the projected results and Divide the news, get The message; among them, the mean value obtained by the projection operation is the Cox regression coefficient as the output result.
S224、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对进行更新,具体为:S224, according to the moment matching rule of the split vector factor graph of the Cox regression model, to Make an update, specifically:
在节点上,将的消息和相乘并积掉变量,将结果投影到独立同方差的多维高斯分布上,投影得到的结果再和的消息相除,得到的消息。exist on the node, the news and multiply and product variables , project the result onto a multidimensional Gaussian distribution with independent homoscedasticity, and then sum the projected results Divide the news, get news.
其中,由于具有极其复杂的形式,因此使用累积量生成函数和拉普拉斯方法替代进行投影操作。Among them, due to has an extremely complex form, so the cumulant generating function and Laplace's method are used instead Perform projection operation.
进一步的,步骤S223中,投影操作具体为:Further, in step S223, the projection operation is specifically:
其中,表示回归系数的近似后验概率;投影得到的均值即是模型输出的Cox回归系数。in, Represents the approximate posterior probability of the regression coefficient; the projected mean That is, the Cox regression coefficient output by the model.
进一步的,所述步骤S23具体为:将步骤S22输出的回归系数和近似后验概率,配合期望最大算法,对先验参数进行自动更新;更新的表达式具体为:Further, the step S23 is specifically: the regression coefficient output in the step S22 and the approximate posterior probability , with the expectation maximization algorithm, for the prior parameters Perform automatic update; the update expression is specifically:
其中,和都是关于的函数,其表达式如下:in, and it's all about function, whose expression is as follows:
其中,为向量点除,为向量点乘。in, For vector point division, is the vector dot product.
其中,通过使先验参数进行自学习,随着整体算法的迭代不断自动更新,而无需手动的调整,能进一步避免了交叉验证的不确定性。Among them, by making the prior parameters self-learning, they are automatically updated with the iteration of the overall algorithm without manual adjustment, which can further avoid the uncertainty of cross-validation.
进一步的,步骤S24中所述预设的迭代结束条件具体为:Further, the preset iteration end condition described in step S24 is specifically:
其中,通过判断Crit值是否开始上升决定是否结束迭代,若Crit值开始上升,则停止迭代过程并输出最终一轮迭代的回归系数;若Crit值未开始上升,则继续迭代;其中表示一范数。Among them, whether to end the iteration is determined by judging whether the Crit value starts to rise. If the Crit value starts to rise, the iterative process is stopped and the regression coefficient of the last round of iteration is output. ; If the Crit value does not start to rise, continue to iterate; where represents a norm.
本发明第二方面提供了一种基于改进Cox模型的癌症基因预后筛选系统,包括有存储器和处理器,所述存储器中包括有基于改进Cox模型的癌症基因预后筛选程序,所述基于改进Cox模型的癌症基因预后筛选程序被所述处理器执行时实现如下步骤:The second aspect of the present invention provides a cancer gene prognosis screening system based on the improved Cox model, including a memory and a processor, the memory includes a cancer gene prognosis screening program based on the improved Cox model, and the improved Cox model based When the cancer gene prognosis screening program is executed by the processor, the following steps are implemented:
S1、采集癌症患者的癌症细胞不同基因的表达量,收集患者的生存数据,将癌症细胞不同基因的表达量和患者信息整理为第一矩阵,对第一矩阵进行预处理,得到第二矩阵。S1. Collect the expression levels of different genes in cancer cells of cancer patients, collect patient survival data, and organize the expression levels of different genes in cancer cells and patient information into the first matrix , for the first matrix Perform preprocessing to get the second matrix .
S2、将步骤S1得到的生存数据和第二矩阵X输入预设的Cox回归模型,求解得到回归系数。S2. Input the survival data obtained in step S1 and the second matrix X into the preset Cox regression model, and solve to obtain the regression coefficients.
S3、根据患者的风险函数评估回归系数中对应基因的患者风险,筛选出高患者风险对应的预后基因组。S3. Evaluate the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screen out the prognostic gene group corresponding to the high patient risk.
S4、利用筛选出的预后基因组通过生物学理论对预测预后、复发和转移提供指导信息。S4. Use the screened prognostic genome to provide guidance information for predicting prognosis, recurrence and metastasis through biological theory.
与现有技术相比,本发明技术方案的有益效果是:Compared with the prior art, the beneficial effects of the technical solution of the present invention are:
本发明提供了一种基于改进Cox模型的癌症基因预后筛选方法及系统,采用因子图作为工具,通过基于期望传播的矩匹配消息传递方法推断出Cox回归系数的近似后验概率;采用最小均方误差估计的方法,实现对回归系数估计值的准确估计;先验参数方面,采用期望最大算法自动求解,省去了交叉验证,使得回归系数估计更加精确;具体实现方面,通过拉普拉斯方法和累积量生成函数的化简,将形式复杂的与高斯相乘成功投影,使得迭代得以进行,从而能够解决回归精度的问题,并筛选出回归系数中绝对值大的对应基因作为预后基因,对后续的预测预后、复发、转移乃至指导治疗提供信息。The present invention provides a cancer gene prognosis screening method and system based on an improved Cox model, using factor graph as a tool, and inferring the approximate posterior probability of Cox regression coefficients through the moment matching message passing method based on expected propagation; using least mean square The method of error estimation realizes accurate estimation of the estimated value of the regression coefficient; in terms of prior parameters, the expected maximum algorithm is used to automatically solve the problem, eliminating the need for cross-validation and making the estimation of the regression coefficient more accurate; in terms of specific implementation, the Laplace method is adopted and the simplification of the cumulant generating function, the complex form Successful projection by multiplying with Gaussian enables iteration to be carried out, so that the problem of regression accuracy can be solved, and the corresponding gene with a large absolute value in the regression coefficient is selected as the prognostic gene, which provides information for subsequent prediction of prognosis, recurrence, metastasis, and even guidance for treatment .
附图说明Description of drawings
图1为本发明一种基于改进Cox模型的癌症基因预后筛选方法的流程图。Fig. 1 is a flow chart of a cancer gene prognosis screening method based on the improved Cox model of the present invention.
图2为本发明一种基于改进Cox模型的癌症基因预后筛选方法中求解Cox模型的流程图。Fig. 2 is a flow chart of solving the Cox model in a cancer gene prognosis screening method based on the improved Cox model of the present invention.
图3为本发明求解Cox模型的一种实施例的流程图。Fig. 3 is a flow chart of an embodiment of the present invention for solving the Cox model.
图4为本发明一种实施例中分列式矢量因子图的示意图。Fig. 4 is a schematic diagram of a columnar vector factor graph in an embodiment of the present invention.
图5为本发明一种实施例中基于期望传播的矩匹配消息传递方法的示意图。Fig. 5 is a schematic diagram of an expected propagation-based moment matching message delivery method in an embodiment of the present invention.
图6为本发明一种实施例中对模拟数据进行回归的性能表现。Fig. 6 shows the performance of regression on simulated data in an embodiment of the present invention.
图7为本发明一种基于改进Cox模型的癌症基因预后筛选系统的结构示意图。Fig. 7 is a schematic structural diagram of a cancer gene prognosis screening system based on the improved Cox model of the present invention.
具体实施方式detailed description
为了能够更清楚地理解本发明的上述目的、特征和优点,下面结合附图和具体实施方式对本发明进行进一步的详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to understand the above-mentioned purpose, features and advantages of the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是,本发明还可以采用其他不同于在此描述的其他方式来实施,因此,本发明的保护范围并不受下面公开的具体实施例的限制。In the following description, many specific details are set forth in order to fully understand the present invention. However, the present invention can also be implemented in other ways different from those described here. Therefore, the protection scope of the present invention is not limited by the specific details disclosed below. EXAMPLE LIMITATIONS.
实施例1Example 1
如图1所示,本发明提供了一种基于改进Cox模型的癌症基因预后筛选方法,本方法包括以下步骤:As shown in Figure 1, the present invention provides a kind of cancer gene prognosis screening method based on improved Cox model, and this method comprises the following steps:
S1、采集癌症患者的癌症细胞不同基因的表达量,收集患者的生存数据,将癌症细胞不同基因的表达量和患者信息整理为第一矩阵,对第一矩阵进行预处理,得到第二矩阵。S1. Collect the expression levels of different genes in cancer cells of cancer patients, collect patient survival data, and organize the expression levels of different genes in cancer cells and patient information into the first matrix , for the first matrix Perform preprocessing to get the second matrix .
S2、将步骤S1得到的生存数据和第二矩阵X输入预设的Cox回归模型,求解得到回归系数。S2. Input the survival data obtained in step S1 and the second matrix X into the preset Cox regression model, and solve to obtain the regression coefficients.
S3、根据患者的风险函数评估回归系数中对应基因的患者风险,筛选出高患者风险对应的预后基因组。S3. Evaluate the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screen out the prognostic gene group corresponding to the high patient risk.
S4、利用筛选出的预后基因组通过生物学理论对预测预后、复发和转移提供指导信息。S4. Use the screened prognostic genome to provide guidance information for predicting prognosis, recurrence and metastasis through biological theory.
其中,在第一矩阵中,矩阵的行代表患者信息,矩阵的列代表癌症细胞的基因片段;第一矩阵的某个元素表示对应行的病人体内对应列的基因的表达水平。Among them, in the first matrix In , the rows of the matrix represent patient information, and the columns of the matrix represent gene fragments of cancer cells; the first matrix An element of represents the expression level of the gene in the corresponding column in the patient in the corresponding row.
其中,所述生存数据包括有:协变量矩阵即第二矩阵X,生存时间y和删失索引c。Wherein, the survival data include: a covariate matrix, namely the second matrix X , survival time y and censoring index c.
其中,回归系数中绝对值较大的分量对应的基因对患者的生存时间有较大影响,通过评估回归系数能够筛选出高患者风险对应的预后基因集。Among them, the gene corresponding to the component with a larger absolute value in the regression coefficient has a greater impact on the survival time of the patient, and the prognostic gene set corresponding to the high patient risk can be screened out by evaluating the regression coefficient.
其中,步骤S1中预处理过程具体为:通过生物学信息统计手段去除无关基因,得到列数较少的第二矩阵。Wherein, the preprocessing process in step S1 is specifically: removing irrelevant genes by means of biological information statistics to obtain a second matrix with fewer columns .
进一步的,步骤S2中,首先将生存数据和第二矩阵组合形成的第三矩阵,将第三矩阵输入所述预设的Cox回归模型;其中,第三矩阵记作[X,y,c],其中X代表协变量矩阵即第二矩阵,y代表生存时间,c代表删失索引;其中第i个病人的生存数据为。Further, in step S2, the survival data and the second matrix are first combined to form a third matrix, and the third matrix is input into the preset Cox regression model; wherein, the third matrix is recorded as [X, y, c] , where X represents the covariate matrix, which is the second matrix, y represents the survival time, and c represents the censored index; where the survival data of the i -th patient is .
进一步的,第i个所述患者的风险函数具体为:Further, the risk function of the ith patient is specifically:
其中为共享基准风险函数;为求解Cox回归模型得到的回归系数;表示第i个患者的基因表达水平。in is the shared benchmark risk function; The regression coefficient obtained for solving the Cox regression model; Indicates the gene expression level of the i -th patient.
其中,通过利用Cox回归模型回归拟合出回归系数,我们就可以根据患者的基因表达水平来评估患者风险,而回归系数中绝对值较大的分量,则对患者生存时间起着较大的影响,而这些分量对应的基因正是我们要筛选出来的预后基因集。Among them, by using the Cox regression model regression to fit the regression coefficient , according to the gene expression level of the patient, we can to assess patient risk, and the regression coefficient The components with larger absolute values have a greater impact on the survival time of patients, and the genes corresponding to these components are the prognostic gene sets we want to screen out.
进一步的,步骤S2中求解Cox回归模型得到回归系数,如图2所示,具体包括以下步骤:Further, in step S2, the regression coefficient is obtained by solving the Cox regression model, as shown in Figure 2, which specifically includes the following steps:
S21、将已有的生存数据合并成第三矩阵并根据参数生存时间排序,利用排序后的数据构建Cox回归模型,初始化先验参数和消息传递参数。S21. Merge existing survival data into a third matrix and sort according to parameter survival time, use the sorted data to construct a Cox regression model, and initialize prior parameters and message passing parameters.
S22、根据Cox回归模型的分列式矢量因子图,利用期望传播算法,通过矩匹配规则将高维消息投影到独立的高斯分布上,循环迭代求解模型,输出回归系数和近似后验概率。S22. According to the columnar vector factor diagram of the Cox regression model, the expectation propagation algorithm is used to project the high-dimensional information onto the independent Gaussian distribution through the moment matching rule, and iteratively solve the model, and output the regression coefficient and the approximate posterior probability.
S23、将回归系数和近似后验概率输入期望最大算法,更新先验参数。S23. Input the regression coefficient and the approximate posterior probability into the expectation maximization algorithm, and update the prior parameters.
S24、判断回归系数是否达到预设的迭代结束条件;若达到预设的迭代结束条件,则输出当前轮迭代得到的回归系数;若没有达到预设的迭代结束条件,则返回步骤S22进行下一轮迭代。S24, judging whether the regression coefficient reaches the preset iteration end condition; if the preset iteration end condition is reached, the regression coefficient obtained by the current round of iteration is output; if the preset iteration end condition is not reached, then return to step S22 for the next step round of iterations.
其中,所述第三矩阵为[X,y,c],X代表协变量矩阵,y代表生存时间,c代表删失索引。Wherein, the third matrix is [X, y, c], X represents a covariate matrix, y represents survival time, and c represents a censored index.
其中,借助完整的贝叶斯分析方法解决回归系数估计的问题,将带惩罚项的最大似然估计转化为贝叶斯角度的最小均方误差估计,采用因子图作为工具,通过基于期望传播的消息传递方法计算节点间传递的消息,获取回归系数的近似后验概率,其实质为近似推断出回归系数所服从的概率分布。Among them, with the help of a complete Bayesian analysis method to solve the problem of regression coefficient estimation, the maximum likelihood estimation with penalty items is transformed into the minimum mean square error estimation of the Bayesian angle, and the factor graph is used as a tool. The message passing method calculates the messages transmitted between nodes and obtains the approximate posterior probability of the regression coefficients. Its essence is to approximate the probability distribution that the regression coefficients obey.
进一步的,所述先验参数包括有:均值、方差和稀疏率;所述消息传递参数包括有:正方向消息的均值和方差;所述步骤S21具体为:将协变量矩阵X矩阵归一化,根据生存时间y对第三矩阵为[X,y,c]进行降序排序,将排序后的第三矩阵为[X,y,c]代入Cox部分似然函数,初始化先验参数和消息传递函数。Further, the prior parameters include: mean ,variance and sparse rate ; The message delivery parameters include: the mean value and variance of the positive direction message; the step S21 is specifically: normalize the covariate matrix X matrix, and the third matrix is [X, y, c] according to the survival time y Perform descending sorting, substitute the sorted third matrix [X, y, c] into the Cox partial likelihood function, and initialize the prior parameters and message transfer function.
在一个具体的实施例中,所述协变量矩阵能够采用基因表达量矩阵,其中每行代表不同病人,每列代表不同基因,矩阵中的某元素代表某个人的某个基因的表达量。In a specific embodiment, the covariate matrix can be a gene expression matrix, where each row represents a different patient, each column represents a different gene, and a certain element in the matrix represents the expression level of a certain gene of a certain person.
其中,所述先验参数和回归系数均服从高斯-伯努利分布,具有稀疏性。Wherein, the prior parameters and the regression coefficients all obey the Gauss-Bernoulli distribution and are sparse.
其中,采用拉普拉斯方法和矩生成函数,对似然函数节点的投影操作进行近似化简,让复杂的计算得以简化,在较小损失的情况下求解出较精确的回归系数。Among them, the Laplace method and the moment generation function are used to approximate and simplify the projection operation of the likelihood function node, which simplifies complex calculations and solves more accurate regression coefficients with less loss.
进一步的,所述将协变量矩阵X矩阵归一化具体为:Further, the normalization of the covariate matrix X matrix is specifically:
其中, mean(X)为X矩阵全体元素的均值, var(X)为X矩阵全体元素的方差。Among them, mean( X ) is the mean value of all elements of X matrix, and var( X ) is the variance of all elements of X matrix.
所述Cox部分似然函数具体为:The Cox partial likelihood function is specifically:
其中,表示该函数为转移到的转移概率,用于表示关于是归一化的;为Cox部分似然函数,未归一化,表示正比关系;该函数以为变量,其第i个元素,为的第i个元素。in, means that the function is move to The transition probability of about is normalized; is the Cox partial likelihood function, which is not normalized, and represents a proportional relationship; the function starts with is a variable whose i -th element , for The i -th element of .
所述先验参数的初始化具体为:令回归系数服从高斯-伯努利分布,其数学表达式为:The initialization of the prior parameters is specifically: make the regression coefficient obey the Gauss-Bernoulli distribution, and its mathematical expression is:
其中,表示狄拉克Delta函数;表示均值为、方差为的高斯分布;该函数以为变量;初始化先验参数,,。in, Represents the Dirac Delta function; Indicates that the mean is , the variance is Gaussian distribution; the function takes as a variable; initialize the prior parameters , , .
所述消息传递函数的初始化具体为:初始化正方向消息的消息传递函数,其数学表达式为:The initialization of the message transfer function is specifically: initialize the message transfer function of the message in the forward direction, and its mathematical expression is:
其中,为元素全为0的n维列向量;为元素全为1的n维列向量;是服从独立同方差多维高斯分布的随机变量;为元素为1的n列维向量;初始化,,。in, is an n-dimensional column vector whose elements are all 0; is an n-dimensional column vector whose elements are all 1; is a random variable that obeys an independent homoscedastic multidimensional Gaussian distribution; Is an n column-dimensional vector with
在一个具体的实施例中,所述Cox回归模型的分列式矢量因子图如图4所示。In a specific embodiment, the columnar vector factor diagram of the Cox regression model is shown in FIG. 4 .
其中,在所述Cox回归模型的分列式矢量因子图中,如图5所示,使用四个多维随机变量表示因子图上传递的消息,即将消息视为一种多维高斯概率密度函数,所述矩匹配过程要求消息服从以下分布:Wherein, in the split vector factor diagram of the Cox regression model, as shown in Figure 5, four multidimensional random variables are used to represent the message transmitted on the factor diagram, that is, the message is regarded as a multidimensional Gaussian probability density function, and the moment The matching process requires messages to obey the following distribution:
其中,是服从独立同方差多维高斯分布的随机变量;为元素为1的n列维向量,下标表示向量的维度大小;为元素为1的p列维向量,下标表示向量维度;当多维高斯随机变量的元素相互独立时,即协方差矩阵非对角线元素为0时,能够采用向量来表示对角矩阵。in, is a random variable that obeys an independent homoscedastic multidimensional Gaussian distribution; is an n-column vector with an element of 1, and the subscript indicates the dimension of the vector; is a p-dimensional vector with an element of 1, and the subscript indicates the dimension of the vector; when the elements of the multidimensional Gaussian random variable are independent of each other, that is, when the off-diagonal elements of the covariance matrix are 0, the vector can be used to represent the diagonal matrix.
在一个具体的实施例中,设定先验参数,既先验分布中的-稀疏参数, -均值参数, -方差参数的初始值分别为,,,并在后续采用期望最大算法对先验参数进行自动更新。In a specific embodiment, the prior parameters are set, that is, the prior distribution middle - sparse parameters, - mean parameter, - The initial values of the variance parameters are , , , and then automatically update the prior parameters using the expected maximum algorithm.
进一步的,所述步骤S22具体为,基于矩匹配规则在Cox回归模型的分列式矢量因子图上进行消息传递,包括以下步骤:Further, the step S22 is specifically, based on the moment matching rule, message passing is performed on the columnar vector factor graph of the Cox regression model, including the following steps:
S221、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对进行更新,具体为:S221, according to the moment matching rule of the split vector factor graph of the Cox regression model, to Make an update, specifically:
在节点上,将的消息与相乘并投影到独立同方差的多维高斯分布上,投影得到的结果再和的消息相除,得到的消息。at node on, will news with Multiply and project onto a multidimensional Gaussian distribution with independent homoscedasticity, and then sum the results obtained by the projection Divide the news, get news.
其中,是投影操作,即求出关于的均值向量和方差向量,因为是独立同方差的多维高斯,所以向量中的每个元素都相等且非对角线元素为0,并输出。in, is a projection operation, that is, find about The mean vector of and variance vector , because it is a multidimensional Gaussian with independent homoscedasticity, so the vector Each element in is equal and the off-diagonal elements are 0, and outputs .
S222、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对进行更新,具体为:S222, according to the moment matching rule of the split vector factor graph of the Cox regression model, to Make an update, specifically:
在节点上,将的消息和相乘然后积掉变量,并投影到独立同方差的多维高斯分布上,投影得到的结果再和的消息相除,得到的消息;其中是狄拉克Delta函数。at node on, will news and multiply and product the variables , and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and the results obtained by the projection are summed Divide the news, get news; among them is the Dirac Delta function.
S223、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对进行更新,具体为:S223, according to the moment matching rule of the split vector factor graph of the Cox regression model, to Make an update, specifically:
在节点上,将的消息和相乘得到的结果经过投影到独立同方差的多维高斯分布上,将投影得到的结果和的消息相除,得到的消息;其中,投影操作得到的均值是作为输出结果的Cox回归系数。exist on the node, the news and The results obtained by multiplication are projected onto the multidimensional Gaussian distribution with independent homoscedasticity, and the projected results and Divide the news, get The message; among them, the mean value obtained by the projection operation is the Cox regression coefficient as the output result.
S224、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对进行更新,具体为:S224, according to the moment matching rule of the split vector factor graph of the Cox regression model, to Make an update, specifically:
在节点上,将的消息和相乘并积掉变量,将结果投影到独立同方差的多维高斯分布上,投影得到的结果再和的消息相除,得到的消息。exist on the node, the news and multiply and product variables , project the result onto a multidimensional Gaussian distribution with independent homoscedasticity, and then sum the projected results Divide the news, get news.
其中,由于具有极其复杂的形式,因此使用累积量生成函数和拉普拉斯方法替代进行投影操作。Among them, due to has an extremely complex form, so the cumulant generating function and Laplace's method are used instead Perform projection operation.
进一步的,步骤S223中,投影操作具体为:Further, in step S223, the projection operation is specifically:
其中,表示回归系数的近似后验概率;投影得到的均值即是模型输出的Cox回归系数。in, Represents the approximate posterior probability of the regression coefficient; the projected mean That is, the Cox regression coefficient output by the model.
进一步的,所述步骤S23具体为:将步骤S22输出的回归系数和近似后验概率,配合期望最大算法,对先验参数进行自动更新;更新的表达式具体为:Further, the step S23 is specifically: the regression coefficient output in the step S22 and the approximate posterior probability , with the expectation maximization algorithm, for the prior parameters Perform automatic update; the update expression is specifically:
其中,和都是关于的函数,其表达式如下:in, and it's all about function, whose expression is as follows:
其中,为向量点除,为向量点乘。in, For vector point division, is the vector dot product.
其中,通过使先验参数进行自学习,随着整体算法的迭代不断自动更新,而无需手动的调整,能进一步避免了交叉验证的不确定性。Among them, by making the prior parameters self-learning, they are automatically updated with the iteration of the overall algorithm without manual adjustment, which can further avoid the uncertainty of cross-validation.
进一步的,步骤S24中所述预设的迭代结束条件具体为:Further, the preset iteration end condition described in step S24 is specifically:
其中,通过判断Crit值是否开始上升决定是否结束迭代,若Crit值开始上升,则停止迭代过程并输出最终一轮迭代的回归系数;若Crit值未开始上升,则继续迭代;其中表示一范数。Among them, whether to end the iteration is determined by judging whether the Crit value starts to rise. If the Crit value starts to rise, the iterative process is stopped and the regression coefficient of the last round of iteration is output. ; If the Crit value does not start to rise, continue to iterate; where represents a norm.
在一个具体的实施例中,在单次实验下对模拟数据进行回归的性能表现如图6所示,其中黑线为真实值,星号为估计值。In a specific embodiment, the performance of regression on simulated data under a single experiment is shown in Figure 6, where the black line is the real value, and the asterisk is the estimated value.
其中,模拟数据生成方式如下:Among them, the simulated data generation method is as follows:
由独立标准正态抽样生成。Generated by independent standard normal sampling .
对于二项分布B(1,0.8)独立抽样,,其中删失率为0.2。right Independently sampled from the binomial distribution B(1,0.8), , where the censoring rate is 0.2.
从拉普拉斯-伯努利抽样生成,其中稀疏率为0.2。Generated from Laplacian-Bernoulli sampling , where the sparsity rate is 0.2.
当且第i号样本非删失时:when And when sample i is not censored:
其中从U(0,1)中独立采样,当且第i号样本删失时:in Sampled independently from U(0,1), when And when sample i is censored:
实施例2Example 2
基于上述实施例1,结合图3,本实施例详细阐述本发明中求解Cox模型的具体过程。Based on the above-mentioned
在一个具体的实施例中,如图3所示,已知数据为,,,待回归系数为。In a specific embodiment, as shown in Figure 3, the known data is , , , the coefficient to be regressed is .
Step 1:Step 1:
S 1.1:X初始化S 1.1: X initialization
其中, mean(X)为X矩阵全体元素的均值, var(X)为X矩阵全体元素的方差。Among them, mean( X ) is the mean value of all elements of X matrix, and var( X ) is the variance of all elements of X matrix.
S 1.2:将已有的生存数据(协变量矩阵-X,生存时间-y,删失索引-c)合并成一个矩阵[X,y,c]并根据y降序排序;S 1.2: Merge the existing survival data (covariate matrix-X, survival time-y, censor index-c) into a matrix [X, y, c] and sort them in descending order according to y;
S1.3:将排序后的[X,y,c]代入Cox部分似然函数:S1.3: Substitute the sorted [X,y,c] into the Cox partial likelihood function:
表示该函数为转移到的转移概率,这暗示关于是归一化的(概率密度函数的特性),而是Cox部分似然函数,未归一化,所以是正比关系;该函数以为变量,其第i个元素,为的第i个元素。 means that the function is move to The transition probability of , which implies about is normalized (property of the probability density function), while is the Cox partial likelihood function, which is not normalized, so it is a proportional relationship; the function is based on is a variable whose i -th element , for The i -th element of .
S 1.4:假设先验服从高斯-伯努利分布:S 1.4: Assume that the prior follows a Gauss-Bernoulli distribution:
该函数以为变量;初始化先验参数,,。This function starts with as a variable; initialize the prior parameters , , .
S 1.5:初始化正方向消息:S 1.5: Initialize forward direction message:
其中,初始化,,;为元素全为0的n维列向量;为元素为1的n维列向量,下标表示向量的维度大小。Among them, initialize , , ; is an n-dimensional column vector whose elements are all 0; is an n-dimensional column vector whose elements are 1, and the subscript indicates the dimension of the vector.
Step 2:基于矩匹配规则在因子图上进行消息传递——期望传播算法(Expectationpropagation)Step 2: Message passing on the factor graph based on moment matching rules - Expectation propagation algorithm (Expectation propagation)
S 2.1:更新:在节点上,将的消息与相乘并投影到独立同方差的多维高斯分布上,然后除去的消息:S 2.1: Update :exist on the node, the news with multiplied and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and then removes message:
其中,是投影操作,即求出关于的均值向量和方差向量(协方差矩阵的对角线),因为是独立同方差的多维高斯,所以向量中的每个元素都相等且非对角线元素为0,并输出。in, is a projection operation, that is, find about The mean vector of and variance vector (diagonal of the covariance matrix), because it is a multidimensional Gaussian with independent homoscedasticity, so the vector Each element in is equal and the off-diagonal elements are 0, and outputs .
通过拉普拉斯方法和矩生成函数对进行化简最终得到:By Laplace method and moment generating function pair After simplification, we finally get:
其中即的方差,,为的黑塞矩阵(对的二阶梯度)。in Right now Variance, , for The Hessian matrix ( right second-order gradient).
含义如下:当是矩阵时取出其对角线,当是向量时将其张成对角矩阵。 The meaning is as follows: when is a matrix, take out its diagonal, when When is a vector, span it into a diagonal matrix.
是对向量求均值,为向量点除,为向量点乘。 is the mean value of the vector, For vector point division, is the vector dot product.
其中,采用对进行二次近似后利用坐标上升算法求解:in, adopt to After quadratic approximation, use the coordinate ascending algorithm to solve:
先将泰勒展开:will first Taylor expands:
其中,为在处的梯度,为在处的黑塞矩阵。经过改写得到:in, for exist the gradient at for exist The Hessian matrix at . After rewriting:
其中,,最终将化简成:in, , will eventually Simplifies to:
其中,是的第i个元素,然后套用坐标上升算法(Coordinate Ascent):in, yes The ith element of , and then apply the Coordinate Ascent algorithm (Coordinate Ascent):
S 2.1.1:初始化;S 2.1.1: Initialization ;
S2.1.2:更新在处的梯度,对于的第k个元素:S2.1.2: Update exist Gradient at ,for The kth element of :
S 2.1.3:更新在处的黑塞矩阵,对于的第k行k列个元素(为加速计算,只保留对角线元素来近似整个矩阵):S 2.1.3: Update exist Hessian matrix at ,for The kth row and k column elements of (To speed up the calculation, only the diagonal elements are kept to approximate the entire matrix):
S2.1.4:更新:S2.1.4: Update :
S 2.1.5:更新:S 2.1.5: Update :
S2.1.6:更新的变化,要是变化小到一定程度则输出;S2.1.6: Update The change, if the change is small to a certain extent, the output ;
若变化仍然很大则返回S 2.1.2继续迭代。If the change is still large, return to S 2.1.2 to continue iteration.
最后,计算相除部分,输出:Finally, the division part is calculated, outputting :
S 2.2:更新:在节点上,将和相乘然后积掉变量,并投影到独立同方差的多维高斯分布上,然后除去的消息:S 2.2: Update :exist on the node, the and multiply and product the variables , and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and then remove message:
其中,计算得出:in, Calculated:
其中,为元素为1的n维列向量,下标表示向量的维度大小;含义为:当是矩阵时取出其对角线,当是向量时将其张成对角矩阵,是对向量求均值;是指求出的关于均值向量和方差向量,并输出;指矩阵求逆,指矩阵转置。in, is an n-dimensional column vector with an element of 1, and the subscript indicates the dimension of the vector; Meaning: when is a matrix, take out its diagonal, when When it is a vector, it is stretched into a diagonal matrix, is to calculate the mean value of the vector; means to find out about mean vector and variance vector , and output ; Refers to matrix inversion, Refers to the matrix transpose.
最后,计算相除部分,输出:Finally, the division part is calculated, outputting :
S 2.3:更新:在节点上,将的和相乘得到的结果经过投影到独立同方差的多维高斯分布上,然后除去的消息:S 2.3: Update :exist on the node, the and The result of multiplication is projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and then removed message:
其中,经过计算得出:in, After calculating:
其中,和都是关于的函数,其表达式如下:in, and it's all about function, whose expression is as follows:
最后,计算相除部分,输出:Finally, the division part is calculated, outputting :
其中回归系数的近似后验如下:The approximate posterior of the regression coefficients is as follows:
而投影操作得到的均值正是要输出的Cox回归系数。And the mean value obtained by the projection operation Exactly the Cox regression coefficients to output.
S 2.4:更新:在节点上,将和相乘然后积掉变量,并投影到独立同方差的多维高斯分布上,然后除去的消息:S 2.4: Update :exist on the node, the and multiply and product the variables , and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and then remove message:
其中,计算得出:in, Calculated:
最后,计算相除部分,输出:Finally, the division part is calculated, outputting :
Step 3:根据S2.3输出近似后验概率,配合期望最大算法(Expectationmaximization),对先验参数进行自动更新。Step 3: Output the approximate posterior probability according to S2.3 , with the expectation maximization algorithm (Expectationmaximization), the prior parameters Make automatic updates.
S 3.1:更新:S 3.1: Update :
S 3.2:更新:S 3.2: Update :
S 3.3:更新:S 3.3: Update :
Step 4:判断是否达到预设的迭代结束条件:Step 4: Determine whether the preset iteration end condition is reached:
结束条件为:The end condition is:
判断其是否开始上升,若开始上升,则停止迭代过程,输出最终结果回归系数(S2.3中)。其中为一范数。Determine whether it starts to rise, if Start to rise, stop the iterative process, and output the final result regression coefficient (S2.3). in is a norm.
实施例3Example 3
基于上述实施例1和实施例2,结合图7,本实施例详细阐述本发明的第二方面一种基于改进Cox模型的癌症基因预后筛选系统。Based on the above-mentioned Example 1 and Example 2, combined with FIG. 7 , this example elaborates the second aspect of the present invention, a cancer gene prognosis screening system based on an improved Cox model.
在一个具体的实施例中,如图7所示,本发明还提供了一种基于改进Cox模型的癌症基因预后筛选系统,包括有存储器和处理器,所述存储器中包括有基于改进Cox模型的癌症基因预后筛选程序,所述基于改进Cox模型的癌症基因预后筛选程序被所述处理器执行时实现如下步骤:In a specific embodiment, as shown in FIG. 7 , the present invention also provides a cancer gene prognosis screening system based on the improved Cox model, including a memory and a processor, and the memory includes a system based on the improved Cox model. Cancer gene prognostic screening program, the cancer gene prognostic screening program based on the improved Cox model realizes the following steps when executed by the processor:
S1、采集癌症患者的癌症细胞不同基因的表达量,收集患者的生存数据,将癌症细胞不同基因的表达量和患者信息整理为第一矩阵,对第一矩阵进行预处理,得到第二矩阵。S1. Collect the expression levels of different genes in cancer cells of cancer patients, collect patient survival data, and organize the expression levels of different genes in cancer cells and patient information into the first matrix , for the first matrix Perform preprocessing to get the second matrix .
S2、将步骤S1得到的生存数据和第二矩阵X输入预设的Cox回归模型,求解得到回归系数。S2. Input the survival data obtained in step S1 and the second matrix X into the preset Cox regression model, and solve to obtain the regression coefficients.
S3、根据患者的风险函数评估回归系数中对应基因的患者风险,筛选出高患者风险对应的预后基因组。S3. Evaluate the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screen out the prognostic gene group corresponding to the high patient risk.
S4、利用筛选出的预后基因组通过生物学理论对预测预后、复发和转移提供指导信息。S4. Use the screened prognostic genome to provide guidance information for predicting prognosis, recurrence and metastasis through biological theory.
附图中描述结构位置关系的图标仅用于示例性说明,不能理解为对本专利的限制。The icons describing the positional relationship of structures in the drawings are only for illustrative purposes, and should not be construed as limitations on this patent.
显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Apparently, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the implementation of the present invention. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. All modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included within the protection scope of the claims of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211631423.4A CN115620808B (en) | 2022-12-19 | 2022-12-19 | Cancer gene prognosis screening method and system based on improved Cox model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211631423.4A CN115620808B (en) | 2022-12-19 | 2022-12-19 | Cancer gene prognosis screening method and system based on improved Cox model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115620808A true CN115620808A (en) | 2023-01-17 |
CN115620808B CN115620808B (en) | 2023-03-31 |
Family
ID=84879866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211631423.4A Active CN115620808B (en) | 2022-12-19 | 2022-12-19 | Cancer gene prognosis screening method and system based on improved Cox model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115620808B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116321620A (en) * | 2023-05-11 | 2023-06-23 | 杭州行至云起科技有限公司 | Intelligent lighting switch control system and method thereof |
CN118710146A (en) * | 2024-06-27 | 2024-09-27 | 东营曜康医药科技有限公司 | A method for detecting abnormal process behavior at a chemical production safety site |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110320390A1 (en) * | 2009-03-10 | 2011-12-29 | Kuznetsov Vladimir A | Method for identification, prediction and prognosis of cancer aggressiveness |
US20170024529A1 (en) * | 2015-07-26 | 2017-01-26 | Macau University Of Science And Technology | Semi-Supervised Learning Framework based on Cox and AFT Models with L1/2 Regularization for Patient's Survival Prediction |
CN106407689A (en) * | 2016-09-27 | 2017-02-15 | 牟合(上海)生物科技有限公司 | Stomach cancer prognostic marker screening and classifying method based on gene expression profile |
CN112117003A (en) * | 2020-09-03 | 2020-12-22 | 中国科学院深圳先进技术研究院 | Tumor risk grading method, system, terminal and storage medium |
CN113409946A (en) * | 2021-07-02 | 2021-09-17 | 中山大学 | System and method for predicting cancer prognosis risk under high-dimensional deletion data |
-
2022
- 2022-12-19 CN CN202211631423.4A patent/CN115620808B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110320390A1 (en) * | 2009-03-10 | 2011-12-29 | Kuznetsov Vladimir A | Method for identification, prediction and prognosis of cancer aggressiveness |
US20170024529A1 (en) * | 2015-07-26 | 2017-01-26 | Macau University Of Science And Technology | Semi-Supervised Learning Framework based on Cox and AFT Models with L1/2 Regularization for Patient's Survival Prediction |
CN106407689A (en) * | 2016-09-27 | 2017-02-15 | 牟合(上海)生物科技有限公司 | Stomach cancer prognostic marker screening and classifying method based on gene expression profile |
CN112117003A (en) * | 2020-09-03 | 2020-12-22 | 中国科学院深圳先进技术研究院 | Tumor risk grading method, system, terminal and storage medium |
WO2022048071A1 (en) * | 2020-09-03 | 2022-03-10 | 中国科学院深圳先进技术研究院 | Tumor risk grading method and system, terminal, and storage medium |
CN113409946A (en) * | 2021-07-02 | 2021-09-17 | 中山大学 | System and method for predicting cancer prognosis risk under high-dimensional deletion data |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116321620A (en) * | 2023-05-11 | 2023-06-23 | 杭州行至云起科技有限公司 | Intelligent lighting switch control system and method thereof |
CN116321620B (en) * | 2023-05-11 | 2023-08-11 | 杭州行至云起科技有限公司 | Intelligent lighting switch control system and method thereof |
CN118710146A (en) * | 2024-06-27 | 2024-09-27 | 东营曜康医药科技有限公司 | A method for detecting abnormal process behavior at a chemical production safety site |
Also Published As
Publication number | Publication date |
---|---|
CN115620808B (en) | 2023-03-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110956260B (en) | Systems and methods for neural architecture search | |
Ritter et al. | Online structured laplace approximations for overcoming catastrophic forgetting | |
Alexandridis et al. | A two-stage evolutionary algorithm for variable selection in the development of RBF neural network models | |
Maslyaev et al. | Partial differential equations discovery with EPDE framework: Application for real and synthetic data | |
Tran et al. | Implicit causal models for genome-wide association studies | |
CN115620808B (en) | Cancer gene prognosis screening method and system based on improved Cox model | |
US20030177105A1 (en) | Gene expression programming algorithm | |
CN113241122A (en) | Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network | |
Rau et al. | Reverse engineering gene regulatory networks using approximate Bayesian computation | |
US12315600B2 (en) | Genome-wide prediction method based on deep learning by using genome-wide data and bioinformatics features | |
KR20230043071A (en) | Variant Pathogenicity Scoring and Classification and Use Thereof | |
CN116629352A (en) | Hundred million-level parameter optimizing platform | |
Oates et al. | Joint estimation of multiple related biological networks | |
KR20230043072A (en) | Variant Pathogenicity Scoring and Classification and Use Thereof | |
Dey et al. | Identification of disease related biomarkers in time varying ‘Omic data: A non-negative matrix factorization aided multi level self organizing map based approach | |
Dhulipala et al. | Efficient Bayesian inference with latent Hamiltonian neural networks in No-U-Turn Sampling | |
Baey et al. | Efficient preconditioned stochastic gradient descent for estimation in latent variable models | |
Agrawal et al. | Disentangling impact of capacity, objective, batchsize, estimators, and step-size on flow VI | |
Du et al. | Incorporating grouping information into bayesian decision tree ensembles | |
CN117457110A (en) | Protein solubility prediction method, computer device, and computer storage medium | |
Dhulipala et al. | Bayesian inference with latent hamiltonian neural networks | |
Roy et al. | A hidden-state Markov model for cell population deconvolution | |
Seal et al. | RCFGL: Rapid condition adaptive fused graphical lasso and application to modeling brain region co-expression networks | |
Rodrigo | Bayesian artificial neural networks in health and cybersecurity | |
Jia | New Model-Based and Deep Learning Methods for Survival Data with or Without Competing Risks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |