CN115620808A - Cancer gene prognosis screening method and system based on improved Cox model - Google Patents

Cancer gene prognosis screening method and system based on improved Cox model Download PDF

Info

Publication number
CN115620808A
CN115620808A CN202211631423.4A CN202211631423A CN115620808A CN 115620808 A CN115620808 A CN 115620808A CN 202211631423 A CN202211631423 A CN 202211631423A CN 115620808 A CN115620808 A CN 115620808A
Authority
CN
China
Prior art keywords
matrix
cox
message
patient
regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211631423.4A
Other languages
Chinese (zh)
Other versions
CN115620808B (en
Inventor
张善书
张浩川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202211631423.4A priority Critical patent/CN115620808B/en
Publication of CN115620808A publication Critical patent/CN115620808A/en
Application granted granted Critical
Publication of CN115620808B publication Critical patent/CN115620808B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Computational Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Analytical Chemistry (AREA)
  • Algebra (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a cancer gene prognosis screening method and a cancer gene prognosis screening system based on an improved Cox model, which comprises the following steps: s1, collecting the expression quantity of different genes of cancer cells of a cancer patient, collecting survival data of the patient, collating the expression quantity of the different genes of the cancer cells and patient information into a first matrix, and preprocessing the first matrix to obtain a second matrix; s2, inputting the survival data and the second matrix into a preset Cox regression model, and solving to obtain a regression coefficient; s3, evaluating the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screening a prognostic genome corresponding to high patient risk; and S4, providing guide information for predicting prognosis, relapse and metastasis by using the screened prognostic genome through a biological theory. Compared with the traditional technology, the accuracy of regression is improved in the regression part through the addition of prior and the automatic updating of parameters, and guidance information is provided for predicting prognosis, recurrence and metastasis.

Description

基于改进Cox模型的癌症基因预后筛选方法及系统Cancer gene prognosis screening method and system based on improved Cox model

技术领域technical field

本发明涉及生存分析Cox模型回归技术领域,更具体地,涉及一种基于改进Cox模型的癌症基因预后筛选方法及系统。The present invention relates to the technical field of survival analysis Cox model regression, and more specifically relates to a cancer gene prognosis screening method and system based on an improved Cox model.

背景技术Background technique

随着DNA微阵列技术的兴起和发展,该项技术可以同时监测数千个基因的表达水平以研究某些治疗,疾病和发育阶段对基因表达的影响。常用的场景为:检测多名癌症病人的癌变细胞的基因表达量,并通过随访获取这些病人的生存数据,最后利用生存分析手段对这些收集到的数据进行统计分析,最后筛选出预后相关的基因。研究预后基因与肿瘤的关系可以对预测预后、复发、转移乃至指导治疗提供信息,最终目的是为患者的个体化治疗提供帮助,进一步为癌症的治疗提供突破。With the rise and development of DNA microarray technology, this technology can simultaneously monitor the expression levels of thousands of genes to study the effects of certain treatments, diseases and developmental stages on gene expression. The commonly used scenario is: detecting the gene expression of cancerous cells of multiple cancer patients, and obtaining the survival data of these patients through follow-up, and finally using survival analysis methods to perform statistical analysis on these collected data, and finally screen out the genes related to prognosis . Studying the relationship between prognostic genes and tumors can provide information for predicting prognosis, recurrence, metastasis, and even guiding treatment. The ultimate goal is to provide assistance for individualized treatment of patients and further provide breakthroughs in cancer treatment.

而收集到的生存数据和基因表达量需要经过系统性的生存分析,从上万个基因中筛出十几个关键预后基因,这一步是整个预后分析中不可或缺的一环,通过这十几个基因组成的基因集,可以对癌症病人的风险进行评估,提供更多治疗信息。The collected survival data and gene expression levels need to undergo a systematic survival analysis to screen out more than a dozen key prognostic genes from tens of thousands of genes. This step is an indispensable part of the entire prognostic analysis. Through these ten A gene set composed of several genes can be used to assess the risk of cancer patients and provide more treatment information.

其中,Cox回归模型在医学随访研究中得到广泛的应用,是迄今生存分析中应用最多的多因素分析方法。它是一种基于协变量线性组合的半参数模型,该模型以生存结局和生存时间为因变量,可同时分析众多因素对生存时间的影响,能分析带有截尾生存时间的资料,且不要求估计资料的生存分布类型,具有优良的性质,该回归模型在癌症预后基因筛选中具有举足轻重的地位。Among them, the Cox regression model has been widely used in medical follow-up research, and is the most widely used multivariate analysis method in survival analysis so far. It is a semi-parametric model based on a linear combination of covariates. The model takes survival outcome and survival time as dependent variables. It can analyze the influence of many factors on survival time at the same time, and can analyze data with censored survival time. The type of survival distribution required to estimate the data has excellent properties, and the regression model plays a pivotal role in the screening of cancer prognosis genes.

根据公开文献显示,Cox回归模型中最常用到的求解方法是由Noah Simon等人于提出来的通过坐标下降,并使用热启动沿着正则化路径(

Figure DEST_PATH_IMAGE001
范数和
Figure 470165DEST_PATH_IMAGE002
范数作为惩罚项)进行拟合的Cox回归方法。但其惩罚项系数通过交叉验证进行确定,这使得惩罚项系数无法自动地精确地求解,由于这种拟合是通过优化方法进行计算的,是一种点估计,无法得出后验分布并结合期望最大算法(Expectation-Maximum)进行先验参数自动求解(即惩罚项系数),这使得算法最终筛选出来的预后基因不能很好的和癌症相关联。According to the public literature, the most commonly used solution method in the Cox regression model is proposed by Noah Simon et al. through coordinate descent, and uses hot start along the regularization path (
Figure DEST_PATH_IMAGE001
Norm and
Figure 470165DEST_PATH_IMAGE002
Norm as a penalty term) to fit the Cox regression method. However, the penalty item coefficient is determined through cross-validation, which makes the penalty item coefficient cannot be automatically and accurately solved. Since this fitting is calculated by an optimization method, it is a point estimate, and the posterior distribution cannot be obtained and combined with The Expectation-Maximum algorithm (Expectation-Maximum) automatically solves the prior parameters (that is, the penalty coefficient), which makes the prognosis genes finally screened by the algorithm not well correlated with cancer.

其中,Cox回归是一种生存分析方法,它是预后基因筛选中的一环,且占有重要地位。Cox回归模型求解得到的回归系数的含义是对每个对应基因的风险加权,只有回归系数准确了,后续每个患者的风险计算才会准确。因此,需要一种精度更高的求解Cox回归模型的方法。Among them, Cox regression is a survival analysis method, which is a part of prognostic gene screening and plays an important role. The meaning of the regression coefficient obtained by solving the Cox regression model is to weight the risk of each corresponding gene. Only when the regression coefficient is accurate can the subsequent risk calculation of each patient be accurate. Therefore, a method for solving the Cox regression model with higher accuracy is needed.

为此,结合以上需求和现有技术缺陷,本申请提出了一种基于改进Cox模型的癌症基因预后筛选方法及系统。Therefore, in combination with the above requirements and the defects of the prior art, the present application proposes a cancer gene prognosis screening method and system based on an improved Cox model.

发明内容Contents of the invention

本发明提供了一种基于改进Cox模型的癌症基因预后筛选方法及系统,在回归部分通过先验的加入及其参数的自动更新提高了的回归精度,并筛选出回归系数中绝对值大的对应基因作为预后基因,对后续的预测预后、复发、转移乃至指导治疗提供信息。The present invention provides a cancer gene prognosis screening method and system based on the improved Cox model. In the regression part, the regression accuracy is improved through the addition of a priori and the automatic update of its parameters, and the corresponding regression coefficient with a large absolute value is screened out. Genes, as prognostic genes, provide information for subsequent prediction of prognosis, recurrence, metastasis and even guidance of treatment.

本发明的首要目的是为解决上述技术问题,本发明的技术方案如下:Primary purpose of the present invention is to solve the above-mentioned technical problems, and technical scheme of the present invention is as follows:

本发明第一方面提供了一种基于改进Cox模型的癌症基因预后筛选方法,本方法包括以下步骤:The first aspect of the present invention provides a cancer gene prognosis screening method based on the improved Cox model, the method comprising the following steps:

S1、采集癌症患者的癌症细胞不同基因的表达量,收集患者的生存数据,将癌症细胞不同基因的表达量和患者信息整理为第一矩阵

Figure DEST_PATH_IMAGE003
,对第一矩阵
Figure 200355DEST_PATH_IMAGE004
进行预处理,得到第二矩阵
Figure 511251DEST_PATH_IMAGE005
。S1. Collect the expression levels of different genes in cancer cells of cancer patients, collect patient survival data, and organize the expression levels of different genes in cancer cells and patient information into the first matrix
Figure DEST_PATH_IMAGE003
, for the first matrix
Figure 200355DEST_PATH_IMAGE004
Perform preprocessing to get the second matrix
Figure 511251DEST_PATH_IMAGE005
.

S2、将步骤S1得到的生存数据和第二矩阵X输入预设的Cox回归模型,求解得到回归系数。S2. Input the survival data obtained in step S1 and the second matrix X into the preset Cox regression model, and solve to obtain the regression coefficients.

S3、根据患者的风险函数评估回归系数中对应基因的患者风险,筛选出高患者风险对应的预后基因组。S3. Evaluate the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screen out the prognostic gene group corresponding to the high patient risk.

S4、利用筛选出的预后基因组通过生物学理论对预测预后、复发和转移提供指导信息。S4. Use the screened prognostic genome to provide guidance information for predicting prognosis, recurrence and metastasis through biological theory.

其中,在第一矩阵

Figure 437618DEST_PATH_IMAGE006
中,矩阵的行代表患者信息,矩阵的列代表癌症细胞的基因片段;第一矩阵
Figure 533882DEST_PATH_IMAGE006
的某个元素表示对应行的病人体内对应列的基因的表达水平。Among them, in the first matrix
Figure 437618DEST_PATH_IMAGE006
In , the rows of the matrix represent patient information, and the columns of the matrix represent gene fragments of cancer cells; the first matrix
Figure 533882DEST_PATH_IMAGE006
An element of represents the expression level of the gene in the corresponding column in the patient in the corresponding row.

其中,所述生存数据包括有:协变量矩阵即第二矩阵X,生存时间y和删失索引c。Wherein, the survival data include: a covariate matrix, namely the second matrix X , survival time y and censoring index c.

其中,回归系数中绝对值较大的分量对应的基因对患者的生存时间有较大影响,通过评估回归系数能够筛选出高患者风险对应的预后基因集。Among them, the gene corresponding to the component with a larger absolute value in the regression coefficient has a greater impact on the survival time of the patient, and the prognostic gene set corresponding to the high patient risk can be screened out by evaluating the regression coefficient.

其中,步骤S1中预处理过程具体为:通过生物学信息统计手段去除无关基因,得到列数较少的第二矩阵

Figure 887503DEST_PATH_IMAGE005
。Wherein, the preprocessing process in step S1 is specifically: removing irrelevant genes by means of biological information statistics to obtain a second matrix with fewer columns
Figure 887503DEST_PATH_IMAGE005
.

进一步的,步骤S2中,首先将生存数据和第二矩阵组合形成的第三矩阵,将第三矩阵输入所述预设的Cox回归模型;其中,第三矩阵记作[X,y,c],其中X代表协变量矩阵即第二矩阵,y代表生存时间,c代表删失索引;其中第i个病人的生存数据为

Figure 685694DEST_PATH_IMAGE007
。Further, in step S2, the survival data and the second matrix are first combined to form a third matrix, and the third matrix is input into the preset Cox regression model; wherein, the third matrix is recorded as [X, y, c] , where X represents the covariate matrix, which is the second matrix, y represents the survival time, and c represents the censored index; where the survival data of the i -th patient is
Figure 685694DEST_PATH_IMAGE007
.

进一步的,第i个所述患者的风险函数具体为:Further, the risk function of the ith patient is specifically:

Figure 415753DEST_PATH_IMAGE008
Figure 415753DEST_PATH_IMAGE008

其中

Figure DEST_PATH_IMAGE009
为共享基准风险函数;
Figure 894751DEST_PATH_IMAGE010
为求解Cox回归模型得到的回归系数;
Figure 419273DEST_PATH_IMAGE011
表示第i个患者的基因表达水平。in
Figure DEST_PATH_IMAGE009
is the shared benchmark risk function;
Figure 894751DEST_PATH_IMAGE010
The regression coefficient obtained for solving the Cox regression model;
Figure 419273DEST_PATH_IMAGE011
Indicates the gene expression level of the i -th patient.

其中,通过利用Cox回归模型回归拟合出回归系数

Figure 704761DEST_PATH_IMAGE010
,我们就可以根据患者的基因表达水平
Figure 723664DEST_PATH_IMAGE011
来评估患者风险,而回归系数
Figure 574945DEST_PATH_IMAGE010
中绝对值较大的分量,则对患者生存时间起着较大的影响,而这些分量对应的基因正是我们要筛选出来的预后基因集。Among them, by using the Cox regression model regression to fit the regression coefficient
Figure 704761DEST_PATH_IMAGE010
, according to the gene expression level of the patient, we can
Figure 723664DEST_PATH_IMAGE011
to assess patient risk, and the regression coefficient
Figure 574945DEST_PATH_IMAGE010
The components with larger absolute values have a greater impact on the survival time of patients, and the genes corresponding to these components are the prognostic gene sets we want to screen out.

进一步的,步骤S2中求解Cox回归模型得到回归系数,具体包括以下步骤:Further, in step S2, solving the Cox regression model to obtain regression coefficients specifically includes the following steps:

S21、将已有的生存数据合并成第三矩阵并根据参数生存时间排序,利用排序后的数据构建Cox回归模型,初始化先验参数和消息传递参数。S21. Merge existing survival data into a third matrix and sort according to parameter survival time, use the sorted data to construct a Cox regression model, and initialize prior parameters and message passing parameters.

S22、根据Cox回归模型的分列式矢量因子图,利用期望传播算法,通过矩匹配规则将高维消息投影到独立的高斯分布上,循环迭代求解模型,输出回归系数和近似后验概率。S22. According to the columnar vector factor diagram of the Cox regression model, the expectation propagation algorithm is used to project the high-dimensional information onto the independent Gaussian distribution through the moment matching rule, and iteratively solve the model, and output the regression coefficient and the approximate posterior probability.

S23、将回归系数和近似后验概率输入期望最大算法,更新先验参数。S23. Input the regression coefficient and the approximate posterior probability into the expectation maximization algorithm, and update the prior parameters.

S24、判断回归系数是否达到预设的迭代结束条件;若达到预设的迭代结束条件,则输出当前轮迭代得到的回归系数;若没有达到预设的迭代结束条件,则返回步骤S22进行下一轮迭代。S24, judging whether the regression coefficient reaches the preset iteration end condition; if the preset iteration end condition is reached, the regression coefficient obtained by the current round of iteration is output; if the preset iteration end condition is not reached, then return to step S22 for the next step round of iterations.

其中,所述第三矩阵为[X,y,c],X代表协变量矩阵,y代表生存时间,c代表删失索引。Wherein, the third matrix is [X, y, c], X represents a covariate matrix, y represents survival time, and c represents a censored index.

其中,借助完整的贝叶斯分析方法解决回归系数估计的问题,将带惩罚项的最大似然估计转化为贝叶斯角度的最小均方误差估计,采用因子图作为工具,通过基于期望传播的消息传递方法计算节点间传递的消息,获取回归系数的近似后验概率,其实质为近似推断出回归系数所服从的概率分布。Among them, with the help of a complete Bayesian analysis method to solve the problem of regression coefficient estimation, the maximum likelihood estimation with penalty items is transformed into the minimum mean square error estimation of the Bayesian angle, and the factor graph is used as a tool. The message passing method calculates the messages transmitted between nodes and obtains the approximate posterior probability of the regression coefficients. Its essence is to approximate the probability distribution that the regression coefficients obey.

进一步的,所述先验参数包括有:均值

Figure 535948DEST_PATH_IMAGE012
、方差
Figure DEST_PATH_IMAGE013
和稀疏率
Figure 590623DEST_PATH_IMAGE014
;所述消息传递参数包括有:正方向消息的均值和方差;所述步骤S21具体为:将协变量矩阵X矩阵归一化,根据生存时间y对第三矩阵为[X,y,c]进行降序排序,将排序后的第三矩阵为[X,y,c]代入Cox部分似然函数,初始化先验参数和消息传递函数。Further, the prior parameters include: mean
Figure 535948DEST_PATH_IMAGE012
,variance
Figure DEST_PATH_IMAGE013
and sparse rate
Figure 590623DEST_PATH_IMAGE014
; The message delivery parameters include: the mean value and variance of the positive direction message; the step S21 is specifically: normalize the covariate matrix X matrix, and the third matrix is [X, y, c] according to the survival time y Perform descending sorting, substitute the sorted third matrix [X, y, c] into the Cox partial likelihood function, and initialize the prior parameters and message transfer function.

其中,所述先验参数和回归系数均服从高斯-伯努利分布,具有稀疏性。Wherein, the prior parameters and the regression coefficients all obey the Gauss-Bernoulli distribution and are sparse.

其中,采用拉普拉斯方法和矩生成函数,对似然函数节点的投影操作进行近似化简,让复杂的计算得以简化,在较小损失的情况下求解出较精确的回归系数。Among them, the Laplace method and the moment generation function are used to approximate and simplify the projection operation of the likelihood function node, which simplifies complex calculations and solves more accurate regression coefficients with less loss.

进一步的,所述将协变量矩阵X矩阵归一化具体为:Further, the normalization of the covariate matrix X matrix is specifically:

Figure 396905DEST_PATH_IMAGE015
Figure 396905DEST_PATH_IMAGE015

其中, mean(X)为X矩阵全体元素的均值, var(X)为X矩阵全体元素的方差。Among them, mean( X ) is the mean value of all elements of X matrix, and var( X ) is the variance of all elements of X matrix.

所述Cox部分似然函数具体为:The Cox partial likelihood function is specifically:

Figure 571534DEST_PATH_IMAGE016
Figure 571534DEST_PATH_IMAGE016

其中,

Figure 188592DEST_PATH_IMAGE017
表示该函数为
Figure 183092DEST_PATH_IMAGE018
转移到
Figure DEST_PATH_IMAGE019
的转移概率,用于表示
Figure 324224DEST_PATH_IMAGE017
关于
Figure 892041DEST_PATH_IMAGE019
是归一化的;
Figure 194846DEST_PATH_IMAGE020
为Cox部分似然函数,未归一化,表示正比关系;该函数以
Figure 676643DEST_PATH_IMAGE018
为变量,其第i个元素
Figure 575460DEST_PATH_IMAGE021
Figure 459102DEST_PATH_IMAGE022
Figure 932809DEST_PATH_IMAGE023
的第i个元素。in,
Figure 188592DEST_PATH_IMAGE017
means that the function is
Figure 183092DEST_PATH_IMAGE018
move to
Figure DEST_PATH_IMAGE019
The transition probability of
Figure 324224DEST_PATH_IMAGE017
about
Figure 892041DEST_PATH_IMAGE019
is normalized;
Figure 194846DEST_PATH_IMAGE020
is the Cox partial likelihood function, which is not normalized, and represents a proportional relationship; the function starts with
Figure 676643DEST_PATH_IMAGE018
is a variable whose i -th element
Figure 575460DEST_PATH_IMAGE021
,
Figure 459102DEST_PATH_IMAGE022
for
Figure 932809DEST_PATH_IMAGE023
The i -th element of .

所述先验参数的初始化具体为:令回归系数服从高斯-伯努利分布,其数学表达式为:The initialization of the prior parameters is specifically: make the regression coefficient obey the Gauss-Bernoulli distribution, and its mathematical expression is:

Figure 167481DEST_PATH_IMAGE024
Figure 167481DEST_PATH_IMAGE024

其中,

Figure 604410DEST_PATH_IMAGE025
表示狄拉克Delta函数;
Figure 608138DEST_PATH_IMAGE026
表示均值为
Figure 252746DEST_PATH_IMAGE027
、方差为
Figure 709135DEST_PATH_IMAGE028
的高斯分布;该函数以
Figure 215334DEST_PATH_IMAGE029
为变量;初始化先验参数
Figure 807989DEST_PATH_IMAGE030
Figure 623499DEST_PATH_IMAGE031
Figure 567184DEST_PATH_IMAGE032
。in,
Figure 604410DEST_PATH_IMAGE025
Represents the Dirac Delta function;
Figure 608138DEST_PATH_IMAGE026
Indicates that the mean is
Figure 252746DEST_PATH_IMAGE027
, the variance is
Figure 709135DEST_PATH_IMAGE028
Gaussian distribution; the function takes
Figure 215334DEST_PATH_IMAGE029
as a variable; initialize the prior parameters
Figure 807989DEST_PATH_IMAGE030
,
Figure 623499DEST_PATH_IMAGE031
,
Figure 567184DEST_PATH_IMAGE032
.

所述消息传递函数的初始化具体为:初始化正方向消息的消息传递函数,其数学表达式为:The initialization of the message transfer function is specifically: initialize the message transfer function of the message in the forward direction, and its mathematical expression is:

Figure 670882DEST_PATH_IMAGE033
Figure 670882DEST_PATH_IMAGE033

其中,

Figure 383623DEST_PATH_IMAGE034
为元素全为0的n维列向量;
Figure 104454DEST_PATH_IMAGE035
为元素全为1的n维列向量,下标表示向量的维度大小;
Figure 286168DEST_PATH_IMAGE036
是服从独立同方差多维高斯分布的随机变量;
Figure 383437DEST_PATH_IMAGE035
为元素为1的n列维向量;初始化
Figure 950684DEST_PATH_IMAGE037
Figure 842417DEST_PATH_IMAGE038
Figure 308164DEST_PATH_IMAGE039
。in,
Figure 383623DEST_PATH_IMAGE034
is an n-dimensional column vector whose elements are all 0;
Figure 104454DEST_PATH_IMAGE035
is an n-dimensional column vector whose elements are all 1, and the subscript indicates the dimension of the vector;
Figure 286168DEST_PATH_IMAGE036
is a random variable that obeys an independent homoscedastic multidimensional Gaussian distribution;
Figure 383437DEST_PATH_IMAGE035
Is an n column-dimensional vector with elements 1; initialization
Figure 950684DEST_PATH_IMAGE037
,
Figure 842417DEST_PATH_IMAGE038
,
Figure 308164DEST_PATH_IMAGE039
.

其中,在所述Cox回归模型的分列式矢量因子图中,使用四个多维随机变量表示因子图上传递的消息,即将消息视为一种多维高斯概率密度函数,所述矩匹配过程要求消息服从以下分布:Wherein, in the split vector factor diagram of the Cox regression model, four multidimensional random variables are used to represent the message transmitted on the factor diagram, that is, the message is regarded as a multidimensional Gaussian probability density function, and the moment matching process requires the message to obey the following distributed:

Figure 209124DEST_PATH_IMAGE040
Figure 209124DEST_PATH_IMAGE040

其中,

Figure 630878DEST_PATH_IMAGE041
是服从独立同方差多维高斯分布的随机变量;
Figure 444245DEST_PATH_IMAGE042
为元素为1的n列维向量,下标表示向量的维度大小;
Figure 584239DEST_PATH_IMAGE043
为元素为1的p列维向量,下标表示向量维度;当多维高斯随机变量的元素相互独立时,即协方差矩阵非对角线元素为0时,能够采用向量来表示对角矩阵。in,
Figure 630878DEST_PATH_IMAGE041
is a random variable that obeys an independent homoscedastic multidimensional Gaussian distribution;
Figure 444245DEST_PATH_IMAGE042
is an n-column vector with an element of 1, and the subscript indicates the dimension of the vector;
Figure 584239DEST_PATH_IMAGE043
is a p-dimensional vector with an element of 1, and the subscript indicates the dimension of the vector; when the elements of the multidimensional Gaussian random variable are independent of each other, that is, when the off-diagonal elements of the covariance matrix are 0, the vector can be used to represent the diagonal matrix.

进一步的,所述步骤S22具体为,基于矩匹配规则在Cox回归模型的分列式矢量因子图上进行消息传递,包括以下步骤:Further, the step S22 is specifically, based on the moment matching rule, message passing is performed on the columnar vector factor graph of the Cox regression model, including the following steps:

S221、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对

Figure 288890DEST_PATH_IMAGE044
进行更新,具体为:S221, according to the moment matching rule of the split vector factor graph of the Cox regression model, to
Figure 288890DEST_PATH_IMAGE044
Make an update, specifically:

Figure 830730DEST_PATH_IMAGE045
Figure 830730DEST_PATH_IMAGE045

在节点

Figure 880244DEST_PATH_IMAGE046
上,将
Figure 507534DEST_PATH_IMAGE047
的消息与
Figure 750297DEST_PATH_IMAGE048
相乘并投影到独立同方差的多维高斯分布上,投影得到的结果再和
Figure 349905DEST_PATH_IMAGE049
的消息相除,得到
Figure 770654DEST_PATH_IMAGE050
的消息。at node
Figure 880244DEST_PATH_IMAGE046
on, will
Figure 507534DEST_PATH_IMAGE047
news with
Figure 750297DEST_PATH_IMAGE048
Multiply and project onto a multidimensional Gaussian distribution with independent homoscedasticity, and then sum the results obtained by the projection
Figure 349905DEST_PATH_IMAGE049
Divide the news, get
Figure 770654DEST_PATH_IMAGE050
news.

其中,

Figure 681978DEST_PATH_IMAGE051
是投影操作,即求出
Figure 462852DEST_PATH_IMAGE052
关于
Figure 730016DEST_PATH_IMAGE053
的均值向量
Figure 570933DEST_PATH_IMAGE054
和方差向量
Figure 172816DEST_PATH_IMAGE055
,因为是独立同方差的多维高斯,所以向量
Figure 491802DEST_PATH_IMAGE055
中的每个元素都相等且非对角线元素为0,并输出
Figure 613473DEST_PATH_IMAGE056
。in,
Figure 681978DEST_PATH_IMAGE051
is a projection operation, that is, find
Figure 462852DEST_PATH_IMAGE052
about
Figure 730016DEST_PATH_IMAGE053
The mean vector of
Figure 570933DEST_PATH_IMAGE054
and variance vector
Figure 172816DEST_PATH_IMAGE055
, because it is a multidimensional Gaussian with independent homoscedasticity, so the vector
Figure 491802DEST_PATH_IMAGE055
Each element in is equal and the off-diagonal elements are 0, and outputs
Figure 613473DEST_PATH_IMAGE056
.

S222、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对

Figure 625291DEST_PATH_IMAGE057
进行更新,具体为:S222, according to the moment matching rule of the split vector factor graph of the Cox regression model, to
Figure 625291DEST_PATH_IMAGE057
Make an update, specifically:

Figure 245628DEST_PATH_IMAGE058
Figure 245628DEST_PATH_IMAGE058

在节点

Figure 381687DEST_PATH_IMAGE059
上,将
Figure 607132DEST_PATH_IMAGE060
的消息和
Figure 789852DEST_PATH_IMAGE061
相乘然后积掉变量
Figure 100747DEST_PATH_IMAGE062
,并投影到独立同方差的多维高斯分布上,投影得到的结果再和
Figure 777848DEST_PATH_IMAGE063
的消息相除,得到
Figure 857799DEST_PATH_IMAGE064
的消息;其中
Figure 273737DEST_PATH_IMAGE065
是狄拉克Delta函数。at node
Figure 381687DEST_PATH_IMAGE059
on, will
Figure 607132DEST_PATH_IMAGE060
news and
Figure 789852DEST_PATH_IMAGE061
multiply and product the variables
Figure 100747DEST_PATH_IMAGE062
, and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and the results obtained by the projection are summed
Figure 777848DEST_PATH_IMAGE063
Divide the news, get
Figure 857799DEST_PATH_IMAGE064
news; among them
Figure 273737DEST_PATH_IMAGE065
is the Dirac Delta function.

S223、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对

Figure 822661DEST_PATH_IMAGE063
进行更新,具体为:S223, according to the moment matching rule of the split vector factor graph of the Cox regression model, to
Figure 822661DEST_PATH_IMAGE063
Make an update, specifically:

Figure 552720DEST_PATH_IMAGE066
Figure 552720DEST_PATH_IMAGE066

Figure 487178DEST_PATH_IMAGE067
节点上,将
Figure 11700DEST_PATH_IMAGE068
的消息和
Figure 47920DEST_PATH_IMAGE069
相乘得到的结果经过投影到独立同方差的多维高斯分布上,将投影得到的结果和
Figure 316090DEST_PATH_IMAGE070
的消息相除,得到
Figure 370634DEST_PATH_IMAGE063
的消息;其中,投影操作得到的均值
Figure 331637DEST_PATH_IMAGE071
是作为输出结果的Cox回归系数。exist
Figure 487178DEST_PATH_IMAGE067
on the node, the
Figure 11700DEST_PATH_IMAGE068
news and
Figure 47920DEST_PATH_IMAGE069
The results obtained by multiplication are projected onto the multidimensional Gaussian distribution with independent homoscedasticity, and the projected results and
Figure 316090DEST_PATH_IMAGE070
Divide the news, get
Figure 370634DEST_PATH_IMAGE063
The message; among them, the mean value obtained by the projection operation
Figure 331637DEST_PATH_IMAGE071
is the Cox regression coefficient as the output result.

S224、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对

Figure 604222DEST_PATH_IMAGE072
进行更新,具体为:S224, according to the moment matching rule of the split vector factor graph of the Cox regression model, to
Figure 604222DEST_PATH_IMAGE072
Make an update, specifically:

Figure 472821DEST_PATH_IMAGE073
Figure 472821DEST_PATH_IMAGE073

Figure 647451DEST_PATH_IMAGE074
节点上,将
Figure 264508DEST_PATH_IMAGE075
的消息和
Figure 524588DEST_PATH_IMAGE074
相乘并积掉变量
Figure 868982DEST_PATH_IMAGE076
,将结果投影到独立同方差的多维高斯分布上,投影得到的结果再和
Figure 898117DEST_PATH_IMAGE077
的消息相除,得到
Figure 686076DEST_PATH_IMAGE072
的消息。exist
Figure 647451DEST_PATH_IMAGE074
on the node, the
Figure 264508DEST_PATH_IMAGE075
news and
Figure 524588DEST_PATH_IMAGE074
multiply and product variables
Figure 868982DEST_PATH_IMAGE076
, project the result onto a multidimensional Gaussian distribution with independent homoscedasticity, and then sum the projected results
Figure 898117DEST_PATH_IMAGE077
Divide the news, get
Figure 686076DEST_PATH_IMAGE072
news.

其中,由于

Figure 433452DEST_PATH_IMAGE078
具有极其复杂的形式,因此使用累积量生成函数和拉普拉斯方法替代
Figure 643854DEST_PATH_IMAGE078
进行投影操作。Among them, due to
Figure 433452DEST_PATH_IMAGE078
has an extremely complex form, so the cumulant generating function and Laplace's method are used instead
Figure 643854DEST_PATH_IMAGE078
Perform projection operation.

进一步的,步骤S223中,投影操作具体为:Further, in step S223, the projection operation is specifically:

Figure 278228DEST_PATH_IMAGE079
Figure 278228DEST_PATH_IMAGE079

其中,

Figure 751935DEST_PATH_IMAGE080
表示回归系数的近似后验概率;投影得到的均值
Figure 721028DEST_PATH_IMAGE081
即是模型输出的Cox回归系数。in,
Figure 751935DEST_PATH_IMAGE080
Represents the approximate posterior probability of the regression coefficient; the projected mean
Figure 721028DEST_PATH_IMAGE081
That is, the Cox regression coefficient output by the model.

进一步的,所述步骤S23具体为:将步骤S22输出的回归系数

Figure 672804DEST_PATH_IMAGE082
和近似后验概率
Figure 424334DEST_PATH_IMAGE083
,配合期望最大算法,对先验参数
Figure 537784DEST_PATH_IMAGE084
进行自动更新;更新的表达式具体为:Further, the step S23 is specifically: the regression coefficient output in the step S22
Figure 672804DEST_PATH_IMAGE082
and the approximate posterior probability
Figure 424334DEST_PATH_IMAGE083
, with the expectation maximization algorithm, for the prior parameters
Figure 537784DEST_PATH_IMAGE084
Perform automatic update; the update expression is specifically:

Figure 994173DEST_PATH_IMAGE085
Figure 994173DEST_PATH_IMAGE085

Figure 749640DEST_PATH_IMAGE086
Figure 749640DEST_PATH_IMAGE086

Figure 358607DEST_PATH_IMAGE087
Figure 358607DEST_PATH_IMAGE087

其中,

Figure 174116DEST_PATH_IMAGE088
Figure 117801DEST_PATH_IMAGE089
都是关于
Figure 676958DEST_PATH_IMAGE090
的函数,其表达式如下:in,
Figure 174116DEST_PATH_IMAGE088
and
Figure 117801DEST_PATH_IMAGE089
it's all about
Figure 676958DEST_PATH_IMAGE090
function, whose expression is as follows:

Figure 140432DEST_PATH_IMAGE091
Figure 140432DEST_PATH_IMAGE091

其中,

Figure 126842DEST_PATH_IMAGE092
为向量点除,
Figure 557824DEST_PATH_IMAGE093
为向量点乘。in,
Figure 126842DEST_PATH_IMAGE092
For vector point division,
Figure 557824DEST_PATH_IMAGE093
is the vector dot product.

其中,通过使先验参数进行自学习,随着整体算法的迭代不断自动更新,而无需手动的调整,能进一步避免了交叉验证的不确定性。Among them, by making the prior parameters self-learning, they are automatically updated with the iteration of the overall algorithm without manual adjustment, which can further avoid the uncertainty of cross-validation.

进一步的,步骤S24中所述预设的迭代结束条件具体为:Further, the preset iteration end condition described in step S24 is specifically:

Figure 655093DEST_PATH_IMAGE094
Figure 655093DEST_PATH_IMAGE094

其中,通过判断Crit值是否开始上升决定是否结束迭代,若Crit值开始上升,则停止迭代过程并输出最终一轮迭代的回归系数

Figure 973073DEST_PATH_IMAGE095
;若Crit值未开始上升,则继续迭代;其中
Figure 130385DEST_PATH_IMAGE096
表示一范数。Among them, whether to end the iteration is determined by judging whether the Crit value starts to rise. If the Crit value starts to rise, the iterative process is stopped and the regression coefficient of the last round of iteration is output.
Figure 973073DEST_PATH_IMAGE095
; If the Crit value does not start to rise, continue to iterate; where
Figure 130385DEST_PATH_IMAGE096
represents a norm.

本发明第二方面提供了一种基于改进Cox模型的癌症基因预后筛选系统,包括有存储器和处理器,所述存储器中包括有基于改进Cox模型的癌症基因预后筛选程序,所述基于改进Cox模型的癌症基因预后筛选程序被所述处理器执行时实现如下步骤:The second aspect of the present invention provides a cancer gene prognosis screening system based on the improved Cox model, including a memory and a processor, the memory includes a cancer gene prognosis screening program based on the improved Cox model, and the improved Cox model based When the cancer gene prognosis screening program is executed by the processor, the following steps are implemented:

S1、采集癌症患者的癌症细胞不同基因的表达量,收集患者的生存数据,将癌症细胞不同基因的表达量和患者信息整理为第一矩阵

Figure 251924DEST_PATH_IMAGE097
,对第一矩阵
Figure 887305DEST_PATH_IMAGE097
进行预处理,得到第二矩阵
Figure 574638DEST_PATH_IMAGE005
。S1. Collect the expression levels of different genes in cancer cells of cancer patients, collect patient survival data, and organize the expression levels of different genes in cancer cells and patient information into the first matrix
Figure 251924DEST_PATH_IMAGE097
, for the first matrix
Figure 887305DEST_PATH_IMAGE097
Perform preprocessing to get the second matrix
Figure 574638DEST_PATH_IMAGE005
.

S2、将步骤S1得到的生存数据和第二矩阵X输入预设的Cox回归模型,求解得到回归系数。S2. Input the survival data obtained in step S1 and the second matrix X into the preset Cox regression model, and solve to obtain the regression coefficients.

S3、根据患者的风险函数评估回归系数中对应基因的患者风险,筛选出高患者风险对应的预后基因组。S3. Evaluate the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screen out the prognostic gene group corresponding to the high patient risk.

S4、利用筛选出的预后基因组通过生物学理论对预测预后、复发和转移提供指导信息。S4. Use the screened prognostic genome to provide guidance information for predicting prognosis, recurrence and metastasis through biological theory.

与现有技术相比,本发明技术方案的有益效果是:Compared with the prior art, the beneficial effects of the technical solution of the present invention are:

本发明提供了一种基于改进Cox模型的癌症基因预后筛选方法及系统,采用因子图作为工具,通过基于期望传播的矩匹配消息传递方法推断出Cox回归系数的近似后验概率;采用最小均方误差估计的方法,实现对回归系数估计值的准确估计;先验参数方面,采用期望最大算法自动求解,省去了交叉验证,使得回归系数估计更加精确;具体实现方面,通过拉普拉斯方法和累积量生成函数的化简,将形式复杂的

Figure 656514DEST_PATH_IMAGE078
与高斯相乘成功投影,使得迭代得以进行,从而能够解决回归精度的问题,并筛选出回归系数中绝对值大的对应基因作为预后基因,对后续的预测预后、复发、转移乃至指导治疗提供信息。The present invention provides a cancer gene prognosis screening method and system based on an improved Cox model, using factor graph as a tool, and inferring the approximate posterior probability of Cox regression coefficients through the moment matching message passing method based on expected propagation; using least mean square The method of error estimation realizes accurate estimation of the estimated value of the regression coefficient; in terms of prior parameters, the expected maximum algorithm is used to automatically solve the problem, eliminating the need for cross-validation and making the estimation of the regression coefficient more accurate; in terms of specific implementation, the Laplace method is adopted and the simplification of the cumulant generating function, the complex form
Figure 656514DEST_PATH_IMAGE078
Successful projection by multiplying with Gaussian enables iteration to be carried out, so that the problem of regression accuracy can be solved, and the corresponding gene with a large absolute value in the regression coefficient is selected as the prognostic gene, which provides information for subsequent prediction of prognosis, recurrence, metastasis, and even guidance for treatment .

附图说明Description of drawings

图1为本发明一种基于改进Cox模型的癌症基因预后筛选方法的流程图。Fig. 1 is a flow chart of a cancer gene prognosis screening method based on the improved Cox model of the present invention.

图2为本发明一种基于改进Cox模型的癌症基因预后筛选方法中求解Cox模型的流程图。Fig. 2 is a flow chart of solving the Cox model in a cancer gene prognosis screening method based on the improved Cox model of the present invention.

图3为本发明求解Cox模型的一种实施例的流程图。Fig. 3 is a flow chart of an embodiment of the present invention for solving the Cox model.

图4为本发明一种实施例中分列式矢量因子图的示意图。Fig. 4 is a schematic diagram of a columnar vector factor graph in an embodiment of the present invention.

图5为本发明一种实施例中基于期望传播的矩匹配消息传递方法的示意图。Fig. 5 is a schematic diagram of an expected propagation-based moment matching message delivery method in an embodiment of the present invention.

图6为本发明一种实施例中对模拟数据进行回归的性能表现。Fig. 6 shows the performance of regression on simulated data in an embodiment of the present invention.

图7为本发明一种基于改进Cox模型的癌症基因预后筛选系统的结构示意图。Fig. 7 is a schematic structural diagram of a cancer gene prognosis screening system based on the improved Cox model of the present invention.

具体实施方式detailed description

为了能够更清楚地理解本发明的上述目的、特征和优点,下面结合附图和具体实施方式对本发明进行进一步的详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to understand the above-mentioned purpose, features and advantages of the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.

在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是,本发明还可以采用其他不同于在此描述的其他方式来实施,因此,本发明的保护范围并不受下面公开的具体实施例的限制。In the following description, many specific details are set forth in order to fully understand the present invention. However, the present invention can also be implemented in other ways different from those described here. Therefore, the protection scope of the present invention is not limited by the specific details disclosed below. EXAMPLE LIMITATIONS.

实施例1Example 1

如图1所示,本发明提供了一种基于改进Cox模型的癌症基因预后筛选方法,本方法包括以下步骤:As shown in Figure 1, the present invention provides a kind of cancer gene prognosis screening method based on improved Cox model, and this method comprises the following steps:

S1、采集癌症患者的癌症细胞不同基因的表达量,收集患者的生存数据,将癌症细胞不同基因的表达量和患者信息整理为第一矩阵

Figure 796508DEST_PATH_IMAGE097
,对第一矩阵
Figure 501159DEST_PATH_IMAGE097
进行预处理,得到第二矩阵
Figure 42998DEST_PATH_IMAGE005
。S1. Collect the expression levels of different genes in cancer cells of cancer patients, collect patient survival data, and organize the expression levels of different genes in cancer cells and patient information into the first matrix
Figure 796508DEST_PATH_IMAGE097
, for the first matrix
Figure 501159DEST_PATH_IMAGE097
Perform preprocessing to get the second matrix
Figure 42998DEST_PATH_IMAGE005
.

S2、将步骤S1得到的生存数据和第二矩阵X输入预设的Cox回归模型,求解得到回归系数。S2. Input the survival data obtained in step S1 and the second matrix X into the preset Cox regression model, and solve to obtain the regression coefficients.

S3、根据患者的风险函数评估回归系数中对应基因的患者风险,筛选出高患者风险对应的预后基因组。S3. Evaluate the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screen out the prognostic gene group corresponding to the high patient risk.

S4、利用筛选出的预后基因组通过生物学理论对预测预后、复发和转移提供指导信息。S4. Use the screened prognostic genome to provide guidance information for predicting prognosis, recurrence and metastasis through biological theory.

其中,在第一矩阵

Figure 558425DEST_PATH_IMAGE006
中,矩阵的行代表患者信息,矩阵的列代表癌症细胞的基因片段;第一矩阵
Figure 185715DEST_PATH_IMAGE006
的某个元素表示对应行的病人体内对应列的基因的表达水平。Among them, in the first matrix
Figure 558425DEST_PATH_IMAGE006
In , the rows of the matrix represent patient information, and the columns of the matrix represent gene fragments of cancer cells; the first matrix
Figure 185715DEST_PATH_IMAGE006
An element of represents the expression level of the gene in the corresponding column in the patient in the corresponding row.

其中,所述生存数据包括有:协变量矩阵即第二矩阵X,生存时间y和删失索引c。Wherein, the survival data include: a covariate matrix, namely the second matrix X , survival time y and censoring index c.

其中,回归系数中绝对值较大的分量对应的基因对患者的生存时间有较大影响,通过评估回归系数能够筛选出高患者风险对应的预后基因集。Among them, the gene corresponding to the component with a larger absolute value in the regression coefficient has a greater impact on the survival time of the patient, and the prognostic gene set corresponding to the high patient risk can be screened out by evaluating the regression coefficient.

其中,步骤S1中预处理过程具体为:通过生物学信息统计手段去除无关基因,得到列数较少的第二矩阵

Figure 162898DEST_PATH_IMAGE005
。Wherein, the preprocessing process in step S1 is specifically: removing irrelevant genes by means of biological information statistics to obtain a second matrix with fewer columns
Figure 162898DEST_PATH_IMAGE005
.

进一步的,步骤S2中,首先将生存数据和第二矩阵组合形成的第三矩阵,将第三矩阵输入所述预设的Cox回归模型;其中,第三矩阵记作[X,y,c],其中X代表协变量矩阵即第二矩阵,y代表生存时间,c代表删失索引;其中第i个病人的生存数据为

Figure 824824DEST_PATH_IMAGE007
。Further, in step S2, the survival data and the second matrix are first combined to form a third matrix, and the third matrix is input into the preset Cox regression model; wherein, the third matrix is recorded as [X, y, c] , where X represents the covariate matrix, which is the second matrix, y represents the survival time, and c represents the censored index; where the survival data of the i -th patient is
Figure 824824DEST_PATH_IMAGE007
.

进一步的,第i个所述患者的风险函数具体为:Further, the risk function of the ith patient is specifically:

Figure 245572DEST_PATH_IMAGE098
Figure 245572DEST_PATH_IMAGE098

其中

Figure 360159DEST_PATH_IMAGE099
为共享基准风险函数;
Figure 344295DEST_PATH_IMAGE100
为求解Cox回归模型得到的回归系数;
Figure 657465DEST_PATH_IMAGE101
表示第i个患者的基因表达水平。in
Figure 360159DEST_PATH_IMAGE099
is the shared benchmark risk function;
Figure 344295DEST_PATH_IMAGE100
The regression coefficient obtained for solving the Cox regression model;
Figure 657465DEST_PATH_IMAGE101
Indicates the gene expression level of the i -th patient.

其中,通过利用Cox回归模型回归拟合出回归系数

Figure 701644DEST_PATH_IMAGE102
,我们就可以根据患者的基因表达水平
Figure 54259DEST_PATH_IMAGE103
来评估患者风险,而回归系数
Figure 638824DEST_PATH_IMAGE102
中绝对值较大的分量,则对患者生存时间起着较大的影响,而这些分量对应的基因正是我们要筛选出来的预后基因集。Among them, by using the Cox regression model regression to fit the regression coefficient
Figure 701644DEST_PATH_IMAGE102
, according to the gene expression level of the patient, we can
Figure 54259DEST_PATH_IMAGE103
to assess patient risk, and the regression coefficient
Figure 638824DEST_PATH_IMAGE102
The components with larger absolute values have a greater impact on the survival time of patients, and the genes corresponding to these components are the prognostic gene sets we want to screen out.

进一步的,步骤S2中求解Cox回归模型得到回归系数,如图2所示,具体包括以下步骤:Further, in step S2, the regression coefficient is obtained by solving the Cox regression model, as shown in Figure 2, which specifically includes the following steps:

S21、将已有的生存数据合并成第三矩阵并根据参数生存时间排序,利用排序后的数据构建Cox回归模型,初始化先验参数和消息传递参数。S21. Merge existing survival data into a third matrix and sort according to parameter survival time, use the sorted data to construct a Cox regression model, and initialize prior parameters and message passing parameters.

S22、根据Cox回归模型的分列式矢量因子图,利用期望传播算法,通过矩匹配规则将高维消息投影到独立的高斯分布上,循环迭代求解模型,输出回归系数和近似后验概率。S22. According to the columnar vector factor diagram of the Cox regression model, the expectation propagation algorithm is used to project the high-dimensional information onto the independent Gaussian distribution through the moment matching rule, and iteratively solve the model, and output the regression coefficient and the approximate posterior probability.

S23、将回归系数和近似后验概率输入期望最大算法,更新先验参数。S23. Input the regression coefficient and the approximate posterior probability into the expectation maximization algorithm, and update the prior parameters.

S24、判断回归系数是否达到预设的迭代结束条件;若达到预设的迭代结束条件,则输出当前轮迭代得到的回归系数;若没有达到预设的迭代结束条件,则返回步骤S22进行下一轮迭代。S24, judging whether the regression coefficient reaches the preset iteration end condition; if the preset iteration end condition is reached, the regression coefficient obtained by the current round of iteration is output; if the preset iteration end condition is not reached, then return to step S22 for the next step round of iterations.

其中,所述第三矩阵为[X,y,c],X代表协变量矩阵,y代表生存时间,c代表删失索引。Wherein, the third matrix is [X, y, c], X represents a covariate matrix, y represents survival time, and c represents a censored index.

其中,借助完整的贝叶斯分析方法解决回归系数估计的问题,将带惩罚项的最大似然估计转化为贝叶斯角度的最小均方误差估计,采用因子图作为工具,通过基于期望传播的消息传递方法计算节点间传递的消息,获取回归系数的近似后验概率,其实质为近似推断出回归系数所服从的概率分布。Among them, with the help of a complete Bayesian analysis method to solve the problem of regression coefficient estimation, the maximum likelihood estimation with penalty items is transformed into the minimum mean square error estimation of the Bayesian angle, and the factor graph is used as a tool. The message passing method calculates the messages transmitted between nodes and obtains the approximate posterior probability of the regression coefficients. Its essence is to approximate the probability distribution that the regression coefficients obey.

进一步的,所述先验参数包括有:均值

Figure 275342DEST_PATH_IMAGE104
、方差
Figure 287160DEST_PATH_IMAGE105
和稀疏率
Figure 858563DEST_PATH_IMAGE106
;所述消息传递参数包括有:正方向消息的均值和方差;所述步骤S21具体为:将协变量矩阵X矩阵归一化,根据生存时间y对第三矩阵为[X,y,c]进行降序排序,将排序后的第三矩阵为[X,y,c]代入Cox部分似然函数,初始化先验参数和消息传递函数。Further, the prior parameters include: mean
Figure 275342DEST_PATH_IMAGE104
,variance
Figure 287160DEST_PATH_IMAGE105
and sparse rate
Figure 858563DEST_PATH_IMAGE106
; The message delivery parameters include: the mean value and variance of the positive direction message; the step S21 is specifically: normalize the covariate matrix X matrix, and the third matrix is [X, y, c] according to the survival time y Perform descending sorting, substitute the sorted third matrix [X, y, c] into the Cox partial likelihood function, and initialize the prior parameters and message transfer function.

在一个具体的实施例中,所述协变量矩阵能够采用基因表达量矩阵,其中每行代表不同病人,每列代表不同基因,矩阵中的某元素代表某个人的某个基因的表达量。In a specific embodiment, the covariate matrix can be a gene expression matrix, where each row represents a different patient, each column represents a different gene, and a certain element in the matrix represents the expression level of a certain gene of a certain person.

其中,所述先验参数和回归系数均服从高斯-伯努利分布,具有稀疏性。Wherein, the prior parameters and the regression coefficients all obey the Gauss-Bernoulli distribution and are sparse.

其中,采用拉普拉斯方法和矩生成函数,对似然函数节点的投影操作进行近似化简,让复杂的计算得以简化,在较小损失的情况下求解出较精确的回归系数。Among them, the Laplace method and the moment generation function are used to approximate and simplify the projection operation of the likelihood function node, which simplifies complex calculations and solves more accurate regression coefficients with less loss.

进一步的,所述将协变量矩阵X矩阵归一化具体为:Further, the normalization of the covariate matrix X matrix is specifically:

Figure 246819DEST_PATH_IMAGE015
Figure 246819DEST_PATH_IMAGE015

其中, mean(X)为X矩阵全体元素的均值, var(X)为X矩阵全体元素的方差。Among them, mean( X ) is the mean value of all elements of X matrix, and var( X ) is the variance of all elements of X matrix.

所述Cox部分似然函数具体为:The Cox partial likelihood function is specifically:

Figure 472264DEST_PATH_IMAGE016
Figure 472264DEST_PATH_IMAGE016

其中,

Figure 654983DEST_PATH_IMAGE017
表示该函数为
Figure 982190DEST_PATH_IMAGE107
转移到
Figure 908558DEST_PATH_IMAGE108
的转移概率,用于表示
Figure 722930DEST_PATH_IMAGE017
关于
Figure 342131DEST_PATH_IMAGE108
是归一化的;
Figure 891055DEST_PATH_IMAGE020
为Cox部分似然函数,未归一化,表示正比关系;该函数以
Figure 621113DEST_PATH_IMAGE107
为变量,其第i个元素
Figure 821150DEST_PATH_IMAGE021
Figure 611252DEST_PATH_IMAGE022
Figure 381893DEST_PATH_IMAGE023
的第i个元素。in,
Figure 654983DEST_PATH_IMAGE017
means that the function is
Figure 982190DEST_PATH_IMAGE107
move to
Figure 908558DEST_PATH_IMAGE108
The transition probability of
Figure 722930DEST_PATH_IMAGE017
about
Figure 342131DEST_PATH_IMAGE108
is normalized;
Figure 891055DEST_PATH_IMAGE020
is the Cox partial likelihood function, which is not normalized, and represents a proportional relationship; the function starts with
Figure 621113DEST_PATH_IMAGE107
is a variable whose i -th element
Figure 821150DEST_PATH_IMAGE021
,
Figure 611252DEST_PATH_IMAGE022
for
Figure 381893DEST_PATH_IMAGE023
The i -th element of .

所述先验参数的初始化具体为:令回归系数服从高斯-伯努利分布,其数学表达式为:The initialization of the prior parameters is specifically: make the regression coefficient obey the Gauss-Bernoulli distribution, and its mathematical expression is:

Figure 915642DEST_PATH_IMAGE109
Figure 915642DEST_PATH_IMAGE109

其中,

Figure 970186DEST_PATH_IMAGE025
表示狄拉克Delta函数;
Figure 665610DEST_PATH_IMAGE026
表示均值为
Figure 641656DEST_PATH_IMAGE027
、方差为
Figure 455461DEST_PATH_IMAGE110
的高斯分布;该函数以
Figure 364511DEST_PATH_IMAGE111
为变量;初始化先验参数
Figure 496415DEST_PATH_IMAGE030
Figure 490916DEST_PATH_IMAGE031
Figure 117200DEST_PATH_IMAGE032
。in,
Figure 970186DEST_PATH_IMAGE025
Represents the Dirac Delta function;
Figure 665610DEST_PATH_IMAGE026
Indicates that the mean is
Figure 641656DEST_PATH_IMAGE027
, the variance is
Figure 455461DEST_PATH_IMAGE110
Gaussian distribution; the function takes
Figure 364511DEST_PATH_IMAGE111
as a variable; initialize the prior parameters
Figure 496415DEST_PATH_IMAGE030
,
Figure 490916DEST_PATH_IMAGE031
,
Figure 117200DEST_PATH_IMAGE032
.

所述消息传递函数的初始化具体为:初始化正方向消息的消息传递函数,其数学表达式为:The initialization of the message transfer function is specifically: initialize the message transfer function of the message in the forward direction, and its mathematical expression is:

Figure 146336DEST_PATH_IMAGE033
Figure 146336DEST_PATH_IMAGE033

其中,

Figure 183562DEST_PATH_IMAGE034
为元素全为0的n维列向量;
Figure 930938DEST_PATH_IMAGE035
为元素全为1的n维列向量;
Figure 360914DEST_PATH_IMAGE036
是服从独立同方差多维高斯分布的随机变量;
Figure 244556DEST_PATH_IMAGE035
为元素为1的n列维向量;初始化
Figure 452684DEST_PATH_IMAGE112
Figure 421777DEST_PATH_IMAGE113
Figure 389864DEST_PATH_IMAGE114
。in,
Figure 183562DEST_PATH_IMAGE034
is an n-dimensional column vector whose elements are all 0;
Figure 930938DEST_PATH_IMAGE035
is an n-dimensional column vector whose elements are all 1;
Figure 360914DEST_PATH_IMAGE036
is a random variable that obeys an independent homoscedastic multidimensional Gaussian distribution;
Figure 244556DEST_PATH_IMAGE035
Is an n column-dimensional vector with elements 1; initialization
Figure 452684DEST_PATH_IMAGE112
,
Figure 421777DEST_PATH_IMAGE113
,
Figure 389864DEST_PATH_IMAGE114
.

在一个具体的实施例中,所述Cox回归模型的分列式矢量因子图如图4所示。In a specific embodiment, the columnar vector factor diagram of the Cox regression model is shown in FIG. 4 .

其中,在所述Cox回归模型的分列式矢量因子图中,如图5所示,使用四个多维随机变量表示因子图上传递的消息,即将消息视为一种多维高斯概率密度函数,所述矩匹配过程要求消息服从以下分布:Wherein, in the split vector factor diagram of the Cox regression model, as shown in Figure 5, four multidimensional random variables are used to represent the message transmitted on the factor diagram, that is, the message is regarded as a multidimensional Gaussian probability density function, and the moment The matching process requires messages to obey the following distribution:

Figure 128013DEST_PATH_IMAGE033
Figure 128013DEST_PATH_IMAGE033

其中,

Figure 772621DEST_PATH_IMAGE041
是服从独立同方差多维高斯分布的随机变量;
Figure 229010DEST_PATH_IMAGE115
为元素为1的n列维向量,下标表示向量的维度大小;
Figure 732279DEST_PATH_IMAGE043
为元素为1的p列维向量,下标表示向量维度;当多维高斯随机变量的元素相互独立时,即协方差矩阵非对角线元素为0时,能够采用向量来表示对角矩阵。in,
Figure 772621DEST_PATH_IMAGE041
is a random variable that obeys an independent homoscedastic multidimensional Gaussian distribution;
Figure 229010DEST_PATH_IMAGE115
is an n-column vector with an element of 1, and the subscript indicates the dimension of the vector;
Figure 732279DEST_PATH_IMAGE043
is a p-dimensional vector with an element of 1, and the subscript indicates the dimension of the vector; when the elements of the multidimensional Gaussian random variable are independent of each other, that is, when the off-diagonal elements of the covariance matrix are 0, the vector can be used to represent the diagonal matrix.

在一个具体的实施例中,设定先验参数,既先验分布

Figure 590513DEST_PATH_IMAGE116
中的
Figure 406023DEST_PATH_IMAGE117
-稀疏参数,
Figure 287391DEST_PATH_IMAGE118
-均值参数,
Figure 846548DEST_PATH_IMAGE119
-方差参数的初始值分别为
Figure 310022DEST_PATH_IMAGE030
Figure 296432DEST_PATH_IMAGE031
Figure 727414DEST_PATH_IMAGE032
,并在后续采用期望最大算法对先验参数进行自动更新。In a specific embodiment, the prior parameters are set, that is, the prior distribution
Figure 590513DEST_PATH_IMAGE116
middle
Figure 406023DEST_PATH_IMAGE117
- sparse parameters,
Figure 287391DEST_PATH_IMAGE118
- mean parameter,
Figure 846548DEST_PATH_IMAGE119
- The initial values of the variance parameters are
Figure 310022DEST_PATH_IMAGE030
,
Figure 296432DEST_PATH_IMAGE031
,
Figure 727414DEST_PATH_IMAGE032
, and then automatically update the prior parameters using the expected maximum algorithm.

进一步的,所述步骤S22具体为,基于矩匹配规则在Cox回归模型的分列式矢量因子图上进行消息传递,包括以下步骤:Further, the step S22 is specifically, based on the moment matching rule, message passing is performed on the columnar vector factor graph of the Cox regression model, including the following steps:

S221、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对

Figure 824683DEST_PATH_IMAGE120
进行更新,具体为:S221, according to the moment matching rule of the split vector factor graph of the Cox regression model, to
Figure 824683DEST_PATH_IMAGE120
Make an update, specifically:

Figure 142663DEST_PATH_IMAGE121
Figure 142663DEST_PATH_IMAGE121

在节点

Figure 565554DEST_PATH_IMAGE122
上,将
Figure 218252DEST_PATH_IMAGE123
的消息与
Figure 119212DEST_PATH_IMAGE124
相乘并投影到独立同方差的多维高斯分布上,投影得到的结果再和
Figure 557278DEST_PATH_IMAGE049
的消息相除,得到
Figure 885491DEST_PATH_IMAGE120
的消息。at node
Figure 565554DEST_PATH_IMAGE122
on, will
Figure 218252DEST_PATH_IMAGE123
news with
Figure 119212DEST_PATH_IMAGE124
Multiply and project onto a multidimensional Gaussian distribution with independent homoscedasticity, and then sum the results obtained by the projection
Figure 557278DEST_PATH_IMAGE049
Divide the news, get
Figure 885491DEST_PATH_IMAGE120
news.

其中,

Figure 25485DEST_PATH_IMAGE125
是投影操作,即求出
Figure 464557DEST_PATH_IMAGE052
关于
Figure 494479DEST_PATH_IMAGE053
的均值向量
Figure 993594DEST_PATH_IMAGE126
和方差向量
Figure 620884DEST_PATH_IMAGE055
,因为是独立同方差的多维高斯,所以向量
Figure 863647DEST_PATH_IMAGE055
中的每个元素都相等且非对角线元素为0,并输出
Figure 276305DEST_PATH_IMAGE056
。in,
Figure 25485DEST_PATH_IMAGE125
is a projection operation, that is, find
Figure 464557DEST_PATH_IMAGE052
about
Figure 494479DEST_PATH_IMAGE053
The mean vector of
Figure 993594DEST_PATH_IMAGE126
and variance vector
Figure 620884DEST_PATH_IMAGE055
, because it is a multidimensional Gaussian with independent homoscedasticity, so the vector
Figure 863647DEST_PATH_IMAGE055
Each element in is equal and the off-diagonal elements are 0, and outputs
Figure 276305DEST_PATH_IMAGE056
.

S222、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对

Figure 946320DEST_PATH_IMAGE057
进行更新,具体为:S222, according to the moment matching rule of the split vector factor graph of the Cox regression model, to
Figure 946320DEST_PATH_IMAGE057
Make an update, specifically:

Figure 795328DEST_PATH_IMAGE127
Figure 795328DEST_PATH_IMAGE127

在节点

Figure 841781DEST_PATH_IMAGE128
上,将
Figure 108946DEST_PATH_IMAGE129
的消息和
Figure 949863DEST_PATH_IMAGE061
相乘然后积掉变量
Figure 82904DEST_PATH_IMAGE062
,并投影到独立同方差的多维高斯分布上,投影得到的结果再和
Figure 418201DEST_PATH_IMAGE063
的消息相除,得到
Figure 789140DEST_PATH_IMAGE130
的消息;其中
Figure 800958DEST_PATH_IMAGE065
是狄拉克Delta函数。at node
Figure 841781DEST_PATH_IMAGE128
on, will
Figure 108946DEST_PATH_IMAGE129
news and
Figure 949863DEST_PATH_IMAGE061
multiply and product the variables
Figure 82904DEST_PATH_IMAGE062
, and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and the results obtained by the projection are summed
Figure 418201DEST_PATH_IMAGE063
Divide the news, get
Figure 789140DEST_PATH_IMAGE130
news; among them
Figure 800958DEST_PATH_IMAGE065
is the Dirac Delta function.

S223、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对

Figure 890137DEST_PATH_IMAGE063
进行更新,具体为:S223, according to the moment matching rule of the split vector factor graph of the Cox regression model, to
Figure 890137DEST_PATH_IMAGE063
Make an update, specifically:

Figure 760616DEST_PATH_IMAGE131
Figure 760616DEST_PATH_IMAGE131

Figure 251640DEST_PATH_IMAGE067
节点上,将
Figure 434360DEST_PATH_IMAGE132
的消息和
Figure 745256DEST_PATH_IMAGE069
相乘得到的结果经过投影到独立同方差的多维高斯分布上,将投影得到的结果和
Figure 422356DEST_PATH_IMAGE133
的消息相除,得到
Figure 767887DEST_PATH_IMAGE063
的消息;其中,投影操作得到的均值
Figure 121508DEST_PATH_IMAGE071
是作为输出结果的Cox回归系数。exist
Figure 251640DEST_PATH_IMAGE067
on the node, the
Figure 434360DEST_PATH_IMAGE132
news and
Figure 745256DEST_PATH_IMAGE069
The results obtained by multiplication are projected onto the multidimensional Gaussian distribution with independent homoscedasticity, and the projected results and
Figure 422356DEST_PATH_IMAGE133
Divide the news, get
Figure 767887DEST_PATH_IMAGE063
The message; among them, the mean value obtained by the projection operation
Figure 121508DEST_PATH_IMAGE071
is the Cox regression coefficient as the output result.

S224、根据Cox回归模型的分列式矢量因子图的矩匹配规则,对

Figure 919699DEST_PATH_IMAGE072
进行更新,具体为:S224, according to the moment matching rule of the split vector factor graph of the Cox regression model, to
Figure 919699DEST_PATH_IMAGE072
Make an update, specifically:

Figure 400490DEST_PATH_IMAGE134
Figure 400490DEST_PATH_IMAGE134

Figure 600527DEST_PATH_IMAGE074
节点上,将
Figure 328312DEST_PATH_IMAGE075
的消息和
Figure 410538DEST_PATH_IMAGE074
相乘并积掉变量
Figure 695019DEST_PATH_IMAGE076
,将结果投影到独立同方差的多维高斯分布上,投影得到的结果再和
Figure 749563DEST_PATH_IMAGE135
的消息相除,得到
Figure 648249DEST_PATH_IMAGE072
的消息。exist
Figure 600527DEST_PATH_IMAGE074
on the node, the
Figure 328312DEST_PATH_IMAGE075
news and
Figure 410538DEST_PATH_IMAGE074
multiply and product variables
Figure 695019DEST_PATH_IMAGE076
, project the result onto a multidimensional Gaussian distribution with independent homoscedasticity, and then sum the projected results
Figure 749563DEST_PATH_IMAGE135
Divide the news, get
Figure 648249DEST_PATH_IMAGE072
news.

其中,由于

Figure 421033DEST_PATH_IMAGE136
具有极其复杂的形式,因此使用累积量生成函数和拉普拉斯方法替代
Figure 492894DEST_PATH_IMAGE136
进行投影操作。Among them, due to
Figure 421033DEST_PATH_IMAGE136
has an extremely complex form, so the cumulant generating function and Laplace's method are used instead
Figure 492894DEST_PATH_IMAGE136
Perform projection operation.

进一步的,步骤S223中,投影操作具体为:Further, in step S223, the projection operation is specifically:

Figure 979825DEST_PATH_IMAGE079
Figure 979825DEST_PATH_IMAGE079

其中,

Figure 111729DEST_PATH_IMAGE080
表示回归系数的近似后验概率;投影得到的均值
Figure 902968DEST_PATH_IMAGE137
即是模型输出的Cox回归系数。in,
Figure 111729DEST_PATH_IMAGE080
Represents the approximate posterior probability of the regression coefficient; the projected mean
Figure 902968DEST_PATH_IMAGE137
That is, the Cox regression coefficient output by the model.

进一步的,所述步骤S23具体为:将步骤S22输出的回归系数

Figure 981782DEST_PATH_IMAGE138
和近似后验概率
Figure 496071DEST_PATH_IMAGE139
,配合期望最大算法,对先验参数
Figure 798877DEST_PATH_IMAGE084
进行自动更新;更新的表达式具体为:Further, the step S23 is specifically: the regression coefficient output in the step S22
Figure 981782DEST_PATH_IMAGE138
and the approximate posterior probability
Figure 496071DEST_PATH_IMAGE139
, with the expectation maximization algorithm, for the prior parameters
Figure 798877DEST_PATH_IMAGE084
Perform automatic update; the update expression is specifically:

Figure 280674DEST_PATH_IMAGE085
Figure 280674DEST_PATH_IMAGE085

Figure 694337DEST_PATH_IMAGE086
Figure 694337DEST_PATH_IMAGE086

Figure 328712DEST_PATH_IMAGE087
Figure 328712DEST_PATH_IMAGE087

其中,

Figure 802419DEST_PATH_IMAGE140
Figure 37091DEST_PATH_IMAGE141
都是关于
Figure 988867DEST_PATH_IMAGE142
的函数,其表达式如下:in,
Figure 802419DEST_PATH_IMAGE140
and
Figure 37091DEST_PATH_IMAGE141
it's all about
Figure 988867DEST_PATH_IMAGE142
function, whose expression is as follows:

Figure 743327DEST_PATH_IMAGE143
Figure 743327DEST_PATH_IMAGE143

其中,

Figure 387935DEST_PATH_IMAGE144
为向量点除,
Figure 844324DEST_PATH_IMAGE093
为向量点乘。in,
Figure 387935DEST_PATH_IMAGE144
For vector point division,
Figure 844324DEST_PATH_IMAGE093
is the vector dot product.

其中,通过使先验参数进行自学习,随着整体算法的迭代不断自动更新,而无需手动的调整,能进一步避免了交叉验证的不确定性。Among them, by making the prior parameters self-learning, they are automatically updated with the iteration of the overall algorithm without manual adjustment, which can further avoid the uncertainty of cross-validation.

进一步的,步骤S24中所述预设的迭代结束条件具体为:Further, the preset iteration end condition described in step S24 is specifically:

Figure 599791DEST_PATH_IMAGE145
Figure 599791DEST_PATH_IMAGE145

其中,通过判断Crit值是否开始上升决定是否结束迭代,若Crit值开始上升,则停止迭代过程并输出最终一轮迭代的回归系数

Figure 205828DEST_PATH_IMAGE146
;若Crit值未开始上升,则继续迭代;其中
Figure 755758DEST_PATH_IMAGE096
表示一范数。Among them, whether to end the iteration is determined by judging whether the Crit value starts to rise. If the Crit value starts to rise, the iterative process is stopped and the regression coefficient of the last round of iteration is output.
Figure 205828DEST_PATH_IMAGE146
; If the Crit value does not start to rise, continue to iterate; where
Figure 755758DEST_PATH_IMAGE096
represents a norm.

在一个具体的实施例中,在单次实验下对模拟数据进行回归的性能表现如图6所示,其中黑线为真实值,星号为估计值。In a specific embodiment, the performance of regression on simulated data under a single experiment is shown in Figure 6, where the black line is the real value, and the asterisk is the estimated value.

其中,模拟数据生成方式如下:Among them, the simulated data generation method is as follows:

由独立标准正态抽样生成

Figure 699443DEST_PATH_IMAGE147
。Generated by independent standard normal sampling
Figure 699443DEST_PATH_IMAGE147
.

Figure 258600DEST_PATH_IMAGE148
于二项分布B(1,0.8)独立抽样,
Figure 722074DEST_PATH_IMAGE149
,其中删失率为0.2。right
Figure 258600DEST_PATH_IMAGE148
Independently sampled from the binomial distribution B(1,0.8),
Figure 722074DEST_PATH_IMAGE149
, where the censoring rate is 0.2.

从拉普拉斯-伯努利抽样生成

Figure 911747DEST_PATH_IMAGE150
,其中稀疏率为0.2。Generated from Laplacian-Bernoulli sampling
Figure 911747DEST_PATH_IMAGE150
, where the sparsity rate is 0.2.

Figure 873887DEST_PATH_IMAGE151
且第i号样本非删失时:when
Figure 873887DEST_PATH_IMAGE151
And when sample i is not censored:

Figure 971156DEST_PATH_IMAGE152
Figure 971156DEST_PATH_IMAGE152

其中

Figure 554715DEST_PATH_IMAGE153
从U(0,1)中独立采样,当
Figure 915289DEST_PATH_IMAGE154
且第i号样本删失时:in
Figure 554715DEST_PATH_IMAGE153
Sampled independently from U(0,1), when
Figure 915289DEST_PATH_IMAGE154
And when sample i is censored:

Figure 567987DEST_PATH_IMAGE155
Figure 567987DEST_PATH_IMAGE155

实施例2Example 2

基于上述实施例1,结合图3,本实施例详细阐述本发明中求解Cox模型的具体过程。Based on the above-mentioned embodiment 1, with reference to FIG. 3 , this embodiment elaborates in detail the specific process of solving the Cox model in the present invention.

在一个具体的实施例中,如图3所示,已知数据为

Figure 468947DEST_PATH_IMAGE156
Figure 156280DEST_PATH_IMAGE157
Figure 235226DEST_PATH_IMAGE158
,待回归系数为
Figure 375220DEST_PATH_IMAGE159
。In a specific embodiment, as shown in Figure 3, the known data is
Figure 468947DEST_PATH_IMAGE156
,
Figure 156280DEST_PATH_IMAGE157
,
Figure 235226DEST_PATH_IMAGE158
, the coefficient to be regressed is
Figure 375220DEST_PATH_IMAGE159
.

Step 1:Step 1:

S 1.1:X初始化S 1.1: X initialization

Figure 79871DEST_PATH_IMAGE015
Figure 79871DEST_PATH_IMAGE015

其中, mean(X)为X矩阵全体元素的均值, var(X)为X矩阵全体元素的方差。Among them, mean( X ) is the mean value of all elements of X matrix, and var( X ) is the variance of all elements of X matrix.

S 1.2:将已有的生存数据(协变量矩阵-X,生存时间-y,删失索引-c)合并成一个矩阵[X,y,c]并根据y降序排序;S 1.2: Merge the existing survival data (covariate matrix-X, survival time-y, censor index-c) into a matrix [X, y, c] and sort them in descending order according to y;

S1.3:将排序后的[X,y,c]代入Cox部分似然函数:S1.3: Substitute the sorted [X,y,c] into the Cox partial likelihood function:

Figure 621711DEST_PATH_IMAGE016
Figure 621711DEST_PATH_IMAGE016

Figure 874488DEST_PATH_IMAGE017
表示该函数为
Figure 501778DEST_PATH_IMAGE107
转移到
Figure 744540DEST_PATH_IMAGE108
的转移概率,这暗示
Figure 140887DEST_PATH_IMAGE017
关于
Figure 561635DEST_PATH_IMAGE108
是归一化的(概率密度函数的特性),而
Figure 879484DEST_PATH_IMAGE020
是Cox部分似然函数,未归一化,所以是正比关系;该函数以
Figure 925937DEST_PATH_IMAGE107
为变量,其第i个元素
Figure 176790DEST_PATH_IMAGE021
Figure 17707DEST_PATH_IMAGE022
Figure 370322DEST_PATH_IMAGE023
的第i个元素。
Figure 874488DEST_PATH_IMAGE017
means that the function is
Figure 501778DEST_PATH_IMAGE107
move to
Figure 744540DEST_PATH_IMAGE108
The transition probability of , which implies
Figure 140887DEST_PATH_IMAGE017
about
Figure 561635DEST_PATH_IMAGE108
is normalized (property of the probability density function), while
Figure 879484DEST_PATH_IMAGE020
is the Cox partial likelihood function, which is not normalized, so it is a proportional relationship; the function is based on
Figure 925937DEST_PATH_IMAGE107
is a variable whose i -th element
Figure 176790DEST_PATH_IMAGE021
,
Figure 17707DEST_PATH_IMAGE022
for
Figure 370322DEST_PATH_IMAGE023
The i -th element of .

S 1.4:假设先验服从高斯-伯努利分布:S 1.4: Assume that the prior follows a Gauss-Bernoulli distribution:

Figure 954887DEST_PATH_IMAGE160
Figure 954887DEST_PATH_IMAGE160

该函数以

Figure 591405DEST_PATH_IMAGE161
为变量;初始化先验参数
Figure 337644DEST_PATH_IMAGE030
Figure 443134DEST_PATH_IMAGE031
Figure 565811DEST_PATH_IMAGE162
。This function starts with
Figure 591405DEST_PATH_IMAGE161
as a variable; initialize the prior parameters
Figure 337644DEST_PATH_IMAGE030
,
Figure 443134DEST_PATH_IMAGE031
,
Figure 565811DEST_PATH_IMAGE162
.

S 1.5:初始化正方向消息:S 1.5: Initialize forward direction message:

Figure 791256DEST_PATH_IMAGE033
Figure 791256DEST_PATH_IMAGE033

其中,初始化

Figure 239555DEST_PATH_IMAGE163
Figure 753713DEST_PATH_IMAGE164
Figure 427883DEST_PATH_IMAGE165
Figure 773414DEST_PATH_IMAGE034
为元素全为0的n维列向量;
Figure 127035DEST_PATH_IMAGE035
为元素为1的n维列向量,下标表示向量的维度大小。Among them, initialize
Figure 239555DEST_PATH_IMAGE163
,
Figure 753713DEST_PATH_IMAGE164
,
Figure 427883DEST_PATH_IMAGE165
;
Figure 773414DEST_PATH_IMAGE034
is an n-dimensional column vector whose elements are all 0;
Figure 127035DEST_PATH_IMAGE035
is an n-dimensional column vector whose elements are 1, and the subscript indicates the dimension of the vector.

Step 2:基于矩匹配规则在因子图上进行消息传递——期望传播算法(Expectationpropagation)Step 2: Message passing on the factor graph based on moment matching rules - Expectation propagation algorithm (Expectation propagation)

S 2.1:更新

Figure 925227DEST_PATH_IMAGE166
:在
Figure 406018DEST_PATH_IMAGE167
节点上,将
Figure 871634DEST_PATH_IMAGE166
的消息与
Figure 396156DEST_PATH_IMAGE122
相乘并投影到独立同方差的多维高斯分布上,然后除去
Figure 681644DEST_PATH_IMAGE168
的消息:S 2.1: Update
Figure 925227DEST_PATH_IMAGE166
:exist
Figure 406018DEST_PATH_IMAGE167
on the node, the
Figure 871634DEST_PATH_IMAGE166
news with
Figure 396156DEST_PATH_IMAGE122
multiplied and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and then removes
Figure 681644DEST_PATH_IMAGE168
message:

Figure 700547DEST_PATH_IMAGE121
Figure 700547DEST_PATH_IMAGE121

其中,

Figure 755091DEST_PATH_IMAGE169
是投影操作,即求出
Figure 716093DEST_PATH_IMAGE052
关于
Figure 488877DEST_PATH_IMAGE053
的均值向量
Figure 498422DEST_PATH_IMAGE126
和方差向量
Figure 423783DEST_PATH_IMAGE055
(协方差矩阵的对角线),因为是独立同方差的多维高斯,所以向量
Figure 290108DEST_PATH_IMAGE055
中的每个元素都相等且非对角线元素为0,并输出
Figure 550188DEST_PATH_IMAGE056
。in,
Figure 755091DEST_PATH_IMAGE169
is a projection operation, that is, find
Figure 716093DEST_PATH_IMAGE052
about
Figure 488877DEST_PATH_IMAGE053
The mean vector of
Figure 498422DEST_PATH_IMAGE126
and variance vector
Figure 423783DEST_PATH_IMAGE055
(diagonal of the covariance matrix), because it is a multidimensional Gaussian with independent homoscedasticity, so the vector
Figure 290108DEST_PATH_IMAGE055
Each element in is equal and the off-diagonal elements are 0, and outputs
Figure 550188DEST_PATH_IMAGE056
.

通过拉普拉斯方法和矩生成函数对

Figure 160161DEST_PATH_IMAGE170
进行化简最终得到:By Laplace method and moment generating function pair
Figure 160161DEST_PATH_IMAGE170
After simplification, we finally get:

Figure 931240DEST_PATH_IMAGE171
Figure 931240DEST_PATH_IMAGE171

其中

Figure 234046DEST_PATH_IMAGE172
Figure 715843DEST_PATH_IMAGE173
的方差,
Figure 395086DEST_PATH_IMAGE174
Figure 29461DEST_PATH_IMAGE175
Figure 503167DEST_PATH_IMAGE176
的黑塞矩阵(
Figure 472260DEST_PATH_IMAGE176
Figure 424036DEST_PATH_IMAGE177
的二阶梯度)。in
Figure 234046DEST_PATH_IMAGE172
Right now
Figure 715843DEST_PATH_IMAGE173
Variance,
Figure 395086DEST_PATH_IMAGE174
,
Figure 29461DEST_PATH_IMAGE175
for
Figure 503167DEST_PATH_IMAGE176
The Hessian matrix (
Figure 472260DEST_PATH_IMAGE176
right
Figure 424036DEST_PATH_IMAGE177
second-order gradient).

Figure 178496DEST_PATH_IMAGE178
含义如下:当
Figure 823104DEST_PATH_IMAGE179
是矩阵时取出其对角线,当
Figure 279493DEST_PATH_IMAGE179
是向量时将其张成对角矩阵。
Figure 178496DEST_PATH_IMAGE178
The meaning is as follows: when
Figure 823104DEST_PATH_IMAGE179
is a matrix, take out its diagonal, when
Figure 279493DEST_PATH_IMAGE179
When is a vector, span it into a diagonal matrix.

Figure 34960DEST_PATH_IMAGE180
是对向量求均值,
Figure 830877DEST_PATH_IMAGE181
为向量点除,
Figure 397119DEST_PATH_IMAGE182
为向量点乘。
Figure 34960DEST_PATH_IMAGE180
is the mean value of the vector,
Figure 830877DEST_PATH_IMAGE181
For vector point division,
Figure 397119DEST_PATH_IMAGE182
is the vector dot product.

其中,

Figure 340804DEST_PATH_IMAGE183
采用对
Figure 899962DEST_PATH_IMAGE184
进行二次近似后利用坐标上升算法求解:in,
Figure 340804DEST_PATH_IMAGE183
adopt to
Figure 899962DEST_PATH_IMAGE184
After quadratic approximation, use the coordinate ascending algorithm to solve:

先将

Figure 612703DEST_PATH_IMAGE185
泰勒展开:will first
Figure 612703DEST_PATH_IMAGE185
Taylor expands:

Figure 81337DEST_PATH_IMAGE186
Figure 81337DEST_PATH_IMAGE186

其中,

Figure 512318DEST_PATH_IMAGE187
Figure 875166DEST_PATH_IMAGE188
Figure 707993DEST_PATH_IMAGE189
处的梯度,
Figure 350458DEST_PATH_IMAGE190
Figure 471998DEST_PATH_IMAGE191
Figure 169696DEST_PATH_IMAGE189
处的黑塞矩阵。经过改写得到:in,
Figure 512318DEST_PATH_IMAGE187
for
Figure 875166DEST_PATH_IMAGE188
exist
Figure 707993DEST_PATH_IMAGE189
the gradient at
Figure 350458DEST_PATH_IMAGE190
for
Figure 471998DEST_PATH_IMAGE191
exist
Figure 169696DEST_PATH_IMAGE189
The Hessian matrix at . After rewriting:

Figure 794712DEST_PATH_IMAGE192
Figure 794712DEST_PATH_IMAGE192

其中,

Figure 670395DEST_PATH_IMAGE193
,最终将
Figure 810390DEST_PATH_IMAGE194
化简成:in,
Figure 670395DEST_PATH_IMAGE193
, will eventually
Figure 810390DEST_PATH_IMAGE194
Simplifies to:

Figure 515040DEST_PATH_IMAGE195
Figure 515040DEST_PATH_IMAGE195

其中,

Figure 56880DEST_PATH_IMAGE196
Figure 759257DEST_PATH_IMAGE197
的第i个元素,然后套用坐标上升算法(Coordinate Ascent):in,
Figure 56880DEST_PATH_IMAGE196
yes
Figure 759257DEST_PATH_IMAGE197
The ith element of , and then apply the Coordinate Ascent algorithm (Coordinate Ascent):

S 2.1.1:初始化

Figure 137280DEST_PATH_IMAGE198
;S 2.1.1: Initialization
Figure 137280DEST_PATH_IMAGE198
;

S2.1.2:更新

Figure 380042DEST_PATH_IMAGE199
Figure 776389DEST_PATH_IMAGE200
处的梯度
Figure 446404DEST_PATH_IMAGE201
,对于
Figure 314653DEST_PATH_IMAGE201
的第k个元素
Figure 95527DEST_PATH_IMAGE202
:S2.1.2: Update
Figure 380042DEST_PATH_IMAGE199
exist
Figure 776389DEST_PATH_IMAGE200
Gradient at
Figure 446404DEST_PATH_IMAGE201
,for
Figure 314653DEST_PATH_IMAGE201
The kth element of
Figure 95527DEST_PATH_IMAGE202
:

Figure 877538DEST_PATH_IMAGE203
Figure 877538DEST_PATH_IMAGE203

S 2.1.3:更新

Figure 718455DEST_PATH_IMAGE204
Figure 71070DEST_PATH_IMAGE189
处的黑塞矩阵
Figure 655636DEST_PATH_IMAGE190
,对于
Figure 26574DEST_PATH_IMAGE190
的第kk列个元素
Figure 38392DEST_PATH_IMAGE205
(为加速计算,只保留对角线元素来近似整个矩阵):S 2.1.3: Update
Figure 718455DEST_PATH_IMAGE204
exist
Figure 71070DEST_PATH_IMAGE189
Hessian matrix at
Figure 655636DEST_PATH_IMAGE190
,for
Figure 26574DEST_PATH_IMAGE190
The kth row and k column elements of
Figure 38392DEST_PATH_IMAGE205
(To speed up the calculation, only the diagonal elements are kept to approximate the entire matrix):

Figure 612724DEST_PATH_IMAGE206
Figure 612724DEST_PATH_IMAGE206

S2.1.4:更新

Figure 204243DEST_PATH_IMAGE207
:S2.1.4: Update
Figure 204243DEST_PATH_IMAGE207
:

Figure 226425DEST_PATH_IMAGE208
Figure 226425DEST_PATH_IMAGE208

S 2.1.5:更新

Figure 409145DEST_PATH_IMAGE209
:S 2.1.5: Update
Figure 409145DEST_PATH_IMAGE209
:

Figure 736352DEST_PATH_IMAGE210
Figure 736352DEST_PATH_IMAGE210

S2.1.6:更新

Figure 662720DEST_PATH_IMAGE209
的变化,要是变化小到一定程度则输出
Figure 945934DEST_PATH_IMAGE209
;S2.1.6: Update
Figure 662720DEST_PATH_IMAGE209
The change, if the change is small to a certain extent, the output
Figure 945934DEST_PATH_IMAGE209
;

Figure 565134DEST_PATH_IMAGE211
Figure 565134DEST_PATH_IMAGE211

若变化仍然很大则返回S 2.1.2继续迭代。If the change is still large, return to S 2.1.2 to continue iteration.

最后,计算相除部分,输出

Figure 363326DEST_PATH_IMAGE212
:Finally, the division part is calculated, outputting
Figure 363326DEST_PATH_IMAGE212
:

Figure 841187DEST_PATH_IMAGE213
Figure 841187DEST_PATH_IMAGE213

Figure 775645DEST_PATH_IMAGE214
Figure 775645DEST_PATH_IMAGE214

S 2.2:更新

Figure 565746DEST_PATH_IMAGE215
:在
Figure 116813DEST_PATH_IMAGE216
节点上,将
Figure 135716DEST_PATH_IMAGE217
Figure 190260DEST_PATH_IMAGE216
相乘然后积掉变量
Figure 151263DEST_PATH_IMAGE218
,并投影到独立同方差的多维高斯分布上,然后除去
Figure 658467DEST_PATH_IMAGE219
的消息:S 2.2: Update
Figure 565746DEST_PATH_IMAGE215
:exist
Figure 116813DEST_PATH_IMAGE216
on the node, the
Figure 135716DEST_PATH_IMAGE217
and
Figure 190260DEST_PATH_IMAGE216
multiply and product the variables
Figure 151263DEST_PATH_IMAGE218
, and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and then remove
Figure 658467DEST_PATH_IMAGE219
message:

Figure 481061DEST_PATH_IMAGE127
Figure 481061DEST_PATH_IMAGE127

其中,

Figure 858953DEST_PATH_IMAGE220
计算得出:in,
Figure 858953DEST_PATH_IMAGE220
Calculated:

Figure 725278DEST_PATH_IMAGE221
Figure 725278DEST_PATH_IMAGE221

Figure 985358DEST_PATH_IMAGE222
Figure 985358DEST_PATH_IMAGE222

Figure 595330DEST_PATH_IMAGE223
Figure 595330DEST_PATH_IMAGE223

其中,

Figure 375199DEST_PATH_IMAGE224
为元素为1的n维列向量,下标表示向量的维度大小;
Figure 412425DEST_PATH_IMAGE225
含义为:当
Figure 425380DEST_PATH_IMAGE226
是矩阵时取出其对角线,当
Figure 839044DEST_PATH_IMAGE227
是向量时将其张成对角矩阵,
Figure 488067DEST_PATH_IMAGE228
是对向量求均值;
Figure 961774DEST_PATH_IMAGE229
是指求出
Figure 930867DEST_PATH_IMAGE230
的关于
Figure 882643DEST_PATH_IMAGE227
均值向量
Figure 89633DEST_PATH_IMAGE231
和方差向量
Figure 219394DEST_PATH_IMAGE232
,并输出
Figure 675783DEST_PATH_IMAGE233
Figure 431250DEST_PATH_IMAGE234
指矩阵求逆,
Figure 289484DEST_PATH_IMAGE235
指矩阵转置。in,
Figure 375199DEST_PATH_IMAGE224
is an n-dimensional column vector with an element of 1, and the subscript indicates the dimension of the vector;
Figure 412425DEST_PATH_IMAGE225
Meaning: when
Figure 425380DEST_PATH_IMAGE226
is a matrix, take out its diagonal, when
Figure 839044DEST_PATH_IMAGE227
When it is a vector, it is stretched into a diagonal matrix,
Figure 488067DEST_PATH_IMAGE228
is to calculate the mean value of the vector;
Figure 961774DEST_PATH_IMAGE229
means to find out
Figure 930867DEST_PATH_IMAGE230
about
Figure 882643DEST_PATH_IMAGE227
mean vector
Figure 89633DEST_PATH_IMAGE231
and variance vector
Figure 219394DEST_PATH_IMAGE232
, and output
Figure 675783DEST_PATH_IMAGE233
;
Figure 431250DEST_PATH_IMAGE234
Refers to matrix inversion,
Figure 289484DEST_PATH_IMAGE235
Refers to the matrix transpose.

最后,计算相除部分,输出

Figure 855726DEST_PATH_IMAGE236
:Finally, the division part is calculated, outputting
Figure 855726DEST_PATH_IMAGE236
:

Figure 64990DEST_PATH_IMAGE237
Figure 64990DEST_PATH_IMAGE237

Figure 358568DEST_PATH_IMAGE238
Figure 358568DEST_PATH_IMAGE238

S 2.3:更新

Figure 71310DEST_PATH_IMAGE239
:在
Figure 808452DEST_PATH_IMAGE240
节点上,将
Figure 239434DEST_PATH_IMAGE241
的和
Figure 336703DEST_PATH_IMAGE240
相乘得到的结果经过投影到独立同方差的多维高斯分布上,然后除去
Figure 107213DEST_PATH_IMAGE242
的消息:S 2.3: Update
Figure 71310DEST_PATH_IMAGE239
:exist
Figure 808452DEST_PATH_IMAGE240
on the node, the
Figure 239434DEST_PATH_IMAGE241
and
Figure 336703DEST_PATH_IMAGE240
The result of multiplication is projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and then removed
Figure 107213DEST_PATH_IMAGE242
message:

Figure 264525DEST_PATH_IMAGE243
Figure 264525DEST_PATH_IMAGE243

其中,

Figure 930605DEST_PATH_IMAGE244
经过计算得出:in,
Figure 930605DEST_PATH_IMAGE244
After calculating:

Figure 565985DEST_PATH_IMAGE245
Figure 565985DEST_PATH_IMAGE245

Figure 253319DEST_PATH_IMAGE246
Figure 253319DEST_PATH_IMAGE246

其中,

Figure 581532DEST_PATH_IMAGE247
Figure 737838DEST_PATH_IMAGE248
都是关于
Figure 442489DEST_PATH_IMAGE249
的函数,其表达式如下:in,
Figure 581532DEST_PATH_IMAGE247
and
Figure 737838DEST_PATH_IMAGE248
it's all about
Figure 442489DEST_PATH_IMAGE249
function, whose expression is as follows:

Figure 984328DEST_PATH_IMAGE250
Figure 984328DEST_PATH_IMAGE250

最后,计算相除部分,输出

Figure 483443DEST_PATH_IMAGE251
:Finally, the division part is calculated, outputting
Figure 483443DEST_PATH_IMAGE251
:

Figure 861466DEST_PATH_IMAGE252
Figure 861466DEST_PATH_IMAGE252

Figure 838649DEST_PATH_IMAGE253
Figure 838649DEST_PATH_IMAGE253

其中回归系数的近似后验如下:The approximate posterior of the regression coefficients is as follows:

Figure 500575DEST_PATH_IMAGE254
Figure 500575DEST_PATH_IMAGE254

而投影操作得到的均值

Figure 170590DEST_PATH_IMAGE255
正是要输出的Cox回归系数。And the mean value obtained by the projection operation
Figure 170590DEST_PATH_IMAGE255
Exactly the Cox regression coefficients to output.

S 2.4:更新

Figure 35909DEST_PATH_IMAGE256
:在
Figure 20046DEST_PATH_IMAGE257
节点上,将
Figure 536478DEST_PATH_IMAGE258
Figure 174132DEST_PATH_IMAGE257
相乘然后积掉变量
Figure 529677DEST_PATH_IMAGE259
,并投影到独立同方差的多维高斯分布上,然后除去
Figure 114242DEST_PATH_IMAGE260
的消息:S 2.4: Update
Figure 35909DEST_PATH_IMAGE256
:exist
Figure 20046DEST_PATH_IMAGE257
on the node, the
Figure 536478DEST_PATH_IMAGE258
and
Figure 174132DEST_PATH_IMAGE257
multiply and product the variables
Figure 529677DEST_PATH_IMAGE259
, and projected onto a multidimensional Gaussian distribution with independent homoscedasticity, and then remove
Figure 114242DEST_PATH_IMAGE260
message:

Figure 688443DEST_PATH_IMAGE261
Figure 688443DEST_PATH_IMAGE261

其中,

Figure 496999DEST_PATH_IMAGE262
计算得出:in,
Figure 496999DEST_PATH_IMAGE262
Calculated:

Figure 71331DEST_PATH_IMAGE263
Figure 71331DEST_PATH_IMAGE263

Figure 459587DEST_PATH_IMAGE264
Figure 459587DEST_PATH_IMAGE264

Figure 685032DEST_PATH_IMAGE265
Figure 685032DEST_PATH_IMAGE265

最后,计算相除部分,输出

Figure 867752DEST_PATH_IMAGE266
:Finally, the division part is calculated, outputting
Figure 867752DEST_PATH_IMAGE266
:

Figure 194959DEST_PATH_IMAGE267
Figure 194959DEST_PATH_IMAGE267

Figure 121327DEST_PATH_IMAGE268
Figure 121327DEST_PATH_IMAGE268

Step 3:根据S2.3输出近似后验概率

Figure 201278DEST_PATH_IMAGE269
,配合期望最大算法(Expectationmaximization),对先验参数
Figure 820478DEST_PATH_IMAGE270
进行自动更新。Step 3: Output the approximate posterior probability according to S2.3
Figure 201278DEST_PATH_IMAGE269
, with the expectation maximization algorithm (Expectationmaximization), the prior parameters
Figure 820478DEST_PATH_IMAGE270
Make automatic updates.

S 3.1:更新

Figure 369402DEST_PATH_IMAGE271
:S 3.1: Update
Figure 369402DEST_PATH_IMAGE271
:

Figure 833882DEST_PATH_IMAGE272
Figure 833882DEST_PATH_IMAGE272

S 3.2:更新

Figure 237181DEST_PATH_IMAGE273
:S 3.2: Update
Figure 237181DEST_PATH_IMAGE273
:

Figure 27283DEST_PATH_IMAGE274
Figure 27283DEST_PATH_IMAGE274

S 3.3:更新

Figure 47191DEST_PATH_IMAGE275
:S 3.3: Update
Figure 47191DEST_PATH_IMAGE275
:

Figure 328744DEST_PATH_IMAGE276
Figure 328744DEST_PATH_IMAGE276

Step 4:判断是否达到预设的迭代结束条件:Step 4: Determine whether the preset iteration end condition is reached:

结束条件为:The end condition is:

Figure 383287DEST_PATH_IMAGE277
Figure 383287DEST_PATH_IMAGE277

判断其是否开始上升,若

Figure 78711DEST_PATH_IMAGE278
开始上升,则停止迭代过程,输出最终结果回归系数
Figure 851495DEST_PATH_IMAGE279
(S2.3中)。其中
Figure 674088DEST_PATH_IMAGE280
为一范数。Determine whether it starts to rise, if
Figure 78711DEST_PATH_IMAGE278
Start to rise, stop the iterative process, and output the final result regression coefficient
Figure 851495DEST_PATH_IMAGE279
(S2.3). in
Figure 674088DEST_PATH_IMAGE280
is a norm.

实施例3Example 3

基于上述实施例1和实施例2,结合图7,本实施例详细阐述本发明的第二方面一种基于改进Cox模型的癌症基因预后筛选系统。Based on the above-mentioned Example 1 and Example 2, combined with FIG. 7 , this example elaborates the second aspect of the present invention, a cancer gene prognosis screening system based on an improved Cox model.

在一个具体的实施例中,如图7所示,本发明还提供了一种基于改进Cox模型的癌症基因预后筛选系统,包括有存储器和处理器,所述存储器中包括有基于改进Cox模型的癌症基因预后筛选程序,所述基于改进Cox模型的癌症基因预后筛选程序被所述处理器执行时实现如下步骤:In a specific embodiment, as shown in FIG. 7 , the present invention also provides a cancer gene prognosis screening system based on the improved Cox model, including a memory and a processor, and the memory includes a system based on the improved Cox model. Cancer gene prognostic screening program, the cancer gene prognostic screening program based on the improved Cox model realizes the following steps when executed by the processor:

S1、采集癌症患者的癌症细胞不同基因的表达量,收集患者的生存数据,将癌症细胞不同基因的表达量和患者信息整理为第一矩阵

Figure 583139DEST_PATH_IMAGE281
,对第一矩阵
Figure 715043DEST_PATH_IMAGE281
进行预处理,得到第二矩阵
Figure 709544DEST_PATH_IMAGE282
。S1. Collect the expression levels of different genes in cancer cells of cancer patients, collect patient survival data, and organize the expression levels of different genes in cancer cells and patient information into the first matrix
Figure 583139DEST_PATH_IMAGE281
, for the first matrix
Figure 715043DEST_PATH_IMAGE281
Perform preprocessing to get the second matrix
Figure 709544DEST_PATH_IMAGE282
.

S2、将步骤S1得到的生存数据和第二矩阵X输入预设的Cox回归模型,求解得到回归系数。S2. Input the survival data obtained in step S1 and the second matrix X into the preset Cox regression model, and solve to obtain the regression coefficients.

S3、根据患者的风险函数评估回归系数中对应基因的患者风险,筛选出高患者风险对应的预后基因组。S3. Evaluate the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screen out the prognostic gene group corresponding to the high patient risk.

S4、利用筛选出的预后基因组通过生物学理论对预测预后、复发和转移提供指导信息。S4. Use the screened prognostic genome to provide guidance information for predicting prognosis, recurrence and metastasis through biological theory.

附图中描述结构位置关系的图标仅用于示例性说明,不能理解为对本专利的限制。The icons describing the positional relationship of structures in the drawings are only for illustrative purposes, and should not be construed as limitations on this patent.

显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Apparently, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the implementation of the present invention. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. All modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included within the protection scope of the claims of the present invention.

Claims (10)

1. A cancer gene prognosis screening method based on an improved Cox model is characterized by comprising the following steps:
s1, collecting the expression quantity of different genes of cancer cells of a cancer patient, collecting survival data of the patient, collating the expression quantity of the different genes of the cancer cells and patient information into a first matrix, and preprocessing the first matrix to obtain a second matrix;
s2, inputting the survival data obtained in the step S1 and the second matrix into a preset Cox regression model, and solving to obtain a regression coefficient;
s3, evaluating the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screening a prognostic genome corresponding to high patient risk;
and S4, providing guide information for predicting prognosis, relapse and metastasis by using the screened prognostic genome through a biological theory.
2. The method of claim 1, wherein in step S2, the survival data and the second matrix are combined to form a third matrix, and the third matrix is inputted into the predetermined Cox regression model; wherein the third matrix is denoted as [ X, y, c]X represents a covariate matrix, i.e. a second matrix, y represents the time-to-live, c represents the deletionIndexing; wherein the first stepiSurvival data for individual patients is
Figure 39106DEST_PATH_IMAGE001
3. The method of claim 2, wherein the first step is to select the improved Cox model based cancer gene prognosisiThe risk function for each of said patients is specifically:
Figure 68242DEST_PATH_IMAGE002
wherein
Figure 652938DEST_PATH_IMAGE003
Is a shared benchmark risk function;
Figure 400314DEST_PATH_IMAGE004
obtaining a regression coefficient for solving the Cox regression model;
Figure 813978DEST_PATH_IMAGE005
is shown asiGene expression levels of individual patients.
4. The method of claim 3, wherein the step S2 of solving the Cox regression model to obtain regression coefficients comprises the following steps:
s21, combining the existing survival data into a third matrix, sequencing according to the survival time of the parameters, constructing a Cox regression model by using the sequenced data, and initializing prior parameters and message transmission parameters;
s22, projecting a high-dimensional message to independent Gaussian distribution through a moment matching rule by using an expected propagation algorithm according to a determinant vector factor graph of the Cox regression model, circularly iterating to solve the model, and outputting a regression coefficient and an approximate posterior probability;
s23, inputting the regression coefficient and the approximate posterior probability into an expected maximum algorithm, and updating prior parameters;
s24, judging whether the regression coefficient reaches a preset iteration ending condition or not; if the preset iteration ending condition is reached, outputting a regression coefficient obtained by the current iteration; and if the preset iteration end condition is not reached, returning to the step S22 for the next iteration.
5. The method of claim 4, wherein the prior parameters include: mean value
Figure 697620DEST_PATH_IMAGE006
Variance, variance
Figure 656480DEST_PATH_IMAGE007
And sparsity ratio
Figure 359994DEST_PATH_IMAGE008
(ii) a The message passing parameters comprise: mean and variance of positive direction messages; the step S21 is specifically: normalizing the X matrix of the covariate matrix, and determining the third matrix as [ X, y, c ] according to the survival time y]Sorting in descending order, and setting the sorted third matrix as [ X, y, c ]]And substituting Cox partial likelihood function to initialize prior parameter and message transfer function.
6. The method of claim 4, wherein the normalization process of the X matrix of the covariate matrix is as follows:
Figure 577348DEST_PATH_IMAGE009
wherein mean (m)X) Is composed ofXMean of the whole elements of the matrix, var: (X) Is composed ofXThe variance of the whole elements of the matrix;
the Cox partial likelihood function is specifically:
Figure 315497DEST_PATH_IMAGE010
wherein,
Figure 507575DEST_PATH_IMAGE011
expressing the function as
Figure 963965DEST_PATH_IMAGE012
Is transferred to
Figure 719431DEST_PATH_IMAGE013
For representing transition probabilities of
Figure 65748DEST_PATH_IMAGE011
About
Figure 881258DEST_PATH_IMAGE013
Is normalized;
Figure 356101DEST_PATH_IMAGE014
the partial likelihood function of Cox is not normalized and represents a direct proportion relation; the function is as follows
Figure 665991DEST_PATH_IMAGE012
Is a variable, the firstiAn element
Figure 378732DEST_PATH_IMAGE015
Figure 365143DEST_PATH_IMAGE016
Is composed of
Figure 530545DEST_PATH_IMAGE017
To (1) aiAn element;
the initialization of the prior parameters specifically comprises the following steps: the regression coefficients are subjected to Gaussian-Bernoulli distribution, and the mathematical expression is as follows:
Figure 644125DEST_PATH_IMAGE018
wherein,
Figure 945794DEST_PATH_IMAGE019
representing a dirac Delta function;
Figure 103106DEST_PATH_IMAGE020
represents a mean value of
Figure 755804DEST_PATH_IMAGE021
Variance of
Figure 407496DEST_PATH_IMAGE022
A gaussian distribution of (d); the function is as follows
Figure 94829DEST_PATH_IMAGE023
Is a variable; initializing prior parameters
Figure 219780DEST_PATH_IMAGE024
Figure 107577DEST_PATH_IMAGE025
Figure 281070DEST_PATH_IMAGE026
The initialization of the message transfer function is specifically as follows: initializing a message transfer function of a positive direction message, wherein the mathematical expression of the message transfer function is as follows:
Figure 822910DEST_PATH_IMAGE027
wherein,
Figure 322024DEST_PATH_IMAGE028
is an n-dimensional column vector with elements all 0;
Figure 700047DEST_PATH_IMAGE029
is an n-dimensional column vector with elements all being 1;
Figure 942809DEST_PATH_IMAGE030
is a random variable obeying independent same variance multidimensional Gaussian distribution;
Figure 604735DEST_PATH_IMAGE029
is an n-column dimensional vector with element 1; initialization
Figure 822221DEST_PATH_IMAGE031
Figure 671228DEST_PATH_IMAGE032
Figure 717681DEST_PATH_IMAGE033
7. The method for screening cancer gene prognosis based on improved Cox model as claimed in claim 6, wherein said step S22 is specifically for message transmission on determinant vector factor graph of Cox regression model based on moment matching rule, comprising the following steps:
s221, matching according to the moment matching rule of the determinant vector factor graph of the Cox regression model
Figure 234113DEST_PATH_IMAGE034
Updating, specifically:
Figure 560184DEST_PATH_IMAGE035
at a node
Figure 896487DEST_PATH_IMAGE036
In the above, will
Figure 481052DEST_PATH_IMAGE037
Of (2) a message
Figure 851991DEST_PATH_IMAGE038
Multiplying and projecting the result onto a multidimensional Gaussian distribution of independent covariance
Figure 605752DEST_PATH_IMAGE037
Is divided by the message to obtain
Figure 694931DEST_PATH_IMAGE034
The message of (a);
s222, matching according to the moment matching rule of the determinant vector factor graph of the Cox regression model
Figure 817608DEST_PATH_IMAGE039
Updating, specifically:
Figure 308632DEST_PATH_IMAGE040
at a node
Figure 242084DEST_PATH_IMAGE041
In the above, will
Figure 552980DEST_PATH_IMAGE042
Of a message and
Figure 276085DEST_PATH_IMAGE043
multiply and then accumulate variables
Figure 824878DEST_PATH_IMAGE044
And projected to independent covarianceOn a multi-dimensional Gaussian distribution, the results obtained by projection are then summed
Figure 929232DEST_PATH_IMAGE045
Is divided by the message to obtain
Figure 727423DEST_PATH_IMAGE039
The message of (2); wherein
Figure 254220DEST_PATH_IMAGE046
Is a dirac Delta function;
s223, matching according to the moment matching rule of the determinant vector factor graph of the Cox regression model
Figure 204989DEST_PATH_IMAGE045
Updating, specifically:
Figure 729511DEST_PATH_IMAGE047
in that
Figure 218262DEST_PATH_IMAGE048
On a node, will
Figure 752011DEST_PATH_IMAGE049
Of a message and
Figure 806555DEST_PATH_IMAGE048
projecting the result obtained by multiplication on the multidimensional Gaussian distribution of independent covariance, and summing the results obtained by projection
Figure 249781DEST_PATH_IMAGE050
Is divided by the message to obtain
Figure 22565DEST_PATH_IMAGE045
The message of (2); wherein the mean value obtained by the projection operation
Figure 94426DEST_PATH_IMAGE051
Is the Cox regression coefficient as the output result;
s224, according to the moment matching rule of the determinant vector factor graph of the Cox regression model, pairing
Figure 3476DEST_PATH_IMAGE052
Updating, specifically:
Figure 151692DEST_PATH_IMAGE053
in that
Figure 146193DEST_PATH_IMAGE054
On a node, will
Figure 21745DEST_PATH_IMAGE055
Of a message and
Figure 785302DEST_PATH_IMAGE054
multiply and accumulate variables
Figure 838839DEST_PATH_IMAGE056
Projecting the result on a multidimensional Gaussian distribution with independent covariance, and then summing the projected results
Figure 320636DEST_PATH_IMAGE057
Is divided by the message to obtain
Figure 734300DEST_PATH_IMAGE052
The message of (2).
8. The method for screening cancer gene prognosis based on improved Cox model according to claim 4, wherein the step S23 is specifically as follows: regression coefficient output from step S22
Figure 617943DEST_PATH_IMAGE058
And approximate posterior probability
Figure 842382DEST_PATH_IMAGE059
Matching with expectation maximization algorithm to prior parameter
Figure 811475DEST_PATH_IMAGE060
Carrying out automatic updating; the updated expression is specifically:
Figure 28829DEST_PATH_IMAGE061
Figure 766978DEST_PATH_IMAGE062
Figure 899669DEST_PATH_IMAGE063
wherein,
Figure 621637DEST_PATH_IMAGE064
and
Figure 377104DEST_PATH_IMAGE065
are all about
Figure 235338DEST_PATH_IMAGE066
Is expressed as follows:
Figure 536001DEST_PATH_IMAGE067
wherein,
Figure 479686DEST_PATH_IMAGE068
the vector points are divided by the vector points,
Figure 38843DEST_PATH_IMAGE069
is a vector dot product.
9. The method for screening cancer gene prognosis based on improved Cox model according to any one of claims 4-8, wherein the iteration end conditions preset in step S24 are specifically:
Figure 751584DEST_PATH_IMAGE070
determining whether to end iteration by judging whether the Crit value starts to rise or not, if the Crit value starts to rise, stopping the iteration process and outputting a regression coefficient of the final iteration
Figure 488727DEST_PATH_IMAGE071
(ii) a If the Crit value does not start to rise, continuing iteration; wherein
Figure 654130DEST_PATH_IMAGE072
Representing a norm.
10. A cancer gene prognosis screening system based on an improved Cox model comprises a memory and a processor, wherein the memory comprises a cancer gene prognosis screening program based on the improved Cox model, and the cancer gene prognosis screening program based on the improved Cox model realizes the following steps when being executed by the processor:
s1, collecting the expression quantity of different genes of cancer cells of a cancer patient, collecting survival data of the patient, collating the expression quantity of the different genes of the cancer cells and patient information into a first matrix, and preprocessing the first matrix to obtain a second matrix;
s2, inputting the survival data obtained in the step S1 and the second matrix into a preset Cox regression model, and solving to obtain a regression coefficient;
s3, evaluating the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screening a prognostic genome corresponding to high patient risk;
and S4, providing guide information for predicting prognosis, relapse and metastasis by using the screened prognostic genome through a biological theory.
CN202211631423.4A 2022-12-19 2022-12-19 Cancer gene prognosis screening method and system based on improved Cox model Active CN115620808B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211631423.4A CN115620808B (en) 2022-12-19 2022-12-19 Cancer gene prognosis screening method and system based on improved Cox model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211631423.4A CN115620808B (en) 2022-12-19 2022-12-19 Cancer gene prognosis screening method and system based on improved Cox model

Publications (2)

Publication Number Publication Date
CN115620808A true CN115620808A (en) 2023-01-17
CN115620808B CN115620808B (en) 2023-03-31

Family

ID=84879866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211631423.4A Active CN115620808B (en) 2022-12-19 2022-12-19 Cancer gene prognosis screening method and system based on improved Cox model

Country Status (1)

Country Link
CN (1) CN115620808B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116321620A (en) * 2023-05-11 2023-06-23 杭州行至云起科技有限公司 Intelligent lighting switch control system and method thereof
CN118710146A (en) * 2024-06-27 2024-09-27 东营曜康医药科技有限公司 A method for detecting abnormal process behavior at a chemical production safety site

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110320390A1 (en) * 2009-03-10 2011-12-29 Kuznetsov Vladimir A Method for identification, prediction and prognosis of cancer aggressiveness
US20170024529A1 (en) * 2015-07-26 2017-01-26 Macau University Of Science And Technology Semi-Supervised Learning Framework based on Cox and AFT Models with L1/2 Regularization for Patient's Survival Prediction
CN106407689A (en) * 2016-09-27 2017-02-15 牟合(上海)生物科技有限公司 Stomach cancer prognostic marker screening and classifying method based on gene expression profile
CN112117003A (en) * 2020-09-03 2020-12-22 中国科学院深圳先进技术研究院 Tumor risk grading method, system, terminal and storage medium
CN113409946A (en) * 2021-07-02 2021-09-17 中山大学 System and method for predicting cancer prognosis risk under high-dimensional deletion data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110320390A1 (en) * 2009-03-10 2011-12-29 Kuznetsov Vladimir A Method for identification, prediction and prognosis of cancer aggressiveness
US20170024529A1 (en) * 2015-07-26 2017-01-26 Macau University Of Science And Technology Semi-Supervised Learning Framework based on Cox and AFT Models with L1/2 Regularization for Patient's Survival Prediction
CN106407689A (en) * 2016-09-27 2017-02-15 牟合(上海)生物科技有限公司 Stomach cancer prognostic marker screening and classifying method based on gene expression profile
CN112117003A (en) * 2020-09-03 2020-12-22 中国科学院深圳先进技术研究院 Tumor risk grading method, system, terminal and storage medium
WO2022048071A1 (en) * 2020-09-03 2022-03-10 中国科学院深圳先进技术研究院 Tumor risk grading method and system, terminal, and storage medium
CN113409946A (en) * 2021-07-02 2021-09-17 中山大学 System and method for predicting cancer prognosis risk under high-dimensional deletion data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116321620A (en) * 2023-05-11 2023-06-23 杭州行至云起科技有限公司 Intelligent lighting switch control system and method thereof
CN116321620B (en) * 2023-05-11 2023-08-11 杭州行至云起科技有限公司 Intelligent lighting switch control system and method thereof
CN118710146A (en) * 2024-06-27 2024-09-27 东营曜康医药科技有限公司 A method for detecting abnormal process behavior at a chemical production safety site

Also Published As

Publication number Publication date
CN115620808B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN110956260B (en) Systems and methods for neural architecture search
Ritter et al. Online structured laplace approximations for overcoming catastrophic forgetting
Alexandridis et al. A two-stage evolutionary algorithm for variable selection in the development of RBF neural network models
Maslyaev et al. Partial differential equations discovery with EPDE framework: Application for real and synthetic data
Tran et al. Implicit causal models for genome-wide association studies
CN115620808B (en) Cancer gene prognosis screening method and system based on improved Cox model
US20030177105A1 (en) Gene expression programming algorithm
CN113241122A (en) Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network
Rau et al. Reverse engineering gene regulatory networks using approximate Bayesian computation
US12315600B2 (en) Genome-wide prediction method based on deep learning by using genome-wide data and bioinformatics features
KR20230043071A (en) Variant Pathogenicity Scoring and Classification and Use Thereof
CN116629352A (en) Hundred million-level parameter optimizing platform
Oates et al. Joint estimation of multiple related biological networks
KR20230043072A (en) Variant Pathogenicity Scoring and Classification and Use Thereof
Dey et al. Identification of disease related biomarkers in time varying ‘Omic data: A non-negative matrix factorization aided multi level self organizing map based approach
Dhulipala et al. Efficient Bayesian inference with latent Hamiltonian neural networks in No-U-Turn Sampling
Baey et al. Efficient preconditioned stochastic gradient descent for estimation in latent variable models
Agrawal et al. Disentangling impact of capacity, objective, batchsize, estimators, and step-size on flow VI
Du et al. Incorporating grouping information into bayesian decision tree ensembles
CN117457110A (en) Protein solubility prediction method, computer device, and computer storage medium
Dhulipala et al. Bayesian inference with latent hamiltonian neural networks
Roy et al. A hidden-state Markov model for cell population deconvolution
Seal et al. RCFGL: Rapid condition adaptive fused graphical lasso and application to modeling brain region co-expression networks
Rodrigo Bayesian artificial neural networks in health and cybersecurity
Jia New Model-Based and Deep Learning Methods for Survival Data with or Without Competing Risks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant