CN115620808B

CN115620808B - Cancer gene prognosis screening method and system based on improved Cox model

Info

Publication number: CN115620808B
Application number: CN202211631423.4A
Authority: CN
Inventors: 张善书; 张浩川
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-03-31
Anticipated expiration: 2042-12-19
Also published as: CN115620808A

Abstract

The invention discloses a cancer gene prognosis screening method and a cancer gene prognosis screening system based on an improved Cox model, which comprises the following steps: s1, collecting the expression quantity of different genes of cancer cells of a cancer patient, collecting survival data of the patient, collating the expression quantity of the different genes of the cancer cells and patient information into a first matrix, and preprocessing the first matrix to obtain a second matrix; s2, inputting the survival data and the second matrix into a preset Cox regression model, and solving to obtain a regression coefficient; s3, evaluating the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screening a prognostic genome corresponding to high patient risk; and S4, providing guide information for predicting prognosis, relapse and metastasis by using the screened prognostic genome through a biological theory. Compared with the traditional technology, the accuracy of regression is improved in the regression part through the addition of prior and the automatic updating of parameters, and guidance information is provided for predicting prognosis, recurrence and metastasis.

Description

Cancer gene prognosis screening method and system based on improved Cox model

Technical Field

The invention relates to the technical field of survival analysis Cox model regression, in particular to a cancer gene prognosis screening method and system based on an improved Cox model.

Background

With the rise and development of DNA microarray technology, the technology can monitor the expression level of thousands of genes simultaneously to study the effect of certain treatments, diseases and developmental stages on gene expression. Commonly used scenarios are: detecting the gene expression of cancer cells of a plurality of cancer patients, obtaining the survival data of the patients through follow-up, finally carrying out statistical analysis on the collected data by using a survival analysis means, and finally screening genes relevant to prognosis. The research on the relation between the prognostic gene and the tumor can provide information for predicting prognosis, recurrence, metastasis and even guiding treatment, and the final purpose is to provide help for individualized treatment of patients and further provide breakthrough for the treatment of cancer.

The collected survival data and gene expression quantity need to be subjected to systematic survival analysis, more than ten key prognostic genes are screened from tens of thousands of genes, the step is an indispensable loop in the whole prognostic analysis, and the risk of cancer patients can be evaluated through a gene set consisting of the more than ten genes, so that more treatment information is provided.

Among them, the Cox regression model is widely used in medical follow-up studies, and is the multifactorial analysis method most frequently used in survival analysis so far. The regression model is a semi-parameter model based on covariate linear combination, takes the survival outcome and the survival time as dependent variables, can simultaneously analyze the influence of a plurality of factors on the survival time, can analyze the data with the truncated survival time, does not require to estimate the survival distribution type of the data, has excellent properties, and has great importance in the screening of cancer prognosis genes.

It is shown from the open literature that the most commonly used solution in the Cox regression model is through coordinate descent, proposed by Noah Simon et al, and follows a regularization path using a hot start: (

Norm sum>

Norm as a penalty term) is fitted. However, the penalty coefficient is determined through cross validation, which makes the penalty coefficient not be solved accurately automatically, and since the fitting is calculated through an optimization method, the fitting is a point estimation, posterior distribution cannot be obtained, and prior parameters are automatically solved (i.e. the penalty coefficient) by combining an Expectation-maximization algorithm (Expectation-maximization), which makes the prognostic genes finally screened by the algorithm not be well associated with cancers.

Among them, cox regression is a survival analysis method, which is a loop in prognostic gene screening and plays an important role. The implication of the regression coefficients solved by the Cox regression model is to weight the risk of each corresponding gene, and only if the regression coefficients are accurate, the subsequent risk calculation for each patient will be accurate. Therefore, a method for solving the Cox regression model with higher accuracy is required.

To this end, in combination with the above needs and deficiencies of the prior art, the present application proposes a method and system for cancer gene prognosis screening based on an improved Cox model.

Disclosure of Invention

The invention provides a cancer gene prognosis screening method and system based on an improved Cox model, which improve regression precision in a regression part through prior addition and automatic updating of parameters, screen out corresponding genes with large absolute values in regression coefficients as prognosis genes, and provide information for subsequent prediction prognosis, relapse, transfer and even guide treatment.

The primary objective of the present invention is to solve the above technical problems, and the technical solution of the present invention is as follows:

the first aspect of the invention provides a cancer gene prognosis screening method based on an improved Cox model, which comprises the following steps:

s1, collecting the expression quantity of different genes of cancer cells of a cancer patient, collecting survival data of the patient, and sorting the expression quantity of the different genes of the cancer cells and patient information into a first matrix

For the first matrix>

Preprocessing is carried out to obtain a second matrix +>

。

S2, survival data obtained in the step S1 and a second matrixXAnd inputting a preset Cox regression model, and solving to obtain a regression coefficient.

And S3, evaluating the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screening a prognostic genome corresponding to high patient risk.

And S4, providing guide information for predicting prognosis, relapse and metastasis by using the screened prognostic genome through a biological theory.

Wherein, in the first matrix

Wherein the rows of the matrix represent patient information and the columns of the matrix represent gene segments of cancer cells; first matrix +>

Indicates the expression level of the gene of the corresponding column in the patient of the corresponding row.

Wherein, the survival data comprises: covariate or secondary matrixXTime to live y and erasure index c.

The genes corresponding to the components with larger absolute values in the regression coefficients have larger influence on the survival time of the patient, and the prognostic gene set corresponding to high patient risk can be screened out by evaluating the regression coefficients.

The pretreatment process in the step S1 specifically comprises the following steps: removing irrelevant genes by biological information statistical means to obtain a second matrix with less columns

。

Further, in step S2, first, a third matrix formed by combining the raw data and the second matrix is input into the preset Cox regression model; wherein the third matrix is denoted as [ X, y, c ]]Wherein X represents a covariate matrix, i.e. a second matrix, y represents the time-to-live, and c represents the erasure index; wherein the first stepiSurvival data for individual patients is

。

Further, the firstiThe risk function for each of said patients is specifically:

wherein

Is a shared benchmark risk function; />

Obtaining a regression coefficient for solving the Cox regression model; />

Is shown asiGene expression levels of individual patients.

Wherein the regression coefficient is fitted by regression using Cox regression model

We can then base on the gene expression level of the patient->

To assess patient risk, and regression coefficients>

The larger absolute value of the components has a larger influence on the survival time of the patient, and the genes corresponding to the components are the prognostic gene set to be screened out.

Further, the step S2 of solving the Cox regression model to obtain the regression coefficient specifically includes the following steps:

s21, combining the existing survival data into a third matrix, sequencing according to the survival time of the parameters, constructing a Cox regression model by using the sequenced data, and initializing prior parameters and message transmission parameters.

And S22, projecting the high-dimensional message to independent Gaussian distribution through a moment matching rule by using an expected propagation algorithm according to a determinant vector factor graph of the Cox regression model, circularly iterating and solving the model, and outputting a regression coefficient and an approximate posterior probability.

And S23, inputting the regression coefficient and the approximate posterior probability into an expectation maximization algorithm, and updating the prior parameter.

S24, judging whether the regression coefficient reaches a preset iteration ending condition or not; if the preset iteration ending condition is reached, outputting a regression coefficient obtained by the current iteration; if the preset iteration end condition is not reached, the process returns to step S22 to perform the next iteration.

Wherein the third matrix is [ X, y, c ], X represents a covariate matrix, y represents the survival time, and c represents the deletion index.

The method comprises the steps of solving the problem of regression coefficient estimation by means of a complete Bayesian analysis method, converting maximum likelihood estimation with penalty terms into minimum mean square error estimation of Bayesian angles, adopting a factor graph as a tool, calculating messages transmitted among nodes by a message transmission method based on expected propagation, and acquiring approximate posterior probability of the regression coefficient, wherein the approximate posterior probability is substantially the probability distribution obeyed by the approximation deduction of the regression coefficient.

Further, the prior parameters include: mean value

Variance->

And a sparseness ratio>

(ii) a The message passing parameters comprise: mean and variance of positive direction messages; the step S21 is specifically: normalizing the X matrix of the covariate matrix, and determining the third matrix as [ X, y, c ] according to the survival time y]Sorting in descending order, and setting the sorted third matrix as [ X, y, c ]]And substituting a Cox partial likelihood function to initialize the prior parameter and the message transfer function.

Wherein, the prior parameter and the regression coefficient both obey Gaussian-Bernoulli distribution and have sparsity.

The projection operation of the likelihood function nodes is simplified approximately by adopting a Laplace method and a moment generating function, so that the complex calculation is simplified, and a more accurate regression coefficient is solved under the condition of less loss.

Further, the normalizing the covariate matrix X specifically includes:

wherein mean (a)X) Is composed ofXMean of the whole elements of the matrix, var: (X) Is composed ofXThe variance of the whole elements of the matrix.

The Cox partial likelihood function is specifically:

wherein the content of the first and second substances,

indicates that the function is->

Is transferred to->

For representing ≥ a transition probability>

About

Is normalized; />

The partial likelihood function of Cox is not normalized and represents a direct proportion relation; the function is based on>

Is a variable, the firstiElement/element->

，/>

Is->

To (1) aiAnd (4) each element.

The initialization of the prior parameter specifically comprises: the regression coefficients are subjected to Gaussian-Bernoulli distribution, and the mathematical expression is as follows:

wherein the content of the first and second substances,

representing a dirac Delta function; />

Means are->

The variance is greater or less>

A gaussian distribution of (d); the function in +>

Is a variable; initializing a prior parameter>

，/>

，/>

。

The initialization of the message transfer function specifically includes: initializing a message transfer function of a positive direction message, wherein the mathematical expression of the message transfer function is as follows:

wherein, the first and the second end of the pipe are connected with each other,

is an n-dimensional column direction with elements all being 0An amount; />

The method is characterized in that the method is an n-dimensional column vector with elements all being 1, and subscripts represent the dimension of the vector; />

Is a random variable obeying independent same variance multidimensional Gaussian distribution; />

Is an n-column dimensional vector with element 1; initialization of a device>

，/>

，/>

。

In the determinant vector factor graph of the Cox regression model, four multidimensional random variables are used for representing messages transmitted on the factor graph, namely, the messages are regarded as a multidimensional Gaussian probability density function, and the moment matching process requires that the messages obey the following distribution:

wherein the content of the first and second substances,

The vector is an n-column dimensional vector with an element of 1, and subscripts represent the dimension of the vector; />

Is a p-column dimensional vector with element 1, the subscript representing the vector dimension; when the elements of the multidimensional gaussian random variables are independent of each other, i.e., the off-diagonal elements of the covariance matrix are 0, the diagonal matrix can be represented by vectors.

Further, the step S22 is specifically to perform message transmission on the determinant vector factor graph of the Cox regression model based on the moment matching rule, and includes the following steps:

s221, matching according to the moment matching rule of the determinant vector factor graph of the Cox regression model

Updating, specifically:

at a node

On, will>

And->

Multiplying and projecting the result on a multidimensional Gaussian distribution with independent covariance, and summing the result obtained by projection>

Is divided to get->

The message of (2).

is a projection operation, i.e. evaluating &>

About>

Is based on the mean vector->

And the variance vector pick>

Because it is a multidimensional Gaussian of independent covariance, the vector @>

Is equal and the non-diagonal element is 0, and outputs ÷ greater than ∑>

。

S222, matching according to the moment matching rule of the determinant vector factor graph of the Cox regression model

Updating, specifically: />

At a node

On, will>

And->

Multiply and then accumulate the variable>

And projected to independent covarianceOn a multi-dimensional Gaussian distribution, the results obtained by projection are then summed>

Is divided into>

The message of (2); wherein->

Is the dirac Delta function.

S223, according to the moment matching rule of the determinant vector factor graph of the Cox regression model, pairing

Updating, specifically:

in that

On the node, will->

And->

The result obtained by multiplication is projected on the multidimensional Gaussian distribution with independent covariance, and the result obtained by projection is combined with->

Is divided into>

The message of (2); wherein the mean value obtained by the projection operation>

Are the Cox regression coefficients as the output result.

S224, according to the moment matching rule of the determinant vector factor graph of the Cox regression model, pairing

Updating, specifically:

in that

On the node, will->

And->

Multiply and accumulate variables

Projecting the result on a multidimensional Gaussian distribution with independent covariance, and summing the projected result>

Is divided to get->

The message of (2).

Wherein, due to

Has an extremely complex form, and therefore the cumulant generation function and the Laplace method are used instead of->

And carrying out projection operation.

Further, in step S223, the projection operation specifically includes:

wherein the content of the first and second substances,

representing the approximate posterior probability of the regression coefficients; the mean value obtained by projection->

I.e., the Cox regression coefficients of the model output.

Further, step S23 specifically includes: regression coefficient output from step S22

And approximate a posteriori probability>

In conjunction with the expected maximum algorithm, the prior parameter is pick>

Carrying out automatic updating; the updated expression is specifically:

wherein the content of the first and second substances,

and &>

Are all about>

Is expressed as follows:

wherein the content of the first and second substances,

for the vector point divide, greater or lesser>

Is a vector dot product.

The prior parameters are self-learned, and are automatically updated along with iteration of the whole algorithm without manual adjustment, so that the uncertainty of cross validation can be further avoided.

Further, the preset iteration ending condition in step S24 is specifically:

determining whether to end iteration by judging whether the Crit value starts to rise or not, if the Crit value starts to rise, stopping the iteration process and outputting a regression coefficient of the final iteration

(ii) a If the Crit value does not start to rise, continuing iteration; wherein

Representing a norm.

The second aspect of the present invention provides a cancer gene prognosis screening system based on an improved Cox model, which comprises a memory and a processor, wherein the memory includes a cancer gene prognosis screening program based on the improved Cox model, and the processor executes the following steps:

For the first matrix->

A pre-processing is carried out, resulting in a second matrix->

。

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a cancer gene prognosis screening method and a cancer gene prognosis screening system based on an improved Cox model.A factor graph is used as a tool, and the approximate posterior probability of a Cox regression coefficient is deduced by a moment matching message transmission method based on expected propagation; the method of minimum mean square error estimation is adopted to realize accurate estimation of the regression coefficient estimation value; in the aspect of prior parameters, an expectation maximization algorithm is adopted for automatic solution, so that cross validation is omitted, and the regression coefficient estimation is more accurate; in the specific implementation aspect, the Laplace method and the cumulant generation function are simplified to simplify the complex form

And the iteration is successfully projected by the Gaussian multiplication, so that the problem of regression precision can be solved, a corresponding gene with a large absolute value in the regression coefficient is screened out to be used as a prognosis gene, and information is provided for subsequent prediction prognosis, relapse, transfer and even treatment guidance.

Drawings

FIG. 1 is a flow chart of the cancer gene prognosis screening method based on the improved Cox model of the present invention.

FIG. 2 is a flow chart of solving a Cox model in the cancer gene prognosis screening method based on the improved Cox model.

FIG. 3 is a flow chart of an embodiment of the invention for solving the Cox model.

FIG. 4 is a diagram of a determinant vector factor graph in an embodiment of the present invention.

FIG. 5 is a diagram illustrating a method of matching message delivery based on a desired propagation in accordance with an embodiment of the present invention.

FIG. 6 is a graph illustrating performance of regression performed on simulated data in an embodiment of the present invention.

FIG. 7 is a schematic structural diagram of a cancer gene prognosis screening system based on an improved Cox model according to the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Example 1

As shown in FIG. 1, the present invention provides a method for screening cancer gene prognosis based on an improved Cox model, which comprises the following steps:

For the first matrix->

Preprocessing is carried out to obtain a second matrix +>

。

And S4, providing guidance information for predicting prognosis, relapse and metastasis by using the screened prognostic genome through a biological theory.

Wherein, in the first matrix

Wherein the rows of the matrix represent patient information and the columns of the matrix represent gene segments of cancer cells; the first matrix->

。

Further, in step S2, first, a third matrix formed by combining the raw data and the second matrix is input into the preset Cox regression model; wherein the third matrix is denoted as [ X, y, c]Wherein X represents a covariate matrix, i.e. a second matrix, y represents the time-to-live, and c represents the erasure index; wherein the first stepiSurvival data for individual patients is

。

Further, the firstiThe risk function for each of said patients is specifically:

wherein

Is a shared benchmark risk function; />

Obtaining a regression coefficient for solving the Cox regression model; />

Is shown asiGene expression levels of individual patients.

Wherein the regression coefficient is fitted by regression using a Cox regression model

We can then base on the gene expression level of the patient->

To assess the risk of the patient, and the regression coefficient->

Further, in step S2, solving the Cox regression model to obtain a regression coefficient, as shown in fig. 2, specifically includes the following steps:

And S22, projecting the high-dimensional information to independent Gaussian distribution through a moment matching rule by using an expected propagation algorithm according to a determinant vector factor graph of the Cox regression model, circularly iterating to solve the model, and outputting a regression coefficient and an approximate posterior probability.

Wherein the third matrix is [ X, y, c ], X represents a covariate matrix, y represents survival time, and c represents a deletion index.

Further, the prior parameters include: mean value

Variance->

And a sparseness ratio>

(ii) a The message passing parameters comprise: mean and variance of positive direction messages; the step S21 is specifically: normalizing the X matrix of the covariate matrix, and determining the third matrix as [ X, y, c ] according to the survival time y]Sorting in descending order, and setting the sorted third matrix as [ X, y, c ]]And substituting Cox partial likelihood function to initialize prior parameter and message transfer function.

In a specific embodiment, the covariate matrix can be a gene expression matrix, wherein each row represents a different patient, each column represents a different gene, and an element in the matrix represents the expression of a gene of a person.

The projection operation of the likelihood function nodes is approximately simplified by adopting a Laplace method and a moment generating function, so that the complex calculation is simplified, and a more accurate regression coefficient is solved under the condition of less loss.

Further, the normalizing the covariate matrix X specifically includes:

wherein mean (m)X) Is composed ofXMean of the whole elements of the matrix, var: (X) Is composed ofXThe variance of the whole elements of the matrix.

The Cox partial likelihood function is specifically:

indicates that the function is->

Is transferred to->

The probability of the transition of (a) is, for representing +>

About

Is normalized; />

The partial likelihood function of Cox is not normalized and represents a direct proportion relation; the function in +>

Is a variable, the firstiElement/element->

，/>

Is->

To (1) aiAnd (4) each element.

wherein the content of the first and second substances,

representing a dirac Delta function; />

Means is->

Variance of ^ er>

(ii) a gaussian distribution of; the function is based on>

Is a variable; initializing a prior parameter ≥>

，/>

，/>

。

is an n-dimensional column vector with elements all 0; />

Is an n-dimensional column vector with elements all being 1; />

Is an n-column dimensional vector with element 1; initialization->

，

，/>

。

In a specific embodiment, the determinant vector factor graph of the Cox regression model is shown in fig. 4.

In the determinant vector factor graph of the Cox regression model, as shown in fig. 5, four multidimensional random variables are used to represent messages passing through the factor graph, i.e., the messages are regarded as a multidimensional gaussian probability density function, and the moment matching process requires that the messages obey the following distribution:

wherein the content of the first and second substances,

The vector is an n-column dimensional vector with the element of 1, and the subscript represents the dimension of the vector; />

Is a p-column dimensional vector with element 1, the subscript representing the vector dimension; when it is muchWhen the elements of the dimensional gaussian random variables are independent of each other, i.e., when the off-diagonal elements of the covariance matrix are 0, the diagonal matrix can be represented by vectors.

In a specific embodiment, a priori parameters, i.e. a priori distribution, are set

In>

-a sparsity parameter, <' > based on>

-a mean value parameter->

-the initial value of the variance parameter is &>

，/>

，/>

And then automatically updating the prior parameters by adopting an expected maximum algorithm.

Updating, specifically:

at a node

Up, will>

And->

Multiplying and projecting the result to a multidimensional Gaussian distribution with independent covariance, and summing and->

Is divided to get->

The message of (2).

Wherein the content of the first and second substances,

is a projection operation, i.e. determines->

About>

Is based on the mean vector->

And the variance vector pick>

Is equal and the off-diagonal element is 0, and outputs ≥>

。

S222, according to the moment matching rule of the determinant vector factor graph of the Cox regression model, pairing

Updating, specifically:

at a node

Up, will>

And->

Multiply and then accumulate the variable>

And projected on a multidimensional Gaussian distribution with independent covariance, and the projected result is summed>

Is divided to get->

The message of (2); wherein +>

Is the dirac Delta function.

S223, matching according to the moment matching rule of the determinant vector factor graph of the Cox regression model

Updating, specifically:

in that

On the node, will->

And->

Is divided into>

The message of (2); wherein the mean value obtained by the projection operation->

Are the Cox regression coefficients as the output result.

Updating, specifically:

in that

On a node, will &>

And->

Multiply and accumulate the variable pick>

Is divided by the message to obtain

The message of (2).

Wherein, due to

Has an extremely complex form and therefore uses the cumulant generation function and the Laplace method instead of->

And carrying out projection operation.

Further, in step S223, the projection operation specifically includes:

wherein the content of the first and second substances,

representing an approximate posterior probability of the regression coefficient; projected mean value>

I.e., the Cox regression coefficients of the model output.

And approximate posterior probability

Carrying out automatic updating; the updated expression is specifically:

/>

wherein the content of the first and second substances,

and &>

Are all related to>

Is expressed as follows:

wherein the content of the first and second substances,

for the vector point divide, greater or lesser>

Is a vector dot product.

Further, the preset iteration ending condition in step S24 is specifically:

(ii) a If the Crit value does not start to rise, continuing iteration; wherein

Representing a norm.

In a specific embodiment, the performance of regression on simulated data in a single experiment is shown in FIG. 6, where the black line is the true value and the asterisk is the estimated value.

The generation mode of the analog data is as follows:

generated from independent standard normal samples

。

For is to

Independently sampled and/or sampled in a binomial distribution B (1,0.8)>

Wherein the deletion rate is 0.2.

Generation from Laplace-Bernoulli samples

Wherein the sparsity ratio is 0.2。

When in use

And the firstiWhen no sample number is deleted:

/>

wherein

Independently sample from U (0,1) when & ->

And the firstiWhen the number sample is deleted:

example 2

Based on the above embodiment 1, with reference to fig. 3, this embodiment describes in detail a specific process of solving the Cox model in the present invention.

In one particular embodiment, as shown in FIG. 3, the known data is

，/>

，

The regression coefficient is->

。

Step 1：

S 1.1：XInitialization

S1.2: merging the existing survival data (covariate matrix-X, survival time-y, deletion index-c) into a matrix [ X, y, c ] and sorting according to y descending order;

s1.3: substituting the sorted [ X, y, c ] into a Cox partial likelihood function:

indicates that the function is->

Is transferred to->

Which implies a->

About>

Is normalized (characteristic of the probability density function), and->

The partial likelihood function is a Cox partial likelihood function, which is not normalized, so the partial likelihood function is in a direct proportion relation; the function is based on>

Is a variable, the firstiElement/element->

，/>

Is->

To (1) aiAnd (4) each element.

S1.4: it is assumed that the prior obeys a gaussian-bernoulli distribution:

the function is as follows

Is a variable; initializing a prior parameter ≥>

，/>

，/>

。

S1.5: initializing a positive direction message:

wherein initialization is carried out

，/>

，/>

；/>

Is an n-dimensional column vector with elements all 0; />

An n-dimensional column vector with an element of 1, the subscripts denote the dimension of the vector.

Step 2: message passing on factor graph based on moment matching rule-expectation propagation algorithm (expectation propagation)

S2.1: updating

: in or on>

On a node, will

And->

Multiply and project onto a multidimensional Gaussian distribution of independent covariance and then remove->

The message of (2):

wherein the content of the first and second substances,

is a projection operation, i.e. evaluating &>

About>

Is based on the mean vector->

And the variance vector pick>

(diagonal of covariance matrix) because it is an independent covariance multi-dimensional Gaussian, so the vector &>

Is equal and the off-diagonal element is 0, and outputs ≥>

。

By Laplace method and pairs of moment generating functions

Simplifying to finally obtain:

wherein

I.e. based on>

Is greater than or equal to>

，/>

Is composed of

Is detected (#) and>

is paired and/or matched>

Second order gradient of).

The meanings are as follows: when/is>

Takes out its diagonal when it is a matrix, when->

When the vector is a vector, the vector is stretched into a diagonal matrix.

Is to average the vector>

In the form of a vector point divide, device for combining or screening>

Is a vector dot product.

by taking pairs>

And (3) solving by using a coordinate ascending algorithm after quadratic approximation:

firstly, the method is carried out

Taylor expansion:

/>

wherein the content of the first and second substances,

is->

In or on>

At a gradient of->

Is->

In or on>

A black plug matrix of (a). After rewriting, the following are obtained:

will eventually>

The method is simplified into the following steps:

wherein the content of the first and second substances,

is->

To (1) aiElement, then apply Coordinate Ascent algorithm (Coordinate Ascent):

s2.1.1: initialization

；

S2.1.2: updating

Is at>

Is taken to a gradient->

For>

To (1)kElement/element->

：

S2.1.3: updating

Is at>

Is closed by a black plug matrix>

For>

To (1) akLine ofkColumn element

(for accelerated calculations, only diagonal elements are kept to approximate the entire matrix):

s2.1.4: updating

：

S2.1.5: updating

：

S2.1.6: updating

Is output if the change is small to a certain extent>

；

If the change is still large, the iteration is continued by returning to S2.1.2.

Finally, calculating the division part and outputting

：

S2.2: updating

: is at>

On a node, will

And &>

Multiply and then accumulate variables

Projected onto a multidimensional Gaussian distribution of independent covariance and then eliminated->

The message of (2):

wherein the content of the first and second substances,

calculating to obtain:

wherein the content of the first and second substances,

the n-dimensional column vector with the element of 1 is represented by subscript, wherein the dimension of the vector is represented by the subscript; />

The meaning is as follows: when +>

If it is a matrix, its diagonal is taken out, when->

Opens it into a diagonal matrix if it is a vector, then holds it>

Is to calculate the average value of vector;

means to determine->

In relation to->

Mean value vector>

And the variance vector pick>

And output

；/>

Finger matrix inversion, and/or on/off>

Refers to matrix transposition.

Finally, calculating the division part and outputting

：

/>

S2.3: updating

: in or on>

On a node, will

And &>

The result of the multiplication is projected onto a multidimensional Gaussian distribution of independent covariance and then removed>

The message of (2):

wherein the content of the first and second substances,

the following calculation results:

wherein the content of the first and second substances,

and &>

Are all about>

Is expressed as follows:

finally, calculating the division part and outputting

：

Wherein the approximation of the regression coefficients is a posteriori as follows:

and mean value obtained by projection operation

It is the Cox regression coefficients that are to be output. />

S 2.4：Updating

: in or on>

On a node, will

And &>

Multiply and then accumulate a variable>

The message of (2):

calculating to obtain:

finally, calculating the division part and outputting

：

Step 3: output of approximate posterior probability according to S2.3

In conjunction with an expectation maximization algorithm (expectelationrecommendation), a prior parameter is combined>

And carrying out automatic updating.

S3.1: updating

：

S3.2: updating

：

S3.3: updating

：/>

Step 4: judging whether a preset iteration end condition is reached:

the end conditions are as follows:

determine whether it starts to rise, if so

Starting to rise, stopping the iteration process and outputting the regression coefficient->

(in S2.3). Wherein->

Is a norm.

Example 3

Based on the above example 1 and example 2, and with reference to fig. 7, this example illustrates a cancer gene prognosis screening system based on an improved Cox model in the second aspect of the present invention.

In a specific embodiment, as shown in fig. 7, the present invention further provides a cancer gene prognosis screening system based on an improved Cox model, which includes a memory and a processor, wherein the memory includes a cancer gene prognosis screening program based on the improved Cox model, and the cancer gene prognosis screening program based on the improved Cox model implements the following steps when executed by the processor:

For the first matrix->

Preprocessing is carried out to obtain a second matrix +>

。

S2, survival data obtained in the step S1 and a second matrix are usedXAnd inputting a preset Cox regression model, and solving to obtain a regression coefficient.

The drawings depicting the positional relationship of the structures are for illustrative purposes only and are not to be construed as limiting the present patent.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A cancer gene prognosis screening method based on an improved Cox model is characterized by comprising the following steps:

s1, collecting the expression quantity of different genes of cancer cells of a cancer patient, collecting survival data of the patient, collating the expression quantity of the different genes of the cancer cells and patient information into a first matrix, and preprocessing the first matrix to obtain a second matrix;

s2, inputting the survival data obtained in the step S1 and the second matrix into a preset Cox regression model, and solving to obtain a regression coefficient; the specific solving method is as follows:

s21, combining the existing survival data into a third matrix, sequencing according to the survival time of the parameters, constructing a Cox regression model by utilizing the sequenced data, and initializing prior parameters and message transmission parameters;

s22, projecting the high-dimensional message to independent Gaussian distribution through a moment matching rule by using an expected propagation algorithm according to a determinant vector factor graph of the Cox regression model, circularly iterating and solving the model, and outputting a regression coefficient and an approximate posterior probability;

s23, inputting the regression coefficient and the approximate posterior probability into an expected maximum algorithm, and updating prior parameters;

s24, judging whether the regression coefficient reaches a preset iteration ending condition or not; if the preset iteration ending condition is reached, outputting a regression coefficient obtained by the current iteration; if the preset iteration end condition is not reached, returning to the step S22 for the next iteration;

s3, evaluating the patient risk of the corresponding gene in the regression coefficient according to the risk function of the patient, and screening a prognostic genome corresponding to high patient risk;

2. The method of claim 1, wherein in step S2, the survival data and the second matrix are combined to form a third matrix, and the third matrix is inputted into the predetermined Cox regression model; wherein the third matrix is denoted as [ X, y, c]X represents a covariate matrix, namely a second matrix, y represents survival time, and c represents deletion index; wherein the first stepiSurvival data for individual patients is

。

3. The method of claim 2, wherein the first step is to select the improved Cox model based cancer gene prognosisiThe risk function for each of said patients is specifically:

wherein

Is a shared benchmark risk function; />

Obtaining a regression coefficient for solving the Cox regression model; />

Is shown asiGene expression levels of individual patients.

4. The method of claim 1, wherein the prior parameters comprise: mean value

Variance->

And sparsity ratio>

5. The method of claim 1, wherein the normalization process of the X matrix of covariate matrix is as follows:

wherein mean (m)X) Is composed ofXMean of the whole elements of the matrix, var: (X) Is composed ofXThe variance of the whole elements of the matrix;

the Cox partial likelihood function is specifically:

the transition probability needs to normalize and use a partial likelihood function, specifically:

indicates that the function is->

Is transferred to->

For representing ≥ a transition probability>

About>

Is normalized; />

Is a Cox partial likelihood function, not normalized; />

The sign is a direct proportion sign and represents a direct proportion relation; the function is based on>

Is a variable, the firstiElement/element->

，/>

Is->

To (1) aiAn element;

the initialization of the prior parameters specifically comprises the following steps: the regression coefficients are subjected to Gaussian-Bernoulli distribution, and the mathematical expression is as follows:

wherein the content of the first and second substances,

representing a dirac Delta function; />

Means is->

The variance is greater or less>

(ii) a gaussian distribution of; the function is based on>

Is a variable; initializing a prior parameter ≥>

，/>

，/>

；

The initialization of the message transfer function is specifically: initializing a message transfer function of a positive direction message, wherein the mathematical expression of the message transfer function is as follows:

initialization parameters

Wherein is present>

Is an n-dimensional column vector with elements all 0; />

Is a p-dimensional column vector with elements all 0; />

Is an n-dimensional column vector with elements all being 1; />

Is a p-dimensional column vector with elements all being 1;

the message transfer function of the negative direction message is specifically:

is a random variable obeying independent same variance multidimensional Gaussian distribution; wherein-represents subject to a profile>

Represents a mean vector of { [>

The covariance matrix is a diagonal matrix and the diagonal element is ≥>

Is multi-dimensional Gaussian distribution,. Is greater than or equal to >>

Is->

Is based on the mean value of>

Is->

Is greater than or equal to>

Is->

Is based on the mean value of>

Is->

The variance of (a); />

Is->

Is based on the mean value of>

Is->

In (b) based on the variance of (c), in>

Is->

In the mean value of (a)>

Is->

The variance of (c). />

6. The method for screening cancer gene prognosis based on improved Cox model as claimed in claim 5, wherein said step S22 is specifically for message transmission on determinant vector factor graph of Cox regression model based on moment matching rule, comprising the following steps:

Updating, specifically:

at a node

Up, will>

And->

Is divided to get->

The message of (a);

Updating, specifically:

at a node

Up, will>

And->

Multiply and then accumulate the variable>

And projected on a multidimensional Gaussian distribution of independent covariance, the projected result is summed>

Is divided to get->

The message of (2); wherein->

Is a dirac Delta function;

Updating, specifically:

in that

On the node, will->

And->

Is divided into>

The message of (a); wherein the result of the evaluation in S223 is output>

As a regression coefficient->

；

Updating, specifically:

in that

On the node, will->

And->

Multiply and accumulate a variable>

Is divided into>

The message of (2).

7. The method of claim 1, wherein the step S23 is specifically to: regression coefficient output from step S22

And the approximate a posteriori probability->

In conjunction with the expectation maximization algorithm, a priori parameters are combined>

Carrying out automatic updating; the updated expression is specifically:

wherein the content of the first and second substances,

and &>

Are all about>

Is expressed as follows:

wherein the content of the first and second substances,

for the vector point divide, greater or lesser>

Is a vector dot product.

8. The method for screening cancer gene prognosis based on the improved Cox model according to any one of claims 1 to 7, wherein the iteration ending conditions preset in step S24 are specifically:

(ii) a If the Crit value does not start to rise, continuing iteration; wherein +>

Representing a norm.

9. A cancer gene prognosis screening system based on an improved Cox model comprises a memory and a processor, wherein the memory comprises a cancer gene prognosis screening program based on the improved Cox model, and the cancer gene prognosis screening program based on the improved Cox model realizes the following steps when being executed by the processor:

s21, combining the existing survival data into a third matrix, sequencing according to the survival time of the parameters, constructing a Cox regression model by using the sequenced data, and initializing prior parameters and message transmission parameters;

s24, judging whether the regression coefficient reaches a preset iteration ending condition or not; if the preset iteration ending condition is reached, outputting a regression coefficient obtained by the current iteration; if the preset iteration end condition is not met, returning to the step S22 for the next iteration;