CN116386881A

CN116386881A - Method and system for predicting low-frequency poor prognosis outcome of early colorectal cancer patient

Info

Publication number: CN116386881A
Application number: CN202310211362.4A
Authority: CN
Inventors: 何亚舟; 罗志鹏; 王自强; 许川; 舒驰; 吴清彬; 周燕虹
Original assignee: Individual
Current assignee: Individual
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-07-04

Abstract

The invention discloses a method and a system for predicting low-frequency bad prognosis ending of early colorectal cancer patients, and belongs to the technical field of neural networks. The method comprises the following steps: first selecting a risk profile for patient survival prediction; building a model M based on a neural network, and training and optimizing the model M by utilizing observed patient survival data and by defining a proper loss function; finally, the risk characteristics of the tested patient are input into the model M, and then the prediction result of the survival time of the tested patient is obtained. Aiming at the characteristic of low occurrence frequency of bad ending of early tumor patients, the method fills the technical blank in the field, and effectively overcomes the technical difficulties of small sample, small probability ending event and 'time-event' two-dimensional compound ending prediction by a proposed neural network model, proper loss function and random gradient descent method; compared with the traditional linear model COX regression method, the model prediction accuracy is further improved.

Description

Method and system for predicting low-frequency poor prognosis outcome of early colorectal cancer patient

Technical Field

The invention belongs to the field of neural networks, and particularly relates to a method and a system for predicting low-frequency bad prognosis ending of early colorectal cancer patients by using a neural network.

Background

The malignant tumor disease burden of China is increased year by year, and the malignant tumor disease burden is mainly represented by the increase of new cases and tumor related death cases year by year. Colorectal cancer is a malignant tumor with the second global cancer-related mortality rate, and the incidence and mortality rate of colorectal cancer in China are also increased year by year. The risk factors influencing prognosis outcome of the tumor patient are screened based on priori knowledge, so that the risk degree layering is carried out on the tumor patients with different characteristics, and the accurate prediction of the survival probability of the patients in a specific age is carried out, so that the clinical doctor can be helped to formulate individual disease monitoring and treatment strategies, and the method has important clinical value.

In recent years, with the wide implementation of colorectal cancer tumor early screening projects such as fecal occult blood and colonoscopes at home and abroad and the appearance of noninvasive early screening new technologies such as circulating tumor DNA (ctDNA), more and more colorectal cancer cases are found in the early stage of tumor (stage I). Statistics indicate that about 30% of current cases of colorectal cancer are diagnosed with stage I. Although stage I tumor patients generally have better survival outcomes after radical surgery, some patients still have poor prognosis. Literature data shows that about 5% -10% of patients develop adverse outcomes such as tumor recurrence, metastasis or death within five years. Therefore, how to accurately identify such patient populations at risk of developing poor outcome at the time of early tumor diagnosis is an important issue to be addressed.

Disclosure of Invention

In view of the above, the present invention provides a method and a system for predicting low-frequency poor prognosis outcome for early colorectal cancer patients, which can accurately identify patient population at risk of poor outcome in early tumor diagnosis.

In order to solve the technical problems, the technical scheme of the invention is to adopt a method for predicting the low-frequency bad prognosis outcome of early colorectal cancer patients, which comprises the following steps:

selecting risk features for prediction; the risk features include basal features and clinical pathology features; the basal characteristics include age, sex, tumor size; the clinical pathological characteristics comprise tumor T stage, tumor grading, tumor nerve invasion, lymph number examination and preoperative embryo antigen;

building a model M based on a neural network, and training and optimizing the model M by utilizing a training data set and a properly defined loss function;

and inputting the risk characteristics of the patient into the model M to obtain a prediction result.

As an improvement, the model M is a fully-connected neural network comprising H hidden layers, the input of the model M is the risk characteristic x of the patient, and the output is the probability distribution y of the occurrence of the adverse event of the patient;

let the input vector of hidden layer k be z _k-1 The output vector is z _k Wherein 1.ltoreq.k.ltoreq.H, and z when k=1 ₁ =x; model parameter w _k Sigma is the activation function, then there is

z _k ＝σ(z _k-1 ^T W _k )；

When k=h, for output vector z _H And carrying out Softmax probability transformation to obtain probability distribution y of occurrence of the adverse event.

As a further improvement, the formula is used

A Softmax probability transformation is performed, where y= [ y ] ₁ ,y ₂ …y _m ]，1≤r≤m。

As an improvement, training the model M by using a random gradient descent method includes:

let the loss function l=l (θ; D), where θ e Θ is a spatial parameter and D is a training dataset, i.e. observed patient survival time data;

trained optimal model M ^* ＝M ^θ* Wherein

Optimum parameter theta ^* And (5) iteratively obtaining by using a random gradient descent method.

As a further development, the loss function is

L＝αL ₁ +(1-α)L ₂ ；

Wherein the method comprises the steps of

Where n is the total number of patients in the training dataset, subscript i is the ith patient, x _i Is a risk feature of the patient, k _i =1 represents patient death, k _i =0 represents that the patient still survived; y is _i [s _i ]Representing predicted time to live equal s _i Probability of (2); s (|) is the survival function, S (S) _i |x _i ) Patient i's survival time is greater than s, which is model predicted _i Probability of (2); ρ ₁ And ρ ₂ The weights of the event samples, ρ, are either occurred or not, respectively ₁ >ρ ₂ >0; super parameter alpha E [0,1]]Balance L ₁ And L is equal to ₂ Influence on model optimization.

As an improvement, the formula is utilized

C ^td ＝P{S(s _i |x _i )＜S(s _i |x _j )|s _i ＜s _j ，k _i ＝1}

The output accuracy of the model M is evaluated, where P (|) is a conditional probability distribution function, x _i Is a risk feature of the patient s _i Is the predicted patient survival time, S (|) is the survival function, S (S) _i |x _i ) Patient i's survival time is greater than s, which is model predicted _i Probability of k _i Is a label of whether the patient is dead.

The invention also provides a prediction system of low frequency office of early colorectal cancer patients, comprising:

the risk feature selection module is used for selecting risk features; the risk features include basal features and clinical pathology features; the basal characteristics include age, sex, tumor size; the clinical pathological characteristics comprise tumor T stage, tumor grading, tumor nerve invasion, lymph number examination and preoperative embryo antigen;

and the prediction module is used for predicting probability distribution of occurrence of adverse events of the patient by using the input risk characteristics of the patient.

As an improvement, the prediction module includes:

the model building module is used for building a model M; the model M is a fully-connected neural network comprising H hidden layers, the input of the model M is the risk characteristic x of a patient, and the output is the probability distribution y of the occurrence of adverse events of the patient;

z _k ＝σ(z _k-1 ^T W _k )；

When k=h, for output vector z _H Performing Softmax probability transformation to obtain probability distribution y of occurrence of adverse events;

the training optimization module is used for training and optimizing the model M, and comprises a loss function L=L (theta; D), wherein theta epsilon theta is a space parameter, and D is a training data set, namely observed patient survival time data;

trained optimal model M ^* ＝M ^θ* Wherein

Iterative acquisition of optimal parameter theta by random gradient descent method ^* 。

As an improvement, the training optimization module includes a loss function definition module for defining a loss function L, the loss function being

L＝αL ₁ +(1-α)L ₂ ；

Wherein the method comprises the steps of

As an improvement, the prediction module further includes:

model evaluation module for using formula

C ^td ＝P{S(s _i |x _i )＜S(s _i |x _j )|s _i ＜s _j ，k _i ＝1}

The invention has the advantages that:

firstly, the invention can fill up the technical blank in the field aiming at the characteristic of low occurrence frequency of bad ending of early tumor patients, and effectively overcomes the technical difficulties of small sample, small probability ending event and 'time-event' two-dimensional compound ending prediction by a proposed neural network model random gradient descent method; compared with the traditional linear model COX regression method, the model prediction accuracy is further improved.

Second, the invention is based on neural network method, has the advantage of nonlinear fitting, and can realize better fitting of the model compared with the traditional generalized linear model (such as Cox model), thereby improving the prediction accuracy of the model.

Third, the time cost is low. The survival probability of different time nodes of the patient can be rapidly calculated on the basis of given input risk factor parameters.

Fourth, the simple operation. The risk factor characteristic screening method based on priori knowledge is adopted, the final inclusion is common patient characteristics in the work of clinicians, and the risk factor characteristic screening method can be updated in time along with the occurrence of new research evidence, is convenient for medical workers to read the calculation result, and is more convenient to use.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a schematic structural diagram of the present invention.

FIG. 3 is a schematic diagram of the prediction result.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the following specific embodiments.

The main characteristic of early (stage I) colorectal tumor is that most patients have generally better prognosis outcome after standardized surgery treatment, i.e. the adverse outcome such as recurrence, metastasis and death after treatment is a relatively small probability event. Statistical data indicate that the probability of the above adverse outcomes occurring within five years after surgery for patients with stage I colorectal cancer is approximately 5% -10%. Meanwhile, the prognosis outcome is a two-dimensional compound variable of Time-event (Time-to-event) in variable typing, namely, the outcome of a specific patient consists of whether the bad outcome occurs or not and specific Time two-dimensional information of the occurrence of the bad outcome. This further increases the technical difficulty of accurately predicting it.

The current tumor prognosis-related prediction model tools focus mainly on tumors that are staged for a particular anatomical location, including all early, mid, and late diagnoses. For example, publication No. CN112011616A describes a prognostic model for predicting post-operative survival of all staged hepatocellular carcinoma based on immune-related gene expression markers. Similar methods are equally applicable to colorectal cancer prognosis prediction, as the patent of publication No. CN111778337A describes a calculation method for predicting the prognosis risk scores of all staged colorectal cancers using 12 known tumor-stem related gene expression levels as risk factors. Since tumor staging is one of the main factors affecting tumor prognosis, failure to use staging as a main predictor in the above tool design process can potentially affect prediction accuracy. There are other tools that treat tumor stage as predictors in model tools, however such tools still identify the subject as tumor of all stages, and thus lack specificity for tumor characteristics of different stages. For example, a study from japan reports a calculation method for predicting the overall survival rate of a patient using clinical information such as tumor stage, body mass index and history of diabetes in a colorectal cancer patient as risk factors. Because the tools lack mining on the specific characteristics of tumor patients in different stages, and the tumor stages of the patients in the same stage can not be used as effective prediction factors, the prediction effect of the patients applied to a specific stage is often poor.

In order to solve the above technical problems, as shown in fig. 1, the present invention provides a method for predicting low-frequency poor prognosis outcome of early colorectal cancer patients, specifically comprising:

s1, selecting risk characteristics for prediction; the risk features include basal features and clinical pathology features; the basal characteristics include age, sex, tumor size; the clinical pathological characteristics comprise tumor T stage, tumor grading, tumor nerve invasion, lymph number examination and preoperative embryo antigen.

Given that early tumor patients have a low probability of occurrence of poor outcome events (often less than 10%), too many risk features (features) may seriously affect prediction accuracy, the present invention adopts a double-layer screening strategy of "basic variable+priori knowledge" to select 8 risk features in total for predicting stage I colorectal cancer poor outcome probability. Colorectal cancer prognosis risk factors recommended by the national tumor Cooperation network (NCCN) and the European clinical oncology society (ESMO) were first selected: tumor T stage (T1 vs. T2), tumor Grade (Grade G1 or G2 vs. G3 or G4); tumor nerve invasion (PNI), number of lymph nodes (12 vs. < 12) and pre-operative carcinoembryonic antigen (CEA <5ng/ml vs. Gtoreq.5 ng/ml); the basic variables include the age, sex and tumor diameter at the time of tumor diagnosis.

In addition, the present invention selects the predicted adverse outcome to be total mortality (overall death) and colorectal cancer-related mortality (CRC-specific death). All cause death is defined as death of any cause in a subject over a period of observation, whereas colorectal cancer-related death is directly due to colorectal cancer. The corresponding total survival probability (overall survival) and colorectal cancer associated survival rate (CRC-specific survival) are defined as "1-mortality", and the prediction probability generated by the present invention is the total survival rate and colorectal cancer associated survival rate of individuals who meet the fixed risk profile at a given time node (e.g., 5 years after tumor diagnosis).

S2, building a model M, and training and optimizing the model M by using a training data set and a properly defined loss function.

In this embodiment, the model M is a fully connected neural network including H hidden layers, the input of the model M is the risk feature x of the patient, and the output is the probability distribution y of occurrence of adverse events of the patient;

z _k ＝σ(z _k-1 ^T W _k )；

Specifically, the formula is utilized

After the prediction model is built, training and optimizing the model are needed to improve the prediction accuracy. The invention adopts a random gradient descent method to train the model M, and comprises the following steps:

trained optimal model M ^* ＝M ^θ* Wherein

Optimum parameter theta ^* Iterative acquisition by using a random gradient descent method, specifically setting theta ₀ For random initial model parameters, then a finite and sufficient number of iterations θ _t Will then approach theta ^* . The iteration rule is as follows:

where beta is the learning step size and,

the gradient of the model parameters is determined for the loss function.

In the present invention, the loss function L is composed of two parts L ₁ And L ₂ The composition is formed. First, L ₁ Based on a negative log-likelihood function and requiring a more accurate fit to the small probability of adverse events that have occurred, namely:

Second part L ₂ Is a contrast ranking loss function for a known determination of survival time, i.e., for two different patients (x _i ,s _i ,k _i ) And (x) _j ,s _j ,k _j ) If k _i =1 (i.e. s _i Is a determined length of time to live) then regardless of k _j Whether or not it is 1, provided that s _i <s _j The survival probability S (S) _i |x _i ) Also should satisfy a value less than S (S) _j |x _j ). Thus, there are:

the final loss function is a combination of the two, and is controlled by the super parameter alpha E [0,1], namely:

L＝αL ₁ +(1-α)L ₂ 。

and finally, evaluating the output accuracy of the model M after training and optimization. The invention adopts the time-dependent C-index (C) ^td ) As a main parameter for evaluating the performance of the model. C (C) ^td Given are the probabilities of whether the magnitude of two survival times predicted by the model and the magnitude of the real survival time are consistent for any two comparable samples. Namely:

C ^td ＝P{S(s _i |x _i )＜S(s _i |x _j )|s _i ＜s _j ，k _i ＝1}

where P (|) is a conditional probability distribution function, x _i Is a risk feature of the patient s _i Is the predicted patient survival time, S (|) is the survival function, S (S) _i |x _i ) Patient i's survival time is greater than s, which is model predicted _i Probability of k _i Is a label of whether the patient is dead.

The data of 9015 patients with stage I colorectal cancer in The national cancer registration monitoring system (The Surveillance, epidemic, and End Results, SEER) were used according to 3: the 1 scale is randomly divided into a training data set and a test set. The risk prediction factors were the 8 risk factors mentioned previously (age, sex, tumor diameter, tumor grade, T-stage, CEA, PNI, and number of lymph nodes examined). The predicted outcome event is total cause death (OS) and colorectal cancer-related death (CRCD). The control group comparison method is a classical COX proportional risk regression model. Comparative experimental results the above table is the predicted C for different event occurrence rates ^td Index performance (higher value is better) is shown.

Types of adverse events	Incidence of event	COX model	Model M
				OS	17.6％	0.7382	0.7475
CRCD	4.2％	0.6725	0.6846

The table above is C predicted for different event occurrence rates ^td Index performance (higher value is better).

From this, it can be seen that the model M provided by the present invention has significantly better predictive performance over the COX model on both low frequency events. It is worth mentioning that in the prediction of CRCD, the occurrence rate of adverse events is only 4.2%, and the proposed model can still perform normally and has better performance than COX model (in case of lower event rate, the improvement of accuracy rate of 0.1% is very difficult). Thus, the proposed model has significant advantages in the prediction of low probability bad outcomes.

S3, inputting the risk characteristics of the patient into the model M to obtain a prediction result.

Let the risk profile x= [ age=58 years, sex=male, tumor grade=g3 or G4, tumor size=3 cm, pni=yes, lymph node for delivery=14, cea=10 ng/ml, T stage=t1 for a newly diagnosed colon cancer patient]Put x as input into model M ^* Y=m can be obtained ^* (x) The distribution is shown in fig. 3, from which it can be inferred that the most likely survival time of the patient is around five years.

As shown in fig. 2, the present invention further provides a system for predicting low frequency office of early colorectal cancer patients, comprising:

a prediction module for predicting a probability distribution of occurrence of a patient adverse event using an input patient risk feature, comprising:

z _k ＝σ(z _k-1 ^T w _k )；

trained optimal model M ^* ＝M ^θ* Wherein

Iterative acquisition of optimal parameter theta by random gradient descent method ^* . The training optimization module comprises a loss function definition module for defining a loss function L, wherein the loss function is that

L＝αL ₁ +(1-α)L ₂ ；

Wherein the method comprises the steps of

Model evaluation module for using formula

C ^td ＝P{S(s _i |x _i )＜S(s _i |x _j )|s _i ＜s _j ，k _i ＝1}

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that the above-mentioned preferred embodiment should not be construed as limiting the invention, and the scope of the invention should be defined by the appended claims. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the spirit and scope of the invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A method for predicting low frequency poor prognosis outcome in an early stage colorectal cancer patient, comprising:

building a model M based on a neural network, and training and optimizing the model M by utilizing a training data set and a loss function;

2. A method of predicting low frequency poor prognosis outcome for patients with early colorectal cancer according to claim 1, wherein:

the model M is a fully-connected neural network comprising H hidden layers, the input of the model M is the risk characteristic x of a patient, and the output is the probability distribution y of the occurrence time of adverse events of the patient;

let the input vector of a hidden layer k be z _k-1 The output vector is z _k Wherein 1.ltoreq.k.ltoreq.H, and z when k=1 ₁ =x; model parameter w _k Sigma is the activation function, then there is

z _k ＝σ(z _k-1 ^T w _k )；

When k=h, for output vector z _H And carrying out Softmax probability transformation so as to obtain probability distribution y of occurrence of the adverse event.

3. A method of predicting low frequency poor prognosis outcome for patients with early colorectal cancer according to claim 2, wherein:

using the formula

4. A method for predicting low frequency poor prognosis outcome for patients with early colorectal cancer according to claim 1, characterized in that the training of model M with a stochastic gradient descent method comprises:

trained optimal model M ^* ＝M ^θ* Wherein

5. A method of predicting low frequency poor prognosis outcome for patients with early colorectal cancer according to claim 1, wherein:

the loss function is

L＝αL ₁ +(1-α)L ₂ ；

Wherein the method comprises the steps of

Where n is the total number of patients in the training dataset, subscript i is the ith patient, x _i As a risk feature of the patient,k _i =1 represents patient death, k _i =0 represents that the patient still survived; y is _i [s _i ]Representing predicted time to live equal s _i Probability of (2); s (|) is the survival function, S (S) _i |x _i ) Patient i's survival time is greater than s, which is model predicted _i Probability of (2); ρ ₁ And ρ ₂ The weights of the event samples, ρ, are either occurred or not, respectively ₁ >ρ ₂ >0; super parameter alpha E [0,1]]Balance L ₁ And L is equal to ₂ Influence on model optimization.

6. A method of predicting low frequency poor prognosis outcome for patients with early colorectal cancer according to claim 1, wherein:

using the formula

C ^td ＝p{S(s _ｉ |x _ｉ )＜S(s _t |x _ｊ )|s _t ＜s _ｊ ,k _i ＝1}

7. A system for predicting low frequency offices in early colorectal cancer patients, comprising:

8. A prediction system for the low frequency office of an early stage colorectal cancer patient according to claim 7, characterized in that the prediction module comprises:

z _k ＝σ(z _k-1 ^T w _k )；

trained optimal model M ^* ＝M ^θ* Wherein

9. The prediction system for low frequency office of early stage colorectal cancer patients according to claim 8, wherein the training optimization module comprises a loss function definition module for defining a loss function L, the loss function being

L＝αL ₁ +(1-α)L ₂ ；

Wherein the method comprises the steps of

Where n is the total number of patients in the training dataset, subscript i is the ith patient, x _i Is a risk feature of the patient, k _i =1 represents patient death, k _i =0 represents that the patient still survived; y is _i [s _i ]Representing predicted time to live equal s _i Probability of (2); s (|) is the survival function, S (S) _i |x _i ) Patient i's survival time is greater than s, which is model predicted _i Probability of (2); ρ ₁ And ρ ₂ The weights of the event samples which are happened or not happened respectively satisfy ρ ₁ >ρ ₂ >0; super parameter alpha E [0,1]]Balance L ₁ And L is equal to ₂ Influence on model optimization.

10. The system for predicting low frequency office in an early stage colorectal cancer patient of claim 8, wherein the prediction module further comprises:

model evaluation module for using formula

C ^td ＝P{S(s _i |x _i )＜S(s _i |x _j )|s _i ＜s _j ，k _i ＝1}