CN113355426B

CN113355426B - Evaluation gene set and kit for predicting liver cancer prognosis

Info

Publication number: CN113355426B
Application number: CN202110916132.9A
Authority: CN
Inventors: 王维锋; 张欣; 王丛茂
Original assignee: Shanghai Zhiben Medical Laboratory Co ltd; Origimed Technology Shanghai Co ltd
Current assignee: Shanghai Zhiben Medical Laboratory Co ltd; Origimed Technology Shanghai Co ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-11-09
Anticipated expiration: 2041-08-11
Also published as: CN113355426A

Abstract

The invention relates to a liver cancer prognosis evaluation gene set. Specifically, the invention uses a gene set consisting of 62 specific immune related genes to predict and evaluate the prognosis of the liver cancer patient and provides scientific basis for medical decision. The invention also relates to a kit, a computing device and a storage medium for predicting liver cancer prognosis.

Description

Evaluation gene set and kit for predicting liver cancer prognosis

Technical Field

The present invention relates to a method for assessing the prognosis of a patient with liver cancer using a specific set of immune-related genes. Specifically, the invention relates to a characteristic gene set of 62 immune-related genes, which can be used for evaluating the prognosis of a liver cancer patient and providing scientific basis for medical decision.

Background

Tumors are incurable diseases, and the therapeutic targets are well defined, and patients are allowed to live longer by treatment since the tumors have been diagnosed. Liver cancer is one of the most common cancer types in China, and has high morbidity and mortality. The treatment generally adopts surgery, radiotherapy and chemotherapy and traditional Chinese medicine combination therapy. In the past, the recurrence rate and poor outcome rate of liver cancer are still high, and the 5-year recurrence-free survival rate and the overall survival rate are lower. The existing clinical prognostic evaluation indexes such as alpha-fetoprotein level, TNM staging and the like can not meet the requirements of comprehensively reflecting tumor characteristics and accurately judging the survival risk of patients. Therefore, there is an urgent need for reliable prognostic indicators to establish reliable prognostic models and make accurate survival predictions, and to combine corresponding molecular characteristics for targeted therapeutic intervention. In early clinical application, how to provide liver cancer prognosis prediction information for doctors and patients is an urgent problem to be solved, is helpful for formulating individualized treatment schemes, and has important clinical significance for improving postoperative survival of patients and realizing accurate treatment of liver cancer.

The activity of tumor cells and immune cells in the tumor microenvironment are involved in the generation and development of tumors, so tumor immunology has attracted attention. Tumor infiltrating immune cells are key cellular components of the host immune response and are important members of the tumor microenvironment. Many studies have demonstrated that tumor-infiltrating immune cells are associated with therapeutic response and prognosis in a variety of cancers.

The advantage of using the expression values of the characteristic gene set to evaluate the prognosis of the patient is objectivity and there is no subjective bias of the researcher. The disadvantage is that the observation time is long and it is necessary to record the occurrence of all events, i.e. the death of all patients. Published markers of immune-related genes, usually involve only a single immune gene or a small number of immune cells. However, the development of immune responses in vivo involves the involvement of multiple immune cells, and the evaluation of prognosis by a single immune gene or a small number of immune cells is not complete. Therefore, there remains a need for more accurate and efficient models that can predict the prognosis of cancer patients.

Disclosure of Invention

The method is based on TCGA liver cell liver cancer samples, samples are randomly divided into training sets and testing sets, and by combining gene expression value data and screening immune related genes, an evaluation gene set capable of predicting the prognosis of liver cell liver cancer according to the gene expression value is selected.

First, in a first aspect of the present invention, the present invention relates to an evaluation gene set for predicting prognosis of a liver cancer patient, the evaluation gene set comprising 62 genes, the genes being represented in table 1 below:

table 1: selected features assess gene status of a gene set

。

In another aspect, the present invention also relates to a kit for predicting prognosis of a patient with liver cancer, comprising a reagent that can specifically detect a gene expression value; wherein the genes are 62 genes in Table 1.

In this context, the terms "expression level" and "expression value" of a gene are used interchangeably to refer to the value of a parameter that measures the degree of expression of a given gene. The expression value can be determined by measuring the level of mRNA encoded by the gene of interest or by measuring the amount of protein encoded by the gene.

In some embodiments, the kit comprises one or more of nucleic acid extraction reagents, PCR reagents, genome/transcriptome sequencing reagents, gene-specific primers or probes, antibodies specific for gene expression products.

In some embodiments, the agent is any agent known in the art that can be used to detect the level of gene expression; in particular embodiments, the reagents are used in reagents for performing one or more of the following methods: real-time fluorescent quantitative PCR, northern blotting, western blotting, genome sequencing, transcriptome sequencing, biomass spectrometry or specific antibody detection.

In some embodiments, the kit further comprises sample processing reagents, such as sample lysis reagents, sample purification reagents, and nucleic acid extraction reagents, among others.

Transcriptome sequencing can rapidly and comprehensively obtain almost all transcripts and gene sequences of a specific cell or tissue of a certain species in a certain state through a second-generation sequencing platform, and can be used for researching gene expression quantity, gene function, structure, alternative splicing, prediction of new transcripts and the like. In addition, by designing appropriate primers, the transcription expression level of a gene can be determined by PCR such as reverse transcription PCR. The protein expression level of each gene can also be measured by an immunoassay such as immunohistochemistry, ELISA, or the like using an antibody specific to the gene protein.

Preferably, the gene expression value is a value obtained by annotating transcriptome sequencing data.

In another aspect, the present invention also relates to a method for predicting the prognosis of a patient with liver cancer, comprising the steps of:

a) sample collection and data detection: collecting a sample of the patient, determining their expression values for 62 genes in the evaluation gene set in table 1;

b) calculating the risk score: calculating the total expression value of the liver cancer patient in 62 genes of the evaluation gene set, namely a Risk Score (Risk Score); the risk score calculation formula is as follows:

wherein E_iCoef for the value of expression of each gene_iThe weight coefficient of each gene, n is the number of genes in the characteristic gene set, namely 62;

c) and (3) predicting the prognosis condition of the patient according to the calculated risk score of the liver cancer patient: the lower the risk score of the patient, the better the prognosis; and comparing the risk score with a defined value, if the risk score is higher than the defined value, predicting that the prognosis is poor, and if the risk score is lower than the defined value, predicting that the prognosis is good.

In some embodiments, the defined value is about 4.14.

In some embodiments, the patient sample is from a tissue of the patient, including tumor tissue, which is a primary lesion or a metastatic lesion.

As used herein, "about" when used in reference to a numerical value indicates that the calculation or measurement allows the value to encompass some approximation of the exact numerical value, or a reasonably close numerical value; "about" herein means at least the variation in value that can result from the usual methods of measuring or using such parameters; it should be understood that the presence or absence of "about" does not affect the interpretation of its numerical value; preferably, all values within the range of plus or minus 10% of the subsequent value are indicated. Those skilled in the art will appreciate that all or part of the functions of the above-described method steps may be implemented by hardware, or may be implemented by a computer program.

When all or part of the functions of the above method steps are implemented by means of a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.

In another aspect, the invention also relates to a system for predicting prognosis of a patient with liver cancer, comprising the following modules:

a) a data collection module: collecting a sample of the patient, determining the expression values of 62 genes in the evaluation gene set in table 1, and inputting the expression value data of each gene to a model calculation module;

b) a model calculation module: calculating the total expression value of 62 genes in the evaluation gene set of the liver cancer patient, namely a Risk Score (Risk Score); the risk score calculation formula is as described above;

c) the output prediction module predicts the prognosis condition of the patient according to the risk score data of the liver cancer patient, wherein the lower the risk score of the patient is, the better the prognosis is; and comparing the risk score data with a defined value, if the risk score data is higher than the defined value, outputting that the prediction prognosis is not good, and if the risk score data is lower than the defined value, outputting that the prognosis is good.

In some embodiments, the defined value is about 4.14.

In another aspect, the present invention also relates to the use of the reagent for detecting the expression value of the genes described in table 1 in the preparation of kits and systems for predicting liver cancer prognosis.

In some embodiments, wherein the kit or system is a kit and system of the invention as described above.

In some embodiments, the agent for detecting expression values is selected from one or more of nucleic acid extraction reagents, PCR reagents, genome/transcriptome sequencing reagents, gene-specific primers or probes, antibodies specific for gene expression products. In some embodiments, the reagent is a reagent for performing one or more of the following: real-time fluorescent quantitative PCR, northern blotting, western blotting, genome sequencing, transcriptome sequencing, biomass spectrometry or specific antibody detection.

In another aspect, the invention also relates to a computing device comprising:

at least one processing unit; and

at least one memory coupled to the processing unit and storing instructions for execution by the processing unit, the instructions when executed, the apparatus enabling prediction of a prognosis for a liver cancer patient, the prediction comprising the steps of:

a) calculating a risk score for the patient based on the collected and determined expression values for 62 genes in the evaluation set of genes in table 1 for the patient sample; the risk score calculation formula is as described above;

b) predicting the prognosis condition of the patient according to the risk score data of the liver cancer patient, wherein the lower the risk score of the patient is, the better the prognosis is; comparing the risk score data with a defined value, if the risk score data is higher than the defined value, the prognosis is predicted to be poor, and if the risk score data is lower than the defined value, the prognosis is predicted to be good.

Preferably wherein said defined value is about 4.14.

In another aspect, the present invention also relates to a computer readable storage medium storing a computer program executable by a machine to perform the steps of predicting a prognosis for a patient with liver cancer, the steps comprising:

Preferably wherein said defined value is about 4.14.

The invention has the beneficial effects that:

the invention provides an evaluation gene set for liver cancer prognosis prediction and a corresponding kit, which can be more reliably applied to clinical practice. The characteristic gene set comprises 62 immune related genes, 22 immune cells are covered, and the prediction performance is verified in the test set. Compared with a method for correlating prognosis by mutation of a single gene, the method disclosed by the invention reduces the limitation of mutation frequency in a crowd and the limitation of collected samples on the stability of a survival analysis result, and can be used for more accurately predicting the prognosis of a liver cancer patient; and the method can be applied to clinical tests and provide scientific basis for medical decision making.

Drawings

FIG. 1: screening a characteristic gene set flow chart.

FIG. 2: selecting a result of the characteristic gene set; grouping patients into high risk groups (high group) and low risk groups (low group) according to the median of their risk scores for 62 selected genes; fig. 2 shows the Probability of Survival (Survival viability) for the high risk group and the low risk group in the training set (training set).

FIG. 3: selecting a result of the characteristic gene set; grouping patients into high risk groups (high group) and low risk groups (low group) according to their median of 62 selected gene risk scores; fig. 3 shows the probability of survival for the high risk group and the low risk group in the test set (test set).

FIG. 4: (ii) results of a randomly selected signature gene set; grouping patients into high risk groups (high group) and low risk groups (low group) according to their median of 62 randomly selected gene risk scores; where figure 4 shows the probability of survival for the high risk group and the low risk group in the training set.

FIG. 5: (ii) results of a randomly selected signature gene set; grouping patients into high risk groups (high group) and low risk groups (low group) according to their median of 62 randomly selected gene risk scores; fig. 5 shows the probability of survival for the high risk group and the low risk group in the test set.

Detailed Description

The following describes embodiments of the present invention with reference to the drawings. For the specific methods or materials used in the embodiments, those skilled in the art can make routine alternatives according to the existing technologies based on the technical idea of the present invention, and not limited to the specific embodiments of the present invention.

Example 1: establishing a model by a Lasso regression method to obtain a selected characteristic gene set

Data processing, screening immune gene related to liver cancer prognosis

Downloading gene expression data of the liver cell liver cancer and clinical data such as total survival time and survival end point of a patient from a cancer genome atlas (TCGA), wherein the gene expression data comprise 363 liver cell liver cancer samples and 60483 genes. In order to construct a liver cancer prognosis prediction model, 547 genes related to immunity are selected from 60483 genes for subsequent screening of a gene set for predicting patient prognosis.

Construction of liver cancer prognosis model

363 liver cancer patient samples in the TCGA dataset were randomly divided into 80% training set (290 samples) and 20% testing set (73 samples) with reference to clinical staging. Using training set samples and 547 immune-related genes, Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis was performed in the training set:

and (5) completing LASSO regression analysis and establishment of a multi-risk prediction model through the R package glmnet. And C, using cv. glmnet function in the training set, selecting a lasso regression model and a cox model, and modeling by using C-index as a judgment index of the model to obtain a penalty coefficient of the screening characteristic gene set. The penalty factor is 0.033. And the model built is validated using 20-fold cross validation. And finally, selecting 62 genes with weight values not being 0 as a final characteristic gene set. A multiple risk prediction model was thus established to predict patient prognosis, see table 2.

Table 2: selected characteristic evaluation gene set 62 genes and weight coefficients thereof

。

The weight coefficient of each gene was used to calculate a risk score for the signature gene set, whose expression value was the sum of the products of 62 genes and the respective weights (the calculation formula is as follows).

The calculation method of the characteristic gene lumped expression value, namely the Risk Score (Risk Score) is as follows:

i.e. the sum of the individual gene expression values and the individual weight coefficients. Wherein E_iCoef for the value of expression of each gene_iIs the weight coefficient of each gene, n is the number of genes in the characteristic gene set, and n is 62 in the invention; wherein the weight values corresponding to each gene are shown in Table 2.

Verifying model accuracy

And after the model training is finished, predicting the test set by using the established model and the selected gene set by using a prediction function, and testing the prediction capability of the model and the selected gene set on the data of the test set.

And according to the formula calculation method, calculating the total expression values (risk scores) of the patients in the training set, sorting the total expression values according to the sizes of the risk scores, and grouping the patients in the training set/the test set by using the median value, wherein the median value is 4.14 and is divided into a high risk group (high group) and a low risk group (low group). The Survival probability of high and low group patients is compared by plotting the Survival time (days) of the patients as the abscissa and the Survival probability (Survival probability) as the ordinate.

And performing multi-risk model prediction by using a coxph function in the R-packet survival. The function input file is patient group and patient survival time and status. The results were then examined using log-rank t test. Training set p values were less than 0.0001, 95% CI [0.066-0.18], and low risk group Hazard ratio values were 0.11. Test set p value 0.006, 95% CI [0.13-0.75], low risk group Hazard ratio value 0.3.

Fig. 2 shows the probability of survival situation for the high risk group and the low risk group of the training set. It can be seen that there are significant differences between the 2 groups of the training set, and the high risk group has a significantly lower probability of survival than the low risk group (P < 0.0001).

And calculating the C-index value of the training set. The training set C-index is 0.83. C-index, the consistency index (concordance index), used to evaluate the predictive power of the model; the C index is the proportion of pairs with the predicted result consistent with the actual result in all pairs of patients.

Test set data verification of liver cancer prognosis model

In order to verify the constructed liver cancer prognosis model, the expression values (risk scores) of liver cancer patients in the test set are calculated by using the same expression value formula and weight coefficients in the test set according to a similar process, and the test set is equally divided into a high group and a low group by using the same critical value so as to verify the accuracy of the liver cancer prognosis model of the evaluation gene set of the 62 genes. Fig. 3 shows the probability of survival for the high risk group and the low risk group of the test set. As can be seen from fig. 3, the survival probability of the high risk group is significantly lower than that of the low risk group (p = 0.006), i.e. the test set data verifies that the prognostic model is highly reliable. The C-index value of the test set was calculated to be 0.7.

Example 2: predictive power comparison of selected signature gene sets to random gene sets

To further verify the validity of the selected estimated gene set of 62 genes, the other 62 genes were randomly selected from 547 genes (excluding the above selected 62 genes) to form a "random gene set" and compared with the selected "estimated gene set"; the genes of the random gene set and their weight coefficients are seen in table 3.

Table 3: 62 genes of random gene set and weight coefficients thereof

。

The patients were also divided into a training set (80%) and a test set (20%) according to the procedure described in example 1, and the risk scores of the patients in the test set in the randomized model were calculated using each gene in the randomized gene set and its weight coefficients in table 3. The random gene set risk score calculation method is similar to that in example 1. Calculating the C-index of the training set and the test set; wherein the training set C-index: 0.86, test set C-index: 0.55.

the training set patients were also grouped by median risk score (1.66 by computational analysis) of the training set into high risk group and low risk group. FIG. 4 shows the probability of survival for the high risk group and the low risk group of the training set of the "random gene set". It can be seen that there are significant differences between the 2 groups of the training set, with the high risk group having a significantly lower probability of survival than the low risk group (p < 0.0001).

However, the same risk scoring formula and weighting factors were used to calculate the risk scores of the liver cancer patients in the test set of the random gene set, and the test set was equally divided into high and low groups by the median value (1.66) obtained from the above training set to verify the accuracy of the liver cancer prognosis model of the 62 genes "random gene set". Fig. 5 shows the survival probability of the high risk group and the low risk group of the training set of the "random gene set", and it can be seen that the survival probability of the high risk group is not significantly different from that of the low risk group (p = 0.48). The verification of the test set shows that the random gene set can not effectively predict the prognosis of the liver cancer patient.

As shown by the comparison between the selected gene set and the random gene set, the estimated gene set of 62 specific genes constructed by the invention can effectively predict the prognosis of the liver cancer patient, but the randomly selected gene set cannot be realized.

In order to accurately predict the prognosis risk of the liver cancer patient, 62 immune related genes are determined to predict the prognosis condition of the liver cancer, so that a high risk group and a low risk group of the liver cancer patient can be effectively distinguished, and the immune related genes can be developed into potential in-vitro diagnosis products to predict and detect the prognosis condition of the liver cancer patient, so that preventive medication or treatment is realized, and an accurate judgment basis is provided for further auxiliary treatment of the prognosis of the liver cancer patient.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An evaluation gene set for predicting prognosis of a liver cancer patient, wherein the evaluation gene set comprises 62 genes as follows:

。

2. a kit for predicting prognosis of a patient with liver cancer, comprising a reagent for detecting the expression level of a gene; wherein the genes are 62 genes in the evaluation gene set of claim 1.

3. The kit of claim 2, further comprising one or more of nucleic acid extraction reagents, PCR reagents, genomic/transcriptome sequencing reagents, gene-specific primers or probes, antibodies specific for gene expression products.

4. A system for predicting prognosis in a patient with liver cancer, comprising the following modules:

a) a data collection module: collecting the sample of the patient, measuring the gene expression value of the sample, and outputting the expression value data of each gene to a model calculation module; wherein the genes are 62 genes in the evaluation gene set of claim 1;

b) a model calculation module: calculating the total expression value of 62 genes of the liver cancer patient, namely a Risk Score (Risk Score); the risk score calculation formula is as follows:

wherein; e_iCoef for the expression value of each gene_iIs the weight coefficient corresponding to each gene, n is the total number of genes, namely 62;

wherein, each gene and the corresponding weight coefficient are as follows:

；

c) an output prediction module: predicting the prognosis of the patient according to the risk score data of the liver cancer patient, wherein the lower the risk score of the patient is, the better the prognosis is; and comparing the risk score with a defined value, and if the risk score is higher than the defined value, outputting that the prognosis is not good, and if the risk score is lower than the defined value, outputting that the prognosis is good.

5. The system of claim 4, the defined value being 4.14.

6. A computing device, comprising:

at least one processing unit; and

a) calculating a risk score for the patient based on the collected and determined expression values of the genes in the patient sample; the genes are 62 genes in the evaluation gene set of claim 1; the risk score calculation formula is as follows:

wherein; e_iCoef for the expression value of each gene_iIs the weight coefficient corresponding to each gene, n is the total number of genes, namely 62; wherein, each gene and the corresponding weight coefficient are shown in claim 4;

b) predicting the prognosis condition of the patient according to the risk score of the liver cancer patient, wherein the lower the risk score of the patient is, the better the prognosis is; and comparing the risk score with a defined value, if the risk score is higher than the defined value, the prognosis is predicted to be poor, and if the risk score is lower than the defined value, the prognosis is predicted to be good.

7. The computing device of claim 6, wherein the defined value is 4.14.

8. A computer-readable storage medium storing a computer program executable by a machine to perform steps for predicting prognosis of a patient with liver cancer, the steps comprising:

a) calculating a risk score for the patient based on the collected and determined gene expression values for the patient sample; wherein the genes are 62 genes in the evaluation gene set of claim 1; the risk score calculation formula is as follows:

9. The computer-readable storage medium of claim 8, wherein the defined value is 4.14.

10. Use of a reagent for detecting gene expression levels in the preparation of a kit or system for predicting prognosis in a patient with liver cancer; wherein the genes are 62 genes in the evaluation gene set of claim 1.

11. The use according to claim 10, wherein the kit is a kit according to claim 2 or 3; the system is the system of claim 4 or 5.