CN114496097A - Gastric cancer metabolic gene prognosis prediction method and device - Google Patents

Gastric cancer metabolic gene prognosis prediction method and device Download PDF

Info

Publication number
CN114496097A
CN114496097A CN202210104897.7A CN202210104897A CN114496097A CN 114496097 A CN114496097 A CN 114496097A CN 202210104897 A CN202210104897 A CN 202210104897A CN 114496097 A CN114496097 A CN 114496097A
Authority
CN
China
Prior art keywords
gene
exp
gastric cancer
genes
prognosis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210104897.7A
Other languages
Chinese (zh)
Inventor
王鑫
许蜜蝶
盛伟琪
常瑾嘉
谭聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University Shanghai Cancer Center
Original Assignee
Fudan University Shanghai Cancer Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University Shanghai Cancer Center filed Critical Fudan University Shanghai Cancer Center
Priority to CN202210104897.7A priority Critical patent/CN114496097A/en
Publication of CN114496097A publication Critical patent/CN114496097A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a gastric cancer metabolic gene prognosis prediction method and a device, wherein the method comprises the following steps: obtaining genome RNA of a sample to be detected, and detecting the expression levels of a plurality of genes, wherein the plurality of genes comprise a DYNC1I1 gene, a GPER1 gene, an MFAP2 gene, an ARRB1 gene, a C3 gene and a GLI1 gene; and calculating the gastric cancer prognosis risk score according to the detected expression levels of the multiple genes. Compared with the prior art, the invention predicts the clinical prognosis state of the gastric cancer patient through metabolism-related genes DYNC1I1, GPER1, MFAP2, ARRB1, C3 and GLI1 for the first time, and has the advantages of excellent efficiency, few detection factors, high robustness and the like.

Description

Gastric cancer metabolic gene prognosis prediction method and device
Technical Field
The invention relates to the technical field of gene detection, in particular to a gastric cancer metabolic gene prognosis prediction method and device.
Background
Gastric cancer is one of the most common malignancies of the digestive system. In recent decades, the incidence of gastric cancer has been decreasing in some areas due to effective preventive measures and early diagnostic strategies. However, cases of inoperable gastric cancer diagnosed at an advanced stage still have a poor prognosis. According to the data of GLOBOCAN 2018, gastric cancer ranks third in global cancer mortality, second only to lung cancer and colorectal cancer. Therefore, accurate prediction of clinical outcome of gastric cancer patients is still urgently needed for more individualized management.
Reprogrammed metabolic patterns have long been recognized as markers of cancer. Tumor cells may have different nutrient acquisition and consumption patterns to acquire and maintain malignant characteristics compared to normal cells. The most well-known feature of cancer metabolism is the increased production of glycolysis and lactate even in oxygen-rich microenvironments, which is known as the "Warburg effect". To date, glucose is widely recognized as the primary energy source for cancer cells. However, it is increasingly appreciated that the metabolic phenotype of cancer cells is largely heterogeneous. Some tumor cells primarily utilize glycolysis, while others have the metabolic property of oxidative phosphorylation. There is increasing evidence for a metabolic symbiosis between glycolysis and the OXPHOS pathway in tumor cells. For example, lactic acid and pyruvic acid produced by glycolysis can serve as substrates of tricarboxylic acid cycle intermediates (TCAs) to help produce Adenosine Triphosphate (ATP) in neighboring cells. Also, other non-glucose nutrients (i.e., free fatty acids, amino acids) can be used as alternative fuels to meet the energy burden of tumor cells. Since the complex metabolic profile of tumor cells can greatly affect the clinical outcome of a patient, a more thorough understanding of cancer metabolic profiles may be crucial for developing new therapies and determining prognostic predictors.
The invention with the authorization notice number of CN107586852B discloses a gastric cancer peritoneal metastasis prediction model based on 22 genes and application thereof, and provides a gastric cancer peritoneal metastasis prediction gene model and application thereof, wherein the gastric cancer peritoneal metastasis prediction gene model comprises PCLO, UGGT1, ZNF714, KIAA0825, COL23A1, MED1, NPAS2, TTC14, RPS27A, ASPH, ARHGEF12, SIK1, PAPA, HHIPL1, MYO9B, ITPKB, ZNF862, MKNK1, MUC6, TRRAP, DUOX1 and KRRAP 52; and selecting a classifier SVM and a positive judgment threshold value of 0.5, and effectively and specifically predicting the peritoneal metastasis risk according to the classifier SVM and the positive judgment threshold value.
According to the scheme, the number of samples of SNV-existing genes in peritoneal metastasis groups and non-peritoneal metastasis groups is calculated to obtain the positive rate ratio of two groups aiming at each gene, genes with significant difference (namely p is less than 0.05) in the positive rates between the two groups are screened out by hypothesis test statistics, and 22 genes for predicting gastric cancer peritoneal metastasis are obtained, but the number of the genes selected by the scheme is still too large, and the metabolic characteristics of the gastric cancer are not deeply explored.
Disclosure of Invention
The present invention is directed to overcome the above-mentioned drawbacks of the prior art and to provide a method and an apparatus for predicting prognosis of a gastric cancer metabolic gene with excellent efficacy and a small number of genes detected.
The purpose of the invention can be realized by the following technical scheme:
a gastric cancer metabolic gene prognosis prediction method comprises the following steps:
obtaining RNA expression levels of a plurality of genes of a sample to be detected for detection, wherein the plurality of genes comprise a DYNC1I1 gene, a GPER1 gene, an MFAP2 gene, an ARRB1 gene, a C3 gene and a GLI1 gene;
and calculating the gastric cancer prognosis risk score according to the detected expression levels of the multiple genes.
Further, the calculation expression of the gastric cancer prognosis risk score is as follows:
RiskScore6=0.38585*expDYNC1I1+0.10411*expGPER1+0.04476*expMFAP2-0.70386*expARRB1+0.09187*expC3+0.21797*expGLI1
in the formula, RiskScore6Exp as gastric cancer prognostic risk scoreDYNC1I1Exp is the expression level of DYNC1I1 gene based on natural constant eGPER1Exp results for the expression level of the GPER1 Gene based on the Natural constant eMFAP2Exp results for expression level of MFAP2 Gene based on the Natural constant eARRB1Exp results for the expression level of ARRB1 Gene based on the Natural constant eC3Exp results for the expression level of the C3 Gene based on the Natural constant eGLI1Results for expression levels of the GLI1 gene based on the natural constant e.
Further, the method further comprises: and loading the detected RNA expression levels of a plurality of genes into a pre-established and trained classifier, calculating gastric cancer prognosis risk scores, and dividing the samples to be detected with the gastric cancer prognosis risk scores larger than a preset risk threshold value into high risk groups, or else, dividing the samples into low risk groups.
The present invention also provides a gastric cancer metabolic gene prognosis prediction device, including:
a data acquisition module configured to: obtaining RNA expression levels of a plurality of genes in a sample to be detected, wherein the plurality of genes comprise a DYNC1I1 gene, a GPER1 gene, an MFAP2 gene, an ARRB1 gene, a C3 gene and a GLI1 gene;
a gastric cancer prognosis risk score calculation module configured to: and calculating the gastric cancer prognosis risk score according to the detected expression levels of the multiple genes.
Further, the calculation expression of the gastric cancer prognosis risk score is as follows:
RiskScore6=0.38585*expDYNC1I1+0.10411*expGPER1+0.04476*expMFAP2-0.70386*expARRB1+0.09187*expC3+0.21797*expGLI1
in the formula, RiskScore6Exp as gastric cancer prognostic risk scoreDYNC1I1Exp is the expression level of DYNC1I1 gene based on natural constant eGPER1Exp results for the expression level of the GPER1 Gene based on the Natural constant eMFAP2Exp results for expression level of MFAP2 Gene based on the Natural constant eARRB1Exp results for the expression level of ARRB1 Gene based on the Natural constant eC3Exp results for the expression level of the C3 Gene based on the Natural constant eGLI1Results for expression levels of GLI1 gene based on natural constant e.
Further, the apparatus further comprises:
a risk classification module configured to: and loading the detected expression levels of the multiple genes into a pre-established and trained classifier, calculating a gastric cancer prognosis risk score through the gastric cancer prognosis risk score calculation module, and dividing the sample to be detected with the gastric cancer prognosis risk score larger than a preset risk threshold into a high risk group, or else, dividing the sample into a low risk group.
Compared with the prior art, the invention has the following advantages:
(1) according to the invention, the clinical prognosis state of a gastric cancer patient is predicted through metabolism-related genes DYNC1I1, GPER1, MFAP2, ARRB1, C3 and GLI1 for the first time;
the 6 gene model has stronger robustness and can play stable prediction efficiency in data sets of different platforms; it has a better AUC in both the training and validation sets and is a clinical independent model that is suggested for use as a molecular diagnostic test to assess prognostic risk in gastric cancer patients.
(2) The invention selects key prognostic factors of gastric cancer from 587 energy metabolism genes, and innovatively constructs a robust 6-gene metabolism-related model for survival prediction of gastric cancer patients. The model was trained and validated in 339 samples from the cancer genomic map (TCGA) dataset diagnosed as gastric adenocarcinoma and externally validated with 300 tumor samples from the GSE62254 dataset from the GEO database. Finally, the model is compared with the gastric cancer transcriptome prognosis prediction model which is published at present and based on other research backgrounds, and the model shows the best prognosis effect.
Drawings
FIG. 1 is a schematic flow chart of a method for predicting prognosis of gastric cancer metabolic gene according to an embodiment of the present invention;
FIG. 2 is a schematic diagram showing the results of a process for identifying molecular subtypes using the NMF algorithm according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the process results of a WGCNA co-expression analysis provided in an embodiment of the invention;
FIG. 4 is an enlarged view of a portion of the area E in FIG. 3;
FIG. 5 is a schematic diagram illustrating a process result of ROC analysis of a training set risk model construction and risk model provided in an embodiment of the present invention;
FIG. 6 is a diagram illustrating a result of a process for verifying the robustness of a 6-gene signature by an internal data set according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a result of a process for validating the robustness of a 6-gene signature by an external data set according to an embodiment of the present invention;
fig. 8 is a schematic diagram illustrating a comparison result between a risk model and other models provided in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings or the orientations or positional relationships that the products of the present invention are conventionally placed in use, and are only used for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
It should be noted that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Furthermore, the terms "horizontal", "vertical" and the like do not imply that the components are required to be absolutely horizontal or pendant, but rather may be slightly inclined. For example, "horizontal" merely means that the direction is more horizontal than "vertical" and does not mean that the structure must be perfectly horizontal, but may be slightly inclined.
Example 1
As shown in fig. 1, the present embodiment provides a method for predicting prognosis of gastric cancer metabolic gene, comprising the following steps:
s1: obtaining a sample to be detected, and detecting the RNA expression levels of a plurality of genes, wherein the plurality of genes comprise a DYNC1I1 gene, a GPER1 gene, an MFAP2 gene, an ARRB1 gene, a C3 gene and a GLI1 gene;
s2: and calculating the gastric cancer prognosis risk score according to the detected expression levels of the multiple genes.
Most preferably, the calculated expression for the gastric cancer prognosis risk score is:
RiskScore6=0.38585*expDYNC1I1+0.10411*expGPER1+0.04476*expMFAP2-0.70386*expARRB1+0.09187*expC3+0.21797*expGLI1
in the formula, RiskScore6Exp as gastric cancer prognostic risk scoreDYNC1I1Exp is the expression level of DYNC1I1 gene based on natural constant eGPER1Exp results for the expression level of the GPER1 Gene based on the Natural constant eMFAP2Exp results for expression level of MFAP2 Gene based on the Natural constant eARRB1Exp results for the expression level of ARRB1 Gene based on the Natural constant eC3Exp results for the expression level of the C3 Gene based on the Natural constant eGLI1Results for expression levels of GLI1 gene based on natural constant e.
As an optional implementation, the method further comprises: and loading the detected expression levels of the multiple genes into a pre-established and trained classifier, calculating gastric cancer prognosis risk scores, and dividing the samples to be detected with the gastric cancer prognosis risk scores larger than a preset risk threshold value into high risk groups, or else, dividing the samples into low risk groups.
As a preferred embodiment, the method further comprises:
in this example, RiskScore was calculated6Then, RiskScore is added6The Z-Score transformation was performed to a Risk Score (data normalization), with samples greater than zero of the Risk Score divided into high Risk groups and samples less than zero as low Risk groups.
The present embodiment also provides a gastric cancer metabolic gene prognosis prediction apparatus, including:
a data acquisition module configured to: obtaining the expression levels of a plurality of genes in the genome DNA of a sample to be detected, wherein the plurality of genes comprise a DYNC1I1 gene, a GPER1 gene, an MFAP2 gene, an ARRB1 gene, a C3 gene and a GLI1 gene;
a gastric cancer prognosis risk score calculation module configured to: and calculating the gastric cancer prognosis risk score according to the detected expression levels of the multiple genes.
Most preferably, the calculated expression for the gastric cancer prognosis risk score is:
RiskScore6=0.38585*expDYNC1I1+0.10411*expGPER1+0.04476*expMFAP2-0.70386*expARRB1+0.09187*expC3+0.21797*expGLI1
in the formula, RiskScore6Exp as gastric cancer prognostic risk scoreDYNC1I1Exp is the expression level of DYNC1I1 gene based on natural constant eGPER1Exp results for the expression level of the GPER1 Gene based on the Natural constant eMFAP2Exp results for expression level of MFAP2 Gene based on the Natural constant eARRB1Exp results for the expression level of ARRB1 Gene based on the Natural constant eC3To take natureExpression level results of C3 Gene with constant e as base, expGLI1Results for expression levels of the GLI1 gene based on the natural constant e.
As an optional implementation, the apparatus further comprises:
a risk classification module configured to: and loading the detected expression levels of the multiple genes into a pre-established and trained classifier, calculating a gastric cancer prognosis risk score through a gastric cancer prognosis risk score calculation module, and dividing the sample to be detected with the gastric cancer prognosis risk score larger than a preset risk threshold into a high risk group, or else, dividing the sample into a low risk group.
In this example, the process of screening the six genes was as follows:
1. data download
The latest clinical follow-up information was downloaded using TCGA GDC API, and GSE62254 chip expression data in MINiML format was downloaded from NCBI. GSE62254 contained 300 samples with clinical characteristics.
Sources of expression profile data: RNA-Seq data for TCGA, clinical follow-up information data. GEO validation data: GSE 62254.
Energy metabolism related gene sources: 11 human metabolism-related pathways were downloaded from Reactome (https:// reactor. org /), and a total of 587 genes involved in energy metabolism were organized.
2. Data pre-processing
2.1 pretreatment of TCGA data
The RNA-seq data of TCGA was pre-processed in several steps:
1) samples with no clinical data and PFS <30 days were removed;
2) removing normal tissue sample data;
3) removing the gene whose FPKM is 0 in 30% of the samples;
4) the expression profile of the energy metabolism-related genes is retained.
2.2 GEO data preprocessing
The GSE62254 data is preprocessed in the following steps:
1) removing normal tissue sample data;
2) converting PFS data into days for year or month;
3) samples with PFS <30 days were removed;
4) the chip probe map was mapped to the human gene SYMBOL using the bioconductor package.
3. Molecular typing based on energy metabolism genes
3.1 identification of molecular subtypes using NMF algorithm
In the embodiment, the expression amounts of 587 energy metabolism genes are extracted from TCGA expression profile data, and 1 gene is not found as a result, more than 30% of samples are retained in the embodiment, and 584 genes are used for subsequent analysis; next, single factor cox analysis was performed through the coxph function in R, resulting in 86 genes associated with gastric PFS (p <0.05) and STAD samples were clustered by non-negative matrix clustering algorithm (NMF), NMF method selection criteria "brunet", for 50 iterations. The number of clusters k is set to 2 to 10, the average contour width of the common member matrix is determined by R package "NMF", and the minimum member per subclass is set to 10. And determining the optimal clustering number according to indexes such as copheretic and rrs, and selecting the optimal clustering number as2 (fig. 2A).
Further analysis of the prognostic relationship between the two clusters showed that the prognosis of the samples from Cluster1 group was poor and significantly different from that of Cluster2 (fig. 2B, log rank p ═ 0.025), and the expression of the energy metabolism-related genes in the two subclasses was shown in fig. 2C, where it can be seen that the expression of the energy metabolism-related genes in most of Cluster2 was higher than that in Cluster 1.
4. Analysis of co-expressed genes between subtypes
4.1 WGCNA Co-expression analysis
Mining the co-expressed encoding genes and co-expression modules by using a WGCNA co-expression algorithm according to the expression profiles of the 584 encoding genes, firstly, extracting the expression profile of the protein encoding genes from the TPM data in the TCGA database, and performing cluster analysis on the samples by using hierarchical clustering to remove 1 outlier sample (fig. 3A); further using the Pearson correlation coefficient to calculate the distance between each gene, using an R software package WGCNA to construct a weight co-expression network, selecting a soft threshold value of 8, and screening co-expression modules. Research shows that the co-expression network conforms to a scale-free network, namely the logarithm log (k) of a node with the occurrence degree of k and the logarithm log (P (k)) of the probability of the node to occur are in negative correlation, and the correlation coefficient is larger than 0.8. To ensure that the network is a scale-free network, this embodiment chooses β -8 (fig. 3B-C). Next, converting the expression matrix into an adjacency matrix, then converting the adjacency matrix into a topology matrix, based on TOM, clustering the genes by using an average-linking hierarchical clustering method according to the standard of a hybrid dynamic shear tree, and setting the minimum number of genes in each gene network module: 30. after determining the gene modules by using a dynamic cutting method, the embodiment sequentially calculates the feature vector values (eigengenes) of each module, then performs cluster analysis on the modules, merges the modules with closer distances into a new module, and sets height equal to 0.25, depsplit equal to 2, and minModuleSize equal to 30. A total of 29 modules were obtained (FIG. 3D), indicating that the grey module is a collection of genes that could not be clustered into other modules.
This example further analyzes the correlation of each module with patient gender, age, T, N, M, Stage, and Cluster1 and Cluster2 as shown in FIG. 3B, from which it can be seen that the modules significantly correlated with Cluster1 and Cluster2 are yellow and brown, respectively, where the yellow module contains 667 genes and the brown module contains 1046 genes. Finally, the correlation results between the module and the phenotype and the correlation results between the genes in the module and the phenotype are further analyzed for correlation, e.g., to infer the relationship between the genes and the phenotype, wherein the higher the correlation is, the more correlated the phenotype is with the module, and the analysis shows that the yellow module correlates well with Cluster1 and the brown module correlates well with Cluster2 (FIG. 3E and FIG. 4).
4.3 Module gene PPI network construction and network topology property analysis
Considering that research on interaction networks among proteins is helpful for mining key genes of the core, in order to identify potential regulatory genes of gastric cancer, the embodiment maps the gene set of the subtype-related module obtained by screening into the human PPI network, and further extracts the interaction relationship. The String database contains a so far comprehensive protein interaction network (https:// String-db. org /), and in order to observe the relationship among the genes, the embodiment maps the genes to the String database respectively, obtains the interaction relationship among the genes by using score of more than 0.9, and performs visualization by using Cytoscope. The 1713 co-expressed genes are mapped to 3585 interaction relations, and further, hub node identification is carried out through a cytoHubba module in Cytoscope, wherein the analysis of the hub node network, which is calculated according to the three methods of Degree, Closense and Betweenness, shows that hub genes obtained by the three analysis methods are basically consistent.
Further the present embodiment delves into the nature of the network topology. Firstly, the network of the embodiment has moderate distribution, presents power law distribution, and most of genes have degree smaller than 5; next, the present embodiment also calculates closense of the network, and finds that closense of most nodes is overall higher and basically above 100, and finally, the present embodiment calculates Betweenness distribution of the network, with most of the nodes EPC at 0. High degree, Clsoenersor or high Betwenness are considered as important nodes in the network, nodes which simultaneously satisfy the condition that the degree, Closensess and Betwenness are larger than the respective median value are selected as hub genes of the channel network in the embodiment, 220 genes are contained in the hub genes, the genes are considered to be closely related to the occurrence and development of gastric cancer, and the genes can be used as prognostic markers of STAD.
5. Construction of a hub genes-based prognostic risk model
5.1 training set risk model construction and ROC analysis of risk models
First, 50% of samples were randomly selected from 339 TCGA samples after pretreatment as a training set for model construction. A univariate Cox proportional hazards regression model was performed for each hub genes and survival data. By using the R-package survival coxph function, log rank p <0.05 was selected as the threshold, and finally 8 genes with significant prognostic difference were obtained. In order to further narrow the gene range and construct a prognosis model under the condition of keeping higher accuracy, the present embodiment uses R software glmnet to perform lasso cox regression analysis, and firstly analyzes the variation locus of each independent variable as shown in fig. 5A, from which it can be seen that the number of independent variable coefficients tending to 0 gradually increases with the gradual increase of lambda, the present embodiment uses 10-fold cross validation to perform model construction, analyzes the confidence interval under each lambda, and from fig. 5B, it can be seen that the model is optimal when lambda is 0.01876213. This example selects 6 genes when lambda is 0.01876213 as the final model. The final 6-mRNA signature formula is as follows:
RiskScore6=0.38585*expDYNC1I1+0.10411*expGPER1+0.04476*expMFAP2-0.70386*expARRB1+0.09187*expC3+0.21797*expGLI1
the risk score of each sample is calculated according to the expression level of the sample, and RiskScore distribution of the samples is plotted (FIG. 5C), and it can be seen from the graph that PFS of the sample with high risk score is obviously smaller than that with low risk score, which indicates that the sample with high risk score has worse prognosis, and the change of expression of 6 different signature genes along with the increase of risk value identifies high expression and high risk correlation of DYNC1I1, GPER1, MFAP2, C3 and GLI1, which are risk factors, and ARRB1, which is protective factors. Further, this example uses the R software package timeROC to perform ROC analysis of prognosis classification on RiskScore, and this example analyzes the prediction classification efficiency after one year, three years, and five years, respectively, and the result shows that the model has a very high AUC area under the line, and the AUC is above 0.70 (fig. 5D); finally, this example performed zscorore on Riskscore, divided samples with Riskscore greater than zero after zscorore transformation into high risk groups and samples with low risk groups less than zero, and plotted the KM curve (fig. 5E), from which it can be seen that they have very significant difference logrank p-2E-04 and HR-2.663 (1.558-4.551), where 79 samples were divided into high risk groups and 91 samples were low risk groups.
5.2, verifying the robustness of the 6-gene signature by an internal data set
To determine the robustness of the model, the present embodiment employs the same model and the same coefficients in the internal test set as in the training set. In this example, the risk score of each sample is calculated according to the expression level of the sample (fig. 6A), the ROC analysis of the R software package timeROC for the prognosis classification of RiskScore is performed in the internal test set, the classification efficiency of prognosis prediction is analyzed for one year, two years and three years (fig. 6B), it can be seen that the model has very high AUC line area, the AUC of the internal test set is 0.71 in 1 year, and the AUC of all TCGA data sets is 0.72 in 5 years; finally, this example also performed zscorore on the Riskscore of the dataset, dividing the samples with Riskscore greater than zero after zscorore into high risk groups and low risk groups of samples less than zero, and plotting the KM curve (fig. 6C) results in the internal test set logrank p of 0.01 and HR of 2.04(1.169-3.561), where 82 samples were divided into high risk groups and 87 samples were low risk groups.
5.3, verifying the robustness of the 6-gene signature by an external data set
The same model and the same coefficients are used in the GEO external validation set as in the training set. In this example, the risk score of each sample is calculated according to the expression level of the sample (fig. 7A), and the risskscore distribution of the sample is plotted, and it can be seen from the figure that the PFS of the sample with high risk score is significantly less than that with low risk score, which suggests that the high risskscore sample has worse prognosis, and the change of expression of 6 different model genes along with the increase of risk value identifies the high expression and high risk correlation of DYNC1I1, GPER1, MFAP2, C3 and GLI1, which are risk factors, and the high expression and low risk correlation of ARRB1, which are protective factors. Further, this example uses the R software package timeROC to perform ROC analysis of prognosis classification on RiskScore, and this example analyzes the one-year, three-year, and five-year prediction classification efficiencies, respectively, from which it can be seen that the model has a very high AUC area under the line, and the five-year AUC is 0.70 (fig. 7B); finally, this example also performed zscorore on Riskscore, divided samples with Riskscore greater than zero after zscorore into high risk groups, low risk groups with samples less than zero, and plotted the KM curves, from which it can be seen that they have a very significant difference logrank p <0.0001, HR 2.358(1.693-3.283), of which 152 samples were divided into high risk groups and 148 samples were low risk groups (fig. 7C).
6. Comparison of risk models with other models
By reviewing the survey literature, three prognosis-related risk models were ultimately selected in this example: the 5-gene model of Wang et al [ PMID:23912700DOI:10.1007/s 12032-013-. In order to make the models comparable, this example calculated the Risk score of each STAD sample in TCGA using the same method based on the corresponding genes in these 3 models, evaluated the ROC of each model, and divided the samples into Risk-H and Risk-L groups based on median Risk score, and calculated the PFS prognosis difference for both groups of samples. The ROC and PFS-KM graphs of 3 models show that the results of the three models are all inferior to those of the 6-genes model in this embodiment, as shown in fig. 8A-C, the corrected mean survival curves of the models are further compared as shown in fig. 8D, from which it can be seen that the model in this embodiment has the highest C-index among the 4 models, which is more advantageous in long-term survival prediction, and meanwhile, the present embodiment compares the prediction effects of the 6 gene model and the 3 models through the DCA curve, and the result shows that the performance of the model in this embodiment is superior to that of the other 3 models, as shown in fig. 8E.
To summarize:
1. 339 STAD samples of TCGA were typed based on 584 energy metabolism-related genes, and these samples were divided into 2 subtypes, which showed significant difference in prognosis between subtypes.
2. The WGCNA analysis shows that yellow and brown modules respectively have the highest correlation with Cluster1 and Cluster1 typing; and constructing a PPI interaction network based on the co-expression genes, analyzing topological properties of the network, and searching hub genes.
3. Constructing a 6-gene signature prognosis PFS model based on the Hub gene;
4. the 6-gene signature has stronger robustness and can play a stable prediction role in data sets of different platforms.
5. In summary, in this study, this example developed a 6-genes signature prognostic stratification system with better AUC in both training and validation sets and a model independent of clinical features, therefore, this example suggests using this classifier as a molecular diagnostic test to assess the prognostic risk of gastric cancer patients.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (6)

1. A gastric cancer metabolic gene prognosis prediction method is characterized by comprising the following steps:
obtaining genome RNA of a sample to be detected, and detecting the expression levels of a plurality of genes, wherein the plurality of genes comprise a DYNC1I1 gene, a GPER1 gene, an MFAP2 gene, an ARRB1 gene, a C3 gene and a GLI1 gene;
and calculating the gastric cancer prognosis risk score according to the detected expression levels of the multiple genes.
2. The method for predicting gastric cancer metabolic gene prognosis according to claim 1, wherein the expression for calculating the gastric cancer prognosis risk score is as follows:
RiskScore6=0.38585*expDYNC1I1+0.10411*expGPER1+0.04476*expMFAP2-0.70386*expARRB1+0.09187*expC3+0.21797*expGLI1
in the formula, RiskScore6Exp as gastric cancer prognostic risk scoreDYNC1I1Exp is the expression level of DYNC1I1 gene based on natural constant eGPER1Exp results for the expression level of the GPER1 Gene based on the Natural constant eMFAP2Exp results for expression level of MFAP2 Gene based on the Natural constant eARRB1Exp results for the expression level of ARRB1 Gene based on the Natural constant eC3Exp results for the expression level of the C3 Gene based on the Natural constant eGLI1Results for expression levels of GLI1 gene based on natural constant e.
3. The method for predicting the prognosis of gastric cancer metabolic gene according to claim 1, further comprising: and loading the detected expression levels of the multiple genes into a pre-established and trained classifier, calculating gastric cancer prognosis risk scores, and dividing the samples to be detected with the gastric cancer prognosis risk scores larger than a preset risk threshold value into high risk groups, or else, dividing the samples into low risk groups.
4. A gastric cancer metabolic gene prognosis prediction device is characterized by comprising:
a data acquisition module configured to: obtaining RNA expression levels of a plurality of genes in a sample to be detected, wherein the plurality of genes comprise a DYNC1I1 gene, a GPER1 gene, an MFAP2 gene, an ARRB1 gene, a C3 gene and a GLI1 gene;
a gastric cancer prognosis risk score calculation module configured to: and calculating the gastric cancer prognosis risk score according to the detected expression levels of the multiple genes.
5. The gastric cancer metabolic gene prognosis prediction device according to claim 4, wherein the calculation expression of the gastric cancer prognosis risk score is as follows:
RiskScore6=0.38585*expDYNC1I1+0.10411*expGPER1+0.04476*expMFAP2-0.70386*expARRB1+0.09187*expC3+0.21797*expGLI1
in the formula, RiskScore6Exp as gastric cancer prognostic risk scoreDYNC1I1Exp is the expression level of DYNC1I1 gene based on natural constant eGPER1Exp results for the expression level of the GPER1 Gene based on the Natural constant eMFAP2Exp results for expression level of MFAP2 Gene based on the Natural constant eARRB1Exp results for the expression level of ARRB1 Gene based on the Natural constant eC3To be composed ofExpression level results of C3 Gene with Natural constant e as base, expGLI1Results for expression levels of GLI1 gene based on natural constant e.
6. The gastric cancer metabolic gene prognosis prediction device according to claim 5, wherein the device further comprises:
a risk classification module configured to: and loading the detected expression levels of the multiple genes into a pre-established and trained classifier, calculating a gastric cancer prognosis risk score through the gastric cancer prognosis risk score calculation module, and dividing the sample to be detected with the gastric cancer prognosis risk score larger than a preset risk threshold into a high risk group, or else, dividing the sample into a low risk group.
CN202210104897.7A 2022-01-28 2022-01-28 Gastric cancer metabolic gene prognosis prediction method and device Pending CN114496097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210104897.7A CN114496097A (en) 2022-01-28 2022-01-28 Gastric cancer metabolic gene prognosis prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210104897.7A CN114496097A (en) 2022-01-28 2022-01-28 Gastric cancer metabolic gene prognosis prediction method and device

Publications (1)

Publication Number Publication Date
CN114496097A true CN114496097A (en) 2022-05-13

Family

ID=81477509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210104897.7A Pending CN114496097A (en) 2022-01-28 2022-01-28 Gastric cancer metabolic gene prognosis prediction method and device

Country Status (1)

Country Link
CN (1) CN114496097A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116798632A (en) * 2023-07-13 2023-09-22 山东第一医科大学附属省立医院(山东省立医院) Stomach cancer molecular typing and prognosis prediction model construction method based on metabolic genes and application

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116798632A (en) * 2023-07-13 2023-09-22 山东第一医科大学附属省立医院(山东省立医院) Stomach cancer molecular typing and prognosis prediction model construction method based on metabolic genes and application
CN116798632B (en) * 2023-07-13 2024-04-30 山东第一医科大学附属省立医院(山东省立医院) Stomach cancer molecular typing and prognosis prediction model construction method based on metabolic genes and application

Similar Documents

Publication Publication Date Title
CN112048559B (en) Model construction and clinical application of m 6A-related IncRNA network gastric cancer prognosis
Ruijter et al. Statistical evaluation of SAGE libraries: consequences for experimental design
Xi et al. Cancer feature selection and classification using a binary quantum-behaved particle swarm optimization and support vector machine
US20180247010A1 (en) Integrated method and system for identifying functional patient-specific somatic aberations using multi-omic cancer profiles
CN113168886A (en) Systems and methods for germline and somatic variant calling using neural networks
US20220310199A1 (en) Methods for identifying chromosomal spatial instability such as homologous repair deficiency in low coverage next- generation sequencing data
CN109872776B (en) Screening method for potential biomarkers of gastric cancer based on weighted gene co-expression network analysis and application thereof
CN111933211B (en) Cancer accurate chemotherapy typing marker screening method, chemotherapy sensitivity molecular typing method and application
CN113517073B (en) Method for constructing survival rate prediction model after lung cancer surgery and prediction model system
CN107480470A (en) Known the variation method for detecting and device examined based on Bayes and Poisson distribution
CN110273003A (en) A kind of Papillary Renal Cell Carcinoma patient prognosis recurrence detects the foundation of mark tool and its risk evaluation model
CA3154621A1 (en) Single cell rna-seq data processing
Ramos et al. An interpretable approach for lung cancer prediction and subtype classification using gene expression
CN114496097A (en) Gastric cancer metabolic gene prognosis prediction method and device
CN114373548A (en) Pancreatic cancer prognosis risk prediction method and device established based on metabolic genes
Song et al. Blood circulating miRNA pairs as a robust signature for early detection of esophageal cancer
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
CN114672569A (en) Tryptophan metabolism gene-based liver cancer prognosis evaluation method
CN114974432A (en) Screening method of biomarker and related application thereof
Fu et al. Single cell and spatial alternative splicing analysis with long read sequencing
CN116529835A (en) Methods of predicting cancer progression
Bruncsics et al. A multi-trait evaluation of network propagation for GWAS results
CN114493018A (en) Lung adenocarcinoma prognosis prediction method and device based on lipid metabolism genes
CN110229902A (en) The determination method of assessment gene group for gastric cancer prognosis prediction
Zheng et al. A structural variation genotyping algorithm enhanced by CNV quantitative transfer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination