CN114283885A - Method for constructing diagnosis model of prostate cancer - Google Patents

Method for constructing diagnosis model of prostate cancer Download PDF

Info

Publication number
CN114283885A
CN114283885A CN202111603645.0A CN202111603645A CN114283885A CN 114283885 A CN114283885 A CN 114283885A CN 202111603645 A CN202111603645 A CN 202111603645A CN 114283885 A CN114283885 A CN 114283885A
Authority
CN
China
Prior art keywords
pca
genes
model
constructing
gene expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111603645.0A
Other languages
Chinese (zh)
Inventor
罗艺灵
佟延秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Medical University
Original Assignee
Chongqing Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Medical University filed Critical Chongqing Medical University
Priority to CN202111603645.0A priority Critical patent/CN114283885A/en
Publication of CN114283885A publication Critical patent/CN114283885A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for constructing a diagnosis model of prostate cancer, which comprises the following steps: step1) acquiring gene expression profile data of PCa; step2) carrying out differential expression profile analysis on the gene expression profile data of the PCa, and screening out differential genes in the PCa; step3) screening key genes by GAE algorithm in machine learning method aiming at difference genes in PCa; step4), obtaining 10 high expression genes and 6 low expression genes by PPI analysis according to the calculation result of GAE; step5) establishing a prognosis model through single-factor regression analysis and multi-factor regression analysis; step6) constructing a PCa diagnosis model according to the prognosis model parameters; step7) the PCa diagnostic model was validated. The invention constructs and verifies the PCa diagnosis model constructed by 4 genes, which provides a basis for the personalized accurate treatment of PCa patients.

Description

Method for constructing diagnosis model of prostate cancer
Technical Field
The invention relates to the field of medicine, in particular to a method for constructing a diagnosis model of prostate cancer.
Background
Prostate Cancer (PCa) is the second most frequently diagnosed malignancy in men in western countries. According to the data of World Health Organization (WHO) international cancer research institute, about 110 million men have been diagnosed with prostate cancer worldwide by 2012, accounting for 15% of all cancers diagnosed by men. In China, according to the latest national cancer statistical data published by the national cancer center of 1 month in 2019, the incidence rate of prostate cancer has a remarkable trend in recent years, and the prostate cancer is ranked at the 6 th place among men. PCa is considered a heterogeneous disease, and multiple genes and cellular pathways are jointly involved in the development and progression of PCa. Tumors resulting from epigenetic mutations in the cells may grow and multiply uncontrollably.
Disclosure of Invention
In view of the above, it is an object of the present invention to provide a prostate cancer diagnosis model, which can realize prediction of prostate cancer.
The invention solves the technical problems by the following technical means:
a method for constructing a diagnostic model of prostate cancer, comprising the steps of:
step1) acquiring gene expression profile data of PCa (prostate cancer);
step2) carrying out differential expression profile analysis on the gene expression profile data of the PCa, and screening out differential genes in the PCa;
step3) screening key genes aiming at differential genes in PCa through a GAE (graph Autoencoder) algorithm in a machine learning method;
step4), obtaining 10 high expression genes and 6 low expression genes in the key genes by PPI analysis according to the calculation result of GAE;
step5) establishing a prognosis model through single-factor regression analysis and multi-factor regression analysis;
step6) constructing a PCa diagnosis model according to the prognosis model parameters;
step7) the PCa diagnostic model was validated.
Based on the PCa differential expression gene, analysis is carried out by a GAE method in machine learning, and the high expression gene in PCa is screened as follows: UBE2C, CCNB1, TOP2A, TPX2, CENPM, KIAA0101, F5, APOE, NPY and TRIM36, and the low-expression genes are: MYH11, FLNA, ACTA2, MYL9, TAGLN, and ACTG 2.
And finding out key genes related to PCa prognosis through the single-factor Cox proportional risk model, and constructing a diagnosis model based on 4 genes through the multi-factor Cox proportional risk model. The diagnostic model is calculated by the following formula:
prognostic risk indicator ═ 0.3153 × TOP2A gene expression level) + (0.2987 × UBE2C gene expression level) + (-0.7064 × MYL9 gene expression level) + (-0.4628 × FLNA gene expression level)
The invention has the beneficial effects that:
the invention discovers and verifies a diagnostic model consisting of 4 key genes relevant to PCa prognosis. In addition, by integrating a plurality of groups of chemical databases to verify and construct key genes of a prediction model, the result obtained by the invention provides a new direction for the research of the PCa biomarker and simultaneously provides a new possibility for the personalized accurate treatment of PCa patients.
Drawings
The invention is further illustrated with reference to the following figures and examples;
FIG. 1 is an expression profile of two data sets, GSE6919 and GSE 30174;
FIG. 2 shows the results of differential expression profiling of GSE6919 and GSE30174 data sets;
FIG. 3 shows key genes selected by the GAE algorithm;
FIG. 4 shows the significant up-and down-regulation of genes obtained by PPI analysis;
FIG. 5 shows the expression of high-and low-risk genes in the GEO training set;
FIG. 6 is a ROC curve in the GEO training set;
FIG. 7 is a multifactor Cox analysis of a predictive model;
FIG. 8 is a multifactor Cox analysis of the prediction model and age, pathology staging;
FIG. 9 is a ROC curve for a prediction model;
FIG. 10 is validation of key genes by the GEPIA database;
FIG. 11 is the validation of key genes by the Oncomie database;
FIG. 12 is a validation of key genes by the GETX database;
FIG. 13 is a validation of key genes by the Human Proteinatlas database.
Detailed Description
The invention is described in detail below with reference to specific experiments:
the invention relates to a method for constructing a diagnosis model of prostate cancer, which comprises the following steps:
the method comprises the following steps: data collection and analysis
1) Collecting patient data
Two data sets, GSE6919 and GSE30174, were selected from the Gene Expression Omnibus (GEO) database as training data sets.
The GEO database is a public genomic database in which the data is from published papers. The database was created in 2000, and the database collects high-throughput gene expression data submitted by research institutions of various countries in the world, that is, the data of gene expression detection related to the paper can be found through the database as long as the paper is published at present. Therefore, there is a high degree of confidence based on the database as a data source.
The invention selects two data sets of GSE6919 and GSE30174 as training data source for machine learning. The GSE6919 dataset was based on Agilent GPL92, GPL93, and GPL8300 platforms (Affymetrix Human Genome U95 Version 2Array) and was submitted by Federico Alberto Monzon in 2018. There were 504 samples in the GSE30174 dataset, which included 233 normal prostate tissues and 271 metastatic prostate tumors. GSE30174 was submitted by Jennifer Barb at 2019. The expression profile data of the training set is shown in FIG. 1.
The GSE16560 dataset was used as the validation dataset. The GSE16560 dataset contained 80 samples, including 10 healthy peripheral blood and 70 non-metastatic prostate tumors. GSE16560 as a validation dataset based on the GPL5474 platform (human 6k transcriptome for DASL), submitted in 2013 by Andrea Sboner, contained 281 samples, including primary prostate tumors sorted by different Gleason Score.
2) Screening for differentially expressed genes in prostate cancer
In order to screen for differential genes in PCa, the present invention screened 6269 differential genes from GSE6919 and GSE30174 datasets by differential expression profiling using limma software package in R language. The screening criteria were (false discovery rate, FDR) <0.05 and | log2| (fold change, FC) | > 1.5.
The results of differential expression profiling of the GSE6919 and GSE30174 datasets are shown in figure 2.
Further, GO analysis shows that these differential genes are significantly enriched in Biological Processes (BP), including signal transduction, positive regulation of RNA polymerase II promoter transcription. Cellular Component (CC) analysis showed that these differential genes were significantly enriched in cytoplasmic vesicular membrane, membrane integral components and plasma membrane. For Molecular Function (MF), these differential genes are rich in protein binding, protein homodimerization activity, and calcium ion binding.
The KEGG analysis showed that all up-regulated genes were significantly enriched in dilated cardiomyopathy, Hypertrophic Cardiomyopathy (HCM), ECM-receptor interactions, Arrhythmogenic Right Ventricular Cardiomyopathy (ARVC), focal adhesions, and TGF- β signaling pathways.
3) Further screening of PCa Key genes Using GAE machine learning Algorithm
GAE (graph auto Encoder) is an unsupervised learning model. The relevant variables for GAE are as follows:
the graph G may be represented by G ═ V, E, where V represents the set of nodes and E represents the set of edges.
A: representing adjacency matrices
D: the expression matrix, in which the diagonal elements are assumed to be 1
N: indicating the number of nodes
d: representing characteristic dimensions of a node
X { \\ epsilon } { \ mathbb { R } } ^ N ^ d } X ∈ RN ^ d: feature matrix representing nodes
f: representing embedding dimensions
Z { \\ epsilon } { \ mathbb { R } } ^ N ^ f } Z ∈ RN ^ f: embedding representing nodes
The coding process of the GAE is that the GAE uses GCN as encoder to obtain the late representations (or embedding) of the nodes, and this process can be expressed as follows:
Z=GCN(X,A)
the GCN is regarded as a function, X and A are taken as input and input into the GCN function, and Z { \ epsilon } { \ mathbb { R } } ^ N ^ f } Z ∈ RN ^ f is output, namely, the latint representations or embedding of all nodes. The function of the GCN is defined as follows:
Figure BDA0003433580260000051
as shown in the formula, the whole encoder has only two layers, and each layer adopts the first-order approximation of the Chebyshev polynomial as the convolution kernel to process data. It can be seen that the remaining parameters are the objects to be learned, except for the initial input X, i.e., the feature matrix representing the node. In short, the GCN is a function with the node characteristics and the adjacency matrix as input and the node embedding as output, and the purpose is to obtain embedding.
The encoding process of the GAE is that the GAE reconstructs (reconstruct) the original graph using inner-product as decoding:
Figure BDA0003433580260000052
the reconstructed adjacency matrix is obtained, and the loss function can be constructed according to the adjacency matrix and the information characteristics of the original image.
The loss function of GAE is that the adjacency matrix determines the structure of the graph, and should be made as similar as possible to the original adjacency matrix. Therefore, GAE uses cross entropy as a loss function during training:
Figure BDA0003433580260000053
in the above formula, y represents the value (0 or 1) of a certain element in the adjacency matrix A, and \ hat { y } y ^ represents the value (between 0 and 1) of the corresponding element in the reconstructed adjacency matrix \ hat { A } A ^. As can be seen from the loss function, it is desirable that the reconstructed adjacency matrix (or reconstructed graph) is closer to, more similar to, and better than the original adjacency matrix (or original graph).
After obtaining the differential genes for PCa, the present invention implemented the GAE algorithm using TensorFlow, which screened the 6269 differential genes for key genes. The GAE learns network embedding using an encoder to extract the network embedding, and performs network embedding using a decoder to preserve topology information of the nodes by the adjacency matrix:
Figure BDA0003433580260000061
wherein v is1,v2∈V,
Figure BDA0003433580260000062
And the count (-) function returns the frequency distribution of the co-occurrence/occurrence of nodes v and/or u in the random sampling.
The key genes selected by the GAE algorithm are shown in FIG. 3.
After obtaining the key genes, all the key genes were uploaded to the STRING database for PPI (Protein-Protein Interaction) analysis (FIG. 4). The STRING database contains 14094 organs, 6.76 million proteins, and over 20 million interactions in total. This provides an important basis for the interactive research between key genes. The PPI results of the present invention total 6475 nodes presented in topological form, and among the first 100 genes generated by GAE, the genes significantly up-regulated were: UBE2C, CCNB1, TOP2A, TPX2, CENPM, F5, APOE, NPY and TRIM 36; the genes that were significantly down-regulated were: MYH11, FLNA, ACTA2, MYL9, TAGLN, and ACTG 2.
Step two: model construction and model verification
4) Construction and validation of predictive models
First, univariate Cox analysis was used to study the relationship between patient OS and the expression level of each key gene. The analytical tool used the surfminer R in the R language. Screening conditions with P values <0.01 in univariate Cox regression analysis were considered significant.
Second, a multivariate Cox proportional hazards analysis was performed to assess the contribution of multiple genes as independent prognostic factors affecting patient survival.
And finally, selecting an optimization model by adopting a step-by-step method. A PCa risk score prediction model was constructed by using the coefficients of the multifactor Cox regression as weights. The risk score is calculated as follows:
risk score ═ Σ risk basis geneiX Gene expression leveli
The prognostic risk indicator constructed by the invention is (0.3153 XTOP 2A gene expression level) + (0.2987 XUBE 2C gene expression level) + (-0.7064 XMYL 9 gene expression level) + (-0.4628 XFLNA gene expression level)
The results of high-risk and low-risk gene expression constructed using the GEO training set in the present invention are shown in fig. 5.
For the performance evaluation of the prognostic risk model constructed by the present invention, the ROC curve is used to evaluate the prediction performance and the GSE16560 data set in the GEO database is used for verification.
The ROC curves for the GEO training set are shown in fig. 6.
5) Verifying independence between predictive models and clinically relevant information
Univariate and multivariate Cox regression analysis was used to evaluate the independent predictive value of the tcgapprad cohort and GSE16560 cohort four-gene prognosis models. And clinical information with gleason scores and pathological stages was analyzed by univariate Cox regression analysis. Since age and gleason scores almost reached statistical significance, we incorporated the age, gleason scores and prognostic models into multivariate Cox regression analysis (fig. 7), which showed that the results of multivariate Cox regression analysis showed that the prognostic models were independent of OS.
The Cox multifactor analysis in fig. 7 shows that the risk model constructed by the invention, P, is 0.0073, P is less than 0.01, and has obvious difference with other factors influencing PCa diagnosis.
In addition, the GSE16560 dataset was used to assess the predictive value of the prognostic model (fig. 8). The 280 patients in the GSE16560 dataset were classified into a high risk group (n 190) and a low risk group (n 90) using the optimal risk cutoff. Time-dependent ROC analysis of survival prediction for the prognostic models yielded AUC of 0.69 at 1 year, 0.58 at 3 years and 0.61 at 5 years (fig. 9).
6) Verification and construction of 4 key genes of prediction model
The 4 key genes for constructing the prediction model were verified using the GEPIA, Oncomine, GETx and Human proteinaatlas databases, respectively.
The GEPIA database can be used to verify whether the expression of 4 key genes in prostate cancer is significant in this predictive model (fig. 10).
The Oncomine database can be used to verify the expression of 4 key genes in various tumors in the prediction model (FIG. 11).
The GETX database can be used for verifying the expression of 4 key genes in normal tissues in the prediction model (FIG. 12).
The Human ProteinAtlas database can be used to validate the pathological expression of 4 key genes in this predictive model (fig. 13).
Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims. The techniques, shapes, and configurations not described in detail in the present invention are all known techniques.

Claims (3)

1. A method for constructing a diagnosis model of prostate cancer, which is characterized by comprising the following steps:
step1) acquiring gene expression profile data of PCa (Prostate Cancer);
step2) carrying out differential expression profile analysis on the gene expression profile data of the PCa, and screening out differential genes in the PCa;
step3) screening key genes aiming at differential genes in PCa through a GAE (graph Autoencoder) algorithm in a machine learning method;
step4), obtaining 10 high expression genes and 6 low expression genes in the key genes by PPI analysis according to the calculation result of GAE;
step5) establishing a prognosis model through single-factor regression analysis and multi-factor regression analysis;
step6) constructing a PCa diagnosis model according to the prognosis model parameters;
step7) the PCa diagnostic model was validated.
2. The method for constructing a diagnostic model for prostate cancer according to claim 1, characterized in that: the PCa differential expression gene in the step2 is analyzed by a GAE method in machine learning, and the high expression gene in PCa is screened out as follows: UBE2C, CCNB1, TOP2A, TPX2, CENPM, KIAA0101, F5, APOE, NPY and TRIM36, and the low-expression genes are: MYH11, FLNA, ACTA2, MYL9, TAGLN, and ACTG 2.
3. The method for constructing a diagnostic model for prostate cancer according to claim 2, wherein the specific method for establishing a prognostic model in step5 is as follows: and finding out key genes related to PCa prognosis through the single-factor Cox proportional risk model, and constructing a diagnosis model based on 4 genes through the multi-factor Cox proportional risk model. The diagnostic model is calculated by the following formula:
prognostic risk indicator ═ (0.3153 × TOP2A gene expression level) + (0.2987 × UBE2C gene expression level) + (-0.7064 × MYL9 gene expression level) + (-0.4628 × FLNA gene expression level).
CN202111603645.0A 2021-12-25 2021-12-25 Method for constructing diagnosis model of prostate cancer Pending CN114283885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111603645.0A CN114283885A (en) 2021-12-25 2021-12-25 Method for constructing diagnosis model of prostate cancer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111603645.0A CN114283885A (en) 2021-12-25 2021-12-25 Method for constructing diagnosis model of prostate cancer

Publications (1)

Publication Number Publication Date
CN114283885A true CN114283885A (en) 2022-04-05

Family

ID=80875429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111603645.0A Pending CN114283885A (en) 2021-12-25 2021-12-25 Method for constructing diagnosis model of prostate cancer

Country Status (1)

Country Link
CN (1) CN114283885A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115394358A (en) * 2022-08-31 2022-11-25 西安理工大学 Single cell sequencing gene expression data interpolation method and system based on deep learning
CN117747093A (en) * 2024-02-20 2024-03-22 神州医疗科技股份有限公司 Method for constructing idiopathic pulmonary fibrosis diagnosis model and diagnosis system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115394358A (en) * 2022-08-31 2022-11-25 西安理工大学 Single cell sequencing gene expression data interpolation method and system based on deep learning
CN117747093A (en) * 2024-02-20 2024-03-22 神州医疗科技股份有限公司 Method for constructing idiopathic pulmonary fibrosis diagnosis model and diagnosis system
CN117747093B (en) * 2024-02-20 2024-06-07 神州医疗科技股份有限公司 Method for constructing idiopathic pulmonary fibrosis diagnosis model and diagnosis system

Similar Documents

Publication Publication Date Title
Kim et al. Cancer‐associated molecular signature in the tissue samples of patients with cirrhosis
CN114283885A (en) Method for constructing diagnosis model of prostate cancer
CN110232974B (en) Multiple myeloma comprehensive risk scoring method
CN113056563A (en) Method and system for identifying gene abnormality in blood
CN112466404A (en) Unsupervised clustering method and unsupervised clustering system for metagenome contigs
CN115410713A (en) Hepatocellular carcinoma prognosis risk prediction model construction based on immune-related gene
CN113421609A (en) Colorectal cancer prognosis prediction model based on lncRNA pair and construction method thereof
CN110714078A (en) Marker gene for colorectal cancer recurrence prediction in stage II and application thereof
CN114203256B (en) MIBC typing and prognosis prediction model construction method based on microbial abundance
CN115691813A (en) Genetic gastric cancer assessment method and system based on genomics and microbiomics
CN112382341B (en) Method for identifying biomarkers related to prognosis of esophageal squamous carcinoma
CN112037863B (en) Early NSCLC prognosis prediction system
CN111944902A (en) Early prediction method of renal papillary cell carcinoma based on lincRNA expression profile combination characteristics
Vijayan et al. Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods
CN111944900A (en) Characteristic lincRNA expression profile combination and early endometrial cancer prediction method
CN115954045A (en) Personalized treatment decision method and system for intestinal cancer and storage medium containing personalized treatment decision system
KR20220133516A (en) Method for detecting tumor derived mutation from cell-free DNA based on artificial intelligence and Method for early diagnosis of cancer using the same
CN115035951A (en) Mutation signature prediction method and device, terminal equipment and storage medium
CN110223733B (en) Screening method of multiple myeloma prognostic gene
KR20220160805A (en) Method for early diagnosis of cancer using cell-free DNA by modeling tissue-specific chromatin structure based on Artificial intelligence
Chang et al. Transcriptional network classifiers
CN110218789B (en) Gene probe composition and kit for predicting overall survival rate of multiple myeloma patients
US20240177806A1 (en) Deep learning based method for diagnosing and predicting cancer type using characteristics of cell-free nucleic acid
CN113450872B (en) Method for predicting phosphorylation site specific kinase
CN110197701B (en) Novel multiple myeloma nomogram construction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination