CN114283885A

CN114283885A - Method for constructing diagnosis model of prostate cancer

Info

Publication number: CN114283885A
Application number: CN202111603645.0A
Authority: CN
Inventors: 罗艺灵; 佟延秋
Original assignee: Chongqing Medical University
Current assignee: Chongqing Medical University
Priority date: 2021-12-25
Filing date: 2021-12-25
Publication date: 2022-04-05

Abstract

The invention discloses a method for constructing a diagnosis model of prostate cancer, which comprises the following steps: step1) acquiring gene expression profile data of PCa; step2) carrying out differential expression profile analysis on the gene expression profile data of the PCa, and screening out differential genes in the PCa; step3) screening key genes by GAE algorithm in machine learning method aiming at difference genes in PCa; step4), obtaining 10 high expression genes and 6 low expression genes by PPI analysis according to the calculation result of GAE; step5) establishing a prognosis model through single-factor regression analysis and multi-factor regression analysis; step6) constructing a PCa diagnosis model according to the prognosis model parameters; step7) the PCa diagnostic model was validated. The invention constructs and verifies the PCa diagnosis model constructed by 4 genes, which provides a basis for the personalized accurate treatment of PCa patients.

Description

Method for constructing diagnosis model of prostate cancer

Technical Field

The invention relates to the field of medicine, in particular to a method for constructing a diagnosis model of prostate cancer.

Background

Prostate Cancer (PCa) is the second most frequently diagnosed malignancy in men in western countries. According to the data of World Health Organization (WHO) international cancer research institute, about 110 million men have been diagnosed with prostate cancer worldwide by 2012, accounting for 15% of all cancers diagnosed by men. In China, according to the latest national cancer statistical data published by the national cancer center of 1 month in 2019, the incidence rate of prostate cancer has a remarkable trend in recent years, and the prostate cancer is ranked at the 6 th place among men. PCa is considered a heterogeneous disease, and multiple genes and cellular pathways are jointly involved in the development and progression of PCa. Tumors resulting from epigenetic mutations in the cells may grow and multiply uncontrollably.

Disclosure of Invention

In view of the above, it is an object of the present invention to provide a prostate cancer diagnosis model, which can realize prediction of prostate cancer.

The invention solves the technical problems by the following technical means:

a method for constructing a diagnostic model of prostate cancer, comprising the steps of:

step1) acquiring gene expression profile data of PCa (prostate cancer);

step2) carrying out differential expression profile analysis on the gene expression profile data of the PCa, and screening out differential genes in the PCa;

step3) screening key genes aiming at differential genes in PCa through a GAE (graph Autoencoder) algorithm in a machine learning method;

step4), obtaining 10 high expression genes and 6 low expression genes in the key genes by PPI analysis according to the calculation result of GAE;

step5) establishing a prognosis model through single-factor regression analysis and multi-factor regression analysis;

step6) constructing a PCa diagnosis model according to the prognosis model parameters;

step7) the PCa diagnostic model was validated.

Based on the PCa differential expression gene, analysis is carried out by a GAE method in machine learning, and the high expression gene in PCa is screened as follows: UBE2C, CCNB1, TOP2A, TPX2, CENPM, KIAA0101, F5, APOE, NPY and TRIM36, and the low-expression genes are: MYH11, FLNA, ACTA2, MYL9, TAGLN, and ACTG 2.

And finding out key genes related to PCa prognosis through the single-factor Cox proportional risk model, and constructing a diagnosis model based on 4 genes through the multi-factor Cox proportional risk model. The diagnostic model is calculated by the following formula:

prognostic risk indicator ═ 0.3153 × TOP2A gene expression level) + (0.2987 × UBE2C gene expression level) + (-0.7064 × MYL9 gene expression level) + (-0.4628 × FLNA gene expression level)

The invention has the beneficial effects that:

the invention discovers and verifies a diagnostic model consisting of 4 key genes relevant to PCa prognosis. In addition, by integrating a plurality of groups of chemical databases to verify and construct key genes of a prediction model, the result obtained by the invention provides a new direction for the research of the PCa biomarker and simultaneously provides a new possibility for the personalized accurate treatment of PCa patients.

Drawings

The invention is further illustrated with reference to the following figures and examples;

FIG. 1 is an expression profile of two data sets, GSE6919 and GSE 30174;

FIG. 2 shows the results of differential expression profiling of GSE6919 and GSE30174 data sets;

FIG. 3 shows key genes selected by the GAE algorithm;

FIG. 4 shows the significant up-and down-regulation of genes obtained by PPI analysis;

FIG. 5 shows the expression of high-and low-risk genes in the GEO training set;

FIG. 6 is a ROC curve in the GEO training set;

FIG. 7 is a multifactor Cox analysis of a predictive model;

FIG. 8 is a multifactor Cox analysis of the prediction model and age, pathology staging;

FIG. 9 is a ROC curve for a prediction model;

FIG. 10 is validation of key genes by the GEPIA database;

FIG. 11 is the validation of key genes by the Oncomie database;

FIG. 12 is a validation of key genes by the GETX database;

FIG. 13 is a validation of key genes by the Human Proteinatlas database.

Detailed Description

The invention is described in detail below with reference to specific experiments:

the invention relates to a method for constructing a diagnosis model of prostate cancer, which comprises the following steps:

the method comprises the following steps: data collection and analysis

1) Collecting patient data

Two data sets, GSE6919 and GSE30174, were selected from the Gene Expression Omnibus (GEO) database as training data sets.

The GEO database is a public genomic database in which the data is from published papers. The database was created in 2000, and the database collects high-throughput gene expression data submitted by research institutions of various countries in the world, that is, the data of gene expression detection related to the paper can be found through the database as long as the paper is published at present. Therefore, there is a high degree of confidence based on the database as a data source.

The invention selects two data sets of GSE6919 and GSE30174 as training data source for machine learning. The GSE6919 dataset was based on Agilent GPL92, GPL93, and GPL8300 platforms (Affymetrix Human Genome U95 Version 2Array) and was submitted by Federico Alberto Monzon in 2018. There were 504 samples in the GSE30174 dataset, which included 233 normal prostate tissues and 271 metastatic prostate tumors. GSE30174 was submitted by Jennifer Barb at 2019. The expression profile data of the training set is shown in FIG. 1.

The GSE16560 dataset was used as the validation dataset. The GSE16560 dataset contained 80 samples, including 10 healthy peripheral blood and 70 non-metastatic prostate tumors. GSE16560 as a validation dataset based on the GPL5474 platform (human 6k transcriptome for DASL), submitted in 2013 by Andrea Sboner, contained 281 samples, including primary prostate tumors sorted by different Gleason Score.

2) Screening for differentially expressed genes in prostate cancer

In order to screen for differential genes in PCa, the present invention screened 6269 differential genes from GSE6919 and GSE30174 datasets by differential expression profiling using limma software package in R language. The screening criteria were (false discovery rate, FDR) <0.05 and | log2| (fold change, FC) | > 1.5.

The results of differential expression profiling of the GSE6919 and GSE30174 datasets are shown in figure 2.

Further, GO analysis shows that these differential genes are significantly enriched in Biological Processes (BP), including signal transduction, positive regulation of RNA polymerase II promoter transcription. Cellular Component (CC) analysis showed that these differential genes were significantly enriched in cytoplasmic vesicular membrane, membrane integral components and plasma membrane. For Molecular Function (MF), these differential genes are rich in protein binding, protein homodimerization activity, and calcium ion binding.

The KEGG analysis showed that all up-regulated genes were significantly enriched in dilated cardiomyopathy, Hypertrophic Cardiomyopathy (HCM), ECM-receptor interactions, Arrhythmogenic Right Ventricular Cardiomyopathy (ARVC), focal adhesions, and TGF- β signaling pathways.

3) Further screening of PCa Key genes Using GAE machine learning Algorithm

GAE (graph auto Encoder) is an unsupervised learning model. The relevant variables for GAE are as follows:

the graph G may be represented by G ═ V, E, where V represents the set of nodes and E represents the set of edges.

A: representing adjacency matrices

D: the expression matrix, in which the diagonal elements are assumed to be 1

N: indicating the number of nodes

d: representing characteristic dimensions of a node

X { \\ epsilon } { \ mathbb { R } } ^ N ^ d } X ∈ RN ^ d: feature matrix representing nodes

f: representing embedding dimensions

Z { \\ epsilon } { \ mathbb { R } } ^ N ^ f } Z ∈ RN ^ f: embedding representing nodes

The coding process of the GAE is that the GAE uses GCN as encoder to obtain the late representations (or embedding) of the nodes, and this process can be expressed as follows:

Z＝GCN(X，A)

the GCN is regarded as a function, X and A are taken as input and input into the GCN function, and Z { \ epsilon } { \ mathbb { R } } ^ N ^ f } Z ∈ RN ^ f is output, namely, the latint representations or embedding of all nodes. The function of the GCN is defined as follows:

as shown in the formula, the whole encoder has only two layers, and each layer adopts the first-order approximation of the Chebyshev polynomial as the convolution kernel to process data. It can be seen that the remaining parameters are the objects to be learned, except for the initial input X, i.e., the feature matrix representing the node. In short, the GCN is a function with the node characteristics and the adjacency matrix as input and the node embedding as output, and the purpose is to obtain embedding.

The encoding process of the GAE is that the GAE reconstructs (reconstruct) the original graph using inner-product as decoding:

the reconstructed adjacency matrix is obtained, and the loss function can be constructed according to the adjacency matrix and the information characteristics of the original image.

The loss function of GAE is that the adjacency matrix determines the structure of the graph, and should be made as similar as possible to the original adjacency matrix. Therefore, GAE uses cross entropy as a loss function during training:

in the above formula, y represents the value (0 or 1) of a certain element in the adjacency matrix A, and \ hat { y } y ^ represents the value (between 0 and 1) of the corresponding element in the reconstructed adjacency matrix \ hat { A } A ^. As can be seen from the loss function, it is desirable that the reconstructed adjacency matrix (or reconstructed graph) is closer to, more similar to, and better than the original adjacency matrix (or original graph).

After obtaining the differential genes for PCa, the present invention implemented the GAE algorithm using TensorFlow, which screened the 6269 differential genes for key genes. The GAE learns network embedding using an encoder to extract the network embedding, and performs network embedding using a decoder to preserve topology information of the nodes by the adjacency matrix:

wherein v is₁,v₂∈V,

And the count (-) function returns the frequency distribution of the co-occurrence/occurrence of nodes v and/or u in the random sampling.

The key genes selected by the GAE algorithm are shown in FIG. 3.

After obtaining the key genes, all the key genes were uploaded to the STRING database for PPI (Protein-Protein Interaction) analysis (FIG. 4). The STRING database contains 14094 organs, 6.76 million proteins, and over 20 million interactions in total. This provides an important basis for the interactive research between key genes. The PPI results of the present invention total 6475 nodes presented in topological form, and among the first 100 genes generated by GAE, the genes significantly up-regulated were: UBE2C, CCNB1, TOP2A, TPX2, CENPM, F5, APOE, NPY and TRIM 36; the genes that were significantly down-regulated were: MYH11, FLNA, ACTA2, MYL9, TAGLN, and ACTG 2.

Step two: model construction and model verification

4) Construction and validation of predictive models

First, univariate Cox analysis was used to study the relationship between patient OS and the expression level of each key gene. The analytical tool used the surfminer R in the R language. Screening conditions with P values <0.01 in univariate Cox regression analysis were considered significant.

Second, a multivariate Cox proportional hazards analysis was performed to assess the contribution of multiple genes as independent prognostic factors affecting patient survival.

And finally, selecting an optimization model by adopting a step-by-step method. A PCa risk score prediction model was constructed by using the coefficients of the multifactor Cox regression as weights. The risk score is calculated as follows:

risk score ═ Σ risk basis gene_iX Gene expression level_i

The prognostic risk indicator constructed by the invention is (0.3153 XTOP 2A gene expression level) + (0.2987 XUBE 2C gene expression level) + (-0.7064 XMYL 9 gene expression level) + (-0.4628 XFLNA gene expression level)

The results of high-risk and low-risk gene expression constructed using the GEO training set in the present invention are shown in fig. 5.

For the performance evaluation of the prognostic risk model constructed by the present invention, the ROC curve is used to evaluate the prediction performance and the GSE16560 data set in the GEO database is used for verification.

The ROC curves for the GEO training set are shown in fig. 6.

5) Verifying independence between predictive models and clinically relevant information

Univariate and multivariate Cox regression analysis was used to evaluate the independent predictive value of the tcgapprad cohort and GSE16560 cohort four-gene prognosis models. And clinical information with gleason scores and pathological stages was analyzed by univariate Cox regression analysis. Since age and gleason scores almost reached statistical significance, we incorporated the age, gleason scores and prognostic models into multivariate Cox regression analysis (fig. 7), which showed that the results of multivariate Cox regression analysis showed that the prognostic models were independent of OS.

The Cox multifactor analysis in fig. 7 shows that the risk model constructed by the invention, P, is 0.0073, P is less than 0.01, and has obvious difference with other factors influencing PCa diagnosis.

In addition, the GSE16560 dataset was used to assess the predictive value of the prognostic model (fig. 8). The 280 patients in the GSE16560 dataset were classified into a high risk group (n 190) and a low risk group (n 90) using the optimal risk cutoff. Time-dependent ROC analysis of survival prediction for the prognostic models yielded AUC of 0.69 at 1 year, 0.58 at 3 years and 0.61 at 5 years (fig. 9).

6) Verification and construction of 4 key genes of prediction model

The 4 key genes for constructing the prediction model were verified using the GEPIA, Oncomine, GETx and Human proteinaatlas databases, respectively.

The GEPIA database can be used to verify whether the expression of 4 key genes in prostate cancer is significant in this predictive model (fig. 10).

The Oncomine database can be used to verify the expression of 4 key genes in various tumors in the prediction model (FIG. 11).

The GETX database can be used for verifying the expression of 4 key genes in normal tissues in the prediction model (FIG. 12).

The Human ProteinAtlas database can be used to validate the pathological expression of 4 key genes in this predictive model (fig. 13).

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims. The techniques, shapes, and configurations not described in detail in the present invention are all known techniques.

Claims

1. A method for constructing a diagnosis model of prostate cancer, which is characterized by comprising the following steps:

step1) acquiring gene expression profile data of PCa (Prostate Cancer);

step7) the PCa diagnostic model was validated.

2. The method for constructing a diagnostic model for prostate cancer according to claim 1, characterized in that: the PCa differential expression gene in the step2 is analyzed by a GAE method in machine learning, and the high expression gene in PCa is screened out as follows: UBE2C, CCNB1, TOP2A, TPX2, CENPM, KIAA0101, F5, APOE, NPY and TRIM36, and the low-expression genes are: MYH11, FLNA, ACTA2, MYL9, TAGLN, and ACTG 2.

3. The method for constructing a diagnostic model for prostate cancer according to claim 2, wherein the specific method for establishing a prognostic model in step5 is as follows: and finding out key genes related to PCa prognosis through the single-factor Cox proportional risk model, and constructing a diagnosis model based on 4 genes through the multi-factor Cox proportional risk model. The diagnostic model is calculated by the following formula:

prognostic risk indicator ═ (0.3153 × TOP2A gene expression level) + (0.2987 × UBE2C gene expression level) + (-0.7064 × MYL9 gene expression level) + (-0.4628 × FLNA gene expression level).