CN114360642A

CN114360642A - Cancer transcriptome data processing method based on gene co-expression network analysis

Info

Publication number: CN114360642A
Application number: CN202210040488.5A
Authority: CN
Inventors: 付聪; 梁磊; 张彦; 易星丞; 许彤
Original assignee: Jilin Puchuan Bio Medicine Co ltd
Current assignee: Jilin Puchuan Bio Medicine Co ltd
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-04-15

Abstract

A cancer transcriptome data processing method based on gene coexpression network analysis relates to the data processing field, including obtaining the original data set; preprocessing original data; identifying differentially expressed genes; constructing a gene co-expression network; excavating a gene module; correlation analysis of the gene module and clinical indexes; enrichment analysis of the gene module; identifying a key gene; the functions of key genes are explored; survival analysis of key genes. As can be seen from the results of the enrichment analysis, the gene modules divided by using the method have significant biological significance; as can be seen from the results of the verification of key genes by the Dispenet database, most of the key genes identified by the method are related to tumor diseases. The method has good effects on the aspects of gene module mining and key gene identification. The method can be used as an important tool for cancer disease transcriptome data, and the application of the method also provides a new direction for further understanding the pathogenesis of cancer diseases.

Description

Cancer transcriptome data processing method based on gene co-expression network analysis

Technical Field

The invention relates to a gene data processing method, in particular to a cancer transcriptome data processing method based on gene coexpression network analysis.

Background

In recent years, the prevalence of cancer diseases has become higher, but the research on cancer diseases has become more important because such diseases are difficult to treat and are very susceptible to relapse. If the functional gene module of cancer diseases can be mined by a bioinformatics method and key genes in the cancer diseases are identified, the pathogenesis of the cancer diseases can be further understood and certain help is provided for clinical treatment of the cancer diseases.

With the rapid development of the next-generation sequencing technology, the gene expression data has an explosive growth, and how to dig out hidden knowledge from a large amount of data becomes one of the important tasks in the later genome era. Meanwhile, as research progresses, it is gradually discovered that various biological factors do not act individually but cooperate with each other to perform various complex biological functions in a cellular environment. Therefore, various biological data are converted into biological networks by adopting a proper method, and are analyzed and mined by utilizing the graph theory and the related knowledge of the complex network theory, so that the method becomes an effective method for processing mass biological data. Biological networks are networks constructed by using biological elements in the scientific problem of the biological field, wherein nodes in the networks represent the biological elements such as proteins, genes and the like, and edges in the networks represent the interaction relationship of the biological elements in biochemistry, physics or function. The gene coexpression network is a common biological network, and the appearance of the gene coexpression network opens up a new direction for the development of genomics.

Disclosure of Invention

In order to effectively process cancer transcriptome data, the invention provides a cancer transcriptome data processing method based on gene coexpression network analysis.

The technical scheme adopted by the invention for solving the technical problem is as follows:

the invention relates to a cancer transcriptome data processing method based on gene coexpression network analysis, which mainly comprises the following steps:

step one, acquiring an original data set;

secondly, preprocessing original data;

step three, identifying a differential expression gene;

step four, constructing a gene co-expression network;

fifthly, excavating a gene module;

sixthly, performing correlation analysis on the gene module and clinical indexes;

seventhly, enriching and analyzing the gene module;

step eight, identifying key genes;

step nine, researching the functions of key genes;

step ten, survival analysis of key genes.

Further, in the first step, the raw data set is derived from a TCGA database or a GEO database; the raw data set includes gene expression data in cancer tissue samples, gene expression data in paracancerous tissue samples, and corresponding clinical data for each sample.

Further, in the second step, low expression genes are filtered out, then hierarchical clustering is carried out on the samples, and outlier samples are deleted.

Further, in step three, all differentially expressed genes satisfying the defined condition are identified by using the FC-t algorithm.

Further, in the fourth step, based on the gene expression data of the differentially expressed genes in the sample, the Pearson correlation analysis of every two genes is carried out; setting a limiting condition to screen all obtained relations, and regarding two genes meeting the limiting condition as having a co-expression relation; and (3) expressing all genes with co-expression relations and the relations thereof by using a graph to obtain the gene co-expression network.

Further, in the fifth step, carrying out network clustering on nodes in the gene co-expression network by using 4 community detection algorithms to obtain communities formed by genes with similar functions, namely gene modules; and selecting the optimal module mining result by using the 'modularity' as an evaluation criterion.

Further, in the sixth step, principal component analysis is performed on all gene expression data in one gene module, and the first principal component is defined as a module characteristic gene of the gene module; and carrying out Pearson correlation analysis on the module characteristic genes of each gene module and different clinical indexes to obtain a correlation matrix of the gene module and the clinical indexes.

Further, in the seventh step, the gene in the gene module of interest and the biological process, cell components and molecular functions provided by the GO database are subjected to enrichment analysis, and the gene and the signal channel provided by the Reactome database are subjected to enrichment analysis.

And further, in the eighth step, the importance of all nodes in the gene co-expression network is scored by using a PageRank algorithm, the scoring standard is based on a topological principle, and then more important nodes in the gene co-expression network are identified, and genes corresponding to the nodes are key genes.

Further, in the ninth step, diseases related to key genes are searched by using a Dispenet database, and the functions of the key genes are researched.

Further, in the tenth step, survival analysis is carried out on the key genes by using online software onclnc, and a survival curve is drawn.

The invention has the beneficial effects that:

the complex network theory plays a great role in many disciplines, and in recent years, the application of the complex network theory in computer disciplines, physics, sociology and other disciplines is widely researched. The organism is a highly complex system, each biological process of the organism needs the joint participation of a plurality of substances, and the research on a single gene or protein is difficult to understand the molecular mechanism behind the single gene or protein as a whole. Due to the complexity of cancer diseases, the existing bioinformatics analysis method is difficult to effectively analyze and mine the transcriptome data, so the invention applies the complex network theory to biological research, and particularly to a method for processing and analyzing the cancer transcriptome data.

The invention provides a cancer transcriptome data processing method based on gene coexpression network analysis, which mainly comprises the following steps: acquiring an original data set; preprocessing original data; identifying differentially expressed genes; constructing a gene co-expression network; excavating a gene module; correlation analysis of the gene module and clinical indexes; enrichment analysis of the gene module; identifying a key gene; the functions of key genes are explored; survival analysis of key genes. According to the GO/Reactome enrichment analysis result, the gene modules divided by the method have obvious biological significance; as can be seen from the results of the verification of key genes by the Dispenet database, most of the key genes identified by the method are related to tumor diseases. Therefore, the cancer transcriptome data processing method based on gene co-expression network analysis provided by the invention has good effects on the aspects of gene module mining and key gene identification. The cancer transcriptome data processing method based on gene coexpression network analysis can be used as an important tool of cancer disease transcriptome data, and the application of the method also provides a new direction for further understanding the pathogenesis of cancer diseases.

Drawings

FIG. 1 is a flow chart of a cancer transcriptome data processing method based on gene co-expression network analysis according to the present invention.

FIG. 2 is a flow chart of data acquisition and preprocessing in example 1.

FIG. 3 is a hierarchical clustering tree of cancer tissue samples in example 1.

FIG. 4 is a flowchart of the identification of differentially expressed genes in example 1.

FIG. 5 is a volcanic plot of differentially expressed genes in example 1.

FIG. 6 is a flow chart of the construction of the gene co-expression network in example 1.

FIG. 7 shows the gene co-expression network and several piconets in example 1.

FIG. 8 is a flowchart of gene module mining in example 1.

FIG. 9 shows the results of the module mining of the eigenvector algorithm in example 1.

FIG. 10 is a flowchart of the correlation analysis between the gene modules and clinical markers in example 1.

FIG. 11 is a correlation matrix of gene modules and clinical indices in example 1.

FIG. 12 is a flow chart of GO/Reactome enrichment analysis of the gene module in example 1.

FIG. 13 shows the result of BP enrichment of the gene module m1 in example 1.

FIG. 14 shows the result of CC enrichment of gene module m1 in example 1.

FIG. 15 shows the MF enrichment result of gene module m1 in example 1.

FIG. 16 is a flowchart showing the identification of key genes in example 1.

FIG. 17 is a survival curve of the gene NAA40 in example 1.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The invention relates to a cancer transcriptome data processing method based on gene coexpression network analysis, which specifically comprises the following steps as shown in figure 1:

step one, acquiring an original data set

The raw data set, which mainly includes gene expression data in cancer tissue samples, gene expression data in paracancer tissue samples and clinical data corresponding to each sample, is obtained from a TCGA database (https:// cancer tissue. nih. gov /) or a GEO database (https:// www.ncbi.nlm.nih.gov/geofiles).

Step two, preprocessing the original data

Firstly, low-expression genes in the low-expression genes are filtered, namely, the low-expression genes with the maximum value of the gene expression level (FPKM) in cancer tissues or para-cancer tissues is deleted, then, the expression levels of the remaining genes in the cancer tissues or the para-cancer tissues are subjected to hierarchical clustering about samples, and the clustered samples are deleted, so that an original data set for further mining is obtained.

Step three, identifying the differentially expressed genes

All differentially expressed genes that meet the defined conditions were identified using the FC-t algorithm. The limiting conditions are as follows: FC > | FC [ ═ threshold & & P [ - ] threshold, where FC denotes threshold denotes the multiple of the change in difference and P denotes the statistical significance of T-test.

Step four, constructing a gene co-expression network

Carrying out Pearson correlation analysis on every two genes on the basis of gene expression data of the differentially expressed genes in a sample to obtain a Pearson Correlation Coefficient (PCC) and a P value; further, setting a limiting condition | PCC | >, threshold & & P < threshold, screening all the obtained relations, and regarding two genes meeting the limiting condition as having a co-expression relation; finally, all genes with co-expression relationship and the relationship thereof are expressed by a graph, and the gene co-expression network is obtained.

Step five, excavating gene modules

Carrying out network clustering on nodes (genes) in a gene co-expression network by utilizing 4 community detection algorithms (eigenvector, label-propagation, map-equalization and edge-betweenness) to obtain a community composed of genes with similar functions, namely a gene module; further, the optimal module mining result is selected by using the 'modularity' as an evaluation criterion.

The 4 community detection algorithms (eigen, label-propagation, map-evaluation, edge-betweens) are respectively realized by using functions of R software "image" package, label.

Sixthly, correlation analysis of the gene module and the clinical index

Performing principal component analysis on all gene expression data in a gene module, and defining a first principal component as a module characteristic gene (ME) of the gene module; further, the module characteristic genes (ME) of each gene module and different clinical indexes are subjected to Pearson correlation analysis, and the absolute value of the Pearson Correlation Coefficient (PCC) is taken to obtain a correlation matrix of the gene module and the clinical indexes.

Seventhly, GO/Reactome enrichment analysis of gene module

The gene in the gene module of interest is enriched and analyzed with the Biological Process (BP), the Cell Component (CC) and the Molecular Function (MF) provided by the GO database (http:// genentology. org /), and simultaneously, the gene is enriched and analyzed with the signal path provided by the Reactome database (https:// Reactome. org /).

Step eight, identifying key genes

And (3) scoring the importance of all nodes in the gene co-expression network by using a PageRank algorithm, wherein the scoring standard is based on a topological principle, so that more important nodes in the gene co-expression network are identified, and genes corresponding to the nodes are key genes. The top n genes with the highest score can be selected as key genes for the disease according to the situation.

Step nine, exploring the function of key genes

The function of the key gene was explored by searching for diseases associated with the key gene using the Dispenet database (http:// www.disgenet.org /).

Step ten, survival analysis of key genes

Survival analysis was performed on the key genes using the online software onclnc (http:// www.oncolnc.org /), and a survival curve was plotted.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1 mammary invasive carcinoma transcriptome data analysis

The invention relates to a cancer transcriptome data processing method based on gene coexpression network analysis, which is used for carrying out data analysis on a transcriptome of hepatocellular carcinoma and specifically comprises the following steps:

(1) acquisition and preprocessing of raw data sets

As shown in fig. 2, the method specifically includes the following steps:

expression profile data of mammary invasive carcinoma (BR CA) tissues and tissues beside the mammary invasive carcinoma (BR CA) are downloaded from a TCGA database (https:// cancer.

Secondly, deleting low-expression genes of which the maximum value of the gene expression quantity (FPKM) in the breast infiltrating cancer tissues or tissues beside the cancer is less than 1, and remaining 14,129 genes.

Thirdly, the expression quantity of all the genes left after filtration in the breast infiltrating cancer tissues or the tissues beside the cancer is subjected to hierarchical clustering about the samples, and a hierarchical clustering tree is shown in figure 3. As can be seen from fig. 3, there are 3 outlier samples: TCGA-DD-AAEB, TCGA-CC-5259, TCGA-FV-A4ZP, which were deleted to give the original data set for further analysis. The raw data (gene expression level) after partial pretreatment are shown in Table 1.

TABLE 1 Gene expression data of breast infiltrating cancer tissue after pretreatment

Gene numbering	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
						ENSG00000167578	2.982962631	2.426924178	2.180554626	1.704487843	1.574196206
ENSG00000078237	1.511416409	2.962567928	3.496769794	2.590545901	0.977175405
						ENSG00000146083	15.42361467	34.18583752	7.12327477	6.727115362	4.062698315
ENSG00000198242	105.7124415	207.8535728	193.8028654	113.9189313	112.3564048
						ENSG00000134108	17.91677888	19.23785333	34.42522038	26.4881835	14.67476604
ENSG00000167700	25.30139043	15.73939839	12.30061718	5.993611251	58.71591007
						ENSG00000060642	5.208718456	6.704560923	6.293990173	6.504866534	11.32913478
ENSG00000166391	0	0.268748468	0.194918684	0.157939948	20.1794968
						ENSG00000070087	2.812958236	6.293035032	17.21057879	11.38258759	0.345172284
ENSG00000153561	8.757826693	6.304640789	15.13777725	11.0352232	15.45512232

(2) Identification of differentially expressed genes

As shown in fig. 4, the method specifically includes the following steps:

first, FC-t algorithm was used to calculate FC values and P values of all genes, and part of the calculation results are shown in Table 2.

② setting a limiting condition FC > -2 | | FC < -0.5 & & P < -0.05 to identify the differentially expressed genes, and identifying 4130 up-regulated genes and 471 down-regulated genes in total.

Thirdly, drawing a volcanic map of the differential expression genes by using an R software package 'ggplot 2' to visually display the screening result of the differential expression genes, wherein the volcanic map of the differential expression genes is shown in figure 5.

As can be seen from table 2 and fig. 4, a large number of genes were significantly differentially expressed in the mammary invasive cancer tissue compared to the normal tissue.

TABLE 2 FC-t Algorithm calculation results

Gene numbering	FC	P
			ENSG00000146083	2.700996521	1.13E-40
ENSG00000198242	2.360743008	3.45E-33
			ENSG00000167700	2.481208784	4.38E-28
ENSG00000166391	0.302473358	1.03E-17
			ENSG00000127511	2.283066186	1.44E-46
ENSG00000064601	2.93550767	3.52E-58
			ENSG00000227766	2.444028443	3.37E-09
ENSG00000008517	2.397033874	2.05E-13
			ENSG00000070081	2.184524979	2.20E-27
ENSG00000275479	3.451166828	1.91E-23

(3) Construction of Gene Co-expression networks

As shown in fig. 6, the method specifically includes the following steps:

for each differentially expressed gene, the Pearson Correlation Coefficient (PCC) and P value with other differentially expressed genes were calculated, and part of the calculation results are shown in table 3.

And secondly, setting a limiting condition | PCC | >, 0.65& & P <0.05, screening the obtained relation, and regarding two genes meeting the limiting condition as having a co-expression relation.

And thirdly, introducing all genes with co-expression relationship and relationship thereof into Cytoscape software for visualization, as shown in FIG. 7.

And fourthly, deleting the genes of the small net according to the visualization result, wherein the rest large net is the gene coexpression network.

TABLE 3 Pearson correlation analysis results

(4) Excavation of Gene modules

As shown in fig. 8, the method specifically includes the following steps:

the method comprises the following steps of firstly, utilizing functions of a software "arraph" package of R software, namely, leading. element. communication (), label. propagation. communication (), infomap. communication (), and edge. beta. communication (), to carry out network clustering on nodes (genes) in a gene co-expression network, and obtaining a community, namely a gene module, which is composed of the genes with similar functions.

Secondly, calculating the modularity of 4 community detection algorithms (eigenvector, label-propagation, map-evaluation and edge-betweenness) clustering results, and selecting the result with the maximum modularity for further research. In this embodiment, the modularity of the eigenvector algorithm is the highest, so the module mining result obtained by the eigenvector algorithm is selected here for further study.

③ the community with the too small number of deleted genes (the community with the number of deleted genes less than 50 in the embodiment) is 9 communities, which correspond to 9 gene modules. The results of the module mining of the eigenvector algorithm are shown in fig. 9.

(5) Correlation analysis of gene modules and clinical indicators

As shown in fig. 10, the method specifically includes the following steps:

subjecting all gene expression data in each gene module to principal component analysis to obtain a module signature gene (ME) of each gene module.

Secondly, carrying out Pearson correlation analysis on the module characteristic genes (ME) of each gene module and 4 clinical indexes of event, T, N and M (wherein the event represents the survival state of the patient, and T, N, M represents the tumor stage), and obtaining a correlation matrix of the gene module and the clinical indexes by taking the absolute value of a Pearson Correlation Coefficient (PCC), as shown in FIG. 11. As can be seen from fig. 11, the gene modules m1, m2, m3, and m7 have high correlation with clinical markers.

(6) GO/Reactome enrichment analysis of gene modules

As shown in fig. 12, the method specifically includes the following steps:

the genes contained in the gene modules m1, m2, m3 and m6 are respectively enriched and analyzed with the Biological Process (BP), the Cell Component (CC) and the Molecular Function (MF) provided by the GO database, and 10 Terms with the minimum P value are selected for research. The results of the enrichment analysis of gene module m1 are shown in FIGS. 13-15.

And secondly, carrying out enrichment analysis on genes contained in the gene modules m1, m2, m3 and m6 and signal channels provided by the Reactome database respectively, and selecting 10 signal channels with the minimum P value for research. The results of the enrichment analysis of gene module m1 are shown in Table 4.

As can be seen from the enrichment results of the gene modules, the enrichment results of the gene modules have high specificity and are mostly related to tumor diseases, so that the reliability of the module mining results can be proved.

TABLE 4 Reactome enrichment results for Gene Module m1

Way numbering	Path name	Enriched amount	P
				R-HSA-69278	Cell Cycle,Mitotic	64	1.11E-16
R-HSA-1640170	Cell Cycle	71	1.11E-16
				R-HSA-453279	Mitotic G1 phase and G1/S transition	24	2.57E-12
R-HSA-73886	Chromosome Maintenance	19	6.22E-12
				R-HSA-69205	G1/S-Specific Transcription	13	3.12E-11
R-HSA-69206	G1/S Transition	20	3.50E-10
				R-HSA-68886	M Phase	32	4.48E-10
R-HSA-69190	DNA strand elongation	11	1.62E-09
				R-HSA-73894	DNA Repair	28	2.67E-08
R-HSA-69242	S Phase	19	3.58E-08

(7) Identification of key genes

As shown in fig. 16, the method specifically includes the following steps:

the importance of all genes in a gene coexpression network is scored based on a topological structure by utilizing a PageRank algorithm.

And secondly, sequencing all genes in a descending order according to the scoring result.

And selecting the first 20 genes as key genes of hepatocellular carcinoma. The 20 key genes in this embodiment are: FABP7, CXCL3, LOC284578, CAPN6, NRG2, HCFC1, ILF3, KANSL1, NAA40, NCOA6, PCDHB2, GRIK2, FRMD7, CCSER1, PCDHGA1, PCDHA1, LRRC37A6P, PCDHGA12, ZNF486 and PCDHGB 5.

(8) Functional study of key genes

All key genes are sequentially input into a Dispenet database (http:// www.disgenet.org /) for searching related diseases. Among them, the results of the gene NAA40 search are shown in Table 5.

From the search results of the Dispenet database, most of the 20 HUB genes are related to tumor diseases, so that the reliability of the cancer transcriptome data processing method based on gene co-expression network analysis provided by the invention can be proved.

TABLE 5 search results for Gene NAA40

(9) Survival analysis of Key genes

All key genes were subjected to viability analysis using the online software onclnc (http:// www.oncolnc.org /) and a viability curve was plotted (cancer chose "BRCA"). The survival curve of gene NAA40 is shown in FIG. 17.

As can be seen from the survival analysis of the key genes, 20 key genes have significant correlation with the survival of the patients, which further proves that the cancer transcriptome data processing method based on the gene co-expression network analysis provided by the invention has significant effects on key gene identification.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The cancer transcriptome data processing method based on gene coexpression network analysis is characterized by comprising the following steps:

step one, acquiring an original data set;

secondly, preprocessing original data;

step three, identifying a differential expression gene;

step four, constructing a gene co-expression network;

fifthly, excavating a gene module;

seventhly, enriching and analyzing the gene module;

step eight, identifying key genes;

step nine, researching the functions of key genes;

step ten, survival analysis of key genes.

2. The method for processing cancer transcriptome data based on gene co-expression network analysis of claim 1, wherein in step one, the raw data set is derived from TCGA database or GEO database; the raw data set includes gene expression data in cancer tissue samples, gene expression data in paracancerous tissue samples, and corresponding clinical data for each sample.

3. The method for processing cancer transcriptome data based on gene coexpression network analysis as claimed in claim 2, wherein in step two, low-expression genes are filtered out first, then the samples are hierarchically clustered, and outlier samples are deleted.

4. The method for processing cancer transcriptome data based on gene co-expression network analysis of claim 3, wherein in step three, all differentially expressed genes satisfying the defined condition are identified by using FC-t algorithm.

5. The method for processing cancer transcriptome data based on gene coexpression network analysis as claimed in claim 4, wherein in step four, the Pearson correlation analysis of two genes is performed based on the gene expression data of the differentially expressed genes in the sample; setting a limiting condition to screen all obtained relations, and regarding two genes meeting the limiting condition as having a co-expression relation; and (3) expressing all genes with co-expression relations and the relations thereof by using a graph to obtain the gene co-expression network.

6. The cancer transcriptome data processing method based on gene coexpression network analysis as claimed in claim 5, wherein in step five, 4 kinds of community detection algorithms are used to perform network clustering on nodes in the gene coexpression network to obtain communities consisting of similar-function genes, i.e. gene modules; and selecting the optimal module mining result by using the 'modularity' as an evaluation criterion.

7. The method for processing cancer transcriptome data based on gene coexpression network analysis as claimed in claim 6, wherein in step six, principal component analysis is performed on all gene expression data in a gene module, and the first principal component is defined as a module characteristic gene of the gene module; and carrying out Pearson correlation analysis on the module characteristic genes of each gene module and different clinical indexes to obtain a correlation matrix of the gene module and the clinical indexes.

8. The method for processing cancer transcriptome data based on gene coexpression network analysis as claimed in claim 7, wherein in step seven, the gene in the gene module of interest is enriched and analyzed with the biological process, cell components and molecular functions provided by GO database, and simultaneously the gene is enriched and analyzed with the signal pathway provided by Reactome database.

9. The cancer transcriptome data processing method based on gene coexpression network analysis of claim 8, wherein in step eight, importance of all nodes in the gene coexpression network is scored by using PageRank algorithm, scoring standard is based on topological principle, and then more important nodes in the gene coexpression network are identified, and genes corresponding to these nodes are key genes.

10. The method for processing cancer transcriptome data based on gene coexpression network analysis as claimed in claim 9, wherein in the ninth step, the disease related to key gene is searched by using the Dispenet database, and the function of key gene is explored; in the tenth step, survival analysis is carried out on the key genes by using online software onclnc, and a survival curve is drawn.