CN107203704B

CN107203704B - Method for identifying gene pathway based on GSA

Info

Publication number: CN107203704B
Application number: CN201710300928.5A
Authority: CN
Inventors: 刘文斌; 沈良忠; 昝乡镇
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2017-05-02
Filing date: 2017-05-02
Publication date: 2020-08-25
Anticipated expiration: 2037-05-02
Also published as: CN107203704A

Abstract

The embodiment of the invention discloses a method for identifying a gene path based on GSA, which comprises the steps of obtaining a sample, determining a signal path and genes of the sample, and sequencing the genes contained in all the signal paths; determining the total number of genes in each signal channel and the positive and negative score average value of each gene, and calculating the channel score of each signal channel; acquiring the gene outbreak of each gene, counting the maximum and minimum gene outbreak, and calculating the gene outbreak weight of each gene; screening out the gene outbreak weight in the same signal channel, revising the channel score of the corresponding signal channel according to the gene outbreak weight of the same signal channel, sequencing the revised channel scores, and determining that the probability of the signal channel corresponding to the sequenced maximum channel score is the maximum. By implementing the invention, the importance of regulating and controlling genes of a large number of genes in a channel compared with regulating and controlling a small number of genes is considered, so that the identification precision of the channel is improved.

Description

Method for identifying gene pathway based on GSA

Technical Field

The invention relates to the technical field of system biology research, in particular to a method for identifying a gene pathway based on GSA.

Background

The high-throughput technology based on microarray generates a large amount of gene expression data, how to gain insight from the large amount of gene expression data, and further understanding the mechanism of life phenomena remains a serious challenge to scientists around the world. Biological pathways are the interaction between a group of genes that fulfill specific functions, mainly signaling pathways and metabolic pathways. In a signaling pathway, a node represents a gene (or gene product) and an edge represents a signal that is transduced from one gene to another. In a metabolic pathway, nodes represent biochemical compounds and edges represent biochemical reactions between compounds encoded by enzymes that are encoded by genes. Common pathway databases are the KEGG and Reactome databases, which provide a visualization format for interactions between genes. .

From the perspective of system biology, the interaction between genes and the change of their kinetics are the main causes of various diseases and cancers, and since the topological features of the pathway reflect the position, importance and interaction between genes of the genes in the pathway, the pathway should be identified by considering as much as possible various information of the genes contained in the pathway, such as the upstream and downstream positions of the genes, the number of regulatory genes, the interaction relationship between genes, and the like.

In 2005, PNAS published two important approaches to pathway analysis, one is a significant pathway analysis method based on function proposed by Tian et al, which comprehensively considers the significance of the difference between gene expression in a gene set and gene expression outside the set (row replacement) and the significance of the correlation between gene expression of the gene set and phenotype (column replacement). Another is the well-known GSEA method, a gene set enrichment analysis method, proposed by Subramanian et al, whose main idea is to rank all genes according to their correlation between gene expression in a pathway and a given phenotype, and then determine the score for the degree to which the Kolmogorov-Smirnov (Schmilnorov) statistic for a given pathway P is close to extreme in the ranked list. In this method, the significance of the Kolmogorov-Smirnov statistic was determined from the column permutation of the samples. In 2006, Zahn et al used the Van der Waerden statistic instead of the Kolmogorov-Smirnov statistic and replaced the permutation test method with bootstrap sampling that takes into account the correlation of the expression levels of the two genes in the pathway and the correlation with other factors. In the same year EFRON et al used the max-mean statistic instead of the Kolmogorov-Smirnov statistic to calculate the pathway score, then normalized the score by the row permutation method, and finally tested the significance of the pathway score by the column permutation, which is the well-known GSA method.

On the basis of the above-mentioned gene set enrichment analysis method GSEA and gene set analysis method GSA, the scholars also propose a signal pathway influence analysis method SPIA and an overlapping gene weight reduction method PADOG. In the signal pathway influence analysis method SPIA, only the influence of the upstream and downstream positions of genes on the propagation of a perturbation signal is considered, but genes which regulate a large number of genes in a pathway are ignored to be more important than genes which regulate a small number of genes, and the difference has greater influence on the function of the pathway, while in the overlapping gene weight reduction method PADOG, the influence of "common genes" which frequently appear in many pathways is reduced on the basis of the GSA method, but the genes which regulate a large number of genes in the pathway are not considered to be more important than genes which regulate a small number of genes, and the difference has greater influence on the function of the pathway.

Therefore, it is necessary to consider the importance of genes that regulate a large number of genes in a pathway rather than regulating only a small number of genes, and to improve the accuracy of pathway identification based on this.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a method for identifying a gene pathway based on GSA, which can improve the accuracy of pathway identification in consideration of the importance of genes regulating a large number of genes in the pathway compared to genes regulating only a small number of genes.

In order to solve the above technical problem, an embodiment of the present invention provides a method for identifying a gene pathway based on GSA, the method including:

a. obtaining a sample, determining signal paths of the sample and genes contained in each signal path, and further sequencing the genes contained in all the signal paths according to the correlation between each gene and a phenotype;

b. determining the total number of genes contained in each signal channel, determining a positive score average value and a negative score average value of each gene in a corresponding signal channel according to the sequenced genes, and further calculating the channel score of each signal channel according to the determined total number of genes contained in each signal channel and the positive score average value and the negative score average value of each gene in the corresponding signal channel;

c. obtaining the gene outbreak of each gene, counting the maximum gene outbreak and the minimum gene outbreak, and further calculating the gene outbreak weight of each gene according to the obtained gene outbreak of each gene and the counted maximum gene outbreak and minimum gene outbreak; wherein the gene outbreak is the number of genes that the gene regulates and controls downstream in the determined signal pathway;

d. screening out the gene outburst weight corresponding to the gene contained in the same signal channel, revising the channel score of the signal channel correspondingly calculated according to the gene outburst weight corresponding to the gene contained in the same signal channel, further sequencing the revised channel score of each signal channel, and determining that the probability of the signal channel corresponding to the maximum channel score after sequencing is the maximum.

Wherein the 'path fraction of each signal path' in the step b is determined by the formula

To realize the operation; wherein, ES₀(S) is sequenced gene g_jThe path fraction of the signal path S; m is sequenced gene g_jThe total number of genes contained in the signal path S;

sequencing of the Gene g in Signal pathway S_jIs given a positive score of the average value,

sequencing of the Gene g in Signal pathway S_jNegative score average of (2).

Wherein, the step c specifically comprises:

acquiring the gene outbreak of each gene, and counting the maximum gene outbreak max (d) and the minimum gene outbreak min (d);

according to the formula

Obtaining the gene out-degree weight of each gene; wherein d (g)_j) Is sequenced gene g_jGene outbreak of (2); w is a_d(g_j) Is sequenced gene g_jGene out-degree weight of (c).

Wherein the value range of the gene out-degree weight of each gene is [1, 2 ].

Wherein, the step d specifically comprises:

screening out the gene emergence weight corresponding to the genes contained in the same signal channel, and multiplying all the screened gene emergence weights corresponding to the genes contained in the same signal channel, wherein the obtained products are respectively used as the correction coefficients of each signal channel;

and multiplying the obtained correction coefficient of each signal path by the path fraction of the corresponding signal path to obtain a product as the revised path fraction of each signal path, sequencing the revised path fractions of each signal path, and determining that the probability of the change of the signal path corresponding to the sequenced maximum path fraction is maximum.

The embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, genes are sequenced according to the correlation between the genes and phenotypes, the channel score of each signal channel is counted, the importance of the regulatory genes is fully considered, the counted channel score of each signal channel is revised by combining the gene outbreak of each gene, and the importance of the channel is identified by the revised channel score, so that the aim of improving the identification precision of the channel is fulfilled.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is within the scope of the present invention for those skilled in the art to obtain other drawings based on the drawings without inventive exercise.

Fig. 1 is a flowchart of a method for identifying a gene pathway based on GSA according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, in an embodiment of the present invention, a method for identifying a gene pathway based on GSA is provided, the method comprising:

step S1, obtaining a sample, determining the signal path of the sample and the gene contained in each signal path, and further sequencing the genes contained in all the signal paths according to the correlation between each gene and the phenotype;

the specific process comprises the steps of obtaining a sample, determining signal paths of the sample and genes contained in each signal path, and sequencing the genes contained in all the signal paths according to the correlation between each gene and the phenotype.

As an example, assuming a total number of all genes N, given a signaling pathway S with a base factor M, N genes are ordered by a correlation r (or t statistic) between each gene g and the phenotype, L ═ g₁,...,g_j,...g_N]。

Step S2, determining the total number of genes contained in each signal path, determining the positive score average value and the negative score average value of each gene in the corresponding signal path according to the sorted genes, and further calculating the path score of each signal path according to the determined total number of genes contained in each signal path and the positive score average value and the negative score average value of each gene in the corresponding signal path;

the specific process is that the total number of genes contained in each signal channel is determined, and according to the sequenced genes, the positive score average value and the negative score average value of each gene in the corresponding signal channel are determined;

according to the formula

To calculate a path fraction for each signal path; wherein, ES₀(S) is sequenced gene g_jThe path fraction of the signal path S; m is sequenced gene g_jThe total number of genes contained in the signal path S;

sequencing of the Gene g in Signal pathway S_jNegative score average of (2).

Step S3, obtaining the gene appearance of each gene, and counting the maximum gene appearance and the minimum gene appearance, and further calculating the gene appearance weight of each gene according to the obtained gene appearance of each gene and the counted maximum gene appearance and minimum gene appearance; wherein the gene outbreak is the number of genes that the gene regulates and controls downstream in the determined signal pathway;

specifically, the gene expression indicates the number of downstream genes regulated by one gene, so that the more advanced genes have greater influence on the pathway.

Acquiring the gene outbreak of each gene, and counting the maximum gene outbreak max (d) and the minimum gene outbreak min (d) according to the acquired gene outbreak of each gene;

according to the formula

Obtaining the gene out-degree weight of each gene; wherein d (g)_j) Is sequenced gene g_jGene outbreak of (2); w is a_d(g_j) Is sequenced gene g_jThe gene out-degree weight of (a), the value reflecting the importance of the gene in the pathway, the greater the value, the higher the importance of the gene in the pathway; conversely, the less important the gene is in the pathway, w_d(g_j) Is in the range of [1, 2]]In between, i.e., the out-degree weight of each gene is in the range of [1, 2]]。

And S4, screening the gene out-degree weight corresponding to the gene contained in the same signal channel, revising the channel score of the signal channel correspondingly calculated according to the screened gene out-degree weight corresponding to the gene contained in the same signal channel, further sequencing the revised channel score of each signal channel, and determining that the probability of the signal channel corresponding to the maximum channel score after sequencing is the maximum.

Screening out the gene emergence weights corresponding to the genes contained in the same signal channel, multiplying all the screened gene emergence weights corresponding to the genes contained in the same signal channel, and respectively taking the obtained products as the correction coefficients of each signal channel;

and multiplying the obtained correction coefficient of each signal path by the path fraction of the corresponding signal path to obtain a product as the revised path fraction of each signal path, sequencing the revised path fractions of each signal path, and determining that the probability of the change of the signal path corresponding to the maximum sequenced path fraction is the maximum, namely the more the path fraction is ranked, the higher the signal path tendency is taken as the research value.

The embodiment of the invention has the following beneficial effects:

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by relevant hardware instructed by a program, and the program may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for identifying a gene pathway based on GSA, the method comprising:

d. screening out the gene outburst weight corresponding to the gene contained in the same signal channel, revising the channel score of the signal channel correspondingly calculated according to the gene outburst weight corresponding to the gene contained in the same signal channel, further sequencing the revised channel score of each signal channel, and determining that the probability of the signal channel corresponding to the maximum channel score after sequencing is the maximum;

the 'path fraction of each signal path' in the step b is determined by the formula

sequencing of the Gene g in Signal pathway S_jNegative score average of (2).

2. The method according to claim 1, wherein said step c specifically comprises:

according to the formula

3. The method of claim 2, wherein the out-of-degree weight for each gene is in the range of [1, 2 ].

4. The method according to claim 1, wherein said step d specifically comprises: