CN108763864B

CN108763864B - Method for evaluating state of biological pathway sample

Info

Publication number: CN108763864B
Application number: CN201810420756.XA
Authority: CN
Inventors: 沈良忠; 刘文斌; 昝乡镇
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2021-06-29
Anticipated expiration: 2038-05-04
Also published as: CN108763864A

Abstract

The invention provides a method for evaluating the state of a biological pathway sample, which comprises the steps of obtaining the sample and a pathway network corresponding to the sample and expression values of all genes, and weighting the expression values of all genes by taking a t value of a gene t test and a Pearson correlation coefficient as weights of the genes; obtaining interaction types and corresponding intensities of genes in the same path of each sample according to the topological structure of the path network of each sample, and obtaining interaction expression values of the genes in each sample by using the intensities of the interaction types of the genes and the weighted expression values of the genes; integrating the expression values and the interaction expression values of the genes in the samples, analyzing by adopting a principal component analysis method, and further defining the first principal components obtained by the samples as the activity scores of the channels; and constructing a classifier to evaluate the path state of each sample. The invention is implemented, and the importance of genes and the importance of the interaction between the genes are simultaneously considered to infer the activity of the pathway, thereby realizing the evaluation of the state of the biological pathway sample.

Description

Method for evaluating state of biological pathway sample

Technical Field

The invention relates to the technical field of gene detection, in particular to a method for evaluating the state of a biological pathway sample.

Background

Many recent research methods propose to search more robust biological markers at a functional level to break through the problem of instability of single gene tags. Because genes are not solely involved in biological processes, gene products usually act synergistically in the modes of functional modules or signal cascades and the like, functional modules which are disordered at a high level are possibly more stable than single genes as biomarkers, and various noises have little influence on the biomarkers. The biological markers at the functional level can effectively reduce the heterogeneity of tissues and the genetic heterogeneity of samples, and simultaneously effectively analyze the relationship between important functional pathways and diseases. Therefore, integrating the expression profiles of the functionally related genes and extracting the classification features at the functional level will be beneficial to obtain more robust biological markers. Functional modules are often embedded in classical pathways and protein networks, and these high-throughput information can be obtained from Gene Ontology, KEGG databases, or other Gene sets defined in microarray expression profiling research experiments, such as the molecular signature database MSigDB.

Since the pathway information highly reflects the chemical effect and functional expression between genes, the expression level of genes in the pathway is indistinguishable from the function embodied by the pathway, and once the expression level of significant genes in the pathway is disturbed, the function of part of the pathway is also disordered. Therefore, a classification identification experiment is performed by analyzing gene expression profiles in the pathway to define the activity of the pathway, so as to obtain accurate biomarkers. For example, in order to solve the problem of gene duplication in different paths, researchers such as Su design a log-likelihood function to search for a linear sub-path with classification capability, the obtained linear sub-path has higher classification capability, and the classification effect is further improved; in another example, Breslin et al investigators infer pathway activity by the sum of pathway member gene expression values; as another example, Guo et al investigators infer pathway activity by calculating the Mean (Mean) or Median (media) of pathway member gene expression values; for another example, researchers such as Bild and the like can deduce the pathway activity by analyzing a pathway member gene expression profile through a main component and using a first main component, and the method can also identify a disordered pathway pattern and an oncogenic pathway marker, thereby providing an important basis for the targeted treatment of related cancer subtypes; for another example, Lee et al have suggested that CORGs (condition-responsive genes) genes in a pathway play a major role in pathway activity rather than all genes in the pathway. The above research results indicate that considering the functional modules of genes can identify more stable biological markers and obtain more accurate classification effect.

However, the above method for inferring pathway activity only utilizes significant genes in a pathway, does not consider interaction information between genes, but only considers the pathway as a simple set of single genes, but ignores gene topology information in a pathway network, and loses many important information of intergenic communication.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a method for evaluating the state of a biological pathway sample, which can estimate the activity of the pathway by considering the importance of genes and the importance of interactions between genes, thereby realizing the evaluation of the state of the biological pathway sample.

In order to solve the above technical problem, an embodiment of the present invention provides a method for evaluating a state of a biological pathway sample, including the steps of:

s1, obtaining samples and corresponding channel networks thereof, obtaining the expression values of the genes in the channel networks of the samples, and weighting the expression values of the genes in the channel networks of the samples by taking the t value of the t test of the expression values of the genes between two different phenotypes and the Pearson correlation coefficient of the gene expression values and the sample phenotypes as the weights of the genes;

s2, obtaining the interaction type and the corresponding strength of the genes in the same path of each sample according to the topological structure of the path network of each sample, and obtaining the interaction expression value of each gene in the path network of each sample by using the strength corresponding to the interaction type of each gene and the expression value of each gene after weighting treatment;

step S3, integrating the expression values of the genes in the path network of each sample and the interaction expression values among the genes, analyzing by a principal component analysis method, and further defining the first principal components obtained by each sample as the activity scores of the paths;

in step S4, a classifier is constructed from the activity scores of the channels obtained from the samples, and the channel state of each sample is evaluated.

In step S1, the expression values of the genes included in the path network of each sample are normalized by the formula

Wherein, g_ijRepresents the expression value of the gene i in the sample j, and mean and std represent the mean and standard deviation of the expression value of the gene in all samples, respectively.

Wherein in the step S1, the expression value of the gene after weight processing is z'_ij＝t_score(g_i)²*ρ(g_i)*z_ij(ii) a Wherein, z'_ijGene g in sample j_ijA weighted expression value; gene t_score(g_i) Is gene g_iAnalyzing the statistic value of the gene expression value between two phenotypes by using a two-tailed t test; ρ (g)_i) Is the Pearson correlation coefficient between the expression value of the gene in all samples and the sample phenotype.

Wherein, in the step S2, the interaction expression value between the genes is

Wherein e is_hjIs gene g_ijAnd gene g_kjAn expression value of the interaction; beta is a_ikIs gene g_iAnd gene g_kA beta value corresponding to the interaction type; rho_ikIs gene g_iAnd gene g_kPearson's correlation coefficient of expression value; z'_ijGene g in sample j_ijThe expression value after weighting; z'_kjGene g in sample j_kjThe expression value after weighting.

Wherein, in the step S3, the calculation formula of the activity score of each gene pathway is:

a(P_j)＝w_1jz'_1j+w_2jz'_2j+…+w_ijz'_1j+…+w_njz'_nj+w_(n+1)je_1j+…+w_(n+h)e_hj+…w_(n+l)e_lj(ii) a Wherein, a (P)_j) Is the channel of sample jRoad Activity score, w_1jIs the weight of the first gene in the sample j in the first principal component, w_ijIs the weight of the gene i in the sample j in the first principal component, w_(n+1)jIs the weight of the first intergenic interaction in the sample j in the first principal component, n is the total number of genes, and l is the number of intergenic interactions.

The embodiment of the invention has the following beneficial effects:

the invention adopts a principal component analysis method to analyze the expression value of each gene and the interaction expression value among the genes in the channel network integrating each sample, and defines the first principal component obtained by each sample as the activity score of the channel to evaluate the channel state of each sample, thereby not only considering the importance of the genes, but also considering the importance of the interaction among the genes to infer the activity of the channel and having wide practicability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is within the scope of the present invention for those skilled in the art to obtain other drawings based on the drawings without inventive exercise.

FIG. 1 is a flow chart of a method for evaluating a status of a sample of a biological pathway according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an application scenario of the method for evaluating the status of a sample of a biological pathway according to an embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

Referring to fig. 1, a method for evaluating a state of a sample of a biological pathway according to an embodiment of the present invention includes the following steps:

Specifically, in step S1, in order to make the gene expression values in the same class, it is avoided that the gene expression values are not in the same dimension, and an unreasonable classification result is obtained. Firstly, the expression value of each gene contained in the access network of each sample is standardized, and the specific formula is as follows:

in the formula (1), g_ijRepresents the expression value of the gene i in the sample j, and mean and std represent the mean and standard deviation of the expression value of the gene in all samples, respectively. If the expression value of a certain gene is missing in sample j, the average value of the expression values of the gene in other samples is used as a filling deletion value.

Since the expression difference of the gene in the two phenotypes can be visually depicted by the t-test value, if the t-test value of the gene is higher, the expression difference of the gene in the two phenotypes is more obvious, so that the gene expression value can be weighted by using the characteristic of the t-test value, and the gene expression value difference of the gene in different phenotypes is amplified.

The expression values of the genes after the treatment of each sample weight are:

z'_ij＝t_score(g_i)²*ρ(g_i)*z_ij (2)；

in formula (2), z'_ijGene g in sample j_ijA weighted expression value; gene t_score(g_i) Is gene g_iAnalyzing the statistic value of the gene expression value between two phenotypes by using a two-tailed t test; ρ (g)_i) Is the Pearson correlation coefficient between the expression value of the gene in all samples and the sample phenotype.

It should be noted that the t-test can be divided into single population test and double population test, and the single population t-test mainly tests the difference between the average number of one sample and the average number of population samples. See if this difference is significant. The statistics for the single sample t-test are:

wherein

Is the average of the population samples, n is the number of population samples of the sample, σ_XIs the standard deviation of the sample.

The double population t-test measures the difference between two samples at the level of the respective population. The double global t-test can be subdivided into an independent sample t-test and a paired sample t-test. An independent sample t-test is commonly used in cancer classification experiments. The difference in gene expression between two different phenotypes is described by t-test values for the gene between the two different phenotypes. T-test statistics for its gene at two different phenotypes were:

wherein n is₁And n₂The total number of positive and negative samples respectively,

and

the variance of the gene expression value in the two samples,

and

the mean value of the gene expression values in the two samples is shown. Zero assumes that the mean and variance of the positive-too distribution obeyed by both samples are the same. This method is usually called student t-test only if the variances of the two populations are equal. When this null assumption does not hold, the method is sometimes referred to as Welch's t-test. the t-test can also be used to test the difference between two measurements of the same statistic to determine if the difference between them is zero, in which case the test is often referred to as a "paired" or "duplicate measurement" t-test.

It should be noted that the pearson correlation coefficient is often used to characterize the correlation between gene expression values and sample phenotypes and the correlation between two genes, where there is an interaction between the two genes, and the pearson correlation coefficient can be used to visually describe the strength of the interaction between the two genes. The formula for calculating the Pearson correlation coefficient of the interacting gene i and gene k is:

the value of the Pearson correlation coefficient is between 1 and-1, and the Pearson correlation coefficient of two genes is 1, which shows that the two genes are completely positively correlated and have strong correlation; when the Pearson correlation coefficient of the two genes is 0, the two variables have no linear correlation and the correlation is weak; when the Pearson correlation coefficient of two variables is-1, the two genes are completely negatively linearly related, and the strong correlation between the two genes can be also shown.

The pearson correlation coefficients are symmetric, i.e.: corr (X, Y) ═ cor (Y, X). One key mathematical property of the pearson correlation coefficient is: it is invariant under different variations in the position and scale of the two variables. That is, we can transform X to a + bX and Y to c + dY, where a, b, c, and d are constants b and d > 0, and this change in the variables does not change the correlation coefficient between them.

In step S2, if there is an interaction relationship between gene i and gene k in the pathway, the expression value of the interaction between the two genes can be defined based on the expression values of the two genes. The gene interactions are weighted by the strength and type of interaction between them. Thus, the interaction expression between gene i and gene k is expressed as:

in the formula (3), e_hjIs gene g_ijAnd gene g_kjAn expression value of the interaction; beta is a_ikIs gene g_iAnd gene g_kA beta value corresponding to the interaction type; rho_ikIs gene g_iAnd gene g_kPearson's correlation coefficient of expression value; z'_ijGene g in sample j_ijThe expression value after weighting; z'_kjGene g in sample j_kjThe expression value after weighting.

By analogy, the interaction expression value between the genes in the pathway network of each sample can be determined.

In step S3, the activity score of each gene pathway is calculated by the formula:

a(P_j)＝w_1jz'_1j+w_2jz'_2j+…+w_ijz'_1j+…+w_njz'_nj+w_(n+1)je_1j+…+w_(n+h)e_hj+…w_(n+l)e_lj (4)；

in the formula (4), a (P)_j) Is the pathway activity fraction, w, of sample j_1jIs the weight of the first gene in the sample j in the first principal component, w_ijIs the weight of the gene i in the sample j in the first principal component, w_(n+1)jIs the weight of the first intergenic interaction in the sample j in the first principal component, n is the total number of genes, and l is the number of intergenic interactions.

It should be noted that Principal Component Analysis (PCA) is an important feature dimension reduction algorithm in machine learning, and the basic principle thereof is to project original data onto the dimension of the feature vector of the covariance matrix.

The algorithm for PCA roughly comprises the following steps:

1: carrying out standardization treatment on all sample data, namely mean value normalization;

2: calculating a covariance matrix C of the sample data:

where m is the number of samples and n is the amount of data per sample;

3: performing singular value decomposition on the covariance matrix obtained in the previous step:

[U,S,V]＝svd(C) (4-2)

4: then setting a projection feature matrix P according to the feature vector corresponding to the feature value;

5: projecting the original data onto a feature matrix to:

Z＝P^TX (4-3)

the PCA technique is commonly used in various research fields, and its name varies from field to field, for example, it is called noise and vibration spectrum analysis in structural dynamics, empirical mode analysis. In the machine learning process classification problem, feature selection process is often performed, and in the classification experiment, in the case of limited number of samples, tens of thousands of genes are obviously not desirable to be classified as features, which greatly reduces the performance of the classifier. Dimension reduction processing of biological data is a feasible method. The gene data after the dimensionality reduction of the PCA technology reserves the information of the original data, wherein the variance of the first principal component data is the largest and is often used for selecting as an important classification characteristic.

In step S4, a classifier is constructed by the activity score of the pathway to predict the pathway state of each biological sample, thereby verifying the classification performance when based on different pathway networks. A total of 19 significant pathways were selected as shown in table 1 below. Where the degrees in table 1 represent the number of other lanes that are interconnected with the lane. A path is more important if a greater degree of the path indicates more paths are interconnected with the path.

Path name	Degree of rotation
		MAPK signaling pathway	69
Adherens junction	36
		Pathway in cancer	31
ECM-receptor interaction	26
		Tight junction	22
Adipocytokine signaling pathway	19
		Regulation of actin cytoskeleton	18
p53signaling pathway	17
		Calcium signaling pathway	14
Endocytosis	13
		PPAR signaling pathway	12
Progesterone-mediated oocyte maturation	10
		Proteasome	10
Focal adhesion	8
		Wnt signaling pathway	8
Insulin signaling pathway	4
		Axon guidance	3
RNA transport	3

Notably, the MAPK signaling pathway is closely related to the other 69 pathways and is involved in various biological processes. This suggests that the MAPK signaling pathway encompasses enormous biological information. The MAPK signal pathway is a very important pathway, which explains that under the MAPK signal pathway, the pathway activity deduced by the method for evaluating the state of the biological pathway sample provided by the embodiment of the invention has good classification performance and is not an accidental result.

Fig. 2 is a diagram illustrating an application scenario of the method for evaluating the state of a biological pathway sample according to an embodiment of the present invention. First, the gene expression values are normalized. Secondly, establishing gene interaction based on gene expression value data and a path; in a pathway network, each node represents a gene, and each edge represents an interaction relationship between two genes; third, the pathway state of each sample was evaluated by calculating a pathway activity score using principal component analysis for each sample.

The embodiment of the invention has the following beneficial effects:

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of assessing the status of a biological pathway sample, comprising the steps of:

step S1, obtaining samples and corresponding channel networks thereof, obtaining the expression values of the genes contained in the channel networks of the samples, and weighting the expression values of the genes in the channel networks of the samples by taking the t value of the t test of the expression values of the genes between two different phenotypes and the Pearson correlation coefficient of the gene expression values and the sample phenotypes as the weights of the genes;

step S4, respectively forming a classifier according to the activity scores of the channels obtained by the samples to evaluate the channel state of the samples;

Wherein, g_ijRepresenting the expression value of the gene i in the sample j, mean and std respectively represent the average value and standard deviation of the expression value of the gene in all samples;

in step S1, the expression value of the gene after weight processing is z'_ij＝t_score(g_i)²*ρ(g_i)*z_ij(ii) a Wherein, z'_ijGene g in sample j_ijA weighted expression value; gene t_score(g_i) Is gene g_iAnalyzing the statistic value of the gene expression value between two phenotypes by using a two-tailed t test; ρ (g)_i) Is the skin of the gene between the expression value of all samples and the sample phenotypeAn elsen correlation coefficient;

wherein in the step S2, the value of the expression of the interaction between the genes is

2. The method for evaluating the status of a biological pathway sample according to claim 1, wherein in step S3, the activity score of each genetic pathway is calculated by the formula:

a(P_j)＝w_1jz’_1j+w_2jz’_2j+… +w_ijz’_ij+… +w_njz'_nj+w_(n+1)je_1j+… +w_(n+h)e_hj+… w_(n+l)e_lj(ii) a Wherein the content of the first and second substances,

a(P_j) Is the pathway activity fraction, w, of sample j_1jIs the weight of the first gene in the sample j in the first principal component, w_ijIs the weight of the gene i in the sample j in the first principal component, w_(n+1)jIs the weight of the first intergenic interaction in the sample j in the first principal component, n is the total number of genes, and l is the number of intergenic interactions.