CN109817337B

CN109817337B - Method for evaluating channel activation degree of single disease sample and method for distinguishing similar diseases

Info

Publication number: CN109817337B
Application number: CN201910091441.XA
Authority: CN
Inventors: 李敏; 李幸一; 王建新
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2020-09-08
Anticipated expiration: 2039-01-30
Also published as: CN109817337A

Abstract

The invention discloses an evaluation method of single disease sample channel activation degree and a similar disease distinguishing method; constructing a fully-connected network for each passage, taking the original connecting edge in each passage as an important connecting edge of each passage, and taking the added connecting edge as a background connecting edge of each passage; taking genes existing in the pathway as important genes and other genes as background genes; calculating the difference value between the disease sample and the normal sample for each connecting edge in the fully-connected network, and calculating the significance of the difference value; calculating the difference multiple of the expression value of each gene in the disease sample and the normal sample; and calculating the enrichment degree of the important nodes and the connecting edges in the node and connecting edge ranking of each fully connected network as the activation degree of the corresponding channel. Similar diseases are distinguished by the degree of activation of the pathway. The invention can effectively calculate the activation degree of each channel in a single disease sample, and converts the gene expression matrix of a high-dimensional small sample of the disease sample into the expression matrix of the activation degree of the channel, so as to distinguish similar diseases and have high accuracy.

Description

Method for evaluating channel activation degree of single disease sample and method for distinguishing similar diseases

Technical Field

The invention relates to the field of bioinformatics, and relates to a method for evaluating the channel activation degree of a single disease sample and a method for distinguishing similar diseases.

Background

Studies have shown that genes and gene products do not act individually, but rather act synergistically by participating in complex, interrelated networks. Common biological structures in the form of networks include pathways, gene transcription control networks, and protein interaction networks, wherein the pathways can reflect biological processes in cells, such as biological metabolism, signal transmission, and growth cycle, and the effective biological information is important for revealing molecular mechanisms of organisms from the aspect of functions by combining with pathway data mining.

The occurrence and development of diseases are often closely related to the disorder of important pathways, and the identification of these dysregulated pathways and quantification of the extent of their dysregulation are of great interest for disease research.

Pathway activity may be used to measure the degree of dysregulation of a pathway. Furthermore, although the clinical symptoms of similar complex diseases are similar, the mechanisms by which different diseases develop differ, and therefore the activation state of the pathway can be used as an indicator to distinguish between similar diseases. There are several models and methods for assessing the activation of pathways during disease development, which differ from each other in their definition and calculation of the activation of pathways, e.g., Han et al^[1]A method called prop is proposed to calculate the degree of activation of the pathway using a gaussian bayesian network. Young and Craft^[2]Three methods of calculating the degree of activation of a pathway are provided: PCA, NTC and GED. PCA (principal component analysis) extracts principal components in gene expression data based on each channel by using a principal component analysis method as the activation degree of the channel; the NTC method is that the Euclidean distance between a disease sample and a normal sample is calculated based on gene expression data of each path and is used as the activation degree of the path; the GED scores genes which are distributed in the gene expression data of each channel in the normal sample and the disease sample in a different mode, and the channel activation characteristic is defined according to the gene score value. While considering the specificity status of a single disease sample from the perspective of pathway is crucial to reveal the molecular mechanisms of complex diseases at the systemic level, none of the current models and methods consider the specificity status of a single disease sample from the perspective of pathway.

In addition, although there are several models and methods available to distinguish similar diseases, e.g., Winter et al^[3]A method for improving the Petzer sequencing is provided, wherein genes are sequenced according to the ranking of neighbor nodes of the genes in a network, and the genes with the top sequence are extracted as characteristics for distinguishing similar diseases. Cun and

^[4]a feature selection method stSVM based on a support vector machine is provided, and effective gene markers are extracted to serve as features for distinguishing similar diseases. Zhang et al^[5]A frame CNS for extracting functional characteristics is proposed, and the method utilizes a flow balance model to polymerize genes enriched with the same functions, thereby obtaining functional modules capable of furthest distinguishing two similar diseases, and extracting the functional modules as the characteristics for distinguishing the similar diseases. However, the classification accuracy of similar disease classification based on the features extracted by these methods is still to be further improved.

Therefore, there is a need to provide a method for assessing the degree of activation of a single disease sample pathway and effectively distinguishing between similar diseases.

[1]Han,L.et al.A probabilistic pathway score(PROPS)for classificationwith applications to inflammatory bowel disease.Bioinformatics,2017；34(6):985-993.

[2]Young,M.R.and Craft,D.L.Pathway-informed classification system(PICS)for cancer analysis using gene expression data.Cancer informatics,2016；15:151-161.

[3]Winter C,Kristiansen G,Kersting S,et al.Google goes cancer:improving outcome prediction for cancer patients by network-based ranking ofmarker genes[J].PLoS computational biology,2012,8(5):e1002511.

[4]Cun Y,

H.Network and data integration for biomarkersignature discovery via network smoothed t-statistics[J].PloS one,2013,8(9):e73074.

[5]Zhang C,Liu J,Shi Q,et al.Comparative network stratificationanalysis for identifying functional interpretable network biomarkers[J].BMCbioinformatics,2017,18(3):48.

Disclosure of Invention

The invention aims to solve the technical problem that aiming at the defects of the prior art, the invention provides the method for evaluating the channel activation degree of a single disease sample and the method for distinguishing similar diseases, so that the characteristic of effectively distinguishing similar diseases, namely the channel activation degree of the disease sample can be obtained, the classification of the similar diseases is carried out based on the characteristic, and the classification accuracy is high.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method for evaluating the activation degree of a single disease sample channel, wherein the activation degree of each channel comprises a continuous edge activation degree and a gene activation degree, and for each channel in a disease sample, the method for evaluating the activation degree comprises the following steps:

step 1, for all genes in the path, if two genes have no connecting edge, adding the connecting edge, and constructing the path into a fully-connected network (namely a network with connecting edges between every two nodes);

taking the original continuous edge in the path as the important continuous edge, and taking the added continuous edge as the background continuous edge;

using genes present in the pathway as important genes and genes not present in the pathway (genes present in other pathways in the disease sample) as background genes;

step 2, for each connecting edge in the fully-connected network, calculating the Pearson correlation coefficient of the expression values of two genes connected with the connecting edge in n samples based on n normal samples, and marking the Pearson correlation coefficient as PCC_n(ii) a Adding a single disease sample into n normal samples, calculating the Pearson correlation coefficient of the expression values of the two connected genes in the n +1 samples, and marking as PCC_n+1(ii) a By PCC_n+1And PCC_nDifferencing to give Δ PCC_nAs the difference value of the continuous edge between the disease sample and the normal sample; and evaluating the significance of the difference values;

step 3, calculating the difference multiple of the expression values of each gene in the fully-connected network in the disease sample and n normal samples;

step 4, sorting all connected edges in the fully-connected network according to the significance of the difference values, and sorting all genes in the fully-connected network according to the difference multiple;

and 5, calculating the enrichment degree of important continuous edges/genes (positive labels) in the continuous edge/gene sequencing of the fully-connected network according to the sequencing result, and taking the enrichment degree as the activation degree of the continuous edges/genes in the corresponding channel.

Further, in the step 2, the pearson correlation coefficient PCC of the expression value of each of the two genes connected in parallel in the n samples is calculated_nThe formula of (1) is:

wherein x is₁And x₂Respectively representing the expression values of the two genes connected with the connecting edge in n samples, cov_n(x₁,x₂) Denotes x₁And x₂The covariance of (a) of (b),

and

respectively represent x₁And x₂Standard deviation of (2).

Further, in step 2, the significance of the difference values is evaluated based on the Z-test (Z-test):

wherein the Z value represents Δ PCC_nThe significance of (a).

Further, in step 3, for any gene, the formula for calculating the fold difference FC between the expression values of the gene in the disease sample and the expression values in the n normal samples is as follows:

wherein b represents an expression value of the gene in the disease sample,

represents the mean value of the expression values of the gene in n normal samples.

Further, in step 5, for any one path, the activation degree of its connecting edge/node is calculated by the following formula:

wherein I represents the set of all important edges/genes in the fully connected network, rank_iRepresenting the sequence of the ith connecting edge/gene in the I when the I is arranged according to the ascending sequence in the step 4, wherein M represents the total number of important connecting edges/genes in the fully-connected network, and N represents the total number of background connecting edges/gene sets; the formula utilizes AUC to calculate the enrichment degree of important continuous edges/genes (positive labels) in the continuous edge/gene ordering of the fully-connected network from the angle of the continuous edges/genes, and the enrichment degree is used as the activation degree of a channel.

A method of distinguishing between similar diseases comprising the steps of:

firstly, calculating the channel activation degree of each disease sample according to the evaluation method of the channel activation degree of the single disease sample, and connecting the joint activation degree and the gene activation degree of all the channels of the single disease sample into a vector to be used as a characteristic vector of the disease sample; corresponding features of the same dimension in all disease sample feature vectors are the same, namely the continuous edge activation degree or the gene activation degree corresponding to the same channel;

secondly, training a classifier by taking the feature vectors of the known disease samples as input and the classification labels of the known disease samples as output;

and finally, inputting the feature vector of the unknown disease sample into a trained classifier to obtain a classification label of the unknown disease sample.

Further, the classifier is a random forest classifier.

Has the advantages that:

the method can effectively calculate the activation degree of each channel in a single disease sample, converts the gene expression matrix of the high-dimensional small sample of the disease sample into the expression matrix of the activation degree of the channel, and solves the problem that the specificity of the single disease sample is not considered in other feature extraction methods. The calculated activation degree of the pathway can be used for distinguishing similar diseases, and the accuracy is high.

Drawings

FIG. 1 is a block diagram of the present invention (PASS);

FIG. 2 is a graph comparing ROC curves and the area under them (AUC) for the methods of the present invention (PASS) and NetRank, stSVM, CNS, PCA, NTC, GED, PROPS;

FIG. 3 is a significance analysis of the difference of the pathway in two similar disease samples based on the degree of pathway activation extracted by the present invention.

FIG. 4 is an enrichment analysis of known disease genes in significantly differentially expressed pathways based on the degree of pathway activation extracted in the present invention.

Detailed Description

As shown in FIG. 1, the invention provides a method for evaluating the activation degree of a single disease sample channel, wherein the activation degree of each channel comprises a connective activation degree and a gene activation degree, and the method for evaluating the activation degree of each channel in the disease sample comprises the following steps:

preprocessing of first, channel data

For all genes in a path, if no connecting edge exists between the two genes, adding the connecting edge, and constructing the path into a fully-connected network (namely a network with connecting edges between every two nodes);

secondly, calculating the difference significance of the edges

For each connecting edge in the fully-connected network, calculating the Pearson correlation coefficient of the expression values of two genes connected with the connecting edge in n samples based on n normal samples, and marking the Pearson correlation coefficient as PCC_n(ii) a Adding a single disease sample into n normal samples, calculating the Pearson correlation coefficient of the expression values of the two connected genes in the n +1 samples, and marking as PCC_n+1(ii) a By PCC_n+1And PCC_nDifferencing to give Δ PCC_nAs the difference value of the continuous edge between the disease sample and the normal sample; and evaluating the difference value Δ PCC_nThe significance of (a);

calculating the expression value of two genes connected with each connecting edge in n samples by using the Pearson Correlation Coefficient (PCC)_nThe formula of (1) is:

and

respectively represent x₁And x₂Standard deviation of (d);

ΔPCC_nsignificance of (d) was assessed by z-test:

thirdly, calculating the difference significance of the nodes

The expression for the fold difference in expression value between individual disease samples and normal samples for each gene is:

wherein b represents an expression value of the gene in the disease sample,

Fourth, evaluation of channel activation

The activation degree of the channel is calculated by the following formula:

wherein I represents the set of all important edges/genes in the fully connected network, rank_iThe position of the ith edge/gene in the I is shown after the ith edge/gene is sorted according to the significance of the difference value/the multiple of the difference value in an ascending order, M represents the total number of important edges/genes in the fully-connected network, and N represents the total number of background edges/gene sets; the formula utilizes AUC to calculate the enrichment degree of important continuous edges/genes in the continuous edges and gene sequencing of each fully-connected network from the angles of continuous edges and genes (nodes) respectively, and the enrichment degree is used as the activation degree of a channel.

The activation degree of the pathway in a single disease sample can be calculated based on the activation degree of the pathway estimated by the single disease sample, and the problem that the specificity of each disease sample is not considered in other feature extraction methods is solved.

The invention also provides a method for distinguishing similar diseases, which comprises the following steps:

firstly, calculating the channel activation degree of each disease sample, and connecting the side activation degree and the gene activation degree of all channels of a single disease sample into a vector to be used as a characteristic vector of the disease sample;

The classifier may employ a random forest classifier.

Fifth, experiment verification

To verify the effectiveness of the present method, validation was performed based on four data sets of two similar diseases in inflammatory bowel disease, regional enteritis and ulcerative enteritis. Four data sets of colitis regionalis and ulcerative colitis were derived from the GEO database (https:// www.ncbi.nlm.nih.gov/GEO /), GSE9686, GSE3365, GSE36807, GSE71730, containing a total of 61 samples of ulcerative colitis and 105 samples of colitis regionalis. The full human pathway data is from the KEGG database (https:// www.kegg.jp /), with 294 pathways in total.

To evaluate the accuracy and functional interpretability of the classification of the method, the following three analyses were performed:

(1) accuracy of analytical classification

This section performs analysis on all samples in the four data sets together. For each method in the invention (PASS), NetRank, stSVM, CNS, PCA, NTC, GED and PROPS, respectively constructing a random forest classifier based on the extracted characteristics, applying a three-fold cross validation method to divide a sample set into 3 subsets, respectively making a primary validation set for each subset, taking the rest 2 subsets as training sets to obtain 3 classifiers, and classifying the samples in the corresponding validation sets by using the classifiers to obtain classification results; and repeating the triple-fold cross validation 500 times (different divisions are carried out on the sample set each time), calculating a True Positive Rate (TPR) and a False Positive Rate (FPR) based on all classification results, and drawing an ROC curve. And evaluating the classification result by adopting ROC and AUC indexes. The AUC value is the area under the ROC curve, the ROC and AUC experimental results are shown in figure 2, and as can be seen from figure 2, the AUC value of the invention is superior to that of other methods.

(2) Analysis of significance of differences in pathways in two similar disease samples

This section analyzes samples in the four data sets separately. For each pathway, a t-test was used to determine whether the degree of activation differed significantly between the two similar disease samples in each data set. The method comprises the following steps: the activation degree of the pathway in each disease sample is calculated respectively based on the method, then a t value calculation formula is adopted to calculate a t value representing the difference degree of the activation degree of the pathway in two similar disease samples, a t boundary value table is checked, the sum of the number of horizontal marks (freedom degrees) in the t boundary value table, namely two disease samples in a data set, is determined to be-2, the number of vertical marks P corresponding to a unit cell with the value of t is determined, and if the P is less than or equal to 0.05, the difference of the activation degree of the pathway in the two similar disease samples is obvious. The P values corresponding to all the pathways are counted, as shown in fig. 3, and it can be seen from fig. 3 that the P values corresponding to most of the pathways are less than or equal to 0.05, which indicates that the activation degrees of most of the pathways are significantly different in the two similar disease samples.

(3) The degree of enrichment of known disease genes in differentially expressed pathways in two similar diseases.

This section analyzes samples in the four data sets separately. Taking the path with the P value less than or equal to 0.05 obtained in the step (2) as a differential expression path, and respectively determining the enrichment degree of the known disease genes in the two similar diseases in the paths.

P-values for the degree of enrichment of known disease genes in the differential expression pathway were calculated by hypergeometric tests:

wherein N isThe number of genes in a pathway, M the number of known disease genes, n the number of genes in a differentially expressed pathway, and M the number of known disease genes in a differentially expressed pathway. The smaller the P value, the higher the enrichment of the known disease gene in the differential expression pathway. Log obtained based on four data sets₁₀The results for P are shown in FIG. 4, from which it can be seen that-log₁₀The P values are all more than or equal to 1.3, namely the P values are less than or equal to 0.05, which indicates that the known disease genes are highly enriched in differential expression channels.

The results of fig. 3 and fig. 4 show that the pathway activation degree of a single disease sample extracted by the method of the present invention can effectively reflect the difference between similar diseases, and the two similar diseases can be effectively distinguished by the pathway activation degree calculation method provided by the present invention.

Experimental results show that the method has good classification accuracy and stability.

Claims

1. A method for evaluating the activation degree of a single disease sample channel is characterized in that the activation degree of each channel comprises a continuous activation degree and a gene activation degree, and for each channel in a disease sample, the method for evaluating the activation degree comprises the following steps:

step 1, for all genes in the path, if two genes have no connecting edge, adding the connecting edge, and constructing the path into a fully-connected network;

taking a gene present in the pathway as an important gene and a gene not present in the pathway as a background gene;

in step 2, the significance of the difference values is evaluated based on the Z test:

wherein the Z value represents Δ PCC_nThe significance of (a);

in the step 3, for any gene, the formula for calculating the expression value difference multiple FC between the disease sample and n normal samples is as follows:

wherein b represents an expression value of the gene in the disease sample,

represents the mean value of the expression values of the gene in n normal samples;

step 5, according to the sequencing result, calculating the enrichment degree of important continuous edges/genes in the continuous edge/gene sequencing of the fully-connected network, and taking the enrichment degree as the activation degree of the continuous edges/genes in the corresponding channel;

in the step 5, for any one passage, the enrichment degree of the important connecting edges/genes in the connecting edge/gene ordering of the fully-connected network is calculated by the following formula:

wherein I represents the set of all important edges/genes in the fully connected network, rank_iRepresenting the sequence of the ith connecting edge/gene in the I when the I is arranged according to the ascending sequence in the step 4, wherein M represents the total number of important connecting edges/genes in the fully-connected network, and N represents the total number of background connecting edges/gene sets; the formula utilizes AUC to calculate the enrichment degree of important continuous edges/genes in the continuous edge/gene ordering of the fully-connected network from the angle of the continuous edges/genes, and the enrichment degree is used as the activation degree of a channel.

2. The method for assessing the degree of activation of a single disease sample pathway according to claim 1, wherein in step 2, the Pearson correlation coefficient PCC of the expression values of the two genes linked at each side in n samples is calculated_nThe formula of (1) is:

and

respectively represent x₁And x₂Standard deviation of (2).

3. A method for distinguishing between similar diseases, comprising the steps of:

firstly, calculating the channel activation degree of each disease sample according to the method for evaluating the channel activation degree of the single disease sample as claimed in claim 1, and connecting the side activation degree and the gene activation degree of all the channels of the single disease sample into a vector as a characteristic vector of the disease sample;

4. A similar disease differentiating method according to claim 3 wherein said classifier is a random forest classifier.