CN104182654B

CN104182654B - Protein-protein interaction network based gene set identification method

Info

Publication number: CN104182654B
Application number: CN201410370730.0A
Authority: CN
Inventors: 吴康; 黄家颖; 范小勇
Original assignee: SHANGHAI PUBLIC HEALTH CLINICAL CENTER
Current assignee: SHANGHAI PUBLIC HEALTH CLINICAL CENTER
Priority date: 2014-07-30
Filing date: 2014-07-30
Publication date: 2017-04-12
Anticipated expiration: 2034-07-30
Also published as: CN104182654A

Abstract

The invention relates to a protein-protein interaction network based gene set identification method, and belongs to the technical field of genes. The identification method comprises the following steps: finding out genes/proteins in direct interaction with a 'set A' from a 'set B', and naming as a 'node set B'; counting the number of each gene/protein in the 'node set B' in direct interaction with the 'set A', and naming as a dimensionality 'i'; calling out the interactive genes/proteins from the 'set A' through the 'node set B [i]' with different minimal dimensionalities 'i', and naming as a 'set A [i]'; counting the aggregate z-score of the 'set A [i]'; adopting the 'set A [i]' with the maximal aggregate z-score as the obtained gene set. The identification method can identify the gene sets more relevant to the biological processes, and is helpful for relevant researchers to carry out correlational research work.

Description

Gene set authentication method based on protein-protein interaction network

Technical field

The invention belongs to gene technology field, and in particular to a kind of gene set based on protein-protein interaction network Authentication method.

Background technology

The dynamic change of transcript profile/protein groups causes the change of cell function.Genes/proteins are not that independent performance is made With, but played a role by the interaction with other albumen in protein-protein interaction network.Therefore, it is based on The group of protein-protein interaction network learns data mining it can be found that some new bio informations.Based on this, if group learns number According to being analyzed under the auxiliary of protein-protein interaction information, analysis result will more have biological relevance.

At present, for the interactive network analysis of notable modulation genes/proteins depend on these genes/proteins it Between direct interaction information.But, the expression of multiple genes/proteins shows that it may be with a key node gene/egg (there is no notable modulation) in vain to interact.The key node genes/proteins simultaneously also may be with other multiple genes/proteins phases Interaction.May cause to lose those by key node base based on the analysis of notable modulation genes/proteins direct interaction Because of/albumen and the notable modulation genes/proteins of Indirect Interaction.Therefore, carry out based on protein-protein interaction network Group learns data analysiss, it is impossible to ignore those key node genes/proteins.

The content of the invention

It is an object of the invention to overcome the deficiencies in the prior art, there is provided one kind is based on protein-protein interaction network Gene set authentication method.The method of the present invention can identify the gene set more related to bioprocess, contribute to correlation Research worker carries out correlational study work.

The present invention is realized by following technical scheme, the present invention relates to a kind of be based on protein-protein interaction The gene set authentication method of network, comprises the steps：

Step one, finds out from " data set B " and " set A " occurs the genes/proteins of direct interaction, and is named as " set of node B "；Genes/proteins in " set of node B " come from " data set B ", and and " set A " total genes/proteins；

Step 2, counts the number that each genes/proteins and " set A " in " set of node B " occur direct interaction, should Number is named as the dimension " i " of genes/proteins in " set of node B ", and the genes/proteins in " set of node B " have different dimensions Degree；

Step 3, with " set of node B [i] " with different smallest dimensions " i " those interactions are recalled from " set A " Genes/proteins, and be named as " set A [i] ", remaining genes/proteins are named as " set A [i] in " set A " It is remaining "；

Step 4, calculates the aggregation z values of " set A [i] "；

Step 5, " set A [i] " with maximum aggregation z values is to be identified based on protein-protein interaction network Gene set.

Preferably, in step one, the data set B is protein-protein interaction data in public database.

Preferably, in step one, the set A is the notable modulation obtained from related full genome transcript profile data, And with the gene set of biological function enrichment.

Preferably, in step 4, the calculating of the aggregation z values comprises the steps：

A) the expression significance of each genes/proteins is calculated, i.e., significance ratio between biological specimen different disposal interested Compared with correction p value；

B) the correction p value is deducted with 1, and then divided by normal cumulative distribution function, generates z values；

C) the z values of all genes/proteins in " set A [i] " are added, and divided by " set A [i] " genes/proteins number Square root, obtains assembling z values；Can compare the relative table with different genes/albumen number " set A [i] " by assembling z values Up to change, aggregation z values are higher, and " set A [i] " expression is more notable.

Compared with prior art, the present invention has following beneficial effect：

Technical scheme has considered gene that bioprocess itself, i.e. function be closely related in answer signal During stimulation, the modulation of these genes may be affected by certain key gene (i.e. " key node genes/proteins "), and the pass Key gene may not occur notable modulation.The key of key node genes/proteins then passes through itself and notable modulation genes/proteins The number of interaction, i.e. dimension " i " are embodied.Dimension " i " is bigger, and it is more crucial.Meanwhile, also consider and identified The overall expression modulation information of gene, that is, assemble z values.Aggregation z values are bigger, and gene set modulation is more notable.Certainly, either The dimension " i " of key node genes/proteins, or the aggregation z values of gene set, it is all objective direct to have reacted in bioprocess Important indicator, make use of the natural law being related in biology.

The gene set of method of the present invention identification has the effect that：Identify the gene more related to bioprocess Collection.The node genes/proteins interacted with gene set also have important biomolecule function.Based on the gene set and/or node base Cause/albumen, contributes to related researcher and carries out the work of next step correlational study.Such as gene function analysis, medical diagnosis on disease, disease Disease treatment prognosis etc..

Description of the drawings

The detailed description by reading non-limiting example made with reference to the following drawings, the further feature of the present invention, Objects and advantages will become more apparent upon：

Fig. 1 is the analysis process identified based on protein-protein interaction networking gene set.

Fig. 2 be mycobacterium tuberculosis (Mtb) infection THP-1 cells after, THP-1 cell transcriptions spectrum based on protein-protein The main policies of the gene set identification of interaction data.

When Fig. 3 is the node using different smallest dimensions, the aggregation z values of gene set THP1r2Mtb-iNet [i] are identified (A) gene set TMtb-iNet, correspondence residue gene set TMtb-iEx and when adopting smallest dimension for 14 node, are identified, and The box figure of original gene collection THP1r2Mtb-induced expressions shows (B).

Fig. 4 is the transcription factor binding site of THP1r2Mtb-iNet [i] and THP1r2Mtb-iEx [i] gene promoter area Point enrichment analysis (A-C), and when adopting smallest dimension for 14 node, identify gene set TMtb-iNet, correspondence residue base Binding site for transcription factor enrichment analysis (D) because collecting TMtb-iEx gene promoter areas.

Fig. 5 is analyzed for the biological pathway of THP1r2Mtb-induced and TMtb-iNet.

Fig. 6 is THP1r2Mtb-induced (A), TMtb-iNet (B) and TMtb-iEx (C) and interferon module gene (M3.1) gene overlap analysis.

Fig. 7 is THP1r2Mtb-induced, TMtb-iNet and TMtb-iEx and pulmonary tuberculosis patient correlated expression modal data Correlation analysiss.

Specific embodiment

With reference to specific embodiment, the present invention is expanded on further.These embodiments be merely to illustrate the present invention and without In restriction the scope of the present invention.The experimental technique of unreceipted actual conditions in the following example, generally according to normal condition, for example Sambrook equimoleculars are cloned：Laboratory manual (New York:Cold Spring Harbor Laboratory Press, 1989) condition described in, or according to the condition proposed by manufacturer.

For the interactive network analysis of notable modulation genes/proteins are depended between these genes/proteins Direct interaction information.But, the expression of multiple genes/proteins show its may with a key node genes/proteins (though So there is no notable modulation) interact.The present invention consider the modulation degree of genes/proteins and its with key node base The interaction situation of cause/albumen, identifies the gene set more related to bioprocess from flux data.Gene set identification tool Body considers the Degree of interaction (dimension) and identified gene of key node genes/proteins and notable modulation genes/proteins The overall expression (aggregation z values) of collection.The gene set of expression most obvious (maximum aggregation z values) is identified gene set.

The present invention is considered aobvious by comprehensive protein-protein interaction information and transcript profile gene modulation information Write the interaction situation of modulation genes/proteins and key node genes/proteins (notable modulation does not occur), identification and biological mistake Cheng Gengjia related gene set.

Before technical scheme is implemented, need to obtain：1) the gene table of related full genome transcript profile data Up to modulation information, and obtain notable modulation and with the gene set of certain biological function enrichment, the gene set is named as " collection Close A ", and the genes/proteins of set A (concrete time point, concrete process etc.) or consistent rise under states of interest；Or It is consistent to lower.Such as " THP1r2Mtb-induced " in embodiment, its gene 18h after Mtb infection significantly raises (relative In 4h)；2) protein-protein interaction data in public database, the numerical nomenclature is " data set B ", in embodiment " STRING protein-protein interaction data ".

Fig. 1 is the analysis process identified based on protein-protein interaction networking gene set：

1) genes/proteins that direct interaction occurs with " set A " are found out from " data set B ", i.e., in " data set B " Protein-protein interaction centering only one of which albumen come from " set A ", and be named as " set of node B "." set of node B " In genes/proteins come from " data set B ", and " set A " have genes/proteins.

2) number that each genes/proteins and " set A " in " set of node B " occur direct interaction, i.e. " node are counted In certain genes/proteins and " set A " in collection B " there is direct interaction in how many genes/proteins, and the number is named For the dimension " i " of genes/proteins in " set of node B ".Genes/proteins in " set of node B " have different dimensions.

3) base of those interactions is recalled from " set A " with " set of node B [i] " with different smallest dimensions " i " Cause/albumen, and be named as " set A [i] ", such as " THP1r2Mtb-iNet [i] " in embodiment.Base in " set A [i] " Cause/albumen may interact directly with one another, or indirect by " set of node B [i] " with different smallest dimensions " i " Interact.Corresponding, remaining genes/proteins are named as " set A [i] is remaining " in " set A ", in embodiment “THP1r2Mtb-iEx[i]”。

4) the aggregation z values (aggregate z-score) of " set A [i] " are calculated¹.Specifically, the calculating of z values is assembled such as Under：A) the expression significance of each genes/proteins, i.e., the school that significance compares between biological specimen different disposal interested are calculated Positive p value；B) the correction p value is deducted with 1, and then divided by normal cumulative distribution function (normal cumulative Distribution function, normal CDF), generate z values；C) it is added the z of all genes/proteins in " set A [i] " Value, and divided by the square root of " set A [i] " genes/proteins number, obtain assembling z values.Can be compared by aggregation z values and be had Relative expression's change of different genes/albumen number " set A [i] ".Aggregation z values are higher, and " set A [i] " expression is more notable；Instead It is as the same.

5) " set A [i] " with maximum aggregation z values is the base based on protein-protein interaction network identified Because of collection.

Below it is specifically addressed, the data in following embodiments are based on host macrophage (THP-1 cells) tuberculosis Interferon related gene collection (" set A " in THP1r2Mtb-induced, i.e. claims) after mycobacterial infectionses², By with reference to STRING protein-protein interaction data, and " the data set B " in claims^3,4, further excavate A gene set based on protein-protein interaction network, i.e. TMtb-iNet, and further carried out associated verification.

Embodiment

1 method

1.1 protein-protein interaction data

Protein-protein interaction data come from STRING data bases^3,4.STRING data bases include multiple species Protein-protein physics and function interaction data.Inventor therefrom extracts the special protein-protein interaction data of people, And its combined value (combined socre) for interacting is at least 0.7.The standard ensure that the high covering of data Rate, also ensure that the high-quality of data.

1.2 derive the gene set based on protein-protein interaction network from THP1r2Mtb-induced

First, find out from STING protein-protein interaction data and direct phase occurs with THP1r2Mtb-induced The genes/proteins of interaction, are named as " set of node ", i.e., aforesaid " set of node B ".Genes/proteins in set of node come from Protein-protein interaction data, and the total genes/proteins of THP1r2Mtb-induced.Secondly, in statistics set of node There is the number of direct interaction in each genes/proteins and THP1r2Mtb-induced, the number is named as in set of node The dimension " i " of genes/proteins.Two nodes as shown in Figure 2, the dimension of a node is 3, and the dimension of another node is 4.Genes/proteins in set of node have different dimensions.With the set of node [i] with different smallest dimensions " i " from The genes/proteins of those interactions are recalled in THP1r2Mtb-induced, it is named as " THP1r2Mtb-iNet [i] ", " set A [i] " i.e. in claims.Genes/proteins in THP1r2Mtb-iNet [i] may occur directly with one another mutually Effect, or interacted indirectly by the set of node [i] with different smallest dimensions " i ".It is corresponding, THP1r2Mtb- Remaining genes/proteins are named as " THP1r2Mtb-iEx [i] " in induced, i.e., " the set A [i] in claims It is remaining ".Calculate the aggregation z values (aggregate z-score) of THP1r2Mtb-iNet [i]¹.Specifically, the calculating of z values is assembled such as Under：A) the expression significance of each genes/proteins is calculated, that is, corrects p value；B) the correction p value is deducted with 1, and then divided by normal state Cumulative distribution function (normal cumulative distribution function, normal CDF), generates z values；C) phase Plus in THP1r2Mtb-iNet [i] all genes/proteins z values, and divided by genes/proteins number in THP1r2Mtb-iNet [i] Purpose square root, obtains assembling z values.Can be compared with different genes/albumen number THP1r2Mtb-iNet by assembling z values Relative expression's change of [i].Aggregation z values are higher, and THP1r2Mtb-iNet [i] expression is more notable；Vice versa.With maximum poly- The THP1r2Mtb-iNet [i] of collection z values is the gene set based on protein-protein interaction network identified.

The enrichment analysis of 2.3 Binding site for transcription factor

PRomoter Integration in Microarray Analysis (PRIMA) is used for related gene collection TFBS enrichment analyses⁵.Analyzed promoter region is transcriptional start site upstream 2000bp to downstream 200bp.Use full-length genome Gene is used as background.Bonferroni corrects p value<0.01 is considered to have statistical significance.

The enrichment analysis of 2.4KEGG signal paths

By online database Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.7 carries out signal path enrichment analysis⁶.Based on the mistake that Benjamini and Hochberg are corrected By mistake discovery rate (False Discovery Rate, FDR) carries out statistical analysiss.

2.5 gene set enrichment analysis (the gene set for being directed to pulmonary tuberculosis (pulmonary tuberculosis, PTB) Enrichment analysis, GSEA)

GSEA may determine that gene set is in the data set of ranked (sorting from high to low according to expression) Still it is mainly distributed on above being mainly distributed on following⁷.Inventor downloads from NCBI GEO and obtains transcribing spectrum data set GSE19491⁸。

GSE19491 is included from a large amount of PTB, latent infection (latent tuberculosis, LTB), and Healthy People The whole blood expression modal data of (healthy control, HC).These volunteers are divided into multiple groups：1) training group (training Set), including PTB, LTB, HC, it both is from London；2) detection group (test set), including PTB, LTB, HC, it is also Come from London；3) validation group (validation set), including PTB, LTB, it comes from Cape Town, RSA；4) detect Group _ separation (test set_seperated), including the neutrophilic granulocyte (neut), the mononuclear cell that are isolated from PTB and HC (mono), CD4+ (CD4) and CD8+ (CD8) T cell；5) before treatment group (longitudinal), including PTB treatments, medicine opens Beginning treatment 2 months (PTB_2m), medicine start to treat December (PTB_12m), and HC.

GSEA results pass through NES (Normalized Enrichment Score) and FDR (false discovery Rate) judged.Positive NES shows top enrichment of the gene set in expression spectrum data set, illustrates the gene set and the express spectra Data set positive correlation, i.e., the main up-regulated expression in expression spectrum data set；Negative NES shows the gene set in expression spectrum data set Lower section enrichment, illustrates that the gene set and the expression spectrum data set are negatively correlated, i.e., main in expression spectrum data set to lower expression. FDR<=0.05 shows that NES has statistical significance⁷。

2 results

2.1 gene sets from THP1r2Mtb-induced identifications based on protein-protein interaction network, it is embodied The principal character of THP1r2Mtb-induced

Genes/proteins play a role in molecular network, and the disturbance of molecular network can affect the phenotype of cell⁹.Cause This can be further refined by integral protein-protein interaction data, THP1r2Mtb-induced.As shown in Fig. 2 Inventor further extracts the genes/proteins of interaction between each other from THP1r2Mtb-induced, or extracts by node The genes/proteins of collection Indirect Interaction.The gene set of interaction, remaining gene set, and

THP1r2Mtb-induced, is further used for the GSEA (Fig. 2) for patient's correlated expression modal data.From egg The genes/proteins that interaction occurs with THP1r2Mtb-induced, i.e. node are selected in vain-protein interaction data storehouse Collection.The number that each genes/proteins and THP1r2Mtb-induced occur interaction gene/albumen in set of node is named For the dimension of node, i.e. i.In THP1r2Mtb-induced interact with each other or by smallest dimension for i set of node One genoid/albumen that [i] occurs to interact indirectly is named as THP1r2Mtb-iNet [i].

Remaining gene is named as THP1r2Mtb-iEx [i] in THP1r2Mtb-induced.Because different nodes Dimension is different, therefore for a series of THP1r2Mtb-iNet [i], inventor calculates respectively its aggregation z value.Such as Fig. 3 A institutes Show, when the smallest dimension of set of node is 14, i.e. set of node [i=14], the aggregation z of correspondence THP1r2Mtb-iNet [i=14] Value is maximum.THP1r2Mtb-iNet [i=14] is referred to as TMtb-iNet, corresponding THP1r2Mtb-iEx [i=by inventor 14] it is referred to as TMtb-iEx.Compared to TMtb-iEx, TMtb-iNet up-regulated expressions more significantly (Fig. 3 B).

The transcription factor that the gene promoter area significant enrichment three of THP1r2Mtb-induced is related to interferon is combined Site, i.e. ISRE (IFN-stimulated response element), IRF-1 (interferon regulatory factor1)、IRF-7².It is consistent, inventor also these three Binding site for transcription factor of labor in THP1r2Mtb- INet [i] and the enrichment degree of THP1r2Mtb-iEx [i] gene promoter area.As shown in Fig. 4 A, 4B and 4D, no matter use and appoint The set of node of what smallest dimension, the ISRE and IRF-7 all more significantly gene promoter for being enriched in THP1r2Mtb-iNet [i] Area.It is contrary, IRF-1 in THP1r2Mtb-iNet [i] and THP1r2Mtb-iEx [i] gene promoter area all significant enrichments, and The dimension unrelated (Fig. 4 C and 4D) of set of node.

Compared to THP1r2Mtb-induced, TMtb-iNet more significantly enrichment cytokine-cytokine Receptor interactoin, chemokine signalling, NOD-like receptor signalling signals lead to Road (Fig. 5).TMtb-iEx is not enriched with any signal path.

In sum, by the way that using the set of node that smallest dimension is 14, inventor identifies one based on protein-protein phase The gene set of interaction network, i.e. TMtb-iNet.TMtb-iNet expression modulation most significantly (highest assembles z values), while also In these three Binding site for transcription factor of its gene promoter area significant enrichment ISRE, IRF-7 and IRF-1.

2.2TMtb-iNet contains more interferon related genes than TMtb-iEx

THP1r2Mtb-induced is related to interferon process².Meanwhile, TMtb-iNet inherits THP1r2Mtb- The primary biological feature (Fig. 4 and Fig. 5) of induced.Based on this, inventor further analyzes whether TMtb-iNet compares TMtb- IEx contains more interferon related genes.Chaussabel D etc. are based on to multiple disease patient's PERIPHERAL BLOOD MONONUCLEAR CELL Express spectra data analysiss, construct series of genes module.These netic modules present special consistent in multiple diseases Expression.And based on literature research, multiple netic modules have been done functional annotation by author, including an interferon relevant mode Block, i.e. M3.1^10,11.THP1r2Mtb-induced includes the nearly gene of half in interferon gene module, in 95 genes of level 44².Relatively find, TMtb-iNet contains wherein 33 genes, and TMtb-iEx contains only wherein 11 gene (p =4.32 × 10^-6) (Fig. 6).The result shows, based on the gene set that protein-protein interaction network is identified, i.e. TMtb- INet, than TMtb-iEx more interferon related genes are included.And confirm and reflected based on protein-protein interaction network Determine the reasonability of gene diversity method.

2.3 compare with THP1r2Mtb-induced or TMtb-iEx, the positive correlation degree of TMtb-iNet and PTB patients compared with Unanimously, it is but higher with the positive correlation degree of the specific cell group for being isolated from patient PTB

As shown in the PTB_1＆2 of Fig. 7, no matter PTB comes from training group or test group, TMtb-iNet and The positive correlation degree of THP1r2Mtb-induced and PTB is substantially suitable.And the positive correlation degree of TMtb-iEx and PTB is then relatively low. The result shows, the TMtb-iNet identified based on protein-protein interaction network compared with THP1r2Mtb-induced, There is similar up-regulated expression degree in PTB patients bloods.

Inventor further analyze TMtb-iNet be isolated from the neutrophilic granulocyte of patient PTB, mononuclear cell, CD4+ and The positive correlation degree of CD8+ cells.As a result show that TMtb-iNet and these four cells are also in significance positive correlation.

The positive correlation degree of TMtb-iNet and CD4+, CD8+T cell is higher than THP1r2Mtb-induced.Because TMtb- INet and neutrophilic granulocyte, monocytic positive correlation degree similar in appearance to THP1r2Mtb-induced, therefore TMtb-iNet with The higher positive correlation of CD4+, CD8+ has specificity.TMtb-iEx and neutrophilic granulocyte, monocytic positive correlation degree compared with It is low；It is related (PTB_3-6 of Fig. 7) without significance to CD4+, CD8+T cell.

In sum, compare with THP1r2Mtb-induced with TMtb-iEx, based on protein-protein interaction network The gene set TMtb-iNet of identification and the positive correlation degree of patient PTB are more consistent, but and are isolated from the special thin of patient PTB The positive correlation degree of born of the same parents group is higher.

2.4 in the therapeutic process of PTB, and TMtb-iNet declines faster than THP1r2Mtb-induced or TMtb-iEx

As shown in the PTB_7-9 of Fig. 7, after treatment starts two months, the positive correlation of TMtb-iNet and PTB has declined, But still with significance.But after treatment starts 12 months, the dependency of TMtb-iNet and PTB does not then have significance. Regardless of whether be before the treatment starts, treatment start two months, or treatment start 12 months, THP1r2Mtb-induced and The positive correlation of TMtb-iEx and PTB really has always statistical significance.These results indicate that being based on protein-protein phase interaction More there is responsiveness with treatments of the gene set TMtb-iNet of network identification to PTB.

In sum, the present invention has considered gene that bioprocess itself, i.e. function be closely related in answer signal During stimulation, the modulation of these genes may be affected by certain key gene (i.e. " key node genes/proteins "), and the pass Key node genes/proteins may not occur notable modulation；

The key number for then being interacted with notable modulation genes/proteins by it of key node genes/proteins, i.e., Dimension " i " is embodied.Dimension " i " is bigger, and it is more crucial.Meanwhile, also consider and identified that the overall expression of gene is adjusted Change information, that is, assemble z values.Aggregation z values are bigger, and gene set modulation is more notable.Certainly, either key node gene/egg White dimension " i ", or the aggregation z values of gene set, all objective direct important indicator reacted in bioprocess.

Reference paper according to the present invention is listed as follows：

1.Ideker T,Ozier O,Schwikowski B,Siegel AF.Discovering regulatory and signalling circuits in molecular interaction networks.Bioinformatics 2002； 18Suppl1:S233-S240.

2.Wu K,Dong D,Fang H et al.An interferon-related signature in the transcriptional core response of human macrophages to Mycobacterium tuberculosis infection.PLoS One 2012；7(6):e38367.

3.Snel B,Lehmann G,Bork P,Huynen MA.STRING:a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene.Nucleic Acids Res2000；28(18):3442-3444.

4.Franceschini A,Szklarczyk D,Frankild S et al.STRING v9.1:protein- protein interaction networks,with increased coverage and integration.Nucleic Acids Res2013；41(Database issue):D808-D815.

5.Ulitsky I,Maron-Katz A,Shavit S et al.Expander:from expression microarrays to networks and functions.Nat Protoc 2010；5(2):303-322.

6.Huang dW,Sherman BT,Lempicki RA.Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.Nat Protoc 2009；4 (1):44-57.

7.Subramanian A,Tamayo P,Mootha VK et al.Gene set enrichment analysis:a knowledge-based approach for interpreting genome-wide expression profiles.Proc Natl Acad Sci U S A 2005；102(43):15545-15550.

8.Berry MP,Graham CM,McNab FW et al.An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis.Nature2010；466(7309):973-977.

9.Vidal M,Cusick ME,Barabasi AL.Interactome networks and human disease.Cell2011；144(6):986-998.

10.Chaussabel D,Quinn C,Shen J et al.A modular analysis framework for blood genomics studies:application to systemic lupus erythematosus.Immunity2008；29(1):150-164.

11.Chaussabel D,Sher A.Mining microarray expression data by literature profiling.Genome Biol 2002；3(10):RESEARCH0055.

The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can within the scope of the claims make various modifications or modification, this not shadow Ring the flesh and blood of the present invention.

Claims

1. a kind of gene set authentication method based on protein-protein interaction network, it is characterised in that comprise the steps：

Step one, finds out from " data set B " and " set A " occurs the genes/proteins of direct interaction, and is named as " section Point set B "；Genes/proteins in " set of node B " come from " data set B ", and and " set A " total genes/proteins；

Step 2, counts the number that each genes/proteins and " set A " in " set of node B " occur direct interaction, the number It is named as the dimension " i " of genes/proteins in " set of node B ", the genes/proteins in " set of node B " have different dimensions；

Step 3, with " set of node B [i] " with different smallest dimensions " i " base of those interactions is recalled from " set A " Cause/albumen, and it is named as " set A [i] ", remaining genes/proteins are named as " set A [i] is remaining " in " set A "；

Step 4, calculates the aggregation z values of " set A [i] "；

Step 5, " set A [i] " with maximum aggregation z values is the base based on protein-protein interaction network identified Because of collection；

In step one, the set A is the notable modulation obtained from related full genome transcript profile data, and with biological work( The gene set that can be enriched with.

2. the gene set authentication method of protein-protein interaction network is based on as claimed in claim 1, it is characterised in that In step one, the data set B is protein-protein interaction data in public database.

3. the gene set authentication method of protein-protein interaction network is based on as claimed in claim 1, it is characterised in that In step 4, the calculating of the aggregation z values comprises the steps：

A) the expression significance of each genes/proteins is calculated, i.e., significance compares between biological specimen different disposal interested Correction p value；

C) be added " set A [i] " in all genes/proteins z values, and divided by " set A [i] " genes/proteins number square Root, obtains assembling z values；Can compare the change of the relative expression with different genes/albumen number " set A [i] " by assembling z values Change, aggregation z values are higher, " set A [i] " expression is more notable.