CN104182654B - Protein-protein interaction network based gene set identification method - Google Patents
Protein-protein interaction network based gene set identification method Download PDFInfo
- Publication number
- CN104182654B CN104182654B CN201410370730.0A CN201410370730A CN104182654B CN 104182654 B CN104182654 B CN 104182654B CN 201410370730 A CN201410370730 A CN 201410370730A CN 104182654 B CN104182654 B CN 104182654B
- Authority
- CN
- China
- Prior art keywords
- genes
- proteins
- protein
- gene
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention relates to a protein-protein interaction network based gene set identification method, and belongs to the technical field of genes. The identification method comprises the following steps: finding out genes/proteins in direct interaction with a 'set A' from a 'set B', and naming as a 'node set B'; counting the number of each gene/protein in the 'node set B' in direct interaction with the 'set A', and naming as a dimensionality 'i'; calling out the interactive genes/proteins from the 'set A' through the 'node set B [i]' with different minimal dimensionalities 'i', and naming as a 'set A [i]'; counting the aggregate z-score of the 'set A [i]'; adopting the 'set A [i]' with the maximal aggregate z-score as the obtained gene set. The identification method can identify the gene sets more relevant to the biological processes, and is helpful for relevant researchers to carry out correlational research work.
Description
Technical field
The invention belongs to gene technology field, and in particular to a kind of gene set based on protein-protein interaction network
Authentication method.
Background technology
The dynamic change of transcript profile/protein groups causes the change of cell function.Genes/proteins are not that independent performance is made
With, but played a role by the interaction with other albumen in protein-protein interaction network.Therefore, it is based on
The group of protein-protein interaction network learns data mining it can be found that some new bio informations.Based on this, if group learns number
According to being analyzed under the auxiliary of protein-protein interaction information, analysis result will more have biological relevance.
At present, for the interactive network analysis of notable modulation genes/proteins depend on these genes/proteins it
Between direct interaction information.But, the expression of multiple genes/proteins shows that it may be with a key node gene/egg
(there is no notable modulation) in vain to interact.The key node genes/proteins simultaneously also may be with other multiple genes/proteins phases
Interaction.May cause to lose those by key node base based on the analysis of notable modulation genes/proteins direct interaction
Because of/albumen and the notable modulation genes/proteins of Indirect Interaction.Therefore, carry out based on protein-protein interaction network
Group learns data analysiss, it is impossible to ignore those key node genes/proteins.
The content of the invention
It is an object of the invention to overcome the deficiencies in the prior art, there is provided one kind is based on protein-protein interaction network
Gene set authentication method.The method of the present invention can identify the gene set more related to bioprocess, contribute to correlation
Research worker carries out correlational study work.
The present invention is realized by following technical scheme, the present invention relates to a kind of be based on protein-protein interaction
The gene set authentication method of network, comprises the steps:
Step one, finds out from " data set B " and " set A " occurs the genes/proteins of direct interaction, and is named as
" set of node B ";Genes/proteins in " set of node B " come from " data set B ", and and " set A " total genes/proteins;
Step 2, counts the number that each genes/proteins and " set A " in " set of node B " occur direct interaction, should
Number is named as the dimension " i " of genes/proteins in " set of node B ", and the genes/proteins in " set of node B " have different dimensions
Degree;
Step 3, with " set of node B [i] " with different smallest dimensions " i " those interactions are recalled from " set A "
Genes/proteins, and be named as " set A [i] ", remaining genes/proteins are named as " set A [i] in " set A "
It is remaining ";
Step 4, calculates the aggregation z values of " set A [i] ";
Step 5, " set A [i] " with maximum aggregation z values is to be identified based on protein-protein interaction network
Gene set.
Preferably, in step one, the data set B is protein-protein interaction data in public database.
Preferably, in step one, the set A is the notable modulation obtained from related full genome transcript profile data,
And with the gene set of biological function enrichment.
Preferably, in step 4, the calculating of the aggregation z values comprises the steps:
A) the expression significance of each genes/proteins is calculated, i.e., significance ratio between biological specimen different disposal interested
Compared with correction p value;
B) the correction p value is deducted with 1, and then divided by normal cumulative distribution function, generates z values;
C) the z values of all genes/proteins in " set A [i] " are added, and divided by " set A [i] " genes/proteins number
Square root, obtains assembling z values;Can compare the relative table with different genes/albumen number " set A [i] " by assembling z values
Up to change, aggregation z values are higher, and " set A [i] " expression is more notable.
Compared with prior art, the present invention has following beneficial effect:
Technical scheme has considered gene that bioprocess itself, i.e. function be closely related in answer signal
During stimulation, the modulation of these genes may be affected by certain key gene (i.e. " key node genes/proteins "), and the pass
Key gene may not occur notable modulation.The key of key node genes/proteins then passes through itself and notable modulation genes/proteins
The number of interaction, i.e. dimension " i " are embodied.Dimension " i " is bigger, and it is more crucial.Meanwhile, also consider and identified
The overall expression modulation information of gene, that is, assemble z values.Aggregation z values are bigger, and gene set modulation is more notable.Certainly, either
The dimension " i " of key node genes/proteins, or the aggregation z values of gene set, it is all objective direct to have reacted in bioprocess
Important indicator, make use of the natural law being related in biology.
The gene set of method of the present invention identification has the effect that:Identify the gene more related to bioprocess
Collection.The node genes/proteins interacted with gene set also have important biomolecule function.Based on the gene set and/or node base
Cause/albumen, contributes to related researcher and carries out the work of next step correlational study.Such as gene function analysis, medical diagnosis on disease, disease
Disease treatment prognosis etc..
Description of the drawings
The detailed description by reading non-limiting example made with reference to the following drawings, the further feature of the present invention,
Objects and advantages will become more apparent upon:
Fig. 1 is the analysis process identified based on protein-protein interaction networking gene set.
Fig. 2 be mycobacterium tuberculosis (Mtb) infection THP-1 cells after, THP-1 cell transcriptions spectrum based on protein-protein
The main policies of the gene set identification of interaction data.
When Fig. 3 is the node using different smallest dimensions, the aggregation z values of gene set THP1r2Mtb-iNet [i] are identified
(A) gene set TMtb-iNet, correspondence residue gene set TMtb-iEx and when adopting smallest dimension for 14 node, are identified, and
The box figure of original gene collection THP1r2Mtb-induced expressions shows (B).
Fig. 4 is the transcription factor binding site of THP1r2Mtb-iNet [i] and THP1r2Mtb-iEx [i] gene promoter area
Point enrichment analysis (A-C), and when adopting smallest dimension for 14 node, identify gene set TMtb-iNet, correspondence residue base
Binding site for transcription factor enrichment analysis (D) because collecting TMtb-iEx gene promoter areas.
Fig. 5 is analyzed for the biological pathway of THP1r2Mtb-induced and TMtb-iNet.
Fig. 6 is THP1r2Mtb-induced (A), TMtb-iNet (B) and TMtb-iEx (C) and interferon module gene
(M3.1) gene overlap analysis.
Fig. 7 is THP1r2Mtb-induced, TMtb-iNet and TMtb-iEx and pulmonary tuberculosis patient correlated expression modal data
Correlation analysiss.
Specific embodiment
With reference to specific embodiment, the present invention is expanded on further.These embodiments be merely to illustrate the present invention and without
In restriction the scope of the present invention.The experimental technique of unreceipted actual conditions in the following example, generally according to normal condition, for example
Sambrook equimoleculars are cloned:Laboratory manual (New York:Cold Spring Harbor Laboratory Press,
1989) condition described in, or according to the condition proposed by manufacturer.
For the interactive network analysis of notable modulation genes/proteins are depended between these genes/proteins
Direct interaction information.But, the expression of multiple genes/proteins show its may with a key node genes/proteins (though
So there is no notable modulation) interact.The present invention consider the modulation degree of genes/proteins and its with key node base
The interaction situation of cause/albumen, identifies the gene set more related to bioprocess from flux data.Gene set identification tool
Body considers the Degree of interaction (dimension) and identified gene of key node genes/proteins and notable modulation genes/proteins
The overall expression (aggregation z values) of collection.The gene set of expression most obvious (maximum aggregation z values) is identified gene set.
The present invention is considered aobvious by comprehensive protein-protein interaction information and transcript profile gene modulation information
Write the interaction situation of modulation genes/proteins and key node genes/proteins (notable modulation does not occur), identification and biological mistake
Cheng Gengjia related gene set.
Before technical scheme is implemented, need to obtain:1) the gene table of related full genome transcript profile data
Up to modulation information, and obtain notable modulation and with the gene set of certain biological function enrichment, the gene set is named as " collection
Close A ", and the genes/proteins of set A (concrete time point, concrete process etc.) or consistent rise under states of interest;Or
It is consistent to lower.Such as " THP1r2Mtb-induced " in embodiment, its gene 18h after Mtb infection significantly raises (relative
In 4h);2) protein-protein interaction data in public database, the numerical nomenclature is " data set B ", in embodiment
" STRING protein-protein interaction data ".
Fig. 1 is the analysis process identified based on protein-protein interaction networking gene set:
1) genes/proteins that direct interaction occurs with " set A " are found out from " data set B ", i.e., in " data set B "
Protein-protein interaction centering only one of which albumen come from " set A ", and be named as " set of node B "." set of node B "
In genes/proteins come from " data set B ", and " set A " have genes/proteins.
2) number that each genes/proteins and " set A " in " set of node B " occur direct interaction, i.e. " node are counted
In certain genes/proteins and " set A " in collection B " there is direct interaction in how many genes/proteins, and the number is named
For the dimension " i " of genes/proteins in " set of node B ".Genes/proteins in " set of node B " have different dimensions.
3) base of those interactions is recalled from " set A " with " set of node B [i] " with different smallest dimensions " i "
Cause/albumen, and be named as " set A [i] ", such as " THP1r2Mtb-iNet [i] " in embodiment.Base in " set A [i] "
Cause/albumen may interact directly with one another, or indirect by " set of node B [i] " with different smallest dimensions " i "
Interact.Corresponding, remaining genes/proteins are named as " set A [i] is remaining " in " set A ", in embodiment
“THP1r2Mtb-iEx[i]”。
4) the aggregation z values (aggregate z-score) of " set A [i] " are calculated1.Specifically, the calculating of z values is assembled such as
Under:A) the expression significance of each genes/proteins, i.e., the school that significance compares between biological specimen different disposal interested are calculated
Positive p value;B) the correction p value is deducted with 1, and then divided by normal cumulative distribution function (normal cumulative
Distribution function, normal CDF), generate z values;C) it is added the z of all genes/proteins in " set A [i] "
Value, and divided by the square root of " set A [i] " genes/proteins number, obtain assembling z values.Can be compared by aggregation z values and be had
Relative expression's change of different genes/albumen number " set A [i] ".Aggregation z values are higher, and " set A [i] " expression is more notable;Instead
It is as the same.
5) " set A [i] " with maximum aggregation z values is the base based on protein-protein interaction network identified
Because of collection.
Below it is specifically addressed, the data in following embodiments are based on host macrophage (THP-1 cells) tuberculosis
Interferon related gene collection (" set A " in THP1r2Mtb-induced, i.e. claims) after mycobacterial infectionses2,
By with reference to STRING protein-protein interaction data, and " the data set B " in claims3,4, further excavate
A gene set based on protein-protein interaction network, i.e. TMtb-iNet, and further carried out associated verification.
Embodiment
1 method
1.1 protein-protein interaction data
Protein-protein interaction data come from STRING data bases3,4.STRING data bases include multiple species
Protein-protein physics and function interaction data.Inventor therefrom extracts the special protein-protein interaction data of people,
And its combined value (combined socre) for interacting is at least 0.7.The standard ensure that the high covering of data
Rate, also ensure that the high-quality of data.
1.2 derive the gene set based on protein-protein interaction network from THP1r2Mtb-induced
First, find out from STING protein-protein interaction data and direct phase occurs with THP1r2Mtb-induced
The genes/proteins of interaction, are named as " set of node ", i.e., aforesaid " set of node B ".Genes/proteins in set of node come from
Protein-protein interaction data, and the total genes/proteins of THP1r2Mtb-induced.Secondly, in statistics set of node
There is the number of direct interaction in each genes/proteins and THP1r2Mtb-induced, the number is named as in set of node
The dimension " i " of genes/proteins.Two nodes as shown in Figure 2, the dimension of a node is 3, and the dimension of another node is
4.Genes/proteins in set of node have different dimensions.With the set of node [i] with different smallest dimensions " i " from
The genes/proteins of those interactions are recalled in THP1r2Mtb-induced, it is named as " THP1r2Mtb-iNet [i] ",
" set A [i] " i.e. in claims.Genes/proteins in THP1r2Mtb-iNet [i] may occur directly with one another mutually
Effect, or interacted indirectly by the set of node [i] with different smallest dimensions " i ".It is corresponding, THP1r2Mtb-
Remaining genes/proteins are named as " THP1r2Mtb-iEx [i] " in induced, i.e., " the set A [i] in claims
It is remaining ".Calculate the aggregation z values (aggregate z-score) of THP1r2Mtb-iNet [i]1.Specifically, the calculating of z values is assembled such as
Under:A) the expression significance of each genes/proteins is calculated, that is, corrects p value;B) the correction p value is deducted with 1, and then divided by normal state
Cumulative distribution function (normal cumulative distribution function, normal CDF), generates z values;C) phase
Plus in THP1r2Mtb-iNet [i] all genes/proteins z values, and divided by genes/proteins number in THP1r2Mtb-iNet [i]
Purpose square root, obtains assembling z values.Can be compared with different genes/albumen number THP1r2Mtb-iNet by assembling z values
Relative expression's change of [i].Aggregation z values are higher, and THP1r2Mtb-iNet [i] expression is more notable;Vice versa.With maximum poly-
The THP1r2Mtb-iNet [i] of collection z values is the gene set based on protein-protein interaction network identified.
The enrichment analysis of 2.3 Binding site for transcription factor
PRomoter Integration in Microarray Analysis (PRIMA) is used for related gene collection
TFBS enrichment analyses5.Analyzed promoter region is transcriptional start site upstream 2000bp to downstream 200bp.Use full-length genome
Gene is used as background.Bonferroni corrects p value<0.01 is considered to have statistical significance.
The enrichment analysis of 2.4KEGG signal paths
By online database Database for Annotation, Visualization and Integrated
Discovery (DAVID) v6.7 carries out signal path enrichment analysis6.Based on the mistake that Benjamini and Hochberg are corrected
By mistake discovery rate (False Discovery Rate, FDR) carries out statistical analysiss.
2.5 gene set enrichment analysis (the gene set for being directed to pulmonary tuberculosis (pulmonary tuberculosis, PTB)
Enrichment analysis, GSEA)
GSEA may determine that gene set is in the data set of ranked (sorting from high to low according to expression)
Still it is mainly distributed on above being mainly distributed on following7.Inventor downloads from NCBI GEO and obtains transcribing spectrum data set
GSE194918。
GSE19491 is included from a large amount of PTB, latent infection (latent tuberculosis, LTB), and Healthy People
The whole blood expression modal data of (healthy control, HC).These volunteers are divided into multiple groups:1) training group (training
Set), including PTB, LTB, HC, it both is from London;2) detection group (test set), including PTB, LTB, HC, it is also
Come from London;3) validation group (validation set), including PTB, LTB, it comes from Cape Town, RSA;4) detect
Group _ separation (test set_seperated), including the neutrophilic granulocyte (neut), the mononuclear cell that are isolated from PTB and HC
(mono), CD4+ (CD4) and CD8+ (CD8) T cell;5) before treatment group (longitudinal), including PTB treatments, medicine opens
Beginning treatment 2 months (PTB_2m), medicine start to treat December (PTB_12m), and HC.
GSEA results pass through NES (Normalized Enrichment Score) and FDR (false discovery
Rate) judged.Positive NES shows top enrichment of the gene set in expression spectrum data set, illustrates the gene set and the express spectra
Data set positive correlation, i.e., the main up-regulated expression in expression spectrum data set;Negative NES shows the gene set in expression spectrum data set
Lower section enrichment, illustrates that the gene set and the expression spectrum data set are negatively correlated, i.e., main in expression spectrum data set to lower expression.
FDR<=0.05 shows that NES has statistical significance7。
2 results
2.1 gene sets from THP1r2Mtb-induced identifications based on protein-protein interaction network, it is embodied
The principal character of THP1r2Mtb-induced
Genes/proteins play a role in molecular network, and the disturbance of molecular network can affect the phenotype of cell9.Cause
This can be further refined by integral protein-protein interaction data, THP1r2Mtb-induced.As shown in Fig. 2
Inventor further extracts the genes/proteins of interaction between each other from THP1r2Mtb-induced, or extracts by node
The genes/proteins of collection Indirect Interaction.The gene set of interaction, remaining gene set, and
THP1r2Mtb-induced, is further used for the GSEA (Fig. 2) for patient's correlated expression modal data.From egg
The genes/proteins that interaction occurs with THP1r2Mtb-induced, i.e. node are selected in vain-protein interaction data storehouse
Collection.The number that each genes/proteins and THP1r2Mtb-induced occur interaction gene/albumen in set of node is named
For the dimension of node, i.e. i.In THP1r2Mtb-induced interact with each other or by smallest dimension for i set of node
One genoid/albumen that [i] occurs to interact indirectly is named as THP1r2Mtb-iNet [i].
Remaining gene is named as THP1r2Mtb-iEx [i] in THP1r2Mtb-induced.Because different nodes
Dimension is different, therefore for a series of THP1r2Mtb-iNet [i], inventor calculates respectively its aggregation z value.Such as Fig. 3 A institutes
Show, when the smallest dimension of set of node is 14, i.e. set of node [i=14], the aggregation z of correspondence THP1r2Mtb-iNet [i=14]
Value is maximum.THP1r2Mtb-iNet [i=14] is referred to as TMtb-iNet, corresponding THP1r2Mtb-iEx [i=by inventor
14] it is referred to as TMtb-iEx.Compared to TMtb-iEx, TMtb-iNet up-regulated expressions more significantly (Fig. 3 B).
The transcription factor that the gene promoter area significant enrichment three of THP1r2Mtb-induced is related to interferon is combined
Site, i.e. ISRE (IFN-stimulated response element), IRF-1 (interferon regulatory
factor1)、IRF-72.It is consistent, inventor also these three Binding site for transcription factor of labor in THP1r2Mtb-
INet [i] and the enrichment degree of THP1r2Mtb-iEx [i] gene promoter area.As shown in Fig. 4 A, 4B and 4D, no matter use and appoint
The set of node of what smallest dimension, the ISRE and IRF-7 all more significantly gene promoter for being enriched in THP1r2Mtb-iNet [i]
Area.It is contrary, IRF-1 in THP1r2Mtb-iNet [i] and THP1r2Mtb-iEx [i] gene promoter area all significant enrichments, and
The dimension unrelated (Fig. 4 C and 4D) of set of node.
Compared to THP1r2Mtb-induced, TMtb-iNet more significantly enrichment cytokine-cytokine
Receptor interactoin, chemokine signalling, NOD-like receptor signalling signals lead to
Road (Fig. 5).TMtb-iEx is not enriched with any signal path.
In sum, by the way that using the set of node that smallest dimension is 14, inventor identifies one based on protein-protein phase
The gene set of interaction network, i.e. TMtb-iNet.TMtb-iNet expression modulation most significantly (highest assembles z values), while also
In these three Binding site for transcription factor of its gene promoter area significant enrichment ISRE, IRF-7 and IRF-1.
2.2TMtb-iNet contains more interferon related genes than TMtb-iEx
THP1r2Mtb-induced is related to interferon process2.Meanwhile, TMtb-iNet inherits THP1r2Mtb-
The primary biological feature (Fig. 4 and Fig. 5) of induced.Based on this, inventor further analyzes whether TMtb-iNet compares TMtb-
IEx contains more interferon related genes.Chaussabel D etc. are based on to multiple disease patient's PERIPHERAL BLOOD MONONUCLEAR CELL
Express spectra data analysiss, construct series of genes module.These netic modules present special consistent in multiple diseases
Expression.And based on literature research, multiple netic modules have been done functional annotation by author, including an interferon relevant mode
Block, i.e. M3.110,11.THP1r2Mtb-induced includes the nearly gene of half in interferon gene module, in 95 genes of level
442.Relatively find, TMtb-iNet contains wherein 33 genes, and TMtb-iEx contains only wherein 11 gene (p
=4.32 × 10-6) (Fig. 6).The result shows, based on the gene set that protein-protein interaction network is identified, i.e. TMtb-
INet, than TMtb-iEx more interferon related genes are included.And confirm and reflected based on protein-protein interaction network
Determine the reasonability of gene diversity method.
2.3 compare with THP1r2Mtb-induced or TMtb-iEx, the positive correlation degree of TMtb-iNet and PTB patients compared with
Unanimously, it is but higher with the positive correlation degree of the specific cell group for being isolated from patient PTB
As shown in the PTB_1&2 of Fig. 7, no matter PTB comes from training group or test group, TMtb-iNet and
The positive correlation degree of THP1r2Mtb-induced and PTB is substantially suitable.And the positive correlation degree of TMtb-iEx and PTB is then relatively low.
The result shows, the TMtb-iNet identified based on protein-protein interaction network compared with THP1r2Mtb-induced,
There is similar up-regulated expression degree in PTB patients bloods.
Inventor further analyze TMtb-iNet be isolated from the neutrophilic granulocyte of patient PTB, mononuclear cell, CD4+ and
The positive correlation degree of CD8+ cells.As a result show that TMtb-iNet and these four cells are also in significance positive correlation.
The positive correlation degree of TMtb-iNet and CD4+, CD8+T cell is higher than THP1r2Mtb-induced.Because TMtb-
INet and neutrophilic granulocyte, monocytic positive correlation degree similar in appearance to THP1r2Mtb-induced, therefore TMtb-iNet with
The higher positive correlation of CD4+, CD8+ has specificity.TMtb-iEx and neutrophilic granulocyte, monocytic positive correlation degree compared with
It is low;It is related (PTB_3-6 of Fig. 7) without significance to CD4+, CD8+T cell.
In sum, compare with THP1r2Mtb-induced with TMtb-iEx, based on protein-protein interaction network
The gene set TMtb-iNet of identification and the positive correlation degree of patient PTB are more consistent, but and are isolated from the special thin of patient PTB
The positive correlation degree of born of the same parents group is higher.
2.4 in the therapeutic process of PTB, and TMtb-iNet declines faster than THP1r2Mtb-induced or TMtb-iEx
As shown in the PTB_7-9 of Fig. 7, after treatment starts two months, the positive correlation of TMtb-iNet and PTB has declined,
But still with significance.But after treatment starts 12 months, the dependency of TMtb-iNet and PTB does not then have significance.
Regardless of whether be before the treatment starts, treatment start two months, or treatment start 12 months, THP1r2Mtb-induced and
The positive correlation of TMtb-iEx and PTB really has always statistical significance.These results indicate that being based on protein-protein phase interaction
More there is responsiveness with treatments of the gene set TMtb-iNet of network identification to PTB.
In sum, the present invention has considered gene that bioprocess itself, i.e. function be closely related in answer signal
During stimulation, the modulation of these genes may be affected by certain key gene (i.e. " key node genes/proteins "), and the pass
Key node genes/proteins may not occur notable modulation;
The key number for then being interacted with notable modulation genes/proteins by it of key node genes/proteins, i.e.,
Dimension " i " is embodied.Dimension " i " is bigger, and it is more crucial.Meanwhile, also consider and identified that the overall expression of gene is adjusted
Change information, that is, assemble z values.Aggregation z values are bigger, and gene set modulation is more notable.Certainly, either key node gene/egg
White dimension " i ", or the aggregation z values of gene set, all objective direct important indicator reacted in bioprocess.
The gene set of method of the present invention identification has the effect that:Identify the gene more related to bioprocess
Collection.The node genes/proteins interacted with gene set also have important biomolecule function.Based on the gene set and/or node base
Cause/albumen, contributes to related researcher and carries out the work of next step correlational study.Such as gene function analysis, medical diagnosis on disease, disease
Disease treatment prognosis etc..
Reference paper according to the present invention is listed as follows:
1.Ideker T,Ozier O,Schwikowski B,Siegel AF.Discovering regulatory and
signalling circuits in molecular interaction networks.Bioinformatics 2002;
18Suppl1:S233-S240.
2.Wu K,Dong D,Fang H et al.An interferon-related signature in the
transcriptional core response of human macrophages to Mycobacterium
tuberculosis infection.PLoS One 2012;7(6):e38367.
3.Snel B,Lehmann G,Bork P,Huynen MA.STRING:a web-server to retrieve
and display the repeatedly occurring neighbourhood of a gene.Nucleic Acids
Res2000;28(18):3442-3444.
4.Franceschini A,Szklarczyk D,Frankild S et al.STRING v9.1:protein-
protein interaction networks,with increased coverage and integration.Nucleic
Acids Res2013;41(Database issue):D808-D815.
5.Ulitsky I,Maron-Katz A,Shavit S et al.Expander:from expression
microarrays to networks and functions.Nat Protoc 2010;5(2):303-322.
6.Huang dW,Sherman BT,Lempicki RA.Systematic and integrative analysis
of large gene lists using DAVID bioinformatics resources.Nat Protoc 2009;4
(1):44-57.
7.Subramanian A,Tamayo P,Mootha VK et al.Gene set enrichment
analysis:a knowledge-based approach for interpreting genome-wide expression
profiles.Proc Natl Acad Sci U S A 2005;102(43):15545-15550.
8.Berry MP,Graham CM,McNab FW et al.An interferon-inducible
neutrophil-driven blood transcriptional signature in human
tuberculosis.Nature2010;466(7309):973-977.
9.Vidal M,Cusick ME,Barabasi AL.Interactome networks and human
disease.Cell2011;144(6):986-998.
10.Chaussabel D,Quinn C,Shen J et al.A modular analysis framework for
blood genomics studies:application to systemic lupus
erythematosus.Immunity2008;29(1):150-164.
11.Chaussabel D,Sher A.Mining microarray expression data by
literature profiling.Genome Biol 2002;3(10):RESEARCH0055.
The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned
Particular implementation, those skilled in the art can within the scope of the claims make various modifications or modification, this not shadow
Ring the flesh and blood of the present invention.
Claims (3)
1. a kind of gene set authentication method based on protein-protein interaction network, it is characterised in that comprise the steps:
Step one, finds out from " data set B " and " set A " occurs the genes/proteins of direct interaction, and is named as " section
Point set B ";Genes/proteins in " set of node B " come from " data set B ", and and " set A " total genes/proteins;
Step 2, counts the number that each genes/proteins and " set A " in " set of node B " occur direct interaction, the number
It is named as the dimension " i " of genes/proteins in " set of node B ", the genes/proteins in " set of node B " have different dimensions;
Step 3, with " set of node B [i] " with different smallest dimensions " i " base of those interactions is recalled from " set A "
Cause/albumen, and it is named as " set A [i] ", remaining genes/proteins are named as " set A [i] is remaining " in " set A ";
Step 4, calculates the aggregation z values of " set A [i] ";
Step 5, " set A [i] " with maximum aggregation z values is the base based on protein-protein interaction network identified
Because of collection;
In step one, the set A is the notable modulation obtained from related full genome transcript profile data, and with biological work(
The gene set that can be enriched with.
2. the gene set authentication method of protein-protein interaction network is based on as claimed in claim 1, it is characterised in that
In step one, the data set B is protein-protein interaction data in public database.
3. the gene set authentication method of protein-protein interaction network is based on as claimed in claim 1, it is characterised in that
In step 4, the calculating of the aggregation z values comprises the steps:
A) the expression significance of each genes/proteins is calculated, i.e., significance compares between biological specimen different disposal interested
Correction p value;
B) the correction p value is deducted with 1, and then divided by normal cumulative distribution function, generates z values;
C) be added " set A [i] " in all genes/proteins z values, and divided by " set A [i] " genes/proteins number square
Root, obtains assembling z values;Can compare the change of the relative expression with different genes/albumen number " set A [i] " by assembling z values
Change, aggregation z values are higher, " set A [i] " expression is more notable.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410370730.0A CN104182654B (en) | 2014-07-30 | 2014-07-30 | Protein-protein interaction network based gene set identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410370730.0A CN104182654B (en) | 2014-07-30 | 2014-07-30 | Protein-protein interaction network based gene set identification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104182654A CN104182654A (en) | 2014-12-03 |
CN104182654B true CN104182654B (en) | 2017-04-12 |
Family
ID=51963689
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410370730.0A Active CN104182654B (en) | 2014-07-30 | 2014-07-30 | Protein-protein interaction network based gene set identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104182654B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929282A (en) * | 2019-12-05 | 2020-03-27 | 武汉深佰生物科技有限公司 | Protein interaction-based biological characteristic information early warning method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102086473A (en) * | 2010-05-12 | 2011-06-08 | 天津市泌尿外科研究所 | Undirected network screening method for key genes of human polygenic disease |
CN102375840A (en) * | 2010-08-19 | 2012-03-14 | 浙江中医药大学附属第一医院 | Method for screening micro ribonucleic acid (microRNA) target gene based on natural language processing system |
CN103065066A (en) * | 2013-01-22 | 2013-04-24 | 四川大学 | Drug combination network based drug combined action predicting method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050197783A1 (en) * | 2004-03-04 | 2005-09-08 | Kuchinsky Allan J. | Methods and systems for extension, exploration, refinement, and analysis of biological networks |
-
2014
- 2014-07-30 CN CN201410370730.0A patent/CN104182654B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102086473A (en) * | 2010-05-12 | 2011-06-08 | 天津市泌尿外科研究所 | Undirected network screening method for key genes of human polygenic disease |
CN102375840A (en) * | 2010-08-19 | 2012-03-14 | 浙江中医药大学附属第一医院 | Method for screening micro ribonucleic acid (microRNA) target gene based on natural language processing system |
CN103065066A (en) * | 2013-01-22 | 2013-04-24 | 四川大学 | Drug combination network based drug combined action predicting method |
Non-Patent Citations (3)
Title |
---|
Discovering regulatory and signalling circuits in molecular interaction networks;Trey Ideker et al;《Bioinformatics》;20021231;第18卷;第S233-S240页 * |
基于蛋白质相互作用网络预测癌症致病基因;袁芳 等;《计算机应用研究》;20120930;第29卷(第9期);第3221-3223页 * |
复杂疾病风险基因模块识别及其调控机制研究;李冬果 等;《中国优生与遗传杂志》;20131231;第21卷(第10期);第3-6页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104182654A (en) | 2014-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols | |
Zheng et al. | Prediction of genome-wide DNA methylation in repetitive elements | |
Mangul et al. | ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues | |
Yousef et al. | miRModuleNet: detecting miRNA-mRNA regulatory modules | |
Tarca et al. | Methodological approach from the best overall team in the sbv improver diagnostic signature challenge | |
Soneson et al. | A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs | |
Li et al. | Chromatin-accessibility estimation from single-cell ATAC data with scOpen | |
CN110603597A (en) | System and method for biomarker identification | |
Loscalzo | Molecular interaction networks and drug development: Novel approach to drug target identification and drug repositioning | |
Schmidt et al. | Integrative analysis of epigenetics data identifies gene-specific regulatory elements | |
Zhang et al. | Time to infer miRNA sponge modules | |
de Matos Simoes et al. | Organizational structure and the periphery of the gene regulatory network in B-cell lymphoma | |
Wang et al. | Improved prediction of smoking status via isoform-aware RNA-seq deep learning models | |
Munquad et al. | A deep learning–based framework for supporting clinical diagnosis of glioblastoma subtypes | |
Hawkins et al. | Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP | |
Van den Berge et al. | Normalization benchmark of ATAC-seq datasets shows the importance of accounting for GC-content effects | |
Wang et al. | Identification of A-to-I RNA editing profiles and their clinical relevance in lung adenocarcinoma | |
Bell et al. | miRWoods: Enhanced precursor detection and stacked random forests for the sensitive detection of microRNAs | |
CN104182654B (en) | Protein-protein interaction network based gene set identification method | |
Bianchi et al. | Detailed regulatory interaction map of the human heart facilitates gene discovery for cardiovascular disease | |
Li et al. | Establishment of a novel combined nomogram for predicting the risk of progression related to castration resistance in patients with prostate cancer | |
Salunkhe et al. | CytoPred: 7-gene pair metric for AML cytogenetic risk prediction | |
Harmanci et al. | XCVATR: characterization of variant impact on the embeddings of single-cell and bulk RNA-sequencing samples | |
Shuai et al. | DriverPower: combined burden and functional impact tests for cancer driver discovery | |
Mishra et al. | Pan-cancer analysis for studying cancer stage using protein expression data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |