CN103279690A - Method for ordering medical information - Google Patents

Method for ordering medical information Download PDF

Info

Publication number
CN103279690A
CN103279690A CN2013102376664A CN201310237666A CN103279690A CN 103279690 A CN103279690 A CN 103279690A CN 2013102376664 A CN2013102376664 A CN 2013102376664A CN 201310237666 A CN201310237666 A CN 201310237666A CN 103279690 A CN103279690 A CN 103279690A
Authority
CN
China
Prior art keywords
protein
confidence
degree
network
medical information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102376664A
Other languages
Chinese (zh)
Inventor
代涛
李姣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Medical Information CAMS
Original Assignee
Institute of Medical Information CAMS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Medical Information CAMS filed Critical Institute of Medical Information CAMS
Priority to CN2013102376664A priority Critical patent/CN103279690A/en
Publication of CN103279690A publication Critical patent/CN103279690A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for ordering medical information. The method is characterized in that seed genes in a given seed gene set R specific for certain phenotypic genes are neighborly expanded according to set confidence coefficient intervals to obtain a protein interaction sub-network NET which can be viewed as a phenotype related protein interaction sub-network, nodes and sides in the NET are divided into a plurality of islands LAND according to whether the nodes and the sides in the NET are communicated with one another or not, each island is a network comprising nodes and sides which are communicated with one another, importance degrees Ip of the proteins p, which are contained in each island LAND, in the network of the island are computed, in other words, each Ip is used for measuring a correlation degree of the corresponding protein to phenotype, the proteins in the islands are ordered according to the correlation degrees of the proteins, and a computing principle includes that all nodes on the islands are traversed, a fusion confidence coefficient of each protein makes a contribution to the corresponding Ip, and the corresponding contribution of a certain protein is bigger if the fusion confidence coefficient of the certain protein is higher.

Description

A kind of medical information sort method
Technical field:
The present invention relates to a kind of medical information sort method, especially a kind of medical information sort method based on the protein interaction network.
Technical background:
Concern between the molecule biomedical meaning: the Nobel laureate, modern age biological evolution theory founder Linus Pauling as far back as 1962, close in the article of writing that is entitled as " molecular disease; evolve and general heterogeneity (Molecular Disease; Evolution; and Generic Heterogeneity) " at him and biologist Emile Zuckerkandl, point out that intermolecular relation is to life and phenotypic significance, namely, " life; the disease of life threatening; be intermolecular relation; rather than the independent attribute of molecule (Life is a relationship between molecules; not a property of any one molecule.So is therefore disease, which endangers life) ".Gone over nearly half a century, and the understanding of phenotype molecular mechanism still is the significant challenge in the life science.Fast development along with experimental methods of molecular biology and determination techniques, people can obtain the full genome (genome-wide) of sample (phenotype) and the intermolecular relation data of holoprotein group (proteome-wide), and a large amount of molecule relation datas have constituted the biomolecule network.Based on the biomolecule network, scientists has been carried out a large amount of research from the angle pair molecule mechanism relevant with phenotype of system.Based on network phenotype method for forecasting gene can be divided into substantially: the method for distance Network Based, and based on the network method of phenotype similarity, the method for centrad Network Based.
Molecular marker is identified for biomedical meaning: along with the fast development of high flux biomolecule technology, the genetic test means have been used for biology, medical science.In the research and development process of genetic test, vital three links comprise: the drafting of (1) human full Genome Atlas; (2) screening and the identification of the gene marker that phenotype is relevant; (3) with the genetic chip be the maturation and the marketization of the genetic test means of representative.The mensuration of human gene sequence can be traced back to " Human Genome Project " (Human Genome Project) of nineteen ninety initiation.This plan lasts 13 years, utilizes the minority individuality to be sample, has measured human whole genome sequence (3,000,000,000 base-pairs), and is about 3,700,000 mononucleotide polymorphism sites (SNPs) that embody diversity of individuals from having measured.For meticulousr drafting human genome collection of illustrative plates, initiated an other genome plan that is with historically new significance in 2008, i.e. " thousand international people's gene batch totals are drawn " (1000Genomes Project).This plan utilization new-generation sequencing technology is estimated to finish in the end of the year 2012 2500 not agnate human individuals from the country variant area is carried out full gene sequencing, and then draws out and almost cover the genomic genetic polymorphism collection of illustrative plates of the whole mankind.At present, this collection of illustrative plates comprises about 1,500 ten thousand SNPs, 1,000,000 insertion/disappearances, 20,000 structure variations [1,2].These hereditary variation overwhelming majority is latest find, all may become the gene marker of potential phenotype diagnosis, treatment and prevention.Meanwhile, along with the maturation of the chip technology of gene, utilize the relevant gene marker of known phenotype to carry out genetic test and moved towards market.The 23andMe company of the U.S. provides the genetic test service to the public, and the expense of (2013) one-time detection only is 99 U.S. dollars (amounting to 600 yuan of Renminbi) at present.In the genetic test report that they provide, comprised: 118 kinds of phenotypic ill risks (Disease Risk) such as senile dementia, breast cancer; To anticoagulant 20 kinds of issuable reactions of medicine such as (Warfarin); Whether carry the mutant gene (Carrier Status) of 48 kinds of familial hereditary diseases such as cyst cystic fibrosis; And eye color, blush 57 signs (Trait) such as reaction after drinking.But owing to lack the efficient gene mark, the content that genetic test is contained is still limited, and the content that has been included in sensing range is not all to have higher degree of confidence.Be the potential mark of the millions of accurate location in the genome on one side, Yi Bian be the market-oriented genetic test service of marching toward rapidly.Therefore, identify the gene marker relevant with screening effective phenotype, become the key that clinical practice is successfully moved towards in genetic test.
The existing method of protein molecule marker identification: by intermolecular relation recognition protein molecule marker.Be example with method in " foundation and the Research on Mining of the protein of disease association-medicine incidence relation ", these class methods are based on following hypothesis: the gene that specific phenotype is relevant is close mutually in the biomolecule network.The basic thought of these class methods and framework are as shown in Figure 1.Given one group of candidate gene (candidate gene) G{g 1; g 2; g 3; :::; g N, it is mapped on human protein's interactive network corresponding one group of candidate albumen matter set T={t 1; t 2; t 3; :::; t N, simultaneously known phenotype gene is mapped on the protein interaction network corresponding one group of known phenotype protein set R={r 1; r 2; r 3; :::; r M, by calculating each candidate albumen matter t iWith the distance of known phenotype protein R, measure t iSignificance level I (the t of relative R i/ R), then according to I (t i/ R) value from high to low each the candidate albumen matter among the pair set T sort, good Forecasting Methodology should make that the ordering of phenotype protein is forward as much as possible.
In vivo, genetic transcription becomes RNA, translates into protein again, and this is the bioprocess of a complexity.But, in the prediction based on the corresponding gene of the phenotype of protein interaction network, do not consider so complicated process, but directly gene be mapped on its encoded protein matter.Do not have specified otherwise in this article and gene and protein are not done strict the differentiation.In addition, molecular network, molecular action network, protein molecule effect network, protein molecule interaction network are not done strict the differentiation yet herein.
In mathematical modeling, usually protein interaction network formalism is expressed as the non-directed graph G=(V of a no weight; E), wherein, V is the node set of figure, represents protein in the network; E is the set on the limit of tie point, represents the interaction between the protein.When representing the non-directed graph of this no weight with adjacency matrix A, A Ij=1 expression protein i and protein j interact, otherwise A Ij=0.In more accurate modeling, represent protein interaction with the numbers (1 or 0) that a number less than 1, greater than 0 replaces binaryzations, method is as follows:
The obtain manner of human protein's interaction relationship can be divided into following 5 kinds at present:
(1) high-throughout yeast-two hybrid technique (HT-Y2H, High-Throughput Yeast Two-Hybrid).The representative data resource that utilizes this technology to obtain has: the CCSB-HI1 that was published in " Nature " in 2005 has comprised 2,800 interaction relationships between 1,549 human protein; People such as Stelzl were published in 3,186 interaction relationships between 1,705 human protein of " Cell " in 2005.
(2) based on the extensive immunoprecipitation technology (large scale immune precipitation) of proteomic image (mass spectrometry).The representative data resource that utilizes this technology to obtain has: people such as Ewing were published in 6,463 interactions between 2,235 human proteins of " Molecular Systems Biology " in 2007.
(3) the manual of small throughput experimental data put in order.HPRD(Human Protein Reference Database) is based on the database of the method, HPRD is published in 2003 " Genome Research " the earliest, at present (in October, 2009) comprised 38,806 interaction relationships between 27,081 human proteins.Also delivered a series of these type of databases on " Nucleic Acid Research " database monograph, as: BIND, InAcT, DIP, MIPS etc.
(4) utilize the protein interaction Relationship Prediction of other species.This method hypothesis: the physical action between the protein is guarded between different plant species.Utilize the protein interaction relation of known unicellular lower eukaryote, prediction and the reasoning mankind's protein interaction relation.People such as Lehner utilized lower eukaryotes to predict 70,000 interaction relationships that obtained between 6,200 human proteins, and this achievement was published in " Genome Biology " in 2004 years.
(5) biomedical document excavates.The amounts of protein interaction relationship is not indexed in the database as yet, but is scattered in the report of biomedical document.People such as Ramani were published in the work of " Genome Biology " in 2005, had extracted 31,609 interaction relationships between 7,748 human proteins from biomedical literature summary." Genome Biology " first phase monograph in 2008 has also been thoroughly discussed technology and the evaluation method of the document excavation of protein interaction relation.
By these 5 databases, " foundation and the Research on Mining of the protein of disease association-medicine incidence relation " (Li Jiao 2009) literary composition has proposed a kind of method, generate a degree of confidence according to the contact between protein, degree of confidence merges the data in 5 databases, forms the non-binaryzation relational network of protein molecule.
Mark recognizer based on this network is: choose seed-protein (phenotype protein) R; R is placed human protein's interaction relationship network, according to certain degree of confidence conf (r; T), it is expanded to a sub-network N ET; Each protein p ∈ P among the NET is given a mark, and that the marking ordering is forward is the relevant protein of phenotype (shown on the left of Fig. 2).
In this method flow process, also comprised following data resource:
OMIM(Online Mendelian Inheritance in Man) is human gene and phenotype (phenotype) database of the manual mark of u.s. national library of medicine, comprised more than 11,000 gene and more than 6,000 phenotypic information.Phenotype information comprises the information description that phenotype is relevant." foundation and the Research on Mining of the protein of disease association-medicine incidence relation " literary composition is example with the senile dementia, utilizes OMIM to inquire about the list of genes relevant with senile dementia, obtains 49 seed-proteins by mapping.
HAPPI(Human Annotated and Predicted Protein Interaction Database) be to have integrated HPRD, BIND, MINT, a plurality of Protein Data Banks such as STRING and OPHID and the secondary Protein Data Bank of constructing, comprised 10,142,952 interaction relationships between 592 human proteins.
The deficiency of existing method: by to macromethod on figure of the mathematical model of existing method, abstract, and by changing the parameter observation, summing up the variation situation of these models, figure, obtain as drawing a conclusion: these methods Network Based or sub-network, no matter all be that the gene relevant with phenotype considered as a network integral body, be to calculate its distance, similarity or centrad.And in fact, between these network densitys had, namely each limit weights varied in size: according to previously through examining, corresponding certain phenotypic protein quantity is more, concerns that each other density does not wait.From the physical world phenomenon, can infer the protein molecule that these are different, from concerning each other, have and form a plurality of " clusters " trend or the possibility of--or claiming isolated island--: the inner member relation of cluster is tight, cluster is relative with outside member relation loose, present relative independentability between cluster and the cluster--or claim isolation, cause certain phenotypic gene, under existing situation, be difficult to be excavated fully; Because means previously all are that all nodes under certain degree of confidence (node is interpreted as gene or protein here) are done scoring, all nodes are sorted, so, the extreme value of one " cluster " may be eliminated, cover by the queuing process of " complete or collected works " node, loses the chance of outlet as rival in the sports tournament has entered " group of death ".And this " cluster " originally can be used as this phenotypic adequate condition, or says that this phenotype can be used as the adequate condition of this node because relative independence is arranged.In ordering, be necessary to attempt the ordering scope is confined within each cluster so.This can be avoided the extreme value in certain cluster " to be covered " by the extreme value in other clusters or even higher value.In traditional method, do not consider this point; So if improved in this respect, be expected bigger raising molecular marker ordering effect in some scenarios, improve molecular marker afterwards and excavate effect.
Summary of the invention:
Goal of the invention: in order to embody the presumable isolation of sub-network in the molecular network in ordering, the present invention provides following several scheme:
Method class scheme:
Scheme 1: a kind of medical information sort method is characterized in that may further comprise the steps:
Choosing of a, seed-protein:
Described seed-protein refers to the known protein relevant with phenotype, or the protein of the known coded by said gene relevant with phenotype;
The acquisition of the confidence data of b, protein relation:
The data that integration is obtained by different laboratory facilities and data mining mode, by the protein interaction relation that different modes obtains, its degree of confidence is also different thereupon, and the protein interaction relation that different modes obtains is composed with the different score value s of 0-1 i(p, q), laboratory facilities are more credible, and the precision of computing method is more high, the degree of confidence s of protein interaction relation i(p, q) more high;
C, calculate a pair of protein (p, q) interaction merge degree of confidence conf (p, q):
For a pair of protein (p, q) interaction, it may appear in the different data resources, also may appear at and indicate different obtain manners in the same data resource, it interact to merge principle: allow each independently degree of confidence the degree of confidence after merging is produced contribution, and independently the contribution of the more big generation of degree of confidence is more big;
The ordering of e, the potential mark of correlator network:
Given seed cdna set R, namely at the set of certain phenotypic gene, according to the preset confidence interval, wherein seed cdna is carried out neighbour's expansion, obtain protein interaction sub-network NET, it can be considered the relevant protein interaction sub-network of phenotype, whether node among the NET and limit communicated with each other according to it be divided into a plurality of isolated island LAND, each isolated island is the network that the point that communicates with one another and limit are formed, for the protein p that comprises among the isolated island LAND, calculate the significance level I of this protein in island network p, namely use I pMeasure this protein and phenotypic degree of correlation, and accordingly to isolated island internal protein ordering, calculate principle and be: all nodes on the traversal isolated island allow the fusion degree of confidence of each protein to I pProduce contribution, and the fusion degree of confidence of protein is more big, its contribution that produces is more big.
Can also calculate I with other formula p, for example following formula increased or reduces, change some parameters.
As can be seen, this method with respect to the difference of prior art is, its ordering is the ordering of the node within isolated island LAND, and the ordering of each isolated island is independent of one another, and prior art is the ordering on a plurality of isolated islands, the ordering on all nodes among the sub-network NET just.
This is based on the improvement of the isolation of sub-network, can understand with following model:
The protein molecule network with a representation of a surface, related node is adjacent to each other each other, related close especially near, if the maximal value that it--is the limit--that molecule is followed neighbour's relation is its height, this curved surface can demonstrate bulge one by one so--and the place of bulge is molecule cluster in close relations, when this curved surface is immersed seawater--the height of seawater is the degree of confidence that algorithm is selected, there is interconnected point set to regard island as, no interconnected point set is regarded the water surface as, obviously, the limit that only is higher than this degree of confidence, point can expose " sea ", resembles and has only higher topographic rise can form island.Bulge (being cluster, the sub-network of node) is regarded as island, when degree of confidence is low, resembles the sea water level and descend, the island area increases, even island are communicated with island; When degree of confidence was higher, the island area diminished, and the independence between the island strengthens--and before flooding all island, at this moment, can regard the point in this zone as small collectives in close relations, it is bigger to phenotypic effect.So, in the ordering of selection marker thing, also the ordering object is limited within these small collectives.Fig. 3, Fig. 4 are similar to this curved surface in the projection of vertical direction, this shows the border of isolated island.
Scheme 2: a kind of as scheme 1 described medical information sort method, it is characterized in that: in described step a, the seed-protein source is: obtain the phenotype gene from the omim database of National Library of Medicine, obtain phenotype protein by mapping then.
Scheme 3: a kind of as scheme 1 described medical information sort method, it is characterized in that: in described step a, the seed-protein source is: GeneCards, G2D, SWISS-PROT, Orthodisease, PhenomicDB, PhenoGO or PharmGKB.
Scheme 4: a kind of as scheme 1 described medical information sort method, it is characterized in that: in described step b, the acquisition of the confidence data of protein relation is to obtain from the HAPPI database.
Scheme 5: a kind of as scheme 1 described medical information sort method, it is characterized in that: in described step c, utilize formula 1 with its be fused into final degree of confidence score conf (p, q):
conf ( p , q ) = 1 - Π i = 1 N ( 1 - s i ( p , q ) ) - - - ( 1 ) ,
Wherein, N represents the sum of different pieces of information resource and different obtain manners;
Scheme 6: a kind of as scheme 1 described medical information sort method, it is characterized in that: between described step c, e, also have steps d: be divided into 2 above continuums to merging degree of confidence from 0 to 1:
Represent degree of confidence in n grade and this protein interaction relation more than grade with PPIn+, n is the natural number sequence number, and the reliability of the more big expression protein interaction relation of n is more big, and the coverage rate of protein interaction relation is more little;
Scheme 7: a kind of as scheme 6 described medical information sort methods, it is characterized in that: in described steps d, be divided into 5 continuums to merging degree of confidence from 0 to 1: PPI5 degree of confidence value [0.90,1), PPI4 degree of confidence value [0.75,0.9), PPI3 degree of confidence value [0.45,0.75), PPI2 degree of confidence value [0.25,0.45), PPI1 degree of confidence value [0,0.25); Use PPI4+, PPI3+, PPI2+, PPI1+ represent that respectively degree of confidence is in this grade and this protein interaction relation more than grade; From PPI5 to PPI1+, the reliability of protein interaction relation reduces successively, and the coverage rate of protein interaction relation raises successively.
Scheme 8: a kind of as scheme 1 described medical information sort method, it is characterized in that: in described step e, further, utilize formula 2, calculate the significance level I of this protein in phenotype correlator network p,
I p = α ln ( Σ q ∈ LAND conf ( p , q ) ) - ln ( Σ q ∈ LAND N ( p , q ) ) - - - ( 2 ) ,
Wherein, p and q represent two protein among the isolated island LAND, if p and q interact, then (p, q)=1, α is setup parameter to N.
Scheme 9: a kind of as scheme 1 described medical information sort method, it is characterized in that: in described step e, parameter alpha=2.
Scheme 10: a kind of as scheme 1 described medical information sort method, it is characterized in that: in described step e, parameter alpha=2, the degree of confidence of all proteins interaction relationship is 1 in the sub-network.
Device class scheme:
Scheme 11: a kind of medical information collator is characterized in that comprising with lower device
The selecting device of a, seed-protein:
Described seed-protein refers to the known protein relevant with phenotype, or the protein of the known coded by said gene relevant with phenotype;
The acquisition device of the confidence data of b, protein relation:
The data that integration is obtained by different laboratory facilities and data mining mode, by the protein interaction relation that different modes obtains, its degree of confidence is also different thereupon, and the protein interaction relation that different modes obtains is composed with the different score value s of 0-1 i(p, q), laboratory facilities are more credible, and the precision of computing method is more high, the degree of confidence s of protein interaction relation i(p, q) more high;
C, calculate a pair of protein (p, q) interaction merge degree of confidence conf (p, device q):
For a pair of protein (p, q) interaction, it may appear in the different data resources, also may appear at and indicate different obtain manners in the same data resource, it interact to merge principle: allow each independently degree of confidence the degree of confidence after merging is produced contribution, and independently the contribution of the more big generation of degree of confidence is more big;
The collator of e, the potential mark of correlator network:
Given seed cdna set R, namely at the set of certain phenotypic gene, according to the preset confidence interval, wherein seed cdna is carried out neighbour's expansion, obtain protein interaction sub-network NET, it can be considered the relevant protein interaction sub-network of phenotype, whether node among the NET and limit communicated with each other according to it be divided into a plurality of isolated island LAND, each isolated island is the network that the point that communicates with one another and limit are formed, for the protein p that comprises among the isolated island LAND, calculate the significance level I of this protein in island network p, namely use I pMeasure this protein and phenotypic degree of correlation, and accordingly to isolated island internal protein ordering, calculate principle and be: all nodes on the traversal isolated island allow the fusion degree of confidence of each protein to I pProduce contribution, and the contribution of the more big generation of fusion degree of confidence of protein is more big.
Scheme 12: a kind of as scheme 11 described medical information collators, it is characterized in that: in described device a, the seed-protein source is: obtain the phenotype gene from the omim database of National Library of Medicine, obtain phenotype protein by mapping then.
Scheme 13: a kind of as scheme 11 described medical information collators, it is characterized in that: in described device a, the seed-protein source is: GeneCards, G2D, SWISS-PROT, Orthodisease, PhenomicDB, PhenoGO or PharmGKB.
Scheme 14: a kind of as scheme 11 described medical information collators, it is characterized in that: in described device b, the acquisition of the confidence data of protein relation is to obtain from the HAPPI database.
Scheme 15: a kind of as scheme 11 described medical information collators, it is characterized in that: in described device c, utilize formula 1 with its be fused into final degree of confidence score conf (p, q):
conf ( p , q ) = 1 - Π i = 1 N ( 1 - s i ( p , q ) ) - - - ( 1 ) ,
Wherein, N represents the sum of different pieces of information resource and different obtain manners;
Scheme 16: a kind of as scheme 11 described medical information collators, it is characterized in that: between described device c, e, also have device d: be divided into 2 above continuums to merging degree of confidence from 0 to 1:
Represent degree of confidence in n grade and this protein interaction relation more than grade with PPIn+, n is the natural number sequence number, and the reliability of the more big expression protein interaction relation of n is more big, and the coverage rate of protein interaction relation is more little;
Scheme 17: a kind of as scheme 16 described medical information collators, it is characterized in that: in described device d, be divided into 5 continuums to merging degree of confidence from 0 to 1: PPI5 degree of confidence value [0.90,1), PPI4 degree of confidence value [0.75,0.9), PPI3 degree of confidence value [0.45,0.75), PPI2 degree of confidence value [0.25,0.45), PPI1 degree of confidence value [0,0.25); Use PPI4+, PPI3+, PPI2+, PPI1+ represent that respectively degree of confidence is in this grade and this protein interaction relation more than grade; From PPI5 to PPI1+, the reliability of protein interaction relation reduces successively, and the coverage rate of protein interaction relation raises successively.
Scheme 18: a kind of as scheme 11 described medical information collators, it is characterized in that: in described device e, further, utilize formula 2, calculate the significance level I of this protein in phenotype correlator network p,
I p = α ln ( Σ q ∈ LAND conf ( p , q ) ) - ln ( Σ q ∈ LAND N ( p , q ) ) - - - ( 2 ) ,
Wherein, p and q represent two protein among the isolated island LAND, if p and q interact, then (p, q)=1, α is setup parameter to N.
Scheme 19: a kind of as scheme 11 described medical information collators, it is characterized in that: in described device e, parameter alpha=2.
Scheme 20: a kind of as scheme 11 described medical information collators, it is characterized in that: in described device e, parameter alpha=2, the degree of confidence of all proteins interaction relationship is 1 in the sub-network.
Advantage of the present invention: because ordering is the ordering in cluster, it is the ordering in each extreme value neighborhood of molecular network score value, so each extreme point can not covered by the point of other clusters, eliminate, thereby showed, this help to disclose make new advances, independently cause phenotypic factor, help to disclose new, more hidden phenotype (being to be embodied on the protein molecule aspect at least), might be subdivided into polytype with seem to be characterizing the same phenotype originally thus, thereby provide foundation more specifically for biology, medical science.
Description of drawings:
The based on network disease gene prediction framework of Fig. 1.
11 some seed cdnas, 12 genes, 13 genes and intergenic relation, 14 seed cdnas expand in the process of network, the involved candidate gene that is used for ordering, 15 seed cdnas, the 16 molecular action networks with seed cdna expansion generation, 17 pairs of candidate genes sort.Reference: " in order to find the interaction of molecules network walking algorithm of preferential disease candidate gene ", " American Journal of Human Genetics " (
Figure BDA00003350844500061
S, Bauer S, Horn D, et al.Walking the interactome for prioritization of candidate disease genes.The American Journal of Human Genetics, 2008,82 (4): 949 – 958.).
Fig. 2 identifies method flow and the performance evaluation of disease protein matter from the protein interaction network.Reference: " foundation and the Research on Mining of the protein of disease association-medicine incidence relation " (Li Jiao 2009).
Fig. 3 degree of confidence is greater than 0.5 o'clock, isolated island synoptic diagram in the molecular action network.
31 isolated islands, 32 isolated islands, 33 isolated islands, 34 genes: i.e. the node of protein molecule (or gene) interaction network concerns between 35 genes: i.e. the limit of protein molecule (or gene) interaction network, the degree of confidence that concerns between 36 genes.
Fig. 4 degree of confidence is greater than 0.2 o'clock, isolated island synoptic diagram in the time of in the molecular action network.
41 isolated islands: formed by isolated island 31, isolated island 33 fusions, extension; 42 isolated islands: formed by isolated island 32 extensions.
Fig. 5 sort method process flow diagram of the present invention.
Embodiment:
As shown in Figure 5:
Choosing of step a, seed-protein
In this article, seed-protein refers to the known protein relevant with phenotype, or the protein of the known coded by said gene relevant with phenotype.This paper obtains the phenotype gene from the omim database of u.s. national library of medicine, obtain phenotype protein by mapping then.Except OMIM, the normal data resource about phenotype and genes matter relation that uses of researchist comprises (as shown in table 1): GeneCards, G2D, SWISS-PROT, Orthodisease, PhenomicDB, PhenoGO and PharmGKB.Seed-protein also can be chosen and collect from these resources.
The widely used data resource of containing gene-gene relationship of table 1.
Data resource URL
GeneCards http://www.genecards.org/
G2D http://www.ogic.ca/projects/g2d_2/
SWISS-PROT http://uniprot.org/
Orthodisease http://orthodisease.sbc.su.se/
PhenomicDB http://www.phenomicdb.de/
PhenoGO http://www.phenogo.org/
PharmGKB http://www.pharmgkb.org/
The degree of confidence of step b, protein interaction relation
For containing human protein's interaction relationship as much as possible, HAPPI has integrated following 5 and has had complementary protein interaction relational database: HPRD, BIND, MINT, STRING and OPHID.These data are to obtain by different laboratory facilities and data mining mode, comprising: the manual sort arrangement; The high flux laboratory facilities record; Prediction and reasoning from different plant species (rat, mouse, fruit bat, worm, yeast etc.) protein interaction relation data; The document of different order of accuarcys excavates.By the protein interaction relation that different modes obtains, its degree of confidence is also different thereupon.The HAPPI database root is composed with the different score value s of 0-1 the protein interaction relation that different modes obtains according to existing research conclusion and biological priori i(p, q) (as shown in table 2).Laboratory facilities are more credible, the degree of confidence s of protein interaction relation i(p, q) more high, the precision of computing method is more high.
Protein interaction concerns the degree of confidence assignment rule in the table 2.HAPPI database
Figure BDA00003350844500071
Step c, (it may appear in the different data resources for p, interaction q), also may appear at and indicate different obtain manners in the same data resource, utilizes formula 1 that it is fused into final degree of confidence score for a pair of protein
conf ( p , q ) = 1 - Π i = 1 N ( 1 - s i ( p , q ) ) - - - ( 1 )
Wherein, N represents the sum of different pieces of information resource and different obtain manners.Interaction between protein p and the q is confirmed by more than one mode, and the final degree of confidence of this interaction relationship will be higher than the highest score that its single mode obtains.For example: a protein interaction relation is to obtain s by the high flux laboratory facilities i(p, q)=0.75, the s that is confirmed by manual sort again i(p, q)=0.8, the degree of confidence s of this interactively then i(p, q)=0.95.
Steps d, by to merging the back statistical study of data and the check analysis of gene expression data, protein interaction relation after integrating is divided into 5 grades according to its degree of confidence span: PPI5 degree of confidence value [0.90,1), PPI4 degree of confidence value [0.75,0.9), PPI3 degree of confidence value [0.45,0.75), PPI2 degree of confidence value [0.25,0.45), PPI1 degree of confidence value [0,0.25).Use PPI4+, PPI3+, PPI2+, PPI1+ represent that respectively degree of confidence is in this grade and this protein interaction relation more than grade.From PPI5 to PPI1+, the reliability of protein interaction relation reduces successively, and the coverage rate of protein interaction relation raises successively.
The ordering of step e, the potential mark of correlator network:
Given seed cdna set R, namely at the set of certain phenotypic gene, according to the preset confidence grade, wherein seed cdna is carried out neighbour's expansion, obtain protein interaction sub-network NET, it can be considered the relevant protein interaction sub-network of phenotype, whether node among the NET and limit communicated with each other according to it be divided into a plurality of isolated island LAND, each isolated island is the network that the point that communicates with one another and limit are formed, for the protein p that comprises among the isolated island LAND, utilize formula 2, calculate the significance level I of this protein in isolated island LAND p, namely use I pMeasure this protein and phenotypic degree of correlation, and accordingly to the ordering of isolated island internal protein,
I p = α ln ( Σ q ∈ LAND conf ( p , q ) ) - ln ( Σ q ∈ LAND N ( p , q ) ) - - - ( 2 )
Wherein, p and q represent two protein among the isolated island LAND, if p and q interact, then N (p, q)=1, α is setup parameter; (p q) can be obtained by formula 1 its interaction degree of confidence conf, and α is setup parameter (α=2 herein), and under a kind of opposite extreme situations, the degree of confidence of all proteins interaction relationship is 1 in the sub-network.
As seen from Figure 3, in the protein molecule effect network (this example is come this network of framework with one 8 * 9 node set and limit therebetween), when degree of confidence is greater than 0.5, the node that the protein molecule that satisfies condition (or thinking gene) is corresponding has been formed three " isolated islands ", or title " cluster ", these three clusters are formed " sub-network ", and sub-network is with respect to all nodes and limit; As seen from Figure 4, when degree of confidence greater than 0.2 the time, more node and limit are brought in the sub-network: each isolated island has enlarged, wherein two isolated islands also link up and form a bigger isolated island, resemble the water surface and descend, two close island have connected originally.In like manner can get, when degree of confidence improved, the isolated island area can diminish, and may differentiate more isolated islands.Regulate the degree of confidence scope, can control area and the quantity of isolated island.Increase confidence interval (reducing the degree of confidence lower limit), then the isolated island area becomes big but quantity has minimizing trend; Reduce confidence interval (increase degree of confidence lower limit), then the isolated island area diminishes but quantity has the trend of increasing.The isolated island area is more big, and then to include the node of consideration in more many, isolated island quantity more many more can identification those mark of independent action is arranged.These two factors are wanted balance, and concrete equilibrium point can be sought by testing, and is as the criterion can find some marks that can be verified, that clear meaning is arranged.
The present invention also has a meaning: because each isolated island has corresponding mark, might as well be called the isolated island mark, when finding such isolated island mark, can further study its corresponding phenotype--might as well be referred to as the isolated island phenotype; In the method in the past, the phenotype of different isolated islands equally can't be distinguished by general thinking, the present invention provides the differentiating method on the network model for these isolated island phenotypes, can further excavate the phenotypic differentiation of different isolated islands based on its guide, for the research life science creates conditions.

Claims (10)

1. medical information sort method is characterized in that may further comprise the steps:
Choosing of a, seed-protein:
Described seed-protein refers to the known protein relevant with phenotype, or the protein of the known coded by said gene relevant with phenotype;
The acquisition of the confidence data of b, protein relation:
The data that integration is obtained by different laboratory facilities and data mining mode, by the protein interaction relation that different modes obtains, its degree of confidence is also different thereupon, and the protein interaction relation that different modes obtains is composed with the different score value s of 0-1 i(p, q), laboratory facilities are more credible, the degree of confidence s of protein interaction relation i(p, q) more high, the precision of computing method is more high;
C, calculate a pair of protein (p, q) interaction merge degree of confidence conf (p, q):
For a pair of protein (p, q) interaction, it may appear in the different data resources, also may appear at and indicate different obtain manners in the same data resource, it interact to merge principle: allow each independently degree of confidence all the degree of confidence after merging is produced contribution, and independently degree of confidence is more big, and its contribution that produces is more big;
The ordering of e, the potential mark of correlator network:
Given seed cdna set R, namely at the set of certain phenotypic gene, according to the preset confidence interval, wherein seed cdna is carried out neighbour's expansion, obtain protein interaction sub-network NET, it can be considered the relevant protein interaction sub-network of phenotype, whether node among the NET and limit communicated with each other according to it be divided into a plurality of isolated island LAND, each isolated island is the network that the point that communicates with one another and limit are formed, for the protein p that comprises among the isolated island LAND, calculate the significance level I of this protein in island network p, namely use I pMeasure this protein and phenotypic degree of correlation, and accordingly to isolated island internal protein ordering, calculate principle and be: all nodes on the traversal isolated island allow the fusion degree of confidence of each protein all to I pProduce contribution, and the fusion degree of confidence of protein is more big, its contribution that produces is more big.
2. medical information sort method as claimed in claim 1, it is characterized in that: in described step a, the seed-protein source is: obtain the phenotype gene from the omim database of u.s. national library of medicine, obtain phenotype protein by mapping then.
3. medical information sort method as claimed in claim 1 is characterized in that: in described step a, the seed-protein source is: GeneCards, G2D, SWISS-PROT, Orthodisease, PhenomicDB, PhenoGO or PharmGKB.
4. medical information sort method as claimed in claim 1 is characterized in that: in described step b, the acquisition of the confidence data of protein relation is to obtain from the HAPPI database.
5. medical information sort method as claimed in claim 1 is characterized in that: in described step c, utilize formula 1 with its be fused into final degree of confidence score conf (p, q):
Wherein, N represents the sum of different pieces of information resource and different obtain manners.
6. medical information sort method as claimed in claim 1 is characterized in that: also have steps d between described step c, e: be divided into 2 above continuums to merging degree of confidence from 0 to 1:
Represent degree of confidence in n grade and this protein interaction relation more than grade with PPIn+, n is the natural number sequence number, and the reliability of the more big expression protein interaction relation of n is more big, and the coverage rate of protein interaction relation is more little.
7. medical information sort method as claimed in claim 6, it is characterized in that: in described steps d, be divided into 5 continuums to merging degree of confidence from 0 to 1: PPI5 degree of confidence value [0.90,1), PPI4 degree of confidence value [0.75,0.9), PPI3 degree of confidence value [0.45,0.75), PPI2 degree of confidence value [0.25,0.45), PPI1 degree of confidence value [0,0.25); Use PPI4+, PPI3+, PPI2+, PPI1+ represent that respectively degree of confidence is in this grade and this protein interaction relation more than grade; From PPI5 to PPI1+, the reliability of protein interaction relation reduces successively, and the coverage rate of protein interaction relation raises successively.
8. a medical information sort method as claimed in claim 1 is characterized in that: in described step e, further, utilize formula 2, calculate the significance level I of this protein in phenotype correlator network p,
Wherein, p and q represent two protein among the isolated island LAND, if p and q interact, then (p, q)=1, α is setup parameter to N.
9. medical information sort method as claimed in claim 1 is characterized in that: in described step e, and parameter alpha=2.
10. medical information sort method as claimed in claim 1 is characterized in that: in described step e, and parameter alpha=2, the degree of confidence of all proteins interaction relationship is 1 in the sub-network.
CN2013102376664A 2013-06-16 2013-06-16 Method for ordering medical information Pending CN103279690A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102376664A CN103279690A (en) 2013-06-16 2013-06-16 Method for ordering medical information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102376664A CN103279690A (en) 2013-06-16 2013-06-16 Method for ordering medical information

Publications (1)

Publication Number Publication Date
CN103279690A true CN103279690A (en) 2013-09-04

Family

ID=49062205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102376664A Pending CN103279690A (en) 2013-06-16 2013-06-16 Method for ordering medical information

Country Status (1)

Country Link
CN (1) CN103279690A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559426A (en) * 2013-11-06 2014-02-05 北京工业大学 Protein functional module excavating method for multi-view data fusion
CN109074425A (en) * 2016-05-11 2018-12-21 国际商业机器公司 Predict individuation metastasis of cancer approach, transfer biological media and transfer Block For Treating
WO2020258254A1 (en) * 2019-06-28 2020-12-30 北京哲源科技有限责任公司 Data mining method and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246520A (en) * 2008-03-18 2008-08-20 中南大学 Protein complex recognizing method based on range estimation
CN101344902A (en) * 2008-07-15 2009-01-14 北京科技大学 Secondary protein structure forecasting technique based on association analysis and association classification
CN101989297A (en) * 2009-07-30 2011-03-23 陈越 System for excavating medicine related with disease gene in computer
EP2600269A2 (en) * 2011-12-03 2013-06-05 Medeolinx, LLC Microarray sampling and network modeling for drug toxicity prediction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246520A (en) * 2008-03-18 2008-08-20 中南大学 Protein complex recognizing method based on range estimation
CN101344902A (en) * 2008-07-15 2009-01-14 北京科技大学 Secondary protein structure forecasting technique based on association analysis and association classification
CN101989297A (en) * 2009-07-30 2011-03-23 陈越 System for excavating medicine related with disease gene in computer
EP2600269A2 (en) * 2011-12-03 2013-06-05 Medeolinx, LLC Microarray sampling and network modeling for drug toxicity prediction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李姣: "疾病相关的蛋白质-药物关联关系的建立与挖掘研究", 《清华大学博士论文》 *
梅娟 等: "基于图聚类的蛋白质相互作用网络功能模块探测", 《食品与生物技术学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559426A (en) * 2013-11-06 2014-02-05 北京工业大学 Protein functional module excavating method for multi-view data fusion
CN109074425A (en) * 2016-05-11 2018-12-21 国际商业机器公司 Predict individuation metastasis of cancer approach, transfer biological media and transfer Block For Treating
WO2020258254A1 (en) * 2019-06-28 2020-12-30 北京哲源科技有限责任公司 Data mining method and electronic device
CN112567345A (en) * 2019-06-28 2021-03-26 北京哲源科技有限责任公司 Data mining method and electronic equipment
CN112567345B (en) * 2019-06-28 2024-06-04 北京哲源科技有限责任公司 Data mining method and electronic equipment

Similar Documents

Publication Publication Date Title
Siler et al. Did geckos ride the Palawan raft to the Philippines?
Leavitt et al. Complex patterns of speciation in cosmopolitan “rock posy” lichens–Discovering and delimiting cryptic fungal species in the lichen-forming Rhizoplaca melanophthalma species-complex (Lecanoraceae, Ascomycota)
CN106599615B (en) A kind of sequence signature analysis method for predicting miRNA target gene
CN102952854B (en) Single cell sorting and screening method and device thereof
Morard et al. PFR2: a curated database of planktonic foraminifera 18S ribosomal DNA as a resource for studies of plankton ecology, biogeography and evolution
Tassi et al. Early modern human dispersal from Africa: genomic evidence for multiple waves of migration
CN104331642B (en) Integrated learning method for recognizing ECM (extracellular matrix) protein
Wichmann et al. Homelands of the world’s language families: A quantitative approach
Guo et al. Partitioned Bayesian analyses, dispersal–vicariance analysis, and the biogeography of Chinese toad-headed lizards (Agamidae: Phrynocephalus): a re-evaluation
CN109994151A (en) Predictive genes system is driven based on the tumour of complex network and machine learning method
Eberle et al. Sex-biased dispersal obscures species boundaries in integrative species delimitation approaches
CN105868584A (en) Method for performing whole genome selective breeding by selecting extreme character individual
CN103279690A (en) Method for ordering medical information
CN114022008A (en) Estuary suitable ecological flow assessment method based on water ecological zoning theory
CN106498070A (en) A kind of method based on genome LoF site examination indirect association Kiwi berry kinds
CN106021992A (en) Computation pipeline of location-dependent variant calls
CN117330040A (en) Tidal flat topography mapping method and system based on unmanned water mapping ship
Reilly et al. Bewildering biogeography: Waves of dispersal and diversification across southern Wallacea by bent-toed geckos (genus: Cyrtodactylus)
Hurtado-Gómez et al. Diversity and biogeography of South American mud turtles elucidated by multilocus DNA sequencing (Testudines: Kinosternidae)
Chumová et al. The relationship between transposable elements and ecological niches in the Greater Cape Floristic Region: A study on the genus Pteronia (Asteraceae)
CN111125893B (en) Non-dispersive water flow path simulation method based on DEM and flow collection
CN113838528A (en) Single cell horizontal coupling visualization method based on single cell immune group library data
Zhao et al. Diversification of the African legless skinks in the subfamily Acontinae (Family Scincidae)
Zhao et al. De novo spatial reconstruction of single cells by developmental coalescent embedding of transcriptomic networks
Mu et al. Investigation on tree molecular genome of Arabidopsis thaliana for internet of things

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130904