CN106295249A - The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection - Google Patents

The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection Download PDF

Info

Publication number
CN106295249A
CN106295249A CN201610648165.9A CN201610648165A CN106295249A CN 106295249 A CN106295249 A CN 106295249A CN 201610648165 A CN201610648165 A CN 201610648165A CN 106295249 A CN106295249 A CN 106295249A
Authority
CN
China
Prior art keywords
atlas
limit
edge
vector
subgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610648165.9A
Other languages
Chinese (zh)
Inventor
沈良忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou University
Original Assignee
Wenzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou University filed Critical Wenzhou University
Priority to CN201610648165.9A priority Critical patent/CN106295249A/en
Publication of CN106295249A publication Critical patent/CN106295249A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a kind of method based on complex biological network collection Frequent Pattern Mining gene function, first the method obtains the initial atlas formed after being converted into bio-networks by multiple gene expression datas, and initial atlas is done coarse filtration, delete and without the incoherent limit contributed, summary figure is obtained for the frequent dense point set of searching, then on the basis of summary figure, find possible candidate network subset, it is then return to initial graph concentration and extracts candidate network subset respectively, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, the most respectively summary figure is carried out dense subgraph lookup, obtain the frequent dense point set gene cluster as coexpression.Implement the present invention, it is possible to reduce computation complexity, improve the accuracy of Frequent Pattern Mining, efficiency and solve pattern overlap problem.

Description

The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection
Technical field
The present invention relates to systems biology studying technological domain, particularly relate to a kind of based on complex biological network collection frequent The Forecasting Methodology of mode excavation gene function.
Background technology
Along with generation and the development of bioinformatics of biochip technology, by genomic sequence analysis, unknown in a large number The gene of function is identified.How the function of " imparting " these genes of systematic science is genome times afterwards comprehensively scientists The difficult problem that need to solve, but the generation of extensive gene expression data, the solution for this problem brings new breakthrough mouth.
It addition, in organism, gene has cooperatively participated in a function often by with other genes.These Gene often has similar express spectra, the most how according to the modal data of these gene expressions thus excavate these coexpressions Gene cluster, have critically important researching value (such as can predict unknown function or the deduction of a gene in biology The function of one unknown gene), but owing to high-throughput techniques itself has the impact of strong noise and biosystem itself Complexity, in the bio-networks that the microarray data that people obtain converts, containing the most unrelated " noise ", exactly because this The existence of " noise " a bit, the gene cluster just making scientists find coexpression becomes highly difficult.If able to these are a large amount of Unrelated " noise " being weeded out progressively, then the problem finding conservative co-expression gene group the most just becomes simply.
In the prior art, the Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection has following Several: (1) is based on width first traversal: to utilize class Apriori character to enumerate the subgraph repeated, main representative have AGM and FSG.AGM searches for all " induction " subgraphs in atlas.The node of the induced subgraph G ' of figure G isG's ' While be V (G ') interior joint all limits in figure G.The mode that FSG then utilizes limit to increase searches the frequent connection in all atlas Subgraph;(2) based on depth-priority-searching method: main representative has gSpan, CloseSpan and FFSM etc., its basic thought be by by Step extends frequent limit and obtains Frequent tree mining, the expansion process differring primarily in that figure of various algorithms;(3) based on summary figure open Hairdo method, main representative has CODENSE, NeMo etc., and its main thought is the letter that first integrated network is concentrated each network Breath, then by itself and the Frequent Pattern Mining that is converted in free hand drawing, is finally returning to former network collection.
But inventor finds, in place of the above-mentioned equal Shortcomings of several Forecasting Methodologies, it is disadvantageous in that: (1) is the In a kind of Forecasting Methodology, computation complexity is the highest;(2) in the second Forecasting Methodology, computation complexity is the highest;(3) the 3rd In the method for kind, the quality of summary figure directly determines the complexity of algorithm, and pattern overlapping phenomenon further increases problem Complexity;Additionally, in order to avoid the scale of spectral factorization method limits to (about 2000 nodes), summary figure can only be carried out by the method Piecemeal processes, and this results in the reasonability problem of piecemeal.
Summary of the invention
The purpose of the embodiment of the present invention is to provide a kind of Frequent Pattern Mining gene merit based on complex biological network collection The Forecasting Methodology of energy, it is possible to reduce computation complexity, improves the accuracy of Frequent Pattern Mining, efficiency and the pattern of solution overlapping Problem.
In order to solve above-mentioned technical problem, embodiments provide frequent mode based on complex biological network collection and dig The Forecasting Methodology of pick gene function, described Forecasting Methodology includes:
The first step, find frequent dense point set without contribution incoherent limit:
Step S11, obtain multiple gene expression data and be converted into the initial atlas D={G formed after bio-networksi=(V, Ei) (1≤i≤m), and determine minimum consistency threshold value δ, minimum frequently support threshold k and User Defined parameter f, p, q The most corresponding numerical value;Wherein, described initial atlas is by multiple subgraph G being respectively provided with identical point setiFormed, and different subgraph Gi Between at least there is a different limit;
Step S12: determine each subgraph GiEach edge, described initial graph concentrate occur number of times be all higher than minimum The frequently minimum positive integer value of the product of support threshold k and atlas size m;
Step S13, delete described initial graph and concentrate each subgraph GiIn be satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S14, construct and concentrate each subgraph G with described initial graphiThere is the summary figure of identical point set, and described In the summary figure that initial graph collection is corresponding, each edge is both needed to meet consistency coefficient
In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, delete described initially Limit cluster coefficients EC in the summary figure that atlas is correspondingeThe limit of < q also updates;
Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph GiLimit carry out Contrast one by one, delete each subgraph GiIn be not present in the limit in the summary figure that the initial graph collection after described renewal is corresponding more Newly;
Step S17, repetition step S13 to step S16, until corresponding the making a summary in figure of initial graph collection after described renewal Limit is not till changing;
Second step, determine candidate network subset:
Step S21, to described limit, in the summary figure changed, the edge-vector of each edge does not all give weights, and Determine that each edge corresponding sides of described assignment support the Hamming value of vector, and further Hamming value is met the limit of screening conditions to Measure and to gathering in A, Hamming value is unsatisfactory for the edge-vector of screening conditions and to gathering in B;
Step S22, respectively the edge-vector in described set A and described set B is carried out merger, the edge-vector repeated is deleted Remove, only retain one and update the weights that edge-vector is corresponding;
Step S23, Seeding vector is set, and according to the Seeding vector of described setting, adjusts the limit in set A and set B Vector;Wherein, described Seeding vector is the limit that weight is maximum;
Step S24, criterion according to maximum edge-vector similarity, all reflect the edge-vector in the set B after described adjustment It is mapped in the set A after described adjustment, and after end to be mapped, the edge-vector in the described set A completed after mapping is carried out Cluster computing, forms cluster centre set;
Step S25: deleting in cluster centre set, the frequency that the number of 1 occurs takes under atlas size product less than k IntegerCluster centre;
3rd step, acquisition summary atlas:
Step S31, cluster centre set according to described formation, at described initial atlas D={Gi=(V, Ei)}(1≤i ≤ m) in, extract subgraph consistent with each vector in described cluster centre set respectively, form multiple new atlas;
Step S32, minimum consistency threshold value δ determined according to described, minimum frequently support threshold k and User Defined Parameter f, the numerical value that p, q are the most corresponding, delete in described each new atlas and be satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S33, construct and in described each new atlas there is the summary figure of identical point set respectively, and described each newly In the summary figure of atlas, each edge is both needed to meet consistency coefficient
Step S34, each edge in the summary figure of described each new atlas is all done limit cluster coefficients, delete described each newly Limit cluster coefficients EC in the summary figure of atlaseThe limit of < q also updates;
Step S35, by the summary figure of each new atlas after described renewal, limit with corresponding new atlas carries out one respectively One contrast, deletes the limit being not present in each new atlas in its corresponding summary figure and updates;
Step S36, repetition step S32 to step S35, until the limit in the summary figure of each new atlas after described renewal Not till changing, obtain atlas of making a summary;
4th step, lookup dense subgraph, and determine frequent dense point set:
Step S41, in the described summary atlas obtained, according to the summary figure that the initial graph collection after described renewal is corresponding, Search the dense subgraph that in the summary figure corresponding with the initial graph collection after described renewal, limit collection is consistent, and find according to described Dense subgraph, determine frequent dense point set, and after further the described frequent dense point set determined being carried out merger, as altogether The gene cluster output expressed.
Wherein, described User Defined parameter f span is [4,10];Parameter p span is [0.1,0.2];Ginseng Number q value is 0.334.
Implement the embodiment of the present invention, have the advantages that
The embodiment of the present invention, first does coarse filtration to initial atlas, deletes for finding frequent dense point set without contribution Incoherent limit obtains summary figure, then finds possible candidate network subset on the basis of summary figure, is then return to initial Atlas extracts candidate network subset respectively, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, Rear respectively summary figure is carried out dense subgraph lookup, obtain the frequent dense point set gene cluster as coexpression such that it is able to fall Low computation complexity, improves the accuracy of Frequent Pattern Mining, efficiency and solves pattern overlap problem.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, according to These accompanying drawings obtain other accompanying drawing and still fall within scope of the invention.
The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 1 provides for the embodiment of the present invention The operation principle block diagram of method;
The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 2 provides for the embodiment of the present invention In method, the first step finds the frequent dense point set application scenarios figure without the incoherent limit of contribution;
The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 3 provides for the embodiment of the present invention In method, second step determines the application scenarios figure of candidate network subset.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, and It is not used in the restriction present invention.
The present invention be based on CODENSE, NeMo algorithm on the basis of, have employed summary figure-candidate network subset and progressively change Strategy for refinement finds the coexpression group that many data concentration is conservative.Therefore to be easy to describe, unified finding many data Concentrate the problem finding frequent dense point set from multiple atlas that the problem of conservative coexpression group is converted in graph theory.Use this After the mode of kind, noise limit refers to that the relation between two genes representated by this edge has for the gene cluster finding coexpression The limit of interference, noise limit the most also includes that those are due to true positives or false-positive between the gene of experimental data generation itself Limit.
Inventor finds, the key of Resolving probiems, is how to find out these unrelated noise limits and them to rejecting.Directly For in sight, for noise limit, have following some understanding:
(1) if a limit UNICOM being adjacent the most sparse for limit e, then this edge will necessarily because of to find frequency Numerous dense point set is noise limit without contribution;
(2) if a limit e only shows the connectedness the strongest with its neighbours limit in several figures, then this edge must So because the requirement of frequency cannot be met, thus because noise limit can be become without contribution to finding frequent dense point set;
(3) if a limit e is connection " bridge " between two dense subgraphs in summary figure, then this edge is inevitable Noise limit can be become without contribution to finding frequent dense point set;
(4) if a limit e does not occurs in summary figure and occurs in certain several figure of artwork collection, then this edge Also it is to finding the frequent dense point set limit without contribution;
(5) if a point set V' is the frequent dense point set that certain the several atlas concentrated in artwork occur, then this Point set induced subgraph in artwork concentrates other several figures of residue is to be not have for continually looking for other frequent dense point sets Contribution.
Therefore, in order to delete above five kinds of noise limits thus Mining Frequent dense point set, in embodiments of the present invention, invention First people proposes to find frequent dense point set and deletes above four noise like limit without the incoherent limit contributed, and secondly edge-vector gathers Class is deleted the 5th noise like limit and is formed candidate network subset, is then return to initial graph concentration and extracts candidate network respectively Collection, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, the most respectively summary figure is carried out dense subgraph Searching, obtain the frequent dense point set gene cluster as coexpression, specific works principle is as shown in Figure 1.
In sum, a kind of based on complex biological network collection the Frequent Pattern Mining gene function of inventor's proposition is pre- Survey method, specifically includes:
The first step, find frequent dense point set without contribution incoherent limit, i.e. FILTER algorithm:
Step S11, obtain multiple gene expression data and be converted into the initial atlas D={G formed after bio-networksi=(V, Ei) (1≤i≤m), and determine minimum consistency threshold value δ, minimum frequently support threshold k and User Defined parameter f, p, q The most corresponding numerical value;Wherein, described initial atlas is by multiple subgraph G being respectively provided with identical point setiFormed, and different subgraph Gi Between at least there is a different limit;
Step S12: determine each subgraph GiEach edge, described initial graph concentrate occur number of times be all higher than minimum The frequently minimum positive integer value of the product of support threshold k and atlas size m;
Step S13, delete described initial graph and concentrate each subgraph GiIn be satisfied by consistency coefficient EDeThe limit of < δ/f, This step is mainly deleted initial graph and is concentrated each subgraph GiIn the limit that connect sparse with limit about, in order to prevent delete phase The limit closed;
Step S14, construct and concentrate each subgraph G with described initial graphiThere is the summary figure of identical point set, and described In the summary figure that initial graph collection is corresponding, each edge is both needed to meet consistency coefficientThis step is main It is to build summary figure corresponding to initial graph collection, meeting the limit of frequency to extracting, is placed in this summary figure;Wherein, 0 < P < 1, mainly prevents from deleting for searching the contributive dependence edge of frequent dense point set;
In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, delete described initially Limit cluster coefficients EC in the summary figure that atlas is correspondingeThe limit of < q also updates, and this step mainly deletes those at initial graph set pair The limit of two dense subgraphs of connection sparse in the summary figure answered;
Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph GiLimit carry out Contrast one by one, delete each subgraph GiIn be not present in the limit in the summary figure that the initial graph collection after described renewal is corresponding more Newly, this step mainly updates initial atlas according to the summary figure that initial graph collection is corresponding so that the artwork collection after renewal just filters A part of incoherent limit, thus be more conducive to find frequent dense point set;
Step S17, repetition step S13 to step S16, until corresponding the making a summary in figure of initial graph collection after described renewal Limit is not till changing;
In one embodiment, as in figure 2 it is shown, running of atlas that four figures with identical point set are constituted becomes Change displaying situation.It is assumed here that to find this atlas meets frequent support more than or equal to 2, consistency is more than or equal to 0.9 Frequent dense point set.{ a, b, d}, { b, c, d} and { e, f, g, h} are to meet the frequent dense point set required it is obvious that point set. F value is 4, p value 0.8, and in this figure, in each figure, solid line represents a limit being truly present in the figure, and dotted line table Show the limit that previous step needs are deleted.
Figure it is seen that each subgraph after Geng Xining compared with initial atlas containing less noise limit.But, But can not once directly extract from summary figure dense subgraph in all of frequent dense point set, such as summary figure a, b, C, d} actually represent two point sets { a, b, d} and { b, c, d}.Because a frequent dense point set tends to occur in atlas In certain several figure.So, if it is known that frequently dense point set occurs in which figure, the most again these figures are made an abstract Figure, then frequently dense point set is just easy to extract.In theory, an atlas size is m, and frequent dense point set occurs Support is k, then search volume is exactlyIt is 20 for an atlas size, it is desirable to frequently support is the dense point set of 6 For, then search volume is justIndividual possible candidate network subset, this is in practice, it is clear that be infeasible.
It is thus desirable to determine possible candidate network subset, thus reduce the search volume of candidate network subset, specifically As follows:
Second step, determine candidate network subset, i.e. GCLUSTER algorithm:
Step S21, to described limit, in the summary figure changed, the edge-vector of each edge does not all give weights, and Determine that each edge corresponding sides of described assignment support the Hamming value of vector, and further Hamming value is met the limit of screening conditions to Measure and to gathering in A, Hamming value is unsatisfactory for the edge-vector of screening conditions and to gathering in B;
Step S22, respectively the edge-vector in described set A and described set B is carried out merger, the edge-vector repeated is deleted Remove, only retain one and update the weights that edge-vector is corresponding;
Step S23, Seeding vector is set, and according to the Seeding vector of described setting, adjusts the limit in set A and set B Vector;Wherein, described Seeding vector is the limit that weight is maximum;
Step S24, criterion according to maximum edge-vector similarity, all reflect the edge-vector in the set B after described adjustment It is mapped in the set A after described adjustment, and after end to be mapped, the edge-vector in the described set A completed after mapping is carried out Cluster computing, forms cluster centre set;
Step S25: deleting in cluster centre set, the frequency that the number of 1 occurs takes under atlas size product less than k IntegerCluster centre;
In one embodiment, input: summary figureAtlas size m, minimum frequently support k, minimum Hamming distance From threshold tau;
Output: cluster centre C;
Step 1: for each edge in summary figureMake the edge-vector v on this limiteWeight w (ve)=1, and handle The limit on all limits in summary figure supports that the limit that Hamming value is k or k+1 of vector is integrated in set A, and remaining limit is placed on In set B, and respectively the edge-vector in set A and set B is carried out merger, the edge-vector repeated is deleted and only retains one also Update the weights that edge-vector is corresponding;
Step 2:for each edge ve∈B do
The edge-vector that Hamming value in set A is k is moved to gather in B;
Step 3: the edge-vector in A, carry out descending sort according to the size of edge-vector weights;
Above-mentioned algorithm initializes weights to the edge-vector of summary figure, then edge-vector has been carried out simple merger, And have updated weights.Processing through this step, remaining edge-vector does not just repeat mutually, and the weights that each edge-vector is corresponding mean that The limit number of the edge-vector the most identical with this edge-vector is had in summary figure.Following step, algorithm is first Hamming value etc. Edge-vector in the frequent support of user's requirement is placed in a set A as seed, and remaining edge-vector is put into another collection Close in B, then according to the criterion of maximum edge-vector similarity is reasonably mapped to the edge-vector in B in A, finally in A Edge-vector clusters, and eventually forms cluster centre set, the namely set of candidate network subset.
Needing exist for explanation, after having had T set, how cluster centre is formed.Cluster centre is in i-th figure Value be gathered by T in all edge-vectors and to its corresponding weights in i-th figure 0 weights and and weights sum big of 1 Little determine.Such as, if the weights of 1 are with big, then cluster centre value in first i figure is 1;It is the most just zero, As shown in Figure 3.
3rd step, acquisition summary atlas:
Step S31, cluster centre set according to described formation, at described initial atlas D={Gi=(V, Ei)}(1≤i ≤ m) in, extract subgraph consistent with each vector in described cluster centre set respectively, form multiple new atlas;
Step S32, minimum consistency threshold value δ determined according to described, minimum frequently support threshold k and User Defined Parameter f, the numerical value that p, q are the most corresponding, delete in described each new atlas and be satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S33, construct and in described each new atlas there is the summary figure of identical point set respectively, and described each newly In the summary figure of atlas, each edge is both needed to meet consistency coefficient
Step S34, each edge in the summary figure of described each new atlas is all done limit cluster coefficients, delete described each newly Limit cluster coefficients EC in the summary figure of atlaseThe limit of < q also updates;
Step S35, by the summary figure of each new atlas after described renewal, limit with corresponding new atlas carries out one respectively One contrast, deletes the limit being not present in each new atlas in its corresponding summary figure and updates;
Step S36, repetition step S32 to step S35, until the limit in the summary figure of each new atlas after described renewal Not till changing, obtain atlas of making a summary;
4th step, lookup dense subgraph, and determine frequent dense point set: the lookup method of dense subgraph is 1: require this son Branch of Tu Shiyige UNICOM;2) consistency of this subgraph requires the consistency more than setting, specific as follows:
Step S41, in the described summary atlas obtained, according to the summary figure that the initial graph collection after described renewal is corresponding, Search the dense subgraph that in the summary figure corresponding with the initial graph collection after described renewal, limit collection is consistent, and find according to described Dense subgraph, determine frequent dense point set, and after further the described frequent dense point set determined being carried out merger, as altogether The gene cluster output expressed.
It should be noted that, the merger to frequent dense point set, take following method.Firstly, for frequent dense point set Whether the induced subgraph corresponding at each figure of original atlas is more than the most given threshold value according to consistency, sets up the densest The vector of point set.Next merger is carried out according to following three principles:
(1), duplicate frequent dense point set carry out merger, only retain one;
(2) if two frequent dense points be concentrated with about more than 85% element identical or one be another The vector of subset and the frequent dense point set of the two is the most identical, then the frequent dense point set of the two takes the mode of " union " to carry out It is merged into one;
(3) if two frequent dense points be concentrated with about more than 85% element identical or one be another height Collection, but the vector of the frequent dense point set of the two is different, then such frequent dense point set nonjoinder, and will be element Many frequent dense point sets split
In embodiments of the present invention, User Defined parameter f span is [4,10];Parameter p span be [0.1, 0.2];Parameter q value is 0.334.
Implement the embodiment of the present invention, have the advantages that
The embodiment of the present invention, first does coarse filtration to initial atlas, deletes for finding frequent dense point set without contribution Incoherent limit obtains summary figure, then finds possible candidate network subset on the basis of summary figure, is then return to initial Atlas extracts candidate network subset respectively, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, Rear respectively summary figure is carried out dense subgraph lookup, obtain the frequent dense point set gene cluster as coexpression such that it is able to fall Low computation complexity, improves the accuracy of Frequent Pattern Mining, efficiency and solves pattern overlap problem.
One of ordinary skill in the art will appreciate that all or part of step realizing in above-described embodiment method is permissible Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium, Described storage medium, such as ROM/RAM, disk, CD etc..
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Any amendment, equivalent and the improvement etc. made within god and principle, should be included within the scope of the present invention.

Claims (2)

1. the Forecasting Methodology of a Frequent Pattern Mining gene function based on complex biological network collection, it is characterised in that described Forecasting Methodology includes:
The first step, find frequent dense point set without contribution incoherent limit:
Step S11, obtain multiple gene expression data and be converted into the initial atlas D={G formed after bio-networksi=(V, Ei)} (1≤i≤m), and determine minimum consistency threshold value δ, minimum frequently support threshold k and User Defined parameter f, p, q are respectively Corresponding numerical value;Wherein, described initial atlas is by multiple subgraph G being respectively provided with identical point setiFormed, and different subgraph GiBetween At least there is a different limit;
Step S12: determine each subgraph GiEach edge, described initial graph concentrate the number of times occurred be all higher than minimum frequently The minimum positive integer value of the product of support threshold k and atlas size m;
Step S13, delete described initial graph and concentrate each subgraph GiIn be satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S14, construct and concentrate each subgraph G with described initial graphiThere is the summary figure of identical point set, and described initially In the summary figure that atlas is corresponding, each edge is both needed to meet consistency coefficient
In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, deletes described initial atlas Limit cluster coefficients EC in corresponding summary figureeThe limit of < q also updates;
Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph GiLimit carry out 1 a pair Ratio, deletes each subgraph GiIn be not present in described renewal after summary figure corresponding to initial graph collection in limit and update;
Step S17, repetition step S13 to step S16, until the limit in the summary figure that the initial graph collection after described renewal is corresponding not Till changing;
Second step, determine candidate network subset:
Step S21, to described limit, in the summary figure changed, the edge-vector of each edge does not all give weights, and determines The each edge corresponding sides of described assignment support the Hamming value of vector, and further Hamming value are met the edge-vector of screening conditions also In set A, Hamming value is unsatisfactory for the edge-vector of screening conditions and to gathering in B;
Step S22, respectively the edge-vector in described set A and described set B is carried out merger, the edge-vector repeated is deleted, Only retain one and update the weights that edge-vector is corresponding;
Step S23, Seeding vector is set, and according to the Seeding vector of described setting, adjusts the edge-vector in set A and set B; Wherein, described Seeding vector is the limit that weight is maximum;
Step S24, criterion according to maximum edge-vector similarity, both map to the edge-vector in the set B after described adjustment In set A after described adjustment, and after end to be mapped, the edge-vector in the described set A completed after mapping is clustered Computing, forms cluster centre set;
Step S25: delete in cluster centre set, the frequency that the number of 1 the occurs lower round numbers less than k with atlas size productCluster centre;
3rd step, acquisition summary atlas:
Step S31, cluster centre set according to described formation, at described initial atlas D={Gi=(V, Ei)}(1≤i≤m) In, extract subgraph consistent with each vector in described cluster centre set respectively, form multiple new atlas;
Step S32, minimum consistency threshold value δ determined according to described, minimum frequently support threshold k and User Defined parameter The numerical value that f, p, q are the most corresponding, deletes in described each new atlas and is satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S33, construct, in described each new atlas, there is the summary figure of identical point set, and described each new atlas respectively Summary figure in each edge be both needed to meet consistency coefficient
Step S34, each edge in the summary figure of described each new atlas is all done limit cluster coefficients, delete described each new atlas Summary figure in limit cluster coefficients ECeThe limit of < q also updates;
Step S35, by the summary figure of each new atlas after described renewal, limit with corresponding new atlas carries out 1 a pair respectively Ratio, deletes the limit being not present in each new atlas in its corresponding summary figure and updates;
Step S36, repetition step S32 to step S35, until the limit in the summary figure of each new atlas after described renewal does not exists Till changing, obtain atlas of making a summary;
4th step, lookup dense subgraph, and determine frequent dense point set:
Step S41, in the described summary atlas obtained, according to the summary figure that the initial graph collection after described renewal is corresponding, search The dense subgraph that in corresponding with the initial graph collection after described renewal summary figure, limit collection is consistent, and according to described find thick Close subgraph, determines frequent dense point set, and after further the described frequent dense point set determined being carried out merger, as coexpression Gene cluster output.
2. Forecasting Methodology as claimed in claim 1, it is characterised in that described User Defined parameter f span be [4, 10];Parameter p span is [0.1,0.2];Parameter q value is 0.334.
CN201610648165.9A 2016-08-08 2016-08-08 The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection Pending CN106295249A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610648165.9A CN106295249A (en) 2016-08-08 2016-08-08 The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610648165.9A CN106295249A (en) 2016-08-08 2016-08-08 The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection

Publications (1)

Publication Number Publication Date
CN106295249A true CN106295249A (en) 2017-01-04

Family

ID=57667262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610648165.9A Pending CN106295249A (en) 2016-08-08 2016-08-08 The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection

Country Status (1)

Country Link
CN (1) CN106295249A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709275A (en) * 2017-02-04 2017-05-24 上海喆之信息科技有限公司 Restricted type cardiomyopathy gene data processing device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855398A (en) * 2012-08-28 2013-01-02 中国科学院自动化研究所 Method for obtaining disease potentially-associated gene based on multi-source information fusion
CN103514381A (en) * 2013-07-22 2014-01-15 湖南大学 Protein biological network motif identification method integrating topological attributes and functions
CN104598748A (en) * 2015-01-29 2015-05-06 中国人民解放军军械工程学院 Calculating method of restrictive boolean network degeneracy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855398A (en) * 2012-08-28 2013-01-02 中国科学院自动化研究所 Method for obtaining disease potentially-associated gene based on multi-source information fusion
CN103514381A (en) * 2013-07-22 2014-01-15 湖南大学 Protein biological network motif identification method integrating topological attributes and functions
CN104598748A (en) * 2015-01-29 2015-05-06 中国人民解放军军械工程学院 Calculating method of restrictive boolean network degeneracy

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAIYAN HU 等: "Mining coherent dense subgraphs acrossmassive biological networks for functional discovery", 《BIOINFORMATICS》 *
ZAN XIANGZHEN 等: "A Graph-based Method to Mine Coexpression Clusters Across Multiple Datasets", 《CHINESE JOURNAL OF ELECTRONICS》 *
昝乡镇: "复杂生物网络集中频繁模式挖掘软件开发研究", 《中国优秀硕士学位论文全文数据库基础科学辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709275A (en) * 2017-02-04 2017-05-24 上海喆之信息科技有限公司 Restricted type cardiomyopathy gene data processing device

Similar Documents

Publication Publication Date Title
He et al. A discrete multi-objective fireworks algorithm for flowshop scheduling with sequence-dependent setup times
Li et al. Stepping community detection algorithm based on label propagation and similarity
WO2018166270A2 (en) Index and direction vector combination-based multi-objective optimisation method and system
Kabir et al. A new multiple seeds based genetic algorithm for discovering a set of interesting Boolean association rules
Zhang et al. Protein complex prediction in large ontology attributed protein-protein interaction networks
CN111916149B (en) Hierarchical clustering-based protein interaction network global comparison method
Zhou et al. A density based link clustering algorithm for overlapping community detection in networks
CN111599406B (en) Global multi-network comparison method combined with network clustering method
Luna et al. Efficient mining of top-k high utility itemsets through genetic algorithms
CN111429970B (en) Method and system for acquiring multiple gene risk scores based on feature selection of extreme gradient lifting method
CN103810388A (en) Large-scale ontology mapping method based on partitioning technology oriented towards mapping
Guns et al. Repetitive branch-and-bound using constraint programming for constrained minimum sum-of-squares clustering
Khouzani et al. Identification of the effects of the existing network properties on the performance of current community detection methods
CN106295247A (en) Frequent Pattern Mining mouse gene coexpression based on complex biological network collection because of Forecasting Methodology
CN108388769A (en) Protein Functional Module Identification Method Based on Edge-Driven Label Propagation Algorithm
CN106295249A (en) The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection
CN110910953B (en) Key protein prediction method based on protein-domain heterogeneous network
CN106295248A (en) The Forecasting Methodology of Frequent Pattern Mining yeast gene co-expressing based on complex biological network collection group
Li et al. Motif paths: A new approach for analyzing higher-order semantics between graph nodes
Haeri et al. Developing a hybrid data mining approach based on multi-objective particle swarm optimization for solving a traveling salesman problem
Su et al. An efficient algorithm for calculating drainage accumulation in digital elevation models based on the basin tree index
CN111709846A (en) Local community discovery algorithm based on line graph
Li et al. Hierarchical hidden community detection for protein complex prediction
EP1973050A1 (en) Virtual screening of chemical spaces
Zhang et al. Heuristic Methods for Solving the Traveling Salesman Problem (TSP): A Comparative Study

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170104