CN106295249A

CN106295249A - The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection

Info

Publication number: CN106295249A
Application number: CN201610648165.9A
Authority: CN
Inventors: 沈良忠
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2016-08-08
Filing date: 2016-08-08
Publication date: 2017-01-04

Abstract

The embodiment of the invention discloses a kind of method based on complex biological network collection Frequent Pattern Mining gene function, first the method obtains the initial atlas formed after being converted into bio-networks by multiple gene expression datas, and initial atlas is done coarse filtration, delete and without the incoherent limit contributed, summary figure is obtained for the frequent dense point set of searching, then on the basis of summary figure, find possible candidate network subset, it is then return to initial graph concentration and extracts candidate network subset respectively, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, the most respectively summary figure is carried out dense subgraph lookup, obtain the frequent dense point set gene cluster as coexpression.Implement the present invention, it is possible to reduce computation complexity, improve the accuracy of Frequent Pattern Mining, efficiency and solve pattern overlap problem.

Description

The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection

Technical field

The present invention relates to systems biology studying technological domain, particularly relate to a kind of based on complex biological network collection frequent The Forecasting Methodology of mode excavation gene function.

Background technology

Along with generation and the development of bioinformatics of biochip technology, by genomic sequence analysis, unknown in a large number The gene of function is identified.How the function of " imparting " these genes of systematic science is genome times afterwards comprehensively scientists The difficult problem that need to solve, but the generation of extensive gene expression data, the solution for this problem brings new breakthrough mouth.

It addition, in organism, gene has cooperatively participated in a function often by with other genes.These Gene often has similar express spectra, the most how according to the modal data of these gene expressions thus excavate these coexpressions Gene cluster, have critically important researching value (such as can predict unknown function or the deduction of a gene in biology The function of one unknown gene), but owing to high-throughput techniques itself has the impact of strong noise and biosystem itself Complexity, in the bio-networks that the microarray data that people obtain converts, containing the most unrelated " noise ", exactly because this The existence of " noise " a bit, the gene cluster just making scientists find coexpression becomes highly difficult.If able to these are a large amount of Unrelated " noise " being weeded out progressively, then the problem finding conservative co-expression gene group the most just becomes simply.

In the prior art, the Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection has following Several: (1) is based on width first traversal: to utilize class Apriori character to enumerate the subgraph repeated, main representative have AGM and FSG.AGM searches for all " induction " subgraphs in atlas.The node of the induced subgraph G ' of figure G isG's ' While be V (G ') interior joint all limits in figure G.The mode that FSG then utilizes limit to increase searches the frequent connection in all atlas Subgraph；(2) based on depth-priority-searching method: main representative has gSpan, CloseSpan and FFSM etc., its basic thought be by by Step extends frequent limit and obtains Frequent tree mining, the expansion process differring primarily in that figure of various algorithms；(3) based on summary figure open Hairdo method, main representative has CODENSE, NeMo etc., and its main thought is the letter that first integrated network is concentrated each network Breath, then by itself and the Frequent Pattern Mining that is converted in free hand drawing, is finally returning to former network collection.

But inventor finds, in place of the above-mentioned equal Shortcomings of several Forecasting Methodologies, it is disadvantageous in that: (1) is the In a kind of Forecasting Methodology, computation complexity is the highest；(2) in the second Forecasting Methodology, computation complexity is the highest；(3) the 3rd In the method for kind, the quality of summary figure directly determines the complexity of algorithm, and pattern overlapping phenomenon further increases problem Complexity；Additionally, in order to avoid the scale of spectral factorization method limits to (about 2000 nodes), summary figure can only be carried out by the method Piecemeal processes, and this results in the reasonability problem of piecemeal.

Summary of the invention

The purpose of the embodiment of the present invention is to provide a kind of Frequent Pattern Mining gene merit based on complex biological network collection The Forecasting Methodology of energy, it is possible to reduce computation complexity, improves the accuracy of Frequent Pattern Mining, efficiency and the pattern of solution overlapping Problem.

In order to solve above-mentioned technical problem, embodiments provide frequent mode based on complex biological network collection and dig The Forecasting Methodology of pick gene function, described Forecasting Methodology includes:

The first step, find frequent dense point set without contribution incoherent limit:

Step S11, obtain multiple gene expression data and be converted into the initial atlas D={G formed after bio-networks_i=(V, E_i) (1≤i≤m), and determine minimum consistency threshold value δ, minimum frequently support threshold k and User Defined parameter f, p, q The most corresponding numerical value；Wherein, described initial atlas is by multiple subgraph G being respectively provided with identical point set_iFormed, and different subgraph G_i Between at least there is a different limit；

Step S12: determine each subgraph G_iEach edge, described initial graph concentrate occur number of times be all higher than minimum The frequently minimum positive integer value of the product of support threshold k and atlas size m；

Step S13, delete described initial graph and concentrate each subgraph G_iIn be satisfied by consistency coefficient ED_eThe limit of ＜ δ/f；

Step S14, construct and concentrate each subgraph G with described initial graph_iThere is the summary figure of identical point set, and described In the summary figure that initial graph collection is corresponding, each edge is both needed to meet consistency coefficient

In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, delete described initially Limit cluster coefficients EC in the summary figure that atlas is corresponding_eThe limit of ＜ q also updates；

Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph G_iLimit carry out Contrast one by one, delete each subgraph G_iIn be not present in the limit in the summary figure that the initial graph collection after described renewal is corresponding more Newly；

Step S17, repetition step S13 to step S16, until corresponding the making a summary in figure of initial graph collection after described renewal Limit is not till changing；

Second step, determine candidate network subset:

Step S21, to described limit, in the summary figure changed, the edge-vector of each edge does not all give weights, and Determine that each edge corresponding sides of described assignment support the Hamming value of vector, and further Hamming value is met the limit of screening conditions to Measure and to gathering in A, Hamming value is unsatisfactory for the edge-vector of screening conditions and to gathering in B；

Step S22, respectively the edge-vector in described set A and described set B is carried out merger, the edge-vector repeated is deleted Remove, only retain one and update the weights that edge-vector is corresponding；

Step S23, Seeding vector is set, and according to the Seeding vector of described setting, adjusts the limit in set A and set B Vector；Wherein, described Seeding vector is the limit that weight is maximum；

Step S24, criterion according to maximum edge-vector similarity, all reflect the edge-vector in the set B after described adjustment It is mapped in the set A after described adjustment, and after end to be mapped, the edge-vector in the described set A completed after mapping is carried out Cluster computing, forms cluster centre set；

Step S25: deleting in cluster centre set, the frequency that the number of 1 occurs takes under atlas size product less than k IntegerCluster centre；

3rd step, acquisition summary atlas:

Step S31, cluster centre set according to described formation, at described initial atlas D={G_i=(V, E_i)}(1≤i ≤ m) in, extract subgraph consistent with each vector in described cluster centre set respectively, form multiple new atlas；

Step S32, minimum consistency threshold value δ determined according to described, minimum frequently support threshold k and User Defined Parameter f, the numerical value that p, q are the most corresponding, delete in described each new atlas and be satisfied by consistency coefficient ED_eThe limit of ＜ δ/f；

Step S33, construct and in described each new atlas there is the summary figure of identical point set respectively, and described each newly In the summary figure of atlas, each edge is both needed to meet consistency coefficient

Step S34, each edge in the summary figure of described each new atlas is all done limit cluster coefficients, delete described each newly Limit cluster coefficients EC in the summary figure of atlas_eThe limit of ＜ q also updates；

Step S35, by the summary figure of each new atlas after described renewal, limit with corresponding new atlas carries out one respectively One contrast, deletes the limit being not present in each new atlas in its corresponding summary figure and updates；

Step S36, repetition step S32 to step S35, until the limit in the summary figure of each new atlas after described renewal Not till changing, obtain atlas of making a summary；

4th step, lookup dense subgraph, and determine frequent dense point set:

Step S41, in the described summary atlas obtained, according to the summary figure that the initial graph collection after described renewal is corresponding, Search the dense subgraph that in the summary figure corresponding with the initial graph collection after described renewal, limit collection is consistent, and find according to described Dense subgraph, determine frequent dense point set, and after further the described frequent dense point set determined being carried out merger, as altogether The gene cluster output expressed.

Wherein, described User Defined parameter f span is [4,10]；Parameter p span is [0.1,0.2]；Ginseng Number q value is 0.334.

Implement the embodiment of the present invention, have the advantages that

The embodiment of the present invention, first does coarse filtration to initial atlas, deletes for finding frequent dense point set without contribution Incoherent limit obtains summary figure, then finds possible candidate network subset on the basis of summary figure, is then return to initial Atlas extracts candidate network subset respectively, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, Rear respectively summary figure is carried out dense subgraph lookup, obtain the frequent dense point set gene cluster as coexpression such that it is able to fall Low computation complexity, improves the accuracy of Frequent Pattern Mining, efficiency and solves pattern overlap problem.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, according to These accompanying drawings obtain other accompanying drawing and still fall within scope of the invention.

The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 1 provides for the embodiment of the present invention The operation principle block diagram of method；

The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 2 provides for the embodiment of the present invention In method, the first step finds the frequent dense point set application scenarios figure without the incoherent limit of contribution；

The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 3 provides for the embodiment of the present invention In method, second step determines the application scenarios figure of candidate network subset.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, and It is not used in the restriction present invention.

The present invention be based on CODENSE, NeMo algorithm on the basis of, have employed summary figure-candidate network subset and progressively change Strategy for refinement finds the coexpression group that many data concentration is conservative.Therefore to be easy to describe, unified finding many data Concentrate the problem finding frequent dense point set from multiple atlas that the problem of conservative coexpression group is converted in graph theory.Use this After the mode of kind, noise limit refers to that the relation between two genes representated by this edge has for the gene cluster finding coexpression The limit of interference, noise limit the most also includes that those are due to true positives or false-positive between the gene of experimental data generation itself Limit.

Inventor finds, the key of Resolving probiems, is how to find out these unrelated noise limits and them to rejecting.Directly For in sight, for noise limit, have following some understanding:

(1) if a limit UNICOM being adjacent the most sparse for limit e, then this edge will necessarily because of to find frequency Numerous dense point set is noise limit without contribution；

(2) if a limit e only shows the connectedness the strongest with its neighbours limit in several figures, then this edge must So because the requirement of frequency cannot be met, thus because noise limit can be become without contribution to finding frequent dense point set；

(3) if a limit e is connection " bridge " between two dense subgraphs in summary figure, then this edge is inevitable Noise limit can be become without contribution to finding frequent dense point set；

(4) if a limit e does not occurs in summary figure and occurs in certain several figure of artwork collection, then this edge Also it is to finding the frequent dense point set limit without contribution；

(5) if a point set V' is the frequent dense point set that certain the several atlas concentrated in artwork occur, then this Point set induced subgraph in artwork concentrates other several figures of residue is to be not have for continually looking for other frequent dense point sets Contribution.

Therefore, in order to delete above five kinds of noise limits thus Mining Frequent dense point set, in embodiments of the present invention, invention First people proposes to find frequent dense point set and deletes above four noise like limit without the incoherent limit contributed, and secondly edge-vector gathers Class is deleted the 5th noise like limit and is formed candidate network subset, is then return to initial graph concentration and extracts candidate network respectively Collection, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, the most respectively summary figure is carried out dense subgraph Searching, obtain the frequent dense point set gene cluster as coexpression, specific works principle is as shown in Figure 1.

In sum, a kind of based on complex biological network collection the Frequent Pattern Mining gene function of inventor's proposition is pre- Survey method, specifically includes:

The first step, find frequent dense point set without contribution incoherent limit, i.e. FILTER algorithm:

Step S13, delete described initial graph and concentrate each subgraph G_iIn be satisfied by consistency coefficient ED_eThe limit of ＜ δ/f, This step is mainly deleted initial graph and is concentrated each subgraph G_iIn the limit that connect sparse with limit about, in order to prevent delete phase The limit closed；

Step S14, construct and concentrate each subgraph G with described initial graph_iThere is the summary figure of identical point set, and described In the summary figure that initial graph collection is corresponding, each edge is both needed to meet consistency coefficientThis step is main It is to build summary figure corresponding to initial graph collection, meeting the limit of frequency to extracting, is placed in this summary figure；Wherein, 0 ＜ P ＜ 1, mainly prevents from deleting for searching the contributive dependence edge of frequent dense point set；

In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, delete described initially Limit cluster coefficients EC in the summary figure that atlas is corresponding_eThe limit of ＜ q also updates, and this step mainly deletes those at initial graph set pair The limit of two dense subgraphs of connection sparse in the summary figure answered；

Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph G_iLimit carry out Contrast one by one, delete each subgraph G_iIn be not present in the limit in the summary figure that the initial graph collection after described renewal is corresponding more Newly, this step mainly updates initial atlas according to the summary figure that initial graph collection is corresponding so that the artwork collection after renewal just filters A part of incoherent limit, thus be more conducive to find frequent dense point set；

In one embodiment, as in figure 2 it is shown, running of atlas that four figures with identical point set are constituted becomes Change displaying situation.It is assumed here that to find this atlas meets frequent support more than or equal to 2, consistency is more than or equal to 0.9 Frequent dense point set.{ a, b, d}, { b, c, d} and { e, f, g, h} are to meet the frequent dense point set required it is obvious that point set. F value is 4, p value 0.8, and in this figure, in each figure, solid line represents a limit being truly present in the figure, and dotted line table Show the limit that previous step needs are deleted.

Figure it is seen that each subgraph after Geng Xining compared with initial atlas containing less noise limit.But, But can not once directly extract from summary figure dense subgraph in all of frequent dense point set, such as summary figure a, b, C, d} actually represent two point sets { a, b, d} and { b, c, d}.Because a frequent dense point set tends to occur in atlas In certain several figure.So, if it is known that frequently dense point set occurs in which figure, the most again these figures are made an abstract Figure, then frequently dense point set is just easy to extract.In theory, an atlas size is m, and frequent dense point set occurs Support is k, then search volume is exactlyIt is 20 for an atlas size, it is desirable to frequently support is the dense point set of 6 For, then search volume is justIndividual possible candidate network subset, this is in practice, it is clear that be infeasible.

It is thus desirable to determine possible candidate network subset, thus reduce the search volume of candidate network subset, specifically As follows:

Second step, determine candidate network subset, i.e. GCLUSTER algorithm:

In one embodiment, input: summary figureAtlas size m, minimum frequently support k, minimum Hamming distance From threshold tau；

Output: cluster centre C；

Step 1: for each edge in summary figureMake the edge-vector v on this limit_eWeight w (v_e)=1, and handle The limit on all limits in summary figure supports that the limit that Hamming value is k or k+1 of vector is integrated in set A, and remaining limit is placed on In set B, and respectively the edge-vector in set A and set B is carried out merger, the edge-vector repeated is deleted and only retains one also Update the weights that edge-vector is corresponding；

Step 2:for each edge v_e∈B do

The edge-vector that Hamming value in set A is k is moved to gather in B；

Step 3: the edge-vector in A, carry out descending sort according to the size of edge-vector weights；

Above-mentioned algorithm initializes weights to the edge-vector of summary figure, then edge-vector has been carried out simple merger, And have updated weights.Processing through this step, remaining edge-vector does not just repeat mutually, and the weights that each edge-vector is corresponding mean that The limit number of the edge-vector the most identical with this edge-vector is had in summary figure.Following step, algorithm is first Hamming value etc. Edge-vector in the frequent support of user's requirement is placed in a set A as seed, and remaining edge-vector is put into another collection Close in B, then according to the criterion of maximum edge-vector similarity is reasonably mapped to the edge-vector in B in A, finally in A Edge-vector clusters, and eventually forms cluster centre set, the namely set of candidate network subset.

Needing exist for explanation, after having had T set, how cluster centre is formed.Cluster centre is in i-th figure Value be gathered by T in all edge-vectors and to its corresponding weights in i-th figure 0 weights and and weights sum big of 1 Little determine.Such as, if the weights of 1 are with big, then cluster centre value in first i figure is 1；It is the most just zero, As shown in Figure 3.

3rd step, acquisition summary atlas:

4th step, lookup dense subgraph, and determine frequent dense point set: the lookup method of dense subgraph is 1: require this son Branch of Tu Shiyige UNICOM；2) consistency of this subgraph requires the consistency more than setting, specific as follows:

It should be noted that, the merger to frequent dense point set, take following method.Firstly, for frequent dense point set Whether the induced subgraph corresponding at each figure of original atlas is more than the most given threshold value according to consistency, sets up the densest The vector of point set.Next merger is carried out according to following three principles:

(1), duplicate frequent dense point set carry out merger, only retain one；

(2) if two frequent dense points be concentrated with about more than 85% element identical or one be another The vector of subset and the frequent dense point set of the two is the most identical, then the frequent dense point set of the two takes the mode of " union " to carry out It is merged into one；

(3) if two frequent dense points be concentrated with about more than 85% element identical or one be another height Collection, but the vector of the frequent dense point set of the two is different, then such frequent dense point set nonjoinder, and will be element Many frequent dense point sets split

In embodiments of the present invention, User Defined parameter f span is [4,10]；Parameter p span be [0.1, 0.2]；Parameter q value is 0.334.

Implement the embodiment of the present invention, have the advantages that

One of ordinary skill in the art will appreciate that all or part of step realizing in above-described embodiment method is permissible Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium, Described storage medium, such as ROM/RAM, disk, CD etc..

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Any amendment, equivalent and the improvement etc. made within god and principle, should be included within the scope of the present invention.

Claims

1. the Forecasting Methodology of a Frequent Pattern Mining gene function based on complex biological network collection, it is characterised in that described Forecasting Methodology includes:

Step S11, obtain multiple gene expression data and be converted into the initial atlas D={G formed after bio-networks_i=(V, E_i)} (1≤i≤m), and determine minimum consistency threshold value δ, minimum frequently support threshold k and User Defined parameter f, p, q are respectively Corresponding numerical value；Wherein, described initial atlas is by multiple subgraph G being respectively provided with identical point set_iFormed, and different subgraph G_iBetween At least there is a different limit；

Step S12: determine each subgraph G_iEach edge, described initial graph concentrate the number of times occurred be all higher than minimum frequently The minimum positive integer value of the product of support threshold k and atlas size m；

Step S14, construct and concentrate each subgraph G with described initial graph_iThere is the summary figure of identical point set, and described initially In the summary figure that atlas is corresponding, each edge is both needed to meet consistency coefficient

In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, deletes described initial atlas Limit cluster coefficients EC in corresponding summary figure_eThe limit of ＜ q also updates；

Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph G_iLimit carry out 1 a pair Ratio, deletes each subgraph G_iIn be not present in described renewal after summary figure corresponding to initial graph collection in limit and update；

Step S17, repetition step S13 to step S16, until the limit in the summary figure that the initial graph collection after described renewal is corresponding not Till changing；

Second step, determine candidate network subset:

Step S21, to described limit, in the summary figure changed, the edge-vector of each edge does not all give weights, and determines The each edge corresponding sides of described assignment support the Hamming value of vector, and further Hamming value are met the edge-vector of screening conditions also In set A, Hamming value is unsatisfactory for the edge-vector of screening conditions and to gathering in B；

Step S22, respectively the edge-vector in described set A and described set B is carried out merger, the edge-vector repeated is deleted, Only retain one and update the weights that edge-vector is corresponding；

Step S23, Seeding vector is set, and according to the Seeding vector of described setting, adjusts the edge-vector in set A and set B； Wherein, described Seeding vector is the limit that weight is maximum；

Step S24, criterion according to maximum edge-vector similarity, both map to the edge-vector in the set B after described adjustment In set A after described adjustment, and after end to be mapped, the edge-vector in the described set A completed after mapping is clustered Computing, forms cluster centre set；

Step S25: delete in cluster centre set, the frequency that the number of 1 the occurs lower round numbers less than k with atlas size productCluster centre；

3rd step, acquisition summary atlas:

Step S31, cluster centre set according to described formation, at described initial atlas D={G_i=(V, E_i)}(1≤i≤m) In, extract subgraph consistent with each vector in described cluster centre set respectively, form multiple new atlas；

Step S32, minimum consistency threshold value δ determined according to described, minimum frequently support threshold k and User Defined parameter The numerical value that f, p, q are the most corresponding, deletes in described each new atlas and is satisfied by consistency coefficient ED_eThe limit of ＜ δ/f；

Step S33, construct, in described each new atlas, there is the summary figure of identical point set, and described each new atlas respectively Summary figure in each edge be both needed to meet consistency coefficient

Step S34, each edge in the summary figure of described each new atlas is all done limit cluster coefficients, delete described each new atlas Summary figure in limit cluster coefficients EC_eThe limit of ＜ q also updates；

Step S35, by the summary figure of each new atlas after described renewal, limit with corresponding new atlas carries out 1 a pair respectively Ratio, deletes the limit being not present in each new atlas in its corresponding summary figure and updates；

Step S36, repetition step S32 to step S35, until the limit in the summary figure of each new atlas after described renewal does not exists Till changing, obtain atlas of making a summary；

4th step, lookup dense subgraph, and determine frequent dense point set:

Step S41, in the described summary atlas obtained, according to the summary figure that the initial graph collection after described renewal is corresponding, search The dense subgraph that in corresponding with the initial graph collection after described renewal summary figure, limit collection is consistent, and according to described find thick Close subgraph, determines frequent dense point set, and after further the described frequent dense point set determined being carried out merger, as coexpression Gene cluster output.

2. Forecasting Methodology as claimed in claim 1, it is characterised in that described User Defined parameter f span be [4, 10]；Parameter p span is [0.1,0.2]；Parameter q value is 0.334.