CN106295249A - The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection - Google Patents
The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection Download PDFInfo
- Publication number
- CN106295249A CN106295249A CN201610648165.9A CN201610648165A CN106295249A CN 106295249 A CN106295249 A CN 106295249A CN 201610648165 A CN201610648165 A CN 201610648165A CN 106295249 A CN106295249 A CN 106295249A
- Authority
- CN
- China
- Prior art keywords
- atlas
- limit
- edge
- vector
- subgraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the invention discloses a kind of method based on complex biological network collection Frequent Pattern Mining gene function, first the method obtains the initial atlas formed after being converted into bio-networks by multiple gene expression datas, and initial atlas is done coarse filtration, delete and without the incoherent limit contributed, summary figure is obtained for the frequent dense point set of searching, then on the basis of summary figure, find possible candidate network subset, it is then return to initial graph concentration and extracts candidate network subset respectively, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, the most respectively summary figure is carried out dense subgraph lookup, obtain the frequent dense point set gene cluster as coexpression.Implement the present invention, it is possible to reduce computation complexity, improve the accuracy of Frequent Pattern Mining, efficiency and solve pattern overlap problem.
Description
Technical field
The present invention relates to systems biology studying technological domain, particularly relate to a kind of based on complex biological network collection frequent
The Forecasting Methodology of mode excavation gene function.
Background technology
Along with generation and the development of bioinformatics of biochip technology, by genomic sequence analysis, unknown in a large number
The gene of function is identified.How the function of " imparting " these genes of systematic science is genome times afterwards comprehensively scientists
The difficult problem that need to solve, but the generation of extensive gene expression data, the solution for this problem brings new breakthrough mouth.
It addition, in organism, gene has cooperatively participated in a function often by with other genes.These
Gene often has similar express spectra, the most how according to the modal data of these gene expressions thus excavate these coexpressions
Gene cluster, have critically important researching value (such as can predict unknown function or the deduction of a gene in biology
The function of one unknown gene), but owing to high-throughput techniques itself has the impact of strong noise and biosystem itself
Complexity, in the bio-networks that the microarray data that people obtain converts, containing the most unrelated " noise ", exactly because this
The existence of " noise " a bit, the gene cluster just making scientists find coexpression becomes highly difficult.If able to these are a large amount of
Unrelated " noise " being weeded out progressively, then the problem finding conservative co-expression gene group the most just becomes simply.
In the prior art, the Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection has following
Several: (1) is based on width first traversal: to utilize class Apriori character to enumerate the subgraph repeated, main representative have AGM and
FSG.AGM searches for all " induction " subgraphs in atlas.The node of the induced subgraph G ' of figure G isG's '
While be V (G ') interior joint all limits in figure G.The mode that FSG then utilizes limit to increase searches the frequent connection in all atlas
Subgraph;(2) based on depth-priority-searching method: main representative has gSpan, CloseSpan and FFSM etc., its basic thought be by by
Step extends frequent limit and obtains Frequent tree mining, the expansion process differring primarily in that figure of various algorithms;(3) based on summary figure open
Hairdo method, main representative has CODENSE, NeMo etc., and its main thought is the letter that first integrated network is concentrated each network
Breath, then by itself and the Frequent Pattern Mining that is converted in free hand drawing, is finally returning to former network collection.
But inventor finds, in place of the above-mentioned equal Shortcomings of several Forecasting Methodologies, it is disadvantageous in that: (1) is the
In a kind of Forecasting Methodology, computation complexity is the highest;(2) in the second Forecasting Methodology, computation complexity is the highest;(3) the 3rd
In the method for kind, the quality of summary figure directly determines the complexity of algorithm, and pattern overlapping phenomenon further increases problem
Complexity;Additionally, in order to avoid the scale of spectral factorization method limits to (about 2000 nodes), summary figure can only be carried out by the method
Piecemeal processes, and this results in the reasonability problem of piecemeal.
Summary of the invention
The purpose of the embodiment of the present invention is to provide a kind of Frequent Pattern Mining gene merit based on complex biological network collection
The Forecasting Methodology of energy, it is possible to reduce computation complexity, improves the accuracy of Frequent Pattern Mining, efficiency and the pattern of solution overlapping
Problem.
In order to solve above-mentioned technical problem, embodiments provide frequent mode based on complex biological network collection and dig
The Forecasting Methodology of pick gene function, described Forecasting Methodology includes:
The first step, find frequent dense point set without contribution incoherent limit:
Step S11, obtain multiple gene expression data and be converted into the initial atlas D={G formed after bio-networksi=(V,
Ei) (1≤i≤m), and determine minimum consistency threshold value δ, minimum frequently support threshold k and User Defined parameter f, p, q
The most corresponding numerical value;Wherein, described initial atlas is by multiple subgraph G being respectively provided with identical point setiFormed, and different subgraph Gi
Between at least there is a different limit;
Step S12: determine each subgraph GiEach edge, described initial graph concentrate occur number of times be all higher than minimum
The frequently minimum positive integer value of the product of support threshold k and atlas size m;
Step S13, delete described initial graph and concentrate each subgraph GiIn be satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S14, construct and concentrate each subgraph G with described initial graphiThere is the summary figure of identical point set, and described
In the summary figure that initial graph collection is corresponding, each edge is both needed to meet consistency coefficient
In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, delete described initially
Limit cluster coefficients EC in the summary figure that atlas is correspondingeThe limit of < q also updates;
Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph GiLimit carry out
Contrast one by one, delete each subgraph GiIn be not present in the limit in the summary figure that the initial graph collection after described renewal is corresponding more
Newly;
Step S17, repetition step S13 to step S16, until corresponding the making a summary in figure of initial graph collection after described renewal
Limit is not till changing;
Second step, determine candidate network subset:
Step S21, to described limit, in the summary figure changed, the edge-vector of each edge does not all give weights, and
Determine that each edge corresponding sides of described assignment support the Hamming value of vector, and further Hamming value is met the limit of screening conditions to
Measure and to gathering in A, Hamming value is unsatisfactory for the edge-vector of screening conditions and to gathering in B;
Step S22, respectively the edge-vector in described set A and described set B is carried out merger, the edge-vector repeated is deleted
Remove, only retain one and update the weights that edge-vector is corresponding;
Step S23, Seeding vector is set, and according to the Seeding vector of described setting, adjusts the limit in set A and set B
Vector;Wherein, described Seeding vector is the limit that weight is maximum;
Step S24, criterion according to maximum edge-vector similarity, all reflect the edge-vector in the set B after described adjustment
It is mapped in the set A after described adjustment, and after end to be mapped, the edge-vector in the described set A completed after mapping is carried out
Cluster computing, forms cluster centre set;
Step S25: deleting in cluster centre set, the frequency that the number of 1 occurs takes under atlas size product less than k
IntegerCluster centre;
3rd step, acquisition summary atlas:
Step S31, cluster centre set according to described formation, at described initial atlas D={Gi=(V, Ei)}(1≤i
≤ m) in, extract subgraph consistent with each vector in described cluster centre set respectively, form multiple new atlas;
Step S32, minimum consistency threshold value δ determined according to described, minimum frequently support threshold k and User Defined
Parameter f, the numerical value that p, q are the most corresponding, delete in described each new atlas and be satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S33, construct and in described each new atlas there is the summary figure of identical point set respectively, and described each newly
In the summary figure of atlas, each edge is both needed to meet consistency coefficient
Step S34, each edge in the summary figure of described each new atlas is all done limit cluster coefficients, delete described each newly
Limit cluster coefficients EC in the summary figure of atlaseThe limit of < q also updates;
Step S35, by the summary figure of each new atlas after described renewal, limit with corresponding new atlas carries out one respectively
One contrast, deletes the limit being not present in each new atlas in its corresponding summary figure and updates;
Step S36, repetition step S32 to step S35, until the limit in the summary figure of each new atlas after described renewal
Not till changing, obtain atlas of making a summary;
4th step, lookup dense subgraph, and determine frequent dense point set:
Step S41, in the described summary atlas obtained, according to the summary figure that the initial graph collection after described renewal is corresponding,
Search the dense subgraph that in the summary figure corresponding with the initial graph collection after described renewal, limit collection is consistent, and find according to described
Dense subgraph, determine frequent dense point set, and after further the described frequent dense point set determined being carried out merger, as altogether
The gene cluster output expressed.
Wherein, described User Defined parameter f span is [4,10];Parameter p span is [0.1,0.2];Ginseng
Number q value is 0.334.
Implement the embodiment of the present invention, have the advantages that
The embodiment of the present invention, first does coarse filtration to initial atlas, deletes for finding frequent dense point set without contribution
Incoherent limit obtains summary figure, then finds possible candidate network subset on the basis of summary figure, is then return to initial
Atlas extracts candidate network subset respectively, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary,
Rear respectively summary figure is carried out dense subgraph lookup, obtain the frequent dense point set gene cluster as coexpression such that it is able to fall
Low computation complexity, improves the accuracy of Frequent Pattern Mining, efficiency and solves pattern overlap problem.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, according to
These accompanying drawings obtain other accompanying drawing and still fall within scope of the invention.
The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 1 provides for the embodiment of the present invention
The operation principle block diagram of method;
The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 2 provides for the embodiment of the present invention
In method, the first step finds the frequent dense point set application scenarios figure without the incoherent limit of contribution;
The prediction of the Frequent Pattern Mining gene function based on complex biological network collection that Fig. 3 provides for the embodiment of the present invention
In method, second step determines the application scenarios figure of candidate network subset.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right
The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, and
It is not used in the restriction present invention.
The present invention be based on CODENSE, NeMo algorithm on the basis of, have employed summary figure-candidate network subset and progressively change
Strategy for refinement finds the coexpression group that many data concentration is conservative.Therefore to be easy to describe, unified finding many data
Concentrate the problem finding frequent dense point set from multiple atlas that the problem of conservative coexpression group is converted in graph theory.Use this
After the mode of kind, noise limit refers to that the relation between two genes representated by this edge has for the gene cluster finding coexpression
The limit of interference, noise limit the most also includes that those are due to true positives or false-positive between the gene of experimental data generation itself
Limit.
Inventor finds, the key of Resolving probiems, is how to find out these unrelated noise limits and them to rejecting.Directly
For in sight, for noise limit, have following some understanding:
(1) if a limit UNICOM being adjacent the most sparse for limit e, then this edge will necessarily because of to find frequency
Numerous dense point set is noise limit without contribution;
(2) if a limit e only shows the connectedness the strongest with its neighbours limit in several figures, then this edge must
So because the requirement of frequency cannot be met, thus because noise limit can be become without contribution to finding frequent dense point set;
(3) if a limit e is connection " bridge " between two dense subgraphs in summary figure, then this edge is inevitable
Noise limit can be become without contribution to finding frequent dense point set;
(4) if a limit e does not occurs in summary figure and occurs in certain several figure of artwork collection, then this edge
Also it is to finding the frequent dense point set limit without contribution;
(5) if a point set V' is the frequent dense point set that certain the several atlas concentrated in artwork occur, then this
Point set induced subgraph in artwork concentrates other several figures of residue is to be not have for continually looking for other frequent dense point sets
Contribution.
Therefore, in order to delete above five kinds of noise limits thus Mining Frequent dense point set, in embodiments of the present invention, invention
First people proposes to find frequent dense point set and deletes above four noise like limit without the incoherent limit contributed, and secondly edge-vector gathers
Class is deleted the 5th noise like limit and is formed candidate network subset, is then return to initial graph concentration and extracts candidate network respectively
Collection, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary, the most respectively summary figure is carried out dense subgraph
Searching, obtain the frequent dense point set gene cluster as coexpression, specific works principle is as shown in Figure 1.
In sum, a kind of based on complex biological network collection the Frequent Pattern Mining gene function of inventor's proposition is pre-
Survey method, specifically includes:
The first step, find frequent dense point set without contribution incoherent limit, i.e. FILTER algorithm:
Step S11, obtain multiple gene expression data and be converted into the initial atlas D={G formed after bio-networksi=(V,
Ei) (1≤i≤m), and determine minimum consistency threshold value δ, minimum frequently support threshold k and User Defined parameter f, p, q
The most corresponding numerical value;Wherein, described initial atlas is by multiple subgraph G being respectively provided with identical point setiFormed, and different subgraph Gi
Between at least there is a different limit;
Step S12: determine each subgraph GiEach edge, described initial graph concentrate occur number of times be all higher than minimum
The frequently minimum positive integer value of the product of support threshold k and atlas size m;
Step S13, delete described initial graph and concentrate each subgraph GiIn be satisfied by consistency coefficient EDeThe limit of < δ/f,
This step is mainly deleted initial graph and is concentrated each subgraph GiIn the limit that connect sparse with limit about, in order to prevent delete phase
The limit closed;
Step S14, construct and concentrate each subgraph G with described initial graphiThere is the summary figure of identical point set, and described
In the summary figure that initial graph collection is corresponding, each edge is both needed to meet consistency coefficientThis step is main
It is to build summary figure corresponding to initial graph collection, meeting the limit of frequency to extracting, is placed in this summary figure;Wherein, 0 <
P < 1, mainly prevents from deleting for searching the contributive dependence edge of frequent dense point set;
In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, delete described initially
Limit cluster coefficients EC in the summary figure that atlas is correspondingeThe limit of < q also updates, and this step mainly deletes those at initial graph set pair
The limit of two dense subgraphs of connection sparse in the summary figure answered;
Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph GiLimit carry out
Contrast one by one, delete each subgraph GiIn be not present in the limit in the summary figure that the initial graph collection after described renewal is corresponding more
Newly, this step mainly updates initial atlas according to the summary figure that initial graph collection is corresponding so that the artwork collection after renewal just filters
A part of incoherent limit, thus be more conducive to find frequent dense point set;
Step S17, repetition step S13 to step S16, until corresponding the making a summary in figure of initial graph collection after described renewal
Limit is not till changing;
In one embodiment, as in figure 2 it is shown, running of atlas that four figures with identical point set are constituted becomes
Change displaying situation.It is assumed here that to find this atlas meets frequent support more than or equal to 2, consistency is more than or equal to 0.9
Frequent dense point set.{ a, b, d}, { b, c, d} and { e, f, g, h} are to meet the frequent dense point set required it is obvious that point set.
F value is 4, p value 0.8, and in this figure, in each figure, solid line represents a limit being truly present in the figure, and dotted line table
Show the limit that previous step needs are deleted.
Figure it is seen that each subgraph after Geng Xining compared with initial atlas containing less noise limit.But,
But can not once directly extract from summary figure dense subgraph in all of frequent dense point set, such as summary figure a, b,
C, d} actually represent two point sets { a, b, d} and { b, c, d}.Because a frequent dense point set tends to occur in atlas
In certain several figure.So, if it is known that frequently dense point set occurs in which figure, the most again these figures are made an abstract
Figure, then frequently dense point set is just easy to extract.In theory, an atlas size is m, and frequent dense point set occurs
Support is k, then search volume is exactlyIt is 20 for an atlas size, it is desirable to frequently support is the dense point set of 6
For, then search volume is justIndividual possible candidate network subset, this is in practice, it is clear that be infeasible.
It is thus desirable to determine possible candidate network subset, thus reduce the search volume of candidate network subset, specifically
As follows:
Second step, determine candidate network subset, i.e. GCLUSTER algorithm:
Step S21, to described limit, in the summary figure changed, the edge-vector of each edge does not all give weights, and
Determine that each edge corresponding sides of described assignment support the Hamming value of vector, and further Hamming value is met the limit of screening conditions to
Measure and to gathering in A, Hamming value is unsatisfactory for the edge-vector of screening conditions and to gathering in B;
Step S22, respectively the edge-vector in described set A and described set B is carried out merger, the edge-vector repeated is deleted
Remove, only retain one and update the weights that edge-vector is corresponding;
Step S23, Seeding vector is set, and according to the Seeding vector of described setting, adjusts the limit in set A and set B
Vector;Wherein, described Seeding vector is the limit that weight is maximum;
Step S24, criterion according to maximum edge-vector similarity, all reflect the edge-vector in the set B after described adjustment
It is mapped in the set A after described adjustment, and after end to be mapped, the edge-vector in the described set A completed after mapping is carried out
Cluster computing, forms cluster centre set;
Step S25: deleting in cluster centre set, the frequency that the number of 1 occurs takes under atlas size product less than k
IntegerCluster centre;
In one embodiment, input: summary figureAtlas size m, minimum frequently support k, minimum Hamming distance
From threshold tau;
Output: cluster centre C;
Step 1: for each edge in summary figureMake the edge-vector v on this limiteWeight w (ve)=1, and handle
The limit on all limits in summary figure supports that the limit that Hamming value is k or k+1 of vector is integrated in set A, and remaining limit is placed on
In set B, and respectively the edge-vector in set A and set B is carried out merger, the edge-vector repeated is deleted and only retains one also
Update the weights that edge-vector is corresponding;
Step 2:for each edge ve∈B do
The edge-vector that Hamming value in set A is k is moved to gather in B;
Step 3: the edge-vector in A, carry out descending sort according to the size of edge-vector weights;
Above-mentioned algorithm initializes weights to the edge-vector of summary figure, then edge-vector has been carried out simple merger,
And have updated weights.Processing through this step, remaining edge-vector does not just repeat mutually, and the weights that each edge-vector is corresponding mean that
The limit number of the edge-vector the most identical with this edge-vector is had in summary figure.Following step, algorithm is first Hamming value etc.
Edge-vector in the frequent support of user's requirement is placed in a set A as seed, and remaining edge-vector is put into another collection
Close in B, then according to the criterion of maximum edge-vector similarity is reasonably mapped to the edge-vector in B in A, finally in A
Edge-vector clusters, and eventually forms cluster centre set, the namely set of candidate network subset.
Needing exist for explanation, after having had T set, how cluster centre is formed.Cluster centre is in i-th figure
Value be gathered by T in all edge-vectors and to its corresponding weights in i-th figure 0 weights and and weights sum big of 1
Little determine.Such as, if the weights of 1 are with big, then cluster centre value in first i figure is 1;It is the most just zero,
As shown in Figure 3.
3rd step, acquisition summary atlas:
Step S31, cluster centre set according to described formation, at described initial atlas D={Gi=(V, Ei)}(1≤i
≤ m) in, extract subgraph consistent with each vector in described cluster centre set respectively, form multiple new atlas;
Step S32, minimum consistency threshold value δ determined according to described, minimum frequently support threshold k and User Defined
Parameter f, the numerical value that p, q are the most corresponding, delete in described each new atlas and be satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S33, construct and in described each new atlas there is the summary figure of identical point set respectively, and described each newly
In the summary figure of atlas, each edge is both needed to meet consistency coefficient
Step S34, each edge in the summary figure of described each new atlas is all done limit cluster coefficients, delete described each newly
Limit cluster coefficients EC in the summary figure of atlaseThe limit of < q also updates;
Step S35, by the summary figure of each new atlas after described renewal, limit with corresponding new atlas carries out one respectively
One contrast, deletes the limit being not present in each new atlas in its corresponding summary figure and updates;
Step S36, repetition step S32 to step S35, until the limit in the summary figure of each new atlas after described renewal
Not till changing, obtain atlas of making a summary;
4th step, lookup dense subgraph, and determine frequent dense point set: the lookup method of dense subgraph is 1: require this son
Branch of Tu Shiyige UNICOM;2) consistency of this subgraph requires the consistency more than setting, specific as follows:
Step S41, in the described summary atlas obtained, according to the summary figure that the initial graph collection after described renewal is corresponding,
Search the dense subgraph that in the summary figure corresponding with the initial graph collection after described renewal, limit collection is consistent, and find according to described
Dense subgraph, determine frequent dense point set, and after further the described frequent dense point set determined being carried out merger, as altogether
The gene cluster output expressed.
It should be noted that, the merger to frequent dense point set, take following method.Firstly, for frequent dense point set
Whether the induced subgraph corresponding at each figure of original atlas is more than the most given threshold value according to consistency, sets up the densest
The vector of point set.Next merger is carried out according to following three principles:
(1), duplicate frequent dense point set carry out merger, only retain one;
(2) if two frequent dense points be concentrated with about more than 85% element identical or one be another
The vector of subset and the frequent dense point set of the two is the most identical, then the frequent dense point set of the two takes the mode of " union " to carry out
It is merged into one;
(3) if two frequent dense points be concentrated with about more than 85% element identical or one be another height
Collection, but the vector of the frequent dense point set of the two is different, then such frequent dense point set nonjoinder, and will be element
Many frequent dense point sets split
In embodiments of the present invention, User Defined parameter f span is [4,10];Parameter p span be [0.1,
0.2];Parameter q value is 0.334.
Implement the embodiment of the present invention, have the advantages that
The embodiment of the present invention, first does coarse filtration to initial atlas, deletes for finding frequent dense point set without contribution
Incoherent limit obtains summary figure, then finds possible candidate network subset on the basis of summary figure, is then return to initial
Atlas extracts candidate network subset respectively, and the atlas extracted is done coarse filtration again to obtain atlas of making a summary,
Rear respectively summary figure is carried out dense subgraph lookup, obtain the frequent dense point set gene cluster as coexpression such that it is able to fall
Low computation complexity, improves the accuracy of Frequent Pattern Mining, efficiency and solves pattern overlap problem.
One of ordinary skill in the art will appreciate that all or part of step realizing in above-described embodiment method is permissible
Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium,
Described storage medium, such as ROM/RAM, disk, CD etc..
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention
Any amendment, equivalent and the improvement etc. made within god and principle, should be included within the scope of the present invention.
Claims (2)
1. the Forecasting Methodology of a Frequent Pattern Mining gene function based on complex biological network collection, it is characterised in that described
Forecasting Methodology includes:
The first step, find frequent dense point set without contribution incoherent limit:
Step S11, obtain multiple gene expression data and be converted into the initial atlas D={G formed after bio-networksi=(V, Ei)}
(1≤i≤m), and determine minimum consistency threshold value δ, minimum frequently support threshold k and User Defined parameter f, p, q are respectively
Corresponding numerical value;Wherein, described initial atlas is by multiple subgraph G being respectively provided with identical point setiFormed, and different subgraph GiBetween
At least there is a different limit;
Step S12: determine each subgraph GiEach edge, described initial graph concentrate the number of times occurred be all higher than minimum frequently
The minimum positive integer value of the product of support threshold k and atlas size m;
Step S13, delete described initial graph and concentrate each subgraph GiIn be satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S14, construct and concentrate each subgraph G with described initial graphiThere is the summary figure of identical point set, and described initially
In the summary figure that atlas is corresponding, each edge is both needed to meet consistency coefficient
In step S15, the summary figure corresponding to described initial graph collection, each edge all does limit cluster coefficients, deletes described initial atlas
Limit cluster coefficients EC in corresponding summary figureeThe limit of < q also updates;
Step S16, by summary figure corresponding for the initial graph collection after described renewal and described each subgraph GiLimit carry out 1 a pair
Ratio, deletes each subgraph GiIn be not present in described renewal after summary figure corresponding to initial graph collection in limit and update;
Step S17, repetition step S13 to step S16, until the limit in the summary figure that the initial graph collection after described renewal is corresponding not
Till changing;
Second step, determine candidate network subset:
Step S21, to described limit, in the summary figure changed, the edge-vector of each edge does not all give weights, and determines
The each edge corresponding sides of described assignment support the Hamming value of vector, and further Hamming value are met the edge-vector of screening conditions also
In set A, Hamming value is unsatisfactory for the edge-vector of screening conditions and to gathering in B;
Step S22, respectively the edge-vector in described set A and described set B is carried out merger, the edge-vector repeated is deleted,
Only retain one and update the weights that edge-vector is corresponding;
Step S23, Seeding vector is set, and according to the Seeding vector of described setting, adjusts the edge-vector in set A and set B;
Wherein, described Seeding vector is the limit that weight is maximum;
Step S24, criterion according to maximum edge-vector similarity, both map to the edge-vector in the set B after described adjustment
In set A after described adjustment, and after end to be mapped, the edge-vector in the described set A completed after mapping is clustered
Computing, forms cluster centre set;
Step S25: delete in cluster centre set, the frequency that the number of 1 the occurs lower round numbers less than k with atlas size productCluster centre;
3rd step, acquisition summary atlas:
Step S31, cluster centre set according to described formation, at described initial atlas D={Gi=(V, Ei)}(1≤i≤m)
In, extract subgraph consistent with each vector in described cluster centre set respectively, form multiple new atlas;
Step S32, minimum consistency threshold value δ determined according to described, minimum frequently support threshold k and User Defined parameter
The numerical value that f, p, q are the most corresponding, deletes in described each new atlas and is satisfied by consistency coefficient EDeThe limit of < δ/f;
Step S33, construct, in described each new atlas, there is the summary figure of identical point set, and described each new atlas respectively
Summary figure in each edge be both needed to meet consistency coefficient
Step S34, each edge in the summary figure of described each new atlas is all done limit cluster coefficients, delete described each new atlas
Summary figure in limit cluster coefficients ECeThe limit of < q also updates;
Step S35, by the summary figure of each new atlas after described renewal, limit with corresponding new atlas carries out 1 a pair respectively
Ratio, deletes the limit being not present in each new atlas in its corresponding summary figure and updates;
Step S36, repetition step S32 to step S35, until the limit in the summary figure of each new atlas after described renewal does not exists
Till changing, obtain atlas of making a summary;
4th step, lookup dense subgraph, and determine frequent dense point set:
Step S41, in the described summary atlas obtained, according to the summary figure that the initial graph collection after described renewal is corresponding, search
The dense subgraph that in corresponding with the initial graph collection after described renewal summary figure, limit collection is consistent, and according to described find thick
Close subgraph, determines frequent dense point set, and after further the described frequent dense point set determined being carried out merger, as coexpression
Gene cluster output.
2. Forecasting Methodology as claimed in claim 1, it is characterised in that described User Defined parameter f span be [4,
10];Parameter p span is [0.1,0.2];Parameter q value is 0.334.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610648165.9A CN106295249A (en) | 2016-08-08 | 2016-08-08 | The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610648165.9A CN106295249A (en) | 2016-08-08 | 2016-08-08 | The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106295249A true CN106295249A (en) | 2017-01-04 |
Family
ID=57667262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610648165.9A Pending CN106295249A (en) | 2016-08-08 | 2016-08-08 | The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106295249A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709275A (en) * | 2017-02-04 | 2017-05-24 | 上海喆之信息科技有限公司 | Restricted type cardiomyopathy gene data processing device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102855398A (en) * | 2012-08-28 | 2013-01-02 | 中国科学院自动化研究所 | Method for obtaining disease potentially-associated gene based on multi-source information fusion |
CN103514381A (en) * | 2013-07-22 | 2014-01-15 | 湖南大学 | Protein biological network motif identification method integrating topological attributes and functions |
CN104598748A (en) * | 2015-01-29 | 2015-05-06 | 中国人民解放军军械工程学院 | Calculating method of restrictive boolean network degeneracy |
-
2016
- 2016-08-08 CN CN201610648165.9A patent/CN106295249A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102855398A (en) * | 2012-08-28 | 2013-01-02 | 中国科学院自动化研究所 | Method for obtaining disease potentially-associated gene based on multi-source information fusion |
CN103514381A (en) * | 2013-07-22 | 2014-01-15 | 湖南大学 | Protein biological network motif identification method integrating topological attributes and functions |
CN104598748A (en) * | 2015-01-29 | 2015-05-06 | 中国人民解放军军械工程学院 | Calculating method of restrictive boolean network degeneracy |
Non-Patent Citations (3)
Title |
---|
HAIYAN HU 等: "Mining coherent dense subgraphs acrossmassive biological networks for functional discovery", 《BIOINFORMATICS》 * |
ZAN XIANGZHEN 等: "A Graph-based Method to Mine Coexpression Clusters Across Multiple Datasets", 《CHINESE JOURNAL OF ELECTRONICS》 * |
昝乡镇: "复杂生物网络集中频繁模式挖掘软件开发研究", 《中国优秀硕士学位论文全文数据库基础科学辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709275A (en) * | 2017-02-04 | 2017-05-24 | 上海喆之信息科技有限公司 | Restricted type cardiomyopathy gene data processing device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
He et al. | A discrete multi-objective fireworks algorithm for flowshop scheduling with sequence-dependent setup times | |
Li et al. | Stepping community detection algorithm based on label propagation and similarity | |
WO2018166270A2 (en) | Index and direction vector combination-based multi-objective optimisation method and system | |
Kabir et al. | A new multiple seeds based genetic algorithm for discovering a set of interesting Boolean association rules | |
Zhang et al. | Protein complex prediction in large ontology attributed protein-protein interaction networks | |
CN111916149B (en) | Hierarchical clustering-based protein interaction network global comparison method | |
Zhou et al. | A density based link clustering algorithm for overlapping community detection in networks | |
CN111599406B (en) | Global multi-network comparison method combined with network clustering method | |
Luna et al. | Efficient mining of top-k high utility itemsets through genetic algorithms | |
CN111429970B (en) | Method and system for acquiring multiple gene risk scores based on feature selection of extreme gradient lifting method | |
CN103810388A (en) | Large-scale ontology mapping method based on partitioning technology oriented towards mapping | |
Guns et al. | Repetitive branch-and-bound using constraint programming for constrained minimum sum-of-squares clustering | |
Khouzani et al. | Identification of the effects of the existing network properties on the performance of current community detection methods | |
CN106295247A (en) | Frequent Pattern Mining mouse gene coexpression based on complex biological network collection because of Forecasting Methodology | |
CN108388769A (en) | Protein Functional Module Identification Method Based on Edge-Driven Label Propagation Algorithm | |
CN106295249A (en) | The Forecasting Methodology of Frequent Pattern Mining gene function based on complex biological network collection | |
CN110910953B (en) | Key protein prediction method based on protein-domain heterogeneous network | |
CN106295248A (en) | The Forecasting Methodology of Frequent Pattern Mining yeast gene co-expressing based on complex biological network collection group | |
Li et al. | Motif paths: A new approach for analyzing higher-order semantics between graph nodes | |
Haeri et al. | Developing a hybrid data mining approach based on multi-objective particle swarm optimization for solving a traveling salesman problem | |
Su et al. | An efficient algorithm for calculating drainage accumulation in digital elevation models based on the basin tree index | |
CN111709846A (en) | Local community discovery algorithm based on line graph | |
Li et al. | Hierarchical hidden community detection for protein complex prediction | |
EP1973050A1 (en) | Virtual screening of chemical spaces | |
Zhang et al. | Heuristic Methods for Solving the Traveling Salesman Problem (TSP): A Comparative Study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170104 |