CN107784196A - Method based on Artificial Fish Swarm Optimization Algorithm identification key protein matter - Google Patents

Method based on Artificial Fish Swarm Optimization Algorithm identification key protein matter Download PDF

Info

Publication number
CN107784196A
CN107784196A CN201710912037.5A CN201710912037A CN107784196A CN 107784196 A CN107784196 A CN 107784196A CN 201710912037 A CN201710912037 A CN 201710912037A CN 107784196 A CN107784196 A CN 107784196A
Authority
CN
China
Prior art keywords
protein
node
fish
formula
artificial fish
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710912037.5A
Other languages
Chinese (zh)
Other versions
CN107784196B (en
Inventor
雷秀娟
杨晓琴
代才
程适
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN201710912037.5A priority Critical patent/CN107784196B/en
Publication of CN107784196A publication Critical patent/CN107784196A/en
Application granted granted Critical
Publication of CN107784196B publication Critical patent/CN107784196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Genetics & Genomics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Physiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of method based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, by protein-protein interaction network be converted into non-directed graph, structure purification protein-protein interaction network, obtain protein corresponding to the degree of nbccs gene expression value, GO annotation informations and protein in known compound, the protein-protein interaction network side after purification and node are handled, choose known key protein matter as original manual fish, Artificial Fish execution foraging behavior, random behavior, behavior of knocking into the back, bunching behavior and produces key protein matter.The inventive method can identify key protein matter exactly;The simulation experiment result shows that the index performance such as susceptibility, specificity, positive predictive value, negative predictive value is more excellent;Compared with other key protein matter recognition methods, the topological characteristic of the optimization characteristics of artificial fish-swarm and protein-protein interaction network is combined to the identification process for realizing key protein matter, improves the recognition accuracy of key protein matter.

Description

Method based on Artificial Fish Swarm Optimization Algorithm identification key protein matter
Technical field
The invention belongs to biological information field, and in particular to one kind is based on Artificial Fish Swarm Optimization Algorithm identification key protein matter Method.
Background technology
Key protein matter is the product of key gene, is that organism sustains life the essential part of activity.It is crucial The missing of protein can cause vital movement not to be normally carried out, and even result in organisms die.The prediction of key protein matter with Identification is a significant research work, on the one hand, helps to study the adjusting and controlling growth process related to cell;Separately On the one hand, also have far-reaching significance for medical diagnosis on disease and drug design.Initially, the identification of key protein matter is mainly logical BIOLOGICAL TEST METHODS, such as single-gene knockout and RNA interference etc. are crossed, identifies although key protein is accurate by these experimental techniques Effect is truly had, but cost is high, and efficiency is low.Therefore, identify that key protein becomes by the method for calculating in field of bioinformatics The focus and emphasis of research.
At present, to realize that the identification of key protein matter mainly has by computational methods following two:Knot based on network topology The method that dot center's property method, PPI networks and biological data combine.
Jeong is equal to " centrality-lethal " rule proposed in 2001 and points out key and egg of a protein White matter is closely related in the topological property of protein-protein interaction network, that is, possesses the missing of protein of more neighbor node more It is easy to influence the topological structure of whole network.In short, in protein network, the higher protein node of degree more tends to table Reveal key, the missing of the proteinoid, more easily cause the forfeiture of body function, produce lethal effect.The rule is base Laid a good foundation in the key protein matter identification of network topology.Afterwards, it is a series of to be known based on the central key protein matter of topology Other method is suggested, including degree centrality (Degree Centrality, DC), betweenness center (Betweenness Centrality, BC), tight ness rating centrality (Closeness Centrality, CC), eigenvector centrality (Eigenvector Centrality, EC), information centre's property (Information Centrality, IC), subgraph centrality (Subgraph Centrality, SC).These methods be all by all proteins node in protein-protein interaction network In some central value given a mark, sorted, and then identify key protein.But these centrality methods highly rely on egg The reliability of white matter interactive network, because protein-protein interaction network is obtained by high flux Bioexperiment, wherein A large amount of false positives are contained, this largely have impact on the accuracy rate of key protein matter identification.
The shortcomings that existing for centrality method identification key protein matter, researcher propose that some new recognition methods come Improve the accuracy rate of identification key protein matter.If PeC key protein matter recognition methods is by protein-protein interaction network and gene Express spectra combines, and ION key protein matter recognition methods enters the homogeneous character of protein with protein-protein interaction network Row combines, and UDoNC key protein matter recognition methods combines protein domain and protein-protein interaction network, and SCP is crucial Subcellular Localization information and protein-protein interaction network are combined by protein identification method.In addition, there are some to be based on Priori carries out key protein matter and knows method for distinguishing, such as CPPK and CEPPK, using key protein matter known to part as priori Knowledge, the key of the protein is judged by the tightness degree of other protein and priori in network.
Numerous studies show that protein is key, and there is close contact between protein complex.Hart et al. Found by research experiment, the key of protein is determined by single protein, and is often depending on protein and is answered The function of compound.And show that often richness concentrates in some compounds key protein matter by experimental data.Therefore largely it is based on The key protein matter recognition methods of protein complex and functional module is suggested.
Although as the development of bioinformatics, identification of the researcher to key protein matter conducts in-depth research, but It is still relatively low to be currently based on the accuracy rate of the recognition methods of network topology, and most methods are all isolated or by piecemeal made With a small number of parameters or signature analysis key protein matter, lack for node from the assurance in the overall and overall situation.Further, since pass through The protein interaction packet that high-throughput techniques obtain contains substantial amounts of false positive, it is impossible to real protein network is represented, Therefore structure one more truly the protein-protein interaction network of mimic biology body can help further to lift key protein matter Recognition accuracy.
The defects of summary key protein matter recognition methods, mainly have and do not consider the reliable of protein-protein interaction network Property, only consider that Partial Feature lacks assurance global and on the whole, key protein matter recognition accuracy is relatively low.
The content of the invention
The shortcomings that it is an object of the invention to overcome prior art and deficiency, there is provided one kind is based on Artificial Fish Swarm Optimization Algorithm The method for identifying key protein matter, builds the protein-protein interaction network of a purification, and the recognition accuracy of key protein matter is high.
To reach above-mentioned purpose, the present invention adopts the following technical scheme that:
Comprise the following steps:
(1) protein-protein interaction network is converted into non-directed graph:
Protein-protein interaction network is changed into a non-directed graph G=(V, E), wherein, V={ vi, i=1,2 ..., n } For node viSet, E be side e set, node viProtein is represented, side e represents the interaction between protein;
(2) protein-protein interaction network of structure purification:
In time point t, node viGene expression values EpitIf being more than activity of gene expression threshold value A ctive_Th (i), Then think node viIt is active in time point t, otherwise it is assumed that the node does not have activity in time point t;If any two in V Different node v, u is simultaneously active in time point t, then it is assumed that the node v under time point t, u are co-expressed;By in non-directed graph All leave out under all time points without the side corresponding to the protein interaction of coexpression, build the protein of a purification Interactive network;
(3) side and node of the protein-protein interaction network of purification are handled:The convergence factor ECC on calculating side, While Pearson correlation coefficients PCC, while degree inside protein complex of GO functional similarity and node;
(4) key protein matter composition original manual fish known to choosing:
It is artificial fingerling group scale to make N, and m is the quantity of the known key protein matter included in every Artificial Fish;Current The Artificial Fish that m known key protein matter form a priori is randomly selected in known key protein matter;Fish (k) tables Show the known key protein matter set included in kth bar original manual fish, k=1,2 ... N;Cn is of candidate key protein Number;
(5) foraging behavior:
All neighbours' protein of protein in every Artificial Fish are found out, form neighbours' protein node set Neighbor (k), and set Neighbor (k) and the protein in set Neighbor (l) are different, k=1,2 ... N, l=1,2 ... N,k≠l;For each node v in Neighbor (k)iAccording to formula score1 (i)=fitness1 (vi, Fish (k)) really Surely the possibility being merged into Artificial Fish Fish (k), by the node in neighbours protein node set Neighbor (k) according to it Score1 scores carry out descending sort, and score1 value highest protein node is added in Fish (k), is added to simultaneously In set Add (k);Foraging behavior is repeated Tn times, and Tn protein node is added into original manual fish;
(6) knock into the back behavior:
After foraging behavior performs, every Artificial Fish is determined according to formula S core2 (k)=fitness2 (Add (k)) Artificial Fish in optimum state, descending sort, Score2 value highest are carried out according to its Score2 score to all Artificial Fishs Artificial Fish be optimal Artificial Fish Fish (p), p ∈ [1, N], the set Add (p) corresponding to optimal Artificial Fish Fish (p) In protein node be added in set Candidate;
(7) bunch behavior:
In addition to set Add (p) corresponding to optimal Artificial Fish Fish (p), it will gather corresponding to remaining Artificial Fish Fish (k) Node v in Add (k)iAccording to formula S core3 (i)=fitness3 (vi) calculate score, wherein k ≠ p;To all viAccording to Its Score3 score carries out descending sort, and it is the crowding factor to make δ, and δ protein node for selecting to come above is added to collection Close in Candidate;
(8) key protein matter is produced:
Exported the protein node in the set Candidate obtained by step (7) as key protein matter.
Further, gene expression threshold value A ctive_Th (i) is obtained by formula (1):
Active_Th (the i)=σ of μ (i)+3 (i) (1-F (i)) formula (1)
μ (i) is node v in formula (1)iAverage gene expression value, σ (i) are the standard deviations of gene expression values;F (i)=1/ (1 +σ2) it is weight function.
Further, in step (3), the convergence factor on side is calculated by formula (2):
In formula, Ni,NjNode v is represented respectivelyi,vjNeighbor node collection;
The Pearson correlation coefficients on side are calculated by formula (3):
In formula, EpitAnd EpjtNode v is represented respectivelyi,And vjGene expression values in time point t, μ (i) and μ (j) are Node viAnd vjAverage gene expression value, T be time point maximum;
The GO functional similarity on side is calculated by formula (4):
In formula, GOi,GOjAnnotation node v is represented respectivelyiWith node vjGO terms;
Node v is calculated by formula (5)iDegree inside protein complex:
In formula, V (| C |) represent the node set included in protein complex, CviExpression includes node viAlbumen Matter compound, Din(vi, Cvi) represent node viIn protein complex CviIn degree, vjIt is viNeighbor node.
Further, node v in the middle set Neighbor (k) of step (5)iThe possibility being added in Artificial Fish Fish (k) Property fiitness1 is obtained by formula (6):
V in formulajIt is the protein node inside Artificial Fish Fish (k), ECC is node viWith node vjBetween side it is poly- Collect coefficient, PCC is node viWith node vjBetween side Pearson correlation coefficients, GO_sim is node viWith node vjBetween Functional similarity.
Further, in step (5), if there is no suitable protein node to be added in foraging behavior implementation procedure In Artificial Fish, then random behavior is performed, one protein node of random selection is added to neighbours' protein node set In Neighbor (k).
Further, the possibility fitness2 that the middle determination Artificial Fish of step (6) is in optimum state is obtained by formula (7):
In formula, Add (k) represents that kth bar Artificial Fish passes through the protein node set that Tn foraging behavior is added.
Further, node v in determination set Add (k) in step (7), k ≠ piScore fitness3 by formula (8) Arrive:
W(vi,vj)=ECC (vi,vj)×(PCC(vi,vj)+GO_sim(vi,vj)) formula (9)
In formula (8), a, b are coefficients, meet a+b=1, Nei (vi) represent node viNeighbor node set, DIC (vi) table Show node viDegree inside protein complex.
Further, δ=Cn-Tn in step (7).
The present invention compared with the existing methods, has advantages below:
1st, key protein matter known to selected section of the present invention is more likely to phase each other as priori according to key protein Connect, the neighbor node for forming artificial fish protein is scanned for complete key protein by the foraging behavior of Artificial Fish The prediction of matter, the topological property of key protein is in a network taken into full account.
2nd, the present invention in when Artificial Fish execution bunch behavior protein node is given a mark when, used side convergence factor (ECC), Pearson correlation coefficients (PCC), GO functional similarity (GO_sim), the protein of two interactions has been considered Between the tightness degree, the similitude of gene expression, the protein function correlation that connect;And protein has been used in compound Internal participation (DIC), it is contemplated that protein is key to cause key protein with the relation of compound, the fusion of multifrequency nature Identification is more accurate.
3rd, the process for looking for food or finding companion of present invention simulation artificial fish-swarm identifies key protein matter, and build one can The protein-protein interaction network leaned on, topological property, the gene table of protein of protein-protein interaction network are considered Up to value, GO Semantic Similarities, protein complex information and priori, and the Optimization Mechanism of artificial fish-swarm is added, in many ways Region feature is used so that the degree of accuracy of the key protein matter identified using the present invention than using other crucial eggs at present The degree of accuracy of white matter recognition methods identification is high.
4th, the inventive method can identify key event exactly;The simulation experiment result shows, susceptibility, specificity, the positive The index performance such as predicted value, negative predictive value is more excellent;It is compared with other key protein recognition methods, the optimization of artificial fish-swarm is special Property with the topological characteristic of node interactive network be combined the identification process for realizing key event, improve key event Recognition accuracy.
5th, key protein matter can effectively be identified from protein-protein interaction network using the present invention, not only assisted in Understand the adjusting and controlling growth process of cell and the Operational Mechanisms of vital movement, while to how accurately to develop medicine and diagnoses and treatment Disease also has extremely important theory value.
Brief description of the drawings
Fig. 1 is the process chart of the embodiment of the present invention 1.
Fig. 2 is part signal of the key protein matter drawn using embodiment 1 in whole protein-protein interaction network Figure.
Fig. 3 is key protein matter situation in java standard library corresponding to Fig. 2.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only the part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
As shown in figure 1, method of the present invention based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, including following step Suddenly:
(1) protein-protein interaction network is converted into non-directed graph
Protein-protein interaction network is changed into a non-directed graph G=(V, E), wherein, V={ vi, i=1,2 ..., n } For node viSet, E be side e set, node viProtein is represented, side e represents the interaction between protein;
(2) protein-protein interaction network of structure purification
In time point t, node viGene expression values EpitIf being more than activity of gene expression threshold value A ctive_Th (i), Then think node viIt is active in time point t;Otherwise it is assumed that the node does not have activity in time point t;If any two in V Different node v, u is simultaneously active in time point t, then it is assumed that the node v under time point t, u are co-expressed;By in non-directed graph All leave out under all time points without the side corresponding to the protein interaction of coexpression, build a new protein phase Interaction network, that is, the protein network purified;
EpitFor node viGene expression values at time point t;
Activity of gene expression threshold value A ctive_Th (i) is obtained by formula (1):
Active_Th (the i)=σ of μ (i)+3 (i) (1-F (i)) formula (1)
μ (i) is node v in formula (1)iAverage gene expression value, σ (i) are the standard deviations of gene expression values;F (i)=1/ (1+ σ2) It is weight function;
(3) side and node of the protein-protein interaction network after purification are handled
The convergence factor on side is calculated by formula (2):
In formula, Ni,NjNode v is represented respectivelyi,vjNeighbor node collection;
The Pearson correlation coefficients on side are calculated by formula (3):
In formula, EpitAnd EpjtNode v is represented respectivelyiAnd vjGene expression values in time point t, μ (i) and μ (j) are knots Point viAnd vjAverage gene expression value, T be time point maximum;
The GO functional similarity on side is calculated by formula (4):
In formula, GOi,GOjAnnotation node v is represented respectivelyiWith node vjGO terms.
To node viPretreatment:Node v is calculated by formula (5)iDegree inside protein complex:
In formula, V (| C |) represent the protein node set included in protein complex, CviExpression includes node vi Protein complex, Din(vi, Cvi) represent node viIn protein complex CviIn degree, vjIt is viNeighbor node;
(4) key protein matter is as original manual fish known to choosing
It is artificial fingerling group scale to make N, and m is the quantity of the known key protein matter included in every Artificial Fish;In standard The Artificial Fish that m known key protein matter form a priori is randomly selected in storehouse (the key protein matter being currently known); Fish (k) represents the known key protein matter set included in kth bar original manual fish, k=1,2 ... N;Cn is candidate key egg The number of white matter;
(5) foraging behavior
Artificial Fish search of food in visual range is to find the albumen that direct interaction be present with artificial fish protein Matter, all neighbours' protein of protein in every Artificial Fish are found out, form neighbours protein node set Neighbor (k), And node in set Neighbor (k) and set Neighbor (l) it is different (k=1,2 ... N, l=1,2 ... N, k ≠ L), for each node v in Neighbor (k)iAccording to formula score1 (i)=fitness1 (vi, Fish (k)) determine to close And to the possibility in Artificial Fish Fish (k), by the node in neighbours protein node set Neighbor (k) according to it Score1 scores carry out descending sort, score1 value highest node are added in Fish (k), while be added to set In Add (k);
Random behavior:If there is no suitable protein to be added in Artificial Fish in foraging behavior implementation procedure, hold Row random behavior, one protein node of random selection are added in set Neighbor (k);
Foraging behavior is repeated Tn times, i.e., Tn protein node is added into original manual fish;
(6) knock into the back behavior
After foraging behavior performs, every Artificial Fish is determined according to formula S core2 (k)=fitness2 (Add (k)) Artificial Fish in optimum state, descending sort, Score2 value highest are carried out according to its Score2 score to all Artificial Fishs Artificial Fish be optimal Artificial Fish Fish (p), p ∈ [1, N], the set Add (p) corresponding to optimal Artificial Fish Fish (p) In protein node be added in set Candidate;
(7) bunch behavior
In addition to set Add (p) corresponding to optimal Artificial Fish Fish (p), by corresponding to remaining Artificial Fish Fish (k) (k ≠ p) Node v in set Add (k) (k ≠ p)iAccording to formula S core3 (i)=fitness3 (vi) score is calculated, to all viAccording to Its Score3 score carries out descending sort, and it is the crowding factor to make δ, δ=Cn-Tn, selects to come δ protein node above It is added in set Candidate;
(8) key protein matter is produced
Exported the protein in set Candidate as key protein matter.
Set Neighbor (k) nodes v in the step (5) of the present inventioniThe possibility being added in Artificial Fish Fish (k) Fiitness1 is obtained by formula (6):
V in formulajIt is the node inside Artificial Fish Fish (k);ECC is node viWith node vjBetween side convergence factor, Obtained by formula (2);PCC is node viWith node vjBetween side Pearson correlation coefficients, obtained by formula (3);GO_sim It is node viWith node vjBetween functional similarity, obtained by formula (4).
The possibility fitness2 that the middle determination Artificial Fish of the step (6) of the present invention is in optimum state is obtained by formula (7):
In formula, Add (k) represents that kth bar Artificial Fish passes through the protein node set that Tn foraging behavior is added, fitness1(vi, Fish (k)) and as shown in formula (6).
Determine the score fitness3 of protein node in set Add (k) (k ≠ p) by formula in the step (7) of the present invention (8) obtain:
W(vi,vj)=ECC (vi,vj)×(PCC(vi,vj)+GO_sim(vi,vj)) formula (9)
In formula (8), a, b are coefficients, meet a+b=1, Nei (vi) represent node viNeighbor node set, DIC (vi) represent knot Point viObtained in the degree of inside compounds by formula (5).
Below by way of specific embodiment, the present invention is described in more detail:
Embodiment 1
By taking protein network as an example it is a kind of based on Artificial Fish Swarm Optimization Algorithm identification key protein matter method the step of such as Under:
The present embodiment is used as emulation data set, DIP using the yeast data set (DIP 20160114 editions) for picking up from DIP databases Data contain 5028 protein and 22303 interaction relationships.The ferment that gene expression dataset is picked up from GEO databases Female metabolism expression data set GSE3431, including 9336 genes, the genic value at totally 36 time points in 3 cycles, cover 95% protein covered in DIP.GO data include annotation spectrum and SGD, it is known that protein complex information is to come from CYC2008, including 408 protein complexes, cover 1492 protein, key protein prime number according to by integrate MIPS, Data in tetra- databases of SGD, DEG and SGDP obtain, and contain 1285 key protein matter altogether, correspond to 5028 albumen It is key protein to have 1152 in matter, and remaining is regarded as non-key albumen.Experiment porch is the operating systems of Windows 10, Intel Duo i5-6600 double-core 3.31GHz processors, 8GB physical memories, realize the present invention's with Matlab R2014a softwares Method.
1st, protein-protein interaction network is converted into non-directed graph
Protein-protein interaction network comprising 5028 protein and 22303 interaction relationships is changed into one Non-directed graph G=(V, E), wherein, V={ vi, i=1,2 ..., 5028 } it is node viSet, E be side e set, node viTable Show protein, side e represents the interaction between protein.
2nd, the protein-protein interaction network of structure purification
In time point t, node viGene expression values EpitIf being more than activity of gene expression threshold value A ctive_Th (i), Then think node viIt is active in time point t;Otherwise it is assumed that the node does not have activity in time point t;If any two in V Different node v, u is simultaneously active in time point t, then it is assumed that the v under time point t, u are co-expressed;Activity of gene expression threshold Value Active_Th (i) is obtained by formula (1):
Active_Th (the i)=σ of μ (i)+3 (i) (1-F (i)) formula (1)
μ (i) is node v in formula (1)iAverage gene expression value, σ (i) are the standard deviations of gene expression values;F (i)=1/ (1+ σ2) It is weight function.By above-mentioned processing, correspond in crude protein interactive network, delete at all time points all without table altogether The protein interaction reached, form a new protein interaction with 5028 protein nodes and 9576 sides Network, that is, the protein-protein interaction network purified.
3rd, the side and node of the protein-protein interaction network after purification are handled
The convergence factor on side is calculated by formula (2):
In formula, Ni,NjPoint v is represented respectivelyi,vjNeighbor node number, di,djIt is point v respectivelyi,vjDegree;Based on formula (3) Calculate the Pearson correlation coefficients on side:
In formula, EPit,EPjtRepresent node vi,vjGene expression values in time point t, μ (i), μ (j) are node vi,vj Average gene expression value, T be time point maximum;The GO functional similarity on side is calculated by formula (4):
In formula, GOi,GOjAnnotation protein v is represented respectivelyiWith protein vjGO terms.
To node viPretreatment:I=1,2 ..., 5028, to the i of a given determination, node v can be calculatediIn protein The participation of inside compounds, node v is calculated by formula (5)iDegree inside protein complex:
In formula, V (| C |) represent the protein node set included in protein complex, CviExpression includes protein viProtein complex, Din(vi,Cvi) represent protein viIn protein complex CviIn degree, vjIt is viNeighbours knot Point.
4th, key protein matter composition original manual fish known to choosing
It is artificial fingerling group scale to make N, and m is the quantity of the known key protein matter included in every Artificial Fish;For every Bar Artificial Fish, 100 known key protein matter compositions, one priori is randomly selected in 1152 key protein matter in java standard library The Artificial Fish of knowledge, Fish (k) represent the protein set included in kth bar original manual fish;N=100 in this example, m =100;Cn is the number of candidate key protein.
5th, foraging behavior
Artificial Fish search of food in visual range is to find the albumen that direct interaction be present with artificial fish protein Matter, find out all neighbours protein N eighbor (k) of protein in every Artificial Fish, and set Neighbor (k) and collection The protein closed in Neighbor (l) is different (100, k ≠ l of k=1,2 ... 100, l=1,2 ...), for Neighbor (k) In each protein viAccording to score1 (i)=fitness1 (vi, Fish (k)) determine to be merged into Artificial Fish Fish (k) Possibility, the node in protein node set Neighbor (k) is subjected to descending sort according to its score1 score, will Score1 value highest node is added in Fish (k), while is added in set Add (k), and score1 (i) is egg in formula White matter viWith the cohesion of all proteins in Artificial Fish, cohesion is obtained by formula (6):
V in formulajIt is the protein node inside Artificial Fish Fish (k), ECC is node viWith node vjBetween side it is poly- Collection coefficient is obtained by formula (2), and PCC is node viWith node vjBetween the Pearson correlation coefficients on side obtained by formula (3), GO_sim is node viWith node vjBetween functional similarity obtained by formula (4).
If there is no suitable protein to be added in Artificial Fish in foraging behavior implementation procedure, random row is performed For one protein node of random selection is added in set Neighbor (k).Foraging behavior (or random behavior) repeats The Tn Tn protein node of addition i.e. into original manual fish.
6th, knock into the back behavior
After foraging behavior (or random behavior) performs, to every Artificial Fish according to formula S core2 (k)=fitness2 (Add (k)) determines the Artificial Fish in optimum state, and descending sort is carried out according to its Score2 score to all Artificial Fishs, Score2 value highest Artificial Fish is optimal Artificial Fish Fish (p) (p ∈ [1,100]), corresponding to optimal Artificial Fish Protein node in Fish (p) set Add (p) is added in set Candidate, and fitness2 represents to the addition of albumen After matter, the fitness of every Artificial Fish, obtained by formula (7):
In formula, Add (k) represents that kth bar Artificial Fish passes through the protein node set that Tn foraging behavior is added, fitness1(vi, Fish (k)) and as shown in formula (6).
7th, bunch behavior
In addition to set Add (p) corresponding to optimal Artificial Fish Fish (p), by corresponding to remaining Artificial Fish Fish (k) (k ≠ p) Protein node v in set Add (k) (k ≠ p)iAccording to formula S core3 (i)=fitness3 (vi) score is calculated, to all viDescending sort is carried out according to its Score3 score, it is the crowding factor to make δ (δ=Cn-Tn), and selection comes δ egg above White matter node is added in set Candidate, and fitness3 represents the score value of protein node in set Add (k) (k ≠ p), Obtained by formula (8):
W(vi,vj)=ECC (vi,vj)×(PCC(vi,vj)+GO_sim(vi,vj)) formula (9)
In formula, a, b are coefficients, a=0.8, b=0.2, Nei (vi) represent node viNeighbor node set, DIC (vi) represent knot Point viObtained in the degree of inside compounds by formula (5).
9th, key protein matter is produced
Exported the protein in set Candidate as key protein matter.
In order to verify effectiveness of the invention, inventor is closed using the identification of the Artificial Fish Swarm Optimization Algorithm of the embodiment of the present invention 1 Key method of protein carries out the identification of key protein matter to the protein network in DIP databases, to candidate key protein When number (Cn) is 100,200,300,400,500 and 600, the key protein matter correctly identified is analyzed, in this reality In testing, we are that every Artificial Fish takes 100 known key protein matter as priori, in view of being used as priori in experimentation Known key protein matter randomly select, therefore experiment is carried out 50 times, takes the average value of 50 experimental results as most Terminate fruit, the results are shown in Table 1, Fig. 2 and Fig. 3, and table 1 shows the knot identified with the method for other current identification key protein matter The comparison of accuracy rate is identified in fruit.The distribution of the Partial key protein of the invention identified in a network is shown in fig. 2 Situation, Fig. 3 show Fig. 2 corresponding java standard library part.
Contrast of the present invention of table 1 with other method identification key protein matter in accuracy rate
Table 1 show present invention will identify that 100,200,300,400,500,600 protein as candidate key Recognition accuracy of the protein compared with key protein matter in java standard library, and identify key protein matter sides with current other The contrast of method recognition result.Before identification during 600 key protein matter, shown compared with remaining 8 kinds of key protein recognition methods Going out the present invention has higher predictablity rate.Found out by table 2, effectively key protein matter can be identified by the present invention, wait Selecting the number of key protein, the present invention suffers from highest recognition accuracy from 100 to 600.Fig. 2 shows that the present invention identifies Position of the Partial key protein in protein-protein interaction network.That carry dark-background color in Fig. 2 is the present invention The key protein matter correctly identified, the key protein matter come out with light background wrong identification, white background is non-pass Key protein.Fig. 3 is the key protein matter situation in java standard library corresponding to Fig. 2.By Fig. 2 and Fig. 3 contrast it can be found that originally Inventing the wrong protein identified has " YDR283C " " YPL246C ", and the key protein matter for leaking identification has " YBR152W ".If Using key protein matter known to part as priori, then the inventive method can correctly identify the big portion around the priori Divide key protein matter.
Method of the present invention based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, protein-protein interaction network is turned Turn to non-directed graph, the protein-protein interaction network that structure purifies, obtain nbccs gene expression value, GO corresponding to protein The degree of annotation information and protein in known compound, the protein-protein interaction network side after purification and node are carried out Key protein matter known to processing, selection performs foraging behavior, random behavior, behavior of knocking into the back as original manual fish, Artificial Fish, gathered Group's behavior simultaneously produces key protein matter.The inventive method can identify key protein matter exactly;The simulation experiment result shows, sensitive The index performance such as degree, specificity, positive predictive value, negative predictive value is more excellent;, will compared with other key protein matter recognition methods The optimization characteristics of artificial fish-swarm are combined the identification for realizing key protein matter with the topological characteristic of node interactive network Journey, improve the recognition accuracy of key protein matter.
Described above is the preferred embodiment of the present invention, passes through described above content, the related work of the art Personnel can carry out various improvement and replacement on the premise of without departing from the technology of the present invention principle, and these improve and replaced It should be regarded as protection scope of the present invention.

Claims (8)

1. the method based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that:Comprise the following steps:
(1) protein-protein interaction network is converted into non-directed graph:
Protein-protein interaction network is changed into a non-directed graph G=(V, E), wherein, V={ vi, i=1,2 ..., n } it is knot Point viSet, E be side e set, node viProtein is represented, side e represents the interaction between protein;
(2) protein-protein interaction network of structure purification:
In time point t, node viGene expression values EpitIf being more than activity of gene expression threshold value A ctive_Th (i), recognize For node viIt is active in time point t, otherwise it is assumed that the node does not have activity in time point t;If any two is different in V Node v, u time point t simultaneously it is active, then it is assumed that the node v under time point t, u co-express;By in non-directed graph in institute All leave out under having time point without the side corresponding to the protein interaction of coexpression, the protein for building a purification is mutual Act on network;
(3) side and node of the protein-protein interaction network of purification are handled:Calculate while convergence factor ECC, while The degree of Pearson correlation coefficients PCC, the GO functional similarity and node on side inside protein complex;
(4) key protein matter composition original manual fish known to choosing:
It is artificial fingerling group scale to make N, and m is the quantity of the known key protein matter included in every Artificial Fish;It is being currently known Key protein matter in randomly select the Artificial Fish that the known key protein matter of m form a priori;Fish (k) represents the The known key protein matter set included in k bar original manual fishes, k=1,2 ... N;Cn is the number of candidate key protein;
(5) foraging behavior:
All neighbours' protein of protein in every Artificial Fish are found out, form neighbours protein node set Neighbor (k), And set Neighbor (k) and the protein in set Neighbor (l) are different, k=1,2 ... N, l=1,2 ... N, k ≠l;For each node v in Neighbor (k)iAccording to formula score1 (i)=fitness1 (vi, Fish (k)) determine to close And to the possibility in Artificial Fish Fish (k), by the node in neighbours protein node set Neighbor (k) according to it Score1 scores carry out descending sort, and score1 value highest protein node is added in Fish (k), is added to simultaneously In set Add (k);Foraging behavior is repeated Tn times, and Tn protein node is added into original manual fish;
(6) knock into the back behavior:
After foraging behavior performs, every Artificial Fish is determined to be according to formula S core2 (k)=fitness2 (Add (k)) The Artificial Fish of optimum state, descending sort, Score2 value highest people are carried out according to its Score2 score to all Artificial Fishs Work fish is optimal Artificial Fish Fish (p), p ∈ [1, N], corresponding in the set Add (p) of optimal Artificial Fish Fish (p) Protein node is added in set Candidate;
(7) bunch behavior:
In addition to set Add (p) corresponding to optimal Artificial Fish Fish (p), by set Add (k) corresponding to remaining Artificial Fish Fish (k) In node viAccording to formula S core3 (i)=fitness3 (vi) calculate score, wherein k ≠ p;To all viAccording to it Score3 scores carry out descending sort, and it is the crowding factor to make δ, and δ protein node for selecting to come above is added to set In Candidate;
(8) key protein matter is produced:
Exported the protein node in the set Candidate obtained by step (7) as key protein matter.
2. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that:
Gene expression threshold value A ctive_Th (i) is obtained by formula (1):
Active_Th (the i)=σ of μ (i)+3 (i) (1-F (i)) formula (1)
μ (i) is node v in formula (1)iAverage gene expression value, σ (i) are the standard deviations of gene expression values;F (i)=1/ (1+ σ2) It is weight function.
3. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that:Step Suddenly in (3), the convergence factor on side is calculated by formula (2):
In formula, Ni,NjNode v is represented respectivelyi,vjNeighbor node collection;
The Pearson correlation coefficients on side are calculated by formula (3):
In formula, EpitAnd EpjtNode v is represented respectivelyi,And vjGene expression values in time point t, μ (i) and μ (j) are node vi And vjAverage gene expression value, T be time point maximum;
The GO functional similarity on side is calculated by formula (4):
In formula, GOi,GOjAnnotation node v is represented respectivelyiWith node vjGO terms;
Node v is calculated by formula (5)iDegree inside protein complex:
In formula, V (| C |) represent the node set included in protein complex, CviExpression includes node viProtein answer Compound, Din(vi, Cvi) represent node viIn protein complex CviIn degree, vjIt is viNeighbor node.
4. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that:Step Suddenly node v in (5) middle set Neighbor (k)iThe possibility fiitness1 being added in Artificial Fish Fish (k) is obtained by formula (6) Arrive:
V in formulajIt is the protein node inside Artificial Fish Fish (k), ECC is node viWith node vjBetween side aggregation system Number, PCC is node viWith node vjBetween side Pearson correlation coefficients, GO_sim is node viWith node vjBetween work( Can similitude.
5. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that:Step Suddenly in (5), if there is no suitable protein node to be added in Artificial Fish in foraging behavior implementation procedure, perform random Behavior, one protein node of random selection are added in neighbours protein node set Neighbor (k).
6. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that:Step Suddenly the possibility fitness2 that (6) middle determination Artificial Fish is in optimum state is obtained by formula (7):
In formula, Add (k) represents that kth bar Artificial Fish passes through the protein node set that Tn foraging behavior is added.
7. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that:Step Suddenly node v in determination set Add (k) in (7), k ≠ piScore fitness3 obtained by formula (8):
W(vi,vj)=ECC (vi,vj)×(PCC(vi,vj)+GO_sim(vi,vj)) formula (9)
In formula (8), a, b are coefficients, meet a+b=1, Nei (vi) represent node viNeighbor node set, DIC (vi) represent knot Point viDegree inside protein complex.
8. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that:Step Suddenly δ=Cn-Tn in (7).
CN201710912037.5A 2017-09-29 2017-09-29 Method for identifying key protein based on artificial fish school optimization algorithm Active CN107784196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710912037.5A CN107784196B (en) 2017-09-29 2017-09-29 Method for identifying key protein based on artificial fish school optimization algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710912037.5A CN107784196B (en) 2017-09-29 2017-09-29 Method for identifying key protein based on artificial fish school optimization algorithm

Publications (2)

Publication Number Publication Date
CN107784196A true CN107784196A (en) 2018-03-09
CN107784196B CN107784196B (en) 2021-07-09

Family

ID=61433970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710912037.5A Active CN107784196B (en) 2017-09-29 2017-09-29 Method for identifying key protein based on artificial fish school optimization algorithm

Country Status (1)

Country Link
CN (1) CN107784196B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629159A (en) * 2018-05-14 2018-10-09 辽宁大学 A method of for finding the pathogenic key protein matter of alzheimer's disease
CN109509509A (en) * 2018-09-29 2019-03-22 江西理工大学 Protein complex method for digging based on dynamic weighting protein-protein interaction network
CN110895672A (en) * 2018-12-29 2020-03-20 研祥智能科技股份有限公司 Face recognition method based on artificial fish swarm algorithm
CN111312330A (en) * 2020-02-13 2020-06-19 兰州理工大学 Key protein identification method and system based on protein node characteristics
CN112259157A (en) * 2020-10-28 2021-01-22 杭州师范大学 Protein interaction prediction method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945333A (en) * 2012-12-04 2013-02-27 中南大学 Key protein predicating method based on priori knowledge and network topology characteristics
WO2015054266A1 (en) * 2013-10-08 2015-04-16 The Regents Of The University Of California Predictive optimization of network system response
CN105279397A (en) * 2015-10-26 2016-01-27 华东交通大学 Method for identifying key proteins in protein-protein interaction network
CN107169983A (en) * 2017-04-13 2017-09-15 西安电子科技大学 Multi-threshold image segmentation method based on cross and variation artificial fish-swarm algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945333A (en) * 2012-12-04 2013-02-27 中南大学 Key protein predicating method based on priori knowledge and network topology characteristics
WO2015054266A1 (en) * 2013-10-08 2015-04-16 The Regents Of The University Of California Predictive optimization of network system response
CN105279397A (en) * 2015-10-26 2016-01-27 华东交通大学 Method for identifying key proteins in protein-protein interaction network
CN107169983A (en) * 2017-04-13 2017-09-15 西安电子科技大学 Multi-threshold image segmentation method based on cross and variation artificial fish-swarm algorithm

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
WOOCHANG HWANG等: "A novel functional module detection algorithm for protein-protein interaction networks", 《ALGORITHMS FOR MOLECULAR BIOLOGY》 *
吴爽 等: "融合人工鱼群机理的PPI网络聚类模型与算法", 《计算机科学》 *
吴爽: "基于群智能机理的PPI网络功能模块聚类", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 *
尤梦丽: "群智能优化算法及其在PPI网络中的应用及评价研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
陈新: "基于图的蛋白质相互作用网络比对方法", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629159A (en) * 2018-05-14 2018-10-09 辽宁大学 A method of for finding the pathogenic key protein matter of alzheimer's disease
CN109509509A (en) * 2018-09-29 2019-03-22 江西理工大学 Protein complex method for digging based on dynamic weighting protein-protein interaction network
CN109509509B (en) * 2018-09-29 2020-12-22 江西理工大学 Protein compound mining method based on dynamic weighted protein interaction network
CN110895672A (en) * 2018-12-29 2020-03-20 研祥智能科技股份有限公司 Face recognition method based on artificial fish swarm algorithm
CN110895672B (en) * 2018-12-29 2022-05-17 研祥智能科技股份有限公司 Face recognition method based on artificial fish swarm algorithm
CN111312330A (en) * 2020-02-13 2020-06-19 兰州理工大学 Key protein identification method and system based on protein node characteristics
CN112259157A (en) * 2020-10-28 2021-01-22 杭州师范大学 Protein interaction prediction method
CN112259157B (en) * 2020-10-28 2023-10-03 杭州师范大学 Protein interaction prediction method

Also Published As

Publication number Publication date
CN107784196B (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN107784196A (en) Method based on Artificial Fish Swarm Optimization Algorithm identification key protein matter
Tan Cascade ARTMAP: Integrating neural computation and symbolic knowledge processing
CN104156634B (en) key protein identification method based on subcellular localization specificity
CN106874708B (en) Using the method for the artificial bee colony optimization algorithm identification key protein matter for the mechanism of looking for food
CN105279397B (en) A kind of method of key protein matter in identification of protein interactive network
CN109637579B (en) Tensor random walk-based key protein identification method
CN108319812A (en) A method of key protein matter is identified based on cuckoo searching algorithm
CN107885971A (en) Using the method for improving flower pollination algorithm identification key protein matter
Ceccarelli Behavioral mimicry in Myrmarachne species (Araneae, Salticidae) from North Queensland, Australia
CN109727637A (en) Method based on shuffled frog leaping algorithm identification key protein matter
Brophy et al. Otolith shape variation provides a marker of stock origin for north Atlantic bluefin tuna (Thunnus thynnus)
CN109816087B (en) Strong convection weather discrimination method for rough set attribute reduction based on artificial fish swarm and frog swarm hybrid algorithm
CN109686403A (en) Based on key protein matter recognition methods in uncertain protein-protein interaction network
CN108229643A (en) A kind of method using drosophila optimization algorithm identification key protein matter
Lein et al. Studying the evolution of social behaviour in one of Darwin’s Dreamponds: a case for the Lamprologine shell-dwelling cichlids
CN108804871A (en) Key protein matter recognition methods based on maximum neighbours' subnet
Liu et al. Simple primitives with feasibility-and contextuality-dependence for open-world compositional zero-shot learning
Xu et al. Prdp: Person reidentification with dirty and poor data
Bertrand et al. Reconstruction of ancestral genome subject to whole genome duplication, speciation, rearrangement and loss
Aslan An Artificial Bee Colony-Guided Approach for Electro-Encephalography Signal Decomposition-Based Big Data Optimization
Cardoso et al. Snake Species Identification Using Deep Convolutional Neural Networks
Carmona et al. Mapping extinction risk in the global functional spectra across the tree of life
CN113254458A (en) Intelligent diagnosis method for aquatic disease
Yu et al. Knowledge-aware global reasoning for situation recognition
CN110400599A (en) Method based on dove colony optimization algorithm identification key protein matter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant