CN111599406B - Global multi-network comparison method combined with network clustering method - Google Patents

Global multi-network comparison method combined with network clustering method Download PDF

Info

Publication number
CN111599406B
CN111599406B CN202010449468.4A CN202010449468A CN111599406B CN 111599406 B CN111599406 B CN 111599406B CN 202010449468 A CN202010449468 A CN 202010449468A CN 111599406 B CN111599406 B CN 111599406B
Authority
CN
China
Prior art keywords
network
nodes
node
comparison
sequence similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010449468.4A
Other languages
Chinese (zh)
Other versions
CN111599406A (en
Inventor
陈璟
黄佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202010449468.4A priority Critical patent/CN111599406B/en
Publication of CN111599406A publication Critical patent/CN111599406A/en
Application granted granted Critical
Publication of CN111599406B publication Critical patent/CN111599406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a global multi-network comparison method combined with a network clustering method. The invention discloses a global multi-network comparison method combined with a network clustering method, which comprises the following steps: step 1. Reading protein interaction network data and different speciesSequence similarity data among the proteins and setting parameters alpha and beta to construct a protein interaction network G and a sequence similarity network S; step 2, preprocessing the sequence similarity data, deleting the edge with smaller sequence similarity score by utilizing the parameter beta to obtain a filtered network S β The method comprises the steps of carrying out a first treatment on the surface of the Step 3, calculating the weight of each node in all networks; and 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm. The invention has the beneficial effects that: the method adopted by the invention can achieve good comparison effect and can produce good comparison results in the topology and biological function sense.

Description

Global multi-network comparison method combined with network clustering method
Technical Field
The invention relates to the field of global multi-network comparison, in particular to a global multi-network comparison method combined with a network clustering method.
Background
Since the 21 st century, the amount of data available in different research areas, such as social networks, commercial transactions, and molecular biology, has seen explosive growth. The representation of proteins and their interactions in the form of networks (figures) and the study of new strategies for their analysis is a current research hotspot. In a protein interaction network, a node represents a protein and an edge represents an interaction between two proteins.
The comparison research based on protein interaction networks is relatively extensive and is mainly divided into pairwise (two networks) network comparison and multi (three and more) network comparison. Early network comparison algorithms are mostly contrast-oriented, and aim to find the mapping node with the highest similarity in two networks. Since 2008, research on network comparison is gradually transferred to a multi-network comparison algorithm, and the multi-network comparison algorithm can obtain mapping relations of nodes among a plurality of networks at the same time, so that the multi-network comparison can obtain deeper biological significance.
Since the network alignment problem can be seen as a subgraph isomorphism problem, the network alignment is an NP-complete problem, which has been solved by employing a heuristic method. In early days, greedy algorithms are used for comparison, and more methods are developed for network comparison for many years, for example, hungarian algorithm, seed and extension matching method, simulated annealing algorithm, genetic algorithm and the like, and in recent years, a deep learning method is also adopted to solve the problem of network comparison.
The traditional technology has the following technical problems:
1. global multi-network alignment studies of protein interaction networks originated in 2009 by the isorank n algorithm proposed by Liao (isorank n: spectral methods for global alignment of multiple protein networks [ J ]. Bioenformatics, 25 (12): i253-i258 ]) et al. The method comprises the steps of establishing similarity score matrixes of nodes among different networks, and generating a multi-network comparison result by utilizing a frequency spectrum clustering method. However, as an early algorithm, it is mostly superior to the newly proposed multi-network comparison algorithm in recent years in terms of topology and biological functions. Since this is a classical multi-network comparison algorithm, many of the algorithms proposed later will also be compared against this method as standard.
2. The SMETANA algorithm was proposed by Sahraeian (SMETANA: accurate and Scalable Algorithm for Probabilistic Alignment of Large-Scale Biological Networks [ J ]. Plos One,8 (7): e 67995) et al, 2013. Firstly, calculating a similarity score matrix among different network nodes by using a semi-Markov random walk model, then improving the comparison probability among the nodes by two probability consistency transfer, and finally constructing final comparison by using a greedy seed and expansion method. The experimental results of many data sets show that the SMETANA is a multi-network comparison algorithm capable of obtaining a better topological result, but the better topological meaning (higher conservative edge proportion) is obtained at the cost of the functional similarity among partial nodes, that is, the SMETANA can only obtain the comparison result with better topological meaning and poor biological functional meaning.
3. The BEAMS algorithm proposed by 2014 Ferhat A (BEAMS: backbone extraction and merge strategy for the global many-to-many alignment of multiple PPI networks, bioinformation, 2014,30 (4), 531-539.) et al is a global multi-network alignment algorithm framed by seed and extension, producing an alignment by searching the network for a weighted maximum clique (backbone), which is largely divided into two parts, backbone extraction and merging. The BEAMS algorithm is a comparison algorithm capable of obtaining better biological function significance, but the topological significance (the proportion of the conserved edges is not high) of the comparison result is poor, and the transition relies on sequence similarity information among nodes in the comparison process, so that the topological and biological function significance cannot be balanced well.
Disclosure of Invention
The invention aims to solve the technical problem of providing a global multi-network comparison method combined with a network clustering method, wherein the influence of incomplete sequence similarity data on network comparison accuracy is provided; the complexity of selecting the search comparison node under the large data scale is reduced; the influence of a plurality of similar nodes on the network comparison accuracy in each network caused by gene replication; multiple networks align imbalances in topology and biofunctional quality.
In order to solve the above technical problems, the present invention provides a global multi-network comparison method combined with a network clustering method, including:
step 1, reading protein interaction network data and sequence similarity data among different species, and setting parameters alpha and beta to construct a protein interaction network G and a sequence similarity network S;
step 2, preprocessing the sequence similarity data, deleting the edge with smaller sequence similarity score by utilizing the parameter beta to obtain a filtered network s β
Step 3, calculating the weight of each node in all the networks, and transferring the weights of the nodes and edges with small degrees to the nodes and edges with larger degrees according to the degrees of the nodes and the neighbor nodes;
step 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm;
step 5, calculating an objective function score of the current candidate comparison cluster, and selecting the candidate comparison cluster with the highest score as a comparison result;
and 6, outputting a comparison result, and analyzing the comparison result.
In one embodiment, step 1 is specifically as follows:
firstly, reading a parameter alpha which is set by a user and used for determining the proportion of topology and sequence similarity and a parameter beta which is set by the user and used for filtering sequence similarity information; secondly, determining the number k of input networks, carrying out batch reading on protein interaction networks, and constructing a protein interaction network G= { G 1 ,G 2 ,…,G k };Reading sequence similarity information among different networks; a sequence information network S is constructed.
In one embodiment, step 2 is specifically as follows:
filtering the sequence similarity information according to the read threshold coefficient beta; firstly, because of the huge information quantity of sequence similarity, the data quantity increases exponentially with the increase of the network number, and the calculation difficulty increases accordingly; secondly, because sequence similarity information of a plurality of real species is incomplete at present, certain sequence information can influence the accuracy of an alignment result; therefore, when using the sequence similarity auxiliary alignment process, the sequence information needs to be filtered; for any side (u, v) in the sequence information network S, if the score of the side is smaller than the threshold value related to the corresponding side, deleting the side in the network S, and finally obtaining the filtered network S β
w(u,v)<β×max(u,v) (1)
In one embodiment, step 3 is specifically as follows:
(a) Setting an initial value for a node in a network and an edge between the nodes;
(b) Transferring the weight of the node with the node degree of 1 to the neighbor nodes and edges of the node;
(c) Transferring node weights with node degrees greater than 1 and less than 10 to neighboring nodes and edges thereof;
(d) Calculating importance scores of the nodes in the network according to the weights of the nodes and the weights of the edges connected with the nodes, wherein lambda represents the influence of the weight scores of the related edges on the weights of the nodes when the weights of the nodes are calculated:
calculating a sequence homology score associated with each node;
(e) Calculating a final weight score for each node in the network;
Weight(u)=α×importance(u)+(1-α)×B(v) (8)
(f) Searching and comparing by combining a clustering method:
a weighted k-division complete graph S can be constructed according to the sequence similarity information, wherein nodes represent nodes in corresponding networks, and the weight of the edge between two nodes from different species networks is a sequence similarity bit score value; in the similarity graph S, similar nodes are clustered into a cluster through a network clustering method.
In one embodiment, in step 3, the clustering method based on the seed and expansion method is divided into the following steps:
(a) Calculating the weighting degree of nodes in the current search network, and selecting the node with the largest weighting degree as a first seed to be added into the set S;
(b) Normalizing the weight of the first seed neighbor, and selecting the highest weight as the second seed to be added into S;
(c) Expanding in a network according to the seeds generated in the previous two steps, selecting a node with the greatest weight and the greatest weight connected with the node in the S, adding the node if two constraint conditions are met, and ending the expansion if the two constraint conditions are met; when a new candidate cluster is generated, ifThe number of node source networks included in the cluster is smaller than the number of input networks, the current cluster is expanded, and the constraint condition is improved, namely T is increased d ,T s Is a value of (2);
in one embodiment, step 5 is specifically as follows:
for each iteration, calculating the objective function score of the candidate cluster generated in the previous step, and selecting the comparison cluster with the highest objective function score from the candidate clusters as the comparison cluster generated in the iteration; the objective function formula is as follows:
AS(A)=α×CIQ(A)+(1-α)ICQ(A) (11)
wherein, alpha is a parameter for balancing the proportion of topology and sequence similarity information, and the value is usually 0.5; CIQ is a measure for measuring the topology quality among clusters, and ICQ is a measure for measuring the sequence similarity among nodes in the clusters:
wherein the method comprises the steps ofRepresenting Cl in different clusters respectively m ,Cl n The set of edges between nodes in (c), cs (m, n) represents the cluster CI m ,Cl n The proportion of the conserved edges is calculated as follows:
wherein s' m,n The representation comprisesThe number of networks of edges s m,n Representing cluster Cl m ,Cl n The number of networks of the node included in the network; here, when s' m,n When=1, cs (m, n) =0; otherwise the first set of parameters is selected,calculated by equation 13.
In one embodiment, among others, ICQ (Cl i ) Indicating that only one cluster Cl is considered i The score of the inner node, ICQ (a), represents the score of the comparison result a considering all clusters:
wherein w is max (u) represents the maximum value of the weights in the edge that meets node u, E (Cl) i ) Represented in cluster Cl i In (a) node at S β A collection of edges that meet in the middle.
Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of any one of the methods when executing said program.
Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of any of the methods.
Based on the same inventive concept, the present application also provides a processor for running a program, wherein the program runs to perform any one of the methods.
The invention has the beneficial effects that:
the method adopted by the invention can achieve good comparison effect and can produce good comparison results in the topology and biological function sense. The k-coverage is the protein coverage of the number of the input network, and shows that the clustering method adopted by the invention can compare as many similar nodes as possible into the same cluster, and the nodes compared into the same cluster have the same biological function, so that the beneficial effects of the method can be proved.
Drawings
FIG. 1 is a flow chart of a global multi-network alignment method in combination with a network clustering method of the present invention.
Fig. 2 is an exemplary graph of the effect of computing node weights in the global multi-network comparison method of the present invention in combination with the network clustering method.
Fig. 3 is a schematic diagram of experimental results of different algorithms in the global multi-network comparison method combining with the network clustering method on a synthetic network data set.
Fig. 4 (a), fig. 4 (b), fig. 4 (c) are topology and biological index results of comparison results in a synthetic network in the global multi-network comparison method combining the network clustering method of the present invention.
Fig. 5 is an experimental result of different comparison algorithms under a real network in the global multi-network comparison method combining with the network clustering method.
Fig. 6 is a topology and biological index result of different comparison algorithms under a real network in the global multi-network comparison method combining with the network clustering method of the present invention.
FIG. 7 is a comparison of topology and biological index products of different alignment algorithms in a global multi-network alignment method combining with a network clustering method of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.
The technical scheme of the invention is shown by a flow chart shown in fig. 2:
a global multi-network comparison algorithm ACCMNA combining a network clustering method comprises the following steps:
step 1, reading protein interaction network data and sequence similarity data among different species, and setting parameters alpha and beta to construct a protein interaction network G and a sequence similarity network S;
step 2, preprocessing the sequence similarity data, and utilizing the parameter beta to carry out sequence similarityDeleting edges with smaller scores to obtain a filtered network s β
Step 3, calculating the weight of each node in all the networks, and transferring the weights of the nodes and edges with small degrees to the nodes and edges with larger degrees according to the degrees of the nodes and the neighbor nodes;
and 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm.
Step 5, calculating an objective function score of the current candidate comparison cluster, and selecting the candidate comparison cluster with the highest score as a comparison result;
and 6, outputting a comparison result, and analyzing the comparison result.
(1) Reading input data:
firstly, reading a parameter alpha which is set by a user and used for determining the proportion of topology and sequence similarity and a parameter beta which is set by the user and used for filtering sequence similarity information; secondly, determining the number k of input networks, carrying out batch reading on protein interaction networks, and constructing a protein interaction network G= { G 1 ,G 2 ,…,G k -a }; and reads sequence similarity information between different networks. A sequence information network S is constructed.
(2) Filtering the sequence similarity data:
and filtering the sequence similarity information according to the read threshold coefficient beta. Firstly, because of the huge information quantity of sequence similarity, the data quantity increases exponentially with the increase of the network number, and the calculation difficulty increases accordingly; secondly, because sequence similarity information of a plurality of real species is incomplete at present, certain sequence information can influence the accuracy of an alignment result; thus, when using sequence similarity to aid the alignment process, it is necessary to filter the sequence information. For any side (u, v) in the sequence information network S, if the score of the side is smaller than the threshold value related to the corresponding side, deleting the side in the network S, and finally obtaining the filtered network S β
w(u,v)<β×max(u,v) (1)
(3) Calculating the weight of all nodes in the network:
because the number of nodes in the network data is large, in order to reduce the complexity of searching and comparing the nodes in the network, firstly, the weight of each node in the network is calculated, and the node with the largest weight is preferentially selected for comparison. The degree (degree) indicates the number of nodes connected to a node in the network, and the importance of a node in the network can be measured. The second graph is an example graph of the effect after calculating the weight of the node, the greater the degree of the node is, the higher the importance of the node in the network is, and finally the weight of the node with the larger degree is made to be larger.
FIG. 2 is a schematic diagram of node weights. The size of the node and the thickness of the edge in the graph represent the weight of the node and the edge respectively, and the graph is an indication of the effect achieved after the node weight is calculated (the greater the degree of the node is, the greater the weight is
The method for calculating the node weight comprises the following steps:
(a) Setting an initial value for a node in a network and an edge between the nodes;
(b) Transferring the weight of the node with the node degree of 1 to the neighbor nodes and edges of the node;
(c) Transferring node weights with node degrees greater than 1 and less than 10 to neighboring nodes and edges thereof;
(d) Calculating importance scores of the nodes in the network according to the weights of the nodes and the weights of the edges connected with the nodes, wherein lambda represents the influence of the weight scores of the related edges on the weights of the nodes when the weights of the nodes are calculated;
calculating a sequence homology score associated with each node;
(e) Calculating a final weight score for each node in the network;
Weight(u)=α×importance(u)+(1-α)×B(v) (8)
(f) Searching and comparing by combining a clustering method:
a weighted k-score complete graph S can be constructed according to the sequence similarity information, wherein the nodes represent nodes in the corresponding network, and the weight of the edge between two nodes from different species networks is the sequence similarity bit score value. In the similarity graph S, similar nodes are clustered into a cluster through a network clustering method. The clustering method based on the seed and the expansion method comprises the following steps:
(a) Calculating the weighting degree of nodes in the current search network, and selecting the node with the largest weighting degree as a first seed to be added into the set S;
(b) Normalizing the weight of the first seed neighbor, and selecting the highest weight as the second seed to be added into S;
(c) Expanding in a network according to the seeds generated in the previous two steps, selecting a node with the greatest weight and the greatest weight connected with the node in the S, adding the node if two constraint conditions are met, and ending the expansion if the two constraint conditions are met; when a new candidate cluster is generated, if the number of node source networks contained in the cluster is smaller than the number of input networks, expanding the current cluster, and improving constraint conditions, namely increasing T d ,T s Is a value of (2).
(1) Calculating an objective function score, and selecting a generation cluster:
for each iteration, calculating the objective function score of the candidate cluster generated by the previous step, and selecting the comparison cluster with the highest objective function score from the candidate clusters as the comparison cluster generated in the iteration. The objective function formula is as follows:
AS(A)=α×CIQ(A)+(1-α)ICQ(A) (11)
where α is a parameter balancing the proportion of topology and sequence similarity information, and is typically 0.5.CIQ is a measure for measuring the topology quality among clusters, and ICQ is a measure for measuring the sequence similarity among nodes in the clusters:
wherein the method comprises the steps ofRepresenting Cl in different clusters respectively m ,Cl n The set of edges between nodes in (c), cs (m, n) represents the cluster Cl m ,Cl n The proportion of the conserved edges is calculated as follows:
wherein s' m,n The representation comprisesThe number of networks of edges s m,n Representing cluster Cl m ,Cl n The number of networks of nodes included in the network. Here, when s' m,n When=1, cs (m, n) =0; otherwise, it is calculated by equation 13.
In the following formula for calculating ICQ, ICQ (Cl i ) Indicating that only one cluster Cl is considered i Score of internal node, ICQ (A) represents score of comparison result A considering all clusters:
Wherein w is max (u) represents the maximum value of the weights in the edge that meets node u, E (Cl) i ) Represented in cluster Cl i Node in s β A collection of edges that meet in the middle.
The technical scheme of the invention has the beneficial effects that:
to verify the above, tests were performed on the real network and the composite network, respectively. Wherein the real network is derived from five eukaryotes in the I soBase, table 1 is node number and edge number information for each species in the real network. The synthetic network was derived from three sets of data under the NAPAbench CG, DMC and DMR three growth models. Each dataset contained 8 networks, each with 1000 nodes of cg model
The number of nodes per network is 3985; the eight network nodes of the DMC model are 1919,1853,1923,1840,1867,1848,1818,1867 respectively; the eight network node numbers in the DMR model are 2031,2092,1967,1977,1959,1998,2030,2056, respectively.
Table 1 details of the real network
FIG. 3 is a statistical comparison of the results of the coverage of proteins of four different algorithms under a synthetic network and proteins with biological identity, the protein coverage representing the number of proteins contained in the comparison results; bio-consensus means that if proteins in a cluster are all annotated by the same GO term, then the cluster and the proteins in its cluster are considered to be identical. The consistent amount of protein coverage indicates the number of proteins contained in the alignment that are consistent. The protein values are typically distinguished by the number k of protein source networks in the cluster, with larger k results indicating more meaningful alignment results. From fig. 3, the results of the three other algorithms than I sorank n are not much worse for the total number of proteins and consistent protein coverage, but it can be seen that the number of proteins and consistent protein coverage amounts that produce k=8 is the greatest compared to the other three algorithms ACCMNA, and it can also be demonstrated that the comparison results produced by ACCMNA algorithm are more research-significant and can obtain good biological results.
Fig. 4 (a), fig. 4 (b), fig. 4 (c) are the results of synthesizing multiple metric values for topology and functional consistency under a network dataset. Wherein C IQ is a measure for measuring the proportion of the conserved edges of the comparison result, SPE is an index for measuring the specificity of the comparison result organism, sen is an index for measuring the sensitivity of the comparison result organism, and MNE and nGOC are indexes for measuring the consistency of the comparison result organism. The above metrics are all the larger and better except the MNE is smaller.
Figure 5 is a statistical comparison of the results of the protein and consistent protein coverage for four different algorithms under a real network dataset, from which it can be seen that the results of several algorithms are not significantly worse.
Fig. 6 is a graph showing the results of a plurality of metric values for measuring topology and functional consistency under a real network data set, and it can be seen from the results in the graph that each index of the ACCMNA algorithm is superior in four algorithms.
Fig. 7 is a graphical representation of the result of the product of a topological index and a biological identity index. Because the network comparison produces a result which can balance topology and biology and is a difficult point of network comparison, the existing majority of algorithms can only obtain a better result of one of the metrics, so that in order to highlight the superiority of the ACCMNA algorithm, a mode of multiplying two indexes is adopted to show an improvement effect compared with the prior art.
Through the experimental demonstration, the method adopted by the invention can achieve good comparison effect and can generate good comparison results in the topology and biological function sense. The protein coverage quantity of k as the number of the input network shows that the clustering method adopted by the invention can compare as many similar nodes as possible into the same cluster, and the nodes compared into the same cluster have the same biological function, so that the beneficial effect of the method can be proved.
One specific application scenario of the present invention is given below:
taking the IsoBase derived eukaryotic dataset as an example, five real-network comparisons were made.
(1) Reading an input protein interaction network to construct a network G, reading sequence similarity data among different species to construct a similarity weighted network S, and setting parameters alpha to be 0.5 and beta to be 0.4;
(2) Filtering the sequence similarity data, deleting edges smaller than the relevant edge threshold value of each node, and obtaining a filtered similarity weighted network S β
(3) According to the degree of the nodes in the network and the neighbor nodes thereof, calculating the weight of each node:
firstly, initializing weights of all nodes and edges in a network, wherein the weights of all nodes are 0, if edges exist between the nodes, the weights are 1, otherwise, the weights are 0;
(1) Traversing nodes with the node degree smaller than 10 from small to large, and transferring the weights of the nodes and the edges to neighbor nodes and edges with larger corresponding degrees;
(2) Adding weights on the relevant edges of all the nodes to the nodes by taking lambda as a parameter, wherein lambda is set to be 0.2;
(3) Calculating a sequence similarity score associated with each node;
(4) Combining the topological weights with the sequence scores gives the final weight for each node, controlling the scale at α=0.5.
(4) Generating a comparison cluster by combining a clustering method:
(1) The first iteration needs to preferentially select the node with the heavy weight calculated in the step 3 to generate a neighborhood graph;
(2) Calculating the weighting degree of all nodes in the neighborhood graph, taking the largest weighting degree as a first Seed, and selecting the node with the largest weighting degree as a second Seed from the neighbor nodes of the seeds, wherein the two seeds form a Seed set Seed;
(3)selecting the node with the maximum edge weight and the maximum edge weight related to the Seed set from the neighbor nodes of the nodes in the Seed, and judging whether the node meets the added limiting condition, wherein the limiting condition is T s ,T d Are all set to 0.5; if yes, adding, otherwise ending the search;
(4) Generating a neighborhood graph according to the current candidate cluster in each iteration, and repeating the steps 2-3 to generate a new candidate cluster;
(5) Calculating the objective function score of the candidate cluster generated in step 4, selecting the candidate cluster with high score as a comparison result, and iterating steps 4 and 5 continuously until no new candidate cluster is formed in the network, wherein alpha in the objective function is 0.5.
(6) Outputting the comparison result and analyzing the result.
The global multi-network comparison method combined with the network clustering method provided by the invention is described in detail, and the following points need to be described:
1. in the initial stage of comparison, the importance of different nodes in the network is estimated according to the degree of the nodes, the adjacent nodes and the sequence similarity information among the nodes, and the problem of unbalanced topology and biological function quality in the network comparison can be effectively solved by adopting a mode of combining the topology information and the sequence similarity information.
2. And (4) generating comparison by combining an algorithm with a clustering method. The clustering method is used for searching similar nodes, so that the similar nodes can be effectively compared into the same cluster as much as possible, and the influence on the network comparison accuracy due to gene replication and the like can be avoided.
The above-described embodiments are merely preferred embodiments for fully explaining the present invention, and the scope of the present invention is not limited thereto. Equivalent substitutions and modifications will occur to those skilled in the art based on the present invention, and are intended to be within the scope of the present invention. The protection scope of the invention is subject to the claims.

Claims (9)

1. A global multi-network comparison method in combination with a network clustering method, comprising:
step 1, reading protein interaction network data and sequence similarity data among different species, and setting parameters alpha and beta to construct a protein interaction network G and a sequence similarity network S;
step 2, preprocessing the sequence similarity data, deleting the edge with smaller sequence similarity score by utilizing the parameter beta to obtain a filtered network S β
Step 3, calculating the weight of each node in all the networks, and transferring the weights of the nodes and edges with small degrees to the nodes and edges with larger degrees according to the degrees of the nodes and the neighbor nodes;
the step 3 is specifically as follows:
(a) Setting an initial value for a node in a network and an edge between the nodes;
(b) Transferring the weight of the node with the node degree of 1 to the neighbor nodes and edges of the node;
(c) Transferring node weights with node degrees greater than 1 and less than 10 to neighboring nodes and edges thereof;
(d) Calculating importance scores of the nodes in the network according to the weights of the nodes and the weights of the edges connected with the nodes, wherein lambda represents the influence of the weight scores of the related edges on the weights of the nodes when the weights of the nodes are calculated;
calculating a sequence homology score associated with each node;
(e) Calculating a final weight score for each node in the network;
Weight(u)=α×importance(u)+(1-a)×B(v) (8)
(f) Searching and comparing by combining a clustering method:
a weighted k-division complete graph S can be constructed according to the sequence similarity information, wherein nodes represent nodes in corresponding networks, and the weight of the edge between two nodes from different species networks is a sequence similarity bit score value; in the similarity graph S, similar nodes are clustered in a cluster through a network clustering method;
step 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm;
step 5, calculating an objective function score of the current candidate comparison cluster, and selecting the candidate comparison cluster with the highest score as a comparison result;
and 6, outputting a comparison result, and analyzing the comparison result.
2. The global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 1 is specifically as follows:
firstly, reading a parameter alpha which is set by a user and used for determining the proportion of topology and sequence similarity and a parameter beta which is set by the user and used for filtering sequence similarity information; secondly, determining the number k of input networks, carrying out batch reading on protein interaction networks, and constructing a protein interaction network G= { G 1 ,G 2 ,…,G k -a }; reading sequence similarity information among different networks; a sequence information network S is constructed.
3. The global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 2 is specifically as follows:
filtering the sequence similarity information according to the read threshold coefficient beta; firstly, because of the huge information quantity of sequence similarity, the data quantity increases exponentially with the increase of the network number, and the calculation difficulty increases accordingly; secondly, because sequence similarity information of a plurality of real species is incomplete at present, certain sequence information can influence the accuracy of an alignment result; therefore, when using the sequence similarity auxiliary alignment process, the sequence information needs to be filtered; for any side (u, v) in the sequence information network S, if the score of the side is smaller than the threshold value related to the corresponding side, deleting the side in the network S, and finally obtaining the filtered network S β
w(u,v)<β×max(u,v) (1)。
4. The global multi-network comparison method combining the network clustering method according to claim 1, wherein in the step 3, the clustering method based on the seed and the expansion method is divided into the following steps:
(a) Calculating the weighting degree of nodes in the current search network, and selecting the node with the largest weighting degree as a first seed to be added into the set S;
(b) Normalizing the weight of the first seed neighbor, and selecting the highest weight as the second seed to be added into S;
(c) Expanding in a network according to the seeds generated in the previous two steps, selecting a node with the greatest weight and the greatest weight connected with the node in the S, adding the node if two constraint conditions are met, and ending the expansion if the two constraint conditions are met; when a new candidate cluster is generated, if the number of node source networks contained in the cluster is smaller than the number of input networks, expanding the current cluster, and improving constraint conditions, namely increasing T d ,T s Is a value of (2);
5. the global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 5 is specifically as follows:
for each iteration, calculating the objective function score of the candidate cluster generated in the previous step, and selecting the comparison cluster with the highest objective function score from the candidate clusters as the comparison cluster generated in the iteration; the objective function formula is as follows:
AS(A)=α×CIQ(A)+(1-α)ICQ(A) (11)
wherein, alpha is a parameter for balancing the proportion of topology and sequence similarity information, and the value is usually 0.5; CIQ is a measure for measuring the topology quality among clusters, and ICQ is a measure for measuring the sequence similarity among nodes in the clusters:
wherein the method comprises the steps ofRepresenting Cl in different clusters respectively m ,Cl n The set of edges between nodes in (c), cs (m, n) represents the cluster Cl m ,Cl n The proportion of the conserved edges is calculated as follows:
wherein s' m,n The representation comprisesThe number of networks of edges s m,n Representing cluster Cl m ,Cl n The number of networks of the node included in the network; here, when s' m,n When=1, cs (m, n) =0; otherwise, it is calculated by equation 13.
6. The global multi-network comparison method combining network clustering methods according to claim 5, wherein, in the following formula for calculating ICQ, ICQ (Cl i ) Indicating that only one cluster Cl is considered i The score of the inner node, ICQ (a), represents the score of the comparison result a considering all clusters:
wherein w is max (u) represents the maximum value of the weights in the edge that meets node u, E (Cl) i ) Represented in cluster Cl i In (a) node at S β A collection of edges that meet in the middle.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the program is executed.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
9. A processor for running a program, wherein the program when run performs the method of any one of claims 1 to 6.
CN202010449468.4A 2020-05-25 2020-05-25 Global multi-network comparison method combined with network clustering method Active CN111599406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010449468.4A CN111599406B (en) 2020-05-25 2020-05-25 Global multi-network comparison method combined with network clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010449468.4A CN111599406B (en) 2020-05-25 2020-05-25 Global multi-network comparison method combined with network clustering method

Publications (2)

Publication Number Publication Date
CN111599406A CN111599406A (en) 2020-08-28
CN111599406B true CN111599406B (en) 2023-08-04

Family

ID=72187760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010449468.4A Active CN111599406B (en) 2020-05-25 2020-05-25 Global multi-network comparison method combined with network clustering method

Country Status (1)

Country Link
CN (1) CN111599406B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446492B (en) * 2020-12-14 2024-04-05 江南大学 Biological network comparison method based on genetic algorithm
CN113113083B (en) * 2021-04-09 2022-08-09 山东大学 Tumor driving pathway prediction system for collective cell mutation data and protein network
CN113066524B (en) * 2021-05-19 2022-12-20 江南大学 Multi-protein interaction network comparison method based on simulated annealing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832583A (en) * 2017-11-08 2018-03-23 武汉大学 A kind of across species biological pathways based on figure matching find method
CN109767809A (en) * 2019-01-16 2019-05-17 中南大学 The alignment schemes of protein-protein interaction network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832583A (en) * 2017-11-08 2018-03-23 武汉大学 A kind of across species biological pathways based on figure matching find method
CN109767809A (en) * 2019-01-16 2019-05-17 中南大学 The alignment schemes of protein-protein interaction network

Also Published As

Publication number Publication date
CN111599406A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN111599406B (en) Global multi-network comparison method combined with network clustering method
Ciriello et al. AlignNemo: a local network alignment method to integrate homology and topology
Hsu et al. An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
CA2424031C (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
CN116741397B (en) Cancer typing method, system and storage medium based on multi-group data fusion
CN103838820A (en) Evolutionary multi-objective optimization community detection method based on affinity propagation
Ma et al. SNEGAN: Signed network embedding by using generative adversarial nets
Gao et al. Protein2vec: Aligning multiple ppi networks with representation learning
Mallick et al. A Novel Graph Topology-Based GO-Similarity Measure for Signature Detection From Multi-Omics Data and its Application to Other Problems
CN111599405B (en) Biological network comparison method integrating various topological information
Tian et al. A framework for stability‐based module detection in correlation graphs
CN116720975A (en) Local community discovery method and system based on structural similarity
Zhang et al. iDHS-DASTS: identifying DNase I hypersensitive sites based on LASSO and stacking learning
CN108717551A (en) A kind of fuzzy hierarchy clustering method based on maximum membership degree
Nguyen et al. A knowledge-based multiple-sequence alignment algorithm
Wang et al. Prediction of the disease causal genes based on heterogeneous network and multi-feature combination method
CN113066524B (en) Multi-protein interaction network comparison method based on simulated annealing
Zhang et al. Discovering biomarkers of hepatocellular carcinoma from single-cell RNA sequencing data by cooperative games on gene regulatory network
CN111192639A (en) Complex network-based tumor metastasis key gene retrieval method
Cavattoni et al. Classgraph: improving metagenomic read classification with overlap graphs
Chen et al. A novel network aligner for the analysis of multiple protein-protein interaction networks
Cavattoni et al. Boosting Metagenomic Classification with Reads Overlap Graphs
Jing et al. Brief survey of biological network alignment and a variant with incorporation of functional annotations
CN111739582B (en) Biological omics data analysis method based on synergistic network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant