CN111599406A - Global multi-network comparison method combined with network clustering method - Google Patents

Global multi-network comparison method combined with network clustering method Download PDF

Info

Publication number
CN111599406A
CN111599406A CN202010449468.4A CN202010449468A CN111599406A CN 111599406 A CN111599406 A CN 111599406A CN 202010449468 A CN202010449468 A CN 202010449468A CN 111599406 A CN111599406 A CN 111599406A
Authority
CN
China
Prior art keywords
network
node
weight
sequence similarity
comparison
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010449468.4A
Other languages
Chinese (zh)
Other versions
CN111599406B (en
Inventor
陈璟
黄佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202010449468.4A priority Critical patent/CN111599406B/en
Publication of CN111599406A publication Critical patent/CN111599406A/en
Application granted granted Critical
Publication of CN111599406B publication Critical patent/CN111599406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a global multi-network comparison method combined with a network clustering method, which comprises the steps of 1, reading protein interaction network data and sequence similarity data among different species, setting parameters α and β, constructing a protein interaction network G and a sequence similarity network S,2, preprocessing the sequence similarity data, deleting edges with smaller sequence similarity scores by using a parameter β, and obtaining a filtered network Sβ(ii) a Step 3, calculating the weight of each node in all the networks; and 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm. The invention has the beneficial effects that: the method adopted by the invention can achieve good comparison effect and can generate good comparison results in topological and biological function meanings.

Description

Global multi-network comparison method combined with network clustering method
Technical Field
The invention relates to the field of global multi-network comparison, in particular to a global multi-network comparison method combined with a network clustering method.
Background
Since the 21 st century, the amount of data available in different research areas, such as social networking, commercial transactions and molecular biology, has seen explosive growth. It is a current research focus to represent proteins and their interactions in the form of networks (graphs) and analyze them by studying new strategies. In a protein interaction network, nodes represent proteins and edges represent interactions between two proteins.
The comparison studies based on protein interaction networks are extensive, and are mainly classified into pairwise (two networks) network comparison and multi (three or more) network comparison. Most of the early network comparison algorithms are pairwise comparison, and aim to find a mapping node with the highest similarity in two networks. Since 2008, research on network comparison is gradually transferred to a multi-network comparison algorithm, and the multi-network comparison algorithm can simultaneously obtain mapping relations of nodes among multiple networks, so that the multi-network comparison can obtain deeper biological significance.
Since the network alignment problem can be regarded as a sub-graph isomorphism problem, network alignment is an NP-complete problem, and the network alignment has been used to solve the problem by employing a heuristic method. A greedy algorithm is used for comparison in the early stage, more and more methods are used for network comparison after years of development, for example, a Hungarian algorithm, a seed and extension matching method, a simulated annealing algorithm, a genetic algorithm and the like, and a deep learning method is also used for solving the network comparison problem in recent years.
The traditional technology has the following technical problems:
1. global multi-network alignment studies of protein interaction networks originated in 2009, the IsoRankN algorithm proposed by Liao (IsoRankN: spectral methods for global alignment of multiple protein networks [ J ]. Bioinformatics,25(12): i253-i258.), et al. The method comprises the steps of establishing a similarity score matrix of nodes among different networks, and generating a multi-network comparison result by using a frequency spectrum clustering method. But as an early algorithm, it is far less advantageous in topological and biofunctional sense than the multi-network comparison algorithm newly proposed in recent years. Since this is a classical multi-network comparison algorithm, many algorithms proposed later are compared based on this method.
2. The SMETANA Algorithm proposed in 2013 by Sahraean (SMETANA: Accurate and Scalable Algorithm for basic Alignment of Large-Scale Biological Networks [ J ]. Plos One,8(7): e67995), et al. Firstly, a similarity score matrix between different network nodes is calculated by using a half-Markov random walk model, then the comparison probability between the nodes is improved by two times of probability consistency transfer, and finally, the final comparison is constructed by using greedy seeds and an expansion method. Experimental results of many data sets show that SMETANA is a multi-network comparison algorithm capable of obtaining a better topological result, but the better topological significance (higher conservative edge proportion) is obtained at the cost of functional similarity among partial nodes, that is, SMETANA can obtain comparison results with better topological significance and poor biological functional significance mostly.
3. The BEAMS algorithm proposed in FERHAT A (BEAMS: background extraction and surface alignment for the global management-to-management alignment of multiple PPI networks, Bioinformatics,2014,30(4),531-539.) in 2014 is a global multi-network alignment algorithm that frames seed-to-extension and generates alignment results by searching for weighted maximum clusters (backbones) in the network, which is mainly divided into backbone extraction and merging two parts. The BEAMS algorithm is an alignment algorithm which can obtain better biological function significance, but the topological significance (the conservative edge proportion is not high) of an alignment result is not good, and the sequence similarity information between nodes is depended on in the alignment process in a transition mode, so that the topological and biological function significance cannot be well balanced.
Disclosure of Invention
The invention aims to solve the technical problem of providing a global multi-network comparison method combined with a network clustering method, wherein the influence of the incompleteness of sequence similarity data on the network comparison accuracy is avoided; the complexity of selecting search comparison nodes under a larger data scale is reduced; influence of a plurality of similar nodes on network comparison accuracy in each network caused by gene replication; multi-network alignment topology and biofunctional quality imbalance.
In order to solve the above technical problem, the present invention provides a global multi-network comparison method combining a network clustering method, including:
step 1, reading protein interaction network data and sequence similarity data among different species, setting parameters alpha and beta, and constructing a protein interaction network G and a sequence similarity network S;
step 2, preprocessing the sequence similarity data, and deleting the edge with smaller sequence similarity score by using a parameter β to obtain the sequence similarity dataTo the filtered network Sβ
Step 3, calculating the weight of each node in all the networks, and transferring the weight of the node and the edge with small degree to the node and the edge with larger degree according to the degree of the node and the neighbor node;
step 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm;
step 5, calculating the target function score of the current candidate comparison cluster, and selecting the candidate comparison cluster with the highest score as the comparison result;
and 6, outputting a comparison result, and analyzing the comparison result.
In one embodiment, step 1 is specifically as follows:
firstly, reading a parameter α for determining the proportion of topology and sequence similarity and a parameter β for filtering sequence similarity information, which are set by a user, secondly, determining the number k of input networks, reading the protein interaction network in batches, and constructing a protein interaction network G ═ G { (G {)1,G2,…,Gk}; reading sequence similarity information among different networks; a sequence information network S is constructed.
In one embodiment, step 2 is specifically as follows:
according to the read threshold coefficient β, sequence similarity information is filtered, firstly, because the information quantity of the sequence similarity is huge, the data quantity exponentially increases along with the increase of the network number, the calculation difficulty also increases along with the increase of the network number, secondly, because the sequence similarity information of a plurality of real species at present is incomplete, the accuracy of a comparison result can be influenced by some sequence information, therefore, when the sequence similarity auxiliary comparison process is used, the sequence information needs to be filtered, for any edge (u, v) in the sequence information network S, if the score of the edge is smaller than the relevant threshold of the corresponding edge, the edge in the network S is deleted, and finally, the filtered network S is obtainedβ
w(u,v)<β×max(u,v) (1)
In one embodiment, step 3 is specifically as follows:
(a) setting initial values for nodes in the network and edges among the nodes;
Figure BDA0002507156250000041
Figure BDA0002507156250000042
(b) transferring the weight of the node with the node degree of 1 to the neighbor node and the edge of the node;
Figure BDA0002507156250000043
(c) transferring the node weight with the node degree larger than 1 and smaller than 10 to the neighbor nodes and edges of the node weight;
Figure BDA0002507156250000044
(d) calculating the importance score of the node in the network according to the weight of the node and the weight of the edge connected with the node, wherein lambda represents the influence of the weight score of the related edge on the weight of the node when the weight of the node is calculated;
Figure BDA0002507156250000045
calculating the sequence homology score related to each node;
Figure BDA0002507156250000046
(e) calculating the final weight score of each node in the network;
Weight(u)=α×importance(u)+(1-α)×B(v) (8)
(f) searching and comparing by combining a clustering method:
a weighted k-score complete graph S can be constructed according to the sequence similarity information, wherein the nodes represent the nodes in the corresponding network, and the weight of the edge between two nodes from networks of different species is a sequence similarity bitscore value; similar nodes are clustered in a cluster in the similar graph S by a network clustering method.
In one embodiment, step 3, which is a clustering method based on a seed and expansion method, is divided into the following steps:
(a) calculating the weighting degree of the nodes in the current search network, and selecting the node with the maximum weighting degree as a first seed to be added into the set S;
Figure BDA0002507156250000051
(b) normalizing the weight of the first seed neighbor, and selecting the seed with the highest weight as a second seed to be added into S;
(c) expanding in the network according to the seeds generated in the first two steps, selecting the node with the maximum weight and connected with the node in the S, adding the node if two constraint conditions are met, and ending the expansion if the two constraint conditions are not met; when a new candidate cluster is generated, if the number of node source networks contained in the cluster is less than the number of input networks, the current cluster is expanded, and the constraint condition is improved, namely T is increasedd,TsA value of (d);
Figure BDA0002507156250000052
in one embodiment, step 5 is specifically as follows:
for the candidate clusters generated in the previous step in each iteration, calculating the target function scores of the candidate clusters, and selecting the candidate clusters with the highest target function scores as comparison clusters generated in the iteration; the objective function is formulated as follows:
AS(A)=α×CIQ(A)+(1-α)ICQ(A) (11)
wherein, α is a parameter of proportion of balanced topology and sequence similarity information, and is usually 0.5; CIQ is a measure for measuring the topological quality between clusters, ICQ is a measure for measuring the sequence similarity between nodes in a cluster:
Figure BDA0002507156250000061
wherein
Figure BDA0002507156250000062
Respectively in different clusters Clm,ClnIn (c), cs (m, n) represents a cluster Clm,ClnThe proportion of the conservative edges is calculated according to the following formula:
Figure BDA0002507156250000063
wherein s'm,nThe representation comprises
Figure BDA0002507156250000064
Number of edges in, sm,nRepresents a cluster Clm,ClnThe network number of the node contained in (1); here, when s'm,nWhen 1, cs (m, n) is 0; otherwise, it is calculated by equation 13.
In one embodiment, among the following formulas for calculating ICQ, ICQ (Cl)i) Indicating that only one cluster Cl is considerediScore of inner node, icq (a) represents the score of all clusters considered in alignment a:
Figure BDA0002507156250000065
Figure BDA0002507156250000066
wherein, wmax(u) represents the maximum value of the side-edge weights in contact with the node u, E (Cl)i) Represented in the cluster CliThe node in SβOf a contiguous set of edges.
Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.
Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the methods.
Based on the same inventive concept, the present application further provides a processor for executing a program, wherein the program executes to perform any one of the methods.
The invention has the beneficial effects that:
the method adopted by the invention can achieve good comparison effect and can generate good comparison results in topological and biological function meanings. The k-coverage is the protein coverage of the input network number, which shows that the clustering method adopted by the invention can compare as many similar nodes as possible into the same cluster, and the nodes compared into the same cluster have the same biological function, thus being capable of proving the beneficial effect of the method of the invention.
Drawings
FIG. 1 is a flowchart of a global multi-network comparison method in combination with a network clustering method according to the present invention.
FIG. 2 is a diagram illustrating an effect of calculating node weights in the global multi-network comparison method combined with the network clustering method according to the present invention.
FIG. 3 is a schematic diagram of experimental results of different algorithms in the global multi-network comparison method combined with the network clustering method on a synthetic network data set according to the present invention.
FIG. 4 shows the topology and biological index results of the comparison results in the synthetic network in the global multi-network comparison method combined with the network clustering method according to the present invention.
FIG. 5 is an experimental result of different comparison algorithms under a real network in the global multi-network comparison method combined with the network clustering method of the present invention.
FIG. 6 shows the topology and biological index results of different comparison algorithms under the real network in the global multi-network comparison method combined with the network clustering method of the present invention.
FIG. 7 is a comparison result of products of topology and biological indexes of different comparison algorithms in the global multi-network comparison method combined with the network clustering method according to the present invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
The technical solution of the present invention is illustrated by the flow chart shown in fig. 2:
a global multi-network comparison algorithm ACCMNA combined with a network clustering method comprises the following steps:
step 1, reading protein interaction network data and sequence similarity data among different species, setting parameters alpha and beta, and constructing a protein interaction network G and a sequence similarity network S;
step 2, preprocessing the sequence similarity data, deleting the edge with smaller sequence similarity score by using a parameter β to obtain a filtered network Sβ
Step 3, calculating the weight of each node in all the networks, and transferring the weight of the node and the edge with small degree to the node and the edge with larger degree according to the degree of the node and the neighbor node;
and 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm.
Step 5, calculating the target function score of the current candidate comparison cluster, and selecting the candidate comparison cluster with the highest score as the comparison result;
and 6, outputting a comparison result, and analyzing the comparison result.
(1) Reading input data:
firstly, reading a parameter α for determining the proportion of topology and sequence similarity and a parameter β for filtering sequence similarity information, which are set by a user, secondly, determining the number k of input networks, reading the protein interaction network in batches, and constructing a protein interaction network G ═ G { (G {)1,G2,…,Gk};And reads sequence similarity information between different networks. A sequence information network S is constructed.
(2) Filtering sequence similarity data:
and filtering the sequence similarity information according to the read threshold coefficient beta. Firstly, because the information quantity of sequence similarity is huge, the data quantity exponentially increases along with the increase of the number of networks, and the calculation difficulty also increases along with the increase of the number of networks; secondly, because the sequence similarity information of a plurality of real species is incomplete at present, the accuracy of the comparison result can be influenced by some sequence information; thus, when sequence similarity is used to assist the alignment process, the sequence information needs to be filtered. For any edge (u, v) in the sequence information network S, if the score of the edge is smaller than the threshold value associated with the corresponding edge, deleting the edge in the network S, and finally obtaining a filtered network S β:
w(u,v)<β×max(u,v) (1)
(3) calculating the weight of all nodes in the network:
because the number of nodes in the network data is large, in order to reduce the complexity of searching and comparing the nodes in the network, the weight of each node in the network is firstly calculated, and the node with the largest weight is preferentially selected for comparison. Degree (degree) represents the number of nodes connected with a node in the network, and can measure the importance of a node in the network, and the degree of a node is larger to indicate that the node is more important in the network. And the second graph is an exemplary graph of the effect after the weight of the node is calculated, the greater the degree of the node is, the higher the importance in the network is, and finally, the greater the degree of the node is, the greater the weight of the node is.
FIG. 2 is a schematic diagram of node weights. The size of the node and the thickness of the edge in the graph represent the weight of the node and the edge respectively, and the graph shows the effect achieved after the weight of the node is calculated (the larger the degree of the node is, the larger the weight is)
The method for calculating the node weight comprises the following steps:
(a) setting initial values for nodes in the network and edges among the nodes;
Figure BDA0002507156250000091
Figure BDA0002507156250000092
(b) transferring the weight of the node with the node degree of 1 to the neighbor node and the edge of the node;
Figure BDA0002507156250000093
(c) transferring the node weight with the node degree larger than 1 and smaller than 10 to the neighbor nodes and edges of the node weight;
Figure BDA0002507156250000094
(d) calculating the importance score of the node in the network according to the weight of the node and the weight of the edge connected with the node, wherein lambda represents the influence of the weight score of the related edge on the weight of the node when the weight of the node is calculated;
Figure BDA0002507156250000095
calculating the sequence homology score related to each node;
Figure BDA0002507156250000101
(e) calculating the final weight score of each node in the network;
Weight(u)=α×importance(u)+(1-α)×B(v) (8)
(f) searching and comparing by combining a clustering method:
a weighted k-score complete graph S can be constructed according to the sequence similarity information, wherein the nodes represent nodes in the corresponding network, and the weight of an edge between two nodes from networks of different species is a sequence similarity bitscore value. Similar nodes are clustered in a cluster in the similar graph S by a network clustering method. The clustering method based on the seed and the extension method comprises the following steps:
(a) calculating the weighting degree of the nodes in the current search network, and selecting the node with the maximum weighting degree as a first seed to be added into the set S;
Figure BDA0002507156250000102
(b) normalizing the weight of the first seed neighbor, and selecting the seed with the highest weight as a second seed to be added into S;
(c) expanding in the network according to the seeds generated in the first two steps, selecting the node with the maximum weight and connected with the node in the S, adding the node if two constraint conditions are met, and ending the expansion if the two constraint conditions are not met; when a new candidate cluster is generated, if the number of node source networks contained in the cluster is less than the number of input networks, the current cluster is expanded, and the constraint condition is improved, namely T is increasedd,TsThe value of (c).
Figure BDA0002507156250000103
(1) Calculating the target function score, selecting the generated cluster:
and for the candidate cluster generated in the previous step in each iteration, calculating the objective function score of the candidate cluster, and selecting the candidate cluster with the highest objective function score as the comparison cluster generated in the iteration. The objective function is formulated as follows:
AS(A)=α×CIQ(A)+(1-α)ICQ(A) (11)
wherein α is a parameter that balances the proportion of topology and sequence similarity information, and is usually 0.5. CIQ is a measure for measuring the topological quality between clusters, ICQ is a measure for measuring the sequence similarity between nodes in a cluster:
Figure BDA0002507156250000111
wherein
Figure BDA0002507156250000112
Respectively in different clusters Clm,ClnIn (c), cs (m, n) represents a cluster Clm,ClnThe proportion of the conservative edges is calculated according to the following formula:
Figure BDA0002507156250000113
wherein s'm,nThe representation comprises
Figure BDA0002507156250000114
Number of edges in, sm,nRepresents a cluster Clm,ClnThe number of networks of nodes contained in (1). Here, when s'm,nWhen 1, cs (m, n) is 0; otherwise, it is calculated by equation 13.
ICQ (Cl) in the following formula for calculating ICQi) Indicating that only one cluster Cl is considerediScore of inner node, icq (a) represents the score of all clusters considered in alignment a:
Figure BDA0002507156250000115
Figure BDA0002507156250000116
wherein, wmax(u) represents the maximum value of the side-edge weights in contact with the node u, E (Cl)i) Represented in the cluster CliThe node in SβOf a contiguous set of edges.
The technical scheme of the invention has the following beneficial effects:
to verify the above work, tests were performed on the real network and the synthetic network, respectively. Where the real network is derived from five eukaryotes in IsoBase, table 1 is the node number and side number information for each species in the real network. The synthetic network was derived from three sets of data under three growth models, CG, DMC and DMR, of NAPAbench. Each data set comprises 8 networks, each network has 1000 nodes, and each network node of the CG model has 3985; the eight network nodes of the DMC model are 1919,1853,1923,1840,1867,1848,1818,1867, respectively; the number of eight network nodes in the DMR model is 2031,2092,1967,1977,1959,1998,2030,2056.
TABLE 1 details of the real network
Figure BDA0002507156250000121
FIG. 3 is a statistical comparison of the results of the coverage of proteins with biologically identical proteins and four different algorithms for a synthetic network, the protein coverage representing the number of proteins contained in the results of the comparison; what is meant by biologically consistent is that if the proteins in a cluster are all annotated by the same GO term, the proteins in that cluster and its clusters are considered to be consistent. The consistent protein coverage indicates the number of proteins contained in the alignment that are consistent. Protein values are usually differentiated according to the number k of the network of sources of the proteins in the cluster, with larger k indicating more meaningful results than results. From fig. 3, the results of the three other algorithms except IsoRankN are not very different for the total number of proteins and consistent protein coverage, but it can be seen that ACCMNA produces the most protein with k-8 and consistent number of protein coverage compared to the other three algorithms, which also indicates that the ACCMNA algorithm produces more research-meaningful comparisons and can obtain good biological results.
Fig. 4 is the result of a plurality of metric values used to measure topological and functional consistency under a composite network data set. Wherein CIQ is a measure for measuring the conservative edge proportion of the comparison result, SPE is an index for measuring the biological specificity of the comparison result, Sen is an index for measuring the biological sensitivity of the comparison result, and MNE and nGOC are indexes for measuring the biological consistency of the comparison result. The above metrics except that the MNE is as small as possible, the other metric values are as large as possible.
FIG. 5 is a statistical comparison of results of protein and consistent protein coverage for four different algorithms under a real network data set, and it can be seen that the results of the several algorithms are not very different.
Fig. 6 shows the results of multiple metrics for measuring topology and functional consistency under a real network data set, and it can be seen from the results in the figure that the ACCMNA algorithm is superior in each of the four algorithms.
FIG. 7 is a graphical representation of the result of multiplying a topological index by a biological consistency index. Since the network comparison has a difficulty in balancing the topology and biological results, most of the existing algorithms can only obtain a better result of one of the metrics, and therefore, in order to highlight the superiority of the ACCMNA algorithm, the method of multiplying two indexes is adopted to show an improved effect of the method compared with the prior art.
Through the experimental demonstration, the method adopted by the invention can achieve good comparison effect and can generate good comparison results in topological and biological function meanings. k is the protein coverage of the input network number, which shows that the clustering method adopted by the invention can compare as many similar nodes as possible into the same cluster, and the nodes compared into the same cluster have the same biological function, thus being capable of proving the beneficial effect of the method of the invention.
A specific application scenario of the present invention is given below:
taking the eukaryotic dataset from IsoBase as an example, an alignment between five true networks was performed.
(1) Reading an input protein interaction network to construct a network G, reading sequence similarity data among different species to construct a similarity weighting network S, and setting parameters alpha to be 0.5 and beta to be 0.4;
(2) filtering the sequence similarity data, deleting edges smaller than the threshold value of the relevant edge of each node to obtain a filtered similarity weighting network Sβ
(3) Calculating the weight of each node according to the degree of the node in the network and the neighbor nodes thereof:
firstly, initializing the weights of all nodes and edges in a network, wherein the weights of all nodes are 0, if the edges exist among the nodes, the weights are 1, and if not, the weights are 0;
(1) traversing nodes with the node degree smaller than 10 from small to large, and transferring the weights of the nodes and the edges to neighbor nodes and edges with larger corresponding degrees;
(2) adding the weight on the relevant edges of all nodes to the nodes by taking lambda as a parameter, wherein lambda is set to be 0.2;
(3) calculating a sequence similarity score associated with each node;
(4) combining the topology weight with the sequence score to obtain the final weight of each node, and controlling the specific gravity by alpha being 0.5.
(4) Generating an alignment cluster by combining a clustering method:
(1) the nodes with large weights calculated in the step 3 need to be selected preferentially for the first iteration to generate a neighborhood graph;
(2) calculating the weighting degrees of all nodes in the neighborhood graph, wherein the largest node is used as a first Seed, the largest node in the neighborhood of the Seed is selected as a second Seed, and the two seeds form a Seed set;
(3) selecting the node with the maximum edge weight and the maximum edge weight in the Seed set from the neighbor nodes of the nodes in the Seed, and judging whether the node meets the added limiting condition, wherein T of the limiting conditions,TdAre all set to 0.5; if yes, adding, otherwise, ending the search;
(4) generating a neighborhood map according to the current candidate cluster in each iteration, and repeating the steps of 2-3 to generate a new candidate cluster;
(5) and 4, calculating the target function score of the candidate cluster generated in the step 4, selecting the high score as a comparison result, continuously iterating the steps 4 and 5, and taking 0.5 as alpha in the target function until no new candidate cluster is formed in the network.
(6) And outputting a comparison result, and analyzing the result.
The global multi-network comparison method combined with the network clustering method provided by the invention is described in detail above, and the following points need to be explained:
1. in the initial stage of comparison, the importance of different nodes in the network is evaluated according to the degree of the node, the adjacent nodes and the sequence similarity information among the nodes, and the problem of unbalanced quality of topology and biological functions in network comparison can be effectively solved by adopting a mode of combining topology information and the sequence similarity information.
2. The algorithm combines a clustering method to generate the alignment. The clustering method is used for searching similar nodes, so that the similar nodes can be effectively compared in the same cluster as much as possible, and the influence on the network comparison accuracy due to gene replication and the like can be avoided.
The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims (10)

1. A global multi-network comparison method combined with a network clustering method is characterized by comprising the following steps:
step 1, reading protein interaction network data and sequence similarity data among different species, setting parameters alpha and beta, and constructing a protein interaction network G and a sequence similarity network S;
step 2, preprocessing the sequence similarity data, deleting the edge with smaller sequence similarity score by using a parameter β to obtain a filtered network Sβ
Step 3, calculating the weight of each node in all the networks, and transferring the weight of the node and the edge with small degree to the node and the edge with larger degree according to the degree of the node and the neighbor node;
step 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm;
step 5, calculating the target function score of the current candidate comparison cluster, and selecting the candidate comparison cluster with the highest score as the comparison result;
and 6, outputting a comparison result, and analyzing the comparison result.
2. The global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 1 is as follows:
first, a user is readSetting α as the parameter for determining the proportion of topology and sequence similarity and β as the parameter for filtering sequence similarity information, determining k as the number of input networks, reading the protein interaction network in batches, and constructing G-G1,G2,…,Gk}; reading sequence similarity information among different networks; a sequence information network S is constructed.
3. The global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 2 is as follows:
according to the read threshold coefficient β, sequence similarity information is filtered, firstly, because the information quantity of the sequence similarity is huge, the data quantity exponentially increases along with the increase of the network number, the calculation difficulty also increases along with the increase of the network number, secondly, because the sequence similarity information of a plurality of real species at present is incomplete, the accuracy of a comparison result can be influenced by some sequence information, therefore, when the sequence similarity auxiliary comparison process is used, the sequence information needs to be filtered, for any edge (u, v) in the sequence information network S, if the score of the edge is smaller than the relevant threshold of the corresponding edge, the edge in the network S is deleted, and finally, the filtered network S is obtainedβ
w(u,v)<β×max(u,v) (1)
4. The global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 3 is as follows:
(a) setting initial values for nodes in the network and edges among the nodes;
Figure FDA0002507156240000021
Figure FDA0002507156240000022
(b) transferring the weight of the node with the node degree of 1 to the neighbor node and the edge of the node;
Figure FDA0002507156240000023
(c) transferring the node weight with the node degree larger than 1 and smaller than 10 to the neighbor nodes and edges of the node weight;
Figure FDA0002507156240000024
(d) calculating the importance score of the node in the network according to the weight of the node and the weight of the edge connected with the node, wherein lambda represents the influence of the weight score of the related edge on the weight of the node when the weight of the node is calculated;
Figure FDA0002507156240000025
calculating the sequence homology score related to each node;
Figure FDA0002507156240000026
(e) calculating the final weight score of each node in the network;
Weight(u)=α×importance(u)+(1-α)×B(v) (8)
(f) searching and comparing by combining a clustering method:
a weighted k-score complete graph S can be constructed according to the sequence similarity information, wherein the nodes represent nodes in corresponding networks, and the weight of an edge between two nodes from networks of different species is a sequence similarity bit score value; similar nodes are clustered in a cluster in the similar graph S by a network clustering method.
5. The global multi-network comparison method combined with network clustering method as claimed in claim 1, wherein in step 3, which is here a clustering method based on seed and expansion method, is divided into the following steps:
(a) calculating the weighting degree of the nodes in the current search network, and selecting the node with the maximum weighting degree as a first seed to be added into the set S;
Figure FDA0002507156240000031
(b) normalizing the weight of the first seed neighbor, and selecting the seed with the highest weight as a second seed to be added into S;
(c) expanding in the network according to the seeds generated in the first two steps, selecting the node with the maximum weight and connected with the node in the S, adding the node if two constraint conditions are met, and ending the expansion if the two constraint conditions are not met; when a new candidate cluster is generated, if the number of node source networks contained in the cluster is less than the number of input networks, the current cluster is expanded, and the constraint condition is improved, namely T is increasedd,TsA value of (d);
Figure FDA0002507156240000032
6. the global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 5 is as follows:
for the candidate clusters generated in the previous step in each iteration, calculating the target function scores of the candidate clusters, and selecting the candidate clusters with the highest target function scores as comparison clusters generated in the iteration; the objective function is formulated as follows:
AS(A)=α×CIQ(A)+(1-α)ICQ(A) (11)
wherein, α is a parameter of proportion of balanced topology and sequence similarity information, and is usually 0.5; CIQ is a measure for measuring the topological quality between clusters, ICQ is a measure for measuring the sequence similarity between nodes in a cluster:
Figure FDA0002507156240000041
wherein
Figure FDA0002507156240000042
Respectively in different clusters Clm,ClnIn (c), cs (m, n) represents a cluster Clm,ClnThe proportion of the conservative edges is calculated according to the following formula:
Figure FDA0002507156240000043
wherein s'm,nThe representation comprises
Figure FDA0002507156240000044
Number of edges in, sm,nRepresents a cluster Clm,ClnThe network number of the node contained in (1); here, when s'm,nWhen 1, cs (m, n) is 0; otherwise, it is calculated by equation 13.
7. The global multi-network alignment method combined with network clustering method according to claim 6, wherein the following formula for calculating ICQ, ICQ (Cl)i) Indicating that only one cluster Cl is considerediScore of inner node, icq (a) represents the score of all clusters considered in alignment a:
Figure FDA0002507156240000045
Figure FDA0002507156240000046
wherein, wmax(u) represents the maximum value of the side-edge weights in contact with the node u, E (Cl)i) Represented in the cluster CliThe node in SβOf a contiguous set of edges.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.
CN202010449468.4A 2020-05-25 2020-05-25 Global multi-network comparison method combined with network clustering method Active CN111599406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010449468.4A CN111599406B (en) 2020-05-25 2020-05-25 Global multi-network comparison method combined with network clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010449468.4A CN111599406B (en) 2020-05-25 2020-05-25 Global multi-network comparison method combined with network clustering method

Publications (2)

Publication Number Publication Date
CN111599406A true CN111599406A (en) 2020-08-28
CN111599406B CN111599406B (en) 2023-08-04

Family

ID=72187760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010449468.4A Active CN111599406B (en) 2020-05-25 2020-05-25 Global multi-network comparison method combined with network clustering method

Country Status (1)

Country Link
CN (1) CN111599406B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446492A (en) * 2020-12-14 2021-03-05 江南大学 Biological network comparison method based on genetic algorithm
CN113033669A (en) * 2021-03-29 2021-06-25 东北大学 Visual scene recognition method based on learnable feature map filtering and map attention network
CN113066524A (en) * 2021-05-19 2021-07-02 江南大学 Multi-protein interaction network comparison method based on simulated annealing
CN113113083A (en) * 2021-04-09 2021-07-13 山东大学 Tumor driving pathway prediction system for collective cell mutation data and protein network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832583A (en) * 2017-11-08 2018-03-23 武汉大学 A kind of across species biological pathways based on figure matching find method
CN109767809A (en) * 2019-01-16 2019-05-17 中南大学 The alignment schemes of protein-protein interaction network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832583A (en) * 2017-11-08 2018-03-23 武汉大学 A kind of across species biological pathways based on figure matching find method
CN109767809A (en) * 2019-01-16 2019-05-17 中南大学 The alignment schemes of protein-protein interaction network

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446492A (en) * 2020-12-14 2021-03-05 江南大学 Biological network comparison method based on genetic algorithm
CN112446492B (en) * 2020-12-14 2024-04-05 江南大学 Biological network comparison method based on genetic algorithm
CN113033669A (en) * 2021-03-29 2021-06-25 东北大学 Visual scene recognition method based on learnable feature map filtering and map attention network
CN113033669B (en) * 2021-03-29 2024-05-14 东北大学 Visual scene recognition method based on learnable feature map filtering and graph annotation meaning network
CN113113083A (en) * 2021-04-09 2021-07-13 山东大学 Tumor driving pathway prediction system for collective cell mutation data and protein network
CN113066524A (en) * 2021-05-19 2021-07-02 江南大学 Multi-protein interaction network comparison method based on simulated annealing
CN113066524B (en) * 2021-05-19 2022-12-20 江南大学 Multi-protein interaction network comparison method based on simulated annealing

Also Published As

Publication number Publication date
CN111599406B (en) 2023-08-04

Similar Documents

Publication Publication Date Title
CN111599406B (en) Global multi-network comparison method combined with network clustering method
Ciriello et al. AlignNemo: a local network alignment method to integrate homology and topology
Lei et al. Identification of dynamic protein complexes based on fruit fly optimization algorithm
US20180357363A1 (en) Protein design method and system
CN108319812B (en) Method for identifying key protein based on cuckoo search algorithm
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
CN104992078B (en) A kind of protein network complex recognizing method based on semantic density
CN103838820A (en) Evolutionary multi-objective optimization community detection method based on affinity propagation
KR20180137386A (en) Community detection method and community detection framework apparatus
Ma et al. SNEGAN: Signed network embedding by using generative adversarial nets
CN111429970B (en) Method and system for acquiring multiple gene risk scores based on feature selection of extreme gradient lifting method
CN110321377B (en) Multi-source heterogeneous data truth value determination method and device
Zhang et al. A new method for detecting protein complexes based on the three node cliques
Xiong et al. BTWalk: Branching tree random walk for multi-order structured network embedding
Gao et al. Protein2vec: Aligning multiple ppi networks with representation learning
Jiao et al. Community mining with new node similarity by incorporating both global and local topological knowledge in a constrained random walk
Cavattoni et al. Classgraph: improving metagenomic read classification with overlap graphs
CN116720975A (en) Local community discovery method and system based on structural similarity
CN110502669A (en) The unsupervised chart dendrography learning method of lightweight and device based on the side N DFS subgraph
Zhang et al. iDHS-DASTS: identifying DNase I hypersensitive sites based on LASSO and stacking learning
CN111739582B (en) Biological omics data analysis method based on synergistic network
Pan et al. LinRace: cell division history reconstruction of single cells using paired lineage barcode and gene expression data
CN113066524B (en) Multi-protein interaction network comparison method based on simulated annealing
Chen et al. A novel network aligner for the analysis of multiple protein-protein interaction networks
CN111709846A (en) Local community discovery algorithm based on line graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant