CN111599406A

CN111599406A - Global multi-network comparison method combined with network clustering method

Info

Publication number: CN111599406A
Application number: CN202010449468.4A
Authority: CN
Inventors: 陈璟; 黄佳
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2020-08-28
Anticipated expiration: 2040-05-25
Also published as: CN111599406B

Abstract

The invention discloses a global multi-network comparison method combined with a network clustering method, which comprises the steps of 1, reading protein interaction network data and sequence similarity data among different species, setting parameters α and β, constructing a protein interaction network G and a sequence similarity network S,2, preprocessing the sequence similarity data, deleting edges with smaller sequence similarity scores by using a parameter β, and obtaining a filtered network S_β(ii) a Step 3, calculating the weight of each node in all the networks; and 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm. The invention has the beneficial effects that: the method adopted by the invention can achieve good comparison effect and can generate good comparison results in topological and biological function meanings.

Description

Global multi-network comparison method combined with network clustering method

Technical Field

The invention relates to the field of global multi-network comparison, in particular to a global multi-network comparison method combined with a network clustering method.

Background

Since the 21 st century, the amount of data available in different research areas, such as social networking, commercial transactions and molecular biology, has seen explosive growth. It is a current research focus to represent proteins and their interactions in the form of networks (graphs) and analyze them by studying new strategies. In a protein interaction network, nodes represent proteins and edges represent interactions between two proteins.

The comparison studies based on protein interaction networks are extensive, and are mainly classified into pairwise (two networks) network comparison and multi (three or more) network comparison. Most of the early network comparison algorithms are pairwise comparison, and aim to find a mapping node with the highest similarity in two networks. Since 2008, research on network comparison is gradually transferred to a multi-network comparison algorithm, and the multi-network comparison algorithm can simultaneously obtain mapping relations of nodes among multiple networks, so that the multi-network comparison can obtain deeper biological significance.

Since the network alignment problem can be regarded as a sub-graph isomorphism problem, network alignment is an NP-complete problem, and the network alignment has been used to solve the problem by employing a heuristic method. A greedy algorithm is used for comparison in the early stage, more and more methods are used for network comparison after years of development, for example, a Hungarian algorithm, a seed and extension matching method, a simulated annealing algorithm, a genetic algorithm and the like, and a deep learning method is also used for solving the network comparison problem in recent years.

The traditional technology has the following technical problems:

1. global multi-network alignment studies of protein interaction networks originated in 2009, the IsoRankN algorithm proposed by Liao (IsoRankN: spectral methods for global alignment of multiple protein networks [ J ]. Bioinformatics,25(12): i253-i258.), et al. The method comprises the steps of establishing a similarity score matrix of nodes among different networks, and generating a multi-network comparison result by using a frequency spectrum clustering method. But as an early algorithm, it is far less advantageous in topological and biofunctional sense than the multi-network comparison algorithm newly proposed in recent years. Since this is a classical multi-network comparison algorithm, many algorithms proposed later are compared based on this method.

2. The SMETANA Algorithm proposed in 2013 by Sahraean (SMETANA: Accurate and Scalable Algorithm for basic Alignment of Large-Scale Biological Networks [ J ]. Plos One,8(7): e67995), et al. Firstly, a similarity score matrix between different network nodes is calculated by using a half-Markov random walk model, then the comparison probability between the nodes is improved by two times of probability consistency transfer, and finally, the final comparison is constructed by using greedy seeds and an expansion method. Experimental results of many data sets show that SMETANA is a multi-network comparison algorithm capable of obtaining a better topological result, but the better topological significance (higher conservative edge proportion) is obtained at the cost of functional similarity among partial nodes, that is, SMETANA can obtain comparison results with better topological significance and poor biological functional significance mostly.

3. The BEAMS algorithm proposed in FERHAT A (BEAMS: background extraction and surface alignment for the global management-to-management alignment of multiple PPI networks, Bioinformatics,2014,30(4),531-539.) in 2014 is a global multi-network alignment algorithm that frames seed-to-extension and generates alignment results by searching for weighted maximum clusters (backbones) in the network, which is mainly divided into backbone extraction and merging two parts. The BEAMS algorithm is an alignment algorithm which can obtain better biological function significance, but the topological significance (the conservative edge proportion is not high) of an alignment result is not good, and the sequence similarity information between nodes is depended on in the alignment process in a transition mode, so that the topological and biological function significance cannot be well balanced.

Disclosure of Invention

The invention aims to solve the technical problem of providing a global multi-network comparison method combined with a network clustering method, wherein the influence of the incompleteness of sequence similarity data on the network comparison accuracy is avoided; the complexity of selecting search comparison nodes under a larger data scale is reduced; influence of a plurality of similar nodes on network comparison accuracy in each network caused by gene replication; multi-network alignment topology and biofunctional quality imbalance.

In order to solve the above technical problem, the present invention provides a global multi-network comparison method combining a network clustering method, including:

step 1, reading protein interaction network data and sequence similarity data among different species, setting parameters alpha and beta, and constructing a protein interaction network G and a sequence similarity network S;

step 2, preprocessing the sequence similarity data, and deleting the edge with smaller sequence similarity score by using a parameter β to obtain the sequence similarity dataTo the filtered network S_β；

Step 3, calculating the weight of each node in all the networks, and transferring the weight of the node and the edge with small degree to the node and the edge with larger degree according to the degree of the node and the neighbor node;

step 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm;

step 5, calculating the target function score of the current candidate comparison cluster, and selecting the candidate comparison cluster with the highest score as the comparison result;

and 6, outputting a comparison result, and analyzing the comparison result.

In one embodiment, step 1 is specifically as follows:

firstly, reading a parameter α for determining the proportion of topology and sequence similarity and a parameter β for filtering sequence similarity information, which are set by a user, secondly, determining the number k of input networks, reading the protein interaction network in batches, and constructing a protein interaction network G ═ G { (G {)₁，G₂，…，G_k}; reading sequence similarity information among different networks; a sequence information network S is constructed.

In one embodiment, step 2 is specifically as follows:

according to the read threshold coefficient β, sequence similarity information is filtered, firstly, because the information quantity of the sequence similarity is huge, the data quantity exponentially increases along with the increase of the network number, the calculation difficulty also increases along with the increase of the network number, secondly, because the sequence similarity information of a plurality of real species at present is incomplete, the accuracy of a comparison result can be influenced by some sequence information, therefore, when the sequence similarity auxiliary comparison process is used, the sequence information needs to be filtered, for any edge (u, v) in the sequence information network S, if the score of the edge is smaller than the relevant threshold of the corresponding edge, the edge in the network S is deleted, and finally, the filtered network S is obtained_β：

w(u，v)＜β×max(u，v) (1)

In one embodiment, step 3 is specifically as follows:

(a) setting initial values for nodes in the network and edges among the nodes;

(b) transferring the weight of the node with the node degree of 1 to the neighbor node and the edge of the node;

(c) transferring the node weight with the node degree larger than 1 and smaller than 10 to the neighbor nodes and edges of the node weight;

(d) calculating the importance score of the node in the network according to the weight of the node and the weight of the edge connected with the node, wherein lambda represents the influence of the weight score of the related edge on the weight of the node when the weight of the node is calculated;

calculating the sequence homology score related to each node;

(e) calculating the final weight score of each node in the network;

Weight(u)＝α×importance(u)+(1-α)×B(v) (8)

(f) searching and comparing by combining a clustering method:

a weighted k-score complete graph S can be constructed according to the sequence similarity information, wherein the nodes represent the nodes in the corresponding network, and the weight of the edge between two nodes from networks of different species is a sequence similarity bitscore value; similar nodes are clustered in a cluster in the similar graph S by a network clustering method.

In one embodiment, step 3, which is a clustering method based on a seed and expansion method, is divided into the following steps:

(a) calculating the weighting degree of the nodes in the current search network, and selecting the node with the maximum weighting degree as a first seed to be added into the set S;

(b) normalizing the weight of the first seed neighbor, and selecting the seed with the highest weight as a second seed to be added into S;

(c) expanding in the network according to the seeds generated in the first two steps, selecting the node with the maximum weight and connected with the node in the S, adding the node if two constraint conditions are met, and ending the expansion if the two constraint conditions are not met; when a new candidate cluster is generated, if the number of node source networks contained in the cluster is less than the number of input networks, the current cluster is expanded, and the constraint condition is improved, namely T is increased_d，T_sA value of (d);

in one embodiment, step 5 is specifically as follows:

for the candidate clusters generated in the previous step in each iteration, calculating the target function scores of the candidate clusters, and selecting the candidate clusters with the highest target function scores as comparison clusters generated in the iteration; the objective function is formulated as follows:

AS(A)＝α×CIQ(A)+(1-α)ICQ(A) (11)

wherein, α is a parameter of proportion of balanced topology and sequence similarity information, and is usually 0.5; CIQ is a measure for measuring the topological quality between clusters, ICQ is a measure for measuring the sequence similarity between nodes in a cluster:

wherein

Respectively in different clusters Cl_m，Cl_nIn (c), cs (m, n) represents a cluster Cl_m，Cl_nThe proportion of the conservative edges is calculated according to the following formula:

wherein s'_m，nThe representation comprises

Number of edges in, s_m，nRepresents a cluster Cl_m，Cl_nThe network number of the node contained in (1); here, when s'_m，nWhen 1, cs (m, n) is 0; otherwise, it is calculated by equation 13.

In one embodiment, among the following formulas for calculating ICQ, ICQ (Cl)_i) Indicating that only one cluster Cl is considered_iScore of inner node, icq (a) represents the score of all clusters considered in alignment a:

wherein, w_max(u) represents the maximum value of the side-edge weights in contact with the node u, E (Cl)_i) Represented in the cluster Cl_iThe node in S_βOf a contiguous set of edges.

Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the methods.

Based on the same inventive concept, the present application further provides a processor for executing a program, wherein the program executes to perform any one of the methods.

The invention has the beneficial effects that:

the method adopted by the invention can achieve good comparison effect and can generate good comparison results in topological and biological function meanings. The k-coverage is the protein coverage of the input network number, which shows that the clustering method adopted by the invention can compare as many similar nodes as possible into the same cluster, and the nodes compared into the same cluster have the same biological function, thus being capable of proving the beneficial effect of the method of the invention.

Drawings

FIG. 1 is a flowchart of a global multi-network comparison method in combination with a network clustering method according to the present invention.

FIG. 2 is a diagram illustrating an effect of calculating node weights in the global multi-network comparison method combined with the network clustering method according to the present invention.

FIG. 3 is a schematic diagram of experimental results of different algorithms in the global multi-network comparison method combined with the network clustering method on a synthetic network data set according to the present invention.

FIG. 4 shows the topology and biological index results of the comparison results in the synthetic network in the global multi-network comparison method combined with the network clustering method according to the present invention.

FIG. 5 is an experimental result of different comparison algorithms under a real network in the global multi-network comparison method combined with the network clustering method of the present invention.

FIG. 6 shows the topology and biological index results of different comparison algorithms under the real network in the global multi-network comparison method combined with the network clustering method of the present invention.

FIG. 7 is a comparison result of products of topology and biological indexes of different comparison algorithms in the global multi-network comparison method combined with the network clustering method according to the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

The technical solution of the present invention is illustrated by the flow chart shown in fig. 2:

a global multi-network comparison algorithm ACCMNA combined with a network clustering method comprises the following steps:

step 2, preprocessing the sequence similarity data, deleting the edge with smaller sequence similarity score by using a parameter β to obtain a filtered network S_β；

and 4, generating candidate clusters in the corresponding searched graphs by adopting a network clustering algorithm.

and 6, outputting a comparison result, and analyzing the comparison result.

(1) Reading input data:

firstly, reading a parameter α for determining the proportion of topology and sequence similarity and a parameter β for filtering sequence similarity information, which are set by a user, secondly, determining the number k of input networks, reading the protein interaction network in batches, and constructing a protein interaction network G ═ G { (G {)₁，G₂，…，G_k}；And reads sequence similarity information between different networks. A sequence information network S is constructed.

(2) Filtering sequence similarity data:

and filtering the sequence similarity information according to the read threshold coefficient beta. Firstly, because the information quantity of sequence similarity is huge, the data quantity exponentially increases along with the increase of the number of networks, and the calculation difficulty also increases along with the increase of the number of networks; secondly, because the sequence similarity information of a plurality of real species is incomplete at present, the accuracy of the comparison result can be influenced by some sequence information; thus, when sequence similarity is used to assist the alignment process, the sequence information needs to be filtered. For any edge (u, v) in the sequence information network S, if the score of the edge is smaller than the threshold value associated with the corresponding edge, deleting the edge in the network S, and finally obtaining a filtered network S β:

w(u，v)＜β×max(u，v) (1)

(3) calculating the weight of all nodes in the network:

because the number of nodes in the network data is large, in order to reduce the complexity of searching and comparing the nodes in the network, the weight of each node in the network is firstly calculated, and the node with the largest weight is preferentially selected for comparison. Degree (degree) represents the number of nodes connected with a node in the network, and can measure the importance of a node in the network, and the degree of a node is larger to indicate that the node is more important in the network. And the second graph is an exemplary graph of the effect after the weight of the node is calculated, the greater the degree of the node is, the higher the importance in the network is, and finally, the greater the degree of the node is, the greater the weight of the node is.

FIG. 2 is a schematic diagram of node weights. The size of the node and the thickness of the edge in the graph represent the weight of the node and the edge respectively, and the graph shows the effect achieved after the weight of the node is calculated (the larger the degree of the node is, the larger the weight is)

The method for calculating the node weight comprises the following steps:

(a) setting initial values for nodes in the network and edges among the nodes;

calculating the sequence homology score related to each node;

(e) calculating the final weight score of each node in the network;

Weight(u)＝α×importance(u)+(1-α)×B(v) (8)

(f) searching and comparing by combining a clustering method:

a weighted k-score complete graph S can be constructed according to the sequence similarity information, wherein the nodes represent nodes in the corresponding network, and the weight of an edge between two nodes from networks of different species is a sequence similarity bitscore value. Similar nodes are clustered in a cluster in the similar graph S by a network clustering method. The clustering method based on the seed and the extension method comprises the following steps:

(c) expanding in the network according to the seeds generated in the first two steps, selecting the node with the maximum weight and connected with the node in the S, adding the node if two constraint conditions are met, and ending the expansion if the two constraint conditions are not met; when a new candidate cluster is generated, if the number of node source networks contained in the cluster is less than the number of input networks, the current cluster is expanded, and the constraint condition is improved, namely T is increased_d，T_sThe value of (c).

(1) Calculating the target function score, selecting the generated cluster:

and for the candidate cluster generated in the previous step in each iteration, calculating the objective function score of the candidate cluster, and selecting the candidate cluster with the highest objective function score as the comparison cluster generated in the iteration. The objective function is formulated as follows:

AS(A)＝α×CIQ(A)+(1-α)ICQ(A) (11)

wherein α is a parameter that balances the proportion of topology and sequence similarity information, and is usually 0.5. CIQ is a measure for measuring the topological quality between clusters, ICQ is a measure for measuring the sequence similarity between nodes in a cluster:

wherein

wherein s'_m，nThe representation comprises

Number of edges in, s_m，nRepresents a cluster Cl_m，Cl_nThe number of networks of nodes contained in (1). Here, when s'_m，nWhen 1, cs (m, n) is 0; otherwise, it is calculated by equation 13.

ICQ (Cl) in the following formula for calculating ICQ_i) Indicating that only one cluster Cl is considered_iScore of inner node, icq (a) represents the score of all clusters considered in alignment a:

The technical scheme of the invention has the following beneficial effects:

to verify the above work, tests were performed on the real network and the synthetic network, respectively. Where the real network is derived from five eukaryotes in IsoBase, table 1 is the node number and side number information for each species in the real network. The synthetic network was derived from three sets of data under three growth models, CG, DMC and DMR, of NAPAbench. Each data set comprises 8 networks, each network has 1000 nodes, and each network node of the CG model has 3985; the eight network nodes of the DMC model are 1919,1853,1923,1840,1867,1848,1818,1867, respectively; the number of eight network nodes in the DMR model is 2031,2092,1967,1977,1959,1998,2030,2056.

TABLE 1 details of the real network

FIG. 3 is a statistical comparison of the results of the coverage of proteins with biologically identical proteins and four different algorithms for a synthetic network, the protein coverage representing the number of proteins contained in the results of the comparison; what is meant by biologically consistent is that if the proteins in a cluster are all annotated by the same GO term, the proteins in that cluster and its clusters are considered to be consistent. The consistent protein coverage indicates the number of proteins contained in the alignment that are consistent. Protein values are usually differentiated according to the number k of the network of sources of the proteins in the cluster, with larger k indicating more meaningful results than results. From fig. 3, the results of the three other algorithms except IsoRankN are not very different for the total number of proteins and consistent protein coverage, but it can be seen that ACCMNA produces the most protein with k-8 and consistent number of protein coverage compared to the other three algorithms, which also indicates that the ACCMNA algorithm produces more research-meaningful comparisons and can obtain good biological results.

Fig. 4 is the result of a plurality of metric values used to measure topological and functional consistency under a composite network data set. Wherein CIQ is a measure for measuring the conservative edge proportion of the comparison result, SPE is an index for measuring the biological specificity of the comparison result, Sen is an index for measuring the biological sensitivity of the comparison result, and MNE and nGOC are indexes for measuring the biological consistency of the comparison result. The above metrics except that the MNE is as small as possible, the other metric values are as large as possible.

FIG. 5 is a statistical comparison of results of protein and consistent protein coverage for four different algorithms under a real network data set, and it can be seen that the results of the several algorithms are not very different.

Fig. 6 shows the results of multiple metrics for measuring topology and functional consistency under a real network data set, and it can be seen from the results in the figure that the ACCMNA algorithm is superior in each of the four algorithms.

FIG. 7 is a graphical representation of the result of multiplying a topological index by a biological consistency index. Since the network comparison has a difficulty in balancing the topology and biological results, most of the existing algorithms can only obtain a better result of one of the metrics, and therefore, in order to highlight the superiority of the ACCMNA algorithm, the method of multiplying two indexes is adopted to show an improved effect of the method compared with the prior art.

Through the experimental demonstration, the method adopted by the invention can achieve good comparison effect and can generate good comparison results in topological and biological function meanings. k is the protein coverage of the input network number, which shows that the clustering method adopted by the invention can compare as many similar nodes as possible into the same cluster, and the nodes compared into the same cluster have the same biological function, thus being capable of proving the beneficial effect of the method of the invention.

A specific application scenario of the present invention is given below:

taking the eukaryotic dataset from IsoBase as an example, an alignment between five true networks was performed.

(1) Reading an input protein interaction network to construct a network G, reading sequence similarity data among different species to construct a similarity weighting network S, and setting parameters alpha to be 0.5 and beta to be 0.4;

(2) filtering the sequence similarity data, deleting edges smaller than the threshold value of the relevant edge of each node to obtain a filtered similarity weighting network S_β；

(3) Calculating the weight of each node according to the degree of the node in the network and the neighbor nodes thereof:

firstly, initializing the weights of all nodes and edges in a network, wherein the weights of all nodes are 0, if the edges exist among the nodes, the weights are 1, and if not, the weights are 0;

(1) traversing nodes with the node degree smaller than 10 from small to large, and transferring the weights of the nodes and the edges to neighbor nodes and edges with larger corresponding degrees;

(2) adding the weight on the relevant edges of all nodes to the nodes by taking lambda as a parameter, wherein lambda is set to be 0.2;

(3) calculating a sequence similarity score associated with each node;

(4) combining the topology weight with the sequence score to obtain the final weight of each node, and controlling the specific gravity by alpha being 0.5.

(4) Generating an alignment cluster by combining a clustering method:

(1) the nodes with large weights calculated in the step 3 need to be selected preferentially for the first iteration to generate a neighborhood graph;

(2) calculating the weighting degrees of all nodes in the neighborhood graph, wherein the largest node is used as a first Seed, the largest node in the neighborhood of the Seed is selected as a second Seed, and the two seeds form a Seed set;

(3) selecting the node with the maximum edge weight and the maximum edge weight in the Seed set from the neighbor nodes of the nodes in the Seed, and judging whether the node meets the added limiting condition, wherein T of the limiting condition_s，T_dAre all set to 0.5; if yes, adding, otherwise, ending the search;

(4) generating a neighborhood map according to the current candidate cluster in each iteration, and repeating the steps of 2-3 to generate a new candidate cluster;

(5) and 4, calculating the target function score of the candidate cluster generated in the step 4, selecting the high score as a comparison result, continuously iterating the

steps

4 and 5, and taking 0.5 as alpha in the target function until no new candidate cluster is formed in the network.

(6) And outputting a comparison result, and analyzing the result.

The global multi-network comparison method combined with the network clustering method provided by the invention is described in detail above, and the following points need to be explained:

1. in the initial stage of comparison, the importance of different nodes in the network is evaluated according to the degree of the node, the adjacent nodes and the sequence similarity information among the nodes, and the problem of unbalanced quality of topology and biological functions in network comparison can be effectively solved by adopting a mode of combining topology information and the sequence similarity information.

2. The algorithm combines a clustering method to generate the alignment. The clustering method is used for searching similar nodes, so that the similar nodes can be effectively compared in the same cluster as much as possible, and the influence on the network comparison accuracy due to gene replication and the like can be avoided.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A global multi-network comparison method combined with a network clustering method is characterized by comprising the following steps:

and 6, outputting a comparison result, and analyzing the comparison result.

2. The global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 1 is as follows:

first, a user is readSetting α as the parameter for determining the proportion of topology and sequence similarity and β as the parameter for filtering sequence similarity information, determining k as the number of input networks, reading the protein interaction network in batches, and constructing G-G₁，G₂，…，G_k}; reading sequence similarity information among different networks; a sequence information network S is constructed.

3. The global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 2 is as follows:

w(u，v)＜β×max(u，v) (1)

4. The global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 3 is as follows:

(a) setting initial values for nodes in the network and edges among the nodes;

calculating the sequence homology score related to each node;

(e) calculating the final weight score of each node in the network;

Weight(u)＝α×importance(u)+(1-α)×B(v) (8)

(f) searching and comparing by combining a clustering method:

a weighted k-score complete graph S can be constructed according to the sequence similarity information, wherein the nodes represent nodes in corresponding networks, and the weight of an edge between two nodes from networks of different species is a sequence similarity bit score value; similar nodes are clustered in a cluster in the similar graph S by a network clustering method.

5. The global multi-network comparison method combined with network clustering method as claimed in claim 1, wherein in step 3, which is here a clustering method based on seed and expansion method, is divided into the following steps:

6. the global multi-network comparison method combined with the network clustering method according to claim 1, wherein the step 5 is as follows:

AS(A)＝α×CIQ(A)+(1-α)ICQ(A) (11)

wherein

wherein s'_m，nThe representation comprises

7. The global multi-network alignment method combined with network clustering method according to claim 6, wherein the following formula for calculating ICQ, ICQ (Cl)_i) Indicating that only one cluster Cl is considered_iScore of inner node, icq (a) represents the score of all clusters considered in alignment a:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.