CN112669907A

CN112669907A - Pairing protein interaction network comparison method based on divide-and-conquer integration strategy

Info

Publication number: CN112669907A
Application number: CN202011528447.8A
Authority: CN
Inventors: 陈璟; 刘晓
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-16

Abstract

The invention discloses a pairing protein interaction network comparison method based on a divide-and-conquer integration strategy, which comprises the following steps: step 1: reading a source network, a target network and a BLAST similarity file; step 2: respectively calculating similarity scores of nodes in the two networks by adopting a method based on combination of the nodes and the paths, and respectively carrying out module division on the two networks by combining the similarity scores; and step 3: obtaining homologous protein pairs, and respectively calculating the similarity between different network modules according to the similarity of the homologous protein pairs and BLAST; and carrying out one-to-one matching on the modules from different networks according to the similarity. The invention has the beneficial effects that: the similarity calculation method based on the nodes and the paths is used for replacing self-similarity files, and the problem of dependence on the self-similarity files is solved.

Description

Pairing protein interaction network comparison method based on divide-and-conquer integration strategy

Technical Field

The invention relates to the field of protein interaction network comparison, in particular to a pair protein interaction network comparison method based on a divide-and-conquer integration strategy.

Background

With the development of bioinformatics, research is focused on protein, DNA and other biological macromolecules, protein molecules perform various important tasks in organisms, and protein interaction is the basis for maintaining cell structures and functions, so that the research on protein interaction networks is of great significance.

The traditional technology has the following technical problems:

the "SPINAL: scalable protein interaction network alignment" (Bioinformatics.2013, 4, (29): 917-. The coarse granularity stage iteratively refines the matrix P of matching confidence estimates for each pair of nodes by taking into account the confidence in matches with neighboring nodes calculated in the previous iteration. P starts a fine-grained stage after convergence, which uses a seed-expansion algorithm to construct the alignment. Furthermore, in each iteration of the seed-expansion process, a local search is performed to directly increase the number of conservative edges. The problem with this algorithm is: the excessive consideration of topological information results in poor quality of the final compared biological function.

The module-based homology alignment of protein-protein interaction networks (journal of Bioinformatics,2016,32(17):658-664) algorithm proposes a homology score function, which calculates the homology scores of the nodes according to the similarity of the modules and adopts the dynamic Hungarian algorithm for solving. The problem with this algorithm is: the modular method selects an improper algorithm, and the calculation method is complicated and improper in module similarity calculation, so that an incorrect biological similarity score is generated, and the quality of the biological function of the biological similarity score is poor.

The algorithm of 'alignment of protein-protein interaction networks' (journal of BMC Bioinformatics,2020,21(Suppl 6):1-22) adopts a modularization idea, firstly, a network is divided into a plurality of modules, the modules are enumerated and compared, and finally, all comparison results are combined to be processed into a final comparison. The problem with this algorithm is: all modules need to be enumerated and compared, and time complexity is greatly increased.

The HubAlign an actual and effective method for global alignment of protein-protein interaction networks (journal: Bioinformatics 2014,30(17):438,444) algorithm considers that the protein serving as a hub in the PPI network is more important in function and topology, proposes the centrality of the importation, and uses the greedy seed-expansion algorithm to sort the proteins according to the combination of the importation scores and the sequence similarity of the proteins. The problem with this algorithm is: the algorithm randomly selects nodes as starting points for edge splitting, and different starting points may obtain comparison results with different qualities.

The algorithm of "MAGNA + +" knowledge of acutacity in global network alignment of vitamin node and edge continuity "(journal: Bioinformatics,2015,31(14): 2409-. The problem that the algorithm falls into a local optimal solution is effectively solved. The problem with this algorithm is: thousands of iterations are required, taking a long time.

"INDEX:" INDENmental depth extension for protein-protein interaction networks "journal: BioSystems,2017,162(2017):24-34) algorithm proposes a new alignment strategy, and considers the increase of alignment scores and alignment cores, so that the obtained public connected subgraph has larger edge number than the prior method. The problem with this algorithm is: the compared biological quality is poorer, and the better balance between the biological quality and the topological quality cannot be achieved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a pairing protein interaction network comparison method based on a divide-and-conquer integration strategy, solving the problem of dependence on similarity files between nodes in the same network; the matching relation of the modules is predicted by using the matching relation of the existing protein pairs, so that the module similarity is calculated, and the problem of module similarity calculation is solved; the degree centrality and the feature vector centrality are used for capturing the topological characteristics of the nodes, and the biological quality and the topological quality of the algorithm are improved.

In order to solve the technical problem, the invention provides a pairing protein interaction network comparison method based on a divide-and-conquer integration strategy, which comprises the following steps:

step 1: reading a source network, a target network and a BLAST similarity file;

step 2: respectively calculating similarity scores of nodes in the two networks by adopting a method based on combination of the nodes and the paths, and respectively carrying out module division on the two networks by combining the similarity scores;

and step 3: obtaining homologous protein pairs, and respectively calculating the similarity between different network modules according to the similarity of the homologous protein pairs and BLAST; performing one-to-one matching on modules from different networks according to the similarity;

and 4, step 4: calculating the similarity between nodes in each pair of matched modules according to the feature vector centrality and the BLAST similarity, performing intra-module comparison, and combining the obtained sub-comparison results into a candidate result set;

and 5: and obtaining a final one-to-one comparison result by using a hypergraph matching algorithm on the candidate result set.

In one embodiment, in step 2, the similarity score is calculated as follows:

similarity between two nodes is measured by using degree and shortest path length, and the node similarity is calculated as formula (1)

Where G is the network, u, v are the nodes in G, deg_uDegree, deg, of finger point u_GRefers to the maximum in graph G, D (G) refers to the diameter of graph G, d_G(u, v) refers to the shortest path length of node u, v.

In one embodiment, in step 2, the module division step is as follows:

(1) carrying out similarity calculation on the source network G by using a formula (1) to obtain a similarity matrix S;

(2) for each row in the matrix S, the nodes with similarity of the first 75% form a module according to the value size, and the center of the module is the row name of the row matrix.

In one embodiment, step 3 is specifically as follows:

generating homologous protein pairs, calculating the homologous similarity of the modules according to the collective behaviors of the protein pairs in the modules, and converting homologous protein pair files into homologous matrixes pi by using a formula (2), wherein i and j are proteins from two networks respectively;

then the homology similarity score for module m1, m2, obtained from matrix pi is:

equation (3) is limited to equation (4):

the module similarity calculation formula is as follows:

S(m1,m2)＝HS(m1,m2)+BLAST(c1,c2) (5)

wherein c1, c2 are cluster centers of modules m1, m2, respectively; BLAST is sequence similarity;

and (3) obtaining an inter-module similarity matrix S according to a formula (5), and solving the S by using a Hungarian algorithm to obtain a one-to-one module matching relation.

In one embodiment, step 4 is specifically as follows: for each pair of matched modules obtained in the module comparison stage, firstly calculating a similarity matrix of nodes in the modules, and then comparing the nodes in the two modules; the more specific process is as follows:

calculating the similarity between two nodes in different modules according to the feature vector centrality and sequence similarity, see formula (6)

Wherein T (u, v) represents the feature vector centrality similarity score of the node u, v, and the calculation method is shown in formula (7):

wherein c is_uThe characteristic vector centrality of the finger point u;

and combining the sub-comparisons generated by all the pairing modules into a candidate set, wherein at the moment, one node in the candidate set may form a comparison relation with a plurality of nodes from another network, so that the candidate set is a many-to-many matching set.

In one embodiment, the intra-module comparison according to equation (6) is as follows: (1) first, module centers c1, c2 of modules m1, m2 are aligned up; (2) obtaining neighbors of c1, c2, Deg (c1), Deg (c2), respectively; (3) extracting submatrices of which the row names and the column names are Deg (c1) and Deg (c2) from the F, and comparing nodes Deg (c1) and Deg (c2) by using a Hungarian algorithm; (4) removing the expanded nodes (c1, c2), and repeating the steps (2) and (3) for the remaining aligned node pairs.

In one embodiment, step 5 is specifically as follows: abstracting the candidate set into a hypergraph, wherein nodes in a source network are source nodes of the hypergraph, nodes in a target network are target nodes of the hypergraph, and each sub-comparison corresponds to one hyper-arc of the hypergraph; and extracting the hypergraph into a bipartite graph only containing one-to-one comparison relationship by using a weighted bipartite hypergraph matching algorithm, and obtaining the final node matching relationship.

Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the methods.

Based on the same inventive concept, the present application further provides a processor for executing a program, wherein the program executes to perform any one of the methods.

The invention has the beneficial effects that:

the similarity calculation method based on the nodes and the paths is used for replacing self-similarity files, so that the problem of dependence on the self-similarity files is solved; obtaining homologous protein pairs from Primalign, and predicting the matching relationship of the modules by using the matching relationship of the existing protein pairs, thereby calculating the similarity of the modules and solving the problem of calculating the similarity of the modules; the feature vector centrality is further optimized, and the centrality of the nodes and the centrality difference between the two nodes are captured, so that the topological characteristics of the nodes are better captured, and the quality of the algorithm is improved.

Drawings

FIG. 1 is a flow chart of a method for aligning paired protein interaction networks based on a divide-and-conquer integration strategy.

FIG. 2 is a graph of EC, ICS, S3 and FC scores for different algorithms on an ISOBASE dataset.

FIG. 3 is a graph of EC, ICS, S3 and FC scores for different algorithms on a synthetic data set (DMC, DMR).

FIG. 4 is a diagram illustrating the scores of the present invention in EC, ICS, S3 and FC when using the existing algorithm to obtain the same source pair and its original algorithm, respectively.

FIG. 5 is a diagram illustrating the time required for the present invention to obtain the alignment of the same pair using different prior algorithms.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Referring to fig. 1 to 5, a method for aligning paired protein interaction networks based on a divide-and-conquer integration strategy includes:

step 1: the source and target networks and the BLAST similarity files are read.

Step 2: and respectively calculating similarity scores of the nodes in the two networks by adopting a method based on the combination of the nodes and the paths, and respectively carrying out module division on the two networks by combining the similarity scores.

In order to more fully mine the similarity information of nodes in the same network, a combination method based on the nodes and paths is adopted to calculate the similarity between the nodes, namely the similarity between the two nodes is measured by adopting the degree and the length of the shortest path, and the node similarity is calculated as formula (1).

The module division steps are as follows:

And step 3: obtaining homologous protein pairs, and respectively calculating the similarity between different network modules according to the similarity of the homologous protein pairs and BLAST; and carrying out one-to-one matching on the modules from different networks according to the similarity.

Homologous protein pairs were generated using PrimAlign and the homology similarity of the modules was calculated from their collective behavior in the modules, using equation (2) to convert the homologous protein pair files to the homology matrix pi, where i, j are proteins from the two networks, respectively.

equation (3) is limited to equation (4):

the module similarity calculation formula is as follows:

S(m1,m2)＝HS(m1,m2)+BLAST(c1,c2) (5)

wherein c1, c2 are the cluster centers of modules m1, m2, respectively. BLAST is sequence similarity.

for each pair of matched modules obtained in the module comparison stage, firstly, the similarity matrix of the nodes in the modules is calculated, and then, the nodes in the two modules are compared, and the specific process is as follows:

wherein c is_uWhich refers to the feature vector centrality of node u. Formula (6) not only considers the difference of the centrality of the feature vectors of the nodes u and v, but also considers the centrality value of the node itself, and the design is beneficial to having a stronger center in the moduleThe proteins with similar sex and central sex are aligned first.

The intra-module comparison according to equation (6) is performed as follows:

(1) first, module centers c1, c2 of modules m1, m2 are aligned up;

(2) obtaining neighbors of c1, c2, Deg (c1), Deg (c2), respectively;

(3) extracting submatrices of which the row names and the column names are Deg (c1) and Deg (c2) from the F, and comparing nodes Deg (c1) and Deg (c2) by using a Hungarian algorithm;

(4) removing the expanded nodes (c1, c2), and repeating the steps (2) and (3) for the remaining aligned node pairs.

And abstracting the candidate set into a hypergraph, wherein nodes in the source network are source nodes of the hypergraph, nodes in the target network are target nodes of the hypergraph, and each sub-comparison corresponds to one hyper-arc of the hypergraph. And extracting the hypergraph into a bipartite graph only containing one-to-one comparison relationship by using a weighted bipartite hypergraph matching algorithm, and obtaining the final node matching relationship.

A specific application scenario of the present invention is given below:

take the two networks of SCE and HSA in the ISOBASE database as an example:

1. reading the similarity files of the SCE network, the HSA network and the BLAST;

2. respectively calculating similarity values of nodes in the SCE network and the HSA network according to a formula (1);

3. respectively carrying out module division on the SCE and HSA networks;

4. acquisition of homologous protein pairs Using Primalign

5. Calculating the similarity between different network modules according to the formula (5);

6. performing one-to-one matching on modules from different networks according to the similarity;

7. calculating the similarity between nodes in each pair of matched modules according to a formula (6), and performing intra-module comparison;

8. combining the sub-comparison results obtained in the step 7 into a candidate result set;

9. and obtaining a final one-to-one comparison result by using a hypergraph matching algorithm on the candidate result set.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A method for aligning paired protein interaction networks based on a partition integration strategy is characterized by comprising the following steps:

step 1: reading a source network, a target network and a BLAST similarity file;

2. The method for aligning pairwise protein interaction networks based on partition integration strategy according to claim 1, wherein in step 2, the similarity score is calculated as follows:

3. The method for aligning pairwise protein interaction networks based on partition and integration strategy according to claim 1, wherein in step 2, the module division step is as follows:

4. The method for aligning paired protein interaction networks based on the partition integration strategy of claim 1, wherein the step 3 comprises the following steps:

equation (3) is limited to equation (4):

the module similarity calculation formula is as follows:

S(m1,m2)＝HS(m1,m2)+BLAST(c1,c2) (5)

5. The method for aligning paired protein interaction networks based on the partition integration strategy of claim 1, wherein the step 4 comprises the following steps: for each pair of matched modules obtained in the module comparison stage, firstly calculating a similarity matrix of nodes in the modules, and then comparing the nodes in the two modules; the more specific process is as follows:

wherein c is_uThe characteristic vector centrality of the finger point u;

6. The method for aligning pairwise protein interaction networks based on partition integration strategy according to claim 5, wherein the step of performing intra-module alignment according to formula (6) is as follows: (1) first, module centers c1, c2 of modules m1, m2 are aligned up; (2) obtaining neighbors of c1, c2, Deg (c1), Deg (c2), respectively; (3) extracting submatrices of which the row names and the column names are Deg (c1) and Deg (c2) from the F, and comparing nodes Deg (c1) and Deg (c2) by using a Hungarian algorithm; (4) removing the expanded nodes (c1, c2), and repeating the steps (2) and (3) for the remaining aligned node pairs.

7. The method for aligning paired protein interaction networks based on the partition integration strategy of claim 1, wherein the step 5 comprises the following steps: abstracting the candidate set into a hypergraph, wherein nodes in a source network are source nodes of the hypergraph, nodes in a target network are target nodes of the hypergraph, and each sub-comparison corresponds to one hyper-arc of the hypergraph; and extracting the hypergraph into a bipartite graph only containing one-to-one comparison relationship by using a weighted bipartite hypergraph matching algorithm, and obtaining the final node matching relationship.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.