CN107231252B

CN107231252B - Link prediction method based on Bayesian estimation and seed node neighbor set

Info

Publication number: CN107231252B
Application number: CN201710366159.9A
Authority: CN
Inventors: 杨旭华; 项旗立; 张海丰; 肖杰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2017-05-23
Filing date: 2017-05-23
Publication date: 2020-05-05
Anticipated expiration: 2037-05-23
Also published as: CN107231252A

Abstract

A link prediction method based on Bayesian estimation and a seed node neighbor set is characterized in that a network model is established, two nodes which are not directly connected are selected as seed nodes, the probability of edges existing between the two nodes and the probability of edges not existing between the two nodes are respectively calculated, the probability of edges generated between the two nodes and the probability of edges not generated between the two nodes are respectively calculated according to the degree information of intermediate nodes with the length of 2 or 3 paths between the two nodes, the likelihood values of each intermediate node with the length of 2 and 3 paths between the two nodes are calculated according to the Bayesian estimation and the seed node neighbor set, and the similarity score is the sum of the likelihood values of all the intermediate nodes; and traversing the network, acquiring similarity scores between any two seed nodes by using the method, arranging all seed node pairs in a descending order according to the similarity scores, and taking the node pairs corresponding to the first B score values as predicted continuous edges. According to Bayes estimation, different intermediate nodes in a local path between two nodes are distinguished by combining a seed node neighbor set, so that the algorithm has different importance and a good prediction effect.

Description

Link prediction method based on Bayesian estimation and seed node neighbor set

Technical Field

The invention relates to the field of network science and link prediction, in particular to a link prediction method based on Bayesian estimation and a seed node neighbor set.

Background

The complex system in real life can be researched by using a complex network, wherein nodes in the network represent individuals in the complex system, and connecting edges represent the mutual relations among the nodes in the system. The link prediction is one of important research fields of complex networks, because the link prediction can predict links possibly generated between nodes in the evolution process of the network, the evolution trend of the network can be predicted in advance, and 'ghost sides' which do not exist in the network can be judged, so that researchers can be better helped to research the internal rules of the network.

The link prediction problem is of great interest to researchers. In comparison, the link prediction algorithm based on the network structure is more reliable and accurate compared with the prediction algorithm based on the network node attribute information. The Common Neighbor (CN) algorithm is a classical link prediction algorithm based on a network structure, which is also called a structure equivalence algorithm, i.e. there are many common neighbor nodes between the nodes, the more similar the two nodes are, the link prediction algorithm derived on the basis of the CN algorithm is the Salton algorithm, the Jaccard algorithm, the Sorenson algorithm, the HPI (high node favorable index), the HDI (high node unfavorable index), the LHN-I algorithm, the AA algorithm, the RA algorithm, etc., the Salton algorithm is also called as cosine similarity algorithm, the Sorenson algorithm is often used for researching ecological data, the HPI algorithm is often used for analyzing topological similarity of a metabolic network, the idea of the AA algorithm is that the contribution of a common neighbor node with small degree is larger than that of a common neighbor node with large degree, and the RA algorithm is proposed based on the AA algorithm and inspired by a resource allocation process; the similarity algorithm based on the Path mainly comprises Local Path indexes (LP) and a Katz algorithm LHN-II algorithm, overcomes the defect that the network effective information used by a CN algorithm is too little, and utilizes the effective information of the network from the global perspective, thereby improving the accuracy of link prediction to a certain extent.

The above classical algorithms mainly consider topological structure characteristics in the network, that is, the more similar the network characteristics between two nodes are, the more likely the two nodes are to generate links, and simulations of these methods in many networks have proved to be effective, but most of these traditional classical algorithms only consider degree information of intermediate nodes of a path with a length of two between node pairs without directly connected edges, and do not consider attributes of intermediate nodes of a path with a length greater than two, and these attributes in the network have a great effect on generating links between node pairs in fact. The traditional link prediction algorithm based on the seed node neighbor set only considers the intermediate nodes of the paths with the path length equal to 2 between the seed nodes, and only counts the intermediate nodes of the paths with the path length of 2, and does not distinguish the nodes, so that the importance of the intermediate nodes cannot be distinguished.

Disclosure of Invention

In order to overcome the defects that the existing link prediction method based on the seed node neighbor set only considers the intermediate nodes of the paths with the path lengths equal to 2 and 3 and only considers the degrees of the nodes to cause low prediction precision, the invention provides the link prediction method based on the Bayesian estimation and the seed node neighbor set with higher accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a link prediction method based on Bayesian estimation and a seed node neighbor set comprises the following steps:

the method comprises the following steps: establishing a network model G (V, E), wherein V represents a node set in a network, E represents a connecting edge set in the network, the total number of nodes in the network is marked as N, U represents a set of node pairs in the network, and | U | ═ N (N-1)/2 represents the total number of the node pairs in the network;

step two: two nodes x and y in the network are arbitrarily selected as seed nodes, and the probability that a direct connection edge exists between the two nodes is calculated:

where | E | represents the total number of edges actually present in the network, A₁Indicating that a direct connection edge exists between the two nodes of x and y;

step three: calculating the probability that no direct connecting edge exists between any two nodes x and y in the network:

wherein A is₀Indicating that no direct connection edge exists between the two nodes x and y;

step four: an intermediate node V according to a path of length 2 or 3 between nodes x and y_wCalculating the probability of generating a connecting edge between the nodes x and y:

P(A₁|V_w)＝C_w

wherein, C_w＝2E_w/k_w(k_w-1),k_wRepresents a node V_wDegree of (E)_wRepresents a node V_wK of (a)_wThe number of edges actually existing between the neighbor nodes;

step five: an intermediate node V according to a path of length 2 or 3 between nodes x and y_wCalculating the probability that no connecting edge is generated between the nodes x and y:

P(A₀|V_w)＝1-C_w；

step six: calculating any one intermediate node V of the path with the length of 2 and 3 between the nodes x and y according to a Bayesian estimation method_wLikelihood value of

Step seven: repeating the fourth step to the sixth step for each intermediate node of the path with the length of 2 and 3 between the nodes x and y, and calculating the likelihood value of each intermediate node

Step eight: calculate the similarity score for nodes x and y:

where Q denotes the number of intermediate nodes of all paths between nodes x and y of length 2 and 3, M_xRepresenting the sum of the number of first-order neighbors and second-order neighbors of a node x, wherein the first-order neighbors of the node x refer to nodes with the distance of 1 to the node x, and the second-order neighbors of the node x refer to nodes with the distance of 2 to the node x; m_yRepresents the sum of the first and second order neighbor numbers of node y;

step nine: traversing the whole network, repeating the second step to the eighth step for any two unconnected nodes, calculating similarity scores between all unconnected node pairs, and taking node pairs corresponding to the first B similarity score values as predicted connected edges according to the sequence of the similarity score values from high to low, wherein B is a set positive integer, B is less than or equal to D, and D is the number of all unconnected node pairs in the network.

The invention has the beneficial effects that: considering the local path with the path length equal to 2 or 3 between two unconnected nodes in the network, distinguishing the contribution of the degree of the middle node in the network to the generation of the link, providing a link prediction method based on Bayesian estimation and a seed node neighbor set, and having higher link prediction accuracy.

Drawings

Fig. 1 shows the effect of different intermediate nodes between any pair of nodes in the network where no directly connected edge exists on the link between this pair of nodes.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, a link prediction method based on bayesian estimation and a seed node neighbor set includes the following steps:

step two: two nodes x and y in the network are arbitrarily selected as seed nodes, namely, black dots in fig. 1, and the probability that a straight connecting edge exists between the two nodes is calculated:

step three: calculating the probability that no direct connecting edge exists between any two nodes x and y in the network, as shown in fig. 1:

step four: an intermediate node V according to a path of length 2 or 3 between nodes x and y_wDegree information (shown in fig. 1), calculating the probability of generating a connecting edge between nodes x and y:

P(A₁|V_w)＝C_w

step five: an intermediate node V according to a path of length 2 or 3 between nodes x and y_wDegree information (shown in fig. 1), calculating the probability that no connecting edge is generated between nodes x and y:

P(A₀|V_w)＝1-C_w；

Step eight: calculate the similarity score for nodes x and y:

As mentioned above, the present invention is made more clear by the specific implementation steps implemented in this patent. Any modification and variation of the present invention within the spirit of the present invention and the scope of the claims will fall within the scope of the present invention.

Claims

1. A link prediction method based on Bayesian estimation and a seed node neighbor set is characterized in that: the method comprises the following steps:

step two: two unconnected nodes x and y in the network are arbitrarily selected as seed nodes, and the probability that a direct connection edge exists between the two unconnected nodes x and y is calculated:

step three: calculating the probability that no direct connecting edge exists between any two unconnected nodes x and y in the network:

P(A₁|V_w)＝C_w

P(A₀|V_w)＝1-C_w；

Step eight: calculate the similarity score for nodes x and y: