CN106817251B

CN106817251B - Link prediction method and device based on node similarity

Info

Publication number: CN106817251B
Application number: CN201611207950.7A
Authority: CN
Inventors: 刘大伟; 柯枫; 刘玮; 隋雪青; 程学旗
Original assignee: Yantai Branch Institute Of Computing Technology Chinese Academy Of Science
Current assignee: Yantai Branch Institute Of Computing Technology Chinese Academy Of Science
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2020-05-19
Anticipated expiration: 2036-12-23
Also published as: CN106817251A

Abstract

The invention relates to a link prediction method based on node similarity, which comprises the following steps: carrying out node and link representation on a network to be analyzed; acquiring two nodes which are not directly connected by a link in a network; reading respective neighbor node sets of the two nodes without direct link connection; combining and taking intersection of neighbor node sets of the two nodes which are not connected by the direct link to obtain a common neighbor set; taking the common neighbor set as a subnet, and calculating the global cluster coefficient of the subnet and the cluster coefficient of any common neighbor node in the subnet; calculating the similarity of the two nodes without direct link connection according to the global cluster coefficient of the subnet and the cluster coefficient of any common neighbor node in the subnet; and performing link prediction according to the calculated node similarity. The invention defines a new index for calculating the node similarity based on the local cluster coefficient from the viewpoint of the cluster coefficient when analyzing the interrelation of the nodes in the local structure of the complex network.

Description

Link prediction method and device based on node similarity

Technical Field

The present invention relates to the field of link prediction, and in particular, to a link prediction method and apparatus based on node similarity.

Background

The mode and evolution analysis of the complex network have important research and application values, and the calculation of the node similarity in the network is the basis for solving many practical problems such as link prediction, attribute inference and the like. The method for calculating the node similarity in the complex network can support applications such as biological network analysis, social network friend recommendation, personalized content recommendation, information network attribute inference, label classification and the like.

In the research of link prediction, an undirected and unweighted network is the most basic depicting form, and a similarity-based method is the simple and effective method which is proposed and proved at the earliest. On one hand, the current network data is much richer than the past in description, and taking a social network as an example, a user represented by a node has more attribute information including personalized information such as interests, preferences, regions, behaviors and the like; the use edges can represent static relations such as friends or concerns among users, and can also represent dynamic relations such as interaction, space-time association and the like. Abundant network attribute information and topological physical meaning provide a wide research object for similarity definition; on the other hand, the scale of network data is rapidly increased, the requirement on algorithm performance in practical application is more severe, although some complex modeling methods and probability models such as likelihood analysis are superior to similarity methods in accuracy indexes, the research value is limited to understanding of network structure characteristics, and the limitation of processing node scale and overhigh time complexity determine that the applicability is not strong.

Existing similarity-based algorithms mainly include similarity indicators based on local information, path-based and random walks, where the similarity indicator based on local information common neighbors is the most common method. The similarity of two nodes in the common neighbor index is characterized by the number of the common neighbor nodes, and the index is proved to be effective in an empirical network; meanwhile, some improved methods consider operating a set of common neighbors of two nodes, such as raising a SALTON index by utilizing cosine similarity. Other optimization methods consider the respective degrees of the two nodes and provide indexes which are beneficial to the large-degree node or the small-degree node; and (3) taking the degree of the common neighbor node into consideration, providing an Adamic-Adar index: the contribution of the common neighbor node with small degree is larger than that of the common neighbor node with large degree, and the reciprocal of the logarithm of the degree of the node is decreased; the network resource allocation index is proposed by the inspiration of the network resource allocation process, is similar to the Adamic-Adar index in nature, and is decreased by the reciprocal of the degree of the node; and (4) taking the mutual relation among the common neighbor nodes into consideration, and providing a node gravity index.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the conventional method for calculating the node similarity is high in time complexity and low in accuracy, and provides a link prediction method and device based on the node similarity.

The technical scheme for solving the technical problems is as follows: a link prediction method based on node similarity comprises the following steps:

step 1: carrying out node and link representation on a network to be analyzed;

step 2: acquiring two nodes which are not directly connected by a link in a network;

and step 3: reading respective neighbor node sets of the two nodes without direct link connection;

and 4, step 4: combining and taking intersection of neighbor node sets of the two nodes which are not connected by the direct link to obtain a common neighbor set;

and 5: taking the common neighbor set as a subnet, and calculating the global cluster coefficient of the subnet and the cluster coefficient of any common neighbor node in the subnet;

step 6: calculating the similarity of the two nodes without direct link connection according to the global cluster coefficient of the subnet and the cluster coefficient of any common neighbor node in the subnet;

and 7: and performing link prediction according to the calculated node similarity.

The invention has the beneficial effects that: the invention defines a new index of node similarity calculation based on the local cluster coefficient from the perspective of the cluster coefficient when analyzing the interrelation of the nodes in the local structure of the complex network, and provides a node similarity calculation mode from a new perspective.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the step 5 specifically includes:

step 5.1: local cluster coefficient c for any one common neighbor node z in the sub-network_zRandomly taking two common neighborsWhen the degree of z is more than 1, the local cluster coefficient c is_zIs composed of

A is the number of links between the three nodes z, v and w, B is the number of links between z and v, between z and w and between v and w; when the degree of z is 1 or less, c_zIs 0;

step 5.2: global cluster coefficient C for common neighbor set according to formula

And calculating, wherein t is the number of all nodes in the common neighbor set subnet connected into a triangle, and s is the number of all nodes in the common neighbor set subnet connected into a V shape.

Further, the step 6 specifically includes:

representing the two nodes without direct link connection by x and y, and the similarity is according to a formula

A calculation is performed where C is the global cluster coefficient of the common neighbor set, C_zis the local cluster coefficient of the common neighbor node z, and Γ (x) ∩ Γ (y) is the common neighbor set of nodes x and y.

Further, the step 7 specifically includes:

step 7.1: repeating the steps 2-6 to obtain the similarity of any two nodes which are not directly connected by the link in the complex network;

step 7.2: and sorting the node similarity from large to small.

Further, the network is an undirected and unauthorized complex network.

Corresponding to the above method, the technical solution of the present invention further includes a link prediction apparatus based on node similarity, including: the device comprises a representation module, a calculation module, a traversal module and a prediction module;

the representation module is used for representing nodes and links of the complex network to be analyzed;

the computing module is used for acquiring two nodes which are not connected by a direct link in the complex network to obtain respective neighbor node sets of the two nodes, acquiring an intersection of the neighbor node sets of the two nodes to obtain a common neighbor set, taking the common neighbor set as a subnet, computing a subnet global cluster coefficient and any one common neighbor node cluster coefficient of the subnet, and computing node similarity according to the global cluster coefficient and the any one common neighbor node cluster coefficient;

the traversal module is used for traversing all nodes of the complex network according to the calculation module to obtain the similarity of any two nodes which are not directly connected by the link in the complex network;

and the prediction module is used for predicting the link according to the similarity obtained by the traversal module and the calculation module.

Further, the calculating the node similarity includes:

A calculation is performed where C is the global cluster coefficient of the common neighbor set, C_zis a local cluster coefficient of a common neighbor node z, and Γ (x) ∩ Γ (y) is a common neighbor set of nodes x and y;

local cluster coefficient c for any one common neighbor node z in the sub-network_zRandomly taking two common neighbor nodes v, w different from z in two common neighbor sets, and when the degree of z is more than zLocal cluster coefficient of 1 hour c_zIs composed of

for the global cluster coefficient C of the common neighbor set, according to a formula

Further, the prediction module sorts the node similarity from large to small according to size.

Further, the network is an undirected and unauthorized complex network.

Drawings

Fig. 1 is a flowchart illustrating a method and an apparatus for link prediction based on similarity according to the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, a link prediction method based on node similarity according to an embodiment of the present invention includes the following steps:

step 1: carrying out node and link representation on a network to be analyzed;

Wherein, the step 5 specifically comprises:

step 5.1: local cluster coefficient c for any one common neighbor node z in the sub-network_zRandomly taking two common neighbor nodes v, w different from z in two common neighbor sets, and when the degree of z is more than 1, obtaining a local cluster coefficient c_zIs composed of

A is the number of links between the three nodes z, v and w, B is the number of links between z and v, between z and w and between v and w; when the degree of z is 1 or less, c_zIs 0.

Wherein, the step 6 specifically comprises:

Wherein, the step 7 specifically comprises:

step 7.2: and sorting the node similarity from large to small.

Wherein the network is an undirected and unauthorized complex network.

The embodiment of the invention also provides a link prediction device based on the node similarity, which comprises the following steps: the device comprises a representation module, a calculation module, a traversal module and a prediction module;

Wherein the calculating the node similarity comprises:

for any one of the subnetworks in commonLocal cluster coefficient c of same neighbor node z_zRandomly taking two common neighbor nodes v, w different from z in two common neighbor sets, and when the degree of z is more than 1, obtaining a local cluster coefficient c_zIs composed of

And the prediction module sorts the node similarity from large to small according to the size.

Wherein the network is an undirected and unauthorized complex network.

TABLE 1 similarity index similar to the present invention

As shown in Table 1, x and y are two nodes with similarity to be calculated, Γ (x) is a neighbor node set of the node x, Γ (y) is a neighbor node set of the node y, and the local structure is a common neighbor set of the two nodes, namely | Γ (x) ∩ Γ (y) |, k_xDegree, k, of node x_yDegree of node y, k_zDegree of common neighbor node, e_zIs the number of links in the common neighbor.

Except for the node gravity index, the information used in table 1 is only the number of the common neighbor nodes and their respective degrees. The common neighbor index and the Salton index are considered equivalent for each pair of nodes for all nodes in the common neighbor set. The common neighbor node is also the networkThe general individuals in (1) also have unique characteristics, which are reflected in the network differences. Based on the method, the Adamic-Adar index and the network resource allocation index model the difference as the node degree, namely the link number of the node in the whole network range, according to observation and demonstration, the influence of the degree of the common neighbor node on the node similarity is in an inverse relation, and the corresponding index is designed to be in an inverse or exponential inverse mode. The node gravity index further takes the degree of the common neighbor node into consideration and adds a positive correlation factor e_zthat is, the number of links between the node z and other common neighbors and with the nodes x and y is equivalent to the degree in a local set Γ (x) ∩ Γ (y) ∪ { x, y }, and the degree of similarity is essentially affected more by nodes with large local degrees.

And (D) representing the complex network to be analyzed by using an undirected weightless graph, wherein the G is (N, E), the N is a set of nodes, and the E is a set of links. Whether two nodes generate links or not is in direct proportion to the similarity degree of the two nodes, namely the greater the similarity of the two nodes is, the greater the possibility of generating the links is, the similarity value (forming a similarity matrix S) between every two nodes in the network is calculated in a given network data set G according to the similarity index, and the probability sequence of generating the links between the two nodes can be given according to the sequence of the similarity value. In the embodiment, when the similarity between the nodes is measured, other node attributes caused by network particularity, such as personalized interests and behavior attributes of the nodes representing the user in the social network, are not considered from the analysis of the topological structure of the network.

A calculation is performed where C is the global cluster coefficient of the common neighbor set, C_zis the local cluster coefficient of a common neighbor node z, and is the common neighbor set of nodes x and y_zRandomly taking two common neighbor nodes v, w different from z in two common neighbor sets, and when the degree of z is more than 1Local cluster coefficient c_zIs composed of

Wherein t is the number of triangles formed by connecting all nodes in the common neighbor set subnet, s is the number of V-shaped triangles formed by connecting all nodes in the common neighbor set subnet, and the similarity between the nodes x and y can be calculated according to the three formulas. The advantages of the new index from the aspect of cluster coefficient and other indexes in the aspect of node similarity calculation are compared through actual calculation.

6 data sets such as a biological network, a social network and an information network are selected and compared with 5 other similarity calculation methods, and the selected data sets comprise: the American aviation network (UA) is composed of 332 airports and 2126 air routes of the American air transportation system; a U.S. political blog network (PB) consisting of hyperlinks between 1222 and 19021 blogs; the green wave microblog network (WB) is composed of 8735 attention relations among 500 green wave microblog VIP users and is acquired by utilizing API provided by the green wave microblog; a nematode neural network (CE) consisting of 297 neurons and 2345 nerves of the nematode; a Wikipedia network (WK) consisting of 7066 users and 103663 voting information of Wikipedia; the electronic mail network (EM) is composed of 1133 electronic mailbox accounts of certain university of Spain and 5451 mail exchanges among the electronic mailbox accounts. Meanwhile, common local similarity indexes are selected for comparison, a common neighbor index (CN) and an improvement method thereof are provided, and a Salton index (SA) with cosine similarity is utilized; Adamic-Adar index (AA) biased towards a less common neighbor node; an indicator (RA) based on network resource allocation; and a newly proposed node gravity Index (IA). It is noted that the framework of the link prediction steps is the same, and only the node similarity calculation methods are different.

TABLE 2 predicted AUC evaluation results of different similarity calculation methods

AUC	CN	SA	AA	RA	IA	The invention
							UA	0.9509	0.9304	0.9619	0.9665	0.9662	0.9723
PB	0.9206	0.8728	0.9228	0.9238	0.9251	0.9528
							CE	0.8581	0.7993	0.8768	0.8787	0.8794	0.8909
WB	0.8122	0.8005	0.8111	0.8089	0.8100	0.8766
							WK	0.8637	0.7935	0.8641	0.8585	0.8660	0.8788
EM	0.8498	0.8510	0.8548	0.8539	0.8556	0.8598

The AUC evaluation method for the link prediction effect is as follows: AUC ═ N₀+0.5N_e) and/N. Dividing the network into a training set and a testing set (the proportion is 9:1), and randomly dividing the network into the testing set and a nonexistent edge set each time after calculating the similarity score between every two nodes in the network through the training setRespectively selecting one edge for comparison, and if the similarity degree value of the edge in the test set is greater than that of the edge without the edge, adding 1 (N in total)₀Second); if the two similarity scores are equal, add 0.5 points (N in total)_eNext) of the first and second processes,

the above are independently compared n times. If all scores were randomly generated, then the resulting AUC was 0.5, and the degree to which the AUC value for each algorithm was greater than 0.5 measures how accurate the algorithm was compared to random selection. According to the results in table 2, the node similarity method provided by the invention has a better effect than a comparison algorithm in the link prediction problem, and the link prediction is performed according to the node similarity calculated by the invention, so that the greater the node similarity is, the greater the possibility that two nodes generate link connection is.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A link prediction method based on node similarity is characterized by comprising the following steps:

step 1: carrying out node and link representation on a network to be analyzed;

step 6: calculating the node similarity of the two nodes without direct link connection according to the global cluster coefficient of the subnet and the cluster coefficient of any one common neighbor node in the subnet, specifically, representing the two nodes without direct link connection by x and yConnected nodes, the similarity of which is according to the formula

A calculation is performed where C is the global cluster coefficient of the common neighbor set, C_zis a local cluster coefficient of any one common neighbor node z, and is a common neighbor set of nodes x and y;

2. The method for link prediction based on node similarity as claimed in claim 1, wherein the step 5 comprises:

step 5.1: local cluster coefficient c for any one common neighbor node z in the sub-network_zRandomly taking two common neighbor nodes v, w different from z in the common neighbor set, and when the degree of z is more than 1, obtaining a local cluster coefficient c_zIs composed of

Wherein A is the number of links between the three nodes z, v and w, B is the number of links between z and v, between z and w and between v and w, and when the degree of z is less than or equal to 1, c is_zIs 0;

3. The method for link prediction based on node similarity according to any of claims 1-2, wherein the step 7 comprises:

step 7.2: and sorting the node similarity from large to small.

4. The method according to claim 1, wherein the network is a non-directional and non-weighted complex network.

5. A link prediction apparatus based on node similarity, comprising: the device comprises a representation module, a calculation module, a traversal module and a prediction module;

the representation module is used for representing nodes and links of a network to be analyzed;

the computing module is used for acquiring two nodes which are not connected by a direct link in the network to obtain respective neighbor node sets of the two nodes, acquiring an intersection of the neighbor node sets of the two nodes to obtain a common neighbor set, using the common neighbor set as a subnet, computing a subnet global cluster coefficient and any one common neighbor node cluster coefficient of the subnet, and computing node similarity according to the global cluster coefficient and the any one common neighbor node cluster coefficient, specifically, representing the two nodes which are not connected by the direct link by x and y, wherein the similarity is computed according to a formula

the traversal module is used for traversing all nodes of the network according to the calculation module to obtain the similarity of any two nodes which are not directly connected by the link in the network;

6. The link predictor based on node similarity as claimed in claim 5Measuring device, characterized in that the local cluster coefficient c for any one common neighbor node z in said sub-network_zRandomly taking two common neighbor nodes v, w different from z in the common neighbor set, and when the degree of z is more than 1, obtaining a local cluster coefficient c_zIs composed of

7. The apparatus of claim 5, wherein the prediction module ranks the node similarities according to their sizes.

8. The apparatus according to claim 5, wherein the network is a non-directional and non-weighted complex network.