CN110569885A

CN110569885A - multi-order motif directed network link prediction method based on naive Bayes

Info

Publication number: CN110569885A
Application number: CN201910764249.2A
Authority: CN
Inventors: 刘亚芳; 许小可; 肖婧
Original assignee: Dalian Nationalities University
Current assignee: Dalian Minzu University
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2019-12-13

Abstract

The invention discloses a multistage motif directed network link prediction method based on naive Bayes, which comprises the following steps: s1, counting the integral structure of which two nodes can form a die body structure with a test edge; s2, calculating the number of closed motif structures which can be formed by all edges and other nodes in the network; s3, calculating the number of the unclosed motif structures which can be formed by all edges and other nodes in the network; s4, calculating each edge as a role function value of the motif to be formed; and S5, counting the total role value of the angle color of each edge. The application provides a link prediction algorithm of a four-node motif of a directed network based on naive Bayes, which fully applies the structural characteristics of the directed network and greatly improves the accuracy of link prediction.

Description

Multi-order motif directed network link prediction method based on naive Bayes

Technical Field

The invention relates to a link prediction method, in particular to a multistage motif directed network link prediction method based on naive Bayes.

Background

Link prediction is an important research direction in the field of complex networks, and the basic problem to be processed is to predict the possibility of a link between any two nodes in a network through known information such as network nodes and network structures. The link prediction can not only obtain the possibility that edges which do not exist in the network may exist in the future, but also find out whether the existing edges in the network are false edges or missing edges.

Among the link prediction methods based on the network structure, the method of common neighbor similarity is most commonly used. The method of Liben-Nowellhe and Kleinberg finding common neighbors based on nodes is one of the best methods of prediction accuracy. However, the common neighbor indexes between nodes do not consider the link direction between nodes, and cannot be directly applied to the directed network. The predicted edge and the common neighbor form a closed triangular structure, and the directional problem is considered on the basis of the triangular structure, so that a local structure of the directed network is formed.

A common link prediction algorithm is based on a common neighbor index, and the idea of the algorithm is that two nodes have more common neighbors, and the more edges tend to be generated between them. The algorithm considers that the contribution of each co-neighbor to the formation of a join is the same. In many practical networks, however, this assumption is not reasonable. For example, people can establish a new friendship relationship through common friends, but when two people pay attention to a star at the same time, whether the relationship between the two people is influenced by how many stars the two people pay attention to together is not influenced, and the result of the fact that the people pay attention to the same star at the same time is that the two people often do not know each other because of the influence of the star and the influence of the two people is not large. It is clear that such mutual neighbourhood has no great influence on whether there is a connection between two persons, but if the mutual neighbourhood of two persons is a friend of the same two persons, a connection between two persons is easy to occur. It is therefore necessary to consider the impact of the nodes in considering the link prediction process for a directed network.

Disclosure of Invention

The application provides a multi-order motif directed network link prediction method based on naive Bayes, and the prediction accuracy is improved by adding a role function.

In order to achieve the purpose, the technical scheme of the application is as follows: a multi-order motif directed network link prediction method based on naive Bayes comprises the following steps:

S1, counting the integral structure of which two nodes can form a die body structure with a test edge;

S2, calculating the number N of closed motif structures which can be formed by all edges and other nodes in the network_Δw；

S3, calculating the number N of the unclosed motif structures which can be formed by all edges and other nodes in the network_Λw；

S4, calculating each edge as a role function value of the motif to be formed

s5, counting the total role value of the angle color of each edgeWhere w is a number referring to all nodes (for a three-node motif) or edges (for a four-node motif) that can be combined with an edge (x, y) to form the shape of a given motif, O_xya neighbor in the direction is specified for node x and a neighbor node in the direction is specified for node y by the amount of overlap.

further, for the directed unweighted network, Γ (x) represents a neighbor in the direction specified by the node x, and Γ (y) represents a neighbor in the direction specified by the node y;

Oxy＝|Γ(x)∩Γ(y)|

The neighbor in the direction specified by node x overlaps with the neighbor node in the direction specified by node y by an amount.

Further, calculating a link prediction score of each test edge:

A. Counting the number | O of each edge capable of forming a die body_xy|；

B. Calculating the number of possible edges M in the network^F；

C. Counting the number M of real connecting edges in the network;

D. Calculating the probability of connection between node pairs in a network

E. Calculating the probability of non-connectivity between node pairs in a network

F. Calculating a link prediction score for each edge

Due to the adoption of the technical scheme, the invention can obtain the following technical effects: the application provides a link prediction algorithm of a four-node motif of a directed network based on naive Bayes, which fully applies the structural characteristics of the directed network and greatly improves the accuracy of link prediction.

Drawings

FIG. 1 is a graph comparing the accuracy of a link prediction based on the number of motifs to a single-motif link prediction based on naive Bayes;

FIG. 2 is a structural diagram of role function calculations for the third-order motif and the fourth-order motif.

Detailed Description

the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it should be understood that the described examples are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

example 1

The application provides a multi-order motif directed network link prediction method based on naive Bayes, which comprises the following steps:

S4, calculating each edge as a role function value of the motif to be formed

s5, counting the total role value of the angle color of each edgeWhere w is all nodes or edges that can be combined with an edge (x, y) to form a given motif shape, O_xya neighbor in the direction is specified for node x and a neighbor node in the direction is specified for node y by the amount of overlap.

Calculating the link prediction score of each test edge:

A. Counting the number | O of each edge capable of forming a die body_xy|；

B. Calculating the number of possible edges M in the network^F；

C. Counting the number M of real connecting edges in the network;

D. Calculating the probability of connection between node pairs in a network

F. calculating a link prediction score for each edge

And in the process of carrying out link prediction, using AUC to carry out accuracy calculation, comparing scores of each edge obtained based on the positive sample and the negative sample of the test set, if the two scores are equal, adding 0.5 point, and the score of the positive sample of the test set is greater than the score of the negative sample of the test set, adding 1 point, and if the score of the positive sample of the test set is less than the score of the negative sample of the test set, adding 0 point, and evaluating the accuracy of the link prediction according to the final score.

experiments are carried out according to an algorithm, the obtained experiment results are shown in figure 1 and are obtained by a plurality of networks, wherein I-shaped represents the experiment results obtained by a naive Bayes-based link prediction method, and circles represent the experiment results obtained by a traditional link prediction method.

example 2

this embodiment provides an application of the naive bayes-based multi-order motif directed network link prediction method in embodiment 1, where a network is represented by G (V, E), where V represents a node set in the network and E represents a connecting edge set in the network. E is generally divided into two parts: training set E^TAnd test set E^PIs provided withAnd E^T∪E^PE. Randomly selecting 10% of connected edges as a test set positive sample E^Pand the rest 90% of the continuous edges are used as a training set E^TAnd selecting a continuous edge set with the size equal to that of the positive sample of the test set from the non-existing continuous edges as the negative sample of the test setThe method comprises the following specific steps:

S1, acquiring original directed network data, constructing an initial network, and acquiring a node pair list without a connecting edge;

S2, randomly selecting 10% of continuous edges in the network data as a test set positive sample, using the rest 90% of continuous edges as a training set, and selecting a continuous edge set with the size equal to that of the test set positive sample from a node pair list without the continuous edges as a test set negative sample;

S3, obtaining a role function value corresponding to an individual in the network by adopting a naive Bayes model algorithm; (as described with reference to example 1)

S4, obtaining r 'of the node pair according to the number of common neighbors of the node pair and the role function of the common neighbors corresponding to the node pair'_xya list;

S5, obtaining r 'from different predictors'_xyThe list obtains a new score list by using a machine learning method XGboost;

and calculating a similarity index of the node x and the node y, and measuring the existence possibility of the similarity index, wherein the higher score means the higher possibility of connection. All non-existing edges are sorted in descending order by score, then the preceding edge is most likely to exist.

r'_xythe calculation method of the value is as follows: from the role function values that can be obtained for each node or edge, then from the set of nodes or edges that each edge can form a given motif with the node or edge, the total role function value for each edge can be obtained. Adding the role function total value corresponding to each edge on the basis of the number of common neighbors in the specified direction corresponding to each edge, and taking the role function total value as r 'corresponding to each edge'_xyValue of

In the formulaWherein M is^F| V | (| V | -1)/2 represents the number of all possible connected edges in the network, and M | E^TL represents the number of connected edges that actually exist in the network. V is the total number of nodes, and E is the set of all connected edges in the network.

R of double die body_xyThe calculation method of the value is as follows:

The formula is divided into two parts, and r 'is obtained from two single mold bodies'_xyAnd then combining the two single motifs into a whole to obtain a result of double motif calculation, wherein the result is equal to the result of directly adding the results of the two motifs. Wherein, | O_1xyI represents node x₁Neighbors and nodes y in specified directions₁Number of neighbor node overlaps in the designated direction, | O^2xyI represents node x₂Neighbors and nodes y in specified directions₂Number of neighbor node overlaps in a given direction, R_vRepresenting the value of a role function of one of the motifs, R_wAnd a role function value representing another motif.

the calculation method of predicting score of multiple motifs by machine learning is as follows:

R 'obtained from a test set of already obtained individual motifs using XGboost'_xyList and r 'derived from training set'_xyThe list is brought into the framework of machine learning by r 'to the resulting training set'_xyAnd learning the list to obtain a new test set score list.

Meanwhile, the correlation among different motifs can be obtained through the XGboost model, different multi-motif combinations can be selected according to the correlation among the motifs, and a score list aiming at different combinations can be obtained.

In order to explore the influence of a motif node of a directed network on link prediction, a naive Bayes model is used for calculating a role function value of a node, and the role function of the node is added into a traditional link prediction algorithm. The traditional node role function is proposed based on a three-node motif, and the calculation condition of a four-node motif is not considered. As shown in a diagram in fig. 2, which is a three-node motif, the influence of the node C, except for the predicted edge AB, on the generation of the motif needs to be considered. The method is expanded, and the role function of the four-node motif is calculated. As shown in the b diagram of fig. 2, since there are two nodes C and D in the four-node motif in addition to the predicted edge AB, there are three possible situations when considering the role function of the four-node motif. The first is to consider only the role function of the node C, the second is to consider only the role function of the node D, and the third is to take the whole formed by the connecting edges between the node C and the nodes D and the node CD as the role function. Because the first two ways only consider partial structures except for the predicted edges, the prediction of the whole structure is compared. Therefore, in the process of performing the role function calculation of the four nodes, the role function is calculated by using a third method, that is, the influence of the structure in the frame of the four-node motif in the diagram b on the generation of the motif is considered.

The AUC is called area under the receiver operating characteristic curve in English, and refers to the area under the ROC curve.

AUC can measure the accuracy of link prediction as a whole. AUC refers to the probability that the score value of randomly selecting an edge in a test set positive sample is higher than the score value of randomly selecting an edge in a test set negative sample. That is, each time from E^PAndIn case E selects one of them randomly^PIs greater thanif the score of the edge of (E) is less than 1, then the score is added^PIs equal toThe score value of the edge of (1) is added with 0.5 score, otherwise, the score is not added. This process is carried out independently n times, if any, E times^PIs greater thanhas a score of E of Y^PIs equal tohas a score of Z times E^PIs less thanthe AUC may be defined as:

When the AUC is 0.5, it indicates that all scores are equivalent to those generated randomly, and when the AUC is 1, it indicates that the algorithm completely correctly predicts the variation of the continuous edge. The larger the AUC, the more accurate the prediction result, and the size of the AUC reflects the accuracy of the algorithm relative to the random algorithm.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A multi-order motif directed network link prediction method based on naive Bayes is characterized by comprising the following steps:

S4, calculating each edge as a role function value of the motif to be formed

S5, counting the total role value of the angle color of each edgeWhere w is a number indicating that it can be combined with an edge (x, y) to form a specified modulusall nodes or edges of the shape, O_xyA neighbor in the direction is specified for node x and a neighbor node in the direction is specified for node y by the amount of overlap.

2. The naive bayes-based multi-order motif directed network link prediction method of claim 1, wherein for a directed unweighted network, Γ (x) represents a neighbor in a direction specified by a node x, and Γ (y) represents a neighbor in a direction specified by a node y;

Oxy＝|Γ(x)∩Γ(y)|

3. the naive bayes-based multi-order motif directed network link prediction method of claim 1, wherein the link prediction score of each test edge is calculated as:

A. Counting the number | O of each edge capable of forming a die body_xy|；

B. Calculating the number of possible edges M in the network^F；

C. Counting the number M of real connecting edges in the network;

D. Calculating the probability of connection between node pairs in a network

F. Calculating a link prediction score for each edge