CN116192650B

CN116192650B - Link prediction method based on sub-graph features

Info

Publication number: CN116192650B
Application number: CN202310142106.4A
Authority: CN
Inventors: 谭少林; 方志宏; 方遒; 李哲
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2024-04-30
Anticipated expiration: 2043-02-21
Also published as: CN116192650A

Abstract

The invention discloses a link prediction method based on sub-graph characteristics, which comprises the steps of obtaining a network to be predicted, preprocessing the network to be predicted to obtain an undirected and weight-free network to be predicted, and respectively taking edges from the network to form a training set sample and a testing set sample; extracting two target nodes from a sample each time, determining a corresponding first-order neighbor node set and merging the corresponding first-order neighbor node sets to obtain a first-order neighborhood subgraph corresponding to a connecting edge formed by the two target nodes; then extracting the characteristics of the first-order neighborhood subgraph; extracting two target nodes again from the sample until all target nodes in the training set sample or the test set sample are extracted, and processing in the same way to correspondingly obtain a plurality of first-order neighborhood sub-graph features; and finally, presetting a full-connection neural network second classifier, training, and inputting a plurality of first-order neighborhood sub-graph features into the trained full-connection neural network second classifier to obtain a prediction score of a corresponding link. The method has stronger feature extraction capability and less calculation time.

Description

Link prediction method based on sub-graph features

Technical Field

The invention relates to the technical field of data mining, in particular to a link prediction method based on sub-graph features.

Background

With the advent of the internet and the development of cloud computing, massive amounts of data have emerged that can be modeled as networks. The data that is bordered by the interaction information between the individual subjects is called a network, with the actual subjects as nodes. In the real world, a large amount of different types of network data exist, for example, each account is regarded as a node, and actions such as friends, praise and the like are added among users to serve as continuous edges, so that social platforms such as microblogs, weChat, QQ, tremble sounds and the like can be regarded as a large social network; taking a large number of neurons in the nervous system as nodes, constructing a connecting edge according to the relation of gaps, synapses and the like, and representing the nervous system as a network; each computer is taken as a node, and the computers can be modeled as a network by taking optical cables, coaxial cables and the like as connecting edges; the train station and the train track can be modeled as nodes and edges, respectively, so the train line network is also a network. Networks are built by mining various rules of association between entities of the same kind or different kinds in real-world networks, which have unique characteristics and have many commonalities, commonly referred to as complex networks.

Link prediction is a sub-problem of complex network research, the main purpose of which is to predict whether missing edges and potential edges exist. With the development of technology, interactions in many real networks are more and more frequent. Thus, the evolution mechanism of a real network can be understood by studying the relationship between network entities, but due to the limitation of construction costs, missing or redundant information often occurs when constructing a network topology to approach a complex social system. In addition, some new connections may also be made with dynamic changes in the real network. Thus, these hidden links or future links need to be detected from the current network topology. The link prediction problem focuses on the connection between nodes, rather than the nodes themselves, and from the observed link information, the likelihood of communication between any two non-communicating nodes can be predicted.

At present, the heuristic algorithm is widely applied because of simple calculation and no need of learning, however, the biggest disadvantage of the heuristic method is that basic evolution rules of a network are needed to be assumed in advance, and then link prediction is performed, so that generalization capability of the heuristic method is poor. For example, co-neighbor heuristics may predict friendships in social networks or collaborators in collaborative networks well, but not more accurately predict edges in power and biological networks. Because in the protein-protein interaction network the number of co-neighbors of two proteins does not have a significant promoting effect on the interactions between proteins. Similarly, in a power network, two nodes have many neighbors in common, which in turn reduces the probability that the two nodes will produce a new edge.

Based on the above analysis, it can be seen that there are many heuristics to solve the problem of link prediction in different fields, but selecting an appropriate heuristic in the face of a new network remains a big challenge. To overcome the limitations of the generalization capability of the heuristic method described above, most of the methods recently proposed are learning-based methods. Learning-based methods build training data over a known topology and use a learning model to predict links. In the existing learning-based method, the coding sub-pattern can automatically construct a proper heuristic algorithm and obtain better link prediction performance than the graph embedding method. The earliest coded sub-pattern uses the adjacency matrix of the closed sub-pattern for link prediction, but such sub-pattern extraction process has several drawbacks: firstly, since the fully connected neural network only accepts tensors of a fixed size as input, a fixed sub-graph size is required; second, when the number of single-hop neighbors is smaller than a sub-graph of a predefined size, a sub-graph of a fixed size may require higher order neighbor information, and when the number of single-hop neighbors is greater than the predefined size, some single-hop neighbor information may also be discarded; finally, the coding sub-pattern requires an additional node ordering process, increasing computational complexity.

At present, the graph neural network is generally used for learning heuristic from a closed sub-graph, specifically, extracting a one-hop or two-hop closed sub-graph by taking two target nodes as the center, and predicting whether a link exists or not according to the topology of the closed sub-graph. Thus, the link prediction task is translated into a graph classification problem: and inputting a closed sub-graph of the target node, and predicting whether a connection exists between the center node pair of the sub-graph. There are two disadvantages to such sub-graph extraction processes: firstly, a diagram pooling layer is required to be designed to obtain a characteristic vector with a fixed size of a subgraph, and information is often lost in the process; second, using a graph neural network for graph classification requires a large time-space complexity.

Disclosure of Invention

Aiming at the problem that the link prediction method of the sub-graph mode coding needs larger time space complexity in calculating sub-graph embedding, the invention provides a link prediction method based on sub-graph characteristics, which comprises the following steps:

s1, acquiring a network to be predicted, preprocessing the network to be predicted to obtain an undirected and weight-free network to be predicted, and respectively taking edges from the undirected and weight-free network to be predicted to form a training set sample and a test set sample;

S2, extracting two target nodes from a training set sample or a test set sample each time, determining a first-order neighbor node set corresponding to each of the two target nodes, and obtaining a first-order neighborhood subgraph corresponding to an edge formed by the two target nodes by taking a union set of the first-order neighbor node sets;

S3, dividing the first-order neighborhood subgraph into three communities, and respectively extracting the node number and the edge number of each community as the first-order neighborhood subgraph characteristics;

S4, extracting two target nodes from the training set sample or the test set sample again until all target nodes in the training set sample or the test set sample are extracted, and performing S2 and S3 processing to correspondingly obtain a plurality of first-order neighborhood sub-graph features;

s5, presetting a full-connection neural network second classifier and training to obtain a trained full-connection neural network second classifier, inputting a plurality of first-order neighborhood sub-graph features into the trained full-connection neural network second classifier, processing the full-connection neural network second classifier, outputting a link prediction score, and obtaining a prediction result of an edge formed by two corresponding target nodes according to the link score.

Preferably, S1 comprises:

S11, acquiring a network to be predicted, and eliminating the direction and weight of edges in the network to be predicted to obtain an unoriented and unowned network to be predicted without isolated nodes;

S12, deleting edges with preset proportion from the unidirectional non-weighted network to be predicted, taking the undeleted edges as training set positive samples, and taking the deleted edges as test set positive samples;

S13, forming edges by node pairs without edges in the unidirectional non-weighted network to be predicted, randomly sampling edges which are equal in number and not coincident with the positive samples of the training set and the positive samples of the testing set respectively, taking the edges as the negative samples of the training set and the negative samples of the testing set, forming the positive samples of the training set and the negative samples of the training set into the training set samples, and forming the positive samples of the testing set and the negative samples of the testing set into the testing set samples.

Preferably, S2 specifically includes:

s21, extracting two target nodes x and y from a training set sample or a test set sample each time;

S22, extracting first-order neighbor nodes of the target node x to form a first-order neighbor node set of the target node x;

s23, extracting first-order neighbor nodes of the target node y to form a first-order neighbor node set of the target node y;

S24, a union set is obtained for the first-order neighbor node set of the target node x and the first-order neighbor node set of the target node y, and a first-order neighborhood subgraph corresponding to an edge formed by the two target nodes x and y is obtained.

Preferably, the first-order neighborhood subgraphs corresponding to the edges formed by the two target nodes x and y in S24 are formed by the following node sets:

Γ(x,y)＝{v∈V|min(d(v,x),d(v,y))≤1}

Wherein Γ (x, y) is a set of first-order neighbor nodes corresponding to the target nodes x and y, V is a set of all nodes in the network to be predicted, d (V, x) is a path between the node V in the network to be predicted and the target node x, and d (V, y) is a path between the node V in the network to be predicted and the target node y.

Preferably, in S3, the first-order neighborhood sub-graph is divided into three communities, and the node number and the edge number of each community are extracted as the first-order neighborhood sub-graph feature, which specifically includes:

S31, dividing the first-order neighborhood subgraphs into an x community, a y community and a CN community;

S32, respectively calculating the node number and the edge number of the x community, the y community and the CN community, and taking the calculated node number and edge number as the characteristics of the corresponding communities;

s33, arranging the features of the corresponding communities according to a preset sequence to obtain first-order neighborhood sub-graph features.

Preferably, the x-community in S31 is formed by the following node set:

V_x＝Γ(x)-{y}

Wherein V _x is a node set of an x community, Γ (x) is a first-order neighbor node set of a target node x, namely a set of nodes directly connected with the node x, and { y } is a set of target nodes y.

Preferably, the y-community in S31 is formed by the following node set:

V_y＝Γ(y)-{x}

Wherein V _y is a node set of a y community, Γ (y) is a first-order neighbor node set of a target node y, namely a node set directly connected with the node y, and { x } is a target node x set.

Preferably, the CN community in S31 is formed by the following node set:

V_CN＝Γ(x)∩Γ(y)

Wherein V _CN is a node set of the CN community, Γ (x) is a first-order neighbor node set of the target node x, and Γ (y) is a first-order neighbor node set of the target node y.

Preferably, in S33, the features of the corresponding communities are arranged according to a preset sequence, so as to obtain features of a first-order neighborhood subgraph, which may be specifically expressed as:

(|V_x|,|E_x|,|V_CN|,|E_CN|,|V_y|,|E_y|)

In the formula, |V _x | is the node number of an x community, |E _x | is the edge number of the x community, |V _y | is the node number of a y community, |E _y | is the edge number of the y community, |V _CN | is the node number of a CN community, and|E _CN | is the edge number of the CN community.

Preferably, in S12, the edges with the preset proportion are deleted from the unidirectional unbiased network to be predicted, and the preset proportion is 20%.

According to the link prediction method based on the subgraph characteristics, firstly, a network to be predicted is obtained, preprocessing is carried out on the network to be predicted, an undirected and weight-free network to be predicted is obtained, and edges are taken from the undirected and weight-free network to be predicted respectively to form a training set sample and a test set sample; then extracting two target nodes from the training set sample or the test set sample each time, determining two target nodes and corresponding first-order neighbor node sets thereof, and obtaining a union set of the first-order neighbor node sets corresponding to the two target nodes to obtain a first-order neighborhood subgraph corresponding to a connecting edge formed by the two target nodes; dividing the first-order neighborhood subgraph into three communities, respectively extracting the node number and the edge number of each community as first-order neighborhood subgraph characteristics, extracting two target nodes again from a training set sample or a test set sample until all target nodes in the training set sample or the test set sample are extracted, and processing in the same way to correspondingly obtain a plurality of first-order neighborhood subgraph characteristics; and finally, presetting a full-connection neural network second classifier and training to obtain a trained full-connection neural network second classifier, inputting a plurality of first-order neighborhood sub-graph features into the trained full-connection neural network second classifier, processing, correspondingly outputting a link prediction score, and judging whether a connecting edge formed by two corresponding target nodes exists or not according to the prediction score. According to the method, the single-hop neighborhood (namely the first-order neighborhood) is directly used as the closed subgraph of the target node pair, the preset subgraph size and higher-order neighbor information are not needed, and the problems of large information loss and calculation amount and the like caused by factor graph size limitation, pooling operation and sequencing are avoided by extracting a plurality of key features from the closed subgraph as the input of the full-connection neural network two classifiers. The method can also obtain the distinguishable sub-graph information from the decomposition of different heuristics, and automatically constructs a proper heuristics for each graph data by utilizing the fitting capability of the neural network, so that the method has stronger characteristic extraction capability and less calculation time compared with the existing link prediction technology.

Drawings

FIG. 1 is a flow chart of a link prediction method based on sub-graph features in an embodiment of the invention;

FIG. 2 is an algorithm frame diagram of a link prediction method based on sub-graph features according to an embodiment of the present invention;

FIG. 3 is a diagram of an exemplary network to be predicted without directional and weight in accordance with one embodiment of the present invention;

FIG. 4 is an exemplary diagram of a network to be predicted without directional and weight in another embodiment of the invention;

FIG. 5 is a first-order neighborhood subgraph corresponding to an edge composed of target nodes x and y in an embodiment of the present invention;

FIG. 6 is a schematic diagram of performing community division on a first-order neighborhood subgraph and extracting first-order neighborhood subgraph features according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of computing community nodes in an embodiment of the invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings.

A link prediction method based on sub-graph features specifically comprises the following steps:

Specifically, referring to fig. 1 and fig. 2, fig. 1 is a flowchart of a link prediction method based on sub-graph features according to an embodiment of the present invention, and fig. 2 is an algorithm frame diagram of a link prediction method based on sub-graph features according to an embodiment of the present invention.

The link prediction method based on the subgraph characteristics comprises the steps of firstly, obtaining a network to be predicted, preprocessing the network to be predicted to obtain an undirected and weight-free network to be predicted, and respectively taking edges from the undirected and weight-free network to be predicted to form a training set sample and a test set sample; then extracting two target nodes from the training set sample or the test set sample each time, determining a first-order neighbor node set corresponding to each of the two target nodes, and obtaining a first-order neighborhood subgraph corresponding to an edge (also called a link) formed by the two target nodes by taking a union set of the first-order neighbor node sets; dividing the first-order neighborhood subgraph into three communities, and respectively extracting the node number and the edge number of each community as first-order neighborhood subgraph characteristics; then extracting two target nodes from the training set sample or the test set sample again until all target nodes in the training set sample or the test set sample are extracted, and correspondingly obtaining a plurality of first-order neighborhood sub-graph features through S2 and S3 processing; finally, presetting a full-connection neural network second classifier and training the full-connection neural network second classifier to obtain a trained full-connection neural network second classifier, inputting a plurality of first-order neighborhood sub-graph features into the trained full-connection neural network second classifier, processing the full-connection neural network second classifier, correspondingly outputting a link prediction score (a value between 0 and 1), and judging according to the link score to obtain a prediction result corresponding to an edge formed by two target nodes. For example, if the link prediction score is equal to or greater than 0.5, it indicates that a positive edge, i.e., an edge (link) composed of the corresponding two target nodes, exists, otherwise, if the link prediction score is less than 0.5, it indicates that a negative edge, i.e., an edge (link) composed of the corresponding two target nodes, does not exist.

In one embodiment, S1 comprises:

Specifically, referring to fig. 3 and 4, fig. 3 is an exemplary diagram of an undirected and weight-free network to be predicted in an embodiment of the present invention, and fig. 4 is an exemplary diagram of an undirected and weight-free network to be predicted in another embodiment of the present invention.

First eliminating the direction and weight of the edges in the network to be predicted to obtain an unoriented network without isolated nodes, in FIG. 3, 9 nodes, 10 edges, andAnd a pair of nodes. FIG. 4 shows a network to be predicted with no direction and no weight obtained by deleting 20% of edges based on FIG. 3, wherein the middle network in FIG. 4 has 9 nodes, 8 edges and/>And a pair of nodes.

Taking undeleted edges of the non-directional non-weighted network to be predicted in fig. 3 as training set positive samples, wherein the undeleted edges are edges in the non-directional non-weighted network to be predicted in fig. 4, and the training set positive samples comprise the following 8 edges:

(2,3),(1,3),(3,5),(1,4),(4,6),(6,7),(6,8),(8,9)；

taking the non-directional and non-weighted edges to be predicted network deleted in fig. 3 as test set positive samples, including the following 2 edges:

(3,4),(5,6)；

Randomly sampling edges equal to the number of training set positive samples from node pairs of the non-directional non-weighted network to be predicted without edges in fig. 3 as training set negative samples, and randomly selecting 8 from the remaining 26 pairs, for example:

(1,2),(1,5),(1,6),(1,7),(1,8),(1,9),(2,4),(2,5)；

Randomly sampling edges which are equal to the number of positive samples of the test set from node pairs of the non-directional non-weighted network to be predicted without edges in fig. 3 as negative samples of the test set, wherein the edges in the negative samples of the training set and the negative samples of the test set are not coincident, such as: (2,6), (2,7).

The training set positive samples and the training set negative samples form training set samples together, and the testing set positive samples and the testing set negative samples form testing set samples together.

In fig. 3, the sides of the preset ratio are deleted, where the preset ratio is 20%, or may be 10%, or 30%, etc., and the lower the preset ratio is, the fewer deleted sides are, the higher the prediction accuracy is.

In one embodiment, S2 specifically includes:

Taking a training set sample as an example, firstly, extracting two target nodes x and y from the training set sample each time; then extracting a first-order neighbor node set of the target node x, namely all nodes directly connected with the target node x; and then extracting a first-order neighbor node set of the target node y by adopting the same method, and taking a union set of the first-order neighbor node set of the target node x and the first-order neighbor node set of the target node y to form a first-order neighborhood subgraph corresponding to an edge formed by the target node x and the target node y.

The method of extracting two target nodes from the test set sample each time and determining the first-order neighborhood subgraphs corresponding to the edges formed by the two target nodes is the same as that described above, and will not be repeated here.

In one embodiment, the first-order neighborhood subgraphs corresponding to the edges formed by the two target nodes x and y in S24 are formed by the following node sets:

Γ(x,y)＝{v∈V|min(d(v,x),d(v,y))≤1}

Specifically, referring to fig. 5, fig. 5 is an exemplary diagram of a first-order neighborhood subgraph of a target node x and a target node y in an embodiment of the present invention.

In fig. 5, the first-order neighborhood subgraphs of the target nodes x and y are formed by a set of nodes directly connected to the target node x or y, for example, there is 1,2,3,4,5,6 at points directly connected to the target node x, and there is only one edge between these points and the target node x, and similarly, there is 4,5,6,7,8,9,10 at points directly connected to the target node y, and there is only one edge between these points and the target node y.

The first-order neighborhood subgraphs corresponding to the edges formed by the target nodes x and y are formed by the following node sets:

Γ(x,y)＝{v∈V|min(d(v,x),d(v,y))≤1}

Wherein Γ (x, y) is a set of first-order neighbor nodes corresponding to target nodes x and y, V is a set of all nodes in the network to be predicted, d (V, x) is a path between a node V in the network to be predicted and the target node x, d (V, y) is a path between the node V in the network to be predicted and the target node y, and min (d (V, x), d (V, y)) is less than or equal to 1, wherein the node V is directly connected with the target node x or y.

In fig. 5 Γ (x, y) = {1,2,3,4,5,6,7,8,9,10}.

In one embodiment, in S3, the first-order neighborhood sub-graph is divided into three communities, and the node number and the edge number of each community are extracted as the first-order neighborhood sub-graph feature, which specifically includes:

Specifically, referring to fig. 6, fig. 6 is a schematic diagram illustrating community division of a first-order neighborhood subgraph and extraction of first-order neighborhood subgraph features according to an embodiment of the present invention.

In fig. 6, first the first order neighborhood subgraphs of the target nodes x and y are divided into x-communities, y-communities and CN-communities, where:

An x-community is a collection of nodes:

V_x＝{1,2,3,4,5,6}

The sides of the x-community include:

E_x＝{(1,2),(1,4),(5,6)}

the y-community is a collection of nodes:

V_y＝{4,5,6,7,8,9,10}

The y-community edges include:

E_y＝{(5,6),(7,8)}

the CN community is a collection of the following nodes:

V_CN＝{4,5,6}

the edges of the CN community include:

E_CN＝{(5,6)}

Then respectively calculating the node numbers and the edge numbers of the x community, the y community and the CN community, wherein the node numbers and the edge numbers are as follows:

node number of x community |v _x |=6, edge number of x community |e _x |=3;

the node number of y community |v _y |=7, the edge number of y community |e _y |=2;

The node number of CN community |v _CN |=3, the edge number of CN community |e _CN |=1.

The node number and the edge number are the characteristics of the corresponding communities, the characteristics of the corresponding communities are arranged according to a preset sequence, and the first-order neighborhood sub-graph characteristics are obtained, and are specifically expressed as follows:

(|V_x|,|E_x|,|V_CN|,|E_CN|,|V_y|,|E_y|)＝(6,3,3,1,7,2)

In one embodiment, the x-community in S31 is formed by the following node set:

V_x＝Γ(x)-{y}

Specifically, when the first-order neighbor node set of the target node x includes the target node y, that is, the target node y is directly connected with the target node x, and the node of the x community is calculated at this time, the target node y needs to be removed.

In one embodiment, the y-community in S31 is formed by the collection of the following nodes:

V_y＝Γ(y)-{x}

Specifically, when the first-order neighbor node set of the target node y includes the target node x, that is, the target node x and the target node y are directly connected, the target node x needs to be removed when the node of the y community is calculated.

Referring to fig. 7, fig. 7 is a schematic diagram of a computing community node according to an embodiment of the present invention.

In fig. 7, taking the extracted subgraph of (1, 4) as an example, the target node x is 1, and the target node y is 4:

the nodes in the community of the target node 1 are:

V₁＝Γ(1)-{4}＝{3,4}-{4}＝{3}，

The nodes in the community of target nodes 4 are:

V₄＝Γ(4)-{1}＝{1,6}-{1}＝{6}。

in fig. 7, taking the extracted subgraph of (3, 6) as an example, the target node x is 3, and the target node y is 6:

The nodes in the community of target node 3 are:

V₃＝Γ(3)-{6}＝{1,2,5}-{6}＝{1，2，5}，

The nodes in the community of target nodes 6 are:

V₆＝Γ(6)-{3}＝{4,7,8}-{3}＝{4,7,8}。

In one embodiment, the CN community in S31 is formed by the following node set:

V_CN＝Γ(x)∩Γ(y)

Specifically, the nodes in the CN community are nodes that are common to the node set of each of the target node x and the target node y.

In one embodiment, in S33, the features of the corresponding communities are arranged according to a preset sequence to obtain first-order neighborhood sub-graph features, which may be specifically expressed as:

(|V_x|,|E_x|,|V_CN|,|E_CN|,|V_y|,|E_y|)

Specifically, the first-order neighborhood sub-graph features corresponding to the edges formed by the target node x and the target node y are the node number and the edge number corresponding to the three communities divided by the first-order neighborhood sub-graph.

Further, experimental verification is performed on the prediction effect of the link prediction method based on the subgraph characteristics.

The experiment used USAir, NS, PB, yeast, c.ele, power, router, and e.coli eight real world networks as experimental datasets. Wherein USAir is an aviation network of american airlines, NS is a scientist cooperation network constructed based on paper published in the field of network science, PB is a blog network, yeast is a protein-protein interaction network in Yeast, c.ele is a neural network of caenorhabditis elegans, power is an electric network in the western united states, router is a routing-class internet network, and e.coli is a reaction network of metabolites in escherichia coli.

In order to compare the performance of the method of the present invention with the existing methods, the method of the present invention is compared with other classical topology heuristic-based methods, graph representation learning methods and sub-pattern encoding methods. Heuristic methods include CN, PA, katz; the diagram shows that the learning method includes a random block model SBM, node2vec (N2V), and a variogram self-encoder VGAE; sub-pattern encoding methods include WEISFEILER-Lehman kernel method (WLK), WLNM, and SEAL. The neural network structure in the invention is a fully connected neural network, which has 3 hidden layers, namely 32, 32 and 16 neurons. The activation functions of all hidden and output layers select ReLU and softmax, respectively. Optimization was performed using Adam update rules with a learning rate of 0.001 and a batch size of 128. During training, epoch was set to 100. Each data set was randomly divided into a set of training samples and a set of test samples and repeated 10 times, recording the AUC average of 10 times. The experiment uses python as the programming language and the neural network implementation uses a machine learning framework sklearn.

Table 1 shows AUC values of link prediction performed by different algorithms after deleting 10% of edges in different data sets, wherein NNESF is a link prediction method in the technical scheme of the present invention.

TABLE 1

In table 1, each column represents a link prediction method, the first 9 columns are existing link prediction methods, and the 10 th example is a link prediction method (NNESF) according to an embodiment of the present invention. Each row in table 1 represents the experimental results for one network under the AUC evaluation index. In table 1, for each network, the optimal link prediction performance is marked by bolded fonts. As can be seen, NNESF performs better on USAir and NS datasets than all heuristic methods, all graph representation learning methods, and WLK, WLNM, and SEAL methods under the sub-pattern encoding method.

The fully connected neural network used in NNESF contains fewer parameters than the graph neural network in the SEAL model, which means that NNESF requires less computation time and memory during training.

Table 2 shows the correlation parameters for the three algorithms. In table 2, the comparison results of WLNM, SEAL and NNESF algorithms at sub-graph feature extraction time, training time and peak memory are as follows:

TABLE 2

The neural network of the three algorithms is implemented using a Torch machine learning framework. As can be seen from table 2, first, WLNM and SEAL require additional time to complete node ordering during sub-graph feature extraction, while NNESF only needs to calculate the number of nodes and edges in the closed sub-graph, so NNESF sub-graph feature extraction time is much less than WLNM and SEAL; second, the neural network models NNESF and WLNM are all fully connected neural networks, while SEAL is graph neural network, so that the training time and peak memory of NNESF and WLNM are much smaller than SEAL; finally, the time required for WLNM and SEAL are significantly greater than NNESF from an overall time required.

According to the link prediction method based on the sub-graph features, the single-hop neighborhood (namely the first-order neighborhood) is directly used as the closed sub-graph of the target node pair, neighbor information of a preset sub-graph size and higher orders is not needed, and a plurality of key features are extracted from the closed sub-graph to serve as the input of the full-connection neural network two-classifier, so that the problems of factor graph size limitation, information loss and large calculated amount caused by pooling operation and sequencing are avoided. The method can also obtain the distinguishable sub-graph information from the decomposition of different heuristics, and automatically constructs a proper heuristics for each graph data by utilizing the fitting capability of the neural network, so that the method has stronger characteristic extraction capability and less calculation time compared with the existing link prediction technology.

The link prediction method based on the sub-graph features provided by the invention is described in detail above. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the core concepts of the invention. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. A method of link prediction based on sub-graph features, the method comprising:

S1, acquiring a network to be predicted, preprocessing the network to be predicted to obtain an unoriented and unweighted network to be predicted, and respectively taking edges from the unoriented and unweighted network to be predicted to form a training set sample and a test set sample;

s2, extracting two target nodes from the training set sample or the test set sample each time, determining a first-order neighbor node set corresponding to each of the two target nodes, and obtaining a first-order neighborhood subgraph corresponding to an edge formed by the two target nodes by merging the first-order neighbor node set;

2. The link prediction method based on sub-graph features according to claim 1, wherein the S1 includes:

s11, acquiring a network to be predicted, and eliminating the direction and the weight of edges in the network to be predicted to obtain an undirected and weight-free network to be predicted which does not contain isolated nodes;

S12, deleting edges with preset proportion from the non-directional and non-weight network to be predicted, taking the undeleted edges as training set positive samples, and taking the deleted edges as test set positive samples;

s13, forming edges by node pairs without edges in the unidirectional non-weighted network to be predicted, randomly sampling edges which are equal in number and not coincident with the training set positive samples and the testing set positive samples respectively, taking the edges as training set negative samples and the testing set negative samples, forming the training set samples by the training set positive samples and the training set negative samples, and forming the testing set samples by the testing set positive samples and the testing set negative samples.

3. The link prediction method based on sub-graph features according to claim 2, wherein S2 specifically comprises:

S21, extracting two target nodes x and y from the training set sample or the test set sample each time;

4. The link prediction method based on sub-graph features according to claim 3, wherein the first-order neighborhood sub-graph corresponding to the edge formed by the two target nodes x and y in S24 is formed by the following node set:

Γ(x,y)＝{v∈Vmin(d(v,x),d(v,y))≤1}

5. The link prediction method based on sub-graph features according to claim 4, wherein in S3, the first-order neighborhood sub-graph is divided into three communities, and the node number and the edge number of each community are extracted as the first-order neighborhood sub-graph features, respectively, and specifically including:

S32, respectively calculating the node numbers and the edge numbers of the x community, the y community and the CN community, and taking the calculated node numbers and edge numbers as the characteristics of the corresponding communities;

S33, arranging the features of the corresponding communities according to a preset sequence to obtain the first-order neighborhood sub-graph features.

6. The link prediction method based on sub-graph features according to claim 5, wherein the x-community in S31 is formed by the following node sets:

V_x＝Γ(x)-{y}

7. The link prediction method based on sub-graph features according to claim 6, wherein the y-community in S31 is formed by the following node sets:

V_y＝Γ(y)-{x}

8. The link prediction method based on sub-graph features according to claim 7, wherein the CN community in S31 is formed by the following node sets:

V_CN＝Γ(x)∩Γ(y)

9. The link prediction method based on sub-graph features according to claim 8, wherein in S33, the features of the corresponding communities are arranged according to a preset sequence, so as to obtain the first-order neighborhood sub-graph features, which can be specifically expressed as:

(|V_x|,|E_x|,|V_CN|,|E_CN|,|V_y|,|E_y|)

10. The link prediction method based on the subgraph feature according to claim 2, wherein the preset proportion of edges is deleted from the non-directional and non-weighted network to be predicted in S12, and the preset proportion is 20%.