CN114596473A

CN114596473A - Network embedding pre-training method based on graph neural network hierarchical loss function

Info

Publication number: CN114596473A
Application number: CN202210174674.8A
Authority: CN
Inventors: 陈俊扬; 伍楷舜; 戴志江; 巩志国
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2022-06-07

Abstract

The invention discloses a network embedding pre-training method based on a graph neural network hierarchical loss function. The method comprises the following steps: representing a network as a graph, network embedding for learning a low-dimensional representation, and learning a set of shared weight matrices for aggregating information from node neighbors; for an input graph, a set of walking sequences S is constructed, where each sequence S e S contains a set of nodes { v ∈ S }₁,…,v_|s|}; for the constructed walking sequence, constructing a unit graph based on co-occurring nodes within the sliding window, the unit graph being generated based on the proximity of each target node to one-hop and two-hop neighbors; optimization objective with design layering loss minimization using generated unit mapsAnd (5) marking a training graph neural network model. The invention combines the graph attention network with effective hierarchical loss, can reserve the edge distance between the target node and the neighbor node, and obviously improves the model performance.

Description

Network embedding pre-training method based on graph neural network hierarchical loss function

Technical Field

The invention relates to the technical field of data mining and machine learning, in particular to a network embedding pre-training method based on a graph neural network hierarchical loss function.

Background

In a graph composed of nodes and edges, unlabeled nodes are reasonably encoded into a low-dimensional space by using network-embedded pre-training, wherein the nodes are close to the neighbors and far away from negative samples, so that the nodes show good performance in downstream tasks such as classification and clustering.

Graph structures are popular in real-world scenarios such as social networking and reference networking, and have attracted considerable attention in the research community over the past few years. Network-embedded pre-training aims at mapping node relationships to a low-dimensional space without any supervisory information (e.g., node labels). The learned embeddings can then be used to perform subsequent tasks, such as node classification and link prediction. Pre-training of network embedding is intended to maintain the proximity of unlabeled nodes in a low-dimensional space. To achieve this goal, current work can be divided into direct network embedding and deep neural network embedding.

For the direct network embedding scheme, before the occurrence of a deep neural network, a deep walk and other direct embedding learning method realizes node embedding by applying a SkipGram model in generated random walks. Similar methods, such as Line, node2vec, ACNE-ST, Adv-gaming and HNS design biased random walk strategies or negative sampling modeling to explore different node relationships. In addition, Line and MNMF employ matrix factorization to learn node embedding. However, all direct embedding methods have the problem of computational inefficiency, since there are no shared parameters between nodes in the encoder. Furthermore, these methods often lack the ability to generalize strongly to new maps.

For deep neural network embedding schemes, GNN (graph neural network) based methods have recently been proposed to aggregate information from graph structure data. For example, GCN (graph convolutional neural network) and its variants DGI and GAT show breakthrough performance in many tasks, including pre-training for network embedding. In contrast to traditional models, the GCN can jointly consider local connectivity and global consistency on the graph. Based on the basic theory of GCN, more variants were proposed, including GraphSAGE, FastGCN, DGCN and HAN for large-scale datasets. The PinSage is specially designed for a web scale recommendation system by combining GraphSAGE, and the maximum application of depth map embedding is realized. Furthermore, MEIRec leverages meta-path guided GNNs to model complex interactions of items to make search intent recommendations.

Many network-embedded pre-training methods have been proposed. For example, Deepwalk is a graph embedding method that uses pairs of nodes generated by random walks to represent learning. After that, similar methods such as Line, node2vec and TADW have been proposed, but these methods have a problem of computational inefficiency since there is no shared parameter between nodes in the encoder. Recently, the GNNs of the graph neural networks have shown excellent performance in modeling graph structure data, and typical work includes GCN, GAT and GraphSAGE. In general, the key to the success of GNN is the parameter sharing of the encoder and the importance weights used in the multi-layer structure to capture between the target node and its neighbors. However, due to their inherent information aggregation mechanism, these GNN-based approaches tend to fall into sub-optimal results during pre-training of unlabeled nodes, thereby affecting the prediction efficiency and accuracy of the model.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a network embedding pre-training method based on a graph neural network hierarchical loss function. The method comprises the following steps:

the network is represented as graph G ═ (V, E), V is a set of vertices,

representing a set of edges, for each V ∈ V, network embedding is used to learn a low-dimensional representation V ∈ R^dAnd learn a set of shared weight matrices W_lFor aggregating information from node neighbors, where d represents a dimension of space, d<<|V|，l is the hierarchy of the graph neural network;

for an input graph, a set of walking sequences S is constructed, where each sequence S e S contains a set of nodes { v ∈ S }₁,…,v_|s|}；

For the constructed walking sequence, constructing a unit graph based on nodes which commonly appear in a sliding window, wherein the unit graph is generated based on the proximity of each target node to one-hop and two-hop neighbors;

using the generated unit graph, minimizing the designed hierarchical loss to be an optimized target training graph neural network model, wherein the hierarchical loss is expressed as:

wherein v is_tIs the target node, v_pAre the direct neighbors in the unit graph,

representing a node v_pV. neighbors of_nIs a negative sample sampled from the entire vertex set,

denotes v_nH e R^dA hidden representation of the node of dimension d is represented,

and

respectively representing direct neighbor loss, hierarchical neighbor loss and negative sampling loss.

Compared with the prior art, the method has the advantages that the research blank of poor performance of the GNN in the aspect of unmarked node pre-training is filled, and the problem that the edges of the nodes cannot be distinguished in the process of inherent node information aggregation is solved; a graph attention network with effective hierarchical loss is combined to reserve the edge distance between a target node and the neighbor of the target node, and a two-hop mode is regarded as a unit graph, so that the model performance is remarkably improved.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of a learning embedding space in which a target and its neighbors are direct neighbors, according to one embodiment of the invention;

FIG. 2 is a process diagram of a network embedding pre-training method based on a graph neural network hierarchical loss function according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of clustering results using ACC and NMI metrics on different data sets, according to one embodiment of the present invention;

FIG. 4 is a schematic illustration of an ablation study using Micro-F1 metrology, in accordance with one embodiment of the present invention;

FIG. 5 is a schematic illustration of an ablation study using Macro-F1 metrology according to another embodiment of the present invention;

FIG. 6 is a diagram illustrating the classification results with respect to the random walk parameter on Cora according to an embodiment of the present invention;

fig. 7 is a diagram illustrating the classification result regarding random walk parameters on siteser according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

To facilitate an understanding of the present invention, the drawback of the prior art is first explained, i.e. due to the inherent information aggregation mechanism, GNN-based approaches tend to fall into sub-optimal results during pre-training of unlabeled nodes. Specifically, referring to fig. 1, the left diagram is an example of an input diagram, the upper right diagram is an embedding space learned by the existing GNNS, and the lower right diagram is an embedding space learned by the method proposed by the present invention, where node 1 is a target node, node2 is a direct neighbor, and nodes 3-5 are two-hop neighbors. The main problem with existing GNN-based methods is that they model the distance of the target node and its neighbors in a coarse-grained manner. The edges of the nodes are often difficult to distinguish, resulting in poor performance of downstream applications. It was experimentally verified that even in certain cases of downstream classification and clustering tasks, direct embedding methods (e.g., deep walk) may outperform GNN-based methods (e.g., GCN, GAT, and GraphSAGE). We observe that the more discriminative representation learned in the embedding space, the higher performance the subsequent task can achieve. In general, one desirable result is that the target node should be closer to its immediate neighbors than a two-hop node. To address this problem, the present invention proposes a simple and efficient hierarchical loss graph attention network, called LlossNet, to preserve the margins of nodes in the embedding space.

Specifically, the present invention explicitly models the marginal constraints imposed by direct neighbor losses (e.g., nodes 1 and 2 in FIG. 1), multi-level losses (nodes 1 and 3-5), and both types of losses. Such explicit modeling would help learn more discriminative node representations. However, it is not straightforward to determine the number of hops being considered. Furthermore, overly fine multi-hop margin design may compromise the ability of the model to generalize. Preferably, the neighbor preservation within a two-hop neighbor is the best choice for the effectiveness and efficiency of the LlossNet in the experiment. In the present invention, this two-hop pattern is referred to as a unit graph, and multi-hop margins can also be considered recursively.

Hereinafter, the pre-training of the existing network embedding, the network embedding pre-training method of the present invention, and the experimental verification result will be described in detail.

Firstly, preset

Before introducing the proposed hierarchical loss, we propose the current graph-based loss function in the pre-training of network embedding. To learn the prediction representation in a completely unsupervised environment according to the literature (w.hamilton, z.ying, and j.leskov, "Inductive representation on large graphs," in advance in Neural Information Processing Systems,2017, pp.1024-1034), the output representation of the graph-based nodes is applied to a loss function:

wherein v is_tIs the target node, v_pIs its immediate neighbor, v_nIs a negative sample, P, randomly drawn from the entire vertex_NS(v) Representing the distribution of negative samples, h representing the output representation in the embedding space, σ_sIs a sigmoid function and N defines the number of negative samples. Note that given a node v, h_vIs a feature aggregated from its neighbors. For simplicity, the generation process at the last level node representation is as follows:

where W is a trainable weight matrix,

representing the neighbors of node v, CONCAT represents the join operation, AGGREGATE represents the aggregation function, including the types mean, LSTM (long-short memory network), maxpool (max pooling), meanpool (mean pooling), and attention (attention). From equations (1) and (2), it can be concluded that the current graph-based losses only bring the output representations of the target node and its multi-hop neighbors close in the embedding space. Thus, it cannot preserve the margins between these nodes, however the target node should be closer to its immediate neighbors than the two-hop node.

Secondly, the invention provides a model

A. Problem form and symbol

Since the goal of the present invention is to learn the unmarked node representation, it can represent the network G ═ (V, E), where V is a set of vertices,

representing a set of edges. For each V e V, network embedding aims at learning a low-dimensional representation V e R^dTo maintain network proximity, d represents the dimension of space, d<<L V l. In addition, a set of shared weight matrices W is learned_lFor aggregating information from node neighbors, where l is the hierarchy of GNNs.

B. The first stage is as follows: unit map generation

At this stage, the present invention introduces unit map generation of input data, as shown in FIG. 2, where FIG. 2(a) is the input map; FIG. 2(b) performs a random walk to convert a graph into a sequence of nodes; FIG. 2(c) constructs a new graph based on a set of positive context pairs in a sliding window; FIG. 2(d) reserves target node v in embedding space₁Its direct neighbor v₂And its two-hop neighbor v₃，v₄，v₅(ii) proximity of; FIG. 2(e) illustrates the losses proposed by the present invention, including direct neighbor losses, hierarchical neighbor losses, and negative sampling losses; v. of_nRepresenting negative examples and h representing hidden representation.

Specifically, pre-processing is first performed by performing random walks to form a set of walking sequences S, where each sequence S e S contains a set of walking sequences SNode { v }₁,…,v_|s|}. A new graph may then be constructed based on the co-occurring nodes within the sliding window. Positive context pair in s

Definition, where c denotes the window size. Taking fig. 2(c) and 2(d) as an example, node2 in the sliding window is connected to

nodes

3, 4 and 5, respectively. Note that only the proximity of each target node (e.g., node 1) to one-hop and two-hop neighbors is considered in this embodiment. With node 1 as the target node, within the sliding window, node2 is its direct neighbor and

nodes

3, 4, and 5 are their two-hop neighbors. Such generated maps in fig. 2(c) to 2(d) are referred to herein as unit maps. Based on this approach, multi-hop margins can be recursively considered. By random walk, node relationships can be enriched, which will facilitate pre-training performance. For fair comparison, random walks were used as a pre-treatment for all baselines in the experiment, with only major differences in subsequent models.

C. And a second stage: hierarchical loss optimization

As shown in fig. 2(d) and 2(e), with the generated unit maps, the proposed stratification loss is defined as follows:

wherein v is_tTarget node (e.g. v)₁)，v_pAre direct neighbors (e.g. v) in a unit graph₂)，

Representing a node v_pOf (e.g. v)₃，v₄，v₅)，v_nIs a negative sample sampled from the entire vertex set by using the nickname table method (A.Q.Li, A.Ahmed, S.Ravi, and A.J.Smola, "Reducing the sampling complexity of topologic models," in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,2014, pp.891-900), using only O (1) time,

denotes v_nH ∈ R^dA hidden representation of the node of dimension d is represented,

and

respectively representing direct neighbor loss, hierarchical loss and negative sampling loss. More details will be presented below.

For direct neighbor loss (first part of equation 3), the target node is encouraged to get close to its direct neighbors in the embedding space without loss of generality, defined as follows:

the hierarchical neighbor penalty (second part of equation 3) is to bring the target node close to its two-hop neighbor while keeping the margins of the two-hop neighbor and the direct neighbor in the embedding space, defined for example as follows:

where Agg denotes an aggregation function where information of a central node and messages passed from neighbors are aggregated, and a condition (s.t.) denotes that a target node (i.e., v) is encouraged_t) Than its two-hop neighbor (i.e., v)_i) Closer to its immediate neighbors (i.e. v)_p). If the condition is not satisfied, it will

Is set to 0. In other words, if the target node is closer to its two-hop neighbor than the direct neighbor, the second partial optimization may be suspended by setting the hierarchical neighbor penalty to 0, while keeping the direct neighbor penalty to bring the target node closer to its direct neighbor. Furthermore, inBefore the message aggregation step of the AGG is performed, a message propagation step needs to be performed to process the messages delivered from the neighboring nodes, see the following procedure.

1) And a message propagation step: message propagation at a centrally located node v_iAnd its neighbor node v_jIn the meantime. For example, encoding from a locally structured neighbor, a delivery message may be formulated by:

representing a slave node v_jTo v_iPassed message, message dimension ∈ R^d。M_v∈R^d×dIs a matrix of the transitions,

is a neighbor type (e.g., one-hop or multi-hop neighbor), h_v∈R^dIs a hidden representation of the node v and,

type of neighbor

And neighbor node

As input, and outputs a conversion matrix M_v. Note that the neighbor type

Is a thermal code. By splicing

And

the mapping is then learned using multi-layer perception (MLP).

The detailed definitions of (a) and (b) are as follows:

wherein

Indicating a connect operation.

2) And message aggregation step: in this step, the goal is to aggregate the information of the central node (e.g., node 2) and the messages passed from its neighbors (e.g., nodes 3-5), as shown in FIG. 2 (d). To reduce noise information that may propagate from neighbors, the importance weights of nodes are learned by introducing an attention aggregator in conjunction with an attention mechanism. Weight coefficient between two nodes

Can be expressed by the following equation:

W∈R^d×dis a shared weight matrix, used to map nodes to the same embedding space,

a weight vector for learning the relationship of the central node and its neighbors is recorded,

is node v_iNeighbor set of σ_rIs the ReLU activation function.

Then, using the learned weight coefficients and neighbors, node v_iIs represented by the following formula:

in summary, how to formulate the direct neighbor loss and the hierarchical neighbor loss of equation (3) has been introduced, and the aggregation function of the node, including the message propagation and aggregation steps, has been introduced. Next, the third part of equation 3, namely the negative sample loss or referred to as negative sample loss, will be described.

Negative sampling loss aims at making the target node v_tAway from negative examples and their neighbors in the embedding space, which is defined as follows:

wherein P is_NS(v) Representing a negative sample distribution, v_nIs the negative sample drawn therefrom. Specifically, P_NS(v) Is defined by the formula:

wherein f is_vIs the degree of node v in the graph and β is an empirical power and may be set to 3/4, for example.

D. The LlossNet algorithm provided by the invention

For the sake of clarity, the solution proposed by the present invention is designed in the following Algorithm (Algorithm 1). Specifically, a random walk is performed on the input graph to obtain a sequence of nodes, and a new graph is generated using a sliding window (line 2). Then, at each epoch, a batch is sampled from the generated positive context pair (line 5). Next, for each target node in the batch, its direct neighbor loss, two-hop neighbor loss, and negative sampling loss are constructed according to equations (4), (5), and (10), respectively (lines 7-9). Finally, the above losses are integrated by equation (3) and optimized using random gradient descent (line 10). These steps (lines 5-10) are repeated until convergence is reached (the loss value becomes stable).

*******************************************

The pre-training graph neural network can be used in various scenes, such as a recommendation scene, taking an article content recommendation scene as an example, articles comprise news, social network topics, academic papers and the like, each article is taken as a data node, the relationship (such as citation, link and the like) between the article and the article is taken as an edge, the data node is input into the pre-training graph neural network, and the output is a unique embedded vector learned for each node, and the embedded vector can reflect the correlation relationship between the node and the node. When a node related to a certain node needs to be searched, the node related to the TOP K can be retrieved according to the size of the inner product by calculating the inner product between the nodes.

Further, in order to verify the effect of the present invention, a simulation experiment was performed. In the experiment, downstream tasks including node classification and clustering were used to evaluate the performance of pre-trained node representations on the actual data set.

A. Data set

Four widely used network data sets were tested and the statistical data is shown in table 1.

Table 1: data set

In table 1, Cora is a research citation network constructed in the prior art, which contains 2708 machine learning articles with 7 tags (Labels); citeseer is another widely used research corpus containing 3264 publications and 6 tags; wiki is a language network containing 2405 web pages from 17 groups and 12761 edges between them, this data set has been widely used for the evaluation of vertex classification tasks; DBLP is a listing of computer science references constructed in the prior art. In the experiment, a list of meeting papers was selected from 4 research fields: database, data mining, AI and CV.

B. Baseline model

In the experiments, several of the most advanced methods currently available were used as baselines, which were designed for pre-training of network embedding, including direct embedding and GNN-based embedding models. The selected baseline is described as follows:

deepwalk is the first graph embedding method to apply representation learning to node pairs generated by random walks. The method is a direct embedding learning method, no parameter is shared among nodes, and a Skip-gram method is used for training node embedding.

The GCN introduces an efficient convolutional neural network variant that can run directly on graph structure data and show breakthrough performance in network-embedded learning. To accommodate large-scale networks, an improved version of the GCN method was derived. The present invention employs this variant for a more fair comparison.

DGI introduces a general approach to network representation learning in an unsupervised manner. DGI relies on maximizing mutual information between the graph enhancement representation and the currently extracted graph information-both derived using established graph-convolution network architectures. For graph-enhanced representations, a subgraph is generated from the target nodes. In general, it can be seen as an improved version of the GCN method.

GraphSAGE proposes an inductive way to compute vertex representations that produces impressive performance on several large-scale baselines. For each vertex, GraphSAGE first samples the fixed-size neighbors and then performs an aggregation function on them to obtain a vertex representation. There are four types of polymerization functions, including SAGE-mean, SAGE-LSTM, SAGE-maxpool, and SAGE-meanpool (a variant of maxpool polymerizer). One major difference between meanpool and mean is that meanpool requires the vector of each neighbor to be input independently through a fully connected neural network.

GAT proposes a kind of attention network that incorporates an attention mechanism into its propagation steps. It follows a self-attention strategy to learn the importance weights of the neighbors of the central node by focusing on the neighbors to compute a hidden representation of each node.

C. Parameter setting

In the experiment, the present invention was implemented using a Tensorflow operating environment. For all models, the embedding size was uniformly set to 128, the model parameters were randomly initialized with a gaussian distribution, and the model was optimized using a small batch of Adam. The learning rate is set to 1e-3, the batch size is 512, and the maximum sequence length is set to 50. For variants of GNN and the method of the invention, GraphSAGE was followed by setting the number of concealment layers to 2. Further, for fair comparison, random walks were employed as a pre-process for all methods by setting the walk length, the number of walks per node, and the window size to 30, 5, and 5, respectively.

D. Evaluation index

In order to evaluate the pre-training performance of the method in the downstream classification task, the node classifier is constructed by using a default set Liblinear packet, and measured by using Micro-F1 and Macro-F1. Furthermore, in the downstream clustering task, two widely used metrics are used to report the results, namely clustering Accuracy (ACC) and Normalized Mutual Information (NMI). When the clustering result is completely matched with the actual label, the values of ACC and NMI are both 1, otherwise, the values are close to zero.

E. Evaluating classification performance

In this section, the performance of the downstream classification tasks of the various models on the four real world datasets was verified. Specifically, 20%, 40%, 60%, and 80% of the nodes were randomly selected as the training set, and the rest were the test set. The results of Micro-F1 and Macro-F1 are shown in tables 2, 3, 4 and 5.

Table 2: node classification result of Cora

Table 3: node classification result of Citeser

Table 4: node classification results for Wiki

Table 5: node classification result of DBLP

The highest score is indicated in bold, the underlined score. From the above table the following observations can be obtained:

1) in general, the LLossNET proposed by the present invention outperforms the most advanced models on all data sets with different training ratios, which demonstrates the effectiveness of the present invention in considering the edge distance preservation between the target node and its neighbors.

2) The overall performance of the baseline follows the following sequence: LlossNet > Deepwalk > DGI ≈ SAGE-meanpool ≈ SAGE-LSTM ≈ GAT > SAGE-maxpool ≈ SAGE-mean ≈ GCN. More specifically, the LLossNet of the present invention achieved 4.44%, 5.77%, 4.43% and 3.59% performance improvement on Micro-F1 over the second best model on average training ratios of Cora, citeser, Wiki and DBLP, respectively. Furthermore, with regard to Macro-F1, LLossNet outperformed the second best model 5.26%, 6.27%, 16.69%, and 5.07% on all four datasets, respectively. These improved results indicate that the present invention is more able to learn discriminant embedding in the pre-training of graph structure data.

3) In the baseline results, DeepWalk outperformed GNN-based methods in most cases, such as on the Cora, cineseer and Wiki datasets. This reveals the problem that GNNs do not perform well in pre-training of unlabeled nodes. Although GAT achieves the second best results on DBLP datasets, its self-attention mechanism only considers important weights of neighbors that favor embedded learning, and still ignores the margins between multi-hop nodes.

It should be noted that the present invention performs training of all algorithms in an unsupervised manner, using equation 3 as the objective function of the present invention, and equation 1 as the baseline. To evaluate the performance of all methods after pre-training, classifiers are built to perform the downstream classification task by uniformly using librinear packages with default settings, where the stopping criterion is the tolerance parameter 1e-4 or maximum 100. The present invention no longer divides the test nodes into verification and test sets because the verification settings do not have much impact on the general performance trends of the downstream classification.

F. Clustering performance assessment

To further evaluate the pre-training performance of the present invention, downstream clustering tasks were performed by: all baselines were compared according to ACC and NMI indicators.

In particular, K-means are run uniformly using default settings on the pre-training representations on the four data sets. The results are shown in FIG. 3, and the observations are as follows:

1) first, as can be seen from fig. 3(a), LLossNET outperforms baseline in ACC, which verifies the effectiveness of the present invention. In general, the order of ACC performance between baselines is similar to the classification results, with depawalk being superior to GNN-based models except for DGI on Cora. The LlossNet of the present invention improves the performance of the Deepwalk by 26.11%, 3.42%, 3.28% and 3.38% on Cora, Citeser, Wiki and DBLP, respectively. Furthermore, the LlossNet improved the performance of the ACC by 2.88% for the performance of the DGI.

2) Further, as can be seen from fig. 3(b), LLossNet can also achieve a consistent boost over other models, which additionally demonstrates the superiority of the present invention to pre-training of unlabeled nodes. More specifically, LLossNET improved NMI performance over the DeepWalk model by 14.66%, 13.64%, 4.21% and 3.54% over the four datasets. Furthermore, LLossNet achieved performance gains of 7.60%, 1.17%, 80.94% and 11.02% compared to DGI methods on Cora, cierseer, Wiki and DBLP, respectively.

3) ACC and NMI performance differences for DBLP for all methods are relatively small because the number of tags is small, only 4 tags, and thus the differences are not significant.

In addition, the resulting performance of accuracy and recall was also verified. In this section, the goal is to evaluate the performance of the accuracy, recall results of these methods. Two baselines, Deepwalk and GAT, were selected and performed well in previous classification experiments. The performed index refers to scimit-learn, and 50% of nodes are randomly selected as a training set, and the rest of nodes are used as test sets on Cora and Citeser. The results are reported in table 6.

Table 6: cola and Citeser Performance

From table 6, the following observations can be made:

1) the present invention may achieve better performance than other baseline. Specifically, the LlossNet improved on average by 4.78% and 19.03% over Deepwalk and GAT on Cora, respectively. The mean increases of 7.55% and 19.52% for LlossNet compared to Deepwalk and GAT on Citeser.

2) Note that the performance of Precision micro and Recall micro metrics gave the same results. In the single label classification problem, if there are false positives, there are always false negatives and vice versa, since one class is always predicted. Therefore, the precision and the recall ratio are always the same when the micro averaging scheme is used.

Further, ablation experiments with different losses and number of layers (hops) were also performed. In this section, the impact of different losses is studied, e.g. direct neighbor loss is equation (4) and hierarchical neighbor loss is equation (5). The performance of different number of layers (hop) with downstream classification tasks was also studied according to the Micro-F1 and Macro-F1 metrics. To compare the different penalties, fig. 4 and 5 illustrate the classification performance of the LlossNet using direct penalty, two-hop stratification penalty and three-hop stratification penalty varying { 10%, 30%, 50%, 70%, 90% } training ratio over the four datasets, where fig. 4(a) to 4(d) correspond to Cora, Wiki, citeser and DBLP, respectively, and fig. 5(a) to 5(d) also correspond to Cora, Wiki, citeser and DBLP, respectively. From these results, it can be observed that:

1) compared with direct loss, the two-hop layering loss and the three-hop layering loss of the LlossNet can improve the classification performance, and the effectiveness of the method is verified. More specifically, for Micro-F1, the two-hop delamination loss method increased 5.60%, 5.14%, 6.51%, and 4.27% over the direct losses of Cora, Wiki, Citeseer, and DBLP, respectively. The performance gains of Macro-F1 were 6.50%, 8.23%, 22.23%, and 6.54% on these data sets, respectively.

2) LlossNet with three-hop stratification loss generally performed better on Citeser and DBLP than two-hop stratification loss, with 0.46% and 0.23% improvement for Micro-F1, respectively, and a poor performance on Cora and Wiki, respectively, with 0.18% and 2.42% reduction. Furthermore, for Macro-F1, three-hop stratification loss was superior to two-hop stratification loss on Citeser and DBLP, yielding 2.15% and 0.38% gains, respectively, while Cora and Wiki performed poorly, reducing by 1.09% and 3.81%, respectively. This is because too fine a multi-hop margin design may compromise the ability of the model to generalize.

3) It can also be observed that the performance of the proposed solution of the invention shows a stable performance in terms of different training ratios. Specifically, the two-hop delamination loss Micro-F1 performance fluctuated 78.80%, 54.60%, 62.68% and 79.91% on Cora, Citeseer, Wiki and DBLP, respectively. And the Macro-F1 performance for the two-hop delamination loss fluctuates by 77.78%, 49.45%, 50.84%, and 74.13% over the four data sets, respectively.

In addition, the effectiveness of the pretreatment was also verified. When the network is too complex or not large enough to be analyzed, random walks on the graph can extract node semantics. In one embodiment, unbiased random walks are used to explore the network, and the selection probabilities of the next hop nodes are equal. This process does not result in network formation bias. To evaluate the impact of the random walk strategy on downstream tasks, the values of the parameters of step length, window size and number of walks were varied in the preprocessing. The classification results are shown in fig. 6 and 7.

In general, the walking length is changed to {10, 15, 20, 25, 30}, the walking frequency of each node is {2, 3, 4, 5}, and the window size is {2, 3, 4, 5 }. From fig. 6 and 7, it can be observed that:

1) the classification performance increases with the step length and becomes stable when the length is between 25 and 30.

2) Performance fluctuates with walking times on Cora, while better results are obtained more often on Citeser.

3) For window size, the classification performance is improved when the size is from 2 to 3. The larger the window size, the better the performance.

From these results, it can be concluded that this pre-processing of random walks is effective for exploring the network. It should be understood that the working of the invention is extensible for other preprocessing techniques. The preprocessing of unbiased random walk aims at exploring the network by searching for more potential node pairs using the parameter settings described above. The method proposed by the invention is independent of preprocessing and is able to process pairs of original nodes. Furthermore, when the window size is set to 2, the pre-processing only considers the direct neighbors of the node, which is the same as the original network.

In summary, the present invention proposes a graph attention network with well-designed hierarchical loss, called LlossNet, for pre-training unlabeled nodes. Although GCNs and their variants have demonstrated effective results in semi-supervised network-embedded learning, they achieve inferior performance compared to direct embedded learning (e.g., Deepwalk on pre-training of unlabeled nodes), which is trained in a completely unsupervised manner. The present invention recognizes that during the inherent information aggregation in GNN-based approaches, the edges of the nodes may not be distinguishable. To address this problem, it is proposed to maintain the edge distance between the target node and its neighbors by constructing three types of losses, including direct neighbor losses, hierarchical neighbor losses, and negative sampling losses. The loss-embedded graph attention network is then demonstrated. Furthermore, the invention can be easily applied to other GNN-based models. And the pre-training performance of the baseline is evaluated through a downstream classification and clustering task, so that the LlossNet provided by the invention can realize impressive improvement on the most advanced model, and the effectiveness of the provided method is verified.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A network embedding pre-training method based on a graph neural network hierarchical loss function comprises the following steps:

the network is represented as graph G ═ (V, E), V being a set of vertices,

representing a set of edges, for each V ∈ V, network embedding is used to learn a low-dimensional representation V ∈ R^dAnd learn a set of shared weight matrices W_lFor aggregating information from node neighbors, where d represents a dimension of space, d < | V |, and l is a level of the graph neural network;

for an input graph, a set of walking sequences S is constructed, where each sequence S e S contains a set of nodes { v ∈ S }₁，...，v_|s|}；

and

2. The method of claim 1, wherein the direct neighbor loss is used to encourage a target node to approach its direct neighbor in the embedding space, and is expressed as:

wherein σ_sIs a sigmoid function.

3. The method of claim 1, wherein the hierarchical neighbor penalty is used to bring a target node close to its two-hop neighbor and to maintain the margins of the two-hop neighbor and the direct neighbor in the embedding space, expressed as:

wherein σ_sIs a sigmoid function, Agg denotes an aggregation function, and conditions denote that the target node v is encouraged_tV is compared with its two-hop neighbor_iCloser to its immediate neighbor v_pIf the condition is not satisfied, it will

Is set to 0.

4. The method of claim 1, wherein the negative sample penalty is applied to target node v_tAway from the negative examples and their neighbors in the embedding space, is represented as:

where v_n～P_NS(v)

wherein, P_NS(v) Representing a negative sample distribution, v_nIs the negative sample drawn therefrom, f_vIs the degree of the node v in the graph and β is an empirical power.

5. The method of claim 3, further comprising performing message propagation to process messages communicated from neighboring nodes prior to message aggregation, wherein message propagation is at a centralized node v_iAnd its neighbor node v_jAnd formulating a delivery message by:

wherein, MLP represents a multi-layer perceptron,

it is shown that the connection operation is performed,

representing a slave node v_jTo v_iPassed message, message dimension ∈ R^d，M_v∈R^d×dIs a matrix of the transitions that is,

is of neighbor type, h_v∈R^dIs a hidden representation of the node v and,

type of neighbor

And neighbor nodes

As inputs, and outputs a conversion matrix M_vNeighbor type

Is a thermal code.

6. The method of claim 5, wherein the message aggregation is aimed at aggregating information of a central node and messages passed from its neighbors, and wherein the weighting factors between two nodes

Expressed as:

node v using learned weight coefficients and neighbors_iIs expressed as:

wherein W ∈ R^d×dIs a shared weight matrix, used to map nodes to the same embedding space,

a weight vector for learning the relationship between the central node and its neighbors is recorded,

is node v_iNeighbor set of σ_rIs the ReLU activation function.

7. The method of claim 1, wherein the set of walking sequences S is obtained by performing a random walk on the input map representation.

8. The method of claim 1, wherein the unit graph is constructed based on co-occurring nodes within a sliding window, wherein a positive context pair in s consists of

Definition, c denotes a window size.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the steps of the method of any of claims 1 to 8 are implemented when the processor executes the program.