CN114596473A - Network embedding pre-training method based on graph neural network hierarchical loss function - Google Patents

Network embedding pre-training method based on graph neural network hierarchical loss function Download PDF

Info

Publication number
CN114596473A
CN114596473A CN202210174674.8A CN202210174674A CN114596473A CN 114596473 A CN114596473 A CN 114596473A CN 202210174674 A CN202210174674 A CN 202210174674A CN 114596473 A CN114596473 A CN 114596473A
Authority
CN
China
Prior art keywords
node
graph
neighbor
nodes
neighbors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210174674.8A
Other languages
Chinese (zh)
Inventor
陈俊扬
伍楷舜
戴志江
巩志国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202210174674.8A priority Critical patent/CN114596473A/en
Publication of CN114596473A publication Critical patent/CN114596473A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a network embedding pre-training method based on a graph neural network hierarchical loss function. The method comprises the following steps: representing a network as a graph, network embedding for learning a low-dimensional representation, and learning a set of shared weight matrices for aggregating information from node neighbors; for an input graph, a set of walking sequences S is constructed, where each sequence S e S contains a set of nodes { v ∈ S }1,…,v|s|}; for the constructed walking sequence, constructing a unit graph based on co-occurring nodes within the sliding window, the unit graph being generated based on the proximity of each target node to one-hop and two-hop neighbors; optimization objective with design layering loss minimization using generated unit mapsAnd (5) marking a training graph neural network model. The invention combines the graph attention network with effective hierarchical loss, can reserve the edge distance between the target node and the neighbor node, and obviously improves the model performance.

Description

Network embedding pre-training method based on graph neural network hierarchical loss function
Technical Field
The invention relates to the technical field of data mining and machine learning, in particular to a network embedding pre-training method based on a graph neural network hierarchical loss function.
Background
In a graph composed of nodes and edges, unlabeled nodes are reasonably encoded into a low-dimensional space by using network-embedded pre-training, wherein the nodes are close to the neighbors and far away from negative samples, so that the nodes show good performance in downstream tasks such as classification and clustering.
Graph structures are popular in real-world scenarios such as social networking and reference networking, and have attracted considerable attention in the research community over the past few years. Network-embedded pre-training aims at mapping node relationships to a low-dimensional space without any supervisory information (e.g., node labels). The learned embeddings can then be used to perform subsequent tasks, such as node classification and link prediction. Pre-training of network embedding is intended to maintain the proximity of unlabeled nodes in a low-dimensional space. To achieve this goal, current work can be divided into direct network embedding and deep neural network embedding.
For the direct network embedding scheme, before the occurrence of a deep neural network, a deep walk and other direct embedding learning method realizes node embedding by applying a SkipGram model in generated random walks. Similar methods, such as Line, node2vec, ACNE-ST, Adv-gaming and HNS design biased random walk strategies or negative sampling modeling to explore different node relationships. In addition, Line and MNMF employ matrix factorization to learn node embedding. However, all direct embedding methods have the problem of computational inefficiency, since there are no shared parameters between nodes in the encoder. Furthermore, these methods often lack the ability to generalize strongly to new maps.
For deep neural network embedding schemes, GNN (graph neural network) based methods have recently been proposed to aggregate information from graph structure data. For example, GCN (graph convolutional neural network) and its variants DGI and GAT show breakthrough performance in many tasks, including pre-training for network embedding. In contrast to traditional models, the GCN can jointly consider local connectivity and global consistency on the graph. Based on the basic theory of GCN, more variants were proposed, including GraphSAGE, FastGCN, DGCN and HAN for large-scale datasets. The PinSage is specially designed for a web scale recommendation system by combining GraphSAGE, and the maximum application of depth map embedding is realized. Furthermore, MEIRec leverages meta-path guided GNNs to model complex interactions of items to make search intent recommendations.
Many network-embedded pre-training methods have been proposed. For example, Deepwalk is a graph embedding method that uses pairs of nodes generated by random walks to represent learning. After that, similar methods such as Line, node2vec and TADW have been proposed, but these methods have a problem of computational inefficiency since there is no shared parameter between nodes in the encoder. Recently, the GNNs of the graph neural networks have shown excellent performance in modeling graph structure data, and typical work includes GCN, GAT and GraphSAGE. In general, the key to the success of GNN is the parameter sharing of the encoder and the importance weights used in the multi-layer structure to capture between the target node and its neighbors. However, due to their inherent information aggregation mechanism, these GNN-based approaches tend to fall into sub-optimal results during pre-training of unlabeled nodes, thereby affecting the prediction efficiency and accuracy of the model.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a network embedding pre-training method based on a graph neural network hierarchical loss function. The method comprises the following steps:
the network is represented as graph G ═ (V, E), V is a set of vertices,
Figure BDA0003518615030000021
representing a set of edges, for each V ∈ V, network embedding is used to learn a low-dimensional representation V ∈ RdAnd learn a set of shared weight matrices WlFor aggregating information from node neighbors, where d represents a dimension of space, d<<|V|,l is the hierarchy of the graph neural network;
for an input graph, a set of walking sequences S is constructed, where each sequence S e S contains a set of nodes { v ∈ S }1,…,v|s|};
For the constructed walking sequence, constructing a unit graph based on nodes which commonly appear in a sliding window, wherein the unit graph is generated based on the proximity of each target node to one-hop and two-hop neighbors;
using the generated unit graph, minimizing the designed hierarchical loss to be an optimized target training graph neural network model, wherein the hierarchical loss is expressed as:
Figure BDA0003518615030000031
wherein v istIs the target node, vpAre the direct neighbors in the unit graph,
Figure BDA0003518615030000032
representing a node vpV. neighbors ofnIs a negative sample sampled from the entire vertex set,
Figure BDA0003518615030000033
denotes vnH e RdA hidden representation of the node of dimension d is represented,
Figure BDA0003518615030000034
and
Figure BDA0003518615030000035
respectively representing direct neighbor loss, hierarchical neighbor loss and negative sampling loss.
Compared with the prior art, the method has the advantages that the research blank of poor performance of the GNN in the aspect of unmarked node pre-training is filled, and the problem that the edges of the nodes cannot be distinguished in the process of inherent node information aggregation is solved; a graph attention network with effective hierarchical loss is combined to reserve the edge distance between a target node and the neighbor of the target node, and a two-hop mode is regarded as a unit graph, so that the model performance is remarkably improved.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of a learning embedding space in which a target and its neighbors are direct neighbors, according to one embodiment of the invention;
FIG. 2 is a process diagram of a network embedding pre-training method based on a graph neural network hierarchical loss function according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of clustering results using ACC and NMI metrics on different data sets, according to one embodiment of the present invention;
FIG. 4 is a schematic illustration of an ablation study using Micro-F1 metrology, in accordance with one embodiment of the present invention;
FIG. 5 is a schematic illustration of an ablation study using Macro-F1 metrology according to another embodiment of the present invention;
FIG. 6 is a diagram illustrating the classification results with respect to the random walk parameter on Cora according to an embodiment of the present invention;
fig. 7 is a diagram illustrating the classification result regarding random walk parameters on siteser according to an embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
To facilitate an understanding of the present invention, the drawback of the prior art is first explained, i.e. due to the inherent information aggregation mechanism, GNN-based approaches tend to fall into sub-optimal results during pre-training of unlabeled nodes. Specifically, referring to fig. 1, the left diagram is an example of an input diagram, the upper right diagram is an embedding space learned by the existing GNNS, and the lower right diagram is an embedding space learned by the method proposed by the present invention, where node 1 is a target node, node2 is a direct neighbor, and nodes 3-5 are two-hop neighbors. The main problem with existing GNN-based methods is that they model the distance of the target node and its neighbors in a coarse-grained manner. The edges of the nodes are often difficult to distinguish, resulting in poor performance of downstream applications. It was experimentally verified that even in certain cases of downstream classification and clustering tasks, direct embedding methods (e.g., deep walk) may outperform GNN-based methods (e.g., GCN, GAT, and GraphSAGE). We observe that the more discriminative representation learned in the embedding space, the higher performance the subsequent task can achieve. In general, one desirable result is that the target node should be closer to its immediate neighbors than a two-hop node. To address this problem, the present invention proposes a simple and efficient hierarchical loss graph attention network, called LlossNet, to preserve the margins of nodes in the embedding space.
Specifically, the present invention explicitly models the marginal constraints imposed by direct neighbor losses (e.g., nodes 1 and 2 in FIG. 1), multi-level losses (nodes 1 and 3-5), and both types of losses. Such explicit modeling would help learn more discriminative node representations. However, it is not straightforward to determine the number of hops being considered. Furthermore, overly fine multi-hop margin design may compromise the ability of the model to generalize. Preferably, the neighbor preservation within a two-hop neighbor is the best choice for the effectiveness and efficiency of the LlossNet in the experiment. In the present invention, this two-hop pattern is referred to as a unit graph, and multi-hop margins can also be considered recursively.
Hereinafter, the pre-training of the existing network embedding, the network embedding pre-training method of the present invention, and the experimental verification result will be described in detail.
Firstly, preset
Before introducing the proposed hierarchical loss, we propose the current graph-based loss function in the pre-training of network embedding. To learn the prediction representation in a completely unsupervised environment according to the literature (w.hamilton, z.ying, and j.leskov, "Inductive representation on large graphs," in advance in Neural Information Processing Systems,2017, pp.1024-1034), the output representation of the graph-based nodes is applied to a loss function:
Figure BDA0003518615030000051
wherein v istIs the target node, vpIs its immediate neighbor, vnIs a negative sample, P, randomly drawn from the entire vertexNS(v) Representing the distribution of negative samples, h representing the output representation in the embedding space, σsIs a sigmoid function and N defines the number of negative samples. Note that given a node v, hvIs a feature aggregated from its neighbors. For simplicity, the generation process at the last level node representation is as follows:
Figure BDA0003518615030000052
where W is a trainable weight matrix,
Figure BDA0003518615030000053
representing the neighbors of node v, CONCAT represents the join operation, AGGREGATE represents the aggregation function, including the types mean, LSTM (long-short memory network), maxpool (max pooling), meanpool (mean pooling), and attention (attention). From equations (1) and (2), it can be concluded that the current graph-based losses only bring the output representations of the target node and its multi-hop neighbors close in the embedding space. Thus, it cannot preserve the margins between these nodes, however the target node should be closer to its immediate neighbors than the two-hop node.
Secondly, the invention provides a model
A. Problem form and symbol
Since the goal of the present invention is to learn the unmarked node representation, it can represent the network G ═ (V, E), where V is a set of vertices,
Figure BDA0003518615030000061
representing a set of edges. For each V e V, network embedding aims at learning a low-dimensional representation V e RdTo maintain network proximity, d represents the dimension of space, d<<L V l. In addition, a set of shared weight matrices W is learnedlFor aggregating information from node neighbors, where l is the hierarchy of GNNs.
B. The first stage is as follows: unit map generation
At this stage, the present invention introduces unit map generation of input data, as shown in FIG. 2, where FIG. 2(a) is the input map; FIG. 2(b) performs a random walk to convert a graph into a sequence of nodes; FIG. 2(c) constructs a new graph based on a set of positive context pairs in a sliding window; FIG. 2(d) reserves target node v in embedding space1Its direct neighbor v2And its two-hop neighbor v3,v4,v5(ii) proximity of; FIG. 2(e) illustrates the losses proposed by the present invention, including direct neighbor losses, hierarchical neighbor losses, and negative sampling losses; v. ofnRepresenting negative examples and h representing hidden representation.
Specifically, pre-processing is first performed by performing random walks to form a set of walking sequences S, where each sequence S e S contains a set of walking sequences SNode { v }1,…,v|s|}. A new graph may then be constructed based on the co-occurring nodes within the sliding window. Positive context pair in s
Figure BDA0003518615030000062
Definition, where c denotes the window size. Taking fig. 2(c) and 2(d) as an example, node2 in the sliding window is connected to nodes 3, 4 and 5, respectively. Note that only the proximity of each target node (e.g., node 1) to one-hop and two-hop neighbors is considered in this embodiment. With node 1 as the target node, within the sliding window, node2 is its direct neighbor and nodes 3, 4, and 5 are their two-hop neighbors. Such generated maps in fig. 2(c) to 2(d) are referred to herein as unit maps. Based on this approach, multi-hop margins can be recursively considered. By random walk, node relationships can be enriched, which will facilitate pre-training performance. For fair comparison, random walks were used as a pre-treatment for all baselines in the experiment, with only major differences in subsequent models.
C. And a second stage: hierarchical loss optimization
As shown in fig. 2(d) and 2(e), with the generated unit maps, the proposed stratification loss is defined as follows:
Figure BDA0003518615030000071
wherein v istTarget node (e.g. v)1),vpAre direct neighbors (e.g. v) in a unit graph2),
Figure BDA0003518615030000072
Representing a node vpOf (e.g. v)3,v4,v5),vnIs a negative sample sampled from the entire vertex set by using the nickname table method (A.Q.Li, A.Ahmed, S.Ravi, and A.J.Smola, "Reducing the sampling complexity of topologic models," in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,2014, pp.891-900), using only O (1) time,
Figure BDA0003518615030000073
denotes vnH ∈ RdA hidden representation of the node of dimension d is represented,
Figure BDA0003518615030000074
and
Figure BDA0003518615030000075
respectively representing direct neighbor loss, hierarchical loss and negative sampling loss. More details will be presented below.
For direct neighbor loss (first part of equation 3), the target node is encouraged to get close to its direct neighbors in the embedding space without loss of generality, defined as follows:
Figure BDA0003518615030000076
the hierarchical neighbor penalty (second part of equation 3) is to bring the target node close to its two-hop neighbor while keeping the margins of the two-hop neighbor and the direct neighbor in the embedding space, defined for example as follows:
Figure BDA0003518615030000077
where Agg denotes an aggregation function where information of a central node and messages passed from neighbors are aggregated, and a condition (s.t.) denotes that a target node (i.e., v) is encouragedt) Than its two-hop neighbor (i.e., v)i) Closer to its immediate neighbors (i.e. v)p). If the condition is not satisfied, it will
Figure BDA0003518615030000078
Is set to 0. In other words, if the target node is closer to its two-hop neighbor than the direct neighbor, the second partial optimization may be suspended by setting the hierarchical neighbor penalty to 0, while keeping the direct neighbor penalty to bring the target node closer to its direct neighbor. Furthermore, inBefore the message aggregation step of the AGG is performed, a message propagation step needs to be performed to process the messages delivered from the neighboring nodes, see the following procedure.
1) And a message propagation step: message propagation at a centrally located node viAnd its neighbor node vjIn the meantime. For example, encoding from a locally structured neighbor, a delivery message may be formulated by:
Figure BDA0003518615030000081
Figure BDA0003518615030000082
representing a slave node vjTo viPassed message, message dimension ∈ Rd。Mv∈Rd×dIs a matrix of the transitions,
Figure BDA0003518615030000083
is a neighbor type (e.g., one-hop or multi-hop neighbor), hv∈RdIs a hidden representation of the node v and,
Figure BDA0003518615030000084
type of neighbor
Figure BDA0003518615030000085
And neighbor node
Figure BDA0003518615030000086
As input, and outputs a conversion matrix Mv. Note that the neighbor type
Figure BDA0003518615030000087
Is a thermal code. By splicing
Figure BDA0003518615030000088
And
Figure BDA0003518615030000089
the mapping is then learned using multi-layer perception (MLP).
Figure BDA00035186150300000810
The detailed definitions of (a) and (b) are as follows:
Figure BDA00035186150300000811
wherein
Figure BDA00035186150300000812
Indicating a connect operation.
2) And message aggregation step: in this step, the goal is to aggregate the information of the central node (e.g., node 2) and the messages passed from its neighbors (e.g., nodes 3-5), as shown in FIG. 2 (d). To reduce noise information that may propagate from neighbors, the importance weights of nodes are learned by introducing an attention aggregator in conjunction with an attention mechanism. Weight coefficient between two nodes
Figure BDA00035186150300000813
Can be expressed by the following equation:
Figure BDA00035186150300000814
W∈Rd×dis a shared weight matrix, used to map nodes to the same embedding space,
Figure BDA00035186150300000815
a weight vector for learning the relationship of the central node and its neighbors is recorded,
Figure BDA00035186150300000816
is node viNeighbor set of σrIs the ReLU activation function.
Then, using the learned weight coefficients and neighbors, node viIs represented by the following formula:
Figure BDA00035186150300000817
in summary, how to formulate the direct neighbor loss and the hierarchical neighbor loss of equation (3) has been introduced, and the aggregation function of the node, including the message propagation and aggregation steps, has been introduced. Next, the third part of equation 3, namely the negative sample loss or referred to as negative sample loss, will be described.
Negative sampling loss aims at making the target node vtAway from negative examples and their neighbors in the embedding space, which is defined as follows:
Figure BDA0003518615030000091
wherein P isNS(v) Representing a negative sample distribution, vnIs the negative sample drawn therefrom. Specifically, PNS(v) Is defined by the formula:
Figure BDA0003518615030000092
wherein f isvIs the degree of node v in the graph and β is an empirical power and may be set to 3/4, for example.
D. The LlossNet algorithm provided by the invention
For the sake of clarity, the solution proposed by the present invention is designed in the following Algorithm (Algorithm 1). Specifically, a random walk is performed on the input graph to obtain a sequence of nodes, and a new graph is generated using a sliding window (line 2). Then, at each epoch, a batch is sampled from the generated positive context pair (line 5). Next, for each target node in the batch, its direct neighbor loss, two-hop neighbor loss, and negative sampling loss are constructed according to equations (4), (5), and (10), respectively (lines 7-9). Finally, the above losses are integrated by equation (3) and optimized using random gradient descent (line 10). These steps (lines 5-10) are repeated until convergence is reached (the loss value becomes stable).
*******************************************
Figure BDA0003518615030000093
The pre-training graph neural network can be used in various scenes, such as a recommendation scene, taking an article content recommendation scene as an example, articles comprise news, social network topics, academic papers and the like, each article is taken as a data node, the relationship (such as citation, link and the like) between the article and the article is taken as an edge, the data node is input into the pre-training graph neural network, and the output is a unique embedded vector learned for each node, and the embedded vector can reflect the correlation relationship between the node and the node. When a node related to a certain node needs to be searched, the node related to the TOP K can be retrieved according to the size of the inner product by calculating the inner product between the nodes.
Further, in order to verify the effect of the present invention, a simulation experiment was performed. In the experiment, downstream tasks including node classification and clustering were used to evaluate the performance of pre-trained node representations on the actual data set.
A. Data set
Four widely used network data sets were tested and the statistical data is shown in table 1.
Table 1: data set
Figure BDA0003518615030000101
In table 1, Cora is a research citation network constructed in the prior art, which contains 2708 machine learning articles with 7 tags (Labels); citeseer is another widely used research corpus containing 3264 publications and 6 tags; wiki is a language network containing 2405 web pages from 17 groups and 12761 edges between them, this data set has been widely used for the evaluation of vertex classification tasks; DBLP is a listing of computer science references constructed in the prior art. In the experiment, a list of meeting papers was selected from 4 research fields: database, data mining, AI and CV.
B. Baseline model
In the experiments, several of the most advanced methods currently available were used as baselines, which were designed for pre-training of network embedding, including direct embedding and GNN-based embedding models. The selected baseline is described as follows:
deepwalk is the first graph embedding method to apply representation learning to node pairs generated by random walks. The method is a direct embedding learning method, no parameter is shared among nodes, and a Skip-gram method is used for training node embedding.
The GCN introduces an efficient convolutional neural network variant that can run directly on graph structure data and show breakthrough performance in network-embedded learning. To accommodate large-scale networks, an improved version of the GCN method was derived. The present invention employs this variant for a more fair comparison.
DGI introduces a general approach to network representation learning in an unsupervised manner. DGI relies on maximizing mutual information between the graph enhancement representation and the currently extracted graph information-both derived using established graph-convolution network architectures. For graph-enhanced representations, a subgraph is generated from the target nodes. In general, it can be seen as an improved version of the GCN method.
GraphSAGE proposes an inductive way to compute vertex representations that produces impressive performance on several large-scale baselines. For each vertex, GraphSAGE first samples the fixed-size neighbors and then performs an aggregation function on them to obtain a vertex representation. There are four types of polymerization functions, including SAGE-mean, SAGE-LSTM, SAGE-maxpool, and SAGE-meanpool (a variant of maxpool polymerizer). One major difference between meanpool and mean is that meanpool requires the vector of each neighbor to be input independently through a fully connected neural network.
GAT proposes a kind of attention network that incorporates an attention mechanism into its propagation steps. It follows a self-attention strategy to learn the importance weights of the neighbors of the central node by focusing on the neighbors to compute a hidden representation of each node.
C. Parameter setting
In the experiment, the present invention was implemented using a Tensorflow operating environment. For all models, the embedding size was uniformly set to 128, the model parameters were randomly initialized with a gaussian distribution, and the model was optimized using a small batch of Adam. The learning rate is set to 1e-3, the batch size is 512, and the maximum sequence length is set to 50. For variants of GNN and the method of the invention, GraphSAGE was followed by setting the number of concealment layers to 2. Further, for fair comparison, random walks were employed as a pre-process for all methods by setting the walk length, the number of walks per node, and the window size to 30, 5, and 5, respectively.
D. Evaluation index
In order to evaluate the pre-training performance of the method in the downstream classification task, the node classifier is constructed by using a default set Liblinear packet, and measured by using Micro-F1 and Macro-F1. Furthermore, in the downstream clustering task, two widely used metrics are used to report the results, namely clustering Accuracy (ACC) and Normalized Mutual Information (NMI). When the clustering result is completely matched with the actual label, the values of ACC and NMI are both 1, otherwise, the values are close to zero.
E. Evaluating classification performance
In this section, the performance of the downstream classification tasks of the various models on the four real world datasets was verified. Specifically, 20%, 40%, 60%, and 80% of the nodes were randomly selected as the training set, and the rest were the test set. The results of Micro-F1 and Macro-F1 are shown in tables 2, 3, 4 and 5.
Table 2: node classification result of Cora
Figure BDA0003518615030000121
Table 3: node classification result of Citeser
Figure BDA0003518615030000122
Table 4: node classification results for Wiki
Figure BDA0003518615030000123
Table 5: node classification result of DBLP
Figure BDA0003518615030000131
The highest score is indicated in bold, the underlined score. From the above table the following observations can be obtained:
1) in general, the LLossNET proposed by the present invention outperforms the most advanced models on all data sets with different training ratios, which demonstrates the effectiveness of the present invention in considering the edge distance preservation between the target node and its neighbors.
2) The overall performance of the baseline follows the following sequence: LlossNet > Deepwalk > DGI ≈ SAGE-meanpool ≈ SAGE-LSTM ≈ GAT > SAGE-maxpool ≈ SAGE-mean ≈ GCN. More specifically, the LLossNet of the present invention achieved 4.44%, 5.77%, 4.43% and 3.59% performance improvement on Micro-F1 over the second best model on average training ratios of Cora, citeser, Wiki and DBLP, respectively. Furthermore, with regard to Macro-F1, LLossNet outperformed the second best model 5.26%, 6.27%, 16.69%, and 5.07% on all four datasets, respectively. These improved results indicate that the present invention is more able to learn discriminant embedding in the pre-training of graph structure data.
3) In the baseline results, DeepWalk outperformed GNN-based methods in most cases, such as on the Cora, cineseer and Wiki datasets. This reveals the problem that GNNs do not perform well in pre-training of unlabeled nodes. Although GAT achieves the second best results on DBLP datasets, its self-attention mechanism only considers important weights of neighbors that favor embedded learning, and still ignores the margins between multi-hop nodes.
It should be noted that the present invention performs training of all algorithms in an unsupervised manner, using equation 3 as the objective function of the present invention, and equation 1 as the baseline. To evaluate the performance of all methods after pre-training, classifiers are built to perform the downstream classification task by uniformly using librinear packages with default settings, where the stopping criterion is the tolerance parameter 1e-4 or maximum 100. The present invention no longer divides the test nodes into verification and test sets because the verification settings do not have much impact on the general performance trends of the downstream classification.
F. Clustering performance assessment
To further evaluate the pre-training performance of the present invention, downstream clustering tasks were performed by: all baselines were compared according to ACC and NMI indicators.
In particular, K-means are run uniformly using default settings on the pre-training representations on the four data sets. The results are shown in FIG. 3, and the observations are as follows:
1) first, as can be seen from fig. 3(a), LLossNET outperforms baseline in ACC, which verifies the effectiveness of the present invention. In general, the order of ACC performance between baselines is similar to the classification results, with depawalk being superior to GNN-based models except for DGI on Cora. The LlossNet of the present invention improves the performance of the Deepwalk by 26.11%, 3.42%, 3.28% and 3.38% on Cora, Citeser, Wiki and DBLP, respectively. Furthermore, the LlossNet improved the performance of the ACC by 2.88% for the performance of the DGI.
2) Further, as can be seen from fig. 3(b), LLossNet can also achieve a consistent boost over other models, which additionally demonstrates the superiority of the present invention to pre-training of unlabeled nodes. More specifically, LLossNET improved NMI performance over the DeepWalk model by 14.66%, 13.64%, 4.21% and 3.54% over the four datasets. Furthermore, LLossNet achieved performance gains of 7.60%, 1.17%, 80.94% and 11.02% compared to DGI methods on Cora, cierseer, Wiki and DBLP, respectively.
3) ACC and NMI performance differences for DBLP for all methods are relatively small because the number of tags is small, only 4 tags, and thus the differences are not significant.
In addition, the resulting performance of accuracy and recall was also verified. In this section, the goal is to evaluate the performance of the accuracy, recall results of these methods. Two baselines, Deepwalk and GAT, were selected and performed well in previous classification experiments. The performed index refers to scimit-learn, and 50% of nodes are randomly selected as a training set, and the rest of nodes are used as test sets on Cora and Citeser. The results are reported in table 6.
Table 6: cola and Citeser Performance
Figure BDA0003518615030000141
From table 6, the following observations can be made:
1) the present invention may achieve better performance than other baseline. Specifically, the LlossNet improved on average by 4.78% and 19.03% over Deepwalk and GAT on Cora, respectively. The mean increases of 7.55% and 19.52% for LlossNet compared to Deepwalk and GAT on Citeser.
2) Note that the performance of Precision micro and Recall micro metrics gave the same results. In the single label classification problem, if there are false positives, there are always false negatives and vice versa, since one class is always predicted. Therefore, the precision and the recall ratio are always the same when the micro averaging scheme is used.
Further, ablation experiments with different losses and number of layers (hops) were also performed. In this section, the impact of different losses is studied, e.g. direct neighbor loss is equation (4) and hierarchical neighbor loss is equation (5). The performance of different number of layers (hop) with downstream classification tasks was also studied according to the Micro-F1 and Macro-F1 metrics. To compare the different penalties, fig. 4 and 5 illustrate the classification performance of the LlossNet using direct penalty, two-hop stratification penalty and three-hop stratification penalty varying { 10%, 30%, 50%, 70%, 90% } training ratio over the four datasets, where fig. 4(a) to 4(d) correspond to Cora, Wiki, citeser and DBLP, respectively, and fig. 5(a) to 5(d) also correspond to Cora, Wiki, citeser and DBLP, respectively. From these results, it can be observed that:
1) compared with direct loss, the two-hop layering loss and the three-hop layering loss of the LlossNet can improve the classification performance, and the effectiveness of the method is verified. More specifically, for Micro-F1, the two-hop delamination loss method increased 5.60%, 5.14%, 6.51%, and 4.27% over the direct losses of Cora, Wiki, Citeseer, and DBLP, respectively. The performance gains of Macro-F1 were 6.50%, 8.23%, 22.23%, and 6.54% on these data sets, respectively.
2) LlossNet with three-hop stratification loss generally performed better on Citeser and DBLP than two-hop stratification loss, with 0.46% and 0.23% improvement for Micro-F1, respectively, and a poor performance on Cora and Wiki, respectively, with 0.18% and 2.42% reduction. Furthermore, for Macro-F1, three-hop stratification loss was superior to two-hop stratification loss on Citeser and DBLP, yielding 2.15% and 0.38% gains, respectively, while Cora and Wiki performed poorly, reducing by 1.09% and 3.81%, respectively. This is because too fine a multi-hop margin design may compromise the ability of the model to generalize.
3) It can also be observed that the performance of the proposed solution of the invention shows a stable performance in terms of different training ratios. Specifically, the two-hop delamination loss Micro-F1 performance fluctuated 78.80%, 54.60%, 62.68% and 79.91% on Cora, Citeseer, Wiki and DBLP, respectively. And the Macro-F1 performance for the two-hop delamination loss fluctuates by 77.78%, 49.45%, 50.84%, and 74.13% over the four data sets, respectively.
In addition, the effectiveness of the pretreatment was also verified. When the network is too complex or not large enough to be analyzed, random walks on the graph can extract node semantics. In one embodiment, unbiased random walks are used to explore the network, and the selection probabilities of the next hop nodes are equal. This process does not result in network formation bias. To evaluate the impact of the random walk strategy on downstream tasks, the values of the parameters of step length, window size and number of walks were varied in the preprocessing. The classification results are shown in fig. 6 and 7.
In general, the walking length is changed to {10, 15, 20, 25, 30}, the walking frequency of each node is {2, 3, 4, 5}, and the window size is {2, 3, 4, 5 }. From fig. 6 and 7, it can be observed that:
1) the classification performance increases with the step length and becomes stable when the length is between 25 and 30.
2) Performance fluctuates with walking times on Cora, while better results are obtained more often on Citeser.
3) For window size, the classification performance is improved when the size is from 2 to 3. The larger the window size, the better the performance.
From these results, it can be concluded that this pre-processing of random walks is effective for exploring the network. It should be understood that the working of the invention is extensible for other preprocessing techniques. The preprocessing of unbiased random walk aims at exploring the network by searching for more potential node pairs using the parameter settings described above. The method proposed by the invention is independent of preprocessing and is able to process pairs of original nodes. Furthermore, when the window size is set to 2, the pre-processing only considers the direct neighbors of the node, which is the same as the original network.
In summary, the present invention proposes a graph attention network with well-designed hierarchical loss, called LlossNet, for pre-training unlabeled nodes. Although GCNs and their variants have demonstrated effective results in semi-supervised network-embedded learning, they achieve inferior performance compared to direct embedded learning (e.g., Deepwalk on pre-training of unlabeled nodes), which is trained in a completely unsupervised manner. The present invention recognizes that during the inherent information aggregation in GNN-based approaches, the edges of the nodes may not be distinguishable. To address this problem, it is proposed to maintain the edge distance between the target node and its neighbors by constructing three types of losses, including direct neighbor losses, hierarchical neighbor losses, and negative sampling losses. The loss-embedded graph attention network is then demonstrated. Furthermore, the invention can be easily applied to other GNN-based models. And the pre-training performance of the baseline is evaluated through a downstream classification and clustering task, so that the LlossNet provided by the invention can realize impressive improvement on the most advanced model, and the effectiveness of the provided method is verified.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

1. A network embedding pre-training method based on a graph neural network hierarchical loss function comprises the following steps:
the network is represented as graph G ═ (V, E), V being a set of vertices,
Figure FDA0003518615020000019
representing a set of edges, for each V ∈ V, network embedding is used to learn a low-dimensional representation V ∈ RdAnd learn a set of shared weight matrices WlFor aggregating information from node neighbors, where d represents a dimension of space, d < | V |, and l is a level of the graph neural network;
for an input graph, a set of walking sequences S is constructed, where each sequence S e S contains a set of nodes { v ∈ S }1,...,v|s|};
For the constructed walking sequence, constructing a unit graph based on nodes which commonly appear in a sliding window, wherein the unit graph is generated based on the proximity of each target node to one-hop and two-hop neighbors;
using the generated unit graph, minimizing the designed hierarchical loss to be an optimized target training graph neural network model, wherein the hierarchical loss is expressed as:
Figure FDA0003518615020000011
wherein v istIs the target node, vpAre the direct neighbors in the unit graph,
Figure FDA0003518615020000012
representing a node vpV. neighbors ofnIs a negative sample sampled from the entire vertex set,
Figure FDA0003518615020000013
denotes vnH e RdA hidden representation of the node of dimension d is represented,
Figure FDA0003518615020000014
and
Figure FDA0003518615020000015
respectively representing direct neighbor loss, hierarchical neighbor loss and negative sampling loss.
2. The method of claim 1, wherein the direct neighbor loss is used to encourage a target node to approach its direct neighbor in the embedding space, and is expressed as:
Figure FDA0003518615020000016
wherein σsIs a sigmoid function.
3. The method of claim 1, wherein the hierarchical neighbor penalty is used to bring a target node close to its two-hop neighbor and to maintain the margins of the two-hop neighbor and the direct neighbor in the embedding space, expressed as:
Figure FDA0003518615020000017
Figure FDA0003518615020000018
wherein σsIs a sigmoid function, Agg denotes an aggregation function, and conditions denote that the target node v is encouragedtV is compared with its two-hop neighboriCloser to its immediate neighbor vpIf the condition is not satisfied, it will
Figure FDA00035186150200000215
Is set to 0.
4. The method of claim 1, wherein the negative sample penalty is applied to target node vtAway from the negative examples and their neighbors in the embedding space, is represented as:
Figure FDA0003518615020000021
where vn~PNS(v)
Figure FDA0003518615020000022
wherein, PNS(v) Representing a negative sample distribution, vnIs the negative sample drawn therefrom, fvIs the degree of the node v in the graph and β is an empirical power.
5. The method of claim 3, further comprising performing message propagation to process messages communicated from neighboring nodes prior to message aggregation, wherein message propagation is at a centralized node viAnd its neighbor node vjAnd formulating a delivery message by:
Figure FDA0003518615020000023
Figure FDA0003518615020000024
wherein, MLP represents a multi-layer perceptron,
Figure FDA0003518615020000025
it is shown that the connection operation is performed,
Figure FDA0003518615020000026
representing a slave node vjTo viPassed message, message dimension ∈ Rd,Mv∈Rd×dIs a matrix of the transitions that is,
Figure FDA0003518615020000027
is of neighbor type, hv∈RdIs a hidden representation of the node v and,
Figure FDA0003518615020000028
type of neighbor
Figure FDA0003518615020000029
And neighbor nodes
Figure FDA00035186150200000210
As inputs, and outputs a conversion matrix MvNeighbor type
Figure FDA00035186150200000211
Is a thermal code.
6. The method of claim 5, wherein the message aggregation is aimed at aggregating information of a central node and messages passed from its neighbors, and wherein the weighting factors between two nodes
Figure FDA00035186150200000216
Expressed as:
Figure FDA00035186150200000213
node v using learned weight coefficients and neighborsiIs expressed as:
Figure FDA00035186150200000214
wherein W ∈ Rd×dIs a shared weight matrix, used to map nodes to the same embedding space,
Figure FDA0003518615020000031
a weight vector for learning the relationship between the central node and its neighbors is recorded,
Figure FDA0003518615020000032
is node viNeighbor set of σrIs the ReLU activation function.
7. The method of claim 1, wherein the set of walking sequences S is obtained by performing a random walk on the input map representation.
8. The method of claim 1, wherein the unit graph is constructed based on co-occurring nodes within a sliding window, wherein a positive context pair in s consists of
Figure FDA0003518615020000033
Figure FDA0003518615020000034
Definition, c denotes a window size.
9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the steps of the method of any of claims 1 to 8 are implemented when the processor executes the program.
CN202210174674.8A 2022-02-24 2022-02-24 Network embedding pre-training method based on graph neural network hierarchical loss function Pending CN114596473A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210174674.8A CN114596473A (en) 2022-02-24 2022-02-24 Network embedding pre-training method based on graph neural network hierarchical loss function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210174674.8A CN114596473A (en) 2022-02-24 2022-02-24 Network embedding pre-training method based on graph neural network hierarchical loss function

Publications (1)

Publication Number Publication Date
CN114596473A true CN114596473A (en) 2022-06-07

Family

ID=81804572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210174674.8A Pending CN114596473A (en) 2022-02-24 2022-02-24 Network embedding pre-training method based on graph neural network hierarchical loss function

Country Status (1)

Country Link
CN (1) CN114596473A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117648623A (en) * 2023-11-24 2024-03-05 成都理工大学 Network classification algorithm based on pooling comparison learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117648623A (en) * 2023-11-24 2024-03-05 成都理工大学 Network classification algorithm based on pooling comparison learning

Similar Documents

Publication Publication Date Title
Hu et al. Gpt-gnn: Generative pre-training of graph neural networks
Dong et al. Heterogeneous network representation learning.
Chien et al. Node feature extraction by self-supervised multi-scale neighborhood prediction
Chen et al. A tutorial on network embeddings
CN108388651B (en) Text classification method based on graph kernel and convolutional neural network
CN109753589A (en) A kind of figure method for visualizing based on figure convolutional network
Shi et al. Effective decoding in graph auto-encoder using triadic closure
US20240233877A1 (en) Method for predicting reactant molecule, training method, apparatus, and electronic device
CN113378913A (en) Semi-supervised node classification method based on self-supervised learning
CN110264372B (en) Topic community discovery method based on node representation
CN108052683B (en) Knowledge graph representation learning method based on cosine measurement rule
Xiao et al. Link prediction based on feature representation and fusion
CN113255895A (en) Graph neural network representation learning-based structure graph alignment method and multi-graph joint data mining method
CN109960755B (en) User privacy protection method based on dynamic iteration fast gradient
Huang et al. Learning social image embedding with deep multimodal attention networks
CN115858812A (en) Embedded alignment method constructed by computer
CN112256870A (en) Attribute network representation learning method based on self-adaptive random walk
CN112487110A (en) Overlapped community evolution analysis method and system based on network structure and node content
CN115577283A (en) Entity classification method and device, electronic equipment and storage medium
CN114596473A (en) Network embedding pre-training method based on graph neural network hierarchical loss function
Song et al. Spammer detection using graph-level classification model of graph neural network
Xie et al. L-BGNN: Layerwise trained bipartite graph neural networks
CN114842247B (en) Characteristic accumulation-based graph convolution network semi-supervised node classification method
CN116226547A (en) Incremental graph recommendation method based on stream data
Sun et al. Abnormal entity-aware knowledge graph completion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination