WO2023087303A1

WO2023087303A1 - Method and apparatus for classifying nodes of a graph

Info

Publication number: WO2023087303A1
Application number: PCT/CN2021/132082
Authority: WO
Inventors: Evgeny Kharlamov; Jie Tang; Wenzheng FENG
Original assignee: Robert Bosch Gmbh; Tsinghua University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2023-05-25

Abstract

The present disclosure provides a method for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning. The method comprises: sampling a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes of graph being represented by a corresponding feature vector of the feature matrix; obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node; obtaining a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model; obtaining a loss based on the classification predictions of the nodes in the batch of labeled nodes and the batch of unlabeled nodes; and updating parameters of the GNN model based on the loss.

Description

METHOD AND APPARATUS FOR CLASSIFYING NODES OF A GRAPH

FIELD

Aspects of the present disclosure relate generally to artificial intelligence, and more particularly, to performing a task of classifying nodes of a graph using a Graph Neural Network (GNN) model based on semi-supervised learning.

BACKGROUND

Graph data has been widely used in many real-world applications, such as social networks, financial system, biological networks, citation networks, recommendation system, etc. Node classification is one of the most important tasks on graphs. The deep learning model for graph such as GNN model has achieved good results in the task of node classification on the graph. Given a graph with labels associated with a subset of nodes, the GNN model may predict the labels for the rest of the nodes.

Despite the great success achieved by GNNs, there are two limitations in GNN-based semi-supervised learning methods. The first one is limited generalization. Conventional GNNs only use supervised objective function for model training, which makes the GNN model prone to overfit the limited labeled samples, thereby degrading the prediction performance over unlabeled samples and leading to poor reliability of the GNN model. The other one is weak scalability, conventional GNNs adopt full-graph training method with an expensive recursive feature propagation procedure, inducing enormous time and memory overhead when processing large graphs.

There needs enhancement for improving the generalization and scalability of the GNN model.

SUMMARY

In order to address problems of semi-supervised learning on graphs such as limited generalization and week scalability, a novel GNN framework is proposed in the disclosure in effort to alleviate the two limitations of generalization and scalability simultaneously.

According to an embodiment, there provides a computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning. The method comprises: sampling a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes of graph being represented by a corresponding feature vector of the feature matrix; obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node; obtaining a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model; obtaining a loss based on the classification predictions of the nodes in the batch of labeled nodes and the batch of unlabeled nodes; and updating parameters of the GNN model based on the loss.

According to an embodiment, there provides a computer-implemented method for classifying a node of a graph. The method comprises: training a Graph Neural Network (GNN) model by using the method as mentioned above as well as the method according to aspects of the disclosure; predicting a classification label for the node of the graph by applying the feature matrix of the graph to the trained GNN model.

According to an embodiment, there provides a computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying accounts on a social network or a financial network. The method comprises: training the GNN model for classifying the accounts on the social network or the financial network using the method as mentioned above as well as the method according to aspects of the disclosure, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.

According to an embodiment, there provides a computer-implemented method for classifying an account on a social network or a financial network. The method comprises: training a GNN model for classifying the accounts on the social network or the financial network using the method as mentioned above as well as the method according to aspects of the disclosure; and predicting a classification label for the account on the social network or the financial network with the trained GNN model, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network..

According to an embodiment, there provides a computer system, which comprises one or more processors and one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.

According to an embodiment, there provides one or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.

According to an embodiment, there provides a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.

By utilizing the batch based random propagation of the node features, each node in the batch is enabled to be insensitive to specific neighborhoods, increasing the reliability of the classification prediction of the model frame while the computational resource requirement such as time and memory cost may be decreased. Other advantages and enhancements are explained in the description hereafter.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

Fig. 1 illustrates an exemplary GCN model according to an embodiment.

Fig. 2 illustrates an exemplary schematic diagram for training a GNN model according to an embodiment.

Fig. 3 illustrates an exemplary process for training a GNN model to perform a task of classifying nodes of a graph based on semi-supervised learning according to an embodiment.

Fig. 4 illustrates an exemplary process for classifying a node of a graph according to an embodiment.

Fig. 5 illustrates an exemplary computing system according to an embodiment.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

Various embodiments will be described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made to particular examples and embodiments are for illustrative purposes, and are not intended to limit the scope of the disclosure.

The present disclosure describes a method and a system, implemented as computer programs executed on one or more computers, which is used to train a GNN model to perform a task of classifying nodes of a graph or which is used classify a node of a graph. As an example, the GNN model may be implemented as a graph convolution network (GCN) model, and may perform a machine learning task of classifying nodes in a graph, which may for example represents a social network, biological network, citation network, recommendation system, financial system, etc. The aspects of the disclosure may be applied in these fields such as the social network, biological network, citation network, recommendation system, financial system and so on to improve the security and robustness of these systems.

Fig. 1 illustrates an exemplary GCN model 10 according to an embodiment.

A graph is fed as input 110 of the GCN model 10. The graph may be a dataset that contains nodes and edges. The nodes in the graphs may represent entities, and the edges represent the connections between nodes. For example, a social network is a graph in which users or particularly user accounts in the network are nodes in the graph. An edge exists when two users are connected in some way. For example, the two users are friends, share one’s posts, have similar interests, have similar profiles, or the like, then the two users may have a connection which is represented by the edge. For example, a financial network is a graph in which users or particularly user accounts in the network are nodes in the graph. An edge exists when two users are connected in some way. For example, the two users have remittance transfer relation, employee relation, similar deposit, similar investment preference, similar profiles, or the like, then the two users may have a connection which is represented by the edge.

A graph may be denoted as G= (V, E) , where V is a set of |V| nodes, v∈ V refers to a data sample corresponding to a node, and E∈V×V is a set of |E| edges between nodes. In an example, the graph as the input 110 may be formulated as G= (A, X) , where A∈ {0, 1} ^|V|×|V| represents the adjacency matrix of the graph G, with each element A (s, v) = 1 indicating there exists an edge between nodes s and v, A (s, v) = 0 indicating there is no edge between nodes s and v. D is diagonal degree matrix where D (s, s) =∑ _vA (s, v) , representing the number of connections the node s has in the graph. The graphs G with added self-loop connections may be denoted as

which may be referred to as self-loop augmented graph. The adjacency matrix of graph

is

and the degree matrix of graph

is

where I denotes an unit matrix.

represents the feature matrix of the graph G, where |V| stands for the number of nodes of the graph G, h ₀ stands for the dimension of the feature of the nodes. Therefore, the adjacency matrix A or

may represent the connections among the nodes in the graph G or

the feature matrix X may represent the features of respective nodes in the graph. The feature of a node may include multiple feature components, the number of which is defined as the dimension of the node feature. For example, for a graph of a social network, the feature components of a node may include age, gender, hobby, career, various actions such as shopping, reading, listening music, and so son. It is appreciated that aspects of the disclosure are not limited to specific values of the elements of the adjacency matrix and the feature matrix.

A label matrix Y∈ {0, 1} ^|V|×C may denote the labels of the nodes of the graph, where C stands for the dimension of the classification or the number of classification labels. Therefore, each node s of a graph is associated with a feature vector

and a label vector Y _s∈Y∈ {0, 1} ^|V|×C. For semi-supervised classification, limited number of nodes L∈V (0<|L|＜＜|V|) have their observed labels Y _L, and the remaining nodes U = V -L do not have the observed labels. The objective of semi-supervised learning is to infer the missing labels Y _U for the unlabeled nodes U based on graph structure G, node features X and the observed labels Y _L. In the disclosure, for sake of description, for a matrix

denotes its i-th row vector and M (i, j) denotes the element of the i-th row and the j-th column.

Graph neural networks (GNNs) have emerged as a powerful approach for semi-supervised graph learning the predictive function. The GCN model 10 is an exemplary GNN and may be used to learn and implement the predictive function. The GCN model may include one or multiple hidden layers 120, which are also referred to as graph convolutional layers 120. Each hidden layer 120 may receive and process a graph-structured data. For example, the hidden layer 120 may perform convolution operation on the data. The weights of the convolution operations in the hidden layer 120 may be trained with training data. It is appreciated that other operations may be included in the hidden layer 120 in addition to the convolution operation. Each activation engine 130 may apply an activation function (e.g., ReLU) to the output from a hidden layer 120 and send the output to the next hidden layer 120. A fully-connected layer or a softmax engine 140 may provide an output 150 based on the output of the previous hidden layer. In the node classification task, the output 150 of the GCN model 10 may be classification labels or particularly classification probabilities for nodes in the graph. For example, the GCN model 10 may employ a propagation rule

where

is the symmetric normalized adjacency matrix of

σ (. ) denotes the activation function such as the ReLU function, W ^(l) is the weight matrix of the layer l and H ^(l) is the hidden node representation in the layer l with H ⁽⁰⁾ = X.

The node classification task of the GCN model 10 is to determine the classification labels of nodes of the graph based on their neighbors. Particularly, given a subset of labeled nodes in the graph, the goal of the classification task of the GCN model 10 is to predict the labels of the remaining unlabeled nodes in the graph.

In an example, the GCN model 10 may be a two-layer GCN model:

where

is a normalized adjacency matrix,

is a degree matrix of adjacency matrix A or

and

are parameter matrices of two hidden layers 120, where d _H denotes the dimension of the hidden layer, C denotes the number of the categories of the classification labels, and σ (x) is the activation function 130, for example, σ (x) = ReLU (x) .

is the output matrix 150, representing the probability of each node to each classification label in the graph.

Fig. 2 illustrates an exemplary schematic diagram 20 for training a GNN model according to an embodiment.

Taking the graph G= (A, X) as shown in Fig. 2 as example. As illustrated, the graph 210 includes nodes represented by feature matrix X and edges represented by adjacent matrix A. The values shown in the graph are examples of features of the nodes. It is known that the disclosure is not limited to the specific node features or feature values. For sake of illustration, seven nodes are included in the graph. It is appreciated that a graph G may include much more nodes in practical application. In the exampled graph G, there are two labeled

nodes

1 and 6 and five unlabeled nodes 2-5 and 7. The task is to training the GNN model 250 by means of semi-supervised learning so that the trained GNN model 250 can predict the classification labels of the unlabeled nodes 2-5 and 7.

A propagation matrix may be used to propagate the features of the nodes of the graph. In an example, the feature propagation for the features of the nodes may be performed based on the adjacent matrix A or

In the embodiment of the Fig. 2, a mixed-order propagation may be employed to perform the feature propagation in order to exploit and incorporate more local information of the adjacency relations among nodes, reducing the risk of over-smoothing. In order to perform the mixed-order propagation, a mixed-order adjacency matrix Π may be utilized as shown in equation (1) :

where

and w _t≥ 0. In an example, row normalization may be used for

The t-order normalized adjacency matrix

is also the t-order random walk reverse transition matrix of the graph

where the element P ^t (s, v) denotes the probability that a t-step random walk goes from source node s to target node v.

In the example shown in equation (1) , the generalized mixed-order matrix Π uses a set of tunable weights {w _t |0 ≤ t ≤ T} to fuse different orders of adjacency matrices. By adjusting w _t, the framework of the embodiment illustrated in Fig. 2 can flexibly manipulate the importance of different orders of neighborhoods to suit diverse graphs in the real world. In an example, the weight w _t may be set as α (1-α) ^t and accordingly the equation (1) becomes

which may be referred to as a truncated personalized page-rank (PPR) matrix. It is appreciated that

approaches 1 as more as the largest order T is bigger, this situation also conform substantially to the above condition

In another example, the weight w _t may be set as 1/ (T + 1) and accordingly the equation (1) becomes

which may be referred to as a average pooling matrix. In another example, the weight w _t may be set as w _t=1 when t = T and w _t= 0 otherwise, and accordingly the equation (1) becomes Π ^single=P ^T, which may be referred to as a single order matrix.

It is appreciated that calculating the mixed-order adjacency matrix as shown in equation (1) may be computationally inefficient because large amount of memory capacity, computing capacity and computing time are required. On the other hand, when all the samples of the graph such as graph 210 are taken as the input of the GNN mode 250, the size of the graph may influence the requirements of the computing resources such as the memory capacity, computing capacity and so on, therefore a problem for computing resources may be risen for large size of graphs.

In an embodiment, an efficient push-flow method may be employed to generate an error-bounded approximation for each row vector Π _s of the mixed-order adjacency matrix Π. The push-flow method may be referred to as a Generalized Forward Push (GFPush) method, which has the ability to approximate the generalized mixed-order random walk transition vector Π _s of a node s. The core idea of GFPush is to simulate a T-step random walk probability diffusion process from node s with a series of pruning operations on probability masses. In an implementation of the GFPush, a pair of vectors at each step t (0 ≤ t ≤ T ) is maintained. One of the pair of vectors may be a reserve vector

denoting the probability masses reserved at step t. The other one of the pair of vectors may be a residue vector

representing the probability masses to be diffused beyond step t.

Table 1

Table 1 shows the pseudo-code of the GFPush method. At beginning, r ⁽⁰⁾ and q ⁽⁰⁾ are set to the indicator vector e ^(s) , where

and

for v ≠ s, representing that the random walk starts from s with the probability mass of 1. Other reserve and residue vectors (r ^(t) and q ^(t) , t ≥ 1) are set to

Then the method repeats multiple iterations from step 0 to step T-1. At step t, the GFPush method conducts push operation (Line 5–9 of the pseudo-code) on each node v that satisfies

where

is the degree of node v in self-loop augmented graph

r _max is a predefined threshold. In each push operation, the current residue

of node v is uniformly spread to node v’s neighbor nodes and the results are stored into the residue vector r ^(t+1) of the next step t+1. And each updated residue

is assigned to

After that,

is reset to 0.

Intuitively, in the procedure of the GFPush method,

is the conditional probability that a random walk moves from node v to a neighboring node u, conditioned on it reaching v with probability

at step t. Thus, the push operation on node v can be seen as a one-step random walk probability diffusion from v to its one-hop neighbor nodes. In an embodiment, in order for efficiency, the GFPush method can only conducts push operations for node v whose residue value is greater than d _v·r _max. It is appreciated that the GFPush method can also conduct push operations for every neighbor node v in the graph

that is, the neighboring node v’s residue value is not limited to be greater than d _v·r _max.

After the last iteration, the GFPush method returns

as result, which is an approximation of the mixed-order adjacency vector Π _s for node s. For a set of nodes, the mixed-order adjacency vector Π _s of each node s may be obtained using the GFPush method and the set of the mixed-order adjacency vectors for the set of nodes may be used as an approximation of the mixed-order adjacency matrix Π, and thus may be referred to as mixed-order adjacency matrix

In an embodiment, in order to further reduce training cost, a top-k sparsification may be performed for the mixed-order adjacency vector

of each node s. In this procedure, only the top-k largest elements of the mixed-order adjacency vector

are preserved and other elements of the mixed-order adjacency vector

are set to 0. Hence the resultant sparsified mixed-order adjacency vector

has at most k non-zero elements. According to the theory of Escaping Mass of lazy random walk, which says the probability that a T-hop lazy random walk starting from node s will concentrate around a local cluster of node s, the sparsified vector

is still expected to be effective for feature propagation by preserving most of local neighborhood nodes for node s. It is appreciated that the mixed-order adjacency vectors

for different nodes may be calculated in parallel, for example, by using multi-thread programming, and thus the computing time of the mixed-order adjacency matrix

for a set of nodes may be reduced.

Return to Fig. 2, the graph 210 includes a labeled node set L and an unlabeled node set U. In the shown example, the labeled node set L includes

nodes

1 and 6 and the unlabeled node set U includes nodes 2-5 and 7. A subset of nodes may be sampled from the graph 210, the subset of nodes may include at least a part of the labeled nodes and at least a part of the labeled nodes. As the number of labeled nodes in a graph is typically limited, all the labeled nodes of the graph are typically included in the subset. A part of the unlabeled nodes of the graph, especially a large graph, are sampled and included in the subset. The subset of nodes may include the labeled node set L and an unlabeled node subset U′. For example, the unlabeled node subset U′ may include 10000 unlabeled nodes in practical application. It is appreciated that the disclosure is not limited to the specific number of nodes in the unlabeled node subset U′. In the illustrated example, the unlabeled node subset U′ includes the sampled

nodes

3 and 7. Then the subset of nodes of graph 210 includes

nodes

1 and 6 in the labeled node set L and

nodes

3 and 7 in the unlabeled node subset U′.

The mixed-order adjacency vector

of each node s in the subset of

nodes

1, 3, 6 and 7 may be obtained using the GFPush method, and accordingly a mixed-order adjacency matrix

220 may be obtained for the subset of

nodes

1, 3, 6 and 7. The top-k sparsification may be performed on the mixed-order adjacency vector

of each node s in the subset of

nodes

1, 3, 6 and 7 so as to obtain sparsified mixed-order adjacency vector

of each node s, and accordingly a sparsified mixed-order adjacency matrix

230 may be obtained for the subset of

nodes

1, 3, 6 and 7. The mixed-order adjacency matrix

220 and the sparsified mixed-order adjacency matrix

230 is a sub-matrix compared to the adjacency matrix A since they only include adjacency vectors of a part of nodes of the graph. The value k is set to 3 in the illustrated example. As the GFPush method has the ability to generate the mixed-order sub-matrix for only a part of nodes of the graph, the efficiency for generating the propagation matrix can be improved significantly, especially for a large size graph including a large amount of nodes.

The sparsified sub-matrix

230 may be used to perform random propagation of node features in a mini-batch manner. At a training step or training iteration, which may be generally denoted as a n-th training step or iteration, a batch of labeled nodes

may be sampled with batch size |L _n| = b _l, and a batch of unlabeled nodes

may be sampled with batch size |U _n| = b _u. In the illustrated example in Fig. 2, the batch of labeled nodes L _n includes node 1 and the batch of unlabeled nodes U _n includes node 7. Then the augmented feature vector

of node s∈L _n∪U _n is calculated via equation (2) :

where

denotes neighboring nodes of node s, and particularly denotes the indices of the non-zero elements of

is the feature vector of node v. At each training step n, M augmented feature vectors

are generated for node s by repeating this procedure shown in equation (2) for M times. As shown in Fig. 2, M augmented feature vectors 240-1 to 240-M for each of

nodes

1 and 7 in the batch are obtained. The time complexity and memory complexity of each batch are both bounded by O (k·h ₀· (b _l+b _u) ) , which is independent of the graph size.

In an example as illustrated in Fig. 2, a drop-node method is used to perform the random propagation of node feature. Specifically, some nodes’ entire feature vectors are dropped or removed from the feature vectors of the neighboring nodes of node s by randomly setting entire feature vectors of some of the neighboring nodes to 0. Taking the node 1 as an example, its neighboring nodes includes

nodes

1, 2 and 4 as indicated in its mixed-order adjacency vector

the drop node mask z = [0, 1, 1] for

nodes

1, 2 and 4 is obtained for the first augmentation branch 240-1 based on the Bernoulli distribution, similarly the drop node mask z = [1, 0, 0] for

nodes

1, 2 and 4 is obtained for the M-th augmentation branch 240-M based on the Bernoulli distribution. In the Bernoulli distribution, the element z _i in the drop node mask z takes the value 0 with a probability of dropping rate δ and takes the value 1 with a probability of 1-δ, that is, z _i ～ Bernoulli (1 -δ) .

In an embodiment, the augmented feature vectors

of node s may be scaled with a factor of

so as to make the augmented feature vectors in expectation equal to the input feature vectors of the feature matrix X.

The drop-node method enables each node to aggregate information only from a subset of its neighbors by completely ignoring some neighboring nodes’ features, which reduces its dependency on particular neighbors and thus helps increase the model’s robustness over adversarial action. It is appreciated that the neighbors of a node may include one-hop neighbor and multi-hop neighbor.

Although the drop-node method is illustrated in Fig. 2, it is appreciated that the perturbation of the feature vectors of neighboring nodes of a node s may be performed in other ways. For example, the perturbation of the feature vectors of neighboring nodes such as

neighboring nodes

1, 2 and 4 of node 1 may be performed by using a dropout method. Specifically, the dropout method may perturb the feature vectors of neighboring nodes by randomly setting some elements of the feature vectors of neighboring nodes to 0. In this example, the drop mask

in which each element z _ij takes the value 0 with a probability of δ and takes the value 1 with a probability of 1-δ. In an example, the z _ij is obtained from a Bernoulli distribution, that is, z _ij ～Bernoulli (1 -δ) . In the example of employing the dropout method to perturb the feature vectors of neighboring nodes, the drop node mask vector shown in Fig. 2 is replaced with the dropout mask matrix for performing the random dropout of some elements of the feature vectors of neighboring nodes.

It is appreciated that there may be different ways to perturb the feature vectors of neighboring nodes, and the disclosure is not limited to the drop-node method and dropout method.

In practical applications, the feature dimension h ₀ of a node of the graph may be extremely large, this may incur huge computational resource cost to calculate the augmented feature vector

of node s. In an embodiment, in order to decrease the resource requirement for performing random propagation for high-dimensional features, each feature vector X _v as shown in equation (2) and Fig. 2 may be transform to a low dimensional hidden vector

with a linear transformation layer before random feature propagation. Then the augmented feature vector

of node s may be obtained by performing random propagation with the low dimensional feature vector H _v as shown in equation (3)

where

denotes a learnable transformation matrix. In this way, the computational complexity of this procedure is reduced to O (k·h· (b _l+b _u) ) , where h ＜＜ h ₀.

After obtaining the multiple augmented feature vectors

of the node s, such as

to

by performing the random feature propagation for the nodes s for multiple times such as M times, each of the augmented feature vectors may be fed into the GNN model 250 such as the illustrated two-layer MLP model 250 to get respective outputs

where 1≤m≤M. The outputs

denotes the classification prediction probabilities of the node s on the augmented feature vector

Θ are the parameters of the model 250. Although the dimension C of the illustrated outputs

is one, it is appreciated that dimension C of the output

may be bigger depending on the number of classifications to be predicted by the model 250. It is appreciated that the two-layer MLP model 250 is exemplary, other number of layers of the MLP model is applicable.

After obtaining the multiple outputs

of each node s in the batch for the current training step or training iteration, a loss L 260 may be obtained based on the classification predictions

Then the parameter weights of the GNN model 250 may be updated based on the loss L 260. For example, the weights of the GNN model 250 may be updated by back propagating the loss L 260 along the GNN 250. In the above embodiment in which the learnable transformation matrix

is used, the weights of the GNN 250 and the weights of learnable transformation matrix W ⁽⁰⁾ may be updated based on the loss L.

In an embodiment, the loss function is a combination of the supervised loss on the labeled nodes and the graph regularization loss. Given the M data augmentations of each node generated though the random propagation 240, a consistency regularized loss may be employed for the semi-supervised learning.

As a batch of b _l labeled nodes L _n are sampled for this training step, the supervised loss of the graph node classification task is defined as the average cross-entropy loss over M augmentations:

where Y _s is the observed label of the node s.

In the semi-supervised learning, the prediction consistency among M augmentations may be optimized for unlabeled data by using a consistency regularization loss. In an embodiment, for each node of the batch of unlabeled nodes U _n, i.e., for node s ∈U _n, the distribution center may be calculated by taking the average of its M prediction probabilities, i.e.,

Then a sharpening process may be performed over the average prediction probability

to obtain a pseudo label

for node s. Formally, the probability on the j-th classification of node s is obtained via:

where 0 < τ ≤ 1 is a hyperparameter to control the sharpness of the obtained pseudo label. As decreasing the value of τ,

is enforced to become sharper and converges to a one-hot distribution eventually.

In an embodiment, the consistency loss on unlabeled node batch U _n may be obtained via:

In another embodiment, a confidence-aware consistency loss may be employed to further improve effectiveness of the consistency loss. The confidence-aware consistency loss on unlabeled node batch U _n may be obtained via:

where

is an indicator function which outputs 1 if

holds, and outputs 0 otherwise. 0 ≤ γ ≤ 1 is a predefined threshold.

is a distance function which measures the distribution discrepancy between p and q. For example, the distance function may be a function for calculating L2 distance. In another example, the distance function may be a function for calculating KL divergence. In this embodiment shown in equation (7) , only highly confident unlabeled nodes whose maximum value of prediction center

exceeds γ are considered when obtaining the consistence loss. In this way, the potential training noise induced by uncertain pseudo-labels may be reduced and thus the effectiveness of the consistency loss may be improved.

The final loss

for model optimization may be obtained based on the supervised classification loss and the consistency regularization loss:

where λ is a weight that controls the balance of the two losses.

After obtaining the loss

the parameters Θ of the GNN model 250 may be updated by gradients descending:

In an embodiment, a dynamic scheduling strategy may be used to determine the value of λ in equation (8) . Particularly, at the n-th training step or training iteration, λ is obtained via:

λ=min (λ _max, λ _min+ (λ _max-λ _min) ·n/n _max) (10)

In the first n _max training steps, λ linearly increases from λ _min to λ _max, and remains constant in the following training steps. By using this dynamic loss weight scheduling, the weight λ may be limited to a small value in the early stage of training when the generated pseudo-labels

are not much reliable, which may help model converge.

After the training of the GNN model, predictions for all unlabeled nodes of the graph may be inferred at the same time using the trained model:

The feature matrix X is rescaled with (1 -δ) so that the rescaled features are identical with the expectation of the perturbed features with dropping rate δ in the model training. Unlike training, the above process only needs to be performed once for each node of the graph during model inference, and the computational cost is acceptable in practice. In another embodiment, the rescale operation with (1 -δ) may not be performed if the corresponding scaling operation has been performed during training.

The process of training the GNN model for performing a classification task on graph data and obtaining the classification prediction for the nodes of the graph may be illustrated as the following pseudocode in table 2.

Table 2

It is appreciated that the above process is just illustrative rather than limitative to the scope of the disclosure, the various variant embodiments of the process may be possible.

For example, although the drop-node method is illustrated in Fig. 2 as the specific method for performing the perturbation of the feature vectors of the neighboring nodes, other perturbation method such as dropout, random perturbing method and so on may be employed.

For example, the MLP model 250 may be employed as the GNN model, and other GNN model such as the GCN, the GAT and so on may be employed as the GNN model.

For example, although the supervised classification loss

via Equation (4) and consistency regularization loss

via Equation (6) or (7) are employed as the loss function, the specific method for calculating the supervised classification loss

and the consistency regularization loss

are not limited to the specific Equations. And the loss function is not limited to the combination of the supervised classification loss

and the consistency regularization loss

at least one of which may be employed as the loss function at different circumstances.

Fig. 3 illustrates an exemplary process 30 for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning according to an embodiment. It is appreciated that the exemplary process 30 may be implemented by a computer, in which programs are executed by one or more processor to perform the process 30.

At step 310, a batch of labeled nodes and a batch of unlabeled nodes are sampled from the nodes of the graph. The graph comprises the nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes being represented by a corresponding feature vector of the feature matrix. It is appreciated that the batch is a subset of the nodes of graph, the number of the batch of nodes is far less than the number of nodes of the graph in practical application.

At step 320, a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node. The neighboring nodes of the node are indicated by the adjacency vector of the node.

At step 330, a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by respectively applying the plurality of augmented feature vectors of the node to the GNN model.

At step 340, a loss is obtained based on the classification predictions of the respective nodes in the batch of labeled nodes and the batch of unlabeled nodes.

At step 350, parameters of the GNN model are updated based on the loss.

In an embodiment, in the method 30, a mixed-order adjacency matrix corresponding to a plurality of adjacency orders is generated for at least a part of the nodes of the graph based on the adjacency matrix of the graph. The adjacency vector of each node in the batch of labeled nodes and the batch of unlabeled nodes is a corresponding mixed-order adjacency vector of the mixed-order adjacency matrix. In an embodiment, the mixed-order adjacency matrix corresponding to a plurality of adjacency orders is generated for a part or subset of the nodes of the graph based on the adjacency matrix of the graph. The adjacency vector of each node in the batch of labeled nodes and the batch of unlabeled nodes is a corresponding mixed-order adjacency vector of the mixed-order adjacency matrix. In this embodiment, the mixed-order adjacency matrix is actually a sub-matrix for the sampled nodes, which is generated by using the above mentioned GFPush method. By using the efficient GFPush method to generate the mixed-order adjacency sub-matrix, the efficiency for performing the classification of node for a large size graph, which includes a huge amount of nodes, can be improved significantly, as the GFPush method is able to generate the mixed-order adjacency sub-matrix for only a subset of all the nodes of the graph.

In an embodiment, the mixed-order adjacency matrix may be generated by performing weighted sum of a plurality of per-order adjacency matrixes corresponding to the plurality of adjacency orders, wherein one of the plurality of per-order adjacency matrixes corresponding to an adjacency order t, and an element in the per-order adjacency matrix representing a probability that a t-step random walk goes from a source node associated with the element to a target node associated with the element.

In an embodiment, the mixed-order adjacency matrix may be generated by generating a mixed-order adjacency vector for each of the at least a part of the nodes of the graph by iteratively performing one-step random walk based on the adjacency matrix of the graph. The generated respective mixed-order adjacency vectors for the at least a part of nodes constitute the mixed-order adjacency matrix.

In an embodiment, the batch of labeled nodes and the batch of unlabeled nodes are sampled from a subset of the nodes of the graph. In this embodiment, a mixed-order adjacency vector for each of the subset of nodes may be generated by iteratively performing one-step random walk based on the adjacency matrix of the graph. The generated respective mixed-order adjacency vectors for the subset of nodes constitute the mixed-order adjacency matrix.

In an embodiment, the mixed-order adjacency vector for each of the at least a part of the nodes of the graph may be generated by iteratively, from the lowest to the highest of the plurality of adjacency orders, calculating a plurality of reserved probability mass vectors and a plurality of residue probability mass vectors corresponding to the plurality of adjacency orders for the node, and generating the mixed-order adjacency vector for the node by performing weighted sum of the plurality of reserved probability mass vectors for the node. In an embodiment, the calculating of the plurality of reserved probability mass vectors and the plurality of residue probability mass vectors is performed based on an adjacency degree matrix of the graph and/or a predefined threshold.

In an embodiment, the generating the mixed-order adjacency vector for each of the at least a part of the nodes of the graph may further comprises: remaining a predefined number of largest elements of the mixed-order adjacency vector of the node while setting other elements of the mixed-order adjacency vector to zero.

In an embodiment, the weighted sum of the plurality of reserved probability mass vectors for the node may performed by using weights α (1-α) ^t to the plurality of reserved probability mass vectors, wherein α being a decay factor and t being one of the plurality of adjacency orders, or by using an averaged weight corresponding to the plurality of adjacency orders, or by using a single order weight in which one of the weights being set to 1 and the other of the weights being set to 0.

In an embodiment, at step 320, a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating feature vectors of neighboring nodes of the node based on the adjacency vector of the node and a dropping mask. The dropping mask is configured to randomly drop out at least partial features of at least a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes. In an embodiment, the dropping mask is configured to randomly drop out entire features of a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes. In an embodiment, the dropping mask is configured to randomly drop out entire features of a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes based on a dropping rate, features of the remained part of the neighboring nodes are scaled up based on the dropping rate.

In an embodiment, the loss comprises a supervised classification loss and a consistency regularization loss. In an embodiment, the supervised classification loss is obtained based on classification labels of the batch of labeled nodes, and the consistency regularization loss is obtained based on classification predictions of the batch of unlabeled nodes. In an embodiment, the supervised classification loss is a cross-entropy loss based on the classification labels and the classification predictions corresponding to the batch of labeled nodes, and the consistency regularization loss is a distance loss based on the classification predictions of the batch of unlabeled nodes. In an embodiment, the consistency regularization loss is set to zero if the maximum of the classification predictions of the batch of unlabeled nodes is less than a threshold or if the maximum of respective averaged classification predictions of the batch of unlabeled nodes is less than a threshold.

In an embodiment, the method 30 is performed repetitively in each of a plurality of training steps, which may be referred to as training loops or training iterations, wherein the loss is obtained by performing weighted sum of the supervised classification loss and a consistency regularization loss with dynamic weights in each of the plurality of training steps.

In an embodiment, in the method 30, the feature vectors of neighboring nodes of the node are transformed with a linear transformation matrix to respective lower-dimension feature vectors having a lower dimension than the feature vectors of the neighboring nodes of the node. In this embodiment, the plurality of augmented lower-dimension feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating the lower-dimension feature vectors of neighboring nodes of the node, the plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by respectively applying the plurality of augmented lower-dimension feature vectors of the node to the GNN model, and at step 350 parameters of the GNN model and parameter of the linear transformation matrix are updated based on the loss.

In an embodiment, the GNN model comprises one of a Multilayer Perception (MLP) model, a Graph Convolutional Network (GCN) model, a Graph Attention Network (GAT) model.

In an embodiment, the process 30 illustrated in Fig. 3 may be implemented as a method for training a Graph Neural Network (GNN) model to perform a task of classifying accounts on a social network or a financial network. In this embodiment, the GNN model for classifying the accounts on the social network or the financial network may be trained using the method as described above with reference to Figs. 1-3, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.

Fig. 4 illustrates an exemplary process 40 for classifying a node of a graph according to an embodiment. It is appreciated that the exemplary process 40 may be implemented by a computer, in which programs are executed by one or more processor to perform the process 40.

At step 410, a Graph Neural Network (GNN) model may be trained by using the method as described above with reference to Figs. 1-3.

At step 420, a classification label for the node of the graph may be predicted by applying the feature matrix of the graph to the trained GNN model.

In an embodiment, the method 40 may also include identify an alarm if the label of the node is a predefined label. In an example, the predefined label of the node may be a label indicating the node is an adversarial node, a fraud node or the like, such as an adversarial node in social network, financial network or the like.

In an embodiment, the process 40 illustrated in Fig. 4 may be implemented as a method for classifying an account on a social network or a financial network. In this embodiment, the GNN model for classifying the accounts on the social network or the financial network may be trained using the method as described above with reference to Figs. 1-3, and a classification label for the account on the social network or the financial network may be predicted with the trained GNN model. The nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.

Fig. 5 illustrates an exemplary computing system 50 according to an embodiment. The computing system 50 may comprise at least one processor 510. The computing system 50 may further comprise at least one storage device 520. The storage device 520 may store computer-executable instructions that, when executed, cause the processor 510 to sample a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising the nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes being represented by a corresponding feature vector of the feature matrix; obtain a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node, wherein the neighboring nodes of the node being indicated by the adjacency vector of the node; obtain a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model; obtain a loss based on the classification predictions of the nodes in the batch of labeled nodes and the batch of unlabeled nodes; and update parameters of the GNN model based on the loss.

It should be appreciated that the storage device 520 may store computer-executable instructions that, when executed, cause the processor 510 to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.

The embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.

The embodiments of the present disclosure may be embodied in a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.

It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

A computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning, the method comprising:

sampling a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes of graph being represented by a corresponding feature vector of the feature matrix;

obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node;

obtaining a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model;

obtaining a loss based on the classification predictions of the nodes in the batch of labeled nodes and the batch of unlabeled nodes; and

updating parameters of the GNN model based on the loss.
The method of claim 1, further comprising:

generating a mixed-order adjacency matrix corresponding to a plurality of adjacency orders for at least a part of the nodes of the graph based on the adjacency matrix of the graph, wherein the adjacency vector of each node in the batch of labeled nodes and the batch of unlabeled nodes being a corresponding mixed-order adjacency vector of the mixed-order adjacency matrix.
The method of claim 2, wherein the generating the mixed-order adjacency matrix further comprising:

generating a mixed-order adjacency vector for each of the at least a part of the nodes of the graph by iteratively performing one-step random walk based on the adjacency matrix of the graph.
The method of claim 3, wherein the generating the mixed-order adjacency vector for each of the at least a part of the nodes of the graph further comprising:

iteratively, from the lowest to the highest of the plurality of adjacency orders, calculating a plurality of reserved probability mass vectors and a plurality of residue probability mass vectors corresponding to the plurality of adjacency orders for the node; and

generating the mixed-order adjacency vector for the node by performing weighted sum of the plurality of reserved probability mass vectors for the node.
The method of claim 4, wherein the calculating of the plurality of reserved probability mass vectors and the plurality of residue probability mass vectors being based on an adjacency degree matrix of the graph and/or a predefined threshold.
The method of claim 3, wherein the generating the mixed-order adjacency vector for each of the at least a part of the nodes of the graph further comprising:

remaining a predefined number of largest elements of the mixed-order adjacency vector of the node while setting other elements of the mixed-order adjacency vector to zero.
The method of claim 4, wherein the performing weighted sum of the plurality of reserved probability mass vectors for the node further comprising:

performing weighted sum of the plurality of reserved probability mass vectors for the node by using weights α (1-α) ^t to the plurality of reserved probability mass vectors, wherein α being a decay factor and t being one of the plurality of adjacency orders, or by using an averaged weight corresponding to the plurality of adjacency orders, or by using a single order weight in which one of the weights being set to 1 and the other of the weights being set to zero.
The method of claim 1, wherein the obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes further comprising:

obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on the adjacency vector of the node and a dropping mask, wherein the dropping mask being configured to randomly drop out partial features of at least a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes, or being configured to randomly drop out entire features of a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes.
The method of claim 1, wherein the loss comprising a supervised classification loss and a consistency regularization loss, wherein the supervised classification loss being obtained based on classification labels of the batch of labeled nodes, and the consistency regularization loss being obtained based on classification predictions of the batch of unlabeled nodes.
The method of claim 9, wherein the supervised classification loss being a cross-entropy loss based on the classification labels and the classification predictions corresponding to the batch of labeled nodes, and the consistency regularization loss being a distance loss based on the classification predictions of the batch of unlabeled nodes.
The method of claim 10, wherein the consistency regularization loss being set to zero if the maximum of the classification predictions of the batch of unlabeled nodes is less than a threshold or if the maximum of respective averaged classification predictions of the batch of unlabeled nodes is less than a threshold.
The method of claim 11, wherein the method being repetitively performed in each of a plurality of training steps, wherein the loss being obtained by performing weighted sum of the supervised classification loss and the consistency regularization loss with dynamic weights in each of the plurality of training steps.
The method of claim 1, further comprising transforming the feature vectors of neighboring nodes of the node with a linear transformation matrix to respective lower-dimension feature vectors having a lower dimension than the feature vectors of the neighboring nodes,

wherein the obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node further comprising: obtaining a plurality of augmented lower-dimension feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating the lower-dimension feature vectors of neighboring nodes of the node,

wherein the obtaining a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model further comprising: obtaining the plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented lower-dimension feature vectors of the node to the GNN model.
A computer-implemented method for classifying a node of a graph, comprising:

training a Graph Neural Network (GNN) model by using the method of one of claims 1-13;

predicting a classification label for the node of the graph by applying the feature matrix of the graph to the trained GNN model.
A computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying accounts on a social network or a financial network, comprising:

training the GNN model for classifying the accounts on the social network or the financial network using the method of one of claims 1-13, wherein the nodes representing the accounts, the edges representing the relation among the accounts, the graph representing the social network or the financial network.
A computer-implemented method for classifying an account on a social network or a financial network, comprising:

training a Graph Neural Network (GNN) model by using the method of claim 15;

predicting a classification label for the account on the social network or the financial network with the trained GNN model, wherein the nodes representing the accounts, the edges representing the relation among the accounts, the graph representing the social network or the financial network.
The method of claim 16, further comprising:

identify an alarm if the label of the account is a predefined label.
The method of claim 17, wherein the predefined label indicating the account as adversarial account.
A computer system, comprising:

one or more processors; and

one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method of one of claims 1-18.
One or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-18.