WO2023087303A1 - Method and apparatus for classifying nodes of a graph - Google Patents

Method and apparatus for classifying nodes of a graph Download PDF

Info

Publication number
WO2023087303A1
WO2023087303A1 PCT/CN2021/132082 CN2021132082W WO2023087303A1 WO 2023087303 A1 WO2023087303 A1 WO 2023087303A1 CN 2021132082 W CN2021132082 W CN 2021132082W WO 2023087303 A1 WO2023087303 A1 WO 2023087303A1
Authority
WO
WIPO (PCT)
Prior art keywords
nodes
node
batch
graph
adjacency
Prior art date
Application number
PCT/CN2021/132082
Other languages
French (fr)
Inventor
Evgeny Kharlamov
Jie Tang
Wenzheng FENG
Original Assignee
Robert Bosch Gmbh
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Robert Bosch Gmbh, Tsinghua University filed Critical Robert Bosch Gmbh
Priority to PCT/CN2021/132082 priority Critical patent/WO2023087303A1/en
Publication of WO2023087303A1 publication Critical patent/WO2023087303A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Definitions

  • aspects of the present disclosure relate generally to artificial intelligence, and more particularly, to performing a task of classifying nodes of a graph using a Graph Neural Network (GNN) model based on semi-supervised learning.
  • GNN Graph Neural Network
  • Graph data has been widely used in many real-world applications, such as social networks, financial system, biological networks, citation networks, recommendation system, etc.
  • Node classification is one of the most important tasks on graphs.
  • the deep learning model for graph such as GNN model has achieved good results in the task of node classification on the graph. Given a graph with labels associated with a subset of nodes, the GNN model may predict the labels for the rest of the nodes.
  • GNNs Despite the great success achieved by GNNs, there are two limitations in GNN-based semi-supervised learning methods. The first one is limited generalization. Conventional GNNs only use supervised objective function for model training, which makes the GNN model prone to overfit the limited labeled samples, thereby degrading the prediction performance over unlabeled samples and leading to poor reliability of the GNN model. The other one is weak scalability, conventional GNNs adopt full-graph training method with an expensive recursive feature propagation procedure, inducing enormous time and memory overhead when processing large graphs.
  • a computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning comprises: sampling a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes of graph being represented by a corresponding feature vector of the feature matrix; obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node; obtaining a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model; obtaining a loss
  • a computer-implemented method for classifying a node of a graph comprises: training a Graph Neural Network (GNN) model by using the method as mentioned above as well as the method according to aspects of the disclosure; predicting a classification label for the node of the graph by applying the feature matrix of the graph to the trained GNN model.
  • GNN Graph Neural Network
  • a computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying accounts on a social network or a financial network comprises: training the GNN model for classifying the accounts on the social network or the financial network using the method as mentioned above as well as the method according to aspects of the disclosure, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.
  • GNN Graph Neural Network
  • a computer-implemented method for classifying an account on a social network or a financial network comprises: training a GNN model for classifying the accounts on the social network or the financial network using the method as mentioned above as well as the method according to aspects of the disclosure; and predicting a classification label for the account on the social network or the financial network with the trained GNN model, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.
  • a computer system which comprises one or more processors and one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.
  • there provides one or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.
  • a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.
  • each node in the batch is enabled to be insensitive to specific neighborhoods, increasing the reliability of the classification prediction of the model frame while the computational resource requirement such as time and memory cost may be decreased.
  • Fig. 1 illustrates an exemplary GCN model according to an embodiment.
  • Fig. 2 illustrates an exemplary schematic diagram for training a GNN model according to an embodiment.
  • Fig. 3 illustrates an exemplary process for training a GNN model to perform a task of classifying nodes of a graph based on semi-supervised learning according to an embodiment.
  • Fig. 4 illustrates an exemplary process for classifying a node of a graph according to an embodiment.
  • Fig. 5 illustrates an exemplary computing system according to an embodiment.
  • the present disclosure describes a method and a system, implemented as computer programs executed on one or more computers, which is used to train a GNN model to perform a task of classifying nodes of a graph or which is used classify a node of a graph.
  • the GNN model may be implemented as a graph convolution network (GCN) model, and may perform a machine learning task of classifying nodes in a graph, which may for example represents a social network, biological network, citation network, recommendation system, financial system, etc.
  • GCN graph convolution network
  • the aspects of the disclosure may be applied in these fields such as the social network, biological network, citation network, recommendation system, financial system and so on to improve the security and robustness of these systems.
  • Fig. 1 illustrates an exemplary GCN model 10 according to an embodiment.
  • a graph is fed as input 110 of the GCN model 10.
  • the graph may be a dataset that contains nodes and edges.
  • the nodes in the graphs may represent entities, and the edges represent the connections between nodes.
  • a social network is a graph in which users or particularly user accounts in the network are nodes in the graph.
  • An edge exists when two users are connected in some way.
  • the two users are friends, share one’s posts, have similar interests, have similar profiles, or the like, then the two users may have a connection which is represented by the edge.
  • a financial network is a graph in which users or particularly user accounts in the network are nodes in the graph.
  • An edge exists when two users are connected in some way. For example, the two users have remittance transfer relation, employee relation, similar deposit, similar investment preference, similar profiles, or the like, then the two users may have a connection which is represented by the edge.
  • the graphs G with added self-loop connections may be denoted as which may be referred to as self-loop augmented graph.
  • the adjacency matrix of graph is and the degree matrix of graph is where I denotes an unit matrix. represents the feature matrix of the graph G, where
  • the feature of a node may include multiple feature components, the number of which is defined as the dimension of the node feature.
  • the feature components of a node may include age, gender, hobby, career, various actions such as shopping, reading, listening music, and so son. It is appreciated that aspects of the disclosure are not limited to specific values of the elements of the adjacency matrix and the feature matrix.
  • ⁇ C may denote the labels of the nodes of the graph, where C stands for the dimension of the classification or the number of classification labels. Therefore, each node s of a graph is associated with a feature vector and a label vector Y s ⁇ Y ⁇ ⁇ 0, 1 ⁇
  • semi-supervised classification limited number of nodes L ⁇ V (0 ⁇
  • ) have their observed labels Y L , and the remaining nodes U V -L do not have the observed labels.
  • the objective of semi-supervised learning is to infer the missing labels Y U for the unlabeled nodes U based on graph structure G, node features X and the observed labels Y L .
  • for a matrix denotes its i-th row vector and M (i, j) denotes the element of the i-th row and the j-th column.
  • the GCN model 10 is an exemplary GNN and may be used to learn and implement the predictive function.
  • the GCN model may include one or multiple hidden layers 120, which are also referred to as graph convolutional layers 120.
  • Each hidden layer 120 may receive and process a graph-structured data.
  • the hidden layer 120 may perform convolution operation on the data.
  • the weights of the convolution operations in the hidden layer 120 may be trained with training data. It is appreciated that other operations may be included in the hidden layer 120 in addition to the convolution operation.
  • Each activation engine 130 may apply an activation function (e.g., ReLU) to the output from a hidden layer 120 and send the output to the next hidden layer 120.
  • ReLU activation function
  • a fully-connected layer or a softmax engine 140 may provide an output 150 based on the output of the previous hidden layer.
  • the output 150 of the GCN model 10 may be classification labels or particularly classification probabilities for nodes in the graph.
  • the node classification task of the GCN model 10 is to determine the classification labels of nodes of the graph based on their neighbors. Particularly, given a subset of labeled nodes in the graph, the goal of the classification task of the GCN model 10 is to predict the labels of the remaining unlabeled nodes in the graph.
  • Fig. 2 illustrates an exemplary schematic diagram 20 for training a GNN model according to an embodiment.
  • the graph 210 includes nodes represented by feature matrix X and edges represented by adjacent matrix A.
  • the values shown in the graph are examples of features of the nodes. It is known that the disclosure is not limited to the specific node features or feature values. For sake of illustration, seven nodes are included in the graph. It is appreciated that a graph G may include much more nodes in practical application.
  • the exampled graph G there are two labeled nodes 1 and 6 and five unlabeled nodes 2-5 and 7.
  • the task is to training the GNN model 250 by means of semi-supervised learning so that the trained GNN model 250 can predict the classification labels of the unlabeled nodes 2-5 and 7.
  • a propagation matrix may be used to propagate the features of the nodes of the graph.
  • the feature propagation for the features of the nodes may be performed based on the adjacent matrix A or
  • a mixed-order propagation may be employed to perform the feature propagation in order to exploit and incorporate more local information of the adjacency relations among nodes, reducing the risk of over-smoothing.
  • a mixed-order adjacency matrix ⁇ may be utilized as shown in equation (1) :
  • the t-order normalized adjacency matrix is also the t-order random walk reverse transition matrix of the graph where the element P t (s, v) denotes the probability that a t-step random walk goes from source node s to target node v.
  • the generalized mixed-order matrix ⁇ uses a set of tunable weights ⁇ w t
  • the framework of the embodiment illustrated in Fig. 2 can flexibly manipulate the importance of different orders of neighborhoods to suit diverse graphs in the real world.
  • the weight w t may be set as ⁇ (1- ⁇ ) t and accordingly the equation (1) becomes which may be referred to as a truncated personalized page-rank (PPR) matrix.
  • PPR personalized page-rank
  • the weight w t may be set as 1/ (T + 1) and accordingly the equation (1) becomes which may be referred to as a average pooling matrix.
  • calculating the mixed-order adjacency matrix as shown in equation (1) may be computationally inefficient because large amount of memory capacity, computing capacity and computing time are required.
  • the size of the graph may influence the requirements of the computing resources such as the memory capacity, computing capacity and so on, therefore a problem for computing resources may be risen for large size of graphs.
  • an efficient push-flow method may be employed to generate an error-bounded approximation for each row vector ⁇ s of the mixed-order adjacency matrix ⁇ .
  • the push-flow method may be referred to as a Generalized Forward Push (GFPush) method, which has the ability to approximate the generalized mixed-order random walk transition vector ⁇ s of a node s.
  • GFPush Generalized Forward Push
  • the core idea of GFPush is to simulate a T-step random walk probability diffusion process from node s with a series of pruning operations on probability masses.
  • a pair of vectors at each step t (0 ⁇ t ⁇ T ) is maintained.
  • One of the pair of vectors may be a reserve vector denoting the probability masses reserved at step t.
  • the other one of the pair of vectors may be a residue vector representing the probability masses to be diffused beyond step t.
  • Table 1 shows the pseudo-code of the GFPush method.
  • r (0) and q (0) are set to the indicator vector e (s) , where and for v ⁇ s, representing that the random walk starts from s with the probability mass of 1.
  • Other reserve and residue vectors (r (t) and q (t) , t ⁇ 1) are set to Then the method repeats multiple iterations from step 0 to step T-1.
  • the GFPush method conducts push operation (Line 5–9 of the pseudo-code) on each node v that satisfies where is the degree of node v in self-loop augmented graph r max is a predefined threshold.
  • the GFPush method in the procedure of the GFPush method, is the conditional probability that a random walk moves from node v to a neighboring node u, conditioned on it reaching v with probability at step t.
  • the push operation on node v can be seen as a one-step random walk probability diffusion from v to its one-hop neighbor nodes.
  • the GFPush method can only conducts push operations for node v whose residue value is greater than d v ⁇ r max . It is appreciated that the GFPush method can also conduct push operations for every neighbor node v in the graph that is, the neighboring node v’s residue value is not limited to be greater than d v ⁇ r max .
  • the GFPush method After the last iteration, the GFPush method returns as result, which is an approximation of the mixed-order adjacency vector ⁇ s for node s.
  • the mixed-order adjacency vector ⁇ s of each node s may be obtained using the GFPush method and the set of the mixed-order adjacency vectors for the set of nodes may be used as an approximation of the mixed-order adjacency matrix ⁇ , and thus may be referred to as mixed-order adjacency matrix
  • a top-k sparsification may be performed for the mixed-order adjacency vector of each node s.
  • the resultant sparsified mixed-order adjacency vector has at most k non-zero elements.
  • Escaping Mass of lazy random walk which says the probability that a T-hop lazy random walk starting from node s will concentrate around a local cluster of node s, the sparsified vector is still expected to be effective for feature propagation by preserving most of local neighborhood nodes for node s.
  • the mixed-order adjacency vectors for different nodes may be calculated in parallel, for example, by using multi-thread programming, and thus the computing time of the mixed-order adjacency matrix for a set of nodes may be reduced.
  • the graph 210 includes a labeled node set L and an unlabeled node set U.
  • the labeled node set L includes nodes 1 and 6 and the unlabeled node set U includes nodes 2-5 and 7.
  • a subset of nodes may be sampled from the graph 210, the subset of nodes may include at least a part of the labeled nodes and at least a part of the labeled nodes. As the number of labeled nodes in a graph is typically limited, all the labeled nodes of the graph are typically included in the subset. A part of the unlabeled nodes of the graph, especially a large graph, are sampled and included in the subset.
  • the subset of nodes may include the labeled node set L and an unlabeled node subset U′.
  • the unlabeled node subset U′ may include 10000 unlabeled nodes in practical application. It is appreciated that the disclosure is not limited to the specific number of nodes in the unlabeled node subset U′.
  • the unlabeled node subset U′ includes the sampled nodes 3 and 7.
  • the subset of nodes of graph 210 includes nodes 1 and 6 in the labeled node set L and nodes 3 and 7 in the unlabeled node subset U′.
  • the mixed-order adjacency vector of each node s in the subset of nodes 1, 3, 6 and 7 may be obtained using the GFPush method, and accordingly a mixed-order adjacency matrix 220 may be obtained for the subset of nodes 1, 3, 6 and 7.
  • the top-k sparsification may be performed on the mixed-order adjacency vector of each node s in the subset of nodes 1, 3, 6 and 7 so as to obtain sparsified mixed-order adjacency vector of each node s, and accordingly a sparsified mixed-order adjacency matrix 230 may be obtained for the subset of nodes 1, 3, 6 and 7.
  • the mixed-order adjacency matrix 220 and the sparsified mixed-order adjacency matrix 230 is a sub-matrix compared to the adjacency matrix A since they only include adjacency vectors of a part of nodes of the graph.
  • the value k is set to 3 in the illustrated example.
  • the GFPush method has the ability to generate the mixed-order sub-matrix for only a part of nodes of the graph, the efficiency for generating the propagation matrix can be improved significantly, especially for a large size graph including a large amount of nodes.
  • the sparsified sub-matrix 230 may be used to perform random propagation of node features in a mini-batch manner.
  • a training step or training iteration which may be generally denoted as a n-th training step or iteration
  • a batch of labeled nodes may be sampled with batch size
  • b l
  • a batch of unlabeled nodes may be sampled with batch size
  • b u .
  • the batch of labeled nodes L n includes node 1 and the batch of unlabeled nodes U n includes node 7.
  • the augmented feature vector of node s ⁇ L n ⁇ U n is calculated via equation (2) :
  • M augmented feature vectors are generated for node s by repeating this procedure shown in equation (2) for M times.
  • M augmented feature vectors 240-1 to 240-M for each of nodes 1 and 7 in the batch are obtained.
  • the time complexity and memory complexity of each batch are both bounded by O (k ⁇ h 0 ⁇ (b l +b u ) ) , which is independent of the graph size.
  • a drop-node method is used to perform the random propagation of node feature. Specifically, some nodes’ entire feature vectors are dropped or removed from the feature vectors of the neighboring nodes of node s by randomly setting entire feature vectors of some of the neighboring nodes to 0.
  • the element z i in the drop node mask z takes the value 0 with a probability of dropping rate ⁇ and takes the value 1 with a probability of 1- ⁇ , that is, z i ⁇ Bernoulli (1 - ⁇ ) .
  • the augmented feature vectors of node s may be scaled with a factor of so as to make the augmented feature vectors in expectation equal to the input feature vectors of the feature matrix X.
  • the drop-node method enables each node to aggregate information only from a subset of its neighbors by completely ignoring some neighboring nodes’ features, which reduces its dependency on particular neighbors and thus helps increase the model’s robustness over adversarial action. It is appreciated that the neighbors of a node may include one-hop neighbor and multi-hop neighbor.
  • the perturbation of the feature vectors of neighboring nodes of a node s may be performed in other ways.
  • the perturbation of the feature vectors of neighboring nodes such as neighboring nodes 1, 2 and 4 of node 1 may be performed by using a dropout method.
  • the dropout method may perturb the feature vectors of neighboring nodes by randomly setting some elements of the feature vectors of neighboring nodes to 0.
  • the drop mask in which each element z ij takes the value 0 with a probability of ⁇ and takes the value 1 with a probability of 1- ⁇ .
  • the z ij is obtained from a Bernoulli distribution, that is, z ij ⁇ Bernoulli (1 - ⁇ ) .
  • the drop node mask vector shown in Fig. 2 is replaced with the dropout mask matrix for performing the random dropout of some elements of the feature vectors of neighboring nodes.
  • each feature vector X v as shown in equation (2) and Fig. 2 may be transform to a low dimensional hidden vector with a linear transformation layer before random feature propagation. Then the augmented feature vector of node s may be obtained by performing random propagation with the low dimensional feature vector H v as shown in equation (3)
  • each of the augmented feature vectors may be fed into the GNN model 250 such as the illustrated two-layer MLP model 250 to get respective outputs where 1 ⁇ m ⁇ M.
  • the outputs denotes the classification prediction probabilities of the node s on the augmented feature vector ⁇ are the parameters of the model 250.
  • the dimension C of the illustrated outputs is one, it is appreciated that dimension C of the output may be bigger depending on the number of classifications to be predicted by the model 250. It is appreciated that the two-layer MLP model 250 is exemplary, other number of layers of the MLP model is applicable.
  • a loss L 260 may be obtained based on the classification predictions Then the parameter weights of the GNN model 250 may be updated based on the loss L 260. For example, the weights of the GNN model 250 may be updated by back propagating the loss L 260 along the GNN 250.
  • the weights of the GNN 250 and the weights of learnable transformation matrix W (0) may be updated based on the loss L.
  • the loss function is a combination of the supervised loss on the labeled nodes and the graph regularization loss. Given the M data augmentations of each node generated though the random propagation 240, a consistency regularized loss may be employed for the semi-supervised learning.
  • the supervised loss of the graph node classification task is defined as the average cross-entropy loss over M augmentations:
  • the prediction consistency among M augmentations may be optimized for unlabeled data by using a consistency regularization loss.
  • the distribution center may be calculated by taking the average of its M prediction probabilities, i.e., Then a sharpening process may be performed over the average prediction probability to obtain a pseudo label for node s.
  • the probability on the j-th classification of node s is obtained via:
  • 0 ⁇ ⁇ ⁇ 1 is a hyperparameter to control the sharpness of the obtained pseudo label. As decreasing the value of ⁇ , is enforced to become sharper and converges to a one-hot distribution eventually.
  • the consistency loss on unlabeled node batch U n may be obtained via:
  • a confidence-aware consistency loss may be employed to further improve effectiveness of the consistency loss.
  • the confidence-aware consistency loss on unlabeled node batch U n may be obtained via:
  • the distance function may be a function for calculating L2 distance.
  • the distance function may be a function for calculating KL divergence. In this embodiment shown in equation (7) , only highly confident unlabeled nodes whose maximum value of prediction center exceeds ⁇ are considered when obtaining the consistence loss. In this way, the potential training noise induced by uncertain pseudo-labels may be reduced and thus the effectiveness of the consistency loss may be improved.
  • the final loss for model optimization may be obtained based on the supervised classification loss and the consistency regularization loss:
  • is a weight that controls the balance of the two losses.
  • the parameters ⁇ of the GNN model 250 may be updated by gradients descending:
  • a dynamic scheduling strategy may be used to determine the value of ⁇ in equation (8) .
  • is obtained via:
  • linearly increases from ⁇ min to ⁇ max , and remains constant in the following training steps.
  • the weight ⁇ may be limited to a small value in the early stage of training when the generated pseudo-labels are not much reliable, which may help model converge.
  • predictions for all unlabeled nodes of the graph may be inferred at the same time using the trained model:
  • the feature matrix X is rescaled with (1 - ⁇ ) so that the rescaled features are identical with the expectation of the perturbed features with dropping rate ⁇ in the model training. Unlike training, the above process only needs to be performed once for each node of the graph during model inference, and the computational cost is acceptable in practice. In another embodiment, the rescale operation with (1 - ⁇ ) may not be performed if the corresponding scaling operation has been performed during training.
  • the process of training the GNN model for performing a classification task on graph data and obtaining the classification prediction for the nodes of the graph may be illustrated as the following pseudocode in table 2.
  • the drop-node method is illustrated in Fig. 2 as the specific method for performing the perturbation of the feature vectors of the neighboring nodes, other perturbation method such as dropout, random perturbing method and so on may be employed.
  • the MLP model 250 may be employed as the GNN model, and other GNN model such as the GCN, the GAT and so on may be employed as the GNN model.
  • the supervised classification loss via Equation (4) and consistency regularization loss via Equation (6) or (7) are employed as the loss function
  • the specific method for calculating the supervised classification loss and the consistency regularization loss are not limited to the specific Equations.
  • the loss function is not limited to the combination of the supervised classification loss and the consistency regularization loss at least one of which may be employed as the loss function at different circumstances.
  • Fig. 3 illustrates an exemplary process 30 for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning according to an embodiment.
  • GNN Graph Neural Network
  • the exemplary process 30 may be implemented by a computer, in which programs are executed by one or more processor to perform the process 30.
  • a batch of labeled nodes and a batch of unlabeled nodes are sampled from the nodes of the graph.
  • the graph comprises the nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes being represented by a corresponding feature vector of the feature matrix. It is appreciated that the batch is a subset of the nodes of graph, the number of the batch of nodes is far less than the number of nodes of the graph in practical application.
  • a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node.
  • the neighboring nodes of the node are indicated by the adjacency vector of the node.
  • a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by respectively applying the plurality of augmented feature vectors of the node to the GNN model.
  • a loss is obtained based on the classification predictions of the respective nodes in the batch of labeled nodes and the batch of unlabeled nodes.
  • parameters of the GNN model are updated based on the loss.
  • a mixed-order adjacency matrix corresponding to a plurality of adjacency orders is generated for at least a part of the nodes of the graph based on the adjacency matrix of the graph.
  • the adjacency vector of each node in the batch of labeled nodes and the batch of unlabeled nodes is a corresponding mixed-order adjacency vector of the mixed-order adjacency matrix.
  • the mixed-order adjacency matrix corresponding to a plurality of adjacency orders is generated for a part or subset of the nodes of the graph based on the adjacency matrix of the graph.
  • the adjacency vector of each node in the batch of labeled nodes and the batch of unlabeled nodes is a corresponding mixed-order adjacency vector of the mixed-order adjacency matrix.
  • the mixed-order adjacency matrix is actually a sub-matrix for the sampled nodes, which is generated by using the above mentioned GFPush method.
  • the mixed-order adjacency matrix may be generated by performing weighted sum of a plurality of per-order adjacency matrixes corresponding to the plurality of adjacency orders, wherein one of the plurality of per-order adjacency matrixes corresponding to an adjacency order t, and an element in the per-order adjacency matrix representing a probability that a t-step random walk goes from a source node associated with the element to a target node associated with the element.
  • the mixed-order adjacency matrix may be generated by generating a mixed-order adjacency vector for each of the at least a part of the nodes of the graph by iteratively performing one-step random walk based on the adjacency matrix of the graph.
  • the generated respective mixed-order adjacency vectors for the at least a part of nodes constitute the mixed-order adjacency matrix.
  • the batch of labeled nodes and the batch of unlabeled nodes are sampled from a subset of the nodes of the graph.
  • a mixed-order adjacency vector for each of the subset of nodes may be generated by iteratively performing one-step random walk based on the adjacency matrix of the graph.
  • the generated respective mixed-order adjacency vectors for the subset of nodes constitute the mixed-order adjacency matrix.
  • the mixed-order adjacency vector for each of the at least a part of the nodes of the graph may be generated by iteratively, from the lowest to the highest of the plurality of adjacency orders, calculating a plurality of reserved probability mass vectors and a plurality of residue probability mass vectors corresponding to the plurality of adjacency orders for the node, and generating the mixed-order adjacency vector for the node by performing weighted sum of the plurality of reserved probability mass vectors for the node.
  • the calculating of the plurality of reserved probability mass vectors and the plurality of residue probability mass vectors is performed based on an adjacency degree matrix of the graph and/or a predefined threshold.
  • the generating the mixed-order adjacency vector for each of the at least a part of the nodes of the graph may further comprises: remaining a predefined number of largest elements of the mixed-order adjacency vector of the node while setting other elements of the mixed-order adjacency vector to zero.
  • the weighted sum of the plurality of reserved probability mass vectors for the node may be performed by using weights ⁇ (1- ⁇ ) t to the plurality of reserved probability mass vectors, wherein ⁇ being a decay factor and t being one of the plurality of adjacency orders, or by using an averaged weight corresponding to the plurality of adjacency orders, or by using a single order weight in which one of the weights being set to 1 and the other of the weights being set to 0.
  • a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating feature vectors of neighboring nodes of the node based on the adjacency vector of the node and a dropping mask.
  • the dropping mask is configured to randomly drop out at least partial features of at least a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes.
  • the dropping mask is configured to randomly drop out entire features of a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes.
  • the dropping mask is configured to randomly drop out entire features of a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes based on a dropping rate, features of the remained part of the neighboring nodes are scaled up based on the dropping rate.
  • the loss comprises a supervised classification loss and a consistency regularization loss.
  • the supervised classification loss is obtained based on classification labels of the batch of labeled nodes
  • the consistency regularization loss is obtained based on classification predictions of the batch of unlabeled nodes.
  • the supervised classification loss is a cross-entropy loss based on the classification labels and the classification predictions corresponding to the batch of labeled nodes
  • the consistency regularization loss is a distance loss based on the classification predictions of the batch of unlabeled nodes.
  • the consistency regularization loss is set to zero if the maximum of the classification predictions of the batch of unlabeled nodes is less than a threshold or if the maximum of respective averaged classification predictions of the batch of unlabeled nodes is less than a threshold.
  • the method 30 is performed repetitively in each of a plurality of training steps, which may be referred to as training loops or training iterations, wherein the loss is obtained by performing weighted sum of the supervised classification loss and a consistency regularization loss with dynamic weights in each of the plurality of training steps.
  • the feature vectors of neighboring nodes of the node are transformed with a linear transformation matrix to respective lower-dimension feature vectors having a lower dimension than the feature vectors of the neighboring nodes of the node.
  • the plurality of augmented lower-dimension feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating the lower-dimension feature vectors of neighboring nodes of the node
  • the plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by respectively applying the plurality of augmented lower-dimension feature vectors of the node to the GNN model
  • parameters of the GNN model and parameter of the linear transformation matrix are updated based on the loss.
  • the GNN model comprises one of a Multilayer Perception (MLP) model, a Graph Convolutional Network (GCN) model, a Graph Attention Network (GAT) model.
  • MLP Multilayer Perception
  • GCN Graph Convolutional Network
  • GAT Graph Attention Network
  • the process 30 illustrated in Fig. 3 may be implemented as a method for training a Graph Neural Network (GNN) model to perform a task of classifying accounts on a social network or a financial network.
  • GNN Graph Neural Network
  • the GNN model for classifying the accounts on the social network or the financial network may be trained using the method as described above with reference to Figs. 1-3, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.
  • Fig. 4 illustrates an exemplary process 40 for classifying a node of a graph according to an embodiment. It is appreciated that the exemplary process 40 may be implemented by a computer, in which programs are executed by one or more processor to perform the process 40.
  • a Graph Neural Network (GNN) model may be trained by using the method as described above with reference to Figs. 1-3.
  • a classification label for the node of the graph may be predicted by applying the feature matrix of the graph to the trained GNN model.
  • the method 40 may also include identify an alarm if the label of the node is a predefined label.
  • the predefined label of the node may be a label indicating the node is an adversarial node, a fraud node or the like, such as an adversarial node in social network, financial network or the like.
  • the process 40 illustrated in Fig. 4 may be implemented as a method for classifying an account on a social network or a financial network.
  • the GNN model for classifying the accounts on the social network or the financial network may be trained using the method as described above with reference to Figs. 1-3, and a classification label for the account on the social network or the financial network may be predicted with the trained GNN model.
  • the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.
  • Fig. 5 illustrates an exemplary computing system 50 according to an embodiment.
  • the computing system 50 may comprise at least one processor 510.
  • the computing system 50 may further comprise at least one storage device 520.
  • the storage device 520 may store computer-executable instructions that, when executed, cause the processor 510 to sample a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising the nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes being represented by a corresponding feature vector of the feature matrix; obtain a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node, wherein the neighboring nodes of the node being indicated by the adjacency vector of the node; obtain a plurality of classification predictions for each node in the batch of
  • the storage device 520 may store computer-executable instructions that, when executed, cause the processor 510 to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.
  • the embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.
  • the embodiments of the present disclosure may be embodied in a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Technology Law (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning. The method comprises: sampling a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes of graph being represented by a corresponding feature vector of the feature matrix; obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node; obtaining a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model; obtaining a loss based on the classification predictions of the nodes in the batch of labeled nodes and the batch of unlabeled nodes; and updating parameters of the GNN model based on the loss.

Description

METHOD AND APPARATUS FOR CLASSIFYING NODES OF A GRAPH FIELD
Aspects of the present disclosure relate generally to artificial intelligence, and more particularly, to performing a task of classifying nodes of a graph using a Graph Neural Network (GNN) model based on semi-supervised learning.
BACKGROUND
Graph data has been widely used in many real-world applications, such as social networks, financial system, biological networks, citation networks, recommendation system, etc. Node classification is one of the most important tasks on graphs. The deep learning model for graph such as GNN model has achieved good results in the task of node classification on the graph. Given a graph with labels associated with a subset of nodes, the GNN model may predict the labels for the rest of the nodes.
Despite the great success achieved by GNNs, there are two limitations in GNN-based semi-supervised learning methods. The first one is limited generalization. Conventional GNNs only use supervised objective function for model training, which makes the GNN model prone to overfit the limited labeled samples, thereby degrading the prediction performance over unlabeled samples and leading to poor reliability of the GNN model. The other one is weak scalability, conventional GNNs adopt full-graph training method with an expensive recursive feature propagation procedure, inducing enormous time and memory overhead when processing large graphs.
There needs enhancement for improving the generalization and scalability of the GNN model.
SUMMARY
In order to address problems of semi-supervised learning on graphs such as limited generalization and week scalability, a novel GNN framework is proposed in the disclosure in effort to alleviate the two limitations of generalization and scalability simultaneously.
According to an embodiment, there provides a computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning. The method comprises: sampling a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes of graph being represented by a corresponding feature vector of the feature matrix; obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node; obtaining a plurality of  classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model; obtaining a loss based on the classification predictions of the nodes in the batch of labeled nodes and the batch of unlabeled nodes; and updating parameters of the GNN model based on the loss.
According to an embodiment, there provides a computer-implemented method for classifying a node of a graph. The method comprises: training a Graph Neural Network (GNN) model by using the method as mentioned above as well as the method according to aspects of the disclosure; predicting a classification label for the node of the graph by applying the feature matrix of the graph to the trained GNN model.
According to an embodiment, there provides a computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying accounts on a social network or a financial network. The method comprises: training the GNN model for classifying the accounts on the social network or the financial network using the method as mentioned above as well as the method according to aspects of the disclosure, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.
According to an embodiment, there provides a computer-implemented method for classifying an account on a social network or a financial network. The method comprises: training a GNN model for classifying the accounts on the social network or the financial network using the method as mentioned above as well as the method according to aspects of the disclosure; and predicting a classification label for the account on the social network or the financial network with the trained GNN model, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network..
According to an embodiment, there provides a computer system, which comprises one or more processors and one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.
According to an embodiment, there provides one or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.
According to an embodiment, there provides a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method as mentioned above as well as to perform the operations of the method according to aspects of the disclosure.
By utilizing the batch based random propagation of the node features, each node in the batch is enabled to be insensitive to specific neighborhoods, increasing the reliability of the classification prediction of the model frame while the computational resource requirement such as time and memory cost may be decreased. Other advantages and enhancements are explained in the description hereafter.
BRIEF DESCRIPTION OF THE DRAWINGS
The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.
Fig. 1 illustrates an exemplary GCN model according to an embodiment.
Fig. 2 illustrates an exemplary schematic diagram for training a GNN model according to an embodiment.
Fig. 3 illustrates an exemplary process for training a GNN model to perform a task of classifying nodes of a graph based on semi-supervised learning according to an embodiment.
Fig. 4 illustrates an exemplary process for classifying a node of a graph according to an embodiment.
Fig. 5 illustrates an exemplary computing system according to an embodiment.
DETAILED DESCRIPTION
The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
Various embodiments will be described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made to particular examples and embodiments are for illustrative purposes, and are not intended to limit the scope of the disclosure.
The present disclosure describes a method and a system, implemented as computer programs executed on one or more computers, which is used to train a GNN model to perform a task of classifying nodes of a graph or which is used classify a node of a graph. As an example, the GNN model may be implemented as a graph convolution network (GCN) model, and may perform a machine learning task of classifying nodes in a graph, which may for example represents a social network, biological network, citation network, recommendation system, financial system, etc. The aspects of the disclosure may be applied in these fields such as the social network, biological network, citation network, recommendation system, financial  system and so on to improve the security and robustness of these systems.
Fig. 1 illustrates an exemplary GCN model 10 according to an embodiment.
A graph is fed as input 110 of the GCN model 10. The graph may be a dataset that contains nodes and edges. The nodes in the graphs may represent entities, and the edges represent the connections between nodes. For example, a social network is a graph in which users or particularly user accounts in the network are nodes in the graph. An edge exists when two users are connected in some way. For example, the two users are friends, share one’s posts, have similar interests, have similar profiles, or the like, then the two users may have a connection which is represented by the edge. For example, a financial network is a graph in which users or particularly user accounts in the network are nodes in the graph. An edge exists when two users are connected in some way. For example, the two users have remittance transfer relation, employee relation, similar deposit, similar investment preference, similar profiles, or the like, then the two users may have a connection which is represented by the edge.
A graph may be denoted as G= (V, E) , where V is a set of |V| nodes, v∈ V refers to a data sample corresponding to a node, and E∈V×V is a set of |E| edges between nodes. In an example, the graph as the input 110 may be formulated as G= (A, X) , where A∈ {0, 1}  |V|×|V| represents the adjacency matrix of the graph G, with each element A (s, v) = 1 indicating there exists an edge between nodes s and v, A (s, v) = 0 indicating there is no edge between nodes s and v. D is diagonal degree matrix where D (s, s) =∑ vA (s, v) , representing the number of connections the node s has in the graph. The graphs G with added self-loop connections may be denoted as 
Figure PCTCN2021132082-appb-000001
which may be referred to as self-loop augmented graph. The adjacency matrix of graph
Figure PCTCN2021132082-appb-000002
is
Figure PCTCN2021132082-appb-000003
and the degree matrix of graph
Figure PCTCN2021132082-appb-000004
is
Figure PCTCN2021132082-appb-000005
where I denotes an unit matrix. 
Figure PCTCN2021132082-appb-000006
represents the feature matrix of the graph G, where |V| stands for the number of nodes of the graph G, h 0 stands for the dimension of the feature of the nodes. Therefore, the adjacency matrix A or
Figure PCTCN2021132082-appb-000007
may represent the connections among the nodes in the graph G or
Figure PCTCN2021132082-appb-000008
the feature matrix X may represent the features of respective nodes in the graph. The feature of a node may include multiple feature components, the number of which is defined as the dimension of the node feature. For example, for a graph of a social network, the feature components of a node may include age, gender, hobby, career, various actions such as shopping, reading, listening music, and so son. It is appreciated that aspects of the disclosure are not limited to specific values of the elements of the adjacency matrix and the feature matrix.
A label matrix Y∈ {0, 1}  |V|×C may denote the labels of the nodes of the graph, where C stands for the dimension of the classification or the number of classification labels. Therefore, each node s of a graph is associated with a feature vector
Figure PCTCN2021132082-appb-000009
and a label vector Y s∈Y∈ {0, 1}  |V|×C. For semi-supervised  classification, limited number of nodes L∈V (0<|L|<<|V|) have their observed labels Y L, and the remaining nodes U = V -L do not have the observed labels. The objective of semi-supervised learning is to infer the missing labels Y U for the unlabeled nodes U based on graph structure G, node features X and the observed labels Y L. In the disclosure, for sake of description, for a matrix
Figure PCTCN2021132082-appb-000010
denotes its i-th row vector and M (i, j) denotes the element of the i-th row and the j-th column.
Graph neural networks (GNNs) have emerged as a powerful approach for semi-supervised graph learning the predictive function. The GCN model 10 is an exemplary GNN and may be used to learn and implement the predictive function. The GCN model may include one or multiple hidden layers 120, which are also referred to as graph convolutional layers 120. Each hidden layer 120 may receive and process a graph-structured data. For example, the hidden layer 120 may perform convolution operation on the data. The weights of the convolution operations in the hidden layer 120 may be trained with training data. It is appreciated that other operations may be included in the hidden layer 120 in addition to the convolution operation. Each activation engine 130 may apply an activation function (e.g., ReLU) to the output from a hidden layer 120 and send the output to the next hidden layer 120. A fully-connected layer or a softmax engine 140 may provide an output 150 based on the output of the previous hidden layer. In the node classification task, the output 150 of the GCN model 10 may be classification labels or particularly classification probabilities for nodes in the graph. For example, the GCN model 10 may employ a propagation rule
Figure PCTCN2021132082-appb-000011
where
Figure PCTCN2021132082-appb-000012
is the symmetric normalized adjacency matrix of
Figure PCTCN2021132082-appb-000013
σ (. ) denotes the activation function such as the ReLU function, W  (l) is the weight matrix of the layer l and H  (l) is the hidden node representation in the layer l with H  (0) = X.
The node classification task of the GCN model 10 is to determine the classification labels of nodes of the graph based on their neighbors. Particularly, given a subset of labeled nodes in the graph, the goal of the classification task of the GCN model 10 is to predict the labels of the remaining unlabeled nodes in the graph.
In an example, the GCN model 10 may be a two-layer GCN model: 
Figure PCTCN2021132082-appb-000014
where
Figure PCTCN2021132082-appb-000015
is a normalized adjacency matrix, 
Figure PCTCN2021132082-appb-000016
is a degree matrix of adjacency matrix A or
Figure PCTCN2021132082-appb-000017
and
Figure PCTCN2021132082-appb-000018
are parameter matrices of two hidden layers 120, where d H denotes the dimension of the hidden layer, C denotes the number of the categories of the classification labels, and σ (x) is the activation function 130, for example, σ (x) = ReLU (x) . 
Figure PCTCN2021132082-appb-000019
is the output matrix 150, representing the probability of each node to each classification label in the graph.
Fig. 2 illustrates an exemplary schematic diagram 20 for training a GNN model according to an embodiment.
Taking the graph G= (A, X) as shown in Fig. 2 as example. As illustrated, the graph 210 includes nodes represented by feature matrix X and edges represented by adjacent matrix A. The values shown in the graph are examples of features of the nodes. It is known that the disclosure is not limited to the specific node features or feature values. For sake of illustration, seven nodes are included in the graph. It is appreciated that a graph G may include much more nodes in practical application. In the exampled graph G, there are two labeled  nodes  1 and 6 and five unlabeled nodes 2-5 and 7. The task is to training the GNN model 250 by means of semi-supervised learning so that the trained GNN model 250 can predict the classification labels of the unlabeled nodes 2-5 and 7.
A propagation matrix may be used to propagate the features of the nodes of the graph. In an example, the feature propagation for the features of the nodes may be performed based on the adjacent matrix A or
Figure PCTCN2021132082-appb-000020
In the embodiment of the Fig. 2, a mixed-order propagation may be employed to perform the feature propagation in order to exploit and incorporate more local information of the adjacency relations among nodes, reducing the risk of over-smoothing. In order to perform the mixed-order propagation, a mixed-order adjacency matrix Π may be utilized as shown in equation (1) :
Figure PCTCN2021132082-appb-000021
where
Figure PCTCN2021132082-appb-000022
and w t≥ 0. In an example, row normalization may be used for
Figure PCTCN2021132082-appb-000023
The t-order normalized adjacency matrix
Figure PCTCN2021132082-appb-000024
is also the t-order random walk reverse transition matrix of the graph
Figure PCTCN2021132082-appb-000025
where the element P t (s, v) denotes the probability that a t-step random walk goes from source node s to target node v.
In the example shown in equation (1) , the generalized mixed-order matrix Π uses a set of tunable weights {w t |0 ≤ t ≤ T} to fuse different orders of adjacency matrices. By adjusting w t, the framework of the embodiment illustrated in Fig. 2 can flexibly manipulate the importance of different orders of neighborhoods to suit diverse graphs in the real world. In an example, the weight w t may be set as α (1-α)  t and accordingly the equation (1) becomes
Figure PCTCN2021132082-appb-000026
which may be referred to as a truncated personalized page-rank (PPR) matrix. It is appreciated that
Figure PCTCN2021132082-appb-000027
approaches 1 as more as the largest order T is bigger, this situation also conform substantially to the above condition
Figure PCTCN2021132082-appb-000028
In another example, the weight w t may be set as 1/ (T + 1) and accordingly the equation (1) becomes
Figure PCTCN2021132082-appb-000029
which may be referred to as a average pooling matrix. In another example, the weight w t may be set as w t=1 when t = T and w t= 0 otherwise, and accordingly the equation (1) becomes Π single=P T, which may be referred to as a single order matrix.
It is appreciated that calculating the mixed-order adjacency matrix as  shown in equation (1) may be computationally inefficient because large amount of memory capacity, computing capacity and computing time are required. On the other hand, when all the samples of the graph such as graph 210 are taken as the input of the GNN mode 250, the size of the graph may influence the requirements of the computing resources such as the memory capacity, computing capacity and so on, therefore a problem for computing resources may be risen for large size of graphs.
In an embodiment, an efficient push-flow method may be employed to generate an error-bounded approximation for each row vector Π s of the mixed-order adjacency matrix Π. The push-flow method may be referred to as a Generalized Forward Push (GFPush) method, which has the ability to approximate the generalized mixed-order random walk transition vector Π s of a node s. The core idea of GFPush is to simulate a T-step random walk probability diffusion process from node s with a series of pruning operations on probability masses. In an implementation of the GFPush, a pair of vectors at each step t (0 ≤ t ≤ T ) is maintained. One of the pair of vectors may be a reserve vector
Figure PCTCN2021132082-appb-000030
denoting the probability masses reserved at step t. The other one of the pair of vectors may be a residue vector
Figure PCTCN2021132082-appb-000031
representing the probability masses to be diffused beyond step t.
Table 1
Figure PCTCN2021132082-appb-000032
Table 1 shows the pseudo-code of the GFPush method. At beginning, r  (0) and q  (0) are set to the indicator vector e  (s) , where
Figure PCTCN2021132082-appb-000033
and
Figure PCTCN2021132082-appb-000034
for v ≠ s, representing that the random walk starts from s with the probability mass of 1. Other  reserve and residue vectors (r  (t) and q  (t) , t ≥ 1) are set to
Figure PCTCN2021132082-appb-000035
Then the method repeats multiple iterations from step 0 to step T-1. At step t, the GFPush method conducts push operation (Line 5–9 of the pseudo-code) on each node v that satisfies
Figure PCTCN2021132082-appb-000036
Figure PCTCN2021132082-appb-000037
where
Figure PCTCN2021132082-appb-000038
is the degree of node v in self-loop augmented graph
Figure PCTCN2021132082-appb-000039
r max is a predefined threshold. In each push operation, the current residue
Figure PCTCN2021132082-appb-000040
of node v is uniformly spread to node v’s neighbor nodes and the results are stored into the residue vector r  (t+1) of the next step t+1. And each updated residue
Figure PCTCN2021132082-appb-000041
is assigned to
Figure PCTCN2021132082-appb-000042
After that, 
Figure PCTCN2021132082-appb-000043
is reset to 0.
Intuitively, in the procedure of the GFPush method, 
Figure PCTCN2021132082-appb-000044
is the conditional probability that a random walk moves from node v to a neighboring node u, conditioned on it reaching v with probability
Figure PCTCN2021132082-appb-000045
at step t. Thus, the push operation on node v can be seen as a one-step random walk probability diffusion from v to its one-hop neighbor nodes. In an embodiment, in order for efficiency, the GFPush method can only conducts push operations for node v whose residue value is greater than d v·r max. It is appreciated that the GFPush method can also conduct push operations for every neighbor node v in the graph
Figure PCTCN2021132082-appb-000046
that is, the neighboring node v’s residue value is not limited to be greater than d v·r max.
After the last iteration, the GFPush method returns
Figure PCTCN2021132082-appb-000047
as result, which is an approximation of the mixed-order adjacency vector Π s for node s. For a set of nodes, the mixed-order adjacency vector Π s of each node s may be obtained using the GFPush method and the set of the mixed-order adjacency vectors for the set of nodes may be used as an approximation of the mixed-order adjacency matrix Π, and thus may be referred to as mixed-order adjacency matrix
Figure PCTCN2021132082-appb-000048
In an embodiment, in order to further reduce training cost, a top-k sparsification may be performed for the mixed-order adjacency vector
Figure PCTCN2021132082-appb-000049
of each node s. In this procedure, only the top-k largest elements of the mixed-order adjacency vector
Figure PCTCN2021132082-appb-000050
are preserved and other elements of the mixed-order adjacency vector
Figure PCTCN2021132082-appb-000051
are set to 0. Hence the resultant sparsified mixed-order adjacency vector
Figure PCTCN2021132082-appb-000052
has at most k non-zero elements. According to the theory of Escaping Mass of lazy random walk, which says the probability that a T-hop lazy random walk starting from node s will concentrate around a local cluster of node s, the sparsified vector
Figure PCTCN2021132082-appb-000053
is still expected to be effective for feature propagation by preserving most of local neighborhood nodes for node s. It is appreciated that the mixed-order adjacency vectors
Figure PCTCN2021132082-appb-000054
for different nodes may be calculated in parallel, for example, by using multi-thread programming, and thus the computing time of the mixed-order adjacency matrix
Figure PCTCN2021132082-appb-000055
for a set of nodes may be reduced.
Return to Fig. 2, the graph 210 includes a labeled node set L and an unlabeled node set U. In the shown example, the labeled node set L includes  nodes  1  and 6 and the unlabeled node set U includes nodes 2-5 and 7. A subset of nodes may be sampled from the graph 210, the subset of nodes may include at least a part of the labeled nodes and at least a part of the labeled nodes. As the number of labeled nodes in a graph is typically limited, all the labeled nodes of the graph are typically included in the subset. A part of the unlabeled nodes of the graph, especially a large graph, are sampled and included in the subset. The subset of nodes may include the labeled node set L and an unlabeled node subset U′. For example, the unlabeled node subset U′ may include 10000 unlabeled nodes in practical application. It is appreciated that the disclosure is not limited to the specific number of nodes in the unlabeled node subset U′. In the illustrated example, the unlabeled node subset U′ includes the sampled  nodes  3 and 7. Then the subset of nodes of graph 210 includes  nodes  1 and 6 in the labeled node set L and  nodes  3 and 7 in the unlabeled node subset U′.
The mixed-order adjacency vector
Figure PCTCN2021132082-appb-000056
of each node s in the subset of  nodes  1, 3, 6 and 7 may be obtained using the GFPush method, and accordingly a mixed-order adjacency matrix
Figure PCTCN2021132082-appb-000057
220 may be obtained for the subset of  nodes  1, 3, 6 and 7. The top-k sparsification may be performed on the mixed-order adjacency vector
Figure PCTCN2021132082-appb-000058
of each node s in the subset of  nodes  1, 3, 6 and 7 so as to obtain sparsified mixed-order adjacency vector
Figure PCTCN2021132082-appb-000059
of each node s, and accordingly a sparsified mixed-order adjacency matrix
Figure PCTCN2021132082-appb-000060
230 may be obtained for the subset of  nodes  1, 3, 6 and 7. The mixed-order adjacency matrix
Figure PCTCN2021132082-appb-000061
220 and the sparsified mixed-order adjacency matrix 
Figure PCTCN2021132082-appb-000062
230 is a sub-matrix compared to the adjacency matrix A since they only include adjacency vectors of a part of nodes of the graph. The value k is set to 3 in the illustrated example. As the GFPush method has the ability to generate the mixed-order sub-matrix for only a part of nodes of the graph, the efficiency for generating the propagation matrix can be improved significantly, especially for a large size graph including a large amount of nodes.
The sparsified sub-matrix
Figure PCTCN2021132082-appb-000063
230 may be used to perform random propagation of node features in a mini-batch manner. At a training step or training iteration, which may be generally denoted as a n-th training step or iteration, a batch of labeled nodes
Figure PCTCN2021132082-appb-000064
may be sampled with batch size |L n| = b l, and a batch of unlabeled nodes
Figure PCTCN2021132082-appb-000065
may be sampled with batch size |U n| = b u. In the illustrated example in Fig. 2, the batch of labeled nodes L n includes node 1 and the batch of unlabeled nodes U n includes node 7. Then the augmented feature vector
Figure PCTCN2021132082-appb-000066
of node s∈L n∪U n is calculated via equation (2) :
Figure PCTCN2021132082-appb-000067
where
Figure PCTCN2021132082-appb-000068
denotes neighboring nodes of node s, and particularly denotes the indices of the non-zero elements of
Figure PCTCN2021132082-appb-000069
is the feature vector of node v. At each  training step n, M augmented feature vectors
Figure PCTCN2021132082-appb-000070
are generated for node s by repeating this procedure shown in equation (2) for M times. As shown in Fig. 2, M augmented feature vectors 240-1 to 240-M for each of  nodes  1 and 7 in the batch are obtained. The time complexity and memory complexity of each batch are both bounded by O (k·h 0· (b l+b u) ) , which is independent of the graph size.
In an example as illustrated in Fig. 2, a drop-node method is used to perform the random propagation of node feature. Specifically, some nodes’ entire feature vectors are dropped or removed from the feature vectors of the neighboring nodes of node s by randomly setting entire feature vectors of some of the neighboring nodes to 0. Taking the node 1 as an example, its neighboring nodes includes  nodes  1, 2 and 4 as indicated in its mixed-order adjacency vector
Figure PCTCN2021132082-appb-000071
the drop node mask z = [0, 1, 1] for  nodes  1, 2 and 4 is obtained for the first augmentation branch 240-1 based on the Bernoulli distribution, similarly the drop node mask z = [1, 0, 0] for  nodes  1, 2 and 4 is obtained for the M-th augmentation branch 240-M based on the Bernoulli distribution. In the Bernoulli distribution, the element z i in the drop node mask z takes the value 0 with a probability of dropping rate δ and takes the value 1 with a probability of 1-δ, that is, z i ~ Bernoulli (1 -δ) .
In an embodiment, the augmented feature vectors
Figure PCTCN2021132082-appb-000072
of node s may be scaled with a factor of
Figure PCTCN2021132082-appb-000073
so as to make the augmented feature vectors in expectation equal to the input feature vectors of the feature matrix X.
The drop-node method enables each node to aggregate information only from a subset of its neighbors by completely ignoring some neighboring nodes’ features, which reduces its dependency on particular neighbors and thus helps increase the model’s robustness over adversarial action. It is appreciated that the neighbors of a node may include one-hop neighbor and multi-hop neighbor.
Although the drop-node method is illustrated in Fig. 2, it is appreciated that the perturbation of the feature vectors of neighboring nodes of a node s may be performed in other ways. For example, the perturbation of the feature vectors of neighboring nodes such as  neighboring nodes  1, 2 and 4 of node 1 may be performed by using a dropout method. Specifically, the dropout method may perturb the feature vectors of neighboring nodes by randomly setting some elements of the feature vectors of neighboring nodes to 0. In this example, the drop mask
Figure PCTCN2021132082-appb-000074
in which each element z ij takes the value 0 with a probability of δ and takes the value 1 with a probability of 1-δ. In an example, the z ij is obtained from a Bernoulli distribution, that is, z ij ~Bernoulli (1 -δ) . In the example of employing the dropout method to perturb the feature vectors of neighboring nodes, the drop node mask vector shown in Fig. 2 is replaced with the dropout mask matrix for performing the random dropout of some elements of the feature vectors of neighboring nodes.
It is appreciated that there may be different ways to perturb the feature  vectors of neighboring nodes, and the disclosure is not limited to the drop-node method and dropout method.
In practical applications, the feature dimension h 0 of a node of the graph may be extremely large, this may incur huge computational resource cost to calculate the augmented feature vector
Figure PCTCN2021132082-appb-000075
of node s. In an embodiment, in order to decrease the resource requirement for performing random propagation for high-dimensional features, each feature vector X v as shown in equation (2) and Fig. 2 may be transform to a low dimensional hidden vector
Figure PCTCN2021132082-appb-000076
with a linear transformation layer before random feature propagation. Then the augmented feature vector
Figure PCTCN2021132082-appb-000077
of node s may be obtained by performing random propagation with the low dimensional feature vector H v as shown in equation (3)
Figure PCTCN2021132082-appb-000078
where
Figure PCTCN2021132082-appb-000079
denotes a learnable transformation matrix. In this way, the computational complexity of this procedure is reduced to O (k·h· (b l+b u) ) , where h << h 0.
After obtaining the multiple augmented feature vectors
Figure PCTCN2021132082-appb-000080
of the node s, such as
Figure PCTCN2021132082-appb-000081
to
Figure PCTCN2021132082-appb-000082
by performing the random feature propagation for the nodes s for multiple times such as M times, each of the augmented feature vectors may be fed into the GNN model 250 such as the illustrated two-layer MLP model 250 to get respective outputs
Figure PCTCN2021132082-appb-000083
where 1≤m≤M. The outputs
Figure PCTCN2021132082-appb-000084
Figure PCTCN2021132082-appb-000085
denotes the classification prediction probabilities of the node s on the augmented feature vector
Figure PCTCN2021132082-appb-000086
Θ are the parameters of the model 250. Although the dimension C of the illustrated outputs
Figure PCTCN2021132082-appb-000087
is one, it is appreciated that dimension C of the output
Figure PCTCN2021132082-appb-000088
may be bigger depending on the number of classifications to be predicted by the model 250. It is appreciated that the two-layer MLP model 250 is exemplary, other number of layers of the MLP model is applicable.
After obtaining the multiple outputs
Figure PCTCN2021132082-appb-000089
of each node s in the batch for the current training step or training iteration, a loss L 260 may be obtained based on the classification predictions
Figure PCTCN2021132082-appb-000090
Then the parameter weights of the GNN model 250 may be updated based on the loss L 260. For example, the weights of the GNN model 250 may be updated by back propagating the loss L 260 along the GNN 250. In the above embodiment in which the learnable transformation matrix
Figure PCTCN2021132082-appb-000091
is used, the weights of the GNN 250 and the weights of learnable transformation matrix W  (0) may be updated based on the loss L.
In an embodiment, the loss function is a combination of the supervised loss on the labeled nodes and the graph regularization loss. Given the M data  augmentations of each node generated though the random propagation 240, a consistency regularized loss may be employed for the semi-supervised learning.
As a batch of b l labeled nodes L n are sampled for this training step, the supervised loss of the graph node classification task is defined as the average cross-entropy loss over M augmentations:
Figure PCTCN2021132082-appb-000092
where Y s is the observed label of the node s.
In the semi-supervised learning, the prediction consistency among M augmentations may be optimized for unlabeled data by using a consistency regularization loss. In an embodiment, for each node of the batch of unlabeled nodes U n, i.e., for node s ∈U n, the distribution center may be calculated by taking the average of its M prediction probabilities, i.e., 
Figure PCTCN2021132082-appb-000093
Then a sharpening process may be performed over the average prediction probability
Figure PCTCN2021132082-appb-000094
to obtain a pseudo label
Figure PCTCN2021132082-appb-000095
for node s. Formally, the probability on the j-th classification of node s is obtained via:
Figure PCTCN2021132082-appb-000096
where 0 < τ ≤ 1 is a hyperparameter to control the sharpness of the obtained pseudo label. As decreasing the value of τ, 
Figure PCTCN2021132082-appb-000097
is enforced to become sharper and converges to a one-hot distribution eventually.
In an embodiment, the consistency loss on unlabeled node batch U n may be obtained via:
Figure PCTCN2021132082-appb-000098
In another embodiment, a confidence-aware consistency loss may be employed to further improve effectiveness of the consistency loss. The confidence-aware consistency loss on unlabeled node batch U n may be obtained via:
Figure PCTCN2021132082-appb-000099
where
Figure PCTCN2021132082-appb-000100
is an indicator function which outputs 1 if
Figure PCTCN2021132082-appb-000101
holds, and outputs 0 otherwise. 0 ≤ γ ≤ 1 is a predefined threshold. 
Figure PCTCN2021132082-appb-000102
is a distance function which measures the distribution discrepancy between p and q. For example, the distance function may be a function for calculating L2 distance. In another example, the distance function may be a function for calculating KL divergence. In this embodiment shown in equation (7) , only highly confident  unlabeled nodes whose maximum value of prediction center
Figure PCTCN2021132082-appb-000103
exceeds γ are considered when obtaining the consistence loss. In this way, the potential training noise induced by uncertain pseudo-labels may be reduced and thus the effectiveness of the consistency loss may be improved.
The final loss
Figure PCTCN2021132082-appb-000104
for model optimization may be obtained based on the supervised classification loss and the consistency regularization loss:
Figure PCTCN2021132082-appb-000105
where λ is a weight that controls the balance of the two losses.
After obtaining the loss
Figure PCTCN2021132082-appb-000106
the parameters Θ of the GNN model 250 may be updated by gradients descending:
Figure PCTCN2021132082-appb-000107
In an embodiment, a dynamic scheduling strategy may be used to determine the value of λ in equation (8) . Particularly, at the n-th training step or training iteration, λ is obtained via:
λ=min (λ max, λ min+ (λ maxmin) ·n/n max)     (10)
In the first n max training steps, λ linearly increases from λ min to λ max, and remains constant in the following training steps. By using this dynamic loss weight scheduling, the weight λ may be limited to a small value in the early stage of training when the generated pseudo-labels
Figure PCTCN2021132082-appb-000108
are not much reliable, which may help model converge.
After the training of the GNN model, predictions for all unlabeled nodes of the graph may be inferred at the same time using the trained model:
Figure PCTCN2021132082-appb-000109
The feature matrix X is rescaled with (1 -δ) so that the rescaled features are identical with the expectation of the perturbed features with dropping rate δ in the model training. Unlike training, the above process only needs to be performed once for each node of the graph during model inference, and the computational cost is acceptable in practice. In another embodiment, the rescale operation with (1 -δ) may not be performed if the corresponding scaling operation has been performed during training.
The process of training the GNN model for performing a classification task on graph data and obtaining the classification prediction for the nodes of the graph may be illustrated as the following pseudocode in table 2.
Table 2
Figure PCTCN2021132082-appb-000110
It is appreciated that the above process is just illustrative rather than limitative to the scope of the disclosure, the various variant embodiments of the process may be possible.
For example, although the drop-node method is illustrated in Fig. 2 as the specific method for performing the perturbation of the feature vectors of the neighboring nodes, other perturbation method such as dropout, random perturbing method and so on may be employed.
For example, the MLP model 250 may be employed as the GNN model, and other GNN model such as the GCN, the GAT and so on may be employed as the GNN model.
For example, although the supervised classification loss
Figure PCTCN2021132082-appb-000111
via Equation (4) and consistency regularization loss
Figure PCTCN2021132082-appb-000112
via Equation (6) or (7) are employed as the loss function, the specific method for calculating the supervised classification loss 
Figure PCTCN2021132082-appb-000113
and the consistency regularization loss
Figure PCTCN2021132082-appb-000114
are not limited to the specific Equations. And the loss function is not limited to the combination of the supervised  classification loss
Figure PCTCN2021132082-appb-000115
and the consistency regularization loss
Figure PCTCN2021132082-appb-000116
at least one of which may be employed as the loss function at different circumstances.
Fig. 3 illustrates an exemplary process 30 for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning according to an embodiment. It is appreciated that the exemplary process 30 may be implemented by a computer, in which programs are executed by one or more processor to perform the process 30.
At step 310, a batch of labeled nodes and a batch of unlabeled nodes are sampled from the nodes of the graph. The graph comprises the nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes being represented by a corresponding feature vector of the feature matrix. It is appreciated that the batch is a subset of the nodes of graph, the number of the batch of nodes is far less than the number of nodes of the graph in practical application.
At step 320, a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node. The neighboring nodes of the node are indicated by the adjacency vector of the node.
At step 330, a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by respectively applying the plurality of augmented feature vectors of the node to the GNN model.
At step 340, a loss is obtained based on the classification predictions of the respective nodes in the batch of labeled nodes and the batch of unlabeled nodes.
At step 350, parameters of the GNN model are updated based on the loss.
In an embodiment, in the method 30, a mixed-order adjacency matrix corresponding to a plurality of adjacency orders is generated for at least a part of the nodes of the graph based on the adjacency matrix of the graph. The adjacency vector of each node in the batch of labeled nodes and the batch of unlabeled nodes is a corresponding mixed-order adjacency vector of the mixed-order adjacency matrix. In an embodiment, the mixed-order adjacency matrix corresponding to a plurality of adjacency orders is generated for a part or subset of the nodes of the graph based on the adjacency matrix of the graph. The adjacency vector of each node in the batch of labeled nodes and the batch of unlabeled nodes is a corresponding mixed-order adjacency vector of the mixed-order adjacency matrix. In this embodiment, the mixed-order adjacency matrix is actually a sub-matrix for the sampled nodes, which is generated by using the above mentioned GFPush method. By using the efficient GFPush method to generate the mixed-order adjacency sub-matrix, the efficiency for performing the classification of node for a large size graph, which includes a huge amount of nodes, can be improved significantly, as the GFPush method is able to generate the mixed-order adjacency sub-matrix for only a subset of all the nodes of the graph.
In an embodiment, the mixed-order adjacency matrix may be generated by performing weighted sum of a plurality of per-order adjacency matrixes corresponding to the plurality of adjacency orders, wherein one of the plurality of per-order adjacency matrixes corresponding to an adjacency order t, and an element in the per-order adjacency matrix representing a probability that a t-step random walk goes from a source node associated with the element to a target node associated with the element.
In an embodiment, the mixed-order adjacency matrix may be generated by generating a mixed-order adjacency vector for each of the at least a part of the nodes of the graph by iteratively performing one-step random walk based on the adjacency matrix of the graph. The generated respective mixed-order adjacency vectors for the at least a part of nodes constitute the mixed-order adjacency matrix.
In an embodiment, the batch of labeled nodes and the batch of unlabeled nodes are sampled from a subset of the nodes of the graph. In this embodiment, a mixed-order adjacency vector for each of the subset of nodes may be generated by iteratively performing one-step random walk based on the adjacency matrix of the graph. The generated respective mixed-order adjacency vectors for the subset of nodes constitute the mixed-order adjacency matrix.
In an embodiment, the mixed-order adjacency vector for each of the at least a part of the nodes of the graph may be generated by iteratively, from the lowest to the highest of the plurality of adjacency orders, calculating a plurality of reserved probability mass vectors and a plurality of residue probability mass vectors corresponding to the plurality of adjacency orders for the node, and generating the mixed-order adjacency vector for the node by performing weighted sum of the plurality of reserved probability mass vectors for the node. In an embodiment, the calculating of the plurality of reserved probability mass vectors and the plurality of residue probability mass vectors is performed based on an adjacency degree matrix of the graph and/or a predefined threshold.
In an embodiment, the generating the mixed-order adjacency vector for each of the at least a part of the nodes of the graph may further comprises: remaining a predefined number of largest elements of the mixed-order adjacency vector of the node while setting other elements of the mixed-order adjacency vector to zero.
In an embodiment, the weighted sum of the plurality of reserved probability mass vectors for the node may performed by using weights α (1-α)  t to the plurality of reserved probability mass vectors, wherein α being a decay factor and t being one of the plurality of adjacency orders, or by using an averaged weight corresponding to the plurality of adjacency orders, or by using a single order weight in which one of the weights being set to 1 and the other of the weights being set to 0.
In an embodiment, at step 320, a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating feature vectors of neighboring nodes of the node based on  the adjacency vector of the node and a dropping mask. The dropping mask is configured to randomly drop out at least partial features of at least a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes. In an embodiment, the dropping mask is configured to randomly drop out entire features of a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes. In an embodiment, the dropping mask is configured to randomly drop out entire features of a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes based on a dropping rate, features of the remained part of the neighboring nodes are scaled up based on the dropping rate.
In an embodiment, the loss comprises a supervised classification loss and a consistency regularization loss. In an embodiment, the supervised classification loss is obtained based on classification labels of the batch of labeled nodes, and the consistency regularization loss is obtained based on classification predictions of the batch of unlabeled nodes. In an embodiment, the supervised classification loss is a cross-entropy loss based on the classification labels and the classification predictions corresponding to the batch of labeled nodes, and the consistency regularization loss is a distance loss based on the classification predictions of the batch of unlabeled nodes. In an embodiment, the consistency regularization loss is set to zero if the maximum of the classification predictions of the batch of unlabeled nodes is less than a threshold or if the maximum of respective averaged classification predictions of the batch of unlabeled nodes is less than a threshold.
In an embodiment, the method 30 is performed repetitively in each of a plurality of training steps, which may be referred to as training loops or training iterations, wherein the loss is obtained by performing weighted sum of the supervised classification loss and a consistency regularization loss with dynamic weights in each of the plurality of training steps.
In an embodiment, in the method 30, the feature vectors of neighboring nodes of the node are transformed with a linear transformation matrix to respective lower-dimension feature vectors having a lower dimension than the feature vectors of the neighboring nodes of the node. In this embodiment, the plurality of augmented lower-dimension feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by randomly propagating the lower-dimension feature vectors of neighboring nodes of the node, the plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes are obtained by respectively applying the plurality of augmented lower-dimension feature vectors of the node to the GNN model, and at step 350 parameters of the GNN model and parameter of the linear transformation matrix are updated based on the loss.
In an embodiment, the GNN model comprises one of a Multilayer Perception (MLP) model, a Graph Convolutional Network (GCN) model, a Graph Attention Network (GAT) model.
In an embodiment, the process 30 illustrated in Fig. 3 may be implemented as a method for training a Graph Neural Network (GNN) model to perform a task of classifying accounts on a social network or a financial network. In this embodiment, the GNN model for classifying the accounts on the social network or the financial network may be trained using the method as described above with reference to Figs. 1-3, wherein the nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.
Fig. 4 illustrates an exemplary process 40 for classifying a node of a graph according to an embodiment. It is appreciated that the exemplary process 40 may be implemented by a computer, in which programs are executed by one or more processor to perform the process 40.
At step 410, a Graph Neural Network (GNN) model may be trained by using the method as described above with reference to Figs. 1-3.
At step 420, a classification label for the node of the graph may be predicted by applying the feature matrix of the graph to the trained GNN model.
In an embodiment, the method 40 may also include identify an alarm if the label of the node is a predefined label. In an example, the predefined label of the node may be a label indicating the node is an adversarial node, a fraud node or the like, such as an adversarial node in social network, financial network or the like.
In an embodiment, the process 40 illustrated in Fig. 4 may be implemented as a method for classifying an account on a social network or a financial network. In this embodiment, the GNN model for classifying the accounts on the social network or the financial network may be trained using the method as described above with reference to Figs. 1-3, and a classification label for the account on the social network or the financial network may be predicted with the trained GNN model. The nodes represent the accounts, the edges represent the relation among the accounts, the graph represents the social network or the financial network.
Fig. 5 illustrates an exemplary computing system 50 according to an embodiment. The computing system 50 may comprise at least one processor 510. The computing system 50 may further comprise at least one storage device 520. The storage device 520 may store computer-executable instructions that, when executed, cause the processor 510 to sample a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising the nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes being represented by a corresponding feature vector of the feature matrix; obtain a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node, wherein the neighboring nodes of the node being indicated by the adjacency vector of the node; obtain a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of  augmented feature vectors of the node to the GNN model; obtain a loss based on the classification predictions of the nodes in the batch of labeled nodes and the batch of unlabeled nodes; and update parameters of the GNN model based on the loss.
It should be appreciated that the storage device 520 may store computer-executable instructions that, when executed, cause the processor 510 to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.
The embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.
The embodiments of the present disclosure may be embodied in a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-4.
It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.
It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims (20)

  1. A computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying nodes of a graph based on semi-supervised learning, the method comprising:
    sampling a batch of labeled nodes and a batch of unlabeled nodes from the nodes of the graph, wherein the graph comprising nodes represented by a feature matrix and edges represented by an adjacency matrix, each of the nodes of graph being represented by a corresponding feature vector of the feature matrix;
    obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on an adjacency vector of the node;
    obtaining a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model;
    obtaining a loss based on the classification predictions of the nodes in the batch of labeled nodes and the batch of unlabeled nodes; and
    updating parameters of the GNN model based on the loss.
  2. The method of claim 1, further comprising:
    generating a mixed-order adjacency matrix corresponding to a plurality of adjacency orders for at least a part of the nodes of the graph based on the adjacency matrix of the graph, wherein the adjacency vector of each node in the batch of labeled nodes and the batch of unlabeled nodes being a corresponding mixed-order adjacency vector of the mixed-order adjacency matrix.
  3. The method of claim 2, wherein the generating the mixed-order adjacency matrix further comprising:
    generating a mixed-order adjacency vector for each of the at least a part of the nodes of the graph by iteratively performing one-step random walk based on the adjacency matrix of the graph.
  4. The method of claim 3, wherein the generating the mixed-order adjacency vector for each of the at least a part of the nodes of the graph further comprising:
    iteratively, from the lowest to the highest of the plurality of adjacency orders, calculating a plurality of reserved probability mass vectors and a plurality of residue probability mass vectors corresponding to the plurality of adjacency orders for the node; and
    generating the mixed-order adjacency vector for the node by performing weighted sum of the plurality of reserved probability mass vectors for the node.
  5. The method of claim 4, wherein the calculating of the plurality of reserved probability mass vectors and the plurality of residue probability mass vectors being based on an adjacency degree matrix of the graph and/or a predefined threshold.
  6. The method of claim 3, wherein the generating the mixed-order adjacency vector for each of the at least a part of the nodes of the graph further comprising:
    remaining a predefined number of largest elements of the mixed-order adjacency vector of the node while setting other elements of the mixed-order adjacency vector to zero.
  7. The method of claim 4, wherein the performing weighted sum of the plurality of reserved probability mass vectors for the node further comprising:
    performing weighted sum of the plurality of reserved probability mass vectors for the node by using weights α (1-α)  t to the plurality of reserved probability mass vectors, wherein α being a decay factor and t being one of the plurality of adjacency orders, or by using an averaged weight corresponding to the plurality of adjacency orders, or by using a single order weight in which one of the weights being set to 1 and the other of the weights being set to zero.
  8. The method of claim 1, wherein the obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes further comprising:
    obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node based on the adjacency vector of the node and a dropping mask, wherein the dropping mask being configured to randomly drop out partial features of at least a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes, or being configured to randomly drop out entire features of a part of the neighboring nodes of the node from the feature vectors of the neighboring nodes.
  9. The method of claim 1, wherein the loss comprising a supervised classification loss and a consistency regularization loss, wherein the supervised classification loss being obtained based on classification labels of the batch of labeled nodes, and the consistency regularization loss being obtained based on classification predictions of the batch of unlabeled nodes.
  10. The method of claim 9, wherein the supervised classification loss being a cross-entropy loss based on the classification labels and the classification predictions corresponding to the batch of labeled nodes, and the consistency  regularization loss being a distance loss based on the classification predictions of the batch of unlabeled nodes.
  11. The method of claim 10, wherein the consistency regularization loss being set to zero if the maximum of the classification predictions of the batch of unlabeled nodes is less than a threshold or if the maximum of respective averaged classification predictions of the batch of unlabeled nodes is less than a threshold.
  12. The method of claim 11, wherein the method being repetitively performed in each of a plurality of training steps, wherein the loss being obtained by performing weighted sum of the supervised classification loss and the consistency regularization loss with dynamic weights in each of the plurality of training steps.
  13. The method of claim 1, further comprising transforming the feature vectors of neighboring nodes of the node with a linear transformation matrix to respective lower-dimension feature vectors having a lower dimension than the feature vectors of the neighboring nodes,
    wherein the obtaining a plurality of augmented feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating feature vectors of neighboring nodes of the node further comprising: obtaining a plurality of augmented lower-dimension feature vectors for each node in the batch of labeled nodes and the batch of unlabeled nodes by randomly propagating the lower-dimension feature vectors of neighboring nodes of the node,
    wherein the obtaining a plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented feature vectors of the node to the GNN model further comprising: obtaining the plurality of classification predictions for each node in the batch of labeled nodes and the batch of unlabeled nodes by respectively applying the plurality of augmented lower-dimension feature vectors of the node to the GNN model.
  14. A computer-implemented method for classifying a node of a graph, comprising:
    training a Graph Neural Network (GNN) model by using the method of one of claims 1-13;
    predicting a classification label for the node of the graph by applying the feature matrix of the graph to the trained GNN model.
  15. A computer-implemented method for training a Graph Neural Network (GNN) model to perform a task of classifying accounts on a social network or a financial network, comprising:
    training the GNN model for classifying the accounts on the social network or the financial network using the method of one of claims 1-13, wherein the nodes representing the accounts, the edges representing the relation among the accounts, the graph representing the social network or the financial network.
  16. A computer-implemented method for classifying an account on a social network or a financial network, comprising:
    training a Graph Neural Network (GNN) model by using the method of claim 15;
    predicting a classification label for the account on the social network or the financial network with the trained GNN model, wherein the nodes representing the accounts, the edges representing the relation among the accounts, the graph representing the social network or the financial network.
  17. The method of claim 16, further comprising:
    identify an alarm if the label of the account is a predefined label.
  18. The method of claim 17, wherein the predefined label indicating the account as adversarial account.
  19. A computer system, comprising:
    one or more processors; and
    one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method of one of claims 1-18.
  20. One or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-18.
PCT/CN2021/132082 2021-11-22 2021-11-22 Method and apparatus for classifying nodes of a graph WO2023087303A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/132082 WO2023087303A1 (en) 2021-11-22 2021-11-22 Method and apparatus for classifying nodes of a graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/132082 WO2023087303A1 (en) 2021-11-22 2021-11-22 Method and apparatus for classifying nodes of a graph

Publications (1)

Publication Number Publication Date
WO2023087303A1 true WO2023087303A1 (en) 2023-05-25

Family

ID=78821489

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/132082 WO2023087303A1 (en) 2021-11-22 2021-11-22 Method and apparatus for classifying nodes of a graph

Country Status (1)

Country Link
WO (1) WO2023087303A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994073A (en) * 2023-09-27 2023-11-03 江西师范大学 Graph contrast learning method and device for self-adaptive positive and negative sample generation
CN117828514A (en) * 2024-03-04 2024-04-05 清华大学深圳国际研究生院 User network behavior data anomaly detection method based on graph structure learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIANPENG CHEN ET AL: "Graph Attention Networks with LSTM-based Path Reweighting", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 June 2021 (2021-06-21), XP081992926 *
WENZHENG FENG ET AL: "Graph Random Neural Network for Semi-Supervised Learning on Graphs", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 September 2021 (2021-09-21), XP091042321 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994073A (en) * 2023-09-27 2023-11-03 江西师范大学 Graph contrast learning method and device for self-adaptive positive and negative sample generation
CN116994073B (en) * 2023-09-27 2024-01-26 江西师范大学 Graph contrast learning method and device for self-adaptive positive and negative sample generation
CN117828514A (en) * 2024-03-04 2024-04-05 清华大学深圳国际研究生院 User network behavior data anomaly detection method based on graph structure learning
CN117828514B (en) * 2024-03-04 2024-05-03 清华大学深圳国际研究生院 User network behavior data anomaly detection method based on graph structure learning

Similar Documents

Publication Publication Date Title
Zheng et al. Layer-wise learning based stochastic gradient descent method for the optimization of deep convolutional neural network
US9390383B2 (en) Method for an optimizing predictive model using gradient descent and conjugate residuals
CN110766142A (en) Model generation method and device
WO2023087303A1 (en) Method and apparatus for classifying nodes of a graph
Lin et al. Temporal convolutional attention neural networks for time series forecasting
Yang et al. Channel pruning based on convolutional neural network sensitivity
Koklu et al. Estimation of credit card customers payment status by using kNN and MLP
US20220335303A1 (en) Methods, devices and media for improving knowledge distillation using intermediate representations
Gil et al. Quantization-aware pruning criterion for industrial applications
WO2022019913A1 (en) Systems and methods for generation of machine-learned multitask models
Liu et al. Gflowout: Dropout with generative flow networks
CN110633417A (en) Web service recommendation method and system based on service quality
Xie et al. Distributed semi-supervised learning algorithms for random vector functional-link networks with distributed data splitting across samples and features
CN111445032B (en) Method and device for decision processing by using business decision model
Kazemi et al. A novel evolutionary-negative correlated mixture of experts model in tourism demand estimation
Zhang et al. Musings on deep learning: Properties of sgd
WO2023000165A1 (en) Method and apparatus for classifying nodes of a graph
Cao et al. Lstm network based traffic flow prediction for cellular networks
WO2022166125A1 (en) Recommendation system with adaptive weighted baysian personalized ranking loss
Nakamura et al. Stochastic batch size for adaptive regularization in deep network optimization
Zhang et al. A novel hybrid framework based on temporal convolution network and transformer for network traffic prediction
Lee et al. Adaptive network sparsification via dependent variational beta-bernoulli dropout
Salehinejad et al. A framework for pruning deep neural networks using energy-based models
Yang et al. Pruning Convolutional Neural Networks via Stochastic Gradient Hard Thresholding
US11676037B1 (en) Disparity mitigation in machine learning-based predictions for distinct classes of data using derived indiscernibility constraints during neural network training

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21819327

Country of ref document: EP

Kind code of ref document: A1