WO2023060563A1

WO2023060563A1 - Adaptive diffusion in graph neural networks

Info

Publication number: WO2023060563A1
Application number: PCT/CN2021/124130
Authority: WO
Inventors: Jialin Zhao; Ming Ding; Jie Tang; Evgeny Kharlamov
Original assignee: Robert Bosch Gmbh; Tsinghua University
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2023-04-20
Also published as: CN118140231A

Abstract

A method for training a graph neural Network (GNN) to learn neighborhood radius for feature propagation of message passing to perform node classification is disclosed. The method comprises: inputting data of training set into the GNN; updating trainable parameters of the GNN based at least partially on a reduction of a first loss function of the training set, wherein the trainable parameters comprise at least weight and bias parameters of the GNN; updating one or more neighborhood radius related parameters based at least partially on the reduction of the first loss function of the training set, wherein the one or more neighborhood radius related parameters comprise at least influence weights of all the neighbor nodes with different steps away, and wherein sum of the influence weights of all the neighbor nodes on each layer of the GNN equals to 1; and calculated the neighborhood radius to be used in the feature propagation of message passing based on the updated one or more neighborhood radius related parameters. Numerous other aspects are provided.

Description

ADAPTIVE DIFFUSION IN GRAPH NEURAL NETWORKS

FIELD

Aspects of the present disclosure relate generally to artificial intelligence, and more particularly, to a method and an apparatus provided for adaptive diffusion in graph networks.

BACKGROUND

Graph neural networks (GNNs) are a type of neural networks that can be directly coupled with graph-structured data. Specifically, graph convolution networks (GCNs) generalize the convolution operation to local graph structures, offering attractive performance for various graph mining tasks. The graph convolution operation is designed to aggregate information from immediate neighboring nodes into the central node, which is also referred to as message passing. To propagate information between nodes that are further away, multiple neural layers can be stacked to go beyond the immediate hop of neighbors. To directly collect high-order information, spectral based GNNs leverage graph spectral properties to collect signals from global neighbors.

Though generating promising results, both strategies are limited to a pre-determined and fixed neighborhood for passing and receiving messages. Essentially, these methods have an implicit assumption that all graph datasets share the same size of receptive field during the message passing process. To break this, graph diffusion convolution (GDC) was recently proposed to extend the discrete message passing process in GCN to a diffusion process, enabling it to aggregate information from a larger neighborhood. However, for each input graph, GDC hand-tunes the best neighborhood size for feature aggregation by grid-searching the parameters on the validation set, making its practical application limited and sensitive.

SUMMARY

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

The success of graph neural networks (GNNs) largely relies on the process of aggregating information from neighbors defined by the input graph structures. Notably, message passing based GNNs, e.g., graph convolutional networks, leverage the immediate neighbors of each node during the aggregation process, and recently, graph diffusion convolution (GDC) is proposed to expand the propagation neighborhood by leveraging generalized graph diffusion. However, the neighborhood size in GDC is manually tuned for each graph by conducting grid search over the validation set, making its generalization practically limited.

To eliminate the manual search process of the optimal propagation neighborhood in GDC, it is disclosed in the present disclosure the adaptive diffusion convolution (ADC) strategy that supports learning the optimal neighborhood from the data automatically. ADC achieves this by formalizing the task as a bi-level optimization problem, enabling the customized learning of one optimal propagation neighborhood size for each dataset. In other words, all GNN layers and feature channels (dimensions) share the same neighborhood size during message passing on each graph.

It is further disclosed in the present disclosure, ADC is allowed to automatically learn a customized neighborhood size for each GNN layer and each feature dimension from data. By learning a unique propagation neighborhood for each layer, ADC can empower GNNs to capture neighbors’ information from diverse graph structures, which is fully dependent on the data and downstream learning objective. Similarly, by learning distinct neighborhood size for each feature channel, GNNs are then capable of selectively modeling each neighbor’s multiple feature signals. Altogether, ADC makes GNNs fully coupled with the graph structures and all feature channels.

The ADC disclosed in the present disclosure is a general plugin that can be directly applied to existing GNN models. By plugging it on GNNs, the upgraded GNNs can offer significant performance advances over their vanilla versions across a wide range of datasets. Furthermore, by learning the propagation neighborhood size automatically, ADC can consistently outperform GDC, which customizes this for each dataset by grid search. Finally, it is demonstrated that GNNs’ model capacity can benefit from the better coupling between the its architecture, graph structures, and feature channels, that is, by learning dedicated neighborhood size for each GNN layer and feature dimension.

According to an aspect, a method for training a graph neural Network (GNN) to learn neighborhood radius for feature propagation of message passing to perform node classification is disclosed. The method comprises inputting data of training set into the GNN; updating trainable parameters of the GNN based at least partially on a reduction of a first loss function of the training set, wherein the trainable parameters comprise at least weight and bias parameters of the GNN; updating one or more neighborhood radius related parameters based at least partially on the reduction of the first loss function of the training set, wherein the one or more neighborhood radius related parameters comprise at least influence weights of all the neighbor nodes with different steps away, and wherein sum of the influence weights of all the neighbor nodes on each layer of the GNN equals to 1; and calculated the neighborhood radius to be used in the feature propagation of message passing based on the updated one or more neighborhood radius related parameters.

According to a further aspect, wherein the trainable parameters of the GNN and the one or more neighborhood radius related parameters are updated jointly based at least partially on the reduction of the first loss function of the training set.

According to a further aspect, updating the trainable parameters of the GNN based at least partially on the reduction of the first loss function of the training set further comprises updating the trainable parameters of the GNN by a first gradient on the training set to reduce the first loss function.

According to a further aspect, the method further comprises inputting data of validation set into the GNN; and wherein updating the one or more neighborhood radius related parameters based at least partially on the reduction of the first loss function of the training set further comprises updating the one or more neighborhood radius related parameters by a second gradient on the validation set to reduce a second loss function of the validation set, wherein the second loss function of the validation set is calculated with the updated trainable parameters of the GNN.

According to a further aspect, wherein the one or more neighborhood radius related parameters are updated based on the updated trainable parameters of the GNN that minimize the first loss function of the training set.

According to a further aspect, wherein the one or more neighborhood radius related parameters are updated based on the updated trainable parameters of the GNN each epoch.

According to a further aspect, wherein the neighborhood radius is calculated for all layers and feature dimensions of the GNN uniformly.

According to a further aspect, wherein the neighborhood radius is calculated for each layer and each feature dimension of the GNN respectively.

According to a further aspect, wherein the feature propagation of the message passing is performed before feature transformation of the message passing with the updated neighborhood radius for each layer and each feature dimension of the GNN respectively.

According to a further aspect, wherein the influence weights of all the neighbor nodes with different steps away are generated based on the heat kernel as

wherein k represents a step number away from a central node, and t is a diffusion time.

According to a further aspect, wherein the step number away from the central node is truncated to a constant instead of infinity.

According to a further aspect, wherein the influence weights of all the neighbor nodes with different steps away are generated based on the PageRank as α (1-α) ^k, wherein k represents a step number away from a central node, and α is a probability of a user staying in a current page.

The models to which the plugin in the present disclosure applied can focus on the problem of semi-supervised node classification, the input of which may include an undirected network containing multiple nodes and edges therebetween. Given the input feature and a set of labelled nodes, the task is to predict the labels of remaining nodes. As examples but not limiting, node classification may be image classification, speech recognition or anomaly detection, etc.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

Fig. 1 illustrates an exemplary schematic diagram of adaptive diffusion convolution (ADC) , in accordance with various aspects of the present disclosure.

Fig. 2 illustrates another exemplary schematic diagram of adaptive diffusion convolution (ADC) , in accordance with various aspects of the present disclosure.

Fig. 3 illustrates an exemplary flow chart of adaptive diffusion convolution (ADC) , in accordance with various aspects of the present disclosure.

Fig. 4 illustrates an exemplary computing system, in accordance with various aspects of the present disclosure.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

The success of GNNs largely relies on the process of aggregating information from neighbors defined by the input graph structures, which is generally named message passing. Notably, message passing based GNNs, e.g., GCNs leverage the immediate neighbors of each node during the aggregation process, and recently, GDC is proposed to expand the propagation neighborhood by leveraging generalized graph diffusion. However, the neighborhood size in GDC is manually tuned for each graph by conducting grid search over the validation set, making its generalization practically limited.

To address this issue, the adaptive diffusion convolution (ADC) strategy is proposed to automatically learn the optimal neighborhood size from the data. Furthermore, the conventional assumption that all GNN layers and feature channels (or dimensions) should use the same neighborhood size for propagation can be broken in the present disclosure. It is designed to enable ADC to learn a dedicated propagation neighborhood for each GNN layer and each feature channel, making the GNN architecture fully coupled with graph structures, which is the unique property that differ GNNs from traditional neural networks. By directly plugging ADC in the present disclosure into existing GNNs, consistent and significant outperformance over both GDC and their vanilla versions across various datasets may be obtained, realizing improved model capacity brought by automatically learning unique neighborhood size per layer and per channel in GNNs.

In the context of GNNs, it is focused on the problem of semi-supervised node classification in the present disclosure. The input may include an undirected network G= (V, E) , where the node set V contains of n nodes {v ₁, …, v _n} and E is the edge set, and A∈R ^n×n is the symmetric adjacency matrix of graph G. Given the input feature matrix X and a subset of node label Y, the task is to predict the labels of remaining nodes.

In an aspect, the task of node classification may be image classification, in which each node represents an image, an edge may exist if two images are determined to be in a same class, and features may be derived from the pixels to a probability distribution. In another aspect, the task of node classification may be speech recognition, in which each node represents an waveform of a sound record, an edge may exist if two sound records are determined to be in a same class, and features may be the waveform of the sound to a probability distribution over the discrete states of a Hidden Markov Model. In yet another aspect, the task of node classification may be anomaly detection, including but not limited to, frauds recognition, dataset preprocessing, detection of online review spams, fake users and rumors in social media, fake news, etc. The tasks referred to in the present disclosure herein are merely used as examples, and the present disclosure can be applied to any scenario that inputs can be learned as node representations and connections between the nodes can be learned as edges.

Back to the GNNs, the convolution operation on graphs can be described as the process of neighborhood feature aggregation or message passing. The message passing graph convolutional networks can be simply defined as below:

Where H ^(l) denotes the hidden feature of layer l with H ⁽⁰⁾ =X and X as the input feature matrix,

denotes feature transformation and γ (·) denotes feature propagation. Taking GCN as an example, the feature transformation and feature propagation functions correspond to

and

respectively, in which D is the diagonal degree matrix with

and

denotes hidden feature after transformation. Note that GCN uses the adjacency matrix A with self-loop, so it actually uses

To simplify the notations, T is used herein to denote

Straightforwardly, the feature transformation function

describes how features transform inside each node and the feature propagation function γ (·) describes how features propagate between nodes. Essentially, how good a GNN model can utilize graph structures heavily depends on the design of the feature propagation function.

Most graph-based models can be represented as

where f (T) is a matrix that can be generated by T. So f (T) can be represented as

To quantify how far each node could aggregate features from, the neighborhood radius of a node as r is defined as below:

Here, θ _k denotes the influence from k-step-away nodes. For a large r, it means that the model puts more emphasis on long distance nodes, i.e., global information. For a small r, it means that the model amplifies local information.

For GCN, the neighborhood radius is fixed to r=1, which is just the range of nodes directly connected to it. To collect information beyond direct connections, it is required to stack multiple GCN layers to reach high-order neighborhoods.

There are attempts to improve GCN's feature propagation function from first-hop neighborhood to multi-hop neighborhood, such as MixHop, JKNet, and SGC. For example, SGC uses feature propagation function

where

In other words, the neighborhood radius r=K for SGC, which is the range of neighborhoods to collect information from each GNN layer. However, for all multi-hop models, the discrete nature of hop numbers makes r non-differentiable, which is unfavourably for subsequent calculating.

A line of work has been focused on generalizing feature propagation from discrete hops to continuous graph diffusion. Notably, graph diffusion convolution (GDC) addresses this by propagation setup as below:

Where k is summed from 0 to infinity, making each node aggregate information from the whole graph. In Eq. (3) , the weight coefficients should satisfy

such that the signal strength is not amplified nor reduced through the propagation. In an aspect, the set of weight coefficients can be generated from personalized PageRank, as θ _k=α (1-α) ^k, wherein k represents a step number away from a central node, and α is a probability of a user staying in a current page. In another aspect, the set of weight coefficients can be generated from the heat-kernel as

The representation of the set of weight coefficients referred to in the present disclosure herein is merely used as examples. The heat kernel is taken as an example hereinafter and would not limit the scope of the present disclosure.

Heat kernel incorporates prior knowledge into the GNN model, which means the feature propagation between nodes follows Newton’s law of cooling, i.e., the feature propagation speed between two nodes is proportional to the difference between their features. Formally, this prior knowledge can be described as below:

Where N (i) denotes the neighborhood of node i, x _i (t) represents the feature of node i after diffusion time t. This differential equation can be solved as below:

X (t) =H _tX (0) (5)

Where X (t) denotes the feature matrix after diffusion time t, and H _t=e ^- (I-T) t is the heat kernel.

The heat kernel version of the GDC has r as below:

This suggests that t is the neighborhood radius r for the heat kernel based GDC, that is, t becomes a perfect continuous substitute for the hop number in multi-hop models.

Recall that the heat kernel version of graph diffusion convolution (GDC) has the following feature propagation function as below:

Where the Laplacian matrix L=I-T. For each graph dataset, in GDC, it requires the manual grid search step to determine the neighborhood radius related parameter t. Moreover, t is fixed for all feature channels and propagation layers in each dataset.

In the present disclosure, it is disclosed a method called adaptive diffusion convolution (ADC) of how to adaptively learn the neighborhood radius from data for each graph, and it is further disclosed how to generalize it for different feature channels and GNN layers.

It is enabled to replace GNNs’ discrete feature propagation function with the continuous heat kernel by GDC. Moving forward from which, instead of hand-tuning t, an optimal neighborhood radius r can be obtained by calculating the gradient of t and update the t to converge.

In an aspect, the training process of learning t can be the same to learning other weight and bias parameters in the model. Specifically, the training process of learning t and other weight and bias parameters of the GNN is performed jointly and directly on the training set by considering t as one of the trainable parameters, via minimizing the loss function of the training set using gradient of t along with other weight and bias parameters of the GNN on the training set.

However, learning t directly on the training set may cause overfitting in certain cases. To address the issue, it is further disclosed to train t on the validation set instead of the training set, by using the gradient of t on the validation set. The goal for the model is to find t ^* that minimizes the loss function of the validation set

wherein w denotes all the other trainable parameters in the feature transformation function and w ^* denotes the set of parameters that minimize the loss function of the training set

This strategy can be formalized as a bi-level optimization problem as below:

With Eq. (8) and (9) , all the other trainable parameters w, including at least all the weight and bias parameters of the GNN, are firstly learned on the training set. w ^*is obtained when the loss function of the training set

is minimized after certain training epochs. Then t is learned on the validation set with the learned w ^*, and t ^*is obtained when the loss function of the validation set

is minimized after certain training epochs. As the learned t ^*would change the value of

every time t is updated, it is needed to make w converge to the optimal value, causing it too expensive to train.

For the purpose of decreasing the training cost, it is further disclosed an approximation method to update t every time w is updated, which can be as below:

Where e denotes the number of training epochs, α ₁ and α ₂ denote the learning rate on the training and validation sets, respectively.

With Eq. (10) and (11) , all the other trainable parameters w, including at least all the weight and bias parameters of the GNN, are firstly learned on the training set. For each epoch, w ^(e) is updated to w ^(e+1) by using the gradient of w on the training set. Then t is learned on the validation set with the updated w ^(e+1) during the same epoch, t ^(e) is updated to t ^(e+1) by using the gradient of t on the validation set. After all the training epochs, the optimal t ^*and w ^*may be obtained. This method could help avoid overfitting and thus offers better generalization.

Conventional GNNs use the predetermined neighborhood radius for feature propagation. As described above, GDC proposes to use different neighborhood radius t for different datasets by hand-tuning the values. The disclosed methods described above further GDC's direction by automatically learning the radius from the given graph. This implies that one t for one dataset, that is, the same t for all GNN layers and all feature channels (dimensions) .

Fig. 1 illustrates an exemplary schematic diagram of adaptive diffusion convolution (ADC) , in accordance with various aspects of the present disclosure. It can be seen in Fig. 1, a same learned t (shown as 2 for example) is applied to all layers of GNN and all feature channels (dimensions) . When the t is large, the contributions from close and distant neighbors would have little difference, and when the t is small, the contributions from close neighbors would be much more significantly than distant neighbors, shown as the greyscale of the circles.

It is expected that for each layer and feature dimension, a unique r may be learned and used, making them adaptive for the final learning objective. The obstacle that prevents prior arts from achieving this lie in the infeasible challenge of hand-tuning or grid-searching the propagation function separately for each feature channel and GNN layer, given that as the number of parameters increases, the time complexity increases exponentially.

The aforementioned strategy for updating t during the training of the model empowers the method to adaptively learn specific t for all layers and all feature channels.

It is disclosed that to learn a unique r for all layers and all feature channels, the method described above of the heat kernel can be evolved as below by extending the feature propagation function in Eq. (7) for each layer and feature channel, that is from t to

Where

denotes the neighborhood radius t for the l-th layer and i-th channel,

represents the i-th column of the hidden feature

i.e., the feature on channel or dimension i, and

denotes the feature propagation function on the l-th layer and i-th channel. This feature propagation function enables the GNN to train a separate t for each feature channel and layer.

As can be seen in Fig. 2, for the hidden feature

of feature channel i in layer l, a separate feature propagation function

with a unique neighborhood radius

is trained. When t is large (e.g., t=3) , the contributions from close (e.g., in 1-hop) and distant neighbors (e.g., in 3-hops) have little difference (shown as the relatively similar color shading across different hops) . When t is small (e.g., t=1) , the contributions from close neighbors are much more significant than from distant neighbors (shown as darker color concentrated around center) .

It is disclosed the method of ADC based on heat kernel as an example herein. Without loss of generality, the method of ADC can be a generalized ADC (GADC) , in other words, not limiting the weight coefficients θ _k as heat kernel or other specific examples. The feature propagation of GADC can be described with Eq. (3) , and further to learn

for each layer and feature channel or dimension, the feature propagation of GADC is disclosed as below:

Where

denotes the weight coefficient for k-hop neighbors on l-th layer and i-th channel/dimension.

is restricted during training, that is, the sum of the influence weights of all the neighbor nodes on each layer of the GNN equals to 1.

As it operates differently on each channel, whether to propagate before or after the feature transformation function actually matters. Empirically, it is found that propagating on the input channels generates better results than propagating on the output channels. Therefore, the feature propagation and transformation steps in the original message passing networks from Eq. (1) are swapped as below:

Additionally, calculating e ^-Lt directly is in feasible for large graphs. Practically, it is needed to use the top-K truncation to approximate the heat kernel, making ADC (in Eq. (12) ) and GADC (in Eq. (13) ) respectively updated as below:

ADC and GADC in the present disclosure are flexible components that can be directly plugged into existing GNN models, enabling them to adaptively learn the neighborhood radius for feature aggregation.

Fig. 3 illustrates an exemplary flow chart of adaptive diffusion convolution (ADC) , in accordance with various aspects of the present disclosure. As described below, some or all illustrated features may be omitted in a particular implementation within the scope of the present disclosure, and some illustrated features may not be required for implementation of all embodiments. In some examples, the method may be carried out by any suitable apparatus or means for carrying out the functions or algorithm described below.

Generally the approach of ADC is discussed in the context of the task of classification, including but not limited to, image classification, in which each node represents an image, an edge may exist if two images are determined to be in a same class, and features may be derived from the pixels to a probability distribution; speech recognition, in which each node represents an waveform of a sound record, an edge may exist if two sound records are determined to be in a same class, and features may be the waveform of the sound to a probability distribution over the discrete states of a Hidden Markov Model; anomaly detection, such as frauds recognition, dataset preprocessing, detection of online review spams, fake users and rumors in social media, fake news, etc. The tasks referred to in the present disclosure herein are merely used as examples, and the present disclosure can be applied to any scenario that inputs can be learned as node representations and connections between the nodes can be learned as edges.

The method is for training a graph neural Network (GNN) to learn neighborhood radius for feature propagation of message passing, and may begin at block 301, with inputting data of training set into the GNN.

The method proceeds to block 302, with updating trainable parameters of the GNN based at least partially on a reduction of a first loss function of the training set, wherein the trainable parameters comprise at least weight and bias parameters of the GNN. In an aspect, the trainable parameters of the GNN are updated by a first gradient on the training set to reduce the first loss function of the training set with one or more epochs. As an example, the trainable parameters of the GNN can be updated based on Eq. (9) or Eq. (10) .

The method proceeds to block 303, with updating one or more neighborhood radius related parameters based at least partially on the reduction of the first loss function of the training set, wherein the one or more neighborhood radius related parameters comprise at least influence weights of all the neighbor nodes with different steps away. The sum of the influence weights of all the neighbor nodes on each layer of the GNN equals to 1.

In an aspect, the trainable parameters of the GNN and the one or more neighborhood radius related parameters are updated jointly based at least partially on the reduction of the first loss function of the training set. As an example, the one or more neighborhood radius related parameters can be learned on the training set as other trainable parameters besides weight and bias parameters, based on the reduction of the first loss function of the training set by using the gradient descent.

In another aspect, the trainable parameters of the GNN and the one or more neighborhood radius related parameters are updated as a bi-level optimization. As an example, inputting data of validation set into the GNN, and updating the one or more neighborhood radius related parameters by a second gradient on the validation set to reduce a second loss function of the validation set, wherein the second loss function of the validation set is calculated with the updated trainable parameters of the GNN. For example, the one or more neighborhood radius related parameters can be updated based on Eq. (8) or Eq. (11) .

In an aspect, the one or more neighborhood radius related parameters are updated based on the updated trainable parameters of the GNN that minimize the first loss function of the training set, refer to Eq. (8) and (9) . All the trainable parameters, including at least all the weight and bias parameters of the GNN, are firstly learned on the training set. The updated trainable parameters are obtained when the loss function of the training set is minimized after certain training epochs. Then the one or more neighborhood radius related parameters are learned on the validation set with the updated trainable parameters, and the updated one or more neighborhood radius related parameters are obtained when the loss function of the validation set is minimized after certain training epochs.

In another aspect, the one or more neighborhood radius related parameters are updated based on the updated trainable parameters of the GNN each epoch, refer to Eq. (10) and (11) . All the trainable parameters, including at least all the weight and bias parameters of the GNN, are firstly learned on the training set. For each epoch, the trainable parameters are updated by using the gradient on the training set. Then the one or more neighborhood radius related parameters are learned on the validation set with the updated trainable parameters during the same epoch. The one or more neighborhood radius related parameters are updated by using the gradient on the validation set. After all the training epochs, the updated trainable parameters and one or more neighborhood radius related parameters may be obtained.

The influence weights of all the neighbor nodes with different steps away can be generated in different ways, this would not limit the scope of the disclosure. As an example, the influence weights of all the neighbor nodes with different steps away are generated based on the heat kernel as

wherein k represents a step number away from a central node, and t is a diffusion time. Besides, the step number away from the central node can be truncated to a constant instead of infinity for feasible computability when the heat-kernel is used. As another example, the influence weights of all the neighbor nodes with different steps away are generated based on the PageRank as α (1-α) ^k, wherein k represents a step number away from a central node, and α is a probability of a user staying in a current page.

The method proceeds to block 304, with calculating the neighborhood radius based on the updated one or more neighborhood radius related parameters. In an aspect, the neighborhood radius is calculated for all layers and feature dimensions of the GNN uniformly, refer to Eq. (3) or (7) . That is, all the GNN layers and feature dimensions should use the same neighborhood radius for feature propagation. In another aspect, the neighborhood radius is calculated for each layer and each feature dimension of the GNN respectively. That is, the GNN layers and feature dimensions can use respective learned neighborhood radius for feature propagation, refer to Eq. (12) or (13) .

In an aspect, as operate differently on each dimension, propagating on the input dimensions can generate better results than propagating on the output dimensions, the feature propagation of the message passing is performed before feature transformation of the message passing.

As discussed above with Fig. 1-3, ADC is able to enhance any graph-based model, particularly GNNs. By directly plugging ADC into existing GNNs, neighborhood radius can be learned automatically for datasets. Specifically, learning unique neighborhood radius for each feature channel in each GNN layer can further improve the performance for downstream graph mining tasks.

Fig. 4 illustrates an exemplary computing system, in accordance with various aspects of the present disclosure. The computing system may comprise at least one processor 410. The computing system may further comprise at least one storage device 420. It should be appreciated that the storage device 420 may store computer-executable instructions that, when executed, cause the processor 410 to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-3.

The embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform a method for training a graph neural Network (GNN) to learn neighborhood radius for feature propagation of message passing to perform node classification is disclosed. The method comprises: inputting data of training set into the GNN; updating trainable parameters of the GNN based at least partially on a reduction of a first loss function of the training set, wherein the trainable parameters comprise at least weight and bias parameters of the GNN; updating one or more neighborhood radius related parameters based at least partially on the reduction of the first loss function of the training set, wherein the one or more neighborhood radius related parameters comprise at least influence weights of all the neighbor nodes with different steps away, and wherein sum of the influence weights of all the neighbor nodes on each layer of the GNN equals to 1; and calculated the neighborhood radius to be used in the feature propagation of message passing based on the updated one or more neighborhood radius related parameters.

The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-3.

The embodiments of the present disclosure may be embodied in a computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-3.

It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

A method for training a graph neural Network (GNN) to learn neighborhood radius for feature propagation of message passing to perform node classification, comprising:

inputting data of training set into the GNN;

updating trainable parameters of the GNN based at least partially on a reduction of a first loss function of the training set, wherein the trainable parameters comprise at least weight and bias parameters of the GNN;

updating one or more neighborhood radius related parameters based at least partially on the reduction of the first loss function of the training set, wherein the one or more neighborhood radius related parameters comprise at least influence weights of all the neighbor nodes with different steps away, and wherein sum of the influence weights of all the neighbor nodes on each layer of the GNN equals to 1; and

calculated the neighborhood radius to be used in the feature propagation of message passing based on the updated one or more neighborhood radius related parameters.
The method of claim 1, wherein the trainable parameters of the GNN and the one or more neighborhood radius related parameters are updated jointly based at least partially on the reduction of the first loss function of the training set.
The method of claim 1, updating the trainable parameters of the GNN based at least partially on the reduction of the first loss function of the training set further comprising:

updating the trainable parameters of the GNN by a first gradient on the training set to reduce the first loss function.
The method of claim 3, further comprising:

inputting data of validation set into the GNN; and

wherein updating the one or more neighborhood radius related parameters based at least partially on the reduction of the first loss function of the training set further comprising:

updating the one or more neighborhood radius related parameters by a second gradient on the validation set to reduce a second loss function of the validation set, wherein the second loss function of the validation set is calculated with the updated trainable parameters of the GNN.
The method of claim 4, wherein the one or more neighborhood radius related parameters are updated based on the updated trainable parameters of the GNN that minimize the first loss function of the training set.
The method of claim 4, wherein the one or more neighborhood radius related parameters are updated based on the updated trainable parameters of the GNN each epoch.
The method of claim 4, wherein the neighborhood radius is calculated for all layers and feature dimensions of the GNN uniformly.
The method of claim 4, wherein the neighborhood radius is calculated for each layer and each feature dimension of the GNN respectively.
The method of claim 8, wherein the feature propagation of the message passing is performed before feature transformation of the message passing with the updated neighborhood radius for each layer and each feature dimension of the GNN respectively.
The method of claim 1, wherein the influence weights of all the neighbor nodes with different steps away are generated based on the heat kernel as
wherein k represents a step number away from a central node, and t is a diffusion time.
The method of claim 10, wherein the step number away from the central node is truncated to a constant instead of infinity.
The method of claim 1, wherein the influence weights of all the neighbor nodes with different steps away are generated based on the PageRank as α (1-α) ^k, wherein k represents a step number away from a central node, and α is a probability of a user staying in a current page.
A computer system, comprising:

one or more processors; and

one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method of one of claims 1-12.
One or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-12.
A computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-12.