CN115860061A

CN115860061A - Graph neural network optimization method and graph neural network inference system

Info

Publication number: CN115860061A
Application number: CN202211353528.8A
Authority: CN
Inventors: 周杨杰; 冷静文; 李超; 过敏意; 沈雯婷; 肖文聪; 艾宝乐; 李永
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2023-03-28

Abstract

A graph neural network optimization method and a graph neural network inference system are disclosed. The method comprises the following steps: representing a graph operator of the graph neural network computing task into a loop nesting statement based on a preset abstract format; selecting an optimized parallel strategy from a parallel strategy library based on the graph operator information extracted from the loop nested statement and the graph data information of the operator; and executing the graph neural network computing task according to the selected optimized parallel strategy. The invention provides uniform abstraction of complete semantic representation for various graph operators, so that various graph operators in a graph neural network are uniformly expressed, the optimal strategy under the calculation scenes of different operators and graph structures is determined by operator calculation, graph data and parallelization strategy separation description, and the automatic exploration and efficient execution of the dynamic parallelization strategy are realized.

Description

Graph neural network optimization method and graph neural network inference system

Technical Field

The disclosure relates to the field of deep learning, and in particular to a graph neural network optimization method and a graph neural network inference system.

Background

The graph neural network learning is widely applied to a plurality of fields such as intelligent transportation, recommendation systems, knowledge maps and molecular science as a method for representing deep learning, and is put into practical use in a plurality of graph scenes in an industrial scale.

Compared with Euclidean space data of a traditional convolutional neural network research rule, the graph neural network learns on an irregular graph structure, and the irregularity of the graph structure causes randomness and complexity of a graph operator to access space on a graph, so that different calculation access and storage modes are introduced into an execution process, and parallel schemes of other neural networks cannot be directly applied. The existing calculation method of the graph operator in the graph neural network mainly depends on a static kernel of handwriting, and is lack of flexibility and dynamics.

To this end, in order to support efficient execution of the graph neural network, a graph neural network computation scheme capable of implementing adaptive parallelism is required.

Disclosure of Invention

The technical problem to be solved by the present disclosure is to provide a graph neural network optimization method and a graph neural network inference system, which are implemented as a high-performance calculation optimization scheme oriented to graph operators in a graph neural network, and based on a unified abstraction capable of providing complete semantic representation for various graph operators, various graph operators in the graph neural network are uniformly expressed according to a preset nested format, thereby implementing operator-by-operator parallel optimization under the unified expression, and implementing automatic exploration and efficient execution of a dynamic parallelization strategy.

According to a first aspect of the present disclosure, there is provided a graph neural network optimization method, including: representing a graph operator of the graph neural network computing task into loop nesting statements based on a preset abstract format; selecting an optimized parallel strategy from a parallel strategy library based on the graph operator information extracted from the loop nested statement and the graph data information of the operator; and executing the graph neural network computing task according to the selected optimized parallel strategy.

Optionally, the loop nesting statement based on the preset abstract format includes: an outer loop statement for traversing all vertices in the graph; the middle layer loop statement is used for traversing each transmission edge of the vertex; and inner loop statements used to represent specific operator operations.

Optionally, the inner loop statement is used for traversing in a feature dimension, and includes: a first operation statement defined by a first operator operating on each edge feature; and a second operation statement defined by a second operator that reduces the converted feature after the side operation.

Optionally, the predetermined abstract format includes a first input embedding tensor, a second input embedding tensor, and a third input embedding tensor, wherein in the first operation statement, the first input embedding tensor and the second input embedding tensor are operated on using a first operator, and in the second operation statement, the third input embedding tensor is operated on using a second operator, wherein each of the first input embedding tensor, the second input embedding tensor, and the third input embedding tensor is one of a source vertex embedding tensor, a target vertex embedding tensor, an edge embedding tensor, and a NULL (NULL).

Optionally, the parallel policy repository includes: a thread-edge policy in which one thread executes all operations of one edge; a thread-vertex policy in which one thread performs all operations of one vertex; a bundle-edge policy in which one bundle performs all operations of one edge; and selecting the optimized parallel strategy from the thread bundle-vertex strategies of one thread bundle executing all operations of one vertex.

Optionally, based on the graph operator information extracted from the loop nesting statement and the graph data information of the operator, selecting an optimized parallel policy from the parallel policy library further comprises defining the selected parallel policy by introducing one of the following parameters: a grouping parameter for causing one thread or thread bundle in the selected parallel policy to process a plurality of edges or vertices; a tiling parameter for causing multiple threads or thread bundles in the selected parallel policy to process an edge or vertex.

Optionally, based on the graph operator information extracted from the loop nesting statement and the graph data information of the operator, selecting an optimized parallel policy from a parallel policy library includes one of: sending the graph operator information and the graph data graph as input into a trained optimization strategy prediction model, and obtaining the output of the optimization strategy prediction model as the optimization parallel strategy; and sending the graph calculation sub information and the graph data graph into an optimization strategy decision tree, and determining the optimization parallel strategy based on the decision of the optimization strategy decision tree.

Optionally, executing the graph neural network computational task according to the selected optimized parallel strategy includes: generating an executable code according to the optimized parallel strategy and a loop nesting statement based on the preset abstract format; and executing the executable code by neural network computing-specific hardware.

Optionally, the generating executable code according to the optimized parallel policy and the loop nesting statement based on the preset abstract format includes: and when the first operator or the second operator is empty, fusing the first operation statement and the second operation statement, wherein the inner layer loop statement of the loop nesting statement comprises a first operation statement limited by the first operator for operating each edge feature and a second operation statement limited by the second operator for reducing the converted feature after edge operation.

According to a second aspect of the present disclosure, there is provided a graph neural network inference system, comprising: a compiler to: representing a graph operator of the graph neural network computing task into a loop nesting statement based on a preset abstract format; selecting an optimized parallel strategy from a parallel strategy library based on the graph operator information extracted from the loop nested statement and the graph data information of the operator; generating an executable code according to the optimized parallel strategy and a loop nesting statement based on the preset abstract format; and the execution unit is to execute the executable code using neural network computing specific hardware.

According to a third aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described in the first aspect above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first aspect above.

Therefore, the invention provides uniform abstraction of complete semantic representation for various graph operators, so that various graph operators in the graph neural network are uniformly expressed, the optimal strategy under the calculation scenes of different operators and graph structures is determined by operator calculation, graph data and parallelization strategy separation description, and the automatic exploration and efficient execution of the dynamic parallelization strategy are realized.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows an example of typical image and linguistic neural network processing.

Fig. 2 shows an example of an application pattern of a general neural network.

Fig. 3 shows the performance indicators for different operators under different data sets.

FIG. 4 shows a schematic flow diagram of a method of neural network optimization in accordance with one embodiment of the present invention.

FIG. 5 illustrates an operational overview of a unified graph operator interface, according to one embodiment of the invention.

FIG. 6 shows a schematic diagram of the components of the neural network inference system, in accordance with one embodiment of the present invention.

FIG. 7 is a schematic structural diagram of a computing device that can be used to implement the neural network optimization method described above according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

With the development of machine learning and deep learning, speech, image and natural language processing are gradually making a great breakthrough, however, speech, image and text are all structured data presented as a sequence or a grid, and the existing deep learning model is good at processing the type of data. FIG. 1 shows an example of typical image and linguistic neural network processing. As shown in the upper part of fig. 1, the image may be regarded as a fixed grid due to the unchanged structure. Subsequently, desired features may be extracted from the image according to the input layer, the plurality of hidden layers, and the plurality of hidden layers (for example, the extracted features may be from a local contrast pattern to a feature of a face and then to the entire face by enlarging the receptive field), and the extracted features may be classified by the output layer, thereby implementing face recognition. The speech shown in the lower part of fig. 1 can then be regarded as a fixed sequence of text, and the LSTM shown on the right side can be put in for the corresponding semantic understanding task.

However, not all things in the real world can be represented as a sequence or a grid, e.g., social networks, knowledge graphs, complex file systems, etc. are unstructured. This network-type of unstructured data corresponds to a more general graph structure than simple text and images. In a typical graph, each node has a different number of edges, i.e., different nodes are adjacent to it. The graph structure has rich problem expression capacity, and can overcome the problems of data redundancy, information loss and the like caused by the traditional grid structure, so that people can efficiently extract more useful information from the data.

However, the processing for the graph is very complex, and the difficulties include: (1) The size of the graph is arbitrary, the topological structure of the graph is complex, and the graph has no spatial locality like an image; (2) The graph has no fixed node order, or no reference node; (3) The graph is often a dynamic graph and contains features that are multi-dimensional, even multi-modal.

The complexity of graph data is a challenge for conventional machine learning. The graph neural network, which is a graph representing learning method, has been widely noticed and developed in recent years, and has occupied an important position in the above problem scenario. The method has the main idea that an end-to-end learning process in deep learning is combined with information transmission in graph calculation to form a new calculation paradigm, and a graph neural network can capture the relation of irregular graph structures so as to extract effective graph embedding representation. Specific algorithms related to downstream tasks can utilize these embedded vectors for efficient and effective computation to achieve task goals.

A graph is a data structure that models a set of objects (nodes) and their relationships (edges). Graph Neural Network (GNN) is a generalized Neural Network based on Graph structure, and is a deep learning model architecture that can be run based on Graph data. Graph neural networks typically take the topology of the graph as the computational input to the model and learn neural network primitives by passing, transforming and aggregating node/edge feature information across the entire graph to generate single-node embedded vectors. The generated node embedding vector can be used as the input of any micro-prediction layer and used for node classification or edge link prediction or graph structure identification, and the complete model can be trained in an end-to-end mode.

Fig. 2 shows an example of an application pattern of a general graph neural network. As shown, the input may be a graph that includes nodes (circles and triangles) and edges that characterize the relationships between the nodes. The input graph is subjected to various operations such as multilayer graph convolution and the like and activation functions, and finally the representation of each node is obtained, so that tasks such as node classification, link prediction, graph and sub-graph generation and the like can be conveniently carried out.

The continuously developed graph neural network model has huge architectural space. At the same time, the variability and complexity of the graph operators used by graph neural networks has increased rapidly. For example, the number of map operators in the graph neural network model increases significantly from the early GCN (graph convolution network) model to the later GAT (graph attention network) model and GIN (graph isomorphism) model. Accordingly, the graph operators to which these new graph operators correspond also become more complex, which makes high performance implementation of graph neural networks more challenging. In addition to the large design space of graph operators, the graph neural network model can also run on different graph structure datasets with characteristic features (e.g., different equality, density, and cluster locality). Conventional computing systems explore adaptive parallelization patterns in different scenarios to achieve high performance. Unlike conventional graph algorithms, graph neural networks do not have the complex control flow imposed by boundaries, but rather involve traversal of feature dimensions and more complex computations when traversing the graph.

The existing calculation method of the graph operators in the graph neural network calculation framework depends on a handwritten static kernel and lacks flexibility and dynamics. These frameworks use fixed execution strategies for different graph operators and input graphs. However, achieving optimal parallelism performance for different graph neural network models and input graph structures is extremely challenging, requiring dynamic tradeoffs between locality, parallelism, and work efficiency. As the input of the graph neural network model, the graphs have significant differences in the number of vertices, the number of edges, sparsity, the size of the input features, and the distribution features of the edges within the graph. Meanwhile, the image operators in different image neural network models have unique calculation and memory access characteristics. Existing static execution modes make the current framework perform well only for specific graph neural network models and input graph datasets. When the graph neural network model and its input graph data set change, their performance is not ideal.

Before describing the neural network optimization of the present invention, the background of the existing framework, including the execution framework for the neural network (GNN), is first briefly described, with emphasis on their programming interfaces and execution strategies, and the inefficient execution of the existing framework is demonstrated by experiments on the GPU.

Graph neural network

In recent years, GNNs have received a great deal of attention from both academia and industry due to their powerful learning capabilities and their ability to reason about graph structures in non-euclidean spaces. The output of the GNN model is a d-dimensional embedded vector for each node in the input graph. For vertex or subgraph structures with similar attributes, their embeddings are also close to each other, so that graph-related problems can be quickly inferred.

To achieve these embeddings, the GNN combines DNN-based feature transformation and graph-based operations that propagate and aggregate information along the graph structure. Due to the mix of DNN operations and graph operations, existing GNN frameworks such as DGL and pytorrech-Geometric (PyG) extend existing DNN frameworks (such as tensrflow and pytorrech) using the key concept of "messages". The message can be viewed as an intermediate feature embedded representation associated with each edge. The message-centric graph operation may be formalized using the following equation. For any operation on graph G = (V, E), there may be three phases, i.e. message creation, message aggregation and feature update, depending on the attributes of the data and the direction of data movement:

where u and v are vertex (or node) indices, and e is the index of the edge between u and v; h is _v Refers to a feature-embedded representation of the vertex v, m _e Is a message associated with edge e.

In equation (1), each edge creates its message m by applying an edge-wise message function to its own edge features and associated vertex features _e . In equation (2), each vertex aggregates messages from incoming edges using an aggregation function. In equation (3), each vertex updates its characteristics using a vertex-wise combining function (combination function). In GNN, the set of feature-embedded representations of all vertices is called the vertex-embedding tensor, and the set of feature-embedded representations of all edges is called the edge-embedding tensor.

Definition of graph operators

Graph operators are defined as operators that need to traverse the input graph structure. Message creation and message aggregation as explained above are two types of graph operators. When the message creation operator is a simple copy operation, it can be merged into the message aggregation operator, avoiding redundant accesses (both DGL and PyG apply as such). Thus, there is a third type of graph operator, called a fused-aggregation operator, that fuses the original message creation operator and the message aggregation operator (herein, if not explicitly specified, the aggregation operator refers to a fused aggregation operator).

Briefly, the graph operator includes message creation and message aggregation, as well as a fusion aggregation operator. They include irregular storage behavior due to graph structure and complex arithmetic computations, thus presenting a serious challenge to high performance GNN computations. The optimization of the calculation of the graph operator is the technical problem to be solved by the invention.

Complexity of graph operators

Different GNN models use different graph operators, with a larger design space. Table 1 classifies the 160 graphs supported by DGL according to the input and output tensor types. As can be seen from table 1, different graph operators include differences in input types even within the three broad categories of message creation and message aggregation and fusion aggregation. While the figure operator can perform different modes of computation even if the same input/output tensor is used. Therefore, providing practical high performance support for all of these operations is challenging, requiring systematic and automated solutions.

TABLE 1

Variability of graph data

Real world graph data sets also have great variability. As shown in table 2 below, 15 common graph datasets were selected for analysis. In particular, the number of vertices and edges of the different graphs are collected to reflect the size scale of the graphs, and the standard deviation of non-zero values ("std of nnz" columns) in the rows of the adjacency matrix is also derived, which reflects the degree of graph balancing. Different graph data also have different feature and class sizes, which can affect memory usage and computational complexity of some graph operators. As can be seen from table 2, the attributes vary greatly from figure to figure.

TABLE 2

Execution efficiency analysis on a GPU

Here, a GPU with CUDA programming language is selected as the execution hardware. GPU architectures are highly parallel and there are many Streaming Multiprocessors (SMs). The SM executes threads in SIMT (single instruction, multi-thread) mode, and warp (thread bundle) with 32 threads runs simultaneously. The enormous computational and memory resources make GPUs increasingly important for deep learning acceleration. Due to the lack of systematic optimization methods, the inventors have found that the underlying CUDA kernel used by existing GNN frameworks suffers from inefficiencies and inflexibility. The DGL framework is used as an example for explanation, but the kernel in the PyG framework has similar problems.

DGL calls the static CUDA kernel to support a message passing programming interface that cannot adaptively adapt to different computational scenarios. Here, two image operators commonly used in GNN were selected for quantitative analysis. The first is weighted-aggregate-sum graph operator in GCN and GAT, and the other is unweighted-aggregate-max in SageMax, the former being more computationally intensive than the latter due to the addition of edge weights. The data sets AR and SO are used here as representatives of the imbalance map, the data sets PR and DD as representatives of the equilibrium map, and their occupancy indicators are collected by nvprof. In addition, we also take data sets CO and CI as representatives of small graphs, data sets SW and OV as examples of large graphs, and collect their SM (streaming multiprocessor) usage and L2 cache hit rates under different operators.

The results are shown in FIG. 3. Fig. 3 shows the performance indicators for different operators under different data sets. Both different operators have some similar pattern of results. The occupancy of the imbalance map is significantly reduced compared to the equilibrium map. In addition, smaller graphs achieve higher second level cache hit rates while achieving lower SM usage than larger graphs. In addition, there are differences in the results between operators. For the lightweight unweighted-aggr-max operator, the difference between the occupancy results of the imbalance map and the equilibrium map is large, but the difference between the SM usage and the cache hit ratio of the small map and the large map is small.

These results indicate that a low occupancy of the GPU results in an under-utilization of hardware resources when executing the imbalance map. GPU performance is typically limited by insufficient utilization of hardware resources due to parallelism when executing small graphs, while access bandwidth becomes a bottleneck when executing large graphs due to low locality. Also, these indicators may differ from operator to operator.

Existing GNN frameworks rely on handwriting kernels with fixed execution strategies. However, the fixed execution strategy is inefficient in executing the GNN model due to the diversity of graph-related operations and the diversity of real-world graph structures. This prompted the inventors to design a unified interface (called μ graphics) to support the existing GNN framework. The unified interface of the present invention captures the complete semantic representation of all the frequent graph operators in the GNN and enables different dynamic and flexible execution strategies for input graph data and graph operators with different characteristics.

The present invention's diagram calculator uniform abstraction

Previous work only decomposed GNNs into different phases for static optimization. The inventor creatively realizes unified abstraction on all graph operators in the current GNN model, models and abstracts related operators of a bottom graph, and adopts nested sparse-dense cycles to separate data acquisition and calculation so as to realize optimization aiming at different graph data sets. Further, due to the fact that the preset graph operator in an abstract form can decouple the scheduling strategy from the calculation, and the uniform abstract variable part can be used as the input of a prediction model or a decision tree as described below, the optimized parallel strategy can be directly obtained based on the output of the model or the decision of the decision tree.

FIG. 4 shows a schematic flow diagram of a method of neural network optimization in accordance with one embodiment of the present invention. The optimization method is particularly executable by a graph neural network compiler and generates executable code for execution by the underlying specialized hardware, e.g., a GPU, based on a uniform abstraction and optimization parallelism strategy.

In step S410, a graph operator of the graph neural network computing task is represented as a loop nesting statement based on a preset abstract format. Subsequently, in step S420, an optimized parallel policy can be selected from the parallel policy library based on the graph operator information extracted from the loop nesting statement and the graph data information of the operator. In step S430, the graph neural network computational task is performed according to the selected optimized parallel strategy.

The loop nesting statements based on the preset abstract format can comprise three layers of nested loop statements, and specifically can comprise: an outer loop statement for traversing all vertices in the graph; the middle layer loop statements are used for traversing each transmission edge of the vertex; and inner loop statements used to represent specific operator operations. Thus, the inventors abstract all the calculations of GNN into three execution phases: move data from vertex to edge, perform edge-wise (edge-wise) computations on all edges, and perform a reduction function from an edge to its associated vertex. Different operators perform different edge calculations and reduction calculations and may skip certain stages. Accordingly, the inner-level statements in the loop nesting may differ to represent different operators.

The inner loop statement is for traversing in a feature dimension and includes: a first operation statement defined by a first operator (e.g., edge _ op as follows) operating on each edge feature; and a second operation statement defined by a second operator (e.g., gather op as follows) that reduces the converted feature after the side operation.

For ease of understanding, the abstraction method of the present invention is first described by taking the aggregation-sum (aggregation-sum) operator as an example. It is then described as representing the generalization capability of all graph operators.

Nested For loop representation of graph operators

The same aggregate sum operator example as described above is used herein to illustrate the graph sub-abstraction method of the present invention. This operator is widely used for GNN and, for each vertex in the graph, traverses its neighboring vertices and accumulates the feature-embedded representations of these neighboring vertices.

An example of code for representing the aggregate sum operator using nested loops is given below. The graph operator abstraction is nested from three loops, where line 5 (the outer loop) represents traversing all vertices in the graph, line 6 (the middle loop) represents traversing each entry edge of the vertices, and line 8 represents traversing in the feature dimension. The innermost statement (line 9, inner loop) then implements the combined accumulation of data from the source vertex to the target vertex.

The pre-defined unified abstract format can also include several GNN-specific data structures that capture the graph-level semantics of the operator. In the abstract representation above, the Input (Input) of the aggregate sum operator is graph G and the vertex feature embedding tensor X (VertexEmbeddingTensorX), and the Output (Output) is a new vertex embedding tensor Y (VertexEmbeddingTensorY). The graph is a pair of two sets, where V and E represent the set of all vertices and all edges in the graph. Each element of the set V represents a vertex, and each vertex can obtain an incoming edge and an outgoing edge of the vertex through get _ inputs () and get _ outputs () interfaces, respectively. Each element of the set E represents an edge, each edge corresponds to a pair of vertices, and the corresponding source vertex and destination vertex can be obtained through src _ v and dst _ v.

Preset isAbstract design

As previously described, the inventors abstract all of the calculations of GNN into three execution phases: move data from vertex to edge, perform edge-wise (edge-wise) computations on all edges, and perform a reduction function from an edge to its associated vertex. Different operators perform different edge calculations and reduction calculations, and may skip certain stages.

For example, the aggregate sum operator in the SageSum model simply copies the source vertex features of each edge to form edge features, and does not perform edge computations. Then, for each vertex, it will reduce all its edge features into one new vertex feature. In contrast, the GAT model contains several figures with other different computation patterns. The first message creation operator is very lightweight, adds the features of the source vertex and the destination vertex of each edge as the edge features for calculating the attention weight, and skips the final reduction stage. Instead, its second aggregate summation operator involves all three stages of computation. The operator first copies the feature from the source vertex, then performs edge-wise multiplication with the previously generated edge weights, and finally reduces the transformed edge feature to the vertex feature. Thus, the second operator is more computationally intensive than the first operator.

In view of the dissimilarity of these graph operators, the present invention bases nested loops on a graph-operator abstraction and allows users to customize input tensor and element-wise operations to represent different operators. Details of the unified abstraction are given below. Compared to the representation of the aggregated sum as above, the nested loops of the Unified Abstraction remain the same, but the innermost code block (inner loop) introduces two additional dynamic operators: edge _ op (corresponding to the first operator) and gather _ op (corresponding to the second operator), may be defined by the user.

edge _ op implements edge-by-edge computation for each edge, while gather _ op implements edge-to-vertex reduction operations. For example, to represent the aggregate summation in the above example, the two functions edge _ op and gather _ op may be set to copy _ lhs (i.e., copied from the left) and copy _ rhs (i.e., copied from the right), respectively.

In addition to the inputs of edge _ op, gather _ op and graph structure G, the unified abstraction also requires three additional input embedding tensors. To maintain the flexibility of representing different graph operators, the type of these three embedding tensors can be any of the following: the source vertex embedding tensor (Src _ V), the target vertex embedding tensor (Dst _ V), the Edge embedding tensor (Edge), and NULL. Different data types also determine different addressing modes in the loop calculation (lines 10 to 12). For example, the aggregated summed output tensor Y in the above example corresponds to the C tensor with the target vertex eigentype, with the addressing dimension of its row 9 always being based on dst.

In summary, the combination of edge _ op and gather _ op, along with the tensor type A, B, C, can capture the full semantics of the graph operator, including its computation and memory movement patterns. The following equation defines the present invention uniform abstraction in the form of an equation where ψ is the edge _ op function and ρ is the gather _ op function.

The complete implementation of all graph operation semantics and corresponding parameter configuration is shown in table 3 below. Therefore, the unified abstraction of the invention can support message creation and message aggregation, so that fused graph meanings are supported, and a foundation is provided for flexible optimization of different graph operators.

TABLE 3

As can be seen in connection with table 3, for different graph operators, the edge _ op operations may be copy _ lhs (i.e., copy from left), copy _ rhs (i.e., copy from right), mul (multiply), add, sub (subtract), and div (divide), and the gate _ op operations may be copy _ lhs (i.e., copy from left), copy _ rhs (i.e., copy from right), sum, max (maximum), min (minimum), and mean (average), which are also defined as edge _ op _ list and gate _ op _ list above. Further, the tensor types involved in the graph sub-computation include Src _ V (source vertex embedding tensor), dst (destination vertex embedding tensor), and Edge (Edge embedding tensor), as defined by tensor _ type _ list. Since the Edge _ op or gather _ op operation in some operators can be empty, the type _ idx _ dit (type index dictionary) defines the values of Src _ V: src, dst _ V: dst, edge: edge and NULL.

To this end, the predetermined abstract format of the present invention includes a first input embedding tensor (tensorA), a second input embedding tensor (tensor B), and a third input embedding tensor (tensor C), wherein in the first operation sentence, the first input embedding tensor and the second input embedding tensor are operated using a first operator, and in the second operation sentence, the third input embedding tensor is operated using a second operator, wherein each of the first input embedding tensor, the second input embedding tensor, and the third input embedding tensor is one of a source vertex embedding tensor, a target vertex embedding tensor, an edge embedding tensor, and a NULL.

As described previously, after representing an operator as a loop nesting statement having a predetermined abstract format, a optimized parallel policy may be selected from a parallel policy repository based on the graph operator information extracted from the loop nesting statement and the graph data information of the operator.

The optimized parallel strategy needs to be explored in an optimization space. How to determine the optimization space critical to achieving high performance graph execution will be described below. In particular, a trade-off (trade-off) space for different parallelization strategies for executing graph operators on the GPU was explored, and the optimal strategies for different datasets and different graph operators were demonstrated to be different.

Trading off space

The trade-off space that affects the performance of the graph operator on the GPU involves a dimensional optimization space: locality, parallelism, and work efficiency.

Locality describes the amount of spatial and temporal reuse in a program. Better locality improves cache hit rates and potentially program performance. The GPU contains a per SM L1 cache and a shared L2 cache. To improve locality of graph operators, tiling or blocking (e.g., grouping and tiling parameters as follows) can be applied to nested loops, which can limit the working set of each SM.

Parallelism refers to the amount of computation that can be performed simultaneously. Modern GPUs typically contain thousands of compute units, so higher parallelism can increase hardware resource utilization, hide memory access latency, and thus improve program performance. The simplest way to increase parallelism of graph operators is to start more threads (threads), bundles (warp) or blocks (threaded blocks).

The efficiency of operation is expressed as the inverse of the overhead. Different execution strategies of the same operator may introduce additional computations, such as address computations. Meanwhile, in order to execute a graph operator in the GPU, an atomic instruction is required when there is a write conflict, which introduces a lock overhead, thereby reducing work efficiency. For example, each edge may be mapped to a thread. Since different edges may share the same vertex, an atomic addition instruction is required when performing an accumulated reduction from edge feature to vertex feature.

Locality, parallelism, and work efficiency form an impossible triangle, meaning that no single strategy can improve these three metrics simultaneously. Different parallelization strategies have a positive and negative impact on various indicators in the trade-off space. Given the diversity of graph operators and graph dataset features, it can be demonstrated that a fixed parallelization strategy leads to optimal performance in only a few cases.

The effect of various parallelization strategies on three trade-off metrics is illustrated using the aggregate sum graph operator in the previous example as a representative example. Two classical parallelization strategies used in existing graph processing systems are first followed: vertex parallel and edge parallel, its GPU implementation means that one thread handles all computations for one vertex or one edge. Therefore, we define them as thread-vertex (thread-vertex) and thread-edge (thread-edge) policies, where different threads execute in parallel. Because the number of edges in the graph is usually much larger than the number of vertices, compared with the thread-edge strategy, the thread-vertex strategy reduces the parallelism, but improves the reusability of output data, thereby improving the locality. Meanwhile, thread-edge policy reduces work efficiency because multiple threads can update the same vertex, thus requiring atomic update operations.

Meanwhile, because vertex/edge features in the GNN are vectors, and traditional graphics processing algorithms such as PageRank use scalar values, such GNN-specific feature dimension parallelization strategies are called warp-vertex (threaded bundle-vertex) strategies and warp-edge (threaded bundle-edge) strategies. In these policies, each warp (a set of 32 threads in the GPU) processes only one vertex or edge at a time, and different threads in the wire-city bundle process different feature elements. The thread-bundle-vertex and thread-edge policies may initiate more threads than thread-vertex and thread-edge policies, thereby increasing parallelism. However, they also compromise locality because the per-thread-bundle cache capacity is reduced.

For the above four strategies, two fine-grained parameters were introduced to further explore the tradeoff between locality and parallelism. The first parameter, we call the V/E grouping parameter, means to combine multiple edges or vertices into a group. For example, for a thread-edge policy, setting this parameter to 4 means that a thread can handle four edges instead of the original one, which can improve locality, but also reduce parallelism. This also reduces the efficiency of operation due to the additional group computing overhead.

The second parameter is feature tiling (feature tiling), which enables more threads by exploiting parallelism in feature dimensions. For example, for a feature size of 64 and a bundle size of 32, setting the feature tiling parameter to 2 would map one vertex/edge to two bundles, rather than a single bundle if the feature tiling parameter was not applied. This strategy increases parallelism but reduces locality compared to V/E grouping. At the same time, it also reduces the efficiency of operation due to the extra address computation of feature tiling.

To this end, the parallel policy repository of the present invention may comprise: a thread-edge policy in which one thread executes all operations of one edge; a thread-vertex policy in which one thread performs all operations of one vertex; a bundle-edge policy in which one bundle performs all operations of one edge; and a bundle-vertex policy in which one bundle performs all operations of one vertex. Further, the following strategies may also be defined based on the parameters described above. For this purpose, the search optimization parallel strategy based on the graph operator information extracted from the uniform abstract format and the graph data information of the operator further comprises introducing one of the following parameters: a grouping parameter (V/E grouping) for causing one thread or thread bundle in the selected optimized parallel policy to process a plurality of edges or vertices; a tiling parameter (feature tiling) for enabling multiple threads or thread bundles in the selected optimized parallel policy to process one edge or vertex.

As previously described, the search for the optimized parallel strategy is based on the graph operator information, as well as on the graph data information of the operator. In other words, the optimal execution strategy for graph operators varies depending on the data set and feature size, i.e., different strategies achieve the best results under different circumstances. In other words, an optimized parallel policy needs to be determined from both the graph operator and the data information.

Further, the present invention proposes μ graph, which is a unified and high-performance graph operator interface for GNN, which employs the unified abstraction as described above and incorporates parallelization strategies. FIG. 5 illustrates an operational overview of a unified graph operator interface, according to one embodiment of the invention. Fig. 5 shows its two main features, namely the ability to provide a complete semantic representation for various figures, and the ability to achieve efficient execution by automatically exploring flexible and dynamic parallelization strategies. The mu graph can uniformly express various operators, including a Scatter operator, a gather operator, an operator of a message creation class, an operator of a message aggregation class and an operator of a fusion graph, which are shown in the figure. The dynamic parallel policy can be selected from a thread-edge policy, a thread-vertex policy, a thread bundle-edge policy, and a thread bundle-vertex policy, and the policy modification is performed by using the grouping parameter and the tiling parameter. The existing GNN framework can call the unified API of μ graphics, and can also call graph operators rewritten in a unified format. While the μ graphics generated code may be used for efficient execution of dedicated hardware, such as a GPU.

Thus, μ graphics can provide specialized and optimized kernels for all GNN graphs on different GPU architectures and different graph datasets. Based on unified abstraction and various decoupled parallelization strategies, an example of the μ graphics API is shown below:

op_info＝[edge_op,gather_op,Tensor_A,A_Type,Tensor_B,B_Type,

Tensor_C,C_Type]

parallel_info＝[parallel_strategy,Grouping_Param,Tiling_Param]

uGrapher(Graph_Tensor,op_info,parallel_info)

the μ Grapher API contains three arguments: graph _ tensor, i.e., data; op _ info (operator information), which conveys information about edge _ op, gather _ op, and input tensor calculation; and parallel _ info (parallelization information), specifying the parallelization strategy.

The API separates operator computation, graph data and parallelization strategies so that users can propose own heuristic method to determine the optimal strategies of different operators and graph structures. Meanwhile, when the user does not specify any parallelization policy, the interface of μ graph can perform automatic adjustment to find the optimal parallelization policy (e.g., model or decision tree based operations as described below).

How to generate the CUDA kernel for the operator defined by the μ graphics API is described below. At a high level, the CUDA code generator of the invention also follows the mu Grapher design principle, and completely decouples the scheduling strategy of the operator from the calculation thereof.

To provide complete scheduling support for various figures, a CUDA kernel template can be implemented manually with template-based programming and for each parallelization strategy as described above. We then retain one device function interface in each template to support various graph operators.

Code generation may be an automatic end-to-end code generation process to ensure correctness and perform optimization of the generated CUDA kernel for different figures. The whole process consists of two code traversals and also has flexibility and extensibility to support future operators. The first code traversal fuses the two innermost code statements when members of op _ info (e.g., edge _ op or gather _ op) are NULL, thereby reducing register usage and read/write overhead. The second code traversal generates the final device function code, which may choose to use atomic operations by analyzing whether different threads will compete for the same data.

The design brings flexible and efficient realization for different operators through free combination of the CUDA kernel function (globalfunction) and the device function (devicefunction). The former provides support for different parallelization strategies, and the latter provides support for different arithmetic in graph operators.

Finding an optimal parallelization strategy can be challenging and time consuming, since there is 10 in μ graphics for one graph operator in total ⁴ And (4) an effective strategy is adopted. An exhaustive grid search takes several days. Therefore, the prediction model can be trained with the gradient lifting framework LightGBM to select the optimal strategy in the parallelized space. In one embodiment, features using graph data and operator information can be combined for model training. The introduction of an optimization strategy prediction model can almost completely eliminate the overhead of searching for an optimized scheduling strategy. In another embodiment, the graph can be computed as a letterAnd sending the information and the graph data graph into an optimization strategy decision tree, and determining the optimization parallel strategy based on the decision of the optimization strategy decision tree.

According to another aspect of the invention, a graph neural network inference system is provided. FIG. 6 illustrates a schematic diagram of the components of the neural network inference system, in accordance with one embodiment of the present invention. System 600 includes compiler 610 and execution unit 620. The compiler 610 is configured to: representing a graph operator in a graph computation graph corresponding to the graph neural network computation task as a loop nesting statement with a preset abstract format; searching and optimizing a parallel strategy based on the graph operator information extracted from the uniform abstract format and the graph data information of the operator; and generating executable codes according to the optimized parallel strategy and the loop nesting statements with the uniform abstract format. The execution unit 620 is for executing the executable code using neural network computing specific hardware. In a computing platform implemented as a neural network, the compiler and execution unit as above may be arranged on each computing node.

Referring to fig. 7, computing device 700 includes memory 710 and processor 720.

Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), digital Signal Processor (DSP), or the like. For example, processor 720 may include a GPU dedicated to performing parallel neural network computations. In some embodiments, processor 720 may be implemented using or include custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 710 may include various types of storage units, such as system memory, read Only Memory (ROM), and permanent storage. The ROM may store, among other things, static data or instructions for processor 720 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 710 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash, programmable read only memory), magnetic and/or optical disks may also be employed. In some embodiments, memory 710 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 710 has stored thereon executable code that, when processed by the processor 720, causes the processor 720 to perform the neural network optimization methods described above.

The diagram operators in the existing diagram neural network framework have strong dynamic property, and the invention abstracts all diagram operators in the diagram neural network to the following three execution stages: move data from vertex to edge, perform edge computations on all edges, and perform a merge function from edge to associated vertex.

Different operators perform different edge computations and merging computations, and may skip certain stages. In view of the dissimilarity of these graph operators, the present invention allows the user to customize the input tensor and the function operations at different stages to represent different operators based on using nested loops as the graph operator schedule expression.

edge _ op implements the functional representation of memory and computation on the edge, while gather _ op implements the merged functional representation of edge to vertex. The preset abstract format also requires the type information of the other three input embedded tensors (TensorA, B, C). To retain the flexibility of representing different graph operators, the three types of embedding tensors may be any of the following: the source vertex embedding tensor (Src _ V), the destination vertex embedding tensor (Dst _ V), the Edge embedding tensor (Edge), and NULL.

Different data types also determine different addressing modes in the loop calculation. The unified abstraction of the invention supports semantics of different graph operators after Message Creation (Message Creation) and Message Aggregation (Message Aggregation) and merging optimization, which provides a basis for realizing a unified computing optimization interface.

The unified high-performance computation interface design of the grappler can provide special and optimal computation scheduling for the grapplers in all graph neural networks on different GPU architectures and graph data sets. The unified high-performance computing interface can realize the computing functional support of different graph operators including Scatter, gather, message creation, message aggregation and merging optimization on the basis of docking with an upper graph neural network computing framework, and can also provide the support of a parallel strategy including parallel modes with 4 coarse granularities and parallel control and adjustment parameters with 2 fine granularities.

The unified high performance computing interface contains three parameters: graph _ label, which is data of the graph; op _ info, which conveys the calculation information about edge _ op, gather _ op and the input tensor; and parallel _ info, specifying a parallelization strategy.

The interface design separately describes operator calculation, graph data and parallelization strategies, so that the system can determine the optimal strategy in the calculation scenes of different operators and graph structures by using a heuristic method. The invention provides comprehensive scheduling function support for various graph calculators by utilizing high-performance programming based on templates. Firstly, a compute template of CUDA level is manually realized for each coarse-grained parallelization strategy. Then, a device function parameter interface is reserved in each template to support various graph operators. Meanwhile, the invention realizes an automatic calculation code generation process facing different graph operators, and optimizes the generated CUDA kernel on the basis of ensuring the correctness. The invention realizes that a uniform high-performance computing interface is called as a bottom interface without changing a user code. As can be seen in the code segment, the unified high-performance computing interface docking of different scenes of the graph operators used by the existing graph neural network framework can be realized only by simple replacement.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A graph neural network optimization method, comprising:

representing a graph operator of the graph neural network computing task into a loop nesting statement based on a preset abstract format;

selecting an optimized parallel strategy from a parallel strategy library based on the graph operator information extracted from the loop nesting statement and the graph data information of the operator; and

and executing the graph neural network computing task according to the selected optimization parallel strategy.

2. The method of claim 1, wherein the loop nesting statement represented based on the preset abstract format comprises:

an outer loop statement for traversing all vertices in the graph;

the middle layer loop statements are used for traversing the respective edges of the vertexes; and

and the inner layer loop statement is used for representing specific operation of a specific operator.

3. The method of claim 2, wherein the inner loop statement is for traversing in a feature dimension and comprises:

a first operation statement defined by a first operator operating on each edge feature; and

a second operation statement defined by a second operator that reduces the converted feature after the side operation.

4. The method of claim 3, wherein the loop nesting sentence includes a first input embedding tensor, a second input embedding tensor, and a third input embedding tensor, wherein in the first operational sentence the first input embedding tensor and the second input embedding tensor are operated on using a first operator, and in the second operational sentence a third input embedding tensor is operated on using a second operator, wherein the first input embedding tensor, the second input embedding tensor, and the third input embedding tensor are each one of a source vertex embedding tensor, a target vertex embedding tensor, an edge embedding tensor, and air.

5. The method of claim 1, wherein the parallel policy repository includes at least two of:

a thread-edge policy in which one thread executes all operations of one edge;

a thread-vertex policy in which one thread performs all operations of one vertex;

a bundle-edge policy in which one bundle performs all operations of one edge; and

the optimized parallel policy is selected from a bundle-vertex policy in which a bundle performs all operations of a vertex.

6. The method of claim 5, wherein selecting an optimized parallel policy from a parallel policy library based on the graph operator information extracted from the loop nesting statement and the graph data information for the operator further comprises defining the selected parallel policy by introducing one of the following parameters:

grouping parameters for causing a thread or thread bundle in the selected parallel policy to process a plurality of edges or vertices;

a tiling parameter for causing multiple threads or thread bundles in the selected parallel policy to process an edge or vertex.

7. The method of claim 1, wherein selecting an optimized parallel policy from a parallel policy library based on graph operator information extracted from the loop nesting statements and graph data information for the operator comprises:

sending the graph operator information and the graph data graph as input into a trained optimization strategy prediction model, and obtaining the output of the optimization strategy prediction model as the optimization parallel strategy; or

And sending the graph operator information and the graph data graph into an optimization strategy decision tree, and determining the optimization parallel strategy based on the decision of the optimization strategy decision tree.

8. The method of claim 1, wherein performing the graph neural network computational task according to the selected optimized parallel strategy comprises:

generating an executable code according to the optimized parallel strategy and a loop nesting statement based on the preset abstract format; and

the executable code is executed by neural network computing specialized hardware.

9. The method of claim 8, wherein generating executable code from the optimized parallel policy and the loop nesting statement based on the preset abstract format comprises:

and when the first operator or the second operator is empty, fusing the first operation statement and the second operation statement, wherein the inner layer loop statement of the loop nesting statement comprises a first operation statement limited by the first operator which operates each edge feature and a second operation statement limited by the second operator which reduces the converted feature after the edge operation.

10. A graph neural network inference system, comprising a compiler and an execution unit, wherein the compiler is configured to:

selecting an optimized parallel strategy from a parallel strategy library based on the graph operator information extracted from the loop nested statement and the graph data information of the operator;

the execution unit is to:

the executable code is executed using neural network computing-specific hardware.

11. A computing device, comprising:

a processor; and

a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method of any one of claims 1 to 9.

12. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-9.