CN112767230A

CN112767230A - GPU graph neural network optimization method and device

Info

Publication number: CN112767230A
Application number: CN202110222831.3A
Authority: CN
Inventors: 翟季冬; 黄可钊; 陈文光; 郑纬民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-05-07

Abstract

A GPU graph neural network optimization method and a computer readable medium are provided, the optimization method comprises the following steps: generating a calculation graph comprising tensors and operations for the GPU graph neural network model definition; obtaining a plurality of equivalent calculation graphs aiming at the calculation graphs; comparing the calculated amount of each calculation graph, and selecting the calculation graph with the minimum calculated amount; and generating corresponding GPU codes for the selected computation graph. The visual range of data input and the visual range of data output required by each operation of the selected calculation graph can be obtained through analysis; the problem that the data visual range of the operation with the dependency relationship is not matched is solved; and merging the data visual matched operations into the same GPU-kernel. The graph neural network optimization method can reduce mismatching of data visual ranges among operators in the graph neural network, so that the operators are combined to reduce memory access; meanwhile, an equivalent calculation graph without redundancy calculation can be found, and the redundant calculation is reduced.

Description

GPU graph neural network optimization method and device

Technical Field

The present invention relates generally to graph neural networks, and more particularly, to a graph neural network optimization method and a graphics processing unit.

Background

In the past few years, Graph Neural Networks (GNNs) have become a major method of machine learning on graph data. It has the most advanced performance in prediction tasks on various graphs (e.g., node classification, graph classification, and link prediction). Also, GNN outperforms other methods in representing tasks (e.g., DeepWalk, Node2Vec) to provide a better representation of nodes for downstream tasks.

Thus, GNN accounts for over 90% of the leading performance in Graph-related tasks in Open Graph Benchmark (OGB) and CogDL. The scope of influence of GNNs encompasses many areas, from biology to medicine, social networking, personal recommendation, knowledge map processing, and the like. GNN is a machine learning method that combines graph-related operations with neural network operations, which encodes relationships between entities into node features through graph operations, transforms node features through neural network operations, and updates training parameters.

Disclosure of Invention

The performance (speed) of GNNs is critical for many of their applications. As researchers find deeper GNN models with better accuracy, and begin to study models on larger map data to better simulate real-world problems. However, the computational efficiency and manner of existing graph neural network systems limits the application and study of GNNs, resulting in some researchers sacrificing model accuracy in exchange for faster performance.

Although GNNs are primarily composed of graph-related operations and Neural network operations, simply putting together an graph computation framework and a dnn (deep Neural network) framework is not sufficient to efficiently support GNN execution because it cannot efficiently handle the complex interactions between graph-related operations and Neural network operations, which is critical to GNN performance.

For example, graph-dependent operations and neural network operations are staggered in GNN computational graphs, and complex data dependencies between them complicate kernel fusion, memory performance optimization, task scheduling, and load balancing.

Some GNN models perform neural operations even in graph structures, and it is difficult for existing frameworks to achieve high efficiency because the dense computational nature of the neural operations does not match the sparse patterns in the graph operations.

Existing GNN systems include PytrichGeometric (PyG) (see reference [1]), Deep Graph Library (DGL) (see reference [2]) and NeuGraph. However, none of these utilize and solve the combination between graph-related operations and neural network-related operations in GNNs, and therefore, there is a significant amount of overhead in their implementation, such as redundant computation and memory access. The inventors of the present invention found that on the Nvidia Tesla V100 GPU these frameworks achieved peak throughput of less than 10%, while on some more complex GNN models there was about 30% of the time for redundant memory access and computation.

The related documents are:

[1].Matthias Fey,and Jan Eric Lenssen."Fast Graph Representation Learning with PyTorch Geometric."CoRR abs/1903.02428(2019).

[2].Minjie Wang,et al."Deep Graph Library:Towards Efficient and Scalable Deep Learning on Graphs."CoRR abs/1909.01315(2019).

the present invention has been made in view of the above circumstances.

According to one aspect of the invention, a GPU graph neural network optimization method is provided, which includes: generating a calculation graph comprising tensors and operations for the GPU graph neural network model definition; obtaining a plurality of equivalent calculation graphs aiming at the calculation graphs; comparing the calculated amount of each calculation graph, and selecting the calculation graph with the minimum calculated amount; and generating corresponding GPU codes for the selected computation graph.

Optionally, for the computation graph, obtaining a plurality of equivalent computation graphs includes: obtaining information of tensor and types of operation on the calculation graph; and (3) applying a commutative law according to the type of the operation and the information of the input tensor, and changing the calculation sequence and the mode of the calculation graph by combining the law and the distribution law.

Optionally, the information of the tensor comprises a shape and a position.

Optionally, comparing the calculation amounts of the respective calculation graphs, and selecting the calculation graph with the smallest calculation amount includes: and for each calculation graph, obtaining the number of floating point operations required by each calculation, and selecting the calculation graph with the minimum number of the required floating point operations.

Optionally, the GPU graph neural network optimization method may further include, after selecting the computation graph with the smallest computation amount: analyzing the selected calculation diagram to obtain a data visual range required to be input by each operation and an output data visual range; the problem that the data visual range of the operation with the dependency relationship is not matched is solved; and merging the data visual matched operations into the same GPU-kernel.

Optionally, the solving of the problem that the data visual range of the dependent operation does not match includes: the data visual range adapter is adopted to solve the problem of data visual range mismatch between interdependent operations, wherein the data visual range adapter can be operated as a code segment, detects the data visual range of the operations before and after, and obtains the minimum data visual range required by the operations, and then inserts the code segment, and the code segment is operated to enable the data to be shared by using inter-thread communication or shared memory to change the data visual range, so that the data visual range meets the requirement of merging the operations before and after into a kernel function.

Optionally, if the required minimum data visual range is shared in the Warp, the data visual range adapter performs sharing operation on the thread private data by using a GPU Warp Shuffle primitive; if the required minimum data visual range is shared in the thread block, the data visual range adapter uses the shared memory of the GPU, and the thread private data is stored in the GPU shared memory to realize the data sharing in the thread block.

According to another aspect of the present invention, there is provided a GPU graph neural network optimization method, including: generating a calculation graph comprising tensors and operations for the GPU graph neural network model definition; aiming at the calculation graph, analyzing to obtain a data visual range required to be input by each operation and an output data visual range; the problem that the data visual range of the operation with the dependency relationship is not matched is solved; merging the visually matched operations of the data into the same GPU kernel function; and generating corresponding GPU codes for the modified calculation graph.

Optionally, the solving of the problem that the data visual range of the dependent operation does not match includes: the data visual range adapter is used for solving the problem of data visual range mismatch between interdependent operations, wherein the data visual range adapter is a code segment, detects the data visual range of the front and back operations, and obtains the minimum data visual range required by the operations, and then inserts the code segment, and the code segment is operated to enable the data to be shared by using inter-thread communication or shared memory to change the data visual range, so that the data visual range meets the requirement of merging the front and back operations into a kernel function.

According to another aspect of the present invention, there is provided a graphics processing unit GPU comprising a processor and a memory having stored thereon computer executable code operable, when executed by the processor, to perform the aforementioned GPU graph neural network optimization method.

According to another aspect of the present invention, there is provided a computer readable medium having stored thereon computer executable code, which when executed by a processor is operable to perform the GPU graph neural network optimization method of any of claims 1 to 8.

With the GPU graph neural network optimization technique according to embodiments of the present invention, one or more of the following advantages may be achieved:

(1) by detecting the data visual range among the operations in the graph neural network, for the mismatching of the data visual range, the mismatching problem is solved through the adapter in a fine-grained manner, so that a plurality of operations can be carried out in the same GPU function, and the calling times of the GPU kernel function and the access amount of the global memory are reduced.

(2) Meanwhile, for the calculation graph in the graph neural network, the invention can recursively apply an equivalent linear transformation method to generate a large number of equivalent calculation graphs, and the invention can automatically find the version of the calculation graph in which redundant calculation is avoided and apply the version to the training process.

Drawings

FIG. 1 shows a general flow diagram 100 of a GPU graph neural network optimization method according to an embodiment of the invention.

FIG. 2 shows an overall flowchart 200 of a GPU graph neural network optimization method according to another embodiment of the invention.

Fig. 3 is a schematic workflow diagram of a GPU graph neural network optimization method according to an embodiment of the present invention.

Fig. 4 shows an example of a computation graph according to an embodiment of the invention.

FIG. 5 illustrates one example of a data visibility range adapter in accordance with one embodiment of the present invention.

FIG. 6 illustrates an example of redundant computation elimination according to an embodiment of the present invention.

Detailed Description

Various embodiments of the present invention are described below with reference to the accompanying drawings.

As shown in fig. 1, in step S110, a computational graph including tensors and operations is generated for the GPU graph neural network model definition.

In step S120, a plurality of equivalent computation graphs are obtained for the computation graph.

In step S130, the calculation amounts of the respective calculation maps are compared, and the calculation map with the smallest calculation amount is selected.

In step S140, for the selected computation graph, a corresponding GPU code is generated.

The GPU graph neural network optimization method according to the embodiment of the present invention described in conjunction with fig. 1 exploits the combination of graph-related operations and neural network-related operations in a graph neural network, automatically detecting and reducing redundant computations by transforming the computational graph with linear properties.

In step S210, a computational graph including tensors and operations is generated for the GPU graph neural network model definition.

In step S220, the data visual range required to be input and the data visual range required to be output for each operation of the computation graph are analyzed and obtained.

In step S230, the problem that the data visibility ranges of the dependent operations do not match is solved.

In step S240, the visually matched operations of the data are merged into the same GPU kernel.

In step S250, for the modified computation graph, a corresponding GPU code is generated.

In the embodiment of the invention, the mismatching problem is solved in a fine-grained manner through the adapter for mismatching of the data visual range by detecting the data visual range among the operations in the graph neural network, so that a plurality of operations can be carried out in the same GPU function, and the calling times of the GPU kernel function and the access amount of the global memory are reduced.

Fig. 3 is a schematic workflow diagram of a GPU graph neural network optimization method according to an embodiment of the present invention, which combines the GPU graph neural network optimization methods shown in fig. 1 and fig. 2.

Firstly, a user inputs a code describing graph neural network calculation (as shown in 1 in fig. 3), and a classical analysis technology in a compiling technology is adopted to generate a calculation graph containing tensors and operations by analyzing the user code (as shown in 2 in fig. 3 and fig. 4, wherein each circle of a gray background represents a tensor, each circle of a white background represents an operation, and a connection relation between the circles represents a dependency relation between the operation and the tensor). After the computation graph is obtained, a plurality of equivalent candidate computation graphs (as shown in 3 in fig. 3) can be obtained through the computation graph conversion step in the "redundant computation elimination". The computation graph with the least number of floating-point calculations, i.e. no computational redundancy, is selected using floating-point operand analysis for these equivalent candidate computation graphs (as shown in 4 in fig. 3). On a computational graph without computational redundancy, the mismatch of the data visual range between operations can be solved through a data visual range adapter, so that the merging of kernel functions is completed (as shown in 5 in fig. 3, data at different stages are shared through the adapter, and the matching of the data visual range is completed). Finally, by means of code generation and compiling, an efficient executable program is obtained, and calculation of the neural network of the graph is performed. The following is a detailed operation of two steps: redundant computation elimination and data visibility range adapter.

One, redundant computation elimination

The redundant computation elimination method of the embodiment generates a computation graph formed by the computation operations by adopting an analysis technology according to codes which are input by a user and describe the computation contents of the graph neural network.

After generating the computation graph, the method may derive properties of the respective tensors and operations, such as shape, location, and type of operation. The operation types are as follows: element-by-element operations, specification operations, activation function operations, and other operations. The tensor positions are: edges, points (origin, destination) and global. According to the preset rule, whether the operation type and the tensor position respectively accord with the association law, the distribution law and the exchange law can be determined according to the operation type and the tensor position. And for the operation with corresponding property, performing related transformation (such as exchanging calculation sequence and the like) and generating a new calculation graph. On the new computation graph, the search and application of these transformations are continued, and a new equivalent computation graph continues to be generated.

In all the newly generated equivalent calculation graphs, according to the shape, the position and the type of the operation of the tensor, the number of floating point operations required by each operation can be obtained, so that the total number of floating point operands required by the whole calculation graph is obtained, and the calculation graph with the minimum number of required floating point operations is selected.

As shown in fig. 6, in the gray background box of the computation graph (left side of fig. 6) before changing, the feature vector (src.heat) of the source node and the feature vector (dst.heat) of the destination node are added to obtain the sum of the feature vectors (src.heat + dst.heat), and then matrix multiplication is applied to the sum of the feature vectors to perform conversion ((src.heat + dst.heat) ((src.weight)). After applying redundant computation to the original computation graph to eliminate, a transformed computation graph (right side in fig. 6) is obtained, which first performs conversion of the feature vector (src.flat) of the source node and the feature vector (dst.flat) of the destination node, and then sums the converted feature vector (src.flat _ weight) of the source node and the converted feature vector (dst.flat _ weight) of the destination node to obtain (src.flat _ weight + dst.flat _ weight). The changed calculation diagram is calculated from edge to point due to the matrix multiplication operation, and the number of edges in the diagram data is far larger than the number of points, so that the calculation amount is greatly reduced.

Second, data visual range adapter

On the basis of the obtained final calculation graph, for each operation, according to the operation type and the shape of the input and output tensors, the required input data visual range and the output data visual range are obtained. The data visual range refers to that data is shared by threads in which ranges, and there are four data visual ranges, namely thread private, sharing in Warp, sharing in thread blocks and global sharing. When the data visibility ranges of the two operations are not matched, the two operations need to be executed in different kernel functions, and reading and writing of the global memory are generated, so that redundant memory access overhead is caused.

In the embodiment of the invention, after the dependency analysis is carried out on the calculation graph, if the data visual range mismatch exists between the operations with the dependency relationship, the data visual range mismatch between the operations with the mutual dependency is solved by adopting the data visual range adapter.

The data visibility range adapter may be, for example, a piece of code that, when executed, operates to detect the data visibility range of a previous or subsequent operation and to derive the minimum data visibility range required for the operation, and then changes the data visibility range by sharing the data using a different method (inter-thread communication or shared memory) so that the data visibility range satisfies the requirement of merging the previous or subsequent operations into one kernel function. If the required minimum data visual range is shared in the Warp, the adapter adopts GPU Warp Shuffle primitive to share the thread private data; if the required minimum data visual range is shared in the thread block, the adapter uses the shared memory of the GPU, and data sharing in the thread block is realized by storing thread private data into the shared memory of the GPU.

As shown on the left side of fig. 5, the data visibility required for an operation is at the thread block level, but existing system DGLs and pygs ignore the properties of the data visibility of an operation, and even if data only needs to be shared within a thread block, the DGLs and pygs default to placing the operation in separate kernels for global sharing. The data visibility range adapter proposed by the method can determine the required minimum visibility range of the operation, utilize the shared memory to extend the visibility range to the sharing in the thread block, and share the data from the thread private to the inside of the thread block, as shown in the right side of fig. 5, the thread block of the next operation can continue to calculate without waiting for other blocks or accessing the global memory. Therefore, the two kernel functions can be combined together by using the method, and redundant memory access is reduced.

FIG. 3 illustrates a graph neural network optimization method for eliminating redundant memory accesses and computational overhead according to an embodiment of the present invention, which can reduce the mismatch of data visibility ranges between operators in the graph neural network, thereby merging operators to reduce memory accesses; meanwhile, an equivalent calculation graph without redundancy calculation can be found, and the redundant calculation is reduced.

According to another embodiment of the present invention, there is provided a graphics processing unit GPU comprising a processor and a memory having stored thereon computer executable code operable when executed by the processor to perform the aforementioned GPU graph neural network optimization method.

According to another embodiment of the present invention, there is provided a computer-readable medium having stored thereon computer-executable code, which when executed by a processor, is operable to perform the GPU graph neural network optimization method of any of claims 1 to 8.

The invention can be implemented in various forms of software, hardware or a combination thereof, and can be distributed or centralized.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A GPU graph neural network optimization method comprises the following steps:

generating a calculation graph comprising tensors and operations for the GPU graph neural network model definition;

obtaining a plurality of equivalent calculation graphs aiming at the calculation graphs;

comparing the calculated amount of each calculation graph, and selecting the calculation graph with the minimum calculated amount;

and generating corresponding GPU codes for the selected computation graph.

2. The GPU graph neural network optimization method of claim 1, for the computational graph, deriving a plurality of equivalent computational graphs comprising:

obtaining information of tensor and types of operation on the calculation graph;

and (3) applying a commutative law according to the type of the operation and the information of the input tensor, and changing the calculation sequence and the mode of the calculation graph by combining the law and the distribution law.

3. The GPU graph neural network optimization method of claim 1, comparing the computation amounts of the respective computation graphs, and selecting the computation graph with the smallest computation amount comprises:

and for each calculation graph, obtaining the number of floating point operations required by each calculation, and selecting the calculation graph with the minimum number of the required floating point operations.

4. The GPU graph neural network optimization method of claim 1, further comprising, after selecting the computation graph with the smallest computation amount:

analyzing the selected calculation diagram to obtain a data visual range required to be input by each operation and an output data visual range;

the problem that the data visual range of the operation with the dependency relationship is not matched is solved;

and merging the data visual matched operations into the same GPU-kernel.

5. The GPU graph neural network optimization method of claim 4, wherein the solving of the problem of data visibility range mismatch of dependent operations comprises:

the data visual range adapter is adopted to solve the problem of data visual range mismatch between operations which are mutually dependent, wherein the data visual range adapter is operable to detect the data visual range of front and back operations and obtain the minimum data visual range required by the operations, and then a code segment is inserted, and the code segment is operated to enable data to be shared by using inter-thread communication or shared memory to change the data visual range, so that the data visual range meets the requirement of merging the front and back operations into a kernel function.

6. The GPU graph neural network optimization method of claim 5, wherein if the required minimum data visual range is shared in Warp, the data visual range adapter performs sharing operation on thread private data by adopting GPU Warp Shuffle primitives; if the required minimum data visual range is shared in the thread block, the data visual range adapter uses the shared memory of the GPU, and the thread private data is stored in the GPU shared memory to realize the data sharing in the thread block.

7. A GPU graph neural network optimization method comprises the following steps:

aiming at the calculation graph, analyzing to obtain a data visual range required to be input by each operation and an output data visual range;

merging the visually matched operations of the data into the same GPU kernel function;

and generating corresponding GPU codes for the modified calculation graph.

8. The GPU graph neural network optimization method of claim 7, the solving the problem of data visibility range mismatch for dependent operations comprising:

the data visual range adapter is used for solving the problem of data visual range mismatch between interdependent operations, wherein the data visual range adapter is a code segment, detects the data visual range of the front and back operations, and obtains the minimum data visual range required by the operations, and then inserts the code segment, and the code segment is operated to enable the data to be shared by using inter-thread communication or shared memory to change the data visual range, so that the data visual range meets the requirement of merging the front and back operations into a kernel function.

9. A GPU graph neural network optimization device, comprising a processor and a memory, the memory having computer-executable code stored thereon, the code, when executed by the processor, being operable to perform the GPU graph neural network optimization method of any of claims 1 to 8.

10. A computer readable medium having computer executable code stored thereon, the code, when executed by a processor, being operable to perform the GPU graph neural network optimization method of any of claims 1 to 8.