CN117993426A - Method and device for automatically optimizing graph neural network - Google Patents

Method and device for automatically optimizing graph neural network Download PDF

Info

Publication number
CN117993426A
CN117993426A CN202410153195.7A CN202410153195A CN117993426A CN 117993426 A CN117993426 A CN 117993426A CN 202410153195 A CN202410153195 A CN 202410153195A CN 117993426 A CN117993426 A CN 117993426A
Authority
CN
China
Prior art keywords
operator
operators
optimization
model
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410153195.7A
Other languages
Chinese (zh)
Inventor
李鸣一
肖俊敏
谭光明
曹连雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Western Research Institute Of China Science And Technology Computing Technology
Hyperai Cloud Technology Beijing Co ltd
Original Assignee
Western Research Institute Of China Science And Technology Computing Technology
Hyperai Cloud Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Western Research Institute Of China Science And Technology Computing Technology, Hyperai Cloud Technology Beijing Co ltd filed Critical Western Research Institute Of China Science And Technology Computing Technology
Priority to CN202410153195.7A priority Critical patent/CN117993426A/en
Publication of CN117993426A publication Critical patent/CN117993426A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the application provides a method and a device for automatically tuning a graph neural network, wherein the method comprises the following steps: in the ith calculation map optimizing stage: selecting at least one group of operators to be fused from an original operator model corresponding to the neural network model of the graph to be optimized or an operator optimization model obtained through the last operator optimization stage; replacing each group of operators to be fused with a fusion operator node to obtain an ith computational graph optimization model corresponding to the neural network of the graph to be optimized; in the ith operator optimization stage: determining that a first operator in the ith computational graph optimization model meets a fission condition, and then, in the nuclear function generation process, fissioning the first operator into a plurality of operators to obtain an ith operator optimization model; and mapping all the operators obtained by fission in the ith operator optimization model to corresponding hardware units through data mapping to complete corresponding nuclear function writing, and performing self-increasing operation on the i. According to the embodiment of the application, the floating point operand, the intermediate result and the memory access times are reduced while the delay of the training and reasoning process of the graph neural network is reduced.

Description

Method and device for automatically optimizing graph neural network
Technical Field
The application relates to the field of graphic neural networks, in particular to a method and a device for automatically tuning a graphic neural network.
Background
The effectiveness of Graph Neural Networks (GNNs) in learning graph structure data has led to their great impact in various fields. As with conventional dense neural networks, performance optimization of graph neural networks has also become a focus of attention in the industrial and scientific fields. However, due to the irregularity of the real world graph data, no universal sparse operator optimization approach can achieve performance improvement on any graph data; on the other hand, the irregularity of the graph data has a significant effect on the effect of computational graph optimization means such as operator fusion, and the problems all bring challenges to the performance optimization of the graph neural network.
In pursuit of higher performance, many GNN frameworks have recently emerged. Recent studies tend to investigate performance optimizations at the computational and operator levels, respectively. Graph optimization and operator optimization play a vital role in achieving high efficiency and high performance of GNNs. However, existing approaches typically first rewrite the computational graph and then design the operator implementation, which results in a separation between the two optimization levels. Furthermore, current graph optimization and operator optimization strategies are largely fixed, rely heavily on human expertise, and are limited to limited search space. For example, many operator fusion strategies are based on empirically determining predefined fusion patterns and replacing them. Because of these potential factors, the performance of the recent GNN framework often does not reach an optimal level, and the performance of the recent GNN framework shows sensitivity to various inputs including graphs and GNNs, so that the graph neural network obtained based on the method has technical defects of low running speed and the like when the graph neural network is implemented in hardware.
Disclosure of Invention
The embodiment of the application aims to provide a method and a device for automatically optimizing a graph neural network, which ensure the execution efficiency of each operator (performance index description such as the execution time, throughput and the like of the operator) to acquire the performance improvement of GNN reasoning and training processes through the rule-based calculation graph and operator collaborative optimization technology provided by the embodiment of the application, effectively reduce redundant calculation and unnecessary access of a GPU (graphics processing unit), thereby improving the performance.
In a first aspect, an embodiment of the present application provides a method for automatically tuning a neural network, where the method includes: in the ith calculation map optimizing stage: selecting at least one group of operators to be fused from an original operator model corresponding to the neural network model of the graph to be optimized or an operator optimization model obtained through the last operator optimization stage, wherein each group of operators to be fused comprises a plurality of continuous operator nodes, and each operator node belongs to a low-cost dense operator or belongs to a sparse operator; replacing each group of operators to be fused with a fusion operator node to obtain an ith computational graph optimization model corresponding to the neural network of the graph to be optimized; in the ith operator optimization stage: if the first operator in the ith computational graph optimization model meets the fission condition, the first operator is decomposed into a plurality of operators in the nuclear function generation process to obtain the ith operator optimization model, wherein the ith operator optimization model is used for inputting in the next computational graph optimization stage; mapping all fission operators in the ith operator optimization model to corresponding hardware units through data mapping to complete corresponding nuclear function writing, and performing self-increasing operation on i, wherein the hardware units comprise corresponding calculation units on a GPU processor, and one sub operator corresponds to one split operator; repeating the above process until a corresponding target operator model of the graph neural network model is obtained and a code corresponding to the target operator model is obtained; the original operator model, the i-th calculation map optimization model and the i-th operator optimization model all adopt fine-granularity calculation maps, the fine-granularity calculation maps comprise operator operation modes, operator state descriptions and operator mathematical attributes corresponding to each operator, and the operator state descriptions are used for describing that the corresponding operator belongs to one of the following operators: original operator, fusion operator, or split operator.
Some embodiments of the present application provide a collaborative optimization strategy supporting cross-graph and operator layers, and directly generate high-performance GNN codes from graphs and models, so as to achieve collaborative optimization, and improve hardware processing speed.
In some embodiments, the operator operation mode includes: scatter, APPLYEDGE, gather and ApplyVertex; the operator mathematical attributes are used to describe whether the respective operators are combinative, swappable, and/or distributive.
In some embodiments, the selecting at least one group of operators to be fused from the original operator model corresponding to the neural network model of the graph to be optimized or from the operator optimization model obtained through the last operator optimization stage includes: selecting at least one starting point operator from the original operator model corresponding to the neural network model of the graph to be optimized or from the operator optimization model obtained by the last operator optimization stage; and carrying out first-direction propagation and/or second-direction propagation from the starting point operators to obtain operators to be fused, wherein the operators correspond to each starting point operator, the first direction is a subsequent operator direction, the subsequent operator direction is a direction limited by operators positioned behind the starting point operator in the ith graph neural network model, the second direction belongs to a precursor operator direction, and the precursor operator direction is a direction limited by operators positioned in front of the starting point operator in the ith graph neural network model.
Some embodiments of the application search all the fusible operators forward and backward by taking the selected starting operator as a starting point, thereby improving the technical effect of computational graph optimization.
In some embodiments, the origin operator belongs to a dense operator, wherein the dense operator belongs to an operator where both the output data and the input data are vertex or edge data.
In some embodiments of the application, the dense operator is selected as the starting point operator, because the input and output data of the dense operator are vertex or edge data, the fusion of such operator does not affect the input and output position information of the new operator, but intermediate results can be reduced, so that the realization of the fusion operator can realize the semantics of the fused operator, avoid unnecessary global memory access of the GPU, and improve the hardware execution efficiency.
In some embodiments, the performing the first direction propagation and/or the second direction propagation from the starting point operator to obtain an operator to be fused corresponding to each starting point operator includes: and obtaining the operator to be fused at least according to an original operator model corresponding to the neural network model of the graph to be optimized or an operator operation mode of a related operator in the operator optimization model obtained from the last operator optimization stage.
According to some embodiments of the application, whether two operators can be fused is determined through the operation mode of the operators, and the fusion of the operators meeting the rules can reduce the access to the global memory of the GPU and further improve the performance on the premise that the fused operators efficiently utilize hardware computing resources.
In some embodiments, the obtaining the operator to be fused according to at least an original operator model corresponding to the neural network model of the graph to be optimized or an operator operation mode of a related operator in an operator optimization model obtained from a previous operator optimization stage includes: if the mth operator belongs to the first direction of the kth starting point operator and the mth operator is adjacent to the kth starting point operator, further confirming that the mth operator and the kth starting point operator have the same operation mode, and taking the mth operator as an operator to be fused, wherein the operator is in a group corresponding to the kth starting point operator.
Some embodiments of the application preferentially fuse operators with the same operation mode, and the operators are used as one of rules for optimizing the calculation map, so that the technical effect of optimizing the calculation map is improved.
In some embodiments, the obtaining the operator to be fused according to at least an original operator model corresponding to the neural network model of the graph to be optimized or an operator operation mode of a related operator in an operator optimization model obtained from a previous operator optimization stage includes: if the mth operator is confirmed to be positioned in the second direction of the kth starting point operator and the mth operator and the kth starting point operator are confirmed to have different operation modes, whether the mth operator is used as an operator to be fused of a group corresponding to the kth starting point operator or not is confirmed according to an operation performance analysis result of the operator to be evaluated.
Some embodiments of the present application may further determine whether to fuse the respective operators by comparing the operational performance of the operator to be evaluated for operators having different operational modes.
In some embodiments, the determining that the first operator in the ith computational graph optimization model satisfies a fission condition comprises: and if the performance of the plurality of nuclear functions adopted in the process of generating the codes of any operator is confirmed to meet the set standard, confirming that any operator meets the fission condition for executing operator splitting.
Some embodiments of the present application provide an optimization strategy for operator fission for the operator optimization phase.
In some embodiments, mapping all the fissionable operators in the ith operator optimization model to corresponding hardware units by data mapping to complete the corresponding kernel function writing includes: constructing a dispatching strategy of a sparse nuclear function of the nuclear function corresponding to the operator to be fissiled, finishing mapping of data to hardware and constructing corresponding operation for a calculation task; constructing a kernel function skeleton, wherein the kernel function skeleton is used for representing a control flow of a sparse kernel function; and automatically completing the writing of the first kernel function code based on the kernel function skeleton and the dispatching strategy of the sparse kernel function.
Some embodiments of the present application write corresponding kernel function code by constructing a sparse function control flow and a kernel function skeleton.
In some embodiments, the constructing a scheduling policy for sparse nuclear functions of the nuclear functions corresponding to the operators to be fissionable includes: in the load distribution stage, dividing an adjacent matrix corresponding to an operator to be fissioned to obtain a plurality of submatrices to realize operator splitting, wherein one submatrix corresponds to one nuclear function; in the data mapping stage, mapping each submatrix in the plurality of submatrices to different thread blocks of a GPU programming mode of a graphics processor to obtain thread block data mapping, and determining thread number and data distribution; in the calculation task implementation stage, element-by-element calculation tasks and reduction calculation tasks are implemented for kernel functions of different types of tasks based on the mapping results completed in the data mapping stage, so as to obtain task implementation designs of each thread, and one or more design process diagrams for sequential connection operation are obtained, wherein one design process diagram corresponds to the kernel function of one submatrix.
Some embodiments of the application split and map the adjacency matrix to corresponding thread blocks through a load distribution stage, thereby improving the processing speed of the thread blocks on data.
In some embodiments, the automatically completing the first kernel code writing based on the kernel skeleton and the sparse kernel scheduling policy includes: each kernel function fragment reads metadata corresponding to the kernel function skeleton, wherein the metadata comprises DPG and/or kernel function semantics, the metadata corresponding to the DPG comprises data mapping, thread mapping, a cache mode, a reduction mode, a vectorization and cyclic expansion mode, and the metadata corresponding to the kernel function semantics comprises read-write indexes of different tensors and operation occurring between tensors; automatically instantiating each kernel function segment according to the metadata information to obtain an instantiated kernel function segment; and taking the kernel function skeleton and the instantiated kernel function fragment as kernel function codes corresponding to the first kernel function.
In a second aspect, some embodiments of the present application provide an apparatus for automatically tuning a graph neural network, the apparatus comprising: a computational graph tuning module configured to: in the ith calculation map optimizing stage: selecting at least one group of operators to be fused from an original operator model corresponding to the neural network model of the graph to be optimized or an operator optimization model obtained through the last operator optimization stage, wherein each group of operators to be fused comprises a plurality of continuous operator nodes, and each operator node belongs to a low-cost dense operator or belongs to a sparse operator; replacing each group of operators to be fused with a fusion operator node to obtain an ith computational graph optimization model corresponding to the neural network of the graph to be optimized; an operator tuning module configured to: in the ith operator optimization stage: if the first operator in the ith computational graph optimization model meets the fission condition, the first operator is decomposed into a plurality of operators in the nuclear function generation process to obtain the ith operator optimization model, wherein the ith operator optimization model is used for inputting in the next computational graph optimization stage; mapping all fission operators in the ith operator optimization model to corresponding hardware units through data mapping to complete corresponding nuclear function writing, and performing self-increasing operation on i, wherein the hardware units comprise corresponding calculation units on a GPU processor, and one sub operator corresponds to one split operator; performing multiple iterations through the calculation graph tuning module and the operator tuning module until a target operator model corresponding to the graph neural network model is obtained and codes corresponding to the target operator model are obtained; the original operator model, the i-th calculation map optimization model and the i-th operator optimization model all adopt fine-granularity calculation maps, the fine-granularity calculation maps comprise operator operation modes, operator state descriptions and operator mathematical attributes corresponding to each operator, and the operator state descriptions are used for describing that the corresponding operator belongs to one of the following operators: original operator, fusion operator, or split operator.
In a third aspect, some embodiments of the application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method according to any of the embodiments of the first aspect.
In a fourth aspect, some embodiments of the present application provide an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is capable of implementing a method according to any embodiment of the first aspect when executing the program.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a graph neural network based on computational graph optimization and operator optimization collaborative optimization provided by an embodiment of the application;
FIG. 2 is a flowchart of a method for automatically tuning a neural network according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an operator fusion process in computational graph optimization according to an embodiment of the present application;
FIG. 4 is a second flowchart of a method for automatically tuning a neural network according to an embodiment of the present application;
FIG. 5 is a schematic diagram of defining semantics provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of a process for obtaining a sparse kernel function scheduling policy through a hardware mapping process according to an embodiment of the present application;
FIG. 7 is a schematic flow chart of obtaining kernel function codes according to a kernel function skeleton according to an embodiment of the present application;
FIG. 8 is a block diagram of an apparatus for automatically tuning a neural network according to an embodiment of the present application;
Fig. 9 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
The inventor researches and discovers that the method for automatically tuning the graphic neural network provided by the related technology has the following technical problems: on one hand, the existing operator layer optimization technology utilizes the optimization experience of a sparse linear algebraic kernel function to adapt a series of existing optimization strategies to the sparse operators of the graph neural network; on the other hand, existing computational graph level optimization techniques rely on customized graph neural network computational graphs, and expert knowledge driven computational graph rewrite strategies on the computational graphs. That is, the related art method for improving the automatic tuning of the graph neural network can be divided into two main categories: computational graph optimization and operator/kernel optimization. Graph optimization for GNNs involves rewriting computational graphs by identifying specific pattern operators and replacing them with more efficient alternatives, such as operator reordering and operator fusion. It is similar to conventional loop or pipeline tuning in compilers, with the aim of reducing redundancy in computation and optimizing memory access. On the other hand, operator optimization is focused mainly on designing efficient kernel functions, such as optimizing sparse matrix-matrix multiplication (SpMM) and sampling dense matrix multiplication (SDDMM) by adaptively slicing the sparse matrix into dense matrices and sparse blocks.
The inventor of the application analyzes the performance of the existing GNN calculation map optimization strategy on different models and different map data, finds that the model performance under different calculation maps is sensitive to the input map data, and obtains in further analysis that under the condition of applying the kernel function automatic tuning technology, the performance intervals of different calculation maps on different input map data have differences, namely the optimal performance which can be obtained by automatically tuning the kernel function has differences. Based on this, the inventors of the present application have analyzed graph optimization and operator optimization in detail, and found that operator/kernel optimization often creates multiple efficient kernels, resulting in operator splitting, and thus changing the structure of the computational graph. By means of the bridge for optimizing the communication operator and computing the graph, the cooperation of the graph optimization and the operator optimization can be realized, and the performance improvement is expected to be further promoted. Driven by this basic idea, the inventors of the present application propose in an embodiment of the present application a collaborative optimization strategy that supports cross-graph and operator-plane, and generate high-performance GNN code directly from graphs and models. To achieve collaborative optimization, some embodiments of the present application propose rule-based computational graph optimization and performance-driven operator optimization strategies that are automated and adaptive. These strategies are in contrast to mainstream mode-based computational graph optimization and manual kernel function design methods, effectively freeing GNN optimization from the limitations of human expertise.
That is, the purpose of the embodiments of the present application is to solve the problem that the optimization means cannot obtain the optimal performance due to the mutual isolation of the computational graph optimization and the operator optimization in the prior art, and propose a method for automatically tuning a graph neural network, which can implement the computational graph-operator joint tuning strategy based on input adaptation and automatic code generation. For example, the rule-based computational graph and operator collaborative optimization techniques proposed in some embodiments of the present application may effectively reduce redundant computation, reduce memory accesses to hardware devices on the GPU, and reduce memory accesses to intermediate results while guaranteeing performance. According to the performance-driven operator tuning technology provided by some embodiments of the application, tuning search space of a typical sparse operator is expanded, and a computational graph optimization technology is communicated in an operator layer. In some embodiments of the present application, an operator is generated for a computational graph searched by a tuning engine by generating a high-performance kernel function through a combination of custom code fragments and kernel function skeletons based on an automatic code generation technique of templates (i.e., the constructed kernel function skeletons and control flows).
It is easy to understand that the method for automatically tuning the graph neural network provided by the embodiment of the application obtains a technology for efficiently tuning the graph and the operator in a combined way, and can realize the effective coordination of the optimization of the graph and the optimization of the operator, thereby finding the optimal training/predicting performance of the graph neural network in a larger tuning search space. In order to make collaborative optimization possible and efficient, embodiments of the present application employ rule-based computational graph optimization and performance-driven operator optimization strategies, respectively. For example, on the general-purpose computing platform and programming model CUDA (Compute Unified Device Architecture) platform that was developed for GPU products, some embodiments of the application may achieve performance improvements of up to 12 times over existing GNN frameworks on the mainstream GNN model.
Referring to fig. 1, fig. 1 is a schematic diagram of a processing procedure of performing tuning on an original operator model corresponding to a neural network model of a graph to be optimized by adopting computation graph optimization and operator optimization synergy, in fig. 1, a first computation graph tuning is performed on the original operator model to obtain a tuned result, the first computation graph tuning is performed on the tuned result, the second computation graph tuning is performed on the first computation graph tuning, the second computation graph tuning is performed on the second computation graph tuning, and so on, and after a plurality of iterations, an iteration termination condition is reached, so as to obtain a target operator model corresponding to the neural network model of the graph to be optimized and kernel function codes corresponding to the operator models obtained by each time of tuning.
It can be understood that, because the embodiment of the application is different from the prior art in that only the operator model is calculated to perform graph tuning or the technical scheme of the operator tuning, the obtained graph neural network consumes less resources of a computer system and has higher running speed through the joint tuning of the two.
As can be seen from fig. 1, the system tuning needs to be performed multiple times on the neural network of the graph to be optimized, and the method for automatically tuning the neural network of the graph provided by some embodiments of the present application is exemplarily described below with reference to fig. 2 by using the ith joint tuning.
As shown in fig. 2, an embodiment of the present application provides a method for automatically tuning a graph neural network, where the method includes:
In the ith calculation map optimizing stage:
S101, selecting at least one group of operators to be fused from an original operator model corresponding to the neural network model of the graph to be optimized or an operator optimization model obtained through the last operator optimization stage.
It should be noted that, each group of operators to be fused includes a plurality of continuous operator nodes, and each operator node belongs to a low-overhead dense operator or belongs to a sparse operator, where the low-overhead dense operator refers to an operator with complexity of O (n), such as element-by-element operation; sparse operators refer to operators in the GNN framework that contain the messaging semantics, and such operators rely on graph data structures (sparse matrices), and are therefore referred to as sparse operators.
S102, replacing each group of operators to be fused with a fusion operator node to obtain an ith computational graph optimization model corresponding to the neural network of the graph to be optimized.
In the ith operator optimization stage:
S103, determining that a first operator in the ith computational graph optimization model meets a fission condition, and then, in a nuclear function generation process, fissioning the first operator into a plurality of operators to obtain the ith operator optimization model, wherein the ith operator optimization model is used for inputting in the next computational graph optimization stage;
S104, mapping all the operators obtained by fission in the ith operator optimization model to corresponding hardware units through data mapping to complete corresponding nuclear function writing, and performing self-increasing operation on i, wherein the hardware units comprise corresponding computing units (or called streaming multiprocessors (stream multiprocessor)) on a GPU processor, and one sub operator corresponds to one split operator.
Repeating the processes of S101-S104 until a corresponding target operator model of the graph neural network model is obtained and a code corresponding to the target operator model is obtained; the original operator model, the i-th calculation map optimization model and the i-th operator optimization model all adopt fine-granularity calculation maps, the fine-granularity calculation maps comprise operator operation modes, operator state descriptions and operator mathematical attributes corresponding to each operator, and the operator state descriptions are used for describing that the corresponding operator belongs to one of the following operators: an original operator, a fusion operator or a split operator, wherein the operator operation mode comprises the following steps: scatter, APPLYEDGE, gather and ApplyVertex; the operator mathematical attributes are used to describe whether the respective operators are combinative, swappable, and/or distributive. It will be appreciated that in some embodiments of the application the loop termination condition is that no new transformations that could improve the overall performance of the model can be found by performing the computational graph optimization corresponding to the above procedure.
Some embodiments of the present application provide a collaborative optimization strategy supporting cross-graph and operator layers, and directly generate high-performance GNN codes from graphs and models, so as to achieve collaborative optimization, and improve hardware processing speed.
It should be noted that, the embodiment of the present application provides a program representation, i.e. Fine-grained computational graph (Fine-grained Computational Graph, abbreviated as FCG), suitable for a graph neural network, and designs two computational graph optimization mechanisms, i.e. a graph transformation mechanism and an operator fusion mechanism based on mathematical principles, from two angles of redundant computation and load balancing of sparse kernel functions.
The definition of fine-grained computational graphs, graph transformations based on mathematical principles, and operator fusion, to which embodiments of the application relate, is exemplarily set forth below.
Compared with the conventional dense neural network, the GNN layer relates to 4 kinds of operator operation modes related to the graph, and the operator operation modes are respectively as follows: scatter, APPLYEDGE, gather, and ApplyVertex. Definition graph g= (V, E), V, E is vertex and edge set respectively. These modes of operation can be expressed as:
Fine-grained computational graphs (FCGs) are expanded from the computational graph concepts used in conventional deep learning frameworks. In contrast, FCGs contain more GNN-related information, including the operational mode (Scatter, APPLYEDGE, gather, and ApplyVertex) of each operator, the state description of the operator (e.g., original, fused, and split fissioned), and mathematical attributes describing whether the operator has associative, commutative, and/or distributive properties.
The graph transformation mechanism based on the mathematical principle of the application is used for avoiding redundant calculation. Compared with the computational graph rewriting technology of purely focusing on a dense operator, the data graph transformation mechanism of the embodiment of the application integrates the sparse operator and the dense operator, and is based on 3 basic mathematical rules: combination, allocation and exchange laws. An example of a specific transformation strategy is shown in table 1, in which the equivalent computational graph structures on the left and right sides of each row are derived from the application of the above 3 basic mathematical laws, and the mechanism analyzes the floating point operand required by different computational graph variants and reduces the operand as much as possible, and under the scenario of combining sparse and dense operators, the mechanism can more fully explore the computational graph transformation space.
TABLE 1
It should be noted that table 1 is used to show the graph conversion rules of some embodiments of the present application. In this table, x, y, z represent feature vectors having the shape [ f in, k ]. v *,u*,e* represents the characteristics of the source node, target node, and edge, a length f in.Wi represents a linear operator of shape [ f in,fout ], and phi represents any operation that satisfies the switching and allocation laws. s * and g * represent operators with the Scatter and Gather modes, respectively, the + and || of the corner mark represent element-wise addition and vector concatenation, and the addition represents any operator (i.e., any vector operation satisfying the swap rate, the join rate, and the allocation rate). The |e| and |v| represent the number of edges and the number of vertices, respectively. These rules are all applied in the graph transformation phase, corresponding to the first variant step of fig. 4, before the iterative optimization of S101-S104 starts. These substitution rules are applied when the floating point operand can be reduced until no substitution rules that can reduce the floating point operand are found.
The operator fusion mechanism of the embodiment of the application is used for avoiding redundant memory read-write and intermediate results, however, one-taste fusion does not necessarily bring about performance improvement. This is because different operator semantics do not necessarily achieve the highest overall performance under uniform thread mapping, under the combined influence of the irregularity of the input map data and the model parameter size. Accordingly, the inventors of the present application consider that operator fusion, which is advantageous for performance improvement, should follow the following principle: for high-overhead dense operators (e.g., dense matrix multiplication), embodiments of the present application do not fuse, but instead fuse as much as possible all neighboring low-overhead dense operators and sparse operators. It should be noted that, the dense operators with the operational complexity greater than O (n) are all dense operators with high cost, where n corresponds to the dimensions of the vertex and the edge feature. The representation of the high overhead dense operator is dense matrix multiplication, convolution; the low-overhead dense operator is represented by a cut element-by-element operation. Embodiments of the present application follow this principle and the process of operator fusion on FCGs can be generalized to a process of selecting a series of successive operator nodes in an initial computational graph and replacing with a new fused operator node.
That is, in some embodiments of the present application, the process of selecting at least one set of operators to be fused from the original operator model corresponding to the neural network model of the graph to be optimized or from the operator optimization model obtained through the previous operator optimization stage at S101 includes: selecting at least one starting point operator from the original operator model corresponding to the neural network model of the graph to be optimized or from the operator optimization model obtained by the last operator optimization stage; and carrying out first-direction propagation and/or second-direction propagation from the starting point operators to obtain operators to be fused, wherein the operators correspond to each starting point operator, the first direction is a subsequent operator direction, the subsequent operator direction is a direction limited by operators positioned behind the starting point operator in the ith graph neural network model, the second direction belongs to a precursor operator direction, and the precursor operator direction is a direction limited by operators positioned in front of the starting point operator in the ith graph neural network model. Some embodiments of the application search all the fusible operators forward and backward by taking the selected starting operator as a starting point, thereby improving the technical effect of computational graph optimization.
For example, in some embodiments of the application, the origin operator belongs to a dense operator, wherein the dense operator belongs to an operator where both the output data and the input data are vertex or edge data. In some embodiments of the application, the dense operator is selected as the starting point operator, because the input and output data of the dense operator are vertex or edge data, the fusion of such operator does not affect the input and output position information of the new operator, but intermediate results can be reduced, so that the realization of the fusion operator can realize the semantics of the fused operator, avoid unnecessary global memory access of the GPU, and improve the hardware execution efficiency.
For example, in some embodiments of the present application, the performing the first direction propagation and/or the second direction propagation from the starting point operator, to obtain an operator to be fused corresponding to each starting point operator, includes: and obtaining the operator to be fused at least according to an original operator model corresponding to the neural network model of the graph to be optimized or an operator operation mode of a related operator in the operator optimization model obtained from the last operator optimization stage. According to some embodiments of the application, whether two operators can be fused is determined through the operation mode of the operators, and the fusion of the operators meeting the rules can reduce the access to the global memory of the GPU and further improve the performance on the premise that the fused operators efficiently utilize hardware computing resources.
For example, in some embodiments of the present application, the obtaining the operator to be fused according to at least an operator operation mode of an original operator model corresponding to the neural network model of the graph to be optimized or a related operator in an operator optimization model obtained from a previous operator optimization stage includes: if the mth operator belongs to the first direction of the kth starting point operator and the mth operator is adjacent to the kth starting point operator, further confirming that the mth operator and the kth starting point operator have the same operation mode, and taking the mth operator as an operator to be fused, wherein the operator is in a group corresponding to the kth starting point operator. Some embodiments of the application preferentially fuse operators with the same operation mode, and the operators are used as one of rules for optimizing the calculation map, so that the technical effect of optimizing the calculation map is improved.
For example, in some embodiments of the present application, the obtaining the operator to be fused according to at least an operator operation mode of an original operator model corresponding to the neural network model of the graph to be optimized or a related operator in an operator optimization model obtained from a previous operator optimization stage includes: if the mth operator is confirmed to be positioned in the second direction of the kth starting point operator and the mth operator and the kth starting point operator are confirmed to have different operation modes, confirming whether the mth operator is used as an operator to be fused, which corresponds to the kth starting point operator, according to an operation performance analysis result of the operator to be evaluated (i.e. a fusion operator node obtained through fusion). Some embodiments of the present application may further determine whether to fuse the respective operators by comparing the operational performance of the operator to be evaluated for operators having different operational modes.
The implementation of S101 and S102 is exemplarily described below in connection with fig. 3.
The filled patterns in the different operators of fig. 3 are used for representing which type of operation mode the corresponding operators belong to, wherein the application describes ApplyVertex, scatter, applyEdge and other four types of operation modes respectively, the arrows of fig. 3 are used for exemplary representation to determine the propagation direction when the fusible operators are determined, and the dotted lines in fig. 4 are used for framing the fusible operators found along the corresponding directions.
Step 1: the selection operator fuses the starting points, i.e. selects the starting point operator.
The diagram in the box of step 1 in fig. 3 belongs to an original operator model, which is a plurality of operators linked by arrows, each operator corresponds to an operation (e.g., summing, RELU activate functions, etc.), and each operation belongs to one of the four operation modes. In some embodiments of the present application, the operator fusion selects an element-wise dense operator as the starting point (the activation function leakyrelu operator and the multiplication mul operator are selected as the starting point operators in fig. 3). The reason for this is that dense operator input and output data are vertex or edge data, and fusing such operators does not affect the input and output position information of new operators, but intermediate results can be reduced. There may be a plurality of candidate origins in the process, some embodiments of the application optionally having their original operators fused; if the candidate operators which are not fused still exist after one round of fusion, the fusion algorithm is carried out by taking the candidate operators as starting points until all candidate starting point operators enter a certain fusion operator.
Step 2: propagating from the origin operator to the successor operator direction (i.e., the first direction).
In the operator fusion process, the embodiment of the application firstly fuses the fusible operators with the same operation mode, and the operators are positioned in the subsequent nodes of the starting point. In fig. 3, step 2, the operator LeakyRelu and its successor Exp will be fused together (indicated by the arrow) because they have a common operational mode APPLYEDGE. For adjacent operators that are fusible but have different modes of operation, such as APPLYEDGE and Gathers, performance analysis is required to determine if fusion can bring performance benefits, such as the fusion of operators Exp and Sum in FIG. 3.
Step 3: after completing the processing in the direction subsequent to the start point, the embodiment of the present application proceeds to the same fusion process as step 2 along the precursor of the start point (indicated by the arrow in step 3 of fig. 3). However, the difference is that if the precursor represents an operator of the Gather mode of operation, performance analysis is required to determine if fusion can bring performance benefits.
It should be noted that some embodiments of the present application further provide an operator optimization policy based on performance driving, and in the operator layer, some embodiments of the present application design an operator fission mechanism to communicate with a computation graph optimization technique on the one hand, and design an automatic code generation mechanism based on templates (i.e. automatically generate kernel function codes by using mapped hardware collaboration on the other hand).
For example, in some embodiments of the application the determining that the first operator in the ith computational graph optimization model satisfies the fission condition comprises: and if the performance of the plurality of nuclear functions adopted in the process of generating the codes of any operator is confirmed to meet the set standard, confirming that any operator meets the fission condition for executing operator splitting. Some embodiments of the present application provide an optimization strategy for operator fission for the operator optimization phase.
As can be seen from the foregoing, in the manner of computational graph optimization and operator optimization collaborative optimization adopted by the technical solution of the embodiment of the present application, an example of the collaborative optimization process and operator fission is exemplarily described below in connection with fig. 4.
The first calculation map optimization in fig. 4 aims at an original operator model corresponding to a neural network model of a map to be optimized, the original operator model links each operator executed in turn by an arrow, specific algorithms corresponding to the operators (for example, algorithm names comprise exp, div and the like, specific meanings of the algorithms can refer to related files and are not repeated here) and operation modes (filling patterns of each operator in fig. 4 are used for representing the operation modes to which the operators belong) corresponding to the operators are marked on each operator in fig. 4, four operation modes are given in total, specific types of filling patterns corresponding to each operation mode are given through the top row in fig. 4, applyVertex, scatter, applyEdge and other four operation modes are respectively described in total, and the difference between the two calculation modes and the method in fig. 4 is that fusion operators and splitting operators are further represented by different filling patterns, wherein the fusion operators are Fused and the splitting operators are Fissioned. The filled patterns in the different operators of fig. 4 are used to characterize which type of operation mode the corresponding operators belong to, and the broken lines in fig. 4 are used to frame the fusible operators found in the optimization phase of the computational graph, and these operators are characterized as a fusion operator in the next optimization phase of the operators.
In order to pursue the highest performance, the operator splitting mechanism of embodiments of the present application consists of two rules. First rule, in the code generation process of the current operator, if there are multiple kernel function implementations that can achieve higher performance, operator decomposition should be performed. A second rule, mentioned above, in the operator fusion mechanism, the Scatter and the Gather operation mode operators prefer a parallel approach with edge as center and vertex as center, respectively. At the operator level, operator fission should be considered based on performance analysis results if each of them is merged in a fusion operator with its non-preferred parallel approach.
As shown in fig. 4, two groups of operators to be Fused are identified through the optimization of the computation graph in the stage 1, a dashed line box is used for representing one group of operators to be Fused in fig. 4, and 2 fusion operators Fused 1 and Fused 2 of fig. 4 on the operator optimization module are obtained through operator fusion. Operator fission in the process of generating the nuclear function is introduced into operator splitting in the operator optimization of the stage 2, so that a first split operator Fissioned 1.1.1, a second split operator Fussioned 1.2, a third split operator Fissioned 2.1.1 and a fourth split operator Fissioned 2.2.2 are obtained to obtain higher performance (obtained according to a first rule); the calculation map optimization stage of the stage 3 is a new round of calculation map optimization, and Fissioned 1.1.1 and Fissioned 2.1.1 are Fused into a third fusion operator Fused 3; stage 4 is a new round of operator optimization, splitting Fussioned 1.2.1 into Fussioned.2.1 and sum to obtain higher performance (according to two rules).
For example, in some embodiments of the application the automatic code generation mechanism illustratively includes:
To support flexible computational graph transformations, embodiments of the present application devise high performance automatic code generation techniques that are compatible therewith. The inventor observes the results of operator fusion and finds that all the first fusion results only contain one operator in the Gather operation mode. Based on this observation, the application can be used for kernel function generation of operators comprising at most one Gather operation mode, and when processing kernel function fusion, the embodiment of the application adopts a data visible range adapter to process the data dependency relationship between operators.
Some embodiments of the application provide a unified abstraction describing post-fusion operator semantics in order to accomplish automatic code generation.
Some embodiments of the application propose a unified set of abstractions describing post-fusion operator semantics. The fused operators comprise at most one operator g of a Gather operation mode, and the vertex feature matrix comprises a target vertex feature matrix as shown in figure 5And a source vertex feature matrix X T, edge feature matrix/>And adjacency matrix/>The unified abstraction of the fused operator semantics can be expressed as:
Z=f(A,X,Y)
Where Z is the result of the calculation and f is the operator. It should be noted that any of the operators (including fused, fissioned) in fig. 3 and 4 may use this unified abstraction to represent its computation semantics.
If Z represents the updated vertex feature matrix, then its row vector Z j at the j-th row may be represented as follows:
Where X i and X j are the ith and jth row vectors of X, respectively. Y hi,j is the h i,j line vector of Y, where h i,j maps vertex index pair (i, j) to the corresponding side index h i,j. And ψ v are functions of two row vectors for updating X, while φ e is used to update Y. /(I)Is a reduced function according to the ith row of a. Furthermore, if Z represents the updated edge feature matrix, then its h i,j th line vector/>Can be derived from the following:
Lemma 1: z=f (a, X, Y) can represent any calculation in GNN that contains only one Gather mode of operation.
And (3) proving: on the premise of ensuring generality, the result is proved to be true in the case that the computing process (fusion operator) f comprises a Gather pattern operator. Before the Gather mode operator, the computation process may involve ApplyVertex mode operators (denoted ψ v), the sciter mode operator, and the APPLYEDGE mode operator (abstracted with φ e). If the output of f is a vertex feature matrix, after executing the Gather mode operator, the intermediate result does not contain edge feature data, because using edge features to update vertex features (outputs) requires the use of the Gather mode operator in the GNN. Similarly, there is no Scatter schema operator. Thus, there may be only ApplyVertex pattern operators (denoted as). Thus, the first and second substrates are bonded together,If the output of f is an edge feature matrix, then there may be ApplyVertex pattern operators (denoted/>) And a Scatter schema operator and APPLYEDGE schema operator (abstracted with ω e). Thus,/>Wherein the method comprises the steps of
The following illustrates the procedure for writing a kernel function, which is provided by some embodiments of the present application, based on the above semantic definition, to obtain a kernel function code.
In some embodiments of the present application, the mapping, by data mapping, all the fissionable operators in the ith operator optimization model to corresponding hardware units to complete the writing of the corresponding kernel function at S104 illustratively includes:
firstly, constructing a dispatching strategy of a sparse nuclear function of the nuclear function corresponding to the operator to be fissiled, finishing mapping of data to hardware and constructing corresponding operation for a calculation task.
For example, in some embodiments of the present application, the first step of constructing the sparse kernel function of the kernel function corresponding to the operator to be fissile includes: in the load distribution stage, dividing an adjacent matrix corresponding to an operator to be fissioned to obtain a plurality of submatrices to realize operator splitting, wherein one submatrix corresponds to one nuclear function; in the data mapping stage, mapping each submatrix in the plurality of submatrices onto different thread blocks of a GPU programming model of the graphics processor to obtain thread block data mapping, and determining the number of threads and data distribution (for example, 128 total threads in a computing unit, grouping the 128 threads to obtain each thread group, wherein the number of threads refers to the number of threads in the thread group, and the data distribution refers to the data responsible for computing by each computing unit); in the calculation task implementation stage, element-by-element calculation tasks and reduction calculation tasks are implemented for kernel functions of different types of tasks based on the mapping results completed in the data mapping stage, so as to obtain task implementation designs of each thread, and one or more design process diagrams for sequential connection operation are obtained, wherein one design process diagram corresponds to the kernel function of one submatrix. Some embodiments of the present application split and map adjacency matrices to different thread blocks of a graphics processor GPU programming model through a load distribution stage, enhancing data processing speed.
The process of constructing a scheduling policy for sparse kernel functions is exemplarily set forth below in connection with 6.
Some embodiments of the application propose a method of constructing a set of scheduling policies for sparse kernel functions, wherein a design flow chart (DPG) is used to characterize the scheduling policies for kernel functions. The method can introduce the existing common sparse kernel function optimization strategy into the computation semantics defined by the unified abstract. For a given calculation process z=f (a, X, Y), the control flow construction process can be divided into three phases: (1) Determining a compression format of the matrix A, wherein the compression format determines workload distribution among hardware units with different parallel levels; (2) Mapping A and f to hardware units with different parallel levels; (3) The kernel function implementation of design f mainly relates to task granularity and reduction strategies. Each stage contains a number of design or optimization strategies, called operations, and table 2 below lists the operations used in the stages described above. Next, 3 stages of a scheduling policy for constructing a sparse kernel function will be exemplarily described:
TABLE 2
FIG. 6 is an example of a build scheduling policy output design flow diagram (e.g., the final output of FIG. 6 includes two scheduling policy design flow diagrams of scheduling policy design 0 and control flow design 1), which includes two parts of graph processing and feature processing, each of which may be divided into 3 stages, and one or more design process graphs DPGs are generated, each DPG corresponding to a scheduling policy of a kernel function (i.e., scheduling policy design 1 and scheduling policy design 0 of FIG. 6), the scheduling policy of each kernel function contains all scheduling details of the kernel function, and arrows in the DPG graphs represent applications of operations, characters on the arrows are labeled operation names, and diagrams at both ends of the operations represent diagrams of states before and after the operations for illustrating effects of the operations. The effect of the operations in the load classification corresponding to stage 1 is to group vertices and feature vectors; the vertex and the feature obtained after grouping the effects of the data mapping operation corresponding to the stage 2 are distributed to a computing unit, a thread group and threads; the effect of implementing the corresponding operation with the computing task corresponding to stage 3 is to add caching and reduction policies under the current configuration that do not affect semantics. Three stages are exemplarily described below in connection with fig. 6.
Stage 1: load distribution. At this stage, the operation of the scheduling policy building focuses on transforming the matrix a in the unified abstraction.
As shown in the scheduling policy building diagram of fig. 6, the operation format_csr stores a as CSR FORMAT. It is worth mentioning that the operation BIN divides the whole matrix into a plurality of sub-matrices in the row direction according to the number of non-zero elements per row, thereby achieving splitting of kernel functions/operators at FCG level. This stage can process each sub-matrix separately while constructing various scheduling strategies for f (a, X, Y) for sparse data of different features. The adjacency matrix is grouped into sub-graph 0 and sub-graph 1 in fig. 6 and CSR storage format is used on the target sub-graph.
Stage 2: and (5) data mapping.
At this stage, as shown in FIG. 6, the operations of programming focus on mapping A and f onto hardware units. To establish an efficient thread map, the operations for A and the operations for f are coupled and determined simultaneously. For example, in fig. 6, row_block is coupled with result_block. Other operations at this stage further determine the number of threads and the data allocation of the threads.
The _block operation in fig. 6 (e.g., row_block (2), result_block (64), or result_block (128) of fig. 6) assigns a number of sparse matrix ROWs, non-zero elements, or vertex edge feature maps to one compute unit (e.g., compute unit 0, compute unit 1, compute unit 2, and compute unit 3 of fig. 6), and thus to one streaming multiprocessor (Streammultiprocessor, SM) of the GPU. The read_group of fig. 6 (e.g., the read_group (32, 4) and the read_group (64,2) of fig. 6) are used to define the grouping of THREADs within a compute unit: the thread group processes different sparse matrix rows and non-zero elements in parallel, and corresponds to thread bundles (warp) of a basic scheduling unit of an SM in GPU hardware; threads within a thread group process different feature dimensions in parallel. It will be appreciated that the data mapping stage of stage 2 or the construction flow depicted in fig. 6 of the present application does not focus on f, but on A, X, Y, i.e. on which part of the data the streaming multiprocessor SM, thread bundles warp, threads on the GPU specifically need to process, nor on operator defined operations. Thus, splitting occurs in stage 1 load distribution, and operator splitting is triggered if the performance of multiple operators after partitioning the subgraph in subsequent autotune discovery stage 1 is better than the performance of a single operator.
Stage 3: and (5) realizing a computing task. The specific implementation of the operations is not of interest in this stage 3, but rather is used to determine caching and reduction processing of the data.
At this stage, as shown in fig. 6, the operations of programming focus on the kernel implementation of different types of tasks in the computing task f, namely element-by-element computing tasks and reduction computing tasks. Embodiments of the present application provide efficient operations for each type of computing task, such as element-by-element operations with _elem suffixed and reduce operations with _red suffixed. In addition, the embodiment of the present application defines a design process diagram (Design Process Graph, abbreviated as DPG) of the sequential connection operation, for example, the scheduling policy design 0 and the scheduling policy design 1 obtained in fig. 6 are used for presenting the scheduling policy design of f (a, X, Y).
And secondly, constructing a kernel function skeleton.
It should be noted that the kernel function skeleton only contains control flows, i.e. loop structures, branch statements and control variable assignments, but does not contain kernel function frameworks of data operation statements and scheduling policies. The kernel function skeleton is a concrete expression of a scheduling policy of the sparse kernel function, as an example of the kernel function skeleton is given in fig. 7.
And thirdly, automatically completing writing of the first kernel function code according to a kernel function skeleton, a sparse kernel function scheduling strategy and kernel function semantics. It should be noted that, the writing of the kernel function code in the embodiment of the present application is completed by instantiating the kernel function skeleton according to the design process diagram DPG and the kernel function semantics. The operator semantics comprise kernel function fragments.
As shown in fig. 7, which is one example of kernel code generation. The core function skeleton consists of configurable code segments, representing all possible variants of the control flow of the core function; the kernel function semantics include tensor read-write information (including tensor read-write semantics and tensor write-back semantics of fig. 7) and operation information (i.e., computation semantics of fig. 7), which are included in the operator attributes; the design process diagram contains the scheduling policy and implementation details of the kernel function, including data mapping, thread mapping, caching, reduction, loop expansion, and vectorization. The code generation (or code writing) process is a process of instantiating each code segment, and the arrow between the kernel skeleton and the design process diagram in fig. 7 represents the influence relationship of metadata on the instantiation of the code segment. The kernel function skeleton of FIG. 7 includes solid line boxes of code segments representing segments that must appear in the kernel function (e.g., traversing vertices, edges), and dashed line boxes of code segments representing code segments that may appear in the kernel function (e.g., edge index preloading, reduction, etc.).
Some embodiments of the application participate in constructing a kernel function, including a kernel function skeleton, a design process graph DPG and kernel function semantics, wherein the kernel function skeleton includes a control flow of a sparse kernel function, and is composed of configurable code segments, the code segments can be roughly divided into traversal vertexes, traversal edges, data loading data reduction and data writing back, and instantiation of the code segments is controlled by a scheduling strategy and operator semantics.
For example, in some embodiments of the application, the third step illustratively comprises: each kernel function fragment reads metadata corresponding to the kernel function skeleton, wherein the metadata comprises at least one of DPG and kernel function semantics, the metadata corresponding to the DPG comprises data mapping, thread mapping, a cache mode, a reduction mode, a vectorization and cyclic expansion mode, and the metadata corresponding to the kernel function semantics comprises read-write indexes of different tensors and operation occurring between tensors; automatically instantiating each kernel function segment according to the metadata information to obtain an instantiated kernel function segment; and taking the kernel function skeleton and the instantiated kernel function fragment as kernel function codes corresponding to the first kernel function. Some embodiments of the application write the kernel function code by requiring the kernel function skeleton, the kernel function scheduling policy and the kernel function semantics to participate together, namely writing the corresponding kernel function code by the kernel function skeleton, the kernel function scheduling policy and the kernel function semantics.
That is, S104 provided in the embodiment of the present application exemplarily includes:
Step 1: semantics are defined. Based on commonalities derived from the unified abstraction, the semantics of any computational process involving at most one Gather mode operator can be unified by a core skeleton, which is a root symbol template made up of multiple blocks and nested loops. Notably, the skeleton template is applicable to a parallel approach with edge-centric and vertex-centric. As shown in FIG. 7, lines 3, 11, 18, and 12 correspond to the function ψ v、φe, the function ψ v、φe in the unified abstraction, And omega e. The reduction R A along the row direction of a is achieved by implementing two loops on rows 1 and 6 and a Gather calculation block on row 14. Each block in the core skeleton may be selectively left blank according to the actual workflow. For example, if z h(i,j)=ωe(xi,xj,yhi,j), lines 13-19 in the core skeleton may be safely ignored.
Step 2: and constructing a kernel function. In the kernel skeleton, each block contains a slot for kernel fragments, which are pre-prepared C++ templates, specific to updating specific computations of vertex or edge features. During the kernel construction process, all of these slots are connected with corresponding kernel fragments, and each kernel fragment is instantiated based on the DPG. Each kernel function segment reads at least one of DPG and kernel function semantics, and the information read from the DPG comprises data mapping, thread mapping, caching mode, reduction mode, vectorization and cyclic expansion mode, and the information read from the kernel function semantics comprises read-write indexes of different tensors and operations occurring between the tensors, and is automatically materialized according to the information. Thus, the kernel skeleton, together with the materialized kernel fragments, forms a kernel code.
Step 3: and (5) automatic tuning. To improve the performance of kernel functions, autotune is mainly focused on enumerating DPGs and selecting the optimal DPG. In order to handle the huge search space consisting of parameters and structures of DPG, embodiments of the present application employ a multi-level search strategy, gradually optimizing searches to a fine-grained level starting from coarse-grained exploration.
The search strategy in the automatic tuning process consists of three steps (levels). In a first step, the graph structure is randomly enumerated by selecting a null operation and connecting them together. In the second step, the operations (nodes) are searched and determined in a coarse-grained manner. The performance of each DPG is obtained by directly running the corresponding program. In the third step, the high performance results of the second step are further tuned in a fine-grained parameter space using a Machine Learning (ML) model (e.g., XGBoost). The ML model may accelerate fine-grained searching because the overhead of the ML model is negligible compared to the cost of executing the program. In addition, heuristic methods, such as simulated annealing, are applied to further increase the execution speed of the first two steps.
Referring to fig. 8, fig. 8 shows an apparatus for automatically tuning a neural network according to an embodiment of the present application, and it should be understood that the apparatus corresponds to the method embodiment of fig. 2, and is capable of executing the steps related to the method embodiment, and specific functions of the apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein for avoiding repetition. The device comprises at least one software functional module which can be stored in a memory in the form of software or firmware or solidified in the operating system of the device, and the device for automatically optimizing the graphic neural network comprises: the computation graph tuning module 101 and the operator tuning module 102.
A computational graph tuning module configured to: in the ith calculation map optimizing stage: selecting at least one group of operators to be fused from an original operator model corresponding to the neural network model of the graph to be optimized or an operator optimization model obtained through the last operator optimization stage, wherein each group of operators to be fused comprises a plurality of continuous operator nodes, and each operator node belongs to a low-cost dense operator or belongs to a sparse operator; and replacing each group of operators to be fused with one fusion operator node to obtain an ith computational graph optimization model corresponding to the neural network of the graph to be optimized.
An operator tuning module configured to: in the ith operator optimization stage: if the first operator in the ith computational graph optimization model meets the fission condition, the first operator is decomposed into a plurality of operators in the nuclear function generation process to obtain the ith operator optimization model, wherein the ith operator optimization model is used for inputting in the next computational graph optimization stage; mapping all the operators obtained by fission in the ith operator optimization model to corresponding hardware units through data mapping to complete corresponding nuclear function writing, and performing self-increasing operation on i, wherein the hardware units comprise corresponding computing units (or called streaming multiprocessors) on a GPU processor, and one sub operator corresponds to one split operator.
It should be noted that, in the embodiment of the present application, multiple iterations are performed through the computation graph tuning module and the operator tuning module until a target operator model corresponding to the graph neural network model is obtained and a code corresponding to the target operator model is obtained; the original operator model, the i-th calculation map optimization model and the i-th operator optimization model all adopt fine-granularity calculation maps, the fine-granularity calculation maps comprise operator operation modes, operator state descriptions and operator mathematical attributes corresponding to each operator, and the operator state descriptions are used for describing that the corresponding operator belongs to one of the following operators: original operator, fusion operator, or split operator.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.
Some embodiments of the application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method as described in any of the embodiments above.
As shown in fig. 9, some embodiments of the present application provide an electronic device including a memory 710, a processor 720, and a computer program stored on the memory 710 and executable on the processor 720, wherein the processor 720 can implement the method according to any of the embodiments described above when reading the program and executing the program through a bus 730.
Processor 520 may process the digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 520 may be a microprocessor.
Memory 510 may be used for storing instructions to be executed by processor 520 or data related to execution of the instructions. Such instructions and/or data may include code to implement some or all of the functions of one or more of the modules described in embodiments of the present application. The processor 520 of the disclosed embodiments may be used to execute instructions in the memory 510 to implement the method shown in fig. 2. Memory 510 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (14)

1. A method for automatically tuning a graph neural network, the method comprising:
In the ith calculation map optimizing stage:
Selecting at least one group of operators to be fused from an original operator model corresponding to the neural network model of the graph to be optimized or an operator optimization model obtained through the last operator optimization stage, wherein each group of operators to be fused comprises a plurality of continuous operator nodes, and each operator node belongs to a low-cost dense operator or belongs to a sparse operator;
Replacing each group of operators to be fused with a fusion operator node to obtain an ith computational graph optimization model corresponding to the neural network of the graph to be optimized;
In the ith operator optimization stage:
If the first operator in the ith computational graph optimization model meets the fission condition, the first operator is decomposed into a plurality of operators in the nuclear function generation process to obtain the ith operator optimization model, wherein the ith operator optimization model is used for inputting in the next computational graph optimization stage;
Mapping all operators obtained by fission in the ith operator optimization model to corresponding hardware units through data mapping to complete corresponding nuclear function writing, and performing self-increasing operation on the i, wherein the hardware units comprise corresponding calculation units on a GPU processor, and different operators obtained by fission correspond to different calculation units;
repeating the above process until a corresponding target operator model of the graph neural network model is obtained and a code corresponding to the target operator model is obtained;
The original operator model, the i-th calculation map optimization model and the i-th operator optimization model all adopt fine-granularity calculation maps, the fine-granularity calculation maps comprise operator operation modes, operator state descriptions and operator mathematical attributes corresponding to each operator, and the operator state descriptions are used for describing that the corresponding operator belongs to one of the following operators: original operator, fusion operator, or split operator.
2. The method of claim 1, wherein,
The operator operation mode comprises the following steps: scatter, APPLYEDGE, gather and ApplyVertex;
the operator mathematical attributes are used to describe whether the respective operators are combinative, swappable, and/or distributive.
3. The method of claim 1, wherein selecting at least one set of operators to be fused from an original operator model corresponding to a graph neural network model to be optimized or from an operator optimization model obtained through a previous operator optimization stage comprises:
selecting at least one starting point operator from the original operator model corresponding to the neural network model of the graph to be optimized or from the operator optimization model obtained by the last operator optimization stage;
And carrying out first direction propagation and/or second direction propagation from the starting point operators to obtain operators to be fused corresponding to each starting point operator, wherein the first direction is a subsequent operator direction, the subsequent operator direction is a direction limited by operators positioned behind the starting point operator in the original operator model or the operator optimization model obtained from the last operator optimization stage, the second direction belongs to a precursor operator direction, and the precursor operator direction is a direction limited by operators positioned in front of the starting point operator in the original operator model or the operator optimization model obtained from the last operator optimization stage.
4. A method as claimed in claim 3, wherein the start operator belongs to a dense operator, wherein the dense operator belongs to an operator where both the output data and the input data are vertex or edge data.
5. A method according to claim 3, wherein said propagating in a first direction and/or propagating in a second direction from said origin operators, obtaining a set of operators to be fused corresponding to each origin operator, comprises:
And obtaining the operator to be fused at least according to an original operator model corresponding to the neural network model of the graph to be optimized or an operator operation mode of a related operator in the operator optimization model obtained from the last operator optimization stage.
6. The method of claim 5, wherein the obtaining the operator to be fused based at least on an original operator model corresponding to the neural network model of the graph to be optimized or an operator operation pattern of a related operator in an operator optimization model obtained through a previous operator optimization stage, comprises:
If the mth operator belongs to the first direction of the kth starting point operator and the mth operator is adjacent to the kth starting point operator, further confirming that the mth operator and the kth starting point operator have the same operation mode, and taking the mth operator as an operator to be fused, wherein the operator is in a group corresponding to the kth starting point operator.
7. A method according to claim 3, wherein the obtaining the operator to be fused at least from an original operator model corresponding to the neural network model of the graph to be optimized or from an operator operation pattern of a related operator in an operator optimization model obtained through a previous operator optimization stage comprises:
if the mth operator is confirmed to be positioned in the second direction of the kth starting point operator and the mth operator and the kth starting point operator are confirmed to have different operation modes, whether the mth operator is used as an operator to be fused of a group corresponding to the kth starting point operator or not is confirmed according to an operation performance analysis result of the operator to be evaluated.
8. The method of claim 1, wherein the determining that the first operator in the ith computational graph optimization model satisfies a fission condition comprises:
And if the performance of the plurality of nuclear functions adopted in the process of generating the code of the first operator is confirmed to meet the set standard, confirming that the first operator meets the fission condition for executing the operator splitting.
9. The method of claim 1, wherein mapping all fissionable operators in the ith operator optimization model to corresponding hardware units by data mapping to complete the corresponding kernel function writing comprises:
constructing a dispatching strategy of a sparse nuclear function of the nuclear function corresponding to the operator to be fissiled, finishing mapping of data to hardware and constructing corresponding operation for a calculation task;
constructing a kernel function skeleton, wherein the kernel function skeleton is used for representing a control flow of a sparse kernel function;
and automatically completing the writing of the first kernel function code based on the kernel function skeleton and the dispatching strategy of the sparse kernel function.
10. The method of claim 9, wherein constructing a scheduling policy for sparse nuclear functions of the nuclear functions corresponding to the operators to be fissionable comprises:
In the load distribution stage, dividing an adjacent matrix corresponding to an operator to be fissioned to obtain a plurality of submatrices to realize operator splitting, wherein one submatrix corresponds to one nuclear function;
In the data mapping stage, mapping each submatrix in the plurality of submatrices to different thread blocks of a GPU programming mode of a graphics processor to obtain thread block data mapping, and determining thread number and data distribution;
In the calculation task implementation stage, element-by-element calculation tasks and reduction calculation tasks are implemented for kernel functions of different types of tasks based on the mapping results completed in the data mapping stage, so as to obtain task implementation designs of each thread, and one or more design process diagrams for sequential connection operation are obtained, wherein one design process diagram corresponds to the kernel function of one submatrix.
11. The method of claim 9, wherein the automatically completing the first kernel code writing based on the kernel skeleton and the sparse kernel scheduling policy comprises:
Each kernel function fragment reads metadata corresponding to the kernel function skeleton, wherein the metadata comprises DPG and/or kernel function semantics, the metadata corresponding to the DPG comprises data mapping, thread mapping, a cache mode, a reduction mode, a vectorization and cyclic expansion mode, and the metadata corresponding to the kernel function semantics comprises read-write indexes of different tensors and operation occurring between tensors;
Automatically instantiating each kernel function segment according to the metadata information to obtain an instantiated kernel function segment;
And taking the kernel function skeleton and the instantiated kernel function fragment as kernel function codes corresponding to the first kernel function.
12. An apparatus for automatically tuning a graphic neural network, the apparatus comprising:
A computational graph tuning module configured to:
In the ith calculation map optimizing stage:
Selecting at least one group of operators to be fused from an original operator model corresponding to the neural network model of the graph to be optimized or an operator optimization model obtained through the last operator optimization stage, wherein each group of operators to be fused comprises a plurality of continuous operator nodes, and each operator node belongs to a low-cost dense operator or belongs to a sparse operator;
Replacing each group of operators to be fused with a fusion operator node to obtain an ith computational graph optimization model corresponding to the neural network of the graph to be optimized;
An operator tuning module configured to:
In the ith operator optimization stage:
If the first operator in the ith computational graph optimization model meets the fission condition, the first operator is decomposed into a plurality of operators in the nuclear function generation process to obtain the ith operator optimization model, wherein the ith operator optimization model is used for inputting in the next computational graph optimization stage;
Mapping all fission operators in the ith operator optimization model to corresponding hardware units through data mapping to complete corresponding nuclear function writing, and performing self-increasing operation on i, wherein the hardware units comprise corresponding calculation units on a GPU processor, and one sub operator corresponds to one split operator;
performing multiple iterations through the calculation graph tuning module and the operator tuning module until a target operator model corresponding to the graph neural network model is obtained and codes corresponding to the target operator model are obtained;
The original operator model, the i-th calculation map optimization model and the i-th operator optimization model all adopt fine-granularity calculation maps, the fine-granularity calculation maps comprise operator operation modes, operator state descriptions and operator mathematical attributes corresponding to each operator, and the operator state descriptions are used for describing that the corresponding operator belongs to one of the following operators: original operator, fusion operator, or split operator.
13. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, is adapted to carry out the method of any of claims 1-11.
14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is operable to implement a method as claimed in any one of claims 1 to 11 when the program is executed by the processor.
CN202410153195.7A 2024-02-02 2024-02-02 Method and device for automatically optimizing graph neural network Pending CN117993426A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410153195.7A CN117993426A (en) 2024-02-02 2024-02-02 Method and device for automatically optimizing graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410153195.7A CN117993426A (en) 2024-02-02 2024-02-02 Method and device for automatically optimizing graph neural network

Publications (1)

Publication Number Publication Date
CN117993426A true CN117993426A (en) 2024-05-07

Family

ID=90887098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410153195.7A Pending CN117993426A (en) 2024-02-02 2024-02-02 Method and device for automatically optimizing graph neural network

Country Status (1)

Country Link
CN (1) CN117993426A (en)

Similar Documents

Publication Publication Date Title
Dafir et al. A survey on parallel clustering algorithms for big data
Parvat et al. A survey of deep-learning frameworks
JP7481075B2 (en) Simulating quantum circuits on a computer using hierarchical storage
Fernandez et al. A view on fuzzy systems for big data: progress and opportunities
US8943011B2 (en) Methods and systems for using map-reduce for large-scale analysis of graph-based data
Zhou et al. Transferable graph optimizers for ml compilers
US20230267358A1 (en) Distributed Quantum Computing Simulation Method and Apparatus
Tian et al. PCGCN: Partition-centric processing for accelerating graph convolutional network
Zhang et al. FastSV: A distributed-memory connected component algorithm with fast convergence
CN104952032A (en) Graph processing method and device as well as rasterization representation and storage method
Arnaiz-González et al. MR-DIS: democratic instance selection for big data by MapReduce
Gu et al. Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms
Wehr et al. Parallel kd-tree construction on the gpu with an adaptive split and sort strategy
Ouyang et al. Hardware/software partitioning for heterogenous mpsoc considering communication overhead
Chen et al. Tetrahedral mesh improvement by shell transformation
CN110853120B (en) Network layout method, system and medium based on segmentation drawing method
CN117786412A (en) Elastic training method, cluster system, product and medium for large language model
de Oliveira et al. Low-cost heuristics for matrix bandwidth reduction combined with a Hill-Climbing strategy
Wang et al. Auto-MAP: A DQN framework for exploring distributed execution plans for DNN workloads
CN113986816B (en) Reconfigurable computing chip
CN115860061A (en) Graph neural network optimization method and graph neural network inference system
CN116128019A (en) Parallel training method and device for transducer model
CN117993426A (en) Method and device for automatically optimizing graph neural network
CN116257696A (en) Service recommendation method and system based on cross-modal knowledge graph comparison learning
Zong et al. STR: Hybrid Tensor Re-Generation to Break Memory Wall for DNN Training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination