CN114936015A

CN114936015A - Deep learning compiler based on hardware computation graph

Info

Publication number: CN114936015A
Application number: CN202210625568.7A
Authority: CN
Inventors: 李武军; 李俊; 王炜
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-08-23

Abstract

The invention discloses a deep learning compiler based on a hardware computation graph, which comprises a software computation graph generation module, a hardware computation graph generation module, a software computation graph optimization module, a hardware computation graph optimization module and a code generation module; the software computation graph generation module is used for converting the deep learning model defined by the external framework into an internal software computation graph; the hardware computation graph generation module generates a basic node set under the guidance of a software computation graph, adjusts the node set by using greedy search until the computation requirement required by the whole deep learning model is met, and finally generates the edge of the hardware computation graph; the code generation module is used for generating hardware codes and simulation codes corresponding to the deep learning model. The invention can carry out high-efficiency and high-performance deployment on the model with the complex topology, and reduces the cost required by the deployment. The invention can be applied to various hardware platforms such as CPU, GPU, FPGA and artificial intelligence special chip.

Description

Deep learning compiler based on hardware computation graph

Technical Field

The invention relates to a deep learning compiler based on a hardware computation graph, which belongs to the technical field of hardware resource utilization of deep learning compilers and realizes automatic deployment of deep learning models.

Background

In recent years, deep learning has been widely used in many scenarios, such as: image recognition, object detection, semantic segmentation, and the like. Meanwhile, the efficient acceleration of deep learning models is receiving more and more attention from both academic and industrial fields. Therefore, how to deploy the models efficiently and with high performance automatically becomes especially important. For this, a deep learning compiler is proposed. In software architecture, the current main scheme is to convert a model defined by an external framework into an internal representation of a compiler, optimize the model in combination with the number of resources of a target hardware platform, and then generate hardware codes.

However, with the continuous expansion of application scenarios of deep learning and the continuous pursuit of model accuracy by communities, model topology becomes more and more complex. When the existing compiler faces complex topology, limited support can be performed only by using a hardware platform with more hardware resources and more expensive hardware resources, and the deployment cost is greatly increased.

Disclosure of Invention

The purpose of the invention is as follows: currently, deep learning compilers perform poorly in terms of hardware resource utilization and can only improve performance by using hardware platforms with more hardware resources and more expensive. And with the continuous improvement of the complexity of the model topology, the modes cannot guarantee the success of the deployment. The invention provides a deep learning compiler based on a hardware computation graph, which can use less hardware resources to deploy a complex topology model and effectively improve the utilization efficiency of the hardware resources.

The technical scheme is as follows: a deep learning compiler based on a hardware computation graph comprises a software computation graph generation module, a hardware computation graph generation module, a software computation graph optimization module, a hardware computation graph optimization module and a code generation module;

the software computation graph generation module is used for converting the deep learning model defined by the external framework into an internal software computation graph;

the software computation graph optimization module is used for optimizing the software computation graph;

the hardware computation graph generation module generates a basic node set under the guidance of a software computation graph, adjusts the node set by using greedy search until the computation requirements required by the whole deep learning model are met, and finally generates the edges of the hardware computation graph;

the hardware computation graph optimization module is used for optimizing the hardware computation graph;

the code generation module is used for generating hardware codes and simulation codes corresponding to the deep learning model.

In the software computation graph generation module, according to the implementation efficiency of hardware, defining node types and edges in a software computation graph; analyzing and converting the model file; the nodes in the software computation graph are operation units of a deep learning model and comprise operators such as filling, convolution, maximum pooling, average pooling, global pooling, addition, splicing and the like; edges in the software computation graph represent data passing between nodes.

In the hardware computation graph generation module, node types in the hardware computation graph correspond to node types of the software computation graph one by one, but one node in the hardware computation graph represents one processing unit on a hardware platform, that is, the node and the edge respectively represent a module and a hard wire inside a chip. And in practical implementation, a multiplexer and a demultiplexer are respectively added at each input port and each output port of the node. The multiplexer is used for selecting the successor node of the current hardware node, and the demultiplexer is used for selecting the successor node of the current hardware node. In the hardware computation graph generation module, the mapping mode from the software computation graph nodes to the hardware computation graph nodes is as follows: based on the idea of a multi-core processor, the invention can perform one-to-many mapping between the two. For a certain convolution node in a software computation graph, when any convolution node in the existing hardware computation graph can not meet the computation requirement, segmenting (OCP) the output channel dimension of the convolution node weight in the software computation graph, and distributing the weight data of the convolution node to a plurality of hardware nodes (nodes in the hardware computation graph) for computation; for a certain pooled node in the software computation graph, when any pooled node in the existing hardware computation graph cannot meet the computation requirement, the width dimension of the input data of the pooled node in the software computation graph is segmented (IWP). In addition to the case where Node Partitioning (NP) is not required, 9 types of connection methods between nodes of the hardware computation graph are proposed, each of which is:

NP → NP, the current software calculates the node of the graph and its predecessor node and does not need to be divided;

NP → IWP, the current software computation graph node needs to be divided in the width dimension of the input data, and the previous node does not need to be divided;

NP → OCP, the node of the forward software computational graph does not need to be divided, the node of the current software computational graph needs to be divided in the dimension of the output channel of the weight data;

IWP → NP, the current software computation graph node does not need to be divided, and the previous software computation graph node needs to be divided in the width dimension of the input data;

IWP → IWP, the current software computation graph node and the previous software computation graph node are both required to be divided in the width dimension of the input data;

IWP → OCP, the current software computation graph node is divided in the dimension of the output channel of the weight data, and the previous software computation graph node is divided in the dimension of the width of the input data;

OCP → NP, the nodes of the current software computation graph do not need to be divided, and the nodes of the forward software computation graph need to be divided in the dimension of the output channel of the weight data;

OCP → IWP, the current software computation graph node is divided in the dimension of the output channel of the weight data, and the relay software computation graph node is divided in the dimension of the output channel of the weight data;

and OCP → OCP, where both current software computation graph nodes and successor software computation graph nodes need to be partitioned in the output channel dimension of the weight data.

The 9 connection modes are divided into 2 steps to be realized, for example, for OCP → IWP, an OCP → IWP back end module is added to the output ports of all hardware modules required by the calculation graph nodes of the forward software, and an OCP → IWP front end module is added to the input ports of the hardware modules required by the current layer; and the other connection modes are respectively added with a front-end module or a back-end module at the input or output port of the corresponding hardware module according to the same mechanism.

The specific steps of converting the model defined by the external framework into the internal software computation graph are as follows: inputting a model which is defined by an external framework and is trained, and analyzing operators in the model; and reordering the nodes based on the depth-first principle.

The software computation graph optimization module optimizes the software computation graph by the following specific steps: the method comprises the steps that nodes with more than 2 input ports in a software calculation graph are respectively realized and replaced by a plurality of nodes with 2 inputs; adding Duplicate operators to the nodes with more than 2 output ports at the tail ends of the nodes for broadcasting data; for the Duplicate operator with more than 2 output ports, a plurality of Duplicate operators with 2 outputs are used for realizing and replacing.

The step of generating the hardware computation graph specifically comprises: according to the type of the nodes in the software computation graph and the amount of the used hardware resources, frequency statistics is carried out on the nodes of the software computation graph, and the hardware resources (such as memory size, computing resources and the like) are allocated to the nodes according to the frequency. And then, based on the greedy search idea, comprehensively considering the total number of resources of the target hardware platform, and under the guidance of a software computation graph, generating a hardware computation graph node set capable of meeting the computation requirements of the whole model, wherein the hardware computation graph nodes are hardware resources allocated to the software computation graph nodes, and the hardware computation graph nodes are a data structure for representing the hardware resources. And then generating edges of the hardware computation graph according to the node sets of the software computation graph and the hardware computation graph, wherein the edges are used for reflecting the interconnection relationship between the nodes of the hardware computation graph. When generating the hardware computation graph edge, dividing the software computation graph into a plurality of subgraphs, and carrying out data communication between the subgraphs through an off-chip memory; and performing off-chip storage resource allocation on the data among the subgraphs, and realizing the allocation by using dynamic programming. Each node in the hardware computation graph represents a processing unit, but each time it runs, some processing units need to fetch data from off-chip memory.

The specific steps of optimizing the hardware computation graph in the hardware computation graph optimization module are as follows: and traversing each edge in the hardware computation graph, and clipping the edges with repeated functions.

Has the advantages that: compared with the prior art, the deep learning compiler based on the hardware computation graph can not only use less hardware resources to deploy a model with complex topology, but also achieve better performance for the model with low topology complexity. The invention can be applied to various hardware platforms such as CPU, GPU, FPGA and artificial intelligence special chip.

Drawings

FIG. 1 is a flow chart of the operation of an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the generation of a software computation graph according to an embodiment of the present invention;

FIG. 3 is a flow chart of the optimization of a software computation graph according to an embodiment of the present invention;

FIG. 4 is a flowchart of a first stage generation of a hardware computation graph according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a second stage of generation of a hardware computation graph according to an embodiment of the present invention;

fig. 6 is an edge generation flow of a hardware computation graph according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, which is to be given the full breadth of the claims appended hereto.

When the FPGA is used as a target hardware platform, a deep learning compiler based on a hardware computation graph, a hardware computation graph generation module, a software computation graph optimization module, a hardware computation graph optimization module and a code generation module are adopted; the working process is as follows: software computation graph generation (fig. 2), software computation graph optimization (fig. 3), hardware computation graph generation (fig. 4), hardware computation graph optimization (fig. 5), and code generation.

The generation flow of the software computation graph realized by the software computation graph optimization module is as follows: (step 2.0) initializing an operator index cur in a model defined by a current external deep learning framework to be 0, (step 2.1) judging whether a compiler supports the type of the cur-th operator at present, (step 2.2) if the compiler does not support the current operator, directly reporting an error to exit a program, (step 2.3) otherwise, creating an internal representation node (namely a node of a software computation graph) of the compiler according to the operator in the model input by a system, (step 2.4) cur is incremented to point to the next operator, (step 2.5) repeating the steps 2.1-2.4 until all the operators in the model are traversed, (step 2.6) determining the previous node and the subsequent node of each node of the software computation graph under the guidance of the input model, thereby generating the edge of the software computation graph.

In the software computation graph optimization module, the optimization flow of the software computation graph is as follows: (step 3.0) initializing operator index cur in the current software calculation graph to be 0, (step 3.1) counting the number n of input edges and the number m of output edges of the current node, (step 3.2) when n is less than or equal to 2, not processing, (step 3.3) when n is more than 2, splitting the input, replacing by (n-1) 2 input nodes, (step 3.4) when m is less than 2, not processing, (step 3.5) when m is more than or equal to 2, inserting a Duplicate operator at the end of the current node, wherein the input of the operator is the current node, the output is all output edges of the current node, and the original output edges of the current node are deleted, (step 3.6) counting the number k of output edges of the current Duplicate node, (step 3.7) when k is equal to 2, not processing, (step 3.8) when k is more than 2, replacing by using (k-1) Duplicate nodes, (step 3.9) cur increments point to the next node, (step 3.10) repeat steps 3.1-3.9 until all nodes in the software computation graph have been traversed, (step 3.11) depth-first all nodes.

In the hardware computation graph generation module, the generation flow of the hardware computation graph comprises generation of a basic node, adjustment of a node set, generation of an edge and allocation of off-chip storage.

And generating a basic node, namely realizing the allocation of hardware resources to the node according to the frequency, wherein the generation flow of the basic node is as follows: (step 4.0) initializing a frequency statistics dictionary dit, wherein the key is a node object, the value is a node frequency (namely the occurrence frequency of the node), initializing a current software node (a node in a software computation graph) to index cur to 0, (step 4.1) computing the on-chip memory size ram _ size required by the cur software node, (step 4.2) generating a key of the software node according to the memory size ram _ size and the type of the current software node, (step 4.3) increasing the frequency corresponding to the key, (step 4.4) cur points to the next software node, (step 4.5) repeating the steps 4.1-4.4 until all nodes in the software computation graph are traversed, (step 4.6) obtaining the frequency statistics dictionary dit _ frequency by dividing the frequency in the frequency statistics dictionary dit by the total number of the nodes in the software computation graph, (step 4.7) initializing the current key index cur _ k to 0, (step 4.8) acquiring the cur _ k keys k in the dict _ freq, (step 4.9) multiplying the frequency dict _ freq [ k ] corresponding to the current key by the total resource number of the target hardware platform to obtain the hardware resources allocated to the current key, (step 4.10) cur _ k points to the next key, (step 4.13) repeating the steps 4.8-4.10 until all the keys in the dict _ freq are traversed, and thus all the basic hardware nodes are obtained.

The node set adjustment process is as follows: (step 5.0) initializing the node index cur of the current hardware computation graph to 0, (step 5.1) generating a set number (at most 20 to prevent the program run time from being too long) of software computation sub-graph candidates (i.e. a continuous node set in the software computation graph) from the current node, (step 5.2) according to the connection mode among 9 nodes proposed by the present invention, calculating the type (e.g. convolution, pooling, etc.), configuration and number of hardware modules required by each sub-graph candidate under the current hardware computation graph node set, (step 5.3) subtracting the node set actually required by the current hardware computation graph node set and the input model according to the result in step 5.2 to obtain the difference between the two node sets, (step 5.4) selecting the longest sub-network from the candidate under the condition that the total number of resources of the target hardware platform is not exceeded, (step 5.5) cur is added to the starting point of the next iteration, (step 5.6) the existing hardware node set is supplemented, namely, a new hardware calculation graph node is constructed according to the hardware node required by the sub-graph candidate item selected in (step 5.4) and added to the existing hardware node set, (step 5.7) the steps 5.1-5.6 are repeated until the traversal of the whole software calculation graph is completed.

The generation flow of the edge is as follows: (step 6.0) initializing the current sub-graph index cur _ g in the software computation graph to 0 and the node index cur _ n in the current sub-graph to 0, (step 6.1) finding out the hardware computation graph node set required by the cur _ n nodes in the cur _ g sub-network, (step 6.2) initializing the input node index cur _ input in the current software node to 0, (step 6.4) adding edges to the hardware computation graph nodes required by the two software computation graph nodes if the current input software computation graph node is also contained in the current sub-graph, (step 6.5) adding edges between the read port of the off-chip storage communication module of the FPGA and the hardware computation graph nodes required by the current software computation graph node, (step 6.6) cur _ input incrementally pointing to the next input port, (step 6.7) repeating steps 6.3-6.6, until all input ports of the current software computation graph node are traversed, (step 6.8) the input node index cur _ output in the current software computation graph node is initialized to 0, (step 6.10) if the current output software computation graph node is also contained in the current subgraph, then edges are added for the hardware computation graph nodes needed by the two software computation graph nodes, (step 6.11) if the current output node is not included in the current subgraph, adding an edge between the write port of the off-chip storage communication module and the current node, (step 6.12) cur _ output increment points to the next output port, (step 6.13) repeating steps 6.9-6.12 until all output ports of the current node are traversed, (step 6.14) cur _ n increment points to the next node of the software computation graph, (step 6.15) repeating steps 6.1-6.14 until all nodes in the software computation graph are traversed.

In the hardware computation graph optimization module, the optimization flow of the hardware computation graph is as follows: (step 7.0) traversing all hardware nodes in the hardware computation graph, (step 7.1) traversing each output edge of each hardware node, (step 7.2) counting edges with the same function in all output edges of the current hardware node, and (step 7.3) deleting output edges with repeated functions.

Claims

1. A deep learning compiler based on a hardware computation graph is characterized by comprising a software computation graph generation module, a hardware computation graph generation module, a software computation graph optimization module, a hardware computation graph optimization module and a code generation module;

2. The hardware computation graph-based deep learning compiler according to claim 1, wherein in the software computation graph generation module, node types and edges in the software computation graph are defined according to hardware implementation efficiency; analyzing and converting the model file; the nodes in the software calculation graph are operation units of a deep learning model; edges in the software computation graph represent data passing between nodes.

3. The hardware computation graph-based deep learning compiler according to claim 1, wherein in the hardware computation graph generation module, node types in the hardware computation graph correspond to node types in the software computation graph one to one, but one node in the hardware computation graph represents one processing unit on the hardware platform, that is, the node and the edge represent a module and a hard wire inside a chip, respectively;

in actual implementation, a multiplexer and a demultiplexer are respectively added at each input port and each output port of the node; the multiplexer is used for selecting the successor node of the current hardware node, and the demultiplexer is used for selecting the successor node of the current hardware node.

4. The hardware computation graph-based deep learning compiler of claim 1, wherein: the method comprises the steps that nodes with more than 2 input ports in a software calculation graph are respectively realized and replaced by a plurality of nodes with 2 inputs; for nodes with more than 2 output ports, adding a broadcast node at the tail end of the node; replacing a broadcast node containing more than 2 output ports with a plurality of broadcast nodes having only 2 output ports; and reordering the nodes based on the depth-first principle.

5. The hardware computation graph-based deep learning compiler of claim 1, wherein: the specific steps of generating the hardware computation graph according to the software computation graph are as follows: under the guidance of a software computation graph, a basic node set is generated first, then the node set is adjusted by using greedy search until the computation requirements required by the whole model are met, and finally edges of a hardware computation graph are generated.

6. The hardware computation graph-based deep learning compiler of claim 4, wherein: the specific steps of generating the basic node set are as follows: clustering the nodes according to the categories of the nodes in the software calculation graph and the number of the required resources on the chip; and preferentially allocating resources for the nodes with frequent occurrence.

7. The hardware computation graph-based deep learning compiler of claim 4, wherein: the specific steps of adjusting the node set by using greedy search until the calculation requirements required by the whole model are met are as follows: inputting a software calculation graph and the resource quantity of a target hardware platform; traversing the nodes of the software computation graph by using a greedy search strategy to generate some candidate sub-networks; calculating the number and configuration of modules required by each sub-network under the existing hardware; calculating the number and configuration of modules which are lacked under the existing hardware of each sub-network; under the condition that the number of resources of a target hardware platform is not exceeded, selecting a sub-network with the largest length for outputting; the above steps are repeated until all nodes are traversed.

8. The hardware computation graph-based deep learning compiler of claim 1, wherein: the specific steps of optimizing on the hardware computation graph are as follows: and traversing the edge of each node in the hardware computation graph, and cutting the edges with the same function in the hardware computation graph.