CN114936015A - Deep learning compiler based on hardware computation graph - Google Patents

Deep learning compiler based on hardware computation graph Download PDF

Info

Publication number
CN114936015A
CN114936015A CN202210625568.7A CN202210625568A CN114936015A CN 114936015 A CN114936015 A CN 114936015A CN 202210625568 A CN202210625568 A CN 202210625568A CN 114936015 A CN114936015 A CN 114936015A
Authority
CN
China
Prior art keywords
computation graph
hardware
software
node
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210625568.7A
Other languages
Chinese (zh)
Inventor
李武军
李俊
王炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210625568.7A priority Critical patent/CN114936015A/en
Publication of CN114936015A publication Critical patent/CN114936015A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/433Dependency analysis; Data or control flow analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/457Communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a deep learning compiler based on a hardware computation graph, which comprises a software computation graph generation module, a hardware computation graph generation module, a software computation graph optimization module, a hardware computation graph optimization module and a code generation module; the software computation graph generation module is used for converting the deep learning model defined by the external framework into an internal software computation graph; the hardware computation graph generation module generates a basic node set under the guidance of a software computation graph, adjusts the node set by using greedy search until the computation requirement required by the whole deep learning model is met, and finally generates the edge of the hardware computation graph; the code generation module is used for generating hardware codes and simulation codes corresponding to the deep learning model. The invention can carry out high-efficiency and high-performance deployment on the model with the complex topology, and reduces the cost required by the deployment. The invention can be applied to various hardware platforms such as CPU, GPU, FPGA and artificial intelligence special chip.

Description

Deep learning compiler based on hardware computation graph
Technical Field
The invention relates to a deep learning compiler based on a hardware computation graph, which belongs to the technical field of hardware resource utilization of deep learning compilers and realizes automatic deployment of deep learning models.
Background
In recent years, deep learning has been widely used in many scenarios, such as: image recognition, object detection, semantic segmentation, and the like. Meanwhile, the efficient acceleration of deep learning models is receiving more and more attention from both academic and industrial fields. Therefore, how to deploy the models efficiently and with high performance automatically becomes especially important. For this, a deep learning compiler is proposed. In software architecture, the current main scheme is to convert a model defined by an external framework into an internal representation of a compiler, optimize the model in combination with the number of resources of a target hardware platform, and then generate hardware codes.
However, with the continuous expansion of application scenarios of deep learning and the continuous pursuit of model accuracy by communities, model topology becomes more and more complex. When the existing compiler faces complex topology, limited support can be performed only by using a hardware platform with more hardware resources and more expensive hardware resources, and the deployment cost is greatly increased.
Disclosure of Invention
The purpose of the invention is as follows: currently, deep learning compilers perform poorly in terms of hardware resource utilization and can only improve performance by using hardware platforms with more hardware resources and more expensive. And with the continuous improvement of the complexity of the model topology, the modes cannot guarantee the success of the deployment. The invention provides a deep learning compiler based on a hardware computation graph, which can use less hardware resources to deploy a complex topology model and effectively improve the utilization efficiency of the hardware resources.
The technical scheme is as follows: a deep learning compiler based on a hardware computation graph comprises a software computation graph generation module, a hardware computation graph generation module, a software computation graph optimization module, a hardware computation graph optimization module and a code generation module;
the software computation graph generation module is used for converting the deep learning model defined by the external framework into an internal software computation graph;
the software computation graph optimization module is used for optimizing the software computation graph;
the hardware computation graph generation module generates a basic node set under the guidance of a software computation graph, adjusts the node set by using greedy search until the computation requirements required by the whole deep learning model are met, and finally generates the edges of the hardware computation graph;
the hardware computation graph optimization module is used for optimizing the hardware computation graph;
the code generation module is used for generating hardware codes and simulation codes corresponding to the deep learning model.
In the software computation graph generation module, according to the implementation efficiency of hardware, defining node types and edges in a software computation graph; analyzing and converting the model file; the nodes in the software computation graph are operation units of a deep learning model and comprise operators such as filling, convolution, maximum pooling, average pooling, global pooling, addition, splicing and the like; edges in the software computation graph represent data passing between nodes.
In the hardware computation graph generation module, node types in the hardware computation graph correspond to node types of the software computation graph one by one, but one node in the hardware computation graph represents one processing unit on a hardware platform, that is, the node and the edge respectively represent a module and a hard wire inside a chip. And in practical implementation, a multiplexer and a demultiplexer are respectively added at each input port and each output port of the node. The multiplexer is used for selecting the successor node of the current hardware node, and the demultiplexer is used for selecting the successor node of the current hardware node. In the hardware computation graph generation module, the mapping mode from the software computation graph nodes to the hardware computation graph nodes is as follows: based on the idea of a multi-core processor, the invention can perform one-to-many mapping between the two. For a certain convolution node in a software computation graph, when any convolution node in the existing hardware computation graph can not meet the computation requirement, segmenting (OCP) the output channel dimension of the convolution node weight in the software computation graph, and distributing the weight data of the convolution node to a plurality of hardware nodes (nodes in the hardware computation graph) for computation; for a certain pooled node in the software computation graph, when any pooled node in the existing hardware computation graph cannot meet the computation requirement, the width dimension of the input data of the pooled node in the software computation graph is segmented (IWP). In addition to the case where Node Partitioning (NP) is not required, 9 types of connection methods between nodes of the hardware computation graph are proposed, each of which is:
NP → NP, the current software calculates the node of the graph and its predecessor node and does not need to be divided;
NP → IWP, the current software computation graph node needs to be divided in the width dimension of the input data, and the previous node does not need to be divided;
NP → OCP, the node of the forward software computational graph does not need to be divided, the node of the current software computational graph needs to be divided in the dimension of the output channel of the weight data;
IWP → NP, the current software computation graph node does not need to be divided, and the previous software computation graph node needs to be divided in the width dimension of the input data;
IWP → IWP, the current software computation graph node and the previous software computation graph node are both required to be divided in the width dimension of the input data;
IWP → OCP, the current software computation graph node is divided in the dimension of the output channel of the weight data, and the previous software computation graph node is divided in the dimension of the width of the input data;
OCP → NP, the nodes of the current software computation graph do not need to be divided, and the nodes of the forward software computation graph need to be divided in the dimension of the output channel of the weight data;
OCP → IWP, the current software computation graph node is divided in the dimension of the output channel of the weight data, and the relay software computation graph node is divided in the dimension of the output channel of the weight data;
and OCP → OCP, where both current software computation graph nodes and successor software computation graph nodes need to be partitioned in the output channel dimension of the weight data.
The 9 connection modes are divided into 2 steps to be realized, for example, for OCP → IWP, an OCP → IWP back end module is added to the output ports of all hardware modules required by the calculation graph nodes of the forward software, and an OCP → IWP front end module is added to the input ports of the hardware modules required by the current layer; and the other connection modes are respectively added with a front-end module or a back-end module at the input or output port of the corresponding hardware module according to the same mechanism.
The specific steps of converting the model defined by the external framework into the internal software computation graph are as follows: inputting a model which is defined by an external framework and is trained, and analyzing operators in the model; and reordering the nodes based on the depth-first principle.
The software computation graph optimization module optimizes the software computation graph by the following specific steps: the method comprises the steps that nodes with more than 2 input ports in a software calculation graph are respectively realized and replaced by a plurality of nodes with 2 inputs; adding Duplicate operators to the nodes with more than 2 output ports at the tail ends of the nodes for broadcasting data; for the Duplicate operator with more than 2 output ports, a plurality of Duplicate operators with 2 outputs are used for realizing and replacing.
The step of generating the hardware computation graph specifically comprises: according to the type of the nodes in the software computation graph and the amount of the used hardware resources, frequency statistics is carried out on the nodes of the software computation graph, and the hardware resources (such as memory size, computing resources and the like) are allocated to the nodes according to the frequency. And then, based on the greedy search idea, comprehensively considering the total number of resources of the target hardware platform, and under the guidance of a software computation graph, generating a hardware computation graph node set capable of meeting the computation requirements of the whole model, wherein the hardware computation graph nodes are hardware resources allocated to the software computation graph nodes, and the hardware computation graph nodes are a data structure for representing the hardware resources. And then generating edges of the hardware computation graph according to the node sets of the software computation graph and the hardware computation graph, wherein the edges are used for reflecting the interconnection relationship between the nodes of the hardware computation graph. When generating the hardware computation graph edge, dividing the software computation graph into a plurality of subgraphs, and carrying out data communication between the subgraphs through an off-chip memory; and performing off-chip storage resource allocation on the data among the subgraphs, and realizing the allocation by using dynamic programming. Each node in the hardware computation graph represents a processing unit, but each time it runs, some processing units need to fetch data from off-chip memory.
The specific steps of optimizing the hardware computation graph in the hardware computation graph optimization module are as follows: and traversing each edge in the hardware computation graph, and clipping the edges with repeated functions.
Has the advantages that: compared with the prior art, the deep learning compiler based on the hardware computation graph can not only use less hardware resources to deploy a model with complex topology, but also achieve better performance for the model with low topology complexity. The invention can be applied to various hardware platforms such as CPU, GPU, FPGA and artificial intelligence special chip.
Drawings
FIG. 1 is a flow chart of the operation of an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the generation of a software computation graph according to an embodiment of the present invention;
FIG. 3 is a flow chart of the optimization of a software computation graph according to an embodiment of the present invention;
FIG. 4 is a flowchart of a first stage generation of a hardware computation graph according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a second stage of generation of a hardware computation graph according to an embodiment of the present invention;
fig. 6 is an edge generation flow of a hardware computation graph according to an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, which is to be given the full breadth of the claims appended hereto.
When the FPGA is used as a target hardware platform, a deep learning compiler based on a hardware computation graph, a hardware computation graph generation module, a software computation graph optimization module, a hardware computation graph optimization module and a code generation module are adopted; the working process is as follows: software computation graph generation (fig. 2), software computation graph optimization (fig. 3), hardware computation graph generation (fig. 4), hardware computation graph optimization (fig. 5), and code generation.
The generation flow of the software computation graph realized by the software computation graph optimization module is as follows: (step 2.0) initializing an operator index cur in a model defined by a current external deep learning framework to be 0, (step 2.1) judging whether a compiler supports the type of the cur-th operator at present, (step 2.2) if the compiler does not support the current operator, directly reporting an error to exit a program, (step 2.3) otherwise, creating an internal representation node (namely a node of a software computation graph) of the compiler according to the operator in the model input by a system, (step 2.4) cur is incremented to point to the next operator, (step 2.5) repeating the steps 2.1-2.4 until all the operators in the model are traversed, (step 2.6) determining the previous node and the subsequent node of each node of the software computation graph under the guidance of the input model, thereby generating the edge of the software computation graph.
In the software computation graph optimization module, the optimization flow of the software computation graph is as follows: (step 3.0) initializing operator index cur in the current software calculation graph to be 0, (step 3.1) counting the number n of input edges and the number m of output edges of the current node, (step 3.2) when n is less than or equal to 2, not processing, (step 3.3) when n is more than 2, splitting the input, replacing by (n-1) 2 input nodes, (step 3.4) when m is less than 2, not processing, (step 3.5) when m is more than or equal to 2, inserting a Duplicate operator at the end of the current node, wherein the input of the operator is the current node, the output is all output edges of the current node, and the original output edges of the current node are deleted, (step 3.6) counting the number k of output edges of the current Duplicate node, (step 3.7) when k is equal to 2, not processing, (step 3.8) when k is more than 2, replacing by using (k-1) Duplicate nodes, (step 3.9) cur increments point to the next node, (step 3.10) repeat steps 3.1-3.9 until all nodes in the software computation graph have been traversed, (step 3.11) depth-first all nodes.
In the hardware computation graph generation module, the generation flow of the hardware computation graph comprises generation of a basic node, adjustment of a node set, generation of an edge and allocation of off-chip storage.
And generating a basic node, namely realizing the allocation of hardware resources to the node according to the frequency, wherein the generation flow of the basic node is as follows: (step 4.0) initializing a frequency statistics dictionary dit, wherein the key is a node object, the value is a node frequency (namely the occurrence frequency of the node), initializing a current software node (a node in a software computation graph) to index cur to 0, (step 4.1) computing the on-chip memory size ram _ size required by the cur software node, (step 4.2) generating a key of the software node according to the memory size ram _ size and the type of the current software node, (step 4.3) increasing the frequency corresponding to the key, (step 4.4) cur points to the next software node, (step 4.5) repeating the steps 4.1-4.4 until all nodes in the software computation graph are traversed, (step 4.6) obtaining the frequency statistics dictionary dit _ frequency by dividing the frequency in the frequency statistics dictionary dit by the total number of the nodes in the software computation graph, (step 4.7) initializing the current key index cur _ k to 0, (step 4.8) acquiring the cur _ k keys k in the dict _ freq, (step 4.9) multiplying the frequency dict _ freq [ k ] corresponding to the current key by the total resource number of the target hardware platform to obtain the hardware resources allocated to the current key, (step 4.10) cur _ k points to the next key, (step 4.13) repeating the steps 4.8-4.10 until all the keys in the dict _ freq are traversed, and thus all the basic hardware nodes are obtained.
The node set adjustment process is as follows: (step 5.0) initializing the node index cur of the current hardware computation graph to 0, (step 5.1) generating a set number (at most 20 to prevent the program run time from being too long) of software computation sub-graph candidates (i.e. a continuous node set in the software computation graph) from the current node, (step 5.2) according to the connection mode among 9 nodes proposed by the present invention, calculating the type (e.g. convolution, pooling, etc.), configuration and number of hardware modules required by each sub-graph candidate under the current hardware computation graph node set, (step 5.3) subtracting the node set actually required by the current hardware computation graph node set and the input model according to the result in step 5.2 to obtain the difference between the two node sets, (step 5.4) selecting the longest sub-network from the candidate under the condition that the total number of resources of the target hardware platform is not exceeded, (step 5.5) cur is added to the starting point of the next iteration, (step 5.6) the existing hardware node set is supplemented, namely, a new hardware calculation graph node is constructed according to the hardware node required by the sub-graph candidate item selected in (step 5.4) and added to the existing hardware node set, (step 5.7) the steps 5.1-5.6 are repeated until the traversal of the whole software calculation graph is completed.
The generation flow of the edge is as follows: (step 6.0) initializing the current sub-graph index cur _ g in the software computation graph to 0 and the node index cur _ n in the current sub-graph to 0, (step 6.1) finding out the hardware computation graph node set required by the cur _ n nodes in the cur _ g sub-network, (step 6.2) initializing the input node index cur _ input in the current software node to 0, (step 6.4) adding edges to the hardware computation graph nodes required by the two software computation graph nodes if the current input software computation graph node is also contained in the current sub-graph, (step 6.5) adding edges between the read port of the off-chip storage communication module of the FPGA and the hardware computation graph nodes required by the current software computation graph node, (step 6.6) cur _ input incrementally pointing to the next input port, (step 6.7) repeating steps 6.3-6.6, until all input ports of the current software computation graph node are traversed, (step 6.8) the input node index cur _ output in the current software computation graph node is initialized to 0, (step 6.10) if the current output software computation graph node is also contained in the current subgraph, then edges are added for the hardware computation graph nodes needed by the two software computation graph nodes, (step 6.11) if the current output node is not included in the current subgraph, adding an edge between the write port of the off-chip storage communication module and the current node, (step 6.12) cur _ output increment points to the next output port, (step 6.13) repeating steps 6.9-6.12 until all output ports of the current node are traversed, (step 6.14) cur _ n increment points to the next node of the software computation graph, (step 6.15) repeating steps 6.1-6.14 until all nodes in the software computation graph are traversed.
In the hardware computation graph optimization module, the optimization flow of the hardware computation graph is as follows: (step 7.0) traversing all hardware nodes in the hardware computation graph, (step 7.1) traversing each output edge of each hardware node, (step 7.2) counting edges with the same function in all output edges of the current hardware node, and (step 7.3) deleting output edges with repeated functions.

Claims (8)

1. A deep learning compiler based on a hardware computation graph is characterized by comprising a software computation graph generation module, a hardware computation graph generation module, a software computation graph optimization module, a hardware computation graph optimization module and a code generation module;
the software computation graph generation module is used for converting the deep learning model defined by the external framework into an internal software computation graph;
the software computation graph optimization module is used for optimizing the software computation graph;
the hardware computation graph generation module generates a basic node set under the guidance of a software computation graph, adjusts the node set by using greedy search until the computation requirements required by the whole deep learning model are met, and finally generates the edges of the hardware computation graph;
the hardware computation graph optimization module is used for optimizing the hardware computation graph;
the code generation module is used for generating hardware codes and simulation codes corresponding to the deep learning model.
2. The hardware computation graph-based deep learning compiler according to claim 1, wherein in the software computation graph generation module, node types and edges in the software computation graph are defined according to hardware implementation efficiency; analyzing and converting the model file; the nodes in the software calculation graph are operation units of a deep learning model; edges in the software computation graph represent data passing between nodes.
3. The hardware computation graph-based deep learning compiler according to claim 1, wherein in the hardware computation graph generation module, node types in the hardware computation graph correspond to node types in the software computation graph one to one, but one node in the hardware computation graph represents one processing unit on the hardware platform, that is, the node and the edge represent a module and a hard wire inside a chip, respectively;
in actual implementation, a multiplexer and a demultiplexer are respectively added at each input port and each output port of the node; the multiplexer is used for selecting the successor node of the current hardware node, and the demultiplexer is used for selecting the successor node of the current hardware node.
4. The hardware computation graph-based deep learning compiler of claim 1, wherein: the method comprises the steps that nodes with more than 2 input ports in a software calculation graph are respectively realized and replaced by a plurality of nodes with 2 inputs; for nodes with more than 2 output ports, adding a broadcast node at the tail end of the node; replacing a broadcast node containing more than 2 output ports with a plurality of broadcast nodes having only 2 output ports; and reordering the nodes based on the depth-first principle.
5. The hardware computation graph-based deep learning compiler of claim 1, wherein: the specific steps of generating the hardware computation graph according to the software computation graph are as follows: under the guidance of a software computation graph, a basic node set is generated first, then the node set is adjusted by using greedy search until the computation requirements required by the whole model are met, and finally edges of a hardware computation graph are generated.
6. The hardware computation graph-based deep learning compiler of claim 4, wherein: the specific steps of generating the basic node set are as follows: clustering the nodes according to the categories of the nodes in the software calculation graph and the number of the required resources on the chip; and preferentially allocating resources for the nodes with frequent occurrence.
7. The hardware computation graph-based deep learning compiler of claim 4, wherein: the specific steps of adjusting the node set by using greedy search until the calculation requirements required by the whole model are met are as follows: inputting a software calculation graph and the resource quantity of a target hardware platform; traversing the nodes of the software computation graph by using a greedy search strategy to generate some candidate sub-networks; calculating the number and configuration of modules required by each sub-network under the existing hardware; calculating the number and configuration of modules which are lacked under the existing hardware of each sub-network; under the condition that the number of resources of a target hardware platform is not exceeded, selecting a sub-network with the largest length for outputting; the above steps are repeated until all nodes are traversed.
8. The hardware computation graph-based deep learning compiler of claim 1, wherein: the specific steps of optimizing on the hardware computation graph are as follows: and traversing the edge of each node in the hardware computation graph, and cutting the edges with the same function in the hardware computation graph.
CN202210625568.7A 2022-06-02 2022-06-02 Deep learning compiler based on hardware computation graph Pending CN114936015A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210625568.7A CN114936015A (en) 2022-06-02 2022-06-02 Deep learning compiler based on hardware computation graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210625568.7A CN114936015A (en) 2022-06-02 2022-06-02 Deep learning compiler based on hardware computation graph

Publications (1)

Publication Number Publication Date
CN114936015A true CN114936015A (en) 2022-08-23

Family

ID=82865681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210625568.7A Pending CN114936015A (en) 2022-06-02 2022-06-02 Deep learning compiler based on hardware computation graph

Country Status (1)

Country Link
CN (1) CN114936015A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306856A (en) * 2023-05-17 2023-06-23 之江实验室 Deep learning model deployment method and device based on search
CN116301904A (en) * 2023-05-18 2023-06-23 之江实验室 Operator optimization acceleration method and device for deep learning compiler

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306856A (en) * 2023-05-17 2023-06-23 之江实验室 Deep learning model deployment method and device based on search
CN116306856B (en) * 2023-05-17 2023-09-05 之江实验室 Deep learning model deployment method and device based on search
CN116301904A (en) * 2023-05-18 2023-06-23 之江实验室 Operator optimization acceleration method and device for deep learning compiler
CN116301904B (en) * 2023-05-18 2023-08-22 之江实验室 Operator optimization acceleration method and device for deep learning compiler

Similar Documents

Publication Publication Date Title
CN114186633B (en) Distributed training method, device, equipment and storage medium of model
CN114936015A (en) Deep learning compiler based on hardware computation graph
CN113703775B (en) Compiling method, compiling device, compiling equipment and storage medium
US6817013B2 (en) Program optimization method, and compiler using the same
CN112232497A (en) Method, system, device and medium for compiling AI chip
CN113283613B (en) Deep learning model generation method, optimization method, device, equipment and medium
CN113220457A (en) Model deployment method, model deployment device, terminal device and readable storage medium
CN111860816A (en) Compiling method, device, equipment and storage medium of neural network model
Han et al. Legodnn: block-grained scaling of deep neural networks for mobile vision
Moreira et al. Graph partitioning with acyclicity constraints
Fang et al. Optimizing DNN computation graph using graph substitutions
CN116368494A (en) Neural network compiling optimization method and related device
CN114443559A (en) Reconfigurable operator unit, processor, calculation method, device, equipment and medium
CN115423082A (en) Automatic optimization method for depth model calculation graph related to hardware characteristics
Ranaweera et al. Scheduling of periodic time critical applications for pipelined execution on heterogeneous systems
KR20230120850A (en) Deep-learning compiler for supporting heterogeneous computing platform and method thereof
Chiu et al. A fast algorithm for reliability-oriented task assignment in a distributed system
CN108334532B (en) Spark-based Eclat parallelization method, system and device
Ara et al. Scalable analysis for multi-scale dataflow models
Doerr et al. Island models meet rumor spreading
Gupta et al. Map-based graph analysis on MapReduce
CN115208954B (en) Parallel policy preset system for distributed data processing system and method thereof
CN106933665B (en) Method for predicting MPI program running time
US20220066834A1 (en) Memory-bound scheduling
Lombardi et al. Graph contraction on attribute-based coloring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination