CN113934410A - Multi-hardware target depth model optimization deployment framework supporting custom operators - Google Patents

Multi-hardware target depth model optimization deployment framework supporting custom operators Download PDF

Info

Publication number
CN113934410A
CN113934410A CN202111216615.4A CN202111216615A CN113934410A CN 113934410 A CN113934410 A CN 113934410A CN 202111216615 A CN202111216615 A CN 202111216615A CN 113934410 A CN113934410 A CN 113934410A
Authority
CN
China
Prior art keywords
operator
hardware
unit
graph
optimization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111216615.4A
Other languages
Chinese (zh)
Inventor
姜宏旭
东东
胡宗琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202111216615.4A priority Critical patent/CN113934410A/en
Publication of CN113934410A publication Critical patent/CN113934410A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a multi-hardware target depth model optimization deployment architecture supporting a user-defined operator.A front-end import module converts a depth learning model file into a Relay calculation diagram for representation; the operator conversion module converts a Relay OP in the Relay calculation graph representation into a hardware OP and outputs the Relay calculation graph representation with the hardware OP; the model optimization module carries out graph optimization operation on a calculation graph in the Relay calculation graph representation with the hardware OP and outputs the Relay representation carrying the optimized calculation graph and the optimized hardware OP; the data flow optimization module performs hardware perception optimization on the optimized computational graph representation part to form a computational graph execution flow; the operator optimization module carries out multi-mode representation and automatic optimization on the optimized hardware OP and outputs an operator code; and the runtime module loads a corresponding multi-hardware compiling library, executes the flow and the operator codes of the calculation graph and generates a deployment file.

Description

Multi-hardware target depth model optimization deployment framework supporting custom operators
Technical Field
The invention relates to the technical field of deployment of deep learning models and custom operators, in particular to a multi-hardware target depth model optimization deployment framework supporting custom operators.
Background
In recent years, deep learning techniques have been widely applied, and techniques in various industries have been changed, such as financial wind control techniques, image recognition techniques, automatic driving techniques, and the like. The deep learning technology application is mainly divided into two steps: (1) training a model: fitting data through an algorithm, and persisting a reusable model; (2) deployment model: the model is deployed as an API (application program interface) for other application system calls. The former is mainly intensively studied by academia, while the latter is an important and complicated link in industrial production application.
With the rapid development of artificial intelligence technology, new algorithm models and structures are in endless, and these new algorithms often rely on designing unique operators to achieve better precision. However, the existing deployment architecture has the following problems and disadvantages:
first, model files of existing frameworks are incompatible; secondly, the optimization mode does not optimize the operator and the computational graph data flow independently, and the optimization is insufficient; thirdly, the degree of association with the hardware is not high, and the model optimization is not correspondingly adapted according to hardware resources; fourthly, the hardware supported during the operation is less, and a user needs to manually combine with third-party hardware to deploy a tool library; fifthly, the user-defined operator is not supported, the user needs to write the operator in the model, the expansion is difficult, and the operation efficiency of the operator is low.
Therefore, how to provide a multi-hardware target depth model optimized deployment architecture supporting custom operators is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a multi-hardware target depth model optimized deployment architecture supporting a custom operator, which realizes lightweight deployment, avoids complication of engineering process, shortens development period, supports multi-definition operators, and supports embedded processing units such as FPGA, cambrian NPU, hua GPU and the like which are mainstream at home and abroad.
In order to achieve the purpose, the invention adopts the following technical scheme:
the multi-hardware target depth model optimized deployment architecture supporting the custom operator sequentially comprises the following steps: the system comprises a front end import module, an operator conversion module, a model optimization module, a data stream optimization module, an operator optimization module and a runtime module;
the front end import module is used for converting the deep learning model file into a Relay calculation graph representation;
the operator conversion module is used for converting a Relay OP in the Relay calculation graph into a hardware OP and outputting a representation of the Relay calculation graph with the hardware OP;
the model optimization module is used for carrying out graph optimization on a calculation graph in a Relay calculation graph representation with a hardware OP, reducing data access and invalid repeated data calculation in the execution process of the calculation graph in a dynamic and static optimization mode, and outputting the optimized calculation graph and the optimized hardware OP;
the data flow optimization module is used for optimizing the optimized computational graph representation part to form a computational graph execution flow;
the operator optimization module is used for performing multi-mode representation and automatic optimization on the optimized hardware OP and outputting an operator code;
and the runtime module is used for loading the hardware operator corresponding to the execution flow and the operator code of the computation graph by the execution module and generating a deployment file.
Preferably, the front-end import module comprises a model access unit, a Pythrch Relay unit, an Onnx Relay unit and a Tensorflow Relay unit;
the model access unit is used for receiving a deep learning model file, and the deep learning model file comprises hardware information and a network model;
the Pythrch to Relay unit, the ONnx to Relay unit and the Tensorflow to Relay unit are used for converting different forms of network models into a Relay calculation graph representation.
Preferably, the operator conversion module comprises an operator conversion unit, a Relay OP-to-hardware OP unit and a multi-mode operator warehouse;
the operator conversion unit is used for receiving the representation of the Relay calculation diagram;
the Relay OP-hardware OP unit is used for converting a Relay OP in a Relay calculation graph representation into a hardware OP based on an operator implementation mode;
the multi-modal operator repository is used for storing operator implementation modes supported on a specified hardware platform.
Preferably, the model optimization module comprises a model optimization unit, a graph anesthesia Pass unit, a graph cut-open Pass unit, a graph transplantation Pass unit and a graph suture Pass unit;
the model optimization unit is used for receiving a representation of a Relay calculation chart with hardware OP;
the graph anesthesia Pass unit is used for performing subsequent traversal on a Relay calculation graph with a hardware OP (operation top) and replacing a subgraph which can be folded by a constant value into a constant node;
the graph anatomy Pass unit is used for carrying out inline optimization and deleting useless nodes;
the graph transplanting Pass unit is used for iterating the graphically sectioned computation graph;
and the graph stitching Pass unit is used for determining the common sub-expressions by using the Hash table, merging the topological structures of the common sub-expressions, and outputting the optimized computation graph and the optimized hardware OP.
Preferably, the data stream optimization module includes a data stream optimization unit, a heterogeneous parallel unit and a memory multiplexing unit;
the data flow optimization unit is used for receiving the optimized calculation graph;
the heterogeneous parallel unit and the memory multiplexing unit are used for realizing efficient coding of a convolution kernel and an activation value for the data stream of the optimized computation graph, and improving the efficiency of data access and model reasoning under the condition of different hardware constraints.
Preferably, the operator preference module comprises an operator preference unit, a multi-mode preference unit, an operator replacement unit, an operator compiling unit, an automatic preference unit and a code generation unit;
the operator optimization unit is used for receiving the optimized hardware OP;
the multi-mode optimization unit is used for carrying out operator optimization on the optimized hardware OP, if a corresponding operator is integrated in the high-performance operator library, operator substitution is called, and otherwise, operator compiling is called;
the operator substitution unit is used for converting the optimized hardware OP into an operator;
the operator compiling unit is used for carrying out operator compiling on the optimized hardware OP;
the automatic optimization unit is used for automatically optimizing the compiled operator to realize circular automatic block cutting, fusion and sequential adjustment;
the code generation unit is used for generating codes for the optimized operators and outputting operator codes.
Preferably, the runtime module includes a runtime unit, a device/resource management unit, a computation graph executor, a high-performance operator library and a compilation library;
the runtime unit is used for receiving a computation graph execution flow and operator codes;
the device/resource management unit is used for realizing the management work of computing resources, memory resources and execution streams on the device;
the computation graph executor is used for controlling the execution process of the computation graph;
the high-performance operator library is used for storing hardware operators;
the compiling library is used for calling hardware operators corresponding to the execution flow of the computation graph and the operator codes, carrying out code deployment on corresponding hardware and generating deployment files of corresponding hardware units.
Preferably, the system further comprises a model operator registration interface, a Relay interface, a hardware operator registration interface, a hardware operator DSL implementation registration interface, and a hardware operator library implementation registration interface.
Compared with the prior art, the invention discloses a multi-hardware target depth model optimized deployment architecture supporting the user-defined operator, and has the following advantages:
1) and uniformly expressing the model. By adopting the mapping technology of the intelligent model and the hardware resource elements, the problem that the inference end is complicated in engineering process caused by supporting a multi-frame model is solved, such as consistency of common algorithm effects, subsequent optimization and frame coupling and the like, so that the deployment process is long and closed loop is difficult;
2) and automatically tuning the network by combining hardware. By means of accurate adaptation of the model and hardware resources and mutual inductance collaborative optimization and compiling technology of the model and hardware architecture, automatic development capability is provided, labor cost is greatly reduced, and development period is shortened;
3) automatic compilation optimization of custom OPs. Except the OP of the conventional deep learning model, the algorithm is not evolving continuously, and the invention realizes the automatic compiling optimization and expansion of new operator OP development, thereby greatly reducing the labor cost, improving the development efficiency and shortening the development period;
4) platform independent reasoning runtime. The invention can realize the separation design of the execution flow of the computational graph and the inference operation, and realize the light-weight deployment. The problem of low platform mobility, maintainability and reliability caused by strong coupling of calculation and execution during related reasoning operation of the previous platform is solved. Meanwhile, various domestic and foreign embedded hardware platform deployment tool libraries are integrated, and various hardware depth model deployments are supported.
5) The operators are optimized relatively independently of the computational graph data flow. The thoroughness of model optimization is ensured, and the multi-aspect optimization of the model structure and data calculation is considered.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of a multi-hardware target depth model optimized deployment architecture supporting custom operators.
Fig. 2 is a schematic diagram of a registration interface.
FIG. 3 is a hardware topology diagram of the MLU220 platform.
FIG. 4 illustrates the internal structure of the IPU calculation module of the MLU 220.
FIG. 5 is a schematic diagram of the position of the ARM and MicroBlaze processors on the Zynq chip.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a multi-hardware target depth model optimization deployment architecture supporting a custom operator, which comprises six modules as shown in figure 1:
(I): and a front end import module.
The method supports the access work of mainstream deep learning frame models such as ONnx, Pythrch, Tensorflow and the like, and converts the existing deep learning model files into a unified Relay calculation map for representation.
The method specifically comprises the following steps: the system comprises a model access unit, a Pythrch to Relay unit, an Onnx to Relay unit and a Tensorflow to Relay unit;
the model access unit is used for receiving a deep learning model file, and the deep learning model file comprises hardware information and a network model;
the Pythrch to Relay unit, the ONnx to Relay unit and the Tensorflow to Relay unit are used for converting different forms of network models into a Relay calculation graph representation.
And (II) an operator conversion module.
An operator "Relay OP (operator part of Relay)" in the output Relay calculation diagram representation of the upper layer is converted into hardware OP, and in order to realize automatic conversion from the Relay OP to the hardware OP, a multi-mode operator warehouse on the hardware equipment to be deployed needs to be realized in advance. The multi-mode operator warehouse enumerates operator implementation modes supported on a specified hardware platform, and more specifically, the operator conversion module is responsible for converting a Relay OP into a hardware OP supported on the specified hardware platform and outputting a representation of a Relay calculation graph with the hardware OP.
The method specifically comprises the following steps: the system comprises an operator conversion unit, a Relay OP-hardware OP unit and a multi-mode operator warehouse;
the operator conversion unit is used for receiving the representation of the Relay calculation diagram;
the Relay OP-hardware OP unit is used for converting a Relay OP in the Relay calculation graph representation into a hardware OP based on the operator implementation mode;
the multi-modal operator repository is used for storing operator implementations supported on a specified hardware platform.
And (III) a model optimization module.
For optimizing a Relay computation graph with hardware OP.
The method specifically comprises the following steps: model preference unit, graph anesthesia Pass unit, graph cut Pass unit, graph transplant Pass unit and graph suture Pass unit;
the model optimization unit is used for receiving a representation of a Relay calculation chart with hardware OP;
fig. anesthesia Pass unit: the node is used for carrying out subsequent traversal on the computational graph, the node can be folded by a constant, child nodes of the node can also be folded by the constant, and a subgraph which can be folded by the constant is replaced by a constant node;
diagram sectioning Pass unit: and (4) performing inline optimization, deleting useless nodes and expanding functional nodes.
Graph migration Pass unit: in order to save memory, the migration operation will release memory of the computation graph after one iteration each time. This operation adds an identifier during the first back propagation, finds the final output gradient of the network using the chain-type derivation method, and then optimizes the network.
Graph stitching Pass cell: and determining the public sub-expressions by using the Hash table, and merging the topological structures of the public sub-expressions.
By carrying out operations such as horizontal fusion, vertical fusion, dead node elimination, constant preprocessing, coefficient folding and the like on each graph unit on the Relay calculation graph, data access and storage and invalid repeated data calculation in the execution process of the calculation graph are reduced, and the execution efficiency of the model on equipment is improved. The PASS (graph anesthesia, graph section, graph transplantation, graph stitching optimization and 'PASS') has graph traversal and graph modification base class functions and realizes a graph adjustment strategy based on optimization purposes. And meanwhile, the newly added graph optimization PASS is classified according to hardware platforms, and different hardware platforms can select the opening and closing of related PASS according to the actual conditions of the hardware platforms.
And (IV) a data stream optimization module.
And optimizing the part of the computation graph representation in the Relay representation to form a computation graph execution flow.
The method specifically comprises the following steps: the device comprises a data stream optimization unit, a heterogeneous parallel unit and a memory multiplexing unit;
the data flow optimization unit is used for receiving the optimized calculation graph;
the heterogeneous parallel unit and the memory multiplexing unit realize efficient coding of a convolution kernel and an activation value for the data stream of the optimized computation graph, and improve the efficiency of data access and model reasoning under the condition of different hardware constraints.
The module is distributed with computing units of different network segments, and designates the multiplexing relation of various computing memories in the reasoning process, so that the model can normally work under the limited resource after being deployed on equipment.
And (V) optimizing the operator by using a module.
The optimized hardware OP in the Relay representation is input into the module, and the module mainly applies the multi-mode representation and the automatic optimization technology of the operator to realize the automatic selection of the optimal execution mode of the operator on the designated hardware. Meanwhile, by means of the instantiation of the efficient module, a fixed structure formed by a part of operators or a plurality of operators is replaced, and the execution efficiency of the module on a specific chip is improved. As a result of operator optimization, there are generally two implementation paths: 1. using a compiler to perform automatic code generation and optimization; 2. and compiling the operators by using a high-performance operator library which is developed or integrated in advance. Aiming at the first selection, the automatic code optimization function based on hardware limiting conditions is realized, the automatic blocking, fusion and sequential adjustment of circulation are realized, the optimal adjustment of multi-level cache is realized, and the automatic replacement of vectorization and tensorial instruction sets is realized. For the second option, a high-performance operator library needs to be integrated inside the runtime module.
The method specifically comprises the following steps: the system comprises an operator preference unit, a multi-mode preference unit, an operator substitution unit, an operator compiling unit, an automatic preference unit and a code generation unit;
the operator optimization unit is used for receiving the optimized hardware OP;
the multi-mode optimization unit is used for carrying out operator optimization on the optimized hardware OP, if a corresponding operator is integrated in the high-performance operator library, operator substitution is called, and otherwise, operator compiling is called;
the operator substitution unit is used for converting the optimized hardware OP into an operator;
the operator compiling unit is used for carrying out operator compiling on the optimized hardware OP;
the automatic optimization unit is used for automatically optimizing the compiled operator to realize circular automatic block cutting, fusion and sequential adjustment;
the code generation unit is used for generating codes for the optimized operators and outputting operator codes.
And (VI) a runtime module.
The inputs to this module are the output of the fourth module "computational graph execution flow" and the output of the fifth module "automatically generated operator code". The module is primarily targeted to enable lightweight deployment on devices.
The system specifically comprises a runtime unit, a device/resource management unit, a computation graph executor, a high-performance operator library and a compiling library;
the runtime unit is used for receiving the execution flow of the computational graph and the operator codes;
the device/resource management unit is used for realizing the management work of computing resources, memory resources and execution streams on the device;
the computation diagram executor is used for controlling the execution process of the computation diagram;
the high-performance operator library is used for storing hardware operators;
and the compiling library is used for calling the hardware operators corresponding to the execution flow of the computation graph and the operator codes, carrying out code deployment on corresponding hardware and generating a deployment file of a corresponding hardware unit.
The multi-hardware target depth model optimization deployment architecture supporting the custom operator supports the embedded processing units of FPGA, NPU in the Carmbrian period, Huacheng GPU and other mainstream at home and abroad.
The deployment tool chain supports the OP of a conventional deep learning model, but the algorithm is continuously upgraded and improved, and many companies and research institutions can research some new OPs, so that the tool chain needs to provide the development function of a custom operator externally. In order to support the development function of the user-defined operator, the tool chain of the invention provides five registration interfaces with different levels externally, and the structure is shown in fig. 2, and the tool chain comprises a model operator registration interface, a Relay interface, a hardware operator registration interface, a hardware operator DSL realization registration interface and a hardware operator library realization registration interface.
In the implementation process of the custom OP, the scheme of calling the registration interface under different conditions is shown in table 1 below. And 3, by combined calling of the 5 kinds of registration interfaces, the user-defined operator in the network can be freely expanded and realized. When a new hardware OP needs to be developed, there are a number of different implementation schemes for the user to choose:
1) under the condition that the operator has high-performance realization codes, the realization can be realized through an operator library registration interface. The user may register the relevant implementation code into the runtime high performance operator library. When the reasoning is executed on the chip, the original optimization effect of the optimized code can be achieved;
2) when operator optimization needs to be realized again, a register interface can be realized through a hardware operator DSL, and rapid development of operators is carried out. The interface can automatically translate the described custom operator implementation logic into an efficient hardware implementation code by utilizing the automatic optimization and code generation functions of a compiler in a tool chain;
3) for users seeking extreme performance, the two registration interfaces may be used simultaneously. The operator optimization module in the framework of the invention can optimize the two schemes according to the profile related information, and automatically select the best realization mode in the deployment process.
Table 1 scheme for operator to call register interface
Figure BDA0003311005680000091
The above is a description of the architecture part method of the embodiment of the present invention, and the following is a detailed description of the specific deployment method of the embodiment of the present invention on the multi-target hardware processor.
(1) Rapid deployment of cambrian MLU platform
The MLU platform is a domestic intelligent chip platform released by the cambrian era and covers cloud and edge application scenes, and the MLU220 platform is selected and applied to edge embedded intelligent application scenes.
Fig. 3 is a hardware topology diagram of the MLU220 platform, in which a cluster unit is an AI inference calculation unit, 4 IPU units are provided inside, and data communication can be realized between the 4 IPUs through a shared memory. DDR memory and bandwidth are shared among the multimedia decoding unit, the ARM and the Cluster.
FIG. 4 is an internal structure of an IPU computing module of the MLU220, which is mainly divided into a computing unit and an on-chip memory unit; the calculating units are of three types, NNCore is a matrix operation unit and is responsible for calculating convolution, full connection and other matrix multiplication types, Vector is a Vector calculating unit, and Scalar is a Scalar calculating unit. The on-chip memory units are divided into two types, NRAM is a data memory and can store input/output data of inference calculation, WRAM is a weight memory and stores model weights. The vector and scalar calculation unit can provide user-defined operator development capability for users so as to meet tail operator requirements in a network model operator long tail effect, and is a key technical point for expanding the link capability of the whole algorithm deployment.
Based on the general architecture of the invention, the MLU platform is subjected to adaptation support, and the specific implementation scheme is described as follows:
1) the input layer is consistent with other platforms, and the adding capacity of the training framework is determined by the overall scheme;
2) the model conversion is also consistent with other platforms, and a Relay is taken as a unified calculation graph representation of the model;
3) the operator conversion converts the Relay OP into OP representation corresponding to MLU hardware; for a multi-modal operator warehouse, it consists of two parts: the manufacturer software stack supports operators and a conventional operator warehouse which is realized in advance and is to be deployed on the MLU equipment;
4) the model optimization module realizes a general diagram optimization scheme of multi-platform multiplexing, and improves the calculation performance through optimization means such as horizontal fusion, vertical fusion, dead node elimination, constant pretreatment, coefficient folding and the like;
5) the data flow optimization module adopts a multiplexing strategy to realize hybrid heterogeneous parallel of an IPU (intelligent peripheral unit) and an ARM (advanced RISC machines) of an MLU (multi-level Linux) platform, and the memory multiplexing supports the multiplexing of model weights and the reasoning memory multiplexing among models;
the data flow optimization module provides a linking mechanism with a back-end platform and compiles the calculation graph into a calculation representation which can be recognized by an MLU platform;
6) an operator optimization layer in the operator optimization module aims at selecting the manufacturer self-contained operator and the automatic optimization operator through an optimization strategy;
the operator compiling layer is connected with Bang C of the MLU platform through Low Level IR, and automatic optimization is carried out by using codes to generate an automatic optimization operator;
7) the runtime module adopts a multiplexing strategy, and the main functions are to execute compiling to obtain a calculation graph and obtain output, and provide the device and resource management functions of the MLU, such as device memory management mechanism and calculation flow synchronization control.
(2) Fast deployment of FPGA platforms
An FPGA is a hardware reconfigurable architecture. Through programming, the application scene of the system can be changed at any time, and the system can simulate various parallel operations of hardware such as a CPU (central processing unit), a GPU (graphic processing unit) and the like. Through the interconnection with the high-speed interface of the target hardware, the FPGA can finish the part with lower running efficiency of the target hardware, thereby realizing acceleration on the system level. Aiming at the rapid deployment scheme and the related optimization of the SoC series FPGA platform, the method is implemented based on the software and hardware architecture characteristics of the Zynq hardware platform and in combination with the tool chain integrated optimization overall scheme of the project. Fig. 5 indicates the location of the ARM, which is a dedicated resource, and the MicroBlaze processor on the Zynq chip, which is located in the logic. The PL part of Zynq can be provided with a MicroBlaze soft processor which is used for cooperating with an ARM processor. The MicroBlaze processor may be responsible for coordinating certain underlying functions with the system, and may operate the flow processing portion of the framework of the present invention in conjunction with the ARM Cortex-A9 processor, thereby improving overall performance.
The invention is based on the general framework of the invention to carry out adaptation support aiming at the Zynq platform, and the specific implementation scheme is described as follows:
1) the input layer is consistent with other platforms, and the adding capacity of the training framework is determined by the overall scheme;
2) the model conversion is also consistent with other platforms, and relay is used as a unified calculation graph representation of the model;
3) the operator conversion module converts the relay op into an op representation corresponding to Zynq hardware; for an operator warehouse, it consists of two parts: the method comprises the steps of deploying an operator library constructed in a framework and a hardware operator warehouse which is realized in advance and is to be deployed on Zynq equipment;
4) the model optimization module realizes a general diagram optimization scheme of multi-platform multiplexing, and improves the calculation performance through optimization means such as horizontal fusion, vertical fusion, dead node elimination, constant pretreatment, coefficient folding and the like;
5) the data flow optimization module adopts a multiplexing strategy to realize hybrid heterogeneous parallelism of a PL (presentation component) end and an ARM (advanced RISC machines) end of a Zynq platform, and designs a data flow optimization IP (Internet protocol) core aiming at a data flow optimization method of hardware architecture characteristics, and the data flow optimization IP core is placed at the PL end. Calling an IP (Internet protocol) core to a PL (personal information) end through a tool chain to realize multiplexing of memory multiplexing support model weights and reasoning memory multiplexing among models;
the data flow optimization module provides a linking mechanism with a back-end platform and compiles the calculation graph into a calculation representation which can be recognized by the Zynq platform;
6) the operator optimization layer in the operator optimization module aims at selecting between a hardware operator designed for the hardware platform and an automatic optimization operator through an optimization strategy. The hardware operator is arranged at the PL end, and the preference logic judgment of the operator in the tool chain is carried out to obtain a more optimal operator from the PL end;
the operator compiling layer is connected with a vitas AI deploying tool of a Zynq platform through Low Level IR, and automatically optimizes by using codes to generate an automatically optimized operator;
7) the runtime module adopts a multiplexing strategy, and the main functions are to execute compiling to obtain a calculation graph and obtain output, and provide Zynq equipment and resource management functions, such as device memory management mechanism, calculation flow synchronization control and the like.
(3) CPU/GPU platform rapid deployment
The selected platforms are a localization GPUAScend A310 platform and a remote RK3399 platform which are released by Hua, and both the platforms are applied to edge embedded intelligent application scenes.
The CPU/GPU belong to a von Neumann structure, and the instruction decoding execution and the memory sharing are realized. The tool chain uses OpenCL for GPU/CPU computation. Mapping to the OpenCL model, each shader core within a GPU/CPU processor executes one or more workgroups, each shader core supporting a maximum of 384 concurrently executing threads, with each workitem typically mapped to a single thread on the CPU/GPU. The CPU/GPU uses a VLIW (very long instruction word) architecture, each instruction word containing a plurality of operations. CPU/GPU also uses SIMD, so the toolchain operates most arithmetic instructions on multiple data elements simultaneously.
Based on the tool chain overall architecture, the invention carries out adaptation support for Huawei CPU/GPU platform and Rui core micro CPU/GPU, and the specific implementation scheme is described as follows:
1) the input layer is consistent with other platforms, and the adding capacity of the training framework is determined by the overall scheme;
2) the model conversion is also consistent with other platforms, and relay is used as a unified calculation graph representation of the model;
3) the operator conversion module takes Hua as a platform as an example, and converts the Relay OP into CANNOP representation corresponding to Hua hardware; for an operator warehouse, it consists of two parts: the manufacturer software stack supports operators and a conventional operator warehouse which is realized in advance and is to be deployed on the Ascend equipment;
4) the model optimization module realizes a general diagram optimization scheme of multi-platform multiplexing, and improves the calculation performance through optimization means such as horizontal fusion, vertical fusion, dead node elimination, constant pretreatment, coefficient folding and the like;
5) the data flow optimization module adopts a multiplexing strategy and combines the hardware structure characteristics of a CPU/GPU platform to perform convolution acceleration by space packaging, tiling, unfolding and vectorization of an optimization library;
the data flow optimization module provides a linking mechanism with a back-end platform and compiles the calculation graph into a calculation representation which can be recognized by the CPU/GPU platform;
6) an operator optimization layer in the operator optimization module is used for selecting operators of manufacturers (a CANN operator library and an ARM acceleration operator library) and automatic optimization operators through an optimization strategy;
the operator compiling layer is connected with the CPU/GPU platform through Low Level IR, and automatically optimizes and generates an automatically optimized operator by using codes;
7) the runtime module adopts a multiplexing strategy, and the main functions are to execute compiling to obtain a calculation graph and obtain output, and provide equipment and resource management functions of the CPU/GPU, such as device memory management mechanism, calculation stream synchronous control and the like.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. The multi-hardware target depth model optimization deployment architecture supporting the custom operator is characterized by sequentially comprising the following steps: the system comprises a front end import module, an operator conversion module, a model optimization module, a data stream optimization module, an operator optimization module and a runtime module;
the front end import module is used for converting the deep learning model file into a Relay calculation graph representation;
the operator conversion module is used for converting a Relay OP in the Relay calculation graph into a hardware OP and outputting a representation of the Relay calculation graph with the hardware OP;
the model optimization module is used for carrying out graph optimization on a calculation graph in a Relay calculation graph representation with a hardware OP, reducing data access and invalid repeated data calculation in the execution process of the calculation graph by a dynamic graph optimization method and a static graph optimization method, and outputting the optimized calculation graph and the optimized hardware OP;
the data flow optimization module is used for optimizing the optimized computational graph representation part to form a computational graph execution flow;
the operator optimization module is used for performing multi-mode representation and automatic optimization on the optimized hardware OP and outputting an operator code;
and the runtime module is used for loading the execution module and generating a deployment file by executing the hardware operator corresponding to the operator code and the computation graph execution flow.
2. The multi-hardware-target depth model optimized deployment architecture supporting the custom operator according to claim 1, wherein the front-end import module comprises a model access unit, a Pythrch-to-Relay unit, an Onnx-to-Relay unit, and a Tensorflow-to-Relay unit;
the model access unit is used for receiving a deep learning model file, and the deep learning model file comprises hardware information and a network model;
the Pythrch to Relay unit, the ONnx to Relay unit and the Tensorflow to Relay unit are used for converting different forms of network models into a Relay calculation graph representation.
3. The architecture of claim 1, wherein the operator conversion module comprises an operator conversion unit, a Relay OP to hardware OP unit, and a multi-modal operator repository;
the operator conversion unit is used for receiving the representation of the Relay calculation diagram;
the Relay OP-hardware OP unit is used for converting a Relay OP in a Relay calculation graph representation into a hardware OP based on an operator implementation mode;
the multi-modal operator repository is used for storing operator implementation modes supported on a specified hardware platform.
4. The multi-hardware target depth model optimized deployment architecture that supports custom operators according to claim 1, wherein the model optimization module includes a model preference unit, a graph anesthesia Pass unit, a graph cut Pass unit, a graph transplant Pass unit and a graph suture Pass unit;
the model optimization unit is used for receiving a representation of a Relay calculation chart with hardware OP;
the graph anesthesia Pass unit is used for performing subsequent traversal on a Relay calculation graph with a hardware OP (operation top) and replacing a subgraph which can be folded by a constant value into a constant node;
the graph anatomy Pass unit is used for carrying out inline optimization and deleting useless nodes;
the graph transplanting Pass unit is used for iterating the graphically sectioned computation graph;
and the graph stitching Pass unit is used for determining the common sub-expressions by using the Hash table, merging the topological structures of the common sub-expressions, and outputting the optimized computation graph and the optimized hardware OP.
5. The multi-hardware-target depth model optimized deployment architecture supporting custom operators according to claim 1, wherein the data stream optimization module comprises a data stream optimization unit, a heterogeneous parallel unit and a memory multiplexing unit;
the data flow optimization unit is used for receiving the optimized calculation graph;
the heterogeneous parallel unit and the memory multiplexing unit are used for realizing efficient coding of a convolution kernel and an activation value for the data stream of the optimized computation graph, and improving the efficiency of data access and model reasoning under the condition of different hardware constraints.
6. The architecture of claim 1, wherein the operator optimization module comprises an operator optimization unit, a multi-modal optimization unit, an operator substitution unit, an operator compilation unit, an automatic optimization unit, and a code generation unit;
the operator optimization unit is used for receiving the optimized hardware OP;
the multi-mode optimization unit is used for carrying out operator optimization on the optimized hardware OP, if a corresponding operator is integrated in the high-performance operator library, operator substitution is called, and otherwise, operator compiling is called;
the operator substitution unit is used for converting the optimized hardware OP into an operator;
the operator compiling unit is used for carrying out operator compiling on the optimized hardware OP;
the automatic optimization unit is used for automatically optimizing the compiled operator to realize circular automatic block cutting, fusion and sequential adjustment;
the code generation unit is used for generating codes for the optimized operators and outputting operator codes.
7. The architecture of claim 1, wherein the runtime module comprises a runtime unit, a device/resource management unit, a computation graph executor, a high performance operator library, and a compilation library;
the runtime unit is used for receiving a computation graph execution flow and operator codes;
the device/resource management unit is used for realizing the management work of computing resources, memory resources and execution streams on the device;
the computation graph executor is used for controlling the execution process of the computation graph;
the high-performance operator library is used for storing hardware operators;
the compiling library is used for calling hardware operators corresponding to the execution flow of the computation graph and the operator codes, carrying out code deployment on corresponding hardware and generating deployment files of corresponding hardware units.
8. The architecture of claim 1, further comprising a model operator registration interface, a Relay interface, a hardware operator registration interface, a hardware operator DSL implementation registration interface, and a hardware operator library implementation registration interface.
CN202111216615.4A 2021-10-19 2021-10-19 Multi-hardware target depth model optimization deployment framework supporting custom operators Pending CN113934410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111216615.4A CN113934410A (en) 2021-10-19 2021-10-19 Multi-hardware target depth model optimization deployment framework supporting custom operators

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111216615.4A CN113934410A (en) 2021-10-19 2021-10-19 Multi-hardware target depth model optimization deployment framework supporting custom operators

Publications (1)

Publication Number Publication Date
CN113934410A true CN113934410A (en) 2022-01-14

Family

ID=79280499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111216615.4A Pending CN113934410A (en) 2021-10-19 2021-10-19 Multi-hardware target depth model optimization deployment framework supporting custom operators

Country Status (1)

Country Link
CN (1) CN113934410A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091674A (en) * 2022-01-19 2022-02-25 北京华品博睿网络技术有限公司 Model reasoning acceleration method and system based on CPU equipment
CN114092313A (en) * 2022-01-19 2022-02-25 北京华品博睿网络技术有限公司 Model reasoning acceleration method and system based on GPU (graphics processing Unit) equipment
CN116126365A (en) * 2023-04-18 2023-05-16 之江实验室 Model deployment method, system, storage medium and electronic equipment
CN116954721A (en) * 2023-09-20 2023-10-27 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator
CN117075918A (en) * 2023-10-13 2023-11-17 之江实验室 Model deployment method and device, storage medium and electronic equipment
CN117455015A (en) * 2023-12-20 2024-01-26 摩尔线程智能科技(成都)有限责任公司 Model optimization method and device, storage medium and electronic equipment

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091674A (en) * 2022-01-19 2022-02-25 北京华品博睿网络技术有限公司 Model reasoning acceleration method and system based on CPU equipment
CN114092313A (en) * 2022-01-19 2022-02-25 北京华品博睿网络技术有限公司 Model reasoning acceleration method and system based on GPU (graphics processing Unit) equipment
CN116126365A (en) * 2023-04-18 2023-05-16 之江实验室 Model deployment method, system, storage medium and electronic equipment
CN116954721A (en) * 2023-09-20 2023-10-27 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator
CN116954721B (en) * 2023-09-20 2023-12-15 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator
CN117075918A (en) * 2023-10-13 2023-11-17 之江实验室 Model deployment method and device, storage medium and electronic equipment
CN117075918B (en) * 2023-10-13 2024-01-09 之江实验室 Model deployment method and device, storage medium and electronic equipment
CN117455015A (en) * 2023-12-20 2024-01-26 摩尔线程智能科技(成都)有限责任公司 Model optimization method and device, storage medium and electronic equipment
CN117455015B (en) * 2023-12-20 2024-04-02 摩尔线程智能科技(成都)有限责任公司 Model optimization method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN113934410A (en) Multi-hardware target depth model optimization deployment framework supporting custom operators
CN110321999A (en) Neural computing figure optimization method
Konfrst Parallel genetic algorithms: Advances, computing trends, applications and perspectives
Beguelin et al. Graphical development tools for network-based concurrent supercomputing
CN112465108A (en) Neural network compiling method for storage and calculation integrated platform
CN111104120B (en) Neural network compiling method and system and corresponding heterogeneous computing platform
M. Squyres et al. The component architecture of open MPI: Enabling third-party collective algorithms
CN112527262B (en) Automatic vector optimization method for non-uniform width of deep learning framework compiler
CN113220630B (en) Reconfigurable array optimization method and automatic optimization method for hardware accelerator
CN109240704B (en) Multiprocessor programming toolkit for design reuse
CN114995822A (en) Deep learning compiler optimization method special for CNN accelerator
Zhang et al. Hidp: A hierarchical data parallel language
CN114035916A (en) Method for compiling and scheduling calculation graph and related product
CN102004662A (en) Embedded scalable virtual machine
CN113885845A (en) Method, system, device and medium for generating calculation graph of deep learning compiler
CN110648768B (en) POM ocean mode optimization method and device
CN107203406B (en) Processing method for distributed storage structure
CN113469326B (en) Integrated circuit device and board for executing pruning optimization in neural network model
CN115840894A (en) Method for processing multidimensional tensor data and related product thereof
Aldinucci et al. A framework for experimenting with structured parallel programming environment design
Zhao et al. Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their {Domain-Specific} Accelerators
Yin et al. Exact memory-and communication-aware scheduling of dnns on pipelined edge tpus
CN114253545A (en) Neural network heterogeneous many-core multi-layer resource mapping method based on compiling
CN113626035A (en) Neural network compiling method facing RISC-V equipment based on TVM
Jing et al. An Automatic Task Partition Method for Multi-core System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination