WO2022007468A1 - 一种基于 tvm 编译器的异构平台的部署方法、装置及终端设备 - Google Patents

一种基于 tvm 编译器的异构平台的部署方法、装置及终端设备 Download PDF

Info

Publication number
WO2022007468A1
WO2022007468A1 PCT/CN2021/088594 CN2021088594W WO2022007468A1 WO 2022007468 A1 WO2022007468 A1 WO 2022007468A1 CN 2021088594 W CN2021088594 W CN 2021088594W WO 2022007468 A1 WO2022007468 A1 WO 2022007468A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
space
device block
target device
output
Prior art date
Application number
PCT/CN2021/088594
Other languages
English (en)
French (fr)
Inventor
吴金进
Original Assignee
深圳云天励飞技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术股份有限公司 filed Critical 深圳云天励飞技术股份有限公司
Priority to US17/623,902 priority Critical patent/US20240036844A1/en
Publication of WO2022007468A1 publication Critical patent/WO2022007468A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Definitions

  • the present application belongs to the technical field of data processing, and in particular, relates to a method, apparatus and terminal device for deploying a heterogeneous platform based on a TVM compiler.
  • Heterogeneous platforms refer to operating platforms with different types of instruction sets and architectural computing units. Heterogeneous platforms support the architecture of different systems. It can be composed of CPU, GPU, DSP, ASIC, FPGA and other processors.
  • Embodiments of the present application provide a method, apparatus and terminal device for deploying a heterogeneous platform based on a TVM compiler, which can implement the deployment of a deep learning network on a heterogeneous platform.
  • an embodiment of the present application provides a method for deploying a heterogeneous platform based on a TVM compiler, including:
  • the nodes in the topology diagram are divided into target device blocks corresponding to the device types;
  • the deep learning network is deployed on the heterogeneous platform based on the topology diagram, the network parameters, the function information, the target device block and the spatial information of the target device block.
  • an embodiment of the present application provides an apparatus for deploying a heterogeneous platform based on a TVM compiler, including:
  • the data acquisition module is used to acquire the topological structure diagram, network parameters and function information of each function of the deep learning network generated by the TVM compiler;
  • a node division module configured to divide the nodes in the topology diagram into target device blocks corresponding to the device types based on the information of the nodes in the topology diagram and the device types of the heterogeneous platforms;
  • a space allocation module configured to perform space allocation on the target device block to obtain space information of the target device block
  • a network deployment module configured to deploy the deep learning network to the heterogeneous based on the topology diagram, the network parameters, the function information, the target device block and the spatial information of the target device block on the platform.
  • an embodiment of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes
  • the computer program implements the method for deploying a heterogeneous platform based on a TVM compiler according to any one of the above first aspects.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, wherein, when the computer program is executed by a processor, any one of the above-mentioned first aspect is implemented.
  • an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the TVM compiler-based heterogeneous platform according to any one of the above first aspects deployment method.
  • the embodiments of the present application have the following beneficial effects: the present application divides the nodes in the topology diagram into device types related to the heterogeneous platforms by obtaining the information of the nodes in the topology diagram generated by the TVM compiler.
  • Corresponding target device block and then perform space allocation on the target device block to obtain the space information of each target device block, and finally based on the acquired network parameters generated by the TVM compiler, function information of each function, topology diagram, target device block and the spatial information of the target device block, deploy the information in each target device block to the equipment of the heterogeneous platform, and complete the deployment of the heterogeneous platform; this application divides the node into target device blocks of different device types, through different Types of target device blocks complete the deployment of different devices in heterogeneous platforms.
  • the present application ensures the integrity of the deep learning network, and the present application is compiled based on TVM.
  • the deep learning network can be deployed by the device, and the deep learning network obtained by different deep learning frameworks is supported.
  • FIG. 1 is a schematic diagram of an application scenario of a method for deploying a heterogeneous platform based on a TVM compiler provided by an embodiment of the present application;
  • FIG. 2 is a schematic flowchart of a method for deploying a heterogeneous platform based on a TVM compiler provided by an embodiment of the present application;
  • FIG. 3 is a schematic flowchart of a method for dividing target device blocks in FIG. 2 according to an embodiment of the present application;
  • FIG. 4 is a schematic flowchart 1 of a specific node division method provided by an embodiment of the present application.
  • FIG. 5 is a second schematic flowchart of a specific node division method provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a node space allocation method provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a specific space allocation method provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a platform deployment method provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an apparatus for deploying a heterogeneous platform based on a TVM compiler according to an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an application scenario of a method for deploying a heterogeneous platform based on a TVM (Tensor Virtual Machine) compiler provided by an embodiment of the present application.
  • the above-mentioned deployment method for a heterogeneous platform based on a TVM compiler Deployment of deep learning networks.
  • the TVM compiler 10 is used to generate the data of the deep learning network to be deployed, and the terminal device 20 is used to obtain the data generated by the TVM compiler 10, and classify the obtained data by device block of nodes, and finally based on the obtained data
  • the data and the divided device blocks deploy the deep learning network on the heterogeneous platform, so as to achieve the deployment of the heterogeneous platform.
  • the heterogeneous platform may include a variety of processors, and this application is mainly aimed at NPU (Neural-network
  • NPU Neurogeneous platform composed of Processing Units, neural network processor acceleration unit) and DSP (Digital Signal Processing, digital signal processor) is explained, and the NPU specifically adopts the NNP processor (Neural Network Processor, neural network processor), the examples in the following specific embodiments are described by taking the above heterogeneous platform as an example. It should be noted that the above heterogeneous platform is only an example, and should not constitute any limitation to the method of the present application .
  • FIG. 2 shows a schematic flowchart of a deployment method for a heterogeneous platform based on a TVM compiler provided by the present application. Referring to FIG. 2 , the deployment method is described in detail as follows:
  • the TVM compiler can be used to compile deep learning networks generated by different deep learning network frameworks, and the deep learning network compiled by the TVM compiler can generate topology diagrams, network parameters, and function information of each function.
  • the topology diagram includes the information of the nodes, the nodes in the topology diagram are arranged layer by layer, and each node corresponds to a unique index code, where the node refers to the data that the deep learning network needs to store during the operation. or data processing modules.
  • the information of the node may include the device type of the node, the node type of the node, the information of the output data of the node, the information of the input data of the node, the structure information of the node, the data type of the node, and the like.
  • the node type of the node can include function type and data type, function type (TVM).
  • TVM The node of OP (Tensor Virtual Machine Operator-TVM operation) refers to the node of the function type compiled by TVM, which can be a function and can process data; the node of data type refers to the node that only stores the data. data, no data processing.
  • the information of the output data of the node includes one or more output tensors, and an output tensor is a multi-dimensional array; the information of the input data of the node includes one or more input tensors, and an input tensor is a multi-dimensional array.
  • the structural information of a node may include scale information of the node, such as channel information, width information, and height information.
  • the data type of the node can include integer (int), short integer (short), and single-precision floating point (float32 or float16).
  • the device type of the node corresponds to the device type of the heterogeneous platform.
  • the device types of the heterogeneous platform include the device types of the node.
  • the device type of the node is based on the heterogeneous platform when the TVM compiler generates the topology structure.
  • the device type setting of the platform is completed.
  • the type of device type of the node corresponds to the type of device type of the heterogeneous platform to be deployed.
  • the heterogeneous platform can be the platform of NNP and DSP, and the device type of the heterogeneous platform includes NNP and DSP.
  • DSP the device type of the node also includes NNP-type nodes and DSP-type nodes.
  • the network parameters refer to the weighted data of the deep learning network.
  • the function information of the function exists in the form of a lib file, which refers to the assembly code of the function or the information required by the function function.
  • the target device block is divided mainly according to the node type, device type, input data of the node and output data of the node in the node information. Nodes are divided into different target device blocks.
  • step S102 may include:
  • S1021 Divide the node into candidate device blocks corresponding to each device type based on the information of the node, to obtain a target node included in each of the candidate device blocks, wherein each device type includes at least one candidate device block.
  • one device type may include one candidate device block or at least two candidate device blocks, and the specific number of candidate device blocks is mainly determined based on the classification of nodes.
  • the NNP device type may include one NNP candidate device block, or may include two or more NNP candidate device blocks.
  • step S1021 may include:
  • S10211 Determine whether each node in the topology structure graph satisfies the classification condition.
  • a node corresponds to a unique index code, and the node can be indexed according to the index code of the node.
  • Output data wherein the nodes in the topology diagram are distributed layer by layer from top to bottom, and the upper layer node of the jth candidate device block refers to the node distributed in the jth candidate device A node that is outside of the jth candidate device block and above the jth candidate device block is the nearest layer of nodes to the jth candidate device block.
  • the first node if it is the first node and the device type of the first node is DSP, then the first node only needs to satisfy the node type as function type to be put into a DSP candidate device block, because it is For the first node, all DSP candidate device blocks are empty, and the first node can be placed in any DSP candidate device block. If it is the third node, you need to first judge whether it is a function type node. If it is not, you don’t need to make subsequent judgments and just discard it; if it is a function type node and it is a DSP type, you need to see the first node.
  • the input data of three nodes if the input data of the third node is the output data of one node of the first DSP candidate device block, the third node will be included in the first DSP candidate device block; if the third node It is a function type node and is of DSP type.
  • the input data of the third node is outside the first DSP candidate device block, and is closest to the first DSP candidate device block on the first DSP candidate device block.
  • the output data of one node in the first layer of nodes, the third node is included in the first DSP candidate device block.
  • the candidate device block may be a preset empty device block, or may be a device block established according to the judgment of the node.
  • the device type of the first node is NNP. If there is a pre-created NNP candidate device block, the first node is included in the NNP candidate device block. If there is no pre-created NNP candidate device block, then A new NNP candidate device block can be created, and the first node can be included in the newly created NNP candidate device block.
  • the ith node if the ith node does not meet the classification conditions, that is, the ith node cannot be placed in any candidate device block of an existing node, the ith node needs to be stored in an empty candidate device In the block, the j+1-th candidate device block can be a pre-existing empty candidate device block, or it can be a newly-created empty candidate device block when the i-th node does not meet the classification conditions.
  • the NNP candidate device block stores nodes of the NNP device type
  • the sixth node is a function type node, and is an NNP type node
  • the input data of the sixth node is not the output data of the nodes in the above three NNP candidate device blocks
  • the input data of the sixth node is not the output data of the upper layer node of any one of the above three NNP candidate device blocks, then the sixth node is put into the fourth NNP candidate device block.
  • nodes of two device types which are NNP and DSP types respectively, when a node comes in, first determine whether the node is a node of NNP function type;
  • the node belongs to the NNP function type, check whether there is an NNP candidate device block A that can include the node. If the input data of the node comes from the node in the NNP candidate device block A or from the upper part of the NNP candidate device block A If the node is a first-tier node, the node can be placed in the NNP candidate device block A; otherwise, the node can be included in an empty NNP candidate device block without any nodes, or a new NNP candidate device block can be created, and the node can be included in the new The NNP candidate device block;
  • node If the node is not a node of the NNP function type, then determine whether the node is a node of the DSP function type;
  • the node belongs to the DSP function type, check whether there is a DSP candidate device block B that can include the node. If the input data of the node comes from the node in the DSP candidate device block B or from the upper part of the DSP candidate device block B If there is a first-level node, the node can be placed in the DSP candidate device block B; otherwise, the node can be placed in an empty DSP candidate device block without any nodes, or a new DSP candidate device block can be created, and the node can be included in the new DSP candidate device block. DSP candidate device block;
  • no processing is required.
  • the current node has at least two input data, and the input data comes from at least two candidate device blocks, it is also necessary to divide the current node into the first node corresponding to the device type of the current node according to the device type of the current node. j+1 candidate device blocks.
  • a node is a function type node, and the node is a DSP type node
  • one input data comes from a node in the third DSP candidate device block, and the other input data comes from the second NNP candidate device block.
  • the node needs to be put into an empty DSP candidate device block. If there is no empty DSP candidate device block, a new DSP candidate device block can be created.
  • the function-type node is a node that needs to perform calculation or other processing
  • the data-type node only stores data and does not process the data
  • the nodes are classified, and then the candidate device blocks corresponding to each data node can be obtained.
  • a classification rule is set, which can not only divide the nodes with input and output relationship into one block, but also divide the nodes without any association into a candidate device block, so that the unrelated nodes are in the candidate device block. can operate in parallel.
  • the method of the present application can accurately and quickly classify nodes of all function types, and at the same time, classify nodes of different device types, which facilitates subsequent deployment of different devices in heterogeneous platforms, and uses device blocks to deploy heterogeneous platforms , the data required by devices of different device types can be separated, and deployment errors caused by too many nodes or different node device types can be avoided during deployment.
  • each node has an index code.
  • all the candidate device blocks can be sorted in order from small to large according to the size of the index code of the first target node in the candidate device blocks.
  • each merged block is not required.
  • Each candidate device block is a target device block, and a target device block corresponds to a unique index code.
  • the method for judging whether candidate device blocks of the same device type are continuous may be to first obtain candidate device blocks of the same device type by searching, and then determine the adjacent candidates of the same device type based on the index codes of the candidate device blocks. Whether the device blocks are consecutive, if the index codes are consecutive, it means that the adjacent candidate device blocks of the same device type are consecutive, if the index codes are not consecutive, it means that the adjacent candidate device blocks of the same device type are not consecutive. .
  • the candidate device block a is of NNP type, and the index code of the first target node in the candidate device block a is 3; the candidate device block b is of the NNP type, and the first target node in the candidate device block b has an index code of 3.
  • the index code is 2; the candidate equipment block c is of NNP type, and the index code of the first target node in the candidate equipment block c is 5; the candidate equipment block d is of the DSP type, and the first target in the candidate equipment block d is The index code of the node is 4; the candidate equipment block e is of DSP type, and the index code of the first target node in the candidate equipment block e is 1;
  • the order of the index code from small to large is e, b, a, d, c. Since b and a are continuous and both are NNP types, b and a can be combined into one f, and the order is obtained. Arranged e, f, d, and c, where e, f, d, and c are all a target device block, and each target device block corresponds to an index code.
  • S1023 Determine the input node and the output node of each target device block according to the information of the target node in the target device block.
  • a target device block includes multiple target nodes.
  • the target device block can be indexed one by one according to the index code of the target device block, and the input node and output node of each target node in the target device block can be searched, and finally all the targets in each target device block After the nodes are all indexed, the input and output nodes of each target device block are determined.
  • the input node only includes the node corresponding to the input data outside the target device block, and does not include the input node of the input and output transmission between the internal node and the node.
  • the output node is only the node corresponding to the output data of the target device block that ultimately needs to output the target device block, and does not include the node for input and output transmission between the internal node and the node.
  • step S1023 may include:
  • the input node of the target node in the target device block satisfies the input node inclusion condition, the input node of the target node is included in the current target device block and the input node is used as the input node of the target device block , wherein the input node inclusion conditions include that when the input node of the target node is a node of data type, the corresponding data is the input data of the deep learning network, or the input node of the target node is a node of function type and is not included in the current target device block.
  • the input data of the target node is the output data from the external node of the target device block. If the input data of the target node is a data type node and is the input data of the deep learning network, then the node is the input node of the target device block, and the node is included in the input node set in the target device block.
  • the input data of the network is the data input by the user and is known data.
  • the node should also be included in the target node where the target node is located. in the input node collection of the device block.
  • the target node that satisfies the inclusion condition of the output node in the target device block is the output node of the current target device block, wherein the inclusion condition of the output node includes that the target node is in the output list of the deep learning network , or the output data of the target node is the input data of the target node in the target device block other than the current target device block.
  • the target nodes are all function-type nodes, and the target nodes are all nodes that need to output data after data processing, one target node is one output node, and one target node corresponds to one output data. Therefore, in the When determining the output node of the target device block, it is only necessary to determine whether the target node in the target device block meets the conditions for inclusion of the output node.
  • the current target node is the output node of the target device block, if the current target node is not in the output list, but the output of the current target node If the data is the input data of other target device blocks, the current target node should also be used as the output node of the target device block.
  • the target node, input node and output node stored in the target device block are all stored with the index code of the node, that is, only the index code of the node needs to be recorded when determining the target node, input node and output node.
  • the index code of the node in the topology diagram is stored in the target device block, and the parameters and data information of the relevant node can be obtained through the index code when the target device block is scheduled.
  • the node inclusion condition when determining the input node and output node of the target device block, the node inclusion condition is set, according to the node inclusion condition, it can be quickly determined whether the node is an input node or an output node, and the determination of the input node and the output node is a follow-up Space allocation and hardware deployment lay the groundwork.
  • the candidate device blocks are first arranged and merged in order, and then the input node and output node of the target device block are determined. If the input node and output node of the candidate device block are determined first, after the candidate device blocks are merged , the input and output nodes between candidate device blocks need to be removed, and the screening process of output nodes and output nodes is added, which is cumbersome. First merge and then determine the output nodes and output nodes, and there will be nodes that incorporate internal transmission into the input node. and the problem of output nodes, the method is simpler.
  • the target device block includes an NNP target device block and a DSP target device block.
  • the NNP target device block includes target nodes A and B
  • the DSP target device block includes target nodes C and D. Determine the size of all target device blocks. Input node and output node.
  • the input nodes of the search target node A include node E and node F, both nodes E and F are data type nodes , and the output data of E and F are the input data of the deep learning network, then nodes E and F are the input nodes of the NNP target device block, and the input nodes of the search target node B only include the target node A, then the target node A does not satisfy If the input node is included in the condition, then A does not belong to the input node of the NNP target device block; the target node A is not in the output list of the deep learning network, and there is no node in the DSP target device block that uses the output data of A as input data, so A is not The output node of the target device block, the target node B is not in the output list of the deep learning network, but the output data of the target node B is the input data of the target no
  • the input nodes of the search target node C include node B and node G
  • the input nodes of target node C are node B and node G, which are included in the current target device block as an array of input nodes.
  • node C itself is included in the output node list of the current target device block as an array of output nodes.
  • the input nodes of the search target node D only include the target node C, then the node C is included as the input node array of the target node D and included in the input node list of the current target device block; the node D itself is included as the output node array into the output of the current target device block. in the node list.
  • the input node list of the NNP target device block is [E, F]
  • the output node list is [B]
  • the input node list of the DSP target device block is [[B, G], [C]]
  • the output node list is [[B, G], [C]]. is [[C],[D]].
  • the NNP target device block has only one layer for calculation, so the output is a set of output data, and each single node in the DSP target device block can be used as a Layers perform calculations, and the output is multiple sets of output data.
  • the output nodes and input nodes of each layer need to be determined.
  • the output data of the input node of the target device block is the input data of the deep learning network, it is the data input by the user, and no space allocation is required. If the input data of the input node is the target of other target device blocks If the output data of the node is the output data of the node, the input node is the output node of other target device blocks. Therefore, during space allocation, it is only necessary to allocate space to the output nodes in the target device block.
  • the space allocation includes storage address allocation, while in this application Memory address allocation is based on a base address, and only offset addresses are allocated.
  • step S103 may include:
  • the calculation of the space size is the product of the scale information and the data type. With the space size, we can know how much space the output node should occupy, that is, how many addresses it occupies.
  • space allocation is performed on the output nodes according to the space size of the output nodes, so that the address space occupied by the output nodes can be accurately obtained, and at the same time, only the output nodes in the target device block are allocated the space, which can reduce the occupied node space. , to speed up the allocation process.
  • step S1032 may include:
  • S10321 Sort the a output nodes according to the order of input and output, and obtain a sorted output node queue.
  • the target device blocks are indexed one by one through the index codes of the target device blocks, and the nodes with the index codes of the nodes are indexed first, and the node with the index code first is indexed, so that is Indexing is performed in the order of input and output, that is, all output nodes are sorted in the order of input and output.
  • the output node in the target device block A is a
  • the output node in the target device block B is b
  • the target device block A is before B, so the output node a is indexed first, then the output node b is indexed, and the output node a is , b are sorted so that the output node a is in the front, and the output node b is in the back.
  • S10322 Based on the space size of the first output node, allocate the first node space in the first storage space for the first output node.
  • the first output node since it is the first output node, the space that can be allocated in the first storage space is not occupied, so the first output node can directly perform space allocation to obtain one node space.
  • the first node space includes the first address and the last address for storing the first output node.
  • the space may include a data space and a global space.
  • the data space is a space that can be reused, that is, the same node space can be used by different output nodes at different times.
  • the first storage space Refers to the data space; the global space is the space that cannot be reused. After an output node occupies the node space, the node space can no longer be used by other output nodes.
  • the second storage space refers to the global space. space.
  • the node space when the node space is allocated to the nth output node, since the allocated output node occupies a part of the node space, it is necessary to firstly allocate the node space in the released node space, that is, the node space in an idle state. Find the node space that can be used by the nth output node, and the node space that can be used by the nth output node must be larger than the space size of the nth output node.
  • S10324 If there is a target node space in the allocated node space, allocate an nth node space for the nth output node in the target node space according to the space size of the nth output node.
  • the nth node space can be allocated to the nth output node in the node space to realize the multiplexing of the node space .
  • a corresponding node space may be allocated for the nth output node in the first storage space. If there is no space in the first storage space that satisfies the space size of the nth output node space.
  • satisfying the release condition includes input nodes that do not belong to the output nodes following the n-th output node, and input nodes that are not the n-th output node.
  • the allocated node spaces are arranged in ascending order, and then it is checked whether there is node space that needs to be released in the currently existing node space. In the current node space, only the node space of the output node required for the calculation of the nth output node is stored.
  • the node space required for the calculation of the nth output node is stored in the current node space, including the node space of the output data of the nth output node. If the input data is provided by an output node, it should also be saved The node space of the output node, if the input data of the nth output node is not provided by the output node, only the node space of the output data of the nth output node needs to be reserved.
  • the node space corresponding to the input node of the output node whose node space is not allocated also needs to be reserved. Because the output node is required for the calculation of the output node with no node space allocated later, it needs to be reserved for subsequent use. It is convenient. If it is released, the data that can be used will not be found in subsequent use, and the node space of other output nodes that will not be used in the future will be released.
  • S10328 Allocate the a-th node space for the a-th output node in the second storage space according to the space size of the a-th output node.
  • the space of the first node to the space of the a-th node constitutes the space information of the target device block.
  • the node space corresponding to the output node that satisfies the release conditions in the current node space can also be released. to release.
  • the maximum value of the total storage space that all output nodes need to occupy can be obtained. Since the storage space occupied by some output nodes is the same, the maximum The storage space is smaller than the sum of the node space of all output nodes.
  • the first output node occupies 5 bytes of space
  • the second output node occupies 8 bytes of space
  • the third output node occupies the space freed by the first output node
  • the maximum storage space includes the first a-1 The first maximum storage space occupied by the output nodes, and the second maximum storage space occupied by the a-th output node, where the second maximum storage space is the a-th node space.
  • allocating space for the target device block can, on the one hand, obtain the node space of the output nodes in the device block, which is convenient for calling, and on the other hand, obtain the total space of all the target device blocks corresponding to each device type.
  • the size is convenient for heterogeneous platforms to allocate space for deep learning networks when deploying heterogeneous platforms.
  • the unused data is released after the node space is allocated, so that only the data that needs to be used currently is retained in the module that stores the node space, so that the data of the deep learning network is cleaner and will not be polluted during operation, and the deep learning network is guaranteed. Normal operation can also make the node space reusable, which is beneficial to space reuse.
  • NNP target device block A and DSP target device block B where A is connected to B, and A is in front of B.
  • A is an NNP target device block. There is only one layer of operations determined by the function implementation of the NNP target device block. Therefore, there is only one set of output data in A, and the node space is performed on the output nodes corresponding to the output data in A above. Allocation, after the allocation is completed, the index code of the target device block is incremented by 1, and the space allocation of the next target device block is performed.
  • B is the DSP target device block. There are multiple layers of calculation determined by the function implementation of the DSP target device block. Therefore, there are multiple sets of output data in B, and the output corresponding to each layer of output data in the above B All nodes perform node space allocation. After the allocation is completed, the index code of the target device block is incremented by 1, and the space allocation of the next target device block is performed.
  • the second output node has two output data, then both output data need to be allocated, and the space of 2 and 3 is grown, then 2 and 3 form the second Node space, since the input node of the second output node includes the first output node, 1 does not need to be released.
  • the third time allocate node space for the third output node, the third output node has one output data, and grows a space of 4 size, then 4 is the third node space, because the input node of the third output node includes the first If there are two output nodes, 2 and 3 do not need to be released, and the first output node is not the input node of the output node with unallocated space, so 1 is released.
  • the fourth output node has one output data and grows a space of 5 size.
  • the freed 1 space can meet the usage requirements of 5, so 5 can occupy the released 1. space, then 5 is the fourth node space. Since the input node of the fourth output node includes the third output node, 4 does not need to be released, 2 can be released if it is no longer used in the future, and 3 needs to be reserved for subsequent use.
  • the fifth time allocate node space for the fifth output node, the fifth output node has an output data, and grows a space of 6 size, then 6 is the fifth node space, because the input node of the fifth output node includes the first output node. If there are four output nodes, 5 does not need to be released, and 4 is no longer used and can be released.
  • the sixth and sixth output node is the last output node, and the global space needs to be allocated. Since only the dynamically allocated node space is displayed in this figure, it will not be allocated or displayed in this figure, and it is the first
  • the six output nodes allocate space to obtain the sixth node space; in addition, in this step, the node space of the output nodes that are no longer used can be released, and the node space of the 3 no longer used can be released.
  • the fourth node space and the fifth node space are the space released by the previous output nodes. Therefore, the first output node to the sixth output node occupy a total of The maximum storage space is the sum of the 1st node space, the 2nd node space, the 3rd node space and the 6th node space.
  • S104 Deploy the deep learning network on the heterogeneous platform based on the topology structure diagram, the network parameters, the function information, the target device block, and the spatial information.
  • each target device block can be deployed on the heterogeneous platform according to the order in which the target device blocks are arranged.
  • the target device blocks are also deployed one by one through the index codes of the target device blocks.
  • step S104 may include:
  • S1041 based on the input data of the input node, the output data of the output node, the spatial information, the function information, and the network parameters in the m-th target device block arranged according to preset rules, obtain the m-th target device block. Deployment information for the target device block.
  • the deployment information required by different device types is different, but the input data and output data of the target node in the target device block need to be collected before deployment.
  • the input data includes data structure information, data spatial information and node information
  • the information of the input data included in the information, the output data is the same as the information collected by the input data, and then the input data and output node data are composed of structured input data and output data.
  • Deployment information includes structured input data, output data, function information, network parameters, and device configuration information of corresponding devices on heterogeneous platforms.
  • Device configuration information is pre-stored and set according to the device type.
  • Device configuration information can include images Process information such as number, code length, and how the code is started.
  • the function information is the assembly code compiled by TVM, and the NNP target device block needs to put the structured input data, output data, TVM compiled assembly code, and network parameters into the NNP compilation tool.
  • Compile to generate binary code, and deployment information includes binary code, structured output data, structured output data, and device configuration information.
  • the function information is the function information of the function of each layer in the DSP target device block
  • the deployment information includes the function information of each layer function in the DSP target device block, the structured input data of each layer, the function information of each layer Function-structured output data and device configuration information.
  • the overhead information needs to be set according to the device scheduling, which can include the number of layers of the target device block and the input data of the target device block. and output data, etc.
  • S1042 Determine whether the deployment information of the m th target device block satisfies a deployment condition, where the deployment condition is that the m th target device block does not need to wait for the output data of the m-1 th target device block as input data.
  • the current target device block may need to use the output data of the previous target device block as input data.
  • the target device block will not get the input data, so it cannot be deployed. It needs to wait for the data to be obtained before continuing to deploy.
  • the deployment information can be directly assembled into corresponding sections and then send the message to the corresponding device.
  • the deployment conditions are not met, after obtaining the output data of the m-1 th target device block, send the deployment information of the m th target device block to the heterogeneous platform and the m th target
  • the device block's target device block is on a device of the same device type.
  • the deployment conditions are not met, and the m-th target device block needs to wait for the output data of the m-1-th target device block as input data, the deployment needs to be interrupted, and the calculation of the m-1-th target device block needs to be completed.
  • the deployment information is assembled into corresponding sections or messages are sent to the corresponding devices.
  • the nodes in the topology diagram are divided into target device blocks corresponding to the device types of the heterogeneous platforms, and then the target device blocks are divided into target device blocks.
  • Space allocation obtain the space information of each target device block, and finally based on the obtained network parameters generated by the TVM compiler, function information of each function, topology diagram, target device block and space information, the information in each target device block It is deployed to the equipment of the heterogeneous platform to complete the deployment of the heterogeneous platform; the present application divides the node into target device blocks of different device types, and completes the deployment of different devices in the heterogeneous platform through the target device blocks of different types.
  • the present application ensures the integrity of the deep learning network, and the present application is based on the TVM compiler to deploy the deep learning network, and supports different deep learning frameworks. Deep Learning Networks.
  • this application is not only suitable for online deployment, but also for offline deployment.
  • offline deployment only the deployment information needs to be stored in the corresponding bin file, which generally includes the network structure bin file and the network parameter bin file.
  • the corresponding bin file which generally includes the network structure bin file and the network parameter bin file.
  • step S102 may further include:
  • the node is divided into candidate device blocks corresponding to each device type, and the target nodes included in each candidate device block are obtained, wherein each device type contains at least one candidate device device block;
  • the candidate device blocks are arranged according to a preset rule, and the consecutive candidate device blocks belonging to the same device type are combined to obtain the target device block.
  • the candidate device block when the candidate device block is merged, if the candidate device block is NNP, the input node and output node used for internal transmission in the candidate device block to be merged need to be removed, and only the input of the final target device block is left. Node and output node, if it is DSP, then just combine the input node and output node.
  • the input nodes of the candidate device block A are a and b, the output node is c, and c is the target node in the candidate device block A; the input nodes of the candidate device block B are c, d, the output node is e, e is the target node in the candidate device block B; the candidate device block A and the candidate device block B are merged into the target device block C, then the input nodes of the target device block C are a, b, d, and the output Node is e.
  • FIG. 9 shows a structural block diagram of the TVM compiler-based heterogeneous platform deployment apparatus provided by the embodiment of the present application, for the convenience of description , only the parts related to the embodiments of the present application are shown.
  • the apparatus 300 may include: a data acquisition module 310 , a node division module 320 , a space allocation module 330 and a network deployment module 340 .
  • the data acquisition module 310 is used to acquire the topological structure diagram, network parameters and function information of each function of the deep learning network generated by the TVM compiler;
  • a node division module 320 configured to divide the nodes in the topology diagram into target device blocks corresponding to the device types based on the information of the nodes in the topology diagram and the device types of the heterogeneous platforms;
  • a space allocation module 330 configured to perform space allocation on the target device block to obtain space information of the target device block
  • a network deployment module 340 is configured to deploy the deep learning network to the heterogeneous network based on the topology diagram, the network parameters, the function information, the target device block and the spatial information of the target device block. on the platform.
  • the node division module 320 may specifically include:
  • a module dividing unit configured to divide the node into candidate device blocks corresponding to each device type based on the information of the node and the device type, and obtain the target node included in each candidate device block, wherein each device Type contains at least one candidate device block;
  • a module sorting unit used for arranging the candidate device blocks according to preset rules, and merging the candidate device blocks belonging to the same device type and being continuous to obtain a target device block;
  • the input and output node determination unit is configured to determine the input node and the output node of each target device block according to the information of the target node in the target device block.
  • the information of the node includes the node type of the node, the device type of the node, the input data of the node and the output data of the node;
  • the module division unit can be used for:
  • the node type of the node is a function type
  • the input data of the i-th node is the output data of the node in the j-th candidate device block or the output data of the node in the upper layer of the j-th candidate device block , wherein the nodes in the topology diagram are distributed layer by layer from top to bottom, and the upper layer node of the jth candidate device block refers to a node distributed between the jth candidate device block and a layer of nodes closest to
  • the module division unit can also be specifically used for:
  • the i-th node does not meet the classification conditions, classify the i-th node into the j+1-th candidate device corresponding to the device type of the i-th node according to the device type of the i-th node block, wherein no node exists in the j+1 th candidate device block.
  • the input and output node determination unit can be specifically used for:
  • the input node inclusion condition include that when the input node of the target node is a data type node, the corresponding data is the input data of the deep learning network, or the input node of the target node is a function type node and does not Included in the current target device block;
  • the target node that satisfies the inclusion condition of the output node in the target device block is the output node of the current target device block, wherein the output node inclusion condition includes that the target node is in the output list of the deep learning network, or
  • the output data of the target node is the input data of the target node in the target device block other than the current target device block.
  • the space allocation module 330 may specifically include:
  • a space computing unit for calculating the space size of each output node based on the scale information of each output node in the target device block and the data type of each output node
  • the space allocation unit is configured to obtain the maximum storage space that all output nodes need to occupy based on the space size of each output node, and use the maximum storage space as the space information of the target device block.
  • the space allocation unit can be specifically used for:
  • the target node space is searched in the allocated node space, wherein the target node space is the node space that is currently in an idle state and is larger than the space size of the nth output node, 2 ⁇ n ⁇ a-1;
  • the node space of the output node that satisfies the release condition is released, wherein the released node space is in an idle state;
  • the maximum storage space occupied by the a output nodes is determined.
  • the network deployment module 340 can be specifically used for:
  • the output data of the output node, the spatial information, the function information and the network parameters, the deployment information of the mth target device block is obtained;
  • the deployment condition is that the m th target device block does not need to wait for the output data of the m-1 th target device block as input data
  • the deployment information of the m th target device block is sent to a device of the same device type as the target device block of the m th target device block in the heterogeneous platform.
  • the network deployment module 340 can also be specifically used for:
  • the deployment information of the m th target device block is sent to the heterogeneous platform and the m th target device block
  • the target device block is on the same device as the device type.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned deployment of the heterogeneous platform based on the TVM compiler can be implemented.
  • the steps in various embodiments of the method are described in detail below.
  • the embodiments of the present application provide a computer program product, when the computer program product runs on a mobile terminal, the mobile terminal can implement the steps in each embodiment of the above-mentioned method for deploying a heterogeneous platform based on a TVM compiler.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

提供了一种基于TVM编译器的异构平台的部署方法、装置及终端设备,方法包括:获取TVM编译器生成的深度学习网络的拓扑结构图、网络参数和各个函数的函数信息(S101);基于拓扑结构图中节点的信息和异构平台的设备类型,将拓扑结构图中的节点分成与设备类型对应的目标设备块(S102);对目标设备块进行空间分配,得到目标设备块的空间信息(S103);基于拓扑结构图、网络参数、函数信息、目标设备块和目标设备块的空间信息,将深度学习网络部署到异构平台上(S104)。通过将节点划分成不同设备类型的目标设备块,通过不同类型的目标设备块完成异构平台中不同设备的部署。

Description

一种基于TVM编译器的异构平台的部署方法、装置及终端设备 技术领域
本申请属于数据处理技术领域,尤其涉及一种基于TVM编译器的异构平台的部署方法、装置及终端设备。
本申请要求于2020年7月10日提交中国专利局,申请号为202010654954.X、发明名称为“一种基于TVM编译器的异构平台的部署方法、装置及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
 
背景技术
异构平台指有不同类型指令集和体系架构计算单元的操作平台,异构平台支持不同系统的架构,它可以由CPU,GPU,DSP,ASIC,FPGA等其他处理器构成。
目前对硬件平台的部署大多是针对具有一个指令集和一种类型的体系架构计算单元的单一硬件平台,缺乏对异构平台的部署方法。
技术解决方案
本申请实施例提供了一种基于TVM编译器的异构平台的部署方法、装置及终端设备,可以实现对深度学习网络在异构平台的部署。
第一方面,本申请实施例提供了一种基于TVM编译器的异构平台的部署方法,包括:
获取TVM编译器生成的深度学习网络的拓扑结构图、网络参数和各个函数的函数信息;
基于所述拓扑结构图中节点的信息和异构平台的设备类型,将所述拓扑结构图中的节点分成与所述设备类型对应的目标设备块;
对所述目标设备块进行空间分配,得到所述目标设备块的空间信息;
基于所述拓扑结构图、所述网络参数、所述函数信息、所述目标设备块和所述目标设备块的空间信息,将所述深度学习网络部署到所述异构平台上。
第二方面,本申请实施例提供了一种基于TVM编译器的异构平台的部署装置,包括:
数据获取模块,用于获取TVM编译器生成的深度学习网络的拓扑结构图、网络参数和各个函数的函数信息;
节点划分模块,用于基于所述拓扑结构图中节点的信息和异构平台的设备类型,将所述拓扑结构图中的节点分成与所述设备类型对应的目标设备块;
空间分配模块,用于对所述目标设备块进行空间分配,得到所述目标设备块的空间信息;
网络部署模块,用于基于所述拓扑结构图、所述网络参数、所述函数信息、所述目标设备块和所述目标设备块的空间信息,将所述深度学习网络部署到所述异构平台上。
第三方面,本申请实施例提供了一种终端设备,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现上述第一方面中任一项所述的基于TVM编译器的异构平台的部署方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现上述第一方面中任一项所述的基于TVM编译器的异构平台的部署方法。
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的基于TVM编译器的异构平台的部署方法。
可以理解的是,上述第二方面至第五方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
本申请实施例与现有技术相比存在的有益效果是:本申请通过获取到的TVM编译器生成的拓扑结构图中节点的信息,将拓扑结构图中的节点分成与异构平台的设备类型对应的目标设备块,然后对目标设备块进行空间分配,得到各个目标设备块的空间信息,最后基于获取到的TVM编译器生成的网络参数、各个函数的函数信息、拓扑结构图、目标设备块和目标设备块的空间信息,将各个目标设备块中的信息部署到异构平台的设备上,完成对异构平台的部署;本申请通过将节点划分成不同设备类型的目标设备块,通过不同类型的目标设备块完成异构平台中不同设备的部署,相比于现有技术中只能对单一的设备平台进行部署,本申请保证了深度学习网络的完整性,而且本申请是基于TVM编译器进行深度学习网络的部署,支持不同的深度学习框架得到的深度学习网络。
附图说明
图1是本申请一实施例提供的基于TVM编译器的异构平台的部署方法的应用场景示意图;
图2是本申请一实施例提供的基于TVM编译器的异构平台的部署方法的流程示意图;
图3是本申请一实施例提供的图2中目标设备块划分的方法流程示意图;
图4是本申请一实施例提供的具体的节点划分方法的流程示意图一;
图5是本申请一实施例提供的具体的节点划分方法的流程示意图二;
图6是本申请一实施例提供的节点空间分配方法的流程示意图;
图7是本申请一实施例提供的具体的空间分配方法的流程示意图;
图8是本申请一实施例提供的平台部署方法的流程示意图;
图9是本申请一实施例提供的基于TVM编译器的异构平台的部署装置的结构示意图。
本发明的实施方式
图1为本申请实施例提供的基于TVM(Tensor Virtual Machine)编译器的异构平台的部署方法的应用场景示意图,上述基于TVM编译器的异构平台的部署方法可以用于对异构平台进行深度学习网络的部署。其中,TVM编译器10用于生成需要部署的深度学习网络的数据,终端设备20用于获取TVM编译器10生成的数据,并对获取到的数据进行节点的设备块分类,最后基于获取得到的数据和分成的设备块将深度学习网络部署到异构平台上,达到对异构平台的部署。
在具体应用中,异构平台可以包括多种处理器,本申请主要针对NPU(Neural-network Processing Units,神经网络处理器加速单元)和DSP(Digital Signal Processing,数字信号处理器)组成的异构平台进行说明,且NPU具体采用NNP处理器(Neural Network Processor,神经网络处理器),以下具体实施例中的举例说明均以上述异构平台为例进行说明,需要说明的是,上述异构平台只是一个举例说明,不应对本申请的方法构成任何限定。
以下结合图1对本申请实施例的基于TVM编译器的异构平台的部署方法进行详细说明。
图2示出了本申请提供的基于TVM编译器的异构平台的部署方法的示意性流程图,参照图2,对该部署方法的详述如下:
S101,获取TVM编译器生成的深度学习网络的拓扑结构图、网络参数和各个函数的函数信息。
在本实施例中,使用TVM编译器可以编译不同深度学习网络框架生成的深度学习网络,经过TVM编译器编译后的深度学习网络可以生成拓扑结构图、网络参数和各个函数的函数信息。
其中,拓扑结构图中包括节点的信息,拓扑结构图中节点是一层一层排列的,且一个节点对应唯一的一个索引码,其中,节点指的是深度学习网络在运行过程中需要存储数据或数据处理的模块。
节点的信息可以包括节点的设备类型、节点的节点类型、节点的输出数据的信息、节点的输入数据的信息、节点的结构信息以及节点的数据类型等。其中,节点的节点类型可以包括函数类型和数据类型,函数类型(TVM OP(Tensor Virtual Machine Operator-TVM操作)的节点指的是节点指的是经过TVM编译后的函数类型的节点,可以是一个函数,可以对数据进行处理;数据类型的节点指的是节点只是存放数据,不对数据进行处理。节点的输出数据的信息包括一个或多个输出tensor,一个输出tensor是一个多维数组;节点的输入数据的信息包括一个或多个输入tensor,一个输入tensor是一个多维数组。节点的结构信息可以包括节点的尺度信息,例如:通道信息、宽度信息和高度信息等。节点的数据类型可以包括整型(int)、短整型(short)和单精度浮点(float32或者float16)等。
节点的设备类型与异构平台的设备类型相对应,异构平台有哪些设备类型,节点的设备类型就包括哪些,其中,节点的设备类型在进行TVM编译器生成拓扑结构的时候已经根据异构平台的设备类型设置完成,节点的设备类型的种类与需要部署的异构平台的设备类型的种类对应,例如,异构平台可以是NNP和DSP的平台,则异构平台的设备类型包括NNP和DSP,节点的设备类型也包括NNP类型的节点和DSP类型的节点。
网络参数指的是深度学习网络的加权(weight)数据。函数的函数信息是以lib文件形式存在的,指的是函数的汇编代码或功能函数所需的信息。
S102,基于所述拓扑结构图中节点的信息和异构平台的设备类型,将所述拓扑结构图中的节点分成与所述设备类型对应的目标设备块。
在本实施例中,对目标设备块的划分主要是根据节点信息中的节点类型、设备类型、节点的输入数据和节点的输出数据划分的,其中,主要是将拓扑结构图中是函数类型的节点划分成不同的目标设备块。
如图3所示,在一种可能的实现方式中,步骤S102的实现过程可以包括:
S1021,基于所述节点的信息,将节点划分到每个设备类型对应的候选设备块中,得到各个所述候选设备块包含的目标节点,其中,每个设备类型下包含至少一个候选设备块。
在本实施例中,一个设备类型可以包括一个候选设备块,也可以包括至少两个候选设备块,具体有多少个候选设备块主要是基于节点的分类情况确定的。
作为举例,NNP设备类型可以包括一个NNP候选设备块,也可以包括两个或多个NNP候选设备块。
如图4所示,在一种可能的实现方式中,步骤S1021的实现过程可以包括:
S10211,判断所述拓扑结构图中的每一个节点是否满足分类条件。
在本实施例中,一个节点对应唯一的一个索引码,可以根据节点的索引码索引节点。
S10212,若第i个节点满足节点的分类条件,则将所述第i个节点按照所述第i个节点的设备类型划分到所述第i个节点的设备类型对应的第j个候选设备块中,i和j大于或等于1,其中,当i=1时,第一个节点的分类条件为节点类型为函数类型,,当i>1时,第i个节点的分类条件为所述第i个节点的节点类型为函数类型、且所述第i个节点的输入数据为所述第j个候选设备块中节点的输出数据或为所述第j个候选设备块的上一层节点的输出数据,其中,所述拓扑结构图中的节点是由上而下一层一层分布的,所述第j个候选设备块的上一层节点指的是分布在所述第j个候选设备块之外,且在所述第j个候选设备块之上距离所述第j个候选设备块最近的一层节点。
在本实施例中,如果是第一个节点,且第一个节点的设备类型为DSP,则第一个节点只需要满足节点类型为函数类型就可以放入一个DSP候选设备块中,因为是第一个节点,所有的DSP候选设备块均是空的,第一个节点可以放入任意一个DSP候选设备块中。如果是第三个节点,则需要先判断是不是函数类型的节点,如果不是,则不需要再进行后续判断,直接舍去;如果是函数类型的节点,且是DSP类型的,则要看第三个节点的输入数据,如果第三个节点的输入数据是第一个DSP候选设备块的一个节点的输出数据,则将第三个节点纳入第一个DSP候选设备块;如果第三个节点是函数类型的节点,且是DSP类型的,第三个节点的输入数据是第一个DSP候选设备块之外,且在第一个DSP候选设备块纸上距离第一个DSP候选设备块最近的一层节点中一个节点的输出数据,则将第三个节点纳入第一个DSP候选设备块中。
在本实施例中,候选设备块可以是预先设置好的空的设备块,也可以是随着节点的判断而建立的设备块。
作为举例,第一个节点的设备类型为NNP,如果有预先创建好的NNP的候选设备块,则将第一个节点纳入NNP候选设备块中,如果没有预先创建好的NNP候选设备块,则可以新建一个NNP候选设备块,将第一个节点纳入新建的NNP候选设备块。
S10213,若所述第i个节点不满足分类条件,则将所述第i个节点按照所述第i个节点的设备类型划分到所述第i个节点的设备类型对应的第j+1个候选设备块中,其中,所述第j+1个候选设备块中不存在节点。
在本实施例中,如果第i个节点不满足分类条件,也就是第i个节点不能放入任意一个已经存在节点的候选设备块中,则需要将第i个节点存放在一个空的候选设备块中,第j+1个候选设备块可以是预先存在的一个空的候选设备块,也可以是在第i个节点不满足分类条件时新建的一个空的候选设备块,例如,已经有三个NNP候选设备块中存放有NNP设备类型的节点,第六个节点是函数类型的节点,且是NNP类型的节点,第六个节点的输入数据不是上述三个NNP候选设备块中节点的输出数据,且第六个节点的输入数据不是上述三个NNP候选设备块的任何一个的上一层节点的输出数据,则将第六个节点放入第四个NNP候选设备块。
作为举例,如图5所示,如果包括两种设备类型的节点,分别为NNP和DSP类型的,当一个节点进来之后,先判断该节点是否为NNP函数类型的节点;
如果该节点属于NNP函数类型,则查找是否存在一个NNP候选设备块A可以将该节点纳入,如果该节点的输入数据来自于NNP候选设备块A中的节点或者来自于NNP候选设备块A的上一层节点,则可以将该节点放入NNP候选设备块A中;否则,将该节点纳入一个不存在任何节点的空的NNP候选设备块,或者新建一个NNP候选设备块,将该节点纳入新建的NNP候选设备块;
如果该节点不是NNP函数类型的节点,则判断该节点是否为DSP函数类型的节点;
如果该节点属于DSP函数类型,则查找是否存在一个DSP候选设备块B可以将该节点纳入,如果该节点的输入数据来自于DSP候选设备块B中的节点或者来自于DSP候选设备块B的上一层节点,则可以将该节点放入DSP候选设备块B中;否则,将该节点纳入一个不存在任何节点的空的DSP候选设备块,或者新建一个DSP候选设备块,将该节点纳入新建的DSP候选设备块;
如果该节点不是NNP函数类型的节点,也不是DSP函数类型的节点,则无需处理。
需要说明的是,如果当前节点有至少两个输入数据,且输入数据来自至少两个候选设备块,则也需要将当前节点按照当前节点的设备类型划分到所述当前节点的设备类型对应的第j+1个候选设备块中。
作为举例,如果一个节点是函数类型的节点,且节点是DSP类型的节点,输入数据一个来自于第三个DSP候选设备块中一个节点,另一个输入数据来自于第二个NNP候选设备块中一个节点,则需要将该节点放入一个空的DSP候选设备块中,如果没有空的DSP候选设备块,则可以新建一个DSP候选设备块。
本申请实施例中,因为函数类型的节点是需要进行计算或其他处理的节点,而数据类型的节点只是存放数据,不对数据做处理,是依附于函数节点存在的,所以只需要将函数类型的节点进行分类,进而可以得到每个数据节点对应的候选设备块。在进行节点分类时设置了分类规则,既能将有输入输出关系的节点分到一块,同时也能将没有任何关联的节点分到一个候选设备块中,使不关联的节点在候选设备块中可以并行运算。采用本申请的方法,可以将所有函数类型的节点准确,快速的分类,同时可以将不同设备类型的节点分类,方便后续对异构平台中不同设备进行部署,采用设备块来进行异构平台部署,可以将不同设备类型的设备需要的数据分开,在部署时避免因节点太多或节点设备类型不同引起的部署错误。
S1022,将所述候选设备块按预设规则排列,合并属于相同设备类型且连续的候选设备块,得到目标设备块。
在本实施例中,每个节点均有一个索引码,对候选设备块进行排列时,可以按照候选设备块中第一个目标节点的索引码的大小,将所有的候选设备块依次从小到大进行排列,排列后如果有相同设备类型的候选设备块是连续的,则将连续的且相同设备类型的候选设备块合并成一个,得到合并后的目标设备块,当然,不需要合并的每个候选设备块均为一个目标设备块,一个目标设备块对应唯一的一个索引码。
具体的,判断相同设备类型的候选设备块是否是连续的方法可以为首先通过查找得到相邻且相同设备类型的候选设备块,然后基于候选设备块的索引码判断相邻且相同设备类型的候选设备块是否是连续的,如果索引码是连续的则说明相邻且相同设备类型的候选设备块是连续的,如果索引码不是连续的则说明相邻且相同设备类型的候选设备块不是连续的。
作为举例,候选设备块a为NNP类型的,且候选设备块a中第一个目标节点的索引码为3;候选设备块b为NNP类型的,且候选设备块b中第一个目标节点的索引码为2;候选设备块c为NNP类型的,且候选设备块c中第一个目标节点的索引码为5;候选设备块d为DSP类型的,且候选设备块d中第一个目标节点的索引码为4;候选设备块e为DSP类型的,且候选设备块e中第一个目标节点的索引码为1;
则按索引码从小到大的顺序进行排列的顺序为e、b、a、d、c,由于b和a为连续的且都为NNP类型,可以将b和a合并成一个f,得到按顺序排列的e、f、d、c,其中,e、f、d、c均为一个目标设备块,且每个目标设备块对应一个索引码。
S1023,根据所述目标设备块中的目标节点的信息确定每个目标设备块的输入节点和输出节点。
在本实施例中,在对目标设备块进行硬件部署时,需要知道目标设备块的输入和输出才能进行部署,一个目标设备块中包括多个目标节点,在确定目标设备块的输入节点和输出节点时,可以根据目标设备块的索引码一个一个索引目标设备块,并且对目标设备块中的每个目标节点的输入节点和输出节点进行搜索,最终将每个的目标设备块中所有的目标节点都索引完,确定每个目标设备块的输入节点和输出节点。
输入节点只包括目标设备块外部的输入数据对应的节点,不包括内部节点与节点之间输入输出传输的输入节点。输出节点只是目标设备块最终需要输出目标设备块的输出数据对应的节点,不包括内部节点与节点之间的输入输出传输的节点。
在一种可能的实现方式中,步骤S1023的实现过程可以包括:
S10231,若所述目标设备块中的目标节点的输入节点满足输入节点纳入条件,则所述目标节点的输入节点纳入当前目标设备块中并将所述输入节点作为所述目标设备块的输入节点,其中,所述输入节点纳入条件包括所述目标节点的输入节点为数据类型的节点时,对应的数据为所述深度学习网络的输入数据,或所述目标节点的输入节点为函数类型的节点且不包含在当前目标设备块中。
在本实施例中,由于只需要确定目标设备块中的外部输入的节点,所以只需要查找目标节点的输入数据是来自于目标设备块的外部节点的输出数据即可。如果目标节点的输入数据是数据类型的节点,且是深度学习网络的输入数据,则该节点时目标设备块的输入节点,将该节点纳入目标设备块中的输入节点集合中,其中,深度学习网络的输入数据是用户输入的数据,是已知数据。如果目标节点的输入数据是函数类型的节点,但是该节点不在目标节点所在的目标设备块中,也就是该节点是其他目标设备块中的节点,也要将该节点纳入到目标节点所在的目标设备块的输入节点集合中。
S10232,所述目标设备块中满足输出节点纳入条件的目标节点为所述当前目标设备块的输出节点,其中,所述输出节点纳入条件包括所述目标节点在所述深度学习网络的输出列表中,或所述目标节点的输出数据是当前目标设备块之外的目标设备块中目标节点的输入数据。
在本实施例中,由于目标节点均是函数类型的节点,目标节点均是需要进行数据处理后输出数据的节点,所以一个目标节点就是一个输出节点,一个目标节点对应一个输出数据,因此,在确定目标设备块的输出节点时,只要确定目标设备块中的目标节点是不是符合输出节点纳入条件即可。如果目标节点在深度学习网络的输出列表中,也就是输出列表中包含了当前目标节点,则当前目标节点就是目标设备块的输出节点,如果当前目标节点不在输出列表中,但是当前目标节点的输出数据是其他目标设备块的输入数据,则当前目标节点也要作为目标设备块的输出节点。
需要说明的是,目标设备块中存储的目标节点、输入节点和输出节点均以节点的索引码存储,也就是在确定目标节点、输入节点和输出节点时只需要记录节点的索引码即可,目标设备块中存储节点在拓扑结构图中的索引码,通过索引码在目标设备块调度时就可以获取相关节点的参数和数据信息。
本申请实施例中,在确定目标设备块的输入节点和输出节点时,设置了节点纳入条件,根据节点纳入条件可以快速确定节点是否是输入节点或输出节点,输入节点和输出节点的确定为后续空间分配和硬件部署打下基础。
在本申请实施例中,先将候选设备块按顺序排列并合并后再确定目标设备块的输入节点和输出节点,如果先确定候选设备块的输入节点和输出节点,则在候选设备块合并后,候选设备块之间的输入输出节点则需要去除,增加了输出节点和输出节点的筛选过程,比较繁琐,先合并再确定输出节点和输出节点,不会存在将内部传输的节点纳入到了输入节点和输出节点的问题,方法更简单。
作为举例,目标设备块包括一个NNP目标设备块和一个DSP目标设备块,NNP目标设备块中包括目标节点A和B,DSP目标设备块中包括目标节点C和D,确定所有的目标设备块的输入节点和输出节点。
判断当前目标设备块是NNP目标设备块还是DSP目标设备块,如果当前目标设备块是NNP目标设备块,搜索目标节点A的输入节点包括节点E和节点F,节点E和F均为数据类型节点,且E和F的输出数据为深度学习网络的输入数据,则节点E和F均为NNP目标设备块的输入节点,搜索目标节点B的输入节点只包括目标节点A,则目标节点A不满足输入节点纳入条件,则A不属于NNP目标设备块的输入节点;目标节点A不在深度学习网络的输出列表中,且DSP目标设备块中没有以A的输出数据作为输入数据的节点,所以A不是目标设备块的输出节点,目标节点B不在深度学习网络的输出列表中,但是目标节点B的输出数据是DSP 目标设备块中目标节点C的输入数据,则B是NNP目标设备块的输出节点。
如果当前目标设备块是DSP目标设备块,搜索目标节点C的输入节点包括节点B和节点G,那么目标节点C的输入节点就是节点B和节点G,作为一个输入节点数组纳入当前目标设备块的输入节点列表当中;节点C自身作为输出节点数组纳入当前目标设备块的输出节点列表当中。搜索目标节点D的输入节点只包括目标节点C,那么节点C作为目标节点D的输入节点数组,纳入当前目标设备块的输入节点列表中;节点D自身作为输出节点数组纳入当前目标设备块的输出节点列表中。
由此,NNP目标设备块的输入节点列表为[E,F],输出节点列表为[B];DSP目标设备块的输入节点列表为[[B,G],[C]],输出节点列表为[[C],[D]]。
需要说明的是,NNP目标设备块和DSP目标设备块的结构不同,NNP目标设备块只有一层进行计算,所以输出是一组输出数据,而DSP目标设备块中每个单节点均可以作为一层进行计算,输出是多组输出数据,每一层的输出节点和输入节点均需要确定。
S103,对所述目标设备块进行空间分配,得到所述目标设备块的空间信息。
在本实施例中,目标设备块的输入节点的输出数据如果是深度学习网络的输入数据,则是用户输入的数据,不需要进行空间分配,如果输入节点的输入数据是其他目标设备块的目标节点的输出数据,则输入节点为其他目标设备块的输出节点,因此,在空间分配时,只需要对目标设备块中的输出节点进行空间分配,空间分配中包括存储地址分配,而本申请中存储地址分配是基于一个基地址进行分配的,只分配偏移地址。
在一种可能的实现方式中,步骤S103的实现过程可以包括:
S1031,基于所述目标设备块中每个输出节点的尺度信息和每个输出节点的数据类型,计算每个输出节点的空间大小。
在本实施例中,空间大小的计算为尺度信息与数据类型的乘积。有了空间大小,才能知道输出节点具体应占多少空间,也就是占据多少地址。
S1032,基于每个输出节点的空间大小,获得所有输出节点需要占用的最大存储空间,将所述最大存储空间作为所述目标设备块的空间信息。
在本实施例中,如果一个目标设备块中只有一个输出节点,则只对该输出节点的输出数据分配节点空间即可,如果一个目标设备块中有多个输出节点,需要对每一个输出节点均进行节点空间分配,对目标设备块中所有的输出节点进行节点空间分配之后,可以得到所有输出节点需要占用的最大存储空间。
本申请实施例中,通过输出节点的空间大小对输出节点进行空间分配,可以准确得到输出节点所占地址空间,同时只对目标设备块中的输出节点进行空间分配,可以减少所占用的节点空间,加快分配进程。
如图6所示,在一种可能的实现方式中,步骤S1032的实现过程可以包括:
S10321,将a个输出节点按照输入输出的顺序进行排序,获得排序后的输出节点队列。
在本实施例中,对输出节点进行空间分配时,是通过目标设备块的索引码一个个索引目标设备块,并通过节点的索引码索引节点,先索引索引码在前的节点,所以也就是按照输入输出的顺序进行索引的,即,是将所有输出节点按照输入输出的顺序进行排序。
作为举例,目标设备块A中的输出节点为a,目标设备块B中的输出节点为b,目标设备块A在B之前,所以先索引输出节点a,再索引输出节点b,且输出节点a、b的排序为输出节点a在前,输出节点b在后。
S10322,基于第1个输出节点的空间大小,在第一存储空间为所述第1个输出节点分配第1节点空间。
在本实施例中,由于是第1个输出节点,第一存储空间中可以分配的空间均没有被占用,所以第1个输出节点可以直接进行空间分配,得到一个节点空间。第1节点空间中包括存储第1个输出节点的首地址和末地址。
在本实施例中,空间可以包括数据空间和全局空间,数据空间是可以被重复使用的空间,也就是相同的节点空间可以在不同的时间被不同的输出节点使用,本申请中第一存储空间指的是数据空间;全局空间是不可以被重复使用的空间,在一个输出节点占用该节点空间之后,该节点空间不可以再被其他输出节点使用,本申请中第二存储空间指的是全局空间。
S10323,根据第n个输出节点的空间大小,在已分配的节点空间中查找目标节点空间,其中,所述目标节点空间为当前为空闲状态、且大于第n个输出节点的空间大小的节点空间,2≤n≤a-1。
在本实施例中,对第n个输出节点进行节点空间分配时,由于已经有分配过的输出节点占用了一部分节点空间,所以先需要在释放的节点空间中,也就是处于空闲状态的节点空间中寻找第n个输出节点可以使用的节点空间,第n个输出节点可以使用的节点空间必须大于第n个输出节点的空间大小。
S10324,若已分配的节点空间中存在目标节点空间,根据第n个输出节点的空间大小,在所述目标节点空间中为所述第n个输出节点分配第n节点空间。
在本实施例中,如果有空闲状态,且大于第n个输出节点的空间大小的节点空间,则可以在该节点空间中为第n个输出节点分配第n节点空间,实现节点空间的复用。
S10325,若已分配的节点空间中不存在目标节点空间,根据第n个输出节点的空间大小,在第n-1节点空间之后为所述第n个输出节点分配第n节点空间。
在本实施例中,如果不存在目标节点空间,则可以在第一存储空间中为第n个输出节点分配对应的节点空间,如果第一存储空间中没有满足第n个输出节点的空间大小的空间。
S10326,在为所述第n个输出节点分配第n节点空间之后,确定当前所有节点空间中是否存在满足释放条件的输出节点。在本实施例中,满足释放条件包括不属于第n个输出节点之后的输出节点的输入节点、且不为第n个输出节点的输入节点。
S10327,若存在,则将满足释放条件的输出节点的节点空间释放,其中,释放后的节点空间为空闲状态。
在本实施例中,在对第n个输出节点分配完成后,将分配的节点空间按照升序排列,然后检查当前存在的节点空间中是不是有可以需要释放的节点空间,如果有则释放,使当前节点空间中只保存第n个输出节点计算所需要的输出节点的节点空间。
在本实施例中,当前节点空间中保存了第n个输出节点计算所需要的节点空间,包括第n个输出节点的输出数据的节点空间,如果输入数据是由一个输出节点提供的也要保存该输出节点的节点空间,如果第n个输出节点的输入数据不是输出节点提供的,则只需要保留第n个输出节点的输出数据的节点空间。
当前节点空间中还需要保留未分配节点空间的输出节点的输入节点对应的节点空间,因为该输出节点是后边未分配节点空间的输出节点计算需要使用的,所以需要先保留下来为后续使用带来方便,如果释放掉,后续使用时便找不到可以使用的数据,将其他后续不会使用的输出节点的节点空间均释放。
在本实施中,如果不存在满足释放条件的输出节点,则没有可以释放的节点空间。
S10328,根据第a个输出节点的空间大小,在第二存储空间为所述第a个输出节点分配第a节点空间。
在本实施例中,如果是最后一个输出节点,则需要为该输出节点分配一个全局空间,而不是在分配第1个至第a-1个输出节点的动态空间中分配节点空间。
在本实施例中,第1节点空间至第a节点空间组成目标设备块的空间信息。
需要说明的是,在为最后一个输出节点分配节点空间之后,还可以将当前节点空间中满足释放条件的输出节点对应的节点空间释放,如果没有满足释放的条件的输出节点对应的节点空间则不进行释放。
S10329,基于所述第1节点空间至第a节点空间,确定a个输出节点共占用的最大存储空间。
在本实施例中,在为每个输出节点分配了节点空间之后,可以得到所有输出节点一共需要占用的总的存储空间的最大值,由于有的输出节点占用的存储空间是相同的,所以最大存储空间要小于所有输出节点的节点空间之和。
作为举例,如果第一个输出节点占用了5个字节的空间,第二个输出节点占用了8个字节的空间,第三个输出节点占用的是第一个输出节点释放空间,且第三个输出节点占用了3个字节的空间,因此,3个输出节点占用的最大存储空间为5+8=13个字节的空间。
需要说明的是,由于第a个输出节点分配的是第二存储空间中的节点空间,而其他输出节点均分配的是第一存储空间中的节点空间,因此,最大存储空间包括前a-1个输出节点共占用的第一最大存储空间,以及第a个输出节点所占用的第二最大存储空间,其中,第二最大存储空间为第a节点空间。
本申请实施例中,为目标设备块分配空间,一方面可以得到设备块中的输出节点的节点空间,方便调用,另一方面还可以获得每个设备类型对应的所有目标设备块的总的空间大小,在进行异构平台部署时方便异构平台为深度学习网络分配空间。本申请采用节点空间分配后释放不使用的数据,使存储节点空间的模块中只保留当前需要使用的数据,使深度学习网络在运行时数据更干净,不会被污染,保证了深度学习网络的正常运行,还可以使节点空间可以重复使用,有利于空间复用。
作为举例,NNP目标设备块A和DSP目标设备块B,其中A连接B,A在B前边。
对A进行空间分配,A是NNP目标设备块,由NNP目标设备块的功能实现方式决定只有一层运算,所以A中只有一组输出数据,对上述A中输出数据对应的输出节点进行节点空间分配,分配完成后目标设备块的索引码加1,进行下一个目标设备块的空间分配。
对B进行空间分配,B是DSP目标设备块,由DSP目标设备块的功能实现方式决定有多层计算,所以B中有多组输出数据,对上述B中的每一层输出数据对应的输出节点均进行节点空间分配,分配完成后目标设备块的索引码加1,进行下一个目标设备块的空间分配。
作为举例,节点空间分配的具体方法如图7所示:
如果一个目标设备块中一共有6个输出节点需要进行节点空间分配,则:
第一次、为第一个输出节点分配节点空间,直接生长出1(缓冲空间)大小的空间即可,得到第1节点空间。
第二次、为第二个输出节点分配节点空间,第二个输出节点有两个输出数据,则两个输出数据均需要分配,生长出2和3大小的空间,则2和3组成第2节点空间,由于第二个输出节点的输入节点包括第一个输出节点,则1不需要释放。
第三次、为第三个输出节点分配节点空间,第三个输出节点有一个输出数据,生长出4大小的空间,则4为第3节点空间,由于第三个输出节点的输入节点包括第二个输出节点,则2和3均不需要释放,第一个输出节点均不是未分配空间的输出节点的输入节点,所以1释放。
第四次、为第四个输出节点分配节点空间,第四个输出节点有一个输出数据,生长出5大小的空间,释放的1空间可以满足5的使用需求,所以5可以占用释放的1的空间,则5为第4节点空间,由于第四个输出节点的输入节点包括第三个输出节点,则4不需要释放,2后续不再使用可以释放,3后续还要使用所以需要保留。
第五次、为第五个输出节点分配节点空间,第五个输出节点有一个输出数据,生长出6大小的空间,则6为第5节点空间,由于第五个输出节点的输入节点包括第四个输出节点,则5不需要释放,4后续不再使用,可以释放。
第六次、第六个输出节点是最后一个输出节点,需要分配的是全局空间,由于本图中只显示动态分配的节点空间,因此在本图中不再分配,也不再显示,为第六个输出节点分配空间得到第6节点空间;另外,本步骤中还可以把不再使用的输出节点的节点空间释放即可,3不再使用,可以释放。
基于以上的节点空间的分配,从图7中可以得到,第4节点空间和第5节点空间均是占用的之前输出节点释放的空间,因此,第一个输出节点至第六个输出节点一共占用的最大存储空间为第1节点空间、第2节点空间、第3节点空间和第6节点空间之和。
S104,基于所述拓扑结构图、所述网络参数、所述函数信息、所述目标设备块和所述空间信息,将所述深度学习网络部署到所述异构平台上。
在本实施例中,在对节点划分完成目标设备块,并对目标设备块配置好节点空间后,即可按照目标设备块排列好的顺序将各个目标设备块部署到异构平台上,在对目标设备块进行部署时也是通过目标设备块的索引码进行一个个部署的。
如图8所示,在一种可能的实现方式中,步骤S104的实现过程可以包括:
S1041,基于按预设规则排列的第m个目标设备块中的输入节点的输入数据、所述输出节点的输出数据、所述空间信息、所述函数信息和所述网络参数,得到第m个目标设备块的部署信息。
在本实施例中,不同设备类型需要的部署信息不同,但是都需要收集目标设备块中目标节点的输入数据和输出数据才能进行部署,输入数据包括数据的结构信息、数据的空间信息以及节点的信息中包括的输入数据的信息,输出数据与输入数据收集的信息相同,然后将输入数据和输出节点数据组成结构化的输入数据和输出数据。
部署信息包括结构化的输入数据、输出数据、函数信息、网络参数以及异构平台上对应的设备的设备配置信息,设备配置信息是预先存储的,根据设备类型设置的,设备配置信息可以包括图像处理个数、代码长度、代码如何启动等信息。
作为举例,如果是NNP目标设备块,函数信息是TVM编译后的汇编代码,NNP目标设备块需要将结构化的输入数据、输出数据、TVM编译后的汇编代码、网络参数放入NNP编译工具中进行编译,生成二进制代码,部署信息包括二进制代码、结构化的输出数据、结构化的输出数据和设备配置信息。
如果是DSP目标设备块,函数信息是DSP目标设备块中每一层的函数的功能信息,部署信息包括DSP目标设备块中每层函数的功能信息、每层的结构化的输入数据、每层函数结构化的输出数据和设备配置信息。
需要说明的是,如果是在线部署,在收集完成部署信息后,需要给部署信息加上开销信息,开销信息需要根据设备调度进行设置,可以包括目标设备块的层数、目标设备块的输入数据和输出数据等。
S1042,判断所述第m个目标设备块的部署信息是否满足部署条件,其中,所述部署条件为所述第m个目标设备块不需要等待第m-1个目标设备块的输出数据作为输入数据。
在本实施例中,由于目标设备块是按照顺序进行部署的,当前目标设备块可能需要用到前边的目标设备块的输出数据作为输入数据,如果前边的目标设备块没有得到输出数据,则当前目标设备块也就得不到输入数据,所以不能部署,需要等待获取到数据后再继续部署。
S1043,若满足部署条件,将所述第m个目标设备块的部署信息发送至所述异构平台中与所述第m个目标设备块的目标设备块的设备类型相同的设备上。
在本实施例中,满足部署条件,第m个目标设备块的输入数据已经全部搜集到,不需要等待第m-1个目标设备块的输出数据,则可以直接将部署信息组装成相应的节后或消息发送至对应的设备上。
S1044,若不满足部署条件,则获取到第m-1个目标设备块的输出数据后将所述第m个目标设备块的部署信息发送至所述异构平台中与所述第m个目标设备块的目标设备块的设备类型相同的设备上。
在本实施例中,不满足部署条件,第m个目标设备块需要等待第m-1个目标设备块的输出数据作为输入数据,则需要中断部署,等待第m-1个目标设备块计算完成,得到第m-1个目标设备块的输出数据后,将部署信息组装成相应的节后或消息发送至对应的设备上。
本申请实施例中,通过获取到的TVM编译器生成的拓扑结构图中节点的信息,将拓扑结构图中的节点分成与异构平台的设备类型对应的目标设备块,然后对目标设备块进行空间分配,得到各个目标设备块的空间信息,最后基于获取到的TVM编译器生成的网络参数、各个函数的函数信息、拓扑结构图、目标设备块和空间信息,将各个目标设备块中的信息部署到异构平台的设备上,完成对异构平台的部署;本申请通过将节点划分成不同设备类型的目标设备块,通过不同类型的目标设备块完成异构平台中不同设备的部署,相比于现有技术中只能对单一的设备平台进行部署,本申请保证了深度学习网络的完整性,而且本申请是基于TVM编译器进行深度学习网络的部署,支持不同的深度学习框架得到的深度学习网络。
需要说明的是,本申请不仅适用于在线部署,还适用于离线部署,离线部署时只需要将部署信息存储在响应的bin文件中,一般包括网络结构bin文件和网络参数bin文件,然后在异构平台上开发解析上述bin文件,并按照目标设备块的结构进行调度部署。
在一种可能的实现方式中,步骤S102的实现过程还可以包括:
基于所述节点的信息和所述设备类型,将节点划分到每个设备类型对应的候选设备块中,得到各个所述候选设备块包含的目标节点,其中,每个设备类型下包含至少一个候选设备块;
根据所述候选设备块中的目标节点的信息确定每个候选设备块的输入节点和输出节点;
将所述候选设备块按预设规则排列,合并属于相同设备类型且连续的候选设备块,得到目标设备块。
在本实施例中,在候选设备块合并时,如果候选设备块是NNP,需要将需要合并的候选设备块中用于内部传输的输入节点和输出节点去除,只留下最终目标设备块的输入节点和输出节点,如果是DSP的,那么只要将输入节点和输出节点合并即可。
作为举例,以候选设备块是NNP为例,候选设备块A的输入节点为a、b,输出节点为c,c为候选设备块A中的目标节点;候选设备块B的输入节点为c、d,输出节点为e,e为候选设备块B中的目标节点;候选设备块A和候选设备块B合并成目标设备块C,则目标设备块C的输入节点为a、b、d,输出节点为e。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
对应于上文实施例所述的基于TVM编译器的异构平台的部署方法,图9示出了本申请实施例提供的基于TVM编译器的异构平台的部署装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图9,该装置300可以包括:数据获取模块310、节点划分模块320、空间分配模块330和网络部署模块340。
其中,数据获取模块310,用于获取TVM编译器生成的深度学习网络的拓扑结构图、网络参数和各个函数的函数信息;
节点划分模块320,用于基于所述拓扑结构图中节点的信息和异构平台的设备类型,将所述拓扑结构图中的节点分成与所述设备类型对应的目标设备块;
空间分配模块330,用于对所述目标设备块进行空间分配,得到所述目标设备块的空间信息;
网络部署模块340,用于基于所述拓扑结构图、所述网络参数、所述函数信息、所述目标设备块和所述目标设备块的空间信息,将所述深度学习网络部署到所述异构平台上。
在一种可能的实现方式中,节点划分模块320具体可以包括:
模块划分单元,用于基于所述节点的信息和所述设备类型,将节点划分到每个设备类型对应的候选设备块中,得到各个所述候选设备块包含的目标节点,其中,每个设备类型下包含至少一个候选设备块;
模块排序单元,用于将所述候选设备块按预设规则排列,合并属于相同设备类型且连续的候选设备块,得到目标设备块;
输入输出节点确定单元,用于根据所述目标设备块中的目标节点的信息确定每个目标设备块的输入节点和输出节点。
在一种可能的实现方式中,所述节点的信息包括节点的节点类型、节点的设备类型、节点的输入数据和节点的输出数据;
模块划分单元具体可以用于:
判断所述拓扑结构图中的每一个节点是否满足分类条件;
若第i个节点满足节点的分类条件,则将所述第i个节点按照所述第i个节点的设备类型划分到所述第i个节点的设备类型对应的第j个候选设备块中,i和j大于或等于1,其中,当i=1时,第一个节点的分类条件为节点类型为函数类型,,当i>1时,第i个节点的分类条件为所述第i个节点的节点类型为函数类型、且所述第i个节点的输入数据为所述第j个候选设备块中节点的输出数据或为所述第j个候选设备块的上一层节点的输出数据,其中,所述拓扑结构图中的节点是由上而下一层一层分布的,所述第j个候选设备块的上一层节点指的是分布在所述第j个候选设备块之外,且在所述第j个候选设备块之上距离所述第j个候选设备块最近的一层节点。
在一种可能的实现方式中,模块划分单元具体还可以用于:
若所述第i个节点不满足分类条件,则将所述第i个节点按照所述第i个节点的设备类型划分到所述第i个节点的设备类型对应的第j+1个候选设备块中,其中,所述第j+1个候选设备块中不存在节点。
在一种可能的实现方式中,输入输出节点确定单元具体可以用于:
若所述目标设备块中的目标节点的输入节点满足输入节点纳入条件,则所述目标节点的输入节点纳入当前目标设备块中并将所述输入节点作为所述目标设备块的输入节点,其中,所述输入节点纳入条件包括所述目标节点的输入节点为数据类型的节点时,对应的数据为所述深度学习网络的输入数据,或所述目标节点的输入节点为函数类型的节点且不包含在当前目标设备块中;
所述目标设备块中满足输出节点纳入条件的目标节点为所述当前目标设备块的输出节点,其中,所述输出节点纳入条件包括所述目标节点在所述深度学习网络的输出列表中,或所述目标节点的输出数据是当前目标设备块之外的目标设备块中目标节点的输入数据。
在一种可能的实现方式中,空间分配模块330具体可以包括:
空间计算单元,用于基于所述目标设备块中每个输出节点的尺度信息和每个输出节点的数据类型,计算每个输出节点的空间大小;
空间分配单元,用于基于每个输出节点的空间大小,获得所有输出节点需要占用的最大存储空间,将所述最大存储空间作为所述目标设备块的空间信息。
在一种可能的实现方式中,空间分配单元具体可以用于:
将a个输出节点按照输入输出的顺序进行排序,获得排序后的输出节点队列;
基于第1个输出节点的空间大小,在第一存储空间为所述第1个输出节点分配第1节点空间;
根据第n个输出节点的空间大小,在已分配的节点空间中查找目标节点空间,其中,所述目标节点空间为当前为空闲状态、且大于第n个输出节点的空间大小的节点空间,2≤n≤a-1;
若已分配的节点空间中存在目标节点空间,根据第n个输出节点的空间大小,在所述目标节点空间中为所述第n个输出节点分配第n节点空间;
若已分配的节点空间中不存在目标节点空间,根据第n个输出节点的空间大小,在第n-1节点空间之后为所述第n个输出节点分配第n节点空间;
在为所述第n个输出节点分配第n节点空间之后,确定当前所有节点空间中是否存在满足释放条件的输出节点;
若存在,则将满足释放条件的输出节点的节点空间释放,其中,释放后的节点空间为空闲状态;
根据第a个输出节点的空间大小,在第二存储空间为所述第a个输出节点分配第a节点空间;
基于所述第1节点空间至第a节点空间,确定a个输出节点共占用的最大存储空间。
在一种可能的实现方式中,网络部署模块340具体可以用于:
基于按预设规则排列的第m个目标设备块中的输入节点的输入数据、所述输出节点的输出数据、空间信息、函数信息和网络参数,得到第m个目标设备块的部署信息;
判断所述第m个目标设备块的部署信息是否满足部署条件,其中,所述部署条件为所述第m个目标设备块不需要等待第m-1个目标设备块的输出数据作为输入数据;
若满足部署条件,将所述第m个目标设备块的部署信息发送至所述异构平台中与所述第m个目标设备块的目标设备块的设备类型相同的设备上。
在一种可能的实现方式中,网络部署模块340具体还可以用于:
若不满足部署条件,则获取到第m-1个目标设备块的输出数据后将所述第m个目标设备块的部署信息发送至所述异构平台中与所述第m个目标设备块的目标设备块的设备类型相同的设备上。
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述基于TVM编译器的异构平台的部署方法各个实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在移动终端上运行时,使得移动终端执行时实现可实现上述基于TVM编译器的异构平台的部署方法各个实施例中的步骤。

Claims (11)

  1. 一种基于TVM编译器的异构平台的部署方法,其特征在于,包括:
    获取TVM编译器生成的深度学习网络的拓扑结构图、网络参数和各个函数的函数信息;
    基于所述拓扑结构图中节点的信息和异构平台的设备类型,将所述拓扑结构图中的节点分成与所述设备类型对应的目标设备块;
    对所述目标设备块进行空间分配,得到所述目标设备块的空间信息;
    基于所述拓扑结构图、所述网络参数、所述函数信息、所述目标设备块和所述目标设备块的空间信息,将所述深度学习网络部署到所述异构平台上。
  2. 如权利要求1所述的基于TVM编译器的异构平台的部署方法,其特征在于,所述基于所述拓扑结构图中节点的信息和异构平台的设备类型,将所述拓扑结构图中的节点分成与所述设备类型对应的目标设备块,包括:
    基于所述节点的信息和所述设备类型,将节点划分到每个设备类型对应的候选设备块中,得到各个所述候选设备块包含的目标节点,其中,每个设备类型下包含至少一个候选设备块;
    将所述候选设备块按预设规则排列,合并属于相同设备类型且连续的候选设备块,得到目标设备块;
    根据所述目标设备块中的目标节点的信息确定每个目标设备块的输入节点和输出节点。
  3. 如权利要求2所述的基于TVM编译器的异构平台的部署方法,其特征在于,所述节点的信息包括节点的节点类型、节点的设备类型、节点的输入数据和节点的输出数据;
    所述基于所述节点的信息和所述设备类型,将节点划分到每个设备类型对应的候选设备块中,得到各个所述候选设备块包含的目标节点,包括:
    判断所述拓扑结构图中的每一个节点是否满足分类条件;
    若第i个节点满足节点的分类条件,则将所述第i个节点按照所述第i个节点的设备类型划分到所述第i个节点的设备类型对应的第j个候选设备块中,i和j大于或等于1,其中,当i=1时,第一个节点的分类条件为节点类型为函数类型,当i>1时,第i个节点的分类条件为所述第i个节点的节点类型为函数类型、且所述第i个节点的输入数据为所述第j个候选设备块中节点的输出数据或为所述第j个候选设备块的上一层节点的输出数据,其中,所述拓扑结构图中的节点是由上而下一层一层分布的,所述第j个候选设备块的上一层节点指的是分布在所述第j个候选设备块之外,且在所述第j个候选设备块之上距离所述第j个候选设备块最近的一层节点。
  4. 如权利要求3所述的基于TVM编译器的异构平台的部署方法,其特征在于,在所述判断所述拓扑结构图中的每一个节点是否满足分类条件之后,还包括:
    若所述第i个节点不满足分类条件,则将所述第i个节点按照所述第i个节点的设备类型划分到所述第i个节点的设备类型对应的第j+1个候选设备块中,其中,所述第j+1个候选设备块中不存在节点。
  5. 如权利要求2所述的基于TVM编译器的异构平台的部署方法,其特征在于,所述根据所述目标设备块中的目标节点的信息确定每个目标设备块的输入节点和输出节点,包括:
    若所述目标设备块中的目标节点的输入节点满足输入节点纳入条件,则所述目标节点的输入节点纳入当前目标设备块中并将所述输入节点作为所述目标设备块的输入节点,其中,所述输入节点纳入条件包括所述目标节点的输入节点为数据类型的节点时,对应的数据为所述深度学习网络的输入数据,或所述目标节点的输入节点为函数类型的节点且不包含在当前目标设备块中;
    所述目标设备块中满足输出节点纳入条件的目标节点为所述当前目标设备块的输出节点,其中,所述输出节点纳入条件包括所述目标节点在所述深度学习网络的输出列表中,或所述目标节点的输出数据是当前目标设备块之外的目标设备块中目标节点的输入数据。
  6. 如权利要求1至5任一项所述的基于TVM编译器的异构平台的部署方法,其特征在于,所述对所述目标设备块进行空间分配,得到所述目标设备块的空间信息,包括:
    基于所述目标设备块中每个输出节点的尺度信息和每个输出节点的数据类型,计算每个输出节点的空间大小;
    基于每个输出节点的空间大小,获得所有输出节点需要占用的最大存储空间,将所述最大存储空间作为所述目标设备块的空间信息。
  7. 如权利要求6所述的基于TVM编译器的异构平台的部署方法,其特征在于,所述基于每个输出节点的空间大小,获得所有输出节点的需要占用的最大存储空间,包括:
    将a个输出节点按照输入输出的顺序进行排序,获得排序后的输出节点队列;
    基于第1个输出节点的空间大小,在第一存储空间为所述第1个输出节点分配第1节点空间;
    根据第n个输出节点的空间大小,在已分配的节点空间中查找目标节点空间,其中,所述目标节点空间为当前为空闲状态、且大于第n个输出节点的空间大小的节点空间,2≤n≤a-1;
    若已分配的节点空间中存在目标节点空间,根据第n个输出节点的空间大小,在所述目标节点空间中为所述第n个输出节点分配第n节点空间;
    若已分配的节点空间中不存在目标节点空间,根据第n个输出节点的空间大小,在第n-1节点空间之后为所述第n个输出节点分配第n节点空间;
    在为所述第n个输出节点分配第n节点空间之后,确定当前所有节点空间中是否存在满足释放条件的输出节点;
    若存在,则将满足释放条件的输出节点的节点空间释放,其中,释放后的节点空间为空闲状态;
    根据第a个输出节点的空间大小,在第二存储空间为所述第a个输出节点分配第a节点空间;
    基于所述第1节点空间至第a节点空间,确定a个输出节点共占用的最大存储空间。
  8. 如权利要求2所述的基于TVM编译器的异构平台的部署方法,其特征在于,所述基于所述拓扑结构图、所述网络参数、所述函数信息、所述目标设备块和所述目标设备块的空间信息,将所述深度学习网络部署到所述异构平台上,包括:
    基于按预设规则排列的第m个目标设备块中的输入节点的输入数据、输出节点的输出数据、空间信息、函数信息和网络参数,得到第m个目标设备块的部署信息;
    判断所述第m个目标设备块的部署信息是否满足部署条件,其中,所述部署条件为所述第m个目标设备块不需要等待第m-1个目标设备块的输出数据作为输入数据;
    若满足部署条件,将所述第m个目标设备块的部署信息发送至所述异构平台中与所述第m个目标设备块的设备类型相同的设备上。
  9. 如权利要求8所述的基于TVM编译器的异构平台的部署方法,其特征在于,在所述判断所述第m个目标设备块的部署信息是否满足部署条件之后,还包括:
    若不满足部署条件,则获取到第m-1个目标设备块的输出数据后将所述第m个目标设备块的部署信息发送至所述异构平台中与所述第m个目标设备块的目标设备块的设备类型相同的设备上。
  10. 一种基于TVM编译器的异构平台的部署装置,其特征在于,包括:
    数据获取模块,用于获取TVM编译器生成的深度学习网络的拓扑结构图、网络参数和各个函数的函数信息;
    节点划分模块,用于基于所述拓扑结构图中节点的信息和异构平台的设备类型,将所述拓扑结构图中的节点分成与所述设备类型对应的目标设备块;
    空间分配模块,用于对所述目标设备块进行空间分配,得到所述目标设备块的空间信息;
    网络部署模块,用于基于所述拓扑结构图、所述网络参数、所述函数信息、所述目标设备块和所述目标设备块的空间信息,将所述深度学习网络部署到所述异构平台上。
  11. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至9任一项所述的基于TVM编译器的异构平台的部署方法。
    12一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机计算机程序被处理器执行时实现如权利要求1至9任一项所述的基于TVM编译器的异构平台的部署方法。
PCT/CN2021/088594 2020-07-10 2021-04-21 一种基于 tvm 编译器的异构平台的部署方法、装置及终端设备 WO2022007468A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/623,902 US20240036844A1 (en) 2020-07-10 2021-04-21 Deployment method and deployment device of heterogeneous platform based on tvm compiler

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010654954.X 2020-07-10
CN202010654954.XA CN111915016B (zh) 2020-07-10 2020-07-10 一种基于tvm编译器的异构平台的部署方法及装置

Publications (1)

Publication Number Publication Date
WO2022007468A1 true WO2022007468A1 (zh) 2022-01-13

Family

ID=73228010

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088594 WO2022007468A1 (zh) 2020-07-10 2021-04-21 一种基于 tvm 编译器的异构平台的部署方法、装置及终端设备

Country Status (3)

Country Link
US (1) US20240036844A1 (zh)
CN (1) CN111915016B (zh)
WO (1) WO2022007468A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111915016B (zh) * 2020-07-10 2022-03-25 深圳云天励飞技术股份有限公司 一种基于tvm编译器的异构平台的部署方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239315A (zh) * 2017-04-11 2017-10-10 北京深鉴智能科技有限公司 面向神经网络异构计算平台的编程模型
US20190391796A1 (en) * 2019-06-28 2019-12-26 Intel Corporation Control of scheduling dependencies by a neural network compiler
CN110766147A (zh) * 2018-07-25 2020-02-07 赛灵思公司 神经网络编译器架构及编译方法
US20200106717A1 (en) * 2018-09-28 2020-04-02 International Business Machines Corporation Selective multicast delivery on a bus-based interconnect
CN111104120A (zh) * 2018-10-29 2020-05-05 赛灵思公司 神经网络编译方法、系统及相应异构计算平台
CN111915016A (zh) * 2020-07-10 2020-11-10 深圳云天励飞技术有限公司 一种基于tvm编译器的异构平台的部署方法及装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678379B (zh) * 2016-01-12 2020-08-07 腾讯科技(深圳)有限公司 一种cnn的处理方法和装置
WO2018081654A1 (en) * 2016-10-28 2018-05-03 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
US20180307987A1 (en) * 2017-04-24 2018-10-25 Intel Corporation Hardware ip optimized convolutional neural network
CN107678752B (zh) * 2017-08-31 2021-09-21 北京百度网讯科技有限公司 一种面向异构集群的任务处理方法及装置
CN108764487B (zh) * 2018-05-29 2022-07-08 北京百度网讯科技有限公司 用于生成模型的方法和装置、用于识别信息的方法和装置
CN110764744B (zh) * 2018-07-25 2023-12-08 赛灵思公司 用于神经网络计算的中间表示生成方法和装置
CN111258744A (zh) * 2018-11-30 2020-06-09 中兴通讯股份有限公司 一种基于异构计算的任务处理方法及软硬件框架系统
CN110503195A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 利用人工智能处理器执行任务的方法及其相关产品
CN111290762B (zh) * 2020-01-19 2023-05-12 深圳云天励飞技术有限公司 一种深度学习网络的部署方法、装置及终端设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239315A (zh) * 2017-04-11 2017-10-10 北京深鉴智能科技有限公司 面向神经网络异构计算平台的编程模型
CN110766147A (zh) * 2018-07-25 2020-02-07 赛灵思公司 神经网络编译器架构及编译方法
US20200106717A1 (en) * 2018-09-28 2020-04-02 International Business Machines Corporation Selective multicast delivery on a bus-based interconnect
CN111104120A (zh) * 2018-10-29 2020-05-05 赛灵思公司 神经网络编译方法、系统及相应异构计算平台
US20190391796A1 (en) * 2019-06-28 2019-12-26 Intel Corporation Control of scheduling dependencies by a neural network compiler
CN111915016A (zh) * 2020-07-10 2020-11-10 深圳云天励飞技术有限公司 一种基于tvm编译器的异构平台的部署方法及装置

Also Published As

Publication number Publication date
CN111915016B (zh) 2022-03-25
CN111915016A (zh) 2020-11-10
US20240036844A1 (en) 2024-02-01

Similar Documents

Publication Publication Date Title
Khorasani et al. Scalable simd-efficient graph processing on gpus
CN101296114B (zh) 基于流的并行模式匹配方法和系统
JP5425541B2 (ja) マルチプロセッサ・システム上でデータ・セットを区分化およびソートするための方法および装置
US9286123B2 (en) Apparatus and method for managing stream processing tasks
CN110297699B (zh) 调度方法、调度器、存储介质及系统
KR101110904B1 (ko) 채널 트리 오퍼레이션들을 수행하는 방법 및 장치
JP6376865B2 (ja) 並列ツリー・ベースの予測のための、コンピュータにより実行される方法、ストレージ媒体、およびコンピュータ・システム
US20100031008A1 (en) Parallel sorting apparatus, method, and program
WO2015117565A1 (en) Methods and systems for dynamically allocating resources and tasks among database work agents in smp environment
JPH09171503A (ja) 並列処理方法および並列処理装置
US20130227244A1 (en) Workload-aware distributed data processing apparatus and method for processing large data based on hardware acceleration
US7920282B2 (en) Job preempt set generation for resource management
US8869149B2 (en) Concurrency identification for processing of multistage workflows
CN106033442B (zh) 一种基于共享内存体系结构的并行广度优先搜索方法
CN108427602B (zh) 一种分布式计算任务的协同调度方法及装置
Schlag et al. Scalable edge partitioning
CN110231986A (zh) 基于多fpga的动态可重配置的多任务调度和放置方法
WO2022007468A1 (zh) 一种基于 tvm 编译器的异构平台的部署方法、装置及终端设备
CN112015765B (zh) 基于缓存价值的Spark缓存淘汰方法及系统
CN105874426B (zh) 一种系统调用命令的批处理方法及装置
CN110021345A (zh) 基于spark平台的基因数据分析方法
JPH11259433A (ja) 並列実行システム
CN108108242B (zh) 基于大数据的存储层智能分发控制方法
CN106980673A (zh) 内存数据库表索引更新方法及系统
CN113297537B (zh) 一种稀疏结构化三角方程组求解的高性能实现方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21838875

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21838875

Country of ref document: EP

Kind code of ref document: A1