WO2023160236A1 - Slicing method and apparatus for multi-output neural network, and chip and storage medium - Google Patents

Slicing method and apparatus for multi-output neural network, and chip and storage medium Download PDF

Info

Publication number
WO2023160236A1
WO2023160236A1 PCT/CN2022/143527 CN2022143527W WO2023160236A1 WO 2023160236 A1 WO2023160236 A1 WO 2023160236A1 CN 2022143527 W CN2022143527 W CN 2022143527W WO 2023160236 A1 WO2023160236 A1 WO 2023160236A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
slicing
output data
network
output
Prior art date
Application number
PCT/CN2022/143527
Other languages
French (fr)
Chinese (zh)
Inventor
尹长生
蔡万伟
陈宁
Original Assignee
深圳云天励飞技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术股份有限公司 filed Critical 深圳云天励飞技术股份有限公司
Publication of WO2023160236A1 publication Critical patent/WO2023160236A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application belongs to the technical field of model processing, and in particular relates to a multi-output neural network slicing method, device, chip and storage medium.
  • the neural network compilation framework (Tensor Virtual Machine, TVM) is a unified software stack for different neural network frameworks and hardware platforms, and can deploy neural networks under different frameworks on the hardware platform.
  • a neural network processes input data in a smart chip to obtain output data.
  • the available space inside the smart chip is often relatively small, and for a neural network that requires a large amount of storage space to process data, the available space inside the smart chip is far from enough. Therefore, in the process of data processing, for any node included in the neural network, it is often necessary to first store the output data of the node in an external memory. Then, the input data of the nodes stored in the storage space at this time and the generated output data are deleted first. Finally, the output data of the node is obtained from the external memory, and used as the input data of the next node to complete the data processing.
  • the neural network in order to reduce the available space required by the neural network for data processing, it is necessary to first segment the neural network into graphs to obtain network subgraphs. Afterwards, the segmented network subgraph is sliced, so that when the network subgraph processes data according to slices, the available space required by each node in the network subgraph will be reduced, so that each The output data generated by the node is stored in the internal storage space, reducing the number of accesses to the external memory.
  • the current TVM scheduling primitive cannot support the output data of the fork node to be processed in multiple network subgraphs at the same time.
  • the prior art there is no reasonable slicing method for slicing the network subgraph of the multi-output neural network, so that when the existing multi-output neural network processes data, the output data of the bifurcation node still needs to be stored in the In the external memory, the number of accesses to the external memory cannot be reduced.
  • the embodiment of the present application provides a multi-output neural network graph slicing method, device, smart chip and storage medium, which can solve the problem that the existing multi-output neural network graph requires frequent access to external memory when processing input data.
  • the embodiment of the present application provides a method for slicing a multi-output neural network, the method is applied to a smart chip, and the method includes:
  • the multi-output neural network includes at least two end nodes and at least one fork node;
  • the output data volume of the output data of the end nodes of the network subgraph is sliced according to various preset slicing methods, and various slicing schemes of the network subgraph are obtained;
  • any slicing scheme of the network subgraph obtain the processing time of the network subgraph, and determine the target slicing scheme of the network subgraph according to the processing time;
  • the slicing scheme for the multi-output neural network is determined.
  • the embodiment of the present application provides a multi-output neural network slicing device, which is applied to a smart chip, and the device includes:
  • the segmentation module is used to divide each end node in the multi-output neural network from other nodes, and generate multiple network subgraphs containing end nodes;
  • the multi-output neural network includes at least two end nodes and at least one bifurcation node ;
  • the slicing module is used for slicing the output data volume of the output data of the end nodes of the network subgraph according to various preset slicing methods for any network subgraph, so as to obtain various slicing schemes of the network subgraph;
  • the processing duration acquisition module is used for obtaining the processing duration of the network subgraph for any slicing scheme of the network subgraph, and determining the target slicing scheme of the network subgraph according to the processing duration;
  • the slicing scheme determination module is configured to determine the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.
  • an embodiment of the present application provides a smart chip, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, the method in the first aspect above is implemented. .
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method in the first aspect above is implemented.
  • the embodiment of the present application provides a computer program product, which causes the computer to execute the method of the first aspect when the computer program product is run on the computer.
  • the beneficial effect of the embodiment of the present application compared with the prior art is: by characterizing the multi-output neural network in the form of a multi-output neural network graph, and then, for each terminal node in the multi-output neural network graph, respectively The terminal node is divided with other nodes to obtain the network subgraph corresponding to each node. Afterwards, in each network subgraph, the output data of the end nodes of each network subgraph is sliced according to various preset slicing methods, so as to control the amount of input and output data of each network subgraph when processing data, Get a variety of slicing schemes. Afterwards, according to the processing time of the network subgraph when processing data, determine the target slicing scheme of the network subgraph.
  • the overall slicing scheme of the multi-output neural network is obtained in this way.
  • the input and output data are processed according to the slice scheme of the multi-output neural network, so that the output data between nodes can be stored in the available space of the smart chip, and each Nodes can directly obtain data from the available space when processing data, so as to reduce the number of accesses for smart chips to access external memory.
  • Fig. 1 is the realization flowchart of a kind of multi-output neural network slicing method provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a multi-output neural network diagram and a multi-output neural network diagram in a multi-output neural network slicing method provided by an embodiment of the present application;
  • FIG. 3 is a schematic diagram of an implementation of S101 of a multi-output neural network slicing method provided by an embodiment of the present application
  • Fig. 4 is a flow chart of realizing a multi-output neural network slicing method provided by another embodiment of the present application.
  • FIG. 5 is a schematic diagram of a processing scenario in which a single-output neural network processes input data in a smart chip provided by an embodiment of the present application;
  • FIG. 6 is a schematic diagram of a processing scenario of a multi-output neural network graph processing input data in a smart chip provided by an embodiment of the present application;
  • Fig. 7 is a schematic structural diagram of a multi-output neural network graph slicing device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a smart chip provided by an embodiment of the present application.
  • the multi-output neural network slicing method provided in the embodiment of the present application can be applied to a smart chip.
  • the smart chip is a chip that loads each model parameter in the multi-output neural network graph, and processes the input data input to each node according to the model parameters under each node.
  • the output data calculated by a single node of the multi-output neural network graph is usually too large to be stored in the internal memory of the smart chip, the output data needs to be imported to an external memory. Therefore, it is necessary to save the output data of a single node under the multi-output neural network graph by accessing the external memory.
  • the smart chip since there are multiple nodes in the neural network, based on this, when the smart chip performs calculation and processing of the neural network on the input data of each node, the smart chip needs to access the external memory frequently.
  • the above-mentioned smart chip includes but not limited to a central processing unit, a graphics processing unit, a neural network processor, and the like.
  • the above-mentioned internal memory is the on-chip memory of the smart chip.
  • the above-mentioned external memory may be a double-rate synchronous dynamic random access memory. Usually the storage space of the external memory is much larger than the storage space of the on-chip memory.
  • FIG. 1 shows an implementation flow diagram of a multi-output neural network slicing method provided by an embodiment of the present application.
  • the nodes in the multi-output neural network can be merged so that the merged multi-output
  • the output neural network can reduce the number of visits to the external memory when processing data, and the method includes the following steps:
  • the smart chip separates each end node in the multi-output neural network from other nodes to generate multiple network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node.
  • the above-mentioned multi-output neural network can be represented by a multi-output neural network graph.
  • FIG. 2 is a simple multi-output neural network diagram.
  • the multi-output neural network graph includes a plurality of nodes, and each node is used to represent a unit in the multi-output neural network that can be independently calculated on the smart chip.
  • the arrows in the multi-output neural network diagram are used to represent the transmission direction of the output data between each node.
  • FIG. 2X is a schematic structural diagram of a multi-output neural network
  • FIG. 2Y is a reference schematic diagram of network subgraphs after the multi-output neural network is divided.
  • the multi-output neural network graph includes 7 nodes. Among them, 3, 5, and 7 are output nodes (that is, terminal nodes), respectively, 1, 2 are bifurcated nodes, and 4, 6 are ordinary nodes.
  • the multiple nodes include at least two end nodes and one fork node. In this embodiment, there is no limit to the number of end nodes and fork nodes.
  • a divided network subgraph includes one end node, therefore, the divided network subgraph includes at least two.
  • the terminal node is divided from other nodes, which can be specifically: the smart chip adopts the post-order traversal method to collect other nodes that each terminal node needs respectively according to the data transmission relationship between each node in the multi-output neural network diagram, and Segmentation is performed to obtain the corresponding network subgraph.
  • the smart chip slices the output data of the end nodes of the network subgraph according to various preset slicing methods, and obtains various slicing schemes of the network subgraph.
  • the above slicing manner is specifically a manner of slicing the output data volume of the output data of the end nodes.
  • the size of the output data volume when the end node outputs data each time can be determined.
  • the network subgraph can determine the size of the input data each time the input data is processed, so that the smart chip can allocate enough available space for each node to process the input and output data.
  • the terminal node can output output data of any dimension size, therefore, when slicing the output data of the terminal node, there will also be correspondingly various slicing schemes, which are not limited.
  • the data processed by the multi-output neural network is usually image data, therefore, the output data will be described using image data as an example.
  • information of the image data is usually represented by N, C, H, W.
  • N represents the number of sheets of input image data
  • H represents the pixels of the image data in the vertical direction
  • W represents the pixels of the image data in the horizontal direction
  • C represents the number of channels.
  • the smart chip can slice the output data under the node, so as to reduce the size of the input data when the terminal nodes and other nodes in the low network subgraph process the input data, and the size of the output data when outputting the data.
  • the slicing is performed on the H axis, so that the size of the data processed by each node is N, C, 1/2H, and W. Therefore, the node can only get half the size of the image as the input data of the node. Moreover, since only half of the image size data is processed at a time, the output data of the final terminal node is also the output data of half the image size first.
  • the smart chip can first store the output data of half the size of the image in the internal memory, and output it to the external memory if it is an end node; then, delete the data of half the size of the image processed by the fork node from the internal memory (Because it does not need to be used, the processed input data can be released); after that, the data of the other half of the image size is processed again until the output data of the other half of the image size is obtained and saved. Based on this, in the process of generating the overall output data of the node, the internal memory only needs to use half the storage space of the image size for storage. If the node directly processes the input data of the entire image, the smart chip needs to allocate the storage space required for the input of the overall image data and the storage space required for generating the output data of the overall image to the node.
  • multiple ways of slicing the end node output data include but not limited to slicing C, H, and W respectively. That is to say, the above example is only one of the ways to slice the H axis.
  • the way of slicing the data can be as follows: the smart chip obtains the minimum dimension of the output data that can be output by the terminal node, and the maximum dimension of the output data that can be output; the smart chip divides any dimension between the minimum dimension and the maximum dimension The output data of the dimension is determined as a slicing method for slicing the output data of the end node.
  • the above-mentioned minimum dimension is the minimum dimension of the image data that can be output by the terminal node and the non-forking node
  • the above-mentioned maximum dimension is the maximum dimension of the image data that can be output by the terminal node and the non-forking node.
  • the image data of each dimension is used as output data that can be output, that is, a slicing method for slicing terminal nodes and non-forked nodes.
  • the minimum dimension of the image data that can be output by the terminal node is usually 1 pixel
  • the maximum dimension of the image data that can be output is usually the dimension of the entire image data.
  • the number of sheets N is 1 and the number of channels C is also 1, if the pixel H of the image data in the vertical direction has A type of slicing, that is, any dimension between [1, A] is used for The image data is sliced in the vertical direction; and the pixel W of the image data in the horizontal direction has B types of slices, that is, the image data is sliced in the vertical direction in any dimension between [1, B] , then a slicing method for finally slicing the output data of the end nodes will have A*B types.
  • the number C of channels may also be multiple, such as three. Therefore, when the image data is divided, the number of channels can also be divided. Based on this, multiple slicing modes in the network subgraph can be obtained, so that the smart chip can have more slicing schemes for selection.
  • the smart chip acquires a processing time of the network subgraph, and determines a target slicing scheme of the network subgraph according to the processing time.
  • the smart chip can determine the slicing scheme with the shortest processing time as the target slicing scheme of the network subgraph, so as to improve the efficiency of the network subgraph when processing input and output data .
  • the smart chip can count all the effective slicing schemes in the graph A, and obtain the optimal slicing scheme with the lowest cycle and the shortest time consumption through the chip costmodel as The target slicing scheme of terminal node 3; and, statistics of all effective slicing schemes in graph B, and obtain the optimal slicing scheme with the lowest cycle and the shortest time consumption through the chip costmodel as the target slicing scheme of terminal node 5; and, statistical graph C
  • the optimal slicing scheme with the lowest cycle and the shortest time consumption is obtained through the chip costmodel as the target slicing scheme for terminal node 7.
  • the smart chip determines the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.
  • each target slicing scheme is the overall slicing scheme for the multi-output neural network.
  • the multi-output neural network is represented in the form of a multi-output neural network graph, and then, for each terminal node in the multi-output neural network graph, the terminal node is separately divided from other nodes to obtain each The network subgraph corresponding to each node. Afterwards, in each network subgraph, the output data of the end nodes of each network subgraph is sliced according to various preset slicing methods, so as to control the amount of input and output data of each network subgraph when processing data, Get a variety of slicing schemes. Afterwards, according to the processing time of the network subgraph when processing data, determine the target slicing scheme of the network subgraph.
  • the overall slicing scheme of the multi-output neural network is obtained in this way.
  • the input and output data are processed according to the slice scheme of the multi-output neural network, so that the output data between nodes can be stored in the available space of the smart chip, and each Nodes can directly obtain data from the available space when processing data, so as to reduce the number of accesses for smart chips to access external memory.
  • FIG. 3 is a schematic diagram of an implementation of S101 of a method for slicing a multi-output neural network provided by an embodiment of the present application, which is described in detail as follows;
  • the smart chip uses the terminal node as the first current node.
  • the above-mentioned first current node is the node currently being processed.
  • the subsequent traversal method is used to process the multi-output neural network graph, the first terminal node to be processed is the first current node.
  • the smart chip judges whether the fork node has been divided by network subgraphs corresponding to other end nodes.
  • the smart chip divides each node between the first current node and the parent node to generate a network subgraph including the end nodes.
  • the smart chip determines the parent node as the new first current node, and repeats S2-S4 , until a network subgraph containing terminal nodes is generated.
  • the first current node is the end node 3
  • the parent node is the fork node 2.
  • the fork node 2 is not divided by the network subgraphs corresponding to other end nodes at this time. Therefore, based on the step S3, it can be seen that the fork node 2 can be used as the new first current node, and the steps S2-S4 are repeated until a network subgraph including the terminal node 3 is generated.
  • the case of generating the network subgraph containing 3 may be the case of step 2, or it may be that the new first current node has no parent node in the multi-output neural network graph.
  • the smart chip can finish dividing the terminal node 3 .
  • the terminal node 3 is divided, if the new first current node is node 1, since node 1 is the starting node of the multi-output neural network graph, it does not have a parent node. Therefore, the network segmentation including terminal node 3 ends here, and a network subgraph including nodes 1, 2, and 3 is obtained.
  • the initial first current node is the terminal node 5
  • its parent node is the non-forked node 4
  • the smart chip can then use the non-forked node 4 as the new first current node.
  • the previous node (that is, the new parent node) of the new first current node 4 is the fork node 2 .
  • the smart chip will end the division of the end node 5, divide the end node 5 and each node between the end node 5 and the new parent node 2 (only node 4 in Figure 2), and generate a network including the end node 5 subplot.
  • the division method of the terminal node 7 is similar to the division method of the terminal node 5, which will not be described again.
  • the above example only uses terminal node 3 as the first terminal node to be processed.
  • terminal nodes 5 and 7 can also be respectively used as the first terminal node to be processed. This is not limited.
  • each terminal node and other nodes are sequentially segmented through the subsequent traversal method, the network subgraph corresponding to each terminal node can be accurately and quickly generated, and the smart chip can improve the accuracy of the multi-output neural network graph. Efficiency for splitting.
  • each network subgraph respectively includes some nodes in the multi-output neural network graph.
  • slicing the multi-output neural network graph is only to adjust the size of the input and output data processed by each node in the multi-output neural network graph, not to divide the multi-output neural network graph into multiple parts.
  • each network subgraph when the above step S102 is performed, each network subgraph usually has multiple slicing schemes. However, when the subsequent step S103 is performed, if the smart chip counts the network subgraph corresponding to each slicing scheme If the processing time is long, it will consume a lot of time. Based on this, before executing S103, the smart chip can also pre-determine whether the slicing scheme is a valid slicing scheme, and when the slicing scheme is a valid slicing scheme, perform the step of obtaining the processing time of the network subgraph to reduce the time required for the smart chip. time.
  • FIG. 4 is a schematic diagram of an implementation of judging whether a slicing scheme is an effective slicing scheme in a multi-output neural network slicing method provided by an embodiment of the present application, as detailed below:
  • the smart chip reversely deduces the input data and output data of each node respectively according to the output data volume of the output data of the sliced end nodes.
  • the size of each output data of the terminal node can be limited, so that, according to the calculation parameters and calculation formulas in the terminal node, the smart chip can invert the terminal The size of the node's input data each time. Afterwards, since the input data of the terminal node is the output data of the previous node, the smart chip can reversely deduce the input data and output data of each node.
  • the smart chip determines the target output data of the previous node of the node.
  • the smart chip determines the occupied space of the node during execution according to the target output data, the input data and the output data of the node.
  • the preceding node is a node preceding the node in the multi-output neural network, which can be determined according to the data transmission relationship between the various nodes.
  • the input data of a node is the output data of the previous node. Therefore, in order to determine whether the slicing scheme is an effective slicing scheme, the smart chip can process the input and output data of the network subgraph. Get the tiling scheme of the input data and determine it as a valid tiling scheme.
  • each node does not need to obtain input data from the external memory when the network subgraph processes input and output data, it means that under this slicing scheme, when each node processes input and output data, the internal memory of the smart chip can be the The node allocates enough free space. Based on this, for any node of any network subgraph, the smart chip needs to calculate the occupied space required by the node when processing input and output data.
  • the smart chip can specifically determine the occupied space in the following manner: determine all forked nodes before the node and the output data of each forked node; judge whether the output data of each forked node can be released, And the output data of all forked nodes that cannot be released are used as the target output data.
  • the smart chip can judge whether the output data of the forked node can be released in the following way: if the output data of the forked node is also used by other network subgraphs, it is determined that the output data of the forked node cannot be released; otherwise, It is determined that the output data of the forked node can be released.
  • each chip in order to reduce the number of interactions with the external memory, each chip usually needs to store the output data in the internal memory when generating the output data, so that it can be called by the next node. Because the output data of a non-forking node is the input data of the next node, when calculating the occupied space of any node, only the sum of the output data volume of the node's output data and the input data volume of the input data is required , just determine the occupied space required by the node. That is, there is no target output data at this time.
  • the output data of the fork node needs to be used by other network subgraphs. That is, the output data of the fork node needs to be stored in the internal memory all the time. Therefore, when calculating the required occupied space of the node, it is also necessary to add the output data amounts of the output data of the non-releasable fork nodes to obtain the required occupied space of the node. That is, the output data of the forked nodes that have not been released at this time is the target output data.
  • the purpose of judging whether the output data of the bifurcated node can be released is to delete the output data from the internal memory when judging that the output data can be released, so as to increase the available space allocated by the internal memory for the next node as much as possible.
  • the next node does not need to store the output data in the external memory, reducing the number of interactions with the external memory.
  • the smart chip determines the slicing scheme as an effective slicing scheme.
  • the internal memory of the smart chip can have enough space to store the input data required by the node, and at the same time can Stores the output data produced by this node.
  • the above-mentioned available space may be the total space of the internal memory of the smart chip, or the space remaining in the internal memory after storing the calculation parameters when the node processes data, which is not limited.
  • the advantage of adopting an effective slicing scheme to slice the output data of the terminal nodes is to reduce the amount of input data each time the node processes the input data and the output data of the generated output data, so that each node can process Reduced footprint required when inputting and outputting data.
  • the smart chip can store the output data generated by each node in the internal memory, so as to reduce the number of interactions with the external memory.
  • the size of the internal memory of the smart chip is Q.
  • Q is the remaining space in the internal memory after removing the occupied space of the calculation parameters of the node in the internal memory.
  • the space occupied by each non-forked node when processing data generally includes: the space occupied by the input data and output data of the node.
  • the input data and output data of each node under the scheme are respectively obtained.
  • network subgraph A a subgraph composed of terminal node 3, fork node 1 and fork node 2
  • the reason why the output data of fork nodes 1 and 2 cannot be released and sliced is that the output data of fork nodes 1 and 2 need to be reused (both nodes in Figure A and Figure C need the output data of fork node 1, and Both A and Graph B have nodes that require output data from fork node 2).
  • each node in the multi-output neural network when any node performs inverse processing according to the output data, the moment when the input data is generated, the node contains both the input data and the output data. Therefore, each node in the multi-output neural network needs to meet the most basic requirements: at the moment when the node generates input data, the sum of the input data volume and the output data volume is less than or equal to the smart chip as the node Allocated free space.
  • Figure B is a sub-graph of the network including terminal nodes 5 and non-forked nodes 4 .
  • graph A is sliced, Q1b of fork node 1 and Q2b of fork node 2 are not released, and the output data Q2b of the fork node includes the input data Q4a of non-fork node 4.
  • the terminal node 5 can be calculated, otherwise the slicing scheme for the terminal node 5 is invalid.
  • the terminal node 5 can be calculated, otherwise the slicing scheme for the terminal node 5 is invalid.
  • the terminal node 5 since the input data of terminal node 5 is the output data of non-forked node 4, and at this time the input data Q4a (that is, Q2b) of non-forked node 4 can be released (it does not need to be used in Figure C) . Therefore, the second occupied space required by end node 5 needs to satisfy Q5+Q1b (the output data of fork node 1 still needs to be used by graph C and has not been released). If Q5+Q1b ⁇ Q, the slicing method is valid; otherwise, the slicing method is invalid.
  • the above output data is the data for determining whether the slicing scheme is an effective slicing scheme, which is a reverse process from output to input; after the reverse deduction is completed, if the output data is sliced with the X slicing scheme during the reverse deduction process, Smart chips are able to allocate enough free space for each node. Then, when the multi-output neural network uses the same slicing method to limit the amount of input data that nodes process input data, the smart chip should also be able to allocate enough available space for each node to store the output data generated by each node.
  • the operator after the slicing scheme of the multi-output neural network is determined, it is also necessary to set the operator for processing input and output data in the multi-output neural network, so that the operator can process the input and output data according to the slicing scheme.
  • the data is processed.
  • each node includes a plurality of operators for processing input data and/or output data, and each operator needs to call a corresponding scheduling statement for implementation.
  • the slicing method of the multi-output neural network also includes:
  • the smart chip obtains the start operator that processes the input data of the node first among the multiple operators, and the end operator that processes the output data of the node last.
  • the node is an end node
  • the smart chip when the smart chip is running the initial operator, it calls the first scheduling statement that executes the first operation; and when the operator finishes running, it calls the second scheduling statement that executes the second operation, and executes
  • the third scheduling statement of the third operation the first operation is used to read input data from the available space of the smart chip to the start operator; the second operation is used to write the output data generated by the end operator into the internal memory; and , the third operation is used to write the output data generated by the end operator of the terminal node into the external memory.
  • the node is the initial node in the multi-output neural network
  • the smart chip runs the initial operator of the initial node, it calls the fourth dispatching statement and the first dispatching statement for performing the fourth operation; and, at the beginning of operation
  • the end operator of the node call the second scheduling statement; the fourth scheduling statement is used to read input data from the external memory to the start operator of the starting node.
  • the smart chip calls the first scheduling statement when running the start operator in the intermediate node; and calls the second scheduling statement when running the end operator in the intermediate node.
  • a node is a minimum processing unit in a multi-output neural network graph, and contains calculation formulas for input data. Among them, a node usually contains a large number of operators to process the input data.
  • a node when a node includes multiple operators, the first operator in the node that processes input data is used as the start operator, and the last operator that generates output data is used as the end operator.
  • the foregoing first operation may specifically be an operation performed by an initial operator.
  • the first operation is used to read the input data required by the initial operator in the node.
  • the first operation is the operation of reading the input data required by the initial operator from the internal memory. If it is a starting node, you need to perform the fourth operation of reading the required input data from the external memory; and when reading the input data, you need to store the input data in the internal memory first, and then read it from the internal memory Get the input data.
  • the above-mentioned second operation is used to write the output data of the end operator into the available space. It can be understood that if the node belongs to the end node, the output data generated by the end operator not only needs to be written into the internal memory, but also needs to perform a third operation of writing the output data into the external memory. However, for intermediate nodes (non-start nodes and end nodes), only the second operation of writing the output data generated by the end operator into the available space of the internal memory and reading from the available space of the internal memory is only required The first operation to input data.
  • the start node is the fork node 1 in FIG. 2 ; the end nodes are nodes 3 , 5 , and 7 in FIG. 2 ; and the intermediate nodes are nodes 2 , 4 , and 6 in FIG. 2 .
  • the smart chip when generating a scheduling statement for each operator, it can integrate multiple operators in one node, so that multiple operators in the same node can be combined and put into the same core. In this way, the smart chip can use operator fusion without moving the data of the fork node to the external memory, which can effectively reduce the number of data transfers and save the processing time of the smart chip.
  • the graphs A, B, and C are respectively scheduled according to the searched optimal slicing manner in a post-order access manner.
  • END_OP the last operator of each child node
  • STT_OP the start operator
  • Figure A use the END_OP of the end node 3 as the first root node 1, and call the split primitive for the N, C, H, and W axes to slice.
  • the compute_op that moves the output data to DM (internal memory) and DDR (external memory) is generated by cache_write (storage write operation) END_OP.
  • cache_write storage write operation
  • END_OP storage write operation
  • cache_read START_OP generates compute_op that moves DM data to computing units. If compute_op contains parameters, cache_read is also required to generate compute_op that moves DDR data to DM and computing unit, and all compute_op Move to the axis corresponding to the root node 2 by compute_at.
  • Fork node 1 uses a method similar to that of fork node 2, and sets the END_OP of fork node 1 as the second root node 3. Since fork node 1 is the first node of the multi-output neural network, cache_read START_OP needs to generate a moving DDR Compute_op for data to DM and compute unit.
  • the END_OP of the output node 5 calls the split primitive for the first root node 4 to slice the N, C, H, and W axes. via cache_write END_OP generates compute_op that moves output data to DM and DDR. cache_read START_OP generates the compute_op that moves DM data to the computing unit. The parameters required in compute_op also require cache_read to generate the compute_op that moves DDR data to the DM and the computing unit, and all compute_op Move to the axis corresponding to the first root node 4 by compute_at.
  • Node 4 passes cache_write END_OP generates compute_op that moves output data to DM.
  • cache_read START_OP generates compute_op that moves DM data to computing units. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and move all compute_op to the axis corresponding to the first root node 4 through compute_at.
  • the END_OP of the output node 7 is the first root node 5, and the split primitive is called for the N, C, H, and W axes to slice.
  • cache_write END_OP generates compute_op that moves output data to DM and DDR.
  • cache_read START_OP generates compute_op that moves DM data to computing units. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and all compute_op Move to the axis corresponding to the first root node 5 by compute_at.
  • Node 6 passes cache_write END_OP generates a compute_op that moves output data to DM, and cache_read START_OP generates a compute_op that moves DM data to a computing unit. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and move all compute_op to the axis corresponding to the first root node 5 through compute_at.
  • split, cache_write, cache_read, compute_at, etc. are all scheduling primitives of TVM.
  • the above steps are mainly for improving the processing flow of input data and output data and the reading and writing flow. That is, the above method can reduce the number of times the smart chip reads input data from the external memory. However, when each operator is running, it may also need to move other calculation parameters from the external memory to process the input data. However, moving calculation parameters requires access to external memory. That is, the slicing method of the multi-output neural network graph described in this application can only reduce the number of times the multi-output neural network graph accesses input data from an external memory.
  • FIG. 5 is a schematic diagram of a processing scenario of processing input data by a single-output neural network in a smart chip.
  • the smart chip obtains the input data of the single-output neural network from the external memory, and stores it in the internal memory;
  • Step2 The smart chip reads the input data from the internal memory, and uses the model parameters in the calculation unit A to process and obtain the output data, and then store the output data in external memory.
  • Step3 The smart chip clears the output data and input data of node A;
  • Step4 Read the input data from the external memory again, and use the model parameters in the calculation unit B to process the output data;
  • Step5 Store the output data of node B to external memory. It can be seen from FIG. 6 that for a single-output neural network model, when processing the input data in each node, it is necessary to perform an operation of accessing an external memory.
  • FIG. 6 is a schematic diagram of a processing scenario in which a multi-output neural network processes input data in a smart chip.
  • this example is only explained with the multi-output neural network generated by nodes 1, 2, 3, 4, and 5.
  • Step1 The smart chip obtains the input data of the multi-output neural network from the external memory, and stores it in the internal memory.
  • Step2 For node 2, the smart chip can directly read the input data from the internal memory, and delete the output data of node 1 in the internal memory; after that, store the output data of node 2 in the internal memory; at the same time, based on the above
  • the segmentation operation of the network subgraph shows that even if the output data of node 2 is stored in the internal memory, the sum of the occupied space of the model parameters of the other nodes deployed in the internal memory and the occupied space of the generated output data will be smaller than the available internal memory space.
  • node 4 and end node 3 can directly obtain the output data of node 2 from the internal memory, and then node 4 will also put the generated output data into the internal memory to be read and processed by end node 5; after that, the intelligent The chip executes Step3 and Step4: store the output data generated by the end node 3 and the end node 5 respectively in the external memory.
  • each node in each network subgraph reads and writes the data stored in the internal memory to realize data interaction. In this way, the operation of accessing the external memory only needs to be performed once for the initial node and the output node of the multi-output neural network.
  • FIG. 7 is a structural block diagram of a multi-output neural network slicing device provided by an embodiment of the present application.
  • the modules included in the multi-output neural network slicing device in this embodiment are used to execute the steps in the embodiments corresponding to FIG. 1 , FIG. 3 to FIG. 4 .
  • FIG. 1 , FIG. 3 to FIG. 4 and related descriptions in the embodiments corresponding to FIG. 1 , FIG. 3 to FIG. 4 For ease of description, only the parts related to this embodiment are shown. Referring to FIG.
  • the slicing device 700 for a multi-output neural network graph may include: a segmentation module 710, a slicing module 720, a processing duration acquisition module 730, and a slicing scheme determination module 740, wherein:
  • Segmentation module 710 for dividing each end node in the multi-output neural network from other nodes, generating multiple network subgraphs containing end nodes; the multi-output neural network includes at least two end nodes and at least one fork node.
  • the slicing module 720 is configured to slice the output data volume of the output data of the end nodes of the network subgraph according to various preset slicing methods for any network subgraph, so as to obtain various slicing schemes of the network subgraph.
  • the processing duration obtaining module 730 is configured to obtain the processing duration of the network subgraph for any slicing scheme of the network subgraph, and determine the target slicing scheme of the network subgraph according to the processing duration.
  • the slicing scheme determination module 740 is configured to determine the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.
  • the segmentation module 710 is also used for:
  • the slicing module 720 is also used for:
  • the slicing device 700 of the multi-output neural network graph further includes:
  • the judging module is used for judging whether the slicing scheme is an effective slicing scheme for any slicing scheme of the network subgraph, and when the slicing scheme is an effective slicing scheme, execute the step of obtaining the processing time of the network subgraph.
  • the judging module is also used for:
  • the input data and output data of each node are respectively reversed; for any node, the target output data of the previous node of the node is determined; according to the target output data, the node The input data and output data of the node determine the occupied space when the node is executed; if the occupied space of any node is smaller than the available space allocated by the smart chip for the corresponding node, the slicing scheme is determined as an effective slicing scheme.
  • the judging module is also used for:
  • the judging module is also used for:
  • the output data of the fork node is also used by other network subgraphs, it is determined that the output data of the fork node cannot be released; otherwise, it is determined that the output data of the fork node can be released.
  • each module is used to execute the steps in the embodiments corresponding to FIG.
  • Each step in the embodiment corresponding to Fig. 3 to Fig. 4 has been explained in detail in the above embodiment, please refer to Fig. 1, Fig. 3 to Fig. 4 and Fig. 1, Fig. 3 to Fig. 4 in the embodiment corresponding to Relevant descriptions will not be repeated here.
  • Fig. 8 is a structural block diagram of a smart chip provided by an embodiment of the present application.
  • the smart chip 800 of this embodiment includes: a processor 810 , a memory 820 , and a computer program 830 stored in the memory 820 and executable on the processor 810 , such as a program of a multi-output neural network slicing method.
  • the processor 810 executes the computer program 830 , the steps in the above embodiments of each multi-output neural network slicing method are implemented, such as S101 to S104 shown in FIG. 1 .
  • the processor 810 executes the computer program 830, it realizes the functions of the modules in the above embodiment corresponding to FIG. 8, for example, the functions of the modules 710 to 740 shown in FIG. describe.
  • the computer program 830 can be divided into one or more modules, one or more modules are stored in the memory 820, and executed by the processor 810, so as to realize the slicing of the multi-output neural network provided by the embodiment of the present application method.
  • One or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 830 in the smart chip 800 .
  • the computer program 830 may implement the multi-output neural network slicing method provided in the embodiment of the present application.
  • the smart chip 800 may include, but not limited to, a processor 810 and a memory 820 .
  • FIG. 8 is only an example of the smart chip 800, and does not constitute a limitation to the smart chip 800. It may include more or less components than shown in the figure, or combine certain components, or different components. .
  • the so-called processor 810 may be a central processing unit, or other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components wait.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the like.
  • the memory 820 may be an internal storage unit of the smart chip 800 , such as a hard disk or memory of the smart chip 800 .
  • the memory 820 may also be an external storage device of the smart chip 800, such as a plug-in hard disk, smart memory card, flash memory card, etc. equipped on the smart chip 800. Further, the memory 820 may also include both an internal storage unit of the smart chip 800 and an external storage device.
  • An embodiment of the present application provides a smart chip, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, the multi-output neural network as in the above-mentioned embodiments is implemented. slice method.
  • An embodiment of the present application provides a computer-readable storage medium, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the slicing method for the output neural network is described in detail below.
  • An embodiment of the present application provides a computer program product, which, when the computer program product is run on a computer, causes the computer to execute the multi-output neural network slicing method in each of the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

A slicing method and apparatus for a multi-output neural network, and a chip and a storage medium, which are applicable to the technical field of model processing. The method is applied to an intelligent chip, and comprises: respectively segmenting each end node in a multi-output neural network from other nodes, so as to generate a plurality of network sub-graphs including end nodes; for any network sub-graph, slicing output data of the end nodes of the network sub-graph according to a plurality of preset slicing modes, so as to obtain a plurality of slicing schemes of the network sub-graph; for any slicing scheme of the network sub-graph, acquiring a processing duration of the network sub-graph, and determining a target slicing scheme of the network sub-graph according to the processing duration; and determining a slicing scheme of the multi-output neural network according to the target slicing scheme of each network sub-graph. By using the method, when a multi-output neural network graph processes input data, the number of times that an external memory is accessed can be reduced.

Description

多输出神经网络的切片方法、装置、芯片及存储介质Slicing method, device, chip and storage medium of multi-output neural network 技术领域technical field
本申请要求于2022年2月25日提交中国专利局,申请号为202210181249.1、发明名称为“多输出神经网络的切片方法、装置、芯片及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202210181249.1 and the title of the invention "Slicing method, device, chip and storage medium for multi-output neural network" submitted to the China Patent Office on February 25, 2022, the entire content of which Incorporated in this application by reference.
本申请属于模型处理技术领域,尤其涉及一种多输出神经网络的切片方法、装置、芯片及存储介质。The present application belongs to the technical field of model processing, and in particular relates to a multi-output neural network slicing method, device, chip and storage medium.
背景技术Background technique
神经网络编译框架(Tensor Virtual Machine,TVM)是对不同的神经网络框架和硬件平台实现统一的软件栈,可以将不同框架下的神经网络均部署在硬件平台上。The neural network compilation framework (Tensor Virtual Machine, TVM) is a unified software stack for different neural network frameworks and hardware platforms, and can deploy neural networks under different frameworks on the hardware platform.
通常,神经网络是在智能芯片中对输入数据进行处理,得到输出数据。但是,智能芯片内部的可用空间往往比较小,对于需要大量的存储空间对数据进行处理的神经网络而言,智能芯片内部的可用空间是远远不够的。因此,在进行数据处理的过程中,针对神经网络中包含的任一节点,经常需要将该节点的输出数据先存储在外部存储器中。然后,将存储空间此时存储的节点的输入数据以及产生的输出数据先进行删除。最后,再从外部存储器中获取该节点的输出数据,并作为下一节点的输入数据,完成对数据的处理。Usually, a neural network processes input data in a smart chip to obtain output data. However, the available space inside the smart chip is often relatively small, and for a neural network that requires a large amount of storage space to process data, the available space inside the smart chip is far from enough. Therefore, in the process of data processing, for any node included in the neural network, it is often necessary to first store the output data of the node in an external memory. Then, the input data of the nodes stored in the storage space at this time and the generated output data are deleted first. Finally, the output data of the node is obtained from the external memory, and used as the input data of the next node to complete the data processing.
因此,为了降低神经网络对数据处理时所需的可用空间,需要对神经网络先进行图切分,得到网络子图。之后,再对切分后的网络子图进行切片,以使网络子图根据切片的方式对数据进行处理时,网络子图中每个节点所需的可用空间将减少,以此可将每个节点产生的输出数据存储在内部的存储空间中,减少与外部存储器的访问次数。Therefore, in order to reduce the available space required by the neural network for data processing, it is necessary to first segment the neural network into graphs to obtain network subgraphs. Afterwards, the segmented network subgraph is sliced, so that when the network subgraph processes data according to slices, the available space required by each node in the network subgraph will be reduced, so that each The output data generated by the node is stored in the internal storage space, reducing the number of accesses to the external memory.
然而,采用上述切片方式对网络子图进行切片时,当前TVM的调度原语无法支持将分叉节点的输出数据,同时给到多个网络子图中进行处理。以至于现有技术中,没有合理的切片方式对多输出神经网络的网络子图进行切片的切片方案,使得现有的多输出神经网络在处理数据时,分叉节点的输出数据还是需要存储在外部存储器中,无法减少与外部存储器的访问次数。However, when the above slicing method is used to slice the network subgraph, the current TVM scheduling primitive cannot support the output data of the fork node to be processed in multiple network subgraphs at the same time. As a result, in the prior art, there is no reasonable slicing method for slicing the network subgraph of the multi-output neural network, so that when the existing multi-output neural network processes data, the output data of the bifurcation node still needs to be stored in the In the external memory, the number of accesses to the external memory cannot be reduced.
技术解决方案technical solution
本申请实施例提供了一种多输出神经网络图的切片方法、装置、智能芯片及存储介质,可以解决现有的多输出神经网络图在处理输入数据时,需要频繁的访问外部存储器的问题。The embodiment of the present application provides a multi-output neural network graph slicing method, device, smart chip and storage medium, which can solve the problem that the existing multi-output neural network graph requires frequent access to external memory when processing input data.
第一方面,本申请实施例提供了一种多输出神经网络的切片方法,方法应用于智能芯片,该方法包括:In the first aspect, the embodiment of the present application provides a method for slicing a multi-output neural network, the method is applied to a smart chip, and the method includes:
将多输出神经网络中的每个末端节点分别与其他节点进行分割,生成多个包含末端节点的网络子图;多输出神经网络包括至少两个末端节点以及至少一个分叉节点;Segmenting each end node in the multi-output neural network from other nodes to generate multiple network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node;
针对任一网络子图,按照预设的多种切片方式对网络子图末端节点的输出数据的输出数据量进行切片,得到网络子图的多种切片方案;For any network subgraph, the output data volume of the output data of the end nodes of the network subgraph is sliced according to various preset slicing methods, and various slicing schemes of the network subgraph are obtained;
针对网络子图的任一种切片方案,获取网络子图的处理时长,并根据处理时长确定网络子图的目标切片方案;For any slicing scheme of the network subgraph, obtain the processing time of the network subgraph, and determine the target slicing scheme of the network subgraph according to the processing time;
根据每个网络子图的目标切片方案,确定多输出神经网络的切片方案。Based on the target slicing scheme for each network subgraph, the slicing scheme for the multi-output neural network is determined.
第二方面,本申请实施例提供了一种多输出神经网络的切片装置,应用于智能芯片,该装置包括:In the second aspect, the embodiment of the present application provides a multi-output neural network slicing device, which is applied to a smart chip, and the device includes:
分割模块,用于将多输出神经网络中的每个末端节点分别与其他节点进行分割,生成多个包含末端节点的网络子图;多输出神经网络包括至少两个末端节点以及至少一个分叉节点;The segmentation module is used to divide each end node in the multi-output neural network from other nodes, and generate multiple network subgraphs containing end nodes; the multi-output neural network includes at least two end nodes and at least one bifurcation node ;
切片模块,用于针对任一网络子图,按照预设的多种切片方式对网络子图末端节点的输出数据的输出数据量进行切片,得到网络子图的多种切片方案;The slicing module is used for slicing the output data volume of the output data of the end nodes of the network subgraph according to various preset slicing methods for any network subgraph, so as to obtain various slicing schemes of the network subgraph;
处理时长获取模块,用于针对网络子图的任一种切片方案,获取网络子图的处理时长,并根据处理时长确定网络子图的目标切片方案;The processing duration acquisition module is used for obtaining the processing duration of the network subgraph for any slicing scheme of the network subgraph, and determining the target slicing scheme of the network subgraph according to the processing duration;
切片方案确定模块,用于根据每个网络子图的目标切片方案,确定多输出神经网络的切片方案。The slicing scheme determination module is configured to determine the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.
第三方面,本申请实施例提供了一种智能芯片,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述第一方面的方法。In a third aspect, an embodiment of the present application provides a smart chip, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the method in the first aspect above is implemented. .
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如上述第一方面的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method in the first aspect above is implemented.
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行上述第一方面的方法。In the fifth aspect, the embodiment of the present application provides a computer program product, which causes the computer to execute the method of the first aspect when the computer program product is run on the computer.
本申请实施例与现有技术相比存在的有益效果是:通过将多输出神经网络以多输出神经网络图的形式进行表征,而后,对于多输出神经网络图中的每个末端节点,分别将末端节点与其他节点进行分割,得到每个节点分别对应的网络子图。之后,在每个网络子图中,按照预设的多种切片方式对每个网络子图的末端节点的输出数据进行切片,以控制每个网络子图在处理数据时的输入输出数据量,得到多种切片方案。之后,根据网络子图处理数据时的处理时长,确定网络子图的目标切片方案。最后,以此得到多输出神经网络整体的切片方案。以此,在多输出神经网络对数据进行处理时,根据多输出神经网络的切片方案对输入输出数据进行处理,使得节点之间的输出数据可以存储在智能芯片的可用空间中,进而,每个节点在处理数据时均可以直接从可用空间中获取数据,以降低智能芯片访问外部存储器的访问次数。The beneficial effect of the embodiment of the present application compared with the prior art is: by characterizing the multi-output neural network in the form of a multi-output neural network graph, and then, for each terminal node in the multi-output neural network graph, respectively The terminal node is divided with other nodes to obtain the network subgraph corresponding to each node. Afterwards, in each network subgraph, the output data of the end nodes of each network subgraph is sliced according to various preset slicing methods, so as to control the amount of input and output data of each network subgraph when processing data, Get a variety of slicing schemes. Afterwards, according to the processing time of the network subgraph when processing data, determine the target slicing scheme of the network subgraph. Finally, the overall slicing scheme of the multi-output neural network is obtained in this way. In this way, when the multi-output neural network processes data, the input and output data are processed according to the slice scheme of the multi-output neural network, so that the output data between nodes can be stored in the available space of the smart chip, and each Nodes can directly obtain data from the available space when processing data, so as to reduce the number of accesses for smart chips to access external memory.
附图说明Description of drawings
下面将对本申请实施例中所需要使用的附图作介绍。The drawings that need to be used in the embodiments of the present application will be introduced below.
图1是本申请一实施例提供的一种多输出神经网络的切片方法的实现流程图;Fig. 1 is the realization flowchart of a kind of multi-output neural network slicing method provided by an embodiment of the present application;
图2是本申请一实施例提供的一种多输出神经网络的切片方法中多输出神经网络图以及多输出神经网络图的结构示意图;2 is a schematic structural diagram of a multi-output neural network diagram and a multi-output neural network diagram in a multi-output neural network slicing method provided by an embodiment of the present application;
图3是本申请一实施例提供的一种多输出神经网络的切片方法的S101的一种实现方式示意图;FIG. 3 is a schematic diagram of an implementation of S101 of a multi-output neural network slicing method provided by an embodiment of the present application;
图4是本申请另一实施例提供的一种多输出神经网络的切片方法的实现流程图;Fig. 4 is a flow chart of realizing a multi-output neural network slicing method provided by another embodiment of the present application;
图5是本申请一实施例提供的一种智能芯片中单输出神经网络处理输入数据的一种处理场景示意图;FIG. 5 is a schematic diagram of a processing scenario in which a single-output neural network processes input data in a smart chip provided by an embodiment of the present application;
图6是本申请一实施例提供的一种智能芯片中多输出神经网络图处理输入数据的一种处理场景示意图;FIG. 6 is a schematic diagram of a processing scenario of a multi-output neural network graph processing input data in a smart chip provided by an embodiment of the present application;
图7是本申请一实施例提供的一种多输出神经网络图的切片装置的结构示意图;Fig. 7 is a schematic structural diagram of a multi-output neural network graph slicing device provided by an embodiment of the present application;
图8是本申请一实施例提供的一种智能芯片的结构示意图。FIG. 8 is a schematic structural diagram of a smart chip provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
下面结合附图对本申请的实施例进行描述。Embodiments of the present application are described below in conjunction with the accompanying drawings.
本申请实施例提供的多输出神经网络的切片方法可以应用于智芯片上。其中,智能芯片为加载多输出神经网络图中各个模型参数,并根据各个节点下的模型参数对输入至每个节点下的输入数据进行处理的芯片。其中,由于多输出神经网络图单个节点计算后的输出数据通常较大,无法存放于智能芯片的内部存储器中,需要将输出数据导到外部存储器中。因此,需要通过访问外部存储器来保存多输出神经网络图下的单个节点的输出数据。并且,由于神经网络中节点的数量也具有多个,基于此,在智能芯片对每个节点的输入数据进行神经网络的计算处理时,智能芯片需要频繁的访问外部存储器。The multi-output neural network slicing method provided in the embodiment of the present application can be applied to a smart chip. Wherein, the smart chip is a chip that loads each model parameter in the multi-output neural network graph, and processes the input data input to each node according to the model parameters under each node. Among them, since the output data calculated by a single node of the multi-output neural network graph is usually too large to be stored in the internal memory of the smart chip, the output data needs to be imported to an external memory. Therefore, it is necessary to save the output data of a single node under the multi-output neural network graph by accessing the external memory. Moreover, since there are multiple nodes in the neural network, based on this, when the smart chip performs calculation and processing of the neural network on the input data of each node, the smart chip needs to access the external memory frequently.
其中,上述智能芯片包括但不限于中央处理器、图形处理器、神经网络处理器等。上述内部存储器为智能芯片的片内存储器。上述外部存储器可以为双倍速率同步动态随机存储器。通常外部存储器的存储空间远大于片内存储器的存储空间。Wherein, the above-mentioned smart chip includes but not limited to a central processing unit, a graphics processing unit, a neural network processor, and the like. The above-mentioned internal memory is the on-chip memory of the smart chip. The above-mentioned external memory may be a double-rate synchronous dynamic random access memory. Usually the storage space of the external memory is much larger than the storage space of the on-chip memory.
基于此,请参阅图1,图1示出了本申请实施例提供的一种多输出神经网络的切片方法的实现流程图,可以对多输出神经网络中的节点进行合并,使合并后的多输出神经网络在处理数据时可以减少对外部存储器的访问次数,该方法包括如下步骤:Based on this, please refer to FIG. 1. FIG. 1 shows an implementation flow diagram of a multi-output neural network slicing method provided by an embodiment of the present application. The nodes in the multi-output neural network can be merged so that the merged multi-output The output neural network can reduce the number of visits to the external memory when processing data, and the method includes the following steps:
S101、智能芯片将多输出神经网络中的每个末端节点分别与其他节点进行分割,生成多个包含末端节点的网络子图;多输出神经网络包括至少两个末端节点以及至少一个分叉节点。S101. The smart chip separates each end node in the multi-output neural network from other nodes to generate multiple network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node.
在一实施例中,上述多输出神经网络可以用一个多输出神经网络图进行表征。示例性的,以图2为例,图2为一个简易的多输出神经网络图。其中,该多输出神经网络图中包括多个节点,每个节点用于表征多输出神经网络中的一个可以在智能芯片上独立计算的单元。多输出神经网络图中的箭头用于表征每个节点之间输出数据的传输方向。In an embodiment, the above-mentioned multi-output neural network can be represented by a multi-output neural network graph. Exemplarily, taking FIG. 2 as an example, FIG. 2 is a simple multi-output neural network diagram. Wherein, the multi-output neural network graph includes a plurality of nodes, and each node is used to represent a unit in the multi-output neural network that can be independently calculated on the smart chip. The arrows in the multi-output neural network diagram are used to represent the transmission direction of the output data between each node.
示例性的,参照图2,图2X为多输出神经网络的结构示意图,图2Y为多输出神经网络分割后的网络子图的参考示意图。对应的,从图2可以看出,多输出神经网络图包括7个节点。其中,3、5、7分别为输出节点(也即末端节点),1、2为分叉节点,4、6为普通节点。Exemplarily, referring to FIG. 2 , FIG. 2X is a schematic structural diagram of a multi-output neural network, and FIG. 2Y is a reference schematic diagram of network subgraphs after the multi-output neural network is divided. Correspondingly, it can be seen from FIG. 2 that the multi-output neural network graph includes 7 nodes. Among them, 3, 5, and 7 are output nodes (that is, terminal nodes), respectively, 1, 2 are bifurcated nodes, and 4, 6 are ordinary nodes.
需要说明的是,在本实施例中,上述方法用于对多输出神经网络图进行分割,因此,多个节点中最少包括两个末端节点,以及一个分叉节点。在本实施例中,对末端节点和分叉节点的数量不作限制。It should be noted that, in this embodiment, the above method is used to divide the multi-output neural network graph, therefore, the multiple nodes include at least two end nodes and one fork node. In this embodiment, there is no limit to the number of end nodes and fork nodes.
在一实施例中,一个分割后的网络子图包括一个末端节点,因此,分割后的网络子图至少包括两个。其中,将末端节点与其他节点进行分割,具体可以为:智能芯片根据多输出神经网络图中各个节点之间的数据传输关系,采用后序遍历方法收集每个末端节点分别需要的其他节点,并进行分割,得到对应的网络子图。In an embodiment, a divided network subgraph includes one end node, therefore, the divided network subgraph includes at least two. Wherein, the terminal node is divided from other nodes, which can be specifically: the smart chip adopts the post-order traversal method to collect other nodes that each terminal node needs respectively according to the data transmission relationship between each node in the multi-output neural network diagram, and Segmentation is performed to obtain the corresponding network subgraph.
S102、智能芯片根据针对任一网络子图,按照预设的多种切片方式对网络子图末端节点的输出数据进行切片,得到网络子图的多种切片方案。S102. According to any network subgraph, the smart chip slices the output data of the end nodes of the network subgraph according to various preset slicing methods, and obtains various slicing schemes of the network subgraph.
在应用中,上述切片方式具体为对末端节点的输出数据的输出数据量进行切片的方式。其中,对输出数据量进行切片后,可以确定末端节点每次输出数据时的输出数据量大小。基于该输出数据量大小,网络子图可以确定每次处理输入数据时的输入数据量大小,以此可以使智能芯片可以为每个节点分配足够的可用空间处理输入输出数据。In an application, the above slicing manner is specifically a manner of slicing the output data volume of the output data of the end nodes. Wherein, after slicing the output data volume, the size of the output data volume when the end node outputs data each time can be determined. Based on the size of the output data, the network subgraph can determine the size of the input data each time the input data is processed, so that the smart chip can allocate enough available space for each node to process the input and output data.
其中,末端节点可以输出任意维度大小的输出数据,因此,在对末端节点的输出数据进行切片时,也将对应有多种切片方案,对此不作限定。Wherein, the terminal node can output output data of any dimension size, therefore, when slicing the output data of the terminal node, there will also be correspondingly various slicing schemes, which are not limited.
示例性的,多输出神经网络所处理的数据通常为图像数据,因此,将输出数据以图像数据为例进行说明。Exemplarily, the data processed by the multi-output neural network is usually image data, therefore, the output data will be described using image data as an example.
在一具体实施例中,图像数据的信息通常以N、C、H、W进行表示。具体的,N表示输入的图像数据的张数;H表示图像数据在竖直方向的像素;W表示图像数据在水平方向的像素,C表示通道数。通常的,若直接将整张图像数据输入至节点中进行处理,则输出数据的输出数据量将非常大。因此,为了能够减小节点每次处理图像数据的数据量。智能芯片可以将该节点下的输出数据进行切片,以将低网络子图中末端节点以及其他节点处理输入数据时的输入数据量大小,以及输出数据时的输出数据量大小。In a specific embodiment, information of the image data is usually represented by N, C, H, W. Specifically, N represents the number of sheets of input image data; H represents the pixels of the image data in the vertical direction; W represents the pixels of the image data in the horizontal direction, and C represents the number of channels. Usually, if the entire image data is directly input to the node for processing, the output data volume of the output data will be very large. Therefore, in order to be able to reduce the amount of data that the node processes image data each time. The smart chip can slice the output data under the node, so as to reduce the size of the input data when the terminal nodes and other nodes in the low network subgraph process the input data, and the size of the output data when outputting the data.
示例性的,在对末端节点的输出数据进行切片时,以H轴进行切片,使每次节点处理的数据的大小为N、C、1/2H、W的大小。因此,节点可以只获取一半图像大小的数据,作为节点的输入数据。并且,因一次只对一半图像大小的数据进行处理,则最后末端节点先输出的也是一半图像大小的输出数据。之后,智能芯片可以先将该一半图像大小的输出数据存储在内部存储器中,如果是末端节点则一并输出到外部存储器;而后,从内部存储器中删除该叉节点已处理的一半图像大小的数据(因无需被使用,也即可以释放已处理的输入数据);之后,再次对另一半图像大小的数据进行处理,直至得到另一半图像大小的输出数据,并进行保存。基于此,该节点最终在生成整体输出数据的过程中,内部存储器只需要使用一半图像大小的存储空间进行存储。若节点直接对整个图像的输入数据进行处理,则智能芯片需要为节点分配整体图像数据输入时所需的存储空间,以及生成整体图像的输出数据时所需的存储空间。Exemplarily, when slicing the output data of the terminal node, the slicing is performed on the H axis, so that the size of the data processed by each node is N, C, 1/2H, and W. Therefore, the node can only get half the size of the image as the input data of the node. Moreover, since only half of the image size data is processed at a time, the output data of the final terminal node is also the output data of half the image size first. Afterwards, the smart chip can first store the output data of half the size of the image in the internal memory, and output it to the external memory if it is an end node; then, delete the data of half the size of the image processed by the fork node from the internal memory (Because it does not need to be used, the processed input data can be released); after that, the data of the other half of the image size is processed again until the output data of the other half of the image size is obtained and saved. Based on this, in the process of generating the overall output data of the node, the internal memory only needs to use half the storage space of the image size for storage. If the node directly processes the input data of the entire image, the smart chip needs to allocate the storage space required for the input of the overall image data and the storage space required for generating the output data of the overall image to the node.
基于此,在一实施例中,对末端节点输出数据进行切片的多种方式包括但不限于对C、H、W分别进行切片。也即上述示例仅为其中的一种对H轴进行切片的方式。Based on this, in an embodiment, multiple ways of slicing the end node output data include but not limited to slicing C, H, and W respectively. That is to say, the above example is only one of the ways to slice the H axis.
具体的,对数据进行切片的方式可以为:智能芯片获取末端节点可输出的输出数据的最小维度,以及可输出的输出数据的最大维度;智能芯片在最小维度与最大维度之间,将任一维度的可输出数据,确定为末端节点的输出数据进行切片的一种切片方式。Specifically, the way of slicing the data can be as follows: the smart chip obtains the minimum dimension of the output data that can be output by the terminal node, and the maximum dimension of the output data that can be output; the smart chip divides any dimension between the minimum dimension and the maximum dimension The output data of the dimension is determined as a slicing method for slicing the output data of the end node.
上述最小维度为末端节点和非分叉节点所能够输出的图像数据的最小维度,上述最大维度为末端节点和非分叉节点所能输出的图像数据的最大维度。每一种维度的图像数据作为一种可输出的输出数据,即为对末端节点与非分叉节点进行切片的一种切片方式。The above-mentioned minimum dimension is the minimum dimension of the image data that can be output by the terminal node and the non-forking node, and the above-mentioned maximum dimension is the maximum dimension of the image data that can be output by the terminal node and the non-forking node. The image data of each dimension is used as output data that can be output, that is, a slicing method for slicing terminal nodes and non-forked nodes.
其中,末端节点可输出的图像数据的最小维度通常为1个像素,可输出的图像数据的最大维度通常为整张图像数据的维度。Wherein, the minimum dimension of the image data that can be output by the terminal node is usually 1 pixel, and the maximum dimension of the image data that can be output is usually the dimension of the entire image data.
示例性的,在张数N为1、通道数C也为1时,若图像数据在竖直方向上的像素H具有A种切片方式,即以[1,A]之间的任一维度对图像数据进行竖直方向上的切片;以及,图像数据在水平方向上的像素W具有B种切片方式,即以[1,B]之间的任一维度对图像数据进行竖直方向上的切片,则最终对末端节点的输出数据进行切片的一种切片方式将具有A*B种。在其他实施例中,通道数C也可以为多个,例如3个。因此,在对图像数据进行切分时,也可以对通道数进行切分。基于此,可以得到网络子图中的多种切片方式,使智能芯片能够具有更多的切片方案进行选择。Exemplarily, when the number of sheets N is 1 and the number of channels C is also 1, if the pixel H of the image data in the vertical direction has A type of slicing, that is, any dimension between [1, A] is used for The image data is sliced in the vertical direction; and the pixel W of the image data in the horizontal direction has B types of slices, that is, the image data is sliced in the vertical direction in any dimension between [1, B] , then a slicing method for finally slicing the output data of the end nodes will have A*B types. In other embodiments, the number C of channels may also be multiple, such as three. Therefore, when the image data is divided, the number of channels can also be divided. Based on this, multiple slicing modes in the network subgraph can be obtained, so that the smart chip can have more slicing schemes for selection.
S103、针对网络子图的任一种切片方案,智能芯片获取网络子图的处理时长,并根据处理时长确定网络子图的目标切片方案。S103. For any slicing scheme of the network subgraph, the smart chip acquires a processing time of the network subgraph, and determines a target slicing scheme of the network subgraph according to the processing time.
在应用中,每种切片方案下,网络子图处理输入数据时的输入数据量大小均是不同。因此,每种切片方案下,网络子图处理输入数据的处理时长也通常各不相同。基于此,对于任一网络子图分别对应的多种切片方案,智能芯片可以将处理时长最短的切片方案确定为该网络子图的目标切片方案,以提升网络子图处理输入输出数据时的效率。In the application, under each slicing scheme, the amount of input data when the network subgraph processes the input data is different. Therefore, under each slicing scheme, the processing time for the network subgraph to process the input data is usually different. Based on this, for various slicing schemes corresponding to any network subgraph, the smart chip can determine the slicing scheme with the shortest processing time as the target slicing scheme of the network subgraph, so as to improve the efficiency of the network subgraph when processing input and output data .
示例性的,参照图2,对于多输出神经网络分割后的网络子图,智能芯片可以统计图A所有的有效切片方案,并通过芯片costmodel获得最优即cycle最低,耗时最短的切片方案作为末端节点3的目标切片方案;以及,统计图B所有的有效切片方案,并通过芯片costmodel获得最优即cycle最低,耗时最短的切片方案作为末端节点5的目标切片方案;以及,统计图C所有的有效切片方案,并通过芯片costmodel获得最优即cycle最低,耗时最短的切片方案作为末端节点7的目标切片方案。Exemplarily, referring to Figure 2, for the network subgraph divided by the multi-output neural network, the smart chip can count all the effective slicing schemes in the graph A, and obtain the optimal slicing scheme with the lowest cycle and the shortest time consumption through the chip costmodel as The target slicing scheme of terminal node 3; and, statistics of all effective slicing schemes in graph B, and obtain the optimal slicing scheme with the lowest cycle and the shortest time consumption through the chip costmodel as the target slicing scheme of terminal node 5; and, statistical graph C For all effective slicing schemes, the optimal slicing scheme with the lowest cycle and the shortest time consumption is obtained through the chip costmodel as the target slicing scheme for terminal node 7.
S104、智能芯片根据每个网络子图的目标切片方案,确定多输出神经网络的切片方案。S104. The smart chip determines the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.
可以理解的是,在得到每个网络子图分别一一对应的目标切片方案后,每个目标切片方案即为多输出神经网络整体的切片方案,。It can be understood that, after obtaining the one-to-one corresponding target slicing schemes for each network subgraph, each target slicing scheme is the overall slicing scheme for the multi-output neural network.
在本实施例中,通过将多输出神经网络以多输出神经网络图的形式进行表征,而后,对于多输出神经网络图中的每个末端节点,分别将末端节点与其他节点进行分割,得到每个节点分别对应的网络子图。之后,在每个网络子图中,按照预设的多种切片方式对每个网络子图的末端节点的输出数据进行切片,以控制每个网络子图在处理数据时的输入输出数据量,得到多种切片方案。之后,根据网络子图处理数据时的处理时长,确定网络子图的目标切片方案。最后,以此得到多输出神经网络整体的切片方案。以此,在多输出神经网络对数据进行处理时,根据多输出神经网络的切片方案对输入输出数据进行处理,使得节点之间的输出数据可以存储在智能芯片的可用空间中,进而,每个节点在处理数据时均可以直接从可用空间中获取数据,以降低智能芯片访问外部存储器的访问次数。In this embodiment, the multi-output neural network is represented in the form of a multi-output neural network graph, and then, for each terminal node in the multi-output neural network graph, the terminal node is separately divided from other nodes to obtain each The network subgraph corresponding to each node. Afterwards, in each network subgraph, the output data of the end nodes of each network subgraph is sliced according to various preset slicing methods, so as to control the amount of input and output data of each network subgraph when processing data, Get a variety of slicing schemes. Afterwards, according to the processing time of the network subgraph when processing data, determine the target slicing scheme of the network subgraph. Finally, the overall slicing scheme of the multi-output neural network is obtained in this way. In this way, when the multi-output neural network processes data, the input and output data are processed according to the slice scheme of the multi-output neural network, so that the output data between nodes can be stored in the available space of the smart chip, and each Nodes can directly obtain data from the available space when processing data, so as to reduce the number of accesses for smart chips to access external memory.
请参阅图3,图3是本申请一实施例提供的一种多输出神经网络的切片方法的S101的一种实现方式示意图,详述如下;Please refer to FIG. 3. FIG. 3 is a schematic diagram of an implementation of S101 of a method for slicing a multi-output neural network provided by an embodiment of the present application, which is described in detail as follows;
S1、智能芯片将末端节点作为第一当前节点。S1. The smart chip uses the terminal node as the first current node.
在一实施例中,上述第一当前节点为当前被处理的节点。在本实施例中,因对多输出神经网络图采用后续遍历法进行处理,因此,第一个被处理的末端节点即为第一当前节点。In an embodiment, the above-mentioned first current node is the node currently being processed. In this embodiment, since the subsequent traversal method is used to process the multi-output neural network graph, the first terminal node to be processed is the first current node.
S2、若第一当前节点的父节点属于分叉节点,则智能芯片判断分叉节点是否已被其他末端节点对应的网络子图进行分割。S2. If the parent node of the first current node belongs to a fork node, the smart chip judges whether the fork node has been divided by network subgraphs corresponding to other end nodes.
S3、若判定分叉节点已被其他末端节点对应的网络子图进行分割,则智能芯片将第一当前节点与父节点之间的各个节点进行分割,生成包含末端节点的网络子图。S3. If it is determined that the forked node has been divided by network subgraphs corresponding to other end nodes, the smart chip divides each node between the first current node and the parent node to generate a network subgraph including the end nodes.
S4、若判定分叉节点未被其他末端节点对应的网络子图进行分割,或若父节点属于非分叉节点,则智能芯片将父节点确定为新的第一当前节点,重复执行S2-S4,直至生成包含末端节点的网络子图。S4. If it is determined that the forked node is not divided by the network subgraph corresponding to other end nodes, or if the parent node belongs to a non-forked node, the smart chip determines the parent node as the new first current node, and repeats S2-S4 , until a network subgraph containing terminal nodes is generated.
在一实施例中,以图2为示例,在第一当前节点为末端节点3时,根据多输出神经网络图中各个节点之间的数据传输关系可知,父节点为分叉节点2。并且分叉节点2此时未被其他末端节点对应的网络子图进行分割。因此,基于S3步骤可知,可以将分叉节点2作为新的第一当前节点,重复执行S2-S4步骤,直至生成包含末端节点3的网络子图。In one embodiment, taking FIG. 2 as an example, when the first current node is the end node 3, according to the data transmission relationship between each node in the multi-output neural network graph, it can be known that the parent node is the fork node 2. And the fork node 2 is not divided by the network subgraphs corresponding to other end nodes at this time. Therefore, based on the step S3, it can be seen that the fork node 2 can be used as the new first current node, and the steps S2-S4 are repeated until a network subgraph including the terminal node 3 is generated.
其中,生成包含3的网络子图的情况可以为步骤2的情况,也可以为新的第一当前节点在多输出神经网络图中没有父节点。此时,智能芯片即可结束结束对末端节点3的分割。示例性的,在对末端节点3进行分割时,若新的第一当前节点为节点1时,因节点1为多输出神经网络图的起始节点,其并没有父节点。因此,包含末端节点3的网络分割到此结束,得到包含节点1、2、3的网络子图。Wherein, the case of generating the network subgraph containing 3 may be the case of step 2, or it may be that the new first current node has no parent node in the multi-output neural network graph. At this point, the smart chip can finish dividing the terminal node 3 . Exemplarily, when the terminal node 3 is divided, if the new first current node is node 1, since node 1 is the starting node of the multi-output neural network graph, it does not have a parent node. Therefore, the network segmentation including terminal node 3 ends here, and a network subgraph including nodes 1, 2, and 3 is obtained.
之后,在对末端节点5进行分割时,初始的第一当前节点为末端节点5,其父节点为非分叉节点4。因此,基于S4步骤可知,智能芯片之后可以将非分叉节点4作为新的第一当前节点。此时,新的第一当前节点4的上一节点(也即新的父节点)为分叉节点2。然而,基于上述说明可知,分叉节点2已被末端节点3对应的网络子图进行分割。因此,智能芯片将结束对末端节点5的分割,将末端节点5以及末端节点5与新的父节点2之间的各个节点(图2中只有节点4)进行分割,生成包含末端节点5的网络子图。Afterwards, when splitting the terminal node 5 , the initial first current node is the terminal node 5 , and its parent node is the non-forked node 4 . Therefore, based on step S4, it can be known that the smart chip can then use the non-forked node 4 as the new first current node. At this time, the previous node (that is, the new parent node) of the new first current node 4 is the fork node 2 . However, based on the above description, it can be seen that the fork node 2 has been divided by the network subgraph corresponding to the end node 3 . Therefore, the smart chip will end the division of the end node 5, divide the end node 5 and each node between the end node 5 and the new parent node 2 (only node 4 in Figure 2), and generate a network including the end node 5 subplot.
其中,对末端节点7的分割方式与末端节点5的分割方式相似,对此不再进行说明。需要说明的时,上述示例仅是以末端节点3为第一个被处理的末端节点,在其他实施例中,也可以分别将末端节点5和7分别作为第一个被处理的末端节点,对此不做限定。Wherein, the division method of the terminal node 7 is similar to the division method of the terminal node 5, which will not be described again. When it needs to be explained, the above example only uses terminal node 3 as the first terminal node to be processed. In other embodiments, terminal nodes 5 and 7 can also be respectively used as the first terminal node to be processed. This is not limited.
基于此,在本实施例中,通过后续遍历法依次对每个末端节点以及其他节点进行分割,可以准确且快速的生成每个末端节点对应的网络子图,提高智能芯片对多输出神经网络图进行分割的效率。Based on this, in this embodiment, each terminal node and other nodes are sequentially segmented through the subsequent traversal method, the network subgraph corresponding to each terminal node can be accurately and quickly generated, and the smart chip can improve the accuracy of the multi-output neural network graph. Efficiency for splitting.
需要说明的是,每个网络子图分别包括多输出神经网络图中的部分节点。另外,对多输出神经网络图进行切片仅为调整多输出神经网络图中各节点处理的输入输出数据大小,并不是将多输出神经网络图分为多个部分。It should be noted that each network subgraph respectively includes some nodes in the multi-output neural network graph. In addition, slicing the multi-output neural network graph is only to adjust the size of the input and output data processed by each node in the multi-output neural network graph, not to divide the multi-output neural network graph into multiple parts.
在另一实施例中,在执行上述S102步骤时,每个网络子图通常将会有多种切片方案,然而,后续在执行S103步骤时,若智能芯片统计每种切片方案下网络子图对应的处理时长,则将耗费大量的时间。基于此,在执行S103之前,智能芯片还可以预先判断切片方案是否为有效切片方案,并在切片方案为有效切片方案时,再执行获取网络子图的处理时长的步骤,以减少智能芯片所需的时间。In another embodiment, when the above step S102 is performed, each network subgraph usually has multiple slicing schemes. However, when the subsequent step S103 is performed, if the smart chip counts the network subgraph corresponding to each slicing scheme If the processing time is long, it will consume a lot of time. Based on this, before executing S103, the smart chip can also pre-determine whether the slicing scheme is a valid slicing scheme, and when the slicing scheme is a valid slicing scheme, perform the step of obtaining the processing time of the network subgraph to reduce the time required for the smart chip. time.
请参阅图4,图4是本申请一实施例提供的一种多输出神经网络的切片方法中,判断切片方案是否为有效切片方案的一种实现方式示意图,详述如下:Please refer to FIG. 4. FIG. 4 is a schematic diagram of an implementation of judging whether a slicing scheme is an effective slicing scheme in a multi-output neural network slicing method provided by an embodiment of the present application, as detailed below:
S41、智能芯片根据切片后的末端节点的输出数据的输出数据量,分别逆推每个节点的输入数据和输出数据。S41. The smart chip reversely deduces the input data and output data of each node respectively according to the output data volume of the output data of the sliced end nodes.
在应用中,在对末端节点的输出数据的输出数据量进行切片后,可以限定末端节点每次输出数据的大小,以此,根据末端节点中的计算参数和计算公式,智能芯片可以逆推末端节点每次输入数据的大小。之后,因末端节点的输入数据为上一节点的输出数据,因此,智能芯片可以逆推出每个节点的输入数据和输出数据。In the application, after slicing the output data volume of the output data of the terminal node, the size of each output data of the terminal node can be limited, so that, according to the calculation parameters and calculation formulas in the terminal node, the smart chip can invert the terminal The size of the node's input data each time. Afterwards, since the input data of the terminal node is the output data of the previous node, the smart chip can reversely deduce the input data and output data of each node.
S42、针对任一节点,智能芯片确定该节点的之前节点的目标输出数据。S42. For any node, the smart chip determines the target output data of the previous node of the node.
S43、智能芯片根据目标输出数据、该节点的输入数据和输出数据,确定该节点执行时的占用空间。S43. The smart chip determines the occupied space of the node during execution according to the target output data, the input data and the output data of the node.
在一实施例中,上述之前节点为多输出神经网络中处于该节点之前的节点,其可以根据各个节点之间的数据传输关系进行确定。In an embodiment, the preceding node is a node preceding the node in the multi-output neural network, which can be determined according to the data transmission relationship between the various nodes.
在应用中,节点的输入数据为上一节点的输出数据,因此,为了确定切片方案是否为了有效切片方案,智能芯片可以将网络子图在处理输入输出数据时,每个节点无需从外部存储器中获取输入数据的切片方案,确定为有效切片方案。In the application, the input data of a node is the output data of the previous node. Therefore, in order to determine whether the slicing scheme is an effective slicing scheme, the smart chip can process the input and output data of the network subgraph. Get the tiling scheme of the input data and determine it as a valid tiling scheme.
然而,若网络子图在处理输入输出数据时,每个节点无需从外部存储器中获取输入数据,则表明该切片方案下,每个节点在处理输入输出数据时,智能芯片的内部存储器可以为该节点分配足够的可用空间。基于此,针对任一网络子图的任一节点,智能芯片均需要计算节点在处理输入输出数据时所需的占用空间。However, if each node does not need to obtain input data from the external memory when the network subgraph processes input and output data, it means that under this slicing scheme, when each node processes input and output data, the internal memory of the smart chip can be the The node allocates enough free space. Based on this, for any node of any network subgraph, the smart chip needs to calculate the occupied space required by the node when processing input and output data.
在一具体实施例中,智能芯片具体可以根据如下方式确定占用空间:确定该节点之前的所有分叉节点及每个分叉节点的输出数据;判断每个分叉节点的输出数据是否可释放,并将不可释放的所有分叉节点的输出数据作为目标输出数据。In a specific embodiment, the smart chip can specifically determine the occupied space in the following manner: determine all forked nodes before the node and the output data of each forked node; judge whether the output data of each forked node can be released, And the output data of all forked nodes that cannot be released are used as the target output data.
以及,智能芯片可以通过如下方式判断分叉节点的输出数据是否可释放:若该分叉节点的输出数据还用于被其他网络子图使用,则判定分叉节点的输出数据不可释放;否则,判定分叉节点的输出数据可释放。And, the smart chip can judge whether the output data of the forked node can be released in the following way: if the output data of the forked node is also used by other network subgraphs, it is determined that the output data of the forked node cannot be released; otherwise, It is determined that the output data of the forked node can be released.
在一实施例中,为了减少与外部存储器的交互次数,每个芯片在产生输出数据时通常需要将输出数据存储在内部存储器中,以便可以被下一节点调用。因非分叉节点的输出数据即为下一节点的输入数据,因此,在计算任一节点的占用空间时,只需将该节点的输出数据的输出数据量以及输入数据的输入数据量之和,确定为该节点所需的占用空间即可。即此时并未存在目标输出数据。In an embodiment, in order to reduce the number of interactions with the external memory, each chip usually needs to store the output data in the internal memory when generating the output data, so that it can be called by the next node. Because the output data of a non-forking node is the input data of the next node, when calculating the occupied space of any node, only the sum of the output data volume of the node's output data and the input data volume of the input data is required , just determine the occupied space required by the node. That is, there is no target output data at this time.
然而,若该节点之前还具有分叉节点,且分叉节点的输出数据还需被其他网络子图使用。也即,分叉节点的输出数据需要一直存储在内部存储器中。因此,在计算该节点的所需的占用空间时,还需将不可释放的分叉节点的输出数据的输出数据量进行加和,得到该节点所需的占用空间。即此时均未被释放的分叉节点的输出数据即为目标输出数据。其中,判断分叉节点的输出数据是否可释放的目的在于:在判断该输出数据可释放时,从内部存储器中删除该输出数据,以尽可能的增加内部存储器为下一节点分配的可用空间,使下一节点无需将输出数据存储至外部存储器,减少与外部存储器的交互次数。However, if the node has a fork node before, and the output data of the fork node needs to be used by other network subgraphs. That is, the output data of the fork node needs to be stored in the internal memory all the time. Therefore, when calculating the required occupied space of the node, it is also necessary to add the output data amounts of the output data of the non-releasable fork nodes to obtain the required occupied space of the node. That is, the output data of the forked nodes that have not been released at this time is the target output data. Among them, the purpose of judging whether the output data of the bifurcated node can be released is to delete the output data from the internal memory when judging that the output data can be released, so as to increase the available space allocated by the internal memory for the next node as much as possible. The next node does not need to store the output data in the external memory, reducing the number of interactions with the external memory.
S44、若任一节点的占用空间均小于智能芯片为对应节点分配的可用空间,则智能芯片将切片方案确定为有效切片方案。S44. If the occupied space of any node is smaller than the available space allocated by the smart chip to the corresponding node, the smart chip determines the slicing scheme as an effective slicing scheme.
在应用中,在节点处理输入输出数据时所需的占用空间小于智能芯片为对应节点分配的可用空间时,则表明智能芯片的内部存储器可以有足够的空间存储节点所需的输入数据,同时可以存储该节点产生的输出数据。In the application, when the occupied space required by the node to process input and output data is less than the available space allocated by the smart chip for the corresponding node, it means that the internal memory of the smart chip can have enough space to store the input data required by the node, and at the same time can Stores the output data produced by this node.
在应用中,上述可用空间可以为智能芯片内部存储器的总空间,也可以为内部存储器在存储完该节点处理数据时的计算参数后剩于的空间,对此不作限定。In the application, the above-mentioned available space may be the total space of the internal memory of the smart chip, or the space remaining in the internal memory after storing the calculation parameters when the node processes data, which is not limited.
可以理解的是,采用有效切片方案对末端节点的输出数据进行切片的优势在于:降低节点每次处理输入数据的输入数据量以及产生的输出数据的输出数据量,以此使得每个节点在处理输入输出数据时所需的占用空间减少。进而,使智能芯片可以将每个节点产生的输出数据存储在内部存储器,以减少与外部存储器的交互次数。It can be understood that the advantage of adopting an effective slicing scheme to slice the output data of the terminal nodes is to reduce the amount of input data each time the node processes the input data and the output data of the generated output data, so that each node can process Reduced footprint required when inputting and outputting data. Furthermore, the smart chip can store the output data generated by each node in the internal memory, so as to reduce the number of interactions with the external memory.
示例性的,假设智能芯片的内部存储器的空间大小为Q。其中,Q为除去该节点的计算参数在内部存储器中的占用空间后,内部存储器还剩余的空间。其中,每个非分叉节点中处理数据时所需占用的空间通常包括:该节点的输入数据以及输出数据所占用的空间。以下结合图2进行具体说明:Exemplarily, it is assumed that the size of the internal memory of the smart chip is Q. Wherein, Q is the remaining space in the internal memory after removing the occupied space of the calculation parameters of the node in the internal memory. Wherein, the space occupied by each non-forked node when processing data generally includes: the space occupied by the input data and output data of the node. The following is a specific description in conjunction with Figure 2:
针对任意一种切片方案,分别得到在该方案下的每个节点的输入数据以及输出数据。参照图2,对于网络子图A(由末端节点3、分叉节点1和分叉节点2组成的子图),对末端节点3的输出数据进行切片处理,需要内部存储器的空间为输入数据=Q3a,输出数据=Q3b,共记为Q3=Q3a+Q3b。分叉节点2的输出数据不能释放和切片,因此,需要内部存储器的空间为输入数据=Q2a,输出数据=Q2b,共记为Q2=Q2a+Q2b。分叉节点1的输出数据也不能释放和切片,需要内部存储器的空间为输入数据=Q1a,输出数据=Q1b,共记为Q1=Q1a+Q1b。其中分叉节点1和2的输出数据不能释放和切片的原因在于:其分叉节点1和2的输出数据均需要重用(图A和图C都有节点需要分叉节点1的输出数据,图A和图B都有节点需要分叉节点2的输出数据)。For any slicing scheme, the input data and output data of each node under the scheme are respectively obtained. Referring to Figure 2, for network subgraph A (a subgraph composed of terminal node 3, fork node 1 and fork node 2), to slice the output data of terminal node 3, the internal memory space is required for input data = Q3a, output data = Q3b, recorded as Q3=Q3a+Q3b in total. The output data of fork node 2 cannot be released and sliced. Therefore, the space required for the internal memory is input data=Q2a, output data=Q2b, and the total is recorded as Q2=Q2a+Q2b. The output data of fork node 1 cannot be released and sliced, and the internal memory space required is input data=Q1a, output data=Q1b, and the total is recorded as Q1=Q1a+Q1b. The reason why the output data of fork nodes 1 and 2 cannot be released and sliced is that the output data of fork nodes 1 and 2 need to be reused (both nodes in Figure A and Figure C need the output data of fork node 1, and Both A and Graph B have nodes that require output data from fork node 2).
其中,需要说明的是,对于多输出神经网络中的每个节点,在任一节点根据输出数据进行逆推处理时,产生输入数据的瞬间,该节点是同时包含了输入数据和输出数据。因此,该多输出神经网络中的每个节点需要符合最基本的要求为:满足节点在生成输入数据的时刻,输入的输入数据量与输出的输出数据量之和,小于或等于智能芯片为节点分配的可用空间。Wherein, it should be noted that, for each node in the multi-output neural network, when any node performs inverse processing according to the output data, the moment when the input data is generated, the node contains both the input data and the output data. Therefore, each node in the multi-output neural network needs to meet the most basic requirements: at the moment when the node generates input data, the sum of the input data volume and the output data volume is less than or equal to the smart chip as the node Allocated free space.
因此,计算分叉节点1需要的内部存储器的空间时,需要满足Q1<Q,则计算分叉节点2所需的占用空间,否则该切片方案无效。但是,由于分叉节点2的输入数据就是分叉节点1的输出数据,并且,在实际处理过程中,在分叉节点1基于输入数据计算出输出数据后,输入数据便不用存储在内部存储器中。此时内部存储器中存储的仅为分叉节点1的输出数据。并且,因分叉节点2的输入数据,即为分叉节点1的输出数据。因此,在计算分叉节点2所需的占用空间时,由于分叉节点1的输入数据已经删除,且分叉节点1的输出数据即为分叉节点2的输入数据。基于此,在计算分叉节点2所需的占用空间时,只要满足Q2<Q时,则计算末端节点3,否则该切片方式无效。在计算末端节点3所需的占用空间时,末端节点3的输入数据是分叉节点2的输出数据,且分叉节点1的输出数据还需被其他节点使用(图C中的节点6要使用)。因此,在计算末端节点3所需的占用空间时,需要满足Q3+Q1b才可。如果Q3+Q1b<Q,则该切片方案有效,否则该切片方案是无效的。Therefore, when calculating the internal memory space required by the fork node 1, it is necessary to satisfy Q1<Q, then calculate the occupied space required by the fork node 2, otherwise the slicing scheme is invalid. However, since the input data of the fork node 2 is the output data of the fork node 1, and in the actual processing process, after the fork node 1 calculates the output data based on the input data, the input data does not need to be stored in the internal memory . At this time, only the output data of fork node 1 is stored in the internal memory. And, because the input data of fork node 2 is the output data of fork node 1. Therefore, when calculating the occupied space required by the fork node 2, since the input data of the fork node 1 has been deleted, and the output data of the fork node 1 is the input data of the fork node 2. Based on this, when calculating the occupied space required by the fork node 2, as long as Q2<Q is satisfied, the end node 3 is calculated; otherwise, the slicing method is invalid. When calculating the occupied space required by end node 3, the input data of end node 3 is the output data of fork node 2, and the output data of fork node 1 needs to be used by other nodes (node 6 in Figure C needs to use ). Therefore, when calculating the occupied space required by the terminal node 3, it is necessary to satisfy Q3+Q1b. If Q3+Q1b<Q, the tiling scheme is valid, otherwise the tiling scheme is invalid.
图A切片完后,由于分叉节点1和分叉节点2的输出数据Q1b和Q2b会被图B和C引用,所以不能释放。After graph A is sliced, the output data Q1b and Q2b of fork node 1 and fork node 2 will be referenced by graphs B and C, so they cannot be released.
图B为包含末端节点5和非分叉节点4的网络子图。其中,对非分叉节点4,其需要的内部存储器的空间为非分叉节点4的输入数据=Q4a,输出数据=Q4b,记为Q4=Q4a+Q4b。但是,在进行图A切片时,分叉节点1的Q1b和分叉节点2的Q2b没有释放,以及分叉节点的输出数据Q2b,包含非分叉节点4的输入数据Q4a。因此,非分叉节点4在对输入数据进行处理时所需的占用空间为:Q4b+Q1b+Q2b<Q时,才可计算末端节点5,否则对末端节点5的切片方案无效。在计算末端节点5时,由于末端节点5的输入数据就是非分叉节点4的输出数据,且此时非分叉节点4的输入数据Q4a(也即Q2b)可以释放(图C不需要使用)。因此,末端节点5所需的第二占用空间需满足Q5+Q1b(分叉节点1的输出数据还需被图C需要使用,没有释放)。如果Q5+Q1b<Q,则该切片方式有效,否则该切片方式无效。Figure B is a sub-graph of the network including terminal nodes 5 and non-forked nodes 4 . Wherein, for the non-forking node 4, the required internal memory space is the input data=Q4a and the output data=Q4b of the non-forking node 4, recorded as Q4=Q4a+Q4b. However, when graph A is sliced, Q1b of fork node 1 and Q2b of fork node 2 are not released, and the output data Q2b of the fork node includes the input data Q4a of non-fork node 4. Therefore, the occupied space required by the non-forked node 4 when processing the input data is: Q4b+Q1b+Q2b<Q, the terminal node 5 can be calculated, otherwise the slicing scheme for the terminal node 5 is invalid. When calculating terminal node 5, since the input data of terminal node 5 is the output data of non-forked node 4, and at this time the input data Q4a (that is, Q2b) of non-forked node 4 can be released (it does not need to be used in Figure C) . Therefore, the second occupied space required by end node 5 needs to satisfy Q5+Q1b (the output data of fork node 1 still needs to be used by graph C and has not been released). If Q5+Q1b<Q, the slicing method is valid; otherwise, the slicing method is invalid.
同理:图B切片完后,分叉节点2的输出数据Q2b已经释放,但分叉节点1的输出Q1b由于图C仍需要使用,所以仍不能释放。在对图C进行切片时,还需使用Q1b。其中,使用图C中,判断每个节点所需的占用空间是否满足智能芯片的可用空间,具体与上述类似,对此不再进行解释。In the same way: after graph B is sliced, the output data Q2b of fork node 2 has been released, but the output data Q1b of fork node 1 cannot be released because graph C still needs to be used. When slicing graph C, Q1b is also used. Wherein, using Figure C, it is judged whether the occupied space required by each node satisfies the available space of the smart chip, which is similar to the above, and no further explanation is given here.
需要说明是,多个网络子图中,如果具有任一网络子图不存在有效切片方案,则表明整个多输出神经网络切片失败。另外上述输出数据为确定切片方案是否为有效切片方案时的数据,是从输出到输入的一个逆推过程;在逆推结束后,若逆推过程中以X切片方案对输出数据进行切片时,智能芯片能够为每个节点分配足够的可用空间。那么,在多输出神经网络以相同的切片方式,限定节点处理输入数据的输入数据量大小时,智能芯片也应当可以为每个节点分配足够的可用空间存储各个节点产生的输出数据。It should be noted that if there is no effective slicing scheme in any of the network subgraphs in multiple network subgraphs, it indicates that the entire multi-output neural network slice fails. In addition, the above output data is the data for determining whether the slicing scheme is an effective slicing scheme, which is a reverse process from output to input; after the reverse deduction is completed, if the output data is sliced with the X slicing scheme during the reverse deduction process, Smart chips are able to allocate enough free space for each node. Then, when the multi-output neural network uses the same slicing method to limit the amount of input data that nodes process input data, the smart chip should also be able to allocate enough available space for each node to store the output data generated by each node.
 the
在另一实施例中,在确定出多输出神经网络的切片方案后,还需对多输出神经网络中对输入输出数据进行处理的算子进行设置,以使算子可以根据切片方案对输入输出数据进行处理。In another embodiment, after the slicing scheme of the multi-output neural network is determined, it is also necessary to set the operator for processing input and output data in the multi-output neural network, so that the operator can process the input and output data according to the slicing scheme. The data is processed.
具体的,每个节点分别包括多个对输入数据和/或输出数据进行处理的算子,并且每个算子需要调用相应的调度语句进行实现。其中,多输出神经网络的切片方法还包括:Specifically, each node includes a plurality of operators for processing input data and/or output data, and each operator needs to call a corresponding scheduling statement for implementation. Among them, the slicing method of the multi-output neural network also includes:
针对多输出神经网络图中的任一节点,智能芯片获取多个算子中最先对节点的输入数据进行处理的起始算子,以及最后对节点的输出数据进行处理的结束算子。For any node in the multi-output neural network graph, the smart chip obtains the start operator that processes the input data of the node first among the multiple operators, and the end operator that processes the output data of the node last.
若节点为末端节点,则智能芯片在运行起始算子时,调用执行第一操作的第一调度语句;以及,在运行结束算子时,调用执行第二操作的第二调度语句,以及执行第三操作的第三调度语句;第一操作用于从智能芯片的可用空间中读取输入数据至起始算子;第二操作用于将结束算子产生的输出数据写入内部存储器;以及,第三操作用于将末端节点的结束算子产生的输出数据写入外部存储器中。If the node is an end node, when the smart chip is running the initial operator, it calls the first scheduling statement that executes the first operation; and when the operator finishes running, it calls the second scheduling statement that executes the second operation, and executes The third scheduling statement of the third operation; the first operation is used to read input data from the available space of the smart chip to the start operator; the second operation is used to write the output data generated by the end operator into the internal memory; and , the third operation is used to write the output data generated by the end operator of the terminal node into the external memory.
若节点为多输出神经网络中的起始节点,则智能芯片在运行起始节点的起始算子时,调用执行第四操作的第四调度语句以及第一调度语句;以及,在运行起始节点的结束算子时,调用第二调度语句;第四调度语句用于从外部存储器中读取输入数据至起始节点的起始算子。If the node is the initial node in the multi-output neural network, when the smart chip runs the initial operator of the initial node, it calls the fourth dispatching statement and the first dispatching statement for performing the fourth operation; and, at the beginning of operation When the end operator of the node, call the second scheduling statement; the fourth scheduling statement is used to read input data from the external memory to the start operator of the starting node.
若节点为中间节点,则智能芯片在运行中间节点内的起始算子时,均调用第一调度语句;以及,在运行中间节点内的结束算子时均调用第二调度语句。If the node is an intermediate node, the smart chip calls the first scheduling statement when running the start operator in the intermediate node; and calls the second scheduling statement when running the end operator in the intermediate node.
在一实施例中,节点为多输出神经网络图中的最小处理单元,其内部包含有对输入数据的计算公式。其中,一个节点通常包含大量的算子以对输入数据进行处理。In one embodiment, a node is a minimum processing unit in a multi-output neural network graph, and contains calculation formulas for input data. Among them, a node usually contains a large number of operators to process the input data.
其中,在节点包括多个算子时,以节点中第一个对输入数据进行处理的算子为起始算子,以及以最后一个产生输出数据的算子为结束算子。Wherein, when a node includes multiple operators, the first operator in the node that processes input data is used as the start operator, and the last operator that generates output data is used as the end operator.
在一实施例中,上述第一操作具体可以为起始算子进行处理的操作。具体的,第一操作用于读取该节点中起始算子所需要的输入数据。此时,在节点不为多输出神经网络的起始节点时,第一操作即为从内部存储器读取起始算子所需要的输入数据的操作。如果为起始节点,则需要执行从外部存储器读取所需输入数据的第四操作;并且在读取输入数据时,需先将输入数据存储在内部存储器中,而后,再从内部存储器中读取输入数据。In an embodiment, the foregoing first operation may specifically be an operation performed by an initial operator. Specifically, the first operation is used to read the input data required by the initial operator in the node. At this time, when the node is not the initial node of the multi-output neural network, the first operation is the operation of reading the input data required by the initial operator from the internal memory. If it is a starting node, you need to perform the fourth operation of reading the required input data from the external memory; and when reading the input data, you need to store the input data in the internal memory first, and then read it from the internal memory Get the input data.
在一实施例中,上述第二操作用于将结束算子的输出数据写入可用空间。可以理解的是,若节点属于末端节点,则结束算子产生的输出数据不仅需要写入内部存储器中,还需执行将输出数据写入外部存储器中的第三操作。然而,对于中间节点(非起始节点和末端节点),只需执行将结束算子产生的输出数据写入内部存储器的可用空间中的第二操作,以及执行从内部存储器的可用空间中读取输入数据的第一操作。In an embodiment, the above-mentioned second operation is used to write the output data of the end operator into the available space. It can be understood that if the node belongs to the end node, the output data generated by the end operator not only needs to be written into the internal memory, but also needs to perform a third operation of writing the output data into the external memory. However, for intermediate nodes (non-start nodes and end nodes), only the second operation of writing the output data generated by the end operator into the available space of the internal memory and reading from the available space of the internal memory is only required The first operation to input data.
其中,参照图2,起始节点为图2中的分叉节点1;末端节点为图2中的节点3、5、7;中间节点为图2中的节点2、4、6。Wherein, referring to FIG. 2 , the start node is the fork node 1 in FIG. 2 ; the end nodes are nodes 3 , 5 , and 7 in FIG. 2 ; and the intermediate nodes are nodes 2 , 4 , and 6 in FIG. 2 .
需要说明的是,在对每个算子生成调度语句时,其可以将一个节点内多个算子进行融合,使同一个节点内的多个算子组合在一起放到同一个核中。以此,智能芯片可以通过算子融合的方式,不将分叉节点的数据搬移至外部存储器,可以有效减少数据搬运次数,节省智能芯片的处理时间。It should be noted that when generating a scheduling statement for each operator, it can integrate multiple operators in one node, so that multiple operators in the same node can be combined and put into the same core. In this way, the smart chip can use operator fusion without moving the data of the fork node to the external memory, which can effectively reduce the number of data transfers and save the processing time of the smart chip.
示例性的,采用后序访问方式对图A、B、C按照搜索出的最优切片方式分别schedule。假设将每个子节点最后一个算子(compute_op)记为结束算子(END_OP),第一个算子(compute_op)记为起始算子(START_OP)。以下结合图2进行具体说明:Exemplarily, the graphs A, B, and C are respectively scheduled according to the searched optimal slicing manner in a post-order access manner. Assume that the last operator (compute_op) of each child node is recorded as the end operator (END_OP), and the first operator (compute_op) is recorded as the start operator (START_OP). The following is a specific description in conjunction with Figure 2:
图A中:以末端节点3的END_OP为第一根节点1,对N、C、H、W轴调用split原语进行切片。通过cache_write(存储写操作) END_OP生成搬移输出数据到DM(内部存储器)和DDR(外部存储器)的compute_op。通过cache_read(存储读操作)START_OP生成搬移DM数据到计算单元的compute_op。如果compute_op中含有参数则还需要cache_read生成搬移DDR数据到DM和计算单元的compute_op,并将所有的compute_op 通过compute_at移动到根节点1对应的轴。In Figure A: use the END_OP of the end node 3 as the first root node 1, and call the split primitive for the N, C, H, and W axes to slice. The compute_op that moves the output data to DM (internal memory) and DDR (external memory) is generated by cache_write (storage write operation) END_OP. Generate compute_op that moves DM data to the computing unit through cache_read (storage read operation) START_OP. If compute_op contains parameters, cache_read is also required to generate compute_op that moves DDR data to DM and computing unit, and all compute_op Move to the axis corresponding to root node 1 by compute_at.
获取分叉节点2的输出数据,将分叉节点2的END_OP设置为第二根节点2,通过cache_write END_OP生成搬移输出数据到DM的compute_op。cache_read START_OP生成搬移DM数据到计算单元的compute_op。如果compute_op中含有参数则还需要cache_read生成搬移DDR数据到DM和计算单元的compute_op,并将所有的compute_op 通过compute_at移动到根节点2对应的轴。Obtain the output data of fork node 2, set the END_OP of fork node 2 as the second root node 2, and generate the compute_op that moves the output data to DM through cache_write END_OP. cache_read START_OP generates compute_op that moves DM data to computing units. If compute_op contains parameters, cache_read is also required to generate compute_op that moves DDR data to DM and computing unit, and all compute_op Move to the axis corresponding to the root node 2 by compute_at.
分叉节点1采用同分叉节点2类似的方式,将分叉节点1的END_OP设置为第二根节点3,由于分叉节点1是多输出神经网络的第一节点,cache_read START_OP需要生成搬移DDR数据到DM和计算单元的compute_op。Fork node 1 uses a method similar to that of fork node 2, and sets the END_OP of fork node 1 as the second root node 3. Since fork node 1 is the first node of the multi-output neural network, cache_read START_OP needs to generate a moving DDR Compute_op for data to DM and compute unit.
图B中:输出节点5的END_OP为第一根节点4对N、C、H、W轴调用split原语进行切片。通过cache_write END_OP生成搬移输出数据到DM和DDR的compute_op。cache_read START_OP生成搬移DM数据到计算单元的compute_op,compute_op中所需的参数还需要cache_read生成搬移DDR数据到DM和计算单元的compute_op,并将所有的compute_op 通过compute_at移动到第一根节点4对应的轴。In Figure B: the END_OP of the output node 5 calls the split primitive for the first root node 4 to slice the N, C, H, and W axes. via cache_write END_OP generates compute_op that moves output data to DM and DDR. cache_read START_OP generates the compute_op that moves DM data to the computing unit. The parameters required in compute_op also require cache_read to generate the compute_op that moves DDR data to the DM and the computing unit, and all compute_op Move to the axis corresponding to the first root node 4 by compute_at.
节点4通过cache_write END_OP生成搬移输出数据到DM的compute_op。cache_read START_OP生成搬移DM数据到计算单元的compute_op。如果compute_op中所需的参数还需要cache_read生成搬移DDR数据到DM和计算单元的compute_op,并将所有的compute_op通过compute_at移动到第一根节点4对应的轴。Node 4 passes cache_write END_OP generates compute_op that moves output data to DM. cache_read START_OP generates compute_op that moves DM data to computing units. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and move all compute_op to the axis corresponding to the first root node 4 through compute_at.
图C中:输出节点7的END_OP为第一根节点5,对N、C、H、W轴调用split原语进行切片。通过cache_write END_OP生成搬移输出数据到DM和DDR的compute_op。cache_read START_OP生成搬移DM数据到计算单元的compute_op。如果compute_op中所需的参数还需要cache_read生成搬移DDR数据到DM和计算单元的compute_op,并将所有的compute_op 通过compute_at移动到第一根节点5对应的轴。In Figure C: the END_OP of the output node 7 is the first root node 5, and the split primitive is called for the N, C, H, and W axes to slice. via cache_write END_OP generates compute_op that moves output data to DM and DDR. cache_read START_OP generates compute_op that moves DM data to computing units. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and all compute_op Move to the axis corresponding to the first root node 5 by compute_at.
节点6通过cache_write END_OP生成搬移输出数据到DM的compute_op,cache_read START_OP生成搬移DM数据到计算单元的compute_op。如果compute_op中所需的参数还需要cache_read生成搬移DDR数据到DM和计算单元的compute_op,并将所有的compute_op通过compute_at移动到第一根节点5对应的轴。Node 6 passes cache_write END_OP generates a compute_op that moves output data to DM, and cache_read START_OP generates a compute_op that moves DM data to a computing unit. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and move all compute_op to the axis corresponding to the first root node 5 through compute_at.
需要说明的是,split,cache_write,cache_read,compute_at等都是TVM的调度原语,上述步骤主要为针对输入数据和输出数据的处理流程和读写流程进行改进的步骤。即采用上述方法可以减少智能芯片从外部存储器中读取输入数据的次数。但是每个算子在运行时,可能还需要从外部存储器搬移其他计算参数,以对输入数据进行处理。然而,搬移计算参数是需要访问外部存储器的。也即本申请中所描述多输出神经网络图的切片方法,只是可以降低多输出神经网络图从外部存储器访问输入数据的次数。It should be noted that split, cache_write, cache_read, compute_at, etc. are all scheduling primitives of TVM. The above steps are mainly for improving the processing flow of input data and output data and the reading and writing flow. That is, the above method can reduce the number of times the smart chip reads input data from the external memory. However, when each operator is running, it may also need to move other calculation parameters from the external memory to process the input data. However, moving calculation parameters requires access to external memory. That is, the slicing method of the multi-output neural network graph described in this application can only reduce the number of times the multi-output neural network graph accesses input data from an external memory.
 the
在一实施例中,参照图5,图5为智能芯片中单输出神经网络处理输入数据的一种处理场景示意图。Step1:智能芯片从外部存储器中获取单输出神经网络的输入数据,并存储在内部存储器中;Step2:智能芯片从内部存储器中读取输入数据,并采用计算单元A中的模型参数进行处理得到输出数据,而后将输出数据存放至外部存储器中。Step3:智能芯片清空节点A的输出数据和输入数据;Step4:从外部存储器中再次读取输入数据,并采用计算单元B中的模型参数进行处理得到输出数据;Step5:将B节点的输出数据存放至外部存储器中。从图6可以看出,对于单输出的神经网络模型而言,在对每个节点中的输入数据进行处理时,均需执行访问外部存储器的操作。In an embodiment, refer to FIG. 5 , which is a schematic diagram of a processing scenario of processing input data by a single-output neural network in a smart chip. Step1: The smart chip obtains the input data of the single-output neural network from the external memory, and stores it in the internal memory; Step2: The smart chip reads the input data from the internal memory, and uses the model parameters in the calculation unit A to process and obtain the output data, and then store the output data in external memory. Step3: The smart chip clears the output data and input data of node A; Step4: Read the input data from the external memory again, and use the model parameters in the calculation unit B to process the output data; Step5: Store the output data of node B to external memory. It can be seen from FIG. 6 that for a single-output neural network model, when processing the input data in each node, it is necessary to perform an operation of accessing an external memory.
然而,对于多输出神经网络而言,图6为智能芯片中多输出神经网络处理输入数据的一种处理场景示意图。其中,本示例仅以节点1、2、3、4、5生成的多输出神经网络进行解释。具体的:Step1:智能芯片从外部存储器中获取多输出神经网络的输入数据,并存储在内部存储器中。Step2:对于节点2,智能芯片可以直接从内部存储器中读取输入数据,并对内部存储器中节点1的输出数据进行删除;之后,将节点2的输出数据存储在内部存储器中;同时,基于上述网络子图的分割操作可知,即使将节点2的输出数据存储在内部存储器中,内部存储器部署其余节点的模型参数的占用空间和生成的输出数据的占用空间之和,也将小于内部存储器的可用空间。因此,节点4以及末端节点3可以直接从内部存储器中获取节点2的输出数据,而后节点4将产生的输出数据也放入内部存储器中,以被末端节点5进行读取并处理;之后,智能芯片执行Step3和Step4:将末端节点3和末端节点5分别生成的输出数据存放至外部存储器中。However, for a multi-output neural network, FIG. 6 is a schematic diagram of a processing scenario in which a multi-output neural network processes input data in a smart chip. Among them, this example is only explained with the multi-output neural network generated by nodes 1, 2, 3, 4, and 5. Concrete: Step1: The smart chip obtains the input data of the multi-output neural network from the external memory, and stores it in the internal memory. Step2: For node 2, the smart chip can directly read the input data from the internal memory, and delete the output data of node 1 in the internal memory; after that, store the output data of node 2 in the internal memory; at the same time, based on the above The segmentation operation of the network subgraph shows that even if the output data of node 2 is stored in the internal memory, the sum of the occupied space of the model parameters of the other nodes deployed in the internal memory and the occupied space of the generated output data will be smaller than the available internal memory space. Therefore, node 4 and end node 3 can directly obtain the output data of node 2 from the internal memory, and then node 4 will also put the generated output data into the internal memory to be read and processed by end node 5; after that, the intelligent The chip executes Step3 and Step4: store the output data generated by the end node 3 and the end node 5 respectively in the external memory.
从图6可以看出,对于多输出神经网络而言,各个网络子图中的每个节点之间均是通过对内部存储器中存放的数据进行读写,以实现数据交互。以此,可以只需对多输出神经网络的起始节点和输出节点执行一次访问外部存储器的操作。It can be seen from Fig. 6 that for a multi-output neural network, each node in each network subgraph reads and writes the data stored in the internal memory to realize data interaction. In this way, the operation of accessing the external memory only needs to be performed once for the initial node and the output node of the multi-output neural network.
 the
请参阅图7,图7是本申请实施例提供的一种多输出神经网络的切片装置的结构框图。本实施例中多输出神经网络的切片装置包括的各模块用于执行图1、图3至图4对应的实施例中的各步骤。具体请参阅图1、图3至图4以及图1、图3至图4所对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图7,多输出神经网络图的切片装置700可以包括:分割模块710、切片模块720、处理时长获取模块730以及切片方案确定模块740,其中:Please refer to FIG. 7 . FIG. 7 is a structural block diagram of a multi-output neural network slicing device provided by an embodiment of the present application. The modules included in the multi-output neural network slicing device in this embodiment are used to execute the steps in the embodiments corresponding to FIG. 1 , FIG. 3 to FIG. 4 . For details, please refer to FIG. 1 , FIG. 3 to FIG. 4 and related descriptions in the embodiments corresponding to FIG. 1 , FIG. 3 to FIG. 4 . For ease of description, only the parts related to this embodiment are shown. Referring to FIG. 7 , the slicing device 700 for a multi-output neural network graph may include: a segmentation module 710, a slicing module 720, a processing duration acquisition module 730, and a slicing scheme determination module 740, wherein:
分割模块710,用于将多输出神经网络中的每个末端节点分别与其他节点进行分割,生成多个包含末端节点的网络子图;多输出神经网络包括至少两个末端节点以及至少一个分叉节点。Segmentation module 710, for dividing each end node in the multi-output neural network from other nodes, generating multiple network subgraphs containing end nodes; the multi-output neural network includes at least two end nodes and at least one fork node.
切片模块720,用于针对任一网络子图,按照预设的多种切片方式对网络子图末端节点的输出数据的输出数据量进行切片,得到网络子图的多种切片方案。The slicing module 720 is configured to slice the output data volume of the output data of the end nodes of the network subgraph according to various preset slicing methods for any network subgraph, so as to obtain various slicing schemes of the network subgraph.
处理时长获取模块730,用于针对网络子图的任一种切片方案,获取网络子图的处理时长,并根据处理时长确定网络子图的目标切片方案。The processing duration obtaining module 730 is configured to obtain the processing duration of the network subgraph for any slicing scheme of the network subgraph, and determine the target slicing scheme of the network subgraph according to the processing duration.
切片方案确定模块740,用于根据每个网络子图的目标切片方案,确定多输出神经网络的切片方案。The slicing scheme determination module 740 is configured to determine the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.
在一实施例中,分割模块710还用于:In one embodiment, the segmentation module 710 is also used for:
S1、将末端节点作为第一当前节点;S2、若第一当前节点的父节点属于分叉节点,则判断分叉节点是否已被其他末端节点对应的网络子图进行分割;S3、若判定分叉节点已被其他末端节点对应的网络子图进行分割,则将第一当前节点与父节点之间的各个节点进行分割,生成包含末端节点的网络子图;S4、若判定分叉节点未被其他末端节点对应的网络子图进行分割,或若父节点属于非分叉节点,则将父节点确定为新的第一当前节点,重复执行S2-S4。S1. Take the end node as the first current node; S2. If the parent node of the first current node belongs to the fork node, then judge whether the fork node has been divided by the network subgraph corresponding to other end nodes; S3. If the fork node has been divided by the network subgraph corresponding to other end nodes, each node between the first current node and the parent node is divided to generate a network subgraph containing the end node; S4. If it is determined that the fork node is not The network subgraphs corresponding to other terminal nodes are divided, or if the parent node belongs to a non-forked node, the parent node is determined as the new first current node, and S2-S4 is repeatedly executed.
在一实施例中,切片模块720还用于:In one embodiment, the slicing module 720 is also used for:
获取末端节点可输出数据的最小维度,以及可输出数据的最大维度;在最小维度与最大维度之间,将任一维度的可输出数据,确定为对末端节点的输出数据进行切片的一种切片方式。Obtain the minimum dimension of the output data of the terminal node and the maximum dimension of the output data; between the minimum dimension and the maximum dimension, determine the output data of any dimension as a slice for slicing the output data of the terminal node Way.
在一实施例中,多输出神经网络图的切片装置700还包括:In one embodiment, the slicing device 700 of the multi-output neural network graph further includes:
判断模块,用于针对网络子图任一种切片方案,判断切片方案是否为有效切片方案,并在切片方案为有效切片方案时,再执行获取网络子图的处理时长的步骤。The judging module is used for judging whether the slicing scheme is an effective slicing scheme for any slicing scheme of the network subgraph, and when the slicing scheme is an effective slicing scheme, execute the step of obtaining the processing time of the network subgraph.
在一实施例中,判断模块还用于:In one embodiment, the judging module is also used for:
根据切片后的末端节点的输出数据的输出数据量,分别逆推每个节点的输入数据和输出数据;针对任一节点,确定该节点的之前节点的目标输出数据;根据目标输出数据、该节点的输入数据和输出数据,确定该节点执行时的占用空间;若任一节点的占用空间均小于智能芯片为对应节点分配的可用空间,则将切片方案确定为有效切片方案。According to the output data volume of the output data of the terminal node after slicing, the input data and output data of each node are respectively reversed; for any node, the target output data of the previous node of the node is determined; according to the target output data, the node The input data and output data of the node determine the occupied space when the node is executed; if the occupied space of any node is smaller than the available space allocated by the smart chip for the corresponding node, the slicing scheme is determined as an effective slicing scheme.
在一实施例中,判断模块还用于:In one embodiment, the judging module is also used for:
确定该节点之前的所有分叉节点及每个分叉节点的输出数据;判断每个分叉节点的输出数据是否可释放,并将不可释放的所有分叉节点的输出数据作为目标输出数据。Determine all forked nodes before the node and the output data of each forked node; judge whether the output data of each forked node can be released, and use the output data of all forked nodes that cannot be released as the target output data.
在一实施例中,判断模块还用于:In one embodiment, the judging module is also used for:
若该分叉节点的输出数据还用于被其他网络子图使用,则判定分叉节点的输出数据不可释放;否则,判定分叉节点的输出数据可释放。If the output data of the fork node is also used by other network subgraphs, it is determined that the output data of the fork node cannot be released; otherwise, it is determined that the output data of the fork node can be released.
当理解的是,图7示出的多输出神经网络图的切片装置的结构框图中,各模块用于执行图1、图3至图4对应的实施例中的各步骤,而对于图1、图3至图4对应的实施例中的各步骤已在上述实施例中进行详细解释,具体请参阅图1、图3至图4以及图1、图3至图4所对应的实施例中的相关描述,此处不再赘述。It should be understood that, in the structural block diagram of the slicing device of the multi-output neural network diagram shown in FIG. 7, each module is used to execute the steps in the embodiments corresponding to FIG. Each step in the embodiment corresponding to Fig. 3 to Fig. 4 has been explained in detail in the above embodiment, please refer to Fig. 1, Fig. 3 to Fig. 4 and Fig. 1, Fig. 3 to Fig. 4 in the embodiment corresponding to Relevant descriptions will not be repeated here.
图8是本申请一实施例提供的一种智能芯片的结构框图。如图8所示,该实施例的智能芯片800包括:处理器810、存储器820以及存储在存储器820中并可在处理器810运行的计算机程序830,例如多输出神经网络的切片方法的程序。处理器810执行计算机程序830时实现上述各个多输出神经网络的切片方法各实施例中的步骤,例如图1所示的S101至S104。或者,处理器810执行计算机程序830时实现上述图8对应的实施例中各模块的功能,例如,图7所示的模块710至740的功能,具体请参阅图7对应的实施例中的相关描述。Fig. 8 is a structural block diagram of a smart chip provided by an embodiment of the present application. As shown in FIG. 8 , the smart chip 800 of this embodiment includes: a processor 810 , a memory 820 , and a computer program 830 stored in the memory 820 and executable on the processor 810 , such as a program of a multi-output neural network slicing method. When the processor 810 executes the computer program 830 , the steps in the above embodiments of each multi-output neural network slicing method are implemented, such as S101 to S104 shown in FIG. 1 . Alternatively, when the processor 810 executes the computer program 830, it realizes the functions of the modules in the above embodiment corresponding to FIG. 8, for example, the functions of the modules 710 to 740 shown in FIG. describe.
示例性的,计算机程序830可以被分割成一个或多个模块,一个或者多个模块被存储在存储器820中,并由处理器810执行,以实现本申请实施例提供的多输出神经网络的切片方法。一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述计算机程序830在智能芯片800中的执行过程。例如,计算机程序830可以实现本申请实施例提供的多输出神经网络的切片方法。Exemplarily, the computer program 830 can be divided into one or more modules, one or more modules are stored in the memory 820, and executed by the processor 810, so as to realize the slicing of the multi-output neural network provided by the embodiment of the present application method. One or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 830 in the smart chip 800 . For example, the computer program 830 may implement the multi-output neural network slicing method provided in the embodiment of the present application.
智能芯片800可包括,但不仅限于,处理器810、存储器820。本领域技术人员可以理解,图8仅仅是智能芯片800的示例,并不构成对智能芯片800的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件。The smart chip 800 may include, but not limited to, a processor 810 and a memory 820 . Those skilled in the art can understand that FIG. 8 is only an example of the smart chip 800, and does not constitute a limitation to the smart chip 800. It may include more or less components than shown in the figure, or combine certain components, or different components. .
所称处理器810可以是中央处理单元,还可以是其他通用处理器、数字信号处理器、专用集成电路、现成可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 810 may be a central processing unit, or other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components wait. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the like.
存储器820可以是智能芯片800的内部存储单元,例如智能芯片800的硬盘或内存。存储器820也可以是智能芯片800的外部存储设备,例如智能芯片800上配备的插接式硬盘,智能存储卡,闪存卡等。进一步地,存储器820还可以既包括智能芯片800的内部存储单元也包括外部存储设备。The memory 820 may be an internal storage unit of the smart chip 800 , such as a hard disk or memory of the smart chip 800 . The memory 820 may also be an external storage device of the smart chip 800, such as a plug-in hard disk, smart memory card, flash memory card, etc. equipped on the smart chip 800. Further, the memory 820 may also include both an internal storage unit of the smart chip 800 and an external storage device.
本申请实施例提供了一种智能芯片,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述各个实施例中的多输出神经网络的切片方法。An embodiment of the present application provides a smart chip, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the multi-output neural network as in the above-mentioned embodiments is implemented. slice method.
本申请实施例提供了一种计算机可读存储介质,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述各个实施例中的多输出神经网络的切片方法。An embodiment of the present application provides a computer-readable storage medium, including a memory, a processor, and a computer program stored in the memory and operable on the processor. The slicing method for the output neural network.
本申请实施例提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行上述各个实施例中的多输出神经网络的切片方法。An embodiment of the present application provides a computer program product, which, when the computer program product is run on a computer, causes the computer to execute the multi-output neural network slicing method in each of the foregoing embodiments.
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still apply to the foregoing embodiments Modifications to the technical solutions recorded, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of each embodiment of the application, and should be included in this application. within the scope of protection.

Claims (10)

  1. 一种多输出神经网络的切片方法,其特征在于,应用于智能芯片,所述方法包括:A method for slicing a multi-output neural network is characterized in that it is applied to an intelligent chip, and the method comprises:
    将所述多输出神经网络中的每个末端节点分别与其他节点进行分割,生成多个包含所述末端节点的网络子图;所述多输出神经网络包括至少两个末端节点以及至少一个分叉节点;Each end node in the multi-output neural network is divided from other nodes to generate a plurality of network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node;
    针对任一网络子图,按照预设的多种切片方式对所述网络子图末端节点的输出数据进行切片,得到所述网络子图的多种切片方案;For any network subgraph, slice the output data of the end nodes of the network subgraph according to multiple preset slicing methods to obtain various slicing schemes of the network subgraph;
    针对网络子图的任一种切片方案,获取所述网络子图的处理时长,并根据所述处理时长确定所述网络子图的目标切片方案;For any slicing scheme of the network subgraph, acquiring the processing duration of the network subgraph, and determining the target slicing scheme of the network subgraph according to the processing duration;
    根据每个所述网络子图的目标切片方案,确定所述多输出神经网络的切片方案。The slicing scheme of the multi-output neural network is determined according to the target slicing scheme of each of the network subgraphs.
  2. 根据权利要求1所述的方法,其特征在于,所述将所述多输出神经网络中的每个末端节点分别与其他节点进行分割,生成包含所述末端节点的网络子图,包括:The method according to claim 1, wherein said dividing each terminal node in said multi-output neural network from other nodes to generate a network subgraph comprising said terminal node comprises:
    S1、将所述末端节点作为第一当前节点;S1. Using the terminal node as the first current node;
    S2、若所述第一当前节点的父节点属于所述分叉节点,则判断所述分叉节点是否已被其他所述末端节点对应的网络子图进行分割;S2. If the parent node of the first current node belongs to the forked node, judge whether the forked node has been divided by other network subgraphs corresponding to the end nodes;
    S3、若判定所述分叉节点已被其他所述末端节点对应的网络子图进行分割,则将所述第一当前节点与所述父节点之间的各个节点进行分割,生成包含所述末端节点的网络子图;S3. If it is determined that the forked node has been divided by other network subgraphs corresponding to the end nodes, then divide each node between the first current node and the parent node to generate The network subgraph of the node;
    S4、若判定所述分叉节点未被其他所述末端节点对应的网络子图进行分割,或若所述父节点属于非分叉节点,则将所述父节点确定为新的所述第一当前节点,重复执行S2-S4。S4. If it is determined that the forked node is not divided by other network subgraphs corresponding to the end nodes, or if the parent node belongs to a non-forked node, then determine the parent node as the new first For the current node, execute S2-S4 repeatedly.
  3. 根据权利要求1所述的方法,其特征在于,所述按照预设的多种切片方式对所述网络子图末端节点的输出数据进行切片,得到所述网络子图中的多种切片方案,包括:The method according to claim 1, wherein the output data of the end nodes of the network subgraph is sliced according to multiple preset slicing modes to obtain multiple slicing schemes in the network subgraph, include:
    获取所述末端节点可输出数据的最小维度,以及所述可输出数据的最大维度;Obtaining the minimum dimension of the output data of the terminal node and the maximum dimension of the output data;
    在所述最小维度与所述最大维度之间,将任一维度的所述可输出数据,确定为对所述末端节点的输出数据进行切片的一种切片方式。Between the minimum dimension and the maximum dimension, the output data of any dimension is determined as a slicing manner for slicing the output data of the terminal node.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,在获取所述网络子图的处理时长之前,还包括:The method according to any one of claims 1-3, characterized in that before obtaining the processing duration of the network subgraph, further comprising:
    针对网络子图任一种切片方案,判断所述切片方案是否为有效切片方案,并在所述切片方案为有效切片方案时,再执行获取所述网络子图的处理时长的步骤。For any slicing scheme of the network subgraph, it is judged whether the slicing scheme is an effective slicing scheme, and when the slicing scheme is an effective slicing scheme, the step of obtaining the processing duration of the network subgraph is executed.
  5. 根据权利要求4所述的方法,其特征在于,所述判断所述切片方案是否为有效切片方案,包括:The method according to claim 4, wherein the judging whether the slicing scheme is an effective slicing scheme comprises:
    根据切片后的所述末端节点的输出数据的输出数据量,分别逆推每个所述节点的输入数据和输出数据;According to the output data amount of the output data of the terminal node after slicing, reversely deduce the input data and output data of each node respectively;
    针对任一节点,确定该节点的之前节点的目标输出数据;For any node, determine the target output data of the previous node of the node;
    根据所述目标输出数据、所述该节点的输入数据和输出数据,确定所述该节点执行时的占用空间;determining the occupied space of the node during execution according to the target output data, the input data and the output data of the node;
    若任一节点的占用空间均小于所述智能芯片为对应节点分配的可用空间,则将所述切片方案确定为有效切片方案。If the occupied space of any node is smaller than the available space allocated by the smart chip for the corresponding node, the slicing scheme is determined as an effective slicing scheme.
  6. 根据权利要求5所述的方法,其特征在于,针对任一节点,确定该节点的之前节点的目标输出数据,包括:The method according to claim 5, wherein, for any node, determining the target output data of the previous node of the node comprises:
    确定所述该节点之前的所有分叉节点及每个所述分叉节点的输出数据;Determine all the forked nodes before the node and the output data of each of the forked nodes;
    判断每个所述分叉节点的输出数据是否可释放,并将不可释放的所有所述分叉节点的输出数据作为所述目标输出数据。It is judged whether the output data of each fork node can be released, and the output data of all the fork nodes that cannot be released are used as the target output data.
  7. 根据权利要求6所述的方法,其特征在于,所述判断每个所述分叉节点的输出数据是否可释放,包括:The method according to claim 6, wherein said judging whether the output data of each said bifurcation node can be released comprises:
    针对任一所述分叉节点,若该所述分叉节点的输出数据还用于被其他所述网络子图使用,则判定所述分叉节点的输出数据不可释放;For any of the forked nodes, if the output data of the forked node is also used by other network subgraphs, it is determined that the output data of the forked node cannot be released;
    否则,判定所述分叉节点的输出数据可释放。Otherwise, it is determined that the output data of the fork node can be released.
  8. 一种多输出神经网络的切片装置,其特征在于,应用于智能芯片,所述装置包括:A slicing device of a multi-output neural network is characterized in that it is applied to an intelligent chip, and the device comprises:
    分割模块,用于将所述多输出神经网络中的每个末端节点分别与其他节点进行分割,生成多个包含所述末端节点的网络子图;所述多输出神经网络包括至少两个末端节点以及至少一个分叉节点;A segmentation module, configured to segment each end node in the multi-output neural network from other nodes, and generate a plurality of network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node;
    切片模块,用于针对任一网络子图,按照预设的多种切片方式对所述网络子图末端节点的输出数据的输出数据量进行切片,得到所述网络子图的多种切片方案;A slicing module, configured to slice the output data volume of the output data of the end nodes of the network subgraph according to various preset slicing methods for any network subgraph, so as to obtain various slicing schemes of the network subgraph;
    处理时长获取模块,用于针对网络子图的任一种切片方案,获取所述网络子图的处理时长,并根据所述处理时长确定所述网络子图的目标切片方案;A processing duration acquisition module, configured to obtain a processing duration of the network subgraph for any slicing scheme of the network subgraph, and determine a target slicing scheme of the network subgraph according to the processing duration;
    切片方案确定模块,用于根据每个所述网络子图的目标切片方案,确定所述多输出神经网络的切片方案。The slicing scheme determination module is configured to determine the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.
  9. 一种智能芯片,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的方法。An intelligent chip, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that, when the processor executes the computer program, the computer program according to claims 1 to 1 is realized. 7. The method described in any one.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的方法。A computer-readable storage medium storing a computer program, wherein the computer program implements the method according to any one of claims 1 to 7 when executed by a processor.
PCT/CN2022/143527 2022-02-25 2022-12-29 Slicing method and apparatus for multi-output neural network, and chip and storage medium WO2023160236A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210181249.1 2022-02-25
CN202210181249.1A CN114648105A (en) 2022-02-25 2022-02-25 Slicing method, device, chip and storage medium of multi-output neural network

Publications (1)

Publication Number Publication Date
WO2023160236A1 true WO2023160236A1 (en) 2023-08-31

Family

ID=81993069

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/143527 WO2023160236A1 (en) 2022-02-25 2022-12-29 Slicing method and apparatus for multi-output neural network, and chip and storage medium

Country Status (2)

Country Link
CN (1) CN114648105A (en)
WO (1) WO2023160236A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117172289A (en) * 2023-09-01 2023-12-05 苏州亿铸智能科技有限公司 Tensor segmentation method and device and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648105A (en) * 2022-02-25 2022-06-21 深圳云天励飞技术股份有限公司 Slicing method, device, chip and storage medium of multi-output neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286972A1 (en) * 2018-03-14 2019-09-19 Microsoft Technology Licensing, Llc Hardware accelerated neural network subgraphs
WO2021012609A1 (en) * 2019-07-24 2021-01-28 华为技术有限公司 Neural network segmentation method, prediction method, and related apparatus
CN113994350A (en) * 2020-03-27 2022-01-28 华为技术有限公司 Generating parallel computing schemes for neural networks
CN114648105A (en) * 2022-02-25 2022-06-21 深圳云天励飞技术股份有限公司 Slicing method, device, chip and storage medium of multi-output neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286972A1 (en) * 2018-03-14 2019-09-19 Microsoft Technology Licensing, Llc Hardware accelerated neural network subgraphs
WO2021012609A1 (en) * 2019-07-24 2021-01-28 华为技术有限公司 Neural network segmentation method, prediction method, and related apparatus
CN113994350A (en) * 2020-03-27 2022-01-28 华为技术有限公司 Generating parallel computing schemes for neural networks
CN114648105A (en) * 2022-02-25 2022-06-21 深圳云天励飞技术股份有限公司 Slicing method, device, chip and storage medium of multi-output neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117172289A (en) * 2023-09-01 2023-12-05 苏州亿铸智能科技有限公司 Tensor segmentation method and device and electronic equipment
CN117172289B (en) * 2023-09-01 2024-09-06 苏州亿铸智能科技有限公司 Tensor segmentation method and device and electronic equipment

Also Published As

Publication number Publication date
CN114648105A (en) 2022-06-21

Similar Documents

Publication Publication Date Title
WO2023160236A1 (en) Slicing method and apparatus for multi-output neural network, and chip and storage medium
US8149242B2 (en) Graphics processing apparatus, graphics library module and graphics processing method
CN102169500B (en) Dynamic service flow display method and device
US8269782B2 (en) Graphics processing apparatus
CN106251392A (en) For the method and apparatus performing to interweave
WO2023179415A1 (en) Machine learning computation optimization method and platform
WO2016011886A1 (en) Method and apparatus for decoding image
US20230214338A1 (en) Data moving method, direct memory access apparatus and computer system
CN107122244A (en) A kind of diagram data processing system and method based on many GPU
US11941514B2 (en) Method for execution of computational graph in neural network model and apparatus thereof
CN109447893A (en) A kind of convolutional neural networks FPGA accelerate in image preprocessing method and device
WO2022078400A1 (en) Device and method for processing multi-dimensional data, and computer program product
US20200320367A1 (en) Graph Conversion Method
CN109491934A (en) A kind of storage management system control method of integrated computing function
CN114879968A (en) Method and device for processing video in baseboard management control chip
CN103544729A (en) Animation data processing method and system
CN117271136A (en) Data processing method, device, equipment and storage medium
WO2023060833A1 (en) Data exchange method, electronic device and storage medium
US6771271B2 (en) Apparatus and method of processing image data
CN115994971A (en) Image rendering time calculation method and device, storage medium and electronic equipment
CN108572593A (en) Cross-platform convolutional neural networks control system and method, information data processing terminal
CN114217955A (en) Data processing method and device and computer readable storage medium
CN107610035B (en) Method and system for processing image
CN114513672B (en) Method, device, equipment and medium for unified management of multi-format coding reference frames
US20240320896A1 (en) Multi-gpu-based image rendering method and apparatus, and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22928474

Country of ref document: EP

Kind code of ref document: A1