WO2023160236A1

WO2023160236A1 - Slicing method and apparatus for multi-output neural network, and chip and storage medium

Info

Publication number: WO2023160236A1
Application number: PCT/CN2022/143527
Authority: WO
Inventors: 尹长生; 蔡万伟; 陈宁
Original assignee: 深圳云天励飞技术股份有限公司
Priority date: 2022-02-25
Filing date: 2022-12-29
Publication date: 2023-08-31
Also published as: CN114648105A

Abstract

A slicing method and apparatus for a multi-output neural network, and a chip and a storage medium, which are applicable to the technical field of model processing. The method is applied to an intelligent chip, and comprises: respectively segmenting each end node in a multi-output neural network from other nodes, so as to generate a plurality of network sub-graphs including end nodes; for any network sub-graph, slicing output data of the end nodes of the network sub-graph according to a plurality of preset slicing modes, so as to obtain a plurality of slicing schemes of the network sub-graph; for any slicing scheme of the network sub-graph, acquiring a processing duration of the network sub-graph, and determining a target slicing scheme of the network sub-graph according to the processing duration; and determining a slicing scheme of the multi-output neural network according to the target slicing scheme of each network sub-graph. By using the method, when a multi-output neural network graph processes input data, the number of times that an external memory is accessed can be reduced.

Description

Slicing method, device, chip and storage medium of multi-output neural network

technical field

This application claims the priority of the Chinese patent application with the application number 202210181249.1 and the title of the invention "Slicing method, device, chip and storage medium for multi-output neural network" submitted to the China Patent Office on February 25, 2022, the entire content of which Incorporated in this application by reference.

The present application belongs to the technical field of model processing, and in particular relates to a multi-output neural network slicing method, device, chip and storage medium.

Background technique

The neural network compilation framework (Tensor Virtual Machine, TVM) is a unified software stack for different neural network frameworks and hardware platforms, and can deploy neural networks under different frameworks on the hardware platform.

Usually, a neural network processes input data in a smart chip to obtain output data. However, the available space inside the smart chip is often relatively small, and for a neural network that requires a large amount of storage space to process data, the available space inside the smart chip is far from enough. Therefore, in the process of data processing, for any node included in the neural network, it is often necessary to first store the output data of the node in an external memory. Then, the input data of the nodes stored in the storage space at this time and the generated output data are deleted first. Finally, the output data of the node is obtained from the external memory, and used as the input data of the next node to complete the data processing.

Therefore, in order to reduce the available space required by the neural network for data processing, it is necessary to first segment the neural network into graphs to obtain network subgraphs. Afterwards, the segmented network subgraph is sliced, so that when the network subgraph processes data according to slices, the available space required by each node in the network subgraph will be reduced, so that each The output data generated by the node is stored in the internal storage space, reducing the number of accesses to the external memory.

However, when the above slicing method is used to slice the network subgraph, the current TVM scheduling primitive cannot support the output data of the fork node to be processed in multiple network subgraphs at the same time. As a result, in the prior art, there is no reasonable slicing method for slicing the network subgraph of the multi-output neural network, so that when the existing multi-output neural network processes data, the output data of the bifurcation node still needs to be stored in the In the external memory, the number of accesses to the external memory cannot be reduced.

technical solution

The embodiment of the present application provides a multi-output neural network graph slicing method, device, smart chip and storage medium, which can solve the problem that the existing multi-output neural network graph requires frequent access to external memory when processing input data.

In the first aspect, the embodiment of the present application provides a method for slicing a multi-output neural network, the method is applied to a smart chip, and the method includes:

Segmenting each end node in the multi-output neural network from other nodes to generate multiple network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node;

For any network subgraph, the output data volume of the output data of the end nodes of the network subgraph is sliced according to various preset slicing methods, and various slicing schemes of the network subgraph are obtained;

For any slicing scheme of the network subgraph, obtain the processing time of the network subgraph, and determine the target slicing scheme of the network subgraph according to the processing time;

Based on the target slicing scheme for each network subgraph, the slicing scheme for the multi-output neural network is determined.

In the second aspect, the embodiment of the present application provides a multi-output neural network slicing device, which is applied to a smart chip, and the device includes:

The segmentation module is used to divide each end node in the multi-output neural network from other nodes, and generate multiple network subgraphs containing end nodes; the multi-output neural network includes at least two end nodes and at least one bifurcation node ;

The slicing module is used for slicing the output data volume of the output data of the end nodes of the network subgraph according to various preset slicing methods for any network subgraph, so as to obtain various slicing schemes of the network subgraph;

The processing duration acquisition module is used for obtaining the processing duration of the network subgraph for any slicing scheme of the network subgraph, and determining the target slicing scheme of the network subgraph according to the processing duration;

The slicing scheme determination module is configured to determine the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.

In a third aspect, an embodiment of the present application provides a smart chip, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the method in the first aspect above is implemented. .

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method in the first aspect above is implemented.

In the fifth aspect, the embodiment of the present application provides a computer program product, which causes the computer to execute the method of the first aspect when the computer program product is run on the computer.

The beneficial effect of the embodiment of the present application compared with the prior art is: by characterizing the multi-output neural network in the form of a multi-output neural network graph, and then, for each terminal node in the multi-output neural network graph, respectively The terminal node is divided with other nodes to obtain the network subgraph corresponding to each node. Afterwards, in each network subgraph, the output data of the end nodes of each network subgraph is sliced according to various preset slicing methods, so as to control the amount of input and output data of each network subgraph when processing data, Get a variety of slicing schemes. Afterwards, according to the processing time of the network subgraph when processing data, determine the target slicing scheme of the network subgraph. Finally, the overall slicing scheme of the multi-output neural network is obtained in this way. In this way, when the multi-output neural network processes data, the input and output data are processed according to the slice scheme of the multi-output neural network, so that the output data between nodes can be stored in the available space of the smart chip, and each Nodes can directly obtain data from the available space when processing data, so as to reduce the number of accesses for smart chips to access external memory.

Description of drawings

The drawings that need to be used in the embodiments of the present application will be introduced below.

Fig. 1 is the realization flowchart of a kind of multi-output neural network slicing method provided by an embodiment of the present application;

2 is a schematic structural diagram of a multi-output neural network diagram and a multi-output neural network diagram in a multi-output neural network slicing method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an implementation of S101 of a multi-output neural network slicing method provided by an embodiment of the present application;

Fig. 4 is a flow chart of realizing a multi-output neural network slicing method provided by another embodiment of the present application;

FIG. 5 is a schematic diagram of a processing scenario in which a single-output neural network processes input data in a smart chip provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a processing scenario of a multi-output neural network graph processing input data in a smart chip provided by an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a multi-output neural network graph slicing device provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a smart chip provided by an embodiment of the present application.

Embodiments of the present invention

Embodiments of the present application are described below in conjunction with the accompanying drawings.

The multi-output neural network slicing method provided in the embodiment of the present application can be applied to a smart chip. Wherein, the smart chip is a chip that loads each model parameter in the multi-output neural network graph, and processes the input data input to each node according to the model parameters under each node. Among them, since the output data calculated by a single node of the multi-output neural network graph is usually too large to be stored in the internal memory of the smart chip, the output data needs to be imported to an external memory. Therefore, it is necessary to save the output data of a single node under the multi-output neural network graph by accessing the external memory. Moreover, since there are multiple nodes in the neural network, based on this, when the smart chip performs calculation and processing of the neural network on the input data of each node, the smart chip needs to access the external memory frequently.

Wherein, the above-mentioned smart chip includes but not limited to a central processing unit, a graphics processing unit, a neural network processor, and the like. The above-mentioned internal memory is the on-chip memory of the smart chip. The above-mentioned external memory may be a double-rate synchronous dynamic random access memory. Usually the storage space of the external memory is much larger than the storage space of the on-chip memory.

Based on this, please refer to FIG. 1. FIG. 1 shows an implementation flow diagram of a multi-output neural network slicing method provided by an embodiment of the present application. The nodes in the multi-output neural network can be merged so that the merged multi-output The output neural network can reduce the number of visits to the external memory when processing data, and the method includes the following steps:

S101. The smart chip separates each end node in the multi-output neural network from other nodes to generate multiple network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node.

In an embodiment, the above-mentioned multi-output neural network can be represented by a multi-output neural network graph. Exemplarily, taking FIG. 2 as an example, FIG. 2 is a simple multi-output neural network diagram. Wherein, the multi-output neural network graph includes a plurality of nodes, and each node is used to represent a unit in the multi-output neural network that can be independently calculated on the smart chip. The arrows in the multi-output neural network diagram are used to represent the transmission direction of the output data between each node.

Exemplarily, referring to FIG. 2 , FIG. 2X is a schematic structural diagram of a multi-output neural network, and FIG. 2Y is a reference schematic diagram of network subgraphs after the multi-output neural network is divided. Correspondingly, it can be seen from FIG. 2 that the multi-output neural network graph includes 7 nodes. Among them, 3, 5, and 7 are output nodes (that is, terminal nodes), respectively, 1, 2 are bifurcated nodes, and 4, 6 are ordinary nodes.

It should be noted that, in this embodiment, the above method is used to divide the multi-output neural network graph, therefore, the multiple nodes include at least two end nodes and one fork node. In this embodiment, there is no limit to the number of end nodes and fork nodes.

In an embodiment, a divided network subgraph includes one end node, therefore, the divided network subgraph includes at least two. Wherein, the terminal node is divided from other nodes, which can be specifically: the smart chip adopts the post-order traversal method to collect other nodes that each terminal node needs respectively according to the data transmission relationship between each node in the multi-output neural network diagram, and Segmentation is performed to obtain the corresponding network subgraph.

S102. According to any network subgraph, the smart chip slices the output data of the end nodes of the network subgraph according to various preset slicing methods, and obtains various slicing schemes of the network subgraph.

In an application, the above slicing manner is specifically a manner of slicing the output data volume of the output data of the end nodes. Wherein, after slicing the output data volume, the size of the output data volume when the end node outputs data each time can be determined. Based on the size of the output data, the network subgraph can determine the size of the input data each time the input data is processed, so that the smart chip can allocate enough available space for each node to process the input and output data.

Wherein, the terminal node can output output data of any dimension size, therefore, when slicing the output data of the terminal node, there will also be correspondingly various slicing schemes, which are not limited.

Exemplarily, the data processed by the multi-output neural network is usually image data, therefore, the output data will be described using image data as an example.

In a specific embodiment, information of the image data is usually represented by N, C, H, W. Specifically, N represents the number of sheets of input image data; H represents the pixels of the image data in the vertical direction; W represents the pixels of the image data in the horizontal direction, and C represents the number of channels. Usually, if the entire image data is directly input to the node for processing, the output data volume of the output data will be very large. Therefore, in order to be able to reduce the amount of data that the node processes image data each time. The smart chip can slice the output data under the node, so as to reduce the size of the input data when the terminal nodes and other nodes in the low network subgraph process the input data, and the size of the output data when outputting the data.

Exemplarily, when slicing the output data of the terminal node, the slicing is performed on the H axis, so that the size of the data processed by each node is N, C, 1/2H, and W. Therefore, the node can only get half the size of the image as the input data of the node. Moreover, since only half of the image size data is processed at a time, the output data of the final terminal node is also the output data of half the image size first. Afterwards, the smart chip can first store the output data of half the size of the image in the internal memory, and output it to the external memory if it is an end node; then, delete the data of half the size of the image processed by the fork node from the internal memory (Because it does not need to be used, the processed input data can be released); after that, the data of the other half of the image size is processed again until the output data of the other half of the image size is obtained and saved. Based on this, in the process of generating the overall output data of the node, the internal memory only needs to use half the storage space of the image size for storage. If the node directly processes the input data of the entire image, the smart chip needs to allocate the storage space required for the input of the overall image data and the storage space required for generating the output data of the overall image to the node.

Based on this, in an embodiment, multiple ways of slicing the end node output data include but not limited to slicing C, H, and W respectively. That is to say, the above example is only one of the ways to slice the H axis.

Specifically, the way of slicing the data can be as follows: the smart chip obtains the minimum dimension of the output data that can be output by the terminal node, and the maximum dimension of the output data that can be output; the smart chip divides any dimension between the minimum dimension and the maximum dimension The output data of the dimension is determined as a slicing method for slicing the output data of the end node.

The above-mentioned minimum dimension is the minimum dimension of the image data that can be output by the terminal node and the non-forking node, and the above-mentioned maximum dimension is the maximum dimension of the image data that can be output by the terminal node and the non-forking node. The image data of each dimension is used as output data that can be output, that is, a slicing method for slicing terminal nodes and non-forked nodes.

Wherein, the minimum dimension of the image data that can be output by the terminal node is usually 1 pixel, and the maximum dimension of the image data that can be output is usually the dimension of the entire image data.

Exemplarily, when the number of sheets N is 1 and the number of channels C is also 1, if the pixel H of the image data in the vertical direction has A type of slicing, that is, any dimension between [1, A] is used for The image data is sliced in the vertical direction; and the pixel W of the image data in the horizontal direction has B types of slices, that is, the image data is sliced in the vertical direction in any dimension between [1, B] , then a slicing method for finally slicing the output data of the end nodes will have A*B types. In other embodiments, the number C of channels may also be multiple, such as three. Therefore, when the image data is divided, the number of channels can also be divided. Based on this, multiple slicing modes in the network subgraph can be obtained, so that the smart chip can have more slicing schemes for selection.

S103. For any slicing scheme of the network subgraph, the smart chip acquires a processing time of the network subgraph, and determines a target slicing scheme of the network subgraph according to the processing time.

In the application, under each slicing scheme, the amount of input data when the network subgraph processes the input data is different. Therefore, under each slicing scheme, the processing time for the network subgraph to process the input data is usually different. Based on this, for various slicing schemes corresponding to any network subgraph, the smart chip can determine the slicing scheme with the shortest processing time as the target slicing scheme of the network subgraph, so as to improve the efficiency of the network subgraph when processing input and output data .

Exemplarily, referring to Figure 2, for the network subgraph divided by the multi-output neural network, the smart chip can count all the effective slicing schemes in the graph A, and obtain the optimal slicing scheme with the lowest cycle and the shortest time consumption through the chip costmodel as The target slicing scheme of terminal node 3; and, statistics of all effective slicing schemes in graph B, and obtain the optimal slicing scheme with the lowest cycle and the shortest time consumption through the chip costmodel as the target slicing scheme of terminal node 5; and, statistical graph C For all effective slicing schemes, the optimal slicing scheme with the lowest cycle and the shortest time consumption is obtained through the chip costmodel as the target slicing scheme for terminal node 7.

S104. The smart chip determines the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.

It can be understood that, after obtaining the one-to-one corresponding target slicing schemes for each network subgraph, each target slicing scheme is the overall slicing scheme for the multi-output neural network.

In this embodiment, the multi-output neural network is represented in the form of a multi-output neural network graph, and then, for each terminal node in the multi-output neural network graph, the terminal node is separately divided from other nodes to obtain each The network subgraph corresponding to each node. Afterwards, in each network subgraph, the output data of the end nodes of each network subgraph is sliced according to various preset slicing methods, so as to control the amount of input and output data of each network subgraph when processing data, Get a variety of slicing schemes. Afterwards, according to the processing time of the network subgraph when processing data, determine the target slicing scheme of the network subgraph. Finally, the overall slicing scheme of the multi-output neural network is obtained in this way. In this way, when the multi-output neural network processes data, the input and output data are processed according to the slice scheme of the multi-output neural network, so that the output data between nodes can be stored in the available space of the smart chip, and each Nodes can directly obtain data from the available space when processing data, so as to reduce the number of accesses for smart chips to access external memory.

Please refer to FIG. 3. FIG. 3 is a schematic diagram of an implementation of S101 of a method for slicing a multi-output neural network provided by an embodiment of the present application, which is described in detail as follows;

S1. The smart chip uses the terminal node as the first current node.

In an embodiment, the above-mentioned first current node is the node currently being processed. In this embodiment, since the subsequent traversal method is used to process the multi-output neural network graph, the first terminal node to be processed is the first current node.

S2. If the parent node of the first current node belongs to a fork node, the smart chip judges whether the fork node has been divided by network subgraphs corresponding to other end nodes.

S3. If it is determined that the forked node has been divided by network subgraphs corresponding to other end nodes, the smart chip divides each node between the first current node and the parent node to generate a network subgraph including the end nodes.

S4. If it is determined that the forked node is not divided by the network subgraph corresponding to other end nodes, or if the parent node belongs to a non-forked node, the smart chip determines the parent node as the new first current node, and repeats S2-S4 , until a network subgraph containing terminal nodes is generated.

In one embodiment, taking FIG. 2 as an example, when the first current node is the end node 3, according to the data transmission relationship between each node in the multi-output neural network graph, it can be known that the parent node is the fork node 2. And the fork node 2 is not divided by the network subgraphs corresponding to other end nodes at this time. Therefore, based on the step S3, it can be seen that the fork node 2 can be used as the new first current node, and the steps S2-S4 are repeated until a network subgraph including the terminal node 3 is generated.

Wherein, the case of generating the network subgraph containing 3 may be the case of step 2, or it may be that the new first current node has no parent node in the multi-output neural network graph. At this point, the smart chip can finish dividing the terminal node 3 . Exemplarily, when the terminal node 3 is divided, if the new first current node is node 1, since node 1 is the starting node of the multi-output neural network graph, it does not have a parent node. Therefore, the network segmentation including terminal node 3 ends here, and a network subgraph including nodes 1, 2, and 3 is obtained.

Afterwards, when splitting the terminal node 5 , the initial first current node is the terminal node 5 , and its parent node is the non-forked node 4 . Therefore, based on step S4, it can be known that the smart chip can then use the non-forked node 4 as the new first current node. At this time, the previous node (that is, the new parent node) of the new first current node 4 is the fork node 2 . However, based on the above description, it can be seen that the fork node 2 has been divided by the network subgraph corresponding to the end node 3 . Therefore, the smart chip will end the division of the end node 5, divide the end node 5 and each node between the end node 5 and the new parent node 2 (only node 4 in Figure 2), and generate a network including the end node 5 subplot.

Wherein, the division method of the terminal node 7 is similar to the division method of the terminal node 5, which will not be described again. When it needs to be explained, the above example only uses terminal node 3 as the first terminal node to be processed. In other embodiments, terminal nodes 5 and 7 can also be respectively used as the first terminal node to be processed. This is not limited.

Based on this, in this embodiment, each terminal node and other nodes are sequentially segmented through the subsequent traversal method, the network subgraph corresponding to each terminal node can be accurately and quickly generated, and the smart chip can improve the accuracy of the multi-output neural network graph. Efficiency for splitting.

It should be noted that each network subgraph respectively includes some nodes in the multi-output neural network graph. In addition, slicing the multi-output neural network graph is only to adjust the size of the input and output data processed by each node in the multi-output neural network graph, not to divide the multi-output neural network graph into multiple parts.

In another embodiment, when the above step S102 is performed, each network subgraph usually has multiple slicing schemes. However, when the subsequent step S103 is performed, if the smart chip counts the network subgraph corresponding to each slicing scheme If the processing time is long, it will consume a lot of time. Based on this, before executing S103, the smart chip can also pre-determine whether the slicing scheme is a valid slicing scheme, and when the slicing scheme is a valid slicing scheme, perform the step of obtaining the processing time of the network subgraph to reduce the time required for the smart chip. time.

Please refer to FIG. 4. FIG. 4 is a schematic diagram of an implementation of judging whether a slicing scheme is an effective slicing scheme in a multi-output neural network slicing method provided by an embodiment of the present application, as detailed below:

S41. The smart chip reversely deduces the input data and output data of each node respectively according to the output data volume of the output data of the sliced end nodes.

In the application, after slicing the output data volume of the output data of the terminal node, the size of each output data of the terminal node can be limited, so that, according to the calculation parameters and calculation formulas in the terminal node, the smart chip can invert the terminal The size of the node's input data each time. Afterwards, since the input data of the terminal node is the output data of the previous node, the smart chip can reversely deduce the input data and output data of each node.

S42. For any node, the smart chip determines the target output data of the previous node of the node.

S43. The smart chip determines the occupied space of the node during execution according to the target output data, the input data and the output data of the node.

In an embodiment, the preceding node is a node preceding the node in the multi-output neural network, which can be determined according to the data transmission relationship between the various nodes.

In the application, the input data of a node is the output data of the previous node. Therefore, in order to determine whether the slicing scheme is an effective slicing scheme, the smart chip can process the input and output data of the network subgraph. Get the tiling scheme of the input data and determine it as a valid tiling scheme.

However, if each node does not need to obtain input data from the external memory when the network subgraph processes input and output data, it means that under this slicing scheme, when each node processes input and output data, the internal memory of the smart chip can be the The node allocates enough free space. Based on this, for any node of any network subgraph, the smart chip needs to calculate the occupied space required by the node when processing input and output data.

In a specific embodiment, the smart chip can specifically determine the occupied space in the following manner: determine all forked nodes before the node and the output data of each forked node; judge whether the output data of each forked node can be released, And the output data of all forked nodes that cannot be released are used as the target output data.

And, the smart chip can judge whether the output data of the forked node can be released in the following way: if the output data of the forked node is also used by other network subgraphs, it is determined that the output data of the forked node cannot be released; otherwise, It is determined that the output data of the forked node can be released.

In an embodiment, in order to reduce the number of interactions with the external memory, each chip usually needs to store the output data in the internal memory when generating the output data, so that it can be called by the next node. Because the output data of a non-forking node is the input data of the next node, when calculating the occupied space of any node, only the sum of the output data volume of the node's output data and the input data volume of the input data is required , just determine the occupied space required by the node. That is, there is no target output data at this time.

However, if the node has a fork node before, and the output data of the fork node needs to be used by other network subgraphs. That is, the output data of the fork node needs to be stored in the internal memory all the time. Therefore, when calculating the required occupied space of the node, it is also necessary to add the output data amounts of the output data of the non-releasable fork nodes to obtain the required occupied space of the node. That is, the output data of the forked nodes that have not been released at this time is the target output data. Among them, the purpose of judging whether the output data of the bifurcated node can be released is to delete the output data from the internal memory when judging that the output data can be released, so as to increase the available space allocated by the internal memory for the next node as much as possible. The next node does not need to store the output data in the external memory, reducing the number of interactions with the external memory.

S44. If the occupied space of any node is smaller than the available space allocated by the smart chip to the corresponding node, the smart chip determines the slicing scheme as an effective slicing scheme.

In the application, when the occupied space required by the node to process input and output data is less than the available space allocated by the smart chip for the corresponding node, it means that the internal memory of the smart chip can have enough space to store the input data required by the node, and at the same time can Stores the output data produced by this node.

In the application, the above-mentioned available space may be the total space of the internal memory of the smart chip, or the space remaining in the internal memory after storing the calculation parameters when the node processes data, which is not limited.

It can be understood that the advantage of adopting an effective slicing scheme to slice the output data of the terminal nodes is to reduce the amount of input data each time the node processes the input data and the output data of the generated output data, so that each node can process Reduced footprint required when inputting and outputting data. Furthermore, the smart chip can store the output data generated by each node in the internal memory, so as to reduce the number of interactions with the external memory.

Exemplarily, it is assumed that the size of the internal memory of the smart chip is Q. Wherein, Q is the remaining space in the internal memory after removing the occupied space of the calculation parameters of the node in the internal memory. Wherein, the space occupied by each non-forked node when processing data generally includes: the space occupied by the input data and output data of the node. The following is a specific description in conjunction with Figure 2:

For any slicing scheme, the input data and output data of each node under the scheme are respectively obtained. Referring to Figure 2, for network subgraph A (a subgraph composed of terminal node 3, fork node 1 and fork node 2), to slice the output data of terminal node 3, the internal memory space is required for input data = Q3a, output data = Q3b, recorded as Q3=Q3a+Q3b in total. The output data of fork node 2 cannot be released and sliced. Therefore, the space required for the internal memory is input data=Q2a, output data=Q2b, and the total is recorded as Q2=Q2a+Q2b. The output data of fork node 1 cannot be released and sliced, and the internal memory space required is input data=Q1a, output data=Q1b, and the total is recorded as Q1=Q1a+Q1b. The reason why the output data of fork nodes 1 and 2 cannot be released and sliced is that the output data of fork nodes 1 and 2 need to be reused (both nodes in Figure A and Figure C need the output data of fork node 1, and Both A and Graph B have nodes that require output data from fork node 2).

Wherein, it should be noted that, for each node in the multi-output neural network, when any node performs inverse processing according to the output data, the moment when the input data is generated, the node contains both the input data and the output data. Therefore, each node in the multi-output neural network needs to meet the most basic requirements: at the moment when the node generates input data, the sum of the input data volume and the output data volume is less than or equal to the smart chip as the node Allocated free space.

Therefore, when calculating the internal memory space required by the fork node 1, it is necessary to satisfy Q1<Q, then calculate the occupied space required by the fork node 2, otherwise the slicing scheme is invalid. However, since the input data of the fork node 2 is the output data of the fork node 1, and in the actual processing process, after the fork node 1 calculates the output data based on the input data, the input data does not need to be stored in the internal memory . At this time, only the output data of fork node 1 is stored in the internal memory. And, because the input data of fork node 2 is the output data of fork node 1. Therefore, when calculating the occupied space required by the fork node 2, since the input data of the fork node 1 has been deleted, and the output data of the fork node 1 is the input data of the fork node 2. Based on this, when calculating the occupied space required by the fork node 2, as long as Q2<Q is satisfied, the end node 3 is calculated; otherwise, the slicing method is invalid. When calculating the occupied space required by end node 3, the input data of end node 3 is the output data of fork node 2, and the output data of fork node 1 needs to be used by other nodes (node 6 in Figure C needs to use ). Therefore, when calculating the occupied space required by the terminal node 3, it is necessary to satisfy Q3+Q1b. If Q3+Q1b<Q, the tiling scheme is valid, otherwise the tiling scheme is invalid.

After graph A is sliced, the output data Q1b and Q2b of fork node 1 and fork node 2 will be referenced by graphs B and C, so they cannot be released.

Figure B is a sub-graph of the network including terminal nodes 5 and non-forked nodes 4 . Wherein, for the non-forking node 4, the required internal memory space is the input data=Q4a and the output data=Q4b of the non-forking node 4, recorded as Q4=Q4a+Q4b. However, when graph A is sliced, Q1b of fork node 1 and Q2b of fork node 2 are not released, and the output data Q2b of the fork node includes the input data Q4a of non-fork node 4. Therefore, the occupied space required by the non-forked node 4 when processing the input data is: Q4b+Q1b+Q2b<Q, the terminal node 5 can be calculated, otherwise the slicing scheme for the terminal node 5 is invalid. When calculating terminal node 5, since the input data of terminal node 5 is the output data of non-forked node 4, and at this time the input data Q4a (that is, Q2b) of non-forked node 4 can be released (it does not need to be used in Figure C) . Therefore, the second occupied space required by end node 5 needs to satisfy Q5+Q1b (the output data of fork node 1 still needs to be used by graph C and has not been released). If Q5+Q1b<Q, the slicing method is valid; otherwise, the slicing method is invalid.

In the same way: after graph B is sliced, the output data Q2b of fork node 2 has been released, but the output data Q1b of fork node 1 cannot be released because graph C still needs to be used. When slicing graph C, Q1b is also used. Wherein, using Figure C, it is judged whether the occupied space required by each node satisfies the available space of the smart chip, which is similar to the above, and no further explanation is given here.

It should be noted that if there is no effective slicing scheme in any of the network subgraphs in multiple network subgraphs, it indicates that the entire multi-output neural network slice fails. In addition, the above output data is the data for determining whether the slicing scheme is an effective slicing scheme, which is a reverse process from output to input; after the reverse deduction is completed, if the output data is sliced with the X slicing scheme during the reverse deduction process, Smart chips are able to allocate enough free space for each node. Then, when the multi-output neural network uses the same slicing method to limit the amount of input data that nodes process input data, the smart chip should also be able to allocate enough available space for each node to store the output data generated by each node.

the

In another embodiment, after the slicing scheme of the multi-output neural network is determined, it is also necessary to set the operator for processing input and output data in the multi-output neural network, so that the operator can process the input and output data according to the slicing scheme. The data is processed.

Specifically, each node includes a plurality of operators for processing input data and/or output data, and each operator needs to call a corresponding scheduling statement for implementation. Among them, the slicing method of the multi-output neural network also includes:

For any node in the multi-output neural network graph, the smart chip obtains the start operator that processes the input data of the node first among the multiple operators, and the end operator that processes the output data of the node last.

If the node is an end node, when the smart chip is running the initial operator, it calls the first scheduling statement that executes the first operation; and when the operator finishes running, it calls the second scheduling statement that executes the second operation, and executes The third scheduling statement of the third operation; the first operation is used to read input data from the available space of the smart chip to the start operator; the second operation is used to write the output data generated by the end operator into the internal memory; and , the third operation is used to write the output data generated by the end operator of the terminal node into the external memory.

If the node is the initial node in the multi-output neural network, when the smart chip runs the initial operator of the initial node, it calls the fourth dispatching statement and the first dispatching statement for performing the fourth operation; and, at the beginning of operation When the end operator of the node, call the second scheduling statement; the fourth scheduling statement is used to read input data from the external memory to the start operator of the starting node.

If the node is an intermediate node, the smart chip calls the first scheduling statement when running the start operator in the intermediate node; and calls the second scheduling statement when running the end operator in the intermediate node.

In one embodiment, a node is a minimum processing unit in a multi-output neural network graph, and contains calculation formulas for input data. Among them, a node usually contains a large number of operators to process the input data.

Wherein, when a node includes multiple operators, the first operator in the node that processes input data is used as the start operator, and the last operator that generates output data is used as the end operator.

In an embodiment, the foregoing first operation may specifically be an operation performed by an initial operator. Specifically, the first operation is used to read the input data required by the initial operator in the node. At this time, when the node is not the initial node of the multi-output neural network, the first operation is the operation of reading the input data required by the initial operator from the internal memory. If it is a starting node, you need to perform the fourth operation of reading the required input data from the external memory; and when reading the input data, you need to store the input data in the internal memory first, and then read it from the internal memory Get the input data.

In an embodiment, the above-mentioned second operation is used to write the output data of the end operator into the available space. It can be understood that if the node belongs to the end node, the output data generated by the end operator not only needs to be written into the internal memory, but also needs to perform a third operation of writing the output data into the external memory. However, for intermediate nodes (non-start nodes and end nodes), only the second operation of writing the output data generated by the end operator into the available space of the internal memory and reading from the available space of the internal memory is only required The first operation to input data.

Wherein, referring to FIG. 2 , the start node is the fork node 1 in FIG. 2 ; the end nodes are nodes 3 , 5 , and 7 in FIG. 2 ; and the intermediate nodes are nodes 2 , 4 , and 6 in FIG. 2 .

It should be noted that when generating a scheduling statement for each operator, it can integrate multiple operators in one node, so that multiple operators in the same node can be combined and put into the same core. In this way, the smart chip can use operator fusion without moving the data of the fork node to the external memory, which can effectively reduce the number of data transfers and save the processing time of the smart chip.

Exemplarily, the graphs A, B, and C are respectively scheduled according to the searched optimal slicing manner in a post-order access manner. Assume that the last operator (compute_op) of each child node is recorded as the end operator (END_OP), and the first operator (compute_op) is recorded as the start operator (START_OP). The following is a specific description in conjunction with Figure 2:

In Figure A: use the END_OP of the end node 3 as the first root node 1, and call the split primitive for the N, C, H, and W axes to slice. The compute_op that moves the output data to DM (internal memory) and DDR (external memory) is generated by cache_write (storage write operation) END_OP. Generate compute_op that moves DM data to the computing unit through cache_read (storage read operation) START_OP. If compute_op contains parameters, cache_read is also required to generate compute_op that moves DDR data to DM and computing unit, and all compute_op Move to the axis corresponding to root node 1 by compute_at.

Obtain the output data of fork node 2, set the END_OP of fork node 2 as the second root node 2, and generate the compute_op that moves the output data to DM through cache_write END_OP. cache_read START_OP generates compute_op that moves DM data to computing units. If compute_op contains parameters, cache_read is also required to generate compute_op that moves DDR data to DM and computing unit, and all compute_op Move to the axis corresponding to the root node 2 by compute_at.

Fork node 1 uses a method similar to that of fork node 2, and sets the END_OP of fork node 1 as the second root node 3. Since fork node 1 is the first node of the multi-output neural network, cache_read START_OP needs to generate a moving DDR Compute_op for data to DM and compute unit.

In Figure B: the END_OP of the output node 5 calls the split primitive for the first root node 4 to slice the N, C, H, and W axes. via cache_write END_OP generates compute_op that moves output data to DM and DDR. cache_read START_OP generates the compute_op that moves DM data to the computing unit. The parameters required in compute_op also require cache_read to generate the compute_op that moves DDR data to the DM and the computing unit, and all compute_op Move to the axis corresponding to the first root node 4 by compute_at.

Node 4 passes cache_write END_OP generates compute_op that moves output data to DM. cache_read START_OP generates compute_op that moves DM data to computing units. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and move all compute_op to the axis corresponding to the first root node 4 through compute_at.

In Figure C: the END_OP of the output node 7 is the first root node 5, and the split primitive is called for the N, C, H, and W axes to slice. via cache_write END_OP generates compute_op that moves output data to DM and DDR. cache_read START_OP generates compute_op that moves DM data to computing units. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and all compute_op Move to the axis corresponding to the first root node 5 by compute_at.

Node 6 passes cache_write END_OP generates a compute_op that moves output data to DM, and cache_read START_OP generates a compute_op that moves DM data to a computing unit. If the parameters required in compute_op also require cache_read to generate compute_op that moves DDR data to DM and computing unit, and move all compute_op to the axis corresponding to the first root node 5 through compute_at.

It should be noted that split, cache_write, cache_read, compute_at, etc. are all scheduling primitives of TVM. The above steps are mainly for improving the processing flow of input data and output data and the reading and writing flow. That is, the above method can reduce the number of times the smart chip reads input data from the external memory. However, when each operator is running, it may also need to move other calculation parameters from the external memory to process the input data. However, moving calculation parameters requires access to external memory. That is, the slicing method of the multi-output neural network graph described in this application can only reduce the number of times the multi-output neural network graph accesses input data from an external memory.

the

In an embodiment, refer to FIG. 5 , which is a schematic diagram of a processing scenario of processing input data by a single-output neural network in a smart chip. Step1: The smart chip obtains the input data of the single-output neural network from the external memory, and stores it in the internal memory; Step2: The smart chip reads the input data from the internal memory, and uses the model parameters in the calculation unit A to process and obtain the output data, and then store the output data in external memory. Step3: The smart chip clears the output data and input data of node A; Step4: Read the input data from the external memory again, and use the model parameters in the calculation unit B to process the output data; Step5: Store the output data of node B to external memory. It can be seen from FIG. 6 that for a single-output neural network model, when processing the input data in each node, it is necessary to perform an operation of accessing an external memory.

However, for a multi-output neural network, FIG. 6 is a schematic diagram of a processing scenario in which a multi-output neural network processes input data in a smart chip. Among them, this example is only explained with the multi-output neural network generated by nodes 1, 2, 3, 4, and 5. Concrete: Step1: The smart chip obtains the input data of the multi-output neural network from the external memory, and stores it in the internal memory. Step2: For node 2, the smart chip can directly read the input data from the internal memory, and delete the output data of node 1 in the internal memory; after that, store the output data of node 2 in the internal memory; at the same time, based on the above The segmentation operation of the network subgraph shows that even if the output data of node 2 is stored in the internal memory, the sum of the occupied space of the model parameters of the other nodes deployed in the internal memory and the occupied space of the generated output data will be smaller than the available internal memory space. Therefore, node 4 and end node 3 can directly obtain the output data of node 2 from the internal memory, and then node 4 will also put the generated output data into the internal memory to be read and processed by end node 5; after that, the intelligent The chip executes Step3 and Step4: store the output data generated by the end node 3 and the end node 5 respectively in the external memory.

It can be seen from Fig. 6 that for a multi-output neural network, each node in each network subgraph reads and writes the data stored in the internal memory to realize data interaction. In this way, the operation of accessing the external memory only needs to be performed once for the initial node and the output node of the multi-output neural network.

the

Please refer to FIG. 7 . FIG. 7 is a structural block diagram of a multi-output neural network slicing device provided by an embodiment of the present application. The modules included in the multi-output neural network slicing device in this embodiment are used to execute the steps in the embodiments corresponding to FIG. 1 , FIG. 3 to FIG. 4 . For details, please refer to FIG. 1 , FIG. 3 to FIG. 4 and related descriptions in the embodiments corresponding to FIG. 1 , FIG. 3 to FIG. 4 . For ease of description, only the parts related to this embodiment are shown. Referring to FIG. 7 , the slicing device 700 for a multi-output neural network graph may include: a segmentation module 710, a slicing module 720, a processing duration acquisition module 730, and a slicing scheme determination module 740, wherein:

Segmentation module 710, for dividing each end node in the multi-output neural network from other nodes, generating multiple network subgraphs containing end nodes; the multi-output neural network includes at least two end nodes and at least one fork node.

The slicing module 720 is configured to slice the output data volume of the output data of the end nodes of the network subgraph according to various preset slicing methods for any network subgraph, so as to obtain various slicing schemes of the network subgraph.

The processing duration obtaining module 730 is configured to obtain the processing duration of the network subgraph for any slicing scheme of the network subgraph, and determine the target slicing scheme of the network subgraph according to the processing duration.

The slicing scheme determination module 740 is configured to determine the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.

In one embodiment, the segmentation module 710 is also used for:

S1. Take the end node as the first current node; S2. If the parent node of the first current node belongs to the fork node, then judge whether the fork node has been divided by the network subgraph corresponding to other end nodes; S3. If the fork node has been divided by the network subgraph corresponding to other end nodes, each node between the first current node and the parent node is divided to generate a network subgraph containing the end node; S4. If it is determined that the fork node is not The network subgraphs corresponding to other terminal nodes are divided, or if the parent node belongs to a non-forked node, the parent node is determined as the new first current node, and S2-S4 is repeatedly executed.

In one embodiment, the slicing module 720 is also used for:

Obtain the minimum dimension of the output data of the terminal node and the maximum dimension of the output data; between the minimum dimension and the maximum dimension, determine the output data of any dimension as a slice for slicing the output data of the terminal node Way.

In one embodiment, the slicing device 700 of the multi-output neural network graph further includes:

The judging module is used for judging whether the slicing scheme is an effective slicing scheme for any slicing scheme of the network subgraph, and when the slicing scheme is an effective slicing scheme, execute the step of obtaining the processing time of the network subgraph.

In one embodiment, the judging module is also used for:

According to the output data volume of the output data of the terminal node after slicing, the input data and output data of each node are respectively reversed; for any node, the target output data of the previous node of the node is determined; according to the target output data, the node The input data and output data of the node determine the occupied space when the node is executed; if the occupied space of any node is smaller than the available space allocated by the smart chip for the corresponding node, the slicing scheme is determined as an effective slicing scheme.

In one embodiment, the judging module is also used for:

Determine all forked nodes before the node and the output data of each forked node; judge whether the output data of each forked node can be released, and use the output data of all forked nodes that cannot be released as the target output data.

In one embodiment, the judging module is also used for:

If the output data of the fork node is also used by other network subgraphs, it is determined that the output data of the fork node cannot be released; otherwise, it is determined that the output data of the fork node can be released.

It should be understood that, in the structural block diagram of the slicing device of the multi-output neural network diagram shown in FIG. 7, each module is used to execute the steps in the embodiments corresponding to FIG. Each step in the embodiment corresponding to Fig. 3 to Fig. 4 has been explained in detail in the above embodiment, please refer to Fig. 1, Fig. 3 to Fig. 4 and Fig. 1, Fig. 3 to Fig. 4 in the embodiment corresponding to Relevant descriptions will not be repeated here.

Fig. 8 is a structural block diagram of a smart chip provided by an embodiment of the present application. As shown in FIG. 8 , the smart chip 800 of this embodiment includes: a processor 810 , a memory 820 , and a computer program 830 stored in the memory 820 and executable on the processor 810 , such as a program of a multi-output neural network slicing method. When the processor 810 executes the computer program 830 , the steps in the above embodiments of each multi-output neural network slicing method are implemented, such as S101 to S104 shown in FIG. 1 . Alternatively, when the processor 810 executes the computer program 830, it realizes the functions of the modules in the above embodiment corresponding to FIG. 8, for example, the functions of the modules 710 to 740 shown in FIG. describe.

Exemplarily, the computer program 830 can be divided into one or more modules, one or more modules are stored in the memory 820, and executed by the processor 810, so as to realize the slicing of the multi-output neural network provided by the embodiment of the present application method. One or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 830 in the smart chip 800 . For example, the computer program 830 may implement the multi-output neural network slicing method provided in the embodiment of the present application.

The smart chip 800 may include, but not limited to, a processor 810 and a memory 820 . Those skilled in the art can understand that FIG. 8 is only an example of the smart chip 800, and does not constitute a limitation to the smart chip 800. It may include more or less components than shown in the figure, or combine certain components, or different components. .

The so-called processor 810 may be a central processing unit, or other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components wait. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the like.

The memory 820 may be an internal storage unit of the smart chip 800 , such as a hard disk or memory of the smart chip 800 . The memory 820 may also be an external storage device of the smart chip 800, such as a plug-in hard disk, smart memory card, flash memory card, etc. equipped on the smart chip 800. Further, the memory 820 may also include both an internal storage unit of the smart chip 800 and an external storage device.

An embodiment of the present application provides a smart chip, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the multi-output neural network as in the above-mentioned embodiments is implemented. slice method.

An embodiment of the present application provides a computer-readable storage medium, including a memory, a processor, and a computer program stored in the memory and operable on the processor. The slicing method for the output neural network.

An embodiment of the present application provides a computer program product, which, when the computer program product is run on a computer, causes the computer to execute the multi-output neural network slicing method in each of the foregoing embodiments.

The above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still apply to the foregoing embodiments Modifications to the technical solutions recorded, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of each embodiment of the application, and should be included in this application. within the scope of protection.

Claims

A method for slicing a multi-output neural network is characterized in that it is applied to an intelligent chip, and the method comprises:

Each end node in the multi-output neural network is divided from other nodes to generate a plurality of network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node;

For any network subgraph, slice the output data of the end nodes of the network subgraph according to multiple preset slicing methods to obtain various slicing schemes of the network subgraph;

For any slicing scheme of the network subgraph, acquiring the processing duration of the network subgraph, and determining the target slicing scheme of the network subgraph according to the processing duration;

The slicing scheme of the multi-output neural network is determined according to the target slicing scheme of each of the network subgraphs.
The method according to claim 1, wherein said dividing each terminal node in said multi-output neural network from other nodes to generate a network subgraph comprising said terminal node comprises:

S1. Using the terminal node as the first current node;

S2. If the parent node of the first current node belongs to the forked node, judge whether the forked node has been divided by other network subgraphs corresponding to the end nodes;

S3. If it is determined that the forked node has been divided by other network subgraphs corresponding to the end nodes, then divide each node between the first current node and the parent node to generate The network subgraph of the node;

S4. If it is determined that the forked node is not divided by other network subgraphs corresponding to the end nodes, or if the parent node belongs to a non-forked node, then determine the parent node as the new first For the current node, execute S2-S4 repeatedly.
The method according to claim 1, wherein the output data of the end nodes of the network subgraph is sliced according to multiple preset slicing modes to obtain multiple slicing schemes in the network subgraph, include:

Obtaining the minimum dimension of the output data of the terminal node and the maximum dimension of the output data;

Between the minimum dimension and the maximum dimension, the output data of any dimension is determined as a slicing manner for slicing the output data of the terminal node.
The method according to any one of claims 1-3, characterized in that before obtaining the processing duration of the network subgraph, further comprising:

For any slicing scheme of the network subgraph, it is judged whether the slicing scheme is an effective slicing scheme, and when the slicing scheme is an effective slicing scheme, the step of obtaining the processing duration of the network subgraph is executed.
The method according to claim 4, wherein the judging whether the slicing scheme is an effective slicing scheme comprises:

According to the output data amount of the output data of the terminal node after slicing, reversely deduce the input data and output data of each node respectively;

For any node, determine the target output data of the previous node of the node;

determining the occupied space of the node during execution according to the target output data, the input data and the output data of the node;

If the occupied space of any node is smaller than the available space allocated by the smart chip for the corresponding node, the slicing scheme is determined as an effective slicing scheme.
The method according to claim 5, wherein, for any node, determining the target output data of the previous node of the node comprises:

Determine all the forked nodes before the node and the output data of each of the forked nodes;

It is judged whether the output data of each fork node can be released, and the output data of all the fork nodes that cannot be released are used as the target output data.
The method according to claim 6, wherein said judging whether the output data of each said bifurcation node can be released comprises:

For any of the forked nodes, if the output data of the forked node is also used by other network subgraphs, it is determined that the output data of the forked node cannot be released;

Otherwise, it is determined that the output data of the fork node can be released.
A slicing device of a multi-output neural network is characterized in that it is applied to an intelligent chip, and the device comprises:

A segmentation module, configured to segment each end node in the multi-output neural network from other nodes, and generate a plurality of network subgraphs containing the end nodes; the multi-output neural network includes at least two end nodes and at least one fork node;

A slicing module, configured to slice the output data volume of the output data of the end nodes of the network subgraph according to various preset slicing methods for any network subgraph, so as to obtain various slicing schemes of the network subgraph;

A processing duration acquisition module, configured to obtain a processing duration of the network subgraph for any slicing scheme of the network subgraph, and determine a target slicing scheme of the network subgraph according to the processing duration;

The slicing scheme determination module is configured to determine the slicing scheme of the multi-output neural network according to the target slicing scheme of each network subgraph.
An intelligent chip, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that, when the processor executes the computer program, the computer program according to claims 1 to 1 is realized. 7. The method described in any one.
A computer-readable storage medium storing a computer program, wherein the computer program implements the method according to any one of claims 1 to 7 when executed by a processor.