CN115860066A

CN115860066A - Neural network reasoning pipeline multiplexing method based on batch processing

Info

Publication number: CN115860066A
Application number: CN202211629670.0A
Authority: CN
Inventors: 许嘉帆; 王炜
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-03-28

Abstract

The invention discloses a neural network reasoning pipeline multiplexing method based on batch processing, which comprises the following steps: analyzing different stages of the network; each calculation layer of the network is segmented to obtain a slicing unit; generating a corresponding instruction according to the slicing unit; analyzing the access and storage calculation ratios of different stages of the network to obtain a subgraph segmentation fusion scheme; rearranging the instruction sequence of the matched subgraph by using a dynamic programming algorithm; and deploying on the target hardware according to the instruction sequence. The invention mixes the access intensive operator and the calculation intensive operator by overlapping the front task and the back task, so that the load of the calculation access is balanced, the utilization rate of hardware resources is improved under the condition of not increasing the hardware resources, the throughput rate of calculation is increased, and meanwhile, the invention has excellent portability and expansibility.

Description

Neural network reasoning pipeline multiplexing method based on batch processing

Technical Field

The invention relates to the field of neural networks, NPUs, FPGAs, software and hardware collaborative optimization systems and neural network compiling, in particular to a neural network reasoning pipeline multiplexing method based on batch processing.

Background

Deep learning has been widely applied to the fields of image recognition, recommendation systems, automatic driving and the like, and with the deep learning and the wide deployment of neural networks in production life, the computing power of cloud and edge devices poses new challenges. The traditional CPU computing architecture cannot adapt to the huge parallel computing requirement in the neural network computing, and the GPU which provides the CUDA general parallel computing architecture is widely applied to the training and reasoning of the neural network. With the development of the network, the calculation amount of the network is further increased, the calculation structure is more complex, and the wide application of the network promotes the evolution of the hardware structure, for example, the introduction of the Tensor Core in the V100 architecture of Nvidia, google specially designs the TPU of the Tensorflow framework, and companies such as the Hanmart, huashi and the like also have strength on the special hardware of AI.

The architecture optimization aiming at high-performance computing has a great part of focusing on computing access and memory analysis of computing tasks and hardware, finding out whether the current computing bottleneck is access and memory bandwidth or computing components through the analysis of the computing tasks, and carrying out architecture evolution based on the access and memory bandwidth or the computing components. However, after the hardware is designed, the computing resources, the memory access path, and the like are fixed, and in different periods of computing, the requirements for the memory access resources and the requirements for the computing resources change, so that the computing resources or the memory access resources are wasted, and the problem is difficult to solve through the design of the hardware.

After analyzing the convolutional neural network operated by the currently used FPGA accelerator, it is found that when the accelerator deployed in the FPGA operates the convolutional neural network, access intensive operators and calculation intensive operators alternately appear at different periods of network operation, in the first layers of the convolutional neural network, a characteristic diagram is usually large, an input/output channel is usually small, the data volume of the weight is small and is a calculation intensive operator, and in the later period of the network, the input channel and the output channel are rapidly expanded, so that the data volume of the weight is rapidly increased, and the data volume becomes an access intensive operator.

When the network executes the computation-intensive operator, the memory access bandwidth is wasted, and when the memory access-intensive operator is executed, the computing component is idle. Meanwhile, data dependency exists before and after the computation layer of the same task, and sequential scheduling in one task is difficult to perform. Meanwhile, in the practical application, a large number of scenes of continuous processing of the neural network are provided, the invention designs a pipeline mode multiplexing based on batch processing of a plurality of parallel tasks by utilizing the characteristic, after an operator is cut into instructions, the access intensive operator of the latter half part of the former task and the calculation intensive operator of the former half part of the latter task are mixed, the effect of improving the utilization rate of the access resources and the calculation resources on the accelerator is achieved, and the effect of improving the performance is achieved.

Disclosure of Invention

The invention provides a neural network reasoning pipeline multiplexing method based on batch processing, and relates to an instruction scheduling method for neural network reasoning acceleration hardware. The technical scheme of the invention aims at solving the problem that access resources and computing resources are wasted in different computing periods due to unbalanced access-computing load in different computing periods.

In order to solve the problems, the invention is realized by the following technical scheme:

a neural network reasoning pipeline multiplexing method based on batch processing is characterized by comprising the following steps:

step 1) analyzing different stages of a network to obtain the access-storage-calculation ratio of each stage;

step 2) segmenting each calculation layer of the network to obtain slicing units which are similar in size and suitable for hardware execution;

step 3) generating a corresponding instruction according to the slicing unit in the step 2);

step 4) obtaining a subgraph segmentation fusion scheme by analyzing access and storage calculation ratios of different stages of the network;

step 5), rearranging the instruction sequence of the matched subgraph by using a dynamic programming algorithm;

and 6) deploying on the target hardware according to the instruction sequence.

The neural network reasoning pipeline multiplexing method based on batch processing is characterized in that reasoning tasks of the neural network need to be continuously processed and are scheduled in a pipeline manner.

The neural network reasoning pipeline multiplexing method based on batch processing is characterized in that the subgraph segmentation and fusion scheme in the step 4) is to ensure that the sequence and the fused 2 subgraphs are respectively calculation intensive and access intensive according to the access and storage calculation ratio of the whole network.

The neural network reasoning pipeline multiplexing method based on batch processing is characterized in that the dynamic programming algorithm in the step 5) takes the memory access calculation ratio of the current sequenced instructions as the selection basis of the next instructions.

The method for multiplexing the neural network reasoning pipelines based on batch processing is characterized in that an instruction sequence finally obtained by the method forms a pipeline arrangement form with partially overlapped front and back tasks from the view point of the tasks.

By adopting the technical scheme, the invention can obtain the following beneficial effects:

the invention provides a batch processing-based pipeline mode multiplexing method aiming at the problem of unbalanced calculation-access load of the existing neural network reasoning accelerator in different periods during operation, and the access intensive operators and the calculation intensive operators are mixed by overlapping and arranging front and back tasks, so that the load of the access memory is balanced, the utilization rate of hardware resources can be improved under the condition of not increasing the hardware resources, and the throughput rate of the calculation is increased.

Drawings

Fig. 1 is a schematic diagram of a system architecture of a VTA deep learning accelerator in the prior art.

FIG. 2 is a hardware architecture diagram of an accelerator deployed by an embodiment of the present invention.

FIG. 3 is a diagram illustrating an acceleration hardware structure according to an embodiment of the present invention.

FIG. 4 is a diagram illustrating partitioning subgraphs and results of subgraph fusion according to the embodiment of the invention.

FIG. 5 is a flowchart of instruction order scheduling according to an embodiment of the present invention.

FIG. 6 is a diagram illustrating the effect of using 24 instructions as a test according to an embodiment of the present invention

FIG. 7 is a diagram illustrating the effect of using 16 instructions as a test according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, the following detailed description of the invention is provided in conjunction with the accompanying drawings and specific embodiments.

[ example 1 ]

Fig. 1 shows an embodiment of the prior art, in which a VTA is an open-source deep learning accelerator and is a general neural network inference accelerator driven by instructions, and fig. 1 is a schematic diagram of a system architecture of the VTA deep learning accelerator, and an execution flow of the diagram is as follows:

step 1) an instruction fetching component reads and decodes an instruction from a memory of a host, splits the instruction into three parts of fetching, calculating and writing back, and distributes the three parts to 3 execution components;

step 2, the data fetching component obtains data required by calculation from a memory according to the instruction, stores the data into an input buffer area and a weight buffer area, and transmits a token to the subsequent calculating component;

step 3) after receiving the token, the calculation component means that the data required by calculation is already stored in the buffer area and can be obtained for calculation, and after the calculation is finished, the data is stored in the output buffer area and a token is transmitted to the subsequent write-back component;

and 4) after the write-back component receives the token, the data in the output buffer area is valid, and the data is stored back in the memory.

And repeatedly executing the steps 1) to 4), and executing the instructions one by one.

The design of the VTA decoupling execution component can help the memory access to be transmitted in advance, so that the local memory access-computation load imbalance is relieved to a certain extent, and however, the memory access-computation load imbalance still exists in the aspect of the whole task dimension.

Taking 8-bit quantized resnet-18 as an example, most of the calculation is convolution operation, and the data amount and calculation amount of each convolution layer are shown in table 1:

TABLE 1

Wherein, "64 3 224 224 7 2" indicates that the input layer is 64 input channels, 3 output channels, two-dimensional tensor with input size of 224 × 224, convolution kernel size of 7*7, step size of 2, and pad of 3.

In a fixed hardware design, when the bandwidth-computation ratio is close to the access-computation of the whole network, computation intensive or access intensive occurs in different periods of computation, and when a computation intensive operator or an access intensive operator occurs in a longer time window all the time, the size of a buffer area is not enough to meet the requirement, the waste of computation resources or access resources can be caused.

When the existing VTA scheme is used for executing the network, the waste of computing resources and memory access resources can alternately occur due to unbalanced load of memory access-computing, and the problem is a common phenomenon of the current general neural network inference accelerator of the edge equipment.

[ example 2 ]

The application scenario of the invention is to continuously process different inputs of the same neural network task, and then perform pipelined instruction rearrangement on adjacent tasks.

To support the present invention, the deployed acceleration hardware needs to meet the following 3 points:

1. flexible data prefetching operation is supported;

2. controllable on-chip buffer space;

3. and a good memory access calculation separation mechanism.

The flexible data prefetching operation can help the accelerator to perform prefetching action when redundant memory access resources exist, data are provided for later calculation, the controllable on-chip buffer space ensures that data which are subjected to aggressive prefetching before do not fail due to a replacement strategy in a traditional Cache algorithm, and a good memory access calculation separation mechanism can separate memory access and calculation of a task, so that a memory access component and a calculation component have higher utilization rate.

Taking the accelerator deployed in the invention as an example:

the hardware architecture of the accelerator deployed in the invention is shown in fig. 2, and the architecture is mainly divided into a control module, an instruction fetching module, a transmitting module, an access module, a calculating module, a write-back module and an on-chip memory. The control module is connected with a host end by using an AXI-Lite interface, and the host is used for recording the initial address of a read instruction, the number of program instructions, starting an accelerator, acquiring monitoring information of some performances and the like by writing values into a control register in the control module; the instruction fetching module is mainly used for reading instructions, acquiring the instructions from the memory according to the address in the control module and transmitting the instructions to a subsequent transmitting module; the transmitting module is mainly used for decoding and transmitting the instructions, distinguishing the types of the current instructions and transmitting the current instructions to the corresponding execution components, each type of component is provided with an instruction queue, and when the transmitting conditions of the instructions at the top of the queue are met and the execution components are idle, the instructions can be transmitted, so that the vacuoles in the production line are reduced; the access module is connected with the memory of the host by using an AXI interface, and when the access module receives an instruction, a two-dimensional tensor is acquired from the memory through the AXI interface and is stored in one block of the on-chip memory; the on-chip memory component consists of a plurality of blocks, one block can store one tensor and can be occupied by one execution component at a time, and the on-chip memory component also comprises a scoreboard which is used for finely controlling the on-chip memory; the computation modules are divided into two types, one is used for computing instructions such as matrix multiplication and convolution which need a large number of multiplication and addition components, and the other is used for computing components of addition, relu, quantization and other element-wise operations; the write-back module is opposite to the access module, and can store the tensor in one block to one position in the memory.

In the deployed accelerator, because the execution components are decoupled in a mode of a scoreboard and a vector bank, corresponding instructions are also designed in order to adapt to the design of a hardware structure, the instructions are divided into 3 types, namely fetching, writing back and calculating, the instructions respectively correspond to 3 components, the instructions are similar to a RISC instruction set, the fetching is only responsible for reading tensor, the writing back is only responsible for writing back tensor, and the computer is only responsible for calculating tensor.

In order to adapt to a neural network reasoning pipeline multiplexing method based on batch processing, the accelerator deployed by the invention can automatically execute aggressive prefetching by decoupling a data fetching component, a calculating component, a memory accessing component and an instruction, and meanwhile, the on-chip memory component controlled by the scoreboard can finely control the on-chip memory.

The invention is an abstract structure of neural network reasoning hardware capable of being deployed, as shown in fig. 3, and mainly comprises three parts, namely a memory access path, an on-chip memory and a computing core, so as to illustrate the deployment of the invention on the hardware. The memory access path can calculate needed data after automatic prefetching until an on-chip memory is filled, the on-chip memory serves as a buffer area for prefetching data when memory access resources are excessive until the buffer area is filled, and data is provided when the calculation resources are excessive until the prefetched data in the buffer area are consumed and exhausted, so that the utilization rate of a calculation core is improved. The buffers are able to cope with load imbalances within a small time window, however performance is still affected by the load imbalance across the whole task dimension, and the following description will be based on this abstract structure.

Instruction compiling and scheduling:

in the computational deployment of the neural network edge device inference acceleration hardware, due to the limited storage size on the slice, the input and the weight need to be loaded on the slice from the host memory in real time during computation, and for the flexibility of computation, the computation task of each layer needs to be segmented according to the shape and then loaded on the slice one by one for computation.

The calculation of the neural network is usually performed layer by layer in a linear structure, the output of the previous layer is the input of the next layer, when the neural network is compiled, the layer by layer compiling is also mainly used, each calculation layer is converted or split into a plurality of operators, then, each operator is subjected to instruction compiling, arranged in sequence, and data multiplexing between adjacent instructions is utilized. Taking the convolutional neural network as an example, the interior of the convolutional neural network is mainly composed of a convolutional layer, a pooling layer and a full-connection layer. When the batch size is set to 1, the main parameters of the convolutional layer include the number of input channels I, the number of output channels O, the size HW of the feature map, the size K of the convolutional window, and the like, and the bottom layer of the dedicated acceleration hardware supports general calculations of some sizes, such as matrix multiplication of a fixed size. In order to adapt to a relatively fixed hardware structure, when a convolutional layer is deployed, a compiler divides each dimension of the convolutional layer to divide a calculation task into small blocks, and the small blocks correspond to hardware implementation in an accelerator. Data multiplexing of different degrees exists among the segmented calculation tasks, for example, input data can be multiplexed on an OK dimension, weight can be multiplexed on a HW dimension, output data can be multiplexed on an IK, and different degrees of data multiplexing are caused by different shapes of convolutional layers, so that the ratio of calculation to memory access is changed. Other pooling layers, the splitting compilation operation of the full-link layer is similar.

After each calculation layer in the neural network is segmented according to a hardware structure, due to data dependency between the front layer and the rear layer, instruction sequences of each calculation layer are arranged according to the sequence of the calculation layers after segmentation, repeated data may be used among instructions in one calculation layer, the repeated data may be arranged together, the spatial locality and the time locality of the data can be utilized, data multiplexing in the calculation layers is fully utilized, and the access and storage requirements are reduced.

And then counting the memory access calculation demand proportion of the whole task according to the generated instruction to obtain an average value.

According to the average value, a calculation layer which is larger than the average value is called access intensive type, a calculation layer which is smaller than the average value is called calculation intensive type, a plurality of access intensive type or calculation intensive type calculation layers which continuously appear are divided into a subgraph, then one subgraph is selected from the front subgraph in sequence, another subgraph with the opposite first access calculation ratio is selected from the middle of a task, and the access calculation ratio after the two subgraphs are fused is close to the average value of the whole network as far as possible. And then continuously selecting a sub-graph in sequence, and selecting a sub-graph with the opposite access and storage calculation ratio for fusion from the fused sub-graph in sequence until the whole task is fused.

As shown in fig. 4, the original task in fig. 4 is divided into 4 sub-graphs according to the access and computation ratio, and the access computation ratio of the whole task is 1. The first sub-graph 1 is selected from the front, the access-memory calculation ratio is 1/2, the first sub-graph is calculation-intensive, a relative sub-graph 3 is selected from the middle, the access-memory calculation ratio is 2, the first sub-graph is access-memory intensive, the first sub-graph and the second sub-graph are paired, sub-graphs 2 are sequentially selected, the sub-graphs 2 are access-memory intensive, and the relative calculation-intensive sub-graphs 4 are selected from the rear as fusion objects. After the pairing is carried out, the ratio of statistical calculation and memory access is more balanced compared with the prior structure.

Then we need to fuse each pair of mated subgraphs, i.e. the reordering operation of the instructions.

Fig. 5 is a flowchart of instruction sequence scheduling according to the embodiment of the present invention, which describes a scheduling process from graph segmentation to instruction sequence of a network, and determines an instruction sequence after corresponding subgraphs are fused by using a dynamic programming method:

51 Computing an average access ratio of the two subgraphs;

52 Select an instruction or instructions from the first sub-graph to the tail of instruction queue X;

53 Compute the access/compute ratio of the previously selected N instructions (N may be modified depending on the effect);

54 If the access/calculation ratio calculated in the step 53) is smaller than the average value or the instructions of the calculation-intensive subgraph are consumed, selecting one or a plurality of instructions from the instruction queue of the access-intensive subgraph to add to the tail of the instruction queue X; if the total access/calculation ratio of the current instruction is larger than or equal to the average value or the instruction of the access intensive subgraph is consumed, selecting one or a plurality of instructions from the computation intensive subgraph instruction queue to the tail of the instruction queue X;

55 Jump to said step 53) if there are instructions in the instruction queues of both subgraphs);

56 The instruction queue X is the fused instruction queue.

In addition, if the hardware is designed with a more accurate performance simulator, the hardware can be modified in a fused dynamic programming algorithm, the determined instruction sequence is sent to the performance simulator, the performance simulator can return the bandwidth and the calculation capacity use condition of the latest execution time, if the calculation part is filled but the access part is idle, the access part shows that the latest execution uses more calculation parts, the instruction is selected from the sub-graph with dense access, otherwise, the instruction is selected from the sub-graph with dense access.

Next, 2 examples are used to illustrate the practical effect of the present invention, the deployed hardware structure is a simple abstract model shown before, the hardware automatically performs a prefetch operation, data needed by a later instruction is fetched into an on-chip buffer, data used by a calculated instruction is released, and a later calculation can also automatically find data in the on-chip buffer, and we analyze an access time axis and a calculation time axis.

[ example 3 ]

As shown in fig. 6, using 24 instructions as the test for execution, assume that the runtime of each instruction is 2, the first 12 instructions require a data size of 1, the last 12 instructions require a data size of 3, the bandwidth is 2, and the on-chip buffer size is 6.

Assuming that the invention is not used, the 24 instructions are deployed on hardware, the operation of the first half can be seen, after the buffer is full, many idleness appear on the time axis of the access memory, and many idleness appear on the time axis of the calculation in the later period when the access memory resource is insufficient, from the overall view of the task, the wastes of the access memory resource and the calculation resource exist at the same time, the overall performance is reduced, and the total time spent is 29.

If a pipeline multiplexing method based on batch processing is used, the first 12 instructions are used as a first sub-graph, the last 12 instructions are used as a second sub-graph, the first sub-graph and the second sub-graph are fused, the overall average access/calculation is 2, the first sub-graph has 12 instructions as the calculation density, and the second sub-graph has the access density, according to the method:

an instruction is fetched from the first sub-graph with access/calculation of 1, which is less than the overall average.

An instruction is fetched from the second subgraph with access/calculation of 2, equal to the overall average.

An instruction is fetched from the first sub-graph, and the access/calculation is 5/6, which is smaller than the overall average value.

……

The instructions of the two subgraphs are scheduled to be arranged in a '1212 …' arrangement, when data calculation is carried out, as shown in fig. 6, the previous accesses and the idle of calculation are filled with each other, and the design of load relative balance enables the pressure of a buffer area to be smaller, so that the overall performance is improved, and the total time is about 24.5.

[ example 4 ]

As shown in fig. 7, 16 instructions are used as the test of the run, assuming that the computation time of the first 8 instructions is 1, the computation time of the required data size is 1,9-12 instructions is 1, the required data size is 2, the computation time of the last 4 instructions is 1, the required data size is 4, the on-chip buffer size is 6, and the bandwidth is 2.

If the invention is not used, the first 8 instructions are calculation intensive, the buffer area continuously prefetches the following instructions, and when the buffer area is full, the memory access resources are wasted; the 9-12 computing accesses are balanced, and the occupancy rate of the buffer area is basically unchanged; the last 4 instructions are access intensive, the prefetched data in the buffer is continuously consumed, the waste of computing resources occurs after 0, and the time spent by one task is 19.5 on average.

The invention is used for dispatching the instructions, the first 8 instructions are used as a first subgraph, 9-12 instructions are used as a second subgraph, the last 4 instructions are used as a third subgraph, the first subgraph and the third subgraph are fused, and according to the method, the average memory access/calculation of the fused subgraph is 2:

obtaining an instruction from subgraph 1, wherein the access/calculation is 1 and is smaller than the integral average value;

an instruction is obtained from the subgraph 3, the access/calculation is 5/4 and is larger than the integral average value;

an instruction is obtained from the subgraph 1, the access/calculation is 2, and the overall average value is equal to the overall average value;

an instruction is obtained from the subgraph 1, the access/calculation is 7/8 and is smaller than the integral average value;

……

the instructions are scheduled into a format of '131131 …', the pressure of a buffer is greatly reduced relative to balanced access-computing load, and the time spent by one task is 17 on average.

The key technical points and innovation points of the invention are as follows:

1. the method comprises the following steps of performing balanced scheduling on different modules of a plurality of tasks by adopting computing intensive and memory intensive task balanced scheduling based on batch processing;

2. a production line is formed by overlapping and arranging the calculation intensive tasks and the access intensive tasks on a time line, so that the performance of the system is improved;

3. the hardware design needs to support flexible data pre-fetching operation, controllable on-chip buffer space and a good access memory calculation separation mechanism, and does not need extra calculation cores and bandwidth;

4. and the instruction scheduling adopts a dynamic programming algorithm based on module access and memory calculation ratio to carry out module scheduling.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention, and the detailed description is not to be construed as limiting the present invention. Any simple modification, equivalent change and modification of the above embodiments according to the technical spirit of the present invention without departing from the principle and spirit of the present invention shall be included in the protection scope of the present invention.

Claims

1. A neural network reasoning pipeline multiplexing method based on batch processing is characterized by comprising the following steps:

and 6) deploying on the target hardware according to the instruction sequence.

2. The batch-based neural network inference pipeline multiplexing method of claim 1, wherein inference tasks of the neural network need to be continuously processed and pipelined.

3. The batch-based neural network inference pipeline multiplexing method according to claim 1, wherein the subgraph segmentation fusion scheme in step 4) is to ensure that the sequence and the fused 2 subgraphs are respectively calculation-intensive and access-intensive according to the access-storage calculation ratio of the analysis overall network.

4. The batch-based neural network inference pipeline multiplexing method as claimed in claim 1, wherein the dynamic programming algorithm in step 5) takes the memory access calculation ratio of the current ordered completed instruction as the selection basis of the next instruction.

5. The batch-based neural network inference pipeline multiplexing method of claim 1, wherein the instruction sequence obtained in step 5) forms a partially overlapped pipeline arrangement of front and back tasks from the perspective of the tasks.