CN112015472A

CN112015472A - Sparse convolution neural network acceleration method and system based on data flow architecture

Info

Publication number: CN112015472A
Application number: CN202010685107.XA
Authority: CN
Inventors: 吴欣欣; 范志华; 轩伟; 李文明; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-12-01
Anticipated expiration: 2040-07-16
Also published as: CN112015472B

Abstract

The invention provides a method for detecting and skipping execution of invalid instructions in a data stream architecture, which is suitable for accelerating a sparse convolution neural network under the data stream architecture. The invention relates to a sparse neural network, which comprises a convolution layer and a full connection layer. The instruction detection unit detects the instruction according to the instruction marking information and skips the execution of the invalid instruction, so that the acceleration of the sparse convolutional neural network is realized.

Description

Sparse convolution neural network acceleration method and system based on data flow architecture

Technical Field

The invention relates to the technical field of computer system structures, in particular to a sparse convolution neural network acceleration method and system based on a data flow architecture.

Background

The neural network has advanced performance in the aspects of image detection, voice recognition and natural language processing, the neural network model is complicated along with application complexity, a plurality of challenges are provided for traditional hardware, and in order to relieve the pressure of hardware resources, the sparse network has good advantages in the aspects of calculation, storage, power consumption requirements and the like. Many algorithms and accelerators for accelerating a sparse network have appeared, such as a CPU-oriented sparse-blas library, a GPU-oriented custare library, and the like, which accelerate the execution of the sparse network to some extent, and for a dedicated accelerator, have advanced expressive power in terms of performance, power consumption, and the like.

The data flow architecture has wide application in the aspects of big data processing, scientific calculation and the like, and the decoupled algorithm and structure of the data flow architecture enable the data flow architecture to have good universality and flexibility. The natural parallelism of the dataflow architecture matches well the parallel nature of the neural network algorithm.

The neural network model also becomes "large" and "deep" with the complication of the application, which presents many challenges for the traditional hardware, and in order to relieve the pressure of hardware resources, the sparse network has good advantages in the aspects of calculation, storage, power consumption requirements and the like. However, the special accelerator like the CPU, the GPU and the acceleration-intensive network cannot accelerate the sparse network, and the special accelerator for accelerating the sparse network cannot innovate the algorithm because the strong coupling of the algorithm and the structure lacks flexibility and versatility of the architecture.

The data flow architecture decoupling algorithm and structure enable the data flow architecture decoupling algorithm to have good universality and flexibility, based on the data flow architecture, the neural network algorithm is mapped in the architecture formed by a computing array (PE array) in a data flow diagram mode, the data flow diagram comprises a plurality of nodes, the nodes comprise a plurality of instructions, and the directed edges of the data flow diagram represent the dependency relationship of the nodes.

In the sparse convolutional neural network, the main operation is multiplication and addition operation, the weight in the network is set to be 0 through pruning operation, and since 0 is multiplied by any number to be 0, a data flow graph compiled by a compiler contains 0-value related invalid instructions, and when the sparse convolutional neural network is executed, invalid instructions and data are loaded and executed. Executing invalid instructions and data not only occupies hardware resources of the accelerator, causing resource waste, but also causing an increase in power consumption of the accelerator and an increase in execution time of a compute array (PE), resulting in performance degradation.

Disclosure of Invention

Aiming at the problems of resource waste and performance reduction caused by invalid instructions of a sparse network in a data flow architecture and data loading and execution, the invention discloses a method and a device for detecting invalid instructions in the sparse network by analyzing the characteristics of data and instructions of the sparse network, so that the execution of the invalid instructions is skipped, and the acceleration of the sparse network is realized.

The operation of the convolutional neural network is mainly multiplication and addition operation, and for the sparse convolutional neural network, some weights are set to be 0 by pruning operation, so that 0-value-related instructions exist in a data flow graph. Pruning refers to basing the weights in the convolutional neural network on a set threshold, with weights above the threshold retaining the original values and weights below the threshold set to 0. The purpose of the pruning operation is to set some weights to 0 to turn the dense convolutional neural network into a sparse network based on the redundant nature of the weight data, thereby achieving compression of the network. This operation occurs prior to the data pre-processing stage of the convolutional network, i.e., the convolutional neural network is performed.

Taking convolution calculation as an example, as shown in FIG. 1, to perform a convolution operation, it can be seen that to obtain a value of Ofmap, I of Ifmap₀-I₈And W of Filter₀-W₈Require Load instructions (Inst)₁-Inst₁₈) Data is loaded from memory into the PE's memory cells, followed by a Madd (multiply add) instruction (Inst)₁₉-Inst₂₇) In these instructions, because W₂，W₃,W₅,W₇Are all 0, so Inst₁₂,Inst₁₃,Inst₁₅,Inst₁₇For invalid Load instructions, corresponding Madd instruction Inst₂₁,Inst₂₂,Inst₂₄,Inst₂₆And also invalid instructions, the loading and execution of which have no influence on the final result can be regarded as invalid instructions, but the execution of which occupies hardware resources and causes the performance to be reduced. To remove the loading and calculation of 0 value data, the instructions associated with the 0 value need to be eliminated. Therefore, the invention provides a method and a device for detecting invalid instructions and skipping execution in a data stream, so as to save computing resources and improve the performance of a sparse network.

Aiming at the defects of the prior art, the invention provides a sparse convolution neural network acceleration method based on a data flow architecture, which comprises the following steps:

step 1, compiling the operation of a convolution layer and a full connection layer in a sparse convolution neural network into a data flow graph through a compiler, and generating instruction mark information for each instruction in the data flow graph according to the data characteristics of the data flow graph;

and 2, detecting the instruction through the instruction marking information, reserving the effective instruction in the data flow diagram, counting the distance between the two effective instructions, and directly skipping the execution of the invalid instruction by the sparse convolutional neural network according to the distance when the data flow diagram is executed until the processing result of the data flow diagram is obtained.

The sparse convolution neural network acceleration method based on the data flow architecture comprises the steps that the data flow graph comprises a plurality of nodes, the nodes comprise a plurality of instructions, and the directed edges of the data flow graph represent the dependency relationship of the nodes.

The sparse convolutional neural network acceleration method based on the data flow architecture is characterized in that the instruction marking information respectively represents the validity and the invalidity of an instruction by using 1 and 0.

The sparse convolutional neural network acceleration method based on the data stream architecture, wherein the step 2 specifically detects the instruction through an invalid instruction detection device, and the invalid instruction detection device comprises:

the instruction marking information module is used for caching instruction marking information;

the PC counter register is used for recording the interval between two effective instructions so as to directly skip the execution of the ineffective instruction;

the reference PC register is used for storing the PC value of the first effective instruction as a reference PC value, and when the fact that a certain subsequent instruction is effective is detected, the reference PC value is added with the interval value stored by the PC counter to obtain the PC value of the next effective instruction for the execution unit to execute;

the instruction cache module: for storing valid instructions to be executed by the execution unit.

The sparse convolutional neural network acceleration method based on the data stream architecture, wherein the process of generating the instruction marking information in the step 1 specifically comprises the following steps: and marking the instruction related to the weight value of 0 in the convolutional layer and the full-link layer as an invalid instruction, and marking the instruction related to the non-0 value as a valid instruction.

The invention also provides a sparse convolution neural network acceleration system based on the data flow architecture, which comprises the following steps:

the method comprises the following steps that a module 1 compiles the operation of a convolution layer and a full connection layer in a sparse convolution neural network into a data flow graph through a compiler, and generates instruction mark information for each instruction in the data flow graph according to the data characteristics of the data flow graph;

and the module 2 detects the instruction through the instruction marking information, reserves the effective instruction in the data flow diagram, counts the distance between the two effective instructions, and directly skips the execution of the invalid instruction according to the distance when executing the data flow diagram until obtaining the processing result of the data flow diagram.

The sparse convolution neural network acceleration system based on the data flow architecture comprises a data flow graph and a data flow graph, wherein the data flow graph comprises a plurality of nodes, the nodes comprise a plurality of instructions, and the directed edges of the data flow graph represent the dependency relationship of the nodes.

The sparse convolutional neural network acceleration system based on the data flow architecture is characterized in that the instruction marking information respectively represents the validity and the invalidity of an instruction by using 1 and 0.

The sparse convolution neural network acceleration system based on the data flow architecture is characterized in that the module 2 specifically detects an instruction through an invalid instruction detection device, and the invalid instruction detection device comprises:

The sparse convolutional neural network acceleration system based on the data stream architecture, wherein the process of generating the instruction marking information in the module 1 specifically includes: and marking the instruction related to the weight value of 0 in the convolutional layer and the full-link layer as an invalid instruction, and marking the instruction related to the non-0 value as a valid instruction.

According to the scheme, the invention has the advantages that:

the device reduces the loading and execution of invalid instructions and data in the sparse network, and improves the execution performance of the sparse network.

Aiming at sparse convolutional neural network application, the sparse convolutional neural network detection method comprises a convolutional layer and a full-link layer, and a set of invalid instruction detection device is designed in a hardware mode to accelerate a sparse network. The instruction generated by the compiler generates instruction marking information according to the data characteristics, then the invalid instruction detection device detects the instruction according to the instruction marking information, the distance between two valid instructions is calculated in a statistical mode, and the invalid instruction is directly skipped over to be executed, so that the loading and calculation of 0 value data are avoided, and the acceleration of the sparse convolutional neural network is realized.

Drawings

FIG. 1 is a schematic diagram illustrating a flow chart of a convolution operation instruction;

FIG. 2 is a diagram of an invalid instruction detection apparatus;

FIG. 3 is a flow diagram of invalid instruction detection;

FIG. 4 illustrates generation of instruction tag information;

FIG. 5 is a diagram of instruction screening results.

Detailed Description

The invention comprises the following key points:

the key point 1 generates marking information of the instruction;

the key point 2 is an invalid instruction detection unit which detects invalid instructions and skips execution;

in order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The invention comprises the following steps:

i: generation of instruction marking information

After the compiler compiles the operations of the convolution layer and the full link layer into a data flow graph, instruction marking information is generated according to data characteristics to mark whether the instruction is valid or not, and the marking information respectively represents the validity and the invalidity of the instruction by using 1 and 0. The operations of the convolution layer and the full connection layer of the convolution neural network are multiplication and addition operations, and the compiler compiles the multiplication and addition operations into a data flow graph. In these layers, the weighted data characteristics are expressed as 0 in some cases and not 0 in some cases, and based on such data characteristics, whether or not the corresponding instruction is marked valid is marked, that is, the marking information of the instruction related to the value 0 is 0, and the marking information of the instruction related to the value not 0 is 1. 0, 1 represent invalid and valid instructions, respectively.

II: invalid command detection device

As shown in fig. 2, the invalid instruction detecting apparatus includes an instruction Flag information buffer module (Flag buffer), a PC counter register (PC counter), a reference PC register (PC Base), and an instruction buffer module (instruction buffer), where the PC refers to a program counter and is an address for storing a next instruction, and the PC counter refers to a distance between an effective instruction pointed by the PC and a next effective instruction. The purpose and reason for each module design are:

and the instruction marking information caching module is used for generating marking information of each instruction according to the data characteristics of the instruction.

PC counter register: for recording the interval between two valid instructions to directly skip execution of an invalid instruction.

Reference PC register: and the PC value is used for storing the PC value of the first effective instruction, and when the following certain instruction is detected to be effective, the PC value of the next effective instruction is directly obtained by adding the reference PC value and the value of the PC counter for the execution of the execution unit.

The instruction cache module: for storing instructions to be executed by the execution unit.

Fig. 3 is a flowchart of a process of screening instructions by the instruction detection unit, which obtains an interval between two valid instructions by calculating a distance between two 1 s in the instruction Flag information Flag, and adds the current PC to the interval to obtain a PC value of the next valid instruction, so that the PC automatically jumps and transposes the valid instruction position to skip an invalid instruction. In the flow chart, a parameter i records the interval between the instruction to be detected and the instruction pointed by the current PC, Flag _ id represents the index of the mark information corresponding to the instruction, and the instruction detection unit detects the mark information of the instruction one by one until detecting that one piece of the mark information of the instruction is 1 or all the mark information of the instruction is detected to be finished, and the flow is finished.

Based on the design, the invention has the advantages that: the execution of invalid instructions is reduced, thereby reducing the number of times instructions are executed, as well as the access and execution of 0 value data. The computing resources are saved, and the performance of the sparse network is improved.

Specifically, the present invention is implemented in the compiling stage to realize the generation of the instruction marking information, and an invalid instruction detection device is added in each PE to skip the execution of the invalid instruction, which is further described in detail in conjunction with the execution process of convolution.

(1) Generation of instruction marking information

Fig. 4 shows the convolution implementation of 5 × 5 Ifmap and 3 × 3 Filter to generate one Ofmap result. In the convolution operation, the instruction flag information of the ifmaps participating in the operation is all 1, and the Filter values participating in the operation are respectively 1, 1, 0, 0, 2, 0, 1, 0, 4, so that the flag information (Filter _ flag) of the corresponding Laod instruction is 1, 1, 0, 0, 1, 0, 1 when the multiplication and addition operation is performed, the instruction flag information of the multiplication and addition operation is the and operation of the flag information of the positions corresponding to the ifmaps and the Filter, and the instruction flag information of the ifmaps is all 1, so that the instruction flag information of the multiplication and addition operation is the same as that of the Filter, namely 1, 1, 0, 0, 1, 0, 1, 0, 1.

(2) Detection of invalid instructions

Fig. 5 is a valid instruction finally executed after the instruction in fig. 4 is screened by the instruction detection unit, where Flag is Flag information of a corresponding instruction, and it corresponds to the instruction in fig. 4 one to one.

The method comprises the following steps: at the beginning, Inst₁Is 1, so PC points to Inst₁To execute Inst₁Instructions;

step two: after the execution is finished, the PC automatically adds 1 pointing Inst₂Due to Inst₂Flag of (2) is also 1, so the PC still points to Inst₂And executing;

step three: so executed until Inst1₁The execution is finished;

step four: PC automatic plus 1 pointing Inst₁₂Because of Inst₁₂Is 0, so i and Flag _ id in the instruction detecting unit are both increased by 1 to detect Inst₁₃The validity of the instruction;

step five: and Inst₁₃The Flag of (1) is still 0, i and Flag _ id continue to be added by 1;

step six: inst at this time₁₄Flag of (1), so PC is updated to PC +2, the detection is over, and PC jumps to Inst₁₄Is executed;

step seven: the instruction detection unit continues to execute until all instructions in the graph are detected;

step eight: the effective instruction executed by the PE is Inst finally₁-Inst₁₁，Inst₁₄，Inst₁₆，Inst₁₈-20，Inst₂₃，Inst₂₅，Inst₂₇。

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. A sparse convolution neural network acceleration method based on a data flow architecture is characterized by comprising the following steps:

2. The sparse convolutional neural network acceleration method based on a dataflow architecture of claim 1, wherein the dataflow graph includes a plurality of nodes, the nodes include a plurality of instructions, and the directed edges of the dataflow graph represent dependency relationships of the nodes.

3. The data-flow-architecture-based sparse convolutional neural network acceleration method of claim 1, wherein the instruction tag information uses 1 and 0 to represent validity and invalidity of the instruction, respectively.

4. The method as claimed in claim 1, wherein the step 2 is to detect the command by a null command detection device, and the null command detection device comprises:

5. The method as claimed in claim 1, wherein the step 1 of generating the instruction mark information specifically includes: and marking the instruction related to the weight value of 0 in the convolutional layer and the full-link layer as an invalid instruction, and marking the instruction related to the non-0 value as a valid instruction.

6. A sparse convolutional neural network acceleration system based on a dataflow architecture, comprising:

7. The sparse convolutional neural network acceleration system based on a dataflow architecture of claim 6, wherein the dataflow graph includes a plurality of nodes, the nodes include a plurality of instructions, and the directed edges of the dataflow graph represent dependency relationships of the nodes.

8. The data-flow-architecture-based sparse convolutional neural network acceleration system of claim 6, wherein the instruction tag information uses 1 and 0 to represent validity and invalidity of an instruction, respectively.

9. The sparse convolutional neural network acceleration system based on data flow architecture of claim 6, wherein the module 2 specifically detects the command through a null command detection device, the null command detection device comprises:

10. The sparse convolutional neural network acceleration system based on a data flow architecture as claimed in claim 6, wherein the process of generating instruction tag information in the module 1 specifically includes: and marking the instruction related to the weight value of 0 in the convolutional layer and the full-link layer as an invalid instruction, and marking the instruction related to the non-0 value as a valid instruction.