CN112015472B

CN112015472B - Sparse convolutional neural network acceleration method and system based on data flow architecture

Info

Publication number: CN112015472B
Application number: CN202010685107.XA
Authority: CN
Inventors: 吴欣欣; 范志华; 轩伟; 李文明; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2023-12-12
Anticipated expiration: 2040-07-16
Also published as: CN112015472A

Abstract

The invention provides a method for detecting invalid instructions and skipping execution in a data flow architecture, which is suitable for accelerating a sparse convolutional neural network under the data flow architecture. The invention comprises a convolution layer and a full connection layer for a sparse neural network. The instruction mark information is generated according to the data characteristics through the instruction compiled by the compiler, the instruction detection unit detects the instruction according to the instruction mark information, and execution of the invalid instruction is skipped, so that acceleration of the sparse convolutional neural network is realized.

Description

Sparse convolutional neural network acceleration method and system based on data flow architecture

Technical Field

The invention relates to the technical field of computer system structures, in particular to a sparse convolutional neural network acceleration method and system based on a data flow architecture.

Background

The neural network has advanced performance in the aspects of image detection, voice recognition and natural language processing, and along with the complexity of application, the neural network model is also complex, so that many challenges are presented to the traditional hardware, and in order to relieve the pressure of hardware resources, the sparse network has good advantages in the aspects of calculation, storage, power consumption requirements and the like. Many algorithms and accelerators for accelerating sparse networks, such as a sparse-blas library for a CPU, a custars library for a GPU, etc., have appeared, which accelerate the execution of sparse networks to some extent, and for dedicated accelerators, have advanced expressive forces in terms of performance, power consumption, etc.

The data flow architecture has wide application in the aspects of big data processing, scientific calculation and the like, and the decoupled algorithm and structure thereof enable the data flow architecture to have good universality and flexibility. The natural parallelism of the data flow architecture is well matched with the parallelism characteristic of the neural network algorithm.

The neural network model also becomes 'large' and 'deep' with the complexity of application, which presents many challenges to the traditional hardware, and in order to relieve the pressure of hardware resources, the sparse network has good advantages in terms of calculation, storage, power consumption requirements and the like. However, dedicated accelerators like a CPU, a GPU and an accelerating dense network cannot accelerate the sparse network, and dedicated accelerators for accelerating the sparse network lack flexibility and versatility of architecture due to strong coupling of algorithms and structures, and cannot perform innovation of algorithms.

The data flow architecture is decoupled in algorithm and structure, so that the data flow architecture has good universality and flexibility, the neural network algorithm is mapped in an architecture consisting of a computing array (PE array) in the form of a data flow diagram based on the data flow architecture, the data flow diagram comprises a plurality of nodes, the nodes comprise a plurality of instructions, and the directed edges of the data flow diagram represent the dependency relationship of the nodes.

In the sparse convolutional neural network, the main operation is multiply-add operation, the pruning operation sets the weight value in the network as 0, and because 0 is multiplied by any number to be 0, the data flow diagram compiled by the compiler contains 0-value related invalid instructions, and when the sparse convolutional neural network is executed, loading and execution of the invalid instructions and data exist. The execution of invalid instructions and data not only occupies hardware resources of the accelerator, so that the resource waste is caused, but also the power consumption of the accelerator is increased, the execution time of a computing array (PE) is prolonged, and the performance is reduced.

Disclosure of Invention

Aiming at the problems of resource waste and performance reduction caused by invalid instructions, data loading and execution of a sparse network in a data stream architecture, the invention provides a method and a device for detecting the invalid instructions in the sparse network by analyzing the data and instruction characteristics of the sparse network, so that the execution of the invalid instructions is skipped, and the acceleration of the sparse network is realized.

The convolution neural network operation is mainly multiply-add operation, and for the sparse convolution neural network, pruning operation resets some weights to 0, so that a 0-value related instruction exists in the data flow graph. Pruning refers to the step of setting weights in the convolutional neural network to be based on a set threshold above which the weights remain original, and setting weights below the threshold to be 0. The purpose of pruning operation is to change the dense convolutional neural network into a sparse network by setting some weights to 0 based on the redundancy characteristics of the weight data, thereby achieving the compression of the network. This operation occurs during the data preprocessing phase of the convolutional network, i.e., before the convolutional neural network is executed.

Taking convolution calculations as an example, as shown in FIG. 1, to perform a convolution operation, it can be seen that I of Ifmap is a value to obtain Ofmap ₀ -I ₈ W of Filter ₀ -W ₈ Requires Load instructions (Inst ₁ -Inst ₁₈ ) Loading data from memory into the memory location of the PE, then also requires a Madd (multiply add) instruction (Inst ₁₉ -Inst ₂₇ ) In these instructions, due to W ₂ ，W ₃ ,W ₅ ,W ₇ The values of (2) are all 0, so Inst ₁₂ ,Inst ₁₃ ,Inst ₁₅ ,Inst ₁₇ For an invalid Load instruction, the corresponding Madd instruction Inst ₂₁ ,Inst ₂₂ ,Inst ₂₄ ,Inst ₂₆ Also, as invalid instructions, the loading and execution of these instructions have no effect on the end result, and can be regarded as invalid instructions, but their execution occupies hardware resources and also causes performance degradation. To remove the loading and computation of the 0 value data, the instructions related to the 0 value need to be eliminated. Therefore, the invention provides a method and a device for detecting invalid instructions in a data stream and skipping execution, so as to save computing resources and improve the performance of a sparse network.

Aiming at the defects of the prior art, the invention provides a sparse convolutional neural network acceleration method based on a data stream architecture, which comprises the following steps:

step 1, compiling the operation of a convolution layer and a full connection layer in a sparse convolution neural network into a data flow graph through a compiler, and generating instruction marking information for each instruction in the data flow graph according to the data characteristics of the data flow graph;

and 2, detecting the instruction through the instruction marking information, reserving effective instructions in the data flow diagram, counting the distance between the two effective instructions, and directly skipping the execution of the ineffective instructions by the sparse convolutional neural network according to the distance when the data flow diagram is executed until a processing result of the data flow diagram is obtained.

According to the sparse convolutional neural network acceleration method based on the data flow architecture, the data flow graph comprises a plurality of nodes, the nodes comprise a plurality of instructions, and the directed edges of the data flow graph represent the dependency relationship of the nodes.

According to the sparse convolutional neural network acceleration method based on the data flow architecture, 1 and 0 are used for respectively representing the validity and invalidity of instructions by the instruction marking information.

The sparse convolutional neural network acceleration method based on the data stream architecture, wherein the step 2 specifically detects the instruction through an invalid instruction detection device, and the invalid instruction detection device comprises:

the instruction marking information module is used for caching instruction marking information;

a PC counter register for recording the interval between two valid instructions to directly skip the execution of invalid instructions;

the reference PC register is used for storing the PC value of the first effective instruction as a reference PC value, and when the fact that a subsequent instruction is effective is detected, the reference PC value is added with the interval value stored by the PC counter to obtain the PC value of the next effective instruction for execution by the execution unit;

an instruction cache module: for storing valid instructions to be executed by the execution unit.

The sparse convolutional neural network acceleration method based on the data flow architecture specifically comprises the following steps of: and marking the instructions related to the weight of 0 in the convolution layer and the full connection layer as invalid instructions, and marking the instructions related to the non-0 value as valid instructions.

The invention also provides a sparse convolutional neural network acceleration system based on the data flow architecture, which comprises:

the method comprises the steps that a module 1 compiles operations of a convolution layer and a full connection layer in a sparse convolution neural network into a data flow graph through a compiler, and instruction marking information is generated for each instruction in the data flow graph according to data characteristics of the data flow graph;

The sparse convolutional neural network acceleration system based on the data flow architecture comprises a plurality of nodes, wherein the nodes comprise a plurality of instructions, and directed edges of the data flow graph represent the dependency relationship of the nodes.

The sparse convolutional neural network acceleration system based on the data flow architecture, wherein the instruction marking information indicates the validity and invalidity of an instruction by using 1 and 0 respectively.

The sparse convolutional neural network acceleration system based on the data flow architecture, wherein the module 2 specifically detects the instruction through an invalid instruction detection device, and the invalid instruction detection device comprises:

The sparse convolutional neural network acceleration system based on the data flow architecture specifically comprises the following steps of: and marking the instructions related to the weight of 0 in the convolution layer and the full connection layer as invalid instructions, and marking the instructions related to the non-0 value as valid instructions.

The advantages of the invention are as follows:

the device reduces the loading and execution of invalid instructions and data in the sparse network, and improves the execution performance of the sparse network.

The invention designs a set of invalid instruction detection device by using a hardware mode aiming at sparse convolutional neural network application, comprising a convolutional layer and a full-connection layer, so as to accelerate the sparse network. The instruction mark information is generated according to the data characteristics of the instruction generated by the compiler, then the instruction is detected according to the instruction mark information by the invalid instruction detection device, the distance between two valid instructions is calculated in a statistical mode, and the execution of the invalid instruction is skipped directly, so that loading and calculation of 0-value data are avoided, and acceleration of the sparse convolutional neural network is achieved.

Drawings

FIG. 1 is a schematic diagram of a convolution operation instruction execution flow;

FIG. 2 is a diagram of an invalid instruction detecting device;

FIG. 3 is a flow chart of invalid instruction detection;

FIG. 4 is a diagram illustrating the generation of instruction tag information;

fig. 5 is a diagram of instruction screening results.

Detailed Description

The invention comprises the following key points:

key point 1, generating mark information of an instruction;

the key point 2, the invalid instruction detection unit detects the invalid instruction and skips execution;

in order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

The invention comprises the following steps:

i: generation of instruction tag information

After the compiler compiles the operations of the convolution layer and the full connection layer into a data flow graph, instruction marking information is generated according to the data characteristics to mark whether the instruction is valid or not, and the marking information indicates that the instruction is valid and invalid by using 1 and 0 respectively. The operation of the convolution layer and the full connection layer of the convolution neural network is multiply-add operation, and a compiler compiles the multiply-add operation into a data flow graph. In these layers, the data features of the weights are represented as 0 and non-0, and based on such data features, the flag information of the corresponding instruction is valid, i.e., the flag information of the instruction associated with the 0 value is 0 and the flag information of the instruction associated with the non-0 value is 1.0,1 indicates the invalidity and validity of the instruction, respectively.

II: invalid instruction detection device

As shown in fig. 2, the invalid instruction detecting device includes an instruction tag information buffer (Flag buffer), a PC counter register (PC counter), a reference PC register (PC Base), an instruction buffer module (Instructionbuffer), a PC refers to a program counter, which is used to store an address of a next instruction, and a PC counter refers to a distance between one valid instruction pointed by the PC and the next valid instruction. The purpose and reason for each module design are:

the instruction mark information buffer module is used for marking information of each instruction generated by the data characteristics of the instruction.

PC counter register: for recording the interval between two valid instructions to directly skip the execution of invalid instructions.

Reference PC register: and the PC value is used for storing the PC value of the first effective instruction, and when the fact that a subsequent instruction is effective is detected, the PC value of the next effective instruction is directly obtained by adding the reference PC value and the value of the PC counter for execution by the execution unit.

An instruction cache module: for storing instructions to be executed by the execution unit.

Fig. 3 is a flowchart of a screening process of an instruction by an instruction detection unit, which obtains intervals of two valid instructions by calculating distances between two 1 in instruction tag information Flag, and adds a current PC to the intervals to obtain a PC value of a next valid instruction, so that the PC automatically jumps and transposes the valid instruction position, thereby skipping invalid instructions. In the flow chart, the parameter i records the interval between the instruction to be detected and the instruction pointed by the current PC, flag_id represents the index of the mark information corresponding to the instruction, the instruction detection unit detects the instruction mark information one by one until one instruction mark information is detected to be 1 or all instruction mark information detection is finished, and the flow is finished.

Based on the design, the invention has the advantages that: the execution of invalid instructions is reduced, thereby reducing the number of times instructions are executed, and the accessing and executing of 0-value data. And the computing resources are saved, and the performance of the sparse network is improved.

Specifically, the invention is implemented in the compiling stage to realize the generation of instruction marking information, and an invalid instruction detection device is added in each PE to skip the execution of the invalid instruction, and the execution process of convolution is combined to describe the execution process in further detail.

(1) Generation of instruction tag information

Fig. 4 is a convolution execution of Ifmap of 5*5 and Filter of 3*3 to generate an Ofmap result. In the convolution operation, the instruction mark information of the Ifmap participating in the operation is all 1, the Filter values participating in the operation are 1,1,0,0,2,0,1,0,4 respectively, the mark information (filter_flag) of the corresponding Laod instruction is 1,1,0,0,1,0,1,0,1, when the multiplication and addition operation is executed, the instruction mark information of the multiplication and addition operation is the sum operation of the mark information of the Ifmap and the corresponding position of the Filter, and since the instruction mark information of the Ifmap is all 1, the instruction mark information of the multiplication and addition operation is the same as that of the Filter, namely 1,1,0,0,1,0,1,0,1.

(2) Detection of invalid instructions

Fig. 5 is a valid instruction finally executed after the instruction in fig. 4 is filtered by the instruction detection unit, where Flag is Flag information of a corresponding instruction, and it corresponds to the instruction in fig. 4 one by one.

Step one: initially, inst ₁ The Flag value of (1) is 1 so the PC points to Inst ₁ To execute Inst ₁ An instruction;

step two: after execution, the PC automatically adds 1 to point to Inst ₂ Due to Inst ₂ Is also 1 so the PC is still pointed to Inst ₂ And executing;

step three: doing so until Inst ₁₁ Ending execution;

step four: PC automatic 1-up pointing Inst ₁₂ Because of Inst ₁₂ Is 0, so both i and flag_id in the instruction detection unit are incremented by 1 to detect Inst ₁₃ Validity of the instruction;

step five: and Inst ₁₃ The Flag of (1) is still 0, i and flag_id continue to be added with 1;

step six: at this point Inst ₁₄ Since Flag of (1) is updated to PC+2, the detection ends and PC jumps to Inst ₁₄ Executing at the site;

step seven: the instruction detection unit continues to execute until all instructions in the graph are detected;

step eight: the valid instruction executed by the final PE is Inst ₁ -Inst ₁₁ ，Inst ₁₄ ，Inst ₁₆ ，Inst _18-20 ，Inst ₂₃ ，Inst ₂₅ ，Inst ₂₇ 。

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. The sparse convolutional neural network acceleration method based on the data stream architecture is characterized by comprising the following steps of:

step 2, detecting the instruction through the instruction marking information, reserving effective instructions in the data flow diagram, counting the distance between the two effective instructions, and directly skipping the execution of invalid instructions by the sparse convolutional neural network according to the distance when the data flow diagram is executed until a processing result of the data flow diagram is obtained;

the instruction tag information indicates the validity and invalidity of an instruction using 1 and 0, respectively; the process of generating instruction mark information in the step 1 specifically includes: marking the instruction with the weight of 0 value in the convolution layer and the full connection layer as an invalid instruction, and marking the instruction with the non-0 value as a valid instruction;

the step 2 specifically detects the instruction by an invalid instruction detecting device, which includes:

2. The method for accelerating sparse convolutional neural network based on data flow architecture of claim 1, wherein the data flow graph comprises a plurality of nodes, the nodes comprise a plurality of instructions, and directed edges of the data flow graph represent dependency relationships of the nodes.

3. A sparse convolutional neural network acceleration system based on a data flow architecture, comprising:

the module 2 detects the instruction through the instruction marking information, reserves the effective instruction in the data flow diagram, counts the distance between the two effective instructions, and directly skips the execution of the ineffective instruction according to the distance when the data flow diagram is executed until the processing result of the data flow diagram is obtained;

the instruction tag information indicates the validity and invalidity of an instruction using 1 and 0, respectively; the process of generating instruction tag information in the module 1 specifically includes: marking the instruction with the weight of 0 value in the convolution layer and the full connection layer as an invalid instruction, and marking the instruction with the non-0 value as a valid instruction;

the module 2 detects the instruction specifically by an invalid instruction detecting means including:

4. The sparse convolutional neural network acceleration system of claim 3, wherein the dataflow graph includes a plurality of nodes including a plurality of instructions, and wherein the directed edges of the dataflow graph represent dependencies of the nodes.