CN112215349A - Sparse convolution neural network acceleration method and device based on data flow architecture - Google Patents

Sparse convolution neural network acceleration method and device based on data flow architecture Download PDF

Info

Publication number
CN112215349A
CN112215349A CN202010972552.4A CN202010972552A CN112215349A CN 112215349 A CN112215349 A CN 112215349A CN 202010972552 A CN202010972552 A CN 202010972552A CN 112215349 A CN112215349 A CN 112215349A
Authority
CN
China
Prior art keywords
instruction
neural network
activation
positive
output activation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010972552.4A
Other languages
Chinese (zh)
Other versions
CN112215349B (en
Inventor
吴欣欣
范志华
欧焱
李文明
叶笑春
范东睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202010972552.4A priority Critical patent/CN112215349B/en
Publication of CN112215349A publication Critical patent/CN112215349A/en
Application granted granted Critical
Publication of CN112215349B publication Critical patent/CN112215349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a sparse convolution neural network acceleration method based on a data flow architecture, which comprises the following steps: obtaining positive and negative value marking information of output activation by calculating operation of input activation and a weight matrix; marking validity and invalidity of the instruction related to the output activation according to the positive and negative value marking information of the output activation to obtain instruction marking information; screening out the instructions marked as effective in the instructions according to the instruction marking information; skipping the instructions marked as invalid and executing only the instructions marked as valid.

Description

Sparse convolution neural network acceleration method and device based on data flow architecture
Technical Field
The invention relates to the technical field of computer system structures, in particular to a sparse convolution neural network acceleration method and device based on a data flow architecture.
Background
The neural network has advanced performance in the aspects of image detection, voice recognition and natural language processing, the neural network model is complicated along with application complexity, a plurality of challenges are provided for traditional hardware, and in order to relieve the pressure of hardware resources, the sparse network has good advantages in the aspects of calculation, storage, power consumption requirements and the like. Many algorithms and accelerators for accelerating a sparse network have appeared, such as a CPU-oriented sparse-blas library, a GPU-oriented custare library, and the like, which accelerate the execution of the sparse network to some extent, and for a dedicated accelerator, have advanced expressive power in terms of performance, power consumption, and the like. The data flow architecture has wide application in the aspects of big data processing, scientific calculation and the like, and the decoupled algorithm and structure enable the data flow architecture to have good universality and flexibility. The natural parallelism of the dataflow architecture matches well the parallel nature of the neural network algorithm. However, the special accelerator like the CPU, the GPU and the acceleration-intensive network cannot accelerate the sparse network, and the special accelerator for accelerating the sparse network cannot innovate the algorithm because the strong coupling of the algorithm and the structure lacks flexibility and versatility of the architecture.
Based on a data flow architecture, a neural network algorithm is mapped in an architecture formed by a computing array (PE array) in the form of a data flow graph, the data flow graph comprises a plurality of nodes, the nodes comprise a plurality of instructions, and directed edges of the flow graph represent the dependency relationship of the nodes. The PE array executes the mapped instruction to realize the operation of the neural network.
In most Deep Neural Networks (DNN), a rectifying linear unit (RELU) is widely used for outputting of a network layer, negative value activation data are forcibly output to be 0, meanwhile, for the weight of the neural networks, based on the redundancy characteristic of the weight data, methods such as pruning and the like are used for setting some weights to be 0, and the methods result in that a large amount of 0 value output activation data and weights are generated in the neural networks, so that weight sparseness and activation sparseness exist in the sparse networks, and about 50% of sparseness exists in modern DNN models. The operation of the neural network is mainly multiplication and addition operation, and the multiplication of the 0 value data and any value is 0, so the operations can be regarded as invalid operations, the execution of the invalid operations occupies computing resources, the waste of the computing resources and the power consumption is caused, the execution time of the network is prolonged, and the performance of the network is reduced.
In order to remove these invalid calculations, in a data flow architecture, an effective method is to generate marking information of a corresponding instruction in a data flow graph according to characteristics of data, and before executing the instruction, a PE array only executes an effective instruction according to the marking information and skips an invalid instruction, thereby saving computing resources, reducing power consumption and improving performance.
However, the weight and activation data have different characteristics, the weight data is static and does not change with the operation of the neural network, while the activation data is dynamic, and the sparsity of output activation cannot be known until one layer of the neural network is calculated.
The static data characteristic of the weight can be used for generating the marking information of the instruction in the compiling stage, so that the PE array skips 0 weight-related instruction according to the marking information of the instruction, but the mode is not suitable for the active data of the dynamic characteristic. Firstly, a neural network is an execution process of a layer-by-layer, the input activation of a current network layer is derived from the output activation of a previous layer, and the activated information cannot be obtained in a compiling stage; second, it is not known which weights and input activations are related to the invalid output activations before the current layer is calculated. Therefore, in the compiling stage, the mode of skipping the invalid instruction during the execution of the PE array by using the instruction marking information can only use the sparsity of the weight value but cannot use the sparsity of the activation data, so that all operations related to the 0 value cannot be sufficiently removed, which causes the waste of computing resources and the increase of power consumption, and also reduces the performance of the neural network.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a sparse convolution neural network acceleration method based on a data flow architecture, which comprises the following steps:
step 1, calculating input activation and weight matrix to obtain positive and negative value marking information of output activation;
step 2, according to the positive and negative value marking information of the output activation, marking the validity and invalidity of the instruction related to the output activation to obtain instruction marking information;
step 3, screening out the instructions marked as effective in the instructions according to the instruction marking information;
and 4, skipping the instructions marked as invalid and only executing the instructions marked as valid.
The invention also provides a sparse convolution neural network accelerating device based on the data flow architecture, which comprises the following components:
the instruction execution unit is used for calculating the operation of input activation and the weight matrix, obtaining the positive and negative value marking information of output activation and executing the instruction marked as effective in the instruction;
the prediction marking unit is used for marking the validity and the invalidity of the instruction related to the output activation according to the positive and negative value marking information of the output activation to obtain instruction marking information;
and the instruction selection unit is used for screening out the instructions marked as valid in the instructions according to the instruction marking information.
Aiming at the sparsity of activation data which cannot be dynamically generated in a data flow architecture, the invention provides a prediction device of sparse activation data, which can predict the sparsity of output activation before a calculation network layer by using a short time, and generate corresponding instruction marking information of network operation by using the predicted activation data, so that a calculation array skips instructions marked as invalid according to the instruction marking information, thereby removing the operation related to invalid output activation and realizing the acceleration of a sparse neural network.
Drawings
Fig. 1 is a schematic diagram of the convolution implementation.
Fig. 2 is a schematic diagram of a prediction device outputting activation.
FIG. 3 is a diagram illustrating instructions that need to be executed to compute m +1 output activations.
FIG. 4 is a diagram illustrating a prediction process of output activation according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
In the data flow architecture, the PE array executes instructions required by neural network operation mapped on the PE array by a compiler, generated output data passes through a RELU activation unit to generate output activation data, and the RELU unit changes the output data with negative values into 0 values, so that a large number of 0 values exist in the output activation data. The output activation data of each layer of the neural network only knows the sparsity of the layer after the layer is executed. Therefore, there are invalid calculations when executing the sparse neural network in the data flow architecture, and these invalid calculations occupy the computational resources, which hinders the performance improvement.
Fig. 1 shows the convolution process, in which the input activation (Ifmap, size 4 × 4) and the filter (Fliter, size 3 × 3) perform convolution operation, and after 4 times of convolution operations, i.e., Conv1-Conv4, Output data (Output, size 2 × 2) is generated, and the Output data is forced to 0 by the ReLU unit. Since the output results of Conv3 and Conv4 are negative values, and after the ReLU activation function, the output activation values are both 0, the operations performed by Conv3 and Conv4 are invalid.
Therefore, in order to utilize the sparsity of the activation data, the sparsity of the output activation needs to be predicted in advance, so that invalid operation is removed by utilizing the sparsity, the calculation amount of the neural network forward reasoning is reduced, and the performance and the power consumption are improved.
Aiming at the sparse neural network, the invention designs a prediction device for outputting activation data based on weighted Singular Value Decomposition (SVD), so as to predict whether the output activation data is 0 value before executing forward operation, generates marking information of an instruction related to the output activation data according to the prediction result to indicate whether the instruction is effective, and a calculation unit realizes skipping of an invalid instruction according to the marking information, thereby finally realizing saving of calculation resources, improving of performance and reducing of power consumption.
The invention comprises the following key points:
the key point 1, a prediction method and a device for outputting activation data;
key point 2, a marking process of invalid instructions related to sparse activation data;
at key point 3, the PE array determines whether to execute the instruction according to the tag information of the instruction.
(1) Prediction method and device for output activation data
Assume that the input activation of the l-th layer is IlThe weight matrix is WlActivation of the l +1 th layer is Il Xiang 1=Relu(IlWl) To the weight matrix WlUsing SVD decomposition such that Wl=UΣVTWherein
Figure BDA0002684627700000041
Figure BDA0002684627700000042
U represents WlLeft singular matrix of (d), sigma denotes diagonal matrix, V denotes WlRight singular matrix of (a).
WlThe low rank approximation of (d) is:
Figure BDA0002684627700000043
wherein
Figure BDA0002684627700000044
UrRepresents WlFirst r columns of left singular matrices, sigmarRepresenting the first r columns, V, of the diagonal matrixrRepresents WlThe first r columns of the right singular matrix.
Because r < h, r < w, the computational complexity O (r (h + w)) is much less than the complexity O (hw) of the forward operation after low rank approximation.
W is to belThe low rank approximation of (2) is applied to the network, there is
Figure BDA0002684627700000045
Figure BDA0002684627700000046
Wherein U ═ Ur
Figure BDA0002684627700000047
If IlAnd (3) outputting a result which is smaller than 0 after the W' operation (convolution operation mode), namely the sign is negative, and outputting the result which is 0 after passing through the Relu activation unit. Based on such predictions, these operations may be skipped to reduce the computational load of the network and speed up the operation of the network.
Fig. 2 is a schematic diagram of an output activated human predictor based on Singular Value Decomposition (SVD). As shown in fig. 2, the prediction apparatus is composed of an output active index (out _ act _ index) and its flag information (out _ act _ flag), and an instruction index (inst _ index) and its flag information (inst _ flag). The out _ act _ flag stores positive and negative value information of each output activation, wherein 0 represents that the output activation value is negative, and 1 represents that the output activation value is positive; inst _ index indicates the location information of the instruction, and inst _ flag stores the valid or invalid information of each instruction, such as 0 indicating that the instruction is invalid and 1 indicating that the instruction is valid.
The instruction execution unit (inst execution unit) finishes executing IlAnd after the U 'V' operation, obtaining predicted positive and negative value marking information of output activation, storing the positive and negative value marking information of the output activation into an out _ act _ flag of the prediction device, and marking related instructions as valid or invalid according to the index out _ act _ index of the output activation and the marking information out _ act _ flag. The instruction selection unit (inst selection unit) screens out a valid instruction (valid inst) according to flag information inst _ flag of the instruction, so that the instruction execution unit skips an invalid instruction (inst _ flag is 0) and executes only the valid instruction (inst _ flag is 1).
By predicting the output activation, invalid instruction execution related to 0 value is removed, the instruction execution times are reduced, the network execution time is shortened, the network performance is improved, and the energy consumption is also reduced.
Specifically, as shown in fig. 4, within a PE array, in combination with the convolution execution process, further detailed description is given, assuming that m +1 output activation data (out0-outm) are calculated, and the number of instructions required is as shown in fig. 3.
The method comprises the following steps: performing Singular Value Decomposition (SVD) on a weight matrix W (weight matrix of a filter of a convolutional layer or a fully-connected layer, wherein in parameters of a neural network, weight information is known) offline to obtain U and V of the matrix;
step two: carrying out low-rank approximation on the weight matrix, namely taking the first r ranks of the weight matrix and approximating the weight matrix to be U 'V';
step three: executing an instruction by an instruction execution unit in the PE array to calculate the operation of the input activation I and the approximate matrix U 'V' and generate predicted values of m +1 output activations;
step four: storing predicted output active flag information (out _ act _ flag) in a predictor, and marking the output active flag information according to each output active index (out _ act _ index), wherein the out _ act _ flag value of 0 of the out _ act _ index indicates that the active output is a negative value, the out _ act _ index is 1 and 2, and the out _ act _ flag value of m is 1 and indicates that the active output is a positive value;
step five: marking the marking information of the instruction related to the output activation according to the marking information of the output activation, wherein the related instruction with the out _ act _ index of 0 is block0, the instructions in the corresponding instruction block are all marked as invalid due to the activation output marking of 0, the related instructions with the out _ act _ index of 1, 2, m are block1, block2 and block m, and the instructions in the corresponding instruction block are all marked as valid due to the activation output marking of 1.
Step six: an instruction selection unit (inst selection unit) screens out effective instructions of block1, block2 and block m according to the marking information of the instructions;
step seven: after the valid instruction is screened, the instruction execution unit (inst execution unit) executes only the valid instruction, thereby skipping over the invalid instruction.
The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it should be understood that various changes and modifications can be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A sparse convolution neural network acceleration method based on a data flow architecture is characterized by comprising the following steps:
step 1, calculating input activation and weight matrix to obtain positive and negative value marking information of output activation;
step 2, according to the positive and negative value marking information of the output activation, marking the validity and invalidity of the instruction related to the output activation to obtain instruction marking information;
step 3, screening out the instructions marked as effective in the instructions according to the instruction marking information;
and 4, skipping the instructions marked as invalid and only executing the instructions marked as valid.
2. The data flow architecture-based sparse convolutional neural network acceleration method of claim 1, wherein the step 1 specifically comprises:
performing singular value decomposition on the weight matrix and performing low-rank approximation to obtain an approximation matrix; and calculating the operation of the input activation and the approximate matrix to obtain the positive and negative value marking information of the output activation.
3. The method according to claim 1 or 2, wherein the weight matrix comprises a filter of convolutional layer or a weight matrix of full link layer.
4. The data flow architecture based sparse convolutional neural network acceleration method of claim 1, wherein the positive and negative value flag information of the output activation respectively represents that the output activation is a positive value and a negative value using 1 and 0.
5. The data flow architecture based sparse convolutional neural network acceleration method of claim 1 or 4, wherein the instruction flag information uses 1 and 0 to indicate that the instruction is valid and invalid, respectively.
6. A sparse convolutional neural network acceleration device based on a data flow architecture, comprising:
the instruction execution unit is used for calculating the operation of the input activation and the weight matrix, obtaining the positive and negative value marking information of the output activation, and executing the instruction marked as effective in the instruction;
the prediction marking unit is used for marking the validity and the invalidity of the instruction related to the output activation according to the positive and negative value marking information of the output activation to obtain instruction marking information;
and the instruction selection unit is used for screening out the instructions marked as valid in the instructions according to the instruction marking information.
7. The data flow architecture-based sparse convolutional neural network acceleration device of claim 6, wherein the instruction execution unit specifically comprises:
the weight matrix is used for carrying out singular value decomposition and low-rank approximation to obtain an approximation matrix; and calculating the operation of the input activation and the approximate matrix to obtain the positive and negative value marking information of the output activation.
8. The sparse convolutional neural network acceleration device based on data flow architecture of claim 6 or 7, wherein the weight matrix comprises a filter of convolutional layer or a weight matrix of full link layer.
9. The data flow architecture based sparse convolutional neural network acceleration device of claim 6, wherein the positive and negative value flag information of the output activation uses 1 and 0 to indicate that the output activation is a positive value and a negative value, respectively.
10. The data-flow-architecture-based sparse convolutional neural network acceleration device of claim 6 or 9, wherein the instruction flag information uses 1 and 0 to indicate that the instruction is valid and invalid, respectively.
CN202010972552.4A 2020-09-16 2020-09-16 Sparse convolutional neural network acceleration method and device based on data flow architecture Active CN112215349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010972552.4A CN112215349B (en) 2020-09-16 2020-09-16 Sparse convolutional neural network acceleration method and device based on data flow architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010972552.4A CN112215349B (en) 2020-09-16 2020-09-16 Sparse convolutional neural network acceleration method and device based on data flow architecture

Publications (2)

Publication Number Publication Date
CN112215349A true CN112215349A (en) 2021-01-12
CN112215349B CN112215349B (en) 2024-01-12

Family

ID=74049599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010972552.4A Active CN112215349B (en) 2020-09-16 2020-09-16 Sparse convolutional neural network acceleration method and device based on data flow architecture

Country Status (1)

Country Link
CN (1) CN112215349B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505383A (en) * 2021-07-02 2021-10-15 中国科学院计算技术研究所 ECDSA algorithm execution system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297778A (en) * 2015-05-21 2017-01-04 中国科学院声学研究所 The neutral net acoustic model method of cutting out based on singular value decomposition of data-driven
CN107239829A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of method of optimized artificial neural network
CN109472350A (en) * 2018-10-30 2019-03-15 南京大学 A kind of neural network acceleration system based on block circulation sparse matrix

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297778A (en) * 2015-05-21 2017-01-04 中国科学院声学研究所 The neutral net acoustic model method of cutting out based on singular value decomposition of data-driven
CN107239829A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of method of optimized artificial neural network
CN109472350A (en) * 2018-10-30 2019-03-15 南京大学 A kind of neural network acceleration system based on block circulation sparse matrix

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SONG HAN ET AL.: "Learning bothWeights and Connections for Efficient Neural Networks", 《ARXIV.ORG》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505383A (en) * 2021-07-02 2021-10-15 中国科学院计算技术研究所 ECDSA algorithm execution system and method

Also Published As

Publication number Publication date
CN112215349B (en) 2024-01-12

Similar Documents

Publication Publication Date Title
Lu et al. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs
US10691996B2 (en) Hardware accelerator for compressed LSTM
US10402725B2 (en) Apparatus and method for compression coding for artificial neural network
Zhang et al. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
Heo et al. Real-time object detection system with multi-path neural networks
Li et al. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration
Shi et al. E-LSTM: Efficient inference of sparse LSTM on embedded heterogeneous system
CN111368988A (en) Deep learning training hardware accelerator utilizing sparsity
Yang et al. S 2 Engine: A novel systolic architecture for sparse convolutional neural networks
Mirzaeian et al. Nesta: Hamming weight compression-based neural proc. engineali mirzaeian
CN112580793A (en) Neural network accelerator based on time domain memory computing and acceleration method
Kim et al. Nlp-fast: a fast, scalable, and flexible system to accelerate large-scale heterogeneous nlp models
CN112215349A (en) Sparse convolution neural network acceleration method and device based on data flow architecture
Fuketa et al. Image-classifier deep convolutional neural network training by 9-bit dedicated hardware to realize validation accuracy and energy efficiency superior to the half precision floating point format
Shivapakash et al. A power efficient multi-bit accelerator for memory prohibitive deep neural networks
US11551087B2 (en) Information processor, information processing method, and storage medium
Li et al. Mapping yolov4-tiny on fpga-based dnn accelerator by using dynamic fixed-point method
CN117151178A (en) FPGA-oriented CNN customized network quantification acceleration method
Turner et al. mlGeNN: accelerating SNN inference using GPU-enabled neural networks
Wang et al. A none-sparse inference accelerator that distills and reuses the computation redundancy in CNNs
CN115222028A (en) One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method
Gao et al. FPGA-based accelerator for independently recurrent neural network
CN115130672A (en) Method and device for calculating convolution neural network by software and hardware collaborative optimization
Parashar et al. Processor pipelining method for efficient deep neural network inference on embedded devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant