CN113592072A

CN113592072A - Sparse convolution neural network accelerator oriented to memory access optimization

Info

Publication number: CN113592072A
Application number: CN202110845980.5A
Authority: CN
Inventors: 马胜; 崔益俊; 黄立波; 邢座程; 成元虎; 徐睿; 王波
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-11-02
Anticipated expiration: 2041-07-26
Also published as: CN113592072B

Abstract

A memory access optimization oriented sparse convolutional neural network accelerator, comprising: the sparse activation value processing module SSG is used for removing zero-value activation data and screening out effective non-zero activation values; the buffer module CBUF is used for storing input neuron data and realizing repeated activation data multiplexing; the cache module PB is used for storing the weight data read in parallel; the operation module CMAC is used for completing the multiply-add operation of the convolution operation; in the data reading stage, reading neuron data required by the current convolution operation into a cache module, and reading weight data into a cache module PB; in the screening and multiplexing stage, the sparse activation value processing module screens out the non-zero activation data in the cache module, and simultaneously checks whether multiplexed activation data exist or not; and in the operation stage, transmitting the screened non-zero activation data to an operation module for convolution calculation. The invention has the advantages of simple principle, easy realization, obvious improvement on the efficiency of calculation and memory access, and the like.

Description

Sparse convolution neural network accelerator oriented to memory access optimization

Technical Field

The invention mainly relates to the technical field of neural network application, in particular to a sparse convolution neural network accelerator oriented to access optimization.

Background

At present, the Deep Learning (Deep Learning) technology is rapidly developed, and is widely applied in many fields, and at the same time, the Deep Learning technology is a hot field of academic research. In the deep learning technology, the most concerned is a deep neural network model, and the deep neural network has a good effect in many artificial intelligence applications, including the fields of computer vision, natural language processing, machine translation, image recognition and the like. However, the training and derivation of neural network models place high demands on the computational power, and ordinary CPUs and embedded processors have been unable to meet the computational power required for the calculation of neural network models.

The method aims to solve the problem that the computational complexity is increased due to the rapid development of a neural network model. Currently, there are two main flow schemes provided by those skilled in the art to provide computational support for the operation of neural network models. One approach is to use a GPU with a large number of parallel threads to complete the computation of the model; another approach is to develop a dedicated neural network accelerator based on FPGA or ASIC design. Although GPUs can provide computationally intensive support for neural network models, their large power consumption cost is a concern. Therefore, a low-power, high-computational-effort dedicated neural network accelerator becomes an effective method for resolving the neural network model calculations.

NVDLA is one of the representatives of the special neural network accelerator, and is an open source neural network accelerator platform which is introduced by great invida and faces to the deep learning reasoning process. However, there are still two problems during the convolution operation:

the first and the large amount of zero-value activation value data occupy a large amount of memory space and operation units, and the calculation in which the zero-value activation values participate is called invalid calculation. Because the NVDLA does not support processing for sparse active values, efficient removal of these zero-valued active value data can improve the efficiency of convolution operations;

second, there is a large amount of repeated activation value data in the convolutional layer sliding window, i.e. data at the same position of adjacent convolution operations needs to be repeatedly involved in calculation, which results in a proliferation of NVDLA accelerator access amount.

In summary, a method capable of processing sparse activation data in convolution operation and multiplexing repeated activation data in different convolution operations, and further improving computation and memory access efficiency in the NVDLA convolution process, is needed.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the memory access optimization-oriented sparse convolutional neural network accelerator which is simple in principle, easy to implement and obvious in improvement of calculation and memory access efficiency.

In order to solve the technical problems, the invention adopts the following technical scheme:

a memory-oriented optimization sparse convolutional neural network accelerator, comprising:

the sparse activation value processing module SSG is used for removing zero-value activation data and screening out effective non-zero activation values;

the buffer module CBUF is used for storing input neuron data and realizing repeated activation data multiplexing;

the cache module PB is used for storing the weight data read in parallel;

the operation module CMAC is used for completing the multiply-add operation of the convolution operation;

in the data reading stage, reading neuron data required by the current convolution operation into a buffer module CBUF, and reading weight data into a buffer module PB;

in the screening and multiplexing stage, the sparse activation value processing module SSG screens out the non-zero activation data in the cache module CBUF, and simultaneously checks whether multiplexed activation data exist or not;

and in the operation stage, transmitting the screened non-zero activation data to an operation module CMAC for convolution calculation.

As a further improvement of the invention: the sparse activation value processing module SSG includes:

the input neuron and the weight channel are used for storing input neuron and weight data required by each convolution operation;

the index table is used for recording the position of the non-zero activation value data in the memory;

and the threshold setting module is used for setting threshold parameters of data screening.

As a further improvement of the invention: and the size of the input neurons, the input neurons in the weight channels and the weight data is set to be 16 × 1 × 128 byte.

As a further improvement of the invention: and the screening data threshold T in the threshold setting module is set to be zero and used for screening out non-zero activation value data.

As a further improvement of the invention: the buffer module CBUF includes:

the counting module is used for determining the starting time of the accelerator for initiating the memory access operation, and the counting module is set to be 2 bits to identify when the accelerator starts to execute the memory access operation;

and the identification module is used for identifying whether the secondary convolution operation is finished or not, when the identification is 0, the secondary convolution operation is not finished, otherwise, the current convolution operation is finished.

As a further improvement of the invention: the counting module comprises a counting component Stripe Count used for realizing the counting function in the convolution operation, namely the first convolution operation and the second convolution operation; the first convolution operation has no data multiplexing, partial data in the front of the data section needs to be discarded after the convolution operation is finished, the partial data in the rear is reserved for the next convolution operation for data multiplexing, and the like;

as a further improvement of the invention: the identification module comprises an identification component C _ Flag, the convolution identification component is set to be 0 or 1, the value of 0 indicates that the current convolution operation is not completed due to the lack of data quantity, otherwise, the current convolution operation is completed; the convolution identification bit of the second convolution operation is set to be 0, and the convolution identification bit of the third convolution operation is also set to be 0 as well as the second convolution operation; until the fourth convolution operation, the total amount of the missing active data segments is accumulated to the required active data amount of one convolution operation; at the moment, a read operation is initiated, and the new data are all read into the Buffer; and the data is stored in the cache in the CBUF according to the sequence of use.

As a further improvement of the invention: the operation module CMAC is used for setting to execute multiplication and addition operation in convolution operation, the CMAC components have 16 groups in common, each group of data input is 64 bits, each group of data input has the same input neuron data, and the input weight data are different.

As a further improvement of the invention: the operation module CMAC comprises a three-level pipeline stack:

the first stage is a multiplier layer which comprises 16 multipliers, and the input of each multiplier is 64 bits;

the second level is an adder layer which is set as 16 input ports;

the third layer is a linear processing unit used for completing the nonlinear processing work of the activation data;

and the multiplier layer adopts a weight fixed flow mode to carry out mapping, namely the weight channel updates the weight elements in the weight channel after traversing all the activation data corresponding to the group of weights.

As a further improvement of the invention: the cache module PB stores input weight data, the number of the input weight data is set to be 16 in total, and the input weight data are not distributed in different PE units; the size of the input weight data is set to be 256KB of SRAM, and the input weight data consists of 16 groups of 1KB of Bank; each Bank consists of 64-bit wide, 128-entry dual port SRAM.

Compared with the prior art, the invention has the advantages that:

the sparse convolution neural network accelerator facing the access optimization is simple in structure, easy to realize and obvious in improvement of calculation and access efficiency; the invention utilizes the sparse activation value processing module to enable the accelerator to dynamically skip the zero activation value, thereby improving the calculation efficiency of the accelerator calculation array in the convolution operation. Further, the accelerator can efficiently multiplex repeated activation value data in adjacent convolution operation by setting a plurality of weight value storage modules PB and improving CBUF.

Drawings

Fig. 1 is a schematic diagram of the topology of the present invention.

Fig. 2 is a schematic diagram of the structure of a sparse activation value processing module in a specific application example of the present invention.

FIG. 3 is a schematic diagram illustrating the working flow of the sparse activation value processing module in a specific application example of the present invention.

Fig. 4 is a schematic diagram of the sparse activation value processing module in a specific calculation example in a specific application example of the present invention.

FIG. 5 is a schematic diagram of the structure of an arithmetic element in a specific application example of the present invention.

Fig. 6 is a schematic diagram illustrating a structure of a cache block PB in a specific application example of the present invention.

Fig. 7 is a schematic diagram of the structure of the buffer module CBUF in the embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and specific examples.

As shown in fig. 1, the sparse convolutional neural network accelerator facing memory access optimization of the present invention includes:

the cache module PB is used for storing the weight data read in parallel;

in the data reading stage, the accelerator reads in neuron data required by the current convolution operation to the cache module CBUF, and reads in weight data to the cache module PB;

in the screening and multiplexing stage, the sparse activation value processing module SSG screens out the non-zero activation data in the cache module CBUF, and simultaneously checks whether multiplexed activation data exist or not; and in the operation stage, the screened non-zero activation data is transmitted to an operation module CMAC for convolution calculation.

Referring to fig. 2, in a specific application example, the sparse activation value processing module SSG includes:

the method comprises the steps of inputting neurons and weight channels, and storing input neurons and weight data required by each convolution operation;

the index table records the position of the non-zero activation value data in the memory;

In a specific application example, the input neurons and the weight channels store input neurons and weight data required in each convolution operation, and the size of the input neurons and the size of the weight data are generally set to be 16 × 1 × 128 byte.

In a specific application example, the threshold setting module is used for setting a threshold of the filtered data, the setting of the size of the threshold is in direct relation with the accuracy of the network model, and here, the threshold T is generally set to be zero, so that non-zero activation value data can be filtered out.

As shown in fig. 3, in a specific application example, the process of the accelerator processing to complete the screening of the non-zero activation value in the sparse activation value processing module of the present invention includes:

firstly, reading neuron data and weight data required in current convolution operation;

secondly, comparing the read input neuron data with a threshold value, and recording the position of the neuron data with the value larger than zero in an index table Indexing result;

and thirdly, transmitting the effective neuron data and the weight data to an operation module CMAC for calculation by looking up the position of the nonzero neuron data in the index table.

As shown in fig. 4, the sparse activation value processing module in the present invention is a schematic diagram in a specific calculation example. Let the dimension of the neuron input by one convolution operation be 4 x 4, and the dimension of the convolution kernel be 1 x 16. In a specific computing example, the accelerator needs three steps in total for completing the non-zero activation value data screening:

firstly, transmitting Input neuron data with the dimension of 4 x 4 to Input neuron channels;

in a second step, the input neuron data is compared to a threshold while the relative position of the non-zero neuron data is recorded in an index table indexinresult. Meanwhile, screening out corresponding weight data through the relative position of the nonzero activation value element in the index table;

and thirdly, transmitting nonzero neurons to a CMAC port of the computational array in a broadcasting mode in sequence, wherein the input neuron data in each CMAC are the same, and the weight data are different.

In a specific application example, the cache module CBUF includes: and the counting module is used for determining the starting time of the accelerator for initiating the memory access operation, the number of times of memory access of the accelerator is actually reduced by repeatedly activating the multiplexing of data, and therefore, the counting component is set to be 2 bits to identify when the accelerator starts to execute the memory access operation. And the identification module is used for identifying whether the secondary convolution operation is finished or not, when the identification is 0, the secondary convolution operation is not finished, otherwise, the current convolution operation is finished.

The buffer block PB is used to set the storage weight data. The total number of the CMAC units is set to be 16, and the CMAC units are distributed in different CMAC units, so that the aim of parallelizing reading of the weight value of the accelerator is fulfilled.

In a specific application example, the operation module CMAC is used for setting to execute a multiply-add operation in a convolution operation. The CMAC component has 16 groups in total, each group of data input is 64 bits, the input neuron data of each group are the same, and the input weight data are different.

As shown in fig. 5, in a specific application example, the operation module CMAC of the present invention includes a three-level pipeline stack:

the second level is an adder layer which is set as 16 input ports;

the third layer is a linear processing unit which completes the nonlinear processing work of the activation data.

The multiplier layer performs mapping in a way of fixed flow of weights, that is, the weight channel updates the weight elements in the weight channel after traversing all the activation data corresponding to the set of weights.

As shown in fig. 6, in a specific application example, the number of the input weight data stored in the cache block PB of the present invention is set to 16 in total, and the input weight data are not distributed in different PE units. The size of the input weight data is set to 256KB of SRAM, which consists of 16 sets of 1KB of banks. Each Bank consists of 64-bit wide, 128-entry dual port SRAM. The purpose of setting the number of the buffer blocks PB to 16 is to reduce the time for reading the weight data, thereby facilitating the multiplexing of the repeatedly activated data in the adjacent convolution windows.

As shown in fig. 7, in a specific application example, the cache module CBUF of the present invention includes:

and a Buffer unit Buffer for storing the data-multiplexed non-repetitive input neuron data, and having a size set to 256KB of SRAM. A total of 16 banks of 16KB each consisting of 512 bit wide 256 entry dual port SRAM.

And the counting component Stripe Count is used for realizing a counting function in the convolution operation, namely, the first convolution operation, the second convolution operation and the like. Taking the example that the repeated active value elements in the adjacent convolution operations account for 75% of the active value data of the whole convolution operation, the first convolution operation has no data multiplexing, after the convolution operation is completed, the first 25% of data in the data segment needs to be discarded, the last 75% of data is reserved for the next convolution operation for data multiplexing, and so on. The addition of a counter may quantify the amount of data multiplexing in each convolution operation.

And the identification component C _ Flag is set to be 0 or 1, the value of 0 indicates that the current convolution operation is not completed due to the lack of the data quantity, and otherwise, the identification component C _ Flag indicates that the current convolution operation is completed. The second convolution operation has its convolution flag bit set to 0 because it lacks the last 25% of the active data, and the third convolution operation has its convolution flag bit set to 0, as does the second convolution operation, and because it lacks the last 25% of the data segment. The total amount of missing active data segments accumulates up to the required amount of active data for one convolution operation up to the fourth convolution operation. At this time, a read operation is initiated to read all the new data into the Buffer. And the data is stored in the cache in the CBUF according to the sequence of use.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that those skilled in the art should appreciate that they can make various changes and modifications without departing from the spirit and scope of the present invention.

Claims

1. An access-oriented optimization sparse convolutional neural network accelerator, comprising:

the cache module PB is used for storing the weight data read in parallel;

2. The memory access optimization-oriented sparse convolutional neural network accelerator of claim 1, wherein the sparse activation value processing module SSG comprises:

3. The memory-oriented optimization sparse convolutional neural network accelerator of claim 2, wherein the input neurons, the input neurons in the weight channels, and the weight data are sized to 16 x 1 x 128 bytes.

4. The memory access optimization-oriented sparse convolutional neural network accelerator of claim 2, wherein the filtering data threshold T in the threshold setting module is set to a zero value for filtering out non-zero activation value data.

5. The sparse convolutional neural network accelerator oriented to memory access optimization of any one of claims 1-4, wherein the cache module CBUF comprises:

6. The sparse convolutional neural network accelerator for memory access optimization as claimed in claim 5, wherein the counting module comprises a counter unit Stripe Count for implementing a counting function in the convolution operation, namely a first convolution operation and a second convolution operation; the first convolution operation has no data multiplexing, after the convolution operation is completed, partial data in the front of the data section needs to be discarded, partial data in the rear is reserved for the next convolution operation for data multiplexing, and the like.

7. The memory access optimization-oriented sparse convolutional neural network accelerator of claim 6, wherein the identification module comprises an identification component C _ Flag, the convolutional identification component is set to 0 or 1, the value of 0 indicates that the current convolutional operation is not completed due to lack of data amount, otherwise, indicates that the current convolutional operation is completed; the convolution identification bit of the second convolution operation is set to be 0, and the convolution identification bit of the third convolution operation is also set to be 0 as well as the second convolution operation; until the fourth convolution operation, the total amount of the missing active data segments is accumulated to the required active data amount of one convolution operation; at the moment, a read operation is initiated, and the new data are all read into the Buffer; and the data is stored in the cache in the CBUF according to the sequence of use.

8. The memory access optimization-oriented sparse convolutional neural network accelerator as claimed in any one of claims 1 to 4, wherein the operation module CMAC is configured to perform a multiply-add operation in a convolution operation, a CMAC component has 16 groups in common, each group of data input is 64 bits, each group of data input has the same input neuron data, and the input weight data is different.

9. The memory access optimization oriented sparse convolutional neural network accelerator of claim 8, wherein the operation module CMAC comprises a three-level pipeline stack:

the second level is an adder layer which is set as 16 input ports;

10. The memory access optimization-oriented sparse convolutional neural network accelerator as claimed in any one of claims 1 to 4, wherein the cache module PB stores input weight data, the number of the input weight data is set to 16 in total, and the input weight data are distributed in different PE units; the size of the input weight data is set to be 256KB of SRAM, and the input weight data consists of 16 groups of 1KB of Bank; each Bank consists of 64-bit wide, 128-entry dual port SRAM.