CN112465110A

CN112465110A - Hardware accelerator for convolution neural network calculation optimization

Info

Publication number: CN112465110A
Application number: CN202011279360.1A
Authority: CN
Inventors: 曹学成; 廖湘萍; 丁永林; 李炜
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2021-03-09
Anticipated expiration: 2040-11-16
Also published as: CN112465110B

Abstract

The invention discloses a hardware accelerating device for convolutional neural network calculation optimization, which comprises a parameter storage module, a scheduling control module and a plurality of accelerating kernel modules, wherein each accelerating kernel module comprises an input image cache unit, a weight cache unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a correction linear unit and an output image cache unit. The invention can keep a simple hardware structure of pipelining and paralleling, reduces the calculated amount and improves the acceleration performance of the hardware by removing the zero value of the input characteristic diagram; the original structure of the convolutional neural network algorithm is kept, the convolutional neural network algorithm is not required to be optimized for reducing extra calculation amount, network operation irregularity is avoided, and the convolutional neural network algorithm is suitable for hardware acceleration of various convolutional neural network algorithms.

Description

Hardware accelerator for convolution neural network calculation optimization

Technical Field

The application belongs to the technical field of computers, and particularly relates to a hardware accelerating device for convolutional neural network calculation optimization.

Background

In recent years, the related algorithms of the deep neural network have been applied to the fields of image processing, audio processing and the like on a large scale, and have a great influence on the world economy and social activities. The deep convolutional neural network technology is widely concerned in many machine learning fields, has higher precision compared with the traditional machine learning algorithm, and can realize the accuracy exceeding that of human beings.

Generally, the deeper the number of layers of the convolutional neural network, the more accurate the inference result. But at the same time, deeper networks mean more computing resources are consumed. In convolutional neural network architectures, intra-layer computations are independent and uncorrelated, while inter-layer computations are similar to pipeline architectures and are not efficient to implement using a general purpose processor. Due to the special calculation mode of the convolutional neural network, the convolutional neural network is particularly suitable for hardware acceleration implementation.

The deep neural network has the advantage of high accuracy, but has the disadvantage of huge calculation amount, so that how to reduce the calculation amount of the convolutional neural network is always a popular research direction in the field of artificial intelligence. How to keep a simple structure of running water and parallel without adding extra preprocessing on the premise of being compatible with more deep neural network algorithms and reducing the calculation amount is the difficulty of hardware acceleration at present.

Disclosure of Invention

The application aims to provide a hardware acceleration device for convolutional neural network calculation optimization, which can obviously reduce the amount of convolutional calculation and improve the hardware acceleration performance.

In order to achieve the purpose, the technical scheme adopted by the application is as follows:

the utility model provides a hardware accelerator of convolutional neural network computational optimization, hardware accelerator of convolutional neural network computational optimization includes parameter storage module, dispatch control module, a plurality of speed up kernel module, each speed up kernel module includes input image cache unit, weight cache unit, zero-removing processing unit, multiply and accumulate operation array unit, correction linear unit and output image cache unit, wherein:

the parameter storage module is used for caching the convolutional neural network to be accelerated and a convolution kernel corresponding to the convolutional neural network;

the scheduling control module is used for controlling the balance calculation of the plurality of acceleration core modules, detecting the idle acceleration core module and distributing the input feature map data to be processed to the idle acceleration core module;

the input image cache unit is used for receiving and caching the input characteristic diagram data input into the acceleration kernel module;

the weight cache unit is used for receiving and caching the convolution kernel output by the parameter storage module;

the zero-removing processing unit is used for removing zero values in the input feature map data;

the multiply-accumulate operation array unit is used for multiply-accumulate operation between the weight data in the convolution kernel and the input characteristic diagram data after zero removal, and outputting a convolution operation result;

the correction linear unit is used for correcting the negative number in the convolution operation result to be a zero value to obtain a correction result;

and the output image caching unit is used for caching the correction result as output characteristic graph data, and the output characteristic graph data is used as input characteristic graph data of the next layer of convolution operation.

Preferably, the maximum data amount that can be directly processed by the multiply-accumulate operation performed by the accelerated kernel module at one time is: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.

Preferably, the input image buffer unit is a first random access memory for buffering input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.

Preferably, the weight buffer unit is a second random access memory for buffering the weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores the weight data of M sets of convolution kernels of one point.

Preferably, the multiply-accumulate array unit includes M parallel MAC units, and each MAC unit implements multiply-accumulate operation of the input feature map data and the weight data of a set of convolution kernels.

Preferably, if the size of the input feature map to be processed is C '× R' × N ', where C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed;

if N '> N, the input image cache unit uses a plurality of continuous address spaces to store data of N' channels of one pixel point; and if C '. R' > C.R, splitting the input feature graph to be processed into a plurality of C.R.N size blocks, and distributing the blocks to a plurality of acceleration core module operation.

Preferably, in the parameter storage module, if the size of the convolution kernel to be processed is W '. H'. N '. M', where W 'represents the width of the convolution kernel to be processed, H' represents the height of the convolution kernel to be processed, N 'represents the number of passes of the convolution kernel to be processed, and M' represents the number of groups of the convolution kernels to be processed;

if M' > M, splitting the convolution kernel into a plurality of M groups of convolution kernels, and distributing the convolution kernels to a plurality of accelerated kernel modules for operation;

or, if M '> M, the weight cache unit stores the weight data of the M' sets of convolution kernels for one point using consecutive addresses.

Compared with the prior art, the hardware accelerating device for the convolutional neural network calculation optimization has the following beneficial effects:

(1) the method and the device have the advantages that the simple hardware structure of the flow and the parallelism is kept, the calculated amount is reduced by removing the zero value of the input characteristic diagram, and the hardware acceleration performance is improved.

(2) The method and the device keep the original structure of the convolutional neural network algorithm, do not need the convolutional neural network algorithm to carry out extra optimization for reducing the calculated amount, avoid the irregularity of network operation, and are suitable for hardware acceleration of various convolutional neural network algorithms.

Drawings

FIG. 1 is a schematic structural diagram of a hardware acceleration apparatus for convolutional neural network computational optimization according to the present application;

FIG. 2 is a schematic diagram illustrating a storage manner of an input feature map according to the present application;

FIG. 3 is a schematic diagram of a zeroing process for an input feature map of the present application;

FIG. 4 is a schematic diagram of a convolution kernel weight data storage method according to the present application;

FIG. 5 is a schematic diagram of the operation of convolution multiply accumulate operation of the present application;

fig. 6 is a schematic diagram of an input/output characteristic image storage method according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

In one embodiment, a hardware acceleration device for calculation optimization of a convolutional neural network is provided, which solves the problems that the conventional convolutional neural network consumes more calculation resources and a general processor is low in implementation efficiency.

As shown in fig. 1, the hardware acceleration apparatus for convolutional neural network computational optimization of this embodiment includes a parameter storage module, a scheduling control module, and a plurality of acceleration kernel modules, and each acceleration kernel module includes an input image buffer unit, a weight buffer unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a modified linear unit, and an output image buffer unit.

In the convolution calculation process, the application of each module unit is as follows:

the parameter storage module is used for caching the convolutional neural network to be accelerated and the corresponding convolutional kernel. In practical application, the parameter storage module actually stores compiled network layer information, for example, a YOLO series or a MobileNet series is stored as a neural network to be accelerated and optimized in the convolution operation process.

The scheduling control module is used for controlling the balance calculation of the plurality of acceleration core modules, detecting the idle acceleration core module, and distributing the input feature map data to be processed (namely, a new operation request) to the idle acceleration core module. In the embodiment, the dispatching control module is utilized to comprehensively control the working conditions of all the accelerating core modules, so that the utilization rate of each accelerating core module can be effectively improved, and the calculation waiting time is reduced. It is easy to understand that detecting the idle condition of the acceleration core module is a conventional detection means, and may be determined by an identifier or by a state, which is not described herein again.

The input image cache unit is used for receiving and caching the input characteristic diagram data input into the acceleration kernel module, and the weight cache unit is used for receiving and caching the convolution kernel output by the parameter storage module.

The zero-removing processing unit is used for removing zero values in the input feature map data.

And the multiply-accumulate operation array unit is used for multiply-accumulate operation between the weight data in the convolution kernel and the input characteristic diagram data after zero removal, and outputting a convolution operation result.

And the correction linear unit (namely the activation function Relu) is used for correcting the negative number in the convolution operation result into a zero value to obtain a correction result. In the embodiment, the activation function Relu operation after each layer of convolution operation can enable all negative numbers of convolution operation results to be zero, so that zero in each layer of input feature map data can be conveniently removed, and the calculation amount of the algorithm can be greatly reduced.

The hardware acceleration device for the convolutional neural network calculation optimization keeps the running and parallel original structure of the convolutional neural network algorithm, and simultaneously switches and stores various convolutional neural networks in combination with the parameter storage module, so that the hardware acceleration device can be compatible with the accelerated calculation of various convolutional neural networks; meanwhile, the characteristic that the negative value is changed into zero by the activating function Relu is utilized, and the negative value in the output characteristic diagram data of the current layer is changed into zero, so that the zero data can be removed when the next layer carries out convolution operation, the calculated amount is reduced, and the hardware acceleration performance is improved.

In the whole convolution calculation process, input feature map data and convolution kernels are obtained through an input bus, and final calculation result data are output through an output bus after the convolution calculation is finished. The data transmission is realized through the bus, and the data transmission stability and the integrity are higher.

It is easy to understand that if the output feature map data of the current layer does not complete the whole convolution calculation, the output feature map data of the current layer will be used as the input feature map data of the convolution operation of the next layer; if the output characteristic diagram data of the current layer does not complete the whole convolution calculation, the output characteristic diagram data of the current layer is output through an output bus.

In order to improve the accuracy of the convolution operation, in another embodiment, the hardware acceleration apparatus further includes an offset buffer unit, where the offset buffer unit is configured to buffer offset data and provide offset compensation for the convolution calculation. Since the operations of the weight data and the input feature map data are sequential, the zeroing processing unit is required to control the read address of the weight buffer unit at the same time. If the hardware acceleration device includes an offset cache unit, the zeroing processing unit also needs to control the read address of the offset cache unit. The offset control of the read address is a conventional technique in the convolution operation, and is not described in detail in this embodiment.

Specifically, in order to ensure a simple hardware structure, each acceleration core module has a maximum data size that can be processed at one time, and the maximum data size is also the maximum storage size of the input image buffer unit and the weight buffer unit. In one embodiment, the maximum data amount that can be directly processed by the multiply-accumulate operation performed by each acceleration core module is set as: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.

As shown in fig. 2, when the input feature map and the weight data are obtained and stored, the input image buffer unit is a first Random Access Memory (RAM) for buffering the input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.

As shown in FIG. 3, when removing the zero value in the input feature map, the zeroing processing unit sequentially removes the zero stored in each address space in the input image cache unit, and zeroes the original N data into L data, where N is greater than or equal to L.

As shown in fig. 4, the weight buffer unit is a second Random Access Memory (RAM) for buffering weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores weight data of M sets of convolution kernels of one point.

In order to ensure that the multiply-accumulate operation array can perform parallel calculation according to the single maximum data quantity, in one embodiment, the multiply-accumulate operation array unit comprises M parallel MAC units (multiply-add devices), each MAC unit realizes the multiply-accumulate operation of input feature map data and weight data of a group of convolution kernels, the convolution algorithm is ensured to have the maximum parallelism, and the operation efficiency is obviously improved.

The working flow of the hardware acceleration device for calculating and optimizing the convolutional neural network of the present application is further described by the following embodiments.

Example 1

And when convolution calculation is carried out, selecting the acceleration kernel module which works immediately according to the idle condition of the acceleration kernel.

And acquiring an input characteristic image of C R N from the input bus, wherein C represents the width of the image, R represents the height of the image, N represents the channel number of the image, and each address space in the input image cache unit RAM stores data of N channel numbers of one pixel point.

Obtaining convolution kernels of W, H, N and M from an input bus, wherein W represents the width of the convolution kernels, H represents the height of the convolution kernels, N represents the number of channels of the convolution kernels, M represents the number of groups of the convolution kernels, and each address space in a weight cache unit RAM stores weight data of M groups of convolution kernels of one point.

N data (i.e., data in one address space) in a parallel line are read from the input image buffer unit, and the zero-removing processing unit removes zero data from the N data and converts the data into L data in series.

As shown in fig. 5, the multiply-accumulate operation array unit takes one of the L data as the data to be calculated in each period, and takes one weight data in the M sets of convolution kernels, which is in the same number of channels as the data to be calculated, to perform the multiply-accumulate operation, where the multiply-accumulate operation array is M parallel MAC units, and each MAC unit implements the multiply-accumulate operation of a set of convolution kernels and the input feature map data. And outputting the output characteristic diagram data of M channels of one pixel point after W X H L periods.

As shown in fig. 6, the output image buffer unit is also an RAM, the RAM is a third random access memory, each address space in the third random access memory stores data of M channels of one pixel, the output feature map data is output and buffered, a negative value in the output feature map data is changed to zero by an activation function Relu, convolution operation of one layer is completed, and the output is output through an output bus.

The above is the calculation process of the hardware acceleration apparatus based on the maximum calculation amount once in this embodiment, and it should be noted that if the input data exceeds the maximum calculation amount once, the hardware acceleration apparatus needs to be preprocessed before the convolution operation:

for the input feature map: if the size of the input feature map to be processed is C '. R'. N ', wherein C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed; and if the N '> N is obtained, the input image cache unit stores the data of N' channels of one pixel point by using a plurality of continuous address spaces, and adopts N '/N (integer) accelerating kernel modules to operate the input characteristic diagram, and when the N'/N is calculated, the integer is obtained by adding 1 to the quotient if the remainder is not zero.

For example, the maximum storable input feature size C × R × N of the input image cache unit is 20 × 32, and when N' is within 32, C × R points, that is, 20 × 20, may be stored; when N ' is 64 or more than 32 and less than 64, C R N/N ' points, i.e. 20 x 20/2, and so on, N ' is the storage condition of other values.

And if C '. R' > C.R, splitting the input feature image into a plurality of C.R.N size blocks and distributing the blocks to a plurality of acceleration core module operation. When splitting the input feature image, if C '. multidot.r' is not an integer multiple of c.multidot.r, the outer edge of C '. multidot.r' may be adjusted to zero, and then split.

For example, when C '× R' is 6 × 8 and C × R is 2 × 3, the outer edge of C '× R' needs to be subjected to zero padding to become a specification of 6 × 9, and then to the separation process. It should be noted that, in practical applications, in order to improve the rigor, the splitting process on the input feature image may need to consider the problem of overlapping borders.

The splitting of the input characteristic image can be completed by a splitting module at the CPU side, and after the splitting module performs splitting processing, each data block is sent to a scheduling control module for distribution, and the jump address is supported to read external data. It should be noted that the splitting module may be a part of the hardware acceleration apparatus of this embodiment, and may also be an external module connected to the hardware acceleration apparatus of this embodiment. Of course, the splitting of the input feature image can also be completed by the scheduling control module.

For the convolution kernel: if the size of the convolution kernel to be processed is W '. H'. N '. M', wherein W 'represents the width of the convolution kernel to be processed, H' represents the height of the convolution kernel to be processed, N 'represents the number of channels to be processed, and M' represents the group number of the convolution kernel to be processed; if M' > M, the convolution kernel can be split into a plurality of M groups of convolution kernels and distributed to a plurality of accelerated kernel module operations, and the convolution kernel in the same way can also be completed by the same or similar splitting module or a parameter storage module; or, if M ' > M, the weight cache unit uses a plurality of consecutive addresses to store the weight data of the M ' set of convolution kernels of one point, which is similar to the processing when the feature map N ' > N is input, and is not described in detail.

After the input feature map data or the convolution kernel is split, the output feature map is output to the address of an external SDRAM cache through each accelerated kernel module after processing the operation request, and splicing is realized in the external SDRAM cache. And jumping addresses for each row of output data corresponding to each acceleration core module, for example, splitting the input characteristic image left and right, wherein the two outputs are 10 × 20 (the number of channels is not considered at first), the 0-9 addresses store the data of the acceleration core module 1, the 10-19 addresses store the data of the acceleration core module 2, the 20-29 addresses store the data of the acceleration core module 1, the 30-39 addresses store the data of the acceleration core module 2, and the like.

In the embodiment, the parallel computing structure is adopted to simultaneously skip the multiply-accumulate operation of the input image being zero, thereby achieving the purpose of accelerating the neural network operation.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. The hardware acceleration device for the calculation optimization of the convolutional neural network is characterized by comprising a parameter storage module, a scheduling control module and a plurality of acceleration kernel modules, wherein each acceleration kernel module comprises an input image cache unit, a weight cache unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a modified linear unit and an output image cache unit, and the hardware acceleration device for the calculation optimization of the convolutional neural network is characterized in that:

2. The hardware acceleration device for convolutional neural network computational optimization of claim 1, wherein the maximum data amount that can be directly processed by the multiplication-accumulation operation performed by the acceleration kernel module at one time is: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.

3. The hardware accelerator of convolutional neural network computational optimization of claim 2, wherein the input image buffer unit is a first random access memory for buffering input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.

4. The hardware acceleration device for convolutional neural network computational optimization of claim 2, wherein the weight caching unit is a second random access memory for caching weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores weight data of M sets of convolutional kernels of one point.

5. The hardware acceleration apparatus for convolutional neural network computational optimization of claim 2, wherein said multiply-accumulate array unit comprises M parallel MAC units, each MAC unit implementing a multiply-accumulate operation of the input profile data and the weight data of a set of convolutional kernels.

6. The hardware acceleration device for convolutional neural network computational optimization of claim 3, wherein if the size of the input feature map to be processed is C '× R' × N ', wherein C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed;

7. The hardware acceleration device for convolutional neural network computational optimization of claim 4, wherein in the parameter storage module, if the size of the convolutional kernel to be processed is W '× H' × N '× M', where W 'represents the width of the convolutional kernel to be processed, H' represents the height of the convolutional kernel to be processed, N 'represents the number of tracks passing through the convolutional kernel to be processed, and M' represents the number of groups of the convolutional kernel to be processed;