CN211554991U

CN211554991U - Convolutional neural network reasoning accelerator

Info

Publication number: CN211554991U
Application number: CN202020683437.0U
Authority: CN
Inventors: 李丽; 黄延; 傅玉祥; 宋文清; 何书专
Original assignee: Nanjing Ningqi Intelligent Computing Chip Research Institute Co ltd
Current assignee: Nanjing Ningqi Intelligent Computing Chip Research Institute Co ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2020-09-22
Anticipated expiration: 2030-04-28

Abstract

The utility model discloses a convolution neural network reasoning accelerator belongs to artificial intelligence algorithm's hardware and realizes the field. To the power consumption that exists among the prior art high, can dispose poor and calculation precision low grade problem, the utility model provides a convolution neural network reasoning accelerator, the accelerator includes main control module, the address produces the module, SRAM storage module, data input module, calculation engine module and result output module, parallel computing unit mutual independence in the calculation engine module sets up, on limited computational resource and storage resource's basis, the performance advantage of high parallelism has been realized, parallel computing unit possesses fixed point bit truncation and turn-off function, the parallelism and the calculation precision of accelerator have been improved, and use power consumption has been reduced.

Description

Convolutional neural network reasoning accelerator

Technical Field

The utility model relates to a hardware implementation field of artificial intelligence algorithm, more specifically say, relate to a convolution neural network reasoning accelerator.

Background

The convolutional neural network is a deep feedforward artificial neural network, is one of representative algorithms of deep learning, and has been successfully applied to the fields of computer vision, natural language processing and the like. In a convolutional neural network, convolutional layers can account for more than 80% of the total computation amount and computation time of the whole network, and therefore, acceleration of convolutional layers is the key for improving the performance of the whole CNN network. In the calculation of convolutional layers, there are several parameters: convolution kernel size (K), zero padding mode (Pa), convolution step (St), input image channel number (Ch), convolution kernel number (Nu), input image size (In), and output image size (Ou). According to the above parameters, the size of the output image is:

according to analysis of convolution operators, convolution mainly relates to multiply-accumulate operation, and because most input data are multi-channel two-dimensional data and the number of convolution kernels is multiple, convolution in deep learning belongs to a typical calculation-intensive algorithm and needs strong calculation force for supporting.

Chinese patent application No. CN201810068051.6, published 2018, 6-19, discloses an accelerator and method for convolutional neural network inference, the accelerator comprising: the input image buffer module comprises N buffers for loading input image data, and each buffer stores data of a line corresponding to an image; n arithmetic unit connects input image buffer module for carry out convolution operation, N arithmetic unit supports the pulsation form of image data transmission between adjacent arithmetic unit, and its arithmetic unit who connects the buffer reads image data from the buffer, and remaining arithmetic unit reads image data from neighbouring arithmetic unit, and this utility model discloses the two-way pulsation array of data reusability design that brings to convolution neural network has improved the loading efficiency of data to convolution neural network has been accelerated. Its weak point lies in, this utility model's arithmetic unit is the single calculation, only can carry out the convolution of a batch of image at every turn, and unable simultaneous processing is criticized input image many times, leads to its degree of parallelism not enough, and arithmetic unit's floating point operation also can greatly increased hardware resources's consumption simultaneously, increases extra area and time overhead.

SUMMERY OF THE UTILITY MODEL

1. Technical problem to be solved

To the slow, high, the poor and fixed point precision problem of configurability of computational rate that exists among the prior art of power consumption, the utility model provides a convolution neural network reasoning accelerator, it can improve the accuracy of configurability, the assurance computational result of computational rate, reduction consumption, improving device.

2. Technical scheme

The purpose of the utility model is realized through the following technical scheme.

A convolutional neural network reasoning accelerator comprises a main control module, an address generation module, an SRAM storage module, a data input module, a calculation engine module and a result output module, wherein:

the main control module receives the starting signal and distributes data to the address generation module, the data input module, the calculation engine module and the result output module;

the address generation module receives the data of the main control module and the result output module and sends a control signal to the SRAM storage module;

the SRAM storage module receives the control signal of the address generation module and sends storage data to the data input module;

the data input module receives data of the main control module and the SRAM storage module and sends the data to the calculation engine module;

the calculation engine module receives the data of the main control module and the data input module and sends result data to the result output module;

and the result output module is used for receiving the result data of the calculation engine module and sending the result data to the address generation module.

Further, the calculation engine module comprises a plurality of parallel calculation units.

Furthermore, the parallel computing unit comprises a plurality of multiply-accumulate units and an activation function computing unit, wherein the multiply-accumulate units receive data of the data input module through the data input bus and send the data to the activation function computing unit through the result input bus.

Furthermore, the multiply-accumulate unit comprises a fixed-point multiplier and a fixed-point adder, and the fixed-point multiplier is connected with the fixed-point adder.

Furthermore, the activation function unit comprises a lookup table memory, a fixed point multiplier, a fixed point adder and two registers, wherein the lookup table memory and the first register receive data, the lookup table memory sends the data to the fixed point multiplier and the second register, and the fixed point multiplier and the second register send the data to the fixed point adder.

Furthermore, the data sent by the main control module to the address generation module comprises a start signal and a zero padding signal.

Furthermore, the address generating module receives the zero padding signal, sends the generated zero padding address to the result output module, and the result output module sends the result data and the zero padding address to the SRAM storage module.

Furthermore, the data received by the calculation engine module from the data input module includes an input data valid signal, a module shutdown signal and a fixed point position signal.

3. Advantageous effects

Compared with the prior art, the utility model has the advantages of:

the utility model discloses realize making full use of convolution's concurrency, designed parallel computing unit, on the basis of limited computational resource and storage resource, realized the performance advantage of high parallelism; the parallel computing unit, the multiply-accumulate unit and the activation function computing unit inside the parallel computing unit support adjustable parallelism, and the power consumption can be reduced by closing an idle computing unit; the multiply-accumulate unit and the activation function calculation unit support two configurable fixed point data types of 16bit/8bit so as to meet the requirements of different scenes on precision; the multiply-accumulate unit and the activation function calculation unit support the dynamic change of the fixed point position of the data aiming at the data of different layers, thereby effectively ensuring the precision of the calculation result; the parallel computing unit supports configurable algorithm parameters, is suitable for various scenes and has strong expandability.

Drawings

FIG. 1 is a schematic diagram of the overall framework of the convolutional neural network inference accelerator of the present invention;

FIG. 2 is a schematic diagram of a parallel computing unit according to the present invention;

FIG. 3 is a schematic diagram of the multiply-accumulate unit structure of the present invention;

FIG. 4 is a schematic diagram of an activation function calculation unit according to the present invention;

fig. 5 is a schematic structural diagram of an SRAM memory module according to the present invention;

FIG. 6 is a schematic diagram of the parallel computing unit slice flow calculation;

fig. 7 is a schematic diagram of the input ping-pong of the SRAM memory cell of the present invention.

Detailed Description

According to analysis of convolution operation rules, a large amount of calculation parallelism which can be brought by data slicing exists in main operator multiplication and accumulation operation of convolution, for example, a plurality of convolution kernels are independent and concurrent, and a plurality of input images are independent and concurrent, so that an idea is provided for hardware design.

Under the limitation of hardware resources and cost, the parallelism of convolution operation is fully utilized, and the requirements of high performance and low power consumption are met firstly; secondly, the algorithm and parameter configuration is required, and the expandability is improved so as to meet different application scenes; finally, the calculation precision is required to be improved, and the result error is reduced.

The invention is described in detail below with reference to the drawings and specific examples.

As shown in fig. 1, the utility model provides a convolutional neural network reasoning accelerator, the accelerator includes main control module, address generation module, SRAM memory module, data input module, calculation engine module and result output module, wherein main control module receives the start signal, distributes data to address generation module, data input module, calculation engine module and result output module; the address generation module receives data of the main control module and the result output module and sends a control signal to the SRAM storage module; the SRAM storage module receives the control signal of the address generation module and sends storage data to the data input module; the data input module receives data of the main control module and the SRAM storage module and sends the data to the calculation engine module; the calculation engine module receives data of the main control module and the data input module and sends result data to the result output module; and the result output module receives the result data of the calculation engine module and sends the result data to the address generation module. Specifically, the method comprises the following steps:

the main control module is used for receiving a start signal, sending the start signal to the data input module and the address generation module after receiving the start signal, and sending configuration information to the address generation module, the SRAM module, the data input module, the calculation engine module and the result output module at the same time, wherein the configuration information comprises convolution parameters such as image size, channel number, convolution kernel size, convolution kernel number, image batch number, convolution step length, parallelism and activation function;

the address generation module is used for receiving a starting signal of the main control module, generating a source data address, a weight address and an SRAM control signal after receiving the starting signal, and sending the source data address, the weight address and the SRAM control signal to the SRAM module; the SRAM control module is also used for receiving result data, a result data valid signal and a calculation end signal of the result output module, generating a corresponding result data address and an SRAM control signal and sending the result data address and the SRAM control signal to the SRAM storage module;

the SRAM storage module is used for receiving the address and the SRAM control signal of the address generation module, storing or reading data according to the address and the control signal, and sending the read data to the data input module;

the data input module is used for receiving a starting signal of the main control module, generating a calculation control signal, receiving data of the SRAM storage module, and sending the calculation control signal and the data to the calculation engine module, wherein the calculation control signal comprises an input data effective signal, a module turn-off signal and a fixed-point position signal;

the calculation engine module is used for receiving the calculation control signal and the data of the data input module, receiving the configuration information of the main control module, calculating the data according to the configuration information and the calculation control signal to obtain result data, generating a result data effective signal and a calculation end signal, and sending the result data, the result data effective signal and the calculation end signal to the result output module;

and the result output module is used for receiving the result data, the result data effective signal and the calculation ending signal generated by the calculation engine module, receiving the configuration information of the main control module, performing zero filling on the result data according to the configuration information, and sending the result data after zero filling, the result data effective signal and the calculation ending signal to the address generation module.

Specifically, the calculation engine module in this embodiment includes a plurality of parallel calculation units, one parallel calculation unit includes a plurality of multiply-accumulate units (PE) and an activation function calculation unit (ACT), the calculation engine module of this embodiment includes 32 parallel calculation units, and one parallel calculation unit includes 32 multiply-accumulate units (PE) and an activation function calculation unit (ACT), it should be noted that, the number of parallel calculation units of the calculation engine and the multiply-accumulate units and the activation function calculation units of the calculation engine of the present invention have no specific number limit, a specific number of units can be set according to actual calculation needs in specific implementation, and the number of calculation units is only set according to the convolution layer calculation parameters of this embodiment. Parallel computing units of the computing engine module are mutually independent and can be executed concurrently, source data fragmentation and flow parallel of convolutional layer computing are achieved through the algorithm control module, multiply-accumulate units in the parallel computing units are mutually independent and can be executed concurrently, convolutional kernel (kernel) fragmentation and flow parallel of convolutional layer computing are achieved through the algorithm control module, and the algorithm control module is a main control module, an address generation module, a data input module and a result output module. The parallel computing unit, the multiply-accumulate unit and the activation function computing unit in the parallel computing unit have a turn-off function, and stop working after receiving an effective turn-off signal.

In the internal structure of the parallel computing unit shown in fig. 2, a multiply-accumulate unit in the parallel computing unit receives source data and weight data of a data input module through two data input buses, after the data are computed, the computation result of the multiply-accumulate unit enters an activation function computing unit through a result input bus in a flowing manner for computation, and then the computation result is transmitted to a result output module through a result output bus in a flowing manner.

In the internal structure of the multiply-accumulate unit PE shown in fig. 3, the multiply-accumulate unit includes 1 fixed-point multiplier (FM) and 1 fixed-point adder (FA), the fixed-point multiplier receives source data and weight data, performs a multiplication operation on the data, and performs fixed-point truncation on a calculation result according to a received fixed-point position signal (trunc), the fixed-point truncation is implemented in a manner that the data in the fixed-point multiplier is shifted to the right, the multiply-accumulate unit supports two fixed-point data types of 16 bits and 8 bits, and simultaneously supports dynamic change of fixed-point integer bits, which can meet requirements of different scenes for precision, and the fixed-point truncation can reduce floating-point operation of the calculation unit, thereby reducing consumption of hardware resources and increasing operation speed.

In the internal structure of the active function unit ACT as shown in fig. 4, the active function unit includes a Look-Up Table memory (LUT), a fixed-point multiplier (FM), a fixed-point adder (FA), and two registers (FF), where the Look-Up-Table is abbreviated as LUT, which is essentially a RAM in which a truth Table is stored. In this embodiment, the LUT and one register receive the calculation result data X of the multiply-accumulate unit, the LUT simultaneously receives a Trunc fixed-point truncation signal, the LUT performs logic operation according to an input signal, that is, performs table lookup according to an address, finds out parameters k and b corresponding to a current input X, stores an intermediate result k in another register, sends the intermediate result b and X in one register to the fixed-point multiplier FM, shifts according to the fixed-point truncation signal, and finally sends the result data and the intermediate result k in another register to the fixed-point adder, and calculates to obtain a result, which is b X + k. The activation function unit supports two fixed point data types of 16bit and 8bit, and simultaneously supports the dynamic change of fixed point integer bits, thereby meeting the requirements of different scenes on precision.

In the internal structure of the SRAM memory module shown in fig. 5, the SRAM memory module is totally divided into 3 parts, which are a source data area, a weight area, and a result area, respectively, the memory module in this embodiment uses 128 SRAMs, each SRAM has a size of 256 bits × 1k, and can store 16k 16-bit fixed point data or 32k 8-bit fixed point data, the number of bits of the fixed point data stored in the SRAM memory module is determined by the number of bits of the fixed point data of the multiply-accumulate unit and the activation function calculation unit, the data storage address is in units of bytes, the 128 SRAMs in the memory module are divided into 128 banks, where 64 banks are used as the source data area for storing source data, 32 banks are used as the weight area for storing weight data, and 32 banks are used as the result area for storing result data.

In this embodiment, before the calculation result of the calculation engine module is stored in the result SRAM, zero padding (padding) operation is performed on the convolution result, where the zero padding operation is completed by the main control module, the address generation module, and the result output module, the main control module sends a zero padding signal to the address generation module, the address generation module is controlled to generate a zero padding address, the zero padding address corresponds to an address where result data needs to be padded in the SRAM storage module, the result output module stores data 0 in the SRAM at a corresponding zero padding address according to the zero padding address, and the zero padding operation is performed on the convolution result, so as to facilitate calculation of the next layer of convolution.

In embodiments where the parallel computing units as shown in figure 6 implement sliced pipelined parallelism,parallel computing unit receives source data X_abcAnd weight data W_abcPerforming convolution calculation, wherein the source data and the weight data are two-dimensional matrixes, a represents a row of the two-dimensional matrixes, b represents a column of the two-dimensional matrixes, c represents a channel of the images, and X_abcA, b, c determine the position of the input data in the source data, W_abcDetermines the location of the weight data in the convolution kernel. In the convolution calculation of the calculation engine module, all multiply-accumulate units in a single parallel calculation unit respectively calculate the same group of source data, different multiply-accumulate units of the single parallel calculation unit correspond to different convolution kernels, the weight data refer to a single weight, and one convolution kernel can contain a plurality of weight data; in this embodiment, one parallel computing unit is called a computing group, each computing group is assigned with 32 convolution kernels, and corresponds to 32 multiply-accumulate units PE, kernel₀Is distributed to PE₀，kernel₁Is distributed to PE₁By analogy, kernel₃₁Is distributed to PE₃₁And calculating the PE of group 0₀And PE of calculation group 1₀The corresponding convolution kernels are the same, parallel calculation of convolution kernels in parallel calculation units is achieved by distributing corresponding multiply-accumulate units PE for each convolution kernel, multiply-accumulate operation of source data and weight data is conducted inside each multiply-accumulate unit, one result data of a corresponding channel in an output image can be generated after calculation of each multiply-accumulate unit is finished, the result data enters an activation function unit ACT through a result input bus to conduct activation function calculation, and finally the result data are output to a result output module through a result output bus.

The fragment stream parallel transmission calculation mode can simultaneously perform parallel calculation on a plurality of batches of source data and a plurality of convolution kernels, and the calculation speed of the convolution kernels in the convolution neural network is improved. Meanwhile, each parallel computing unit is independently computed, and each parallel computing unit and the PE and the ACT in the parallel computing unit have a turn-off function, so that the accelerator can be adjusted according to the number of actually computed pictures, and controls the corresponding number of parallel computing units to stop working by sending an effective turn-off signal to the computing engine module; the accelerator can also be adjusted according to the actually calculated convolution number, an effective turn-off signal is also sent to the calculation engine module, the working quantity of the multiply-accumulate units PE in the parallel calculation unit is controlled, the adjustable parallelism of calculation is realized by controlling the working quantity of the parallel calculation unit and the multiply-accumulate units, the idle calculation unit is closed, and the power consumption required by the accelerator is reduced.

The utility model also provides a convolution neural network's reasoning acceleration method, specifically include following step:

step 1, a main control module receives a convolution starting signal and algorithm configuration information, transmits the configuration information to a calculation engine module and a result output module, and simultaneously sends a starting signal to a data input module and an address generation module, and configuration parameters are used for the calculation engine module and the result output module to carry out convolution operation on data;

step 2, after the address generation module receives the start signal, generating a source data address, a weight address and a corresponding SRAM control signal and transmitting the source data address, the weight address and the corresponding SRAM control signal to the SRAM storage module, wherein the corresponding SRAM control signal is a data fetching signal;

step 3, the data input module generates an input data effective signal, a module turn-off signal and a fixed point position signal according to the source data and the weight data sent by the SRAM storage module, and sends the source data, the weight data, the input data effective signal, the module turn-off signal and the fixed point position signal, wherein the fixed point position signal comprises a PE fixed point position signal and an ACT fixed point position signal;

step 4, after receiving the source data, the weight data, the input data effective signal, the fixed point position signal and the module turn-off signal, the calculation engine module starts convolution reasoning calculation to obtain result data, generates a result data effective signal and a calculation end signal, and sends the result data, the result data effective signal and the calculation end signal to the result output module;

step 5, after the result output module receives the result data and the result data effective signal, carrying out corresponding zero padding operation on the result data according to the configuration information, and sending the result data after zero padding, the result data effective signal and the calculation end signal to the address generation module;

and 6, the address generation module generates a result data address and a corresponding SRAM control signal according to the received result data after zero padding, the calculation end signal and the result data effective signal, wherein the corresponding SRAM control signal is a data storage signal, and the result data, the result data address and the corresponding SRAM control signal are sent to the SRAM storage module together to finish the storage of the result data.

When the data is stored in the SRAM storage module, when the complete input data cannot be completely stored in the corresponding SRAM storage module, the source data needs to be carried for many times, and therefore, the input image, i.e., the source data needs to be cut, the input image is divided into a plurality of segments to be carried into the SRAM in sequence, and the covering of part of the carrying time is realized by using ping-pong operation, so that the calculation efficiency is improved.

The method receives external data and information through a main control module of the convolutional neural network reasoning accelerator, distributes the data to corresponding modules, calculates input image data through a calculation engine unit, realizes the flow parallel calculation of convolutional layers in the convolutional neural network, and stores the calculation result into an SRAM storage module, thereby facilitating the next convolutional calculation.

The invention and its embodiments have been described above schematically, without limitation, in other specific forms without departing from the spirit or essential characteristics thereof. The embodiment shown in the drawings is only one of the embodiments of the invention, the actual structure is not limited to the embodiments, and any reference signs in the claims shall not limit the claims. Therefore, if a person skilled in the art receives the teachings of the present invention, without inventive design, a similar structure and an embodiment to the above technical solution should be covered by the protection scope of the present patent. Furthermore, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Several of the elements recited in the product claims may also be implemented by one element in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. The utility model provides a convolutional neural network reasoning accelerator, which comprises main control module, address generation module, SRAM memory module, data input module, calculation engine module and result output module, wherein:

2. The convolutional neural network inference accelerator of claim 1, wherein: the calculation engine module comprises a plurality of parallel calculation units.

3. The convolutional neural network inference accelerator of claim 2, wherein: the parallel computing unit comprises a plurality of multiply-accumulate units and an activation function computing unit, the multiply-accumulate units receive data of the data input module through the data input bus, and send the data to the activation function computing unit through the result input bus.

4. A convolutional neural network inference accelerator as defined in claim 3, wherein: the multiply-accumulate unit comprises a fixed-point multiplier and a fixed-point adder, and the fixed-point multiplier is connected with the fixed-point adder.

5. A convolutional neural network inference accelerator as defined in claim 3, wherein: the activation function unit comprises a lookup table memory, a fixed point multiplier, a fixed point adder and two registers, wherein the lookup table memory and the first register receive data, the lookup table memory sends the data to the fixed point multiplier and the second register, and the fixed point multiplier and the second register send the data to the fixed point adder.

6. The convolutional neural network inference accelerator of claim 1, wherein: the data sent by the main control module to the address generation module comprises a start signal and a zero padding signal.

7. The convolutional neural network inference accelerator of claim 6, wherein: the address generating module receives the zero padding signal, sends the generated zero padding address to the result output module, and the result output module sends the result data and the zero padding address to the SRAM storage module.

8. The convolutional neural network inference accelerator of claim 1, wherein: the data received by the data input module by the calculation engine module comprises an input data valid signal, a module turn-off signal and a fixed point position signal.