CN111242277B

CN111242277B - Convolutional neural network accelerator supporting sparse pruning based on FPGA design

Info

Publication number: CN111242277B
Application number: CN201911383518.7A
Authority: CN
Inventors: 邱蔚; 丁永林; 曹学成; 廖湘萍; 李炜
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2023-05-05
Anticipated expiration: 2039-12-27
Also published as: CN111242277A

Abstract

The invention discloses a convolutional neural network accelerator supporting sparse pruning based on FPGA design, which comprises a bus interface unit, a weight storage and management unit, an input buffer memory FIFO array, an MAC computing array, an intermediate result buffer memory RAM array, a pipeline adder array and an output buffer memory and management unit. According to the convolution kernel with zero value weight removed, the invention forms a weight column composed of non-zero weight, and the non-zero weight in the weight column is dynamically allocated to the multiplier-adder in sequence according to the maximum calculated amount of the weight column each time, so that the convolution neural network accelerator can cope with the irregularity of network calculation after pruning, effectively skip the multiplication-accumulation operation with zero weight, thereby reducing the total operation amount and accelerating the neural network operation.

Description

Convolutional neural network accelerator supporting sparse pruning based on FPGA design

Technical Field

The application belongs to the field of deep learning, and particularly relates to a convolutional neural network accelerator supporting sparse pruning based on FPGA design.

Background

In recent years, artificial intelligence is popular worldwide, and computer vision and voice recognition are most widely applied, so that great convenience is brought to life of people. Computer vision and speech recognition primarily uses deep neural networks for computational reasoning, since deep neural networks perform very well in terms of accuracy.

In general, the deeper the layer number of the neural network, the more parameters, and the more accurate the result of the reasoning. But at the same time, the deeper the network, the more parameters mean that more computing and storage resources are consumed. On the other hand, the neural network has a plurality of parameters, but some of the parameters do not greatly contribute to the final output result and are redundant.

Convolutional neural networks are a class of deep neural networks, which have the advantage of high accuracy of the deep neural networks on one hand, but inherit the disadvantage of huge calculation amount, so how to reduce the calculation amount of the convolutional neural networks is always a popular research direction in the field of artificial intelligence. The convolution operation in the convolution neural network accounts for more than 90% of the total operation amount, the convolution operation is in direct proportion to the parameter capacity, and the pruning can reduce redundant parameters and maintain considerable accuracy. Pruning accelerates the operation of the neural network by reducing the total operation amount, but the irregularity of the network operation after pruning increases the complexity of the implementation of the hardware of the neural network accelerator, so that the existing accelerator design cannot fully utilize the acceleration effect caused by weight sparseness, the acceleration effect is poor, and the defect of low utilization rate of computing resources exists.

Disclosure of Invention

The convolution neural network accelerator based on the FPGA design and supporting sparse pruning can cope with the irregularity of network calculation after pruning, and multiplication accumulation operation with zero weight is effectively skipped, so that total operation amount is reduced, and neural network operation is accelerated.

In order to achieve the above purpose, the technical scheme adopted by the application is as follows:

the utility model provides a convolutional neural network accelerator of support sparse pruning based on FPGA design, convolutional neural network accelerator of support sparse pruning based on FPGA design includes bus interface unit, weight storage and management unit, input buffer memory FIFO array, MAC calculate array, intermediate result buffer memory RAM array, pipeline adder array and output buffer memory and management unit, wherein:

the bus interface unit is used for transmitting a characteristic image column from the DDR memory to the input cache FIFO array, wherein the characteristic image column consists of data with the same size as a convolution kernel on a characteristic image according to a preset rule;

the input buffer memory FIFO array comprises N FIFOs, each FIFO receives different characteristic image columns output by the bus interface unit, and outputs data to be calculated in the characteristic image columns to the MAC calculation array according to a first-in first-out principle;

the weight storage and management unit is used for outputting a weight column into the MAC computing array, wherein the weight column consists of non-zero weights at the same position in the same channel of all convolution kernels, and each non-zero weight has an index value identical to the number of the convolution kernel;

the MAC computing array comprises MAC groups corresponding to the FIFOs, each MAC group comprises a plurality of multiply-add devices, each MAC group receives a weight column output from the weight storage and management unit and one data to be computed output from the corresponding FIFO, each multiply-add device in the MAC group sequentially receives one non-zero weight in the weight column, the product of the non-zero weight and the data to be computed is calculated to obtain a product result, and the product result is output to the intermediate result cache RAM array;

the intermediate result buffer RAM array comprises RAM groups corresponding to each MAC group, the RAM groups comprise a plurality of intermediate buffer RAMs, the intermediate buffer RAMs receive multiplication and accumulation results output by corresponding multiply-add devices, addresses stored in the intermediate buffer RAMs by the multiplication and accumulation results are index values corresponding to non-zero weights used for calculating the multiplication and accumulation results, and the multiplication and accumulation results are obtained by accumulating the product results calculated by the multiply-add devices and intermediate results stored in the intermediate buffer RAMs corresponding to the multiply-add devices and having the same addresses;

the pipeline adder array comprises adder groups corresponding to the RAM groups, each adder group comprises a plurality of adders, each adder group is used for summing the intermediate results stored in the corresponding RAM group, and the summed results are used as convolution operation results to be output to the output buffer and management unit;

the output buffer and management unit comprises a plurality of output buffer RAMs corresponding to the adder groups, and each output buffer RAM is used for storing convolution operation results output by the corresponding adder group.

Preferably, the number of the FIFO, the MAC group, the RAM group, the adder group and the output buffer RAM is N, wherein N represents the parallel degree, and a one-to-one correspondence exists among each FIFO, the MAC group, the RAM group, the adder group and the output buffer RAM;

each MAC group comprises M multiply-add devices, M is a power of 2, each RAM group comprises 2*M intermediate cache RAMs, each multiply-add device is correspondingly connected with two intermediate cache RAMs, a multiply-accumulate result output by the multiply-add device is cached in one of the two correspondingly connected intermediate cache RAMs as an intermediate result, intermediate results calculated for the same feature image column are stored in the same intermediate cache RAM, and the two intermediate cache RAMs are stored in a ping-pong working mode.

Preferably, each of the adder groups includes P adders, and p=1+2+4+ … +2 ^q ，q＝q ₁ -1，q ₁ ＝log ₂ And M, each group of adder groups starts summation operation after the corresponding MAC group completes the operation of a column of characteristic diagram columns and all weights in convolution kernels corresponding to the characteristic diagram columns, and the summation operation is carried out in a pipeline mode.

Preferably, the summing operation is performed in a pipelined manner comprising:

taking data of the same address in M middle cache RAMs to obtain M data;

m data are used as data to be accumulated, and are respectively input into 2 after being grouped two by two ^q M/2 data are obtained in the adders, the M/2 data are used as new data to be accumulated, and 2 are respectively input after two groups of data are grouped ^q-1 In the adders, the data to be accumulated are circularly executed until the data to be accumulated are grouped into groups and then are input into a data processing unit 2 ⁰ The result output by the adder is used as the final convolution operation result of the convolution kernel and the feature diagram column corresponding to the address;

and traversing the data of the same address in the M middle cache RAMs in sequence, and summing according to the M data obtained each time to obtain the final convolution operation result of each convolution kernel and the feature map column.

Preferably, each multiply-adder in the MAC group sequentially receives a non-zero weight in the weight column, calculates a product of the non-zero weight and data to be calculated to obtain a product result, and outputs a multiply-accumulate result to the intermediate result cache RAM array, including:

each multiplier-adder in the MAC group receives data to be calculated, and sequentially receives a non-zero weight in a weight column, wherein the data to be calculated corresponds to the position of each non-zero weight, and the data to be calculated in each multiplier-adder is the same and the non-zero weights are different;

calculating the product of the non-zero weight and the data to be calculated, and reading an intermediate result from a connected intermediate cache RAM, wherein the intermediate cache RAM is used for caching the intermediate result of the current feature image column calculation, and the address of the intermediate result is the index value corresponding to the non-zero weight used by the multiplier adder in the calculation;

and accumulating the intermediate result and the product result, and then storing the accumulated intermediate result and the product result as a new intermediate result in a corresponding intermediate cache RAM, wherein the stored address of the new intermediate result is the same as the address of the original intermediate result so as to update the intermediate result.

Preferably, the bus interface unit comprises a bus interface unit positioned at the reading side of the FPGA and a bus interface unit positioned at the writing side of the FPGA, wherein the bus interface unit positioned at the reading side of the FPGA is used for reading the characteristic diagram and the convolution kernel from the DDR memory, and the bus interface unit positioned at the writing side of the FPGA is used for writing the convolution operation result in the output buffer and management unit into the DDR memory.

Preferably, the output buffer and management unit comprises a polling controller and N output buffer RAMs, wherein the polling controller is used for controlling convolution operation results stored in the N output buffer RAMs to be sequentially transmitted to the DDR memory through a bus interface unit positioned at the writing side of the FPGA.

According to the convolutional neural network accelerator supporting sparse pruning based on the FPGA design, the weight column formed by the non-zero weights is formed according to the convolution kernel with zero value weights removed, and the non-zero weights in the weight column are dynamically allocated to the multiply-adder in sequence according to the maximum calculated amount of the weight column each time, so that the convolutional neural network accelerator can cope with the irregularity of network calculation after pruning, effectively skip the multiply-accumulate operation with zero weight, reduce the total operation amount and accelerate the neural network operation.

Drawings

Fig. 1 is a schematic structural diagram of a convolutional neural network accelerator supporting sparse pruning based on FPGA design of the present application;

FIG. 2 is a schematic diagram of a calculation mode of the convolutional neural network accelerator of the present application;

FIG. 3 is a schematic diagram of a feature map data storage mode in embodiment 1 of the present application;

FIG. 4 is a schematic diagram of a weight data storage mode in embodiment 1 of the present application;

FIG. 5 is a schematic diagram of the operation of the multiplier-adder in example 1 of the present application;

FIG. 6 is a schematic diagram illustrating operation of a pipelined adder array according to embodiment 1 of the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

As shown in FIG. 1, one embodiment provides a convolutional neural network accelerator supporting sparse pruning based on FPGA design, which realizes irregular calculation after coping with sparse pruning and accelerates neural network operation.

The convolutional neural network accelerator supporting sparse pruning based on the FPGA design of the embodiment includes a bus interface unit, a weight storage and management unit, an input buffer FIFO array, a MAC calculation array, an intermediate result buffer RAM array, a pipeline adder array, and an output buffer and management unit, where:

1) Bus interface unit: the bus interface unit is used for transmitting a characteristic diagram column from the DDR memory to the input cache FIFO array, wherein the characteristic diagram column is composed of data with the same size as the convolution kernel on the characteristic diagram according to a preset rule. As shown in fig. 2, the preset rule in this embodiment is to take the data equivalent to the convolution kernel size from top left to bottom right in the channel order and arrange them in a row.

The bus interface unit is a general AXI Master interface and is used for transmitting data between the DDR memory and the FPGA chip, and comprises weight and a feature map; the content of the primary burst transmission of the feature map data is a pixel point at the same position of all channels; the weights transmit data one layer at a time burst.

In order to ensure structural integrity, in one embodiment, the bus interface unit includes a bus interface unit located on the read side of the FPGA and a bus interface unit located on the write side of the FPGA, where the bus interface unit located on the read side of the FPGA is used to read the feature map and the convolution kernel, i.e., the weights in the convolution kernel, from the DDR memory; the bus interface unit positioned on the writing side of the FPGA is used for writing a plurality of summation results output by the output buffer and management unit into the DDR memory.

2) The input buffer memory FIFO array comprises N FIFOs, each FIFO receives different characteristic diagram columns output by the bus interface unit, and outputs data to be calculated in the characteristic diagram columns to the MAC calculation array according to the first-in first-out principle, and all the FIFOs are enabled to be effective by non-space time reading.

3) The weight storage and management unit is used for caching, reading and distributing weight data, and finally outputting a weight column to the MAC computing array, wherein the weight column consists of non-zero weights at the same position in the same channel of all convolution kernels, and each non-zero weight has the same index value as the number of the convolution kernel where the non-zero weight is located.

In one embodiment, all weight columns are stored continuously in a chained architecture in a format of non-zero weight + index value + end index, i.e., the position of the non-zero weight in all convolution kernels, i.e., the number of the convolution kernels, and the end index is used to space different weight columns. The weights are read continuously by determining the end identifier of the current address store to determine whether the weights in the next address are read immediately.

The weight storage and management unit triggers the reading of weights by the reading enable of the input cache FIFO array, stops after the effective end mark is read, distributes the weights according to the arrangement sequence after the weights are read, distributes the non-zero weights, and does not perform any calculation on the zero-value weights.

4) The MAC computing array comprises MAC groups corresponding to the FIFOs, each MAC group comprises a plurality of multiply-add devices, each MAC group receives a weight column output from the weight storage and management unit and one data to be computed output from the corresponding FIFO, each multiply-add device in the MAC group sequentially receives one non-zero weight in the weight column, the product of the non-zero weight and the data to be computed is calculated to obtain a product result, and the product accumulated result is output to the intermediate result buffer RAM array.

5) The intermediate result buffer RAM array comprises RAM groups corresponding to each MAC group, the RAM groups comprise a plurality of intermediate buffer RAMs, the intermediate buffer RAMs receive multiplication and accumulation results output by corresponding multiplication and addition devices, addresses stored in the intermediate buffer RAMs by the multiplication and accumulation results are index values corresponding to non-zero weights used for calculating the multiplication and accumulation results, and the multiplication and accumulation results are obtained by accumulating the product results calculated by the multiplication and addition devices at the time with intermediate results stored in the same addresses in the intermediate buffer RAMs corresponding to the multiplication and addition devices.

For a convolution kernel, the convolution operation is a process of multiply-accumulate operation, but the number of weights in the weight column after sparse pruning is not necessarily equal to the number of convolution kernels, so that the weight storage and management unit distributes weights according to the arrangement sequence, and there is a situation that the distributed weights calculated by one multiply-adder each time may come from different convolution kernels, namely, the weights are dynamically distributed to the multiply-adder, so that the maximum efficiency calculation each time is ensured.

In order to cope with the weight dynamic allocation, the present embodiment implements multiply-accumulate operation by means of the intermediate cache RAM as follows:

each multiplier-adder in the MAC group receives data to be calculated, and receives one non-zero weight in the weight column in sequence, the data to be calculated corresponds to the position of each non-zero weight, and the data to be calculated in each multiplier-adder is the same and the non-zero weights are different.

And calculating the product of the non-zero weight and the data to be calculated, and reading an intermediate result from a connected intermediate cache RAM, wherein the intermediate cache RAM is used for caching the intermediate result of the current feature image column calculation, and the address of the intermediate result is the index value corresponding to the non-zero weight used by the multiplier adder in the calculation.

6) The pipeline adder array comprises adder groups corresponding to the RAM groups, each adder group comprises a plurality of adders, each adder group is used for summing the intermediate results stored in the corresponding RAM group, and the summed results are used as convolution operation results to be output to the output buffer and management unit.

7) The output buffer and management unit comprises a plurality of output buffer RAMs corresponding to the adder groups, and each output buffer RAM is used for storing convolution operation results output by the corresponding adder group.

In order to improve the parallel computing efficiency, in an embodiment, N FIFOs, MAC groups, RAM groups, adder groups and output buffer RAM are set, where N represents the degree of parallel, and there is a one-to-one correspondence between each FIFO, MAC group, RAM group, adder group and output buffer RAM.

In this embodiment, the two intermediate buffer RAMs corresponding to one multiplier-adder are switched after the calculation of a list of feature graphs is completed, and when the intermediate buffer RAM used for storing is switched this time, the MAC group first determines whether the target intermediate buffer RAM is in a continuous reading state, where the continuous reading state in this embodiment is understood as sequentially reading from a start address to an end address, if yes, the MAC group is in a waiting state, and if not, performing a read-write operation on the target intermediate buffer RAM.

In this embodiment, each address of each intermediate cache RAM corresponds to an identification bit, which is used to identify whether the data stored in the address is valid, and when the address is written, the address is set to be valid.

In one embodiment, each of the adder groups includes P adders, and p=1+2+4+ … +2 ^q ，q＝q ₁ -1，q ₁ ＝log ₂ And M, each group of adder groups starts summation operation after the corresponding MAC group completes the operation of a column of characteristic diagram columns and all weights in convolution kernels corresponding to the characteristic diagram columns, and the summation operation is carried out in a pipeline mode.

Further, a pipelined summation operation is provided as follows:

and taking data with the same address in the M middle cache RAMs to obtain M data.

M data are used as data to be accumulated, and are respectively input into 2 after being grouped two by two ^q M/2 data are obtained in the adders, the M/2 data are used as new data to be accumulated, and 2 are respectively input after two groups of data are grouped ^q-1 In the adders, the data to be accumulated are circularly executed until the data to be accumulated are grouped into groups and then are input into a data processing unit 2 ⁰ And the result output by the adder is used as the final convolution operation result of the convolution kernel corresponding to the address and the feature diagram column.

The process of loop execution in the pipeline summation operation is, for example, summing every two of M data at the time of pipe0, summing every two of M/2 data at the time of pipe1, summing every two of M/4 data at the time of pipe2, and so on, and finally outputting a summation result. And the adder group directly outputs the summation result to the output buffer RAM after performing addition on the data of each address.

In order to ensure the orderly output of the data in the output buffer and management unit, the output buffer and management unit comprises a polling controller and N output buffer RAMs, wherein the polling controller is used for controlling convolution operation results stored in the N output buffer RAMs to be sequentially transmitted to the DDR memory through a bus interface unit positioned on the writing side of the FPGA, namely, the output mode of the polling controller is that one pixel point data which is stored in one output buffer RAM and contains the same position of all output channels is output at one time.

In order to facilitate an understanding of the present application, the following is further described by way of examples.

Example 1

In this embodiment, the input buffer FIFO array contains 4 FIFOs, the MAC computing array contains 4*8 multipliers, the intermediate result buffer RAM array contains 4×16 intermediate buffer RAMs, the input picture specification is 8×8×3, the convolution kernel size 3*3, the number of convolution kernels is 32, and the step is 1 for illustration, and the working manner of the convolutional neural network accelerator in this embodiment is as follows:

the host writes the data of the feature map into region 0 of the DDR memory in the manner shown in fig. 3, and writes the weight data into region 1 of the DDR memory in the manner shown in fig. 4, if there is enough RAM space on the FPGA chip to store the pruned weights on the FPGA chip, region 1 may be omitted.

The bus interface unit firstly reads the weight data from the external DDR area 1 into the RAM in the accelerator, wherein the reading mode can be that one layer of weight data is transmitted in a burst at a time, and if the weight is stored on a chip, the step can be omitted; then, the feature map data is read from the external DDR area 0 into the accelerator, and since there are 4 input buffer FIFOs and steps to 1, 4 pixels (pixels 0 to 3 each including 3 channels) are read at a time, the data of pixel 0 is written into FIFO0, the data of pixel 1 is written into

FIFOs

1 and 0, and so on, the data of pixel 3 is written into

FIFOs

3, 2 and 1. The reading sequence is the 1 st row 4 points, the 2 nd row 4 points, the 3 rd row 4 points, then the pixel points 4-7 (each point contains 3 channels) are read, and the reading sequence is still the 1 st row, the 2 nd row and the 3 rd row.

It should be noted that, when reading pixels, the number of lines read is determined by the convolution kernel size, and which pixel data in a line is written into which FIFO is determined by the convolution kernel size and the step.

When 4 FIFOs of the input cache FIFO array are not empty, reading the FIFO once, and transmitting the data of each FIFO to 4 MAC groups corresponding to the FIFO; meanwhile, the weight storage and management unit reads a weight column corresponding to the pixel data and distributes the weight column to all the MAC groups, and each MAC group in the MAC calculation array shares one set of weight; this operation is repeated until the convolution operation ends.

The weight storage and management unit reads a row of weight columns and distributes the weight columns to all the MAC groups, and if the length of one row of weight columns is 16 (the number of non-zero weights) after pruning, the weight columns are distributed to the MAC groups for multiply-accumulate calculation in total twice, and 8 weights are respectively distributed to 8 multiply-add devices each time; assuming no pruning is done, the length of a column of weights is 32 (because the number of convolution kernels is 32), and the 8 weights need to be distributed to 8 multiply-add devices 4 times each time; therefore, the accelerator of the embodiment promotes the calculation speed of the convolutional neural network by supporting pruning and skipping the calculation with zero weight.

The operation schematic diagram of the multiplier and adder in the MAC computing array is shown in fig. 5, the multiplication operation is followed by the addition operation, the addition operation is aimed at the multiplication product result and the intermediate result stored in the corresponding address of the intermediate cache RAM, the operation result is used as the intermediate result to be written back into the corresponding address of the intermediate cache RAM, and the address used for storing in the intermediate cache RAM is the index value of the weight; by adopting the method, the utilization rate of the multiply-add device can be improved, the multiply-add device is not bound with the convolution kernel, and once the multiply-add device is bound with the convolution kernel, the multiply-add devices corresponding to the convolution kernel with zero weight are idle, so that the calculation resources are wasted.

When the multiply-accumulate operation of a weight column and a feature map column is completed, the pipeline adder array is started to carry out summation operation on stored data between a group of intermediate cache RAMs (8) connected with the pipeline adder array, and meanwhile, for the purpose of not blocking the MAC computing array to continue working, the subsequent intermediate results are cached in the other group of intermediate cache RAMs, namely, the reason that each multiply-add device corresponds to 2 intermediate cache RAMs, and the convolution operation efficiency can be remarkably improved.

And simultaneously reading the data in the 8 intermediate cache RAMs from the 0 address until the data in the intermediate cache RAMs are all output to the corresponding adder groups.

The pipeline adder works in the mode of fig. 6, 8 data of the same address in 8 intermediate cache RAMs are taken, 8 data at the time of pipe0 are added in pairs during accumulation, 4 data at the time of pipe1 are added in pairs, and 2 data at the time of pipe2 are added to obtain a final convolution operation result aiming at the address, namely one pixel point of one output channel is obtained.

The output buffer and management unit receives the convolution operation results output by the 4 groups of adder groups, and outputs the convolution operation results to the external DDR through the bus interface unit according to the arrangement order of the output buffer RAM, wherein the length of each output buffer RAM transmitted once is 32 (the number of convolution kernels is 32), namely, one pixel point at the same position of all output channels.

The input picture size, convolution kernel size and number, and stepping referred to above are all configurable for the accelerator proposed in this application.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. The convolutional neural network accelerator supporting sparse pruning based on the FPGA design is characterized by comprising a bus interface unit, a weight storage and management unit, an input cache FIFO array, an MAC calculation array, an intermediate result cache RAM array, a pipeline adder array and an output cache and management unit, wherein:

2. The convolutional neural network accelerator supporting sparse pruning based on the FPGA design of claim 1, wherein the number of the FIFO, the MAC group, the RAM group, the adder group and the output buffer RAM is N, the N represents the parallel degree, and a one-to-one correspondence exists among each FIFO, the MAC group, the RAM group, the adder group and the output buffer RAM;

3. The FPGA design-based convolutional neural network accelerator supporting sparse pruning of claim 2, wherein each of the adder groups comprises P adders, and P = 1+2+4+ … +2 ^q ，q＝q ₁ -1，q ₁ ＝log ₂ M, each adder group starts summation operation after the corresponding MAC group completes the operation of a column of characteristic diagram column and all weights in convolution kernels corresponding to the characteristic diagram column, and the summation operation adopts a pipeline modeIs carried out.

4. The FPGA design-based convolutional neural network accelerator supporting sparse pruning of claim 3, wherein the summing operation is performed in a pipelined manner comprising:

taking data of the same address in M middle cache RAMs to obtain M data;

5. The FPGA design-based convolutional neural network accelerator supporting sparse pruning of claim 3, wherein each multiply-add device in the MAC group receives a non-zero weight in the weight column sequentially, calculates the product of the non-zero weight and the data to be calculated to obtain a product result, and outputs the multiply-accumulate result to the intermediate result cache RAM array, comprising:

6. The convolutional neural network accelerator supporting sparse pruning based on the FPGA design of claim 2, wherein the bus interface unit comprises a bus interface unit located on a read side of the FPGA and a bus interface unit located on a write side of the FPGA, the bus interface unit located on the read side of the FPGA is used for reading the feature map and the convolutional kernel from the DDR memory, and the bus interface unit located on the write side of the FPGA is used for writing the convolutional operation result in the output buffer and management unit into the DDR memory.

7. The FPGA design-based convolutional neural network accelerator supporting sparse pruning of claim 6, wherein the output buffer and management unit comprises a poll controller and N output buffer RAMs, the poll controller being configured to control the convolution results stored in the N output buffer RAMs to be sequentially transferred to the DDR memory through a bus interface unit located on the write side of the FPGA.