CN111242277A - Convolutional neural network accelerator supporting sparse pruning and based on FPGA design - Google Patents
Convolutional neural network accelerator supporting sparse pruning and based on FPGA design Download PDFInfo
- Publication number
- CN111242277A CN111242277A CN201911383518.7A CN201911383518A CN111242277A CN 111242277 A CN111242277 A CN 111242277A CN 201911383518 A CN201911383518 A CN 201911383518A CN 111242277 A CN111242277 A CN 111242277A
- Authority
- CN
- China
- Prior art keywords
- adder
- data
- weight
- result
- ram
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a convolutional neural network accelerator supporting sparse pruning based on FPGA design, which comprises a bus interface unit, a weight storage and management unit, an input cache FIFO array, an MAC calculation array, an intermediate result cache RAM array, a pipeline adder array and an output cache and management unit. According to the invention, a weight sequence composed of non-zero weights is formed according to the convolution kernel without the zero-value weight, and the non-zero weights in the weight sequence are dynamically distributed to the multiplier-adder in sequence according to the maximum calculation amount of the weight sequence every time, so that the convolution neural network accelerator can cope with the irregularity of network calculation after pruning, effectively skip the multiplication-accumulation operation with the weight being zero, thereby reducing the total calculation amount and accelerating the neural network operation.
Description
Technical Field
The application belongs to the field of deep learning, and particularly relates to a convolutional neural network accelerator supporting sparse pruning and designed based on an FPGA.
Background
In recent years, artificial intelligence is popular in the world, wherein computer vision and speech recognition are most widely applied, and great convenience is brought to the life of people. Computer vision and speech recognition mainly use deep neural networks for computational reasoning, because deep neural networks perform very well in terms of accuracy.
Generally speaking, the deeper the neural network layers and the more parameters, the more accurate the reasoning result. But at the same time, deeper networks and more parameters mean more computing and storage resources are consumed. On the other hand, the neural network has many parameters, but some of the parameters do not contribute much to the final output result and appear redundant.
The convolutional neural network is a kind of deep neural network, which has the advantage of high accuracy of the deep neural network on one hand, but also inherits the disadvantage of huge calculation amount, so how to reduce the calculation amount of the convolutional neural network is always a popular research direction in the field of artificial intelligence. Convolution operation in the convolution neural network accounts for more than 90% of the total operation amount, the convolution operation is in direct proportion to parameter capacity, redundant parameters can be reduced through pruning, and the accuracy is kept quite high. Pruning accelerates the operation of the neural network by reducing the total operation amount, but the irregularity of the network operation after pruning increases the complexity of the hardware implementation of the neural network accelerator, so that the existing accelerator design cannot fully utilize the acceleration effect brought by sparse weight, the acceleration effect is poor, and the defect of low utilization rate of computing resources exists.
Disclosure of Invention
The convolution neural network accelerator supporting sparse pruning based on FPGA design can cope with irregularity of network calculation after pruning, effectively skips multiply-accumulate operation with zero weight, thereby reducing total operation amount and accelerating neural network operation.
In order to achieve the purpose, the technical scheme adopted by the application is as follows:
the utility model provides a convolutional neural network accelerator of support sparse pruning based on FPGA design, the convolutional neural network accelerator of support sparse pruning based on FPGA design includes bus interface unit, weight storage and management unit, input buffer FIFO array, MAC calculation array, middle result buffer RAM array, assembly line adder array and output buffer and management unit, wherein:
the bus interface unit is used for transmitting a characteristic diagram row from the DDR memory to the input cache FIFO array, and the characteristic diagram row is composed of data with the size equal to that of a convolution kernel on a characteristic diagram according to a preset rule;
the input buffer FIFO array comprises N FIFOs, each FIFO receives different characteristic diagram rows output by the bus interface unit and outputs data to be calculated in the characteristic diagram rows to the MAC calculation array according to the first-in first-out principle;
the weight storage and management unit is used for outputting a weight column to the MAC calculation array, wherein the weight column is composed of non-zero weights at the same position in the same channel of all convolution kernels, and each non-zero weight has an index value with the same number as that of the convolution kernel;
the MAC calculation array comprises MAC groups corresponding to the FIFOs, each MAC group comprises a plurality of multipliers and adders, each MAC group receives the weight sequence output by the weight storage and management unit and data to be calculated output by the corresponding FIFO, each multiplier and adder in the MAC group receives a non-zero weight in the weight sequence in sequence, calculates the product of the non-zero weight and the data to be calculated to obtain a product result, and outputs the product result to the intermediate result cache RAM array;
the intermediate result cache RAM array comprises RAM groups corresponding to each MAC group, each RAM group comprises a plurality of intermediate cache RAMs, each intermediate cache RAM receives a multiply-accumulate result output by a corresponding multiplier-adder, the address of the multiply-accumulate result stored in each intermediate cache RAM is an index value corresponding to a nonzero weight for calculating the multiply-accumulate result, and the multiply-accumulate result is obtained by accumulating the product result calculated by the multiplier-adder this time and the intermediate result stored at the same address in the intermediate cache RAM corresponding to the multiplier-adder;
the production line adder array comprises adder groups corresponding to the RAM groups, each adder group comprises a plurality of adders, each adder group is used for summing intermediate results stored in the corresponding RAM group, and the summed results are output to the output cache and management unit as convolution operation results;
the output buffer and management unit comprises a plurality of output buffer RAMs corresponding to the adder groups, and each output buffer RAM is used for storing the convolution operation result output by the corresponding adder group.
Preferably, the number of the FIFOs, the MAC groups, the RAM groups, the adder groups and the output cache RAMs is N, wherein N represents the parallelism, and a one-to-one correspondence relationship exists among each of the FIFOs, the MAC groups, the RAM groups, the adder groups and the output cache RAMs;
each MAC group comprises M multipliers and adders, M is a power of 2, each RAM group comprises 2M intermediate cache RAMs, each multiplier and adder is correspondingly connected with two intermediate cache RAMs, a multiplication accumulation result output by each multiplier and adder serves as an intermediate result to be cached in one of the two correspondingly connected intermediate cache RAMs, the intermediate result calculated aiming at the same characteristic diagram column is stored in the same intermediate cache RAM, and the two intermediate cache RAMs adopt a ping-pong working mode to store.
Preferably, each adder group includes P adders, and P ═ 1+2+4+ … +2q,q=q1-1,q1=log2And M, starting summation operation after the MAC group corresponding to each adder group finishes operation of a list of feature diagram columns and all weights in convolution kernels corresponding to the feature diagram columns, and performing the summation operation in a pipeline mode.
Preferably, the summing operation is performed in a pipeline manner, and includes:
taking data of one same address in M intermediate cache RAMs to obtain M data;
taking M data as data to be accumulated, grouping the M data into groups, and respectively inputting the groups into a data input device 2qIn each adder, M/2 data are obtained, the M/2 data are used as new data to be accumulated, and the data are grouped in pairs and input into 2q-1In an adder, the loop is executed until the accumulator is ready to accumulateData adding, pairwise grouping and inputting 20In each adder, the result output by the adder is used as the final convolution operation result of the convolution kernel corresponding to the address and the characteristic diagram row;
and sequentially traversing the data of the same address in the M intermediate cache RAMs, and summing the data according to the M data acquired each time to obtain the final convolution operation result of each convolution kernel and the characteristic diagram column.
Preferably, each multiplier-adder in the MAC group receives a non-zero weight in the weight sequence, calculates a product of the non-zero weight and data to be calculated to obtain a product result, and outputs the multiply-accumulate result to the intermediate result buffer RAM array, including:
each multiplier-adder in the MAC group receives data to be calculated and sequentially receives a non-zero weight in the weight list, the data to be calculated correspond to the positions of the non-zero weights, and the data to be calculated in each multiplier-adder are the same and the non-zero weights are different;
calculating the product of the nonzero weight and the data to be calculated, and reading an intermediate result from a connected intermediate cache RAM (random access memory), wherein the intermediate cache RAM is used for caching the intermediate result calculated by the current characteristic diagram column, and the address of the read intermediate result is an index value corresponding to the nonzero weight used by the multiplier-adder in the current calculation;
and accumulating the intermediate result and the product result to serve as a new intermediate result and storing the new intermediate result into a corresponding intermediate cache RAM, wherein the address stored by the new intermediate result is the same as the address of the original intermediate result so as to update the intermediate result.
Preferably, the bus interface unit comprises a bus interface unit positioned at the read side of the FPGA and a bus interface unit positioned at the write side of the FPGA, the bus interface unit positioned at the read side of the FPGA is used for reading the characteristic diagram and the convolution kernel from the DDR memory, and the bus interface unit positioned at the write side of the FPGA is used for writing the convolution operation result in the output cache and management unit into the DDR memory.
Preferably, the output buffer and management unit includes a polling controller and N output buffer RAMs, and the polling controller is configured to control the convolution operation results stored in the N output buffer RAMs to be transmitted to the DDR memory through the bus interface unit located on the write side of the FPGA in sequence.
According to the convolutional neural network accelerator supporting sparse pruning and designed based on the FPGA, a weight column composed of non-zero weights is formed according to a convolution kernel with zero-valued weights removed, and the non-zero weights in the weight column are dynamically distributed to a multiplier-adder in sequence according to the maximum calculation amount of the weight column every time, so that the convolutional neural network accelerator can cope with the irregularity of network calculation after pruning, effectively skip the multiplication-accumulation operation with the weights being zero, reduce the total calculation amount and accelerate the neural network operation.
Drawings
FIG. 1 is a schematic structural diagram of a convolutional neural network accelerator supporting sparse pruning based on FPGA design according to the present application;
FIG. 2 is a schematic diagram of a calculation method of the convolutional neural network accelerator according to the present application;
FIG. 3 is a schematic diagram illustrating a data storage manner of a feature map in embodiment 1 of the present application;
FIG. 4 is a schematic diagram of a weight data storage method in embodiment 1 of the present application;
FIG. 5 is a diagram illustrating operation of a multiplier-adder according to embodiment 1 of the present application;
fig. 6 is a schematic diagram of the operation of the pipeline adder array according to embodiment 1 of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1, one embodiment of the present invention provides a convolutional neural network accelerator supporting sparse pruning and designed based on an FPGA, so as to implement irregular calculation after sparse pruning and accelerate neural network operation.
The convolutional neural network accelerator supporting sparse pruning based on the FPGA design of this embodiment includes a bus interface unit, a weight storage and management unit, an input buffer FIFO array, an MAC calculation array, an intermediate result buffer RAM array, a pipeline adder array, and an output buffer and management unit, where:
1) a bus interface unit: the bus interface unit is used for transmitting a characteristic diagram row from the DDR memory to the input cache FIFO array, and the characteristic diagram row is composed of data with the size equal to that of a convolution kernel on a characteristic diagram according to a preset rule. As shown in fig. 2, the preset rule in this embodiment is to take the data equal to the size of the convolution kernel from top left to bottom right and arrange them in a row according to the channel order.
The bus interface unit is a universal AXI Master interface and is used for data transmission between the DDR memory and the FPGA chip, and comprises a weight and a characteristic diagram; the content of the characteristic diagram data transmitted in one burst is a pixel point at the same position of all channels; the weight bursts one layer of data at a time.
In order to ensure the integrity of the structure, in one embodiment, the bus interface unit comprises a bus interface unit positioned on the read side of the FPGA and a bus interface unit positioned on the write side of the FPGA, and the bus interface unit positioned on the read side of the FPGA is used for reading the characteristic diagram and the convolution kernel from the DDR memory, namely the weight in the convolution kernel; and the bus interface unit positioned on the write side of the FPGA is used for writing a plurality of summation results output by the output cache and management unit into the DDR memory.
2) The input buffer FIFO array comprises N FIFOs, each FIFO receives different characteristic diagram columns output by the bus interface unit, data to be calculated in the characteristic diagram columns are output to the MAC calculation array according to the first-in first-out principle, and all FIFOs are enabled to be effective in non-empty time reading.
3) The weight storage and management unit is used for caching, reading and distributing weight data, and finally outputting a weight column to the MAC calculation array, wherein the weight column is composed of non-zero weights at the same position in the same channel of all convolution kernels, and each non-zero weight has an index value with the same number as that of the convolution kernel.
In one embodiment, the ownership columns are stored continuously according to a chain structure, and the storage format is non-zero weight + index value + end flag, the index value is the position of the non-zero weight in all the convolution kernels, i.e. the number of the convolution kernels, and the end flag is used for different weight columns. The weights are continuously read in such a way that the end identifier of the current address storage is judged to determine whether to read the weight in the next address immediately.
The weight storage and management unit triggers the reading of the weight by the read enable of the input buffer FIFO array, stops after a valid end mark is read, distributes the weight according to the arrangement sequence after the weight is read, distributes the weight which is non-zero, and does not calculate the weight with zero value.
4) The MAC calculation array comprises MAC groups corresponding to the FIFOs, each MAC group comprises a plurality of multiplier-adders, each MAC group receives the weight sequence output by the weight storage and management unit and data to be calculated output by the corresponding FIFO, each multiplier-adder in the MAC group receives a non-zero weight in the weight sequence in sequence, calculates the product of the non-zero weight and the data to be calculated to obtain a product result, and outputs the product result to the intermediate result cache RAM array.
5) The intermediate result cache RAM array comprises RAM groups corresponding to each MAC group, the RAM groups comprise a plurality of intermediate cache RAMs, the intermediate cache RAMs receive multiply-accumulate results output by corresponding multipliers and adders, addresses of the multiply-accumulate results stored in the intermediate cache RAMs are index values corresponding to nonzero weights for calculating the multiply-accumulate results, and the multiply-accumulate results are obtained by accumulating the product results calculated by the multipliers and the intermediate results stored at the same addresses in the intermediate cache RAMs corresponding to the multipliers and adders.
For a convolution kernel, convolution operation is a process of multiply-accumulate operation, and the number of weights in a weight column after sparse pruning is not necessarily equal to that of the convolution kernels, so that the situation that the weights distributed by a multiplier-adder in each calculation are possibly from different convolution kernels exists when the weight storage and management unit distributes the weights according to the arrangement sequence, namely the weights are dynamically distributed to the multiplier-adder, and each calculation is guaranteed to be the maximum efficiency calculation.
In order to cope with the dynamic weight allocation, the embodiment implements multiply-accumulate operation by means of the intermediate cache RAM as follows:
each multiplier-adder in the MAC group receives data to be calculated and sequentially receives a non-zero weight in the weight list, the data to be calculated corresponds to the position of each non-zero weight, and the data to be calculated in each multiplier-adder is the same and has different non-zero weights.
And calculating the product of the nonzero weight and the data to be calculated, and reading an intermediate result from a connected intermediate cache RAM, wherein the intermediate cache RAM is used for caching the intermediate result calculated by the current characteristic diagram column, and the address of the read intermediate result is an index value corresponding to the nonzero weight used by the multiplier-adder in the current calculation.
And accumulating the intermediate result and the product result to serve as a new intermediate result and storing the new intermediate result into a corresponding intermediate cache RAM, wherein the address stored by the new intermediate result is the same as the address of the original intermediate result so as to update the intermediate result.
6) The assembly line adder array comprises adder groups corresponding to the RAM groups, each adder group comprises a plurality of adders, each adder group is used for summing the intermediate results stored in the corresponding RAM group, and the summing results are output to the output cache and management unit as convolution operation results.
7) The output buffer and management unit comprises a plurality of output buffer RAMs corresponding to the adder groups, and each output buffer RAM is used for storing the convolution operation result output by the corresponding adder group.
In order to improve the parallel computing efficiency, in an embodiment, N FIFOs, MAC groups, RAM groups, adder groups, and output buffer RAMs are set, where N represents a parallelism, and a one-to-one correspondence relationship exists between each FIFO, MAC group, RAM group, adder group, and output buffer RAM.
Each MAC group comprises M multipliers and adders, M is a power of 2, each RAM group comprises 2M intermediate cache RAMs, each multiplier and adder is correspondingly connected with two intermediate cache RAMs, a multiplication accumulation result output by each multiplier and adder serves as an intermediate result to be cached in one of the two correspondingly connected intermediate cache RAMs, the intermediate result calculated aiming at the same characteristic diagram column is stored in the same intermediate cache RAM, and the two intermediate cache RAMs adopt a ping-pong working mode to store.
In this embodiment, two intermediate cache RAMs corresponding to one multiplier-adder are switched after completing the calculation of a row of feature diagram rows, and when the intermediate cache RAM used for storage is switched this time, the MAC group first determines whether the target intermediate cache RAM is in a continuous reading state, where the continuous reading state in this embodiment is to be understood as reading end addresses sequentially from a start address, and if so, the MAC group is in a waiting state, and if not, performing read-write operation on the target intermediate cache RAM.
In this embodiment, each address of each intermediate cache RAM corresponds to an identification bit for identifying whether data stored in the address is valid, and the address is set to be valid after being written.
In one embodiment, each adder group includes P adders, and P is 1+2+4+ … +2q,q=q1-1,q1=log2And M, starting summation operation after the MAC group corresponding to each adder group finishes operation of a list of feature diagram columns and all weights in convolution kernels corresponding to the feature diagram columns, and performing the summation operation in a pipeline mode.
Further, a pipelined summation operation is provided as follows:
and taking data of one same address in the M intermediate cache RAMs to obtain M data.
Taking M data as data to be accumulated, grouping the M data into groups, and respectively inputting the groups into a data input device 2qIn each adder, M/2 data are obtained, the M/2 data are used as new data to be accumulated, and the data are grouped in pairs and input into 2q-1In an adder, the data to be accumulated are input into an adder 2 after being grouped in pairs0In the adder, the result output by the adder is used as the final result of the convolution kernel and the characteristic diagram column corresponding to the addressAnd (5) convolution operation results.
And sequentially traversing the data of the same address in the M intermediate cache RAMs, and summing the data according to the M data acquired each time to obtain the final convolution operation result of each convolution kernel and the characteristic diagram column.
In the process of cyclic execution in the pipeline summation operation, for example, every two of M data at the time of pipe0 are summed, every two of M/2 data at the time of pipe1 are summed, every two of M/4 data at the time of pipe2 are summed, and the like, and a summation result is finally output. And the adder group directly outputs the summation result to the output cache RAM after the data of each address is added.
In order to ensure the orderly output of data in the output buffer and management unit, the output buffer and management unit comprises a polling controller and N output buffer RAMs, wherein the polling controller is used for controlling convolution operation results stored in the N output buffer RAMs to be transmitted to a DDR memory through a bus interface unit positioned on the writing side of the FPGA in sequence, namely the output mode of the polling controller is that all pixel point data which are stored in one output buffer RAM and contain the same position of all output channels are output at one time.
To facilitate an understanding of the present application, further description is provided below by way of examples.
Example 1
In this embodiment, the input buffer FIFO array includes 4 FIFOs, the MAC calculation array includes 4 × 8 multipliers and adders, the intermediate result buffer RAM array includes 4 × 16 intermediate buffer RAMs, the specification of the input picture is 8 × 3, the size of the convolution kernel is 3 × 3, the number of the convolution kernels is 32, and the step is 1, which is taken as an example to explain the operation mode of the convolutional neural network accelerator in this embodiment as follows:
the host writes the data of the feature map into the region 0 of the DDR memory in the manner shown in fig. 3, and writes the weight data into the DDR region 1 in the manner shown in fig. 4, and if there is enough RAM space on the FPGA chip to store the pruned weight on the FPGA chip, the region 1 may be omitted.
The bus interface unit reads the weight data from the external DDR region 1 to an RAM inside the accelerator, the reading mode can be that one layer of weight data is burst transmitted at a time, and if the weight is stored on a chip, the step can be omitted; then, the feature map data is read from the external DDR area 0 to the inside of the accelerator, since there are 4 input buffer FIFOs and the step is 1, 4 pixels (pixel points 0 to 3, each pixel point includes 3 channels) are read at a time, the data of the pixel point 0 is written into the FIFO0, the data of the pixel point 1 is written into the FIFOs 1 and 0, and so on, the data of the pixel point 3 is written into the FIFOs 3, 2, and 1. The reading sequence is 4 points on the 1 st line, 4 points on the 2 nd line and 4 points on the 3 rd line, and then 4-7 pixel points (each point contains 3 channels) are read, and the reading sequence is still the 1 st line, the 2 nd line and the 3 rd line.
It should be noted that, when reading a pixel, the number of lines to be read is determined by the size of the convolution kernel, and which pixel point data in a line is written into which FIFO is determined by the size of the convolution kernel and the step.
When 4 FIFOs of the input buffer FIFO array are not empty, reading the FIFO once and transmitting the data of each FIFO to 4 groups of MAC groups corresponding to the FIFO; meanwhile, the weight storage and management unit reads the weight column corresponding to the pixel point data and distributes the weight column to all MAC groups, and each MAC group in the MAC calculation array shares a set of weight; this operation is repeated until the convolution operation is finished.
The weight storage and management unit reads a row of weight columns and distributes the weight columns to all MAC groups, and if the length of one row of weight columns is 16 (the number of non-zero weights) after pruning, the weight columns are distributed to the MAC groups for multiplication and accumulation calculation for two times in total, and 8 weights are respectively distributed to 8 multiplier-adders each time; assuming no pruning, the length of a row of weight columns is 32 (since the number of convolution kernels is 32), and 8 weights need to be distributed to 8 multiplier-adders 4 times; therefore, the accelerator of the embodiment supports pruning and skips the operation with the weight of zero, so that the calculation speed of the convolutional neural network is improved.
The operation diagram of the multiplier-adder in the MAC computing array is shown in fig. 5, the multiplication operation is followed by addition operation, the addition operation is directed at the product result of the multiplication and the intermediate result stored in the corresponding address of the intermediate cache RAM, the operation result is written back to the corresponding address of the intermediate cache RAM as the intermediate result, and the address used for storing in the intermediate cache RAM is the index value of the weight; by adopting the mode, the utilization rate of the multiplier-adder can be improved, the multiplier-adder does not need to be bound with the convolution kernel, once the multiplier-adder is bound with the convolution kernel, the multiplier-adder corresponding to the convolution kernel with the weight of zero is inevitably idle, and the computing resources are wasted.
When the multiply-accumulate operation of a weight column and a characteristic diagram column is completed, the production line adder array is started to carry out summation operation on the stored data between a group of (8) intermediate cache RAMs connected with the production line adder array, meanwhile, the MAC calculation array continues to work for no blockage, and the subsequent intermediate results are cached in the other group of intermediate cache RAMs, which is the reason that each multiply-add device corresponds to 2 intermediate cache RAMs, so that the efficiency of convolution operation can be obviously improved.
And simultaneously reading the data in the 8 intermediate buffer RAMs from the address 0 until all the data in the intermediate buffer RAMs are output to the corresponding adder groups.
The pipeline adder works according to the mode of fig. 6, 8 data of the same address in 8 intermediate cache RAMs are taken, 8 data are added pairwise at the time of pipe0 during accumulation, 4 data are added pairwise at the time of pipe1, and 2 data are added at the time of pipe2 to obtain a final convolution operation result aiming at the address, so that a pixel point of an output channel is obtained.
The output buffer and management unit receives the convolution operation results output by the 4 groups of adder groups, and outputs the convolution operation results to the external DDR through the bus interface unit according to the arrangement sequence of the output buffer RAMs, wherein the length of one-time transmission of each output buffer RAM is 32 (the number of convolution kernels is 32), namely, one pixel point at the same position of all output channels.
The above mentioned input picture size, convolution kernel size and number, and stepping are all configurable content for the accelerator proposed by the present application.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (7)
1. The convolutional neural network accelerator supporting sparse pruning based on FPGA design is characterized by comprising a bus interface unit, a weight storage and management unit, an input cache FIFO array, an MAC calculation array, an intermediate result cache RAM array, a pipeline adder array and an output cache and management unit, wherein:
the bus interface unit is used for transmitting a characteristic diagram row from the DDR memory to the input cache FIFO array, and the characteristic diagram row is composed of data with the size equal to that of a convolution kernel on a characteristic diagram according to a preset rule;
the input buffer FIFO array comprises N FIFOs, each FIFO receives different characteristic diagram rows output by the bus interface unit and outputs data to be calculated in the characteristic diagram rows to the MAC calculation array according to the first-in first-out principle;
the weight storage and management unit is used for outputting a weight column to the MAC calculation array, wherein the weight column is composed of non-zero weights at the same position in the same channel of all convolution kernels, and each non-zero weight has an index value with the same number as that of the convolution kernel;
the MAC calculation array comprises MAC groups corresponding to the FIFOs, each MAC group comprises a plurality of multipliers and adders, each MAC group receives the weight sequence output by the weight storage and management unit and data to be calculated output by the corresponding FIFO, each multiplier and adder in the MAC group receives a non-zero weight in the weight sequence in sequence, calculates the product of the non-zero weight and the data to be calculated to obtain a product result, and outputs the product result to the intermediate result cache RAM array;
the intermediate result cache RAM array comprises RAM groups corresponding to each MAC group, each RAM group comprises a plurality of intermediate cache RAMs, each intermediate cache RAM receives a multiply-accumulate result output by a corresponding multiplier-adder, the address of the multiply-accumulate result stored in each intermediate cache RAM is an index value corresponding to a nonzero weight for calculating the multiply-accumulate result, and the multiply-accumulate result is obtained by accumulating the product result calculated by the multiplier-adder this time and the intermediate result stored at the same address in the intermediate cache RAM corresponding to the multiplier-adder;
the production line adder array comprises adder groups corresponding to the RAM groups, each adder group comprises a plurality of adders, each adder group is used for summing intermediate results stored in the corresponding RAM group, and the summed results are output to the output cache and management unit as convolution operation results;
the output buffer and management unit comprises a plurality of output buffer RAMs corresponding to the adder groups, and each output buffer RAM is used for storing the convolution operation result output by the corresponding adder group.
2. The convolutional neural network accelerator supporting sparse pruning based on FPGA design of claim 1, wherein the FIFO, MAC group, RAM group, adder group, output buffer RAM are all N, where N represents parallelism, and there is a one-to-one correspondence between each FIFO, MAC group, RAM group, adder group, output buffer RAM;
each MAC group comprises M multipliers and adders, M is a power of 2, each RAM group comprises 2M intermediate cache RAMs, each multiplier and adder is correspondingly connected with two intermediate cache RAMs, a multiplication accumulation result output by each multiplier and adder serves as an intermediate result to be cached in one of the two correspondingly connected intermediate cache RAMs, the intermediate result calculated aiming at the same characteristic diagram column is stored in the same intermediate cache RAM, and the two intermediate cache RAMs adopt a ping-pong working mode to store.
3. As claimed in2, the convolutional neural network accelerator supporting sparse pruning based on FPGA design is characterized in that each adder group includes P adders, and P is 1+2+4+ … +2q,q=q1-1,q1=log2And M, starting summation operation after the MAC group corresponding to each adder group finishes operation of a list of feature diagram columns and all weights in convolution kernels corresponding to the feature diagram columns, and performing the summation operation in a pipeline mode.
4. The convolutional neural network accelerator supporting sparse pruning based on FPGA design of claim 3, wherein said summing operation is performed in a pipelined manner comprising:
taking data of one same address in M intermediate cache RAMs to obtain M data;
taking M data as data to be accumulated, grouping the M data into groups, and respectively inputting the groups into a data input device 2qIn each adder, M/2 data are obtained, the M/2 data are used as new data to be accumulated, and the data are grouped in pairs and input into 2q-1In an adder, the data to be accumulated are input into an adder 2 after being grouped in pairs0In each adder, the result output by the adder is used as the final convolution operation result of the convolution kernel corresponding to the address and the characteristic diagram row;
and sequentially traversing the data of the same address in the M intermediate cache RAMs, and summing the data according to the M data acquired each time to obtain the final convolution operation result of each convolution kernel and the characteristic diagram column.
5. The FPGA design based convolutional neural network accelerator supporting sparse pruning of claim 3, wherein each multiplier-adder in the MAC group receives in sequence a non-zero weight in the weight column, calculates the product of the non-zero weight and the data to be calculated to obtain a product result, and outputs the product-and-accumulate result to the intermediate result cache RAM array, comprising:
each multiplier-adder in the MAC group receives data to be calculated and sequentially receives a non-zero weight in the weight list, the data to be calculated correspond to the positions of the non-zero weights, and the data to be calculated in each multiplier-adder are the same and the non-zero weights are different;
calculating the product of the nonzero weight and the data to be calculated, and reading an intermediate result from a connected intermediate cache RAM (random access memory), wherein the intermediate cache RAM is used for caching the intermediate result calculated by the current characteristic diagram column, and the address of the read intermediate result is an index value corresponding to the nonzero weight used by the multiplier-adder in the current calculation;
and accumulating the intermediate result and the product result to serve as a new intermediate result and storing the new intermediate result into a corresponding intermediate cache RAM, wherein the address stored by the new intermediate result is the same as the address of the original intermediate result so as to update the intermediate result.
6. The convolutional neural network accelerator supporting sparse pruning and based on FPGA design of claim 2, wherein the bus interface unit comprises a bus interface unit located at a read side of the FPGA and a bus interface unit located at a write side of the FPGA, the bus interface unit located at the read side of the FPGA is used for reading the feature map and the convolution kernel from the DDR memory, and the bus interface unit located at the write side of the FPGA is used for writing the convolution operation result in the output buffer and management unit into the DDR memory.
7. The convolutional neural network accelerator for supporting sparse pruning based on FPGA design of claim 6, wherein the output buffer and management unit comprises a polling controller and N output buffer RAMs, and the polling controller is configured to control convolution operation results stored in the N output buffer RAMs to be transmitted to the DDR memory through the bus interface unit located on the write side of the FPGA in sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911383518.7A CN111242277B (en) | 2019-12-27 | 2019-12-27 | Convolutional neural network accelerator supporting sparse pruning based on FPGA design |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911383518.7A CN111242277B (en) | 2019-12-27 | 2019-12-27 | Convolutional neural network accelerator supporting sparse pruning based on FPGA design |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111242277A true CN111242277A (en) | 2020-06-05 |
CN111242277B CN111242277B (en) | 2023-05-05 |
Family
ID=70874120
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911383518.7A Active CN111242277B (en) | 2019-12-27 | 2019-12-27 | Convolutional neural network accelerator supporting sparse pruning based on FPGA design |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111242277B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111553471A (en) * | 2020-07-13 | 2020-08-18 | 北京欣奕华数字科技有限公司 | Data analysis processing method and device |
CN111814972A (en) * | 2020-07-08 | 2020-10-23 | 上海雪湖科技有限公司 | Neural network convolution operation acceleration method based on FPGA |
CN111860778A (en) * | 2020-07-08 | 2020-10-30 | 北京灵汐科技有限公司 | Full-additive convolution method and device |
CN112465110A (en) * | 2020-11-16 | 2021-03-09 | 中国电子科技集团公司第五十二研究所 | Hardware accelerator for convolution neural network calculation optimization |
CN112596912A (en) * | 2020-12-29 | 2021-04-02 | 清华大学 | Acceleration operation method and device for convolution calculation of binary or ternary neural network |
CN112668708A (en) * | 2020-12-28 | 2021-04-16 | 中国电子科技集团公司第五十二研究所 | Convolution operation device for improving data utilization rate |
CN112801276A (en) * | 2021-02-08 | 2021-05-14 | 清华大学 | Data processing method, processor and electronic equipment |
CN113138957A (en) * | 2021-03-29 | 2021-07-20 | 北京智芯微电子科技有限公司 | Chip for neural network inference and method for accelerating neural network inference |
CN113592075A (en) * | 2021-07-28 | 2021-11-02 | 浙江芯昇电子技术有限公司 | Convolution operation device, method and chip |
CN113688976A (en) * | 2021-08-26 | 2021-11-23 | 哲库科技(上海)有限公司 | Neural network acceleration method, device, equipment and storage medium |
CN114037857A (en) * | 2021-10-21 | 2022-02-11 | 中国科学院大学 | Image classification precision improving method |
US11488664B2 (en) | 2020-10-13 | 2022-11-01 | International Business Machines Corporation | Distributing device array currents across segment mirrors |
CN117273101A (en) * | 2020-06-30 | 2023-12-22 | 墨芯人工智能科技(深圳)有限公司 | Method and system for balanced weight sparse convolution processing |
WO2024168514A1 (en) * | 2023-02-14 | 2024-08-22 | 北京大学 | Data processing method and apparatus applied to in-memory computing chip, and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107239824A (en) * | 2016-12-05 | 2017-10-10 | 北京深鉴智能科技有限公司 | Apparatus and method for realizing sparse convolution neutral net accelerator |
US20180046897A1 (en) * | 2016-08-12 | 2018-02-15 | Beijing Deephi Intelligence Technology Co., Ltd. | Hardware accelerator for compressed rnn on fpga |
CN107977704A (en) * | 2017-11-10 | 2018-05-01 | 中国科学院计算技术研究所 | Weighted data storage method and the neural network processor based on this method |
US20180129935A1 (en) * | 2016-11-07 | 2018-05-10 | Electronics And Telecommunications Research Institute | Convolutional neural network system and operation method thereof |
CN110378468A (en) * | 2019-07-08 | 2019-10-25 | 浙江大学 | A kind of neural network accelerator quantified based on structuring beta pruning and low bit |
-
2019
- 2019-12-27 CN CN201911383518.7A patent/CN111242277B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180046897A1 (en) * | 2016-08-12 | 2018-02-15 | Beijing Deephi Intelligence Technology Co., Ltd. | Hardware accelerator for compressed rnn on fpga |
US20180129935A1 (en) * | 2016-11-07 | 2018-05-10 | Electronics And Telecommunications Research Institute | Convolutional neural network system and operation method thereof |
CN107239824A (en) * | 2016-12-05 | 2017-10-10 | 北京深鉴智能科技有限公司 | Apparatus and method for realizing sparse convolution neutral net accelerator |
CN107977704A (en) * | 2017-11-10 | 2018-05-01 | 中国科学院计算技术研究所 | Weighted data storage method and the neural network processor based on this method |
CN110378468A (en) * | 2019-07-08 | 2019-10-25 | 浙江大学 | A kind of neural network accelerator quantified based on structuring beta pruning and low bit |
Non-Patent Citations (3)
Title |
---|
LIQIANG LU,ET AL.: "An efficient hardware accelerator for sparse convolutional neural networks on GPGAs" * |
刘勤让 等: "利用参数稀疏性的卷积神经网络计算优化及其FPGA加速器设计" * |
周徐达: "稀疏神经网络和稀疏神经网络加速器的研究" * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117273101B (en) * | 2020-06-30 | 2024-05-24 | 墨芯人工智能科技(深圳)有限公司 | Method and system for balanced weight sparse convolution processing |
CN117273101A (en) * | 2020-06-30 | 2023-12-22 | 墨芯人工智能科技(深圳)有限公司 | Method and system for balanced weight sparse convolution processing |
CN111814972A (en) * | 2020-07-08 | 2020-10-23 | 上海雪湖科技有限公司 | Neural network convolution operation acceleration method based on FPGA |
CN111860778A (en) * | 2020-07-08 | 2020-10-30 | 北京灵汐科技有限公司 | Full-additive convolution method and device |
CN111814972B (en) * | 2020-07-08 | 2024-02-02 | 上海雪湖科技有限公司 | Neural network convolution operation acceleration method based on FPGA |
CN111553471A (en) * | 2020-07-13 | 2020-08-18 | 北京欣奕华数字科技有限公司 | Data analysis processing method and device |
US11488664B2 (en) | 2020-10-13 | 2022-11-01 | International Business Machines Corporation | Distributing device array currents across segment mirrors |
CN112465110B (en) * | 2020-11-16 | 2022-09-13 | 中国电子科技集团公司第五十二研究所 | Hardware accelerator for convolution neural network calculation optimization |
CN112465110A (en) * | 2020-11-16 | 2021-03-09 | 中国电子科技集团公司第五十二研究所 | Hardware accelerator for convolution neural network calculation optimization |
CN112668708A (en) * | 2020-12-28 | 2021-04-16 | 中国电子科技集团公司第五十二研究所 | Convolution operation device for improving data utilization rate |
CN112668708B (en) * | 2020-12-28 | 2022-10-14 | 中国电子科技集团公司第五十二研究所 | Convolution operation device for improving data utilization rate |
CN112596912B (en) * | 2020-12-29 | 2023-03-28 | 清华大学 | Acceleration operation method and device for convolution calculation of binary or ternary neural network |
CN112596912A (en) * | 2020-12-29 | 2021-04-02 | 清华大学 | Acceleration operation method and device for convolution calculation of binary or ternary neural network |
CN112801276A (en) * | 2021-02-08 | 2021-05-14 | 清华大学 | Data processing method, processor and electronic equipment |
CN113138957A (en) * | 2021-03-29 | 2021-07-20 | 北京智芯微电子科技有限公司 | Chip for neural network inference and method for accelerating neural network inference |
CN113592075A (en) * | 2021-07-28 | 2021-11-02 | 浙江芯昇电子技术有限公司 | Convolution operation device, method and chip |
CN113592075B (en) * | 2021-07-28 | 2024-03-08 | 浙江芯昇电子技术有限公司 | Convolution operation device, method and chip |
CN113688976A (en) * | 2021-08-26 | 2021-11-23 | 哲库科技(上海)有限公司 | Neural network acceleration method, device, equipment and storage medium |
CN114037857B (en) * | 2021-10-21 | 2022-09-23 | 中国科学院大学 | Image classification precision improving method |
CN114037857A (en) * | 2021-10-21 | 2022-02-11 | 中国科学院大学 | Image classification precision improving method |
WO2024168514A1 (en) * | 2023-02-14 | 2024-08-22 | 北京大学 | Data processing method and apparatus applied to in-memory computing chip, and device |
Also Published As
Publication number | Publication date |
---|---|
CN111242277B (en) | 2023-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111242277B (en) | Convolutional neural network accelerator supporting sparse pruning based on FPGA design | |
CN109948774B (en) | Neural network accelerator based on network layer binding operation and implementation method thereof | |
US20220012593A1 (en) | Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization | |
CN108805267B (en) | Data processing method for hardware acceleration of convolutional neural network | |
CN112465110B (en) | Hardware accelerator for convolution neural network calculation optimization | |
CN109447241B (en) | Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things | |
CN110516801A (en) | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput | |
CN109063825A (en) | Convolutional neural networks accelerator | |
CN112418396B (en) | Sparse activation perception type neural network accelerator based on FPGA | |
CN109146065B (en) | Convolution operation method and device for two-dimensional data | |
CN109840585B (en) | Sparse two-dimensional convolution-oriented operation method and system | |
CN109993293B (en) | Deep learning accelerator suitable for heap hourglass network | |
CN115880132B (en) | Graphics processor, matrix multiplication task processing method, device and storage medium | |
CN113869507B (en) | Neural network accelerator convolution calculation device and method based on pulse array | |
CN111583095A (en) | Image data storage method, image data processing system and related device | |
CN110942145A (en) | Convolutional neural network pooling layer based on reconfigurable computing, hardware implementation method and system | |
CN110598844A (en) | Parallel convolution neural network accelerator based on FPGA and acceleration method | |
Shahshahani et al. | Memory optimization techniques for fpga based cnn implementations | |
CN115860080A (en) | Computing core, accelerator, computing method, device, equipment, medium and system | |
CN109948787B (en) | Arithmetic device, chip and method for neural network convolution layer | |
CN112308217B (en) | Convolutional neural network acceleration method and system | |
TWI798591B (en) | Convolutional neural network operation method and device | |
CN111078589B (en) | Data reading system, method and chip applied to deep learning calculation | |
CN110766133B (en) | Data processing method, device, equipment and storage medium in embedded equipment | |
WO2021031154A1 (en) | Method and device for loading feature map of neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |