CN109948774B - Neural network accelerator based on network layer binding operation and implementation method thereof - Google Patents

Neural network accelerator based on network layer binding operation and implementation method thereof Download PDF

Info

Publication number
CN109948774B
CN109948774B CN201910070755.1A CN201910070755A CN109948774B CN 109948774 B CN109948774 B CN 109948774B CN 201910070755 A CN201910070755 A CN 201910070755A CN 109948774 B CN109948774 B CN 109948774B
Authority
CN
China
Prior art keywords
module
characteristic value
layer
convolution
pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910070755.1A
Other languages
Chinese (zh)
Other versions
CN109948774A (en
Inventor
黄立文
谭展宏
陈小柏
虞志益
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201910070755.1A priority Critical patent/CN109948774B/en
Publication of CN109948774A publication Critical patent/CN109948774A/en
Application granted granted Critical
Publication of CN109948774B publication Critical patent/CN109948774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a neural network accelerator based on network layer binding operation and an implementation method thereof. The invention combines and packs the middle layer with large data volume, selects the layer with relatively small data volume for output, reduces the output data volume, only needs to read the data of the off-chip memory once during the first convolution operation, can complete the multi-layer binding calculation through one-time data reading, effectively reduces the use of the off-chip memory, also reduces the data access operation, improves the working efficiency, simultaneously reduces the redundant state turnover on the circuit, further reduces the power consumption and the cost, and can be widely applied to the technical field of deep neural networks.

Description

Neural network accelerator based on network layer binding operation and implementation method thereof
Technical Field
The invention relates to the technical field of deep neural networks, in particular to a neural network accelerator based on network layer binding operation and an implementation method thereof.
Background
The convolutional neural network is widely applied to the fields of computer vision, speech recognition, natural language processing and the like due to high precision and high performance.
The operation of the convolutional neural network has the characteristics of being calculation intensive and storage intensive. The large number of computations and data read operations can increase the workload of the processor. Therefore, people transplant the operation of the convolutional neural network to a GPU, an FPGA or even an ASIC. The GPU has the advantage of high parallelism, but very high power consumption. While ASICs are low power and high performance, they are long development cycle, high cost, and are not easy to reconfigure the hardware. The FPGA is a compromise between the GPU and the ASIC, is a design platform with low cost and short development period, and can also perform special optimization design on neural network operation, thereby achieving the high-performance operation effect.
In the traditional neural network accelerator, an accelerator is established to iteratively process each layer by layer, and then the next layer is iteratively processed after one layer is completely processed, so that the data of a middle layer is very large, and the off-chip storage needs to be accessed for multiple times. This operation, however, is very costly.
In summary, the current neural network accelerator has the following disadvantages:
first, the power consumption overhead is large. The power consumption overhead due to off-chip memory access is very large.
Second, the time overhead is large. Since the access to the outside of the chip cannot be performed with the effect of instantaneous reading, several time periods pass from the generation of the access signal to the final data retrieval.
Thirdly, the economic overhead is large. Off-chip memory access has brought about the use of DRAM (e.g., DDR). The cost of DDR is relatively high and if the DDR requirements can be reduced in a design, the cost for the product can be greatly reduced.
Disclosure of Invention
To solve the above technical problems, the present invention aims to: the neural network accelerator based on network layer binding operation and the implementation method thereof are low in cost, high in efficiency and low in power consumption.
The technical scheme adopted by the invention on the one hand is as follows:
the neural network accelerator based on the network layer binding operation comprises:
the off-chip memory module is used for storing the picture data acquired from the camera and preset weight parameters;
the weight parameter caching module is used for storing preset weight parameters read from the off-chip storage module;
the characteristic value cache module is used for storing input picture data of the first convolution layer, an output characteristic value of the middle layer and an input characteristic value of the next convolution layer;
the calculation unit array module is used for executing convolution operation according to the input characteristic value and the weight parameter;
the weight parameter register module is used for storing the weight parameters read from the weight parameter cache module;
the characteristic value register module is used for storing the output characteristic value of the middle layer obtained by convolution operation;
the local addition tree module is used for accumulating the partial sums of all the output channels;
the pooling calculation module is used for performing pooling calculation on the result of the convolution operation;
and the output channel addition tree module is used for accumulating the output channels of the computing unit array.
Further, the method also comprises the following steps:
and the cache control module is used for controlling the weight parameter to be read into the weight parameter register module.
Further, the cache control module is also used for controlling the characteristic value to be read into the characteristic value register module.
Further, the number of the computing unit arrays is 8.
Further, the convolutional neural network comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a third pooling layer, a first full-connection layer, a second full-connection layer and a third full-connection layer;
wherein the dimension of the input characteristic value of the second convolution layer is 27 × 27 × 96, and the dimension of the output characteristic value of the second convolution layer is 27 × 27 × 256;
the dimension of the input characteristic value of the third convolutional layer is 13 × 13 × 256, and the dimension of the output characteristic value of the third convolutional layer is 13 × 13 × 384;
the input characteristic value dimension of the fourth convolution layer is 15 × 15 × 384, and the output characteristic value dimension of the fourth convolution layer is 13 × 13 × 384;
the input characteristic value dimension of the fifth convolutional layer is 15 × 15 × 384, and the output characteristic value dimension of the fifth convolutional layer is 13 × 13 × 384.
The technical scheme adopted by the other aspect of the invention is as follows:
the implementation method of the neural network accelerator based on the network layer binding operation comprises the following steps:
storing picture data acquired from a camera and preset weight parameters through an off-chip memory module;
storing preset weight parameters read from the off-chip storage module through a weight parameter cache module;
storing the input image data of the first convolution layer, the output characteristic value of the middle layer and the input characteristic value of the next convolution layer through a characteristic value cache module;
performing convolution operation according to the input characteristic value and the weight parameter through the computing unit array;
storing the weight parameters read from the weight parameter cache module through a weight parameter register module;
the characteristic value register module is used for storing the output characteristic value of the middle layer obtained by convolution operation;
accumulating the partial sums of each output channel through a local addition tree module;
performing pooling calculation on the result of the convolution operation through a pooling calculation module;
and accumulating the output channels of the computing unit array through an output channel addition tree module.
Further, the method also comprises the following steps:
and the buffer control module controls the weight parameters to be read into the weight parameter register module.
Further, the method also comprises the following steps:
and the characteristic value is controlled to be read into the characteristic value register module through the cache control module.
Further, the method also comprises the following steps:
splicing the four computing unit arrays into a large computing unit array;
performing convolution calculation on the input characteristic value through a large calculation unit array obtained by splicing;
and performing pooling treatment on the result of the convolution calculation to obtain an input feature map of the next convolution layer.
Further, the method also comprises the following steps:
reading input data of the first convolution layer from an off-chip memory;
executing a first convolution operation of the first convolution layer through the computing unit array;
performing first pooling operation on the result of the first convolution operation through a pooling calculation module, and storing the result of the first pooling operation in a characteristic value cache module;
reading input data of the second convolution layer from the characteristic value cache module;
performing first bundling operation on the second convolution operation, the third convolution operation and the second pooling operation to obtain an output result of the second pooling operation, and storing the output result of the second pooling operation in a characteristic value cache module;
reading input data of the fourth convolution layer from the characteristic value cache module;
and performing second bundling operation on the fourth convolution operation, the fifth convolution operation and the third pooling operation to obtain an output result of the third pooling operation.
The invention has the beneficial effects that: the invention combines and packs the middle layer with large data volume, selects the layer with relatively small data volume for output, reduces the output data volume, stores the characteristic value obtained by convolution operation through the characteristic value register module, does not need to read through an off-chip memory, only needs to read the data of the off-chip memory once during the first convolution operation, can complete multi-layer binding calculation through one-time data reading, effectively reduces the use of the off-chip memory, also reduces data access operation, improves the working efficiency, and simultaneously reduces the state overturn of redundancy on a circuit, thereby reducing the power consumption and the cost.
Drawings
FIG. 1 is a schematic diagram of a neural network accelerator based on network layer binding according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a first implementation step of the embodiment of the invention;
fig. 3 is a schematic diagram of a second implementation procedure of the embodiment of the invention.
Detailed Description
The invention will be further explained and explained with reference to the drawings and the embodiments in the description. The step numbers in the embodiments of the present invention are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.
Referring to fig. 1, an embodiment of the present invention provides a neural network accelerator based on network layer binding operation, including:
the off-chip memory module is used for storing the picture data acquired from the camera and preset weight parameters;
the weight parameter caching module is used for storing preset weight parameters read from the off-chip storage module;
the characteristic value cache module is used for storing input picture data of the first convolution layer, an output characteristic value of the middle layer and an input characteristic value of the next convolution layer;
the calculation unit array module is used for executing convolution operation according to the input characteristic value and the weight parameter;
the weight parameter register module is used for storing the weight parameters read from the weight parameter caching module;
the characteristic value register module is used for storing the output characteristic value of the middle layer obtained by convolution operation;
the local addition tree module is used for accumulating the partial sums of all the output channels;
the pooling calculation module is used for performing pooling calculation on the result of the convolution operation;
and the output channel addition tree module is used for accumulating the output channels of the computing unit array.
Referring to fig. 1, a further preferred embodiment further includes:
and the cache control module is used for controlling the weight parameter to be read into the weight parameter register module.
Referring to fig. 1, as a further preferred embodiment, the cache control module is further configured to control a feature value reading-in feature value register module.
Referring to fig. 1, in a further preferred embodiment, the number of the calculation unit arrays is 8.
Further preferably, the convolutional neural network comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a third pooling layer, a first fully-connected layer, a second fully-connected layer and a third fully-connected layer;
wherein the dimension of the input characteristic value of the second convolution layer is 27 × 27 × 96, and the dimension of the output characteristic value of the second convolution layer is 27 × 27 × 256;
the dimension of the input characteristic value of the third convolutional layer is 13 × 13 × 256, and the dimension of the output characteristic value of the third convolutional layer is 13 × 13 × 384;
the input characteristic value dimension of the fourth convolution layer is 15 × 15 × 384, and the output characteristic value dimension of the fourth convolution layer is 13 × 13 × 384;
the input characteristic value dimension of the fifth convolutional layer is 15 × 15 × 384, and the output characteristic value dimension of the fifth convolutional layer is 13 × 13 × 384.
Based on the neural network accelerator shown in fig. 1, an embodiment of the present invention further provides an implementation method of a neural network accelerator based on network layer binding operation, including the following steps:
storing picture data acquired from a camera and preset weight parameters through an off-chip memory module;
storing preset weight parameters read from the off-chip storage module through a weight parameter cache module;
storing the input image data of the first convolution layer, the output characteristic value of the middle layer and the input characteristic value of the next convolution layer through a characteristic value cache module;
performing convolution operation according to the input characteristic value and the weight parameter through the computing unit array;
storing the weight parameters read from the weight parameter caching module through a weight parameter register module;
the characteristic value register module is used for storing the output characteristic value of the middle layer obtained by convolution operation;
accumulating the partial sums of each output channel through a local addition tree module;
performing pooling calculation on the result of the convolution operation through a pooling calculation module;
and accumulating the output channels of the computing unit array through an output channel addition tree module.
Further as a preferred embodiment, the method further comprises the following steps:
and the buffer control module controls the weight parameters to be read into the weight parameter register module.
Further as a preferred embodiment, the method further comprises the following steps:
and controlling the characteristic value to be read into the characteristic value register module through the cache control module.
Further as a preferred embodiment, the method further comprises the following steps:
splicing the four calculation unit arrays into a large calculation unit array;
performing convolution calculation on the input characteristic value through a large calculation unit array obtained by splicing;
and performing pooling treatment on the result of the convolution calculation to obtain an input feature map of the next convolution layer.
Further as a preferred embodiment, the method further comprises the following steps:
reading input data of the first convolution layer from an off-chip memory;
executing a first convolution operation of the first convolution layer through the computing unit array;
performing first pooling operation on the result of the first convolution operation through a pooling calculation module, and storing the result of the first pooling operation in a characteristic value cache module;
reading input data of the second convolution layer from the characteristic value cache module;
performing first bundling operation on the second convolution operation, the third convolution operation and the second pooling operation to obtain an output result of the second pooling operation, and storing the output result of the second pooling operation in a characteristic value cache module;
reading input data of the fourth convolution layer from the characteristic value cache module;
and performing second bundling operation on the fourth convolution operation, the fifth convolution operation and the third pooling operation to obtain an output result of the third pooling operation.
The invention designs a multi-layer binding operation accelerator aiming at a network layer aiming at the requirements of low power consumption and low cost of a mobile terminal accelerator, and takes AlexNet as an implementation case. The embodiment is realized by utilizing XilinxZYNQUltrascale + ZCU development platform 102, which has 2500 DSP resource numbers and is suitable for the bundled network computing mode of the invention.
Specifically, taking AlexNet as an example, the basic structure of the network of the present embodiment is, in order, a first convolution layer Conv1, a first Pooling calculation layer Pooling1, a second convolution layer Conv2, a second Pooling calculation layer Pooling2, a third convolution layer Conv3, a fourth convolution layer Conv4, a fifth convolution layer Conv5, a third Pooling calculation layer Pooling3, and three final fully-connected layers. Wherein Conv represents the convolutional layer and Pooling represents the Pooling layer.
Firstly, reading Conv1 layer input from an off-chip memory, executing Conv1 convolution operation in a computing unit array, executing Pooling1 Pooling operation in a Pooling computing module, and storing operation results in a characteristic value caching module;
and then, reading Conv2 input from the characteristic value cache module, then binding Conv2, pooling2 and Conv3 for operation, finally directly obtaining Conv3 output, and storing the operation result in the characteristic value cache module.
Then, the input of the Conv4 layer is read from the characteristic value cache module, and binding operation is performed on the Conv4, the Conv5 and the Pooling3, and finally the output of the Pooling3 is directly obtained.
The invention selects the layer with relatively less data volume to output, and compared with the mode of outputting the data of each layer to the off-chip memory by the prior method, the invention reduces the output data volume.
Because the data volume output by different convolution layers and pooling layers is different, the invention selects the layer with relatively less data volume to output and stores the layer into the characteristic value cache module; for the layer with larger data quantity, the data is not directly output, but the next layer is continuously calculated (assuming that the data quantity of the next layer is relatively less), and the next layer outputs the data again and stores the data into the characteristic value cache module. For example, in the present embodiment, since the data amount of Conv2 is larger than that of Conv3, conv2 is not output, but Conv2 is output after being bundled with Conv3 and pooling 2.
In addition, in the present embodiment, the amounts of data output by Conv4 and Conv5 are the same, and in the present embodiment, the merge binding operation is performed on both of them, and the effect of "selecting a layer having a relatively small amount of data to output" is not achieved, but the effect of "reducing the amount of data output (storing in an off-chip memory for each layer)" is achieved.
As shown in fig. 1, the neural network accelerator of this embodiment includes an off-chip memory module, a weight parameter cache module, a cache control module, a feature value cache module, a computing unit array module, a weight parameter register module, a feature value register module, a local addition tree module, a pooling computing module, and an output channel addition tree module.
The off-chip memory module is used for storing the image data acquired from the camera and the trained weight parameters;
the weight parameter caching module stores the weight parameters read from the off-chip memory module;
the cache control module is used for controlling the weight parameter to be read into the weight parameter register module;
the characteristic value cache module is used for storing input picture data of a first layer, intermediate layer output characteristic values returned from the computing unit array and input characteristic values of a later layer;
the cache control module is used for controlling the characteristic value to be read into the characteristic value register module;
the computing unit array module is used for inputting convolution operation of the characteristic value and the weight;
the weight parameter register module is used for storing read weight parameters;
the characteristic value register module is used for storing and storing the output characteristic value of the middle layer;
the local addition tree module is used for accumulating the partial sum of each output channel;
the pooling computing module is used for pooling computing;
and the output channel addition tree module is used for adding the output channels of the 8 calculation unit arrays.
As shown in fig. 2, the implementation of the Conv4, conv5 and Pooling3 network layer binding operation mode is as follows:
the input eigenvalue dimension of the two-layer convolution operations of Conv4 and Conv5 is 15 × 15 × 384, and the output eigenvalue dimension is 13 × 13 × 384.
1) For Conv4, the present embodiment adopts a data multiplexing strategy with stable output:
firstly, all input channels CH X1, CH X2, … …, CH Xk of Conv4 and convolution kernels corresponding to a Conv4 layer are read in one time, and multiplication and addition operation is carried out in the same computing unit Array 15 × 15PE Array, so that a complete output channel of Conv4, such as CH Y1, can be obtained;
then, repeating the operation Yn times to obtain all output channels of Conv4, namely CH Y1, CHY2, … …, CH Yn;
the output channels of Conv4 are all stored in the characteristic value register module of the middle layer on the chip, and do not need to be stored in an additional off-chip memory.
2) For Conv5, the present embodiment adopts a data multiplexing strategy with stable input:
firstly, taking an output channel obtained by Conv4 as an input channel of Conv5, such as CH Y1, obtaining a convolution kernel corresponding to a Conv5 layer, and performing multiplication and addition operation in the same computing unit Array 15 × 15PE Array to obtain a group of partial sums of all output channels of Conv5, namely Psum CH Z1, psum CH Z2, … … and Psum CH Zm;
then, after the 8 calculation unit arrays are calculated, 8 groups of Psum CH Z1, psum CH Z2, … … and Psum CH Zm can be obtained, and all the parts are accumulated to obtain the output of the final M channel;
since there are 384 groups of input channels in the Conv5, the operation can be completed by repeating 384/8=48 times, and all the output channels of the Conv5 are obtained.
The invention designs a binding network layer operation mode based on stable output and stable input, can realize two layers of continuous convolution operation, after the former layer finishes the whole middle output channel, the latter layer can directly carry out continuous operation without storing data on a calculation array on an off-chip memory, obtains parts and results of a plurality of groups of output channels of the latter layer, and continuously accumulates to realize the binding network layer operation mode. The invention can save the operation of writing data to the off-chip memory once, reduce unnecessary operation, improve the calculation efficiency and reduce the power consumption in the calculation process.
As shown in fig. 3, the implementation of the Conv2, pooling2 and Conv3 network layer bundling operation mode design is as follows:
the input eigenvalue dimension of Conv2 is 27 × 27 × 96, and the output eigenvalue dimension is 27 × 27 × 256;
the input eigenvalue dimension of Conv3 is 13 × 13 × 256, and the output eigenvalue dimension is 13 × 13 × 384;
since the size of the output feature map of Conv2 is larger than that of the computing unit Array, it is necessary to tile four arrays into a large computing unit Array, i.e. tile four 15 × 15PE arrays, so that the convolution operation of Conv2 can be completed by applying the output stable data stream as expected.
After the operation of the Conv2 is completed, the input feature map of the Conv3 is output by the pooling module, and the size of the input feature map of the Conv3 is 13 × 13, at this time, the embodiment can completely map the input feature map of the Conv3 to one of the computing unit arrays, that is, 15 × 15PE Array, and further execute the operation of the Conv3 through the Array.
In summary, the accelerator designed by the invention comprises an off-chip memory module, a weight parameter cache module, a cache control module, a characteristic value cache module, a calculation unit array module, a weight parameter register module, a characteristic value register module, a local addition tree module, a Pooling calculation module and an output channel addition tree module, and a network layer binding operation mode based on Conv2, powing 2 and Conv3 binding operation and Conv4 and Conv5 Powing 3 binding operation is designed in detail.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (4)

1. The method for realizing the neural network accelerator based on the network layer binding operation is characterized by comprising the following steps of: the method comprises the following steps:
storing picture data acquired from a camera and preset weight parameters through an off-chip memory module;
storing preset weight parameters read from the off-chip storage module through a weight parameter cache module;
storing the input image data of the first convolution layer, the output characteristic value of the middle layer and the input characteristic value of the next convolution layer through a characteristic value cache module;
performing convolution operation according to the input characteristic value and the weight parameter through the computing unit array;
storing the weight parameters read from the weight parameter cache module through a weight parameter register module;
the characteristic value register module is used for storing the output characteristic value of the middle layer obtained by convolution operation;
accumulating the partial sums of each output channel through a local addition tree module;
performing pooling calculation on the result of the convolution operation through a pooling calculation module;
accumulating the output channels of the computing unit array through an output channel addition tree module;
further comprising the steps of:
reading input data of the first convolution layer from an off-chip memory;
executing a first convolution operation of the first convolution layer through the computing unit array;
performing first pooling operation on the result of the first convolution operation through a pooling calculation module, and storing the result of the first pooling operation in a characteristic value cache module;
reading input data of the second convolution layer from the characteristic value cache module;
performing first bundling operation on the second convolution operation, the third convolution operation and the second pooling operation to obtain an output result of the second pooling operation, and storing the output result of the second pooling operation in a characteristic value cache module;
reading input data of the fourth convolution layer from the characteristic value cache module;
and performing second bundling operation on the fourth convolution operation, the fifth convolution operation and the third pooling operation to obtain an output result of the third pooling operation.
2. The method for implementing a neural network accelerator based on network layer binding operation as claimed in claim 1, wherein: further comprising the steps of:
and the buffer control module controls the weight parameters to be read into the weight parameter register module.
3. The method for implementing a neural network accelerator based on network layer binding operation as claimed in claim 2, wherein:
further comprising the steps of:
and the characteristic value is controlled to be read into the characteristic value register module through the cache control module.
4. The method for implementing a neural network accelerator based on network layer binding operation as claimed in claim 1, wherein: further comprising the steps of:
splicing the four computing unit arrays into a large computing unit array;
performing convolution calculation on the input characteristic value through a large calculation unit array obtained by splicing;
and performing pooling treatment on the result of the convolution calculation to obtain an input feature map of the next convolution layer.
CN201910070755.1A 2019-01-25 2019-01-25 Neural network accelerator based on network layer binding operation and implementation method thereof Active CN109948774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910070755.1A CN109948774B (en) 2019-01-25 2019-01-25 Neural network accelerator based on network layer binding operation and implementation method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910070755.1A CN109948774B (en) 2019-01-25 2019-01-25 Neural network accelerator based on network layer binding operation and implementation method thereof

Publications (2)

Publication Number Publication Date
CN109948774A CN109948774A (en) 2019-06-28
CN109948774B true CN109948774B (en) 2022-12-13

Family

ID=67007223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910070755.1A Active CN109948774B (en) 2019-01-25 2019-01-25 Neural network accelerator based on network layer binding operation and implementation method thereof

Country Status (1)

Country Link
CN (1) CN109948774B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516801B (en) * 2019-08-05 2022-04-22 西安交通大学 High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN110490302B (en) * 2019-08-12 2022-06-07 中科寒武纪科技股份有限公司 Neural network compiling and optimizing method and device and related products
CN112396072B (en) * 2019-08-14 2022-11-25 上海大学 Image classification acceleration method and device based on ASIC (application specific integrated circuit) and VGG16
CN110780923B (en) * 2019-10-31 2021-09-14 合肥工业大学 Hardware accelerator applied to binary convolution neural network and data processing method thereof
CN110991630A (en) * 2019-11-10 2020-04-10 天津大学 Convolutional neural network processor for edge calculation
CN112819022B (en) * 2019-11-18 2023-11-07 同方威视技术股份有限公司 Image recognition device and image recognition method based on neural network
CN111078189B (en) * 2019-11-23 2023-05-02 复旦大学 Sparse matrix multiplication accelerator for cyclic neural network natural language processing
CN111062471B (en) * 2019-11-23 2023-05-02 复旦大学 Deep learning accelerator for accelerating BERT neural network operation
CN111126589B (en) 2019-12-31 2022-05-20 昆仑芯(北京)科技有限公司 Neural network data processing device and method and electronic equipment
CN111340198B (en) * 2020-03-26 2023-05-05 上海大学 Neural network accelerator for data high multiplexing based on FPGA
CN111783967B (en) * 2020-05-27 2023-08-01 上海赛昉科技有限公司 Data double-layer caching method suitable for special neural network accelerator
CN112766479B (en) * 2021-01-26 2022-11-11 东南大学 Neural network accelerator supporting channel separation convolution based on FPGA
CN113159309B (en) * 2021-03-31 2023-03-21 华南理工大学 NAND flash memory-based low-power-consumption neural network accelerator storage architecture
CN113191493A (en) * 2021-04-27 2021-07-30 北京工业大学 Convolutional neural network accelerator based on FPGA parallelism self-adaptation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184366A (en) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 Time-division-multiplexing general neural network processor
EP3154001A2 (en) * 2015-10-08 2017-04-12 VIA Alliance Semiconductor Co., Ltd. Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184366A (en) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 Time-division-multiplexing general neural network processor
EP3154001A2 (en) * 2015-10-08 2017-04-12 VIA Alliance Semiconductor Co., Ltd. Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种用于卷积神经网络压缩的混合剪枝方法;靳丽蕾 等;《小型微型计算机系统》;20181231(第12期);第2596-2601页 *

Also Published As

Publication number Publication date
CN109948774A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN109948774B (en) Neural network accelerator based on network layer binding operation and implementation method thereof
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN106875011B (en) Hardware architecture of binary weight convolution neural network accelerator and calculation flow thereof
CN110097174B (en) Method, system and device for realizing convolutional neural network based on FPGA and row output priority
CN109447241B (en) Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things
US20190188237A1 (en) Method and electronic device for convolution calculation in neutral network
CN110852428B (en) Neural network acceleration method and accelerator based on FPGA
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN111242277B (en) Convolutional neural network accelerator supporting sparse pruning based on FPGA design
KR20200098684A (en) Matrix multiplier
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN112465110A (en) Hardware accelerator for convolution neural network calculation optimization
CN110705702A (en) Dynamic extensible convolutional neural network accelerator
CN112633490B (en) Data processing device, method and related product for executing neural network model
CN110991630A (en) Convolutional neural network processor for edge calculation
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
CN110580519A (en) Convolution operation structure and method thereof
CN115423081A (en) Neural network accelerator based on CNN _ LSTM algorithm of FPGA
CN111582465A (en) Convolutional neural network acceleration processing system and method based on FPGA and terminal
CN110222835A (en) A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN111222090B (en) Convolution calculation module, neural network processor, chip and electronic equipment
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
CN113128688B (en) General AI parallel reasoning acceleration structure and reasoning equipment
CN114912596A (en) Sparse convolution neural network-oriented multi-chip system and method thereof
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant