CN112668708B - Convolution operation device for improving data utilization rate - Google Patents

Convolution operation device for improving data utilization rate Download PDF

Info

Publication number
CN112668708B
CN112668708B CN202011578577.2A CN202011578577A CN112668708B CN 112668708 B CN112668708 B CN 112668708B CN 202011578577 A CN202011578577 A CN 202011578577A CN 112668708 B CN112668708 B CN 112668708B
Authority
CN
China
Prior art keywords
data
multiply
input
convolution
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011578577.2A
Other languages
Chinese (zh)
Other versions
CN112668708A (en
Inventor
廖湘萍
丁永林
曹学成
李炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 52 Research Institute
Original Assignee
CETC 52 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 52 Research Institute filed Critical CETC 52 Research Institute
Priority to CN202011578577.2A priority Critical patent/CN112668708B/en
Publication of CN112668708A publication Critical patent/CN112668708A/en
Application granted granted Critical
Publication of CN112668708B publication Critical patent/CN112668708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a convolution operation device for improving data utilization rate, wherein a multiply-accumulate unit adopts a parallel structure, input feature map data is divided into a plurality of data blocks according to the parallelism of the multiply-accumulate unit, the convolution operation is calculated by taking the data blocks as a unit, the same data block is repeatedly read and used, and the convolution operation is completed with all weights of the current network layer, so that the repeated input cache times of the data block are reduced, the data utilization rate is improved, the requirement of on-chip storage capacity is reduced, the process of reading the feature map data from off-chip storage is avoided, and the overall efficiency of the convolution operation is obviously improved.

Description

Convolution operation device for improving data utilization rate
Technical Field
The application belongs to the field of computers, and particularly relates to a convolution operation device for improving data utilization rate.
Background
In recent years, deep neural network technology has been widely used, especially in image processing, speech recognition, and text classification applications. The deep neural network has excellent accuracy, and more accurate results are obtained by analyzing huge data. The deeper the deep neural network is, the stronger the simulation capability is, and the higher the inference accuracy is.
Convolutional neural networks belong to deep neural networks. Although the convolutional neural network has high accuracy, as the number of the neural network layers increases, the calculation amount of the convolutional neural network algorithm becomes huge, and the requirements on data bandwidth and data storage are higher. Since the resources of hardware implementation are limited, the implementation cannot be increased as the calculation amount of the algorithm increases. And the characteristic that a convolution kernel window needs to slide through the whole image exists in the calculation process of the convolution neural network, so that a large amount of data can be repeatedly used in convolution operation. The on-chip storage space of the arithmetic device is limited, and data needs to be temporarily stored in the external storage space, so that the arithmetic is limited by the data bandwidth, and the arithmetic efficiency is influenced.
Disclosure of Invention
An object of the present application is to provide a convolution operation device that improves data utilization rate, reduces on-chip storage capacity, and improves convolution operation efficiency.
In order to achieve the purpose, the technical scheme adopted by the application is as follows:
the utility model provides an improve convolution arithmetic device of data utilization ratio, including input bus, input data buffer array, multiply and accumulate unit array, output data buffer array and output bus, input data buffer array includes input characteristic map data buffer array and weight buffer array, multiply and accumulate unit array includes Q parallel multiply and accumulate unit, wherein:
the input characteristic diagram data cache array is used for sequentially acquiring a plurality of data blocks from the input bus and caching the data blocks, wherein the data blocks are obtained by dividing input characteristic diagram data according to the parallelism Q of the multiply-accumulate unit;
the weight cache array is used for acquiring weight data from the input bus and caching the weight data, wherein the weight data is weight data of one or more convolution kernels;
the multiply-accumulate unit array is used for multiply-accumulate operation of the data blocks and the weight data, repeatedly reads the same data block and each convolution kernel to carry out multiply-accumulate operation until the multiply-accumulate operation of all the data blocks is completed, and outputs a convolution operation result obtained by carrying out multiply-accumulate operation on each data block and each convolution kernel;
the output data cache array is used for receiving convolution operation results output by the multiply-accumulate unit array, splicing all the convolution operation results to obtain output characteristic diagram data and caching the output characteristic diagram data;
and the output bus is used for outputting the output characteristic diagram data.
Several alternatives are provided below, but not as an additional limitation to the above general solution, but merely as a further addition or preference, each alternative being combinable individually for the above general solution or among several alternatives without technical or logical contradictions.
Preferably, the dividing the input feature map data according to the parallelism Q of the multiply-accumulate unit includes:
splitting input feature map data with the size of R x K x N into J data blocks with the size of L x K x N, wherein R is the height of the input feature map data, K is the width of the input feature map data, N is the number of channels of the input feature map data, L is the height of the data blocks, K is the width of the data blocks, namely the width of the input feature map data, and N is the number of channels of the data blocks, namely the number of channels of the input feature map data.
Preferably, each address space in the input feature map data cache array stores Q pixel data, and one data block is stored by using consecutive H × E × N address spaces, where H is the height of a convolution kernel, and E is the step of a current network layer in a convolutional neural network.
Preferably, when the input feature map data is divided, the following constraint conditions need to be satisfied: the height L of the data block satisfies the formula:
Figure BDA0002864722770000021
meanwhile, the relation among the number J of the data blocks, the height R of the input feature map data and the height L of the data blocks meets the following formula:
R=L+(L-H+E)*(J-1)
in the formula, H is the height of the convolution kernel, and E is the step of the current network layer in the convolution neural network. Preferably, the size of the weight data is H × W × N × M, each address space of the weight cache array stores H × W weight data, and the current ownership value data is stored using consecutive N × M address spaces, where H denotes the height of a convolution kernel, W denotes the width of the convolution kernel, N denotes the number of channels of the convolution kernel, and M denotes the number of convolution kernels.
The utility model provides an improve data utilization's convolution arithmetic device, multiply and accumulate the unit and adopt parallel structure, with the splitting of input feature map data according to multiply and accumulate the parallelism degree of unit and be a plurality of data blocks, convolution operation uses the data block to calculate as the unit, repeated reading uses same data block, accomplish convolution operation with all weights of current network layer, the number of times that this data block repeated input was cached has been reduced, thereby data utilization has been improved and the on-chip memory capacity requirement has been reduced, the process that need read the feature map data from off-chip storage has been avoided, show and promoted convolution operation overall efficiency.
Drawings
FIG. 1 is a schematic structural diagram of a convolution operation apparatus for improving data utilization rate according to the present application;
FIG. 2 is a schematic diagram of multiply-accumulate operation of the convolution operation device for improving data utilization rate according to the present application;
FIG. 3 is a schematic diagram of the present application inputting a feature map data cache array to cache data;
FIG. 4 is a diagram illustrating a weight cache array according to the present application;
FIG. 5 is a diagram illustrating a data block cache according to an embodiment of the present application;
FIG. 6 is a diagram illustrating weight data caching in an embodiment of the present application;
FIG. 7 is a diagram illustrating multiply-accumulate operations in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In one embodiment, a convolution operation device for improving data utilization rate is provided to improve the convolution operation efficiency in each network layer of a convolution neural network so as to meet the calculation requirement of the convolution neural network with deeper depth.
As shown in fig. 1, the convolution operation apparatus for improving data utilization ratio of this embodiment includes an input bus, an input data buffer array, a multiply-accumulate unit array, an output data buffer array, and an output bus, where the input data buffer array includes an input feature map data buffer array and a weight buffer array, and the multiply-accumulate unit array includes Q parallel multiply-accumulate units (e.g., MACs), and Q is determined by the number of hardware resources, DSPs.
And the input characteristic diagram data cache array is used for sequentially acquiring a plurality of data blocks from the input bus and caching the data blocks, wherein the data blocks are obtained by dividing the input characteristic diagram data according to the parallelism Q of the multiply-accumulate unit.
And the weight cache array is used for acquiring weight data from the input bus and caching the weight data, wherein the weight data is the weight of one or more convolution kernels.
And a multiply-accumulate unit array, configured to perform multiply-accumulate operation on the data blocks and the weight data, as shown in fig. 2, when the multiply-accumulate operation is performed, repeatedly read the same data block and each convolution kernel to perform the multiply-accumulate operation, and after the multiply-accumulate operation on one data block and each convolution kernel is completed, read the next data block and each convolution kernel to perform the multiply-accumulate operation until the multiply-accumulate operation on all the data blocks is completed, and output a convolution operation result obtained by performing the multiply-accumulate operation on each data block and each convolution kernel.
And the output data cache array is used for receiving the convolution operation results output by the multiply-accumulate unit array, splicing all the convolution operation results to obtain output characteristic diagram data and caching the output characteristic diagram data.
And the output bus is used for outputting the output characteristic diagram data.
The multiply-accumulate unit in the convolution operation device of the embodiment adopts a parallel structure, the input feature map data is split into a plurality of data blocks according to the parallelism of the multiply-accumulate unit, the convolution operation is calculated by taking the data blocks as a unit, the same data block is repeatedly read and used, and the convolution operation is completed with all weights of the current network layer, so that the repeated input caching times of the data blocks are reduced, the data utilization rate is improved, the requirement on the storage capacity in the chip is reduced, the process of reading the feature map data from the off-chip storage is avoided, and the overall efficiency of the convolution operation is obviously improved.
It is easy to understand that, usually, offset data is added in the convolution operation process to improve the accuracy of the convolution operation, so the convolution operation apparatus of this embodiment may further include an offset buffer array disposed in the input data buffer array, and the offset buffer array is used for buffering offset data and is capable of buffering at least the maximum capacity of a single-layer offset in all network layers.
It should be noted that, the convolution operation device of this embodiment operates once for one layer in the convolutional neural network, that is, the input feature map data, weight data, and bias data usually calculate corresponding data for the current layer. A plurality of convolution operation devices of the application are arranged or operated for a plurality of times to complete all convolution operations in the whole convolution neural network.
In this embodiment, dividing the input feature map data into a plurality of data blocks according to the parallelism Q of the multiply-accumulate unit includes:
as shown in fig. 3, when the convolution kernel size is 1*1 and the step size is 1, the input feature map data with the size of R × K × N is divided into J data blocks with the size of L × K × N, where R is the height of the input feature map data, K is the width of the input feature map data and the width of the data block, N is the number of channels of the input feature map data and the number of channels of the data block, and L is the height of the data block. After the data block is split into a plurality of data blocks, Q pixel data are stored in each address space in the characteristic diagram data cache array, and one data block is stored in N continuous address spaces.
When the convolution kernel size is 3*3, stepping 1, the input feature map data of size R × K × N is split into J data blocks of size L × K × N. After the data blocks are split into a plurality of data blocks, Q pixel data are stored in each address space in the characteristic diagram data cache array, and one data block is stored in continuous N x 3 address spaces.
And by analogy, if the size of the convolution kernel is H × W and the stepping of the current network layer in the convolutional neural network is E, wherein H represents the height of the convolution kernel, and W represents the width of the convolution kernel, splitting the input feature map data into a plurality of data blocks, inputting each address space in the feature map data cache array to store Q pixel data, and storing one data block by adopting continuous H × E × N address spaces.
When the input characteristic diagram data are split in the embodiment, the data can be partitioned according to experience, and partitioning information is issued to the input data cache array, so that the input characteristic diagram data cache array can conveniently acquire data blocks through the input bus according to the partitioning information; or solving according to a specified constraint condition to obtain blocking information, and sending the blocking information obtained by solving to the input data cache array.
In order to obtain a reasonably sized data block, in one embodiment, when the input feature map data is divided, the following constraint conditions are satisfied:
1) The height L of the data block satisfies the formula:
Figure BDA0002864722770000051
2) Meanwhile, the relation among the number J of the data blocks, the height R of the input feature map data and the height L of the data blocks meets the following formula:
R=L+(L-H+E)*(J-1)
in the formula, H is the height of the convolution kernel, and E is the step of the current network layer in the convolution neural network.
As shown in fig. 4, if the size of the weight data is H × W × N × M (it is understood that the value data of the ownership value in the single layer network layer in the convolutional neural network), each address space of the weight cache array stores H × W weight data, and the current ownership value data is stored using consecutive N × M address spaces, where H denotes the height of the convolution kernel, W denotes the width of the convolution kernel, N denotes the number of channels of the convolution kernel, and M denotes the number of convolution kernels.
And after the multiplication and accumulation operation of J data blocks with the size of L x K x N and all weight data H x W x N x M of the current layer is completed, obtaining J x M groups of convolution operation results with the size of Q, and arranging the convolution operation results into complete output characteristic diagram data through splicing. The invention utilizes weight and bias sharing and adopts a parallel computing structure to carry out multiply-accumulate operation, thereby reducing the times of repeated input and cache of the same data block, and achieving the purposes of improving the data utilization rate and reducing the on-chip storage capacity.
In order to further enhance the data utilization rate of the convolution operation apparatus of the present application, a specific example is provided below for explanation.
Taking the input feature image size as 6 × 3, the input feature image block number as 2, the convolution kernel size as 1*1, the convolution kernel number as 4, the step number as 1, and the number of multiply-accumulate units as 18 as an example, the operation mode of the convolution operation device is as follows:
as shown in fig. 5, the input feature image data block is obtained from the input bus, and the size is 3 × 6 × 3, and the input feature image data block is cached in the array according to the manner shown in fig. 5, and each address space stores 3*6 pixel points, which are continuously stored in 3 address spaces.
It should be noted that, when caching data blocks, as long as a plurality of data blocks can be cached in advance under the condition that the input feature map data cache array is not full, the cached data blocks are sequentially read and subjected to convolution operation, and the data blocks participating in the convolution operation are removed, so as to circularly store new data blocks. Namely, the caching, the operation and the removal of the data block are a dynamic process, and the high efficiency of the data operation is ensured.
Fig. 5 shows a state where 2 data blocks have been stored, and in the actual operation process, one data block is obtained first for operation, and after the operation is finished, the next data block is obtained for caching and operation.
As shown in fig. 6, the weight data is obtained from the input bus, and the size is 1 × 3 × 4, and the weight data is cached in the weight array according to the manner shown in fig. 6, and each address space stores 1*1 weight data, and the weight data is continuously stored in 3*4 address spaces.
As shown in fig. 7, 3*6 feature image data of one address space are taken each time, 1*1 weight data are taken for multiply-accumulate operation, and 3*6 multiply-accumulate units perform multiply-accumulate operation in parallel. And continuously taking the characteristic image data and the weight data for 3 times, and completing the multiply-accumulate operation of a data block and a convolution kernel to obtain a group of output characteristic image data with the size of 18.
At this time, the weight data in the weight array is not completely obtained, the data of the 3 address spaces in the array are repeatedly read for 3 times, the weight data in the weight array are sequentially read until multiplication and accumulation operations are completed with all weights, and three groups of output characteristic image data with the size of 18 are obtained.
And after the 1 st data block finishes the operation, reading the characteristic image data of the next data block according to the mode, and repeatedly reading the weight data for 1 time. After the convolution operation is completed, four groups of output characteristic image data with the size of 18 are obtained again.
And obtaining eight groups of output characteristic image data with the size of 18 in total, splicing to obtain a complete output characteristic image with the size of 6*6 and the number of channels of 4, and completing convolution operation of the current network layer.
In the convolution operation device of the embodiment, the multiply-accumulate unit adopts a parallel structure, the convolution operation is calculated by taking an input characteristic image data block as a unit, the data block is reused, and the convolution operation is completed with all weights of the current network layer, so that the times of repeated input and cache of the data block are reduced, the data utilization rate is improved, and the requirement on the storage capacity in a chip is reduced.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (3)

1. The convolution operation device for improving the data utilization rate is characterized by comprising an input bus, an input data cache array, a multiply-accumulate unit array, an output data cache array and an output bus, wherein the input data cache array comprises an input characteristic diagram data cache array and a weight cache array, and the multiply-accumulate unit array comprises Q parallel multiply-accumulate units, wherein:
the input characteristic diagram data cache array is used for sequentially acquiring a plurality of data blocks from the input bus and caching the data blocks, wherein the data blocks are obtained by dividing input characteristic diagram data according to the parallelism Q of the multiply-accumulate unit;
the weight cache array is used for acquiring weight data from the input bus and caching the weight data, wherein the weight data is weight data of one or more convolution kernels;
the multiply-accumulate unit array is used for multiply-accumulate operation of the data blocks and the weight data, repeatedly reads the same data block and each convolution kernel to carry out multiply-accumulate operation until the multiply-accumulate operation of all the data blocks is completed, and outputs a convolution operation result obtained by carrying out multiply-accumulate operation on each data block and each convolution kernel;
the output data cache array is used for receiving convolution operation results output by the multiply-accumulate unit array, splicing all the convolution operation results to obtain output characteristic diagram data and caching the output characteristic diagram data;
the output bus is used for outputting the output characteristic diagram data;
the dividing of the input feature map data according to the parallelism Q of the multiply-accumulate unit comprises the following steps:
splitting input feature map data with the size of R x K x N into J data blocks with the size of L x K x N, wherein R is the height of the input feature map data, K is the width of the input feature map data, N is the number of channels of the input feature map data, L is the height of the data blocks, K is the width of the data blocks, namely the width of the input feature map data, and N is the number of channels of the data blocks, namely the number of channels of the input feature map data;
when the input feature map data are divided, the following constraint conditions need to be met: the height L of the data block satisfies the formula:
Figure DEST_PATH_IMAGE001
meanwhile, the relation among the number J of the data blocks, the height R of the input feature map data and the height L of the data blocks meets the following formula:
Figure 360692DEST_PATH_IMAGE002
in the formula, H is the height of the convolution kernel, and E is the step of the current network layer in the convolution neural network.
2. The apparatus according to claim 1, wherein each address space in the input signature data buffer array stores Q pixel data, and one data block is stored using consecutive H × E × N address spaces, where H is the height of the convolution kernel and E is the step of the current network layer in the convolutional neural network.
3. The apparatus according to claim 1, wherein the weight data has a size H x W x N x M, each address space of the weight cache array stores H x W weight data, and the current ownership value data is stored using consecutive N x M address spaces, where H denotes a height of a convolution kernel, W denotes a width of the convolution kernel, N denotes a number of channels of the convolution kernel, and M denotes a number of the convolution kernels.
CN202011578577.2A 2020-12-28 2020-12-28 Convolution operation device for improving data utilization rate Active CN112668708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011578577.2A CN112668708B (en) 2020-12-28 2020-12-28 Convolution operation device for improving data utilization rate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011578577.2A CN112668708B (en) 2020-12-28 2020-12-28 Convolution operation device for improving data utilization rate

Publications (2)

Publication Number Publication Date
CN112668708A CN112668708A (en) 2021-04-16
CN112668708B true CN112668708B (en) 2022-10-14

Family

ID=75410653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011578577.2A Active CN112668708B (en) 2020-12-28 2020-12-28 Convolution operation device for improving data utilization rate

Country Status (1)

Country Link
CN (1) CN112668708B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469350B (en) * 2021-07-07 2023-03-24 武汉魅瞳科技有限公司 Deep convolutional neural network acceleration method and system suitable for NPU
CN114089911B (en) * 2021-09-07 2024-01-05 上海新氦类脑智能科技有限公司 Block segmentation and splicing processing method, device, equipment and medium based on data multiplexing
CN113688069B (en) * 2021-09-10 2022-08-02 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and medium
CN116721006A (en) * 2022-02-28 2023-09-08 格兰菲智能科技有限公司 Feature map processing method and device
CN116861149B (en) * 2023-09-05 2024-01-09 之江实验室 Convolution operation optimization method, device and processor

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280514A (en) * 2018-01-05 2018-07-13 中国科学技术大学 Sparse neural network acceleration system based on FPGA and design method
CN109598338A (en) * 2018-12-07 2019-04-09 东南大学 A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110147252A (en) * 2019-04-28 2019-08-20 深兰科技(上海)有限公司 A kind of parallel calculating method and device of convolutional neural networks
US10515135B1 (en) * 2017-10-17 2019-12-24 Xilinx, Inc. Data format suitable for fast massively parallel general matrix multiplication in a programmable IC
CN111242277A (en) * 2019-12-27 2020-06-05 中国电子科技集团公司第五十二研究所 Convolutional neural network accelerator supporting sparse pruning and based on FPGA design

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200073636A1 (en) * 2018-08-31 2020-03-05 Qualcomm Incorporated Multiply-accumulate (mac) operations for convolutional neural networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10515135B1 (en) * 2017-10-17 2019-12-24 Xilinx, Inc. Data format suitable for fast massively parallel general matrix multiplication in a programmable IC
CN108280514A (en) * 2018-01-05 2018-07-13 中国科学技术大学 Sparse neural network acceleration system based on FPGA and design method
CN109598338A (en) * 2018-12-07 2019-04-09 东南大学 A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110147252A (en) * 2019-04-28 2019-08-20 深兰科技(上海)有限公司 A kind of parallel calculating method and device of convolutional neural networks
CN111242277A (en) * 2019-12-27 2020-06-05 中国电子科技集团公司第五十二研究所 Convolutional neural network accelerator supporting sparse pruning and based on FPGA design

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Search-free Accelerator for Sparse Convolutional Neural Networks;Bosheng Liu 等;《2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)》;20200326;全文 *
基于FPGA的卷积神经网络加速模块设计;梅志伟 等;《南京大学学报(自然科学)》;20200731;第56卷(第4期);第581-590页 *
深度学习FPGA加速器的进展与趋势;吴艳霞 等;《计算机学报》;20191130;第42卷(第11期);全文 *

Also Published As

Publication number Publication date
CN112668708A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN112668708B (en) Convolution operation device for improving data utilization rate
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
CN107145939B (en) Computer vision processing method and device of low-computing-capacity processing equipment
CN111199273B (en) Convolution calculation method, device, equipment and storage medium
US20180046895A1 (en) Device and method for implementing a sparse neural network
CN106846235B (en) Convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instruction
CN112465110B (en) Hardware accelerator for convolution neural network calculation optimization
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN108629406B (en) Arithmetic device for convolutional neural network
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
US11544542B2 (en) Computing device and method
CN110851779B (en) Systolic array architecture for sparse matrix operations
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
WO2022041188A1 (en) Accelerator for neural network, acceleration method and device, and computer storage medium
CN115186802A (en) Block sparse method and device based on convolutional neural network and processing unit
CN112906865A (en) Neural network architecture searching method and device, electronic equipment and storage medium
CN111008691B (en) Convolutional neural network accelerator architecture with weight and activation value both binarized
CN116720549A (en) FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache
CN113485750B (en) Data processing method and data processing device
CN112966807A (en) Convolutional neural network implementation method based on storage resource limited FPGA
CN113158132A (en) Convolution neural network acceleration system based on unstructured sparsity
CN113052299A (en) Neural network memory computing device based on lower communication bound and acceleration method
CN110766136B (en) Compression method of sparse matrix and vector
CN109902821B (en) Data processing method and device and related components
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant