CN112465110A - Hardware accelerator for convolution neural network calculation optimization - Google Patents

Hardware accelerator for convolution neural network calculation optimization Download PDF

Info

Publication number
CN112465110A
CN112465110A CN202011279360.1A CN202011279360A CN112465110A CN 112465110 A CN112465110 A CN 112465110A CN 202011279360 A CN202011279360 A CN 202011279360A CN 112465110 A CN112465110 A CN 112465110A
Authority
CN
China
Prior art keywords
neural network
data
kernel
convolution
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011279360.1A
Other languages
Chinese (zh)
Other versions
CN112465110B (en
Inventor
曹学成
廖湘萍
丁永林
李炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 52 Research Institute
Original Assignee
CETC 52 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 52 Research Institute filed Critical CETC 52 Research Institute
Priority to CN202011279360.1A priority Critical patent/CN112465110B/en
Publication of CN112465110A publication Critical patent/CN112465110A/en
Application granted granted Critical
Publication of CN112465110B publication Critical patent/CN112465110B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a hardware accelerating device for convolutional neural network calculation optimization, which comprises a parameter storage module, a scheduling control module and a plurality of accelerating kernel modules, wherein each accelerating kernel module comprises an input image cache unit, a weight cache unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a correction linear unit and an output image cache unit. The invention can keep a simple hardware structure of pipelining and paralleling, reduces the calculated amount and improves the acceleration performance of the hardware by removing the zero value of the input characteristic diagram; the original structure of the convolutional neural network algorithm is kept, the convolutional neural network algorithm is not required to be optimized for reducing extra calculation amount, network operation irregularity is avoided, and the convolutional neural network algorithm is suitable for hardware acceleration of various convolutional neural network algorithms.

Description

Hardware accelerator for convolution neural network calculation optimization
Technical Field
The application belongs to the technical field of computers, and particularly relates to a hardware accelerating device for convolutional neural network calculation optimization.
Background
In recent years, the related algorithms of the deep neural network have been applied to the fields of image processing, audio processing and the like on a large scale, and have a great influence on the world economy and social activities. The deep convolutional neural network technology is widely concerned in many machine learning fields, has higher precision compared with the traditional machine learning algorithm, and can realize the accuracy exceeding that of human beings.
Generally, the deeper the number of layers of the convolutional neural network, the more accurate the inference result. But at the same time, deeper networks mean more computing resources are consumed. In convolutional neural network architectures, intra-layer computations are independent and uncorrelated, while inter-layer computations are similar to pipeline architectures and are not efficient to implement using a general purpose processor. Due to the special calculation mode of the convolutional neural network, the convolutional neural network is particularly suitable for hardware acceleration implementation.
The deep neural network has the advantage of high accuracy, but has the disadvantage of huge calculation amount, so that how to reduce the calculation amount of the convolutional neural network is always a popular research direction in the field of artificial intelligence. How to keep a simple structure of running water and parallel without adding extra preprocessing on the premise of being compatible with more deep neural network algorithms and reducing the calculation amount is the difficulty of hardware acceleration at present.
Disclosure of Invention
The application aims to provide a hardware acceleration device for convolutional neural network calculation optimization, which can obviously reduce the amount of convolutional calculation and improve the hardware acceleration performance.
In order to achieve the purpose, the technical scheme adopted by the application is as follows:
the utility model provides a hardware accelerator of convolutional neural network computational optimization, hardware accelerator of convolutional neural network computational optimization includes parameter storage module, dispatch control module, a plurality of speed up kernel module, each speed up kernel module includes input image cache unit, weight cache unit, zero-removing processing unit, multiply and accumulate operation array unit, correction linear unit and output image cache unit, wherein:
the parameter storage module is used for caching the convolutional neural network to be accelerated and a convolution kernel corresponding to the convolutional neural network;
the scheduling control module is used for controlling the balance calculation of the plurality of acceleration core modules, detecting the idle acceleration core module and distributing the input feature map data to be processed to the idle acceleration core module;
the input image cache unit is used for receiving and caching the input characteristic diagram data input into the acceleration kernel module;
the weight cache unit is used for receiving and caching the convolution kernel output by the parameter storage module;
the zero-removing processing unit is used for removing zero values in the input feature map data;
the multiply-accumulate operation array unit is used for multiply-accumulate operation between the weight data in the convolution kernel and the input characteristic diagram data after zero removal, and outputting a convolution operation result;
the correction linear unit is used for correcting the negative number in the convolution operation result to be a zero value to obtain a correction result;
and the output image caching unit is used for caching the correction result as output characteristic graph data, and the output characteristic graph data is used as input characteristic graph data of the next layer of convolution operation.
Preferably, the maximum data amount that can be directly processed by the multiply-accumulate operation performed by the accelerated kernel module at one time is: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.
Preferably, the input image buffer unit is a first random access memory for buffering input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.
Preferably, the weight buffer unit is a second random access memory for buffering the weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores the weight data of M sets of convolution kernels of one point.
Preferably, the multiply-accumulate array unit includes M parallel MAC units, and each MAC unit implements multiply-accumulate operation of the input feature map data and the weight data of a set of convolution kernels.
Preferably, if the size of the input feature map to be processed is C '× R' × N ', where C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed;
if N '> N, the input image cache unit uses a plurality of continuous address spaces to store data of N' channels of one pixel point; and if C '. R' > C.R, splitting the input feature graph to be processed into a plurality of C.R.N size blocks, and distributing the blocks to a plurality of acceleration core module operation.
Preferably, in the parameter storage module, if the size of the convolution kernel to be processed is W '. H'. N '. M', where W 'represents the width of the convolution kernel to be processed, H' represents the height of the convolution kernel to be processed, N 'represents the number of passes of the convolution kernel to be processed, and M' represents the number of groups of the convolution kernels to be processed;
if M' > M, splitting the convolution kernel into a plurality of M groups of convolution kernels, and distributing the convolution kernels to a plurality of accelerated kernel modules for operation;
or, if M '> M, the weight cache unit stores the weight data of the M' sets of convolution kernels for one point using consecutive addresses.
Compared with the prior art, the hardware accelerating device for the convolutional neural network calculation optimization has the following beneficial effects:
(1) the method and the device have the advantages that the simple hardware structure of the flow and the parallelism is kept, the calculated amount is reduced by removing the zero value of the input characteristic diagram, and the hardware acceleration performance is improved.
(2) The method and the device keep the original structure of the convolutional neural network algorithm, do not need the convolutional neural network algorithm to carry out extra optimization for reducing the calculated amount, avoid the irregularity of network operation, and are suitable for hardware acceleration of various convolutional neural network algorithms.
Drawings
FIG. 1 is a schematic structural diagram of a hardware acceleration apparatus for convolutional neural network computational optimization according to the present application;
FIG. 2 is a schematic diagram illustrating a storage manner of an input feature map according to the present application;
FIG. 3 is a schematic diagram of a zeroing process for an input feature map of the present application;
FIG. 4 is a schematic diagram of a convolution kernel weight data storage method according to the present application;
FIG. 5 is a schematic diagram of the operation of convolution multiply accumulate operation of the present application;
fig. 6 is a schematic diagram of an input/output characteristic image storage method according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In one embodiment, a hardware acceleration device for calculation optimization of a convolutional neural network is provided, which solves the problems that the conventional convolutional neural network consumes more calculation resources and a general processor is low in implementation efficiency.
As shown in fig. 1, the hardware acceleration apparatus for convolutional neural network computational optimization of this embodiment includes a parameter storage module, a scheduling control module, and a plurality of acceleration kernel modules, and each acceleration kernel module includes an input image buffer unit, a weight buffer unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a modified linear unit, and an output image buffer unit.
In the convolution calculation process, the application of each module unit is as follows:
the parameter storage module is used for caching the convolutional neural network to be accelerated and the corresponding convolutional kernel. In practical application, the parameter storage module actually stores compiled network layer information, for example, a YOLO series or a MobileNet series is stored as a neural network to be accelerated and optimized in the convolution operation process.
The scheduling control module is used for controlling the balance calculation of the plurality of acceleration core modules, detecting the idle acceleration core module, and distributing the input feature map data to be processed (namely, a new operation request) to the idle acceleration core module. In the embodiment, the dispatching control module is utilized to comprehensively control the working conditions of all the accelerating core modules, so that the utilization rate of each accelerating core module can be effectively improved, and the calculation waiting time is reduced. It is easy to understand that detecting the idle condition of the acceleration core module is a conventional detection means, and may be determined by an identifier or by a state, which is not described herein again.
The input image cache unit is used for receiving and caching the input characteristic diagram data input into the acceleration kernel module, and the weight cache unit is used for receiving and caching the convolution kernel output by the parameter storage module.
The zero-removing processing unit is used for removing zero values in the input feature map data.
And the multiply-accumulate operation array unit is used for multiply-accumulate operation between the weight data in the convolution kernel and the input characteristic diagram data after zero removal, and outputting a convolution operation result.
And the correction linear unit (namely the activation function Relu) is used for correcting the negative number in the convolution operation result into a zero value to obtain a correction result. In the embodiment, the activation function Relu operation after each layer of convolution operation can enable all negative numbers of convolution operation results to be zero, so that zero in each layer of input feature map data can be conveniently removed, and the calculation amount of the algorithm can be greatly reduced.
And the output image caching unit is used for caching the correction result as output characteristic graph data, and the output characteristic graph data is used as input characteristic graph data of the next layer of convolution operation.
The hardware acceleration device for the convolutional neural network calculation optimization keeps the running and parallel original structure of the convolutional neural network algorithm, and simultaneously switches and stores various convolutional neural networks in combination with the parameter storage module, so that the hardware acceleration device can be compatible with the accelerated calculation of various convolutional neural networks; meanwhile, the characteristic that the negative value is changed into zero by the activating function Relu is utilized, and the negative value in the output characteristic diagram data of the current layer is changed into zero, so that the zero data can be removed when the next layer carries out convolution operation, the calculated amount is reduced, and the hardware acceleration performance is improved.
In the whole convolution calculation process, input feature map data and convolution kernels are obtained through an input bus, and final calculation result data are output through an output bus after the convolution calculation is finished. The data transmission is realized through the bus, and the data transmission stability and the integrity are higher.
It is easy to understand that if the output feature map data of the current layer does not complete the whole convolution calculation, the output feature map data of the current layer will be used as the input feature map data of the convolution operation of the next layer; if the output characteristic diagram data of the current layer does not complete the whole convolution calculation, the output characteristic diagram data of the current layer is output through an output bus.
In order to improve the accuracy of the convolution operation, in another embodiment, the hardware acceleration apparatus further includes an offset buffer unit, where the offset buffer unit is configured to buffer offset data and provide offset compensation for the convolution calculation. Since the operations of the weight data and the input feature map data are sequential, the zeroing processing unit is required to control the read address of the weight buffer unit at the same time. If the hardware acceleration device includes an offset cache unit, the zeroing processing unit also needs to control the read address of the offset cache unit. The offset control of the read address is a conventional technique in the convolution operation, and is not described in detail in this embodiment.
Specifically, in order to ensure a simple hardware structure, each acceleration core module has a maximum data size that can be processed at one time, and the maximum data size is also the maximum storage size of the input image buffer unit and the weight buffer unit. In one embodiment, the maximum data amount that can be directly processed by the multiply-accumulate operation performed by each acceleration core module is set as: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.
As shown in fig. 2, when the input feature map and the weight data are obtained and stored, the input image buffer unit is a first Random Access Memory (RAM) for buffering the input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.
As shown in FIG. 3, when removing the zero value in the input feature map, the zeroing processing unit sequentially removes the zero stored in each address space in the input image cache unit, and zeroes the original N data into L data, where N is greater than or equal to L.
As shown in fig. 4, the weight buffer unit is a second Random Access Memory (RAM) for buffering weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores weight data of M sets of convolution kernels of one point.
In order to ensure that the multiply-accumulate operation array can perform parallel calculation according to the single maximum data quantity, in one embodiment, the multiply-accumulate operation array unit comprises M parallel MAC units (multiply-add devices), each MAC unit realizes the multiply-accumulate operation of input feature map data and weight data of a group of convolution kernels, the convolution algorithm is ensured to have the maximum parallelism, and the operation efficiency is obviously improved.
The working flow of the hardware acceleration device for calculating and optimizing the convolutional neural network of the present application is further described by the following embodiments.
Example 1
And when convolution calculation is carried out, selecting the acceleration kernel module which works immediately according to the idle condition of the acceleration kernel.
And acquiring an input characteristic image of C R N from the input bus, wherein C represents the width of the image, R represents the height of the image, N represents the channel number of the image, and each address space in the input image cache unit RAM stores data of N channel numbers of one pixel point.
Obtaining convolution kernels of W, H, N and M from an input bus, wherein W represents the width of the convolution kernels, H represents the height of the convolution kernels, N represents the number of channels of the convolution kernels, M represents the number of groups of the convolution kernels, and each address space in a weight cache unit RAM stores weight data of M groups of convolution kernels of one point.
N data (i.e., data in one address space) in a parallel line are read from the input image buffer unit, and the zero-removing processing unit removes zero data from the N data and converts the data into L data in series.
As shown in fig. 5, the multiply-accumulate operation array unit takes one of the L data as the data to be calculated in each period, and takes one weight data in the M sets of convolution kernels, which is in the same number of channels as the data to be calculated, to perform the multiply-accumulate operation, where the multiply-accumulate operation array is M parallel MAC units, and each MAC unit implements the multiply-accumulate operation of a set of convolution kernels and the input feature map data. And outputting the output characteristic diagram data of M channels of one pixel point after W X H L periods.
As shown in fig. 6, the output image buffer unit is also an RAM, the RAM is a third random access memory, each address space in the third random access memory stores data of M channels of one pixel, the output feature map data is output and buffered, a negative value in the output feature map data is changed to zero by an activation function Relu, convolution operation of one layer is completed, and the output is output through an output bus.
The above is the calculation process of the hardware acceleration apparatus based on the maximum calculation amount once in this embodiment, and it should be noted that if the input data exceeds the maximum calculation amount once, the hardware acceleration apparatus needs to be preprocessed before the convolution operation:
for the input feature map: if the size of the input feature map to be processed is C '. R'. N ', wherein C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed; and if the N '> N is obtained, the input image cache unit stores the data of N' channels of one pixel point by using a plurality of continuous address spaces, and adopts N '/N (integer) accelerating kernel modules to operate the input characteristic diagram, and when the N'/N is calculated, the integer is obtained by adding 1 to the quotient if the remainder is not zero.
For example, the maximum storable input feature size C × R × N of the input image cache unit is 20 × 32, and when N' is within 32, C × R points, that is, 20 × 20, may be stored; when N ' is 64 or more than 32 and less than 64, C R N/N ' points, i.e. 20 x 20/2, and so on, N ' is the storage condition of other values.
And if C '. R' > C.R, splitting the input feature image into a plurality of C.R.N size blocks and distributing the blocks to a plurality of acceleration core module operation. When splitting the input feature image, if C '. multidot.r' is not an integer multiple of c.multidot.r, the outer edge of C '. multidot.r' may be adjusted to zero, and then split.
For example, when C '× R' is 6 × 8 and C × R is 2 × 3, the outer edge of C '× R' needs to be subjected to zero padding to become a specification of 6 × 9, and then to the separation process. It should be noted that, in practical applications, in order to improve the rigor, the splitting process on the input feature image may need to consider the problem of overlapping borders.
The splitting of the input characteristic image can be completed by a splitting module at the CPU side, and after the splitting module performs splitting processing, each data block is sent to a scheduling control module for distribution, and the jump address is supported to read external data. It should be noted that the splitting module may be a part of the hardware acceleration apparatus of this embodiment, and may also be an external module connected to the hardware acceleration apparatus of this embodiment. Of course, the splitting of the input feature image can also be completed by the scheduling control module.
For the convolution kernel: if the size of the convolution kernel to be processed is W '. H'. N '. M', wherein W 'represents the width of the convolution kernel to be processed, H' represents the height of the convolution kernel to be processed, N 'represents the number of channels to be processed, and M' represents the group number of the convolution kernel to be processed; if M' > M, the convolution kernel can be split into a plurality of M groups of convolution kernels and distributed to a plurality of accelerated kernel module operations, and the convolution kernel in the same way can also be completed by the same or similar splitting module or a parameter storage module; or, if M ' > M, the weight cache unit uses a plurality of consecutive addresses to store the weight data of the M ' set of convolution kernels of one point, which is similar to the processing when the feature map N ' > N is input, and is not described in detail.
After the input feature map data or the convolution kernel is split, the output feature map is output to the address of an external SDRAM cache through each accelerated kernel module after processing the operation request, and splicing is realized in the external SDRAM cache. And jumping addresses for each row of output data corresponding to each acceleration core module, for example, splitting the input characteristic image left and right, wherein the two outputs are 10 × 20 (the number of channels is not considered at first), the 0-9 addresses store the data of the acceleration core module 1, the 10-19 addresses store the data of the acceleration core module 2, the 20-29 addresses store the data of the acceleration core module 1, the 30-39 addresses store the data of the acceleration core module 2, and the like.
In the embodiment, the parallel computing structure is adopted to simultaneously skip the multiply-accumulate operation of the input image being zero, thereby achieving the purpose of accelerating the neural network operation.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (7)

1. The hardware acceleration device for the calculation optimization of the convolutional neural network is characterized by comprising a parameter storage module, a scheduling control module and a plurality of acceleration kernel modules, wherein each acceleration kernel module comprises an input image cache unit, a weight cache unit, a zero-removing processing unit, a multiply-accumulate operation array unit, a modified linear unit and an output image cache unit, and the hardware acceleration device for the calculation optimization of the convolutional neural network is characterized in that:
the parameter storage module is used for caching the convolutional neural network to be accelerated and a convolution kernel corresponding to the convolutional neural network;
the scheduling control module is used for controlling the balance calculation of the plurality of acceleration core modules, detecting the idle acceleration core module and distributing the input feature map data to be processed to the idle acceleration core module;
the input image cache unit is used for receiving and caching the input characteristic diagram data input into the acceleration kernel module;
the weight cache unit is used for receiving and caching the convolution kernel output by the parameter storage module;
the zero-removing processing unit is used for removing zero values in the input feature map data;
the multiply-accumulate operation array unit is used for multiply-accumulate operation between the weight data in the convolution kernel and the input characteristic diagram data after zero removal, and outputting a convolution operation result;
the correction linear unit is used for correcting the negative number in the convolution operation result to be a zero value to obtain a correction result;
and the output image caching unit is used for caching the correction result as output characteristic graph data, and the output characteristic graph data is used as input characteristic graph data of the next layer of convolution operation.
2. The hardware acceleration device for convolutional neural network computational optimization of claim 1, wherein the maximum data amount that can be directly processed by the multiplication-accumulation operation performed by the acceleration kernel module at one time is: performing convolution operation on the input characteristic diagram with the size of C, R, N and the convolution kernel with the size of W, H, N, M; where C represents the width of the image, R represents the height of the image, N represents the number of channels, W represents the width of the convolution kernel, H represents the height of the convolution kernel, and M represents the number of sets of convolution kernels.
3. The hardware accelerator of convolutional neural network computational optimization of claim 2, wherein the input image buffer unit is a first random access memory for buffering input feature map data, the first random access memory has C × R address spaces in total, and each address space in the first random access memory stores data of N channels of one pixel.
4. The hardware acceleration device for convolutional neural network computational optimization of claim 2, wherein the weight caching unit is a second random access memory for caching weight data, the second random access memory has W × H × N address spaces in total, and each address space in the second random access memory stores weight data of M sets of convolutional kernels of one point.
5. The hardware acceleration apparatus for convolutional neural network computational optimization of claim 2, wherein said multiply-accumulate array unit comprises M parallel MAC units, each MAC unit implementing a multiply-accumulate operation of the input profile data and the weight data of a set of convolutional kernels.
6. The hardware acceleration device for convolutional neural network computational optimization of claim 3, wherein if the size of the input feature map to be processed is C '× R' × N ', wherein C' represents the width of the image to be processed, R 'represents the height of the image to be processed, and N' represents the number of channels of the image to be processed;
if N '> N, the input image cache unit uses a plurality of continuous address spaces to store data of N' channels of one pixel point; and if C '. R' > C.R, splitting the input feature graph to be processed into a plurality of C.R.N size blocks, and distributing the blocks to a plurality of acceleration core module operation.
7. The hardware acceleration device for convolutional neural network computational optimization of claim 4, wherein in the parameter storage module, if the size of the convolutional kernel to be processed is W '× H' × N '× M', where W 'represents the width of the convolutional kernel to be processed, H' represents the height of the convolutional kernel to be processed, N 'represents the number of tracks passing through the convolutional kernel to be processed, and M' represents the number of groups of the convolutional kernel to be processed;
if M' > M, splitting the convolution kernel into a plurality of M groups of convolution kernels, and distributing the convolution kernels to a plurality of accelerated kernel modules for operation;
or, if M '> M, the weight cache unit stores the weight data of the M' sets of convolution kernels for one point using consecutive addresses.
CN202011279360.1A 2020-11-16 2020-11-16 Hardware accelerator for convolution neural network calculation optimization Active CN112465110B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011279360.1A CN112465110B (en) 2020-11-16 2020-11-16 Hardware accelerator for convolution neural network calculation optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011279360.1A CN112465110B (en) 2020-11-16 2020-11-16 Hardware accelerator for convolution neural network calculation optimization

Publications (2)

Publication Number Publication Date
CN112465110A true CN112465110A (en) 2021-03-09
CN112465110B CN112465110B (en) 2022-09-13

Family

ID=74836284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011279360.1A Active CN112465110B (en) 2020-11-16 2020-11-16 Hardware accelerator for convolution neural network calculation optimization

Country Status (1)

Country Link
CN (1) CN112465110B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435586A (en) * 2021-08-03 2021-09-24 北京大学深圳研究生院 Convolution operation device and system for convolution neural network and image processing device
CN113487017A (en) * 2021-07-27 2021-10-08 湖南国科微电子股份有限公司 Data convolution processing method and device and computer equipment
CN113642724A (en) * 2021-08-11 2021-11-12 西安微电子技术研究所 CNN accelerator with high bandwidth storage
CN114169514A (en) * 2022-02-14 2022-03-11 浙江芯昇电子技术有限公司 Convolution hardware acceleration method and convolution hardware acceleration circuit
CN114254740A (en) * 2022-01-18 2022-03-29 长沙金维信息技术有限公司 Convolution neural network accelerated calculation method, calculation system, chip and receiver
CN114997392A (en) * 2022-08-03 2022-09-02 成都图影视讯科技有限公司 Architecture and architectural methods for neural network computing
WO2023030061A1 (en) * 2021-09-03 2023-03-09 Oppo广东移动通信有限公司 Convolution operation circuit and method, neural network accelerator and electronic device
CN116152520A (en) * 2023-04-23 2023-05-23 深圳市九天睿芯科技有限公司 Data processing method for neural network accelerator, chip and electronic equipment
CN117391149A (en) * 2023-11-30 2024-01-12 爱芯元智半导体(宁波)有限公司 Processing method, device and chip for neural network output data

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951395A (en) * 2017-02-13 2017-07-14 上海客鹭信息技术有限公司 Towards the parallel convolution operations method and device of compression convolutional neural networks
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110390382A (en) * 2019-06-20 2019-10-29 东南大学 A kind of convolutional neural networks hardware accelerator with novel feature figure cache module
KR102038390B1 (en) * 2018-07-02 2019-10-31 한양대학교 산학협력단 Artificial neural network module and scheduling method thereof for highly effective parallel processing
CN110622206A (en) * 2017-06-15 2019-12-27 三星电子株式会社 Image processing apparatus and method using multi-channel feature map
CN110826707A (en) * 2018-08-10 2020-02-21 北京百度网讯科技有限公司 Acceleration method and hardware accelerator applied to convolutional neural network
US20200167405A1 (en) * 2018-11-28 2020-05-28 Electronics And Telecommunications Research Institute Convolutional operation device with dimensional conversion
CN111242277A (en) * 2019-12-27 2020-06-05 中国电子科技集团公司第五十二研究所 Convolutional neural network accelerator supporting sparse pruning and based on FPGA design
US20200192726A1 (en) * 2018-12-12 2020-06-18 Samsung Electronics Co., Ltd. Method and apparatus for load balancing in neural network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951395A (en) * 2017-02-13 2017-07-14 上海客鹭信息技术有限公司 Towards the parallel convolution operations method and device of compression convolutional neural networks
CN110622206A (en) * 2017-06-15 2019-12-27 三星电子株式会社 Image processing apparatus and method using multi-channel feature map
KR102038390B1 (en) * 2018-07-02 2019-10-31 한양대학교 산학협력단 Artificial neural network module and scheduling method thereof for highly effective parallel processing
CN110826707A (en) * 2018-08-10 2020-02-21 北京百度网讯科技有限公司 Acceleration method and hardware accelerator applied to convolutional neural network
US20200167405A1 (en) * 2018-11-28 2020-05-28 Electronics And Telecommunications Research Institute Convolutional operation device with dimensional conversion
US20200192726A1 (en) * 2018-12-12 2020-06-18 Samsung Electronics Co., Ltd. Method and apparatus for load balancing in neural network
CN110390382A (en) * 2019-06-20 2019-10-29 东南大学 A kind of convolutional neural networks hardware accelerator with novel feature figure cache module
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN111242277A (en) * 2019-12-27 2020-06-05 中国电子科技集团公司第五十二研究所 Convolutional neural network accelerator supporting sparse pruning and based on FPGA design

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUIMIN LI ET AL: "A high performance FPGA-based accelerator for lager -scale convolutional neural networks", 《IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS -FPL’16,IEEE》 *
方睿等: "卷积神经网络的FPGA并行加速方案设计", 《计算机工程与应用》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487017A (en) * 2021-07-27 2021-10-08 湖南国科微电子股份有限公司 Data convolution processing method and device and computer equipment
CN113435586A (en) * 2021-08-03 2021-09-24 北京大学深圳研究生院 Convolution operation device and system for convolution neural network and image processing device
CN113642724A (en) * 2021-08-11 2021-11-12 西安微电子技术研究所 CNN accelerator with high bandwidth storage
WO2023030061A1 (en) * 2021-09-03 2023-03-09 Oppo广东移动通信有限公司 Convolution operation circuit and method, neural network accelerator and electronic device
CN114254740B (en) * 2022-01-18 2022-09-30 长沙金维信息技术有限公司 Convolution neural network accelerated calculation method, calculation system, chip and receiver
CN114254740A (en) * 2022-01-18 2022-03-29 长沙金维信息技术有限公司 Convolution neural network accelerated calculation method, calculation system, chip and receiver
CN114169514A (en) * 2022-02-14 2022-03-11 浙江芯昇电子技术有限公司 Convolution hardware acceleration method and convolution hardware acceleration circuit
CN114997392B (en) * 2022-08-03 2022-10-21 成都图影视讯科技有限公司 Architecture and architectural methods for neural network computing
CN114997392A (en) * 2022-08-03 2022-09-02 成都图影视讯科技有限公司 Architecture and architectural methods for neural network computing
CN116152520A (en) * 2023-04-23 2023-05-23 深圳市九天睿芯科技有限公司 Data processing method for neural network accelerator, chip and electronic equipment
CN116152520B (en) * 2023-04-23 2023-07-07 深圳市九天睿芯科技有限公司 Data processing method for neural network accelerator, chip and electronic equipment
CN117391149A (en) * 2023-11-30 2024-01-12 爱芯元智半导体(宁波)有限公司 Processing method, device and chip for neural network output data
CN117391149B (en) * 2023-11-30 2024-03-26 爱芯元智半导体(宁波)有限公司 Processing method, device and chip for neural network output data

Also Published As

Publication number Publication date
CN112465110B (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN112465110B (en) Hardware accelerator for convolution neural network calculation optimization
CN111242277B (en) Convolutional neural network accelerator supporting sparse pruning based on FPGA design
US20210357736A1 (en) Deep neural network hardware accelerator based on power exponential quantization
CN109948774B (en) Neural network accelerator based on network layer binding operation and implementation method thereof
CN108229671B (en) System and method for reducing storage bandwidth requirement of external data of accelerator
CN112668708B (en) Convolution operation device for improving data utilization rate
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
WO2022134465A1 (en) Sparse data processing method for accelerating operation of re-configurable processor, and device
CN112257844B (en) Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof
CN110069444A (en) A kind of computing unit, array, module, hardware system and implementation method
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
CN108197075B (en) Multi-core implementation method of Inceptation structure
CN209708122U (en) A kind of computing unit, array, module, hardware system
CN114491402A (en) Calculation method for sparse matrix vector multiplication access optimization
CN108647780B (en) Reconfigurable pooling operation module structure facing neural network and implementation method thereof
CN111860819B (en) Spliced and sectionable full-connection neural network reasoning accelerator and acceleration method thereof
CN113516236A (en) VGG16 network parallel acceleration processing method based on ZYNQ platform
CN110766136B (en) Compression method of sparse matrix and vector
KR20220101418A (en) Low power high performance deep-neural-network learning accelerator and acceleration method
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
CN115204364A (en) Convolution neural network hardware accelerating device for dynamic allocation of cache space
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN114692079A (en) GPU batch matrix multiplication accelerator and processing method thereof
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant