CN108647773B - Hardware interconnection system capable of reconstructing convolutional neural network - Google Patents

Hardware interconnection system capable of reconstructing convolutional neural network Download PDF

Info

Publication number
CN108647773B
CN108647773B CN201810358443.6A CN201810358443A CN108647773B CN 108647773 B CN108647773 B CN 108647773B CN 201810358443 A CN201810358443 A CN 201810358443A CN 108647773 B CN108647773 B CN 108647773B
Authority
CN
China
Prior art keywords
module
neural network
basic
convolutional neural
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810358443.6A
Other languages
Chinese (zh)
Other versions
CN108647773A (en
Inventor
曹伟
王伶俐
谢亮
罗成
范锡添
周学功
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201810358443.6A priority Critical patent/CN108647773B/en
Publication of CN108647773A publication Critical patent/CN108647773A/en
Application granted granted Critical
Publication of CN108647773B publication Critical patent/CN108647773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)
  • Image Processing (AREA)
  • Stored Programmes (AREA)

Abstract

The invention belongs to the technical field of hardware design of an image processing algorithm, and particularly relates to a hardware interconnection architecture of a reconfigurable convolutional neural network. The interconnect architecture of the present invention comprises: the data and parameter off-chip cache module is used for caching the input pixel data in the picture to be processed and caching the input parameters during the calculation of the convolutional neural network; the basic computing unit array module is used for realizing the core computation of the convolutional neural network; and the arithmetic logic unit calculation module is used for processing the calculation result of the basic calculation unit array to realize downsampling layer, activation function and partial sum accumulation. The basic computing unit array modules are interconnected in a two-dimensional array mode, input data are shared in the row direction, and parallel computing is achieved by using different parameter data; in the column direction, the calculation results are transmitted line by line and participate in the operation as the input of the next line. The invention can reduce the requirement of bandwidth while improving the data multiplexing capability through the structural interconnection.

Description

Hardware interconnection system capable of reconstructing convolutional neural network
Technical Field
The invention belongs to the technical field of hardware design of image processing algorithms, and particularly relates to a hardware interconnection system of a reconfigurable convolutional neural network.
Background
With the rise of artificial intelligence, deep learning is widely applied to computer vision, speech recognition and other large data applications, and is increasingly applied. Convolutional neural networks are widely used as an important algorithm model in deep learning, for example, in the aspects of image classification, face recognition, video detection, voice recognition, and the like.
With the improvement of image recognition precision, the convolutional neural network becomes more and more complex, and meanwhile, more calculation requirements are increased. This makes conventional general purpose processors with large amounts of redundant resources less capable of handling large convolutional neural networks and cannot meet the practical needs in many scenarios. Therefore, the acceleration of the convolutional neural network by using a hardware platform in engineering gradually becomes a mainstream method, and if a hardware architecture in a parallel pipeline mode is adopted to realize an algorithm, a good acceleration effect can be obtained so as to achieve a real-time processing effect.
Although the performance of the hardware platform is far higher than that of the traditional general-purpose processor, as the network structure of the convolutional neural network is increasingly complex and needs to face convolutional kernels with various sizes, the characteristic of low flexibility of the hardware structure can enable the hardware processor to only improve the acceleration effect of certain specific networks, but not accelerate other networks efficiently. Therefore, the demand for a reconfigurable convolutional neural network hardware architecture applicable to convolutional kernels of any size and various network structures is becoming more and more urgent.
For the realization of reconfigurable convolution neural network hardware, the current main research difficulty has two aspects, namely, how to use a fixed hardware structure to configure convolution kernels with different sizes, and how to enable the hardware structure to achieve high efficiency on any convolution kernel. In the prior art, a hardware structure for implementing the most common 3 × 3 convolution template is designed as a basic unit, then data is transferred through the interconnection between the basic units, and a plurality of basic units are used to implement other types of convolution kernels, for example, 5 basic units are needed to implement 3 convolution templates of 5 × 5. Under the scheme, because the number of the basic units in the hardware is certain but the number of the basic units needed for realizing one convolution kernel is variable, a part of the basic units are in an inoperative state when convolution with certain size is realized, and the utilization rate of one hundred percent cannot be guaranteed. In addition, under the scheme, not all basic units can generate effective output, all the output is cached firstly, and then effective data is selected, so that a large amount of waste of extra storage resources is generated.
In the prior art, many image processing hardware accelerators design an optimal hardware structure for one algorithm and then support other algorithms by means of interconnections. Due to the diversity of algorithms and the fixed limit of hardware scale, when the network model changes or the calculation type changes, the resource utilization rate and the calculation performance are reduced. Moreover, in the field of image processing, especially in neural network algorithms, in addition to the utilization of the computing units, bandwidth is also a key that affects the computing performance.
What is needed is a method for increasing data multiplexing capability through fabric interconnect while reducing bandwidth requirements.
Disclosure of Invention
The invention aims to provide a hardware interconnection system of a reconfigurable convolutional neural network, which can improve the data multiplexing capability through structural interconnection and reduce the bandwidth requirement.
The invention provides a hardware interconnection system of a reconfigurable convolution neural network, which is applied to the field of image processing, and comprises:
the data and parameter off-chip cache module is respectively used for caching the input pixel data in the picture to be processed and caching the input parameters during the calculation of the convolutional neural network;
the basic computing unit array module is respectively connected with the data off-chip cache module and the parameter off-chip cache module; the basic computing unit array module is used for realizing the core computation of the convolutional neural network; the basic computing unit array modules are interconnected in a two-dimensional array mode, the basic computing unit arrays share input data in the row direction, and parallel computing is achieved by using different parameter data; in the column direction, the calculation results of the basic calculation unit array are transmitted line by line and used as the input of the next line to participate in operation;
the arithmetic logic unit computing module is connected with the basic computing unit array module and is used for processing the computing result of the basic computing unit array to realize downsampling layer, activation function and partial sum accumulation;
when a convolutional neural network is calculated, when each row of the basic calculation unit array module independently calculates an output characteristic graph, mapping convolution kernels with different sizes according to a mode that convolution window positions divide input characteristic graphs for multiple times of input;
and the control module is respectively connected with the basic computing unit array module and the arithmetic logic unit computing module and is used for realizing convolution kernels with any size and a plurality of computing modes according to different parameters.
Optionally, when the arithmetic logic unit computing module and the basic computing unit array module compute the convolutional neural network, each segmented input feature map only needs one parameter of a convolutional kernel.
Optionally, the basic computing unit array module is formed by adopting an interconnection manner, and the interconnection manner and the data path inside the basic computing unit array module can be changed to support different types of computations with different bit widths.
Optionally, the hardware interconnection system is implemented by using any one of a field programmable gate array and an integrated circuit.
According to the specific embodiment provided by the invention, the invention has the following technical effects: the invention uses a new mapping mode to map the convolutional neural network and converts the freedom degrees of different neural network models from a space level to a time level, thereby enabling a fixed hardware interconnection system to keep one hundred percent of resource utilization rate for any convolutional neural network. Meanwhile, the digital signal processing unit is fully utilized in the basic computing unit module, and a flexible interconnection structure is adopted to support different computing modes and the computing requirement of variable data bit width. And the requirements of input data and output bandwidth are reduced by adopting the mode of parallel computing of the pipelines among rows and multiplexing the input data among columns.
Drawings
Fig. 1 is a schematic structural diagram of a hardware interconnection system of a convolutional neural network provided in the present invention.
Fig. 2 is a schematic diagram of a specific mapping of the mapping scheme provided by the present invention.
Fig. 3 is a schematic diagram of an internal structure of the basic computing module provided in the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the drawings are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise. The described embodiments are only a few embodiments of the invention and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments described herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a hardware interconnection system of a convolutional neural network, which can improve the data multiplexing capability through structural interconnection and reduce the requirement of bandwidth.
A hardware interconnect system for reconfigurable convolutional neural networks, the interconnect system being used in the field of image processing, the interconnect system comprising:
the data and parameter off-chip cache module is used for caching the input pixel data in the picture to be processed and caching the input parameters during the calculation of the convolutional neural network;
the basic computing unit array module is respectively connected with the data off-chip cache module and the parameter off-chip cache module, the basic computing unit array module is used for realizing the computation of an image processing core-convolutional neural network, the basic computing modules are interconnected in a two-dimensional array mode, the basic computing unit array module shares input data in the row direction, and parallel computation is realized by using different parameter data; in the column direction, the calculation result of the basic calculation module is transmitted line by line and is used as the input of the next line to participate in the operation;
when a convolutional neural network is calculated, when each row of the basic calculation unit array module independently calculates an output characteristic graph, mapping convolution kernels with different sizes according to a mode that convolution window positions divide input characteristic graphs for multiple times of input;
the arithmetic logic unit calculation module is connected with the basic calculation module and is used for processing the calculation result of the basic calculation unit array and realizing the down-sampling layer, the activation function and the partial sum accumulation;
the control module is respectively connected with the basic calculation module and the arithmetic logic unit calculation module, is used for realizing convolution kernels with any size and a plurality of calculation modes according to different parameters, and can design efficient and flexible calculation units and interconnection structures aiming at the convolution neural networks with different network structures, distance calculation, vector calculation and digital signal processing calculation in the image processing method.
Optionally, when the arithmetic logic unit computing module and the basic computing module compute the convolutional neural network, each segmented input feature map only needs one parameter of the convolutional kernel.
Optionally, the basic computing module is formed by adopting an interconnection manner, and the interconnection manner and the data path inside the basic computing module can be changed to support different types of computations with different bit widths.
Optionally, the system is implemented by using any one of a field programmable gate array and an integrated circuit.
As shown in fig. 1, the data off-chip cache module and the parameter off-chip cache module 1 are configured to: image data and parameters input from outside are firstly buffered in the module and are waited to be read out by the basic computing unit array module; and the result obtained by calculation is cached in the module and waits to be called again by the arithmetic logic unit calculation module or output. The data and parameter off-chip cache module is equivalent to a large field programmable gate array and is used for handling the condition that the processing speed of the basic computing unit array module is greatly different from the external input speed, otherwise, if the external part directly sends data to the internal basic computing unit array module, a relatively complex handshake signal is needed to ensure that the conditions of data missing sending or retransmission and the like do not exist when the speeds of the two modules are not matched.
The basic computing unit array module 2 connected to the data parameter off-chip cache module 1 is used to implement core computation of a convolutional neural network, and the interconnection mode between the basic computing unit arrays is shown in fig. 1. The interconnect system is composed of a plurality of stages of pipelines, each stage of the pipelines having the same function for calculating multiplication of an input profile by a parameter and adding a partial sum to a result obtained by a previous stage of the pipelines. When the number of the input characteristic diagrams is larger than the number of rows of the basic calculation unit array, the calculation results can be sent back to the arithmetic logic unit calculation module from the off-chip cache module to be continuously accumulated so as to obtain the final output characteristic diagram. After the first characteristic diagram starts to calculate for one period in the first-stage pipeline, the second characteristic diagram starts to calculate in the second-stage pipeline, so that the first-stage pipeline can just send the result to the second-stage pipeline to participate in addition operation when the first-stage pipeline obtains the result, and higher parallelism and higher processing performance can be realized when the pipeline is completely filled. In order to make the pipeline calculation efficient, the number of rows/columns of the basic calculation unit array should be a factor of the number of input/output characteristic patterns as much as possible.
And the arithmetic logic unit computation module 3 is connected with the basic computation unit array module 2 and is used for realizing partial sum accumulation, a down sampling layer and an activation function layer of convolution. The method is specifically configured that in a convolution layer, after calculation results obtained from the last line of a basic calculation unit array are stored in an off-chip cache module, if partial sums exist, the partial sums are read from the off-chip cache module and are sent to an arithmetic logic unit calculation module to be further accumulated with calculation results of the last line of the next group of basic calculation unit arrays, and then are stored in the off-chip cache module to realize accumulation of calculation results of a plurality of groups of input characteristic diagrams; in the lower sampling layer, the arithmetic logic unit calculation module reads the output characteristic graph of the last convolution layer from the off-chip cache module, and selects the function of the lower sampling layer to be a comparator or an adder according to the maximum value or the average value of the lower sampling layer so as to realize the lower sampling operation; in the activation function layer, in this example the Relu function, the arithmetic logic unit computation module implements the activation function using a data selector Mux.
The hardware interconnection system of the reconfigurable convolutional neural network can be used for calculating the convolutional neural network and also can effectively support distance calculation, linear algebra calculation and image signal processing calculation.
And the control module 4 is connected with the modules and used for generating control signals of the rest modules. The method is specifically configured to receive and decode an external instruction, generate an enable signal and an address for the off-chip cache module to read and write data to the off-chip cache module and the basic computing unit array to read and write data to the off-chip cache module, and generate data and function gating signals in the basic computing unit array module and the arithmetic logic unit computing module.
In the basic calculation unit array 2 described above, a multistage pipeline is configured. During mapping, image data in the same position of a convolution template (namely corresponding parameters are the same) in a convolution sliding window is extracted to form a new input feature map, namely the convolution template is divided into a plurality of convolution templates of 1x1, and results calculated by the divided convolution templates are accumulated to obtain an original convolution result. Taking the 5 × 5 convolution template as an example, the mapping manner is shown in fig. 2. In this way, a convolution template of any size can be divided into 1x1 convolution templates, so that the structure can keep high efficiency for a convolution kernel of any size. Secondly, the number of output ports is reduced from the number of the computing units to the number of columns of the basic computing unit array while the DSP function is efficiently utilized by adopting a pipeline structure between rows, so that the computing results of the computing units except the last row do not need extra cache space, and the requirement of on-chip storage resources is reduced.
The arithmetic logic unit above calculates a module 3 that links the downsampling layer, the activation function layer and the convolution layer together. Because the downsampling layer and the activation function layer do not need parameters and have simple operation, in order to reduce the requirement of external data bandwidth and improve the calculation efficiency, the system does not map the two layers into the basic calculation unit array, but sends the two layers into the arithmetic logic unit calculation module to complete the operation of comparison, addition or data selection before the convolution layer result is output. The specific method is that the arithmetic logic unit calculation module is configured as a comparator when the maximum value is realized; when the average value is realized, the denominator of the weight in the convolution layer in the upper layer is removed in advance, so that the average value is only needed to be configured into an adder by the arithmetic logic unit calculation module; the arithmetic logic unit computation module is configured as a data selector Mux when implementing the activation function Relu.
In the above arithmetic logic unit computation module 4, the control signals generated by the arithmetic logic unit computation module for controlling the basic computation unit array are also transmitted layer by layer with the data to match the computation timing of each stage of the pipeline. And the number of the computing units which need to be controlled by the same control signal after the control signal is delayed in multiple stages is reduced from the number of the computing units to the number of columns, so that the fan-out of the control signal is reduced, and the working frequency of the chip can be improved.
The invention adopts a new interconnection structure and a new mapping mode to realize a reconfigurable convolution neural network hardware system. The resource utilization rate of convolution kernels with any size can reach 100%, and high performance and low hardware resource consumption are achieved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (7)

1. A hardware interconnection system capable of reconstructing a convolutional neural network is applied to the field of image processing, and is characterized by comprising:
the data and parameter off-chip cache module is used for caching the input pixel data in the picture to be processed and caching the input parameters during the calculation of the convolutional neural network;
the basic computing unit array module is respectively connected with the data and parameter off-chip cache module; the basic computing unit array module is used for realizing the core computation of the convolutional neural network; the basic computing unit array modules are interconnected in a two-dimensional array mode, the basic computing unit arrays share input data in the row direction, and parallel computing is achieved by using different parameter data; in the column direction, the calculation results of the basic calculation unit array are transmitted line by line and used as the input of the next line to participate in operation;
the arithmetic logic unit computing module is connected with the basic computing unit array module and is used for processing the computing result of the basic computing unit array to realize downsampling layer, activation function and partial sum accumulation;
when a convolutional neural network is calculated, when each row of the basic calculation unit array module independently calculates an output characteristic graph, mapping convolution kernels with different sizes according to a mode that convolution window positions divide input characteristic graphs and input the characteristic graphs for multiple times;
and the control module is respectively connected with the basic computing unit array module and the arithmetic logic unit computing module and is used for realizing convolution kernels with any size and a plurality of computing modes according to different parameters.
2. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the arithmetic logic unit computing module and the basic computing unit array module only need one parameter of a convolution kernel for each segmented input feature map when computing the convolutional neural network.
3. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the basic computing unit array module is internally interconnected in a manner of changing the interconnection manner and data path inside the basic computing unit array module, so as to support different types and different bit widths of computation.
4. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the hardware interconnection system is implemented by any one of a field programmable gate array and an integrated circuit.
5. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the basic computing unit array modules are interconnected in a two-dimensional array manner, the interconnection system is composed of a plurality of stages of pipelines, each stage of pipeline has the same function, and is used for computing multiplication of the input characteristic diagram and the parameters and adding partial sums and results obtained by the previous stage of pipeline; the input data is calculated from the first-stage pipeline at the forefront and is backwards step by step until the last-stage pipeline is stored in the off-chip cache module; when the number of the input feature maps is larger than the number of rows of the basic computing unit array, the computing results are sent back to the arithmetic logic unit computing module from the off-chip cache module to continuously complete accumulation so as to obtain a final output feature map; after the first characteristic diagram starts to calculate for a period in the first-stage pipeline, the second characteristic diagram starts to calculate in the second-stage pipeline, so that the first-stage pipeline can just send the result into the second-stage pipeline to participate in addition operation when the result is obtained by the first-stage pipeline; higher parallelism and higher processing performance can be achieved when the pipeline is fully filled.
6. The hardware interconnection system of a reconfigurable convolutional neural network of claim 1, wherein the arithmetic logic unit computation block is used to implement partial sum accumulation, downsampling layer and activation function layer of convolution; the concrete configuration is as follows: in the convolution layer, after the calculation result obtained from the last line of the basic calculation unit array is stored in an off-chip cache module, if the calculation result is still partial sum, the partial sum is read from the off-chip cache module, sent into the arithmetic logic unit calculation module to be further accumulated with the calculation result of the last line of the next group of basic calculation unit array, and then stored in the off-chip cache module to realize the accumulation of the calculation results of a plurality of groups of input characteristic diagrams; in the lower sampling layer, the arithmetic logic unit calculation module reads out the output characteristic graph of the last convolution layer from the off-chip cache module, and selects the function of the lower sampling layer to be a comparator or an adder according to the maximum value or the average value of the lower sampling layer so as to realize the lower sampling operation; in the activation function layer, the arithmetic logic unit computation block implements the activation function using a data selector Mux.
7. The hardware interconnection system of the reconfigurable convolutional neural network of claim 1, wherein the control block is configured to generate control signals for the remaining blocks; the specific configuration is as follows: and receiving and decoding an external instruction, generating an enable signal and an address for the off-chip cache module to read and write data to the off-chip cache module and the basic computing unit array to read and write the data to the off-chip cache module, and generating data and a function gating signal in the basic computing unit array module and the arithmetic logic unit computing module.
CN201810358443.6A 2018-04-20 2018-04-20 Hardware interconnection system capable of reconstructing convolutional neural network Active CN108647773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810358443.6A CN108647773B (en) 2018-04-20 2018-04-20 Hardware interconnection system capable of reconstructing convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810358443.6A CN108647773B (en) 2018-04-20 2018-04-20 Hardware interconnection system capable of reconstructing convolutional neural network

Publications (2)

Publication Number Publication Date
CN108647773A CN108647773A (en) 2018-10-12
CN108647773B true CN108647773B (en) 2021-07-23

Family

ID=63747074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810358443.6A Active CN108647773B (en) 2018-04-20 2018-04-20 Hardware interconnection system capable of reconstructing convolutional neural network

Country Status (1)

Country Link
CN (1) CN108647773B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102020100209A1 (en) * 2019-01-21 2020-07-23 Samsung Electronics Co., Ltd. Neural network device, neural network system and method for processing a neural network model by using a neural network system
CN109902821B (en) * 2019-03-06 2021-03-16 苏州浪潮智能科技有限公司 Data processing method and device and related components
CN110503189B (en) * 2019-08-02 2021-10-08 腾讯科技(深圳)有限公司 Data processing method and device
US11556450B2 (en) * 2019-10-11 2023-01-17 International Business Machines Corporation Hybrid data-model parallelism for efficient deep learning
CN112766453A (en) * 2019-10-21 2021-05-07 华为技术有限公司 Data processing device and data processing method
CN111582451B (en) * 2020-05-08 2022-09-06 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN111738433B (en) * 2020-05-22 2023-09-26 华南理工大学 Reconfigurable convolution hardware accelerator
CN113971261A (en) * 2020-07-23 2022-01-25 中科亿海微电子科技(苏州)有限公司 Convolution operation device, convolution operation method, electronic device, and medium
WO2022126630A1 (en) * 2020-12-18 2022-06-23 清华大学 Reconfigurable processor and method for computing multiple neural network activation functions thereon
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set
CN113191491B (en) * 2021-03-16 2022-08-09 杭州慧芯达科技有限公司 Multi-dimensional parallel artificial intelligence processor architecture
CN113064852B (en) * 2021-03-24 2022-06-10 珠海一微半导体股份有限公司 Reconfigurable processor and configuration method
CN113240074B (en) * 2021-04-15 2022-12-06 中国科学院自动化研究所 Reconfigurable neural network processor
CN114416182B (en) * 2022-03-31 2022-06-17 深圳致星科技有限公司 FPGA accelerator and chip for federal learning and privacy computation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019656A (en) * 2012-12-04 2013-04-03 中国科学院半导体研究所 Dynamically reconfigurable multi-stage parallel single instruction multiple data array processing system
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN107832839A (en) * 2017-10-31 2018-03-23 北京地平线信息技术有限公司 The method and apparatus for performing the computing in convolutional neural networks
CN107832699A (en) * 2017-11-02 2018-03-23 北方工业大学 Method and device for testing interest point attention degree based on array lens
CN107836001A (en) * 2015-06-29 2018-03-23 微软技术许可有限责任公司 Convolutional neural networks on hardware accelerator
CN107851214A (en) * 2015-07-23 2018-03-27 米雷普里卡技术有限责任公司 For the performance enhancement of two-dimensional array processor
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10552732B2 (en) * 2016-08-22 2020-02-04 Kneron Inc. Multi-layer neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019656A (en) * 2012-12-04 2013-04-03 中国科学院半导体研究所 Dynamically reconfigurable multi-stage parallel single instruction multiple data array processing system
CN107836001A (en) * 2015-06-29 2018-03-23 微软技术许可有限责任公司 Convolutional neural networks on hardware accelerator
CN107851214A (en) * 2015-07-23 2018-03-27 米雷普里卡技术有限责任公司 For the performance enhancement of two-dimensional array processor
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN107832839A (en) * 2017-10-31 2018-03-23 北京地平线信息技术有限公司 The method and apparatus for performing the computing in convolutional neural networks
CN107832699A (en) * 2017-11-02 2018-03-23 北方工业大学 Method and device for testing interest point attention degree based on array lens
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array

Also Published As

Publication number Publication date
CN108647773A (en) 2018-10-12

Similar Documents

Publication Publication Date Title
CN108647773B (en) Hardware interconnection system capable of reconstructing convolutional neural network
CN110390385B (en) BNRP-based configurable parallel general convolutional neural network accelerator
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN111416743B (en) Convolutional network accelerator, configuration method and computer readable storage medium
WO2020258527A1 (en) Deep neural network hardware accelerator based on power exponent quantisation
CN106022468A (en) Artificial neural network processor integrated circuit and design method therefor
CN107256424B (en) Three-value weight convolution network processing system and method
CN111488983A (en) Lightweight CNN model calculation accelerator based on FPGA
CN111210019B (en) Neural network inference method based on software and hardware cooperative acceleration
CN112734020B (en) Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network
CN110543936B (en) Multi-parallel acceleration method for CNN full-connection layer operation
CN114219699B (en) Matching cost processing method and circuit and cost aggregation processing method
CN110110852B (en) Method for transplanting deep learning network to FPAG platform
CN111582465A (en) Convolutional neural network acceleration processing system and method based on FPGA and terminal
CN108647780B (en) Reconfigurable pooling operation module structure facing neural network and implementation method thereof
WO2021243489A1 (en) Data processing method and apparatus for neural network
US20230128421A1 (en) Neural network accelerator
CN112508174B (en) Weight binary neural network-oriented pre-calculation column-by-column convolution calculation unit
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN110163793B (en) Convolution calculation acceleration method and device
CN115496190A (en) Efficient reconfigurable hardware accelerator for convolutional neural network training
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card
CN115081603A (en) Computing device, integrated circuit device and board card for executing Winograd convolution
CN116882467B (en) Edge-oriented multimode configurable neural network accelerator circuit structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant