CN113191493A - Convolutional neural network accelerator based on FPGA parallelism self-adaptation - Google Patents

Convolutional neural network accelerator based on FPGA parallelism self-adaptation Download PDF

Info

Publication number
CN113191493A
CN113191493A CN202110461762.1A CN202110461762A CN113191493A CN 113191493 A CN113191493 A CN 113191493A CN 202110461762 A CN202110461762 A CN 202110461762A CN 113191493 A CN113191493 A CN 113191493A
Authority
CN
China
Prior art keywords
output
data
convolution
activation
different
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110461762.1A
Other languages
Chinese (zh)
Other versions
CN113191493B (en
Inventor
袁海英
曾智勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110461762.1A priority Critical patent/CN113191493B/en
Publication of CN113191493A publication Critical patent/CN113191493A/en
Application granted granted Critical
Publication of CN113191493B publication Critical patent/CN113191493B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a convolutional neural network accelerator based on FPGA parallelism self-adaptation, which comprises: the device comprises a read command generator, a data distributor, an operation cluster group, an addition tree group, an output cache group and an output arbiter. The accelerator configures the parallelism as either multiple or single output activation parallelism, depending on the structure of the convolutional layer. The data distributor may analyze the consistency of the data of the on-chip cache and broadcast the repeated off-chip data to the corresponding caches at the same time. A plurality of operation units in the operation cluster are responsible for convolution operation on different input channels; according to the convolution operation structure, the operation cluster is configured to be operated on different input channels or activated by different outputs. The output cache group contains a data routing module, and for an operation cluster which is activated by the same output of operation, the data in the corresponding output cache is connected end to end, otherwise, the data is output independently. The accelerator not only ensures the perception sparse activation and reduces the operation load, but also can flexibly configure the operation unit, and greatly improves the parallel operation efficiency of the FPGA.

Description

Convolutional neural network accelerator based on FPGA parallelism self-adaptation
Technical Field
The invention relates to the field of digital integrated circuits, electronic information and deep learning, in particular to a convolutional neural network accelerator based on FPGA parallelism self-adaption.
Background
With the breakthrough of deep learning technology in the fields of scientific research, life, production and national defense and military, the Convolutional Neural Network (CNN) model has been greatly successful. The reason why the size of each convolution layer in the convolution model is too different is one of the reasons for limiting the performance improvement of the accelerator. Since the input channels of the front layer in the CNN are too small, the 0 th layer of the typical VGG-16 for image recognition has only 3 input channels, and the accelerator with larger parallelism cannot achieve the proper operation performance in the front layer in the CNN. Sparse-sensing accelerators perform worse in this respect due to the need for greater granularity of control. If the operation of the matrix type operation unit is more flexible, the operation parallelism of the sparse sensing accelerator can be improved on the premise of ensuring the operation efficiency.
Disclosure of Invention
The invention provides a convolutional neural network accelerator based on FPGA parallelism self-adaption, aiming at solving the problem that operation circulation cannot be completely tiled on an operation unit due to inflexible expansion of an operation parallelism by utilizing a neural network accelerator scheme in the prior art, and expanding the dimensionality of parallelization tiling on the basis of finishing sparse activation perception, and the accelerator can control the sizes of three parallelism on line, so that the accelerator still has higher utilization rate of the operation unit when the size of a convolutional layer is smaller.
In order to achieve the technical purpose, the invention adopts the technical scheme that:
a convolutional neural network accelerator based on FPGA parallelism adaptation, comprising: the device comprises a read command generator, a data distributor, an operation cluster group, an addition tree group, an output cache group and an output arbiter.
The operation mode of the deep convolution neural network accelerator based on FPGA parallelism self-adaption is as follows: the operating parameters of the accelerator are first written externally to the configuration registers and these parameters are broadcast into each operational cluster. The command generator generates read requests of activation and weight, and the read requests adopt two independent data channels for data transmission; a read request command for dynamically sending data to an external bus by an arbiter; and reading data in the off-chip memory into a data distributor through an external bus, distributing the data to different operation clusters through the data distributor, and performing convolution operation on the operation clusters. And selecting an addition isolation mode and the number of effective outputs by the root multi-output addition tree, wherein the number of the effective outputs is the same as the output activation parallelism. The data routing module sets different output cache connection modes according to the parallelism, and the connection modes ensure that each output activated parallel operation is connected to one output cache. Finally, one output arbitrator writes different output buffers back to the off-chip memory.
The output arbiter selects from which output buffer the data is to be output, outputs the data according to the request level of the output buffer, and triggers a low level request when the output buffer has a data amount of half of the burst length, and triggers a high level request when the output buffer has a data amount of the burst length. When the arbiter does not have a high request level, outputting the data requested by a low level; and outputting the data corresponding to the higher request level when the higher request level exists.
Optionally, the data distributor performs data unicast or broadcast according to data consistency in the operation cluster. When different outputs of different operation clusters are activated, the data distributor simultaneously broadcasts the weight and the activation of the last (k-stride) w _ in to different operation clusters, wherein k is the size of a convolution kernel, stride is the size of a stride, and win is the width of an input characteristic diagram; when different operation clusters operate the same output activated different input channels, the data distributor performs unicast and sequentially transmits the weight and the activation data to the corresponding operation clusters.
Optionally, the operation cluster group includes Tp operation clusters composed of operation units and adders, and each operation unit includes an on-chip activation and weight cache, an address generator, a responder, a sparse sensor, a non-0 cache, and Tn multiply-accumulators. And each multiply-accumulator calculates the activation of Tn output channels, and multiply-accumulator results of different operation units in the same operation cluster are added by an adder to serve as the output of the operation cluster. The operation cluster group comprises Tm operation units which are averagely distributed to each operation cluster. Wherein Tm is an integral multiple of Tp. Tm, Tp and Tn are hardware configuration parameters, the configuration scheme is that Tm multiplied by Tn is smaller than the number of DSPs in the FPGA, and Tp is 2 at minimum and 2 at maximum, and is an integral power of 2. The operation cluster group can be used for operating convolution operations corresponding to different output channels and different input channels (or different output activations) of the same convolution layer in parallel. Setting the number of input channels of the convolutional layer as Chin, when Chin is larger than or equal to Tm multiplied by Tn,each cluster operates on the convolution of ceil (Chin/Tp) input channels, where "ceil ()" is an upward rounding function. Chin<When Tm is multiplied by Tn, Tu is taken as the index satisfying Chin<2TuMinimum integer of x Tn, convolution operation of each operation cluster corresponding to ceil (Hout/2)Tu) The output of the row activates and ceil (Chin/Tn) input channels. Each operation cluster is responsible for convolution operation of the same number of output channels. And the activation buffer output bit width of the operation units in the operation cluster group is 16 × Tn, and the activation buffer outputs Tn activations to the sparse activation sensor. And extracting non-zero values and corresponding offset values in Tn activations by a sparse activation sensor in the operation unit, after the activation signals are cached by a non-0 buffer, reading the position c by a weight cache according to the current weight and marking the specific non-0 activation position c + off by the offset value off output by the non-0 buffer, carrying out weight addressing to match the weights of Tn output channels, and then executing c-c + Tn.
Optionally, the addition tree group has a multi-node output function and is composed of Tn multi-segment addition trees. The input port of each adder stage is provided with a first-in first-out memory for isolating the operations of different adders. The input end of each stage of adder is connected to an operation cluster, the input value of each stage of adder is the convolution part sum from different input channels, the output value of each stage of adder is used as an independent output, the output value of each stage of adder is different output activation, different output activation caches are linked when Tu is not 0, otherwise, only the output of the last stage is effective, and the other stages are discarded.
Optionally, the storage depth of the output buffer group is consistent with the burst length of the external bus, and corresponds to the operation clusters one by one, and the input end of the output buffer group is connected to each adder of the multi-segment addition array. The output buffer group has data route module composed of Tm/Tp multiplexers, input terminal connected to each adder in the addition tree and the output of the previous multiplexer, output terminal connected to the input terminal of the next multiplexer, data selection mode of multiplexer is every 2TUThe output and input of each multiplexer are connected end to end.
A single operation multi-task data stream based on FPGA activates the output of convolution operation and decomposes one convolution operation into Tu independently operated tasks in the row direction, wherein the data stream in each task is the same and flows in parallel in different operation clusters without interference. And the data stream in the task is sequentially expanded by an output channel, an input channel and a convolution window of convolution circulation, and the task performs convolution parallel operation of Tn output channels and the input channel of Tm multiplied by Tu.
The technical scheme adopted by the invention has the advantages and beneficial effects that:
the flexible parallelism expansion architecture of the sparse activation accelerator is realized, the dimension of parallelization tiling is expanded by the architecture, and the sizes of three parallelisms can be controlled on line. When the volume of the input channel of the convolution layer is insufficient, the accelerator uses the original input channel expansion operation unit for expanding and outputting the activation parallelism, so that the utilization rate of the operation unit is improved, and higher throughput rate is realized. The operation participated by effective skipping 0 activation can reduce the operation load on the premise of not influencing the operation result. The data distributor transmits repeated on-chip data in a broadcast mode, so that reading of off-chip redundant data can be greatly reduced, and dependence of an accelerator on bandwidth is reduced
Drawings
FIG. 1 is a schematic structural view of the present invention;
FIG. 2 is a diagram illustrating a structure of an operation cluster;
FIG. 3 is a schematic diagram of a data routing architecture;
Detailed Description
As shown in fig. 1, the present embodiment relates to a depth convolution neural network accelerator based on FPGA parallelism adaptation, which includes: the device comprises a read command generator, a data distributor, an operation cluster group, an addition tree group, an output cache group and an output arbiter.
The operation mode of the deep convolution neural network accelerator based on FPGA parallelism self-adaption is as follows: the operating parameters of the accelerator are first written externally to the configuration registers and these parameters are broadcast into each operational cluster. The command generator generates read requests of activation and weight, and the read requests adopt two independent data channels for data transmission; a read request command for dynamically sending data to an external bus by an arbiter; and reading data in the off-chip memory into a data distributor through an external bus, distributing the data to different operation clusters through the data distributor, and performing convolution operation on the operation clusters. According to the difference of the parallelism, the multi-output addition tree selects the addition isolation mode and the number of effective outputs, and the number of the effective outputs is the same as the parallelism of the output activation. In order to fully utilize the output cache, the data routing module sets different output cache connection modes according to the parallelism, and the connection modes ensure that each output activation parallel operation is connected to one output dump. And finally, writing different output caches back to the off-chip memory by an arbiter, and configuring an output address by the arbiter according to the physical position of an operation cluster where different output activation operations are located in order to enable multi-output activation parallel operations to be coupled into a single operation again.
The read command generator is used for sending a read request to an external bus for addressing activation and weight data stored in an off-chip memory, the read request is carried out by taking the activation and weight of Tn input channels as a unit, and the read sequence is as follows: the feature map is from width to height to input channel depth; the weights are from width to height, and then from input channel depth to output channel depth.
The data streams in the operation clusters are the same and flow in parallel without interfering with each other, and the data streams are sequentially expanded by an output channel, an input channel and a convolution window of a convolution cycle.
The operation cluster group comprises Tp operation clusters consisting of operation units and adders, and each operation unit comprises an on-chip activation and weight cache, an address generator, a responder, a sparse perceptron, a non-0 cache and Tn multiply-accumulators. And each multiply-accumulator calculates the activation of Tn output channels, and multiply-accumulator results of different operation units in the same operation cluster are added by an adder to serve as the output of the operation cluster. The operation cluster can be flexibly configured to input operation of channels or output activated operation. The operation cluster group comprises Tm operation units which are averagely distributed to each operation cluster. Wherein Tm is an integral multiple of Tp. And Tm, Tp and Tn are hardware configuration parameters, and can be configured off line according to the operation resources and the target network of the FPGA.
The structure of the operation cluster is shown in FIG. 2, which can beAnd performing parallel operation on different output channels and different input channels of the same convolution layer or different outputs to activate corresponding convolution operation. When the number Chin of input channels of the convolution layer is larger than or equal to Tm multiplied by Tn, convolution of ceil (Chin/Tp) input channels is calculated by each operation cluster, wherein 'ceil ()' is an upward integer function. Chin<When Tm is multiplied by Tn, Tu is taken as the index satisfying Chin<2TuMinimum integer of x Tn, convolution operation of each operation cluster corresponding to ceil (Hout/2)Tu) The output of the row activates and ceil (Chin/Tn) input channels. Each cluster of operations operates on the convolution of the same number of output channels. Each operation cluster group comprises Tp operation units, the output bit width of an activation buffer in each operation unit is 16 x Tn, and Tn activations are output to the sparse activation sensor. The sparse activation perceptron extracts non-zero values and corresponding offset values of Tn activation, after the non-0 cache, the weight cache can mark specific non-0 activation positions c + off according to the current weight reading position c and the offset value off output by the non-0 cache, weight addressing is carried out to match the weights of Tn output channels, in a multiplication accumulator, the weights of the Tn output channels and one non-0 activation are subjected to parallel multiplication and accumulation operation, and then c is executed as c + Tn.
And the data distributor performs unicast or broadcast of data according to the data consistency in the operation cluster. When the operation cluster group operates convolution operation corresponding to different output channels and different output activations of the same convolution layer, Tu is taken as the condition of satisfying Chin<2TuThe smallest integer of XTn, where "ceil ()" is the ceiling function, and Chin is the size of the input channel. Every 2 thTuEach operation cluster receives the same weight data broadcasted by the data distributor, and the other operation clusters receive different weight data unicast by the data distributor. Each 2 isTuThe last w _ in (k-stride) active data of the operation clusters are received from the data distributor, the rest are unicast, wherein w _ in is the input active width, k is the convolution kernel size, and stride is the stride.
The addition tree of the addition tree group has a multi-node output function, the value of each level of adder is used as an independent output, the output value is activated for different outputs, when Tu is not 1, the buffer is activated corresponding to different outputs, otherwise, only the output of the last level is effective, and the other levels are discarded. Taking the 8 operation cluster structure shown in fig. 3 as an example, each output passes through a data routing structure composed of a plurality of multiplexers, where AL _ a _ b represents the output from the b-th adder of the a-th addition tree, and OB _ c represents the data buffered in the c-th output. Each multiplexer is responsible for data routing of one output buffer, and the output result of the addition tree is sent to the corresponding output buffer according to the Tu value. Except for the first multiplexer, there are input ports from the last output buffer. Thus, multiple output buffers can be re-linked into one larger buffer when Tu is smaller.
The storage depth of the output buffer is consistent with the burst length of the external bus, the output buffer has 2 request levels to the output arbiter, when the output buffer has the data volume of half of the burst length, a low-level request is triggered, and when the output buffer has the data volume of the burst length, a high-level request is triggered. When the arbiter does not have a high request level, outputting the data requested by a low level; and outputting the data corresponding to the higher request level when the higher request level exists.

Claims (6)

1. A convolutional neural network accelerator based on FPGA parallelism adaptation is characterized by comprising: the device comprises a read command generator, a data distributor, an operation cluster group, an addition tree group, an output cache group and an output arbiter;
the operation mode of the deep convolution neural network accelerator based on FPGA parallelism self-adaption is as follows: firstly, externally writing the operating parameters of an accelerator into a configuration register, and broadcasting the parameters into each operation cluster; the command generator generates read requests of activation and weight, and the read requests adopt two independent data channels for data transmission; a read request command for dynamically sending data to an external bus by an arbiter; reading data in the off-chip memory into a data distributor through an external bus, distributing the data to different operation clusters through the data distributor, and performing convolution operation on the operation clusters; selecting an addition isolation mode and the number of effective outputs by a root multi-output addition tree, wherein the number of the effective outputs is the same as the output activation parallelism; the data routing module sets different output cache connection modes according to the parallelism, and the connection modes ensure that each output activated parallel operation is connected to one output cache; finally, one output arbitrator writes different output buffers back to the off-chip memory;
the output arbiter selects which output buffer to output data from, and outputs data according to the request level of the output buffer, and when the output buffer has a data volume of half of the burst length, a low-level request is triggered, and when the output buffer has a data volume of the burst length, a high-level request is triggered; when the arbiter does not have a high request level, outputting the data requested by a low level; and outputting the data corresponding to the higher request level when the higher request level exists.
2. The FPGA parallelism adaptive-based convolutional neural network accelerator according to claim 1, wherein the data distributor unicasts or broadcasts data according to data consistency in the operation clusters; when different outputs of different operation clusters are activated, the data distributor simultaneously broadcasts the weight and the activation of the last (k-stride) w _ in to different operation clusters, wherein k is the size of a convolution kernel, stride is the size of a stride, and win is the width of an input characteristic diagram; when different operation clusters operate the same output activated different input channels, the data distributor performs unicast and sequentially transmits the weight and the activation data to the corresponding operation clusters.
3. The FPGA parallelism-adaptive convolutional neural network accelerator according to claim 1, wherein the operation cluster group comprises Tp operation clusters consisting of operation units and adders, and each operation unit comprises an on-chip activation and weight cache, an address generator, a responder, a sparse sensor, a non-0 cache and Tn multiply accumulators; activating each multiply-accumulator operation Tn output channels, and adding the multiply-accumulator results of different operation units in the same operation cluster by an adder to serve as the output of the operation cluster; the operation cluster group comprises Tm operation units which are averagely distributed to each operation cluster; wherein Tm is integral multiple of Tp; the Tm, Tp and Tn are hardware configuration parameters and the configuration thereofThe setting scheme is that Tm multiplied by Tn is smaller than the number of DSPs in the FPGA, Tp is minimum 2 and maximum Tm which is an integral power of 2; the operation cluster group can parallelly operate convolution operations corresponding to different output channels and different input channels (or different output activations) of the same convolution layer; setting the number of input channels of the convolution layer as Chin, and when Chin is larger than or equal to Tm multiplied by Tn, calculating the convolution of ceil (Chin/Tp) input channels by each operation cluster, wherein 'ceil ()' is an upward integer function; chin<When Tm is multiplied by Tn, Tu is taken as the index satisfying Chin<2TuMinimum integer of x Tn, convolution operation of each operation cluster corresponding to ceil (Hout/2)Tu) The output of the row is activated and ceil (Chin/Tn) input channels; each operation cluster is responsible for convolution operation of the same number of output channels; the activation cache output bit width of the operation units in the operation cluster group is 16 × Tn, and the activation cache output bit width outputs Tn activations to the sparse activation sensor; and extracting non-zero values and corresponding offset values in Tn activations by a sparse activation sensor in the operation unit, after the activation signals are cached by a non-0 buffer, reading the position c by a weight cache according to the current weight and marking the specific non-0 activation position c + off by the offset value off output by the non-0 buffer, carrying out weight addressing to match the weights of Tn output channels, and then executing c-c + Tn.
4. The FPGA parallelism adaptive convolutional neural network accelerator of claim 1, wherein said set of addition trees has a multi-node output function and consists of Tn multi-segment addition trees; the input port of each stage of adder is provided with a first-in first-out memory for isolating the operation of different adders; the input end of each stage of adder is connected to an operation cluster, the input value of each stage of adder is the convolution part sum from different input channels, the output value of each stage of adder is used as an independent output, the output value of each stage of adder is different output activation, different output activation caches are linked when Tu is not 0, otherwise, only the output of the last stage is effective, and the other stages are discarded.
5. The FPGA parallelism-adaptive convolutional neural network accelerator according to claim 1, wherein the storage depth of the output buffer group is consistent with the burst length of an external busAnd the input end of the adder is connected to each adder of the multi-section addition arrays; the output buffer group has data route module composed of Tm/Tp multiplexers, input terminal connected to each adder in the addition tree and the output of the previous multiplexer, output terminal connected to the input terminal of the next multiplexer, data selection mode of multiplexer is every 2TUThe output and input of each multiplexer are connected end to end.
6. A single operation multi-task data stream based on FPGA is characterized in that the output of convolution operation is activated, one convolution operation is decomposed into Tu tasks which run independently in the row direction, the data stream in each task is the same, and the parallel flow in different operation clusters is not interfered with each other; and the data stream in the task is sequentially expanded by an output channel, an input channel and a convolution window of convolution circulation, and the task performs convolution parallel operation of Tn output channels and the input channel of Tm multiplied by Tu.
CN202110461762.1A 2021-04-27 2021-04-27 Convolutional neural network accelerator based on FPGA parallelism self-adaption Active CN113191493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110461762.1A CN113191493B (en) 2021-04-27 2021-04-27 Convolutional neural network accelerator based on FPGA parallelism self-adaption

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110461762.1A CN113191493B (en) 2021-04-27 2021-04-27 Convolutional neural network accelerator based on FPGA parallelism self-adaption

Publications (2)

Publication Number Publication Date
CN113191493A true CN113191493A (en) 2021-07-30
CN113191493B CN113191493B (en) 2024-05-28

Family

ID=76979620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110461762.1A Active CN113191493B (en) 2021-04-27 2021-04-27 Convolutional neural network accelerator based on FPGA parallelism self-adaption

Country Status (1)

Country Link
CN (1) CN113191493B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819129A (en) * 2022-05-10 2022-07-29 福州大学 Convolution neural network hardware acceleration method of parallel computing unit

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180157969A1 (en) * 2016-12-05 2018-06-07 Beijing Deephi Technology Co., Ltd. Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN109948774A (en) * 2019-01-25 2019-06-28 中山大学 Neural network accelerator and its implementation based on network layer binding operation
CN109993297A (en) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110516801A (en) * 2019-08-05 2019-11-29 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN112052941A (en) * 2020-09-10 2020-12-08 南京大学 Efficient storage and calculation system applied to CNN network convolution layer and operation method thereof
CN112418396A (en) * 2020-11-20 2021-02-26 北京工业大学 Sparse activation perception type neural network accelerator based on FPGA

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180157969A1 (en) * 2016-12-05 2018-06-07 Beijing Deephi Technology Co., Ltd. Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN109948774A (en) * 2019-01-25 2019-06-28 中山大学 Neural network accelerator and its implementation based on network layer binding operation
CN109993297A (en) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110516801A (en) * 2019-08-05 2019-11-29 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN112052941A (en) * 2020-09-10 2020-12-08 南京大学 Efficient storage and calculation system applied to CNN network convolution layer and operation method thereof
CN112418396A (en) * 2020-11-20 2021-02-26 北京工业大学 Sparse activation perception type neural network accelerator based on FPGA

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
余成宇;李志远;毛文宇;鲁华祥;: "一种高效的稀疏卷积神经网络加速器的设计与实现", 智能系统学报, no. 02, 31 December 2020 (2020-12-31) *
李永博;王琴;蒋剑飞;: "稀疏卷积神经网络加速器设计", 微电子学与计算机, no. 06, 5 June 2020 (2020-06-05) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819129A (en) * 2022-05-10 2022-07-29 福州大学 Convolution neural network hardware acceleration method of parallel computing unit

Also Published As

Publication number Publication date
CN113191493B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
CN108171317B (en) Data multiplexing convolution neural network accelerator based on SOC
CN110516801B (en) High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN111445012B (en) FPGA-based packet convolution hardware accelerator and method thereof
CN107689948B (en) Efficient data access management device applied to neural network hardware acceleration system
CN109447241B (en) Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things
CN112149811A (en) Scheduling perception tensor distribution module
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN108647779B (en) Reconfigurable computing unit of low-bit-width convolutional neural network
CN113807509B (en) Neural network acceleration device, method and communication equipment
CN113128675B (en) Multiplication-free convolution scheduler based on impulse neural network and hardware implementation method thereof
WO2021089009A1 (en) Data stream reconstruction method and reconstructable data stream processor
CN112418396A (en) Sparse activation perception type neural network accelerator based on FPGA
CN113222133B (en) FPGA-based compressed LSTM accelerator and acceleration method
CN113033794B (en) Light weight neural network hardware accelerator based on deep separable convolution
CN110705702A (en) Dynamic extensible convolutional neural network accelerator
CN110580519A (en) Convolution operation structure and method thereof
CN113191493B (en) Convolutional neural network accelerator based on FPGA parallelism self-adaption
CN114611686A (en) Synapse delay implementation system and method based on programmable neural mimicry core
CN113792868B (en) Neural network computing module, method and communication equipment
CN113313244B (en) Near-storage neural network accelerator for addition network and acceleration method thereof
US7827023B2 (en) Method and apparatus for increasing the efficiency of an emulation engine
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN114004351A (en) Convolution neural network hardware acceleration platform
CN110046695B (en) Configurable high-parallelism pulse neuron array
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant