CN106951961B - A kind of convolutional neural networks accelerator that coarseness is restructural and system - Google Patents

A kind of convolutional neural networks accelerator that coarseness is restructural and system Download PDF

Info

Publication number
CN106951961B
CN106951961B CN201710104029.8A CN201710104029A CN106951961B CN 106951961 B CN106951961 B CN 106951961B CN 201710104029 A CN201710104029 A CN 201710104029A CN 106951961 B CN106951961 B CN 106951961B
Authority
CN
China
Prior art keywords
unit
coarseness
addition unit
convolutional neural
neural networks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710104029.8A
Other languages
Chinese (zh)
Other versions
CN106951961A (en
Inventor
袁哲
刘勇攀
杨华中
岳金山
李金阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201710104029.8A priority Critical patent/CN106951961B/en
Publication of CN106951961A publication Critical patent/CN106951961A/en
Application granted granted Critical
Publication of CN106951961B publication Critical patent/CN106951961B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/28Indexing scheme for image data processing or generation, in general involving image processing hardware

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention provides a kind of convolutional neural networks accelerator that coarseness is restructural and system, the accelerator includes multiple processing unit clusters, each processing unit cluster includes several basic computational ele- ments, several basic computational ele- ments are connected respectively to a female addition unit by a sub- addition unit connection, the sub- addition unit of the multiple processing unit cluster;Every sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is for the sub- addition unit that adds up.The present invention is in such a way that coarseness can be reconfigured, different weight and picture track are linked by SRAM or other interconnection units, to realize different convolution kernel processing structures, different size of network and convolution kernel can be efficiently supported, while largely reducing the expense reconfigured.

Description

A kind of convolutional neural networks accelerator that coarseness is restructural and system
Technical field
It is restructural more particularly, to a kind of coarseness the present invention relates to high energy efficiency hardware accelerator design field Convolutional neural networks accelerator and system.
Background technique
Convolutional neural networks (Convolutional Neural Network, CNN) are a kind of feedforward neural networks, it Artificial neuron can respond the surrounding cells in a part of coverage area, have outstanding performance for large-scale image procossing.Convolution Neural network has become in the most common algorithm in the fields such as image recognition, speech recognition, and this kind of methods need very more Calculation amount needs the accelerator of design specialized.Also there is good application prospect in movable equipment.But due to movable equipment It is resource-constrained, at present in GPU and FPGA (Field Programmable Gate Array, field programmable gate array) platform The accelerator of upper design is difficult to use on these requirements low-power consumption, resource-constrained platform.
Since convolutional neural networks have the network structure and convolution kernel of a variety of sizes, dedicated convolutional network accelerator is answered This efficiently supports these different size of networks and convolution kernel.Traditional accelerator is in order to support the diversity of convolutional network It may be generally divided into two major classes;First major class is instruction type accelerator, and different convolution kernel calculating operations is disassembled into one Item instruction takes out correct weighted data and image data in synchronization, and this method needs a large amount of on piece bandwidth and on piece Storage is that comparison is efficient, but weighted data can not be stored entirely on piece when processing big network handling small network, so energy Amount efficiency decline is serious;Second major class supports different size of network and convolution by the way of fine granularity reconfigurable circuit Core is arranged an address to each processing unit, sends data to every time accordingly for example, by using the mode of reconstruct network-on-chip Location, although this mode is more efficient than instruction type accelerator when handling different convolutional neural networks, fine granularity reconstruct electricity Road brings many additional energy and reconfigures expense.
In large-scale calculations field, reconfigurable system is a research hotspot of current architecture, it is by general place Manage the flexibility of device and the height of ASIC (Application Specific Integrated Circuits, specific integrated circuit) Effect property combines well, is towards solution more satisfactory in large-scale calculations.Traditional DSP (Digital Signal Processing, Digital Signal Processing) with arithmetic speed is low, hardware configuration is not restructural, exploitation upgrade cycle is long Not the disadvantages of portable, when towards large-scale calculations, this disadvantage is with regard to more obvious.ASIC is in performance, area and power consumption Etc. there is greater advantage, but the complexity of changeable application demand and rapid growth makes the design and validation difficulty of ASIC Greatly, the development cycle is long, is difficult to meet the requirement that product is quickly applied.In the programmable logic device, although Xilinx company Virtex-6 Series FPGA is realized using the DSP48E1slice of 600MHz (multiplies accumulating fortune 1 × 1012 time more than 1000GMACS Calculation/second) performance, but when towards large-scale calculations, the circuit scale needed to configure is excessive, and comprehensive and setup time is too long, And actual operating frequency is not high, it is difficult to keep it is high performance simultaneously, pursue flexibility and low-power consumption target.
Therefore, it is badly in need of designing a kind of dedicated accelerator architecture of low-power consumption high energy efficiency to meet the movable equipment of low-power consumption Use.
Summary of the invention
It is restructural that the present invention provides a kind of coarseness for overcoming the above problem or at least being partially solved the above problem Convolutional neural networks accelerator and system pass through SRAM (Static Random Access in such a way that coarseness can be reconfigured Memory, i.e. static random access memory) or other interconnection units link different weight and picture track, to realize difference Convolution kernel processing structure, can efficiently support different too small networks and convolution kernel, while largely reducing and reconfiguring Expense.
According to an aspect of the present invention, a kind of convolutional neural networks accelerator that coarseness is restructural is provided, including more A processing unit cluster, each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments pass through One sub- addition unit connection, the sub- addition unit of the multiple processing unit cluster are connected respectively to a female addition unit;It is described every A sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is for cumulative described Sub- addition unit.
Preferably, the basic computational ele- ment includes 3 × 3 convolution units.
Preferably, the processing unit cluster is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters;It is described every A processing unit cluster includes 4 basic computational ele- ments, the orthogonal thereto matrix arrangement of 4 basic computational ele- ments.
Preferably, each basic computational ele- ment includes 9 multipliers in nine grids arrangement, it further include 1 Adder, the input register of 3 multipliers in the same row are shift register.
Preferably, basic computational ele- ment adjacent in each every row of processing unit cluster matrix is interconnected by weight Unit connection weight track, two neighboring basic computational ele- ment connects picture track by image interconnection unit in each column;
The weight interconnection unit is used for each basic computational ele- ment connection weight track, by SRAM control selections, Weighted data is selected from weight track to each basic computational ele- ment;
Described image interconnection unit is for connecting basic computational ele- ment and image data, from picture track under the control of SRAM 3 continuous data are selected in the output set of road and a upper basic computational ele- ment.
Preferably, multiplier and adder are being closed when not used in each processing unit cluster, the sub- addition Unit and female addition unit are powering off when not used.
A kind of convolutional neural networks acceleration system that coarseness is restructural accelerates including several parallel convolutional neural networks Device.
The application proposes a kind of convolutional neural networks accelerator that coarseness is restructural and system, can be reconfigured using coarseness Mode, link different weight and picture track, by SRAM or other interconnection units to realize different convolution kernel processing Structure can efficiently support different too small networks and convolution kernel, while largely reduce the expense reconfigured.Pass through one kind The restructural accelerator hardware framework of coarseness can support heterogeneous networks with few expense that reconfigures, and design efficiently branch The computing unit for holding coarseness reconstruction structure, the interconnection architecture for supporting coarseness reconfigurable reconstruct big volume with small convolution kernel The mechanism of product core reconfigures speed and improves 10 compared to traditional reconfigurable FPGA5Times, energy efficiency has reached 18.8 times.Phase Than the ASIC that can be reconfigured in traditional fine granularity, reconfiguration time reduces 81.0%, and average energy efficiency improves 80.0%.
Detailed description of the invention
Fig. 1 is the convolutional neural networks accelerator structure schematic diagram restructural according to the coarseness of the embodiment of the present invention;
It is that coarseness matches the operating mode signal for postponing support different size convolution kernel that Fig. 2, which is according to the embodiment of the present invention, Figure;
Fig. 3 is the schematic equivalent circuit configured after 5x5 mode according to the accelerator architecture of the embodiment of the present invention;
Fig. 4 is that ASIC accelerator, traditional reconfigurable FPGA, of the invention thick is reconstructed according to fine granularity in the embodiment of the present invention The restructural convolutional neural networks accelerator efficiency comparison schematic diagram of granularity.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.
Fig. 1 shows a kind of convolutional neural networks accelerator that coarseness is restructural, including multiple processing unit clusters, described Each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments are connected by a sub- addition unit, The ADDB1-ADDB4 of the sub- addition unit as shown in figure 1, the sub- addition unit of the multiple processing unit cluster are connected respectively to one Female addition unit, the mother addition unit ADDB0 as shown in Figure 1, the sub- addition unit and female addition unit structure phase Together;Every sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is used In the sub- addition unit that adds up.
Granularity refers to the bit wide size of the restructural component of system (or reconfigurable processing unit) operation data, arithmetic element Granularity be divided into fine granularity, coarseness, combination grain;In the present embodiment, the basic computational ele- ment includes 3 × 3 convolution lists Member, 3 × 3 convolution units are most common neural network convolution kernel.Since fine-grained restructural meeting be brought a large amount of chip Area and power dissipation overhead.Therefore, the present invention proposes that a kind of pair of 3x3 convolution kernel does special optimization, while being reconstructed by coarseness The accelerator architecture of method support other types convolution kernel.Since accelerator has done special optimization to 3x3, it can be efficient Handle the convolution kernel of 3x3.Since the convolution kernel of 3x3 is big in common neural network proportion, it can be obviously improved efficiency, led to The method being reconfigured for crossing coarse grain combines these 3x3 convolution units and constitutes bigger core.Therefore, the side restructural using coarseness Method supports other convolution kernels, and expense can be reconfigured by substantially reducing under the premise of not losing too many performance.
In the present embodiment, the processing unit cluster is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters;It is described Each processing unit cluster includes 4 basic computational ele- ments, NE11, NE12, NE21, NE22 as shown in Figure 1 and sub- addition list First ADDB1 forms first processing units cluster, and NE13, NE14, NE23, NE24 and sub- addition unit ADDB2 form second processing list First cluster, NE31, NE32, NE41, NE42 and sub- addition unit ADDB3 form third processing unit cluster, NE33, NE34, NE43, NE44 and sub- addition unit ADDB4 forms fourth processing unit cluster, 4 basic computational ele- ments in each processing unit cluster Orthogonal thereto matrix arrangement;As shown in figure 1 shown in (e), the sub- addition unit include four inputs (input 0 as shown in Figure 1, 3) and buffer input 1, input 2, input;Four inputs are separately connected the first processing units cluster, second processing list First cluster, third processing unit cluster, fourth processing unit cluster;The buffer exports the (addition i.e. in figure as sub- addition unit Device output).
Preferably, each basic computational ele- ment includes 9 multiplier MUL in nine grids arrangement, it further include 1 A adder ADD;The 9 multiplier MUL and adder ADD can closed when not used to save power consumption.It is same The input register of three multiplier MUL on column is shift register, and image data can move from the top down.Meanwhile substantially Computing unit has output port, image data can be removed this unit.
As shown in figure 1 shown in (d), adjacent basic computational ele- ment passes through weight in each every row of processing unit cluster matrix Interconnection unit FC connection weight track, two neighboring basic computational ele- ment passes through image interconnection unit IC connection picture track in each column Road;
The weight interconnection unit FC is used to pass through SRAM (Static to each basic computational ele- ment connection weight track Random Access Memory, static random access memory) control selections, weighted data is selected from weight track to every A basic computational ele- ment;
Described image interconnection unit is for connecting basic computational ele- ment and picture track, since each basic computational ele- ment has Three column, so the collection that image interconnection unit exports under the control of SRAM from picture track and a upper basic computational ele- ment image Three continuous data are selected in conjunction.When needing to reconfigure chip, it is only necessary to data are loaded into configuration SRAM, it can be complete It is reconfigured at chip.
Preferably, multiplier and adder are being closed when not used in each processing unit cluster, the sub- addition Unit and female addition unit are powering off when not used, to save power consumption.
The operating mode schematic diagram for supporting different size convolution kernel is postponed as shown in Fig. 2, matching for coarseness, the present invention supports 1x1 to 12x12 convolution kernel size may be configured to 16 (1x1) to the processing of (3x3) core or 4 (4x4) and arrive (6x6) core With 1 (7x7)-(12x12) core.Such as the core of a 5x5, it will there are 4 basic computational ele- ments and a sub- addition unit to constitute, Wherein the partial product device in 4 basic computational ele- ments there are three in can be powered down, and guarantee the size of 5x5 core, while save function Consumption.
As shown in figure 3, configuring the schematic equivalent circuit after 5x5 mode for accelerator architecture;It is reconfigured by coarseness, By taking the core of 5x5 as an example, which can form an efficient operating structure, and two kinds of data-reusing modes are efficiently utilized, It to substantially reduce the carrying of data, is promoted and calculates efficiency, the first multiplexing is multiplexing in convolution kernel, such as the volume of a 5x5 Product core, is reconstructed by coarseness, has 4 pixels that can be reused between adjacent convolution kernel, do not need to be loaded into again. Meanwhile being reconstructed by coarseness, each image data can be public by N number of convolution kernel, until N number of convolution kernel has all been handled.It is this The internuclear multiplexing of convolution decreases moving for image data, and after N number of convolution kernel has all been handled, whole image data move down one Row, basis repeats the above process herein.Data-reusing in the convolution kernel in another direction is realized simultaneously.
As shown in figure 4, it is restructural to reconstruct ASIC accelerator, traditional reconfigurable FPGA, coarseness of the invention for fine granularity Convolutional neural networks accelerator efficiency comparison schematic diagram, be respectively that fine granularity reconstructs ASIC accelerator, traditional reconfigurable in figure FPGA, accelerator of the invention are in AlexNet depth convolutional network, Clarifai network model, Overfeat algorithm, VGG16 Efficiency comparison schematic diagram when being applied in depth convolutional neural networks;It can be seen from the figure that the present invention can be weighed compared to tradition Structure FPGA reconfigures speed and improves 105Times, energy efficiency has reached 18.8 times.It can be reconfigured compared to traditional fine granularity ASIC, reconfiguration time reduce 81.0%, and average energy efficiency improves 80.0%.
A kind of convolutional neural networks acceleration system that coarseness is restructural is additionally provided in the present embodiment, including several parallel Convolutional neural networks accelerator, due between different units no data exchange, this framework parallel after bring income be line Property.
The application proposes a kind of convolutional neural networks accelerator that coarseness is restructural and system, can be reconfigured using coarseness Mode, link different weight and picture track, by SRAM or other interconnection units to realize different convolution kernel processing Structure can efficiently support different too small networks and convolution kernel, while largely reduce the expense reconfigured.Pass through one kind The restructural accelerator hardware framework of coarseness can support heterogeneous networks with few expense that reconfigures, and design efficiently branch The computing unit for holding coarseness reconstruction structure, the interconnection architecture for supporting coarseness reconfigurable reconstruct big volume with small convolution kernel The mechanism of product core reconfigures speed and improves 10 compared to traditional reconfigurable FPGA5Times, energy efficiency has reached 18.8 times.Phase Than the ASIC that can be reconfigured in traditional fine granularity, reconfiguration time reduces 81.0%, and average energy efficiency improves 80.0%.
Finally, the present processes are only preferable embodiment, it is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention Within the scope of.

Claims (6)

1. a kind of convolutional neural networks accelerator that coarseness is restructural, which is characterized in that described including multiple processing unit clusters Each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments are connected by a sub- addition unit, The sub- addition unit of the multiple processing unit cluster is connected respectively to a female addition unit;Every sub- addition unit is for producing The part of raw adjacent several basic addition units and, mother's addition unit is for the sub- addition unit that adds up;The place Managing cluster of cells is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters;Each processing unit cluster includes 4 basic meters Calculate unit, the orthogonal thereto matrix arrangement of 4 basic computational ele- ments.
2. the restructural convolutional neural networks accelerator of coarseness according to claim 1, which is characterized in that described basic Computing unit includes 3 × 3 convolution units.
3. the restructural convolutional neural networks accelerator of coarseness according to claim 2, which is characterized in that described each Basic computational ele- ment includes 9 multipliers in nine grids arrangement, further includes 1 adder, 3 multiplication in same row The input register of device is shift register.
4. the restructural convolutional neural networks accelerator of coarseness according to claim 1, which is characterized in that described each Adjacent basic computational ele- ment is adjacent in each column by weight interconnection unit connection weight track in the every row of processing unit cluster matrix Two basic computational ele- ments connect picture track by image interconnection unit;
The weight interconnection unit is used for each basic computational ele- ment connection weight track, by SRAM control selections, from power Weighted data is selected in heavy rail road to each basic computational ele- ment;
Described image interconnection unit for connecting basic computational ele- ment and picture track, under the control of SRAM from picture track and 3 continuous data are selected in the output set of a upper basic computational ele- ment.
5. the restructural convolutional neural networks accelerator of coarseness according to claim 3, which is characterized in that described each Multiplier and adder are being closed when not used in processing unit cluster, and the sub- addition unit and female addition unit are breaking when not used Electricity.
6. a kind of convolutional neural networks acceleration system that coarseness is restructural, which is characterized in that including several parallel such as right It is required that 1 to 5 any convolutional neural networks accelerator.
CN201710104029.8A 2017-02-24 2017-02-24 A kind of convolutional neural networks accelerator that coarseness is restructural and system Active CN106951961B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710104029.8A CN106951961B (en) 2017-02-24 2017-02-24 A kind of convolutional neural networks accelerator that coarseness is restructural and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710104029.8A CN106951961B (en) 2017-02-24 2017-02-24 A kind of convolutional neural networks accelerator that coarseness is restructural and system

Publications (2)

Publication Number Publication Date
CN106951961A CN106951961A (en) 2017-07-14
CN106951961B true CN106951961B (en) 2019-11-26

Family

ID=59466600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710104029.8A Active CN106951961B (en) 2017-02-24 2017-02-24 A kind of convolutional neural networks accelerator that coarseness is restructural and system

Country Status (1)

Country Link
CN (1) CN106951961B (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189641A1 (en) 2017-01-04 2018-07-05 Stmicroelectronics S.R.L. Hardware accelerator engine
CN108269224B (en) 2017-01-04 2022-04-01 意法半导体股份有限公司 Reconfigurable interconnect
CN109284827A (en) * 2017-07-19 2019-01-29 阿里巴巴集团控股有限公司 Neural computing method, equipment, processor and computer readable storage medium
CN109284822B (en) * 2017-07-20 2021-09-21 上海寒武纪信息科技有限公司 Neural network operation device and method
CN107491416B (en) * 2017-08-31 2020-10-23 中国人民解放军信息工程大学 Reconfigurable computing structure suitable for convolution requirement of any dimension and computing scheduling method and device
US11609623B2 (en) * 2017-09-01 2023-03-21 Qualcomm Incorporated Ultra-low power neuromorphic artificial intelligence computing accelerator
CN108958801B (en) * 2017-10-30 2021-06-25 上海寒武纪信息科技有限公司 Neural network processor and method for executing vector maximum value instruction by using same
CN108256628B (en) * 2018-01-15 2020-05-22 合肥工业大学 Convolutional neural network hardware accelerator based on multicast network-on-chip and working method thereof
US11468302B2 (en) 2018-03-13 2022-10-11 Recogni Inc. Efficient convolutional engine
CN108510066B (en) * 2018-04-08 2020-05-12 湃方科技(天津)有限责任公司 Processor applied to convolutional neural network
CN108805266B (en) * 2018-05-21 2021-10-26 南京大学 Reconfigurable CNN high-concurrency convolution accelerator
CN110826707B (en) * 2018-08-10 2023-10-31 北京百度网讯科技有限公司 Acceleration method and hardware accelerator applied to convolutional neural network
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device
CN109919826B (en) * 2019-02-02 2023-02-17 西安邮电大学 Graph data compression method for graph computation accelerator and graph computation accelerator
CN109949202B (en) * 2019-02-02 2022-11-11 西安邮电大学 Parallel graph computation accelerator structure
CN110399883A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Image characteristic extracting method, device, equipment and computer readable storage medium
CN111126593B (en) * 2019-11-07 2023-05-05 复旦大学 Reconfigurable natural language deep convolutional neural network accelerator
US11593609B2 (en) 2020-02-18 2023-02-28 Stmicroelectronics S.R.L. Vector quantization decoding hardware unit for real-time dynamic decompression for parameters of neural networks
CN111340206A (en) * 2020-02-20 2020-06-26 云南大学 Alexnet forward network accelerator based on FPGA
CN111325327B (en) * 2020-03-06 2022-03-08 四川九洲电器集团有限责任公司 Universal convolution neural network operation architecture based on embedded platform and use method
WO2021189209A1 (en) * 2020-03-23 2021-09-30 深圳市大疆创新科技有限公司 Testing method and verification platform for accelerator
CN111652361B (en) * 2020-06-04 2023-09-26 南京博芯电子技术有限公司 Composite granularity near storage approximate acceleration structure system and method for long-short-term memory network
US11531873B2 (en) 2020-06-23 2022-12-20 Stmicroelectronics S.R.L. Convolution acceleration with embedded vector decompression
CN111610963B (en) * 2020-06-24 2021-08-17 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN111860780A (en) * 2020-07-10 2020-10-30 逢亿科技(上海)有限公司 Hardware acceleration system and calculation method for irregular convolution kernel convolution neural network
CN112183732A (en) * 2020-10-22 2021-01-05 中国人民解放军国防科技大学 Convolutional neural network acceleration method and device and computer equipment
CN112905526B (en) * 2021-01-21 2022-07-08 北京理工大学 FPGA implementation method for multiple types of convolution
CN112686228B (en) * 2021-03-12 2021-06-01 深圳市安软科技股份有限公司 Pedestrian attribute identification method and device, electronic equipment and storage medium
CN115576895B (en) * 2022-11-18 2023-05-02 摩尔线程智能科技(北京)有限责任公司 Computing device, computing method, and computer-readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984560A (en) * 2014-05-30 2014-08-13 东南大学 Embedded reconfigurable system based on large-scale coarseness and processing method thereof
WO2015168774A1 (en) * 2014-05-05 2015-11-12 Chematria Inc. Binding affinity prediction system and method
CN105453021A (en) * 2013-08-01 2016-03-30 经度企业快闪公司 Systems and methods for atomic storage operations
CN105488565A (en) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 Calculation apparatus and method for accelerator chip accelerating deep neural network algorithm
CN106127302A (en) * 2016-06-23 2016-11-16 杭州华为数字技术有限公司 Process the circuit of data, image processing system, the method and apparatus of process data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7219085B2 (en) * 2003-12-09 2007-05-15 Microsoft Corporation System and method for accelerating and optimizing the processing of machine learning techniques using a graphics processing unit

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105453021A (en) * 2013-08-01 2016-03-30 经度企业快闪公司 Systems and methods for atomic storage operations
WO2015168774A1 (en) * 2014-05-05 2015-11-12 Chematria Inc. Binding affinity prediction system and method
CN103984560A (en) * 2014-05-30 2014-08-13 东南大学 Embedded reconfigurable system based on large-scale coarseness and processing method thereof
CN105488565A (en) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 Calculation apparatus and method for accelerator chip accelerating deep neural network algorithm
CN106127302A (en) * 2016-06-23 2016-11-16 杭州华为数字技术有限公司 Process the circuit of data, image processing system, the method and apparatus of process data

Also Published As

Publication number Publication date
CN106951961A (en) 2017-07-14

Similar Documents

Publication Publication Date Title
CN106951961B (en) A kind of convolutional neural networks accelerator that coarseness is restructural and system
Kwon et al. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects
Qin et al. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training
US11625245B2 (en) Compute-in-memory systems and methods
CN111178519B (en) Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
JP6960700B2 (en) Multicast Network On-Chip Convolutional Neural Network Hardware Accelerator and Its Behavior
CN205139973U (en) BP neural network based on FPGA device founds
Kim et al. FPGA-based CNN inference accelerator synthesized from multi-threaded C software
Kim et al. A highly scalable restricted Boltzmann machine FPGA implementation
CN109740739A (en) Neural computing device, neural computing method and Related product
CN109740754A (en) Neural computing device, neural computing method and Related product
CN109711539A (en) Operation method, device and Related product
Wu et al. Compute-efficient neural-network acceleration
Catthoor et al. Very large-scale neuromorphic systems for biological signal processing
US11645225B2 (en) Partitionable networked computer
Delaye et al. Deep learning challenges and solutions with xilinx fpgas
Gerlinghoff et al. A resource-efficient spiking neural network accelerator supporting emerging neural encoding
Zhang et al. Evaluating low-memory GEMMs for convolutional neural network inference on FPGAS
Sait et al. Engineering a memetic algorithm from discrete cuckoo search and tabu search for cell assignment of hybrid nanoscale CMOL circuits
Morcel et al. Fpga-based accelerator for deep convolutional neural networks for the spark environment
CN109857024A (en) The unit performance test method and System on Chip/SoC of artificial intelligence module
Ascia et al. Networks-on-chip based deep neural networks accelerators for iot edge devices
Greensted et al. Extrinsic evolvable hardware on the RISA architecture
CN110287628B (en) Simulation method of nanometer quantum cellular automatic machine circuit
CN107644143B (en) A kind of high-performance city CA model construction method based on vectorization and parallel computation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant