CN114723029A - DCNN accelerator based on hybrid multi-row data flow strategy - Google Patents

DCNN accelerator based on hybrid multi-row data flow strategy Download PDF

Info

Publication number
CN114723029A
CN114723029A CN202210482658.5A CN202210482658A CN114723029A CN 114723029 A CN114723029 A CN 114723029A CN 202210482658 A CN202210482658 A CN 202210482658A CN 114723029 A CN114723029 A CN 114723029A
Authority
CN
China
Prior art keywords
data
processing module
convolution
input
convolution processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210482658.5A
Other languages
Chinese (zh)
Inventor
黄以华
罗聪慧
黄文津
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202210482658.5A priority Critical patent/CN114723029A/en
Publication of CN114723029A publication Critical patent/CN114723029A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a DCNN accelerator based on a hybrid multi-row data flow strategy, which is formed by stacking a plurality of convolution processing modules. The convolution processing module comprises a plurality of parallel computing unit arrays, a computing buffer and a data buffer. Data transmission of adjacent convolution processing modules is in line unit, line data is stored in a data buffer, and data read from the data buffer in sequence is sent into a calculation buffer after rearrangement operation so as to be used for calculation of a calculation unit array. Each compute unit array is responsible for computing the output signature for a single row, all compute unit arrays share the same weight data, and all weight data is stored in off-chip DRAM. The off-chip bandwidth usage can be adjusted by adjusting the parallelism of the computing unit array of the convolution processing module, and the problem that the off-chip bandwidth optimization cannot be realized by the conventional layer-by-layer pipeline accelerator is solved.

Description

DCNN accelerator based on hybrid multi-row data flow strategy
Technical Field
The invention relates to the technical field of electronic information and deep learning, in particular to a DCNN accelerator based on a hybrid multi-row data flow strategy.
Background
In the artificial intelligence wave which is continuously developed in recent years, a Deep Convolutional Neural Network (DCNN) shows superior performance in the fields of target detection, semantic segmentation, face recognition, voice recognition, medical auxiliary diagnosis and the like compared with the traditional algorithm. Thus, DCNN has received a very wide range of attention and research.
Because the degree of parallelism between layers and in layers of the DCNN model can be fully utilized, the layer-by-layer assembly line type system architecture is widely applied to the DCNN accelerator based on the FPGA. In a layer-by-layer pipelined system architecture, the computational paradigm of convolutional layer computational tasks (row-by-row, layer-by-layer) determines the number of times weight data is read from off-chip DRAM, which in turn determines the off-chip bandwidth of the accelerator. However, the existing layer-by-layer pipeline architecture all uses a fixed computation paradigm, so that the throughput performance of the accelerator is limited by off-chip bandwidth, and it is difficult to reduce the usage amount of the off-chip bandwidth through on-chip storage.
A hybrid parallelism-based convolution computation apparatus is disclosed in the prior art, the apparatus comprising: the input module is configured to acquire input convolution data and corresponding parameters, judge a convolution shape according to the input convolution data, and extract a feature map size, a convolution kernel size and a channel number of the input convolution data; the simulation module is configured to obtain the parallelism corresponding to the input convolution data according to the data characteristics extracted by the input module; the data features include convolution shape and parameters; an on-chip processor including a plurality of parallel processing modules; the grouping control module is respectively connected with each processing module and is configured to divide all the processing modules on the on-chip processor into G groups according to the parallelism, G is equal to the parallelism, and the number of the processing modules in each group is equal; the mapping module is respectively connected with each processing module and is configured to control data and parameters input into each processing module according to the parallelism, the input convolution data and the corresponding parameters; wherein, the processing modules in the same group input the same parameters and different data; different parameters are input by processing modules of different groups; and the processing module is used for finishing convolution acceleration behavior according to the input data and parameters and outputting a result. This scheme also has difficulty reducing off-chip bandwidth usage through on-chip storage.
Disclosure of Invention
The invention provides a hybrid multi-row data stream strategy-based DCNN (direct coupled neural network) accelerator, which solves the problem that the existing layer-by-layer pipelined DCNN accelerator cannot realize off-chip bandwidth optimization.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a DCNN accelerator based on a hybrid multi-row data stream policy, comprising a convolution processing unit and a fully-connected processing unit, wherein:
the convolution processing unit is responsible for processing the convolution calculation part in the DCNN model and comprises a plurality of convolution processing modules, a bypass convolution processing module and a branch processing module, wherein the convolution processing modules are connected in sequence, the number of the convolution processing modules is equal to the number L of convolution layers of the DCNN model, and the number of input line data of each convolution processing module is riThe number of output line data is ri+1And the data quantity of the input line of the convolution processing module is the data quantity of the output line of the last convolution processing unit; the input line data quantity of the bypass convolution processing module is the output line data quantity of the first convolution processing module, the output of the bypass convolution processing module is the input of the branch processing module, the output of the branch processing module is the input of the last convolution processing module, and the branch processing module processes branch parts in the deep convolution neural network;
and the output line data of the last convolution processing module is output to the full-connection processing unit, and the full-connection processing unit is used for processing the full-connection layer part in the deep convolution neural network.
Preferably, a pooling processing module is connected between each convolution processing module, and the pooling processing module processes a pooling layer part in the deep convolution neural network.
Preferably, the ambient data source is every Δ T1Inputting a row of input characteristic diagram data to a first convolution processing module in each clock period, and inputting the characteristic diagram data every ri+1SiΔTiIn each clock cycle, the ith convolution processing module completes ri+1The rows output the computation of the profile data, wherein,
Figure BDA0003628468480000021
in the formula, PsjStride, S of pooled processing modules after the jth convolution processing modulejThe stride of the input feature map corresponding to the jth convolution processing module has the convolution kernel size Ni×Ci×Ki×KiPadding is padi
Preferably, each convolution processing module includes an input data buffer, a plurality of parallel computation buffers, a plurality of parallel computation cell arrays, and an output data buffer, where:
the input data buffer reads and stores data in the output data buffer of the last convolution processing module, the parallel computing buffers read data in the input data buffer, the input of the parallel computing unit arrays is data in the computing buffers, and the output of the parallel computing unit arrays is stored in the output data buffer.
Data transmission of adjacent convolution processing modules is in line unit, line data is stored in a data buffer, and data read from the data buffer in sequence is sent into a calculation buffer after rearrangement operation so as to be used for calculation of a calculation unit array. Each compute unit array is responsible for computing the output signature for a single row, all compute unit arrays share the same weight data, and all weight data is stored in off-chip DRAM.
Preferably, the plurality of parallel computing unit arrays consists of Wh,i×Iw,iEach computing unit is a Ww,iAnd buffering the input multiply-accumulate tree, the intermediate calculation data in a Dual port RAM, and buffering the final calculation result in the RAM, wherein the RAM is used as a data source of an input data buffer of the next convolution processing module.
Preferably, each of said calculation units is configured to perform in turn a small-sized matrix multiplication Wrb×IrpbThereby finally realizing large-size matrix multiplication Wr×IrpAnd finally, the calculation result is single-row output characteristic diagram data.
Preferably, each computing unit computing paradigm is based on Toeplitz matrix multiplication, input feature map data is converted into a Toeplitz matrix, the input feature map data processed by each PE array is located in a column matrix of the Toeplitz matrix, and in a mixed multi-row data flow strategy, all parallel computing unit arrays share the same weight parameter to respectively process Ifmap data from different column matrices in the Toeplitz matrix, so as to achieve optimization of bandwidth usage.
Preferably, the computing resource usage # PE Mult of each computing unit is:
#PE Mult=Wh,i×Ww,i×Iw,i
wherein, Wh,i,Ww,i,Iw,iNeed to satisfy
Figure BDA0003628468480000031
In the formula, HoutiFor the width and height, Δ T, of the output feature map corresponding to the ith convolution processing module1The clock cycle interval of inputting a line of data to an accelerator for an external data source meets the following requirements:
Figure BDA0003628468480000033
TRPobjexpectation of accelerator design
Preferably, the data stored in each input data buffer has the same row position and column position in the input characteristic diagram and is arranged in the input data buffers according to the channel size, and the number of data buffers of the ith convolution processing module is # RowDataBuffer × Hini
Figure BDA0003628468480000034
Figure BDA0003628468480000032
DataIn0′=Ki+Si(ri+1-1)+GCD(ri+1Si,ri)(ri′-1)
Wherein, GCD (r)i+1Si,ri) Is riAnd ri+1SiGreatest common divisor of ri' and (r)i+1Si) ' is two positive integers and relatively prime, padding is the step size corresponding to the output characteristic diagram, padding is padi
Preferably, when the pooling processing module responsible for processing the pooling operation is used for processing the output data of the convolution processing module, the number of input line data thereof is equal to the number of output line data of the convolution processing module, and the number of data buffers included therein satisfies:
Houti×#RowPoolingBuffer
wherein the content of the first and second substances,
Figure BDA0003628468480000041
compared with the prior art, the technical scheme of the invention has the beneficial effects that:
according to the invention, a mixed multi-row calculation paradigm (data flow strategy) is introduced into a layer-by-layer pipeline system architecture, and efficient on-chip storage and off-chip bandwidth balance can be realized through flexible data flow strategy configuration, so that the flexibility of layer-by-layer pipeline system architecture design and the theoretical throughput upper limit thereof are improved.
Drawings
Fig. 1 is a schematic diagram of an overall accelerator framework according to the present invention.
Fig. 2 is a schematic diagram of a hybrid multi-row data flow strategy.
FIG. 3 is a block diagram of a convolution processing module.
FIG. 4 is a diagram of a hardware structure of a computing unit array.
FIG. 5 is a diagram illustrating a data storage sequence of an input data buffer.
Fig. 6 is a schematic diagram of convolution calculation paradigm based on Toeplitz matrix.
Fig. 7 is a schematic diagram of the sequence of processing the weight data and Toeplitz matrix data by the computing unit array.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The present embodiment provides a DCNN accelerator for a hybrid multi-row data flow policy, as shown in fig. 1, including a convolution processing unit and a fully-connected processing unit, where:
the convolution processing unit is responsible for processing the convolution calculation part in the DCNN model and comprises a plurality of convolution processing modules, a bypass convolution processing module and a branch processing module, wherein the convolution processing modules are connected in sequence, the number of the convolution processing modules is equal to the number L of convolution layers of the DCNN model, and the number of input line data of each convolution processing module is riThe number of output line data is ri+1And the data quantity of the input line of the convolution processing module is the data quantity of the output line of the last convolution processing unit; the input line data quantity of the bypass convolution processing module is the output line data quantity of the first convolution processing module, the output of the bypass convolution processing module is the input of the branch processing module, the output of the branch processing module is the input of the last convolution processing module, and the branch processing module processes branch parts in the deep convolution neural network;
and the output line data of the last convolution processing module is output to the full-connection processing unit, and the full-connection processing unit is used for processing the full-connection layer part in the deep convolution neural network.
In fig. 1, BottleNeck is a tributary processing module used in a network such as RESNET, and fpm (full Connected Process module) is a fully Connected processing unit used for processing a fully Connected layer in the network; CPM (convolution Process module) is a convolution processing unit and is used for processing a convolution calculation part in the network; external Memory is off-chip Memory, i.e. External Memory other than FPGA chips, and represents DDR on the FPGA development board.
Example 2
This example continues to disclose the following on the basis of example 1:
when the DCNN model comprises pooling layers, a pooling processing module is connected between every two convolution processing modules and processes the pooling layer part in the deep convolution neural network.
When the parameter of the pooling layer has PK i2 and PsiWhen 2, in the pooling module, each output terminal of the posing buffer is connected to a comparator Comp, 3 comparators form a 4-input comparator tree, and the output terminal of the comparator tree is connected to the CCMi+1The data buffer input terminal. In addition, the parallelism of the comparator tree is Hini+1. Pooling modules in parallel towards CCMi+1R ofi+1Each row data buffer outputs data, so the output data dimension of the pooling module is ri+1×Hini+1
Example 3
This example continues to disclose the following on the basis of example 1 or example 2:
assuming that the number of convolutional layers of the convolutional neural network is L, the width and height of the input feature map (Ifmap) corresponding to each convolutional layer are HiniThe number of input channels is CiThe number of output channels is NiThe width and height of the output feature map (Ofmap) are both HoutiThe size of convolution kernel corresponding to Ifmap is Ni×Ci×Ki×KiStride is SiPadding is padiWithout losing oneIn general, it is assumed that a Pooling Layer (PL) exists between each of the convolution layersi) The size of the pooling filter is PKiThe pooled stride is PsiWherein all i are more than or equal to 1 and less than or equal to L.
The layer-by-layer pipelined DCNN accelerator comprises a plurality of convolution processing modules, the number of the convolution processing modules is equal to the number L of convolution layers of a DCNN model, and the overall framework of the layer-by-layer pipelined DCNN accelerator is shown in an attached figure 1. The number of input line data of each convolution processing module is riThe number of output line data is ri+1. The data quantity of the input line of the convolution processing module is equal to the data quantity of the output line of the last convolution layer processing module, see figure 2, and the external data source is arranged at intervals of delta T1One line of input profile data is input to the accelerator every clock cycle. Every r2S1ΔT1One clock cycle, the first CCM outputs to the next CCM2Line data, similarly, every ri+1SiΔTiOne clock cycle, CCMiCompletion ri+1The computation of the line Ofmap data, where,
Figure BDA0003628468480000061
in the PE array, each computation module (PE module) is a multiply-accumulate tree structure, see FIG. 4, in which W ish,iAnd Iw,iHeight and width of the PE array, Ww,iIn order to multiply the number of input ports of the accumulator tree, intermediate calculation data is cached in a Dual port RAM, and a final calculation result is cached in the RAM. Wherein, the RAM is used as the data source of the data buffer of the next CCM. Each PE array is provided with a PE buffer of a Ping-Pong structure, the PE buffers are used for caching input data of the PE arrays, and the data are sourced from a data buffer of a convolution processing module. The Data buffers are used for buffering output Data from the previous convolution processing module, the Data stored by each Data buffer has the same row position and column position in the input characteristic diagram and is arranged in the Data buffers according to the channel size, see figure 5, the Data stored by each Data buffer has the same row position and column position in the input characteristic diagram, and in addition, the Data buffers have the same row position and column position in the input characteristic diagram and are arranged in the Data buffers according to the channel sizeArranged in data buffers by channel size.
Further, the number of data buffers is # RowDataBuffer × Hini
Figure BDA0003628468480000063
Figure BDA0003628468480000062
DataIn0′=Ki+Si(ri+1-1)+GCD(ri+1Si,ri)(ri′-1)
Wherein, GCD (r)i+1Si,ri) Is riAnd ri+1SiMaximum common divisor (gratet common divisor), ri' and (r)i+1Si) ' two positive integers are relatively prime.
Further, the computational paradigm of the PE array is based on Toeplitz matrix multiplication, i.e. the input feature map data needs to be converted into Toeplitz matrices. The input feature map data processed by each PE array is located in a column matrix of a Toeplitz matrix, see FIG. 6, which is a convolution operation based on the Toeplitz matrix, and each PE array is obtained by performing small-size matrix multiplication W in turnrb×IrpbThereby finally realizing large-size matrix multiplication Wr×IrpAnd finally, the calculation result is single-row output characteristic diagram data.
The order in which the PE array calculates the weight data for a single row of output feature map data and the Toeplitz matrix data is shown in fig. 7. In the mixed multi-row data flow strategy, all parallel PE arrays share the same weight parameter to respectively process Ifmap data from different column matrixes in the Toeplitz matrix, thereby realizing the optimization of the bandwidth usage. Fig. 7 shows an example of the operation mode of two parallel PE arrays.
Further, each PE array has a computational resource usage of
#PE Mult=Wh,i×Ww,i×Iw,i
Wherein, Wh,i,Ww,i,Iw,iNeed to satisfy
Figure BDA0003628468480000071
Wherein, Delta T1The clock period interval of inputting a line of data to the accelerator for an external data source is satisfied
Figure BDA0003628468480000073
Wherein, TRPobjThe desired throughput for the accelerator design. The total computation of the DCNN model executed by the accelerator is set as IOP. freq is the operating frequency of the accelerator.
Furthermore, when the pooling module responsible for processing the pooling operation is used for processing the output data of the convolution processing module, the number of input line data of the pooling module is equal to that of the output line data of the convolution processing module, and the number of included data buffers meets the requirement
Houti×#RowPoolingBuffer
Wherein
Figure BDA0003628468480000072
Further, the implementation of the fully connected processing unit includes a PE array and a data access system. Specifically, the PE array consists of batch PEs. All PEs share the same weight data but process different Ifmap data, i.e. each PE is responsible for processing Ifmap data from different inference tasks. The hardware structure of the PEs is a MACC _ f-input multiply-accumulate tree, so that each PE can realize MACC _ f MACC operations in each clock cycle.
The same or similar reference numerals correspond to the same or similar parts;
the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A DCNN accelerator based on a hybrid multi-row data stream policy, comprising a convolution processing unit and a full-connection processing unit, wherein:
the convolution processing unit is responsible for processing the convolution calculation part in the DCNN model and comprises a plurality of convolution processing modules, a bypass convolution processing module and a branch processing module, wherein the convolution processing modules are connected in sequence, the number of the convolution processing modules is equal to the number L of convolution layers of the DCNN model, and the number of input line data of each convolution processing module is riThe number of output line data is ri+1And the input row data quantity of the convolution processing module is the output row data quantity of the last convolution processing unit; the input line data quantity of the bypass convolution processing module is the output line data quantity of the first convolution processing module, the output of the bypass convolution processing module is the input of the branch processing module, the output of the branch processing module is the input of the last convolution processing module, and the branch processing module processes branch parts in the deep convolution neural network;
and the output line data of the last convolution processing module is output to the full-connection processing unit, and the full-connection processing unit is used for processing the full-connection layer part in the deep convolution neural network.
2. The hybrid multi-row data flow policy-based DCNN accelerator according to claim 1, wherein a pooling processing module is connected between each convolution processing module, and the pooling processing module processes a pooling layer portion in a deep convolution neural network.
3. The hybrid multi-row data flow policy-based DCNN accelerator of claim 2, wherein the ambient data source is every Δ T1Inputting a row of input characteristic diagram data to a first convolution processing module in each clock period, and inputting the characteristic diagram data every ri+1SiΔTiIn one clock cycle, the ith convolution processing module completes ri+1The rows output the computation of the profile data, wherein,
Figure FDA0003628468470000011
in the formula, PsjStride, S of pooled processing modules after the jth convolution processing modulejThe stride of the input feature map corresponding to the jth convolution processing module has the convolution kernel size Ni×Ci×Ki×KiPadding is padi
4. The hybrid multi-row data flow policy-based DCNN accelerator according to claim 3, wherein each convolution processing module comprises an input data buffer, a plurality of parallel computation buffers, a plurality of parallel arrays of computation units, an output data buffer, wherein:
the input data buffer reads and stores data in the output data buffer of the last convolution processing module, the parallel computing buffers read data in the input data buffer, the input of the parallel computing unit arrays is data in the computing buffers, and the output of the parallel computing unit arrays is stored in the output data buffer.
5. The hybrid multi-row data flow policy-based DCNN accelerator of claim 4, wherein the multiple rowsA parallel computing unit array composed ofh,i×Iw,iEach computing unit is a Ww,iAnd buffering the input multiply-accumulate tree, the intermediate calculation data in a Dual port RAM, and buffering the final calculation result in the RAM, wherein the RAM is used as a data source of an input data buffer of the next convolution processing module.
6. The hybrid multi-row data flow policy-based DCNN accelerator of claim 5, wherein each compute unit performs a small-sized matrix multiplication W in turnrb×IrpbThereby finally realizing large-size matrix multiplication Wr×IrpAnd finally, the calculation result is single-row output characteristic diagram data.
7. The hybrid multi-row data flow strategy-based DCNN accelerator of claim 6, wherein each of the computing units computing paradigm is based on a Toeplitz matrix multiplication, converting input feature map data into a Toeplitz matrix, the input feature map data processed by each PE array being located in a column matrix of the Toeplitz matrix, all of the parallel computing unit arrays in the hybrid multi-row data flow strategy sharing the same weight parameters to process Ifmap data from different column matrices of the Toeplitz matrix, respectively, thereby achieving optimization of bandwidth usage.
8. The DCNN accelerator according to claim 7, wherein the computing resource usage # PE Mult of each computing unit is:
#PE Mult=Wh,i×Ww,i×Iw,i
wherein, Wh,i,Ww,i,Iw,iNeed to satisfy
Figure FDA0003628468470000021
In the formula, HoutiIs the ith convolution siteWidth and height, Δ T, of output feature map corresponding to physical module1The clock cycle interval of inputting a line of data to an accelerator for an external data source meets the following requirements:
Figure FDA0003628468470000022
is a desire for accelerator design.
9. The hybrid multi-line data stream policy-based DCNN accelerator according to claim 8, wherein each input data buffer stores data having the same line position and column position in the input profile and is arranged in the input data buffers by channel size, and the number of data buffers of the i-th convolution processing module is # RowDataBuffer × Hini
Figure FDA0003628468470000033
Figure FDA0003628468470000031
DataIn0′=Ki+Si(ri+1-1)+GCD(ri+1Si,ri)(ri′-1)
Wherein, GCD (r)i+1Si,ri) Is riAnd ri+1SiGreatest common divisor of ri' and (r)i+1Si) ' is two positive integers and is relatively prime, padding is the step size corresponding to the output characteristic diagram, and padding is padi
10. The DCNN accelerator according to claim 9, wherein when the pooling processing module responsible for processing the pooling operation is used for processing the output data of the convolution processing module, the number of input line data thereof is equal to the number of output line data of the convolution processing module, and the number of data buffers is equal to:
Houti×#RowPoolingBuffer
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003628468470000032
CN202210482658.5A 2022-05-05 2022-05-05 DCNN accelerator based on hybrid multi-row data flow strategy Pending CN114723029A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210482658.5A CN114723029A (en) 2022-05-05 2022-05-05 DCNN accelerator based on hybrid multi-row data flow strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210482658.5A CN114723029A (en) 2022-05-05 2022-05-05 DCNN accelerator based on hybrid multi-row data flow strategy

Publications (1)

Publication Number Publication Date
CN114723029A true CN114723029A (en) 2022-07-08

Family

ID=82231586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210482658.5A Pending CN114723029A (en) 2022-05-05 2022-05-05 DCNN accelerator based on hybrid multi-row data flow strategy

Country Status (1)

Country Link
CN (1) CN114723029A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292662A (en) * 2022-08-18 2022-11-04 上海燧原科技有限公司 Convolution acceleration operation method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292662A (en) * 2022-08-18 2022-11-04 上海燧原科技有限公司 Convolution acceleration operation method and device, electronic equipment and storage medium
CN115292662B (en) * 2022-08-18 2023-09-22 上海燧原科技有限公司 Convolution acceleration operation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110516801B (en) High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN109447241B (en) Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things
CN111445012B (en) FPGA-based packet convolution hardware accelerator and method thereof
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
CN110705703B (en) Sparse neural network processor based on systolic array
CN111898733A (en) Deep separable convolutional neural network accelerator architecture
CN108170640B (en) Neural network operation device and operation method using same
CN112905530B (en) On-chip architecture, pooled computing accelerator array, unit and control method
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN114723029A (en) DCNN accelerator based on hybrid multi-row data flow strategy
CN113298237A (en) Convolutional neural network on-chip training accelerator based on FPGA
CN114462587B (en) FPGA implementation method for photoelectric hybrid computation neural network
CN111275167A (en) High-energy-efficiency pulse array framework for binary convolutional neural network
CN111079908B (en) Network-on-chip data processing method, storage medium, computer device and apparatus
CN112862091B (en) Resource multiplexing type neural network hardware accelerating circuit based on quick convolution
CN116167424B (en) CIM-based neural network accelerator, CIM-based neural network accelerator method, CIM-based neural network storage processing system and CIM-based neural network storage processing equipment
CN114003201A (en) Matrix transformation method and device and convolutional neural network accelerator
CN110766136B (en) Compression method of sparse matrix and vector
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
US11928176B2 (en) Time domain unrolling sparse matrix multiplication system and method
CN114912596A (en) Sparse convolution neural network-oriented multi-chip system and method thereof
CN112766453A (en) Data processing device and data processing method
CN113592067B (en) Configurable convolution calculation circuit for convolution neural network
Wang et al. An FPGA-Based Reconfigurable CNN Training Accelerator Using Decomposable Winograd

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination