CN113313252B - Depth separable convolution implementation method based on pulse array - Google Patents

Depth separable convolution implementation method based on pulse array Download PDF

Info

Publication number
CN113313252B
CN113313252B CN202110562786.6A CN202110562786A CN113313252B CN 113313252 B CN113313252 B CN 113313252B CN 202110562786 A CN202110562786 A CN 202110562786A CN 113313252 B CN113313252 B CN 113313252B
Authority
CN
China
Prior art keywords
data
convolution
array
register
units
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110562786.6A
Other languages
Chinese (zh)
Other versions
CN113313252A (en
Inventor
陆生礼
张广明
张娟
庞伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110562786.6A priority Critical patent/CN113313252B/en
Publication of CN113313252A publication Critical patent/CN113313252A/en
Application granted granted Critical
Publication of CN113313252B publication Critical patent/CN113313252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a depth separable convolution implementation method based on a systolic array, which adopts M rows and N columns of Processing units (PE units) to form a systolic array structure: in the horizontal direction, adjacent PE units are connected with each other, and the PE unit on the left side can transmit data to the PE unit on the right side; in the vertical direction, each PE unit has a respective data input port and data output port. The data prefetch module provides feature map data and weight parameters for the compute array. The adder tree accumulates the partial sum data output in parallel by each column of PE units. Each PE unit is mainly internally composed of a register, a data selector, an adder and a multiplier. The invention adopts the pulse array with the structure and is matched with the data pre-fetching module and the addition tree, and can realize different data streams and data reusing modes, thereby realizing the accelerated calculation of standard convolution, point convolution and depth convolution.

Description

Depth separable convolution implementation method based on pulse array
Technical Field
The invention discloses a depth separable convolution implementation method based on a pulse array, relates to a hardware accelerator structure of a convolutional neural network, and belongs to the technical field of calculation and calculation.
Background
The convolutional neural network has high accuracy, and is widely applied to the computer vision fields of image classification, target detection, target tracking and the like. However, convolutional neural networks are computationally intensive models requiring a huge amount of computation and parameters during training and deployment, which affect their application on resource-constrained embedded mobile terminals.
In order to meet the requirements of practical application, a network architecture develops towards a lightweight network, the lightweight network architecture widely adopts deep separable convolution to replace standard convolution calculation, and the standard convolution is decomposed into a deep convolution part and a point convolution part, so that the neural network has fewer parameters and calculation amount, and has accuracy rate comparable to that of a large network.
However, the deep separable convolution has less data reuse and computation parallelism, which results in a great reduction in the utilization rate of the computation array when the accelerator computes the deep separable convolution, and thus in a reduction in performance. Therefore, the method for realizing the depth separable convolution based on the systolic array is of great significance.
Disclosure of Invention
In order to fully utilize the data reuse and the calculation parallelism of the depth separable convolution, the invention provides a depth separable convolution implementation method based on a systolic array, which adopts flexible data flow, improves the utilization rate of an accelerator calculation array and realizes the accelerated calculation of the accelerator on standard convolution and the depth separable convolution.
The invention adopts the following technical scheme for solving the problems:
a depth separable convolution implementation method based on a systolic array comprises a data pre-fetching module and the systolic array, wherein the systolic array comprises a plurality of PE units which are arranged in the horizontal direction and the vertical direction, and the PE units have different processing modes for input data, parts and data thereof; the PE unit updates the data A once every period or fixes the data A in a register inside the PE unit for repeated use; the PE unit updates the data B once in each period and transmits the data B in the previous period to the adjacent PE unit; outputting the partial sum data once per period or accumulating the partial sum in the PE unit, storing the partial sum in the PE unit, and outputting the partial sum in a specific period; the adjacent PE units in the horizontal direction of the systolic array are connected with each other, and each PE unit in the vertical direction is provided with a data input port and a data output port; when the systolic array calculates different convolutions, the data transmitted in the horizontal direction and the vertical direction are different, when the standard convolution and the point convolution are calculated, the characteristic diagram data are transmitted in the horizontal direction, and the weight parameters are transmitted in the vertical direction; when the depth convolution is calculated, the weight parameters are transmitted in the horizontal direction, and the characteristic diagram data are transmitted in the vertical direction; the data pre-fetching module provides the feature map data and the weight parameters for the systolic array according to the requirements of the systolic array on the feature map data and the weight parameters when different convolution calculations are executed.
Further, when calculating the standard convolution and the point convolution, the systolic array adopts a mode of simultaneously performing parallel calculation on the input channel dimension and the output channel dimension, in the horizontal direction, each row of PE units parallelly calculates different input channels, and in the vertical direction, each column of PE units calculates different output channels; when the depth convolution is calculated, a mode of performing parallel calculation on the input channel dimension and the convolution window dimension is adopted, in the horizontal direction, each row of PE units parallelly calculates different input channels, and in the vertical direction, each column of PE units calculates different convolution windows.
Further, the PE unit includes a multiplier, an adder, registers, and data selectors, where the number of the registers is 3, and the registers are register I, register II, and register iii, respectively, the number of the data selectors is 4, and the registers are data selector I, data selector II, data selector iii, and data selector iv, respectively,
when the input data A does not need to be reused, the input data A directly passes through the data selector II and is transmitted into the multiplier to be multiplied with the input data B,
when input data A needs to be reused, in a first clock cycle, the input data A is transmitted into a multiplier through a data selector II, meanwhile, the input data A is registered in a register I through the data selector I, then the output data of the register I is selected as the input of the multiplier by the data selector II in each cycle, the output of the register I is used as the input of the register I by the register I, the input data A is always registered in the register I, and the reuse of the input data A is realized;
the input data B is directly used as the other input of the multiplier to be multiplied by the input data A, meanwhile, the input data B is registered in a register II and is output from the register II in the next clock cycle as the input data of the adjacent PE unit;
when the output does not need to be added in the PE, the data selector IV selects the output of the multiplier as the input of the register III, then outputs the data in the register III in the next clock cycle,
when the output needs to be accumulated in the PE, the data selector III selects the data 0 to be added with the output of the multiplier for the first time in the adder, the data selector IV selects the output of the adder as the input of the register III, in the subsequent accumulation process, the output of the data selector III selects the output of the register III to be added with the output of the multiplier in the adder, and the data in the register III is output in a specific clock period.
Further, when the pulse array performs the point convolution calculation, the feature map data are transmitted in a pulse mode among the PE units in the horizontal direction of the array, the weight parameters enter each column of the PE units in parallel in the vertical direction of the array, each column of the PE units in the array outputs the calculated partial sum in parallel, and the partial sums are accumulated through an addition tree.
Further, the systolic array, when performing the standard convolution calculation, divides the convolution kernel of the standard convolution 3 × 3 into 9 groups of convolution kernels of 1 × 1, and for each group of convolution kernels of 1 × 1, performs the same calculation data flow as the point convolution; unlike the point convolution, the positions of the input feature points corresponding to different groups of 1 × 1 convolution kernels in the 9 groups of 1 × 1 convolution kernels are not all the same; and finally, accumulating the 9 groups of partial sums calculated by the 9 groups of 1-by-1 convolution kernels to obtain the final output.
Furthermore, when the systolic array executes deep convolution calculation, the weight parameters are subjected to systolic transmission between the PE units in the horizontal direction of the array, the feature map data of different convolution windows parallelly enter each row of PE units in the vertical direction of the array, each row of PE units in the array needs 9 cycles to finish calculating one convolution window, and during calculation, partial sum is accumulated in the PE units; and after one convolution window calculation is completed, outputting the calculation result in parallel.
Further, the systolic array has different data reuse modes in different directions when different types of convolutions are performed.
Further, when standard convolution and point convolution calculation are executed, in the horizontal direction of a ripple array, the feature diagram data are subjected to ripple transmission, reuse of the feature diagram data among different PE units in the horizontal direction is realized, in the vertical direction of the ripple array, updated weight data are temporarily stored in a register inside each PE unit during calculation, and reuse of weight parameters inside the PE units is realized; when the deep convolution calculation is executed, the weight parameters are subjected to the ripple transmission in the horizontal direction of the ripple array, the reuse of the weight parameters among different PE units in the horizontal direction is realized, the characteristic diagram data can be updated in each period in the vertical direction of the ripple array, and the characteristic diagram data is not reused in the ripple array.
Further, the data pre-fetching module comprises a feature map cache region, a register array, a data grouping module, a data selection module and a weight cache region, and when standard convolution and point convolution are executed, the data pre-fetching module directly transmits feature map data in the feature map cache region and weight parameters in the weight cache region to the systolic array; when deep convolution is performed, for feature map data, the data pre-fetching module firstly reads a part of feature map data from the feature map buffer area (for the case that the step size is 1, the feature map data of 3 rows (N + 2) columns of M input channels are read, for the case that the step size is 2, the data of 3 rows (2 x N-1) columns of M input channels are read), the data are temporarily stored in the register storage array, then the data in the register storage array are divided into N groups by the data grouping module, each group has 3 x M data and is respectively provided for N columns of PE units in the systolic array, and for the weight parameter, the weight parameter is directly transmitted from the weight buffer area to the systolic array; the data pre-fetching module updates the next batch of feature map data to the register storage array while calculating the systolic array, and reduces the time for the systolic array to wait for the next batch of feature map data. The data grouping module realizes reuse of the characteristic diagram data; the function of the data selection module is to select whether to transmit the data in the feature map buffer area to the PE unit array or to transmit the data after grouping to the PE unit array.
Furthermore, when the pulse array calculates the deep convolution, the data pre-fetching module reads a part of feature map data from the feature map buffer area and temporarily stores the part of feature map data in the register storage array, and the feature map data in the register storage array are grouped through the data grouping module, so that the reuse of the feature map data is realized. The data pre-fetching module reads the next batch of feature map data from the feature map buffer to the register storage array during the period that the systolic array calculates the previous batch of feature map data, and reduces the time that the systolic array waits for the next batch of feature map data.
By adopting the technical scheme, the invention has the following beneficial effects:
(1) The systolic array has the calculation parallelism of different dimensions, can match the calculation characteristics of point convolution, standard convolution and deep convolution, and improves the PE unit utilization rate of the systolic array in calculating three types of convolution.
(2) The systolic array has calculation parallelism of different dimensions, can fully utilize data reuse of different convolution types, reduces access to a data cache region and reduces the power consumption of memory access.
Drawings
FIG. 1 is a schematic diagram of the structure of a data prefetch module and systolic array employed in the present invention;
FIG. 2 is a schematic diagram of a PE unit according to the present invention;
FIG. 3 is a schematic data flow diagram of a point convolution in a systolic array;
FIG. 4 is a data flow diagram of a standard convolution in a systolic array;
FIG. 5 is a data flow diagram of a deep convolution in a systolic array.
Detailed Description
The technical scheme of the invention is explained in detail in the following with reference to the attached drawings.
Example 1
FIG. 1 is a schematic diagram of the structure of a data prefetch module and a systolic array employed in the present invention, and FIG. 2 is a schematic diagram of the internal structure of a PE unit employed in the present invention. The PE unit includes a multiplier, an adder, a register, and a data selector. For each PE unit, according to different convolution types, data A and data B are respectively characteristic diagram data and weight parameters, the data A can be stored in the PE units for repeated use, and the data B is transmitted among the PE units, so that data reuse among different PE units is realized. For the calculated partial sums, the accumulation can be implemented inside the PE unit.
The following details the data flow and data re-use for point convolution, standard convolution, deep convolution, respectively:
(1) Dot convolution
The data flow for the point convolution is shown in fig. 3, with weights and feature maps entering the array simultaneously. In the first clock cycle, the characteristic diagram data of M input channels and the first group of M weights simultaneously enter the leftmost column of PE units of the systolic array, and in the second clock cycle, the leftmost column of PE units continuously update the characteristic diagram data but do not update the weight data. The second set of M weights enters the second column of PE units. After N periods, all N groups of weights enter the array, each PE unit in the array has respective weight, then the weights are not updated, and after all the input feature map data of the M channels enter the array, the weights and the corresponding feature maps are updated simultaneously.
In the horizontal direction of the systolic array, the input characteristic diagram data are transmitted between the adjacent PE units in turn from left to right in each clock cycle, so that the input characteristic diagram data are reused among different PE units. The weight data is fixed inside each PE unit and is reused in the calculation process.
For output, each column of PE units computes M products per clock cycle, which are accumulated through an addition tree of M inputs.
(2) Standard convolution
The data flow for the standard convolution is shown in fig. 4. The standard convolved data stream is similar to a point convolution. The convolution kernels of the standard convolution 3 x 3 are divided into 9 groups of 1 x 1 convolution kernels, and for each group of 1 x 1 convolution kernels, the same way of computation as for the point convolution is performed. Unlike the dot convolution, the positions of the input feature points corresponding to different groups of 1 × 1 convolution kernels in the 9 groups of 1 × 1 convolution kernels are not all the same, and for the feature map data prefetching module, corresponding feature map data need to be read from the feature map cache region according to different situations. And finally, accumulating the 9 groups of partial sums calculated by the 9 groups of 1-by-1 convolution kernels to obtain the final output.
(3) Deep convolution
The data flow for the deep convolution is shown in fig. 5. Each input channel of the deep convolution is only corresponding to one convolution kernel of 3 x 3, and the weight reuse of the convolution kernels can be only utilized. Taking 3 × 3 depth convolution with step size of 1 as an example, before starting the calculation, the feature map data prefetching module needs to fetch the feature map data of M channels in advance, and each channel fetches 3 rows of N +2 columns of data. When the step size of the sliding window is 1, 3 rows and N +2 columns of data of each channel can be split into N3 × 3 convolution windows, and the data splitting function is realized by the data prefetching module. Thereafter, the calculation is started, the weights enter the array from the horizontal direction, and the input feature map data of the N convolution windows enter the array from the vertical direction.
Because the bandwidth requirement of the array on the feature map data is higher in the calculation process, the data pre-fetching module can continuously read new feature map data from the feature map cache region during the calculation of the array, and the time for the array to wait for the feature map data is reduced.
During the calculation, each column of PE units in the array needs to update the profile data every cycle, and the profile data cannot be reused in the array. Because the data prefetching module internally comprises the register storage array, the data overlapped among different convolution windows of the characteristic diagram can be reused in the data prefetching module, and the defect that the array cannot reuse the characteristic diagram data is overcome. For weight data, which is transmitted horizontally in a systolic array, reuse can be achieved between different PE units.
The invention discloses a depth separable convolution implementation method based on a systolic array, which adopts M rows and N columns of Processing units (PE units) to form a systolic array structure: in the horizontal direction, adjacent PE units are connected to each other, and the PE unit on the left side can transmit data to the PE unit on the right side; in the vertical direction, each PE unit has a respective data input port and data output port. The data prefetch module provides feature map data and weight parameters for the compute array. The adder tree accumulates the partial sum data output in parallel by each column of PE units. Each PE unit is mainly internally composed of a register, a data selector, an adder and a multiplier. The invention adopts the pulse array with the structure and is matched with the data pre-fetching module and the addition tree, and can realize different data streams and data reusing modes, thereby realizing the accelerated calculation of standard convolution, point convolution and depth convolution.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the present invention.

Claims (8)

1. A depth separable convolution implementation method based on a systolic array is characterized by comprising a data prefetching module and the systolic array, wherein the systolic array comprises a plurality of PE units which are arranged in the horizontal direction and the vertical direction, and the PE units have different processing modes for input data, parts and data thereof; the PE unit updates the data A once every period or fixes the data A in a register inside the PE unit for repeated use; the PE unit updates the data B once in each period and transmits the data B in the previous period to the adjacent PE unit; outputting the partial sum data once per period or accumulating the partial sum in the PE unit, storing the partial sum in the PE unit, and outputting the partial sum in a specific period; the adjacent PE units in the horizontal direction of the systolic array are mutually connected, and each PE unit in the vertical direction is provided with a data input port and a data output port; when the systolic array calculates different convolutions, the data transmitted in the horizontal direction and the vertical direction are different, when the standard convolution and the point convolution are calculated, the characteristic diagram data are transmitted in the horizontal direction, and the weight parameters are transmitted in the vertical direction; when the depth convolution is calculated, the weight parameters are transmitted in the horizontal direction, and the feature map data are transmitted in the vertical direction; the data pre-fetching module provides the feature map data and the weight parameters for the systolic array according to the requirements of the systolic array on the feature map data and the weight parameters when different convolution calculations are executed.
2. The method of claim 1, wherein the PE unit comprises a multiplier, an adder, a register and data selectors, the number of the registers is 3, and the registers are register I, register II and register III, the number of the data selectors is 4, and the data selectors are data selector I, data selector II, data selector III and data selector IV,
when the input data A does not need to be reused, the input data A directly passes through the data selector II and is transmitted into the multiplier to be multiplied with the input data B,
when input data A needs to be reused, in a first clock cycle, the input data A is transmitted into a multiplier through a data selector II, meanwhile, the input data A is registered in a register I through the data selector I, then the output data of the register I is selected as the input of the multiplier through the data selector II in each cycle, the output of the register I is used as the input of the register I through the data selector I by the register I, the input data A is always registered in the register I, and the reuse of the input data A is realized;
the input data B is directly used as the other input of the multiplier to be multiplied by the input data A, meanwhile, the input data B is registered in a register II and is output from the register II in the next clock cycle as the input data of the adjacent PE unit;
when the output does not need to be added in the PE, the data selector IV selects the output of the multiplier as the input of the register III, then outputs the data in the register III in the next clock cycle,
when the output needs to be accumulated in the PE, the data selector III selects the data 0 to be added with the output of the multiplier for the first time in the adder, the data selector IV selects the output of the adder as the input of the register III, in the subsequent accumulation process, the output of the data selector III selects the output of the register III to be added with the output of the multiplier in the adder, and the data in the register III is output in a specific clock period.
3. The method as claimed in claim 1, wherein the systolic array performs the computation of point convolution, the feature map data is transmitted in a systolic manner between PE units in the horizontal direction of the array, the weight parameters are entered in parallel into each column of PE units in the vertical direction of the array, and each column of PE units in the array outputs the computed partial sums in parallel, and the partial sums are accumulated through the addition tree.
4. The method according to claim 1, wherein the systolic array divides the convolution kernel of the standard convolution 3 x 3 into 9 groups of convolution kernels 1 x 1 when performing the standard convolution calculation, and for each group of convolution kernels 1 x 1, performs the same calculation data flow as the point convolution; unlike the point convolution, the positions of the input feature points corresponding to different groups of 1 × 1 convolution kernels in the 9 groups of 1 × 1 convolution kernels are not all the same; and finally, accumulating the 9 groups of parts calculated by the 9 groups of 1-by-1 convolution kernels to obtain the final output.
5. The method according to claim 1, wherein in performing the depth convolution calculation, the weight parameter is transmitted in a pulsating manner between PE units in the horizontal direction of the array, the characteristic map data of different convolution windows enter into each column of PE units in parallel in the vertical direction of the array, each column of PE units in the array needs 9 cycles to complete the calculation of one convolution window, and during the calculation, the partial sum is accumulated inside the PE units; and after one convolution window calculation is completed, outputting the calculation result in parallel.
6. The method of claim 1, wherein the systolic array has different data reuse patterns in different directions when performing different types of convolutions.
7. The method as claimed in claim 6, wherein when performing the standard convolution and point convolution calculation, in the horizontal direction of the systolic array, the feature map data is transmitted in a systolic manner, so as to realize reuse of the feature map data among different PE units in the horizontal direction, and in the vertical direction of the systolic array, the updated weight data is temporarily stored in a register inside each PE unit during calculation, so as to realize reuse of the weight parameters inside the PE units; when the deep convolution calculation is executed, the weight parameters are subjected to the ripple transmission in the horizontal direction of the ripple array, the reuse of the weight parameters among different PE units in the horizontal direction is realized, the characteristic diagram data can be updated in each period in the vertical direction of the ripple array, and the characteristic diagram data is not reused in the ripple array.
8. The method for realizing deep separable convolution based on a systolic array as claimed in claim 7, characterized in that said data pre-fetching module includes a feature map buffer, a register array, a data grouping module, a data selection module and a weight buffer, and when standard convolution and point convolution are performed, the data pre-fetching module directly transmits feature map data in the feature map buffer and weight parameters in the weight buffer to the systolic array; when deep convolution is executed, for feature map data, a data pre-fetching module firstly reads a part of feature map data from a feature map cache region and temporarily stores the feature map data in a register storage array, then the data in the register storage array is divided into N groups through a data grouping module, each group has 3 x M data and is respectively provided for N columns of PE units in a systolic array, and for weight parameters, the weight parameters are directly transmitted from the weight cache region to the systolic array; the data pre-fetching module updates the next batch of feature map data to the register storage array while calculating the systolic array, and reduces the time for the systolic array to wait for the next batch of feature map data; the data grouping module realizes reuse of the characteristic diagram data; the data selection module is used for selecting to transmit the data in the feature map buffer area to the PE unit array or transmit the data after grouping to the PE unit array.
CN202110562786.6A 2021-05-24 2021-05-24 Depth separable convolution implementation method based on pulse array Active CN113313252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110562786.6A CN113313252B (en) 2021-05-24 2021-05-24 Depth separable convolution implementation method based on pulse array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110562786.6A CN113313252B (en) 2021-05-24 2021-05-24 Depth separable convolution implementation method based on pulse array

Publications (2)

Publication Number Publication Date
CN113313252A CN113313252A (en) 2021-08-27
CN113313252B true CN113313252B (en) 2022-10-25

Family

ID=77374382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110562786.6A Active CN113313252B (en) 2021-05-24 2021-05-24 Depth separable convolution implementation method based on pulse array

Country Status (1)

Country Link
CN (1) CN113313252B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113869507B (en) * 2021-12-02 2022-04-15 之江实验室 Neural network accelerator convolution calculation device and method based on pulse array
CN116050474A (en) * 2022-12-29 2023-05-02 上海天数智芯半导体有限公司 Convolution calculation method, SOC chip, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A kind of general convolutional neural networks accelerator based on a dimension systolic array
CN110543934A (en) * 2019-08-14 2019-12-06 北京航空航天大学 Pulse array computing structure and method for convolutional neural network
CN111506343A (en) * 2020-03-05 2020-08-07 北京大学深圳研究生院 Deep learning convolution operation implementation method based on pulse array hardware architecture

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A kind of general convolutional neural networks accelerator based on a dimension systolic array
CN110543934A (en) * 2019-08-14 2019-12-06 北京航空航天大学 Pulse array computing structure and method for convolutional neural network
CN111506343A (en) * 2020-03-05 2020-08-07 北京大学深圳研究生院 Deep learning convolution operation implementation method based on pulse array hardware architecture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于脉动阵列的卷积计算模块硬件设计;王春林等;《电子技术应用》;20200106(第01期);全文 *

Also Published As

Publication number Publication date
CN113313252A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN110458279B (en) FPGA-based binary neural network acceleration method and system
CN111242289B (en) Convolutional neural network acceleration system and method with expandable scale
CN107844826B (en) Neural network processing unit and processing system comprising same
CN107862374B (en) Neural network processing system and processing method based on assembly line
CN110263925B (en) Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
US20210357735A1 (en) Split accumulator for convolutional neural network accelerator
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN113313252B (en) Depth separable convolution implementation method based on pulse array
CN108733348B (en) Fused vector multiplier and method for performing operation using the same
CN110516801A (en) A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
CN114781629B (en) Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
US20220164663A1 (en) Activation Compression Method for Deep Learning Acceleration
CN110543939A (en) hardware acceleration implementation framework for convolutional neural network backward training based on FPGA
CN113033794B (en) Light weight neural network hardware accelerator based on deep separable convolution
CN115238863A (en) Hardware acceleration method, system and application of convolutional neural network convolutional layer
CN110716751B (en) High-parallelism computing platform, system and computing implementation method
CN109558944B (en) Algorithm optimization method and device of convolutional neural network based on configurable convolutional layer
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium
CN113298241B (en) Deep separable convolutional neural network acceleration method and accelerator
CN113240101B (en) Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant