CN108241890B - Reconfigurable neural network acceleration method and architecture - Google Patents

Reconfigurable neural network acceleration method and architecture Download PDF

Info

Publication number
CN108241890B
CN108241890B CN201810084089.2A CN201810084089A CN108241890B CN 108241890 B CN108241890 B CN 108241890B CN 201810084089 A CN201810084089 A CN 201810084089A CN 108241890 B CN108241890 B CN 108241890B
Authority
CN
China
Prior art keywords
convolution
channels
output
input
input data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810084089.2A
Other languages
Chinese (zh)
Other versions
CN108241890A (en
Inventor
尹首一
涂锋斌
严佳乐
欧阳鹏
唐士斌
刘雷波
魏少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201810084089.2A priority Critical patent/CN108241890B/en
Publication of CN108241890A publication Critical patent/CN108241890A/en
Application granted granted Critical
Publication of CN108241890B publication Critical patent/CN108241890B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Error Detection And Correction (AREA)

Abstract

The invention provides a reconfigurable neural network acceleration method and a reconfigurable neural network acceleration architecture, and a method for generating output data by performing convolution operation on read input data and convolution kernels through a convolution calculation kernel unit by adopting input data multiplexing, output data multiplexing and weight data multiplexing modes through architectures of an input cache unit, a weight cache unit, a convolution calculation kernel unit and an output cache unit. The application deals with the diversified neural network of number of layers through the strategy of successive layer acceleration to use the method of cyclic transformation to optimize neural network and accelerate, reached and reduced the number of times of visiting Buffer and DRAM, solved the number of times of visiting the memory among the prior art and caused the extravagant problem of consumption many, have the reduction energy consumption, make the beneficial effect of the hardware resource utilization rate maximize of PE array.

Description

Reconfigurable neural network acceleration method and architecture
Technical Field
The invention relates to a computing mode in a deep convolutional neural network, in particular to a reconfigurable neural network acceleration method and a reconfigurable neural network acceleration framework.
Background
Deep convolutional neural networks have been widely used in the computer vision field and the speech processing field, and then, due to the high complexity of the deep convolutional neural networks themselves, they pose great challenges in hardware implementation, especially problems of power consumption and performance. The traditional execution hardware comprises a CPU, a GPU and an FPGA, and the defects that the CPU cannot perform low-delay operation processing in the embedded equipment, and although the GPU can meet the requirement of low-delay processing, the power consumption of the GPU is extremely high and is not suitable for the embedded equipment; on the other hand, although the FPGA can just barely meet the requirements of power consumption and execution performance, the wiring resources and the computing units in the FPGA limit the execution efficiency of different deep convolutional neural networks.
In response to the above needs and challenges, a dedicated architecture for implementing deep convolutional neural networks is needed to replace the hardware of a CPU, GPU or FPGA. Even so, however, some conventional neural network hardware architectures employ computational modes that do not achieve a good compromise between performance efficiency and power consumption. In a traditional hardware deep neural network computing mode, the data volume of each layer is different, so that the access of the buffer and the memory by the existing computing mode is single, and the real-time configuration cannot be carried out according to the computing requirement, so that the times of accessing the memory are greatly increased, and unnecessary power consumption waste is caused. Fig. 1 is a schematic diagram of a computation of a classical deep convolutional neural network, and fig. 2 is a pseudo code loop expression of a convolutional layer operation of the classical deep convolutional neural network. As shown in fig. 1, in a classical deep convolutional neural network in the prior art, the size of a convolution kernel is K × K, each convolution kernel has N convolution channels, the size of input data is H × L, the convolution kernel has N input channels, and the convolution kernel performs convolution operation on the input data to obtain output data, where the output data is R × C, and the output data has M output channels. As shown in fig. 2, the pseudo code loop process of the convolutional layer operation of the classical deep convolutional neural network is as follows:
sequentially acquiring output data of each part on each channel through the circulation R and the circulation C;
performing convolution operation on the N convolution channels of each convolution kernel and the N input channels of partial input data in sequence through circulation M, and thus obtaining output data of each output channel in sequence;
and sequentially carrying out convolution operation on the N input channels of part of input data and the N convolution channels of the current convolution kernel through the circulation N.
In the deep convolutional neural network acceleration processor, the energy consumption is a very important index. And the definition of energy consumption is:
Figure BDA0001561887610000021
wherein Operations is an operand, Energy is Energy, Perfomance is performance, Power is Power consumption, and for a specific convolutional neural network, its calculation operand is fixed, so that the only key factor affecting Energy consumption is Energy.
The energy generation can be defined as:
Energy=MADRAM·EDRAM+MAbuffer·Ebuffer+Operations·Eoperation
where Energy is Energy, MADRAMAnd MAbufferThe access times of DRAM and cache Buffer, and Operations are operands. EDRAM,Ebuffer,EoperationIs the energy of a single DRAM cache and operation. Thus for a fixed convolutional neural network, a key factor affecting power consumption is the access times MA of the DRAMDRAMAnd number of Buffer accesses MAbuffer
In addition, when convolution operation is performed, the utilization rate of the PE array by the conventional calculation mode is not high, and especially when the step size of the convolution kernel is greater than 1, the hardware resource utilization rate of the PE array is greatly reduced.
Therefore, how to reduce energy consumption by improving the number of accesses to the Buffer and the DRAM in the deep convolutional neural network and improve the utilization rate of the PE array in the convolutional operation is a technical problem to be solved at present.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a reconfigurable neural network acceleration method and a reconfigurable neural network acceleration architecture, which realize the aim of coping with neural networks with various layers by a layer-by-layer acceleration strategy, and use a cyclic transformation optimization calculation mode, thereby having the beneficial effects of reducing energy consumption and maximizing the utilization rate of a PE array.
The invention provides a reconfigurable neural network acceleration method I, which is an input data multiplexing method and comprises the following steps:
the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is NTn, and N, N and Tn are positive integers;
the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit carries out convolution operation on the read a-th input data block and each group of convolution groups in sequence to generate output data blocks with Tm output channels; generating an output data block with M output channels until the convolution operation with the mth group of convolution groups is completed; accumulating the storage output data blocks of the M output channels fed back by the output cache unit and the generated output data blocks of the M output channels, and sending the accumulated output data blocks; the storage output data block is generated by sequentially convolving the 1 st to (a-1) th input data blocks read before the a-th input data block is read with each group of convolution groups and then accumulating the convolution groups;
the output cache unit stores the received accumulated output data blocks into storage output data blocks of M output channels, and feeds back the storage output data blocks to the convolution calculation core unit; when a is n, the output buffer unit stores all output data of M output channels; wherein a is not more than n, and a is a positive integer.
The invention provides a reconfigurable neural network acceleration architecture I, which comprises the following components: the device comprises an input cache unit, a weight cache unit, a convolution calculation kernel unit and an output cache unit;
the input buffer unit is used for dividing the input data of the N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit is used for sequentially carrying out convolution operation on the read a-th input data block and each group of convolution groups to generate output data blocks with Tm output channels; generating an output data block with M output channels until the convolution operation with the mth group of convolution groups is completed; accumulating the storage output data blocks of the M output channels fed back by the output cache unit and the generated output data blocks of the M output channels, and sending the accumulated output data blocks; the storage output data block is generated by sequentially convolving the 1 st to (a-1) th input data blocks read before the a-th input data block is read with each group of convolution groups and then accumulating the convolution groups;
the output cache unit is used for storing the received accumulated output data blocks into storage output data blocks of M output channels and feeding back the storage output data blocks to the convolution calculation core unit; when a is n, the output buffer unit stores all output data of M output channels; wherein a is not more than n, and a is a positive integer.
The invention provides a reconfigurable neural network acceleration method II, which is an output data multiplexing method and comprises the following steps:
the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit carries out convolution operation on each read input data block and the read b-th group of convolution groups in sequence to generate output data blocks with Tm output channels until the convolution operation with the nth input data block is finished, and all output data of the Tm output channels are generated; accumulating part of channels and output data stored by the convolution calculation core unit and all output data of the generated Tm output channels to generate and store the accumulated part of channels and output data; the stored partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups; when b is M, the convolution calculation kernel unit sends the accumulated output data of the M channels; wherein b is not more than m, and b is a positive integer; pooling the received output data of the M output channels, and sending the pooled output data;
and the output buffer unit receives and stores the pooled output data and generates the pooled output data of the M output channels.
The invention provides a reconfigurable neural network acceleration architecture II, which comprises the following steps: the device comprises an input cache unit, a weight cache unit, a convolution calculation kernel unit and an output cache unit;
the input buffer unit is used for dividing the input data of the N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit is used for sequentially carrying out convolution operation on each read input data block and the read b-th group of convolution groups to generate output data blocks with Tm output channels until the convolution operation with the nth input data block is finished and generating all output data of the Tm output channels; accumulating part of channels and output data stored by the convolution calculation core unit and all output data of the generated Tm output channels to generate and store the accumulated part of channels and output data; the stored partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups; when b is M, the convolution calculation kernel unit sends the accumulated output data of the M channels; wherein b is not more than m, and b is a positive integer; pooling the received output data of the M output channels, and sending the pooled output data;
and the output buffer unit is used for receiving and storing the pooled output data and generating the pooled output data of the M output channels.
The invention provides a reconfigurable neural network acceleration method III, which is a weight data multiplexing method and comprises the following steps:
the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit carries out convolution operation on each read input data block and the read b-th group of convolution groups in sequence to generate output data blocks with Tm output channels until the convolution operation with the nth input data block is finished, and output data of the Tm output channels are generated; accumulating part of channels and output data fed back by the output cache unit and the generated output data of the Tm output channels, and sending the accumulated part of channels and output data; the feedback partial channel and the output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups;
the output buffer unit stores the received accumulated partial channels and output data into partial channels and output data, and feeds back the stored partial channels and output data to the convolution calculation core unit; when b is M, the output buffer unit stores all output data of M output channels; wherein b is not more than m, and b is a positive integer.
The invention provides a reconfigurable neural network acceleration architecture III, which comprises the following steps: the device comprises an input cache unit, a weight cache unit, a convolution calculation kernel unit and an output cache unit;
the input buffer unit is used for dividing the input data of the N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit is used for sequentially carrying out convolution operation on each read input data block and the read b-th group of convolution groups to generate output data blocks with Tm output channels until the convolution operation with the nth input data block is finished, and generating output data of the Tm output channels; accumulating part of channels and output data fed back by the output cache unit and the generated output data of the Tm output channels, and sending the accumulated part of channels and output data; the feedback partial channel and the output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups;
the output buffer unit is used for storing the received accumulated partial channels and output data into partial channels and output data and feeding back the stored partial channels and output data to the convolution calculation core unit; when b is M, the output buffer unit stores all output data of M output channels; wherein b is not more than m, and b is a positive integer.
The invention has the beneficial effects that: according to the reconfigurable neural network acceleration method and the reconfigurable neural network acceleration architecture, on the basis of the input cache unit, the weight cache unit, the convolution calculation kernel unit and the output cache unit, an input data multiplexing method, an output data multiplexing method and a weight data multiplexing method are respectively adopted, the aim of coping with neural networks with various layers through a layer-by-layer acceleration strategy is achieved, the neural network acceleration method is optimized by using cyclic transformation, energy consumption is reduced, and the utilization rate of a PE array is maximized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a classical deep convolutional neural network computation;
FIG. 2 is a diagram of a pseudocode circular expression of a convolutional layer operation of a classical deep convolutional neural network;
FIG. 3 is a flowchart of a reconfigurable neural network acceleration method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of sending an input data block along the Z-axis according to one embodiment of the invention;
FIG. 5 is a flow chart of a reconfigurable neural network acceleration method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a convolution calculation defect one of a conventional convolution kernel;
FIG. 7 is a diagram illustrating a defect-one based parallel convolution mapping scheme according to an embodiment of the present invention;
FIG. 8 is a pseudo code loop expression diagram of a defect one based parallel convolution mapping scheme in accordance with an embodiment of the present invention;
FIG. 9 is a schematic diagram of a convolution calculation defect two of a conventional convolution kernel;
FIG. 10 is a block diagram of an input data block partition based on defect two according to an embodiment of the present invention;
FIG. 11 is a diagram illustrating an input data block after defect two based stitching according to an embodiment of the present invention;
FIG. 12 is a diagram illustrating a parallel convolution mapping scheme based on Defect two according to an embodiment of the present invention;
FIG. 13 is a pseudo code loop expression diagram of a convolution operation according to a first embodiment of the present invention;
fig. 14 is a schematic structural diagram of a reconfigurable neural network acceleration architecture according to a second embodiment of the present invention;
fig. 15 is a flowchart of a reconfigurable neural network acceleration method according to a third embodiment of the present invention;
FIG. 16 is a schematic diagram of sending an input data block along the XY plane in accordance with one embodiment of the invention;
FIG. 17 is a flow chart of a reconfigurable neural network acceleration method of an embodiment of the present invention;
FIG. 18 is a pseudo code loop expression diagram of a convolution operation according to a third embodiment of the present invention;
fig. 19 is a schematic structural diagram of a reconfigurable neural network acceleration architecture according to a fourth embodiment of the present invention;
fig. 20 is a flowchart of a reconfigurable neural network acceleration method according to a fifth embodiment of the present invention;
FIG. 21 is a flow chart of a reconfigurable neural network acceleration method of an embodiment of the present invention;
FIG. 22 is a pseudo code loop expression diagram of a convolution operation according to a fifth embodiment of the present invention;
fig. 23 is a schematic structural diagram of a reconfigurable neural network acceleration architecture according to a sixth embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As used herein, the terms "first," "second," … …, etc. do not denote any order or order, nor are they used to limit the invention, but rather are used to distinguish one element from another element or operation described by the same technical terms.
As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.
As used herein, "and/or" includes any and all combinations of the described items.
With respect to directional terminology used herein, for example: up, down, left, right, front or rear, etc., are referred to only in the direction of the attached drawings. Therefore, the directional terminology is used for purposes of illustration and is in no way limiting.
Aiming at the defects in the prior art, the invention provides a reconfigurable neural network acceleration method, which is used for coping with neural networks with various layers by a data multiplexing layer-by-layer acceleration strategy and optimizing the neural network acceleration method by using cyclic transformation, and has the beneficial effects of reducing energy consumption and maximizing the utilization rate of a PE array.
The first embodiment is as follows: in order to solve the defects in the prior art, the present embodiment provides a reconfigurable neural network acceleration method, which adopts an input data multiplexing mode, and as shown in fig. 3, the reconfigurable neural network acceleration method includes:
s101: the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, and N, N and Tn are positive integers.
S102: the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, and M, M, Tm and N are positive integers.
S103: the convolution calculation kernel unit carries out convolution operation on the read a-th input data block and each group of convolution groups in sequence to generate output data blocks with Tm output channels; generating an output data block with M output channels until the convolution operation with the mth group of convolution groups is completed; accumulating the storage output data blocks of the M output channels fed back by the output cache unit and the generated output data blocks of the M output channels, and sending the accumulated output data blocks; and the storage output data block is generated by sequentially convolving the 1 st to (a-1) th input data blocks read before the a-th input data block is read with the convolution groups and accumulating the convolution groups.
S104: the output cache unit stores the received accumulated output data blocks into storage output data blocks of M output channels, and feeds back the storage output data blocks to the convolution calculation core unit; when a is n, the output buffer unit stores all output data of M output channels; wherein a is not more than n, and a is a positive integer.
The reconfigurable neural network acceleration method provided by this embodiment is to divide input data into input data blocks, sequentially send the input data blocks to the convolution computation core unit, sequentially perform convolution operation on one input data block and M groups of convolution groups each time by the convolution computation core unit to generate output data blocks of M output channels, repeat the above operations to perform convolution operation on each input data block and M groups of convolution groups, and continuously accumulate the output data blocks generating M output channels, and finally obtain all output data of M output channels. The reconfigurable neural network acceleration method of the embodiment deals with the neural networks with various layers through a data multiplexing layer-by-layer acceleration strategy, and has the effects of optimizing the neural networks and reducing energy consumption. Further, when the input buffer unit sequentially transmits the input data blocks, the input data blocks may be sequentially transmitted in the Z-axis direction.
In specific implementation, as shown in fig. 4, the input data is a three-dimensional structure, the input data has N input channels (Z-axis direction), each channel has a size H × L (XY plane) of the input data, the input data of each input channel is divided into input data blocks having a size Th × Tl, the N input data blocks are sequentially read along the Z-axis direction, and the input data blocks are sent to the convolution calculation core unit for performing the operation of the winding machine. As shown in fig. 4, the 1 st to ith input data blocks are transmitted first, then the i +1 st to 2i th input data blocks are transmitted, and so on, and finally the nth input data block is transmitted, where n is a positive integer of 1,2, … …, i, i +1, … ….
Further, the convolution computation kernel unit sequentially performs convolution operation on the read a-th input data block and each group of convolution groups, and may sequentially perform convolution operation on Tn input channels of the read a-th input data block and Tn convolution channels of each group of convolution groups; and performing convolution operation by the Tn input channels in one-to-one correspondence with the Tn convolution channels of each convolution kernel.
Further, as shown in fig. 5, the reconfigurable neural network acceleration method further includes:
s105: and judging whether the step length of the current convolution kernel is larger than 1.
S106: if yes, the input data block is mapped to the PE array in an interleaving mode and is subjected to convolution operation with the same convolution kernel.
S107: if not, when the size of the output data block is smaller than that of the input data block, dividing each input data block into W input data small blocks with the same size, re-splicing the input data small blocks at the corresponding positions of the input data blocks, and generating W spliced input data blocks with the same size; and mapping the W spliced input data blocks to a PE array and carrying out convolution operation on the PE array and the same convolution kernel.
In specific implementation, because convolution operation is performed in a conventional convolutional neural network, when the convolutional neural network is executed on a hardware platform, a convolution kernel is actually multiplied by each data in input data in a mode that the step length is 1. Such an operation mode may bring invalid PE operations due to a change in step size or a change in size of output data, and has two disadvantages as follows:
the first drawback is that, as shown in fig. 6, when the input data block size Th ═ Tl is 8, the output data block size Tr ═ Tc ═ 4, the convolution kernel size K ═ 2, and the convolution kernel step length S ═ 2 > 1, at the algorithm level, the convolution kernel is required to perform convolution calculation through the entire input data block in a step length 2 manner, and the first weight value at the upper left corner of the convolution kernel is multiplied by each data in the input data block in a step length 1 manner, so as to generate an invalid calculated PE, and a truly valid calculated PE is a black box portion in fig. 6, where the utilization rate of PE is only 16/64 ×% > < 25%.
In order to solve the first defect, the present invention determines whether the step size of the convolution kernel is greater than 1 by executing step S105, where the step size S of the convolution kernel is 2 > 1, so as to execute step S106, and if the step size of the convolution kernel is greater than 1, map the input data blocks of different input channels to the PE array in an interleaving manner to perform convolution operation with the same convolution kernel. The specific implementation process is as follows:
as shown in fig. 7, the present invention uses the same convolution kernel weight, and since the output data block size Tr is Tc is 4, when the convolution kernel is multiplied by each data of the input data block, an arrangement of overlapping and dislocating 4 different input data blocks 1,2, 3, 4 is adopted, that is: a first row and a first column are used for placing data 1(1,1) of a first row and a first column of an input data block 1, a first row and a second column are used for placing data 2(1,1) of a first row and a first column of an input data block 2, a second row and a first column are used for placing data 3(1,1) of a first row and a first column in an input data block 3, and a second row and a second column are used for placing data 4(1,1) of a first row and a first column in an input data block 4; the first row and third column places data 1(1,3) of the first row and third column in input data block 1 (since the convolution kernel step size is 2, data of the first row and second column of all data blocks is not required to be calculated), the first row and fourth column places data 3(1,1) of the first row and third column in input data block 2, the second row and third column places data 3(1,1) of the first row and first column in input data block 3, the second row and fourth column places data 4(1,1) of the first row and third column in input data block 4, and so on, the PEs that were in fig. 6 to perform the invalid calculation task give data from different input data blocks that require valid calculations. This enables 4 blocks of output data to be simultaneously convolved in parallel, where Tr-Tc-Trr-Tcc-4. The corresponding pseudo-code Loop diagram is shown in FIG. 8, and a four-layer Loop Tm/Tn/Tr/Tc indicates that convolution calculation is performed in a convolution calculation kernel unit: tm output data blocks of Tr x Tc size are calculated from Tn input data blocks of Th x Tl size. And two layers of Trr and Tcc are added to the innermost layer for circularly outputting output data small blocks with the size of Trr x Tc, and the output data small blocks with the size of Trr x Tc are cut again for realizing the parallel convolution mapping method.
The second drawback, as shown in fig. 9, when the input data block size Th ═ Tl is 8, the output data block size Tr ═ Tc ═ 6, the convolution kernel size K ═ 2, and the convolution kernel step S ═ 1, although the step size of the convolution kernel is 1, the size of the output data block 6x6 (the size Tr ═ Tc ═ 6 of one output data block) is such that the convolution kernel does not need to traverse the entire input data block 8x8, and this time the convolution kernel still moves by the step size 1, actually the black box part in fig. 8 is actually valid for the calculation, so the PE utilization is 36/64x 56.25%.
In order to solve the second drawback, the present invention executes step S107, when the convolution kernel step S is equal to 1 and the size 6x6 of the output data block is smaller than the size 8x8 of the input data block, dividing each input data block into W input data small blocks with the same size, and re-splicing the input data small blocks at corresponding positions to generate W spliced input data blocks with the same size; mapping the spliced W input data blocks to a PE array and carrying out convolution operation on the PE array and the same convolution kernel, wherein the specific execution process is as follows:
when 16 input data blocks w1, w2, w3, … … and w16 need to be convolved with 16 different convolution kernels, the defect shown in fig. 9 exists. The present invention performs the splitting process on the truly valid computing part, as shown in fig. 10, each input data block is split in a 2x2 manner, so that the 6x6 part of each input data block is split into 9 input data small blocks of 2x 2. As shown in fig. 11, the original 16 input data blocks are subjected to the above-described division processing, and 16 × 9 input data small blocks of 2 × 2 size are obtained. As shown in fig. 12, the input data tiles are re-spliced, each input data tile takes one input data tile at the same position, and 9 new input data tiles are formed, and each new input data tile has a size of 8 × 8. The 9 re-spliced 8x8 input data blocks are all subjected to convolution operation with the same convolution kernel, the convolution kernel moves through a 6x6 part of the original 16 input data blocks by taking the step size as 1, and the data in each input data block is fully utilized, so that 16 output data block sizes of 6x6 (namely, Tr is Tc is 6) are correspondingly obtained, the output data block is composed of 9 output data small blocks of 2x2 (namely, Trr is Tcc is 2), and the utilization rate of the PE is improved to 100% at this time (because the input data is composed of 9 input data blocks and each input data block is composed of 16 input data small blocks of 2x2, all effective calculation is carried out, and 16 output data blocks with the size of 6x6 are output).
PE utilization calculation formula in traditional mode:
Figure BDA0001561887610000111
where R and C correspond to the two-dimensional size of the output data in FIG. 1, A refers to the size of the computational array, and Tr and Tc are the sizes of the output data blocks.
In the parallel convolution mapping mode, the calculation formula of the PE utilization rate is as follows:
Figure BDA0001561887610000112
where R and C correspond to the two-dimensional size of the output data in fig. 1, a refers to the size of the computational array, Tr and Tc are the size of the output data block, and Trr and Tcc are the size of the output data small block in the parallel convolution mapping method. The invention adopts a parallel convolution mapping mode to maximize the utilization rate of the hardware resources of the PE array.
Fig. 13 is a pseudo code loop expression of the convolution operation of the present embodiment. As shown in fig. 13, in the reconfigurable neural network acceleration method provided in the first embodiment, the following loop is performed from inside to outside: the innermost four layers of Loop Tm/Tn/Tr/Tc represent convolution operation executed by a convolution calculation kernel unit: depicted outside the dashed box is the order of data multiplexing. In cycle M, each input data block is operated on all M convolution kernels, producing a partial sum of the output data of M output channels. In the cycle N, N input data are traversed sequentially, the calculation of the internal cycle is repeated, and the partial sum of the output data of M output channels is accumulated continuously. The updates are therefore read continuously until the complete convolution calculation is completed. The loop R and C will traverse other parts on each output channel, repeat all previous operations, and finally obtain all output data of M output channels.
In the data multiplexing mode, access to the memory location is a very important criterion. The convolutional layer is first divided as shown in fig. 3. Input data of N input channels are divided into N input data blocks, each input data block has Tn input channels, and the size of each input data block is Th multiplied by Tl, N is N/Tn, N, and Tn are positive integers. The output data of the M output channels is composed of M output data blocks, each output data block has Tm output channels, and the size of the output data block is a pattern of Tr × Tc, where M is M/Tm, M, Tm are positive integers, Th is (Tr-1) S + K, Tl is (Tc-1) S + K, K × K represents the size of the convolution kernel, and S represents the convolution step. And the number of accesses MA to memory can be expressed as the following expression:
MA=TI·αi+TO·αo+TW·αw+TPO
wherein TI, TO, TW denote the respective amounts of input data, output data, weight data, respectively, and αi、αo、αwRepresenting the respective reuse times of the input data, output data, weight data, while TPO represents the total number of output data for pooled output.
In the above method for multiplexing input data, the corresponding coefficients for Buffer access are:
Figure BDA0001561887610000121
because the input register unit sequentially reads and writes N ═ N/Tn input data blocks, and in order to obtain a final result, the convolution operation needs to traverse the total N channels of the input data, the convolution computation kernel unit needs to accumulate continuously, wherein the read and write operations need to be performed N-1 times, and the multiplication needs to be performed 2(N-1) times in consideration of the fact that the read and write operations are performed once respectively. The number of times of weight introduction corresponds to n, and the final parameter is that the step length, the size of the convolution kernel, the superposition in the convolution kernel moving process and other factors are considered
Figure BDA0001561887610000122
And for the access coefficient of DRAM, BoAnd BwRespectively representing the storage sizes of the output cache unit and the weight cache unit, if the size of the output cache unit is larger than the data quantity of M.Tr.Tc, the storage capacity of the DRAM is not required to be occupied additionally, at the moment, the DRAM is not required to be accessed, therefore, the coefficient is 1, otherwise, the DRAM is required to be accessed; the same is true for weight storage.
Figure BDA0001561887610000131
The reconfigurable neural network acceleration method provided by the embodiment deals with the neural networks with various layers through a layer-by-layer acceleration strategy, and optimizes the neural network acceleration method by using a cyclic transformation method, so that the access times to buffers and a DRAM are reduced, the problem of power consumption waste caused by a large number of times of accessing a memory in the prior art is solved, and the reconfigurable neural network acceleration method has the beneficial effects of reducing energy consumption and maximizing the utilization rate of hardware resources of a PE array.
Example two: based on the same application concept as the reconfigurable neural network acceleration method, the present embodiment also provides a reconfigurable neural network acceleration architecture, as described below. The reconfigurable neural network acceleration architecture is implemented by the reconfigurable neural network acceleration method in the first embodiment, and repeated details are not repeated.
As shown in fig. 14, the reconfigurable neural network acceleration architecture provided by this embodiment includes: an input buffer unit 1, a weight buffer unit 2, a convolution calculation kernel unit 3 and an output buffer unit 4.
An input buffer unit 1, configured to divide input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, and N, N and Tn are positive integers.
The weight cache unit 2 is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, and M, M, Tm and N are positive integers.
A convolution calculation kernel unit 3, configured to perform convolution operation on the read a-th input data block and each set of convolution groups in sequence, and generate output data blocks with Tm output channels; generating an output data block with M output channels until the convolution operation with the mth group of convolution groups is completed; accumulating the storage output data blocks of the M output channels fed back by the output cache unit and the generated output data blocks of the M output channels, and sending the accumulated output data blocks; and the storage output data block is generated by sequentially convolving the 1 st to (a-1) th input data blocks read before the a-th input data block is read with the convolution groups and accumulating the convolution groups.
The output buffer unit 4 is configured to store the received accumulated output data blocks as storage output data blocks of M output channels, and feed back the storage output data blocks to the convolution computation core unit; when a is n, the output buffer unit stores all output data of M output channels; wherein a is not more than n, and a is a positive integer.
Further, the input buffer unit 1 is specifically configured to sequentially send each input data block along the Z-axis direction.
Further, as shown in fig. 14, the convolution computation core unit 3 includes: an input register unit 31, a calculation engine unit 32, and an output register unit 33.
An input register unit 31, configured to read an a-th input data block from the input buffer unit, and send the a-th input data block to the compute engine unit.
And the calculation engine unit 32 is configured to perform convolution operations on the Tn input channels of the read a-th input data block and the Tn convolution channels of the Tm convolution kernels of each convolution group in sequence, generate output data blocks with Tm output channels, and send and generate output data blocks with M output channels until the convolution operations with the mth convolution group are completed.
The output register unit 33 is configured to accumulate the storage output data blocks of the M output channels fed back by the output buffer unit and the generated output data blocks of the M output channels, and send the accumulated output data blocks; and the storage output data block is generated by sequentially convolving the 1 st to (a-1) th input data blocks read before the a-th input data block is read with the convolution groups and accumulating the convolution groups.
According to the reconfigurable neural network acceleration method and the reconfigurable neural network acceleration architecture provided by the embodiment, the input cache unit, the weight cache unit, the convolution calculation kernel unit and the output cache unit architecture are adopted, the input data multiplexing method is adopted, the neural networks with various layers are responded by a layer-by-layer acceleration strategy, and the neural network acceleration method is optimized by using a circular transformation method, so that the access times to the Buffer and the DRAM are reduced, the problem of power consumption waste caused by multiple times of accessing a memory in the prior art is solved, the energy consumption is reduced, and the hardware resource utilization rate of the PE array is maximized.
Example three: in order to solve the defects in the prior art, the present embodiment further provides a reconfigurable neural network acceleration method, where the method uses an output data multiplexing mode, and as shown in fig. 15, the reconfigurable neural network acceleration method includes:
s201: the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, and N, N and Tn are positive integers.
S202: the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, and M, M, Tm and N are positive integers.
S203: the convolution calculation kernel unit carries out convolution operation on each read input data block and the read b-th group of convolution groups in sequence to generate output data blocks with Tm output channels until the convolution operation with the nth input data block is finished, and all output data of the Tm output channels are generated; accumulating part of channels and output data stored by the convolution calculation core unit and all output data of the generated Tm output channels to generate and store the accumulated part of channels and output data; the stored partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups; when b is M, the convolution calculation kernel unit sends the accumulated output data of the M channels; wherein b is not more than m, and b is a positive integer; and performing pooling on the received output data of the M output channels, and sending the pooled output data.
S204: and the output buffer unit receives and stores the pooled output data and generates the pooled output data of the M output channels.
In the reconfigurable neural network acceleration method provided by this embodiment, the input data is divided into input data blocks, the input data blocks are sequentially sent to the convolution computation core unit, the convolution computation core unit sequentially performs convolution averaging on n input data blocks and the same convolution group to generate all output data of Tm output channels, the above operations are repeated to perform convolution operation on n input data blocks and M convolution groups respectively, and all output data of Tm output channels are accumulated continuously, and finally all output data of M output channels are obtained. The reconfigurable neural network acceleration method of the embodiment deals with the neural networks with various layers through a data multiplexing layer-by-layer acceleration strategy, and has the effects of optimizing the neural networks and reducing energy consumption.
Further, when the input buffer unit sequentially transmits the input data blocks, the input data blocks may be sequentially transmitted along the XY plane.
In specific implementation, as shown in fig. 16, the input data is a three-dimensional structure, the input data has N input channels (Z-axis direction), each channel has a size of H × L (XY plane), the input data of each input channel is divided into input data blocks having a size of Th × Tl, the N input data blocks are sequentially read in the XY plane direction, and the input data blocks are sent to the convolution calculation core unit for performing the operation of the winding machine. As shown in fig. 16, the 1 st to ith input data blocks are sent first, then the i +1 th to ki input data blocks are sent, then the ki +1 th input data block is sent, and so on, and finally the nth input data block is sent, where n is a positive integer of 1,2, … …, i, i +1, … …, ki, ki +1, … ….
Further, the convolution calculation kernel unit sequentially performs convolution operation on each read input data block and the read b-th group of convolution groups, and the convolution operation includes: the convolution calculation kernel unit carries out convolution operation on the Tn input channels of the read input data blocks and the Tn convolution channels of the read group b convolution group in sequence; and performing convolution operation by the Tn input channels in one-to-one correspondence with the Tn convolution channels of each convolution kernel.
In one embodiment, as shown in fig. 17, the method further comprises:
s205: and judging whether the step length of the current convolution kernel is larger than 1.
S206: if yes, the input data block is mapped to the PE array in an interleaving mode and is subjected to convolution operation with the same convolution kernel.
S207: if not, when the size of the output data block is smaller than that of the input data block, dividing each input data block into W input data small blocks with the same size, re-splicing the input data small blocks at the corresponding positions of the input data blocks, and generating W spliced input data blocks with the same size; and mapping the W spliced input data blocks to a PE array and carrying out convolution operation on the PE array and the same convolution kernel.
The specific implementation process is shown in the implementation process of steps S105-S107 in the first embodiment.
Fig. 18 is a pseudo code loop expression of the convolution operation of the present embodiment. As shown in fig. 18, in the reconfigurable neural network acceleration method provided in the third embodiment, the loop M is outside the loop N, which means that each convolution group performs convolution operation with input data of N input channels of the input data to obtain all output data of each output channel. The part of the output data block stored in the output buffer unit does not need to be read repeatedly. As shown in fig. 18, the cycle from inside to outside is as follows: the inner four layers of Loop Tm/Tn/Tr/Tc represent convolution calculation in a convolution calculation kernel unit: tm output data blocks of TrxTc size are calculated from Tn input data blocks of Th x Tl size, and the data multiplexing order is depicted outside the dashed box. In the cycle N, the N input channels sequentially traverse, the calculation of the internal cycle is repeated, all output data of the Tm output channels are generated through accumulation, and finally the output data are stored in the output cache unit. In the cycle M, all the input data used before are repeatedly read in, and the calculation of all M output channels is completed. The loop R and C will traverse other parts on the output channels, repeat all previous operations, and finally obtain all output data of M output channels.
In the above output data multiplexing method, the corresponding access coefficient to the Buffer is:
Figure BDA0001561887610000161
the corresponding access coefficients to the DRAM are:
Figure BDA0001561887610000162
wherein B isiThe storage size of the output buffer unit is pointed out, if the input buffer unit can beAnd storing the next N input data blocks, wherein N is the number of input channels, Th multiplied by Tl is the size of input data, M is the number of convolution kernels, and each convolution group has Tm convolution kernels.
The reconfigurable neural network acceleration method provided by the third embodiment deals with neural networks with various layers through a layer-by-layer acceleration strategy, and optimizes the neural network acceleration method by using a cyclic transformation method, so that the access times to buffers and a DRAM are reduced, the problem of power consumption waste caused by a large number of times of accessing a memory in the prior art is solved, and the reconfigurable neural network acceleration method has the beneficial effects of reducing energy consumption and maximizing the utilization rate of hardware resources of a PE array.
Example four: based on the same application concept as the reconfigurable neural network acceleration method, the invention also provides a reconfigurable neural network acceleration architecture, which is described in the following. Because the principle of the reconfigurable neural network acceleration architecture for solving the problems is similar to that of the reconfigurable neural network acceleration method in the third embodiment, the implementation of the reconfigurable neural network acceleration architecture can refer to the implementation of the reconfigurable neural network acceleration method in the third embodiment, and repeated details are not repeated.
As shown in fig. 19, the reconfigurable neural network acceleration architecture provided in this embodiment further includes: the device comprises an input buffer unit 1, a weight buffer unit 2, a convolution calculation kernel unit 3 and an output buffer unit 4;
an input buffer unit 1, configured to divide input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit 2 is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit 3 is used for sequentially performing convolution operation on each read input data block and the read b-th group of convolution groups to generate output data blocks with Tm output channels until the convolution operation on the nth input data block is completed, and generating all output data of the Tm output channels; accumulating part of channels and output data stored by the convolution calculation core unit and all output data of the generated Tm output channels to generate and store the accumulated part of channels and output data; the stored partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups; when b is M, the convolution calculation kernel unit sends the accumulated output data of the M channels; wherein b is not more than m, and b is a positive integer; pooling the received output data of the M output channels, and sending the pooled output data;
and the output buffer unit 4 is used for receiving and storing the pooled output data and generating the pooled output data of the M output channels.
Further, the input buffer unit 1 is specifically configured to sequentially send each input data block along the XY plane.
Further, as shown in fig. 19, the convolution computation core unit 3 includes: an input register unit 31, a calculation engine unit 32, an output register unit 33, and a pooling unit 34.
An input register unit 31, configured to read each input data block from the input buffer unit one by one, and send the input data block to the calculation engine unit;
a calculation engine unit 32, configured to perform convolution operations on the Tn input channels of each read input data block and the Tn convolution channels of the Tm convolution kernels of the read group b convolution group in sequence, and generate an output data block having Tm output channels until the convolution operation with the nth input data block is completed, and generate all output data of the Tm output channels; accumulating the partial channels and the output data fed back by the output register unit and all the output data of the generated Tm output channels, and generating and sending the accumulated partial channels and the output data; the partial channel and the output data fed back by the output register unit are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups;
an output register unit 33, configured to store the received accumulated partial channels and output data as partial channels and output data, and feed back the stored partial channels and output data to the calculation engine unit; when b is M, the output register unit sends the accumulated output data of the M channels; wherein b is not more than m, and b is a positive integer;
the pooling unit 34 is configured to perform pooling on the received output data of the M output channels, and send the pooled output data.
According to the reconfigurable neural network acceleration method and the reconfigurable neural network acceleration architecture provided by the embodiment, the input cache unit, the weight cache unit, the convolution calculation kernel unit and the output cache unit architecture are adopted, the output data multiplexing method is adopted, the neural networks with various layers are responded by a layer-by-layer acceleration strategy, and the neural network acceleration method is optimized by using a circular transformation method, so that the access times to the Buffer and the DRAM are reduced, the problem of power consumption waste caused by a large number of times of accessing a memory in the prior art is solved, the energy consumption is reduced, and the hardware resource utilization rate of the PE array is maximized.
Example five: in order to solve the defects in the prior art, the present embodiment further provides a reconfigurable neural network acceleration method, which uses a weight data multiplexing mode, as shown in fig. 20, the reconfigurable neural network acceleration method includes:
s301: the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
s302: the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
s303: the convolution calculation kernel unit carries out convolution operation on each read input data block and the read b-th group of convolution groups in sequence to generate output data blocks with Tm output channels until the convolution operation with the nth input data block is finished, and output data of the Tm output channels are generated; accumulating part of channels and output data fed back by the output cache unit and the generated output data of the Tm output channels, and sending the accumulated part of channels and output data; the feedback partial channel and the output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups;
s304: the output buffer unit stores the received accumulated partial channels and output data into partial channels and output data, and feeds back the stored partial channels and output data to the convolution calculation core unit; when b is M, the output buffer unit stores all output data of M output channels; wherein b is not more than m, and b is a positive integer.
In the reconfigurable neural network acceleration method provided by this embodiment, input data is divided into input data blocks, each input data block sequentially performs convolution operation with the same convolution group to generate all output data of Tm output channels, the above operation is repeated to perform convolution operation on n input data blocks and M convolution groups, and all output data of Tm output channels are accumulated continuously, so as to obtain all output data of M output channels. The reconfigurable neural network acceleration method of the embodiment deals with the neural networks with various layers through a data multiplexing layer-by-layer acceleration strategy, and has the effects of optimizing the neural networks and reducing energy consumption.
Further, when the input buffer unit sequentially transmits the input data blocks, the input data blocks may be sequentially transmitted in the Z-axis direction.
In specific implementation, as shown in fig. 4, the input data is a three-dimensional structure, the input data has N input channels (Z-axis direction), each channel has a size H × L (XY plane) of the input data, the input data of each input channel is divided into input data blocks having a size Th × Tl, the N input data blocks are sequentially read along the Z-axis direction, and the input data blocks are sent to the convolution calculation core unit for performing the operation of the winding machine. As shown in fig. 4, the 1 st to ith input data blocks are transmitted first, then the i +1 st to 2i th input data blocks are transmitted, and so on, and finally the nth input data block is transmitted, where n is a positive integer of 1,2, … …, i, i +1, … ….
Further, the convolution calculation kernel unit sequentially performs convolution operation on the read input data blocks and the read b-th convolution group, and may sequentially perform convolution operation on Tn input channels of the read input data blocks and Tn convolution channels of the read b-th convolution group, where the Tn input channels and the Tn convolution channels of each convolution kernel perform convolution operation in a one-to-one correspondence.
Further, as shown in fig. 21, the reconfigurable neural network acceleration method further includes:
s305: and judging whether the step length of the current convolution kernel is larger than 1.
S306: if yes, the input data block is mapped to the PE array in an interleaving mode and is subjected to convolution operation with the same convolution kernel.
S307: if not, when the size of the output data block is smaller than that of the input data block, dividing each input data block into W input data small blocks with the same size, re-splicing the input data small blocks at the corresponding positions of the input data blocks, and generating W spliced input data blocks with the same size; and mapping the W spliced input data blocks to a PE array and carrying out convolution operation on the PE array and the same convolution kernel.
The specific implementation process is shown in the implementation process of steps S105-S107 in the first embodiment.
Fig. 22 is a pseudo code loop expression of the convolution operation of the present embodiment. As shown in fig. 22, in the reconfigurable neural network acceleration method provided in the fifth embodiment, the convolution computation kernel unit applies T tonThe input data blocks of the input channels are sent to the input register unit in sequence. Each input data block and TmMultiplication by a convolution kernel to produce TmThe output data portion of each output channel is summed. Since cycle R and cycle C are on the inside, TmEach is provided with TnConvolution of a convolution channelThe kernel is fully utilized, and T of n input data blocksnThe Channel makes traversal, thereby obtaining TmThe partial sum of the output data (output data size R × C). T generated from each input data blockmPartial and cyclic retrieval of output data T generated with the next block of input datamThe partial sums of the output data are accumulated until all output data for the M output channels are obtained. Cycling from inside to outside is as follows: the inner four layers of Loop Tm/Tn/Tr/Tc represent convolution calculation in a convolution calculation kernel unit: tm output data blocks of TrxTc size are calculated from Tn input data blocks of Th x Tl size, and the data multiplexing order is depicted outside the dashed box. The innermost four-layer Loop Tm/Tn/Tr/Tc represents the calculation on the Convolution Core (Convolution Core) of fig. 6: tn output maps of Tr x Tc size were calculated from Tn input maps of Th x Tl size. Depicted outside the dashed box is the data multiplexing order. Loop R and loop C will traverse the rest of the output path, repeating all the operations of the inner loop, so that the weights are fully multiplexed. In the cycle N, the N input channels sequentially traverse, the calculation of the internal cycle is repeated, all the accumulated calculation of the output data of the Tm output channels is completed, and finally the accumulated calculation is stored in the output buffer unit. In the cycle M, all the input data used before are repeatedly read in, the calculation of all the M convolution kernels is completed, and finally all the output data of the M output channels are obtained.
In the weight data multiplexing method, the corresponding access coefficient to the Buffer is:
Figure BDA0001561887610000201
the corresponding access coefficients to the DRAM are:
Figure BDA0001561887610000211
the reconfigurable neural network acceleration method provided by the fifth embodiment deals with neural networks with various layers through a layer-by-layer acceleration strategy, and optimizes the neural network acceleration method by using a cyclic transformation method, so that the access times to buffers and a DRAM are reduced, the problem of power consumption waste caused by a large number of times of accessing a memory in the prior art is solved, and the reconfigurable neural network acceleration method has the beneficial effects of reducing energy consumption and maximizing the utilization rate of hardware resources of a PE array.
Example six: based on the same application concept as the reconfigurable neural network acceleration method, the invention also provides a reconfigurable neural network acceleration architecture, which is described in the following. The principle of the reconfigurable neural network acceleration architecture for solving the problems is similar to that of the reconfigurable neural network acceleration method in the fifth embodiment, so that the implementation of the reconfigurable neural network acceleration architecture can refer to the implementation of the reconfigurable neural network acceleration method in the fifth embodiment, and repeated details are not repeated.
As shown in fig. 23, the reconfigurable neural network acceleration architecture provided in this embodiment includes: the device comprises an input buffer unit 1, a weight buffer unit 2, a convolution calculation kernel unit 3 and an output buffer unit 4;
an input buffer unit 1, configured to divide input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, and N, N and Tn are positive integers.
The weight cache unit 2 is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, and M, M, Tm and N are positive integers.
The convolution calculation kernel unit 3 is used for sequentially performing convolution operation on each read input data block and the read b-th group of convolution groups to generate output data blocks with Tm output channels until the convolution operation on the nth input data block is completed, and generating output data of the Tm output channels; accumulating part of channels and output data fed back by the output cache unit and the generated output data of the Tm output channels, and sending the accumulated part of channels and output data; and the feedback partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups.
The output buffer unit 4 is used for storing the received accumulated partial channels and output data into partial channels and output data and feeding back the stored partial channels and output data to the convolution calculation core unit; when b is M, the output buffer unit stores all output data of M output channels; wherein b is not more than m, and b is a positive integer.
Further, the input buffer unit is specifically configured to sequentially send each input data block along the Z-axis direction.
Further, as shown in fig. 23, the convolution computation core unit 3 includes: an input register unit 31, a calculation engine unit 32, and an output register unit 33.
An input register unit 31, configured to read each input data block from the input buffer unit one by one, and send the input data block to the calculation engine unit.
And the calculation engine unit 32 is configured to perform convolution operations on the Tn input channels of the read input data blocks sequentially with the Tn convolution channels of the Tm convolution kernels of the read group b convolution group, generate output data blocks with Tm output channels, and send the generated output data of the Tm output channels until the convolution operations with the nth input data block are completed.
The output register unit 33 is configured to accumulate the partial channels and the output data fed back by the output buffer unit and the generated output data of the Tm output channels, and send the accumulated partial channels and output data; and the feedback partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups.
According to the reconfigurable neural network acceleration method and the reconfigurable neural network acceleration architecture provided by the embodiment, the input cache unit, the weight cache unit, the convolution calculation kernel unit and the output cache unit architecture are adopted, the weight data multiplexing method is adopted, the neural networks with various layers are responded by a layer-by-layer acceleration strategy, and the neural network acceleration method is optimized by using a circular transformation method, so that the access times to the Buffer and the DRAM are reduced, the problem of power consumption waste caused by a large number of times of accessing a memory in the prior art is solved, the energy consumption is reduced, and the hardware resource utilization rate of the PE array is maximized.
According to the reconfigurable neural network acceleration method and the reconfigurable neural network acceleration architecture, the input data multiplexing method, the output data multiplexing method and the weight data multiplexing method are adopted respectively through the input cache unit, the weight cache unit, the convolution calculation kernel unit and the output cache unit architecture, the neural networks with various layers are responded by a layer-by-layer acceleration strategy, and the neural network acceleration method is optimized by using a circular transformation method, so that the access times to a Buffer and a DRAM are reduced, the problem of power consumption waste caused by the fact that the number of times of accessing a memory is large in the prior art is solved, the energy consumption is reduced, and the hardware resource utilization rate of a PE array is maximized.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (18)

1. A reconfigurable neural network acceleration method, comprising:
the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit carries out convolution operation on the Tn input channels of the read a-th input data block and the Tn convolution channels of each convolution group in sequence to generate output data blocks with Tm output channels; generating an output data block with M output channels until the convolution operation with the mth group of convolution groups is completed; performing convolution operation on the Tn input channels and the Tn convolution channels of each convolution kernel in a one-to-one correspondence mode; accumulating the storage output data blocks of the M output channels fed back by the output cache unit and the generated output data blocks of the M output channels, and sending the accumulated output data blocks; the storage output data block is generated by sequentially convolving the 1 st to (a-1) th input data blocks read before the a-th input data block is read with each group of convolution groups and then accumulating the convolution groups;
the output cache unit stores the received accumulated output data blocks into storage output data blocks of M output channels, and feeds back the storage output data blocks to the convolution calculation core unit; when a is n, the output buffer unit stores all output data of M output channels; wherein a is not more than n, and a is a positive integer.
2. The method of claim 1, wherein the sequentially sending the input data blocks comprises: and sequentially sending each input data block along the Z-axis direction.
3. The reconfigurable neural network acceleration method of claim 2, further comprising:
judging whether the step length of the current convolution kernel is larger than 1;
if yes, mapping the input data block to the PE array in a staggered manner, and performing convolution operation on the PE array and the same convolution kernel;
if not, when the size of the output data block is smaller than that of the input data block, dividing each input data block into W input data small blocks with the same size, re-splicing the input data small blocks at the corresponding positions of the input data blocks, and generating W spliced input data blocks with the same size; and mapping the W spliced input data blocks to a PE array and carrying out convolution operation on the PE array and the same convolution kernel.
4. A reconfigurable neural network acceleration architecture, comprising: the device comprises an input cache unit, a weight cache unit, a convolution calculation kernel unit and an output cache unit;
the input buffer unit is used for dividing the input data of the N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit is used for sequentially performing convolution operation on the Tn input channels of the read a-th input data block and the Tn convolution channels of each convolution group to generate output data blocks with Tm output channels; performing convolution operation on the Tn input channels and the Tn convolution channels of each convolution kernel in a one-to-one correspondence mode; generating an output data block with M output channels until the convolution operation with the mth group of convolution groups is completed; accumulating the storage output data blocks of the M output channels fed back by the output cache unit and the generated output data blocks of the M output channels, and sending the accumulated output data blocks; the storage output data block is generated by sequentially convolving the 1 st to (a-1) th input data blocks read before the a-th input data block is read with each group of convolution groups and then accumulating the convolution groups;
the output cache unit is used for storing the received accumulated output data blocks into storage output data blocks of M output channels and feeding back the storage output data blocks to the convolution calculation core unit; when a is n, the output buffer unit stores all output data of M output channels; wherein a is not more than n, and a is a positive integer.
5. The reconfigurable neural network acceleration architecture of claim 4, wherein the input cache unit is specifically configured to: and sequentially sending each input data block along the Z-axis direction.
6. The reconfigurable neural network acceleration architecture of claim 4, wherein the convolution computation kernel unit comprises: the input register unit, the calculation engine unit and the output register unit;
the input register unit is used for reading the a-th input data block from the input buffer unit and sending the a-th input data block to the calculation engine unit;
the calculation engine unit is used for sequentially carrying out convolution operation on the Tn input channels of the read a-th input data block and the Tn convolution channels of the Tm convolution kernels of each convolution group to generate output data blocks with Tm output channels, and sending and generating the output data blocks with M output channels until the convolution operation with the mth group of convolution groups is finished;
the output register unit is used for accumulating the storage output data blocks of the M output channels fed back by the output cache unit and the generated output data blocks of the M output channels and sending the accumulated output data blocks; and the storage output data block is generated by sequentially convolving the 1 st to (a-1) th input data blocks read before the a-th input data block is read with the convolution groups and accumulating the convolution groups.
7. A reconfigurable neural network acceleration method, comprising:
the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit sequentially performs convolution operation on the Tn input channels of the read input data blocks and the Tn convolution channels of the read group b convolution group to generate output data blocks with Tm output channels, and generates all output data of the Tm output channels until the convolution operation on the nth input data block is completed; performing convolution operation on the Tn input channels and the Tn convolution channels of each convolution kernel in a one-to-one correspondence mode; accumulating part of channels and output data stored by the convolution calculation core unit and all output data of the generated Tm output channels to generate and store the accumulated part of channels and output data; the stored partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups; when b is M, the convolution calculation kernel unit sends the accumulated output data of the M channels; wherein b is not more than m, and b is a positive integer; pooling the received output data of the M output channels, and sending the pooled output data;
and the output buffer unit receives and stores the pooled output data and generates the pooled output data of the M output channels.
8. The method of claim 7, wherein the sequentially transmitting the input data blocks comprises: and sequentially transmitting each input data block along an XY plane.
9. The reconfigurable neural network acceleration method of claim 7, further comprising:
judging whether the step length of the current convolution kernel is larger than 1;
if yes, mapping the input data block to the PE array in a staggered manner, and performing convolution operation on the PE array and the same convolution kernel;
if not, when the size of the output data block is smaller than that of the input data block, dividing each input data block into W input data small blocks with the same size, re-splicing the input data small blocks at the corresponding positions of the input data blocks, and generating W spliced input data blocks with the same size; and mapping the W spliced input data blocks to a PE array and carrying out convolution operation on the PE array and the same convolution kernel.
10. A reconfigurable neural network acceleration architecture, comprising: the device comprises an input cache unit, a weight cache unit, a convolution calculation kernel unit and an output cache unit;
the input buffer unit is used for dividing the input data of the N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit is used for sequentially carrying out convolution operation on the Tn input channels of the read input data blocks and the Tn convolution channels of the read group b convolution group to generate output data blocks with Tm output channels until the convolution operation with the nth input data block is finished, and generating all output data of the Tm output channels; performing convolution operation on the Tn input channels and the Tn convolution channels of each convolution kernel in a one-to-one correspondence mode; accumulating part of channels and output data stored by the convolution calculation core unit and all output data of the generated Tm output channels to generate and store the accumulated part of channels and output data; the stored partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups; when b is M, the convolution calculation kernel unit sends the accumulated output data of the M channels; wherein b is not more than m, and b is a positive integer; pooling the received output data of the M output channels, and sending the pooled output data;
and the output buffer unit is used for receiving and storing the pooled output data and generating the pooled output data of the M output channels.
11. The reconfigurable neural network acceleration architecture of claim 10, wherein the input cache unit is specifically configured to: and sequentially transmitting each input data block along an XY plane.
12. The reconfigurable neural network acceleration architecture of claim 10, wherein the convolution computation kernel unit comprises: the device comprises an input register unit, a calculation engine unit, an output register unit and a pooling unit;
the input register unit is used for reading each input data block from the input cache unit one by one and sending the input data block to the calculation engine unit;
the calculation engine unit is used for sequentially performing convolution operation on the Tn input channels of the read input data blocks and the Tn convolution channels of the Tm convolution kernels of the read group b convolution group to generate output data blocks with Tm output channels until the convolution operation on the nth input data block is completed, and generating all output data of the Tm output channels; accumulating the partial channels and the output data fed back by the output register unit and all the output data of the generated Tm output channels, and generating and sending the accumulated partial channels and the output data; the partial channel and the output data fed back by the output register unit are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups;
the output register unit is used for storing the received accumulated partial channels and output data into partial channels and output data and feeding back the stored partial channels and output data to the calculation engine unit; when b is M, the output register unit sends the accumulated output data of the M channels; wherein b is not more than m, and b is a positive integer;
and the pooling unit is used for pooling the received output data of the M output channels and sending the pooled output data.
13. A reconfigurable neural network acceleration method, comprising:
the input buffer unit divides input data of N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit divides the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit sequentially performs convolution operation on the Tn input channels of the read input data blocks and the Tn convolution channels of the read group b convolution group to generate output data blocks with Tm output channels, and generates output data of the Tm output channels until the convolution operation on the nth input data block is completed; performing convolution operation on the Tn input channels and the Tn convolution channels of each convolution kernel in a one-to-one correspondence mode; accumulating part of channels and output data fed back by the output cache unit and the generated output data of the Tm output channels, and sending the accumulated part of channels and output data; the feedback partial channel and the output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups;
the output buffer unit stores the received accumulated partial channels and output data into partial channels and output data, and feeds back the stored partial channels and output data to the convolution calculation core unit; when b is M, the output buffer unit stores all output data of M output channels; wherein b is not more than m, and b is a positive integer.
14. The method of claim 13, wherein the sequentially transmitting the input data blocks comprises: and sequentially sending each input data block along the Z-axis direction.
15. The reconfigurable neural network acceleration method of claim 14, further comprising:
judging whether the step length of the current convolution kernel is larger than 1;
if yes, mapping the input data block to the PE array in a staggered manner, and performing convolution operation on the PE array and the same convolution kernel;
if not, when the size of the output data block is smaller than that of the input data block, dividing each input data block into W input data small blocks with the same size, re-splicing the input data small blocks at the corresponding positions of the input data blocks, and generating W spliced input data blocks with the same size; and mapping the W spliced input data blocks to a PE array and carrying out convolution operation on the PE array and the same convolution kernel.
16. A reconfigurable neural network acceleration architecture, comprising: the device comprises an input cache unit, a weight cache unit, a convolution calculation kernel unit and an output cache unit;
the input buffer unit is used for dividing the input data of the N input channels into N input data blocks; each of the input data blocks has Tn input channels; sequentially transmitting each input data block; wherein N is N/Tn, N, N and Tn are positive integers;
the weight cache unit is used for dividing the M convolution kernels into M convolution groups; each said convolution group having Tm convolution kernels, each said convolution kernel having N convolution channels; sequentially transmitting each convolution group; wherein M is M/Tm, M, M, Tm and N are positive integers;
the convolution calculation kernel unit is used for sequentially carrying out convolution operation on the Tn input channels of the read input data blocks and the Tn convolution channels of the read group b convolution group to generate output data blocks with Tm output channels until the convolution operation with the nth input data block is finished, and generating output data of the Tm output channels; performing convolution operation on the Tn input channels and the Tn convolution channels of each convolution kernel in a one-to-one correspondence mode; accumulating part of channels and output data fed back by the output cache unit and the generated output data of the Tm output channels, and sending the accumulated part of channels and output data; the feedback partial channel and the output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups;
the output buffer unit is used for storing the received accumulated partial channels and output data into partial channels and output data and feeding back the stored partial channels and output data to the convolution calculation core unit; when b is M, the output buffer unit stores all output data of M output channels; wherein b is not more than m, and b is a positive integer.
17. The reconfigurable neural network acceleration architecture of claim 16, wherein the input cache unit is specifically configured to: and sequentially sending each input data block along the Z-axis direction.
18. The reconfigurable neural network acceleration architecture of claim 16, wherein the convolution computation kernel unit comprises: the input register unit, the calculation engine unit and the output register unit;
the input register unit is used for reading each input data block from the input cache unit one by one and sending the input data block to the calculation engine unit;
the calculation engine unit is used for sequentially carrying out convolution operation on the Tn input channels of the read input data blocks and the Tn convolution channels of the Tm convolution kernels of the read group b convolution group, generating output data blocks with Tm output channels until the convolution operation with the nth input data block is finished, and sending the generated output data of the Tm output channels;
the output register unit is used for accumulating the partial channels and the output data fed back by the output cache unit and the generated output data of the Tm output channels and sending the accumulated partial channels and the output data; and the feedback partial channel and output data are generated by sequentially convolving the 1 st to (b-1) th convolution groups read before the b-th convolution group with each input data block and accumulating the convolution groups.
CN201810084089.2A 2018-01-29 2018-01-29 Reconfigurable neural network acceleration method and architecture Active CN108241890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810084089.2A CN108241890B (en) 2018-01-29 2018-01-29 Reconfigurable neural network acceleration method and architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810084089.2A CN108241890B (en) 2018-01-29 2018-01-29 Reconfigurable neural network acceleration method and architecture

Publications (2)

Publication Number Publication Date
CN108241890A CN108241890A (en) 2018-07-03
CN108241890B true CN108241890B (en) 2021-11-23

Family

ID=62698691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810084089.2A Active CN108241890B (en) 2018-01-29 2018-01-29 Reconfigurable neural network acceleration method and architecture

Country Status (1)

Country Link
CN (1) CN108241890B (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110716751B (en) * 2018-07-12 2022-10-18 赛灵思公司 High-parallelism computing platform, system and computing implementation method
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm
WO2020041962A1 (en) * 2018-08-28 2020-03-05 深圳鲲云信息科技有限公司 Parallel deconvolutional calculation method, single-engine calculation method and related product
CN110865950B (en) * 2018-08-28 2021-01-12 中科寒武纪科技股份有限公司 Data preprocessing method and device, computer equipment and storage medium
CN109284824B (en) * 2018-09-04 2021-07-23 复旦大学 Reconfigurable technology-based device for accelerating convolution and pooling operation
US11367498B2 (en) * 2018-09-07 2022-06-21 Black Sesame Technologies Inc. Multi-level memory hierarchy
CN109447257B (en) * 2018-09-18 2021-08-17 复旦大学 Operation device of deep neural network acceleration chip with self-organized channels
CN109447241B (en) * 2018-09-29 2022-02-22 西安交通大学 Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things
CN109359735B (en) * 2018-11-23 2020-12-04 浙江大学 Data input device and method for accelerating deep neural network hardware
CN109598338B (en) * 2018-12-07 2023-05-19 东南大学 Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization
CN109740732B (en) * 2018-12-27 2021-05-11 深圳云天励飞技术有限公司 Neural network processor, convolutional neural network data multiplexing method and related equipment
CN109711367B (en) * 2018-12-29 2020-03-06 中科寒武纪科技股份有限公司 Operation method, device and related product
CN111523652B (en) * 2019-02-01 2023-05-02 阿里巴巴集团控股有限公司 Processor, data processing method thereof and image pickup device
CN110110849B (en) * 2019-04-29 2023-04-07 西安电子科技大学 Line fixed data stream mapping method based on graph segmentation
CN110390384B (en) * 2019-06-25 2021-07-06 东南大学 Configurable general convolutional neural network accelerator
CN110414672B (en) * 2019-07-23 2022-11-01 江苏鼎速网络科技有限公司 Convolution operation method, device and system
CN112308217B (en) * 2019-07-31 2024-06-04 北京欣奕华科技有限公司 Convolutional neural network acceleration method and system
CN110516801B (en) * 2019-08-05 2022-04-22 西安交通大学 High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN110490302B (en) * 2019-08-12 2022-06-07 中科寒武纪科技股份有限公司 Neural network compiling and optimizing method and device and related products
CN110533177B (en) * 2019-08-22 2023-12-26 安谋科技(中国)有限公司 Data read-write device, method, equipment, medium and convolution accelerator
CN111126593B (en) * 2019-11-07 2023-05-05 复旦大学 Reconfigurable natural language deep convolutional neural network accelerator
CN111199273B (en) * 2019-12-31 2024-03-26 深圳云天励飞技术有限公司 Convolution calculation method, device, equipment and storage medium
CN111258574B (en) * 2020-01-14 2021-01-15 中科驭数(北京)科技有限公司 Programming method and system for accelerator architecture
TWI733334B (en) 2020-02-15 2021-07-11 財團法人工業技術研究院 Convolutional neural-network calculating apparatus and operation methods thereof
CN111427895B (en) * 2020-04-01 2022-10-25 西安交通大学 Neural network reasoning acceleration method based on two-segment cache
CN111610963B (en) * 2020-06-24 2021-08-17 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN111859797A (en) * 2020-07-14 2020-10-30 Oppo广东移动通信有限公司 Data processing method and device and storage medium
CN112580774B (en) * 2020-09-01 2022-10-21 浙江大学 Neural network layout method for reconfigurable neural network processor
CN112801277B (en) * 2021-02-08 2024-11-08 清华大学 Data processing method, processor, chip and electronic equipment
CN114089911B (en) * 2021-09-07 2024-01-05 上海新氦类脑智能科技有限公司 Block segmentation and splicing processing method, device, equipment and medium based on data multiplexing
CN116306840A (en) * 2021-12-03 2023-06-23 中兴通讯股份有限公司 Neural network operation method, device, chip, electronic equipment and storage medium
CN118761900A (en) * 2024-09-06 2024-10-11 西安邮电大学 A cache structure in a self-reconfigurable and self-evolving AI chip and its chip

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250103A (en) * 2016-08-04 2016-12-21 东南大学 A kind of convolutional neural networks cyclic convolution calculates the system of data reusing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250103A (en) * 2016-08-04 2016-12-21 东南大学 A kind of convolutional neural networks cyclic convolution calculates the system of data reusing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks;Ma Y, Cao Y, Vrudhula S等;《Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays》;20171231;全文 *
基于FPGA的卷积神经网络并行结构研究;陆志坚;《中国博士学位论文全文数据库信息科技辑》;20140415(第04期);全文 *
深度学习算法可重构加速器关键技术研究;刘志强;《中国优秀硕士学位论文全文数据库信息科技辑》;20170315(第03期);全文 *

Also Published As

Publication number Publication date
CN108241890A (en) 2018-07-03

Similar Documents

Publication Publication Date Title
CN108241890B (en) Reconfigurable neural network acceleration method and architecture
CN111178519B (en) Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
JP7474586B2 (en) Tensor Computation Data Flow Accelerator Semiconductor Circuit
KR102443546B1 (en) matrix multiplier
Zhang et al. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
CN111667051B (en) Neural network accelerator applicable to edge equipment and neural network acceleration calculation method
CN111461311B (en) Convolutional neural network operation acceleration method and device based on many-core processor
CN104899182B (en) A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks
CN106445471A (en) Processor and method for executing matrix multiplication on processor
CN109948774A (en) Neural network accelerator and its implementation based on network layer binding operation
CN108170640B (en) Neural network operation device and operation method using same
KR20180123846A (en) Logical-3d array reconfigurable accelerator for convolutional neural networks
CN113469350B (en) Deep convolutional neural network acceleration method and system suitable for NPU
CN112633490B (en) Data processing device, method and related product for executing neural network model
TW202123093A (en) Method and system for performing convolution operation
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN116881618B (en) General matrix multiplication calculation optimization method, device and processor
Li et al. Fsa: A fine-grained systolic accelerator for sparse cnns
CN110414672B (en) Convolution operation method, device and system
CN106484532A (en) GPGPU parallel calculating method towards SPH fluid simulation
CN111522776B (en) Computing architecture
Gao et al. Revisiting thread configuration of SpMV kernels on GPU: A machine learning based approach
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant