CN113298241A - Deep separable convolutional neural network acceleration method and accelerator - Google Patents

Deep separable convolutional neural network acceleration method and accelerator Download PDF

Info

Publication number
CN113298241A
CN113298241A CN202110851351.3A CN202110851351A CN113298241A CN 113298241 A CN113298241 A CN 113298241A CN 202110851351 A CN202110851351 A CN 202110851351A CN 113298241 A CN113298241 A CN 113298241A
Authority
CN
China
Prior art keywords
data
convolution
channel
input
row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110851351.3A
Other languages
Chinese (zh)
Other versions
CN113298241B (en
Inventor
李肖飞
雍珊珊
张兴
王新安
李秋平
刘焕双
郭朋非
高金潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN202110851351.3A priority Critical patent/CN113298241B/en
Publication of CN113298241A publication Critical patent/CN113298241A/en
Application granted granted Critical
Publication of CN113298241B publication Critical patent/CN113298241B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Optimization (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a method for accelerating a deep separable convolutional neural network, which comprises the following steps: performing deep convolution on input neurons, wherein when the deep convolution is calculated, the same M rows of the C input channel are independently and parallelly calculated in a three-dimensional processing unit PE array to obtain the same N rows of output neurons of the C channel, and N is less than M; and performing point convolution on the output neurons obtained by the deep convolution, and independently performing parallel computation on each row of data of the C channel during the point convolution computation. By reasonably distributing the hardware resources of the deep convolution and the point convolution, the high-efficiency support can be performed on a lightweight neural network model adopting the deep separable convolution; data reuse is fully explored to reduce access to external memory, and a basic computing unit capable of jumping to zero is adopted, so that power consumption is greatly reduced.

Description

Deep separable convolutional neural network acceleration method and accelerator
Technical Field
The invention relates to the technical field of a deep separable convolutional neural network, in particular to a deep separable convolutional neural network acceleration method and an accelerator.
Background
Convolutional neural networks CNN have seen good performance in the field of computer vision, such as image classification and object recognition. Because of its high precision, CNN is widely used in autonomous vehicles, internet of things devices, and robot vision. These applications typically require the CNN to operate in a hardware resource-limited and low power consumption environment. This poses a significant challenge because CNN models typically require millions of parameters and calculations, and therefore, it is important to design lightweight neural networks. In recent years, there has been increasing interest in developing small and compact CNN models, which will further help reduce computational requirements. Recent CNN models, such as Xception, MobileNet, utilize depth separable convolution (Depthwise separable convolution) to reduce computational effort. For example, MobileNetv1 reports an 8-9 fold reduction in computation, sacrificing only 1% of accuracy. The depth separable Convolution includes depth Convolution (Depthwise Convolution) and point Convolution (Pointwise Convolution). One convolution kernel of the deep convolution is responsible for one channel, one channel can only be convoluted by one convolution kernel, and the number of the characteristic image channels generated by the process is identical to the number of the input channels. The calculation of the point convolution is very similar to that of the conventional convolution calculation, the convolution kernel size is 1x1xD, and D is the channel number of the previous layer, so the point convolution can perform weighted combination on the feature maps of the previous step in the depth direction to generate a new feature map. Although various neural network accelerators are continuously proposed, many of them do not optimize the deep separable convolution.
Disclosure of Invention
The invention mainly aims to provide a deep separable convolution neural network acceleration method and accelerator, so as to solve the technical problem that the deep separable convolution is not optimized in the prior art.
In order to achieve the above object, the present invention provides a method for accelerating a deep separable convolutional neural network, comprising: performing deep convolution on input neurons, wherein when the deep convolution is calculated, the same M rows of the C input channel are independently and parallelly calculated in a three-dimensional processing unit PE array to obtain the same N rows of output neurons of the C channel, and N is less than M; and performing point convolution on the output neurons obtained by the deep convolution, and independently performing parallel computation on each row of data of the C channel during the point convolution computation.
Optionally, when performing calculation on each channel, the deep convolution allocates a filter corresponding to each channel to each column of PEs in rows; input neurons are assigned to the PE array in rows in a diagonal fashion.
Optionally, the point convolution adopts a method of vertical pulsation of an input neuron, and after the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier in the row for data multiplexing; each row is provided with a tree-shaped adder for accumulating the results of the multipliers in one row; a first-in first-out FIFO queue is configured outside the multiplier array for temporarily storing the input neuron data flowing out from the multiplier array and providing the point convolution input neuron of the next stage.
Optionally, the point convolution specifically includes:
the point convolution is provided with N planar multiplier arrays, each planar multiplier array carries out independent parallel calculation on a row of data with the same C channel, when each row of data of different channels is calculated, the 1 st column of input neuron data of the multiplier arrays is output neurons obtained by calculating the 1 st channel of the depth convolution, the 2 nd column of input neuron data is output neurons obtained by calculating the 2 nd channel of the depth convolution, and the like;
the filter parameter input in the 1 st row of the multiplier array is the filter parameter of the 1 st channel correspondingly output, the filter parameter input in the 2 nd row is the filter parameter of the 2 nd channel correspondingly output, and so on;
the end of each line of the multiplier array is provided with an adder tree which is used for accumulating the multiplication result of the line as a partial sum and storing the partial sum into a partial sum and buffering the partial sum;
after the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier of the row for data multiplexing, and when one data is transmitted to the last row of multipliers, the data is transmitted to FIFO;
when the deep convolution completely sends the output neurons of the same N rows of the C channel to the point convolution, the deep convolution is suspended to provide new data for the point convolution, filter parameters of a multiplier array of the point convolution start to be updated in sequence, the output neurons of the deep convolution of different channels in FIFO are taken out again to flow through the multiplier array in a pulsating mode, and new partial sums are generated;
partial sum buffer access out the partial sum data of the corresponding position of each row and the new partial sum, and then calculating the partial sum of 1 row of all output channels, and calculating the partial sum of the first N rows of all output channels by the N plane multiplier arrays;
and controlling the deep convolution to start providing the next batch of output neurons for the point convolution, repeatedly performing the calculation, and finally calculating all the output neurons of all the rows of all the output channels.
Optionally, the depth convolution and the point convolution both adopt a data mode with a channel priority, and the depth convolution adopts a data mode with a channel priority, and the calculation of the data of the next M rows of input neurons is not started until the data of the same M rows of input neurons is calculated; the point convolution adopts a data mode with a channel priority, and the next batch of input neurons is allowed to enter into calculation after the same batch of input neurons and all filter channels corresponding to the input neurons are calculated.
The present invention also provides a deep separable convolutional neural network accelerator, comprising:
the deep convolution processing unit is used for performing deep convolution on the input neurons, and when the deep convolution calculation is performed, the same M rows of the C input channel are independently and parallelly calculated in the three-dimensional processing unit PE array to obtain the same N rows of output neurons of the C channel, wherein N is less than M;
and the point convolution processing unit is used for performing point convolution on the output neurons obtained by the deep convolution, and each row of data of the C channel is independently and parallelly calculated during the point convolution calculation.
Optionally, the deep convolution processing unit includes a plurality of planar PE arrays, where each planar PE array performs calculation of one channel, and each channel allocates a filter corresponding to each channel to each column of PEs in rows during calculation; input neurons are assigned to the PE array in rows in a diagonal fashion.
Optionally, the point convolution processing unit includes: a plurality of plane multiplier arrays, a plurality of tree adders and a first-in-first-out FIFO queue,
after the multiplier calculates the current input data, the input data is transmitted to the next multiplier in the row for data multiplexing;
each row is provided with a tree-shaped adder for accumulating the results of the multipliers in one row;
a first-in first-out FIFO queue is configured outside the multiplier array for temporarily storing the input neuron data flowing out from the multiplier array and providing the point convolution input neuron of the next stage.
Optionally, the point convolution processing unit has N planar multiplier arrays, each planar multiplier array performs independent parallel computation of a row of data having the same C channel, when computing each row of data of different channels, the 1 st column of input neuron data of the multiplier array is an output neuron obtained by computing the 1 st channel for the depth convolution, the 2 nd column of input neuron data is an output neuron obtained by computing the 2 nd channel for the depth convolution, and so on;
the filter parameter input in the 1 st row of the multiplier array is the filter parameter of the 1 st channel correspondingly output, the filter parameter input in the 2 nd row is the filter parameter of the 2 nd channel correspondingly output, and so on;
the end of each line of the multiplier array is provided with an adder tree which is used for accumulating the multiplication result of the line as a partial sum and storing the partial sum into a partial sum and buffering the partial sum;
after the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier of the row for data multiplexing, and when one data is transmitted to the last row of multipliers, the data is transmitted to FIFO;
when the deep convolution completely sends the output neurons of the same N rows of the C channel to the point convolution, the deep convolution is suspended to provide new data for the point convolution, filter parameters of a multiplier array of the point convolution start to be updated in sequence, the output neurons of the deep convolution of different channels in FIFO are taken out again to flow through the multiplier array in a pulsating mode, and new partial sums are generated;
partial sum buffer access out the partial sum data of the corresponding position of each row and the new partial sum, and then calculating the partial sum of 1 row of all output channels, and calculating the partial sum of the first N rows of all output channels by the N plane multiplier arrays;
and controlling the deep convolution to start providing the next batch of output neurons for the point convolution, repeatedly performing the calculation, and finally calculating all the output neurons of all the rows of all the output channels.
Optionally, the depth convolution and the point convolution both adopt a data mode with a channel priority, and the depth convolution adopts a data mode with a channel priority, and the calculation of the data of the next M rows of input neurons is not started until the data of the same M rows of input neurons is calculated; the point convolution adopts a data mode with a channel priority, and the next batch of input neurons is allowed to enter into calculation after the same batch of input neurons and all filter channels corresponding to the input neurons are calculated.
According to the acceleration method of the deep separable convolution neural network, through reasonable distribution of hardware resources of deep convolution and point convolution, efficient support can be performed on a lightweight neural network model adopting deep separable convolution; data reuse is fully explored to reduce access to external memory, and a basic computing unit capable of jumping to zero is adopted, so that power consumption is greatly reduced.
Additional features and advantages of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart illustrating a method for accelerating a deep separable convolutional neural network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a deep separable convolutional neural network accelerator according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the PE calculation unit for deep convolution according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a calculation unit of a point convolution multiplier according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
Referring to fig. 1, a deep separable convolutional neural network acceleration method according to an embodiment of the present invention is shown, including:
s101, carrying out deep convolution on input neurons, wherein when the deep convolution is calculated, the same M rows of C input channels are independently and parallelly calculated in a three-dimensional processing unit PE array to obtain the same N rows of output neurons of the C channels, and N is less than M;
and S102, performing point convolution on the output neurons obtained by the deep convolution, and independently performing parallel computation on each row of data of the C channel during the point convolution computation.
Optionally, when performing calculation on each channel, the deep convolution allocates a filter corresponding to each channel to each column of PEs in rows; input neurons are assigned to the PE array in rows in a diagonal fashion.
Optionally, the point convolution adopts a method of vertical pulsation of an input neuron, and after the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier in the row for data multiplexing; each row is provided with a tree-shaped adder for accumulating the results of the multipliers in one row; a first-in first-out FIFO queue is configured outside the multiplier array for temporarily storing the input neuron data flowing out from the multiplier array and providing the point convolution input neuron of the next stage.
Optionally, the point convolution specifically includes:
the point convolution is provided with N planar multiplier arrays, each planar multiplier array carries out independent parallel calculation on a row of data with the same C channel, when each row of data of different channels is calculated, the 1 st column of input neuron data of the multiplier arrays is output neurons obtained by calculating the 1 st channel of the depth convolution, the 2 nd column of input neuron data is output neurons obtained by calculating the 2 nd channel of the depth convolution, and the like;
the filter parameter input in the 1 st row of the multiplier array is the filter parameter of the 1 st channel correspondingly output, the filter parameter input in the 2 nd row is the filter parameter of the 2 nd channel correspondingly output, and so on;
the end of each line of the multiplier array is provided with an adder tree which is used for accumulating the multiplication result of the line as a partial sum and storing the partial sum into a partial sum and buffering the partial sum;
after the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier of the row for data multiplexing, and when one data is transmitted to the last row of multipliers, the data is transmitted to FIFO;
when the deep convolution completely sends the output neurons of the same N rows of the C channel to the point convolution, the deep convolution is suspended to provide new data for the point convolution, filter parameters of a multiplier array of the point convolution start to be updated in sequence, the output neurons of the deep convolution of different channels in FIFO are taken out again to flow through the multiplier array in a pulsating mode, and new partial sums are generated;
partial sum buffer access out the partial sum data of the corresponding position of each row and the new partial sum, and then calculating the partial sum of 1 row of all output channels, and calculating the partial sum of the first N rows of all output channels by the N plane multiplier arrays;
and controlling the deep convolution to start providing the next batch of output neurons for the point convolution, repeatedly performing the calculation, and finally calculating all the output neurons of all the rows of all the output channels.
Optionally, the depth convolution and the point convolution both adopt a data mode with a channel priority, and the depth convolution adopts a data mode with a channel priority, and the calculation of the data of the next M rows of input neurons is not started until the data of the same M rows of input neurons is calculated; the point convolution adopts a data mode with a channel priority, and the next batch of input neurons is allowed to enter into calculation after the same batch of input neurons and all filter channels corresponding to the input neurons are calculated.
A deep separable convolutional neural network acceleration method according to an embodiment of the present invention is described in detail below with reference to fig. 2. Fig. 2 is a schematic structural diagram of a deep separable convolutional neural network accelerator according to an embodiment of the present invention, which mainly includes a deep convolutional processing unit and a point convolutional processing unit.
The deep convolution processing unit in the two-stage pipeline accelerator in the embodiment of the invention is composed of a three-dimensional processing unit PE array, adopts an input mode of multi-input channel parallel processing, filter row and column fixing and input neuron diagonal, fully utilizes the parallelism of the PE computing unit, minimizes data movement and maximizes the reusability of data, and meanwhile, as shown in FIG. 3, the PE computing unit is designed to have a zero-skipping computing function.
The depth convolution processing unit comprises a plurality of plane PE arrays, each plane PE array carries out calculation of one channel, and each channel distributes a filter corresponding to each channel to each column of PE according to rows during calculation; input neurons are assigned to the PE array in rows in a diagonal fashion. The deep convolution adopts a data mode with channel priority, and the data of the input neurons of the next rows are not calculated until the data of the same rows of the input neurons of all the channels are calculated.
The point convolution processing unit in the two-stage pipeline accelerator in the embodiment of the invention is composed of a three-dimensional MU multiplier array, adopts a method of input neuron vertical pulsation, fully multiplexes input neuron data, and simultaneously, as shown in FIG. 4, the MU multiplier computing unit is designed to have a zero jump computing function; each row is provided with a tree-shaped adder for accumulating the results of the multipliers in one row; a first-in first-out FIFO queue is configured outside the multiplier array for temporarily storing the input neuron data flowing out from the multiplier array and providing the point convolution input neuron of the next stage.
Namely, the point convolution processing unit includes: the system comprises a plurality of plane multiplier arrays, a plurality of tree adders and a first-in first-out (FIFO) queue, wherein after the multipliers calculate current input data, the input data are transmitted to the next multiplier in the queue for data multiplexing; each row is provided with a tree-shaped adder for accumulating the results of the multipliers in one row; a first-in first-out FIFO queue is configured outside the multiplier array for temporarily storing the input neuron data flowing out from the multiplier array and providing the point convolution input neuron of the next stage. The point convolution adopts a data mode with a channel priority, and the next batch of input neurons is allowed to enter into calculation after the same batch of input neurons and all filter channels corresponding to the input neurons are calculated.
The operation of the two-layer convolution of MobileNetV1 in table 1 is described in detail below.
TABLE 1
Figure DEST_PATH_IMAGE001
For the deep convolution, firstly, input neuron processing of the first 4 channels is carried out, a calculation result is directly transmitted to point convolution through tight coupling to carry out subsequent point convolution calculation, each plane PE array carries out calculation of one channel, and the number of columns of each plane multiplier of a point convolution processing unit is the number of channels processed by the deep convolution processing unit once, so that the 4 plane PE arrays are determined to carry out independent parallel calculation of the 4 channels once in combination with resource allocation of the whole hardware of an accelerator. Processing the input neurons of the first 4 channels is finished, and then processing the input neurons of the next 4 channels, so as to finish the processing of all the input neurons of the 32 channels. When the input neurons of 4 channels are processed, each channel is independently performed at the same time, and the data processing speed is increased. When a first channel is processed, allocating first row data of a filter corresponding to the channel to a first column PE, allocating second row data to a second column PE, and allocating third row data to a third column PE; input neurons are diagonally assigned to the PE arrays, and since the filter size is 3, there are three columns per planar PE array, and the number of rows per planar PE array can be determined according to actual conditions, and 3 rows in the embodiment of the present invention, so that one channel can process 5 rows of input neurons. Every third time unit, the current input neuron data is changed and the input neuron data used for the calculation is transmitted to the next PE in the diagonal direction, so that an output neuron data is generated for each output row every third time unit after a few unit times after the calculation is started, as described in detail below with reference to table 2. Because of the coordination of the calculation of the point convolution, the deep convolution adopts a data mode with a priority channel until the first 5 rows of input neurons finish calculating the first 5 rows of input neuron data of the next group of 4 input neuron channels. When the first 5 rows of data calculation of the 32-channel input neuron is completed, the next group of 5 rows of data calculation of the 32-channel input neuron is performed. This is done until 112 rows of 32 channel input neurons have been calculated.
The depth convolution calculation of an embodiment of the present invention is further described below in conjunction with table 2, assuming that there is now a 5 x 5 input profile, and a 3 x 3 filter.
TABLE 2
Figure DEST_PATH_IMAGE002
5 x 5 input feature map
Figure DEST_PATH_IMAGE003
3 filter (weight)
irow1 is 11, 12, 13, 14, 15
irow2 for 21, 22, 23, 24, 25
...
frow1 is 110, 120, 130
frow2 is 210, 220, 230
frow3 is 310, 320, 330
Assuming that it takes time to perform a MAC (multiply-add) operation once in a PE as one time unit (denoted by T), a partial sum is generated after performing a MAC three times every 3 time units, and stored in the partial sum local storage. Every three time units, the input neuron data is currently replaced, and here, multiplexing also exists when the input neuron data is replaced, for example, after 11, 12 and 13 are used up, the input neuron data needs to be updated to 12, 13 and 14, only a new data 14 needs to be read at this time, and the 12 and 13 can be reused continuously. While the input neuron data used just before computation is passed to the next PE in the diagonal direction.
For example, the rightmost PE in the second row, during 1-3T, is performed 21 × 110+22 × 120+23 × 130, resulting in a portion and presence portion and local store. During the period of 4-6T, three input neurons 21, 22 and 23 are transmitted to the second PE in the first row for correlation calculation, which is represented by data multiplexing, while the rightmost PE in the second row performs input neuron data replacement and performs calculation of 22 x 110+23 x 120+24 x 130, and the input weights used by the PEs are always the first row weights 110, 120 and 130, which is represented by weight fixing.
The first row of PEs is taken as an example to explain the process of obtaining sumrow 1:
1-3T, the first PE in the first row and the second PE in the first row have no data yet, and do not perform the calculation, and the third PE in the first row performs the calculation of 11 × 110+12 × 120+13 × 130, and stores the part and the part stored in the PE and the part stored locally.
4-6T, the first PE in the first row still has no data to transmit, and does not perform calculation, the first PE in the second row receives the input neuron data transmitted from the third PE in the second row, performs 21 x 210+22 x 220+23 x 230 calculation, and stores the part sum stored in the PE locally. The first row, the third PE, performs the calculation of 12 × 110+13 × 120+14 × 130 and stores the part and the part stored in the PE locally.
7-9T, the first PE in the first row receives the data, calculates the data by 31 x 310+32 x 320+33 x 330, and stores the part and the part stored in the PE and the local storage. The new data received by the second PE in the first row is calculated by 22 x 210+23 x 220+24 x 230, and the part stored in the PE are stored locally. The first row of the third PE performs the calculation of 13 × 110+14 × 120+15 × 130 and stores the part and the part stored in the PE locally.
To this end, the first PE part and the local storage data have 31 × 310+32 × 320+33 × 330, and the second PE part and the local storage data have 21 × 210+22 × 220+23 × 230, and 22 × 210+23 × 220+24 × 230. The third PE part and the locally stored data have 11 × 110+12 × 120+13 × 130, 12 × 110+13 × 120+14 × 130, 13 × 110+14 × 120+15 × 130.
The partial and local stores are designed to be of a first-in first-out nature, so that the data which is the first to be stored in the three partial and local stores can be taken out at this time, namely, the first row, the third PE takes out 11 x 110+12 x 120+13 x 130, the second PE takes out 21 x 210+22 x 220+23 x 230, and the first PE takes out 31 x 310+32 x 320+33 x 330. Adding the three partial sums yields exactly the first data of psumrow 1.
10-12T, the first PE in the first row performs 32 × 310+33 × 320+34 × 330 calculations and stores the part and the part stored in the PE locally. The first row of the second PE performs 23 x 210+24 x 220+25 x 230 computations and stores the parts and the parts stored in the PE locally. The first row and the third PE do not calculate this round because all data in the first row has already been calculated (only because the example 5 x 5 signature is small, no new data is calculated in the first row and the third PE soon, and the actual input signature is large, such as 112 x 112).
The data that should be fetched from the three parts and the local cache at this time is such that: the first row of the third PE takes 12 x 110+13 x 120+14 x 130, the second PE takes 22 x 210+23 x 220+24 x 230, and the first PE takes 32 x 310+33 x 320+34 x 330.
It can be seen that after a brief period of time has elapsed from the start (9T in this example), an output neuron datum (i.e. a datum for psumrow) is generated per row every 3T.
The first row of PE operations is illustrated, and the other two rows are similar operations, except that the second row of PEs need only generate one psumrow2 data after 6T; the third row of PEs only needs 3T to generate one psumrow3 data, and in order to ensure that the three rows of psumrows generate data synchronization, the embodiment of the present invention uses the controller to control the operation of the adder, and after 9T, the three rows of partial sums are uniformly added to obtain corresponding psumrows, so that all data can be generated synchronously, and each subsequent 3T row can generate one output neuron data (i.e., one data of psumrow).
For the point convolution, first 3 rows of output neuron data from the first 4 channels of the depth convolution are received to be calculated, as shown in fig. 2, the point convolution has 3 planar multiplier arrays, each planar multiplier array performs independent parallel calculation of each row of data, the number of columns of each planar multiplier array is the number of channels of the depth convolution of one time of parallel calculation, and the multiple of the number of rows of each planar multiplier array is the number of output channels of the point convolution, such as the number of output channels of the point convolution according to the embodiment of the present invention, that is, the number 64 of point convolution filters, then the number of rows of each planar multiplier array is set to be 16 in combination with hardware resources. When the first 4 channels of the 1 st row of data are calculated, the 1 st column of input neuron data of the multiplier array is output neurons obtained by the 1 st channel of the convolution calculation, the 2 nd column of input neuron data is output neurons obtained by the 2 nd channel of the convolution calculation, and the like. The filter parameter input in the 1 st row of the multiplier array is the filter parameter corresponding to the 1 st channel output, the filter parameter input in the 2 nd row is the filter parameter corresponding to the 2 nd channel output, and so on, the filter parameter input in the 16 th row is the filter parameter corresponding to the 16 th channel output. The multiplier array has an adder tree at the end of each row for accumulating the multiplication result of the row as a partial sum and storing the partial sum in a partial sum buffer. And generating a new part by calculating data generated by deep convolution for each multiplier in the row every three unit time moments and storing the new part into a buffer. After the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier in the row for data multiplexing, and after one data is transmitted to the 16 th line of multipliers, the data is transmitted to the FIFO. And when the deep convolution completely sends the first 3 rows of output neurons of the 4 channels to the point convolution, the controller controls the deep convolution to pause to provide new data for the point convolution, the filter parameters of the multiplier array start to be sequentially updated to the filter parameters of the 17 th output channel and the 32 th output channel, and at the moment, the first 3 rows of deep convolution output neurons of the first 4 channels in the FIFO are taken out again to flow through the multiplier array in a pulsating mode, and a new partial sum is generated. At this moment, the partial sum data of the corresponding position of each row of partial sum access and the new partial sum are added, and then the first 1 row partial sum of 64 output channels can be calculated, and the first 3 row partial sums of all output channels can be calculated by 3 plane multiplier arrays;
controlling the deep convolution to start to provide the next batch of output neurons for the point convolution, repeatedly performing the calculation, and finally calculating all the output neurons of all the rows of all the output channels, wherein the method specifically comprises the following steps:
controlling the deep convolution to start to provide the first 3 rows of output neurons of the next 5 th-8 th channels for the point convolution, repeatedly performing the calculation, and obtaining all the first 3 rows of output neurons of all the output channels through calculation;
and controlling the deep convolution to start providing the 4 th-6 th row output neurons of the next 4 channels for the point convolution, repeatedly performing the calculation, and finally calculating all the row output neurons of all the output channels.
According to the method, through reasonable distribution of hardware resources of deep convolution and point convolution, efficient support can be performed on a lightweight neural network model adopting deep separable convolution; the method of simultaneously obtaining multi-line output neuron data by adopting channel priority through the matching of various caches fully explores data multiplexing to reduce the access to an external memory, and simultaneously adopts a basic operation unit capable of jumping to zero to greatly reduce the power consumption.
As shown in fig. 2, an embodiment of the present invention further provides a deep separable convolutional neural network accelerator, including:
the deep convolution processing unit is used for performing deep convolution on the input neurons, and when the deep convolution calculation is performed, the same M rows of the C input channel are independently and parallelly calculated in the three-dimensional processing unit PE array to obtain the same N rows of output neurons of the C channel, wherein N is less than M;
and the point convolution processing unit is used for performing point convolution on the output neurons obtained by the deep convolution, and each row of data of the C channel is independently and parallelly calculated during the point convolution calculation.
Optionally, the deep convolution processing unit includes a plurality of planar PE arrays, where each planar PE array performs calculation of one channel, and each channel allocates a filter corresponding to each channel to each column of PEs in rows during calculation; input neurons are assigned to the PE array in rows in a diagonal fashion.
Optionally, the point convolution processing unit includes: a plurality of plane multiplier arrays, a plurality of tree adders and a first-in-first-out FIFO queue,
after the multiplier calculates the current input data, the input data is transmitted to the next multiplier in the row for data multiplexing;
each row is provided with a tree-shaped adder for accumulating the results of the multipliers in one row;
a first-in first-out FIFO queue is configured outside the multiplier array for temporarily storing the input neuron data flowing out from the multiplier array and providing the point convolution input neuron of the next stage.
Optionally, the point convolution processing unit has N planar multiplier arrays, each planar multiplier array performs independent parallel computation of a row of data having the same C channel, when computing each row of data of different channels, the 1 st column of input neuron data of the multiplier array is an output neuron obtained by computing the 1 st channel for the depth convolution, the 2 nd column of input neuron data is an output neuron obtained by computing the 2 nd channel for the depth convolution, and so on;
the filter parameter input in the 1 st row of the multiplier array is the filter parameter of the 1 st channel correspondingly output, the filter parameter input in the 2 nd row is the filter parameter of the 2 nd channel correspondingly output, and so on;
the end of each line of the multiplier array is provided with an adder tree which is used for accumulating the multiplication result of the line as a partial sum and storing the partial sum into a partial sum and buffering the partial sum;
after the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier of the row for data multiplexing, and when one data is transmitted to the last row of multipliers, the data is transmitted to FIFO;
when the deep convolution completely sends the output neurons of the same N rows of the C channel to the point convolution, the deep convolution is suspended to provide new data for the point convolution, filter parameters of a multiplier array of the point convolution start to be updated in sequence, the output neurons of the deep convolution of different channels in FIFO are taken out again to flow through the multiplier array in a pulsating mode, and new partial sums are generated;
partial sum buffer access out the partial sum data of the corresponding position of each row and the new partial sum, and then calculating the partial sum of 1 row of all output channels, and calculating the partial sum of the first N rows of all output channels by the N plane multiplier arrays;
and controlling the deep convolution to start providing the next batch of output neurons for the point convolution, repeatedly performing the calculation, and finally calculating all the output neurons of all the rows of all the output channels.
Optionally, the depth convolution and the point convolution both adopt a data mode with a channel priority, and the depth convolution adopts a data mode with a channel priority, and the calculation of the data of the next M rows of input neurons is not started until the data of the same M rows of input neurons is calculated; the point convolution adopts a data mode with a channel priority, and the next batch of input neurons is allowed to enter into calculation after the same batch of input neurons and all filter channels corresponding to the input neurons are calculated.
The preferred embodiments of the present invention have been described in detail with reference to the accompanying drawings, however, the present invention is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present invention within the technical idea of the present invention, and these simple modifications are within the protective scope of the present invention.
It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition.
In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as the disclosure of the present invention as long as it does not depart from the spirit of the present invention.

Claims (10)

1. A method for accelerating a deep separable convolutional neural network, comprising:
performing deep convolution on input neurons, wherein when the deep convolution is calculated, the same M rows of the C input channel are independently and parallelly calculated in a three-dimensional processing unit PE array to obtain the same N rows of output neurons of the C channel, and N is less than M;
and performing point convolution on the output neurons obtained by the deep convolution, and independently performing parallel computation on each row of data of the C channel during the point convolution computation.
2. The method of claim 1, wherein the deep convolution assigns a filter corresponding to each channel to each column of PEs by rows when performing the computation of each channel; input neurons are assigned to the PE array in rows in a diagonal fashion.
3. The method according to claim 1, wherein the point convolution adopts a method of input neuron vertical pulsation, and after the multiplier calculates the current input data, the multiplier will transfer the input data to the next multiplier in the column for data multiplexing; each row is provided with a tree-shaped adder for accumulating the results of the multipliers in one row; a first-in first-out FIFO queue is configured outside the multiplier array for temporarily storing the input neuron data flowing out from the multiplier array and providing the point convolution input neuron of the next stage.
4. The method according to claim 3, wherein said point convolution specifically comprises:
the point convolution is provided with N planar multiplier arrays, each planar multiplier array carries out independent parallel calculation on a row of data with the same C channel, when each row of data of different channels is calculated, the 1 st column of input neuron data of the multiplier arrays is output neurons obtained by calculating the 1 st channel of the depth convolution, the 2 nd column of input neuron data is output neurons obtained by calculating the 2 nd channel of the depth convolution, and the like;
the filter parameter input in the 1 st row of the multiplier array is the filter parameter of the 1 st channel correspondingly output, the filter parameter input in the 2 nd row is the filter parameter of the 2 nd channel correspondingly output, and so on;
the end of each line of the multiplier array is provided with an adder tree which is used for accumulating the multiplication result of the line as a partial sum and storing the partial sum into a partial sum and buffering the partial sum;
after the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier of the row for data multiplexing, and when one data is transmitted to the last row of multipliers, the data is transmitted to FIFO;
when the deep convolution completely sends the output neurons of the same N rows of the C channel to the point convolution, the deep convolution is suspended to provide new data for the point convolution, filter parameters of a multiplier array of the point convolution start to be updated in sequence, the output neurons of the deep convolution of different channels in FIFO are taken out again to flow through the multiplier array in a pulsating mode, and new partial sums are generated;
partial sum buffer access out the partial sum data of the corresponding position of each row and the new partial sum, and then calculating the partial sum of 1 row of all output channels, and calculating the partial sum of the first N rows of all output channels by the N plane multiplier arrays;
and controlling the deep convolution to start providing the next batch of output neurons for the point convolution, repeatedly performing the calculation, and finally calculating all the output neurons of all the rows of all the output channels.
5. The method of any one of claims 1-4, wherein both the depth convolution and the point convolution use channel-first data patterns, and wherein the depth convolution uses channel-first data patterns, and wherein the calculation of the next M rows of input neuron data is not started until the same M rows of data of all input channel neurons are calculated; the point convolution adopts a data mode with a channel priority, and the next batch of input neurons is allowed to enter into calculation after the same batch of input neurons and all filter channels corresponding to the input neurons are calculated.
6. A deep separable convolutional neural network accelerator, comprising:
the deep convolution processing unit is used for performing deep convolution on the input neurons, and when the deep convolution calculation is performed, the same M rows of the C input channel are independently and parallelly calculated in the three-dimensional processing unit PE array to obtain the same N rows of output neurons of the C channel, wherein N is less than M;
and the point convolution processing unit is used for performing point convolution on the output neurons obtained by the deep convolution, and each row of data of the C channel is independently and parallelly calculated during the point convolution calculation.
7. The accelerator according to claim 6, wherein the deep convolution processing unit comprises a plurality of planar PE arrays, each planar PE array performs calculation of one channel, and each channel allocates a filter corresponding to each channel to each column of PE by row during calculation; input neurons are assigned to the PE array in rows in a diagonal fashion.
8. The accelerator of claim 6, wherein the point convolution processing unit comprises: a plurality of plane multiplier arrays, a plurality of tree adders and a first-in-first-out FIFO queue,
after the multiplier calculates the current input data, the input data is transmitted to the next multiplier in the row for data multiplexing;
each row is provided with a tree-shaped adder for accumulating the results of the multipliers in one row;
a first-in first-out FIFO queue is configured outside the multiplier array for temporarily storing the input neuron data flowing out from the multiplier array and providing the point convolution input neuron of the next stage.
9. The accelerator of claim 8,
the point convolution processing unit is provided with N plane multiplier arrays, each plane multiplier array carries out independent parallel calculation on a row of data with the same C channel, when each row of data of different channels is calculated, the 1 st column of input neuron data of the multiplier arrays is output neurons obtained by calculating the 1 st channel of the depth convolution, the 2 nd column of input neuron data is output neurons obtained by calculating the 2 nd channel of the depth convolution, and the like;
the filter parameter input in the 1 st row of the multiplier array is the filter parameter of the 1 st channel correspondingly output, the filter parameter input in the 2 nd row is the filter parameter of the 2 nd channel correspondingly output, and so on;
the end of each line of the multiplier array is provided with an adder tree which is used for accumulating the multiplication result of the line as a partial sum and storing the partial sum into a partial sum and buffering the partial sum;
after the current input data is calculated by the multiplier, the input data is transmitted to the next multiplier of the row for data multiplexing, and when one data is transmitted to the last row of multipliers, the data is transmitted to FIFO;
when the deep convolution completely sends the output neurons of the same N rows of the C channel to the point convolution, the deep convolution is suspended to provide new data for the point convolution, filter parameters of a multiplier array of the point convolution start to be updated in sequence, the output neurons of the deep convolution of different channels in FIFO are taken out again to flow through the multiplier array in a pulsating mode, and new partial sums are generated;
partial sum buffer access out the partial sum data of the corresponding position of each row and the new partial sum, and then calculating the partial sum of 1 row of all output channels, and calculating the partial sum of the first N rows of all output channels by the N plane multiplier arrays;
and controlling the deep convolution to start providing the next batch of output neurons for the point convolution, repeatedly performing the calculation, and finally calculating all the output neurons of all the rows of all the output channels.
10. The accelerator according to any one of claims 6 to 9, wherein the depth convolution and the point convolution both use a channel-first data pattern, and the depth convolution uses a channel-first data pattern, and does not start to calculate the next M rows of input neuron data until the same M rows of data of all input channel neurons are calculated; the point convolution adopts a data mode with a channel priority, and the next batch of input neurons is allowed to enter into calculation after the same batch of input neurons and all filter channels corresponding to the input neurons are calculated.
CN202110851351.3A 2021-07-27 2021-07-27 Deep separable convolutional neural network acceleration method and accelerator Active CN113298241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110851351.3A CN113298241B (en) 2021-07-27 2021-07-27 Deep separable convolutional neural network acceleration method and accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110851351.3A CN113298241B (en) 2021-07-27 2021-07-27 Deep separable convolutional neural network acceleration method and accelerator

Publications (2)

Publication Number Publication Date
CN113298241A true CN113298241A (en) 2021-08-24
CN113298241B CN113298241B (en) 2021-10-22

Family

ID=77331288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110851351.3A Active CN113298241B (en) 2021-07-27 2021-07-27 Deep separable convolutional neural network acceleration method and accelerator

Country Status (1)

Country Link
CN (1) CN113298241B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600652A (en) * 2022-11-29 2023-01-13 深圳市唯特视科技有限公司(Cn) Convolutional neural network processing device, high-speed target detection method and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705687A (en) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 Convolution neural network hardware computing device and method
US20200090030A1 (en) * 2018-09-19 2020-03-19 British Cayman Islands Intelligo Technology Inc. Integrated circuit for convolution calculation in deep neural network and method thereof
CN113033794A (en) * 2021-03-29 2021-06-25 重庆大学 Lightweight neural network hardware accelerator based on deep separable convolution

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200090030A1 (en) * 2018-09-19 2020-03-19 British Cayman Islands Intelligo Technology Inc. Integrated circuit for convolution calculation in deep neural network and method thereof
CN110705687A (en) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 Convolution neural network hardware computing device and method
CN113033794A (en) * 2021-03-29 2021-06-25 重庆大学 Lightweight neural network hardware accelerator based on deep separable convolution

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEIGUANG CHEN ET AL: "accelerating compact convolutional neural networks with multi-threaded data streaming", 《18TH IEEE-COMPUTER-SOCIETY ANNUAL SYMPOSIUM ON VLSI》 *
萧嘉乐等: "基于FPGA的高效可伸缩的MobileNet加速器实现", 《计算机工程与科学》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600652A (en) * 2022-11-29 2023-01-13 深圳市唯特视科技有限公司(Cn) Convolutional neural network processing device, high-speed target detection method and equipment

Also Published As

Publication number Publication date
CN113298241B (en) 2021-10-22

Similar Documents

Publication Publication Date Title
US12067492B2 (en) Processing for multiple input data sets in a multi-layer neural network
Guo et al. Software-hardware codesign for efficient neural network acceleration
CN109886400B (en) Convolution neural network hardware accelerator system based on convolution kernel splitting and calculation method thereof
CN110210610B (en) Convolution calculation accelerator, convolution calculation method and convolution calculation device
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN110020723B (en) Neural network processing unit and system on chip comprising same
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
US20180174036A1 (en) Hardware Accelerator for Compressed LSTM
Fan et al. F-E3D: FPGA-based acceleration of an efficient 3D convolutional neural network for human action recognition
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
CN112633490B (en) Data processing device, method and related product for executing neural network model
CN110580519B (en) Convolution operation device and method thereof
CN108304925B (en) Pooling computing device and method
US20200356809A1 (en) Flexible pipelined backpropagation
US11709783B1 (en) Tensor data distribution using grid direct-memory access (DMA) controller
CN113313252B (en) Depth separable convolution implementation method based on pulse array
CN112734020B (en) Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network
CN113298241B (en) Deep separable convolutional neural network acceleration method and accelerator
CN112114942A (en) Streaming data processing method based on many-core processor and computing device
CN115310037A (en) Matrix multiplication computing unit, acceleration unit, computing system and related method
JP2022137247A (en) Processing for a plurality of input data sets
Véstias et al. Hybrid dot-product calculation for convolutional neural networks in FPGA
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
US20230065725A1 (en) Parallel depth-wise processing architectures for neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant