CN108241890A - A kind of restructural neural network accelerated method and framework - Google Patents
A kind of restructural neural network accelerated method and framework Download PDFInfo
- Publication number
- CN108241890A CN108241890A CN201810084089.2A CN201810084089A CN108241890A CN 108241890 A CN108241890 A CN 108241890A CN 201810084089 A CN201810084089 A CN 201810084089A CN 108241890 A CN108241890 A CN 108241890A
- Authority
- CN
- China
- Prior art keywords
- output
- input
- convolution
- block
- channel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
- Error Detection And Correction (AREA)
Abstract
The present invention provides a kind of restructural neural network accelerated method and frameworks, pass through input-buffer unit, weight buffer unit, convolutional calculation nuclear unit and the framework for exporting buffer unit, the pattern of input data multiplexing, output data multiplexing and weighted data multiplexing is respectively adopted, the input data of reading and convolution kernel are carried out by convolution algorithm, the method for generating output data by convolutional calculation nuclear unit.The application copes with the various neural network of the number of plies by the strategy successively accelerated, and carry out optimization neural network acceleration using the method for cyclical-transformation, the access times reduced to Buffer and DRAM are reached, solve the problems, such as that accessing memory in the prior art often causes power wastage, with energy consumption is reduced, make the maximized advantageous effect of hardware resource utilization of PE arrays.
Description
Technical field
The present invention relates to the calculating patterns in depth convolutional neural networks more particularly to a kind of restructural neural network to accelerate
Method and framework.
Background technology
Depth convolutional neural networks are had been widely used in computer vision field and speech processes field, then due to
The high complexity that depth convolutional neural networks have in itself so that very big challenge is brought when it is executed on the hardware, especially
The problem of being power consumption and performance.Traditional execution hardware has CPU, GPU, FPGA, and unfortunately, CPU can not be in embedded device
The middle calculation process for carrying out low delay, and GPU is handled although can meet low delay, its power consumption is very big, is not suitable for being embedded in
In formula equipment;Another aspect FPGA although can meet the requirement of power consumption and execution performance reluctantly, the wiring money inside itself
Source and computing unit exactly limit the execution efficiency of different depth convolutional neural networks.
For above demand and challenge, one kind is needed to be specifically used to perform depth convolutional neural networks framework to replace
These hardware of CPU, GPU or FPGA.However nonetheless, calculating pattern used by some traditional neural network hardware frameworks
A compromise well can not be obtained between execution efficiency and energy consumption.Pattern is calculated in traditional hardware deep neural network
In, often because each layer of data volume is different, therefore some calculating patterns are more single for the access of buffer and memory,
It can not be in real time configured, be increased significantly so as to cause the number for accessing memory, it is unnecessary to cause according to calculating demand
Power wastage.Fig. 1 is that classical depth convolutional neural networks calculate schematic diagram, and Fig. 2 is classical depth convolutional neural networks
Convolutional layer operation pseudocode cycle expression formula.As shown in Figure 1, in the prior art in classical depth convolutional neural networks,
The size of convolution kernel is K × K, and each convolution kernel has N number of convolutional channel, and the size of input data is H × L, has N number of input
Channel carries out convolution algorithm by M convolution kernel and input data and obtains output data, and output data R × C is logical with M output
Road.As shown in Fig. 2, the pseudocode cyclic process of the convolutional layer operation of classical depth convolutional neural networks is as follows:
Each section output data on each channel is obtained successively by recycling R and C;
N number of convolutional channel of each convolution kernel is made to be carried out with N number of input channel of part input data successively by recycling M
Convolution algorithm, so as to obtain the output data of each output channel successively;
N number of input channel of part input data is made to be rolled up with the N number of convolutional channel of current convolution kernel successively by recycling N
Product operation.
In depth convolutional neural networks OverDrive Processor ODP, energy consumption is a very important index.And the definition of energy consumption
For:
Wherein Operations is operand, and Energy is energy, and Perfomance is performance, and Power is power consumption, and right
For some specific convolutional neural networks, its calculating operation number is fixed, therefore influences the key factor of energy consumption only
It is ENERGY E nergy.
And the generation of energy may be defined as:
Energy=MADRAM·EDRAM+MAbuffer·Ebuffer+Operations·Eoperation
Wherein Energy be energy, MADRAMAnd MAbufferIt is DRAM and caching Buffer access times, Operations is
Operand.EDRAM, Ebuffer, EoperationIt is the energy of single DRAM cache and operation.Therefore for fixed convolutional Neural net
Network, the key factor for influencing energy consumption are the access times MA of DRAMDRAMWith Buffer access times MAbuffer。
In addition to this, when performing convolution algorithm, traditional calculating pattern is not high for the utilization rate of PE arrays, especially when
When convolution kernel step-length is more than 1, the hardware resource utilization of PE arrays can substantially reduce.
Therefore, how in depth convolutional neural networks energy to be reduced to the access times of Buffer and DRAM by improving
The technical issues of consumption and the utilization rate that PE arrays are improved in convolution algorithm are current urgently to be resolved hurrily.
Invention content
The defects of in order to solve in the prior art, the present invention propose a kind of restructural neural network accelerated method and frame
Structure realizes through the various neural network of the strategy reply number of plies successively accelerated, and optimizes calculating pattern using cyclical-transformation,
With energy consumption is reduced, the utilization rate of PE arrays is made to reach maximum advantageous effect.
The present invention proposes a kind of restructural neural network accelerated method one, and this method is input data multiplexing method, is wrapped
It includes:
The input data of N number of input channel is divided into n input block by input-buffer unit;Each input number
There is Tn input channel according to block;Each input block is sent successively;Wherein, n=NTn, N, n, Tn are positive integer;
M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm convolution kernel,
Each convolution kernel has N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are just whole
Number;
A-th of input block of reading is carried out convolution fortune by convolutional calculation nuclear unit with convolution group described in each group successively
It calculates, output block of the generation with Tm output channel;It is finished until convolution group described with m groups carries out convolution algorithm,
Output block of the generation with M output channel;The storage output data of M output channel of buffer unit feedback will be exported
The output block of the M output channel of the block with generating adds up, and sends the output block after adding up;Wherein,
It is described storage output block by read before a-th of input block is read the 1st to (a-1) a input block according to
Convolution group carries out Accumulating generation after convolution described in secondary and each group;
The output buffer unit by reception it is described it is cumulative after output block be stored as the storage of M output channel
Output block, and feed back the storage output block to the convolutional calculation nuclear unit;As a=n, the output caching
Unit stores whole output datas of M output channel;Wherein, a≤n, a are positive integer.
The present invention provides a kind of restructural neural networks to accelerate framework one, including:Input-buffer unit, weight caching are single
Member, convolutional calculation nuclear unit and output buffer unit;
The input-buffer unit, for the input data of N number of input channel to be divided into n input block;Each
The input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just
Integer;
The weight buffer unit, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm
A convolution kernel, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N
It is positive integer;
The convolutional calculation nuclear unit, for by a-th of the input block read successively with convolution group described in each group into
Row convolution algorithm, output block of the generation with Tm output channel;Until convolution group described with m groups carries out convolution fortune
It finishes, output block of the generation with M output channel;The storage of M output channel of buffer unit feedback will be exported
The output block of the M output channel of the output block with generating adds up, and sends the output data after adding up
Block;Wherein, the storage output block before a-th of input block is read by reading the 1st to (a-1) a input
Data block carries out Accumulating generation after convolution with convolution group described in each group successively;
The output buffer unit, for by receive it is described it is cumulative after output block be stored as M output channel
Storage output block, and feed back the storage output block to the convolutional calculation nuclear unit;It is described defeated as a=n
Go out whole output datas that buffer unit stores M output channel;Wherein, a≤n, a are positive integer.
The present invention provides a kind of restructural neural network accelerated method two, this method is output data multiplexing method, is wrapped
It includes:
The input data of N number of input channel is divided into n input block by input-buffer unit;Each input number
There is Tn input channel according to block;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are positive integer;
M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm convolution kernel,
Each convolution kernel has N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are just whole
Number;
Each input block of reading is carried out convolution by convolutional calculation nuclear unit with the b group convolution group of reading successively
Operation, output block of the generation with Tm output channel, until it is complete to carry out convolution algorithm with n-th of input block
Finish, generate whole output datas of Tm output channel;By the passage portion of convolutional calculation nuclear unit storage and output number
It adds up according to whole output datas of the Tm output channel with generation, generate and stores the passage portion after adding up and output
Data;Wherein, the passage portion of the storage and output data by read before b group convolution groups are read the 1st to (b-
1) group convolution group carries out Accumulating generation after convolution with each input block successively;As b=m, the convolutional calculation vouching
Member sends the output data of M channel after adding up;Wherein, b≤m, b are positive integer;By the output of M output channel of reception
Data carry out pond processing, and send the output data behind pond;
Output buffer unit receives and stores the output data of Chi Huahou, the output of M output channel after generate pond
Data.
The present invention provides a kind of restructural neural networks to accelerate framework two, including:Input-buffer unit, weight caching are single
Member, convolutional calculation nuclear unit and output buffer unit;
The input-buffer unit, for the input data of N number of input channel to be divided into n input block;Each
The input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just
Integer;
The weight buffer unit, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm
A convolution kernel, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N
It is positive integer;
The convolutional calculation nuclear unit, for each input block that will read successively with the b group convolution of reading
Group carries out convolution algorithm, output block of the generation with Tm output channel, until being rolled up with n-th of input block
Product operation finishes, and generates whole output datas of Tm output channel;The passage portion that the convolutional calculation nuclear unit is stored
It adds up with whole output datas of Tm output channel of the output data with generating, the part for generating and storing after adding up leads to
Road and output data;Wherein, the passage portion of the storage and output data by read before b group convolution groups are read
1 convolution is carried out with each input block successively to (b-1) group convolution group after Accumulating generation;As b=m, the convolution
Calculate the output data for the M channel that nuclear unit is sent after adding up;Wherein, b≤m, b are positive integer;M output of reception is logical
The output data in road carries out pond processing, and sends the output data behind pond;
The output buffer unit, for receiving and storing the output data of Chi Huahou, M output after generate pond is logical
The output data in road.
The present invention provides a kind of restructural neural network accelerated method three, this method is weighted data multiplexing method, is wrapped
It includes:
The input data of N number of input channel is divided into n input block by input-buffer unit;Each input number
There is Tn input channel according to block;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are positive integer;
M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm convolution kernel,
Each convolution kernel has N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are just whole
Number;
Each input block of reading is carried out convolution by convolutional calculation nuclear unit with the b group convolution group of reading successively
Operation, output block of the generation with Tm output channel, until it is complete to carry out convolution algorithm with n-th of input block
Finish, generate the output data of Tm output channel;By the passage portion of output buffer unit feedback and output data and generation
The output data of the Tm output channel adds up, and sends passage portion and output data after adding up;Wherein, it is described
The passage portion and output data of feedback organize convolution group successively by the read before b group convolution groups are read the 1st to (b-1)
With Accumulating generation after each input block progress convolution;
The output buffer unit by reception it is cumulative after passage portion and output data be stored as passage portion and defeated
Go out data, and the passage portion and output data of storage are fed back to the convolutional calculation nuclear unit;As b=m, the output is slow
Memory cell stores whole output datas of M output channel;Wherein, b≤m, b are positive integer.
The present invention provides a kind of restructural neural networks to accelerate framework three, including:Input-buffer unit, weight caching are single
Member, convolutional calculation nuclear unit and output buffer unit;
The input-buffer unit, for the input data of N number of input channel to be divided into n input block;Each
The input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just
Integer;
The weight buffer unit, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm
A convolution kernel, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N
It is positive integer;
The convolutional calculation nuclear unit, for each input block that will read successively with the b group convolution of reading
Group carries out convolution algorithm, output block of the generation with Tm output channel, until being rolled up with n-th of input block
Product operation finishes, and generates the output data of Tm output channel;The passage portion and output data of buffer unit feedback will be exported
It adds up with the output data of the Tm output channel of generation, and sends passage portion and output data after adding up;
Wherein, the passage portion of the feedback and output data before b group convolution groups are read by reading the 1st to (b-1) group
Convolution group carries out Accumulating generation after convolution with each input block successively;
The output buffer unit, for by receive it is cumulative after passage portion and output data be stored as passage portion
And output data, and feed back the passage portion stored and output data to the convolutional calculation nuclear unit;It is described defeated as b=m
Go out whole output datas that buffer unit stores M output channel;Wherein, b≤m, b are positive integer.
Beneficial effects of the present invention:Restructural neural network accelerated method and framework provided by the invention, in input-buffer
On the basis of unit, weight buffer unit, convolutional calculation nuclear unit and output buffer unit, input data multiplexing side is respectively adopted
Method, output data multiplexing method and weighted data multiplexing method realize various by the strategy reply number of plies successively accelerated
Neural network, and use cyclical-transformation optimization neural network accelerated method, have reduce energy consumption, reach the utilization rate of PE arrays
To maximum advantageous effect.
Description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is that classical depth convolutional neural networks calculate schematic diagram;
Fig. 2 is the pseudocode cycle expression formula figure of the convolutional layer operation of classical depth convolutional neural networks;
Fig. 3 is a kind of flow chart for restructural neural network accelerated method that the embodiment of the present invention one provides;
Fig. 4 is the schematic diagram that input block is sent along Z axis of one embodiment of the invention;
Fig. 5 is the flow chart of the restructural neural network accelerated method of one embodiment of the invention;
Fig. 6 is the schematic diagram of the convolutional calculation defect one of traditional convolution kernel;
Fig. 7 is the parallel-convolution mapping mode schematic diagram based on defect one of one embodiment of the invention;
Fig. 8 is the pseudocode cycle expression formula of the parallel-convolution mapping mode based on defect one of one embodiment of the invention
Figure;
Fig. 9 is the schematic diagram of the convolutional calculation defect two of traditional convolution kernel;
Figure 10 is the segmentation schematic diagram of the input block based on defect two of one embodiment of the invention;
Figure 11 is the spliced input block schematic diagram based on defect two of one embodiment of the invention;
Figure 12 is the parallel-convolution mapping mode schematic diagram based on defect two of one embodiment of the invention;
Figure 13 is the convolution algorithm pseudocode cycle expression formula figure of the embodiment of the present invention one;
Figure 14 is the structure diagram that a kind of restructural neural network provided by Embodiment 2 of the present invention accelerates framework;
Figure 15 is a kind of flow chart for restructural neural network accelerated method that the embodiment of the present invention three provides;
Figure 16 is the schematic diagram that input block is sent along X/Y plane of one embodiment of the invention;
Figure 17 is the flow chart of the restructural neural network accelerated method of one embodiment of the invention;
Figure 18 is the convolution algorithm pseudocode cycle expression formula figure of the embodiment of the present invention three;
Figure 19 is that a kind of restructural neural network that the embodiment of the present invention four provides accelerates the structure diagram of framework;
Figure 20 is a kind of flow chart for restructural neural network accelerated method that the embodiment of the present invention five provides;
Figure 21 is the flow chart of the restructural neural network accelerated method of one embodiment of the invention;
Figure 22 is the convolution algorithm pseudocode cycle expression formula figure of the embodiment of the present invention five;
Figure 23 is that a kind of restructural neural network that the embodiment of the present invention six provides accelerates the structure diagram of framework.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work
Embodiment shall fall within the protection scope of the present invention.
About " first " used herein, " second " ... etc., not especially censure the meaning of order or cis-position,
Also it is non-to limit the present invention, only for distinguishing the element described with same technique term or operation.
It is open term, i.e., about "comprising" used herein, " comprising ", " having ", " containing " etc.
Mean including but not limited to.
About it is used herein " and/or ", including any of the things or all combination.
About direction term used herein, such as:Upper and lower, left and right, front or rear etc. are only with reference to annexed drawings
Direction.Therefore, the direction term used is intended to be illustrative and not intended to limit this case.
For defect in the prior art, the present invention proposes a kind of restructural neural network accelerated method, passes through
The strategy that data-reusing successively accelerates copes with the various neural network of the number of plies, and uses cyclical-transformation optimization neural network acceleration side
Method, having reduces energy consumption, and the utilization rate of PE arrays is made to reach maximum advantageous effect.
Embodiment one:In order to solve defect in the prior art, a kind of restructural nerve net is present embodiments provided
Network accelerated method, this method uses input data multiplexer mode, as shown in figure 3, restructural neural network accelerated method includes:
S101:The input data of N number of input channel is divided into n input block by input-buffer unit;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number.
S102:M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer.
S103:Convolutional calculation nuclear unit rolls up a-th of input block of reading with convolution group described in each group successively
Product operation, output block of the generation with Tm output channel;Until convolution group described with m groups progress convolution algorithm is complete
Finish, output block of the generation with M output channel;The storage for M output channel for exporting buffer unit feedback is exported
The output block of the M output channel of the data block with generating adds up, and sends the output block after adding up;Its
In, the storage output block before a-th of input block is read by reading the 1st to (a-1) a input data
Block carries out Accumulating generation after convolution with convolution group described in each group successively.
S104:The output buffer unit by reception it is described it is cumulative after output block be stored as M output channel
Storage output block, and feed back the storage output block to the convolutional calculation nuclear unit;It is described defeated as a=n
Go out whole output datas that buffer unit stores M output channel;Wherein, a≤n, a are positive integer.
Restructural neural network accelerated method provided in this embodiment, by the way that input data is divided into each input data
Block, and input block is sent in sequence to convolutional calculation nuclear unit, convolutional calculation nuclear unit is every time by an input block
Convolution algorithm is carried out with m group convolution group successively, generates the output block of M output channel, repeating aforesaid operations will be each defeated
Enter data block and convolution algorithm is carried out with m group convolution group, and constantly the output block for generating M output channel is tired out
Add, finally obtain whole output datas of M output channel.The restructural neural network accelerated method of the present embodiment, passes through number
The various neural network of the number of plies is coped with according to the strategy that multiplexing successively accelerates, there is optimization neural network, reduce the effect of energy consumption.Into
One step, when input-buffer unit sends each input block successively, each input block can be sent successively along Z-direction.
When it is implemented, as shown in figure 4, input data is three-dimensional structure, input data has N number of input channel (Z axis side
To), size H × L (X/Y plane) of each channel input data, it is Th that the input data of each input channel is divided into size
N input block is successively read by the input block of × Tl along Z-direction, and is sent to the progress of convolutional calculation nuclear unit
Volume machine operation.The 1st to i-th input block is first sent as shown in Figure 4, then retransmits i+1 to the 2i input number
According to block, and so on, finally send n-th of input block, wherein n=1, the positive integer of 2 ... ..., i, i+1 ... ....
Further, convolutional calculation nuclear unit by a-th of input block of reading successively with convolution group described in each group into
Row convolution algorithm, can be by Tn input channel of a-th of input block of the reading successively Tn with convolution group described in each group
A convolutional channel carries out convolution algorithm;Wherein, Tn convolutional channel of Tn input channel and each convolution kernel correspond into
Row convolution algorithm.
Further, as shown in figure 5, the restructural neural network accelerated method further includes:
S105:Judge whether the step-length of current convolution kernel is more than 1.
S106:If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm.
S107:If not, when the size of output block is less than the size of input block, by each input block
W identical input data fritter of size is divided into, the input data fritter of each input block corresponding position is spelled again
It connects, generation is with identical W spliced input block of size;W spliced input blocks are mapped to PE times
Row carry out convolution algorithm with same convolution kernel.
When it is implemented, due to doing convolution algorithm in traditional convolutional neural networks, when performing on a hardware platform, volume
Product core is actually to be multiplied in a manner that step-length is 1 with each data in input data.Such operation mode can be because of step
Long variation or the change in size of output data bring invalid PE operations, have following two defects:
First defect, as shown in fig. 6, as input data block size Th=Tl=8, output data block size Tr=Tc=
4, convolution kernel size K=2 during convolution kernel step-length S=2 > 1, require convolution kernel to be traversed in a manner that step-length is 2 in algorithm level
Entire input block carries out convolutional calculation, and first, the upper left corner weighted value of convolution kernel is same in a manner that step-length is 1 to input number
It is multiplied according to each data in the block, the PE of invalid computation will be generated, the PE really effectively calculated is the black box portion in Fig. 6
Point, the utilization rate of PE is only 16/64x%=25% at this time.
In order to solve above-mentioned first defect, the present invention judges whether the step-length of convolution kernel is more than by performing step S105
1, convolution kernel step-length S=2 > 1, therefore step S106 is performed, if convolution kernel step-length is more than 1, by the input of different input channels
Data block alternative mapping carries out convolution algorithm to PE arrays and same convolution kernel.Specific implementation procedure is as follows:
As shown in fig. 7, the present invention uses identical convolution kernel weight, due to output data block size Tr=Tc=4, so
When convolution kernel is multiplied with each data of input block, dislocation is overlapped using by 4 different input blocks 1,2,3,4
Set-up mode is:The first row first row places the data 1 (1,1) of the first row first row of input block 1, the first row secondary series
The data 2 (1,1) of the first row first row of input block 2 are placed, the second row first row places the first row in input block 3
The data 3 (1,1) of first row, the second row secondary series place the data 4 (1,1) of the first row first row of input block 4;First
Row third row place input block 1 in the tertial data 1 (1,3) of the first row (due to convolution kernel step-length be 2, all data
The data of the first row secondary series of block are not need to calculate), the row of the first row the 4th are placed the first row third in input block 2 and are arranged
Data 3 (1,1), the second row third row place input block 3 in the first row first row data 3 (1,1), the second row the 4th
Row place the tertial data 4 (1,1) of the first row of input block 4, and so on, invalid computation will be performed in Fig. 6 originally
The PE of task gives the data for needing effectively to calculate in different input blocks.This makes it possible to realize 4 output datas
Block is parallel simultaneously to perform convolutional calculation, at this time Tr=Tc=Trr=Tcc=4.Corresponding pseudocode cycle schematic diagram such as Fig. 8 institutes
Show, four layers of cycle Loop Tm/Tn/Tr/Tc represent to carry out convolutional calculation in convolutional calculation nuclear unit:It is big by Tn Th x Tl
Small input block calculates the output block of Tm Tr x Tc size.Trr is added in innermost layer and Tcc is used for two layers
In the output data fritter of cycle output Trr x Tcc sizes, this is cutting again to the output block of Tr x Tc sizes
It cuts, is used to implement parallel convolution mapping method.
Second defect, as shown in figure 9, as input data block size Th=Tl=8, output data block size Tr=Tc=
6, convolution kernel size K=2, convolution kernel step-length S=1, although the moving step length of convolution kernel is 1, the size 6x6 of output block
(the size Tr=Tc=6 of an output block) causes convolution kernel not need to traverse entire input block 8x8, exists at this time
It is moved in hardware execution mechanism still according to step-length for 1, black box portion when actually really effectively calculating in Fig. 8
Point, therefore the utilization rate of PE is 36/64x%=56.25%.
In order to solve above-mentioned second defect, the present invention performs step S107, convolution kernel step-length S=1, and output block
Size 6x6 when being less than the size 8x8 of input block, each input block is divided into size identical W and inputs number
According to fritter, each input data fritter of corresponding position is spliced again, generation is with the identical spliced W input number of size
According to block;Spliced W input block is mapped to PE arrays and carries out convolution algorithm with same convolution kernel, was specifically performed
Journey is as follows:
As 16 input block w1, w2, w3 ... ..., w16, when needing the convolution kernel different from 16 to do convolution algorithm,
There are defects shown in Fig. 9.Real effective calculating section is done dividing processing by the present invention again, as shown in Figure 10, each to input
Data block is divided in a manner of 2x2, and the input data that the 6x6 of input block each in this way part can be divided into 9 2x2 is small
Block.As shown in figure 11,16 input blocks of script pass through above-mentioned dividing processing, obtain the input number of 16x9 2x2 size
According to fritter.As shown in figure 12, input data fritter is spliced again, each input block takes the input number of a same position
According to fritter, 9 new input blocks are formed altogether, the size of each new input block is 8x8.9 8x8 spliced again
Input block do convolution algorithm with same convolution kernel, convolution kernel carries out 16 inputs originally of mobile traversal using step-length as 1
The part of 6x6 in data block, and each input data data in the block are taken full advantage of, it is corresponding to obtain 16 output block rulers
Very little is 6x6 (i.e. Tr=Tc=6), and output block is made of the output data fritter of 9 2x2 (i.e. Trr=Tcc=2), at this time
The utilization rate of PE has been increased to 100% (because input data is by 16 2x2 by 9 each input blocks of input block
Input data fritter composition all effectively calculated, export the output block of 16 sizes for 6x6).
PE utilization rate calculation formula under traditional approach:
Wherein, in R and C corresponding diagrams 1 output data two-dimensional, A refers to the size of computing array, and Tr and Tc are output numbers
According to the size of block.
Under parallel-convolution mapping mode using the present invention, the calculation formula of PE utilization rates:
Wherein, in R and C corresponding diagrams 1 output data two-dimensional, A refers to the size of computing array, and Tr and Tc are output numbers
According to the size of block, Trr and Tcc are the sizes of output data fritter in parallel-convolution mapping method.The present invention is by using parallel
Convolution mapping mode, makes the hardware resource utilization of PE arrays can reach maximum.
Figure 13 is that the pseudocode of the convolution algorithm of the present embodiment recycles expression formula.As shown in figure 13, it is carried in embodiment one
In the restructural neural network accelerated method supplied, recycle from the inside to the outside as follows:Most interior four layers of cycle Loop Tm/Tn/Tr/Tc
Represent the convolution algorithm that convolutional calculation nuclear unit performs:Dotted line frame external description be data-reusing sequence.It recycles in M, often
One input block can all do operation with whole M convolution kernels, generate M output channel output data part with.It follows
In ring N, N number of input data can traverse successively, repeat the internal calculating recycled, the part with the output data of M output channel
It adds up with continuous.Therefore update is constantly read, until completing complete convolutional calculation.Cycle R and C will traverse each output channel
On other parts, repeat before all operationss, finally obtain M output channel whole output datas.
In data-reusing pattern, the access for storage unit is a very important index.First by convolutional layer
It is split, as shown in Figure 3.The input data of N number of input channel is divided into n input block, each input number
There is Tn input channel according to block, the size of each input block is Th × Tl, n=N/Tn, N, and n, Tn are positive integer.M
The output data of a output channel is made of m output block, and each output block has Tm output channel, exports number
The pattern that size according to block is Tr × Tc, wherein, m=M/Tm, M, m, Tm are positive integer, Th=(Tr-1) S+K, Tl=(Tc-
1) S+K, K × K represent the size of convolution kernel, and S represents convolution step-length.And for the access times MA of memory can be expressed as
Lower expression formula:
MA=TI αi+TO·αo+TW·αw+TPO
TI therein, TO, TW represent input data, output data, the respective quantity of weighted data, α respectivelyi、αo、αw
It represents input data, output data, weighted data and respectively reuses the time, and the output data that TPO represents pondization output is total
Quantity.
In the method for above-mentioned input data multiplexing, it is to the Buffer coefficients accessed accordingly:
Because input deposit unit reads and writes n=N/Tn input block successively, final in order to obtain as a result, rolling up
Product operation when, need to have traversed total N number of Channel of input data, thus convolutional calculation nuclear unit just need ceaselessly to do it is tired
Add, wherein reading and writing operation needs n-1 times, it is contemplated that read and write is each primary, therefore should be multiplied by 2 times, i.e., 2 (n-1) are secondary.And for
The importing number of weight is corresponding with n, it is contemplated that the factors such as coincidence in step-length and convolution kernel size and convolution kernel moving process,
Final parameter is
And for the access coefficient of DRAM, BoAnd BwRepresent that the storage of output buffer unit and weight buffer unit is big respectively
Small, if when the size for exporting buffer unit is bigger than MTrTc data volume, there is no need to additionally occupy the storage energy of DRAM
Power, do not need at this time access DRAM, therefore coefficient be 1, it is on the contrary then need access DRAM;Likewise, weight is stored also such as
This.
The restructural neural network accelerated method that embodiment one provides, it is various to cope with the number of plies by the strategy successively accelerated
Neural network, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached reduction to Buffer and
The access times of DRAM solve the problems, such as that accessing memory in the prior art often causes power wastage, and having reduces
Energy consumption makes the maximized advantageous effect of hardware resource utilization of PE arrays.
Embodiment two:Conceived based on the application identical with above-mentioned restructural neural network accelerated method, the present embodiment also carries
A kind of restructural neural network is supplied to accelerate framework, as described below.Since the restructural neural network accelerates framework solution to ask
The principle of topic is similar to the restructural neural network accelerated method of embodiment one, therefore the restructural neural network accelerates framework
Implementation may refer to the implementation of restructural neural network accelerated method in embodiment one, and overlaps will not be repeated.
As shown in figure 14, restructural neural network provided in this embodiment accelerates framework, including:Input-buffer unit 1, power
Weight buffer unit 2, convolutional calculation nuclear unit 3 and output buffer unit 4.
Input-buffer unit 1, for the input data of N number of input channel to be divided into n input block;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number.
Weight buffer unit 2, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer.
Convolutional calculation nuclear unit 3, for a-th of the input block read to be rolled up successively with convolution group described in each group
Product operation, output block of the generation with Tm output channel;Until convolution group described with m groups progress convolution algorithm is complete
Finish, output block of the generation with M output channel;The storage for M output channel for exporting buffer unit feedback is exported
The output block of the M output channel of the data block with generating adds up, and sends the output block after adding up;Its
In, the storage output block before a-th of input block is read by reading the 1st to (a-1) a input data
Block carries out Accumulating generation after convolution with convolution group described in each group successively.
Export buffer unit 4, for by receive it is described it is cumulative after output block be stored as depositing for M output channel
Output block is stored up, and the storage output block is fed back to the convolutional calculation nuclear unit;As a=n, the output is slow
Memory cell stores whole output datas of M output channel;Wherein, a≤n, a are positive integer.
Further, input-buffer unit 1 is specifically used for, and sends each input block successively along Z-direction.
Further, as shown in figure 14, convolutional calculation nuclear unit 3 includes:Input deposit unit 31, computing engines unit 32
And output deposit unit 33.
Deposit unit 31 is inputted, for reading a-th of input block from the input-buffer unit, and by a
A input block is sent to the computing engines unit.
Computing engines unit 32, for by Tn input channel of a-th of the input block read successively with each convolution
Tn convolutional channel of Tm convolution kernel of group carries out convolution algorithm, output block of the generation with Tm output channel,
It is finished until convolution group described with m groups carries out convolution algorithm, sends output block of the generation with M output channel.
Export deposit unit 33, for will export buffer unit feedback M output channel storage output block and
The output block of the M output channel of generation adds up, and sends the output block after adding up;Wherein, it is described
Storage output block by read before a-th of input block is read the 1st to (a-1) input block successively with respectively
The group convolution group carries out Accumulating generation after convolution.
The restructural neural network accelerated method and framework that above-described embodiment provides, are delayed by input-buffer unit, weight
Memory cell, convolutional calculation nuclear unit and output buffer unit framework, the method being multiplexed using input data, the strategy successively accelerated
To cope with the various neural network of the number of plies, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached and subtracted
Few access times to Buffer and DRAM, solve access memory in the prior art often cause power wastage
Problem, having reduces energy consumption, makes the maximized advantageous effect of hardware resource utilization of PE arrays.
Embodiment three:In order to solve defect in the prior art, the present embodiment additionally provides a kind of restructural nerve
Network accelerating method, this method use output data multiplexer mode, as shown in figure 15, restructural neural network accelerated method method
Including:
S201:The input data of N number of input channel is divided into n input block by input-buffer unit;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number.
S202:M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer.
S203:Convolutional calculation nuclear unit by each input block of reading successively with the b group convolution groups of reading into
Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block
It finishes, generates whole output datas of Tm output channel;By the passage portion and defeated of convolutional calculation nuclear unit storage
The whole output datas of Tm output channel for going out data and generation add up, generate and store it is cumulative after passage portion with
Output data;Wherein, the passage portion of the storage and output data by read before b group convolution groups are read the 1st to
(b-1) group convolution group carries out Accumulating generation after convolution with each input block successively;As b=m, the convolutional calculation
Nuclear unit sends the output data of M channel after adding up;Wherein, b≤m, b are positive integer;By M output channel of reception
Output data carries out pond processing, and sends the output data behind pond.
S204:Output buffer unit receives and stores the output data of Chi Huahou, M output channel after generate pond
Output data.
Restructural neural network accelerated method provided in this embodiment, by the way that transmission of data is divided into each input block,
And input block is sent in sequence to convolutional calculation nuclear unit, convolutional calculation nuclear unit by n input block successively with together
One group of convolution group carries out convolution and calculates, and generates whole output datas of Tm output channel, repeats aforesaid operations and inputs number by n
Convolution algorithm is carried out with m group convolution group respectively, and constantly tire out the whole output datas for generating Tm output channel according to block
Add, finally obtain whole output datas of M output channel.The restructural neural network accelerated method of the present embodiment, passes through number
The various neural network of the number of plies is coped with according to the strategy that multiplexing successively accelerates, there is optimization neural network, reduce the effect of energy consumption.
Further, it when input-buffer unit sends each input block successively, can successively be sent along X/Y plane each defeated
Enter data block.
When it is implemented, as shown in figure 16, input data is three-dimensional structure, input data has N number of input channel (Z axis
Direction), the input data of each input channel is divided into the size to be by size H × L (X/Y plane) of each channel input data
N input block is successively read, and be sent to convolutional calculation nuclear unit by the input block of Th × Tl along X/Y plane direction
Carry out volume machine operation.As shown in Figure 16, the 1st to i-th input block is first sent, it is a to kth i then to retransmit i+1
Then input block sends i+1 input block of kth, and so on, finally send n-th of input block, wherein n
=1,2 ... ..., i, i+1 ... ..., the positive integer of ki, ki+1 ... ....
Further, convolutional calculation nuclear unit by each input block of reading successively with the b group convolution of reading
Group carries out convolution algorithm, including:Convolutional calculation nuclear unit by Tn input channel of each input block of reading successively
Convolution algorithm is carried out with Tn convolutional channel of the b group convolution groups of reading;Wherein, Tn input channel and each convolution kernel
Tn convolutional channel, which corresponds, carries out convolution algorithm.
In one embodiment, as shown in figure 17, this method further includes:
S205:Judge whether the step-length of current convolution kernel is more than 1.
S206:If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm.
S207:If not, when the size of output block is less than the size of input block, by each input block
W identical input data fritter of size is divided into, the input data fritter of each input block corresponding position is spelled again
It connects, generation is with identical W spliced input block of size;W spliced input blocks are mapped to PE times
Row carry out convolution algorithm with same convolution kernel.
Specific implementation process, referring to the implementation procedure of the step S105-S107 in embodiment one.
Figure 18 is that the pseudocode of the convolution algorithm of the present embodiment recycles expression formula.As shown in figure 18, it is provided in embodiment three
Restructural neural network accelerated method in, cycle M cycle N outer layers, it is meant that every group of convolution group is N number of with input data
The input data of input channel does convolution algorithm, obtains whole output datas of each output channel.It does not need to repeat to read defeated
Go out the part output block of buffer unit storage.As shown in figure 18, it recycles from the inside to the outside as follows:Most interior four layers of cycle Loop
Tm/Tn/Tr/Tc represents to carry out convolutional calculation in convolutional calculation nuclear unit:Pass through the input block meter of Tn Th x Tl size
The output block of Tm TrxTc size is calculated, dotted line frame external description is data-reusing sequence.It recycles in N, N number of input
Channel can traverse successively, repeat the internal calculating recycled, and whole output datas of Tm output channel of Accumulating generation are final to be stored in
Export buffer unit.It recycles in M, that uses before fully enters data and can repeatedly read in, and completes whole M output channels
It calculates.Cycle R and C will traverse the other parts in output channel, and all operationss before repeating are final to obtain M output
Whole output datas of channel.
In above-mentioned output data multiplexing method, mutually coping with the coefficient that Buffer is accessed is:
Mutually the access coefficient of reply DRAM is:
Wherein BiRefer to output buffer unit storage size, if input-buffer unit can store lower n input block,
So only need to access primary, N is input channel number, and Th × Tl is the size of input data, and M is convolution kernel number, Mei Gejuan
Product group has Tm convolution kernel.
The restructural neural network accelerated method that embodiment three provides, it is various to cope with the number of plies by the strategy successively accelerated
Neural network, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached reduction to Buffer and
The access times of DRAM solve the problems, such as that accessing memory in the prior art often causes power wastage, and having reduces
Energy consumption makes the maximized advantageous effect of hardware resource utilization of PE arrays.
Example IV:Conceived based on the application identical with above-mentioned restructural neural network accelerated method, the present invention also provides
A kind of restructural neural network accelerates framework, as described below.Since the restructural neural network accelerates framework to solve the problems, such as
Principle it is similar to the restructural neural network accelerated method in embodiment three, therefore the restructural neural network accelerates framework
Implementation may refer to the implementation of the restructural neural network accelerated method in embodiment three, and overlaps will not be repeated.
As shown in figure 19, the restructural neural network that the present embodiment also provides accelerates framework, including:Input-buffer unit 1,
Weight buffer unit 2, convolutional calculation nuclear unit 3 and output buffer unit 4;
Input-buffer unit 1, for the input data of N number of input channel to be divided into n input block;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number;
Weight buffer unit 2, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer;
Convolutional calculation nuclear unit 3, for each input block that will read successively with the b group convolution groups of reading into
Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block
It finishes, generates whole output datas of Tm output channel;By the passage portion and defeated of convolutional calculation nuclear unit storage
The whole output datas of Tm output channel for going out data and generation add up, generate and store it is cumulative after passage portion with
Output data;Wherein, the passage portion of the storage and output data by read before b group convolution groups are read the 1st to
(b-1) group convolution group carries out Accumulating generation after convolution with each input block successively;As b=m, the convolutional calculation
Nuclear unit sends the output data of M channel after adding up;Wherein, b≤m, b are positive integer;By M output channel of reception
Output data carries out pond processing, and sends the output data behind pond;
Buffer unit 4 is exported, for receiving and storing the output data of Chi Huahou, M output channel after generate pond
Output data.
Further, input-buffer unit 1 is specifically used for, and sends each input block successively along X/Y plane.
Further, as shown in figure 19, convolutional calculation nuclear unit 3 includes:Input deposit unit 31, computing engines unit
32nd, deposit unit 33 and pond unit 34 are exported.
Deposit unit 31 is inputted, for reading each input block one by one from the input-buffer unit, and will be described defeated
Enter data block and be sent to the computing engines unit;
Computing engines unit 32, for the Tn input channel of each input block that will read successively with reading
Tn convolutional channel of Tm convolution kernel of b group convolution groups carries out convolution algorithm, and generation is defeated with Tm output channel
Go out data block, finished until carrying out convolution algorithm with n-th of input block, generate whole output numbers of Tm output channel
According to;By the passage portion of the output deposit unit feedback and whole output numbers of output data and Tm output channel of generation
According to adding up, passage portion and output data after adding up are generated and sent;Wherein, the portion of the output deposit unit feedback
Subchannel and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with it is each described
Input block carries out Accumulating generation after convolution;
Export deposit unit 33, for by receive it is cumulative after passage portion and output data be stored as passage portion and
Output data, and feed back the passage portion stored and output data to the computing engines unit;As b=m, the output is posted
Memory cell sends the output data of M channel after adding up;Wherein, b≤m, b are positive integer;
Pond unit 34 for the output data of the M output channel received to be carried out pond processing, and sends Chi Huahou
Output data.
The restructural neural network accelerated method and framework that above-described embodiment provides, are delayed by input-buffer unit, weight
Memory cell, convolutional calculation nuclear unit and output buffer unit framework, using output data multiplexing method, the strategy successively accelerated comes
The various neural network of the number of plies is coped with, and carrys out optimization neural network accelerated method using the method for cyclical-transformation, has reached reduction
To the access times of Buffer and DRAM, solve access memory in the prior art often cause asking for power wastage
Topic, having reduces energy consumption, makes the maximized advantageous effect of hardware resource utilization of PE arrays.
Embodiment five:In order to solve defect in the prior art, the present embodiment additionally provides a kind of restructural nerve
Network accelerating method, this method are become blind with weighted data multiplexer mode, as shown in figure 20, restructural neural network accelerated method packet
It includes:
S301:The input data of N number of input channel is divided into n input block by input-buffer unit;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number;
S302:M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer;
S303:Convolutional calculation nuclear unit by each input block of reading successively with the b group convolution groups of reading into
Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block
It finishes, generates the output data of Tm output channel;By the passage portion of output buffer unit feedback and output data and life
Into the output data of the Tm output channel add up, and send it is cumulative after passage portion and output data;Wherein,
The passage portion and output data of the feedback organize convolution group by the read before b group convolution groups are read the 1st to (b-1)
Accumulating generation after convolution is carried out with each input block successively;
S304:Output buffer unit by reception it is cumulative after passage portion and output data be stored as passage portion and defeated
Go out data, and the passage portion and output data of storage are fed back to the convolutional calculation nuclear unit;As b=m, the output is slow
Memory cell stores whole output datas of M output channel;Wherein, b≤m, b are positive integer.
Restructural neural network accelerated method provided in this embodiment, by the way that input data is divided into each input data
Block, each input block carry out convolution algorithm with same group of convolution group successively, generate whole output datas of Tm output channel,
It repeats aforesaid operations and n input block is subjected to convolution algorithm with m group convolution group respectively, and it is logical constantly to generate Tm output
Whole output datas in road add up, and finally obtain whole output datas of M output channel.The restructural god of the present embodiment
Through network accelerating method, the various neural network of the number of plies is coped with by the strategy that data-reusing successively accelerates, there is optimization nerve
Network reduces the effect of energy consumption.
Further, when input-buffer unit sends each input block successively, each institute can be sent successively along Z-direction
State input block.
When it is implemented, as shown in figure 4, input data is three-dimensional structure, input data has N number of input channel (Z axis side
To), size H × L (X/Y plane) of each channel input data, it is Th that the input data of each input channel is divided into size
N input block is successively read by the input block of × Tl along Z-direction, and is sent to the progress of convolutional calculation nuclear unit
Volume machine operation.The 1st to i-th input block is first sent as shown in Figure 4, then retransmits i+1 to the 2i input number
According to block, and so on, finally send n-th of input block, wherein n=1, the positive integer of 2 ... ..., i, i+1 ... ....
Further, convolutional calculation nuclear unit by each input block of reading successively with the b group convolution of reading
Group carries out convolution algorithm, can roll up Tn input channel of each input block of reading with the b groups of reading successively
Tn convolutional channel of product group carries out convolution algorithm, wherein, Tn input channel and the Tn convolutional channel one of each convolution kernel
One corresponds to progress convolution algorithm.
Further, as shown in figure 21, which further includes:
S305:Judge whether the step-length of current convolution kernel is more than 1.
S306:If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm.
S307:If not, when the size of output block is less than the size of input block, by each input block
W identical input data fritter of size is divided into, the input data fritter of each input block corresponding position is spelled again
It connects, generation is with identical W spliced input block of size;W spliced input blocks are mapped to PE times
Row carry out convolution algorithm with same convolution kernel.
Specific implementation process, referring to the implementation procedure of the step S105-S107 in embodiment one.
Figure 22 is that the pseudocode of the convolution algorithm of the present embodiment recycles expression formula.As shown in figure 22, it is provided in embodiment five
Restructural neural network accelerated method in, convolutional calculation nuclear unit is by TnThe input block of a input channel is sent in sequence to
Input register unit.Each input block and TmA convolution kernel is multiplied, and generates TmThe output data part of a output channel
With.Since cycle R and cycle C are in inside, TmIt is a that there is TnThe convolution kernel of a convolutional channel can be fully utilized, with n input number
According to the T of blocknChannel is traversed, so as to obtain TmThe part of a output data and (output data size is R × C).It will be every
The T of a input block generationmThe part of a output data and cycle fetch the T generated with next input blockmA output
The part of data and cumulative, whole output datas until obtaining M output channel.It recycles from the inside to the outside as follows:Most interior four
Layer cycle Loop Tm/Tn/Tr/Tc represent to carry out convolutional calculation in convolutional calculation nuclear unit:Pass through Tn Th x Tl size
Input block calculates the output block of Tm TrxTc size, and dotted line frame external description is data-reusing sequence.It is most interior
Four layers of cycle Loop Tm/Tn/Tr/Tc represent the calculating in the convolution core (Convolution Core) of Fig. 6:Pass through Tn
The input figure of a Th x Tl sizes calculates the output figure of Tm Tr x Tc size.Dotted line frame external description is data-reusing
Sequentially.Cycle R and cycle C will traverse the other parts of output channel, repeat the internal all operationss recycled, therefore weight obtains
Sufficient multiplexing is arrived.It recycles in N, N number of input channel can traverse successively, repeat the internal calculating recycled, complete Tm output
Whole accumulation calculatings of the output data of channel, final deposit output buffer unit.It recycles in M, that uses before fully enters
Data can repeatedly be read in, and complete the calculating of whole M convolution kernels, finally obtain whole output datas of M output channel.
In weighted data multiplexing method, mutually coping with the coefficient that Buffer is accessed is:
Mutually the access coefficient of reply DRAM is:
The restructural neural network accelerated method that embodiment five provides, it is various to cope with the number of plies by the strategy successively accelerated
Neural network, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached reduction to Buffer and
The access times of DRAM solve the problems, such as that accessing memory in the prior art often causes power wastage, and having reduces
Energy consumption makes the maximized advantageous effect of hardware resource utilization of PE arrays.
Embodiment six:Conceived based on the application identical with above-mentioned restructural neural network accelerated method, the present invention also provides
A kind of restructural neural network accelerates framework, as described below.Since the restructural neural network accelerates framework to solve the problems, such as
Principle it is similar to the restructural neural network accelerated method of embodiment five, therefore the restructural neural network accelerates the reality of framework
The implementation for the restructural neural network accelerated method that may refer to embodiment five is applied, overlaps will not be repeated.
As shown in figure 23, restructural neural network provided in this embodiment accelerates framework, including:Input-buffer unit 1, power
Weight buffer unit 2, convolutional calculation nuclear unit 3 and output buffer unit 4;
Input-buffer unit 1, for the input data of N number of input channel to be divided into n input block;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number.
Weight buffer unit 2, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer.
Convolutional calculation nuclear unit 3, for each input block that will read successively with the b group convolution groups of reading into
Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block
It finishes, generates the output data of Tm output channel;By the passage portion of output buffer unit feedback and output data and life
Into the output data of the Tm output channel add up, and send it is cumulative after passage portion and output data;Wherein,
The passage portion and output data of the feedback organize convolution group by the read before b group convolution groups are read the 1st to (b-1)
Accumulating generation after convolution is carried out with each input block successively.
Export buffer unit 4, for by receive it is cumulative after passage portion and output data be stored as passage portion and
Output data, and feed back the passage portion stored and output data to the convolutional calculation nuclear unit;As b=m, the output
Buffer unit stores whole output datas of M output channel;Wherein, b≤m, b are positive integer.
Further, input-buffer unit is specifically used for, and sends each input block successively along Z-direction.
Further, as shown in figure 23, convolutional calculation nuclear unit 3 includes:Input deposit unit 31, computing engines unit 32
And output deposit unit 33.
Deposit unit 31 is inputted, for reading each input block one by one from the input-buffer unit, and will be described defeated
Enter data block and be sent to the computing engines unit.
Computing engines unit 32, for the Tn input channel of each input block that will read successively with reading
Tn convolutional channel of Tm convolution kernel of b group convolution groups carries out convolution algorithm, and generation is defeated with Tm output channel
Go out data block, finished until carrying out convolution algorithm with n-th of input block, send the output number of Tm output channel of generation
According to.
Deposit unit 33 is exported, for it will export the passage portion of buffer unit feedback and output data and generate described in
The output data of Tm output channel adds up, and sends passage portion and output data after adding up;Wherein, the feedback
Passage portion and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with respectively
The input block carries out Accumulating generation after convolution.
The restructural neural network accelerated method and framework that above-described embodiment provides, are delayed by input-buffer unit, weight
Memory cell, convolutional calculation nuclear unit and output buffer unit framework, the method being multiplexed using weighted data, the strategy successively accelerated
To cope with the various neural network of the number of plies, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached and subtracted
Few access times to Buffer and DRAM, solve access memory in the prior art often cause power wastage
Problem, having reduces energy consumption, makes the maximized advantageous effect of hardware resource utilization of PE arrays.
Restructural neural network accelerated method and framework provided by the invention are cached single by input-buffer unit, weight
Member, convolutional calculation nuclear unit and output buffer unit framework, are respectively adopted input data multiplexing method, output data multiplexing method
And weighted data multiplexing method, the strategy successively accelerated cope with the various neural network of the number of plies, and use the side of cyclical-transformation
Method carrys out optimization neural network accelerated method, has reached the access times reduced to Buffer and DRAM, has solved the prior art
Middle the problem of often causing power wastage of memory of access, having reduces energy consumption, makes the hardware resource utilization of PE arrays
Maximized advantageous effect.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program
Product.Therefore, the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware can be used in the present invention
Apply the form of example.Moreover, the computer for wherein including computer usable program code in one or more can be used in the present invention
The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real
The device of function specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to
Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps are performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or
The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
It applies specific embodiment in the present invention to be expounded the principle of the present invention and embodiment, above example
Explanation be merely used to help understand the present invention method and its core concept;Meanwhile for those of ordinary skill in the art,
Thought according to the present invention, there will be changes in specific embodiments and applications, in conclusion in this specification
Appearance should not be construed as limiting the invention.
Claims (21)
1. a kind of restructural neural network accelerated method, which is characterized in that this method includes:
The input data of N number of input channel is divided into n input block by input-buffer unit;Each input block
With Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are positive integer;
M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm convolution kernel, each
The convolution kernel has N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are positive integer;
A-th of input block of reading is carried out convolution algorithm by convolutional calculation nuclear unit with convolution group described in each group successively,
Output block of the generation with Tm output channel;It is finished until convolution group described with m groups carries out convolution algorithm, generation tool
There is the output block of M output channel;Storage output block and the life of M output channel of buffer unit feedback will be exported
Into the output block of the M output channel add up, and send it is cumulative after output block;Wherein, it is described to deposit
Storage output block by read before a-th of input block is read the 1st to (a-1) a input block successively with respectively
The group convolution group carries out Accumulating generation after convolution;
The output buffer unit by reception it is described it is cumulative after output block be stored as the storage output of M output channel
Data block, and feed back the storage output block to the convolutional calculation nuclear unit;As a=n, the output buffer unit
Store whole output datas of M output channel;Wherein, a≤n, a are positive integer.
2. restructural neural network accelerated method according to claim 1, which is characterized in that described to send each input successively
Data block, including:Send each input block successively along Z-direction.
3. restructural neural network accelerated method according to claim 1, which is characterized in that the convolutional calculation nuclear unit
A-th of input block of reading is subjected to convolution algorithm with convolution group described in each group successively, including:Convolutional calculation nuclear unit will
Tn input channel of a-th of the input block read carries out convolution with Tn convolutional channel of convolution group described in each group successively
Operation;Wherein, Tn convolutional channel of Tn input channel and each convolution kernel, which corresponds, carries out convolution algorithm.
4. restructural neural network accelerated method according to claim 3, which is characterized in that this method further includes:
Judge whether the step-length of current convolution kernel is more than 1;
If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm;
If not, when the size of output block is less than the size of input block, each input block is divided into
W identical input data fritter of size, the input data fritter of each input block corresponding position is spliced again, generation tool
W spliced input blocks for having size identical;By W spliced input blocks be mapped to PE arrays with it is same
Convolution kernel carries out convolution algorithm.
5. a kind of restructural neural network accelerates framework, which is characterized in that including:Input-buffer unit, weight buffer unit, volume
Product calculates nuclear unit and output buffer unit;
The input-buffer unit, for the input data of N number of input channel to be divided into n input block;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number;
The weight buffer unit, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer;
The convolutional calculation nuclear unit, for a-th of the input block read to be rolled up successively with convolution group described in each group
Product operation, output block of the generation with Tm output channel;Until convolution group described with m groups progress convolution algorithm is complete
Finish, output block of the generation with M output channel;The storage for M output channel for exporting buffer unit feedback is exported
The output block of the M output channel of the data block with generating adds up, and sends the output block after adding up;Its
In, the storage output block before a-th of input block is read by reading the 1st to (a-1) a input data
Block carries out Accumulating generation after convolution with convolution group described in each group successively;
The output buffer unit, for by receive it is described it is cumulative after output block be stored as depositing for M output channel
Output block is stored up, and the storage output block is fed back to the convolutional calculation nuclear unit;As a=n, the output is slow
Memory cell stores whole output datas of M output channel;Wherein, a≤n, a are positive integer.
6. restructural neural network according to claim 5 accelerates framework, which is characterized in that the input-buffer unit tool
Body is used for:Send each input block successively along Z-direction.
7. restructural neural network according to claim 5 accelerates framework, which is characterized in that the convolutional calculation nuclear unit
Including:Input deposit unit, computing engines unit and output deposit unit;
The input deposit unit, for reading a-th of input block from the input-buffer unit, and by described a-th
Input block is sent to the computing engines unit;
The computing engines unit, for by Tn input channel of a-th of the input block read successively with each convolution group
Tn convolutional channel of Tm convolution kernel carry out convolution algorithm, output block of the generation with Tm output channel, directly
Convolution algorithm is carried out to convolution group described with m groups to finish, and sends output block of the generation with M output channel;
The output deposit unit, for the storage output block of M output channel of buffer unit feedback and life will to be exported
Into the output block of the M output channel add up, and send it is cumulative after output block;Wherein, it is described to deposit
Storage output block by read before a-th of input block is read the 1st to (a-1) input block successively with each group
The convolution group carries out Accumulating generation after convolution.
8. a kind of restructural neural network accelerated method, which is characterized in that this method includes:
The input data of N number of input channel is divided into n input block by input-buffer unit;Each input block
With Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are positive integer;
M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm convolution kernel, each
The convolution kernel has N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are positive integer;
Each input block of reading is carried out convolution fortune by convolutional calculation nuclear unit with the b group convolution group of reading successively
It calculates, output block of the generation with Tm output channel, is finished until carrying out convolution algorithm with n-th of input block,
Generate whole output datas of Tm output channel;By the convolutional calculation nuclear unit storage passage portion and output data with
Whole output datas of Tm output channel of generation add up, and generate and store the passage portion after adding up and output number
According to;Wherein, the passage portion of the storage and output data before b group convolution groups are read by reading the 1st to (b-1)
Group convolution group carries out Accumulating generation after convolution with each input block successively;As b=m, the convolutional calculation nuclear unit
Send the output data of M channel after adding up;Wherein, b≤m, b are positive integer;By the output number of M output channel of reception
According to progress pond processing, and send the output data behind pond;
Output buffer unit receives and stores the output data of Chi Huahou, the output data of M output channel after generate pond.
9. restructural neural network accelerated method according to claim 8, which is characterized in that described to send each input successively
Data block, including:Send each input block successively along X/Y plane.
10. restructural neural network accelerated method according to claim 8, which is characterized in that the convolutional calculation vouching
Each input block of reading is carried out convolution algorithm by member with the b group convolution group of reading successively, including:Convolutional calculation core
Unit leads to Tn convolution of the Tn input channel of each input block of reading successively with the b group convolution groups of reading
Road carries out convolution algorithm;Wherein, Tn convolutional channel of Tn input channel and each convolution kernel, which corresponds, carries out convolution fortune
It calculates.
11. restructural neural network accelerated method according to claim 10, which is characterized in that this method further includes:
Judge whether the step-length of current convolution kernel is more than 1;
If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm;
If not, when the size of output block is less than the size of input block, each input block is divided into
W identical input data fritter of size, the input data fritter of each input block corresponding position is spliced again, generation tool
W spliced input blocks for having size identical;By W spliced input blocks be mapped to PE arrays with it is same
Convolution kernel carries out convolution algorithm.
12. a kind of restructural neural network accelerates framework, which is characterized in that including:Input-buffer unit, weight buffer unit,
Convolutional calculation nuclear unit and output buffer unit;
The input-buffer unit, for the input data of N number of input channel to be divided into n input block;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number;
The weight buffer unit, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer;
The convolutional calculation nuclear unit, for each input block that will read successively with the b group convolution groups of reading into
Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block
It finishes, generates whole output datas of Tm output channel;By the passage portion and defeated of convolutional calculation nuclear unit storage
The whole output datas of Tm output channel for going out data and generation add up, generate and store it is cumulative after passage portion with
Output data;Wherein, the passage portion of the storage and output data by read before b group convolution groups are read the 1st to
(b-1) group convolution group carries out Accumulating generation after convolution with each input block successively;As b=m, the convolutional calculation
Nuclear unit sends the output data of M channel after adding up;Wherein, b≤m, b are positive integer;By M output channel of reception
Output data carries out pond processing, and sends the output data behind pond;
The output buffer unit, for receiving and storing the output data of Chi Huahou, M output channel after generate pond
Output data.
13. restructural neural network according to claim 12 accelerates framework, which is characterized in that the input-buffer unit
It is specifically used for:Send each input block successively along X/Y plane.
14. restructural neural network according to claim 12 accelerates framework, which is characterized in that the convolutional calculation vouching
Member includes:Input deposit unit, computing engines unit, output deposit unit and pond unit;
The input deposit unit, for reading each input block one by one from the input-buffer unit, and by the input
Data block is sent to the computing engines unit;
The computing engines unit, for the Tn input channel of each input block that will read successively with reading
Tn convolutional channel of Tm convolution kernel of b group convolution groups carries out convolution algorithm, output of the generation with Tm output channel
Data block finishes until carrying out convolution algorithm with n-th of input block, generates whole output datas of Tm output channel;
By the passage portion of the output deposit unit feedback and whole output datas of output data and Tm output channel of generation
It adds up, generates and sends passage portion and output data after adding up;Wherein, the part of the output deposit unit feedback
Channel and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with it is each described defeated
Enter data block and carry out Accumulating generation after convolution;
The output deposit unit, for by receive it is cumulative after passage portion and output data be stored as passage portion and defeated
Go out data, and the passage portion and output data of storage are fed back to the computing engines unit;As b=m, the output deposit
Unit sends the output data of M channel after adding up;Wherein, b≤m, b are positive integer;
The pond unit, for the output data of the M output channel received to be carried out pond processing, and after sending pond
Output data.
15. a kind of restructural neural network accelerated method, which is characterized in that this method includes:
The input data of N number of input channel is divided into n input block by input-buffer unit;Each input block
With Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are positive integer;
M convolution kernel is divided into m group convolution groups by weight buffer unit;Convolution group described in every group has Tm convolution kernel, each
The convolution kernel has N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are positive integer;
Each input block of reading is carried out convolution fortune by convolutional calculation nuclear unit with the b group convolution group of reading successively
It calculates, output block of the generation with Tm output channel, is finished until carrying out convolution algorithm with n-th of input block,
Generate the output data of Tm output channel;It will export described in the passage portion of buffer unit feedback and output data and generation
The output data of Tm output channel adds up, and sends passage portion and output data after adding up;Wherein, the feedback
Passage portion and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with respectively
The input block carries out Accumulating generation after convolution;
The output buffer unit by reception it is cumulative after passage portion and output data be stored as passage portion and output number
According to, and feed back the passage portion stored and output data to the convolutional calculation nuclear unit;As b=m, the output caching is single
Whole output datas of member M output channel of storage;Wherein, b≤m, b are positive integer.
16. restructural neural network accelerated method according to claim 15, which is characterized in that it is described send successively it is each defeated
Enter data block, including:Send each input block successively along Z-direction.
17. restructural neural network accelerated method according to claim 15, which is characterized in that the convolutional calculation vouching
Each input block of reading is carried out convolution algorithm by member with the b group convolution group of reading successively, including:Convolutional calculation core
Unit leads to Tn convolution of the Tn input channel of each input block of reading successively with the b group convolution groups of reading
Road carries out convolution algorithm, wherein, Tn convolutional channel of Tn input channel and each convolution kernel, which corresponds, carries out convolution fortune
It calculates.
18. restructural neural network accelerated method according to claim 17, which is characterized in that this method further includes:
Judge whether the step-length of current convolution kernel is more than 1;
If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm;
If not, when the size of output block is less than the size of input block, each input block is divided into
W identical input data fritter of size, the input data fritter of each input block corresponding position is spliced again, generation tool
W spliced input blocks for having size identical;By W spliced input blocks be mapped to PE arrays with it is same
Convolution kernel carries out convolution algorithm.
19. a kind of restructural neural network accelerates framework, which is characterized in that including:Input-buffer unit, weight buffer unit,
Convolutional calculation nuclear unit and output buffer unit;
The input-buffer unit, for the input data of N number of input channel to be divided into n input block;It is each described
Input block has Tn input channel;Each input block is sent successively;Wherein, n=N/Tn, N, n, Tn are just whole
Number;
The weight buffer unit, for M convolution kernel to be divided into m group convolution groups;Convolution group described in every group has Tm volume
Product core, each convolution kernel have N number of convolutional channel;Each convolution group is sent successively;Wherein, m=M/Tm, M, m, Tm, N are
Positive integer;
The convolutional calculation nuclear unit, for each input block that will read successively with the b group convolution groups of reading into
Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block
It finishes, generates the output data of Tm output channel;By the passage portion of output buffer unit feedback and output data and life
Into the output data of the Tm output channel add up, and send it is cumulative after passage portion and output data;Wherein,
The passage portion and output data of the feedback organize convolution group by the read before b group convolution groups are read the 1st to (b-1)
Accumulating generation after convolution is carried out with each input block successively;
The output buffer unit, for by receive it is cumulative after passage portion and output data be stored as passage portion and defeated
Go out data, and the passage portion and output data of storage are fed back to the convolutional calculation nuclear unit;As b=m, the output is slow
Memory cell stores whole output datas of M output channel;Wherein, b≤m, b are positive integer.
20. restructural neural network according to claim 19 accelerates framework, which is characterized in that the input-buffer unit
It is specifically used for:Send each input block successively along Z-direction.
21. restructural neural network according to claim 19 accelerates framework, which is characterized in that the convolutional calculation vouching
Member includes:Input deposit unit, computing engines unit and output deposit unit;
The input deposit unit, for reading each input block one by one from the input-buffer unit, and by the input
Data block is sent to the computing engines unit;
The computing engines unit, for the Tn input channel of each input block that will read successively with reading
Tn convolutional channel of Tm convolution kernel of b group convolution groups carries out convolution algorithm, output of the generation with Tm output channel
Data block finishes until carrying out convolution algorithm with n-th of input block, sends the output number of Tm output channel of generation
According to;
The output deposit unit, for the passage portion of buffer unit feedback and output data and the Tm of generation will to be exported
The output data of a output channel adds up, and sends passage portion and output data after adding up;Wherein, the feedback
Passage portion and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with each institute
It states input block and carries out Accumulating generation after convolution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810084089.2A CN108241890B (en) | 2018-01-29 | 2018-01-29 | Reconfigurable neural network acceleration method and architecture |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810084089.2A CN108241890B (en) | 2018-01-29 | 2018-01-29 | Reconfigurable neural network acceleration method and architecture |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108241890A true CN108241890A (en) | 2018-07-03 |
CN108241890B CN108241890B (en) | 2021-11-23 |
Family
ID=62698691
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810084089.2A Active CN108241890B (en) | 2018-01-29 | 2018-01-29 | Reconfigurable neural network acceleration method and architecture |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108241890B (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109032781A (en) * | 2018-07-13 | 2018-12-18 | 重庆邮电大学 | A kind of FPGA parallel system of convolutional neural networks algorithm |
CN109284824A (en) * | 2018-09-04 | 2019-01-29 | 复旦大学 | A kind of device for being used to accelerate the operation of convolution sum pond based on Reconfiguration Technologies |
CN109359735A (en) * | 2018-11-23 | 2019-02-19 | 浙江大学 | The hardware-accelerated data input device of deep neural network and method |
CN109447241A (en) * | 2018-09-29 | 2019-03-08 | 西安交通大学 | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field |
CN109447257A (en) * | 2018-09-18 | 2019-03-08 | 复旦大学 | A kind of deep neural network of channel self-organizing accelerates the arithmetic unit of chip |
CN109598338A (en) * | 2018-12-07 | 2019-04-09 | 东南大学 | A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA |
CN109711367A (en) * | 2018-12-29 | 2019-05-03 | 北京中科寒武纪科技有限公司 | Operation method, device and Related product |
CN109740732A (en) * | 2018-12-27 | 2019-05-10 | 深圳云天励飞技术有限公司 | Neural network processor, convolutional neural networks data multiplexing method and relevant device |
CN109844774A (en) * | 2018-08-28 | 2019-06-04 | 深圳鲲云信息科技有限公司 | A kind of parallel deconvolution calculation method, single engine calculation method and Related product |
CN110110849A (en) * | 2019-04-29 | 2019-08-09 | 西安电子科技大学 | Row fixed data stream mapping method based on figure segmentation |
CN110390384A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of configurable general convolutional neural networks accelerator |
CN110414672A (en) * | 2019-07-23 | 2019-11-05 | 江苏鼎速网络科技有限公司 | Convolution algorithm method, apparatus and system |
CN110490302A (en) * | 2019-08-12 | 2019-11-22 | 北京中科寒武纪科技有限公司 | A kind of neural network compiling optimization method, device and Related product |
CN110516801A (en) * | 2019-08-05 | 2019-11-29 | 西安交通大学 | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput |
CN110533177A (en) * | 2019-08-22 | 2019-12-03 | 安谋科技(中国)有限公司 | A kind of data read-write equipment, method, equipment, medium and convolution accelerator |
CN110716751A (en) * | 2018-07-12 | 2020-01-21 | 赛灵思公司 | High-parallelism computing platform, system and computing implementation method |
CN110865950A (en) * | 2018-08-28 | 2020-03-06 | 中科寒武纪科技股份有限公司 | Data preprocessing method and device, computer equipment and storage medium |
CN110888824A (en) * | 2018-09-07 | 2020-03-17 | 黑芝麻智能科技(上海)有限公司 | Multilevel memory hierarchy |
CN111126593A (en) * | 2019-11-07 | 2020-05-08 | 复旦大学 | Reconfigurable natural language deep convolution neural network accelerator |
CN111199273A (en) * | 2019-12-31 | 2020-05-26 | 深圳云天励飞技术有限公司 | Convolution calculation method, device, equipment and storage medium |
CN111258574A (en) * | 2020-01-14 | 2020-06-09 | 中科驭数(北京)科技有限公司 | Programming method and system for accelerator architecture |
CN111427895A (en) * | 2020-04-01 | 2020-07-17 | 西安交通大学 | Neural network reasoning acceleration method based on two-segment cache |
CN111523652A (en) * | 2019-02-01 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Processor, data processing method thereof and camera device |
CN111610963A (en) * | 2020-06-24 | 2020-09-01 | 上海西井信息科技有限公司 | Chip structure and multiply-add calculation engine thereof |
CN111859797A (en) * | 2020-07-14 | 2020-10-30 | Oppo广东移动通信有限公司 | Data processing method and device and storage medium |
CN112308217A (en) * | 2019-07-31 | 2021-02-02 | 北京欣奕华科技有限公司 | Convolutional neural network acceleration method and system |
CN112580774A (en) * | 2020-09-01 | 2021-03-30 | 浙江大学 | Neural network layout method for reconfigurable neural network processor |
CN114089911A (en) * | 2021-09-07 | 2022-02-25 | 上海新氦类脑智能科技有限公司 | Block segmentation splicing processing method, device, equipment and medium based on data multiplexing |
US11423292B2 (en) | 2020-02-15 | 2022-08-23 | Industrial Technology Research Institute | Convolutional neural-network calculating apparatus and operation methods thereof |
WO2023098256A1 (en) * | 2021-12-03 | 2023-06-08 | 中兴通讯股份有限公司 | Neural network operation method and apparatus, chip, electronic device and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250103A (en) * | 2016-08-04 | 2016-12-21 | 东南大学 | A kind of convolutional neural networks cyclic convolution calculates the system of data reusing |
-
2018
- 2018-01-29 CN CN201810084089.2A patent/CN108241890B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250103A (en) * | 2016-08-04 | 2016-12-21 | 东南大学 | A kind of convolutional neural networks cyclic convolution calculates the system of data reusing |
Non-Patent Citations (3)
Title |
---|
MA Y, CAO Y, VRUDHULA S等: "Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks", 《PROCEEDINGS OF THE 2017 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS》 * |
刘志强: "深度学习算法可重构加速器关键技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
陆志坚: "基于FPGA的卷积神经网络并行结构研究", 《中国博士学位论文全文数据库信息科技辑》 * |
Cited By (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110716751A (en) * | 2018-07-12 | 2020-01-21 | 赛灵思公司 | High-parallelism computing platform, system and computing implementation method |
CN109032781A (en) * | 2018-07-13 | 2018-12-18 | 重庆邮电大学 | A kind of FPGA parallel system of convolutional neural networks algorithm |
CN109844774A (en) * | 2018-08-28 | 2019-06-04 | 深圳鲲云信息科技有限公司 | A kind of parallel deconvolution calculation method, single engine calculation method and Related product |
CN109844774B (en) * | 2018-08-28 | 2023-01-24 | 深圳鲲云信息科技有限公司 | Parallel deconvolution computing method, single-engine computing method and related products |
CN110865950A (en) * | 2018-08-28 | 2020-03-06 | 中科寒武纪科技股份有限公司 | Data preprocessing method and device, computer equipment and storage medium |
CN110865950B (en) * | 2018-08-28 | 2021-01-12 | 中科寒武纪科技股份有限公司 | Data preprocessing method and device, computer equipment and storage medium |
CN109284824B (en) * | 2018-09-04 | 2021-07-23 | 复旦大学 | Reconfigurable technology-based device for accelerating convolution and pooling operation |
CN109284824A (en) * | 2018-09-04 | 2019-01-29 | 复旦大学 | A kind of device for being used to accelerate the operation of convolution sum pond based on Reconfiguration Technologies |
CN110888824A (en) * | 2018-09-07 | 2020-03-17 | 黑芝麻智能科技(上海)有限公司 | Multilevel memory hierarchy |
CN109447257A (en) * | 2018-09-18 | 2019-03-08 | 复旦大学 | A kind of deep neural network of channel self-organizing accelerates the arithmetic unit of chip |
CN109447257B (en) * | 2018-09-18 | 2021-08-17 | 复旦大学 | Operation device of deep neural network acceleration chip with self-organized channels |
CN109447241A (en) * | 2018-09-29 | 2019-03-08 | 西安交通大学 | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field |
CN109447241B (en) * | 2018-09-29 | 2022-02-22 | 西安交通大学 | Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things |
CN109359735B (en) * | 2018-11-23 | 2020-12-04 | 浙江大学 | Data input device and method for accelerating deep neural network hardware |
CN109359735A (en) * | 2018-11-23 | 2019-02-19 | 浙江大学 | The hardware-accelerated data input device of deep neural network and method |
CN109598338B (en) * | 2018-12-07 | 2023-05-19 | 东南大学 | Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization |
CN109598338A (en) * | 2018-12-07 | 2019-04-09 | 东南大学 | A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA |
CN109740732A (en) * | 2018-12-27 | 2019-05-10 | 深圳云天励飞技术有限公司 | Neural network processor, convolutional neural networks data multiplexing method and relevant device |
CN109711367A (en) * | 2018-12-29 | 2019-05-03 | 北京中科寒武纪科技有限公司 | Operation method, device and Related product |
CN111523652B (en) * | 2019-02-01 | 2023-05-02 | 阿里巴巴集团控股有限公司 | Processor, data processing method thereof and image pickup device |
CN111523652A (en) * | 2019-02-01 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Processor, data processing method thereof and camera device |
CN110110849B (en) * | 2019-04-29 | 2023-04-07 | 西安电子科技大学 | Line fixed data stream mapping method based on graph segmentation |
CN110110849A (en) * | 2019-04-29 | 2019-08-09 | 西安电子科技大学 | Row fixed data stream mapping method based on figure segmentation |
CN110390384A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of configurable general convolutional neural networks accelerator |
CN110390384B (en) * | 2019-06-25 | 2021-07-06 | 东南大学 | Configurable general convolutional neural network accelerator |
CN110414672A (en) * | 2019-07-23 | 2019-11-05 | 江苏鼎速网络科技有限公司 | Convolution algorithm method, apparatus and system |
CN112308217A (en) * | 2019-07-31 | 2021-02-02 | 北京欣奕华科技有限公司 | Convolutional neural network acceleration method and system |
CN110516801A (en) * | 2019-08-05 | 2019-11-29 | 西安交通大学 | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput |
CN110516801B (en) * | 2019-08-05 | 2022-04-22 | 西安交通大学 | High-throughput-rate dynamic reconfigurable convolutional neural network accelerator |
CN110490302A (en) * | 2019-08-12 | 2019-11-22 | 北京中科寒武纪科技有限公司 | A kind of neural network compiling optimization method, device and Related product |
CN110533177B (en) * | 2019-08-22 | 2023-12-26 | 安谋科技(中国)有限公司 | Data read-write device, method, equipment, medium and convolution accelerator |
CN110533177A (en) * | 2019-08-22 | 2019-12-03 | 安谋科技(中国)有限公司 | A kind of data read-write equipment, method, equipment, medium and convolution accelerator |
CN111126593B (en) * | 2019-11-07 | 2023-05-05 | 复旦大学 | Reconfigurable natural language deep convolutional neural network accelerator |
CN111126593A (en) * | 2019-11-07 | 2020-05-08 | 复旦大学 | Reconfigurable natural language deep convolution neural network accelerator |
CN111199273B (en) * | 2019-12-31 | 2024-03-26 | 深圳云天励飞技术有限公司 | Convolution calculation method, device, equipment and storage medium |
CN111199273A (en) * | 2019-12-31 | 2020-05-26 | 深圳云天励飞技术有限公司 | Convolution calculation method, device, equipment and storage medium |
CN111258574A (en) * | 2020-01-14 | 2020-06-09 | 中科驭数(北京)科技有限公司 | Programming method and system for accelerator architecture |
CN111258574B (en) * | 2020-01-14 | 2021-01-15 | 中科驭数(北京)科技有限公司 | Programming method and system for accelerator architecture |
US11423292B2 (en) | 2020-02-15 | 2022-08-23 | Industrial Technology Research Institute | Convolutional neural-network calculating apparatus and operation methods thereof |
CN111427895B (en) * | 2020-04-01 | 2022-10-25 | 西安交通大学 | Neural network reasoning acceleration method based on two-segment cache |
CN111427895A (en) * | 2020-04-01 | 2020-07-17 | 西安交通大学 | Neural network reasoning acceleration method based on two-segment cache |
CN111610963A (en) * | 2020-06-24 | 2020-09-01 | 上海西井信息科技有限公司 | Chip structure and multiply-add calculation engine thereof |
CN111610963B (en) * | 2020-06-24 | 2021-08-17 | 上海西井信息科技有限公司 | Chip structure and multiply-add calculation engine thereof |
CN111859797A (en) * | 2020-07-14 | 2020-10-30 | Oppo广东移动通信有限公司 | Data processing method and device and storage medium |
CN112580774A (en) * | 2020-09-01 | 2021-03-30 | 浙江大学 | Neural network layout method for reconfigurable neural network processor |
CN114089911A (en) * | 2021-09-07 | 2022-02-25 | 上海新氦类脑智能科技有限公司 | Block segmentation splicing processing method, device, equipment and medium based on data multiplexing |
CN114089911B (en) * | 2021-09-07 | 2024-01-05 | 上海新氦类脑智能科技有限公司 | Block segmentation and splicing processing method, device, equipment and medium based on data multiplexing |
WO2023098256A1 (en) * | 2021-12-03 | 2023-06-08 | 中兴通讯股份有限公司 | Neural network operation method and apparatus, chip, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108241890B (en) | 2021-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108241890A (en) | A kind of restructural neural network accelerated method and framework | |
CN111178519B (en) | Convolutional neural network acceleration engine, convolutional neural network acceleration system and method | |
CN106779060B (en) | A kind of calculation method for the depth convolutional neural networks realized suitable for hardware design | |
Wardono et al. | A tabu search algorithm for the multi-stage parallel machine problem with limited buffer capacities | |
CN108009106A (en) | Neural computing module | |
CN110390384A (en) | A kind of configurable general convolutional neural networks accelerator | |
CN106126481A (en) | A kind of computing engines and electronic equipment | |
CN103135132A (en) | Hybrid-domain full wave form inversion method of central processing unit (CPU)/graphics processing unit (GPU) synergetic parallel computing | |
CN105739951B (en) | A kind of L1 minimization problem fast solution methods based on GPU | |
CN104375838B (en) | It is a kind of based on OpenMP to the optimization method of astronomy software Gridding | |
CN109872161A (en) | A kind of chip and system accelerating IOTA subchain transaction verification process | |
CN108427861A (en) | A method of material periodicities polycrystalline structure is built based on mpt kits | |
CN110187965A (en) | The running optimizatin and data processing method of neural network, equipment and storage medium | |
CN106415526A (en) | FET processor and operation method | |
CN109615071A (en) | A kind of neural network processor of high energy efficiency, acceleration system and method | |
JP5572340B2 (en) | Data processing apparatus and method | |
CN108491924A (en) | A kind of serial stream treatment device of Neural Network Data calculated towards artificial intelligence | |
CN110414672B (en) | Convolution operation method, device and system | |
CN106484532B (en) | GPGPU parallel calculating method towards SPH fluid simulation | |
CN110490308A (en) | Accelerate design method, terminal device and the storage medium in library | |
CN112732630A (en) | Floating-point matrix multiplier many-core parallel optimization method for deep learning | |
CN106529679A (en) | Machine learning method and system | |
CN106934485A (en) | A kind of new one-dimensional based on genetic algorithm rehearses baiting method | |
CN110109913B (en) | Hardware implementation method and device of zerocase mining algorithm | |
CN108038304A (en) | A kind of Lattice Boltzmann Method parallel acceleration method using temporal locality |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |