CN108241890A

CN108241890A - A kind of restructural neural network accelerated method and framework

Info

Publication number: CN108241890A
Application number: CN201810084089.2A
Authority: CN
Inventors: 尹首; 尹首一; 涂锋斌; 严佳乐; 欧阳鹏; 唐士斌; 刘雷波; 魏少军
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2018-07-03
Anticipated expiration: 2038-01-29
Also published as: CN108241890B

Abstract

The present invention provides a kind of restructural neural network accelerated method and frameworks, pass through input-buffer unit, weight buffer unit, convolutional calculation nuclear unit and the framework for exporting buffer unit, the pattern of input data multiplexing, output data multiplexing and weighted data multiplexing is respectively adopted, the input data of reading and convolution kernel are carried out by convolution algorithm, the method for generating output data by convolutional calculation nuclear unit.The application copes with the various neural network of the number of plies by the strategy successively accelerated, and carry out optimization neural network acceleration using the method for cyclical-transformation, the access times reduced to Buffer and DRAM are reached, solve the problems, such as that accessing memory in the prior art often causes power wastage, with energy consumption is reduced, make the maximized advantageous effect of hardware resource utilization of PE arrays.

Description

A kind of restructural neural network accelerated method and framework

Technical field

The present invention relates to the calculating patterns in depth convolutional neural networks more particularly to a kind of restructural neural network to accelerate Method and framework.

Background technology

Depth convolutional neural networks are had been widely used in computer vision field and speech processes field, then due to The high complexity that depth convolutional neural networks have in itself so that very big challenge is brought when it is executed on the hardware, especially The problem of being power consumption and performance.Traditional execution hardware has CPU, GPU, FPGA, and unfortunately, CPU can not be in embedded device The middle calculation process for carrying out low delay, and GPU is handled although can meet low delay, its power consumption is very big, is not suitable for being embedded in In formula equipment；Another aspect FPGA although can meet the requirement of power consumption and execution performance reluctantly, the wiring money inside itself Source and computing unit exactly limit the execution efficiency of different depth convolutional neural networks.

For above demand and challenge, one kind is needed to be specifically used to perform depth convolutional neural networks framework to replace These hardware of CPU, GPU or FPGA.However nonetheless, calculating pattern used by some traditional neural network hardware frameworks A compromise well can not be obtained between execution efficiency and energy consumption.Pattern is calculated in traditional hardware deep neural network In, often because each layer of data volume is different, therefore some calculating patterns are more single for the access of buffer and memory, It can not be in real time configured, be increased significantly so as to cause the number for accessing memory, it is unnecessary to cause according to calculating demand Power wastage.Fig. 1 is that classical depth convolutional neural networks calculate schematic diagram, and Fig. 2 is classical depth convolutional neural networks Convolutional layer operation pseudocode cycle expression formula.As shown in Figure 1, in the prior art in classical depth convolutional neural networks, The size of convolution kernel is K × K, and each convolution kernel has N number of convolutional channel, and the size of input data is H × L, has N number of input Channel carries out convolution algorithm by M convolution kernel and input data and obtains output data, and output data R × C is logical with M output Road.As shown in Fig. 2, the pseudocode cyclic process of the convolutional layer operation of classical depth convolutional neural networks is as follows：

Each section output data on each channel is obtained successively by recycling R and C；

N number of convolutional channel of each convolution kernel is made to be carried out with N number of input channel of part input data successively by recycling M Convolution algorithm, so as to obtain the output data of each output channel successively；

N number of input channel of part input data is made to be rolled up with the N number of convolutional channel of current convolution kernel successively by recycling N Product operation.

In depth convolutional neural networks OverDrive Processor ODP, energy consumption is a very important index.And the definition of energy consumption For：

Wherein Operations is operand, and Energy is energy, and Perfomance is performance, and Power is power consumption, and right For some specific convolutional neural networks, its calculating operation number is fixed, therefore influences the key factor of energy consumption only It is ENERGY E nergy.

And the generation of energy may be defined as：

Energy=MA_DRAM·E_DRAM+MA_buffer·E_buffer+Operations·E_operation

Wherein Energy be energy, MA_DRAMAnd MA_bufferIt is DRAM and caching Buffer access times, Operations is Operand.E_DRAM, E_buffer, E_operationIt is the energy of single DRAM cache and operation.Therefore for fixed convolutional Neural net Network, the key factor for influencing energy consumption are the access times MA of DRAM_DRAMWith Buffer access times MA_buffer。

In addition to this, when performing convolution algorithm, traditional calculating pattern is not high for the utilization rate of PE arrays, especially when When convolution kernel step-length is more than 1, the hardware resource utilization of PE arrays can substantially reduce.

Therefore, how in depth convolutional neural networks energy to be reduced to the access times of Buffer and DRAM by improving The technical issues of consumption and the utilization rate that PE arrays are improved in convolution algorithm are current urgently to be resolved hurrily.

Invention content

The defects of in order to solve in the prior art, the present invention propose a kind of restructural neural network accelerated method and frame Structure realizes through the various neural network of the strategy reply number of plies successively accelerated, and optimizes calculating pattern using cyclical-transformation, With energy consumption is reduced, the utilization rate of PE arrays is made to reach maximum advantageous effect.

The present invention proposes a kind of restructural neural network accelerated method one, and this method is input data multiplexing method, is wrapped It includes：

The input data of N number of input channel is divided into n input block by input-buffer unit；Each input number There is Tn input channel according to block；Each input block is sent successively；Wherein, n=NTn, N, n, Tn are positive integer；

M convolution kernel is divided into m group convolution groups by weight buffer unit；Convolution group described in every group has Tm convolution kernel, Each convolution kernel has N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N are just whole Number；

A-th of input block of reading is carried out convolution fortune by convolutional calculation nuclear unit with convolution group described in each group successively It calculates, output block of the generation with Tm output channel；It is finished until convolution group described with m groups carries out convolution algorithm, Output block of the generation with M output channel；The storage output data of M output channel of buffer unit feedback will be exported The output block of the M output channel of the block with generating adds up, and sends the output block after adding up；Wherein, It is described storage output block by read before a-th of input block is read the 1st to (a-1) a input block according to Convolution group carries out Accumulating generation after convolution described in secondary and each group；

The output buffer unit by reception it is described it is cumulative after output block be stored as the storage of M output channel Output block, and feed back the storage output block to the convolutional calculation nuclear unit；As a=n, the output caching Unit stores whole output datas of M output channel；Wherein, a≤n, a are positive integer.

The present invention provides a kind of restructural neural networks to accelerate framework one, including：Input-buffer unit, weight caching are single Member, convolutional calculation nuclear unit and output buffer unit；

The input-buffer unit, for the input data of N number of input channel to be divided into n input block；Each The input block has Tn input channel；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are just Integer；

The weight buffer unit, for M convolution kernel to be divided into m group convolution groups；Convolution group described in every group has Tm A convolution kernel, each convolution kernel have N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N It is positive integer；

The convolutional calculation nuclear unit, for by a-th of the input block read successively with convolution group described in each group into Row convolution algorithm, output block of the generation with Tm output channel；Until convolution group described with m groups carries out convolution fortune It finishes, output block of the generation with M output channel；The storage of M output channel of buffer unit feedback will be exported The output block of the M output channel of the output block with generating adds up, and sends the output data after adding up Block；Wherein, the storage output block before a-th of input block is read by reading the 1st to (a-1) a input Data block carries out Accumulating generation after convolution with convolution group described in each group successively；

The output buffer unit, for by receive it is described it is cumulative after output block be stored as M output channel Storage output block, and feed back the storage output block to the convolutional calculation nuclear unit；It is described defeated as a=n Go out whole output datas that buffer unit stores M output channel；Wherein, a≤n, a are positive integer.

The present invention provides a kind of restructural neural network accelerated method two, this method is output data multiplexing method, is wrapped It includes：

The input data of N number of input channel is divided into n input block by input-buffer unit；Each input number There is Tn input channel according to block；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are positive integer；

Each input block of reading is carried out convolution by convolutional calculation nuclear unit with the b group convolution group of reading successively Operation, output block of the generation with Tm output channel, until it is complete to carry out convolution algorithm with n-th of input block Finish, generate whole output datas of Tm output channel；By the passage portion of convolutional calculation nuclear unit storage and output number It adds up according to whole output datas of the Tm output channel with generation, generate and stores the passage portion after adding up and output Data；Wherein, the passage portion of the storage and output data by read before b group convolution groups are read the 1st to (b- 1) group convolution group carries out Accumulating generation after convolution with each input block successively；As b=m, the convolutional calculation vouching Member sends the output data of M channel after adding up；Wherein, b≤m, b are positive integer；By the output of M output channel of reception Data carry out pond processing, and send the output data behind pond；

Output buffer unit receives and stores the output data of Chi Huahou, the output of M output channel after generate pond Data.

The present invention provides a kind of restructural neural networks to accelerate framework two, including：Input-buffer unit, weight caching are single Member, convolutional calculation nuclear unit and output buffer unit；

The convolutional calculation nuclear unit, for each input block that will read successively with the b group convolution of reading Group carries out convolution algorithm, output block of the generation with Tm output channel, until being rolled up with n-th of input block Product operation finishes, and generates whole output datas of Tm output channel；The passage portion that the convolutional calculation nuclear unit is stored It adds up with whole output datas of Tm output channel of the output data with generating, the part for generating and storing after adding up leads to Road and output data；Wherein, the passage portion of the storage and output data by read before b group convolution groups are read 1 convolution is carried out with each input block successively to (b-1) group convolution group after Accumulating generation；As b=m, the convolution Calculate the output data for the M channel that nuclear unit is sent after adding up；Wherein, b≤m, b are positive integer；M output of reception is logical The output data in road carries out pond processing, and sends the output data behind pond；

The output buffer unit, for receiving and storing the output data of Chi Huahou, M output after generate pond is logical The output data in road.

The present invention provides a kind of restructural neural network accelerated method three, this method is weighted data multiplexing method, is wrapped It includes：

Each input block of reading is carried out convolution by convolutional calculation nuclear unit with the b group convolution group of reading successively Operation, output block of the generation with Tm output channel, until it is complete to carry out convolution algorithm with n-th of input block Finish, generate the output data of Tm output channel；By the passage portion of output buffer unit feedback and output data and generation The output data of the Tm output channel adds up, and sends passage portion and output data after adding up；Wherein, it is described The passage portion and output data of feedback organize convolution group successively by the read before b group convolution groups are read the 1st to (b-1) With Accumulating generation after each input block progress convolution；

The output buffer unit by reception it is cumulative after passage portion and output data be stored as passage portion and defeated Go out data, and the passage portion and output data of storage are fed back to the convolutional calculation nuclear unit；As b=m, the output is slow Memory cell stores whole output datas of M output channel；Wherein, b≤m, b are positive integer.

The present invention provides a kind of restructural neural networks to accelerate framework three, including：Input-buffer unit, weight caching are single Member, convolutional calculation nuclear unit and output buffer unit；

The convolutional calculation nuclear unit, for each input block that will read successively with the b group convolution of reading Group carries out convolution algorithm, output block of the generation with Tm output channel, until being rolled up with n-th of input block Product operation finishes, and generates the output data of Tm output channel；The passage portion and output data of buffer unit feedback will be exported It adds up with the output data of the Tm output channel of generation, and sends passage portion and output data after adding up； Wherein, the passage portion of the feedback and output data before b group convolution groups are read by reading the 1st to (b-1) group Convolution group carries out Accumulating generation after convolution with each input block successively；

The output buffer unit, for by receive it is cumulative after passage portion and output data be stored as passage portion And output data, and feed back the passage portion stored and output data to the convolutional calculation nuclear unit；It is described defeated as b=m Go out whole output datas that buffer unit stores M output channel；Wherein, b≤m, b are positive integer.

Beneficial effects of the present invention：Restructural neural network accelerated method and framework provided by the invention, in input-buffer On the basis of unit, weight buffer unit, convolutional calculation nuclear unit and output buffer unit, input data multiplexing side is respectively adopted Method, output data multiplexing method and weighted data multiplexing method realize various by the strategy reply number of plies successively accelerated Neural network, and use cyclical-transformation optimization neural network accelerated method, have reduce energy consumption, reach the utilization rate of PE arrays To maximum advantageous effect.

Description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is that classical depth convolutional neural networks calculate schematic diagram；

Fig. 2 is the pseudocode cycle expression formula figure of the convolutional layer operation of classical depth convolutional neural networks；

Fig. 3 is a kind of flow chart for restructural neural network accelerated method that the embodiment of the present invention one provides；

Fig. 4 is the schematic diagram that input block is sent along Z axis of one embodiment of the invention；

Fig. 5 is the flow chart of the restructural neural network accelerated method of one embodiment of the invention；

Fig. 6 is the schematic diagram of the convolutional calculation defect one of traditional convolution kernel；

Fig. 7 is the parallel-convolution mapping mode schematic diagram based on defect one of one embodiment of the invention；

Fig. 8 is the pseudocode cycle expression formula of the parallel-convolution mapping mode based on defect one of one embodiment of the invention Figure；

Fig. 9 is the schematic diagram of the convolutional calculation defect two of traditional convolution kernel；

Figure 10 is the segmentation schematic diagram of the input block based on defect two of one embodiment of the invention；

Figure 11 is the spliced input block schematic diagram based on defect two of one embodiment of the invention；

Figure 12 is the parallel-convolution mapping mode schematic diagram based on defect two of one embodiment of the invention；

Figure 13 is the convolution algorithm pseudocode cycle expression formula figure of the embodiment of the present invention one；

Figure 14 is the structure diagram that a kind of restructural neural network provided by Embodiment 2 of the present invention accelerates framework；

Figure 15 is a kind of flow chart for restructural neural network accelerated method that the embodiment of the present invention three provides；

Figure 16 is the schematic diagram that input block is sent along X/Y plane of one embodiment of the invention；

Figure 17 is the flow chart of the restructural neural network accelerated method of one embodiment of the invention；

Figure 18 is the convolution algorithm pseudocode cycle expression formula figure of the embodiment of the present invention three；

Figure 19 is that a kind of restructural neural network that the embodiment of the present invention four provides accelerates the structure diagram of framework；

Figure 20 is a kind of flow chart for restructural neural network accelerated method that the embodiment of the present invention five provides；

Figure 21 is the flow chart of the restructural neural network accelerated method of one embodiment of the invention；

Figure 22 is the convolution algorithm pseudocode cycle expression formula figure of the embodiment of the present invention five；

Figure 23 is that a kind of restructural neural network that the embodiment of the present invention six provides accelerates the structure diagram of framework.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work Embodiment shall fall within the protection scope of the present invention.

About " first " used herein, " second " ... etc., not especially censure the meaning of order or cis-position, Also it is non-to limit the present invention, only for distinguishing the element described with same technique term or operation.

It is open term, i.e., about "comprising" used herein, " comprising ", " having ", " containing " etc. Mean including but not limited to.

About it is used herein " and/or ", including any of the things or all combination.

About direction term used herein, such as：Upper and lower, left and right, front or rear etc. are only with reference to annexed drawings Direction.Therefore, the direction term used is intended to be illustrative and not intended to limit this case.

For defect in the prior art, the present invention proposes a kind of restructural neural network accelerated method, passes through The strategy that data-reusing successively accelerates copes with the various neural network of the number of plies, and uses cyclical-transformation optimization neural network acceleration side Method, having reduces energy consumption, and the utilization rate of PE arrays is made to reach maximum advantageous effect.

Embodiment one：In order to solve defect in the prior art, a kind of restructural nerve net is present embodiments provided Network accelerated method, this method uses input data multiplexer mode, as shown in figure 3, restructural neural network accelerated method includes：

S101：The input data of N number of input channel is divided into n input block by input-buffer unit；It is each described Input block has Tn input channel；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are just whole Number.

S102：M convolution kernel is divided into m group convolution groups by weight buffer unit；Convolution group described in every group has Tm volume Product core, each convolution kernel have N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N are Positive integer.

S103：Convolutional calculation nuclear unit rolls up a-th of input block of reading with convolution group described in each group successively Product operation, output block of the generation with Tm output channel；Until convolution group described with m groups progress convolution algorithm is complete Finish, output block of the generation with M output channel；The storage for M output channel for exporting buffer unit feedback is exported The output block of the M output channel of the data block with generating adds up, and sends the output block after adding up；Its In, the storage output block before a-th of input block is read by reading the 1st to (a-1) a input data Block carries out Accumulating generation after convolution with convolution group described in each group successively.

S104：The output buffer unit by reception it is described it is cumulative after output block be stored as M output channel Storage output block, and feed back the storage output block to the convolutional calculation nuclear unit；It is described defeated as a=n Go out whole output datas that buffer unit stores M output channel；Wherein, a≤n, a are positive integer.

Restructural neural network accelerated method provided in this embodiment, by the way that input data is divided into each input data Block, and input block is sent in sequence to convolutional calculation nuclear unit, convolutional calculation nuclear unit is every time by an input block Convolution algorithm is carried out with m group convolution group successively, generates the output block of M output channel, repeating aforesaid operations will be each defeated Enter data block and convolution algorithm is carried out with m group convolution group, and constantly the output block for generating M output channel is tired out Add, finally obtain whole output datas of M output channel.The restructural neural network accelerated method of the present embodiment, passes through number The various neural network of the number of plies is coped with according to the strategy that multiplexing successively accelerates, there is optimization neural network, reduce the effect of energy consumption.Into One step, when input-buffer unit sends each input block successively, each input block can be sent successively along Z-direction.

When it is implemented, as shown in figure 4, input data is three-dimensional structure, input data has N number of input channel (Z axis side To), size H × L (X/Y plane) of each channel input data, it is Th that the input data of each input channel is divided into size N input block is successively read by the input block of × Tl along Z-direction, and is sent to the progress of convolutional calculation nuclear unit Volume machine operation.The 1st to i-th input block is first sent as shown in Figure 4, then retransmits i+1 to the 2i input number According to block, and so on, finally send n-th of input block, wherein n=1, the positive integer of 2 ... ..., i, i+1 ... ....

Further, convolutional calculation nuclear unit by a-th of input block of reading successively with convolution group described in each group into Row convolution algorithm, can be by Tn input channel of a-th of input block of the reading successively Tn with convolution group described in each group A convolutional channel carries out convolution algorithm；Wherein, Tn convolutional channel of Tn input channel and each convolution kernel correspond into Row convolution algorithm.

Further, as shown in figure 5, the restructural neural network accelerated method further includes：

S105：Judge whether the step-length of current convolution kernel is more than 1.

S106：If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm.

S107：If not, when the size of output block is less than the size of input block, by each input block W identical input data fritter of size is divided into, the input data fritter of each input block corresponding position is spelled again It connects, generation is with identical W spliced input block of size；W spliced input blocks are mapped to PE times Row carry out convolution algorithm with same convolution kernel.

When it is implemented, due to doing convolution algorithm in traditional convolutional neural networks, when performing on a hardware platform, volume Product core is actually to be multiplied in a manner that step-length is 1 with each data in input data.Such operation mode can be because of step Long variation or the change in size of output data bring invalid PE operations, have following two defects：

First defect, as shown in fig. 6, as input data block size Th=Tl=8, output data block size Tr=Tc= 4, convolution kernel size K=2 during convolution kernel step-length S=2 ＞ 1, require convolution kernel to be traversed in a manner that step-length is 2 in algorithm level Entire input block carries out convolutional calculation, and first, the upper left corner weighted value of convolution kernel is same in a manner that step-length is 1 to input number It is multiplied according to each data in the block, the PE of invalid computation will be generated, the PE really effectively calculated is the black box portion in Fig. 6 Point, the utilization rate of PE is only 16/64x%=25% at this time.

In order to solve above-mentioned first defect, the present invention judges whether the step-length of convolution kernel is more than by performing step S105 1, convolution kernel step-length S=2 ＞ 1, therefore step S106 is performed, if convolution kernel step-length is more than 1, by the input of different input channels Data block alternative mapping carries out convolution algorithm to PE arrays and same convolution kernel.Specific implementation procedure is as follows：

As shown in fig. 7, the present invention uses identical convolution kernel weight, due to output data block size Tr=Tc=4, so When convolution kernel is multiplied with each data of input block, dislocation is overlapped using by 4 different input blocks 1,2,3,4 Set-up mode is：The first row first row places the data 1 (1,1) of the first row first row of input block 1, the first row secondary series The data 2 (1,1) of the first row first row of input block 2 are placed, the second row first row places the first row in input block 3 The data 3 (1,1) of first row, the second row secondary series place the data 4 (1,1) of the first row first row of input block 4；First Row third row place input block 1 in the tertial data 1 (1,3) of the first row (due to convolution kernel step-length be 2, all data The data of the first row secondary series of block are not need to calculate), the row of the first row the 4th are placed the first row third in input block 2 and are arranged Data 3 (1,1), the second row third row place input block 3 in the first row first row data 3 (1,1), the second row the 4th Row place the tertial data 4 (1,1) of the first row of input block 4, and so on, invalid computation will be performed in Fig. 6 originally The PE of task gives the data for needing effectively to calculate in different input blocks.This makes it possible to realize 4 output datas Block is parallel simultaneously to perform convolutional calculation, at this time Tr=Tc=Trr=Tcc=4.Corresponding pseudocode cycle schematic diagram such as Fig. 8 institutes Show, four layers of cycle Loop Tm/Tn/Tr/Tc represent to carry out convolutional calculation in convolutional calculation nuclear unit：It is big by Tn Th x Tl Small input block calculates the output block of Tm Tr x Tc size.Trr is added in innermost layer and Tcc is used for two layers In the output data fritter of cycle output Trr x Tcc sizes, this is cutting again to the output block of Tr x Tc sizes It cuts, is used to implement parallel convolution mapping method.

Second defect, as shown in figure 9, as input data block size Th=Tl=8, output data block size Tr=Tc= 6, convolution kernel size K=2, convolution kernel step-length S=1, although the moving step length of convolution kernel is 1, the size 6x6 of output block (the size Tr=Tc=6 of an output block) causes convolution kernel not need to traverse entire input block 8x8, exists at this time It is moved in hardware execution mechanism still according to step-length for 1, black box portion when actually really effectively calculating in Fig. 8 Point, therefore the utilization rate of PE is 36/64x%=56.25%.

In order to solve above-mentioned second defect, the present invention performs step S107, convolution kernel step-length S=1, and output block Size 6x6 when being less than the size 8x8 of input block, each input block is divided into size identical W and inputs number According to fritter, each input data fritter of corresponding position is spliced again, generation is with the identical spliced W input number of size According to block；Spliced W input block is mapped to PE arrays and carries out convolution algorithm with same convolution kernel, was specifically performed Journey is as follows：

As 16 input block w1, w2, w3 ... ..., w16, when needing the convolution kernel different from 16 to do convolution algorithm, There are defects shown in Fig. 9.Real effective calculating section is done dividing processing by the present invention again, as shown in Figure 10, each to input Data block is divided in a manner of 2x2, and the input data that the 6x6 of input block each in this way part can be divided into 9 2x2 is small Block.As shown in figure 11,16 input blocks of script pass through above-mentioned dividing processing, obtain the input number of 16x9 2x2 size According to fritter.As shown in figure 12, input data fritter is spliced again, each input block takes the input number of a same position According to fritter, 9 new input blocks are formed altogether, the size of each new input block is 8x8.9 8x8 spliced again Input block do convolution algorithm with same convolution kernel, convolution kernel carries out 16 inputs originally of mobile traversal using step-length as 1 The part of 6x6 in data block, and each input data data in the block are taken full advantage of, it is corresponding to obtain 16 output block rulers Very little is 6x6 (i.e. Tr=Tc=6), and output block is made of the output data fritter of 9 2x2 (i.e. Trr=Tcc=2), at this time The utilization rate of PE has been increased to 100% (because input data is by 16 2x2 by 9 each input blocks of input block Input data fritter composition all effectively calculated, export the output block of 16 sizes for 6x6).

PE utilization rate calculation formula under traditional approach：

Wherein, in R and C corresponding diagrams 1 output data two-dimensional, A refers to the size of computing array, and Tr and Tc are output numbers According to the size of block.

Under parallel-convolution mapping mode using the present invention, the calculation formula of PE utilization rates：

Wherein, in R and C corresponding diagrams 1 output data two-dimensional, A refers to the size of computing array, and Tr and Tc are output numbers According to the size of block, Trr and Tcc are the sizes of output data fritter in parallel-convolution mapping method.The present invention is by using parallel Convolution mapping mode, makes the hardware resource utilization of PE arrays can reach maximum.

Figure 13 is that the pseudocode of the convolution algorithm of the present embodiment recycles expression formula.As shown in figure 13, it is carried in embodiment one In the restructural neural network accelerated method supplied, recycle from the inside to the outside as follows：Most interior four layers of cycle Loop Tm/Tn/Tr/Tc Represent the convolution algorithm that convolutional calculation nuclear unit performs：Dotted line frame external description be data-reusing sequence.It recycles in M, often One input block can all do operation with whole M convolution kernels, generate M output channel output data part with.It follows In ring N, N number of input data can traverse successively, repeat the internal calculating recycled, the part with the output data of M output channel It adds up with continuous.Therefore update is constantly read, until completing complete convolutional calculation.Cycle R and C will traverse each output channel On other parts, repeat before all operationss, finally obtain M output channel whole output datas.

In data-reusing pattern, the access for storage unit is a very important index.First by convolutional layer It is split, as shown in Figure 3.The input data of N number of input channel is divided into n input block, each input number There is Tn input channel according to block, the size of each input block is Th × Tl, n=N/Tn, N, and n, Tn are positive integer.M The output data of a output channel is made of m output block, and each output block has Tm output channel, exports number The pattern that size according to block is Tr × Tc, wherein, m=M/Tm, M, m, Tm are positive integer, Th=(Tr-1) S+K, Tl=(Tc- 1) S+K, K × K represent the size of convolution kernel, and S represents convolution step-length.And for the access times MA of memory can be expressed as Lower expression formula：

MA=TI α_i+TO·α_o+TW·α_w+TPO

TI therein, TO, TW represent input data, output data, the respective quantity of weighted data, α respectively_i、α_o、α_w It represents input data, output data, weighted data and respectively reuses the time, and the output data that TPO represents pondization output is total Quantity.

In the method for above-mentioned input data multiplexing, it is to the Buffer coefficients accessed accordingly：

Because input deposit unit reads and writes n=N/Tn input block successively, final in order to obtain as a result, rolling up Product operation when, need to have traversed total N number of Channel of input data, thus convolutional calculation nuclear unit just need ceaselessly to do it is tired Add, wherein reading and writing operation needs n-1 times, it is contemplated that read and write is each primary, therefore should be multiplied by 2 times, i.e., 2 (n-1) are secondary.And for The importing number of weight is corresponding with n, it is contemplated that the factors such as coincidence in step-length and convolution kernel size and convolution kernel moving process, Final parameter is

And for the access coefficient of DRAM, B_oAnd B_wRepresent that the storage of output buffer unit and weight buffer unit is big respectively Small, if when the size for exporting buffer unit is bigger than MTrTc data volume, there is no need to additionally occupy the storage energy of DRAM Power, do not need at this time access DRAM, therefore coefficient be 1, it is on the contrary then need access DRAM；Likewise, weight is stored also such as This.

The restructural neural network accelerated method that embodiment one provides, it is various to cope with the number of plies by the strategy successively accelerated Neural network, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached reduction to Buffer and The access times of DRAM solve the problems, such as that accessing memory in the prior art often causes power wastage, and having reduces Energy consumption makes the maximized advantageous effect of hardware resource utilization of PE arrays.

Embodiment two：Conceived based on the application identical with above-mentioned restructural neural network accelerated method, the present embodiment also carries A kind of restructural neural network is supplied to accelerate framework, as described below.Since the restructural neural network accelerates framework solution to ask The principle of topic is similar to the restructural neural network accelerated method of embodiment one, therefore the restructural neural network accelerates framework Implementation may refer to the implementation of restructural neural network accelerated method in embodiment one, and overlaps will not be repeated.

As shown in figure 14, restructural neural network provided in this embodiment accelerates framework, including：Input-buffer unit 1, power Weight buffer unit 2, convolutional calculation nuclear unit 3 and output buffer unit 4.

Input-buffer unit 1, for the input data of N number of input channel to be divided into n input block；It is each described Input block has Tn input channel；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are just whole Number.

Weight buffer unit 2, for M convolution kernel to be divided into m group convolution groups；Convolution group described in every group has Tm volume Product core, each convolution kernel have N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N are Positive integer.

Convolutional calculation nuclear unit 3, for a-th of the input block read to be rolled up successively with convolution group described in each group Product operation, output block of the generation with Tm output channel；Until convolution group described with m groups progress convolution algorithm is complete Finish, output block of the generation with M output channel；The storage for M output channel for exporting buffer unit feedback is exported The output block of the M output channel of the data block with generating adds up, and sends the output block after adding up；Its In, the storage output block before a-th of input block is read by reading the 1st to (a-1) a input data Block carries out Accumulating generation after convolution with convolution group described in each group successively.

Export buffer unit 4, for by receive it is described it is cumulative after output block be stored as depositing for M output channel Output block is stored up, and the storage output block is fed back to the convolutional calculation nuclear unit；As a=n, the output is slow Memory cell stores whole output datas of M output channel；Wherein, a≤n, a are positive integer.

Further, input-buffer unit 1 is specifically used for, and sends each input block successively along Z-direction.

Further, as shown in figure 14, convolutional calculation nuclear unit 3 includes：Input deposit unit 31, computing engines unit 32 And output deposit unit 33.

Deposit unit 31 is inputted, for reading a-th of input block from the input-buffer unit, and by a A input block is sent to the computing engines unit.

Computing engines unit 32, for by Tn input channel of a-th of the input block read successively with each convolution Tn convolutional channel of Tm convolution kernel of group carries out convolution algorithm, output block of the generation with Tm output channel, It is finished until convolution group described with m groups carries out convolution algorithm, sends output block of the generation with M output channel.

Export deposit unit 33, for will export buffer unit feedback M output channel storage output block and The output block of the M output channel of generation adds up, and sends the output block after adding up；Wherein, it is described Storage output block by read before a-th of input block is read the 1st to (a-1) input block successively with respectively The group convolution group carries out Accumulating generation after convolution.

The restructural neural network accelerated method and framework that above-described embodiment provides, are delayed by input-buffer unit, weight Memory cell, convolutional calculation nuclear unit and output buffer unit framework, the method being multiplexed using input data, the strategy successively accelerated To cope with the various neural network of the number of plies, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached and subtracted Few access times to Buffer and DRAM, solve access memory in the prior art often cause power wastage Problem, having reduces energy consumption, makes the maximized advantageous effect of hardware resource utilization of PE arrays.

Embodiment three：In order to solve defect in the prior art, the present embodiment additionally provides a kind of restructural nerve Network accelerating method, this method use output data multiplexer mode, as shown in figure 15, restructural neural network accelerated method method Including：

S201：The input data of N number of input channel is divided into n input block by input-buffer unit；It is each described Input block has Tn input channel；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are just whole Number.

S202：M convolution kernel is divided into m group convolution groups by weight buffer unit；Convolution group described in every group has Tm volume Product core, each convolution kernel have N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N are Positive integer.

S203：Convolutional calculation nuclear unit by each input block of reading successively with the b group convolution groups of reading into Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block It finishes, generates whole output datas of Tm output channel；By the passage portion and defeated of convolutional calculation nuclear unit storage The whole output datas of Tm output channel for going out data and generation add up, generate and store it is cumulative after passage portion with Output data；Wherein, the passage portion of the storage and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group carries out Accumulating generation after convolution with each input block successively；As b=m, the convolutional calculation Nuclear unit sends the output data of M channel after adding up；Wherein, b≤m, b are positive integer；By M output channel of reception Output data carries out pond processing, and sends the output data behind pond.

S204：Output buffer unit receives and stores the output data of Chi Huahou, M output channel after generate pond Output data.

Restructural neural network accelerated method provided in this embodiment, by the way that transmission of data is divided into each input block, And input block is sent in sequence to convolutional calculation nuclear unit, convolutional calculation nuclear unit by n input block successively with together One group of convolution group carries out convolution and calculates, and generates whole output datas of Tm output channel, repeats aforesaid operations and inputs number by n Convolution algorithm is carried out with m group convolution group respectively, and constantly tire out the whole output datas for generating Tm output channel according to block Add, finally obtain whole output datas of M output channel.The restructural neural network accelerated method of the present embodiment, passes through number The various neural network of the number of plies is coped with according to the strategy that multiplexing successively accelerates, there is optimization neural network, reduce the effect of energy consumption.

Further, it when input-buffer unit sends each input block successively, can successively be sent along X/Y plane each defeated Enter data block.

When it is implemented, as shown in figure 16, input data is three-dimensional structure, input data has N number of input channel (Z axis Direction), the input data of each input channel is divided into the size to be by size H × L (X/Y plane) of each channel input data N input block is successively read, and be sent to convolutional calculation nuclear unit by the input block of Th × Tl along X/Y plane direction Carry out volume machine operation.As shown in Figure 16, the 1st to i-th input block is first sent, it is a to kth i then to retransmit i+1 Then input block sends i+1 input block of kth, and so on, finally send n-th of input block, wherein n =1,2 ... ..., i, i+1 ... ..., the positive integer of ki, ki+1 ... ....

Further, convolutional calculation nuclear unit by each input block of reading successively with the b group convolution of reading Group carries out convolution algorithm, including：Convolutional calculation nuclear unit by Tn input channel of each input block of reading successively Convolution algorithm is carried out with Tn convolutional channel of the b group convolution groups of reading；Wherein, Tn input channel and each convolution kernel Tn convolutional channel, which corresponds, carries out convolution algorithm.

In one embodiment, as shown in figure 17, this method further includes：

S205：Judge whether the step-length of current convolution kernel is more than 1.

S206：If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm.

S207：If not, when the size of output block is less than the size of input block, by each input block W identical input data fritter of size is divided into, the input data fritter of each input block corresponding position is spelled again It connects, generation is with identical W spliced input block of size；W spliced input blocks are mapped to PE times Row carry out convolution algorithm with same convolution kernel.

Specific implementation process, referring to the implementation procedure of the step S105-S107 in embodiment one.

Figure 18 is that the pseudocode of the convolution algorithm of the present embodiment recycles expression formula.As shown in figure 18, it is provided in embodiment three Restructural neural network accelerated method in, cycle M cycle N outer layers, it is meant that every group of convolution group is N number of with input data The input data of input channel does convolution algorithm, obtains whole output datas of each output channel.It does not need to repeat to read defeated Go out the part output block of buffer unit storage.As shown in figure 18, it recycles from the inside to the outside as follows：Most interior four layers of cycle Loop Tm/Tn/Tr/Tc represents to carry out convolutional calculation in convolutional calculation nuclear unit：Pass through the input block meter of Tn Th x Tl size The output block of Tm TrxTc size is calculated, dotted line frame external description is data-reusing sequence.It recycles in N, N number of input Channel can traverse successively, repeat the internal calculating recycled, and whole output datas of Tm output channel of Accumulating generation are final to be stored in Export buffer unit.It recycles in M, that uses before fully enters data and can repeatedly read in, and completes whole M output channels It calculates.Cycle R and C will traverse the other parts in output channel, and all operationss before repeating are final to obtain M output Whole output datas of channel.

In above-mentioned output data multiplexing method, mutually coping with the coefficient that Buffer is accessed is：

Mutually the access coefficient of reply DRAM is：

Wherein B_iRefer to output buffer unit storage size, if input-buffer unit can store lower n input block, So only need to access primary, N is input channel number, and Th × Tl is the size of input data, and M is convolution kernel number, Mei Gejuan Product group has Tm convolution kernel.

The restructural neural network accelerated method that embodiment three provides, it is various to cope with the number of plies by the strategy successively accelerated Neural network, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached reduction to Buffer and The access times of DRAM solve the problems, such as that accessing memory in the prior art often causes power wastage, and having reduces Energy consumption makes the maximized advantageous effect of hardware resource utilization of PE arrays.

Example IV：Conceived based on the application identical with above-mentioned restructural neural network accelerated method, the present invention also provides A kind of restructural neural network accelerates framework, as described below.Since the restructural neural network accelerates framework to solve the problems, such as Principle it is similar to the restructural neural network accelerated method in embodiment three, therefore the restructural neural network accelerates framework Implementation may refer to the implementation of the restructural neural network accelerated method in embodiment three, and overlaps will not be repeated.

As shown in figure 19, the restructural neural network that the present embodiment also provides accelerates framework, including：Input-buffer unit 1, Weight buffer unit 2, convolutional calculation nuclear unit 3 and output buffer unit 4；

Input-buffer unit 1, for the input data of N number of input channel to be divided into n input block；It is each described Input block has Tn input channel；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are just whole Number；

Weight buffer unit 2, for M convolution kernel to be divided into m group convolution groups；Convolution group described in every group has Tm volume Product core, each convolution kernel have N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N are Positive integer；

Convolutional calculation nuclear unit 3, for each input block that will read successively with the b group convolution groups of reading into Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block It finishes, generates whole output datas of Tm output channel；By the passage portion and defeated of convolutional calculation nuclear unit storage The whole output datas of Tm output channel for going out data and generation add up, generate and store it is cumulative after passage portion with Output data；Wherein, the passage portion of the storage and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group carries out Accumulating generation after convolution with each input block successively；As b=m, the convolutional calculation Nuclear unit sends the output data of M channel after adding up；Wherein, b≤m, b are positive integer；By M output channel of reception Output data carries out pond processing, and sends the output data behind pond；

Buffer unit 4 is exported, for receiving and storing the output data of Chi Huahou, M output channel after generate pond Output data.

Further, input-buffer unit 1 is specifically used for, and sends each input block successively along X/Y plane.

Further, as shown in figure 19, convolutional calculation nuclear unit 3 includes：Input deposit unit 31, computing engines unit 32nd, deposit unit 33 and pond unit 34 are exported.

Deposit unit 31 is inputted, for reading each input block one by one from the input-buffer unit, and will be described defeated Enter data block and be sent to the computing engines unit；

Computing engines unit 32, for the Tn input channel of each input block that will read successively with reading Tn convolutional channel of Tm convolution kernel of b group convolution groups carries out convolution algorithm, and generation is defeated with Tm output channel Go out data block, finished until carrying out convolution algorithm with n-th of input block, generate whole output numbers of Tm output channel According to；By the passage portion of the output deposit unit feedback and whole output numbers of output data and Tm output channel of generation According to adding up, passage portion and output data after adding up are generated and sent；Wherein, the portion of the output deposit unit feedback Subchannel and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with it is each described Input block carries out Accumulating generation after convolution；

Export deposit unit 33, for by receive it is cumulative after passage portion and output data be stored as passage portion and Output data, and feed back the passage portion stored and output data to the computing engines unit；As b=m, the output is posted Memory cell sends the output data of M channel after adding up；Wherein, b≤m, b are positive integer；

Pond unit 34 for the output data of the M output channel received to be carried out pond processing, and sends Chi Huahou Output data.

The restructural neural network accelerated method and framework that above-described embodiment provides, are delayed by input-buffer unit, weight Memory cell, convolutional calculation nuclear unit and output buffer unit framework, using output data multiplexing method, the strategy successively accelerated comes The various neural network of the number of plies is coped with, and carrys out optimization neural network accelerated method using the method for cyclical-transformation, has reached reduction To the access times of Buffer and DRAM, solve access memory in the prior art often cause asking for power wastage Topic, having reduces energy consumption, makes the maximized advantageous effect of hardware resource utilization of PE arrays.

Embodiment five：In order to solve defect in the prior art, the present embodiment additionally provides a kind of restructural nerve Network accelerating method, this method are become blind with weighted data multiplexer mode, as shown in figure 20, restructural neural network accelerated method packet It includes：

S301：The input data of N number of input channel is divided into n input block by input-buffer unit；It is each described Input block has Tn input channel；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are just whole Number；

S302：M convolution kernel is divided into m group convolution groups by weight buffer unit；Convolution group described in every group has Tm volume Product core, each convolution kernel have N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N are Positive integer；

S303：Convolutional calculation nuclear unit by each input block of reading successively with the b group convolution groups of reading into Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block It finishes, generates the output data of Tm output channel；By the passage portion of output buffer unit feedback and output data and life Into the output data of the Tm output channel add up, and send it is cumulative after passage portion and output data；Wherein, The passage portion and output data of the feedback organize convolution group by the read before b group convolution groups are read the 1st to (b-1) Accumulating generation after convolution is carried out with each input block successively；

S304：Output buffer unit by reception it is cumulative after passage portion and output data be stored as passage portion and defeated Go out data, and the passage portion and output data of storage are fed back to the convolutional calculation nuclear unit；As b=m, the output is slow Memory cell stores whole output datas of M output channel；Wherein, b≤m, b are positive integer.

Restructural neural network accelerated method provided in this embodiment, by the way that input data is divided into each input data Block, each input block carry out convolution algorithm with same group of convolution group successively, generate whole output datas of Tm output channel, It repeats aforesaid operations and n input block is subjected to convolution algorithm with m group convolution group respectively, and it is logical constantly to generate Tm output Whole output datas in road add up, and finally obtain whole output datas of M output channel.The restructural god of the present embodiment Through network accelerating method, the various neural network of the number of plies is coped with by the strategy that data-reusing successively accelerates, there is optimization nerve Network reduces the effect of energy consumption.

Further, when input-buffer unit sends each input block successively, each institute can be sent successively along Z-direction State input block.

Further, convolutional calculation nuclear unit by each input block of reading successively with the b group convolution of reading Group carries out convolution algorithm, can roll up Tn input channel of each input block of reading with the b groups of reading successively Tn convolutional channel of product group carries out convolution algorithm, wherein, Tn input channel and the Tn convolutional channel one of each convolution kernel One corresponds to progress convolution algorithm.

Further, as shown in figure 21, which further includes：

S305：Judge whether the step-length of current convolution kernel is more than 1.

S306：If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm.

S307：If not, when the size of output block is less than the size of input block, by each input block W identical input data fritter of size is divided into, the input data fritter of each input block corresponding position is spelled again It connects, generation is with identical W spliced input block of size；W spliced input blocks are mapped to PE times Row carry out convolution algorithm with same convolution kernel.

Figure 22 is that the pseudocode of the convolution algorithm of the present embodiment recycles expression formula.As shown in figure 22, it is provided in embodiment five Restructural neural network accelerated method in, convolutional calculation nuclear unit is by T_nThe input block of a input channel is sent in sequence to Input register unit.Each input block and T_mA convolution kernel is multiplied, and generates T_mThe output data part of a output channel With.Since cycle R and cycle C are in inside, T_mIt is a that there is T_nThe convolution kernel of a convolutional channel can be fully utilized, with n input number According to the T of block_nChannel is traversed, so as to obtain T_mThe part of a output data and (output data size is R × C).It will be every The T of a input block generation_mThe part of a output data and cycle fetch the T generated with next input block_mA output The part of data and cumulative, whole output datas until obtaining M output channel.It recycles from the inside to the outside as follows：Most interior four Layer cycle Loop Tm/Tn/Tr/Tc represent to carry out convolutional calculation in convolutional calculation nuclear unit：Pass through Tn Th x Tl size Input block calculates the output block of Tm TrxTc size, and dotted line frame external description is data-reusing sequence.It is most interior Four layers of cycle Loop Tm/Tn/Tr/Tc represent the calculating in the convolution core (Convolution Core) of Fig. 6：Pass through Tn The input figure of a Th x Tl sizes calculates the output figure of Tm Tr x Tc size.Dotted line frame external description is data-reusing Sequentially.Cycle R and cycle C will traverse the other parts of output channel, repeat the internal all operationss recycled, therefore weight obtains Sufficient multiplexing is arrived.It recycles in N, N number of input channel can traverse successively, repeat the internal calculating recycled, complete Tm output Whole accumulation calculatings of the output data of channel, final deposit output buffer unit.It recycles in M, that uses before fully enters Data can repeatedly be read in, and complete the calculating of whole M convolution kernels, finally obtain whole output datas of M output channel.

In weighted data multiplexing method, mutually coping with the coefficient that Buffer is accessed is：

Mutually the access coefficient of reply DRAM is：

The restructural neural network accelerated method that embodiment five provides, it is various to cope with the number of plies by the strategy successively accelerated Neural network, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached reduction to Buffer and The access times of DRAM solve the problems, such as that accessing memory in the prior art often causes power wastage, and having reduces Energy consumption makes the maximized advantageous effect of hardware resource utilization of PE arrays.

Embodiment six：Conceived based on the application identical with above-mentioned restructural neural network accelerated method, the present invention also provides A kind of restructural neural network accelerates framework, as described below.Since the restructural neural network accelerates framework to solve the problems, such as Principle it is similar to the restructural neural network accelerated method of embodiment five, therefore the restructural neural network accelerates the reality of framework The implementation for the restructural neural network accelerated method that may refer to embodiment five is applied, overlaps will not be repeated.

As shown in figure 23, restructural neural network provided in this embodiment accelerates framework, including：Input-buffer unit 1, power Weight buffer unit 2, convolutional calculation nuclear unit 3 and output buffer unit 4；

Convolutional calculation nuclear unit 3, for each input block that will read successively with the b group convolution groups of reading into Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block It finishes, generates the output data of Tm output channel；By the passage portion of output buffer unit feedback and output data and life Into the output data of the Tm output channel add up, and send it is cumulative after passage portion and output data；Wherein, The passage portion and output data of the feedback organize convolution group by the read before b group convolution groups are read the 1st to (b-1) Accumulating generation after convolution is carried out with each input block successively.

Export buffer unit 4, for by receive it is cumulative after passage portion and output data be stored as passage portion and Output data, and feed back the passage portion stored and output data to the convolutional calculation nuclear unit；As b=m, the output Buffer unit stores whole output datas of M output channel；Wherein, b≤m, b are positive integer.

Further, input-buffer unit is specifically used for, and sends each input block successively along Z-direction.

Further, as shown in figure 23, convolutional calculation nuclear unit 3 includes：Input deposit unit 31, computing engines unit 32 And output deposit unit 33.

Deposit unit 31 is inputted, for reading each input block one by one from the input-buffer unit, and will be described defeated Enter data block and be sent to the computing engines unit.

Computing engines unit 32, for the Tn input channel of each input block that will read successively with reading Tn convolutional channel of Tm convolution kernel of b group convolution groups carries out convolution algorithm, and generation is defeated with Tm output channel Go out data block, finished until carrying out convolution algorithm with n-th of input block, send the output number of Tm output channel of generation According to.

Deposit unit 33 is exported, for it will export the passage portion of buffer unit feedback and output data and generate described in The output data of Tm output channel adds up, and sends passage portion and output data after adding up；Wherein, the feedback Passage portion and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with respectively The input block carries out Accumulating generation after convolution.

The restructural neural network accelerated method and framework that above-described embodiment provides, are delayed by input-buffer unit, weight Memory cell, convolutional calculation nuclear unit and output buffer unit framework, the method being multiplexed using weighted data, the strategy successively accelerated To cope with the various neural network of the number of plies, and carry out optimization neural network accelerated method using the method for cyclical-transformation, reached and subtracted Few access times to Buffer and DRAM, solve access memory in the prior art often cause power wastage Problem, having reduces energy consumption, makes the maximized advantageous effect of hardware resource utilization of PE arrays.

Restructural neural network accelerated method and framework provided by the invention are cached single by input-buffer unit, weight Member, convolutional calculation nuclear unit and output buffer unit framework, are respectively adopted input data multiplexing method, output data multiplexing method And weighted data multiplexing method, the strategy successively accelerated cope with the various neural network of the number of plies, and use the side of cyclical-transformation Method carrys out optimization neural network accelerated method, has reached the access times reduced to Buffer and DRAM, has solved the prior art Middle the problem of often causing power wastage of memory of access, having reduces energy consumption, makes the hardware resource utilization of PE arrays Maximized advantageous effect.

It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware can be used in the present invention Apply the form of example.Moreover, the computer for wherein including computer usable program code in one or more can be used in the present invention The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real The device of function specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.

These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps are performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.

It applies specific embodiment in the present invention to be expounded the principle of the present invention and embodiment, above example Explanation be merely used to help understand the present invention method and its core concept；Meanwhile for those of ordinary skill in the art, Thought according to the present invention, there will be changes in specific embodiments and applications, in conclusion in this specification Appearance should not be construed as limiting the invention.

Claims

1. a kind of restructural neural network accelerated method, which is characterized in that this method includes：

The input data of N number of input channel is divided into n input block by input-buffer unit；Each input block With Tn input channel；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are positive integer；

M convolution kernel is divided into m group convolution groups by weight buffer unit；Convolution group described in every group has Tm convolution kernel, each The convolution kernel has N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N are positive integer；

A-th of input block of reading is carried out convolution algorithm by convolutional calculation nuclear unit with convolution group described in each group successively, Output block of the generation with Tm output channel；It is finished until convolution group described with m groups carries out convolution algorithm, generation tool There is the output block of M output channel；Storage output block and the life of M output channel of buffer unit feedback will be exported Into the output block of the M output channel add up, and send it is cumulative after output block；Wherein, it is described to deposit Storage output block by read before a-th of input block is read the 1st to (a-1) a input block successively with respectively The group convolution group carries out Accumulating generation after convolution；

The output buffer unit by reception it is described it is cumulative after output block be stored as the storage output of M output channel Data block, and feed back the storage output block to the convolutional calculation nuclear unit；As a=n, the output buffer unit Store whole output datas of M output channel；Wherein, a≤n, a are positive integer.

2. restructural neural network accelerated method according to claim 1, which is characterized in that described to send each input successively Data block, including：Send each input block successively along Z-direction.

3. restructural neural network accelerated method according to claim 1, which is characterized in that the convolutional calculation nuclear unit A-th of input block of reading is subjected to convolution algorithm with convolution group described in each group successively, including：Convolutional calculation nuclear unit will Tn input channel of a-th of the input block read carries out convolution with Tn convolutional channel of convolution group described in each group successively Operation；Wherein, Tn convolutional channel of Tn input channel and each convolution kernel, which corresponds, carries out convolution algorithm.

4. restructural neural network accelerated method according to claim 3, which is characterized in that this method further includes：

Judge whether the step-length of current convolution kernel is more than 1；

If so, input block alternative mapping to PE arrays and same convolution kernel are subjected to convolution algorithm；

If not, when the size of output block is less than the size of input block, each input block is divided into W identical input data fritter of size, the input data fritter of each input block corresponding position is spliced again, generation tool W spliced input blocks for having size identical；By W spliced input blocks be mapped to PE arrays with it is same Convolution kernel carries out convolution algorithm.

5. a kind of restructural neural network accelerates framework, which is characterized in that including：Input-buffer unit, weight buffer unit, volume Product calculates nuclear unit and output buffer unit；

The input-buffer unit, for the input data of N number of input channel to be divided into n input block；It is each described Input block has Tn input channel；Each input block is sent successively；Wherein, n=N/Tn, N, n, Tn are just whole Number；

The weight buffer unit, for M convolution kernel to be divided into m group convolution groups；Convolution group described in every group has Tm volume Product core, each convolution kernel have N number of convolutional channel；Each convolution group is sent successively；Wherein, m=M/Tm, M, m, Tm, N are Positive integer；

The convolutional calculation nuclear unit, for a-th of the input block read to be rolled up successively with convolution group described in each group Product operation, output block of the generation with Tm output channel；Until convolution group described with m groups progress convolution algorithm is complete Finish, output block of the generation with M output channel；The storage for M output channel for exporting buffer unit feedback is exported The output block of the M output channel of the data block with generating adds up, and sends the output block after adding up；Its In, the storage output block before a-th of input block is read by reading the 1st to (a-1) a input data Block carries out Accumulating generation after convolution with convolution group described in each group successively；

The output buffer unit, for by receive it is described it is cumulative after output block be stored as depositing for M output channel Output block is stored up, and the storage output block is fed back to the convolutional calculation nuclear unit；As a=n, the output is slow Memory cell stores whole output datas of M output channel；Wherein, a≤n, a are positive integer.

6. restructural neural network according to claim 5 accelerates framework, which is characterized in that the input-buffer unit tool Body is used for：Send each input block successively along Z-direction.

7. restructural neural network according to claim 5 accelerates framework, which is characterized in that the convolutional calculation nuclear unit Including：Input deposit unit, computing engines unit and output deposit unit；

The input deposit unit, for reading a-th of input block from the input-buffer unit, and by described a-th Input block is sent to the computing engines unit；

The computing engines unit, for by Tn input channel of a-th of the input block read successively with each convolution group Tn convolutional channel of Tm convolution kernel carry out convolution algorithm, output block of the generation with Tm output channel, directly Convolution algorithm is carried out to convolution group described with m groups to finish, and sends output block of the generation with M output channel；

The output deposit unit, for the storage output block of M output channel of buffer unit feedback and life will to be exported Into the output block of the M output channel add up, and send it is cumulative after output block；Wherein, it is described to deposit Storage output block by read before a-th of input block is read the 1st to (a-1) input block successively with each group The convolution group carries out Accumulating generation after convolution.

8. a kind of restructural neural network accelerated method, which is characterized in that this method includes：

Each input block of reading is carried out convolution fortune by convolutional calculation nuclear unit with the b group convolution group of reading successively It calculates, output block of the generation with Tm output channel, is finished until carrying out convolution algorithm with n-th of input block, Generate whole output datas of Tm output channel；By the convolutional calculation nuclear unit storage passage portion and output data with Whole output datas of Tm output channel of generation add up, and generate and store the passage portion after adding up and output number According to；Wherein, the passage portion of the storage and output data before b group convolution groups are read by reading the 1st to (b-1) Group convolution group carries out Accumulating generation after convolution with each input block successively；As b=m, the convolutional calculation nuclear unit Send the output data of M channel after adding up；Wherein, b≤m, b are positive integer；By the output number of M output channel of reception According to progress pond processing, and send the output data behind pond；

Output buffer unit receives and stores the output data of Chi Huahou, the output data of M output channel after generate pond.

9. restructural neural network accelerated method according to claim 8, which is characterized in that described to send each input successively Data block, including：Send each input block successively along X/Y plane.

10. restructural neural network accelerated method according to claim 8, which is characterized in that the convolutional calculation vouching Each input block of reading is carried out convolution algorithm by member with the b group convolution group of reading successively, including：Convolutional calculation core Unit leads to Tn convolution of the Tn input channel of each input block of reading successively with the b group convolution groups of reading Road carries out convolution algorithm；Wherein, Tn convolutional channel of Tn input channel and each convolution kernel, which corresponds, carries out convolution fortune It calculates.

11. restructural neural network accelerated method according to claim 10, which is characterized in that this method further includes：

Judge whether the step-length of current convolution kernel is more than 1；

12. a kind of restructural neural network accelerates framework, which is characterized in that including：Input-buffer unit, weight buffer unit, Convolutional calculation nuclear unit and output buffer unit；

The convolutional calculation nuclear unit, for each input block that will read successively with the b group convolution groups of reading into Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block It finishes, generates whole output datas of Tm output channel；By the passage portion and defeated of convolutional calculation nuclear unit storage The whole output datas of Tm output channel for going out data and generation add up, generate and store it is cumulative after passage portion with Output data；Wherein, the passage portion of the storage and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group carries out Accumulating generation after convolution with each input block successively；As b=m, the convolutional calculation Nuclear unit sends the output data of M channel after adding up；Wherein, b≤m, b are positive integer；By M output channel of reception Output data carries out pond processing, and sends the output data behind pond；

The output buffer unit, for receiving and storing the output data of Chi Huahou, M output channel after generate pond Output data.

13. restructural neural network according to claim 12 accelerates framework, which is characterized in that the input-buffer unit It is specifically used for：Send each input block successively along X/Y plane.

14. restructural neural network according to claim 12 accelerates framework, which is characterized in that the convolutional calculation vouching Member includes：Input deposit unit, computing engines unit, output deposit unit and pond unit；

The input deposit unit, for reading each input block one by one from the input-buffer unit, and by the input Data block is sent to the computing engines unit；

The computing engines unit, for the Tn input channel of each input block that will read successively with reading Tn convolutional channel of Tm convolution kernel of b group convolution groups carries out convolution algorithm, output of the generation with Tm output channel Data block finishes until carrying out convolution algorithm with n-th of input block, generates whole output datas of Tm output channel； By the passage portion of the output deposit unit feedback and whole output datas of output data and Tm output channel of generation It adds up, generates and sends passage portion and output data after adding up；Wherein, the part of the output deposit unit feedback Channel and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with it is each described defeated Enter data block and carry out Accumulating generation after convolution；

The output deposit unit, for by receive it is cumulative after passage portion and output data be stored as passage portion and defeated Go out data, and the passage portion and output data of storage are fed back to the computing engines unit；As b=m, the output deposit Unit sends the output data of M channel after adding up；Wherein, b≤m, b are positive integer；

The pond unit, for the output data of the M output channel received to be carried out pond processing, and after sending pond Output data.

15. a kind of restructural neural network accelerated method, which is characterized in that this method includes：

Each input block of reading is carried out convolution fortune by convolutional calculation nuclear unit with the b group convolution group of reading successively It calculates, output block of the generation with Tm output channel, is finished until carrying out convolution algorithm with n-th of input block, Generate the output data of Tm output channel；It will export described in the passage portion of buffer unit feedback and output data and generation The output data of Tm output channel adds up, and sends passage portion and output data after adding up；Wherein, the feedback Passage portion and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with respectively The input block carries out Accumulating generation after convolution；

The output buffer unit by reception it is cumulative after passage portion and output data be stored as passage portion and output number According to, and feed back the passage portion stored and output data to the convolutional calculation nuclear unit；As b=m, the output caching is single Whole output datas of member M output channel of storage；Wherein, b≤m, b are positive integer.

16. restructural neural network accelerated method according to claim 15, which is characterized in that it is described send successively it is each defeated Enter data block, including：Send each input block successively along Z-direction.

17. restructural neural network accelerated method according to claim 15, which is characterized in that the convolutional calculation vouching Each input block of reading is carried out convolution algorithm by member with the b group convolution group of reading successively, including：Convolutional calculation core Unit leads to Tn convolution of the Tn input channel of each input block of reading successively with the b group convolution groups of reading Road carries out convolution algorithm, wherein, Tn convolutional channel of Tn input channel and each convolution kernel, which corresponds, carries out convolution fortune It calculates.

18. restructural neural network accelerated method according to claim 17, which is characterized in that this method further includes：

Judge whether the step-length of current convolution kernel is more than 1；

19. a kind of restructural neural network accelerates framework, which is characterized in that including：Input-buffer unit, weight buffer unit, Convolutional calculation nuclear unit and output buffer unit；

The convolutional calculation nuclear unit, for each input block that will read successively with the b group convolution groups of reading into Row convolution algorithm, output block of the generation with Tm output channel, until carrying out convolution fortune with n-th of input block It finishes, generates the output data of Tm output channel；By the passage portion of output buffer unit feedback and output data and life Into the output data of the Tm output channel add up, and send it is cumulative after passage portion and output data；Wherein, The passage portion and output data of the feedback organize convolution group by the read before b group convolution groups are read the 1st to (b-1) Accumulating generation after convolution is carried out with each input block successively；

The output buffer unit, for by receive it is cumulative after passage portion and output data be stored as passage portion and defeated Go out data, and the passage portion and output data of storage are fed back to the convolutional calculation nuclear unit；As b=m, the output is slow Memory cell stores whole output datas of M output channel；Wherein, b≤m, b are positive integer.

20. restructural neural network according to claim 19 accelerates framework, which is characterized in that the input-buffer unit It is specifically used for：Send each input block successively along Z-direction.

21. restructural neural network according to claim 19 accelerates framework, which is characterized in that the convolutional calculation vouching Member includes：Input deposit unit, computing engines unit and output deposit unit；

The computing engines unit, for the Tn input channel of each input block that will read successively with reading Tn convolutional channel of Tm convolution kernel of b group convolution groups carries out convolution algorithm, output of the generation with Tm output channel Data block finishes until carrying out convolution algorithm with n-th of input block, sends the output number of Tm output channel of generation According to；

The output deposit unit, for the passage portion of buffer unit feedback and output data and the Tm of generation will to be exported The output data of a output channel adds up, and sends passage portion and output data after adding up；Wherein, the feedback Passage portion and output data by read before b group convolution groups are read the 1st to (b-1) group convolution group successively with each institute It states input block and carries out Accumulating generation after convolution.