CN108133270A

CN108133270A - Convolutional neural networks accelerating method and device

Info

Publication number: CN108133270A
Application number: CN201810028998.4A
Authority: CN
Inventors: 季向阳; 连晓聪
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2018-06-08
Anticipated expiration: 2038-01-12
Also published as: CN108133270B

Abstract

This disclosure relates to a kind of convolutional neural networks accelerating method and device, the method includes：Read the input feature vector figure of convolutional layer；It will be in the processing array group of input feature vector figure input convolutional layer, according to the data of the first reference pixel vector, the second reference pixel vector, convolution kernel weights and completed input channel, adder and multiplier is propagated using part and carries out the multiply-add operation of convolution, obtains the output result of processing array group；The output characteristic pattern of convolutional layer is obtained according to the output result of the processing array group；By the input-buffer for exporting characteristic pattern and full articulamentum being written of last layer of convolutional layer；Full articulamentum performs multiply-add operation according to the output characteristic pattern of described last layer of convolutional layer, obtains the output feature vector of full articulamentum；The output feature vector of described complete last layer of articulamentum is output in the 4th piecemeal of first memory.The disclosure effectively reduces hardware resource and power consumption, improves the processing speed of convolutional neural networks.

Description

Convolutional neural networks accelerating method and device

Technical field

This disclosure relates to nerual network technique field more particularly to a kind of convolutional neural networks accelerating method and device.

Background technology

Deep learning is all shown on video identification, speech recognition and natural language processing etc. much is solved the problems, such as Extraordinary performance.In different types of neural network, convolutional neural networks have obtained most in-depth study.Convolutional Neural net The basic structure of network includes two layers：One is characterized extract layer, and the input of each neuron is connected with preceding layer regional area, carries The feature of the part is taken, this is known as convolutional layer；The second is feature combination layer, is generally made of the neural network for connecting property entirely, Since the characteristics of image extracted to preceding features extract layer is classified, this becomes full articulamentum.Network typically contains multilayer spy Extract layer and feature combination layer are levied, realizes the complicated Nonlinear Classification to inputting on a large scale.With the increase of network depth, volume The parameter scale of product neural network greatly improves, and network carries out what positive feature extraction and calculation, classification and deflection error were propagated Calculation amount is usually thousands of times to hundred million ranks.Therefore it is to improve convolutional neural networks model convolutional neural networks accelerate The key of computational efficiency.

Invention content

In view of this, the present disclosure proposes a kind of convolutional neural networks accelerating method and device, to solve convolutional Neural The slow-footed problem of network calculations.

According to the one side of the disclosure, a kind of convolutional neural networks accelerated method is provided, the method includes：

Read the input feature vector figure of convolutional layer；

By in the processing array group of input feature vector figure input convolutional layer, joined according to the first reference pixel vector, second The data of pixel vectors, convolution kernel weights and completed input channel are examined, it is multiply-add to propagate adder and multiplier progress convolution using part Operation obtains the output of processing array group as a result, wherein, the processing array group includes multiple processing arrays, the processing battle array Row include three row processing arrays, and the line number of convolution window is M, and the first reference pixel is joined to the preceding M-N rows for measuring convolution window, second Examine the last N rows that pixel vectors take convolution window, M and N are positive integer, and M>N；

The output characteristic pattern of convolutional layer is obtained according to the output result of the processing array group；

By the input-buffer for exporting characteristic pattern and full articulamentum being written of last layer of convolutional layer；

Full articulamentum performs multiply-add operation according to the output characteristic pattern of described last layer of convolutional layer, obtains full articulamentum Export feature vector；

The output feature vector of described complete last layer of articulamentum is output in the 4th piecemeal of first memory.

In a kind of possible realization method, the input feature vector figure of convolutional layer is read, including：

When the input feature vector figure for the current input channel that a convolution window in the first convolution window and the second convolution window is read When being used to carry out the multiply-add operation of convolution, another convolution window reads the defeated of next input channel of the current input channel Enter characteristic pattern；

When needing to fill the first convolution window or the second convolution window, from top, row caching neutralizes and input is read in right caching respectively Characteristic pattern；

Output channel is currently organized when a register read in the first weights register and the second weights register When convolution weights are used to carry out convolution multiply-add operation, next group that output channel is currently organized described in another register read is defeated Go out the convolution weights of channel.

In a kind of possible realization method, the method further includes：

Input picture is written in the first piecemeal of first memory, the input feature vector figure as convolutional layer first layer.

In a kind of possible realization method, the method further includes：

It will be in the second piecemeal of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum；

It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum；

The parameter includes weights and biasing.

In a kind of possible realization method, the output that convolutional layer is obtained according to the output result of the processing array group is special Sign figure, including：

When the number of plies of convolutional layer is not last layer, the output characteristic pattern of the convolutional layer of even level is stored in first and is deposited In the third piecemeal of reservoir, the output characteristic pattern of the convolutional layer of odd-level is stored in the third piecemeal of second memory.

In a kind of possible realization method, full articulamentum is performed according to the output characteristic pattern of described last layer of convolutional layer Multiply-add operation obtains the output feature vector of full articulamentum, including：

The input feature value of full articulamentum is obtained according to the output characteristic pattern of described last layer of convolutional layer；

By the input feature value of full articulamentum and the weights of full articulamentum, input full connection processing unit and handled, The output feature vector of full articulamentum is obtained, the full connection processing unit includes adder and multiplier and the register of L series connection, In, L is the integer more than 2；

In each clock cycle, will be deposited in first register to the L-1 register in the register of described L series connection The data transfer of a upper clock cycle for storage is to next stage register, the upper clock week that will be stored in l-th register The data of phase input first memory by adder and multiplier.

In a kind of possible realization method, full articulamentum is performed according to the output characteristic pattern of described last layer of convolutional layer Multiply-add operation obtains the output feature vector of full articulamentum, further includes：

When the input feature value of the full articulamentum or the weights that connect entirely are invalid data, in the input of adder and multiplier Zero padding, and add up, until there is significant figure.

According to another aspect of the present disclosure, a kind of convolutional neural networks accelerator is provided, including：

Input feature vector figure read module, for reading the input feature vector figure of convolutional layer；

Array group processing module is handled, for the input feature vector figure to be inputted in the processing array group of convolutional layer, according to First reference pixel vector, the second reference pixel vector, convolution kernel weights and completed input channel data, utilize part It propagates adder and multiplier and carries out the multiply-add operation of convolution, obtain the output of processing array group as a result, wherein, the processing array group includes more A processing array, the processing array include three rows and handle arrays, and the line number of convolution window is M, and the first reference pixel is to measurement The preceding M-N rows of convolution window, the second reference pixel is to the last N rows for measuring convolution window, and M and N are positive integer, and M>N；

Convolutional layer exports characteristic pattern determining module, for obtaining convolutional layer according to the output result of the processing array group Export characteristic pattern；

Full articulamentum input buffer module, for the output characteristic pattern of convolutional layer last layer to be written the defeated of full articulamentum Enter caching；

Full articulamentum output feature vector acquisition module, for full articulamentum according to the output of described last layer of convolutional layer Characteristic pattern performs multiply-add operation, obtains the output feature vector of full articulamentum；

Full articulamentum output feature vector output module, for by the output feature vector of described complete last layer of articulamentum It is output in the 4th piecemeal of first memory.

In a kind of possible realization method, the input feature vector figure read module, including：

Convolution window reading submodule, a convolution window for working as in the first convolution window and the second convolution window read current When the input feature vector figure of input channel is used to carry out convolution multiply-add operation, another convolution window reads the current input channel Next input channel input feature vector figure；

Convolution window fills submodule, for when needing to fill the first convolution window or the second convolution window, from top, row to be slow respectively It deposits to neutralize and input feature vector figure is read in right caching；

Weights register read submodule, for when a deposit in the first weights register and the second weights register When the convolution weights of current group output channel that device is read are used to carry out convolution multiply-add operation, described in another register read The convolution weights of next group of output channel of current group output channel.

In a kind of possible realization method, which further includes：

Input picture writing module, for input picture to be written in the first piecemeal of first memory, as convolutional layer The input feature vector figure of first layer.

In a kind of possible realization method, which further includes：

Parameter read-in module, for by the second of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum In piecemeal；It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum；The parameter packet Include weights and biasing.

In a kind of possible realization method, the convolutional layer exports characteristic pattern determining module, including：

Output sub-module, it is for when the number of plies of convolutional layer is not last layer, the output of the convolutional layer of even level is special Sign figure is stored in the third piecemeal of first memory, and the output characteristic pattern of the convolutional layer of odd-level is stored in second memory Third piecemeal in.

In a kind of possible realization method, the full articulamentum output feature vector acquisition module, including：

Input feature value acquisition submodule, for being connected entirely according to the output characteristic pattern of described last layer of convolutional layer Connect the input feature value of layer；

Full connection processing submodule, for by the weights of the input feature value of full articulamentum and full articulamentum, input to be complete Connection processing unit is handled, and obtains the output feature vector of full articulamentum, and the full connection processing unit includes adder and multiplier The register connected with L, wherein, L is the integer more than 2；It, will be in the register of described L series connection in each clock cycle The data transfer of a upper clock cycle stored in first register to the L-1 register, will to next stage register The data of a upper clock cycle stored in l-th register input first memory by adder and multiplier.

In a kind of possible realization method, the full articulamentum output feature vector acquisition module further includes：

Zero padding submodule is invalid data for working as the input feature value of the full articulamentum or the weights connected entirely When, it in the input zero padding of adder and multiplier, and adds up, until there is significant figure.

Processor；

For storing the memory of processor-executable instruction；

Wherein, the processor is configured as：Perform above-mentioned convolutional neural networks accelerated method.

According to another aspect of the present disclosure, a kind of non-volatile computer readable storage medium storing program for executing is provided, is stored thereon with Computer program instructions are realized when the computer program instructions are executed by processor and perform above-mentioned convolutional neural networks acceleration side Method.

The disclosure in convolutional layer by setting multiple processing array groups, and when convolution window reads input feature vector figure, use It does not go together and carries out convolutional calculation using the mode of different reference pixels vector, effectively reduce hardware resource and power consumption, improve The processing speed of convolutional neural networks.

According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure will become It is clear.

Description of the drawings

Comprising in the description and the attached drawing of a part for constitution instruction and specification together illustrate the disclosure Exemplary embodiment, feature and aspect, and the principle for explaining the disclosure.

Fig. 1 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure；

Fig. 2 shows the flow charts of the convolutional neural networks accelerated method according to one embodiment of the disclosure；

Fig. 3 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure；

Fig. 4 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure；

Fig. 5 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure；

Fig. 6 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure；

Fig. 7 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure；

Fig. 8 shows the block diagram of the convolutional neural networks hardware configuration according to one embodiment of the disclosure；

Fig. 9 shows the ping-pong operation flow chart of the individual layer convolution according to one embodiment of the disclosure；

Figure 10 shows the on piece caching mechanism schematic diagram according to the convolution window of one embodiment of the disclosure；

Figure 11 shows the structure diagram of the convolution multiplicaton addition unit according to one embodiment of the disclosure；

Figure 12 shows the structural parameters table of each layer of VGG16 networks；

Figure 13 shows the block diagram of the convolutional neural networks accelerator according to one embodiment of the disclosure；

Figure 14 shows the block diagram of the convolutional neural networks accelerator according to one embodiment of the disclosure；

Figure 15 is the block diagram according to a kind of device accelerated for convolutional neural networks shown in an exemplary embodiment.

Specific embodiment

Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Reference numeral represent functionally the same or similar element.Although the various aspects of embodiment are shown in the drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.

Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.

In addition, in order to better illustrate the disclosure, numerous details is given in specific embodiment below. It will be appreciated by those skilled in the art that without certain details, the disclosure can equally be implemented.In some instances, for Method well known to those skilled in the art, means, element and circuit are not described in detail, in order to highlight the purport of the disclosure.

Convolutional neural networks acceleration be broadly divided into algorithm accelerate and it is hardware-accelerated.Algorithm acceleration is generally used for having network Sorting phase pertains only to propagated forward, and can not effectively accelerate the training process of network.Pass through hardware-accelerated realization convolutional Neural Network is a kind of effective way, and hardware-accelerated scheme is including based on FPGA, (Field-Programmable Gate Array show Field programmable gate array)/ASIC (Application Specific Integrated Circuits, application-specific integrated circuit) Convolutional neural networks accelerate framework.FPGA is a kind of logic gate array unit that can be stylized, has outstanding parallel computation Ability.There is the features such as low in energy consumption, speed is fast, restructural by the FPGA specially designed.It is performed in addition, FPGA has broken sequence Pattern, more processing tasks can be completed within each clock cycle, and FPGA has and is absorbed in without using operating system In the certainty of a certain item task, the possibility to go wrong can be reduced.

Fig. 1 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in Figure 1, institute The method of stating includes the following steps：

Step S10 reads the input feature vector figure of convolutional layer.

Step S20, by the input feature vector figure input convolutional layer processing array group in, according to the first reference pixel to Amount, the second reference pixel vector, convolution kernel weights and completed input channel data, part is utilized to propagate adder and multiplier and is carried out The multiply-add operation of convolution obtains the output of processing array group as a result, wherein, the processing array group includes multiple processing arrays, institute It states processing array and includes multiple rows processing array, such as the processing array includes three rows and handles array.The line number of convolution window For M, the first reference pixel is to the preceding M-N rows for measuring convolution window, and the second reference pixel is to the last N rows for measuring convolution window, M and N It is positive integer, and M>N.

In a kind of possible realization method, input channel refers to, according to specific extraction feature, carry over an input image The different input feature vectors got.Output channel refers to handle the result for respectively handling array output in array group.Convolutional calculation is adopted With PP-MAC structures (Propagate Partial MultiplierAccumulator, partly propagate multiply-add algorithm).In order to carry High processing rate, the present embodiment utilize multiple processing array groups so that every 32 handle one processing array group of array composition as an example Realize the parallel processing of different output channels.

The result of calculation of processing array group is stored in convolutional layer output caching, waits for the calculating with other input channels As a result it adds up.In the starting stage of convolution operation, given bias is stored in convolutional layer output caching first as tired Add the initial value of device.During the part of same output channel difference convolution position is cached with the output for being respectively stored in different, side Just part and the reading of result during subsequent operation.

When convolution window reads input feature vector figure, the first reference pixel r0 and the second reference pixel r1 are the spy for setting unit Levy diagram data.For example, convolution window shares 16 rows, 14 rows before the first reference pixel r0 is read.Second reference pixel r1 reads last Two rows.Wherein r1 is only just effective when convolution position changes in a column direction, and other positions are calculated using r0.Characteristic pattern is to broadcast Form be input to processing array in carry out convolutional calculation.One processing array is made of 3 one-dimensional row processing arrays, each Row processing array is made of the addition unit of three multiplication units and 3 operands.In three rows handle array, 1 × 3 Input feature vector diagram data is multiplied respectively with 31 × 3 weights in 3 × 3 convolution kernels, performs the volume of three adjacent location of pixels Product.The processing delay of three row processing arrays is different.

Step S30 obtains the output characteristic pattern of convolutional layer according to the output result of the processing array group.

In a kind of possible realization method, the result of calculation of convolutional layer is needed with Relu (RectifiedLinear Units corrects linear unit) function is into line activating.The output result of processing array group obtains the output feature of each layer of convolutional layer Figure.The output characteristic pattern of each layer of convolutional layer is exported into convolutional layer output caching.And return to step S20 is repeated.

Step S40, by the input-buffer for exporting characteristic pattern and full articulamentum being written of last layer of convolutional layer.

In a kind of possible realization method, the result of calculation of last layer of convolutional layer is write direct full articulamentum input and is delayed It deposits, is not required to cache outside write-in piece.

Step S50, full articulamentum perform multiply-add operation according to the output characteristic pattern of described last layer of convolutional layer, obtain complete The output feature vector of articulamentum.

In a kind of possible realization method, the output characteristic pattern of last layer of convolutional layer, as the input of full articulamentum, After performing the multiply-add operation in full articulamentum, the output feature vector of full articulamentum is obtained.

The output feature vector of described complete last layer of articulamentum is output to the 4th point of first memory by step S60 In block.

In a kind of possible realization method, by first memory carry out piecemeal after, by the output feature of full articulamentum to Amount is stored therein in a piecemeal.Other piecemeals of first memory are used to store input picture, the parameter of each layer etc..Piecemeal The extraction that the mode of storage is conducive to data is called, and improves the computational efficiency of system entirety.

In the present embodiment, it in the calculating of convolutional layer, introduces processing array group and carries out convolutional calculation, broken sequence The pattern of execution can complete more processing tasks within each clock cycle, improve the calculating speed of convolutional neural networks Rate.

Fig. 2 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in Fig. 2, with Above-described embodiment the difference lies in, the step S10 of this method, including：

Step S11, when in the first convolution window and the second convolution window a convolution window read current input channel it is defeated When entering characteristic pattern and being used to carry out the multiply-add operation of convolution, another convolution window reads next input of the current input channel The input feature vector figure of channel.

Step S12, when needing to fill the first convolution window or the second convolution window, from top, row caching is neutralized in right caching respectively Read input feature vector figure.

Step S13, when the currently group of a register read in the first weights register and the second weights register is defeated When going out the convolution weights of channel and being used to carry out the multiply-add operation of convolution, output channel is currently organized described in another register read The convolution weights of next group of output channel.

In a kind of possible realization method, due to the limitation of FPGA Resources on Chip, on-chip memory group cannot be used all In storage input feature vector figure.Input feature vector figure carries out convolutional calculation according to the convolution unit of setting, is read in again after the completion of calculating another One block of pixels is calculated, and is finished until by the reading of entire input feature vector figure.The reading of weights uses identical pattern.Convolution The input-buffer of layer includes two convolution windows, two weights registers, a top row caching and a right caching.Two convolution windows It is used for realizing ping-pong operation with weights register, is carried out at the same time convolutional calculation and digital independent.

Fig. 3 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in figure 3, with The difference lies in the method further includes above-described embodiment：

Step S70, by the first piecemeal of input picture write-in first memory, the input as convolutional layer first layer is special Sign figure.

In a kind of possible realization method, input picture can also be inputted in the first piecemeal of second memory.It will The input feature value of input picture and full articulamentum is respectively stored in the different piecemeals of first memory, is conducive to different numbers According to extraction call, improve the whole efficiency of system.

Fig. 4 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in figure 4, with The difference lies in the method further includes above-described embodiment：

Step S80, will be in the second piecemeal of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum；It will In second piecemeal of the parameter read-in second memory of the even level of convolutional layer and full articulamentum；The parameter is including weights and partially It puts.

In a kind of possible realization method, by the parameter of convolutional layer and full articulamentum, with input picture and full articulamentum Input feature value be respectively stored in the different piecemeals of the first storage, be conducive to different data extraction call, improve system The whole efficiency of system.The parameter of odd-level and even level is stored separately so that the convolutional layer of odd-level or even level, Yi Jiqi Several layers or the full articulamentum of even level can quickly get corresponding parameter for calculating.

Fig. 5 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in figure 5, with The difference lies in step S30 includes above-described embodiment：

When the number of plies of convolutional layer is not last layer, the output characteristic pattern of the convolutional layer of even level is stored by step S31 In the third piecemeal of first memory, the output characteristic pattern of the convolutional layer of odd-level is stored in the third point of second memory In block.

In a kind of possible realization method, the output characteristic pattern of convolutional layer odd-level is stored in the third of second memory In piecemeal, and the output characteristic pattern of even level is stored in the third piecemeal of first memory.In the convolutional calculation mistake of same layer Cheng Zhong, a piece of memory only carry out read operation, and another memory only carries out write operation.

Fig. 6 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in fig. 6, with The difference lies in step S50 includes above-described embodiment：

Step S51 obtains the input feature value of full articulamentum according to the output characteristic pattern of described last layer of convolutional layer.

Step S52, by the input feature value of full articulamentum and the weights of full articulamentum, input full connection processing unit into Row processing, obtains the output feature vector of full articulamentum, and the full connection processing unit includes the deposit of adder and multiplier and L series connection Device, wherein, L is the integer more than 2.

In each clock cycle, first register in the register of described L series connection is posted to L-1 is a by step S53 The data transfer of a upper clock cycle stored in storage is to next stage register, upper one will stored in l-th register The data of a clock cycle input first memory by adder and multiplier.

In a kind of possible realization method, for the multiply-add process of full articulamentum, if only using a register, hold Easily cause the problem of result of calculation is inaccurate.Multiply-add result is respectively stored in multiple registers in the present embodiment, passes through data The progressive row of back kick add up, avoid and add up on a register multiple, reduce calculating error.

Fig. 7 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in fig. 7, with The difference lies in step S50 is further included above-described embodiment：

Step S54, when the input feature value of the full articulamentum or the weights that connect entirely are invalid data, multiply-add The input zero padding of device, and add up, until there is significant figure.

In a kind of possible realization method, if in some period, input feature vector figure or weight data are invalid, just exist Zero padding is inputted, then is added up, until there is significant figure.

Implement example 1：

The convolutional neural networks model that the disclosure provides is connected entirely using VGG16 network models including 13 convolutional layers and 3 Connect layer (the full articulamentum of the first two is respectively split into two layers).Fig. 8 shows to be filled according to the convolutional neural networks of one embodiment of the disclosure The block diagram put connects including convolutional layer accelerator, convolutional layer input-buffer, convolutional layer output caching, full articulamentum accelerator and entirely Connect a layer input-buffer.

The accelerated method for the VGG16 convolutional neural networks accelerators that the disclosure proposes comprises the steps of：

S1：By PCIe, (peripheral component interconnect express, peripheral assembly are quickly mutual Even) interface by the parameter of convolutional layer and full articulamentum (including biasing and weights) write-in DDR3 (DoubleData Rate, it is double Rate synchronization dynamic RAM) (#0 and #1).

The parameter of convolutional layer and full articulamentum arrives DDR3's by the bus of PCIe3.0 × 8 according to the sequential storage of layer In bank0.The parameter of odd-level is stored in first DDR3 (#0).The parameter of even level is stored in second DDR3 (#1) In.Each layer of parameter according to before being biased in, the posterior sequence of weights arranged.Weights are arranged with 3 × 3 (convolution kernels) for unit It is listed in together.The convolution kernel weights of same input channel are placed on continuous memory space, can continuously read convolution kernel weights, no It needs to carry out address computation.

S2：Input picture is written by DDR3 (#0) by PCIe interface.

Input image data is written by the bus of PCIe3.0 × 8 in the bank1 of DDR3.Input image data is in DDR3 Institutional framework it is related with convolutional calculation method and DDR3 data read modes.Convolutional layer characteristic pattern needs to be drawn by block Point, since the size of each layer characteristic pattern of convolutional layer is all 7 integral multiple, it is minimum particle size tissue signature figure number to take 7 × 7 pixels According to.In view of the influence of pondization operation, the least unit of convolutional calculation is 14 × 14 pixels.According to 7 × 7 block of pixels by characteristic pattern It is stored in DDR3, since the data length of mono- pulse burst of DDR3 transmission is 512bit, the data of 7 × 7 block of pixels Length is 784bit, and in order to reduce the waste of the data in data read process, the data length of 7 × 7 block of pixels is extended for 1024bit is aligned according to burst first address.

S3：The data of needs are read in into convolutional layer input-buffer from DDR3.

Due to the limitation of FPGA Resources on Chip, on-chip memory group cannot be completely used for storage input feature vector figure.Input is special It levies figure and carries out convolutional calculation for unit according to 14 × 14, another 14 × 14 block of pixels is read in again after the completion of calculating, until will be whole A input feature vector figure reading finishes.The reading of weights uses identical pattern.The input-buffer of convolutional layer includes two 16 × 16 The convolution window of pixel, the weights register of two 32 × 3 × 3 pixels, the top row of 1024 × 16 pixels cache and one 512 The right caching of × 15 × 8 pixels.Two convolution windows and weights register are used for realizing ping-pong operation, read convolutional calculation and data Taking can be carried out at the same time.Fig. 9 shows the ping-pong operation flow chart of the individual layer convolution according to one embodiment of the disclosure, individual layer convolution Ping-pong operation flow is as shown in Figure 9.

Figure 10 shows the on piece caching mechanism schematic diagram according to the convolution window of one embodiment of the disclosure, as shown in Figure 10, slightly Frame block represents the convolution window of 16 × 16 pixels.When needing to fill convolution window, the first row (1 × 16) in convolution window is gone from top It is read in caching, 15 × 8 block of pixels of left side hatched example areas are read from right caching.Then block of pixels B_{I, j}、B_{I, j+1}、B_{I+1, j}、 B_{I+1, j+1}And B_{I+2, j}And B_{I+2, j+1}The first row need outside piece store DDR3 in read.After convolution window has been filled, first 1 × 16 block of pixels in region (vertical line region) can replace net region write-in top row to cache, and the 15 of second area (dotted region) × 8 block of pixels can replace third region (hatched example areas) that right caching is written.

S4：Convolution operation is performed, repeats S3, until current convolution layer operation is completed.

Convolutional calculation uses PP-MAC structures (Propagate Partial Multiplier Accumulator, part Parallel multiplication).Figure 11 shows the structure diagram of the convolution multiplicaton addition unit according to one embodiment of the disclosure, convolution multiplicaton addition unit Structure it is as shown in figure 11.In order to improve processing speed, 32 processing array composition processing array groups realize different output channels Parallel processing.

Use I_i(R, C) represents input pixel of the ith feature figure in (R, C) position, and r represents current recurring number, j-th Handling the task that array is completed is：

Wherein, O_32r+j(R, C) is pixel of (32r+j) a output characteristic pattern in (R, C) position,Represent convolution, k_{I, 32r+j}Represent weights.

Reference pixel vector r0 and r1 are 1 × 3 feature diagram data, and p0 represents the portion read from convolutional layer output caching Divide and data, kij (i, j ∈ { 0,1,2 }) represent the weights of convolution kernel.R0 takes preceding 14 row of convolution window, and r1 takes last two rows.r1 It is only just effective when convolution position is changed in a column direction.Characteristic pattern is input in processing array in the form of broadcasting and carried out Convolutional calculation.One processing array is made of 3 one-dimensional row processing arrays.Each row processing array by three multiplication units and The addition unit composition of one 3 operand.In three rows handle array, 1 × 3 input feature vector diagram data respectively with volume 3 × 3 31 × 3 weights in product core are multiplied, and perform the convolution of three adjacent location of pixels.The processing delay of three row processing arrays Difference, scratch-pad register be used for guarantee section with it is synchronous.

Figure 12 shows the structural parameters table of each layer of VGG16 networks, as shown in the table in Figure 12, according to VGG16 networks The structural parameters of each layer it is found that for conv1-2, conv2-2, conv3-3, conv4-3 and conv5-3 layers output result also It needs to carry out pondization calculating.Pond window size is 2 × 2, step-length 2, and pond is realized using maximum down-sampling method.Pondization operates quilt During being embedded into convolution operation, in 14 × 14 convolution unit, in the convolution for completing two vertical adjacent pixels positions Afterwards, only retain wherein larger convolution results.In the convolution of another two location of pixels in going to 2 × 2 block of pixels, and before The convolution results of reservation compare, as soon as only retaining maximum convolution results, have obtained 7 × 7 pond output matrix.

S5：If not last layer of convolutional layer, DDR3 is written into convolutional layer result of calculation, repeats S3 and S4.

The result of calculation of convolutional layer needs to use Relu functions into line activating, the expression formula of Relu functions：F (x)=max (0,x).The output characteristic pattern of convolutional layer odd-level is stored in DDR3 (#0).The output characteristic pattern of even level is stored in DDR3 (# 1) in.During the convolutional calculation of same layer, a piece of DDR3 only carries out read operation, and another DDR3 only carries out write operation.

S6：Full articulamentum input-buffer is written into convolutional layer final calculation result.

The result of calculation of last layer of convolutional layer writes direct full articulamentum input-buffer, does not need to cache outside write-in piece DDR3.Last layer of convolutional layer has 512 output channels, according to the number of processing array group, the data of 32 output channels Collection is combined into a sector, a total of 16 sectors.In each sector, the pixel of continuous 32 storages is logical from 32 outputs The same position in road.

S7：Full attended operation is performed, is carried out at the same time S3-S6；

For the multiply-add process of full articulamentum, the cumulative process of up to 25088 times is had.If only use a deposit Device be easy to cause " big number eats decimal " problem.Multiply-add result is respectively stored in 4 registers in the disclosure, is avoided one It adds up on a register multiple, reduces calculating error.

Full articulamentum multiplicaton addition unit needs 4 clock cycle from output result of calculation is input data into, by the letter of 4 level production lines 4 registers are turned to, and additionally increase by 1 outer layer register, form 5 level production lines.During calculating, no matter multiply-add list Whether member is calculated as a result, can not all stop the data of reception following clock.Different location is stored on each register The multiply-add result in the part of data and weights.If in some period, input feature vector figure or weight data are invalid, are just mended in input Zero, it adds up, until there is significant figure.

S8：Full articulamentum result of calculation is output to DDR3；

Full articulamentum finally obtains the result of calculation of 1000 half accuracy floating-points, and the bank4 of DDR3 is written.

S9：If not last piece image, S7 and S8 is repeated；

S10：The result of calculation of entire convolutional neural networks is read from DDR3.

Figure 13 shows the block diagram of the convolutional neural networks accelerator according to one embodiment of the disclosure, as shown in figure 13, should Device includes：

Input feature vector figure read module 41, for reading the input feature vector figure of convolutional layer；

Array group processing module 42 is handled, for the input feature vector figure to be inputted in the processing array group of convolutional layer, root According to the data of the first reference pixel vector, the second reference pixel vector, convolution kernel weights and completed input channel, portion is utilized Divide and propagate the adder and multiplier progress multiply-add operation of convolution, obtain the output of processing array group as a result, wherein, the processing array group includes Multiple processing arrays, the processing array include three rows and handle array, and the line number of convolution window is M, and the first reference pixel is vectorial Take the preceding M-N rows of convolution window, the second reference pixel is to the last N rows for measuring convolution window, and M and N are positive integer, and M>N；

Convolutional layer output characteristic pattern determining module 43, for obtaining convolutional layer according to the output result of the processing array group Output characteristic pattern；

Full articulamentum input buffer module 44, for the output characteristic pattern of convolutional layer last layer to be written full articulamentum Input-buffer；

Full articulamentum output feature vector acquisition module 45, for full articulamentum according to the defeated of described last layer of convolutional layer Go out characteristic pattern and perform multiply-add operation, obtain the output feature vector of full articulamentum；

Full articulamentum output feature vector output module 46, for by the output feature of described complete last layer of articulamentum to Amount is output in the 4th piecemeal of first memory.

Figure 14 shows the block diagram of the convolutional neural networks accelerator according to one embodiment of the disclosure, as shown in figure 14, with In above-mentioned implementation the difference lies in,

In a kind of possible realization method, the input feature vector figure read module 41, including：

Convolution window reading submodule 411, for what is read when a convolution window in the first convolution window and the second convolution window When the input feature vector figure of current input channel is used to carry out convolution multiply-add operation, another convolution window reads the current input The input feature vector figure of next input channel of channel；

Convolution window fills submodule 412, for when needing to fill the first convolution window or the second convolution window, going respectively from top Caching, which neutralizes, reads input feature vector figure in right caching；

Weights register read submodule 413, for as one in the first weights register and the second weights register When the convolution weights of the current group output channel of register read are used to carry out convolution multiply-add operation, another register read The convolution weights of next group of output channel of the current group of output channel.

In a kind of possible realization method, which further includes：

Input picture writing module 47, for input picture to be written in the first piecemeal of first memory, as convolution The input feature vector figure of layer first layer.

In a kind of possible realization method, which further includes：

Parameter read-in module 48, for by the of the parameter read-in first memory of convolutional layer and the odd-level of full articulamentum In two piecemeals；It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum；The parameter Including weights and biasing.

In a kind of possible realization method, the convolutional layer output characteristic pattern determining module 43, including：

Output sub-module 431, for when the number of plies of convolutional layer is not last layer, by the output of the convolutional layer of even level Characteristic pattern is stored in the third piecemeal of first memory, and the output characteristic pattern of the convolutional layer of odd-level is stored in the second storage In the third piecemeal of device.

In a kind of possible realization method, the full articulamentum output feature vector acquisition module 45, including：

Input feature value acquisition submodule 451, for being obtained according to the output characteristic pattern of described last layer of convolutional layer The input feature value of full articulamentum；

Full connection processing submodule 452, for by the weights of the input feature value of full articulamentum and full articulamentum, input Full connection processing unit is handled, and obtains the output feature vector of full articulamentum, and the full connection processing unit includes multiply-add Device and the register of L series connection, wherein, L is the integer more than 2；In each clock cycle, by the register of described L series connection In the data transfer of a upper clock cycle that stores in first register to the L-1 register to next stage register, The data of a upper clock cycle stored in l-th register are inputted into first memory by adder and multiplier.

In a kind of possible realization method, the full articulamentum output feature vector acquisition module 45 further includes：

Zero padding submodule 453 is invalid number for working as the input feature value of the full articulamentum or the weights connected entirely According to when, in the input zero padding of adder and multiplier, and add up, until there is significant figure.

Implement example 2：

Fig. 8 shows the block diagram of the convolutional neural networks hardware configuration according to one embodiment of the disclosure, as shown in figure 8, this public affairs The hardware configuration for opening the VGG16 convolutional neural networks accelerators based on FPGA of proposition includes：

(1) convolutional layer input-buffer group, including two convolution windows, two weights registers, a top row caches and one Right caching；

(2) 32 convolutional layers handle array, and processing array uses PP-MAC (Propagate PartialMultiplier Accumulator) structure completes the multiply-add operation of convolution according to control instruction；

(3) 32 convolutional layer output cachings, size is 3136 pixels；

(4) full articulamentum input-buffer, size are 512 × 7 × 7 pixels；

(5) full articulamentum processing array connects multiply-add operation entirely according to control instruction completion；

The DDR3 memories of (6) two 4GB.

The input data and parameter and multiply-add process of convolutional calculation all use half accuracy floating-point number.

The processing array of convolutional layer and full articulamentum may be performed simultaneously different width images using different calculating structures Convolutional layer and full articulamentum calculating operation.

Convolutional layer characteristic pattern is divided by block, since the size of each layer characteristic pattern of convolutional layer is all 7 integral multiple, takes 7 × 7 pixels are minimum particle size tissue signature diagram data；Characteristic pattern is stored in DDR3 according to 7 × 7 block of pixels, according to DDR3's Burst (512bit) is aligned；In view of the influence of pondization operation, the basic unit of convolutional calculation is 14 × 14 pixels, convolution kernel Size be 3 × 3 pixels.

The size of weights register is 32 × 3 × 3 pixels, and the size of convolution window needs to be extended for 16 × 16 according to convolution kernel Pixel, the size for pushing up row caching are 1024 × 16 pixels, and the size of right caching is 512 × 15 × 8 pixels.

Two convolution windows and weights register are used for realizing ping-pong operation, i.e. an input feature vector figure or weighting parameter data Participate in read the data of another input feature vector figure or weighting parameter while calculating.

Convolutional layer is handled array and is tied using PP-MAC (Propagate Partial Multiplier Accumulator) Structure, 32 processing array composition processing array groups realize the parallel processing of different output channels.

Reference pixel vector r0 and r1 are 1 × 3 feature diagram data, and r0 can take preceding 14 row of convolution window, and r1 can only take Last two rows, r1 are only just effective when convolution position switches in a column direction.Characteristic pattern is inputted everywhere in the form of broadcasting Convolutional calculation is carried out in reason array, 1 processing array is made of 3 one-dimensional row processing arrays, and each row handles array by 3 The addition unit composition of multiplication unit and 13 operand.In three rows handle array, 1 × 3 input feature vector diagram data point It is not multiplied with 31 × 3 weights in 3 × 3 convolution kernels, performs the convolution of three adjacent location of pixels.Three rows handle array Processing delay it is different, scratch-pad register be used for guarantee section with it is synchronous.

Need pondization operate convolutional layer, pondization operate be embedded in convolution operation during, in 14 × 14 volume In product unit, after the convolution for completing two vertical adjacent pixels positions, only retain wherein larger convolution results.It is going to In 2 × 2 block of pixels during the convolution of another two location of pixels, compared with the convolution results retained before, only retain maximum one A convolution results have just obtained 7 × 7 pond output matrix.

The result of calculation of processing array group is stored in the calculating waited in convolutional layer output caching with other input channels As a result it adds up, in the starting stage of convolutional layer, biasing is stored in convolutional layer output caching first as the initial of part sum Value.During the part of same output channel difference convolution position is cached with the output for being respectively stored in different, facilitate subsequent operation When part and result reading.

In the input-buffer of full articulamentum, the characteristic pattern of every 32 output channels is stored together, and forms a sector. In each sector, the pixel of continuous 32 storages is from the identical position of 32 output channels.

The multiply-add result of full articulamentum is respectively stored in 4 registers, is avoided and is added up repeatedly on a register. Full articulamentum multiplicaton addition unit needs 4 clock cycle from output result of calculation is input data into, and 4 level production lines are reduced to 4 Register, and additionally increase by 1 outer layer register, form 5 level production lines.During calculating, no matter multiplicaton addition unit whether Calculate as a result, can not all stop receive following clock data, be stored on each register different location data and The multiply-add result in part of weights.If in some period, input feature vector figure or weight data are invalid, are just inputting zero padding, into Row is cumulative, until there is significant figure.

The input feature vector figure of odd-level and biasing are stored in weighting parameter in first DDR3 (#0), the input of even level Characteristic pattern (i.e. the output characteristic pattern of odd-level) and biasing are stored in weighting parameter in second DDR3 (#1).The volume of a certain layer In product calculating process, a piece of DDR3 only carries out read operation, and another DDR3 only completes write operation；Input feature vector figure and biasing with Weighting parameter is respectively stored in the different bank of DDR3.

The disclosure can configure the high-performance calculation characteristic with low-power consumption using FPGA, realize that VGG16 convolutional neural networks are hard Part accelerating structure, has the advantages that：

(1) (32) calculating of traditional single precision floating datum are replaced using (16) calculating of half accuracy floating-point number, ensured Under the premise of VGG16 network algorithm accuracys rate, hardware resource and power consumption are effectively reduced.

(2) using double convolution windows and the table tennis repeating query operation of weights register, convolutional calculation and data read process can be same Shi Jinhang reduces the access time to chip external memory, improves processing speed.

(3) using double DDR3 memories table tennis repeating query operations, during convolution operation, a piece of DDR3 memories only carry out Read operation, another only carries out write operation, reduces read-write conversion, effectively reduces the power consumption of hardware configuration.With on a piece of DDR3 Characteristic pattern and parameter are stored in different bank, are avoided when reading weighting parameter and feature diagram data and switching over, row switchings The high latency of generation.

(4) according to the calculation features of convolutional layer and full articulamentum, different processing structures is separately designed, may be performed simultaneously The convolutional layer of different width images and full articulamentum calculating operation, and assembly line is carried out to the multiplicaton addition unit of convolutional layer and full articulamentum Optimization improves the processing speed of accelerator.

Figure 15 is the frame according to a kind of device 1900 accelerated for convolutional neural networks shown in an exemplary embodiment Figure.For example, device 1900 may be provided as a server.With reference to Figure 15, device 1900 includes processing component 1922, into one Step includes one or more processors and memory resource represented by a memory 1932, can be by processing group for storing The instruction of the execution of part 1922, such as application program.The application program stored in memory 1932 can include one or one Each above corresponds to the module of one group of instruction.In addition, processing component 1922 is configured as execute instruction, it is above-mentioned to perform Method.

Device 1900 can also include a power supply module 1926 and be configured as the power management of executive device 1900, one Wired or wireless network interface 1950 is configured as device 1900 being connected to network and input and output (I/O) interface 1958.Device 1900 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.

In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, such as including calculating The memory 1932 of machine program instruction, above computer program instruction can be performed to complete by the processing component 1922 of device 1900 The above method.

The disclosure can be system, method and/or computer program product.Computer program product can include computer Readable storage medium storing program for executing, containing for make processor realize various aspects of the disclosure computer-readable program instructions.

Computer readable storage medium can keep and store to perform the tangible of the instruction that uses of equipment by instruction Equipment.Computer readable storage medium for example can be-- but be not limited to-- storage device electric, magnetic storage apparatus, optical storage Equipment, electromagnetism storage device, semiconductor memory apparatus or above-mentioned any appropriate combination.Computer readable storage medium More specific example (non exhaustive list) includes：Portable computer diskette, random access memory (RAM), read-only is deposited hard disk It is reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static RAM (SRAM), portable Compact disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example thereon It is stored with the punch card of instruction or groove internal projection structure and above-mentioned any appropriate combination.Calculating used herein above Machine readable storage medium storing program for executing is not interpreted instantaneous signal in itself, and the electromagnetic wave of such as radio wave or other Free propagations leads to It crosses the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) of waveguide or the propagation of other transmission mediums or is transmitted by electric wire Electric signal.

Computer-readable program instructions as described herein can be downloaded to from computer readable storage medium it is each calculate/ Processing equipment downloads to outer computer or outer by network, such as internet, LAN, wide area network and/or wireless network Portion's storage device.Network can include copper transmission cable, optical fiber transmission, wireless transmission, router, fire wall, interchanger, gateway Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment In calculation machine readable storage medium storing program for executing.

For perform the disclosure operation computer program instructions can be assembly instruction, instruction set architecture (ISA) instruction, Machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programming languages Arbitrarily combine the source code or object code write, the programming language includes the programming language of object-oriented-such as Procedural programming languages-such as " C " language or similar programming language of Smalltalk, C++ etc. and routine.Computer Readable program instructions can be performed fully, partly perform on the user computer, is only as one on the user computer Vertical software package performs, part performs or on the remote computer completely in remote computer on the user computer for part Or it is performed on server.In situations involving remote computers, remote computer can pass through network-packet of any kind Include LAN (LAN) or wide area network (WAN)-be connected to subscriber computer or, it may be connected to outer computer (such as profit Pass through Internet connection with ISP).In some embodiments, by using computer-readable program instructions Status information carry out personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or can Programmed logic array (PLA) (PLA), the electronic circuit can perform computer-readable program instructions, so as to fulfill each side of the disclosure Face.

Referring herein to the method, apparatus (system) according to the embodiment of the present disclosure and the flow chart of computer program product and/ Or block diagram describes various aspects of the disclosure.It should be appreciated that each box and flow chart of flow chart and/or block diagram and/ Or in block diagram each box combination, can be realized by computer-readable program instructions.

These computer-readable program instructions can be supplied to all-purpose computer, special purpose computer or other programmable datas The processor of processing unit, so as to produce a kind of machine so that these instructions are passing through computer or other programmable datas When the processor of processing unit performs, produce and realize work(specified in one or more of flow chart and/or block diagram box The device of energy/action.These computer-readable program instructions can also be stored in a computer-readable storage medium, these refer to It enables so that computer, programmable data processing unit and/or other equipment work in a specific way, so as to be stored with instruction Computer-readable medium then includes a manufacture, including realizing in one or more of flow chart and/or block diagram box The instruction of the various aspects of defined function/action.

Computer-readable program instructions can also be loaded into computer, other programmable data processing units or other In equipment so that series of operation steps are performed on computer, other programmable data processing units or miscellaneous equipment, with production Raw computer implemented process, so that performed on computer, other programmable data processing units or miscellaneous equipment Function/action specified in one or more of flow chart and/or block diagram box is realized in instruction.

Flow chart and block diagram in attached drawing show the system, method and computer journey of multiple embodiments according to the disclosure Architectural framework in the cards, function and the operation of sequence product.In this regard, each box in flow chart or block diagram can generation One module of table, program segment or a part for instruction, the module, program segment or a part for instruction include one or more use In the executable instruction of logic function as defined in realization.In some implementations as replacements, the function of being marked in box It can be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can essentially be held substantially in parallel Row, they can also be performed in the opposite order sometimes, this is depended on the functions involved.It is also noted that block diagram and/or The combination of each box in flow chart and the box in block diagram and/or flow chart can use function or dynamic as defined in performing The dedicated hardware based system made is realized or can be realized with the combination of specialized hardware and computer instruction.

The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.In the case of without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes will be apparent from for the those of ordinary skill in art field.The selection of term used herein, purport In the principle for best explaining each embodiment, practical application or to the technological improvement of the technology in market or lead this technology Other those of ordinary skill in domain are understood that each embodiment disclosed herein.

Claims

1. a kind of convolutional neural networks accelerated method, which is characterized in that the method includes：

Read the input feature vector figure of convolutional layer；

The input feature vector figure is inputted in the processing array group of convolutional layer, according to the first reference pixel vector, the second reference image The data of plain vector, convolution kernel weights and completed input channel propagate adder and multiplier using part and carry out the multiply-add operation of convolution, The output of processing array group is obtained as a result, wherein, the processing array group includes multiple processing arrays, the processing array includes Three rows handle arrays, and the line number of convolution window is M, and the first reference pixel is to the preceding M-N rows for measuring convolution window, the second reference pixel To the last N rows for measuring convolution window, M and N are positive integer, and M>N；

Full articulamentum performs multiply-add operation according to the output characteristic pattern of described last layer of convolutional layer, obtains the output of full articulamentum Feature vector；

2. according to the method described in claim 1, it is characterized in that, read convolutional layer input feature vector figure, including：

When in the first convolution window and the second convolution window a convolution window read current input channel input feature vector figure by with When the multiply-add operation of execution convolution, the input that another convolution window reads next input channel of the current input channel is special Sign figure；

When needing to fill the first convolution window or the second convolution window, from top, row caching neutralizes in right caching and reads input feature vector respectively Figure；

When the convolution of the current group output channel of a register read in the first weights register and the second weights register When weights are used to carry out convolution multiply-add operation, the next group of output that output channel is currently organized described in another register read is led to The convolution weights in road.

3. according to the method described in claim 1, it is characterized in that, the method further includes：

4. according to the method described in claim 1, it is characterized in that, the method further includes：

The parameter includes weights and biasing.

5. according to the method described in claim 1, it is characterized in that, convolution is obtained according to the output result of the processing array group The output characteristic pattern of layer, including：

When the number of plies of convolutional layer is not last layer, the output characteristic pattern of the convolutional layer of even level is stored in first memory Third piecemeal in, the output characteristic pattern of the convolutional layer of odd-level is stored in the third piecemeal of second memory.

6. according to the method described in claim 1, it is characterized in that, full articulamentum is according to the output of described last layer of convolutional layer Characteristic pattern performs multiply-add operation, obtains the output feature vector of full articulamentum, including：

By the input feature value of full articulamentum and the weights of full articulamentum, input full connection processing unit and handled, obtained The output feature vector of full articulamentum, the full connection processing unit include adder and multiplier and the register of L series connection, wherein, L is Integer more than 2；

In each clock cycle, by what is stored in first register to the L-1 register in the register of described L series connection The data transfer of a upper clock cycle is to next stage register, by the upper clock cycle stored in l-th register Data input first memory by adder and multiplier.

7. according to the method described in claim 6, it is characterized in that, full articulamentum is according to the output of described last layer of convolutional layer Characteristic pattern performs multiply-add operation, obtains the output feature vector of full articulamentum, further includes：

When the input feature value of the full articulamentum or the weights that connect entirely are invalid data, mended in the input of adder and multiplier Zero, and add up, until there is significant figure.

8. a kind of convolutional neural networks accelerator, which is characterized in that including：

Array group processing module is handled, for the input feature vector figure to be inputted in the processing array group of convolutional layer, according to first Reference pixel vector, the second reference pixel vector, convolution kernel weights and completed input channel data, part is utilized to propagate Adder and multiplier carries out the multiply-add operation of convolution, obtains the output of processing array group as a result, wherein, the processing array group includes multiple places Manage array, the processing array includes three rows and handles arrays, and the line number of convolution window is M, and the first reference pixel is to measuring convolution The preceding M-N rows of window, the second reference pixel is to the last N rows for measuring convolution window, and M and N are positive integer, and M>N；

Convolutional layer exports characteristic pattern determining module, for obtaining the output of convolutional layer according to the output result of the processing array group Characteristic pattern；

Full articulamentum input buffer module, for the input of the full articulamentum of output characteristic pattern write-in of convolutional layer last layer to be delayed It deposits；

Full articulamentum output feature vector acquisition module, for full articulamentum according to the output feature of described last layer of convolutional layer Figure performs multiply-add operation, obtains the output feature vector of full articulamentum；

Full articulamentum output feature vector output module, for the output feature vector of described complete last layer of articulamentum to be exported Into the 4th piecemeal of first memory.

9. device according to claim 8, which is characterized in that the input feature vector figure read module, including：

Convolution window reading submodule, for the current input read when a convolution window in the first convolution window and the second convolution window When the input feature vector figure of channel is used to carry out convolution multiply-add operation, another convolution window is read under the current input channel The input feature vector figure of one input channel；

Convolution window fills submodule, for when needing to fill the first convolution window or the second convolution window, going respectively from top in caching With reading input feature vector figure in right caching；

Weights register read submodule, for being read when a register in the first weights register and the second weights register It is current described in another register read when the convolution weights of current group output channel taken are used to carry out convolution multiply-add operation The convolution weights of next group of output channel of group output channel.

10. device according to claim 8, which is characterized in that further include：

Input picture writing module, for input picture to be written in the first piecemeal of first memory, as convolutional layer first The input feature vector figure of layer.

11. device according to claim 8, which is characterized in that further include：

Parameter read-in module, for by the second piecemeal of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum In；It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum；The parameter includes power Value and biasing.

12. device according to claim 8, which is characterized in that the convolutional layer exports characteristic pattern determining module, including：

Output sub-module, for when the number of plies of convolutional layer is not last layer, by the output characteristic pattern of the convolutional layer of even level It is stored in the third piecemeal of first memory, the output characteristic pattern of the convolutional layer of odd-level is stored in the of second memory In three piecemeals.

13. device according to claim 8, which is characterized in that the full articulamentum output feature vector acquisition module, packet It includes：

Input feature value acquisition submodule, for obtaining full articulamentum according to the output characteristic pattern of described last layer of convolutional layer Input feature value；

Full connection processing submodule, for by the weights of the input feature value of full articulamentum and full articulamentum, input to connect entirely Processing unit is handled, and obtains the output feature vector of full articulamentum, and the full connection processing unit includes adder and multiplier and L The register of series connection, wherein, L is the integer more than 2；It, will be first in the register of described L series connection in each clock cycle The data transfer of a upper clock cycle stored in register to the L-1 register posts l-th to next stage register The data of a upper clock cycle stored in storage input first memory by adder and multiplier.

14. device according to claim 13, which is characterized in that the full articulamentum output feature vector acquisition module, It further includes：

Zero padding submodule, for when the input feature value of the full articulamentum or the weights connected entirely are invalid data, The input zero padding of adder and multiplier, and add up, until there is significant figure.

15. a kind of convolutional neural networks accelerator, which is characterized in that including：

Processor；

For storing the memory of processor-executable instruction；

Wherein, the processor is configured as：Method in perform claim requirement 1 to 7 described in any one.

16. a kind of non-volatile computer readable storage medium storing program for executing, is stored thereon with computer program instructions, which is characterized in that institute State the method realized when computer program instructions are executed by processor in claim 1 to 7 described in any one.