CN108875925A

CN108875925A - A kind of control method and device for convolutional neural networks processor

Info

Publication number: CN108875925A
Application number: CN201810685989.2A
Authority: CN
Inventors: 韩银和; 许浩博; 王颖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-11-23

Abstract

The present invention provides a kind of control method, including：1) the size n*n of the convolution algorithm needed to be implemented is determined；2) the size n*n of the convolution algorithm executed as needed is selected in m²It is loaded into the numerical value of convolution kernel corresponding with the size in the convolutional calculation unit of a 7*7, and remaining each numerical value is filled with 0,7m >=n；3) size of the convolution algorithm executed as needed, need to be implemented convolution input feature vector figure size, periodicity needed for determining convolutional calculation process；4) numerical value of corresponding input feature vector figure is loaded into the m by each period during convolutional calculation²In the convolutional calculation unit of a 7*7, the numerical value of the input feature vector figure is in the m²The numerical value of distribution and the convolution kernel in the convolutional calculation unit of a 7*7 is in the m²Distribution in the convolutional calculation unit of a 7*7 is consistent；Control is loaded with the m of the numerical value of convolution kernel and input feature vector figure²The convolutional calculation unit of a 7*7 executes convolutional calculation corresponding with the periodicity respectively.

Description

A kind of control method and device for convolutional neural networks processor

Technical field

The present invention relates to a kind of convolutional neural networks processors, more particularly to the hardware for convolutional neural networks processor Accelerate the improvement of aspect.

Background technique

Artificial intelligence technology has obtained swift and violent development in recent years, and extensive pass has been obtained in worldwide The research work that note, either industry or academia have all carried out artificial intelligence technology, artificial intelligence technology is infiltrated into The every field such as visual perception, speech recognition, auxiliary driving, smart home, traffic scheduling.Depth learning technology is artificial intelligence The boost motor of technology development.Deep learning is trained using the topological structure of deep neural network, is optimized and reasoning etc., depth White piece of convolutional neural networks of neural network, depth confidence network, Recognition with Recurrent Neural Network etc., by iterating, training.With image For identification application, deep learning algorithm can be automatically derived the characteristic of hiding image by deep neural network, And it generates and is better than traditional effect based on pattern recognition analysis method.

However, the realization of existing depth learning technology depends on great calculation amount.In the training stage, need in magnanimity Pass through the weighted data being calculated in neural network that iterates in data；In the reasoning stage, also need using nerve net Network completes the calculation process to input data within the extremely short response time (usually Millisecond), this needs disposed nerve net Network computing circuit (including CPU, GPU, FPGA and ASIC etc.) reaches per second hundred billion times even computing capability of trillion times.Thus, It is that have very much must to for realizing the hardware-accelerated of depth learning technology, such as to the hardware-accelerated of convolutional neural networks processor It wants.

It has been generally acknowledged that realizing that hardware-accelerated mode can be roughly divided into two kinds, one is use more massive hardware simultaneously Carry out calculation processing capablely, it is another then be that processing speed or efficiency are improved by design specialized hardware circuit.

For the above-mentioned second way, neural network is directly mapped as hardware circuit by some prior arts, for each Different computing units is respectively adopted in network layer, so that the calculating for each network layer carries out in pipelined fashion.For example, Each computing unit in addition to first computing unit is using the output of previous computing unit as its input, and each meter Unit is calculated to be only used for executing the calculating for being directed to corresponding network layer, within the different unit time of assembly line, the meter Unit is calculated to calculate the different inputs of the network layer.Such prior art, generally directed to be to need continuous place The scene of different inputs is managed, such as the video file comprising multiple image is handled.Also, such prior art is logical Often it is directed to the neural network with less network layer.This is because, the network number of plies of deep neural network and larger, Neural network is directly mapped as hardware circuit, the cost of circuit area is very large, and power consumption also can be with circuit area Increase and increase.In addition, it is contemplated that each network layer mutual operation time, there is also larger differences, in order to realize assembly line Function, be supplied to each assembly line level runing time needs be forced to be set as being equal to each other, that is, be equal to processing speed The operation time of most slow assembly line level.For the deep neural network with a large amount of network layers, design assembly line is needed Factors much more very is considered, to reduce waiting needed for the comparatively faster assembly line level of processing speed during pipeline computing Time.

There are also some prior arts in the case where the rule calculated with reference to neural network, and proposition can be for mind " time division multiplexing " is carried out to improve the reusability of computing unit through the computing unit in network processing unit, is different from above-mentioned flowing water The mode of line successively calculates each network layer in neural network using identical computing unit.Such as to input layer, First hidden layer, the second hidden layer ... output layer is seriatim calculated, and next iteration calculating in repeat above-mentioned mistake Journey.Such prior art can also be directed to deep neural network for the neural network with less network layer, and It is particularly suitable for the limited application scenarios of hardware resource,.For such application scenarios, neural network processor is being directed to one After a input has carried out the calculating of network layer A, it may all not need to carry out the calculating for network layer A again for a long time, If different hardware, which is respectively adopted, as its computing unit in each network layer then will lead to limitation to hardware, so that hardware Reusability is not high.Most prior arts are all based on such consideration, using it is different for computing unit " time-division is multiple With " mode and the hardware of neural network processor is correspondingly improved.

However, no matter using which kind of above-mentioned prior art designing convolutional neural networks processor, however it remains hardware benefit In place of having much room for improvement with rate.

Summary of the invention

Therefore, it is an object of the invention to overcome the defect of the above-mentioned prior art, provides a kind of for convolutional neural networks The control method of processor, the convolutional neural networks processor have the convolutional calculation unit of 7*7, the control method packet It includes：

1) the convolution kernel size n*n of the convolution algorithm needed to be implemented is determined；

2) the convolution kernel size n*n of the convolution algorithm executed as needed is selected in m²In the convolutional calculation unit of a 7*7 It is loaded into the numerical value of convolution kernel corresponding with the size, and remaining each numerical value is filled with 0,7m >=n；

3) size of the convolution algorithm executed as needed and need to be implemented convolution input feature vector figure size, really Periodicity needed for determining convolutional calculation process；And

4) according to the periodicity, each period during convolutional calculation, by the numerical value of corresponding input feature vector figure It is loaded into the m²In the convolutional calculation unit of a 7*7, the numerical value of the input feature vector figure is in the m²The convolutional calculation of a 7*7 The numerical value of distribution and the convolution kernel in unit is in the m²Distribution in the convolutional calculation unit of a 7*7 is consistent；

Control is loaded with the m of the numerical value of convolution kernel and input feature vector figure²The convolutional calculation unit of a 7*7 is held respectively Row convolutional calculation corresponding with the periodicity；

5) to the m²Corresponding element adds up in the convolutional calculation result of the convolutional calculation unit of a 7*7, to obtain Obtain the output characteristic pattern of convolution algorithm finally.

Preferably, according to the method, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is less than 7*7, then it is loaded into the convolutional calculation unit of the same 7*7 Remaining each numerical value is simultaneously filled with 0 by the numerical value of convolution kernel corresponding with the size；

If the size of the convolution algorithm needed to be implemented is greater than 7*7, then in the convolutional calculation unit of the 7*7 of respective numbers It is loaded into the numerical value of convolution kernel corresponding with the size and remaining each numerical value is filled with 0.

Preferably, according to the method, wherein step 4) includes：

In each period during convolutional calculation, if comprising described in the numerical value for the input feature vector figure for needing to be loaded into The element of left side first row in input feature vector figure, then disposably by the input feature vector figure with the convolution algorithm that needs to be implemented Multiple elements that size matches are loaded into the corresponding position of the convolutional calculation unit and by the numbers of remaining each position Value is filled with 0, otherwise then will be moved to the left as a whole a unit with element identical in previous cycle, and will be defeated Enter different from previous cycle in characteristic pattern and the multiple elements updated is needed to be loaded into the position vacated by the movement Set place.

Preferably, according to the method, wherein step 4) includes：

In each period during convolutional calculation, the m is controlled²The convolutional calculation unit of a 7*7 is respectively to its institute What is be loaded into executes multiplication for input feature vector figure and for the element of the corresponding position of convolution kernel and carries out to the result of multiplication It is cumulative, to obtain the element of corresponding position in output characteristic pattern.

Preferably, according to the method, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 5*5, then 5* is loaded into the convolutional calculation unit of the same 7*7 Remaining each numerical value is simultaneously filled with 0 by the numerical value of 5 convolution kernel；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into institute It states in the convolutional calculation unit of 7*7, distribution and institute of the numerical value of the input feature vector figure in the convolutional calculation unit of the 7*7 Distribution of the numerical value of the convolution kernel of 5*5 in the convolutional calculation unit of the 7*7 is stated to be consistent；

Wherein, in each period during convolutional calculation, if being wrapped in the numerical value for the input feature vector figure that needs are loaded into Element containing left side first row in the input feature vector figure, then disposably by 25 having a size of 5*5 in the input feature vector figure Element is loaded into the corresponding position of the convolutional calculation unit and the numerical value of remaining each position is filled with 0, otherwise then To be moved to the left as a whole a unit with element identical in previous cycle, and by input feature vector figure with it is previous 5 elements that different and needs update in period are loaded at the position vacated by the movement.

Preferably, according to the method, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 3*3, then it is loaded into the convolutional calculation unit of the same 7*7 Remaining each numerical value to the numerical value of 4 channels, 3*3 convolution kernel and is filled with 0 by spininess；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into institute It states in the convolutional calculation unit of 7*7, the numerical value of the input feature vector figure is with one equal with the quantity of the convolution kernel of the 3*3 Or the form of multiple copies is loaded into the convolutional calculation unit of the 7*7, and the input feature vector figure is the 7*7's The distribution phase being distributed with the numerical value of the convolution kernel of the 3*3 in the convolutional calculation unit of the 7*7 in convolutional calculation unit It is corresponding；

Wherein, in each period during convolutional calculation, if being wrapped in the numerical value for the input feature vector figure that needs are loaded into Element containing left side first row in the input feature vector figure, then disposably by 9 members in the input feature vector figure having a size of 3*3 Element is loaded into the corresponding position of the convolutional calculation unit and the numerical value of remaining each position is filled with 0, otherwise then will Be moved to the left a unit as a whole with element identical in previous cycle, and by input feature vector figure with the last week 3 elements that interim different and needs update are loaded into the corresponding position vacated by the movement.

Preferably, according to the method, wherein step 4) further includes：

If being loaded into the number for being directed to 2 or 4 channels, 3*3 convolution kernel in the convolutional calculation unit of the same 7*7 Value, and do not include the element of left side first row in the input feature vector figure in the numerical value for the input feature vector figure for needing to be loaded into, then By same number of columns is in the convolutional calculation unit of the 7*7 and in different line numbers, 2 about the input feature vector figure Be moved to the left a unit in copy as a whole with element identical in previous cycle, and by input feature vector figure with 3 elements that different and needs update in previous cycle are loaded into the corresponding position vacated by the movement.

Preferably, according to the method, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 11*11, then control common by the convolutional calculation unit of four 7*7 It is loaded into the numerical value of the convolution kernel of 11*11 and remaining each numerical value is filled with 0；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into institute In the convolutional calculation unit for stating four 7*7, the numerical value of the input feature vector figure is in the convolutional calculation unit of four 7*7 It is distributed and is consistent with distribution of the numerical value of the convolution kernel of the 11*11 in the convolutional calculation unit of four 7*7；

Wherein, in each period during convolutional calculation, if being wrapped in the numerical value for the input feature vector figure that needs are loaded into Containing in the input feature vector figure left side first row element, then disposably by the input feature vector figure having a size of 11*11 121 A element is loaded into the corresponding position of the convolutional calculation unit and the numerical value of remaining each position is filled with 0, otherwise To then be moved to the left as a whole a unit with element identical in previous cycle, and by input feature vector figure with before 11 elements that different and needs update in one period are loaded into the corresponding position vacated by the movement.

Preferably, according to the method, wherein step 4) includes：

In each period during convolutional calculation, the convolutional calculation unit of four 7*7 is controlled respectively to its institute What is be loaded into executes multiplication for input feature vector figure and for the element of the corresponding position of convolution kernel and carries out to the result of multiplication It is accumulated in each period during convolutional calculation, controls the convolutional calculation unit of four 7*7 respectively to loaded by it For input feature vector figure and execute and multiplication and the result of multiplication carried out tired for the element of the corresponding position of convolution kernel Add；

And step 5) includes：It adds up to the calculated result by all convolutional calculation units of four 7*7, with Obtain the element of corresponding position in output characteristic pattern.

And a kind of control unit, for realizing control method described in above-mentioned any one.

And a kind of convolutional neural networks processor, including：The convolutional calculation unit and control unit of 7*7, it is described Control unit is for realizing above-mentioned any one the method.

Compared with the prior art, the advantages of the present invention are as follows：

The reusability for improving the computing unit for executing convolution reaches reduction and is necessarily placed at convolutional neural networks Manage the effect of the hardware computational unit in device.It is unnecessary to be directed to need using various sizes of volume for convolutional neural networks processor It accumulates the different convolutional layers of core and is arranged largely with various sizes of hardware computational unit.A convolutional layer is directed to executing Calculating when, can be calculated using other unmatched computing units of size of the convolution kernel with the volume base, thus Improve the utilization rate of hardware computational unit in convolutional neural networks processor.

Detailed description of the invention

Embodiments of the present invention is further illustrated referring to the drawings, wherein：

Fig. 1 is that M kind convolution kernel is used to carry out convolutional calculation to input figure layer in the prior art to obtain output figure layer and show It is intended to, wherein each convolution kernel has N number of channel；

Fig. 2 is the schematic diagram for the convolution algorithm that the prior art realizes 7*7 using the computing unit of a 7*7；

Fig. 3 is to realize that the convolution algorithm of 5*5 shows using the computing unit of 7*7 according to one embodiment of present invention It is intended to；

Fig. 4 is disposably to be realized using the computing unit of a 7*7 to 4 channels according to one embodiment of present invention The schematic diagram of the convolution algorithm of 3*3；

Fig. 5 is the convolution algorithm for realizing 11*11 using the computing unit of four 7*7 according to one embodiment of present invention Schematic diagram.

Specific embodiment

It elaborates with reference to the accompanying drawings and detailed description to the present invention.

Inventor has found during studying the prior art, existing various classical neural networks, such as Alexnet, GoogleNet, VGG, Resnet etc., these neural networks include the convolutional layer of different number, and different convolutional layer institutes The convolution kernel size of use also difference.Such as Alexnet, the first layer of the network are the volume that convolution kernel size is 11*11 Lamination, the second layer of the network are the convolutional layer that convolution kernel size is 5*5, and it is 3*3's which, which is convolution kernel size, Convolutional layer etc..

However, in existing various neural network processors, be be arranged for different size of convolution kernel it is different Computing unit.This is resulted in, and when executing the calculating of some convolutional layer, is mismatched with the size of the convolution kernel of the volume base Other computing units be in idle state.

For example, as shown in Figure 1, neural network processor can provide M kind different convolution kernels, it is denoted as convolution kernel 0 To convolution kernel M-1, each convolution kernel has N number of channel, is respectively used to carry out convolutional calculation for N number of channel of input figure layer, Each convolution kernel and an input figure layer carry out available output figure layer after convolution algorithm.Scheme for an input Layer, can be calculated M-1 output figure layer using whole M kind convolution kernels.If some input figure layer needs to be implemented use The convolution algorithm of convolution kernel 1, other computing units in addition to computing unit corresponding with convolution kernel 1 are in idle shape at this time State.

In this regard, adjusting computing unit reality by controlling the invention proposes the multiplexing scheme of a kind of pair of computing unit The data that border is loaded into (for the same computing unit, both need to be loaded into the numerical value of convolution kernel, it is also desirable to it is special to be loaded into input Levy the numerical value in figure), to realize the convolution algorithm realized with the computing unit of 7*7 scale and be directed to sizes, with reduce into The scale for the hardware computational unit that row convolution algorithm must use.

Neural network processor system architecture of the present invention may include following five parts, input data storage Unit, control unit, output data storage unit, weight storage unit, computing unit.

Input data storage unit is used to store the data for participating in calculating；Output data storage unit storage is calculated Neuron response；Weight storage unit is for storing trained neural network weight；

Control unit is connected with output data storage unit, weight storage unit, computing unit respectively, and control unit can root The control signal control computing unit obtained according to parsing carries out neural computing.

Computing unit is used for the control signal that generates according to control unit to execute corresponding neural computing.It calculates single Member completes most of operation in neural network algorithm, i.e. multiply-add operation of vector etc..

Multiplexing according to the present invention to computing unit can control and reality computing unit by above-mentioned control unit It is existing, it will specifically be introduced by several embodiments below.

In the following, introducing traditional prior art is how to realize the convolution of 7*7 using the computing unit of 7*7 first Operation.An example is gone out with reference to given in Fig. 2, for the prior art, the computing unit of 7*7 scale is as follows To realize convolution algorithm：

In the period 1, by each element in 1-7 row in input feature vector figure, 1-7 column (referred to herein as to be needle To the sliding window of input feature vector figure) with what each element of corresponding position in convolution kernel was multiplied resulting result respectively add up it With as output characteristic pattern in the 1st row the 1st column element, i.e., 2 × (- 4)+(3 × 2)+(- 2 × (- 4))+(2 × (- 8))+(- 7 × 3)=- 31.

In second round, each element in 1-7 row in input feature vector figure, 2-8 column (is slided for current period Numerical value in window) it is multiplied respectively the sum of cumulative as exporting of resulting result with each element of corresponding position in convolution kernel The 2nd column element (not shown in FIG. 2) of the 1st row in characteristic pattern.

And so on, by the right or moving down the sliding window having a size of 7*7 totally 15 times, to obtain having a size of 4*4 Output characteristic pattern.

The present invention does not repel using aforesaid way the convolution algorithm for utilizing the computing unit of 7*7 to realize 7*7.Also, into One step, it in the present invention, can also be by control so that the computing unit realization of 7*7 scale is directed to other in addition to 7*7 Size convolution kernel operation, such as the convolution algorithm of 5*5,3*3,11*11.

As described in above, it is traditional in the prior art, the size for exporting characteristic pattern depends on sliding window The size of mobile number and convolution kernel, for example, the convolution algorithm of 7*7 is carried out for the input feature vector figure of 10*10, sliding window Traverse range and longitudinal movement range are 4 units, and the output that can obtain 4*4 by the calculating in multiple periods is special Sign figure, this makes the convolution algorithm for realizing other sizes using the computing unit of 7*7 be very difficult.It is appreciated that In the case where continuing to use the prior art, carrying out convolutional calculation for the input feature vector figure of 10*10 using the computing unit of 7*7 can only be obtained To the output characteristic pattern (such as shown in Fig. 2) of 4*4, computing unit and processor are it is not apparent how mobile sliding Window can obtain the convolution algorithm of such as 5*5 using the computing unit of 7*7.

In this regard, the invention proposes a kind of corresponding control method, it is special by dispatching the input being loaded into computing unit Sign figure, convolution kernel, and control and execute multiplication, add operation, realize the convolution algorithm that 5*5 is executed with the computing unit of 7*7.

According to one embodiment of present invention, with reference to Fig. 3, the specific control method is as follows：

Computing unit carry out convolutional calculation when, control in each sliding window respectively by the value of corresponding convolution kernel with And the value of corresponding input feature vector figure is loaded into the computing unit of 7*7.

As shown in figure 3, the size of input feature vector figure is 10*10, the size of the convolution algorithm needed to be implemented is 5*5, therefore It can determine that convolutional calculation needs to be implemented the 6 × 6=36 period in total.

In the period 1, by the element of 1-5 row in input feature vector figure, 1-5 column be loaded into the computing unit of 7*7 with " 0 " is filled with as the element that 1-5 row, 1-5 are arranged, and by the element of the remaining 6th, 7 rows, the 6th, 7 column；By 5*5's Convolution kernel is loaded into using the element as 1-5 row, 1-5 column in the computing unit of 7*7, and by the remaining 6th, 7 rows, the 6, the element of 7 column is filled with " 0 ", and the value of input feature vector figure and convolution kernel is thus loaded in the computing unit of 7*7.Control should The computing unit of 7*7 executes multiplication to the element of corresponding position in input feature vector figure and convolution kernel, adds up, special to obtain output Levy the element of the 1st row the 1st column in figure, i.e. (2 × (- 4))+(3 × 2)+(- 2 × (- 4))+(2 × (- 8))=- 10.Due to calculating The each element removed other than the numerical value of the convolution kernel of script 5*5 in unit is 0, therefore the result and reality that calculate use 5* The result that 5 computing unit carries out convolutional calculation is completely the same.

In second round, by whole elements (i.e. " 0,0,2,0, -3 of 1-5 row, 2-6 column in input feature vector figure；0, 3,-2,5,0；0,0,0,2,0；0,0,0,3,0；0,0,0,0,0 ") it is loaded into the computing unit using as 1-5 row, 1-5 The new element of column.The element that control computing unit is directed to loaded by it executes multiplication and accumulating operation, to obtain output feature The element that the 1st row the 2nd arranges in figure.

A preferred embodiment according to the present invention, can also be in the second cycle to the above-mentioned computing unit in 7*7 The mode of the middle data for being loaded into input feature vector figure improves, to improve loading efficiency.That is, with reference to Fig. 3, by the calculating list of 7*7 Whole elements (i.e. " 0,0,2,0 of 1-5 row, 2-5 column in member；0,3,-2,5；0,0,0,2；0,0,0,3；0,0,0,0 ") whole Body is moved to the left 1 unit using the new element as 1-5 row, 1-4 column, and by 1-5 row in input feature vector figure, the The element (i.e. " -3 of 6 column；0；0；0；0 ") it is loaded into using the new element as 1-5 row, the 5th column in the computing unit, thus The value of input feature vector figure loaded in the computing unit of the 7*7 is updated, reached and uses cunning in traditional scheme Effect as dynamic window class.And computing unit is similarly controlled and executes multiplication and accumulating operation for the element loaded by it, To obtain the element of the 1st row the 2nd column in output characteristic pattern.

And so on, third is completed to the period 6.

In the 7th period, by the element of 2-6 row in input feature vector figure, 1-5 column be loaded into the computing unit of 7*7 with As the element that 1-5 row, 1-5 are arranged, and controls computing unit and execute multiplication and cumulative fortune for the element loaded by it It calculates, to obtain the element of the 2nd row the 1st column in output characteristic pattern.And in the 8th to the tenth subsequent two cycles, using with it is aforementioned Second to period 6 similar mode is loaded into the element of corresponding input feature vector figure in the computing unit.And so on, directly To all 36 periods are completed, the output characteristic pattern of 6*6 is obtained.

As can be seen that it is special that input is disposably loaded with into computing unit in the period 1 by above-mentioned control method Levy 25 numerical value of 5*5 in figure.Similarly, the seven, the 13, ten nine, 25,31 periods were also disposably loaded with input 25 numerical value in characteristic pattern.And correspondingly, in the second to six period, 5 numerical value for being loaded into input feature vector figure are only needed every time, And also use in previous cycle 20 numerical value are moved to the left, for loaded for convolution kernel in computing unit Numerical value is not made an amendment then.Similarly, the 8th to 12, the 14th to 18, the 26th to 30, the 32nd to 36 The element for also using mode similar with the second to six period to be loaded into input feature vector figure.

Thus, it is possible to ensure when in each period, in computing unit, the position of each element of input feature vector figure with It is one-to-one that the position of the respective element in the convolution kernel of multiplying is carried out with it.Also, for remove for realizing For other units, such as computing unit itself or processor except the unit of control method of the invention, they are not It can be appreciated that the convolution unit of 7*7 actually implemented is the convolution algorithm of 5*5.In addition, by above-mentioned control method, so that The numerical value of input feature vector figure loaded by computing unit is not directly dependent on sliding window in each period.On the one hand it is embodied in The arrangement of the numerical value for the input feature vector figure being loaded into computing unit is not dependent on each number in the sliding window having a size of 5*5 It is worth actual arrangement mode, the periodicity for being on the other hand also embodied in calculating is also not dependent on the sliding window having a size of 7*7 Mobile number (i.e. 4*4), export result quantity and size can control method through the invention control, it is possible thereby to sharp The convolution algorithm of 5*5 is carried out for the input feature vector figure of 10*10 with the computing unit of 7*7 and thereby obtains the output result of 6*6.

Similarly, mode similar with the example in above-mentioned Fig. 3 can also be used, in the computing unit of the same 7*7 The convolution algorithm that size is less than 7*7 is executed, such as executes the convolution algorithm of 3*3.That is, the convolution kernel of 3*3 to be loaded into the meter of 7*7 It calculates in unit, and its remainder values is filled with " 0 ".According to the size of input feature vector figure and the ruler of the convolution algorithm needed to be implemented Very little 3*3, periodicity needed for determining convolution algorithm.In each period, numerical value corresponding in input feature vector figure is loaded into this In the computing unit of 7*7, to execute convolution algorithm.

It is appreciated that it is disposably only performed once the convolution algorithm of such as 3*3 for the computing unit using 7*7, The utilization rate of its hardware is not high.In this regard, the invention also provides a kind of schemes to control in the computing unit of the same 7*7 The convolution algorithm of the 3*3 in four channels is disposably executed for the same input feature vector figure.

According to one embodiment of present invention, a kind of control method is additionally provided, realizes and 3* is executed with the computing unit of 7*7 3 convolution algorithm, with reference to Fig. 4, the specific control method is as follows：

The size of input feature vector figure is 10*10, and the size of the convolution algorithm needed to be implemented is 3*3, thus may determine that volume Product calculates and needs to be implemented the 8 × 8=64 period in total.

In the period 1, the element that 1-3 row, 1-3 are arranged in input feature vector figure is copied into 4 parts, is loaded into 7*7 respectively Computing unit 1-3 row 1-3 column, 1-3 row 4-6 column, 4-6 row 1-3 column, 4-6 row 4-6 column, and will be remaining 7th row, the 7th element arranged are filled with " 0 "；The calculating of 7*7 will be loaded into respectively for the convolution kernel of four 3*3 in four channels Unit 1-3 row 1-3 column, 1-3 row 4-6 column, 4-6 row 1-3 column, 4-6 row 4-6 column, and by remaining 7th row, The element of 7th column is filled with " 0 ".In the embodiment illustrated in figure 3, each convolution kernel is used for a channel, and is calculating needle When to the result of the convolution algorithm of multichannel, the convolution results of the same position in each channel can be accumulated in together using as The output result of the position.In the present invention, can after being loaded with the element of above-mentioned input feature vector figure and convolution kernel, The computing unit for controlling the 7*7 executes multiplication to the element of corresponding position in input feature vector figure and convolution kernel, adds up, and is obtained Result be consistent with the convolution results accumulated result of same position for being directed to four channels, it is possible thereby to be exported The element that the 1st row the 1st arranges in characteristic pattern.

In second round, whole elements (i.e. " 0,0 that 1-3 row 2-3 in the computing unit of 7*7 is arranged；0,3；0,0 ") whole Body is moved to the left new element of 1 unit to arrange as 1-3 row 1-2, and 1-3 row, the 4th in input feature vector figure are arranged Element (i.e. " 2；-2；0 ") it is loaded into the computing unit using the new element as 1-3 row, the 3rd column.Similarly, for The 1-3 row 4-6 column of script, 4-6 row 1-3 column, 4-6 row 4-6 column also execute above-mentioned movement and are loaded into the computing unit The operation of new element.The value of input feature vector figure loaded in the computing unit of the 7*7 is updated as a result,.Further Ground in the present invention can also regard respectively the 1-6 row 2-3 column of script in the computing unit and 1-6 row 5-6 column as One entirety is moved, and/or is loaded into meter together for needing the new element being loaded into copy as two parts and regard an entirety as Unit is calculated, thus needs the step of controling and operating to reduce.Computing unit is similarly controlled to hold for the element loaded by it Row multiplication and accumulating operation, to obtain the element of the 1st row the 2nd column in output characteristic pattern.

And so on, until completing whole periods, obtain the output characteristic pattern of 8*8.

As can be seen that can disposably carry out the 3*3's in four channels for input feature vector figure by above-mentioned control method Convolution algorithm is particularly suitable for the more situation of number of channels.In the case of number of channels is not equal to four, it can also select Selecting and being used to be loaded at least one area filling having a size of 3*3 of convolution kernel in the computing unit by 7*7 is " 0 ", such as by the 4-6 row 4-6 column are stuffed entirely with as " 0 ".In the case of number of channels is greater than four, such as there is the case where seven channels, it can To be calculated twice by control, it is loaded into 4 convolution kernels when calculating first time, is loaded into 3 convolution when calculating for second Core, and control and added up the element of same position in the result calculated twice to obtain the output result for being directed to the position.

The present invention describes the convolutional calculation for how controlling computing unit realization 5*5,3*3 of 7*7 through the foregoing embodiment, It is explained below and how to control the computing unit of 7*7 to realize that size is more than the convolutional calculation of 7*7.

According to one embodiment of present invention, a kind of control method is provided, realizes and 11* is executed with the computing unit of 7*7 11 convolution algorithm, with reference to Fig. 5, the specific control method is as follows：

First, it is determined that out 11>7, thus need to be completed jointly by the computing unit of more than one 7*7 having a size of 11*11 Convolution algorithm.Here it can choose the k computing unit that can be used for being loaded into the data having a size of 11*11 just.Here needle K is selected as：K=m², 7m can choose the minimum positive integer more than or equal to n.It is of course also possible to select than above-mentioned quantity The computing unit of more 7*7 executes the convolution algorithm of 11*11.For example illustrated in fig. 5, k=4 meter is selected here Calculate unit.

The numerical value of used convolution kernel is divided into four parts and is loaded into the computing unit of four 7*7 respectively by control, will Rest part is filled with " 0 "；Also, in each period, data corresponding in input feature vector figure are divided into four parts difference by control It is loaded into the computing unit of four 7*7, rest part is filled with " 0 ".Here in the computing unit of four 7*7 In, the distribution mode of the numerical value of convolution kernel and the distribution mode of the numerical value of input feature vector figure are consistent.

Also, it controls each computing unit and executes multiplication and accumulating operation for the element loaded by it, by will be whole Corresponding calculated result adds up in the calculated result of four computing units, to obtain the number of corresponding output characteristic pattern Value.

In this embodiment, further the mode similar with previous embodiment can also be used with logical in each period It crosses and moves and be loaded into the mode of the numerical value of corresponding input feature vector figure to update the number of the input feature vector figure in each computing unit Value.For example, in second period, by the numerical value of the 2-7 column of the computing unit (it is in the upper left corner in Fig. 5) of first 7*7 It is moved to the left 1 unit, and is loaded into new numerical value in the 7th column；By the computing unit of second 7*7, (it is in upper right in Fig. 5 Angle) the numerical value of 2-4 column be moved to the left 1 unit, and be loaded into new numerical value in the 4th column；By the calculating list of third 7*7 The numerical value of the 2-7 column 1-4 row of first (it is in the lower left corner in Fig. 5) is moved to the left 1 unit, and is loaded into the 7th column 1-4 row New numerical value；The numerical value of the 2-4 column 1-4 row of the computing unit (it is in the lower right corner in Fig. 5) of 4th 7*7 is moved to the left 1 unit, and new numerical value is loaded into the 4th column 1-4 row.

In the present invention, corresponding control unit can be set for above-mentioned control method, and such control unit can be with It is adapted to an existing convolutional neural networks processor, come to based on convolution by way of implementing above-mentioned control method It calculates unit to be multiplexed, matched convolutional Neural net can also be designed based on hardware resource required for such control unit Network processor, such as minimal number of hardware resource is used in the case where meeting above-mentioned multiplexing scheme.

Scheme provided by the present invention is related to improving the reusability of the computing unit for executing convolution, must be set with reducing The hardware computational unit in convolutional neural networks processor is set, it is unnecessary to be directed to need to use for convolutional neural networks processor The different convolutional layers of various sizes of convolution kernel and be arranged largely have various sizes of hardware computational unit.Executing needle When to the calculating of a convolutional layer, the convolutional calculation for being directed to different convolutional layers can be realized using the computing unit of same size, Which thereby enhance the utilization rate of hardware computational unit in convolutional neural networks processor.

It is appreciated that the present invention repel as described in background technique using more massive hardware concurrent into Row calculation processing and the reusability that computing unit is improved by way of " time division multiplexing ".

Furthermore, it is desirable to illustrate, each step introduced in above-described embodiment is all not necessary, art technology Personnel can carry out according to actual needs it is appropriate accept or reject, replacement, modification etc..

It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting.On although The invention is described in detail with reference to an embodiment for text, those skilled in the art should understand that, to skill of the invention Art scheme is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered at this In the scope of the claims of invention.

Claims

1. a kind of control method for convolutional neural networks processor, the convolutional neural networks processor has the volume of 7*7 Product computing unit, the control method include：

2) the convolution kernel size n*n of the convolution algorithm executed as needed is selected in m²It is loaded into the convolutional calculation unit of a 7*7 The numerical value of convolution kernel corresponding with the size, and remaining each numerical value is filled with 0,7m >=n；

3) size of the convolution algorithm executed as needed and need to be implemented convolution input feature vector figure size, determine volume Periodicity needed for product calculating process；And

4) according to the periodicity, the numerical value of corresponding input feature vector figure is loaded by each period during convolutional calculation To the m²In the convolutional calculation unit of a 7*7, the numerical value of the input feature vector figure is in the m²The convolutional calculation unit of a 7*7 In distribution with the numerical value of the convolution kernel in the m²Distribution in the convolutional calculation unit of a 7*7 is consistent；

Control is loaded with the m of the numerical value of convolution kernel and input feature vector figure²The convolutional calculation unit of a 7*7 execute respectively with The corresponding convolutional calculation of the periodicity；

5) to the m²Corresponding element adds up in the convolutional calculation result of the convolutional calculation unit of a 7*7, final to obtain Convolution algorithm output characteristic pattern.

2. according to the method described in claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented be less than 7*7, then in the convolutional calculation unit of the same 7*7 be loaded into and institute It states the numerical value of the corresponding convolution kernel of size and remaining each numerical value is filled with 0；

If the size of the convolution algorithm needed to be implemented is greater than 7*7, then it is loaded into the convolutional calculation unit of the 7*7 of respective numbers Remaining each numerical value is simultaneously filled with 0 by the numerical value of convolution kernel corresponding with the size.

3. according to the method described in claim 1, wherein step 4) includes：

In each period during convolutional calculation, if comprising the input in the numerical value for the input feature vector figure for needing to be loaded into The element of left side first row in characteristic pattern, then disposably by the size in the input feature vector figure with the convolution algorithm needed to be implemented The multiple elements to match are loaded into the corresponding position of the convolutional calculation unit and fill out the numerical value of remaining each position Filling is 0, a unit otherwise will be then moved to the left as a whole with element identical in previous cycle, and will input spy Multiple elements that and needs different from previous cycle update in sign figure are loaded into the position vacated by the movement Place.

4. according to the method described in claim 1, wherein step 4) includes：

In each period during convolutional calculation, the m is controlled²The convolutional calculation unit of a 7*7 is respectively to loaded by it For input feature vector figure and multiplication is executed for the element of the corresponding position of convolution kernel and is added up to the result of multiplication, To obtain the element of corresponding position in output characteristic pattern.

5. method described in any one of -4 according to claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 5*5, then it is loaded into 5*5's in the convolutional calculation unit of the same 7*7 Remaining each numerical value is simultaneously filled with 0 by the numerical value of convolution kernel；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into the 7* In 7 convolutional calculation unit, distribution of the numerical value of the input feature vector figure in the convolutional calculation unit of the 7*7 and the 5* Distribution of the numerical value of 5 convolution kernel in the convolutional calculation unit of the 7*7 is consistent；

Wherein, in each period during convolutional calculation, if comprising institute in the numerical value for the input feature vector figure for needing to be loaded into The element for stating left side first row in input feature vector figure, then disposably by 25 elements in the input feature vector figure having a size of 5*5 It is loaded into the corresponding position of the convolutional calculation unit and the numerical value of remaining each position is filled with 0, it otherwise then will be with Identical element is moved to the left a unit as a whole in previous cycle, and by input feature vector figure with previous cycle 5 elements that middle different and needs update are loaded at the position vacated by the movement.

6. method described in any one of -4 according to claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 3*3, then spininess is loaded into the convolutional calculation unit of the same 7*7 Remaining each numerical value is simultaneously filled with 0 by numerical value to 4 channels, 3*3 convolution kernel；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into the 7* In 7 convolutional calculation unit, the numerical value of the input feature vector figure is with one or more equal with the quantity of the convolution kernel of the 3*3 The form of a copy is loaded into the convolutional calculation unit of the 7*7, and the input feature vector figure is in the convolution of the 7*7 Distribution in computing unit is corresponding with distribution of the numerical value of the convolution kernel of the 3*3 in the convolutional calculation unit of the 7*7；

Wherein, in each period during convolutional calculation, if comprising institute in the numerical value for the input feature vector figure for needing to be loaded into The element for stating left side first row in input feature vector figure, then disposably carry 9 elements in the input feature vector figure having a size of 3*3 Enter the corresponding position to the convolutional calculation unit and the numerical value of remaining each position be filled with 0, otherwise then will with it is preceding Identical element is moved to the left a unit as a whole in one period, and will be in input feature vector figure and in previous cycle 3 elements that different and needs update are loaded into the corresponding position vacated by the movement.

7. according to the method described in claim 6, wherein step 4) further includes：

If being loaded into the numerical value for being directed to 2 or 4 channels, 3*3 convolution kernel in the convolutional calculation unit of the same 7*7, And the element for not including left side first row in the input feature vector figure in the numerical value for the input feature vector figure for needing to be loaded into, then by institute State 2 copies in the convolutional calculation unit of 7*7 in same number of columns and in different line numbers, about the input feature vector figure In with element identical in previous cycle be moved to the left a unit as a whole, and by input feature vector figure with it is previous 3 elements that different and needs update in period are loaded into the corresponding position vacated by the movement.

8. method described in any one of claim 1-4, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 11*11, then control is loaded into jointly by the convolutional calculation unit of four 7*7 Remaining each numerical value is simultaneously filled with 0 by the numerical value of the convolution kernel of 11*11；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into described four In the convolutional calculation unit of a 7*7, distribution of the numerical value of the input feature vector figure in the convolutional calculation unit of four 7*7 It is consistent with the distribution of the numerical value of the convolution kernel of the 11*11 in the convolutional calculation unit of four 7*7；

Wherein, in each period during convolutional calculation, if comprising institute in the numerical value for the input feature vector figure for needing to be loaded into The element for stating left side first row in input feature vector figure, then disposably by 121 members in the input feature vector figure having a size of 11*11 Element is loaded into the corresponding position of the convolutional calculation unit and the numerical value of remaining each position is filled with 0, otherwise then will Be moved to the left a unit as a whole with element identical in previous cycle, and by input feature vector figure with the last week 11 elements that interim different and needs update are loaded into the corresponding position vacated by the movement.

9. according to the method described in claim 8, wherein step 4) includes：

In each period during convolutional calculation, the convolutional calculation unit of four 7*7 is controlled respectively to loaded by it For input feature vector figure and execute and multiplication and the result of multiplication carried out tired for the element of the corresponding position of convolution kernel Add；

And step 5) includes：It adds up to the calculated result by all convolutional calculation units of four 7*7, to obtain Export the element of corresponding position in characteristic pattern.

10. a kind of control unit, for realizing the control method as described in any one of claim 1-9.

11. a kind of convolutional neural networks processor, including：The convolutional calculation unit and control unit of 7*7, the control are single Member is for realizing any one of such as claim 1-9 the method.