CN108875917A

CN108875917A - A kind of control method and device for convolutional neural networks processor

Info

Publication number: CN108875917A
Application number: CN201810685538.9A
Authority: CN
Inventors: 韩银和; 许浩博; 王颖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-11-23

Abstract

The present invention provides a kind of control method, including：1) the size n*n of the convolution algorithm needed to be implemented is determined；2) the size n*n of the convolution algorithm executed as needed is selected in m²It is loaded into the numerical value of convolution kernel corresponding with the size in the convolutional calculation unit of a 3*3, and remaining each numerical value is filled with 0,3m >=n；3) size of the convolution algorithm executed as needed, need to be implemented convolution input feature vector figure size, periodicity needed for determining convolutional calculation process；4) numerical value of corresponding input feature vector figure is loaded into the m by each period during convolutional calculation²In the convolutional calculation unit of a 3*3, the numerical value of the input feature vector figure is in the m²The numerical value of distribution and the convolution kernel in the convolutional calculation unit of a 3*3 is in the m²Distribution in the convolutional calculation unit of a 3*3 is consistent；Control is loaded with the m of the numerical value of convolution kernel and input feature vector figure²The convolutional calculation unit of a 3*3 executes convolutional calculation corresponding with the periodicity respectively.

Description

A kind of control method and device for convolutional neural networks processor

Technical field

The present invention relates to a kind of convolutional neural networks processors, more particularly to the hardware for convolutional neural networks processor Accelerate the improvement of aspect.

Background technique

Artificial intelligence technology has obtained swift and violent development in recent years, and extensive pass has been obtained in worldwide The research work that note, either industry or academia have all carried out artificial intelligence technology, artificial intelligence technology is infiltrated into The every field such as visual perception, speech recognition, auxiliary driving, smart home, traffic scheduling.Depth learning technology is artificial intelligence The boost motor of technology development.Deep learning is trained using the topological structure of deep neural network, is optimized and reasoning etc., depth White piece of convolutional neural networks of neural network, depth confidence network, Recognition with Recurrent Neural Network etc., by iterating, training.With image For identification application, deep learning algorithm can be automatically derived the characteristic of hiding image by deep neural network, And it generates and is better than traditional effect based on pattern recognition analysis method.

However, the realization of existing depth learning technology depends on great calculation amount.In the training stage, need in magnanimity Pass through the weighted data being calculated in neural network that iterates in data；In the reasoning stage, also need using nerve net Network completes the calculation process to input data within the extremely short response time (usually Millisecond), this needs disposed nerve net Network computing circuit (including CPU, GPU, FPGA and ASIC etc.) reaches per second hundred billion times even computing capability of trillion times.Thus, It is that have very much must to for realizing the hardware-accelerated of depth learning technology, such as to the hardware-accelerated of convolutional neural networks processor It wants.

It has been generally acknowledged that realizing that hardware-accelerated mode can be roughly divided into two kinds, one is use more massive hardware simultaneously Carry out calculation processing capablely, it is another then be that processing speed or efficiency are improved by the special hardware circuit of design.

For the above-mentioned second way, neural network is directly mapped as hardware circuit by some prior arts, for each Different computing units is respectively adopted in network layer, so that the calculating for each network layer carries out in pipelined fashion.For example, Each computing unit in addition to first computing unit is using the output of previous computing unit as its input, and each meter Unit is calculated to be only used for executing the calculating for being directed to corresponding network layer, within the different unit time of assembly line, the meter Unit is calculated to calculate the different inputs of the network layer.Such prior art, generally directed to be to need continuous place The scene of different inputs is managed, such as the video file comprising multiple image is handled.Also, such prior art is logical Often it is directed to the neural network with less network layer.This is because, the network number of plies of deep neural network and larger, Neural network is directly mapped as hardware circuit, the cost of circuit area is very large, and power consumption also can be with circuit area Increase and increase.In addition, it is contemplated that each network layer mutual operation time, there is also larger differences, in order to realize assembly line Function, be supplied to each assembly line level runing time needs be forced to be set as being equal to each other, that is, be equal to processing speed The operation time of most slow assembly line level.For the deep neural network with a large amount of network layers, design assembly line is needed Factors much more very is considered, to reduce waiting needed for the comparatively faster assembly line level of processing speed during pipeline computing Time.

There are also some prior arts in the case where the rule calculated with reference to neural network, and proposition can be for mind " time division multiplexing " is carried out to improve the reusability of computing unit through the computing unit in network processing unit, is different from above-mentioned flowing water The mode of line successively calculates each network layer in neural network using identical computing unit.Such as to input layer, First hidden layer, the second hidden layer ... output layer is seriatim calculated, and next iteration calculating in repeat above-mentioned mistake Journey.Such prior art can also be directed to deep neural network for the neural network with less network layer, and It is particularly suitable for the limited application scenarios of hardware resource.For such application scenarios, neural network processor is being directed to one After a input has carried out the calculating of network layer A, it may all not need to carry out the calculating for network layer A again for a long time, If different hardware, which is respectively adopted, as its computing unit in each network layer then will lead to limitation to hardware, so that hardware Reusability is not high.Most prior arts are all based on such consideration, using it is different for computing unit " time-division is multiple With " mode and the hardware of neural network processor is correspondingly improved.

However, no matter using which kind of above-mentioned prior art designing convolutional neural networks processor, however it remains hardware benefit In place of having much room for improvement with rate.

Summary of the invention

Therefore, it is an object of the invention to overcome the defect of the above-mentioned prior art, provides a kind of for convolutional neural networks The control method of processor, the convolutional neural networks processor have the convolutional calculation unit of 3*3, the control method packet It includes：

1) the convolution kernel size n*n of the convolution algorithm needed to be implemented is determined；

2) the convolution kernel size n*n of the convolution algorithm executed as needed is selected in m²In the convolutional calculation unit of a 3*3 It is loaded into the numerical value of convolution kernel corresponding with the size, and remaining each numerical value is filled with 0,3m >=n；

3) size of the convolution algorithm executed as needed and need to be implemented convolution input feature vector figure size, really Periodicity needed for determining convolutional calculation process；And

4) according to the periodicity, each period during convolutional calculation, by the numerical value of corresponding input feature vector figure It is loaded into the m²In the convolutional calculation unit of a 3*3, the numerical value of the input feature vector figure is in the m²The convolutional calculation of a 3*3 The numerical value of distribution and the convolution kernel in unit is in the m²Distribution in the convolutional calculation unit of a 3*3 is consistent；

Control is loaded with the m of the numerical value of convolution kernel and input feature vector figure²The convolutional calculation unit of a 3*3 execute with The corresponding convolutional calculation of the periodicity；

5) to the m²Corresponding element adds up in the convolutional calculation result of the convolutional calculation unit of a 3*3, to obtain Obtain the output characteristic pattern of convolution algorithm finally.

Preferably, according to the method, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is less than or equal to 3*3, then in the convolutional calculation unit of the same 3*3 Remaining each numerical value is simultaneously filled with 0 by the middle numerical value for being loaded into convolution kernel corresponding with the size；

If the size of the convolution algorithm needed to be implemented is greater than 3*3, then in the convolutional calculation unit of the 3*3 of respective numbers It is loaded into the numerical value of convolution kernel corresponding with the size and remaining each numerical value is filled with 0.

Preferably, according to the method, wherein step 4) includes：

In each period during convolutional calculation, if comprising described in the numerical value for the input feature vector figure for needing to be loaded into The element of left side first row in input feature vector figure, then disposably by the input feature vector figure with the convolution algorithm that needs to be implemented Multiple elements that size matches are loaded into the corresponding position of the convolutional calculation unit and by the numbers of remaining each position Value is filled with 0, otherwise then will be moved to the left as a whole a unit with element identical in previous cycle, and will be defeated Enter different from previous cycle in characteristic pattern and the multiple elements updated is needed to be loaded into the position vacated by the movement Set place.

Preferably, according to the method, wherein step 4) includes：

In each period during convolutional calculation, the convolutional calculation unit of the 3*3 is controlled to the needle loaded by it To input feature vector figure and multiplication is executed for the element of the corresponding position of convolution kernel and is added up to the result of multiplication, with Obtain the element of corresponding position in output characteristic pattern.

Preferably, according to the method, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 5*5, then 5*5 is loaded into the convolutional calculation unit of four 3*3 Convolution kernel numerical value and remaining each numerical value is filled with 0；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into institute In the convolutional calculation unit for stating four 3*3, the numerical value of the input feature vector figure is in the convolutional calculation unit of four 3*3 It is distributed and is consistent with distribution of the numerical value of the convolution kernel of the 5*5 in the convolutional calculation unit of four 3*3；

Wherein, in each period during convolutional calculation, if being wrapped in the numerical value for the input feature vector figure that needs are loaded into Element containing left side first row in the input feature vector figure, then disposably by 25 having a size of 5*5 in the input feature vector figure Element is loaded into the corresponding position of the convolutional calculation unit of four 3*3 and is filled with the numerical value of remaining each position 0, it otherwise will then be moved to the left as a whole a unit with element identical in previous cycle, and by input feature vector figure In it is different from previous cycle and the element that updates is needed to be loaded at the position vacated by the movement.

Preferably, according to the method, wherein step 4) includes：

In each period during convolutional calculation, the convolutional calculation unit of four 3*3 is controlled respectively to its institute What is be loaded into executes multiplication for input feature vector figure and for the element of the corresponding position of convolution kernel and carries out to the result of multiplication It is cumulative,

And step 5) includes：It adds up to the calculated result by all convolutional calculation units of four 3*3, with Obtain the element of corresponding position in output characteristic pattern.

Preferably, according to the method, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 7*7, then 7*7 is loaded into the convolutional calculation unit of nine 3*3 Convolution kernel numerical value and remaining each numerical value is filled with 0；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into institute In the convolutional calculation unit for stating nine 3*3, the numerical value of the input feature vector figure is in the convolutional calculation unit of nine 3*3 It is distributed and is consistent with distribution of the numerical value of the convolution kernel of the 7*7 in the convolutional calculation unit of nine 3*3；

Wherein, in each period during convolutional calculation, if being wrapped in the numerical value for the input feature vector figure that needs are loaded into Element containing left side first row in the input feature vector figure, then disposably by 49 having a size of 7*7 in the input feature vector figure Element is loaded into the corresponding position of the convolutional calculation unit of nine 3*3 and is filled with the numerical value of remaining each position 0, it otherwise will then be moved to the left as a whole a unit with element identical in previous cycle, and by input feature vector figure In it is different from previous cycle and the element that updates is needed to be loaded at the position vacated by the movement.

Preferably, according to the method, wherein step 4) includes：

In each period during convolutional calculation, the convolutional calculation unit of nine 7*7 is controlled respectively to its institute What is be loaded into executes multiplication for input feature vector figure and for the element of the corresponding position of convolution kernel and carries out to the result of multiplication It is cumulative,

And step 5) includes：It adds up to the calculated result by all convolutional calculation units of nine 7*7, with Obtain the element of corresponding position in output characteristic pattern.

And a kind of control unit, for realizing control method described in above-mentioned any one.

And a kind of convolutional neural networks processor, including：The convolutional calculation unit and control unit of 7*7, it is described Control unit is for realizing above-mentioned any one the method.

Compared with the prior art, the advantages of the present invention are as follows：

The reusability for improving the computing unit for executing convolution reaches reduction and is necessarily placed at convolutional neural networks Manage the effect of the hardware computational unit in device.It is unnecessary to be directed to need using various sizes of volume for convolutional neural networks processor It accumulates the different convolutional layers of core and is arranged largely with various sizes of hardware computational unit.A convolutional layer is directed to executing Calculating when, can be calculated using other unmatched computing units of size of the convolution kernel with the volume base, thus Improve the utilization rate of hardware computational unit in convolutional neural networks processor.

Detailed description of the invention

Embodiments of the present invention is further illustrated referring to the drawings, wherein：

Fig. 1 is that M kind convolution kernel is used to carry out convolutional calculation to input figure layer in the prior art to obtain output figure layer and show It is intended to, wherein each convolution kernel has N number of channel；

Fig. 2 is the signal for the convolution algorithm that the prior art and the present invention realize 3*3 using the computing unit of a 3*3 Figure；

Fig. 3 a be when carrying out 5*5 convolution algorithm using the computing unit of four 3*3 according to one embodiment of present invention by The schematic diagram of computing unit loading input feature vector figure；

Fig. 3 b is the convolution algorithm for realizing 5*5 using the computing unit of four 3*3 according to one embodiment of present invention Schematic diagram；

Fig. 4 is single by calculating when carrying out 3*3 convolution algorithm using the computing unit of 3*3 according to one embodiment of present invention Member is loaded into the schematic diagram of input feature vector figure；

Fig. 5 is to realize that the convolution algorithm of 7*7 shows using the computing unit of nine 3*3 according to one embodiment of present invention It is intended to.

Specific embodiment

It elaborates with reference to the accompanying drawings and detailed description to the present invention.

Inventor has found during studying the prior art, existing various classical neural networks, such as Alexnet, GoogleNet, VGG, Resnet etc., these neural networks include the convolutional layer of different number, and different convolutional layer institutes The convolution kernel size of use also difference.Such as Alexnet, the first layer of the network are the volume that convolution kernel size is 11*11 Lamination, the second layer of the network are the convolutional layer that convolution kernel size is 5*5, and it is 3*3's which, which is convolution kernel size, Convolutional layer etc..

However, in existing various neural network processors, be be arranged for different size of convolution kernel it is different Computing unit.This is resulted in, and when executing the calculating of some convolutional layer, is mismatched with the size of the convolution kernel of the volume base Other computing units be in idle state.

For example, as shown in Figure 1, neural network processor can provide M kind different convolution kernels, it is denoted as convolution kernel 0 To convolution kernel M-1, each convolution kernel has N number of channel, is respectively used to carry out convolutional calculation for N number of channel of input figure layer, Each convolution kernel and an input figure layer carry out available output figure layer after convolution algorithm.Scheme for an input Layer, can be calculated M-1 output figure layer using whole M kind convolution kernels.If some input figure layer needs to be implemented use The convolution algorithm of convolution kernel 1, other computing units in addition to computing unit corresponding with convolution kernel 1 are in idle shape at this time State.

In this regard, adjusting computing unit reality by controlling the invention proposes the multiplexing scheme of a kind of pair of computing unit The data that border is loaded into (for the same computing unit, both need to be loaded into the numerical value of convolution kernel, it is also desirable to it is special to be loaded into input Levy the numerical value in figure), to realize the convolution algorithm realized with the computing unit of 3*3 scale and be directed to sizes, with reduce into The scale for the hardware computational unit that row convolution algorithm must use.

Neural network processor system architecture of the present invention may include following five parts, input data storage Unit, control unit, output data storage unit, weight storage unit, computing unit.

Input data storage unit is used to store the data for participating in calculating；Output data storage unit storage is calculated Neuron response；Weight storage unit is for storing trained neural network weight；

Control unit is connected with output data storage unit, weight storage unit, computing unit respectively, and control unit can root The control signal control computing unit obtained according to parsing carries out neural computing.

Computing unit is used for the control signal that generates according to control unit to execute corresponding neural computing.It calculates single Member completes most of operation in neural network algorithm, i.e. multiply-add operation of vector etc..

Multiplexing according to the present invention to computing unit can control and reality computing unit by above-mentioned control unit It is existing, it will specifically be introduced by several embodiments below.In the following, introducing how traditional prior art is first The convolution algorithm of 3*3 is realized using the computing unit of 3*3.An example is gone out with reference to given in Fig. 2, for the prior art For, the computing unit of 3*3 scale realizes convolution algorithm as follows：

In the period 1, by each element in 1-3 row in input feature vector figure, 1-3 column (referred to herein as to be needle To the sliding window of input feature vector figure) with what each element of corresponding position in convolution kernel was multiplied resulting result respectively add up it With the element as the 1st row the 1st column in output characteristic pattern, i.e. (3 × 2)+(2 × (- 8))=- 10.

In second round, each element in 1-3 row in input feature vector figure, 2-4 column (is slided for current period Numerical value in window) it is multiplied respectively the sum of cumulative as exporting of resulting result with each element of corresponding position in convolution kernel The 2nd column element (not shown in FIG. 2) of the 1st row in characteristic pattern.

And so on, by the right or moving down the sliding window having a size of 3*3 totally 24 times, to obtain having a size of 5*5 Output characteristic pattern.

The present invention does not repel using aforesaid way the convolution algorithm for utilizing the computing unit of 3*3 to realize 3*3.Also, into One step, it in the present invention, can also be by control so that the computing unit realization of 3*3 scale is directed to other in addition to 3*3 Size convolution kernel operation, such as the convolution algorithm of 5*5,7*7,9*9.

As described in above, it is traditional in the prior art, computing unit can be only used for execute and its size phase Deng convolution algorithm.The prior art does not have given as to how to realize such as 5*5,7*7,9*9 using the computing unit of 3*3 The introduction of convolution algorithm.One side computing unit is it is not apparent how be loaded into convolution kernel and input feature vector figure.On the other hand exist In the prior art, the size for exporting characteristic pattern depends on the mobile number of sliding window, for example, being directed to the input feature vector figure of 7*7 The convolution algorithm of 3*3 is carried out, the traverse range and longitudinal movement range of sliding window are 5 units, by multiple The calculating in period can obtain the output characteristic pattern of 5*5, this makes the convolution that other sizes are realized using the computing unit of 3*3 Operation is very difficult.It is appreciated that in the case where continuing to use the prior art, using the computing unit of 3*3 for the defeated of 7*7 Enter characteristic pattern and carry out convolutional calculation to can only obtain the output characteristic pattern (such as shown in Fig. 2) having a size of 5*5, calculate single Member and processor are it is not apparent how mobile sliding window can obtain the volume of such as 5*5 using the computing unit of 3*3 Product operation.

In this regard, the invention proposes a kind of corresponding control method, it is special by dispatching the input being loaded into computing unit Sign figure, convolution kernel, and control and execute multiplication, add operation, realize the convolution algorithm that 5*5 is executed with the computing unit of four 3*3.

According to one embodiment of present invention, with reference to Fig. 3 a, the computing unit that can be divided into four 3*3 is loaded into 5*5 jointly Convolution kernel numerical value and input feature vector figure numerical value, four computing units are shown by dotted line.For example, in Fig. 3 a In, 9 numerical value are loaded into be located at the computing unit of upper left, 6 numerical value are loaded into, by the calculating list of lower-left by the computing unit of upper right Member is loaded into 6 numerical value, is loaded into 4 numerical value by the computing unit of bottom right.

Fig. 3 b shows specific control method corresponding with Fig. 3 a, and the control method is as follows：

The size of input feature vector figure is 7*7, and the size of the convolution algorithm needed to be implemented is 5*5, thus may determine that convolution Calculating needs to be implemented the 3 × 3=9 period in total.

Judge 5>3, thus need to be completed the convolution fortune having a size of 5*5 jointly by the computing unit of more than one 3*3 It calculates.Here it can choose the computing unit that can be used for being loaded into k 3*3 of the data having a size of 5*5 just.Herein for k's It is selected as：K=m², 3m can choose the minimum positive integer more than or equal to n.It is of course also possible to select more more than above-mentioned The computing unit of 3*3 execute the convolution algorithm of 5*5.For example shown by Fig. 3 b, select k=4 calculating single here Member.

When computing unit carries out convolutional calculation, control is in each period respectively by the value of corresponding convolution kernel and corresponding The value of input feature vector figure be loaded into the computing unit of four 3*3.

In the period 1, the element that 1-3 row, 1-3 are arranged in input feature vector figure is loaded into Fig. 3 b positioned at upper left In the computing unit of 3*3, the element that 1-3 row, 4-5 are arranged in input feature vector figure is loaded into the 3*3 for being located at upper right in Fig. 3 b Computing unit in, by the element of 4-5 row in input feature vector figure, 1-3 column be loaded into Fig. 3 b positioned at lower-left 3*3 meter It calculates in unit, the element that 4-5 row, 4-5 are arranged in input feature vector figure is loaded into the calculating list for being located at the 3*3 of bottom right in Fig. 3 b In member, and the element of remaining 6th row, the 6th column is filled with " 0 "；Also, in a similar way in the meter of four 3*3 The convolution kernel for being loaded into 5*5 in unit is calculated, i.e., the element that 1-3 row, 4-5 are arranged in convolution kernel is loaded into Fig. 3 b and is located at upper right 3*3 computing unit in, the element of 4-5 row in convolution kernel, 1-3 column is loaded into the 3*3 in Fig. 3 b positioned at lower-left In computing unit, the element that 4-5 row, 4-5 are arranged in convolution kernel is loaded into the computing unit for being located at the 3*3 of bottom right in Fig. 3 b In, and the element of remaining 6th row, the 6th column is filled with " 0 ".Thus it is special that input is loaded in the computing unit of four 3*3 The value of sign figure and convolution kernel.The computing unit of four 3*3 is controlled respectively to the input feature vector figure and convolution kernel loaded by it The element of middle corresponding position executes multiplication, cumulative, and by corresponding position in all calculated result of the computing unit of four 3*3 Element it is cumulative, to obtain the element of the 1st row the 1st column in output characteristic pattern.Due to removing the convolution of script 3*3 in computing unit Each element other than the numerical value of core is 0, therefore the result calculated and the practical computing unit using 3*3 carry out convolutional calculation Result it is completely the same.

In second round, by inputted in Fig. 3 b 1-5 row in input feature vector figure, 2-6 column whole elements (i.e. " 0,0, 2,0,-3；0,3,-2,5,0；0,0,0,2,0；0,0,0,3,0；0,0,0,0,0 ") it is loaded into the computing unit using as 1- The new element of 5 rows, 1-5 column.The element that control computing unit is directed to loaded by it executes multiplication and accumulating operation, to obtain The element that the 1st row the 2nd arranges in characteristic pattern must be exported.

A preferred embodiment according to the present invention, can also be in the second cycle to the above-mentioned calculating in four 3*3 The mode that the data of input feature vector figure are loaded into unit improves, to improve loading efficiency.That is, upper left will be located in Fig. 3 b Whole elements (i.e. " 0,0 of 1-3 row, 2-3 column in the computing unit of 3*3；0,3；0,0 ") whole to be moved to the left 1 unit Using the new element as 1-3 row, 1-2 column, and by 1-3 row in input feature vector figure, the element (i.e. " 2 of the 3rd column；- 2；0 ") it is loaded into the computing unit using the new element as 1-3 row, the 3rd column；Similarly, upper right will be located in Fig. 3 b 3*3 computing unit in 1-3 row, the 2nd column whole elements (i.e. " 0；5；2 ") it is whole be moved to the left 1 unit using as The new element of 1-3 row, the 1st column, and by 1-3 row in input feature vector figure, the element (i.e. " -3 of the 4th column；0；0 ") it is loaded into Into the computing unit using the new element as 1-3 row, the 2nd column；It will be located in Fig. 3 b in the computing unit of the 3*3 of lower-left Whole elements (i.e. " 0,0 of 1-2 row, 2-3 column；0,0 ") whole to be moved to the left 1 unit using as 1-2 row, 1-2 The new element of column, and by 1-2 row in input feature vector figure, the element (i.e. " 0 of the 3rd column；0 ") it is loaded into the computing unit Using the new element as 1-2 row, the 3rd column；1-2 row in the computing unit of the 3*3 of bottom right, the 2nd column will be located in Fig. 3 b Whole elements (i.e. " 3；0 ") whole to be moved to the left 1 unit using the new element as 1-2 row, the 1st column, and will be defeated Enter the element (i.e. " 0 of 1-2 row in characteristic pattern, the 4th column；0 ") it is loaded into the computing unit to arrange as 1-2 row, the 2nd New element.Thus the value of input feature vector figure loaded in the computing unit of the 5*5 is updated, has been reached and tradition Using the effect that sliding window is similar in scheme.And computing unit is similarly controlled to multiply for the element execution loaded by it Method and accumulating operation, to obtain the element of the 1st row the 2nd column in output characteristic pattern.

And so on, complete the period 3.

In the period 4, by the element of 2-6 row in input feature vector figure, 1-5 column to be carried with hereinbefore similar mode Enter into the computing unit of four 3*3, and controls computing unit and execute multiplication and cumulative fortune for the element loaded by it It calculates, to obtain the element of the 2nd row the 1st column in output characteristic pattern.And in the 5th subsequent to the period 6, using with aforementioned Two to period 3 similar mode is loaded into the element of corresponding input feature vector figure in the computing unit.And so on, until All nine periods are completed, the output characteristic pattern of 3*3 is obtained.

As can be seen that it is special that input is disposably loaded with into computing unit in the period 1 by above-mentioned control method Levy 25 numerical value of 5*5 in figure.Similarly, the four, the seven periods were also disposably loaded with 25 numerical value in input feature vector figure.And Correspondingly, in the second to three period, 3 numerical value for being loaded into input feature vector figure are only needed every time, and will be in previous cycle The multiple numerical value used are moved to the left, and the numerical value for convolution kernel loaded in computing unit is not made an amendment then.It is similar Ground, the five to six, the eight to nine period also use mode similar with the second to three period to be loaded into input feature vector figure Element.

Thus, it is possible to ensure when in each period, in computing unit, the position of each element of input feature vector figure with It is one-to-one that the position of the respective element in the convolution kernel of multiplying is carried out with it.Also, for remove for realizing For other units, such as computing unit itself or processor except the unit of control method of the invention, they are not It can be appreciated that the convolution unit of four 3*3 actually implemented is the convolution algorithm of 5*5.In addition, by above-mentioned control method, So that the numerical value of input feature vector figure loaded by computing unit is not directly dependent on sliding window within each period.On the one hand The arrangement for being embodied in the numerical value for the input feature vector figure being loaded into each computing unit is not dependent on the sliding window having a size of 5*5 Each actual arrangement mode of numerical value in mouthful, the periodicity for being on the other hand also embodied in calculating are also not dependent on size and calculate single The mobile number of the sliding window of the equal 3*3 of elemental size, the quantity and size for exporting result can controls through the invention Method controls, it is possible thereby to carry out the convolution algorithm of 5*5 for the input feature vector figure of 7*7 using the computing unit of four 3*3 And thereby obtain the output result of 3*3.

It certainly, in the present invention can also be special by input when executing the convolution algorithm of 3*3 using the computing unit of 3*3 A unit is moved to the left with part identical in previous cycle in sign figure and by corresponding 3 of new input feature vector figure Element is filled at the position vacated by the movement, such as shown in Fig. 4.

It is appreciated that the present invention can also implement the convolution algorithm for being less than 3*3 using the computing unit of 3*3, i.e., same It is loaded into the numerical value of the convolution kernel of corresponding size, the numerical value of input feature vector figure in a computing unit, and rest part is filled with "0".It in the specific implementation, can be required to determine according to the size of input feature vector figure, the size of performed convolution algorithm Period, and controlled by the way of similar to the above embodiments.

According to one embodiment of present invention, a kind of control method is provided, realizes and is executed with the computing unit of multiple 3*3 The convolution algorithm of 7*7, with reference to Fig. 5, the specific control method is as follows：

Judge 7>3, selection is loaded into the data having a size of 5*5 in the computing unit of k 3*3：K=m², 3m selection is greatly In or equal to n minimum positive integer, here select k=9 computing unit.

The numerical value of used convolution kernel is divided into nine parts and is loaded into the computing unit of nine 3*3 respectively by control, will Rest part is filled with " 0 "；Also, in each period, data corresponding in input feature vector figure are divided into four parts difference by control It is loaded into the computing unit of nine 3*3, rest part is filled with " 0 ".Here in the computing unit of nine 3*3 In, the distribution mode of the numerical value of convolution kernel and the distribution mode of the numerical value of input feature vector figure are consistent.

Also, it controls each computing unit and executes multiplication and accumulating operation for the element loaded by it, by will be whole Corresponding calculated result adds up in the calculated result of nine computing units, to obtain the number of corresponding output characteristic pattern Value.

In this embodiment input feature vector figure can also be loaded by the way of similar with Fig. 3 b.

Similarly, the convolution algorithm of such as 9*9 can also be realized using the computing unit of nine 3*3.

In the present invention, corresponding control unit can be set for above-mentioned control method, and such control unit can be with It is adapted to an existing convolutional neural networks processor, come to based on convolution by way of implementing above-mentioned control method It calculates unit to be multiplexed, matched convolutional Neural net can also be designed based on hardware resource required for such control unit Network processor, such as minimal number of hardware resource is used in the case where meeting above-mentioned multiplexing scheme.

Scheme provided by the present invention is related to improving the reusability of the computing unit for executing convolution, must be set with reducing The hardware computational unit in convolutional neural networks processor is set, it is unnecessary to be directed to need to use for convolutional neural networks processor The different convolutional layers of various sizes of convolution kernel and be arranged largely have various sizes of hardware computational unit.Executing needle When to the calculating of a convolutional layer, the convolutional calculation for being directed to different convolutional layers can be realized using the computing unit of same size, Which thereby enhance the utilization rate of hardware computational unit in convolutional neural networks processor.

It is appreciated that the present invention repel as described in background technique using more massive hardware concurrent into Row calculation processing and the reusability that computing unit is improved by way of " time division multiplexing ".

Furthermore, it is desirable to illustrate, each step introduced in above-described embodiment is all not necessary, art technology Personnel can carry out according to actual needs it is appropriate accept or reject, replacement, modification etc..

It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting.On although The invention is described in detail with reference to an embodiment for text, those skilled in the art should understand that, to skill of the invention Art scheme is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered at this In the scope of the claims of invention.

Claims

1. a kind of control method for convolutional neural networks processor, the convolutional neural networks processor has the volume of 3*3 Product computing unit, the control method include：

2) the convolution kernel size n*n of the convolution algorithm executed as needed is selected in m²It is loaded into the convolutional calculation unit of a 3*3 The numerical value of convolution kernel corresponding with the size, and remaining each numerical value is filled with 0,3m >=n；

3) size of the convolution algorithm executed as needed and need to be implemented convolution input feature vector figure size, determine volume Periodicity needed for product calculating process；And

4) according to the periodicity, the numerical value of corresponding input feature vector figure is loaded by each period during convolutional calculation To the m²In the convolutional calculation unit of a 3*3, the numerical value of the input feature vector figure is in the m²The convolutional calculation unit of a 3*3 In distribution with the numerical value of the convolution kernel in the m²Distribution in the convolutional calculation unit of a 3*3 is consistent；

Control is loaded with the m of the numerical value of convolution kernel and input feature vector figure²The convolutional calculation unit of a 3*3 execute with it is described The corresponding convolutional calculation of periodicity；

5) to the m²Corresponding element adds up in the convolutional calculation result of the convolutional calculation unit of a 3*3, final to obtain Convolution algorithm output characteristic pattern.

2. according to the method described in claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is less than or equal to 3*3, then carried in the convolutional calculation unit of the same 3*3 Enter the numerical value of convolution kernel corresponding with the size and remaining each numerical value is filled with 0；

If the size of the convolution algorithm needed to be implemented is greater than 3*3, then it is loaded into the convolutional calculation unit of the 3*3 of respective numbers Remaining each numerical value is simultaneously filled with 0 by the numerical value of convolution kernel corresponding with the size.

3. according to the method described in claim 1, wherein step 4) includes：

In each period during convolutional calculation, if comprising the input in the numerical value for the input feature vector figure for needing to be loaded into The element of left side first row in characteristic pattern, then disposably by the size in the input feature vector figure with the convolution algorithm needed to be implemented The multiple elements to match are loaded into the corresponding position of the convolutional calculation unit and fill out the numerical value of remaining each position Filling is 0, a unit otherwise will be then moved to the left as a whole with element identical in previous cycle, and will input spy Multiple elements that and needs different from previous cycle update in sign figure are loaded into the position vacated by the movement Place.

4. according to the method described in claim 1, wherein step 4) includes：

In each period during convolutional calculation, control the convolutional calculation unit of the 3*3 to loaded by it for defeated Enter characteristic pattern and execute multiplication for the element of the corresponding position of convolution kernel and add up to the result of multiplication, to obtain Export the element of corresponding position in characteristic pattern.

5. method described in any one of -4 according to claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 5*5, then the volume of 5*5 is loaded into the convolutional calculation unit of four 3*3 Remaining each numerical value is simultaneously filled with 0 by the numerical value of product core；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into described four In the convolutional calculation unit of a 3*3, distribution of the numerical value of the input feature vector figure in the convolutional calculation unit of four 3*3 It is consistent with the distribution of the numerical value of the convolution kernel of the 5*5 in the convolutional calculation unit of four 3*3；

Wherein, in each period during convolutional calculation, if comprising institute in the numerical value for the input feature vector figure for needing to be loaded into The element for stating left side first row in input feature vector figure, then disposably by 25 elements in the input feature vector figure having a size of 5*5 It is loaded into the corresponding position of the convolutional calculation unit of four 3*3 and the numerical value of remaining each position is filled with 0, it is no To then be moved to the left as a whole a unit with element identical in previous cycle, and by input feature vector figure with The element that different and needs update in previous cycle is loaded at the position vacated by the movement.

6. according to the method described in claim 5, wherein step 4) includes：

In each period during convolutional calculation, the convolutional calculation unit of four 3*3 is controlled respectively to loaded by it For input feature vector figure and execute and multiplication and the result of multiplication carried out tired for the element of the corresponding position of convolution kernel Add,

And step 5) includes：It adds up to the calculated result by all convolutional calculation units of four 3*3, to obtain Export the element of corresponding position in characteristic pattern.

7. method described in any one of -4 according to claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 7*7, then the volume of 7*7 is loaded into the convolutional calculation unit of nine 3*3 Remaining each numerical value is simultaneously filled with 0 by the numerical value of product core；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into described nine In the convolutional calculation unit of a 3*3, distribution of the numerical value of the input feature vector figure in the convolutional calculation unit of nine 3*3 It is consistent with the distribution of the numerical value of the convolution kernel of the 7*7 in the convolutional calculation unit of nine 3*3；

Wherein, in each period during convolutional calculation, if comprising institute in the numerical value for the input feature vector figure for needing to be loaded into The element for stating left side first row in input feature vector figure, then disposably by 49 elements in the input feature vector figure having a size of 7*7 It is loaded into the corresponding position of the convolutional calculation unit of nine 3*3 and the numerical value of remaining each position is filled with 0, it is no To then be moved to the left as a whole a unit with element identical in previous cycle, and by input feature vector figure with The element that different and needs update in previous cycle is loaded at the position vacated by the movement.

8. according to the method described in claim 7, wherein step 4) includes：

In each period during convolutional calculation, the convolutional calculation unit of nine 7*7 is controlled respectively to loaded by it For input feature vector figure and execute and multiplication and the result of multiplication carried out tired for the element of the corresponding position of convolution kernel Add,

And step 5) includes：It adds up to the calculated result by all convolutional calculation units of nine 7*7, to obtain Export the element of corresponding position in characteristic pattern.

9. a kind of control unit, for realizing the control method as described in any one of claim 1-8.

10. a kind of convolutional neural networks processor, including：The convolutional calculation unit and control unit of 7*7, the control are single Member is for realizing any one of such as claim 1-8 the method.