CN106203617A

CN106203617A - A kind of acceleration processing unit based on convolutional neural networks and array structure

Info

Publication number: CN106203617A
Application number: CN201610482653.7A
Authority: CN
Inventors: 宋博扬; 赵秋奇; 马芝; 刘记朋; 韩宇菲; 王明江
Original assignee: SHENZHEN INTEGRATED CIRCUIT DESIGN INDUSTRIALIZATION BASE ADMINISTRATION CENTER; Shenzhen Graduate School Harbin Institute of Technology
Current assignee: SHENZHEN INTEGRATED CIRCUIT DESIGN INDUSTRIALIZATION BASE ADMINISTRATION CENTER; Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2016-06-27
Filing date: 2016-06-27
Publication date: 2016-12-07
Anticipated expiration: 2036-06-27
Also published as: CN106203617B

Abstract

The present invention discloses a kind of acceleration processing unit based on convolutional neural networks, for local data is carried out convolution algorithm, described local data includes that multiple multi-medium data, described acceleration processing unit include the first depositor, the second depositor, the 3rd depositor, the 4th depositor, the 5th depositor, multiplier, adder and the first MUX and the second MUX.Single acceleration processing unit is by the first MUX and the control of the second MUX, make multiplier and adder reusable, so that one is accelerated processing unit and has only to a multiplier and an adder can complete convolution algorithm, decrease the use of multiplier and adder, when realizing same convolution algorithm, the use reducing multiplier and adder will improve processing speed and reduce energy consumption, and the most single acceleration processing unit chip area is less.

Description

A kind of acceleration processing unit based on convolutional neural networks and array structure

Technical field

The present invention relates to convolutional neural networks, the acceleration processing unit that is specifically related in the convolutional layer of convolutional neural networks and Array structure.

Background technology

Degree of depth study (dee accelerates processing unit Plea2ning) learns relative to shallow-layer, refers to that machine passes through algorithm, from Historical data learning rule, and things is made Intelligent Recognition and prediction.

Convolutional neural networks (Convolutional Neu2al Netwo2k, CNN) belongs to dee and accelerates processing unit The one of Plea2ning netwo2k, it invents at the beginning of the 1980's, multilamellar the artificial neuron arranged forms, convolution god The method that human brain processes vision has been reflected through network.Along with Moore's Law promotes computer technology from strength to strength, Convolutional neural networks can the actual operation mode of more preferable mimic biology neutral net, it is to avoid pre-to the complicated early stage of image Process, original image can be directly inputted, thus obtained more being widely applied, be the most successfully applied to hand-written character and known Not, in recognition of face, human eye detection, pedestrian detection and robot navigation.

The basic system of convolutional neural networks includes multiple convolutional layer, and every layer is made up of multiple two dimensional surfaces, and each Plane is made up of multiple independent neurons.Each neuron is used for the local data of multi-medium data is carried out convolution algorithm, and And one input end also local receptor field with previous convolutional layer is connected, by the data of the local receptor field to previous convolutional layer Carry out convolution algorithm, to extract the feature of this local receptor field.

In prior art, acceleration processing unit is generally also used to be used as neuron, the local data to multi-medium data Carry out convolution algorithm.Existing acceleration processing unit is designed with an adder and one to each multi-medium data of input Multiplier, when this acceleration processing unit needs local data to be processed to have multiple, it is meant that each acceleration processing unit includes Multiple adders and multiple multiplier, this design causes the area accelerating processing unit sheet relatively big, and power consumption is big, and processing speed is also Have much room for improvement.

Summary of the invention

The application provides a kind of acceleration processing unit based on convolutional neural networks, for local data is carried out convolution fortune Calculate, described local data include multiple multi-medium data, described acceleration processing unit include the first depositor, the second depositor, 3rd depositor, the 4th depositor, the 5th depositor, multiplier, adder and the first MUX and the second multi-path choice Device；

First depositor is used for inputting multi-medium data, and its outfan is connected with the input of multiplier, by multimedia number According to being sent to multiplier；

Second depositor is used for input filter weights, and its outfan is connected with the input of multiplier, is weighed by wave filter Value is sent to multiplier；

Multiplier is for being multiplied multi-medium data with filter weights, and its outfan and the 3rd depositor connect, by phase Result after taking advantage of is sent to the 3rd depositor；

The outfan of the 3rd depositor and the first end of the first MUX connect；

Described first MUX second end connect adder, the 3rd end be previous acceleration processing unit part and Input, the 3rd depositor and adder are connected by described first MUX by state switching, or by previous acceleration The part of reason unit and input and adder connect；

Described adder is also connected with the 5th depositor and the 4th depositor, for the phase the first MUX transmitted Result or the part of previous acceleration processing unit after taking advantage of and carry out additive operation with the data in the 5th depositor, and will add up After result export the 4th depositor；

First end of described second MUX and the second end connect the 4th depositor and the 5th depositor respectively, described 4th depositor is connected to the 5th depositor by the second MUX.

Preferably, described first MUX keeps when acceleration processing unit is not fully complete the multiply-add operation of local data First state, is connected to adder by the 3rd depositor, switches after acceleration processing unit completes the multiply-add operation of local data It is the second state, part and the input of previous acceleration processing unit are connected to adder.

Preferably, described second MUX keeps when acceleration processing unit is not fully complete the multiply-add operation of local data It is the first state, the 4th depositor is connected to the 5th depositor, complete the multiply-add operation of local data at acceleration processing unit After switch to the second state, with by the 5th depositor reset.

Preferably, the 3rd end of described second MUX is for resetting end, and described second MUX is at acceleration Reason unit switches to the second state after completing the multiply-add operation of local data, and replacement end is connected to the 5th depositor.

Preferably, also include that first memory, second memory and the 3rd memorizer, described first memory and first are posted The input of storage connects, and needs, for input storage, the local data carrying out convolution algorithm, and by many in local data Individual multi-medium data is sent to the first depositor successively；The input of described second memory and the second depositor connects, and is used for Input and store filter weights, and filter weights is sent to the second depositor；Described 3rd memorizer and the 4th is deposited The input of device connects, the result after the addition inputting and storing adder output, and the result after will add up is sent to 4th depositor.

Preferably, it is characterised in that the result after described adder also will add up exports a rear acceleration processing unit.

The application provides a kind of array structure based on convolutional neural networks, including multiple described acceleration processing units, Multiple acceleration processing units are rendered as the matrix shape of 3 row N row, and wherein 3 and N is the integer more than or equal to 1, every string It is connected before and after accelerating processing unit.

Preferably, in every string, after the outfan of the adder of previous acceleration processing unit connects, one accelerates processing unit The 3rd end of the first MUX.

Preferably, with in the acceleration processing unit of a line, the filter weights of input is identical；It is positioned on same diagonal Accelerating in processing unit, the local data of input is identical.

Preferably, in the acceleration processing unit of different rows, the filter weights of input is different.

The invention has the beneficial effects as follows: single acceleration processing unit is by the first MUX and the second multi-path choice The control of device so that multiplier and adder are reusable, so that one is accelerated processing unit and has only to a multiplication Device and an adder can complete convolution algorithm, decrease the use of multiplier and adder, realizing same convolution fortune During calculation, the use reducing multiplier and adder will improve processing speed and reduce energy consumption, the most single acceleration processing unit Chip area is less.

Accompanying drawing explanation

A kind of based on convolutional neural networks the acceleration processing unit structured flowchart that Fig. 1 provides for the embodiment of the present invention；

The convolution algorithm mistake of a kind of based on convolutional neural networks the acceleration processing unit that Fig. 2 provides for the embodiment of the present invention Journey schematic diagram；

Fig. 3 is that a kind of array structure based on convolutional neural networks of the embodiment of the present invention arranges to distribution schematic diagram；

Fig. 4 is that a kind of array structure row based on convolutional neural networks of the embodiment of the present invention is to distribution schematic diagram；

Fig. 5 is a kind of array structure diagonal distribution schematic diagram based on convolutional neural networks of the embodiment of the present invention.

Detailed description of the invention

Combine accompanying drawing below by detailed description of the invention technical scheme is clearly and completely described, aobvious So, described embodiment is only a part of embodiment of the present invention rather than whole embodiments.

Embodiment one:

Refer to Fig. 1, the present embodiment provides a kind of acceleration processing unit based on convolutional neural networks, accelerates processing unit 61 include first depositor the 21, second depositor the 22, the 3rd depositor the 23, the 4th depositor the 24, the 5th depositor 25, multiplier 41, adder 51 and the first MUX 31 and the second MUX 32.

First depositor 21 is connected with an input of multiplier 41, and the first depositor 21 is used for inputting multi-medium data, And multi-medium data is sent to multiplier 41.Second depositor 22 is connected with another input of multiplier 41, and second deposits Device 22 is for input filter weights, and filter weights is sent to multiplier 41.The outfan of multiplier 41 and the 3rd is posted Storage 23 connects, and for being multiplied with filter weights by multi-medium data, and the result after being multiplied is sent to the 3rd depositor 23。

First end of the first MUX 31 and the outfan of the 3rd depositor 23 connect, and the second end connects adder 51 An input, the 3rd end is part and the input of previous acceleration processing unit.When the first MUX 31 is switched to During one state (such as setting to 0), the 3rd depositor 23 and adder 51 are connected by the first MUX 31, by the 3rd depositor 23 In data be sent to adder 51；When the first MUX 31 is switched to the second state (such as putting 1), the first multichannel choosing Select device 31 its 3rd end and adder 51 to be connected, by the part of previous acceleration processing unit and be sent to adder 51.

Another input and the 5th depositor 25 of adder 51 connect, the outfan of adder 51 and the 4th depositor 24 Connecting, adder 51 inputs the data in the 3rd depositor 23 and the 5th depositor 25, the data in two depositors is added Method computing, and result after will add up (also referred to as interior section and) output is to the 4th depositor 24.

First end of the second MUX 32 and the second end connect the 4th depositor 24 and the 5th depositor 25 respectively, the 3rd end of two MUX 32 is for resetting end.When the second MUX 32 is switched to the first state (such as setting to 0), the 4th depositor 24 and the 5th depositor 25 are connected by two MUX 32, by the interior section in the 4th depositor 24 and send out Deliver to the 5th depositor 25；When the second MUX 32 is switched to the second state (such as putting 1), the second MUX 32 Its 3rd end and the 5th depositor 25 are connected, resets the 5th depositor 25, make zeros data in the 5th depositor 25.

In the embodiment having, send data for convenience of to depositor, accelerate processing unit 61 also include first depositing 11, the Two memorizer 12 and the 3rd memorizeies 13, first memory 11 is connected with the input of the first depositor 21, is used for inputting and depositing Storage needs the local data carrying out convolution algorithm, and the multiple multi-medium datas in local data are sent to first successively deposit Device 21；Second memory 12 is connected with the input of the second depositor 22, is used for inputting and storing filter weights, and will filtering Device weights are sent to the second depositor 22；3rd memorizer 13 is connected with the input of the 4th depositor 24, is used for inputting and depositing The interior section of storage adder 51 output and, and by interior section and be sent to the 4th depositor 24.

Accelerating processing unit 61 to be used for local data is carried out convolution algorithm, local data includes multiple multi-medium data, Multi-medium data can be video data, view data, it is also possible to be voice data.When multi-medium data is video data, It is believed that the corresponding pixel of each multi-medium data.

Below as a example by view data, illustrate to accelerate the convolution algorithm process of processing unit 61.

In conjunction with Fig. 1 and Fig. 2, the work process of single acceleration processing unit 61 based on convolutional neural networks is as follows:

Step 10, reads and needs to carry out video data and the filter weights of convolution algorithm.If view data is not 0, figure As data are stored in first memory 11, needs when, it is sent to the first depositor 21 for extracting view data, If view data is 0, view data 0 is routed directly to the first depositor 21 and without extracting, takes the plan skipped or gate Slightly avoid non-essential reading and calculating；Filter weights is stored in second memory 12, is sent to needs when Second depositor 22 is used for filter weights data, and wherein, data extracting mode is that serial successively is extracted, i.e. first circulation In, this acceleration processing unit 61 first view data in the local data carrying out convolution algorithm is sent to first and deposits Device 21；Second circulate in, second view data is sent to the first depositor 21, after read in view data successively. Filter weights is required to produce according to convolution algorithm by processor.

Step 20, multiplying.View data in first depositor 21 and the filter weights in the second depositor 22 Being sent in multiplier 41 perform multiplying, the result after multiplier 41 is multiplied is output to the 3rd depositor 23.

Step 30, additive operation.It is not over owing to accelerating the multiply-add operation in processing unit 61, now the first multichannel Selector 31 sets to 0, and when the first MUX 31 is set to 0, in the 3rd depositor 23, view data is sent to adder 51 In, adder 51 is by view data and the front interior section in the 5th depositor 25 and is added.For primary inside Convolution operation, is zero in the 5th depositor 25, for follow-up inside convolution operation, is a front convolution in the 5th depositor 25 Interior section after operation and.Result (i.e. interior section and) after being added in this convolution operation is output to the 4th depositor 24, the most just complete once internal convolution operation, finally give first view data and filter weights part and.By Being not in the multiply-add operation accelerated in processing unit 61, now the second MUX 32 sets to 0, when the second multi-path choice When device 32 is set to 0, interior section and being sent to the 5th depositor 25 by the 4th depositor 24.

Step 40, this acceleration processing unit 61 judges whether to complete the inside convolution operation of all local datas, if interior When portion's convolution operation is not fully complete, step 10, step 20 and step 30 will be repeated in, and extract second view data, and input To the first depositor 21, second filter weights is input to the second depositor 22, the view data in the first depositor 21 and Filter weights in second depositor 22 is all sent in multiplier 41, and view data is multiplied with filter weights, draws Result is sent in the 3rd depositor by multiplier 41, is not over owing to accelerating the multiply-add operation in processing unit 61, this Time the first MUX 31 set to 0, the data in the 3rd depositor 23 are sent in adder 51 by the first MUX, With from the 5th depositor 25 data sue for peace, finally give secondary view data and filter weights part and. Part in adder 51 and being sent in the 4th depositor 24, now, does not also have owing to accelerating the multiply-add operation in processing unit Having end, now the second MUX 32 sets to 0, and the data in the 4th depositor 24 are sent to by the second MUX 31 In 5th depositor 25.Thus complete the inside convolution operation to second view data.Until extracting the last of local data One view data, this image information and filter weights are multiplied and after phase add operation through above-mentioned, obtain this acceleration and process The part of unit and, this part and by operation as hereinbefore, eventually enter in the 5th depositor 25.When all interior roll When long-pending operation completes, then will carry out step 50.

Step 50, after the multiply-add operation to local data accelerated in processing unit 61 terminates, the first MUX 31 and second MUX 32 put 1, when the first MUX 32 is set to 1, the part in previous acceleration processing unit It is sent in adder 51 with by the first MUX 31, final by this acceleration processing unit 61 of the 5th depositor 25 Part and being sent in adder 51, finally, the part of previous acceleration processing unit and with in this acceleration processing unit 61 Part and summation, obtain two accelerate processing unit superpositions parts and, the part of this superposition and output, be sent to the next one and add Speed processing unit.When the second MUX 32 will be set to 1 by state 0, and the 4th depositor 24 is no longer to 25, the 5th depositor The data in data, and the 5th depositor 25 are sent to be cleared.

In the present embodiment, single acceleration processing unit is by the first MUX 31 and the second MUX 32 Control so that multiplier 41 and adder 51 are reusable, so that one is accelerated processing unit and has only to a multiplication Device and an adder can complete convolution algorithm, decrease the use of multiplier and adder, realizing same convolution fortune During calculation, the use reducing multiplier and adder will improve processing speed and reduce energy consumption, the most single acceleration processing unit Chip area is less.

Embodiment two:

Refer to Fig. 3 to Fig. 5, it is shown that a kind of array structure based on convolutional neural networks, including multiple described acceleration Processing unit, multiple acceleration processing units are rendered as the matrix shape of M row N row, and wherein M and N is the integer more than or equal to 1, It is connected before and after the acceleration processing unit of every string.

In the present embodiment, multiple acceleration processing units are rendered as the matrix shape of 3 row 3 row, in every string, at previous acceleration After the outfan connection of the adder of reason unit, one accelerates the 3rd end of the first MUX of processing unit.

With in the acceleration processing unit of a line, the filter weights of input is identical；It is positioned at the acceleration on same diagonal In reason unit, the local data of input is identical.

In the acceleration processing unit of different rows, the filter weights of input is different.

Below in conjunction with the accompanying drawings, the convolutional layer calculating process of multiple acceleration processing unit is described.

In conjunction with Fig. 1 to Fig. 5, the calculating process of array structure based on convolutional neural networks is as follows:

As it is shown on figure 3, the adder 51 of previous acceleration processing unit connects later accelerates more than the first of processing unit Road selector 31, every a line output part and all vertically move, by former and later two accelerate processing units part and add up, Can be read in top line at the end of calculating process, calculate the beginning of process at the next one and delivered to the bottom row of array by buffer.

Such as, accelerate processing unit PE1.1, accelerate processing unit PE2.1 and accelerate processing unit PE3.1, enter the most respectively The internal convolution algorithm of row, final result is stored in respective 5th depositor 25, then, accelerates in processing unit PE3.1 defeated The part that goes out and with accelerate the part of the 5th depositor 25 in processing unit PE2.1 and at the addition accelerating processing unit PE2.1 In device 51, summation is cumulative again, obtain the most cumulative part and, part that described first time is cumulative and processed single by acceleration Unit PE2.1 is sent to accelerate in processing unit PE1.1, with accelerate the 5th depositor 25 in processing unit PE1.1 part and Accelerate the adder 51 of processing unit PE1.1 is sued for peace again, the part of final output these row all acceleration processing unit 1 with.

It is also pointed out that, as shown in Figure 4 and Figure 5, with in the acceleration processing unit of a line, the filter weights phase of input With, it being positioned in the acceleration processing unit on same diagonal, the view data of input is identical, the acceleration processing unit of different rows In, the filter weights of input is different.Owing to whole view data has several rows, and each acceleration processing unit simply processes whole Single line of data in individual view data, is therefore accomplished by acceleration processing unit having been processed every data line respectively again to every a line The convolution results of data carries out accumulation operations.Input data on same diagonal are identical, the input picture on different diagonal Data are different, and the view data being equivalent to the input on different diagonal is the different rows data of view data.And process difference The view data of row needs different filter weights, such as one filter weights to be used only to process the picture number of the first row According to, when processing the view data of the second row when, it is necessary to by new filter weights.Therefore the acceleration of same a line can be made Processing unit uses identical filter weights, and the processing unit that accelerates of different rows uses different filter weights.

Such as, accelerate processing unit PE1.1, accelerate processing unit PE1.2 and the wave filter accelerated in processing unit PE1.3 Weights are identical, and it is identical with accelerating the view data of input in processing unit PE1.2 to accelerate processing unit PE2.1, and acceleration processes single Unit PE1.1, the filter weights accelerated in processing unit PE2.2 and acceleration processing unit PE3.1 differ.

So achieve, process the multi-medium data of a line simultaneously, then the multi-medium data of different rows is used different Filter weights, after having processed every data line respectively, is carrying out accumulation operations to the most each row multi-medium data, thus soon Speed, reliably process whole multi-medium datas.

The present invention is illustrated by use above specific case, is only intended to help and understands the present invention, not in order to limit The present invention processed.For those skilled in the art, according to the thought of the present invention, it is also possible to make some simply Deduce, deform or replace.

Claims

1. an acceleration processing unit based on convolutional neural networks, for carrying out convolution algorithm, described local to local data Data include multiple multi-medium data, it is characterised in that include the first depositor, the second depositor, the 3rd depositor, the 4th post Storage, the 5th depositor, multiplier, adder and the first MUX and the second MUX；

First depositor is used for inputting multi-medium data, and its outfan is connected with the input of multiplier, is sent out by multi-medium data Deliver to multiplier；

Second depositor is used for input filter weights, and its outfan is connected with the input of multiplier, filter weights is sent out Deliver to multiplier；

Multiplier is for being multiplied multi-medium data with filter weights, and its outfan and the 3rd depositor connect, after being multiplied Result be sent to the 3rd depositor；

The outfan of the 3rd depositor and the first end of the first MUX connect；

Second end of described first MUX connects adder, and the 3rd end is part and the input of previous acceleration processing unit End, the 3rd depositor and adder are connected by described first MUX by state switching, or previous acceleration are processed single The part of unit and input and adder connect；

Described adder is also connected with the 5th depositor and the 4th depositor, for by first MUX transmit be multiplied after Result or the part of previous acceleration processing unit and carry out additive operation with the data in the 5th depositor, and after will add up Result exports the 4th depositor；

First end of described second MUX and the second end connect the 4th depositor and the 5th depositor respectively, and the described 4th Depositor is connected to the 5th depositor by the second MUX.

Accelerate processing unit the most as claimed in claim 1, it is characterised in that described first MUX processes single in acceleration Unit keeps the first state when being not fully complete the multiply-add operation of local data, and the 3rd depositor is connected to adder, processes in acceleration Unit switches to the second state after completing the multiply-add operation of local data, by the part of previous acceleration processing unit and input even Receive adder.

Accelerate processing unit the most as claimed in claim 1, it is characterised in that described second MUX processes single in acceleration Unit remains the first state when being not fully complete the multiply-add operation of local data, the 4th depositor is connected to the 5th depositor, is adding Speed processing unit switches to the second state after completing the multiply-add operation of local data, to be reset by the 5th depositor.

Accelerate processing unit the most as claimed in claim 3, it is characterised in that the 3rd end of described second MUX is attached most importance to Putting end, described second MUX switches to the second state after acceleration processing unit completes the multiply-add operation of local data, Replacement end is connected to the 5th depositor.

5. the acceleration processing unit as according to any one of Claims 1-4, it is characterised in that also include first memory, Two memorizeies and the 3rd memorizer, the input of described first memory and the first depositor connects, and needs for inputting and storing The local data of convolution algorithm to be carried out, and the multiple multi-medium datas in local data are sent to the first depositor successively； The input of described second memory and the second depositor connects, and is used for inputting and storing filter weights, and is weighed by wave filter Value is sent to the second depositor；The input of described 3rd memorizer and the 4th depositor connects, and is used for inputting and storing addition Result after the addition of device output, and the result after will add up is sent to the 4th depositor.

6. the acceleration processing unit as according to any one of Claims 1-4, it is characterised in that described adder also will add up After result export rear one accelerate processing unit.

7. an array structure based on convolutional neural networks, it is characterised in that include multiple as arbitrary in claim 1 to 6 Acceleration processing unit described in Xiang, multiple acceleration processing units be rendered as M row N row matrix shape, wherein M and N for more than or Integer equal to 1, is connected before and after the acceleration processing unit of every string.

8. array structure as claimed in claim 7, it is characterised in that in every string, the adder of previous acceleration processing unit Outfan connect after the 3rd end of the first MUX accelerating processing unit.

9. as claimed in claim 7 or 8 array structure, it is characterised in that with in the acceleration processing unit of a line, the filter of input Ripple device weights are identical；Being positioned in the acceleration processing unit on same diagonal, the local data of input is identical.

10. array structure as claimed in claim 9, it is characterised in that in the acceleration processing unit of different rows, the filtering of input Device weights are different.