CN109978161A

CN109978161A - A kind of general convolution-pond synchronization process convolution kernel system

Info

Publication number: CN109978161A
Application number: CN201910268608.5A
Authority: CN
Inventors: 张宝林; 姬梦飞; 常玉春; 李东泽; 丁宁; 戴加海; 慕雨松; 蒋佳奇; 马玉美; 郭玉萍; 孙畅; 宫浩然; 王若溪; 李捷菲
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2019-03-08
Filing date: 2019-04-04
Publication date: 2019-07-05
Anticipated expiration: 2039-04-04
Also published as: CN109978161B

Abstract

The invention discloses a kind of general convolution-pond synchronization process convolution kernel systems, belong to convolutional neural networks acceleration technique field in machine learning.Software realization is used for existing machine learning method, it is limited that there are computing capabilitys, the problems such as higher cost, the present invention realizes machine learning using hardware design, the purpose accelerated to convolutional neural networks is realized in a manner of convolution-pond synchronization process, can be under the premise of accuracy rate be immovable, it being capable of quick, low-power consumption, efficient realization machine learning.The existing common convolution kernel of convolutional neural networks is fixed size, can not adapt to various design needs, and the convolution kernel in the present invention can change the parameters such as convolution kernel size, step-length, can adapt to the design needs in various situations.

Description

A kind of general convolution-pond synchronization process convolution kernel system

Technical field

The invention belongs to convolutional neural networks acceleration technique fields in machine learning.

Technical background

Artificial intelligence (Artificial Intelligence) is a developing direction of current era, is widely answered For numerous areas such as computer, medicine, biology, machinery.Machine learning (Machine Learning) is as one of weight Want branch, in recent years in obtained extensive concern, realize swift and violent development.It can by a large amount of data sample, It is repeatedly trained, obtains ideal effect, be widely used in the fields such as image recognition, object tracking, speech processes.Convolution Neural network (Convolutional Neural Network, CNN) is one of important method of machine learning, has been attracted a large amount of Scholar goes in for the study.Wherein, Lenet, Alexnet, VGG etc. are its more representative models, are had in practical applications out The performance of color.

Machine learning is risen in eighties of last century the fifties, experienced the development of more than ten years, in nineteen sixties In arrive late nineteen seventies, since the computing capability of computer at that time is limited, the development of machine learning is in the lag phase.To last Century, late nineteen seventies started, and with the promotion of computer process ability, machine learning starts second of upsurge.Nowadays, with The development of computer and big data, machine learning method obtained unprecedented development.However, continuous with data volume Increase and the continuous intensification of level, the processing capacity of CPU can not preferably adapt to its development.At this moment, GPU is emerged. However, although GPU has a certain upgrade in computing capability, even if ability is still limited, and higher cost.Therefore, existing Development trend be gradually partial to using hardware realization machine learning algorithm.Its speed is fast, low in energy consumption, high-efficient, makes its tool There is bright development prospect.Also, convolutional neural networks are accelerated using hardware, and following important development Direction.

Summary of the invention

The purpose of the present invention is to solve the deficiencies in the prior art, a kind of the hardware-accelerated of convolutional neural networks is proposed Scheme can accelerate convolutional Neural to realize by convolution-pond synchronization process under the premise of accuracy rate is immovable Purpose.

A kind of general convolution-pond synchronization process convolution kernel system, by nine processing units, 12 input ports and 3 A output port composition；Processing unit includes: weight register, image register, memory cell, multiplication unit, addition list Member, activation primitive unit, pond unit, the first counting unit and the second counting unit；

12 input ports are respectively input end of clock mouth clk, reseting port rstn, the effective signal end of input weight biasing Mouth wren, weight bias input end mouth wb, image input useful signal port pren, image input port p, picture traverse pixel Quantity input port npx, image length pixel quantity input port npy, convolution kernel size input port nc, step sizes input Port st, filling (padding) size input port pa and pond type input port po；

3 output ports are respectively that convolution results output port r, output result useful signal port rl and convolution complete letter Number d.

1) input end of clock mouth clk is used for timing with the alternating low and high level signal input system of constant time length；By multiple For bit port rstn to the reset signal of system input high level, each processing unit carries out convolution-Chi Huatong under the signal designation Step processing；

System is inputted after the completion of reset by picture traverse pixel quantity input port npx, image length pixel quantity Port npy, convolution kernel size input port nc, step sizes input port st, filling (padding) size input port pa and Pond type input port po is by the parameters input system of convolution sum image；

System is biased after effective signal port wren receives the effective high RST of weight by input weight receiving, will be by Weight bias input end mouth wb input convolution kernel weighted value and offset value deposit weight register in, convolution kernel weight and After biasing input, input weight biasing useful signal becomes invalid low signal；

2) system receives after image inputs effective high RST from image input useful signal port pren, and system will pass through In the image pixel numerical value deposit image register that image input port p is received, each clock cycle receives the one of image A pixel, while the numerical value of image register being updated to the pixel number received at this time；The value range of pixel number be- 1~1, after image pixel end of input, image input useful signal becomes invalid low signal；

3) while receiving pixel, by the convolution kernel in the image pixel numerical value and weight register in image register In each weighted value be multiplied, multiplication unit to counter send indication signal xd；

4) after the first counting unit receives indication signal xd, according to from picture traverse pixel quantity npx and image length The image length and width pixel quantity that pixel quantity npy is obtained judges position coordinates x, y of multiplication unit pixel calculated, calculates Formula is as follows:

Wherein, x represents the abscissa of this pixel, and y represents the ordinate of this pixel, and n represents the pixel that counter counts obtain Ordinal number, the number of pixels of the every row of npx representative image, the number of pixels of npy representative image each column, [], which represents, to be rounded；Npx and npy Value range be 0~1024, and be integer；

Then gained location of pixels coordinate x, y are sent to memory cell；

5) memory cell according to the ordinal number of weight, from the filling size input port pa filling size obtained and picture The calculated result of multiplication unit is stored in memory cell by position coordinates x, y of element, and storage mode is as follows:

Ram [m] [y+pa] [x+pa]=w_m×p_xy

Wherein, ram represents memory, and ram represents [] [] [] three-dimensional coordinate of memory, w_mRepresentation repeated order number is m's Convolution kernel weight, value range are -1~1, p_xyPixel number of the abscissa as x ordinate as y is represented, pa represents filling size Numerical value, the value range of filling size pa are 0~5, and are integer；

According to the order of ranks, successively by all pixels input system of whole picture and after the completion of calculating, memory list Member sends indication signal cd to the second counting unit；

6) after the second counting unit receives indication signal cd, according to the step-length got from step sizes input port st Numerical value, and the image length and width pixel quantity obtained from picture traverse pixel quantity npx and image length pixel quantity npy, meter The position for product needed for convolution is calculated, calculation formula is as follows:

Wherein, cx represents the abscissa of convolution kernel, and cy represents the ordinate of convolution kernel, and cx ' represents last moment convolution kernel Abscissa, cy ' represents the ordinate of last moment convolution kernel, the number of pixels of the every row of npx representative image, npy representative image The number of pixels of each column, pa represent filling magnitude numerical value, and nc represents convolution kernel size, and st represents step-length；The value model of step-length st Enclosing is 1~11, and is integer；

7) memory cell takes out all product data for adjacent four convolution kernels for being used to calculate, each convolution kernel institute Need product number are as follows:

N=nc²

Wherein, product number, nc needed for n is represented represent convolution kernel size；The value range of convolution kernel size nc be 1~ 11, and be integer；

7) corresponding add operation is carried out in addition unit, while biasing is added, obtains four calculated results, is calculated public Formula is as follows:

Wherein, a₁Represent the calculated result for being located at the convolution kernel of upper left in four adjacent convolution kernels, a₂It represents and is located at upper right Convolution kernel calculated result, a₃Represent the calculated result for being located at the convolution kernel of lower-left, a₄Represent the meter for being located at the convolution kernel of bottom right It calculates as a result, cx represents every group of abscissa, and cy represents every group of ordinate, and st represents step-length, b with four convolution kernels for one group For weight biasing, value range is the size that -1~1, nc is convolution kernel.

8) by this four results by activation primitive Relu, specific method is that will add up result a_i(i=1~4) compare with 0 Compared with, be greater than 0 access value itself, take 0 less than or equal to 0；

Wherein, m_iRepresent the calculated result by activation primitive；

9) four results are sent into pond unit, pond unit is according to the pond class inputted by pond type input port po Type carries out pond；It is respectively 0 and 1 that pond type po, which can input two values, and 0, maximum value pond is represented, 1 represents mean value pond Change；

If it is maximum value pond, the maximum value of four results is taken to export；If it is mean value pond, four results are calculated Average value exported；It is specific as follows:

Wherein, r represents pond result；

Pond result is exported from convolution results output port r, while being tied from output result useful signal port rl output Fruit useful signal；

10) the second counting unit is according to the size of image, after completing all convolution shown in calculating, to output port d Signal is completed in output, and reset signal switchs to low level, terminates the convolution of a picture, prepares to receive next picture.

The method that judgement processing is completed is as follows:

And

Wherein, cx represents the abscissa of every group of convolution kernel, and cy represents the ordinate of every group of convolution kernel, and npx representative image is every Capable number of pixels, npy represent the line number that pixel has, and pa represents filling size, and nc represents convolution kernel size, and st represents step-length Size.

Beneficial effects of the present invention:

It is 28 × 28 in input picture size to realize lenet-5 convolutional neural networks while calculating convolution, Padding size is 1, and convolution kernel size is 3 × 3, and in the case of step-length is 1, this method is from input picture is started, to calculating Whole convolution results need 980 clock cycle, wherein input picture needs 784 clock cycle, is input to from completion image defeated The treatment process of whole results needs 196 clock cycle out.And the method that common convolution pondization is independently realized, from starting to input Image needs 1568 clock cycle to whole convolution results are calculated, wherein input picture needs 784 clock cycle, from complete 784 clock cycle are needed at the treatment process that image is input to the whole results of output.The design method compared with than conventional method complete Raise speed 37.5% in the process, the data handling procedure speed-raising 75% after image end of input.Also, common convolution kernel is to fix Size can not adapt to various design needs, and the convolution kernel of the design can change the parameters such as convolution kernel size, step-length, can Adapt to the design needs in various situations.

Detailed description of the invention

Fig. 1 is convolution general in the present invention-pond synchronization process convolution kernel flow chart；

Fig. 2 is convolution general in the present invention-pond synchronization process convolution kernel algorithm schematic diagram；

Fig. 3 is convolution general in the present invention-pond synchronization process convolution kernel hardware system structure figure；

Fig. 4 is convolution general in the present invention-pond synchronization process convolution kernel internal system structural schematic diagram；

Fig. 5 is convolution general in the present invention-pond synchronization process convolution kernel time stimulatiom schematic diagram.

Specific embodiment

To keep design scheme of the invention clearer, following will be combined with the drawings in the embodiments of the present invention, to the present invention The embodiment of example is described in detail.

The present invention is the accelerating hardware realization to the differentiation process of the propagated forward of convolutional neural networks, including convolutional Neural The convolutional layer and pond layer of network.Wherein, the formula of convolutional layer are as follows:

Wherein, l represents the number of plies, a^lOutput tensor is represented, * represents convolution, and b represents biasing, and M represents submatrix number, σ generation Table activation primitive, usually Relu.

Pond layer formula be

a^l=pool (a^l-1)

Wherein, pool refers to the process that tensor diminution will be inputted according to pond area size and pond standard.

Whole design thought of the invention, as shown in Fig. 2, are as follows:

First by the realization synchronous with each input product of pixel of all weights；

After image inputs, while four adjacent convolution acquired results are calculated, it therefore, can be directly to this four As a result the pondization operation averaged or be maximized, realizes the purpose of convolution realization synchronous with pondization operation.

Fig. 4 is the internal structure of hardware system in the present invention.System is to receive the weight that is inputted by the port wren effective After signal, the convolution kernel weighted value and offset value that are inputted by the port wb are stored in weight register.Hereafter, upper unit Input pixel useful signal is sent to system by the port pren, is posted the pixel received deposit image according to this signal system Storage, at the same by the pixel number received by multiplication unit respectively with each multiplied by weight of this convolution kernel.Starting After multiplying, multiplication unit sends indication signal xd to counter unit 1, and counter unit 1 can be according to the length of input picture Wide number of pixels judges position coordinates x, y of pixel, and product is stored in memory according to the ordinal number of weight and filling size Corresponding position.After the completion of storage, storage unit sends signal cd to counting unit 2, and counting unit 2 starts to calculate for convolution institute The position of the product needed, while all product data for being used for four adjacent up and down convolution kernels being taken out, and in addition Corresponding add operation is carried out in unit, while biasing is added, and obtains four calculated results.This four results are passed through in next step Activation primitive Relu, method are that will add up result compared with 0, are greater than 0 access value itself, take 0 less than 0.After completing the step, Four results are sent into pond unit, pond unit is according to the pond type inputted by input port po, if it is maximum value pond Change, the maximum value of four results is taken to export, if it is mean value pond, the average value for calculating four results is exported, and is exported As a result it is effective that result useful signal rl is exported while.At the same time, counting unit 2 is according to the size of image, according to calculating It after all convolution are completed in display, is exported to output port d and completes signal, terminate the convolution of a picture, prepare to receive next Picture.

Fig. 5 is the circuit simulation figure of hardware system in the present invention.Clk is clock signal, and rstn is reset signal, and npx is The number of pixels of the every row of image, npy are the number of pixels of image each column, and nc is convolution kernel size, and st is step-length, and pa filling is big Small, po is pond type, and wren is that weight biases useful signal, and wb is weight and biasing, and pren image inputs useful signal, p For pixel, y1~yn is multiplication unit, and xd is the first counting unit useful signal, x pixel abscissa, y pixel ordinate, ram For memory cell, cd is the second counting unit useful signal, cx convolution kernel abscissa, cy convolution kernel ordinate, and m1~m4 is Addition unit, r are last convolution results, and rl is result useful signal, and d is that convolution completes signal.

Specific implementation of the invention is as follows:

Then gained location of pixels coordinate x, y are sent to memory cell；

Ram [m] [y+pa] [x+pa]=w_m×p_xy

N=nc²

Wherein, m_iRepresent the calculated result by activation primitive；

Wherein, r represents pond result；

The method that judgement processing is completed is as follows:

And

Claims

1. a kind of general convolution-pond synchronization process convolution kernel system, which is characterized in that the system by nine processing units, 12 input ports and 3 output port compositions；Processing unit include: weight register, image register, memory cell, Multiplication unit, addition unit, activation primitive unit, pond unit, the first counting unit and the second counting unit；

12 input ports are respectively input end of clock mouth clk, reseting port rstn, the effective signal port of input weight biasing Wren, weight bias input end mouth wb, image input useful signal port pren, image input port p, picture traverse pixel number Measure input port npx, image length pixel quantity input port npy, convolution kernel size input port nc, step sizes input terminal Mouth st, filling size input port pa and pond type input port po；

3 output ports are respectively that convolution results output port r, output result useful signal port rl and convolution complete signal d；

1) input end of clock mouth clk is used for timing with the alternating low and high level signal input system of constant time length；By reset terminal To the reset signal of system input high level, each processing unit carries out at convolution-pond synchronization under the signal designation by mouthful rstn Reason；

System passes through picture traverse pixel quantity input port npx, image length pixel quantity input port after the completion of reset Npy, convolution kernel size input port nc, step sizes input port st, filling size input port pa and the input of pond type Port po is by the parameters input system of convolution sum image；

System is biased after effective signal port wren receives the effective high RST of weight by input weight receiving, will be by weight In the convolution kernel weighted value and offset value deposit weight register of bias input end mouth wb input, convolution kernel weight and biasing After input, input weight biasing useful signal becomes invalid low signal；

2) system receives after image inputs effective high RST from image input useful signal port pren, and system will pass through image In the image pixel numerical value deposit image register that input port p is received, each clock cycle receives a picture of image Element, while the numerical value of image register being updated to the pixel number received at this time；The value range of pixel number be -1~ 1, after image pixel end of input, image input useful signal becomes invalid low signal；

It 3), will be in the convolution kernel in the image pixel numerical value and weight register in image register while receiving pixel Each weighted value is multiplied, and multiplication unit sends indication signal xd to counter；

4) after the first counting unit receives indication signal xd, according to from picture traverse pixel quantity npx and image length pixel The image length and width pixel quantity that quantity npy is obtained judges position coordinates x, y of multiplication unit pixel calculated, calculation formula It is as follows:

Wherein, x represents the abscissa of this pixel, and y represents the ordinate of this pixel, and n represents the ordinal number for the pixel that counter counts obtain, The number of pixels of the every row of npx representative image, the number of pixels of npy representative image each column, [], which represents, to be rounded；Npx's and npy takes Being worth range is 0~1024, and is integer；

Then gained location of pixels coordinate x, y are sent to memory cell；

5) memory cell according to the ordinal number of weight, from the filling size input port pa filling size obtained and pixel The calculated result of multiplication unit is stored in memory cell by position coordinates x, y, and storage mode is as follows:

Ram [m] [y+pa] [x+pa]=w_m×p_xy

Wherein, ram represents memory, and ram represents [] [] [] three-dimensional coordinate of memory, w_mRepresentation repeated order number is the convolution of m Core weight, value range are -1~1, p_xyPixel number of the abscissa as x ordinate as y is represented, pa represents filling magnitude numerical value, The value range for filling size pa is 0~5, and is integer；

According to the order of ranks, successively by all pixels input system of whole picture and calculate after the completion of, memory cell to Second counting unit sends indication signal cd；

6) after the second counting unit receives indication signal cd, according to the step-length number got from step sizes input port st Value, and the image length and width pixel quantity obtained from picture traverse pixel quantity npx and image length pixel quantity npy, calculate The position for product needed for convolution, calculation formula are as follows out:

Wherein, cx represents the abscissa of convolution kernel, and cy represents the ordinate of convolution kernel, and cx ' represents the cross of last moment convolution kernel Coordinate, cy ' represent the ordinate of last moment convolution kernel, the number of pixels of the every row of npx representative image, npy representative image each column Number of pixels, pa represents filling magnitude numerical value, and nc represents convolution kernel size, and st represents step-length；The value range of step-length st is 1 ~11, and be integer；

7) memory cell takes out all product data for adjacent four convolution kernels for being used to calculate, and multiplies needed for each convolution kernel Product number are as follows:

N=nc²

Wherein, product number, nc needed for n is represented represent convolution kernel size；The value range of convolution kernel size nc is 1~11, and For integer；

7) corresponding add operation is carried out in addition unit, while biasing is added, and obtains four calculated results, calculation formula is such as Under:

Wherein, a₁Represent the calculated result for being located at the convolution kernel of upper left in four adjacent convolution kernels, a₂Represent the volume for being located at upper right The calculated result of product core, a₃Represent the calculated result for being located at the convolution kernel of lower-left, a₄Represent the calculating knot for being located at the convolution kernel of bottom right Fruit, with four convolution kernels for one group, cx represents every group of abscissa, and cy represents every group of ordinate, and st represents step-length, and b is power It biases again, value range is the size that -1~1, nc is convolution kernel；

8) by this four results by activation primitive Relu, specific method is that will add up result a_i(i=1~4) are greater than compared with 0 0 access value itself, takes 0 less than or equal to 0；

Wherein, m_iRepresent the calculated result by activation primitive；

9) four results are sent into pond units, pond unit according to the pond type inputted by pond type input port po into Row pond；It is respectively 0 and 1 that pond type po, which can input two values, and 0, maximum value pond is represented, 1 represents mean value pond；

If it is maximum value pond, the maximum value of four results is taken to export；If it is mean value pond, the flat of four results is calculated Mean value is exported；It is specific as follows:

Wherein, r represents pond result；

Pond result is exported from convolution results output port r, while being had from output result useful signal port rl output result Imitate signal；

10) the second counting unit exports after calculating all convolution of display completion to output port d according to the size of image Signal is completed, reset signal switchs to low level, terminates the convolution of a picture, prepares to receive next picture；

The method that judgement processing is completed is as follows:

And

Wherein, cx represents the abscissa of every group of convolution kernel, and cy represents the ordinate of every group of convolution kernel, the every row of npx representative image Number of pixels, npy represent the line number that pixel has, and pa represents filling size, and nc represents convolution kernel size, and st ride instead of walk is grown up It is small.