CN110288086A

CN110288086A - A kind of configurable convolution array accelerator structure based on Winograd

Info

Publication number: CN110288086A
Application number: CN201910511987.6A
Authority: CN
Inventors: 魏继增; 徐文富; 王宇吉; 郭炜
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2019-09-27
Anticipated expiration: 2039-06-13
Also published as: CN110288086B

Abstract

A kind of configurable convolution array accelerator structure based on Winograd, comprising: activation value cache module, weight cache module, output buffer module, controller, weight preprocessing module, activation value preprocessing module, weight conversion module, activation value matrix conversion module, dot product module, matrix of consequence conversion module, accumulator module, pond module and active module.A kind of configurable convolution array accelerator structure based on Winograd of the invention, according to the operation feature of the Winograd convolution algorithm of fixed normal form, the configurable convolution array accelerator of bit wide is devised, the demand of different neural networks and different convolutional layers to bit wide is flexibly met.In addition, the configurable multiplier unit of dedicated data bit width, which has also been devised, reduces calculating power consumption to improve the computational efficiency of neural network convolution algorithm.

Description

A kind of configurable convolution array accelerator structure based on Winograd

Technical field

The present invention relates to a kind of configurable convolution array accelerator structures.More particularly to it is a kind of based on Winograd can Configure convolution array accelerator structure.

Background technique

Neural network is excellent on numerous areas application especially image inter-related task, such as image classification, image In the computer vision problems such as semantic segmentation, image retrieval, object detection, start to substitute most of traditional algorithm, and gradually by It is deployed on terminal device.

But neural computing amount is very huge, asks so that there are Processing with Neural Network speed is slow, operation power consumption is big etc. Topic.Neural network mainly includes training stage and reasoning stage.High-precision processing result in order to obtain, weighted data is in training Middle needs are calculated from mass data by iterating.In the ANN Reasoning stage, need in extremely short response The calculation process to input data is completed in time (usually Millisecond), especially when Application of Neural Network is in real-time system When, such as automatic Pilot field.In addition, calculating involved in neural network mainly includes convolution algorithm, activation operation and pond Operation etc..

Existing research shows that neural network is more than 90% calculating time to be convolved process to occupy.Traditional convolution algorithm By multiple multiply-accumulate operation, each element in output characteristic pattern is calculated separately.Although using the solution of the algorithm before Scheme has been achieved for preliminary success, but when algorithm itself is more efficient, efficiency may be higher.Therefore, researcher at present Propose the convolution algorithm of Winograd, the algorithm by carrying out specific data field conversion to input feature vector figure and weight, It completes equivalent convolution algorithm task and reduces the multiplication number of convolution algorithm process.Due to nerve nets most of in practical application The prediction process of network processor chips is using fixed neural network model, therefore used Winograd convolution output normal form Generally also fixed mode, calculating process is very clear, has biggish optimization space.How to design and optimize and is based on Winograd neural network accelerator structure becomes a research emphasis.

In addition, the data for inputting fixed point type can reach good experimental result for number Application of Neural Network big absolutely, Speed can be more improved, power consumption is reduced.However the convolved data bit wide in existing fixed point neural network is fixed, Bu Nengling Configuration living, reduces applicability.In general, the data bit width of 16bit can meet the accuracy requirement of neural network, and for one A little not high networks of required precision and scene, 8bit data bit width can also meet accuracy requirement.Therefore, real in neural network Existing data bit width is configurable preferably to be optimized.

Summary of the invention

The technical problem to be solved by the invention is to provide a kind of computational efficiencies that can be improved neural network convolution algorithm The configurable convolution array accelerator structure based on Winograd.

The technical scheme adopted by the invention is that: a kind of configurable convolution array accelerator structure based on Winograd, It include: that activation value cache module, weight cache module, output buffer module, controller, weight preprocessing module, activation value are pre- Processing module, weight conversion module, activation value matrix conversion module, dot product module, matrix of consequence conversion module, accumulator module, Pond module and active module, wherein

Activation value cache module is connected with controller for storing input pixel value or input feature vector map values, is activation value Preprocessing module provides activation Value Data；

Weight cache module is connected with controller, provides for weight preprocessing module for storing trained weight Weighted data；

Output buffer module, for storing a convolutional layer as a result, being connected with controller, when active module output data is complete Data are passed to output buffer module by Cheng Hou, are used for next layer of convolution；

Controller controls the biography of activation Value Data to be processed, weighted data and convolution layer data according to calculating process It is defeated；

Weight preprocessing module, receive weight cache module transmission to operational data, for dividing convolution kernel, when obtaining Domain weight matrix K；

Activation value preprocessing module, receive the transmission of activation value cache module to operational data, for being cached from activation value Module takes out activation value, for dividing activation value, obtains time domain activation value matrix I；

Weight conversion module, receive the transmission of weight preprocessing module to operational data, for realizing weighted data from when Domain is converted to the domain Winograd, obtains the domain Winograd weight matrix U；

Activation value matrix conversion module, receive activation value preprocessing module transmission to operational data, for realizing activation Value is converted to the domain Winograd from time domain, obtains the domain Winograd activation value matrix V；

Dot product module, receive respectively weight conversion module and activation value matrix conversion module transfer to operational data, use In the dot product operations for realizing the domain Winograd activation value matrix and the domain Winograd weight matrix, the domain Winograd dot product knot is obtained Fruit matrix M；

Matrix of consequence conversion module, receive dot product module transfer to operational data, for realizing dot product result matrix from Conversion of the domain Winograd to time domain, time domain dot product result matrix F after being converted；

Accumulator module, reception result matrix conversion module transfer to operational data, by obtaining received data accumulation To final convolution results；

Pond module, receive accumulator module transmission to operational data, final convolution results battle array is subjected to pond；

Active module, reception tank module transfer to operational data, pond result is subjected to the processing of Relu activation primitive, It is after being activated as a result, being transferred to output buffer module.

The weight preprocessing module includes:

(1) convolution kernel that a size is 5*5 is passed through into zero padding, is extended to the convolution matrix of 6*6；

(2) convolution matrix of 6*6 is divided into the convolution kernel of four 3*3；

Specific division is as follows, wherein K_inputIndicate the weight matrix of a 5*5, downside is 4 corresponding stroke respectively Time domain weight matrix K to be processed after point₁、K₂、K₃、K₄.Calculating U=GKG^TIn, K value is followed successively by K₁、K₂、K₃、K₄:

The activation value preprocessing module is that the activation value matrix of 6*6 size is divided into 4 4*4 sizes of overlapping Matrix.It divides as follows, wherein I_inputIndicate the weight matrix of a 5*5, downside is that the size after dividing is 4*4 respectively Time domain activation value matrix I to be processed₁、I₂、I₃、I₄.Calculating V=B^TIn IB, I value is followed successively by I₁、I₂、I₃、I₄:

The weight conversion module is to subtract the Matrix Multiplication in completing to calculate by ranks addition of vectors, thereby executing It is directed to the conversion of weight matrix in Winograd convolution, obtains the domain Winograd weight matrix U=[GKG^T] wherein, when K is indicated Domain weight matrix, G are that weight converts companion matrix, U is the domain Winograd weight matrix；

Concrete operations: using the first row vector of weight matrix K as provisional matrix C₂The first row, wherein provisional matrix C₂ =G^TK；Integer in weight matrix K is moved to right into benefit 0, negative moves to right benefit 1 and completes except two；When weight, which is positive, to be worth, weight is moved to right, Mend 0 in the weight left side；When weight is negative, weight is moved to right, and the weight left side mends 1；By the first, second and third row element phase of weight matrix K In addition the vector result after moving to right one again after is as provisional matrix C₂The second row；By the first, second and third of weight matrix K Row element is added the vector result after moving to right one again later as provisional matrix C₂The third line；By the of weight matrix K Three row vectors are as provisional matrix C₂Fourth line；By provisional matrix C₂First column vector is as the domain Winograd weight matrix U's First row；By provisional matrix C₂First, second and third column be added after move to right one again after vector result as Winograd The secondary series of domain weight matrix U；By provisional matrix C₂First, second and third column be added after move to right one again after vector knot Fruit arranges as the third of the domain Winograd weight matrix U；By provisional matrix C₂Third column vector as the domain Winograd weight 4th column of matrix U, finally obtain the domain Winograd weight matrix U.

The activation value matrix conversion module, is subtracted by ranks addition of vectors, and the Matrix Multiplication in calculating is completed, thus It executes in Winograd convolution for the conversion operation of time domain activation value matrix, obtains matrix V=[B^TIB] wherein, I is time domain Activation value matrix, B are that activation value converts companion matrix, V is the domain Winograd activation value matrix；

Concrete operations: the first row of time domain activation value matrix I is subtracted into the vector differentials of the third line as provisional matrix C₁'s The first row, wherein provisional matrix C₁=B^TI；The result that the second row of time domain activation value matrix I is added with the third line is as interim Matrix C₁The second row；Using the vector differentials of the third line row that subtracts the second of time domain activation value matrix I as provisional matrix C₁Third Row；The second row of time domain activation value matrix I is subtracted into the vector differentials of fourth line as provisional matrix C₁Fourth line；By interim square Battle array C₁First row subtract first row of the tertial vector differentials as the domain Winograd activation value matrix V；By provisional matrix C₁'s Secondary series of the result that secondary series is added with third column as the domain Winograd activation value matrix V；By provisional matrix C₁Third The vector differentials that column subtract secondary series are arranged as the third of the domain Winograd activation value matrix V；By provisional matrix C₁Secondary series subtract Fourth column of the vector differentials of 4th column as the domain Winograd activation value matrix V, finally obtain the domain Winograd activation value square Battle array V.

The dot product module is by executing the domain the domain Winograd weight matrix U and Winograd activation value matrix V Dot product operations, obtain the domain Winograd dot product result matrix M, and formula is expressed as M=U ⊙ V, wherein U is the domain Winograd weight Matrix, V are the domain Winograd activation value matrixs；The dot product module has 8 to multiply to realize the configurable dot product of data bit width Two operating modes of musical instruments used in a Buddhist or Taoist mass and 16 multipliers respectively correspond the operation for carrying out two kinds of data bit widths of 8bit and 16bit, realize 8* The fixed-point multiplication operation of 8bit and 16*16bit.

8 multipliers include sequentially connected first gating unit, first negate unit, the first shift unit, First summing elements, the second gating unit, second negate unit and third gating unit, wherein

First gating unit receives respectively: the data information and power of weight conversion module and activation value matrix conversion module The Signed Domination signal of weight conversion module；

First negates the data information that unit receives the first gating unit, negates to received data；

First shift unit receives the first data information for negating unit, and receives the sign bit letter of the first gating unit Breath, shifts received data according to symbolic information；

First summing elements receive the data information of the first shift unit, add up to received data；

Second gating unit receives the data information of the first summing elements and the sign bit information of the first gating unit, and passes It gives second and negates unit；

Second negates the data information that unit receives the second gating unit, negates to received data；

Third gating unit receives second respectively and negates the data information of unit and the first summing elements, and exports.

16 multipliers include that sequentially connected 4th gating unit, third negate unit, 8 multipliers, Two shift units, the second summing elements, the 5th gating unit, the 4th negate unit and the 6th gating unit, wherein

4th gating unit receives respectively: the data information and power of weight conversion module and activation value matrix conversion module The Signed Domination signal of weight conversion module；

Third negates the data information that unit receives the 4th gating unit, negates to received data；

8 multipliers carry out the operation of 8bit data bit width, realize the fixed-point multiplication operation of 8*8bit；

Second shift unit receives the data information of 8 multipliers, shifts to received data；

Second summing elements receive the data information of the second shift unit, add up to received data；

5th gating unit receives the data information of the second summing elements and the sign bit information of the 4th gating unit, and passes It gives the 4th and negates unit；

4th negates the data information that unit receives the 5th gating unit, negates to received data；

6th gating unit receives the 4th and negates the data information of unit, and exports.

The matrix of consequence conversion module is to add and subtract to grasp by the domain Winograd dot product result matrix M ranks vector shift Make to execute the conversion operation F=A for being directed to the domain Winograd dot product result matrix M^TMA, wherein M is the domain Winograd dot product result Matrix, A are the companion matrixs that changes of the domain Winograd dot product result matrix M, and F is time domain dot product result matrix；

Concrete operations: using vector result that the first, second and third row of the domain Winograd dot product result matrix M is added as facing When Matrix C₃The first row, wherein provisional matrix C₃=A^TM；By the domain point Winograd dot product result matrix M second and third, four rows The vector result of addition is as provisional matrix C₃The second row；By provisional matrix C₃First, second and third column be added vector result First row as the time domain dot product result matrix F after conversion；By provisional matrix C₃Second and third, four column be added vector knot Secondary series of the fruit as the time domain dot product result matrix F after conversion, the time domain dot product result matrix F after finally obtaining conversion.

A kind of configurable convolution array accelerator structure based on Winograd of the invention, according to fixed normal form The operation feature of Winograd convolution algorithm devises the configurable convolution array accelerator of bit wide, different nerves is flexibly met The demand of network and different convolutional layers to bit wide.In addition, the configurable multiplier unit of dedicated data bit width has also been devised, To improve the computational efficiency of neural network convolution algorithm, calculating power consumption is reduced.

Detailed description of the invention

Fig. 1 is Winograd convolution array accelerator general frame figure；

Fig. 2 is a kind of composition schematic diagram of the configurable convolution array accelerator structure based on Winograd of the present invention；

Fig. 3 is the schematic diagram of 8 multipliers during data bit width can match；

Fig. 4 is the schematic diagram of 16 multipliers during data bit width can match.

Specific embodiment

Below with reference to embodiment and attached drawing to a kind of configurable convolution array accelerator based on Winograd of the invention Structure is described in detail.

In the convolutional calculation of neural network, Winograd conversion formula is

Out=A^T[(GKG^T)⊙(B^TIB)]A(1)

Wherein K indicates that time domain weights matrix, I indicate that time domain activates value matrix, and A, G, B are respectively indicated and dot product matrix of consequence [(GKG^T)⊙(B^TIB)], time domain weights matrix K, the corresponding transition matrix of time domain activation value matrix I, transition matrix A, G, B are specific It is as follows:

The output normal form of used Winograd convolution is F (2*2,3*3) in the present invention, and first parameter 2*2 indicates defeated The size of characteristic pattern out, second parameter 3*3 indicate the size of convolution kernel.

As shown in Figure 1, Winograd convolution can be divided into three phases execution.First stage, by what is read from caching Weight matrix G and time domain activation value matrix I switchs to the domain Winograd from time domain, and concrete operations are matrix multiplication operation, calculate knot Fruit indicates with U and V, wherein U=GKG^T, V=B^TIB；Second stage swashs the domain the Winograd domain weight matrix U and Winograd Value matrix V living executes dot product operations " ⊙ ", obtains the domain Winograd dot product result matrix M=U ⊙ V；Phase III is by dot product knot Fruit switchs to time domain from the domain Winograd.

As shown in Fig. 2, a kind of configurable convolution array accelerator structure based on Winograd of the invention, comprising: swash Work value cache module 1, weight cache module 2, output buffer module 3, controller 4, weight preprocessing module 5, activation value are located in advance Manage module 6, weight conversion module 7, activation value matrix conversion module 8, dot product module 9, matrix of consequence conversion module 10, cumulative mould Block 11, pond module 12 and active module 13, wherein

1) activation value cache module 1 is connected with controller 4 for storing input pixel value or input feature vector map values, is sharp Value preprocessing module 6 living provides activation Value Data；

2) weight cache module 2 is connected with controller 4 for storing trained weight, is weight preprocessing module 5 provide weighted data；

3) output buffer module 3, for storing a convolutional layer as a result, being connected with controller 4, when active module 13 exports After the completion of data, data are passed to output buffer module 3, are used for next layer of convolution；

4) controller 4 control activation Value Data to be processed, weighted data and convolution layer data according to calculating process Transmission；

5) weight preprocessing module 5 receives dividing to operational data for dividing convolution kernel for the transmission of weight cache module 2 Four time domains weight matrix K to be processed is not obtained₁、K₂、K₃、K₄；

The weight preprocessing module 5 includes: (1) by convolution kernel that a size is 5*5 by zero padding, is extended to 6* 6 convolution matrix；(2) convolution matrix of 6*6 is divided into the convolution kernel of four 3*3；The Winograd of 3*3 can thus be used The convolution that normal form realizes 5*5 is exported, efficiently and not will increase power consumption multiplication number.

Specific division is as follows, wherein K_inputIndicate that a size is the time domain input weight matrix of 5*5, when right side is Four processing results after the division of 6*6 time domain weights matrix after the input weight matrix-expand of domain, are 4 corresponding separately below Time domain weight matrix K to be processed after division₁、K₂、K₃、K₄.Calculating U=GKG^TIn, K value is followed successively by K₁、K₂、K₃、K₄:

6) activation value preprocessing module 6, receive that activation value cache module 1 transmits to operational data, for from activation value Cache module 1 takes out activation value and respectively obtains time domain activation value matrix I to be processed for dividing activation value₁、I₂、I₃、I₄.? Calculate V=B^TIn IB, I value is followed successively by I₁、I₂、I₃、I₄:

The activation value preprocessing module 6 is realized the reading of activation value and is pre-processed to it.It is calculated in Winograd In method, activation value needs are corresponding with weight, and the data that many of them is reused, so being overlapped division.It is described Activation value preprocessing module 6 be the matrix that the activation value matrix of 6*6 size is divided into 4 4*4 sizes of overlapping, it is right respectively Answer the convolution kernel of 4 3*3；It divides as follows, wherein I_inputIndicate that the time domain that a size is 6*6 inputs activation value Matrix, lower section are respectively the time domain activation value matrix I to be processed that the size after dividing is 4*4₁、I₂、I₃、I₄.Calculating V=B^TIB In, I value is followed successively by I₁、I₂、I₃、I₄:

7) weight conversion module 7, receive weight preprocessing module 5 transmit to operational data, for realizing weighted data The domain Winograd is converted to from time domain, obtains the domain Winograd weight matrix U；

The weight conversion module 7 is to subtract the Matrix Multiplication in completing to calculate by ranks addition of vectors, thereby executing It is directed to the conversion of weight matrix in Winograd convolution, obtains the domain Winograd weight matrix U=[GKG^T] wherein, when K is indicated Domain weight matrix, G are that weight converts companion matrix, U is the domain Winograd weight matrix；

Concrete operations: using the first row vector of time domain weights matrix K as provisional matrix C₂The first row, wherein interim square Battle array C₂=G^TK；Because of existence value 1/2 in weight matrix, only needs to move to right the integer in time domain weights matrix K benefit 0, bears Number moves to right benefit 1 and completes except two；When weight, which is positive, to be worth, weight is moved to right, and the weight left side mends 0；When weight is negative, weight is moved to right, Mend 1 in the weight left side；Vector result after moving to right one again after first, second and third row element of time domain weights matrix K is added As provisional matrix C₂The second row；Will time domain weights matrix K the first, second and third row element be added after move to right again one it Vector result afterwards is as Matrix C₂The third line；Using the third row vector of time domain weights matrix K as provisional matrix C₂The 4th Row；By provisional matrix C₂First row of first column vector as the domain Winograd weight matrix U；By provisional matrix C₂First, Two, secondary series of the vector result as the domain Winograd weight matrix U after three column move to right one after being added again；It will be interim Matrix C₂First, second and third column be added after move to right one again after vector result as the domain Winograd weight matrix U's Third column；By provisional matrix C₂Third column vector as the domain Winograd weight matrix U the 4th column, finally obtain The domain Winograd weight matrix U.

8) activation value matrix conversion module 8, receive that activation value preprocessing module 6 transmits to operational data, for realizing Activation value is converted to the domain Winograd from time domain, obtains the domain Winograd activation value matrix V；

The activation value matrix conversion module 8, is subtracted by ranks addition of vectors, and the Matrix Multiplication in calculating is completed, from And execute in Winograd convolution for time domain activation value matrix conversion operation, obtain the domain Winograd activation value matrix V= [B^TIB] wherein, it be activation value conversion companion matrix, V is the domain Winograd activation value matrix that I, which is time domain activation value matrix, B,；

9) dot product module 9, receive that weight conversion module 7 and activation value matrix conversion module 8 transmit respectively to operand According to, for realizing the domain Winograd activation value matrix and the domain Winograd weight matrix dot product operations, obtain the domain Winograd The module for calculating time and resource is most consumed in dot product result matrix M and convolution；

The dot product module 9 is by executing the domain the domain Winograd weight matrix U and Winograd activation value matrix V Dot product operations, obtain the domain Winograd dot product result matrix M, and formula is expressed as M=U ⊙ V, wherein U is the domain Winograd weight Matrix, V are the domain Winograd activation value matrixs；The dot product module 9 has 8 to realize the configurable dot product of data bit width Two operating modes of multiplier and 16 multipliers respectively correspond the operation for carrying out two kinds of data bit widths of 8bit and 16bit, realize The fixed-point multiplication operation of 8*8bit and 16*16bit.Wherein,

(1) as shown in figure 3,8 multipliers include sequentially connected first gating unit 14, first negate unit 15, the first shift unit 16, the first summing elements 17, the second gating unit 18, second negate unit 19 and third gating unit 20, wherein

First gating unit 14 receives respectively: weight conversion module 7 and the data information of activation value matrix conversion module 8 with And the Signed Domination signal of weight conversion module 7；

First negates the data information that unit 15 receives the first gating unit 14, negates to received data；

First shift unit 16 receives the first data information for negating unit 15, and receives the symbol of the first gating unit 14 Number position information, shifts received data according to symbolic information；

First summing elements 17 receive the data information of the first shift unit 16, add up to received data；

Second gating unit 18 receives the data information of the first summing elements 17 and the sign bit letter of the first gating unit 14 Breath, and send second to and negate unit 19；

Second negates the data information that unit 19 receives the second gating unit 18, negates to received data；

Third gating unit 20 receives the second data information for negating unit 19 and the first summing elements 17 respectively, and defeated Out.

8 multiplier concrete operations: according to the sign bit of two multipliers, phase exclusive or obtains the sign bit of result, and root Judge according to sign bit positive and negative, then propose sign bit if negative, rear seven digit is negated plus 1；If positive number, then rear seven digits are protected It holds constant.Judge it is positive and negative after multiplier A₁Multiplier B is judged respectively₁Whether each binary digit is 1, is if 1 corresponding median Multiplier A₁The corresponding position of seven bitwise shift lefts afterwards, if 0 that 0 corresponding median is 8.Multiplier B is judged₁Latter seven after, will All medians are added the result H being multiplied₂, then decide whether to be negated plus 1 according to outcome symbol position, if knot Fruit sign bit 1 is then by the result H of multiplication₂It negates and adds 1, remained unchanged if outcome symbol position is 0, obtain multiplied result H₃, finally In multiplied result H₃The 8th take outcome symbol position, obtain final result.No symbol 8 multiplies, without considering sign bit, It will be according to multiplier B₁8 data shifter-adders obtain result.

(2) as shown in figure 4,16 multipliers include that sequentially connected 4th gating unit 21, third negate list First 22,8 multipliers 23, the second shift unit 24, the second summing elements 25, the 5th gating unit the 26, the 4th negate unit 27 With the 6th gating unit 28, wherein

4th gating unit 21 receives respectively: weight conversion module 7 and the data information of activation value matrix conversion module 8 with And the Signed Domination signal of weight conversion module 7；

Third negates the data information that unit 22 receives the 4th gating unit 21, negates to received data；

8 multipliers 23 carry out the operation of 8bit data bit width, realize the fixed-point multiplication operation of 8*8bit；

Second shift unit 24 receives the data information of 8 multipliers 23, shifts to received data；

Second summing elements 25 receive the data information of the second shift unit 24, add up to received data；

5th gating unit 26 receives the data information of the second summing elements 25 and the sign bit letter of the 4th gating unit 21 Breath, and send the 4th to and negate unit 27；

4th negates the data information that unit 27 receives the 5th gating unit 26, negates to received data；

6th gating unit 28 receives the 4th and negates the data information of unit 27, and exports.

16 multipliers are realized by 48 multiplier devices, wherein the gating letter of 8 multipliers used Number be 0, i.e., without sign multiplication device.Firstly, being judged according to the sign bit of two 16 multipliers positive and negative, remained unchanged if canonical, It is negated if being negative and adds 1；Secondly 16 digits after judgement are divided into most-significant byte number and least-significant byte number, it is then corresponding to be multiplied；Later will The result that two most-significant byte numbers are multiplied moves to left 16, respectively by the result of multiplier D most-significant byte multiplier E least-significant byte multiplication, multiplier D least-significant byte 8 are moved to left after the results added that multiplier E most-significant byte is multiplied, by the result after displacement plus multiplier A least-significant byte and multiplier B least-significant byte Multiplication obtains multiplied result L；Add 1 finally, deciding whether to negate according to outcome symbol position, if multiplied result L symbol is 1 The result of multiplication is negated and adds 1, is remained unchanged if multiplied result L sign bit is 0, finally takes symbol in the first place multiplied result L The worth of position exports result to the end.

10) matrix of consequence conversion module 10, receive dot product module 9 transmit to operational data, for realizing dot product result Conversion of the matrix from Winograd domain to time domain, the time domain dot product result matrix F after being converted；

The matrix of consequence conversion module 10 is added and subtracted by the domain Winograd dot product result matrix M ranks vector shift Operation executes the conversion operation F=A for being directed to the domain Winograd dot product result matrix M^TMA, wherein M is the domain Winograd dot product knot Fruit matrix, A are the conversion companion matrixs of the domain Winograd dot product result matrix M, and F is time domain dot product result matrix；

Concrete operations: using vector result that the first, second and third row of the domain Winograd dot product result matrix M is added as facing When Matrix C₃The first row, wherein C₃=A^TM；By the domain Winograd dot product result matrix M second and third, four rows be added vector As a result it is used as provisional matrix C₃The second row；By provisional matrix C₃First, second and third column be added vector result as conversion after Time domain dot product result matrix F first row；By provisional matrix C₃Second and third, four column be added vector results as conversion The secondary series of time domain dot product result matrix F afterwards, finally obtains time domain dot product result matrix F.

11) accumulator module 11, reception result matrix conversion module 10 transmit to operational data, by by received data It is cumulative, obtain final convolution results, the matrix of consequence of a 2*2 size；

12) pond module 12, receive that accumulator module 11 transmits to operational data, final convolution results battle array is subjected to pond Change；Different pond methods, including maximizing method, averaging method, method of minimizing can be used, to the neuron of input into The operation of row pondization.The matrix of consequence finally exported due to Winograd convolution F (2*2,3*3) is 2*2 size, then can directly into The pondization of row 2*2 operates, and compares to obtain pond result by size three times: be for the first time matrix of consequence the first row two numbers into Row comparison, is that two numbers of the second row compare for the second time, is to compare the preceding result compared twice for the third time, obtains The maximum pond result of the matrix of consequence.

13) active module 13, reception tank module 12 transmit to operational data, pond result is subjected to Relu activation letter Number processing is after being activated as a result, being transferred to output buffer module 3.

Claims

1. a kind of configurable convolution array accelerator structure based on Winograd characterized by comprising activation value caches mould Block (1), weight cache module (2), output buffer module (3), controller (4), weight preprocessing module (5), activation value are located in advance Manage module (6), weight conversion module (7), activation value matrix conversion module (8), dot product module (9), matrix of consequence conversion module (10), accumulator module (11), pond module (12) and active module (13), wherein

Activation value cache module (1) is connected for storing input pixel value or input feature vector map values with controller (4), for activation It is worth preprocessing module (6) and activation Value Data is provided；

Weight cache module (2) is connected with controller (4) for storing trained weight, is weight preprocessing module (5) weighted data is provided；

Output buffer module (3), for storing a convolutional layer as a result, being connected with controller (4), when active module (13) export After the completion of data, data are passed to output buffer module (3), are used for next layer of convolution；

Controller (4) controls the transmission of activation Value Data, weighted data and convolution layer data to be processed according to calculating process；

Weight preprocessing module (5) receives obtaining to operational data for dividing convolution kernel for weight cache module (2) transmission Time domain weights matrix K；

Activation value preprocessing module (6), receive activation value cache module (1) transmission to operational data, for slow from activation value Storing module (1) takes out activation value, for dividing activation value, obtains time domain activation value matrix I；

Weight conversion module (7), receive weight preprocessing module (5) transmission to operational data, for realizing weighted data from Time domain is converted to the domain Winograd, obtains the domain Winograd weight matrix U；

Activation value matrix conversion module (8), receive activation value preprocessing module (6) transmission to operational data, for realizing swashing Value living is converted to the domain Winograd from time domain, obtains the domain Winograd activation value matrix V；

Dot product module (9), receive that weight conversion module (7) and activation value matrix conversion module (8) transmit respectively to operand According to, for realizing the domain Winograd activation value matrix and the domain Winograd weight matrix dot product operations, obtain the domain Winograd Dot product result matrix M；

Matrix of consequence conversion module (10), receive dot product module (9) transmission to operational data, for realizing dot product result matrix Conversion from Winograd domain to time domain, the time domain dot product result matrix F after being converted；

Accumulator module (11), reception result matrix conversion module (10) transmission to operational data, by the way that received data are tired out Add, obtains final convolution results；

Pond module (12), receive accumulator module (11) transmission to operational data, final convolution results battle array is subjected to pond；

Active module (13), reception tank module (12) transmission to operational data, by pond result progress Relu activation primitive Processing is after being activated as a result, being transferred to output buffer module (3).

2. a kind of configurable convolution array accelerator structure based on Winograd according to claim 1, feature exist In the weight preprocessing module (5) includes:

Specific division is as follows, wherein K_inputThe weight matrix of a 5*5 is indicated, after downside is respectively 4 corresponding divisions Time domain weight matrix K to be processed₁、K₂、K₃、K₄.Calculating U=GKG^TIn, K value is followed successively by K₁、K₂、K₃、K₄:

3. a kind of configurable convolution array accelerator structure based on Winograd according to claim 1, feature exist In the activation value preprocessing module (6) is the square that the activation value matrix of 6*6 size is divided into 4 4*4 sizes of overlapping Battle array.It divides as follows, wherein I_inputIndicate the weight matrix of a 5*5, downside be respectively size after dividing be 4*4 when Domain activation value matrix I to be processed₁、I₂、I₃、I₄.Calculating V=B^TIn IB, I value is followed successively by I₁、I₂、I₃、I₄:

4. a kind of configurable convolution array accelerator structure based on Winograd according to claim 1, feature exist In, the weight conversion module (7) is to subtract the Matrix Multiplication in completing to calculate by ranks addition of vectors, thereby executing It is directed to the conversion of weight matrix in Winograd convolution, obtains the domain Winograd weight matrix U=[GKG^T] wherein, when K is indicated Domain weight matrix, G are that weight converts companion matrix, U is the domain Winograd weight matrix；

Concrete operations: using the first row vector of weight matrix K as provisional matrix C₂The first row, wherein provisional matrix C₂= G^TK；Integer in weight matrix K is moved to right into benefit 0, negative moves to right benefit 1 and completes except two；When weight, which is positive, to be worth, weight is moved to right, power It is worth the left side and mends 0；When weight is negative, weight is moved to right, and the weight left side mends 1；The first, second and third row element of weight matrix K is added Vector result after moving to right one again later is as provisional matrix C₂The second row；By the first, second and third row of weight matrix K Element is added the vector result after moving to right one again later as provisional matrix C₂The third line；By the third of weight matrix K Row vector is as provisional matrix C₂Fourth line；By provisional matrix C₂First column vector as the domain Winograd weight matrix U One column；By provisional matrix C₂First, second and third column be added after move to right one again after vector result as the domain Winograd The secondary series of weight matrix U；By provisional matrix C₂The first, second and third column be added after move to right vector result after one again Third as the domain Winograd weight matrix U arranges；By provisional matrix C₂Third column vector as the domain Winograd weight square The 4th column of battle array U, finally obtain the domain Winograd weight matrix U.

5. a kind of configurable convolution array accelerator structure based on Winograd according to claim 1, feature exist In the activation value matrix conversion module (8), is subtracted by ranks addition of vectors, the Matrix Multiplication in calculating is completed, to hold For the conversion operation of time domain activation value matrix in row Winograd convolution, matrix V=[B is obtained^TIB] wherein, I is that time domain swashs Work value matrix, B are that activation value converts companion matrix, V is the domain Winograd activation value matrix；

Concrete operations: the first row of time domain activation value matrix I is subtracted into the vector differentials of the third line as provisional matrix C₁First It goes, wherein provisional matrix C₁=B^TI；The result that the second row of time domain activation value matrix I is added with the third line is as provisional matrix C₁The second row；Using the vector differentials of the third line row that subtracts the second of time domain activation value matrix I as provisional matrix C₁The third line； The second row of time domain activation value matrix I is subtracted into the vector differentials of fourth line as provisional matrix C₁Fourth line；By provisional matrix C₁ First row subtract first row of the tertial vector differentials as the domain Winograd activation value matrix V；By provisional matrix C₁? Secondary series of the result that two column are added with third column as the domain Winograd activation value matrix V；By provisional matrix C₁Third column The vector differentials for subtracting secondary series are arranged as the third of the domain Winograd activation value matrix V；By provisional matrix C₁Secondary series subtract the Fourth column of the vector differentials of four column as the domain Winograd activation value matrix V, finally obtain the domain Winograd activation value matrix V。

6. a kind of configurable convolution array accelerator structure based on Winograd according to claim 1, feature exist In the dot product module (9) is the point by executing the domain the domain Winograd weight matrix U and Winograd activation value matrix V Product operation, obtains the domain Winograd dot product result matrix M, and formula is expressed as M=U ⊙ V, wherein U is the domain Winograd weight square Battle array, V are the domain Winograd activation value matrixs；The dot product module (9) has 8 to realize the configurable dot product of data bit width Two operating modes of multiplier and 16 multipliers respectively correspond the operation for carrying out two kinds of data bit widths of 8bit and 16bit, realize The fixed-point multiplication operation of 8*8bit and 16*16bit.

7. a kind of configurable convolution array accelerator structure based on Winograd according to claim 6, feature exist In, 8 multipliers include sequentially connected first gating unit (14), first to negate unit (15), the first displacement single First (16), the first summing elements (17), the second gating unit (18), second negate unit (19) and third gating unit (20), Wherein,

First gating unit (14) receives respectively: the data information of weight conversion module (7) and activation value matrix conversion module (8) And the Signed Domination signal of weight conversion module (7)；

First negates the data information that unit (15) receive the first gating unit (14), negates to received data；

First shift unit (16) receives the first data information for negating unit (15), and receives the first gating unit (14) Sign bit information shifts received data according to symbolic information；

First summing elements (17) receive the data information of the first shift unit (16), add up to received data；

Second gating unit (18) receives the data information of the first summing elements (17) and the sign bit of the first gating unit (14) Information, and send second to and negate unit (19)；

Second negates the data information that unit (19) receive the second gating unit (18), negates to received data；

Third gating unit (20) receives the second data information for negating unit (19) and the first summing elements (17) respectively, and defeated Out.

8. a kind of configurable convolution array accelerator structure based on Winograd according to claim 6, feature exist It include that sequentially connected 4th gating unit (21), third negate unit (22), 8 multipliers in, 16 multipliers (23), the second shift unit (24), the second summing elements (25), the 5th gating unit (26), the 4th negate unit (27) and Six gating units (28), wherein

4th gating unit (21) receives respectively: the data information of weight conversion module (7) and activation value matrix conversion module (8) And the Signed Domination signal of weight conversion module (7)；

Third negates the data information that unit (22) receive the 4th gating unit (21), negates to received data；

8 multipliers (23) carry out the operation of 8bit data bit width, realize the fixed-point multiplication operation of 8*8bit；

Second shift unit (24) receives the data information of 8 multipliers (23), shifts to received data；

Second summing elements (25) receive the data information of the second shift unit (24), add up to received data；

5th gating unit (26) receives the data information of the second summing elements (25) and the sign bit of the 4th gating unit (21) Information, and send the 4th to and negate unit (27)；

4th negates the data information that unit (27) receive the 5th gating unit (26), negates to received data；

6th gating unit (28) receives the 4th and negates the data information of unit (27), and exports.

9. a kind of configurable convolution array accelerator structure based on Winograd according to claim 1, feature exist In the matrix of consequence conversion module (10) is to add and subtract to grasp by the domain Winograd dot product result matrix M ranks vector shift Make to execute the conversion operation F=A for being directed to the domain Winograd dot product result matrix M^TMA, wherein M is the domain Winograd dot product result Matrix, A are the companion matrixs that changes of the domain Winograd dot product result matrix M, and F is time domain dot product result matrix；

Concrete operations: the vector result that the first, second and third row of the domain Winograd dot product result matrix M is added is as interim square Battle array C₃The first row, wherein provisional matrix C₃=A^TM；By the domain point Winograd dot product result matrix M second and third, four rows be added Vector result as provisional matrix C₃The second row；By provisional matrix C₃First, second and third column be added vector result conduct The first row of time domain dot product result matrix F after conversion；By provisional matrix C₃Second and third, four column be added vector results make For the secondary series of the time domain dot product result matrix F after conversion, the time domain dot product result matrix F after converting is finally obtained.