CN110288086B

CN110288086B - Winograd-based configurable convolution array accelerator structure

Info

Publication number: CN110288086B
Application number: CN201910511987.6A
Authority: CN
Inventors: 魏继增; 徐文富; 王宇吉; 郭炜
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2023-07-21
Anticipated expiration: 2039-06-13
Also published as: CN110288086A

Abstract

A Winograd-based configurable convolutional array accelerator structure comprising: the system comprises an activation value caching module, a weight caching module, an output caching module, a controller, a weight preprocessing module, an activation value preprocessing module, a weight conversion module, an activation value matrix conversion module, a dot multiplication module, a result matrix conversion module, an accumulation module, a pooling module and an activation module. According to the Winograd-based configurable convolution array accelerator structure, the convolution array accelerator with the configurable bit width is designed according to the operation characteristics of a Winograd convolution algorithm of a fixed paradigm, and the requirements of different neural networks and different convolution layers on the bit width are flexibly met. In addition, a special multiplier unit with configurable data bit width is designed, so that the calculation efficiency of the neural network convolution operation is improved, and the calculation power consumption is reduced.

Description

Winograd-based configurable convolution array accelerator structure

Technical Field

The present invention relates to a configurable convolutional array accelerator structure. And more particularly to a Winograd-based configurable convolutional array accelerator structure.

Background

Neural networks are excellent in many fields of application, particularly image-related tasks, and begin to replace most of the traditional algorithms in computer vision problems such as image classification, image semantic segmentation, image retrieval, object detection, etc., and are gradually deployed on terminal devices.

But the calculated amount of the neural network is very huge, so that the problems of low processing speed, high operation power consumption and the like of the neural network exist. The neural network mainly comprises a training phase and an reasoning phase. In order to obtain a high-precision processing result, the weight data is required to be obtained from mass data through repeated iterative computation in training. In the neural network reasoning phase, the computational processing of the input data needs to be done in a very short response time (typically in the order of milliseconds), especially when the neural network is applied to real-time systems, such as in the field of autopilot. Furthermore, the calculations involved in the neural network mainly include convolution operations, activation operations, pooling operations, and the like.

Studies have shown that neural networks account for more than 90% of the computation time by the convolution process. The conventional convolution algorithm calculates each element in the output feature map through multiple multiply-accumulate operations. While previous solutions using this algorithm have met with preliminary success, the efficiency may be higher when the algorithm itself is more efficient. Therefore, researchers currently propose a convolution algorithm of Winograd, which performs an equivalent convolution operation task and reduces the multiplication times of the convolution operation process by performing specific data domain conversion on an input feature map and weights. Because the prediction process of most neural network processor chips in practical application adopts a fixed neural network model, the adopted Winograd convolution output mode is usually a fixed mode, the operation process is quite clear, and a larger optimization space is provided. How to design and optimize the accelerator structure based on Winograd neural network is one of the important research points.

In addition, for the application of the vast majority of neural networks, the input of fixed-point type data can achieve good experimental results, the speed can be improved, and the power consumption can be reduced. However, the convolution data bit width in the existing fixed-point neural network is fixed, so that the flexible configuration is not realized, and the applicability is reduced. In general, the 16-bit data bit width can meet the precision requirement of the neural network, and for some networks and scenes with low precision requirements, the 8-bit data bit width can also meet the precision requirement. Thus, in a neural network, achieving data bit width configurability can be better optimized.

Disclosure of Invention

The invention aims to solve the technical problem of providing a Winograd-based configurable convolution array accelerator structure capable of improving the calculation efficiency of neural network convolution operation.

The technical scheme adopted by the invention is as follows: a Winograd-based configurable convolutional array accelerator structure comprising: an activation value caching module, a weight caching module, an output caching module, a controller, a weight preprocessing module, an activation value preprocessing module, a weight conversion module, an activation value matrix conversion module, a dot multiplication module, a result matrix conversion module, an accumulation module, a pooling module and an activation module,

the activation value buffer module is used for storing an input pixel value or an input characteristic diagram value, is connected with the controller and provides activation value data for the activation value preprocessing module;

the weight buffer module is used for storing the trained weight values, is connected with the controller and provides weight data for the weight preprocessing module;

the output buffer module is used for storing the primary convolution layer result, is connected with the controller, and transmits the data to the output buffer module for the next layer convolution after the data output by the activation module is completed;

the controller is used for controlling transmission of the activation value data, the weight data and the convolution layer data to be processed according to the calculation process;

the weight preprocessing module is used for receiving the data to be operated, which is transmitted by the weight caching module, and dividing a convolution kernel to obtain a time domain weight matrix K;

the activation value preprocessing module is used for receiving the data to be operated, which is transmitted by the activation value caching module, and extracting an activation value from the activation value caching module, and dividing the activation value to obtain a time domain activation value matrix I;

the weight conversion module is used for receiving the data to be operated, which is transmitted by the weight preprocessing module, and converting the weight data from a time domain to a Winograd domain to obtain a Winograd domain weight matrix U;

the activation value matrix conversion module is used for receiving the data to be operated, which is transmitted by the activation value preprocessing module, and converting the activation value from a time domain to a Winograd domain to obtain a Winograd domain activation value matrix V;

the dot multiplication module is used for respectively receiving the data to be operated, which are transmitted by the weight conversion module and the activation value matrix conversion module, and realizing dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain a Winograd domain dot product result matrix M;

the result matrix conversion module is used for receiving the data to be operated transmitted by the dot product module and converting the dot product result matrix from a Winograd domain to a time domain to obtain a converted time domain dot product result matrix F;

the accumulation module is used for receiving the data to be operated, which is transmitted by the result matrix conversion module, and accumulating the received data to obtain a final convolution result;

the pooling module is used for receiving the data to be operated transmitted by the accumulation module and pooling the final convolution result array;

and the activation module is used for receiving the data to be operated, which is transmitted by the pooling module, carrying out Relu activation function processing on the pooling result, obtaining an activated result, and transmitting the activated result to the output buffer module.

The weight preprocessing module comprises:

(1) Extending a convolution kernel of size 5*5 to a 6*6 convolution matrix by zero padding;

(2) The convolution matrix of 6*6 is divided into four 3*3 convolution kernels;

the specific division is shown below, where K _input Representing a 5*5 weight matrix, the lower sides of which are respectively 4 corresponding divided time domain weight matrices to be processed K ₁ 、K ₂ 、K ₃ 、K ₄ . In calculating u=gkg ^T Wherein the K value is K in turn ₁ 、K ₂ 、K ₃ 、K ₄ ：

The activation value preprocessing module divides an activation value matrix of 6*6 into overlapped matrices of 4 sizes 4*4. The division is as follows, where I _input Representing a 5*5 weight matrix, the lower side is divided into 4*4 time domain to-be-processed activation value matrix I ₁ 、I ₂ 、I ₃ 、I ₄ . In calculating v=b ^T In IB, I values are I in turn ₁ 、I ₂ 、I ₃ 、I ₄ ：

The weight conversion module performs matrix multiplication in calculation through row-column vector addition and subtraction, so as to perform conversion of weight matrix in Winograd convolution, and obtain Winograd domain weight matrix U= [ GKG ] ^T ]Wherein K represents a time domain weight matrix, G is a weight conversion auxiliary matrix, and U is a Winograd domain weight matrix;

the specific operation is as follows: taking the first row vector of the weight matrix K as a temporary matrix C ₂ In which the temporary matrix C ₂ ＝G ^T K, performing K; the integer right shift complement 0 and the negative number right shift complement 1 in the weight matrix K are divided into two; when the weight is a positive value, the weight is shifted to the right, and the left side of the weight is supplemented with 0; when the weight is negative, the weight is shifted to the right, and 1 is supplemented to the left of the weight; the vector result after adding the first, second and third row elements of the weight matrix K and right shifting by one bit is taken as a temporary matrix C ₂ Is a second row of (2); the vector result after adding the first, second and third row elements of the weight matrix K and right shifting by one bit is taken as a temporary matrix C ₂ Is a third row of (2); taking the third row vector of the weight matrix K as a temporary matrix C ₂ Is the fourth row of (2); temporary matrix C ₂ The first column vector is used as a first column of a Winograd domain weight matrix U; temporary matrix C ₂ The vector result after adding the first, second and third columns and then right shifting by one bit is used as the second column of Winograd domain weight matrix UThe method comprises the steps of carrying out a first treatment on the surface of the Temporary matrix C ₂ The vector result after adding the first, second and third columns and right shifting by one bit is used as the third column of Winograd domain weight matrix U; temporary matrix C ₂ And taking the third column vector of the (4) as the fourth column of the Winograd domain weight matrix U to finally obtain the Winograd domain weight matrix U.

The activation value matrix conversion module completes matrix multiplication in calculation through addition and subtraction of row vectors and column vectors, so that conversion operation on a time domain activation value matrix in Winograd convolution is executed, and a matrix V= [ B ] is obtained ^T IB]Wherein I is a time domain activation value matrix, B is an activation value conversion auxiliary matrix, and V is a Winograd domain activation value matrix;

the specific operation is as follows: taking the vector difference value of the first row minus the third row of the time domain activation value matrix I as a temporary matrix C ₁ In which the temporary matrix C ₁ ＝B ^T I, a step of I; the result of adding the second row and the third row of the time domain activation value matrix I is taken as a temporary matrix C ₁ Is a second row of (2); taking the vector difference value of the third row minus the second row of the time domain activation value matrix I as a temporary matrix C ₁ Is a third row of (2); taking the vector difference value of the second row minus the fourth row of the time domain activation value matrix I as a temporary matrix C ₁ Is the fourth row of (2); temporary matrix C ₁ The vector difference value of the third column minus the first column of the Winograd domain activation value matrix V; temporary matrix C ₁ The result of the addition of the second and third columns of (a) is taken as the second column of Winograd domain activation value matrix V; temporary matrix C ₁ Subtracting the vector difference value of the second column as the third column of the Winograd domain activation value matrix V; temporary matrix C ₁ The vector difference value of the second column minus the fourth column is used as the fourth column of the Winograd domain activation value matrix V, and finally the Winograd domain activation value matrix V is obtained.

The dot multiplication module obtains a Winograd domain dot product result matrix M by executing dot product operation of a Winograd domain weight matrix U and a Winograd domain activation value matrix V, wherein the formula is expressed as M=U+V, U is the Winograd domain weight matrix, and V is the Winograd domain activation value matrix; the dot multiplication module is used for realizing the dot product with configurable data bit width, and comprises two working modes of an 8-bit multiplier and a 16-bit multiplier, wherein the two working modes respectively and correspondingly carry out the operation of two data bit widths of 8 bits and 16 bits, so as to realize the fixed-point multiplication operation of 8 x 8 bits and 16 x 16 bits.

The 8-bit multiplier comprises a first gating unit, a first inverting unit, a first shifting unit, a first accumulating unit, a second gating unit, a second inverting unit and a third gating unit which are sequentially connected, wherein,

the first gating units respectively receive: the data information of the weight conversion module and the activation value matrix conversion module and the symbol control signal of the weight conversion module;

the first negation unit receives the data information of the first gating unit and performs negation on the received data;

the first shifting unit receives the data information of the first inverting unit, receives the symbol bit information of the first gating unit and shifts the received data according to the symbol information;

the first accumulation unit receives the data information of the first shift unit and accumulates the received data;

the second gating unit receives the data information of the first accumulating unit and the sign bit information of the first gating unit and transmits the data information and the sign bit information to the second inverting unit;

the second inverting unit receives the data information of the second gating unit and inverts the received data;

the third gating unit receives the data information of the second inverting unit and the first accumulating unit respectively and outputs the data information.

The 16-bit multiplier comprises a fourth gating unit, a third inverting unit, an 8-bit multiplier, a second shifting unit, a second accumulating unit, a fifth gating unit, a fourth inverting unit and a sixth gating unit which are sequentially connected,

the fourth strobe units respectively receive: the data information of the weight conversion module and the activation value matrix conversion module and the symbol control signal of the weight conversion module;

the third inverting unit receives the data information of the fourth gating unit and inverts the received data;

the 8-bit multiplier performs 8-bit data bit width operation to realize 8 x 8-bit fixed-point multiplication operation;

the second shifting unit receives the data information of the 8-bit multiplier and shifts the received data;

the second accumulation unit receives the data information of the second shift unit and accumulates the received data;

the fifth gating unit receives the data information of the second accumulating unit and the sign bit information of the fourth gating unit and transmits the data information and the sign bit information to the fourth inverting unit;

the fourth negation unit receives the data information of the fifth gating unit and inverts the received data;

the sixth gating unit receives the data information of the fourth inverting unit and outputs the data information.

The result matrix conversion module performs conversion operation F=A for Winograd domain dot product result matrix M through Winograd domain dot product result matrix M row-column vector shift add-subtract operation ^T MA, wherein M is a Winograd domain dot product result matrix, A is a replacement auxiliary matrix of the Winograd domain dot product result matrix M, and F is a time domain dot product result matrix;

the specific operation is as follows: taking the vector result of the first, second and third row addition of Winograd domain dot product result matrix M as a temporary matrix C ₃ In which the temporary matrix C ₃ ＝A ^T M; taking the vector result of the second, third and fourth row addition of the dot Winograd domain dot product result matrix M as a temporary matrix C ₃ Is a second row of (2); temporary matrix C ₃ The vector result of the first, second and third column addition is used as the first column of the converted time domain dot product result matrix F; temporary matrix C ₃ The vector result of the second, third and fourth column addition is used as the second column of the converted time domain dot product result matrix F, and finally the converted time domain dot product result matrix F is obtained.

According to the Winograd-based configurable convolution array accelerator structure, the convolution array accelerator with the configurable bit width is designed according to the operation characteristics of a Winograd convolution algorithm of a fixed paradigm, and the requirements of different neural networks and different convolution layers on the bit width are flexibly met. In addition, a special multiplier unit with configurable data bit width is designed, so that the calculation efficiency of the neural network convolution operation is improved, and the calculation power consumption is reduced.

Drawings

FIG. 1 is a diagram of the overall architecture of a Winograd convolutional array accelerator;

FIG. 2 is a schematic diagram of the construction of a Winograd-based configurable convolutional array accelerator structure of the present invention;

FIG. 3 is a schematic diagram of an 8-bit multiplier in a data bit width scheme;

fig. 4 is a schematic diagram of a 16-bit multiplier in a data bit width scheme.

Detailed Description

A detailed description of a Winograd-based configurable convolutional array accelerator structure of the present invention is provided below with reference to the examples and figures.

In convolutional calculation of the neural network, winograd conversion formula is as follows

Out＝A ^T [(GKG ^T )⊙(B ^T IB)]A(1)

Wherein K represents a time domain weight matrix, I represents a time domain activation value matrix, A, G, B represents a dot product result matrix [ (GKG) respectively ^T )⊙(B ^T IB)]The conversion matrix corresponding to the time domain weight matrix K and the time domain activation value matrix I, and the conversion matrix A, G, B is specifically shown as follows:

the output paradigm of the Winograd convolution used in the present invention is F (2×2,3×3), where the first parameter 2×2 represents the size of the output feature map and the second parameter 3*3 represents the size of the convolution kernel.

As shown in fig. 1, the Winograd convolution may be performed in three stages. The first stage, converting the weight matrix G and the time domain activation value matrix I read from the buffer memory from the time domain to the Winograd domain, wherein the specific operation is matrix multiplication operation, and the calculation result is represented by U and V, wherein U=GKG ^T ，V＝B ^T IB; in the second stage, activating Winograd domain weight matrix U and Winograd domainPerforming dot product operation "" on the value matrix V to obtain Winograd domain dot product result matrix M=U+V; the third stage converts the dot product result from the Winograd domain to the time domain.

As shown in fig. 2, a configurable convolutional array accelerator structure based on Winograd of the present invention comprises: an activation value caching module 1, a weight caching module 2, an output caching module 3, a controller 4, a weight preprocessing module 5, an activation value preprocessing module 6, a weight conversion module 7, an activation value matrix conversion module 8, a dot multiplication module 9, a result matrix conversion module 10, an accumulation module 11, a pooling module 12 and an activation module 13, wherein,

1) The activation value buffer module 1 is used for storing an input pixel value or an input feature map value, is connected with the controller 4, and provides activation value data for the activation value preprocessing module 6;

2) The weight buffer module 2 is used for storing the trained weight, is connected with the controller 4 and provides weight data for the weight preprocessing module 5;

3) The output buffer module 3 is used for storing the primary convolution layer result, connected with the controller 4, and transmitting the data into the output buffer module 3 for the next layer convolution after the activation module 13 finishes outputting the data;

4) A controller 4 for controlling transmission of the activation value data, the weight data, and the convolution layer data to be processed according to the calculation process;

5) The weight preprocessing module 5 is used for receiving the data to be operated transmitted by the weight caching module 2, dividing convolution kernels and respectively obtaining four time domain weight matrixes K to be processed ₁ 、K ₂ 、K ₃ 、K ₄ ；

The weight preprocessing module 5 comprises: (1) Extending a convolution kernel of size 5*5 to a 6*6 convolution matrix by zero padding; (2) The convolution matrix of 6*6 is divided into four 3*3 convolution kernels; in this way, the convolution of 5*5 can be implemented with the Winograd output paradigm of 3*3, efficiently and without increasing the number of power consumption multiplications.

The specific division is shown below, where K _input Represents a time domain input weight matrix with the size of 5*5, and the right side is 6 time domain input weight matrix after expansion* The four processing results after the 6 time domain weight matrix is divided are respectively 4 corresponding time domain weight matrix K to be processed after the division ₁ 、K ₂ 、K ₃ 、K ₄ . In calculating u=gkg ^T Wherein the K value is K in turn ₁ 、K ₂ 、K ₃ 、K ₄ ：

6) The activation value preprocessing module 6 is used for receiving the data to be operated transmitted by the activation value caching module 1, extracting the activation values from the activation value caching module 1, dividing the activation values, and respectively obtaining a time domain activation value matrix I to be processed ₁ 、I ₂ 、I ₃ 、I ₄ . In calculating v=b ^T In IB, I values are I in turn ₁ 、I ₂ 、I ₃ 、I ₄ ：

The activation value preprocessing module 6 reads and preprocesses the activation value. In the Winograd algorithm, the activation value needs to correspond to a weight, and there is a lot of data that is reused, so it is divided in overlapping. The activation value preprocessing module 6 divides an activation value matrix with the size of 6*6 into overlapped matrices with the size of 4 4*4, which respectively correspond to the convolution kernels of 4 3*3; the division is as follows, where I _input Representing a time domain input activation value matrix with the size of 6*6, and dividing a time domain to-be-processed activation value matrix I with the size of 4*4 below ₁ 、I ₂ 、I ₃ 、I ₄ . In calculating v=b ^T In IB, I values are I in turn ₁ 、I ₂ 、I ₃ 、I ₄ ：

7) The weight conversion module 7 is used for receiving the data to be operated, which is transmitted by the weight preprocessing module 5, and converting the weight data from a time domain to a Winograd domain to obtain a Winograd domain weight matrix U;

the weight conversion module 7 performs matrix multiplication in calculation through row-column vector addition and subtraction, so as to perform conversion of weight matrix in Winograd convolution, and obtain Winograd domain weight matrix U= [ GKG ] ^T ]Wherein K represents a time domain weight matrix, G is a weight conversion auxiliary matrix, and U is a Winograd domain weight matrix;

the specific operation is as follows: taking the first row vector of the time domain weight matrix K as a temporary matrix C ₂ In which the temporary matrix C ₂ ＝G ^T K, performing K; because the weight matrix has a value of 1/2, only the integer number in the time domain weight matrix K is required to be shifted and supplemented by 0 to the right and the negative number is required to be shifted and supplemented by 1 to the right to complete the division of two; when the weight is positive, the weight is shifted to the right, and 0 is added to the left of the weight; when the weight is negative, the weight is shifted to the right, and 1 is supplemented to the left of the weight; the vector result after adding the first, second and third row elements of the time domain weight matrix K and right shifting by one bit is used as a temporary matrix C ₂ Is a second row of (2); the vector result after adding the first, second and third row elements of the time domain weight matrix K and right shifting by one bit is taken as a matrix C ₂ Is a third row of (2); taking the third row vector of the time domain weight matrix K as a temporary matrix C ₂ Is the fourth row of (2); temporary matrix C ₂ Is used as a first column of a Winograd domain weight matrix U; temporary matrix C ₂ After the first, second and third columns of (a) are addedThe vector result after one bit shift right is used as the second column of Winograd domain weight matrix U; temporary matrix C ₂ The vector result after adding the first, second and third columns and right shifting by one bit is used as the third column of Winograd domain weight matrix U; temporary matrix C ₂ And taking the third column vector of the (4) as the fourth column of the Winograd domain weight matrix U to finally obtain the Winograd domain weight matrix U.

8) The activation value matrix conversion module 8 is used for receiving the data to be operated, which is transmitted by the activation value preprocessing module 6, and converting the activation value from a time domain to a Winograd domain to obtain a Winograd domain activation value matrix V;

the activation value matrix conversion module 8 performs matrix multiplication in calculation by adding and subtracting row vectors and column vectors, so as to perform conversion operation on the time domain activation value matrix in Winograd convolution, and obtain a Winograd domain activation value matrix V= [ B ] ^T IB]Wherein I is a time domain activation value matrix, B is an activation value conversion auxiliary matrix, and V is a Winograd domain activation value matrix;

9) The dot multiplication module 9 is used for respectively receiving the data to be operated transmitted by the weight conversion module 7 and the activation value matrix conversion module 8, and realizing dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain a Winograd domain dot product result matrix M, which is also the module which consumes the most calculation time and resources in convolution;

the dot product module 9 obtains a Winograd domain dot product result matrix M by executing dot product operation of a Winograd domain weight matrix U and a Winograd domain activation value matrix V, wherein the formula is expressed as M=U+V, U is the Winograd domain weight matrix, and V is the Winograd domain activation value matrix; the dot multiplication module 9 is used for realizing a dot product with configurable data bit width, and has two working modes of an 8-bit multiplier and a 16-bit multiplier, and respectively and correspondingly carries out two data bit width operations of 8 bits and 16 bits, thereby realizing 8 x 8bit and 16 x 16bit fixed-point multiplication operations. Wherein, the liquid crystal display device comprises a liquid crystal display device,

(1) As shown in fig. 3, the 8-bit multiplier includes a first gating unit 14, a first inverting unit 15, a first shifting unit 16, a first accumulating unit 17, a second gating unit 18, a second inverting unit 19, and a third gating unit 20, which are sequentially connected, wherein,

the first gating units 14 respectively receive: the data information of the weight conversion module 7 and the activation value matrix conversion module 8, and the symbol control signal of the weight conversion module 7;

the first inverting unit 15 receives the data information of the first gating unit 14 and inverts the received data;

the first shifting unit 16 receives the data information of the first inverting unit 15, and receives the sign bit information of the first strobe unit 14, and shifts the received data according to the sign information;

the first accumulating unit 17 receives the data information of the first shifting unit 16 and accumulates the received data;

the second gating unit 18 receives the data information of the first accumulating unit 17 and the sign bit information of the first gating unit 14, and transmits to the second inverting unit 19;

the second inverting unit 19 receives the data information of the second strobe unit 18 and inverts the received data;

the third gating unit 20 receives the data information of the second inverting unit 19 and the first accumulating unit 17, respectively, and outputs it.

The 8-bit multiplier specifically operates: according to the sign bit of the two multipliers, the sign bit of the result is different or obtained, the sign bit is judged to be positive or negative according to the sign bit, if the sign bit is negative, the sign bit is provided, and the last seven digits are added by 1 in a reverse way; if positive, the last seven digits remain unchanged. Multiplier A after judging positive and negative ₁ Respectively judging the multipliers B ₁ Whether each binary bit is 1, if 1, the corresponding intermediate value is multiplier A ₁ The last seven bits are shifted left by the corresponding position, and if the position is 0, the corresponding intermediate value is 8 bits of 0. Judging the multiplier B ₁ After the last seven bits of (2), all intermediate values are added to obtain the multiplied result H ₂ Then determining whether the result sign bit needs to be inverted and added with 1, if the result sign bit 1, multiplying the result H ₂ Taking the inverse and adding 1, if the sign bit of the result is 0, keeping unchanged, and obtaining a multiplication result H ₃ Finally, at the multiplication result H ₃ And the eighth bit of the result is the sign bit of the result, so as to obtain the final result. Unsigned 8-bit multiplication without taking into account sign bits will be based on multiplier B ₁ The 8-bit data shift addition of (2) yields the result.

(2) As shown in fig. 4, the 16-bit multiplier includes a fourth gating unit 21, a third inverting unit 22, an 8-bit multiplier 23, a second shifting unit 24, a second accumulating unit 25, a fifth gating unit 26, a fourth inverting unit 27 and a sixth gating unit 28, which are sequentially connected, wherein,

the fourth strobe unit 21 receives: the data information of the weight conversion module 7 and the activation value matrix conversion module 8, and the symbol control signal of the weight conversion module 7;

the third inverting unit 22 receives the data information of the fourth strobe unit 21 and inverts the received data;

the 8-bit multiplier 23 performs 8-bit data bit width operation to realize 8 x 8bit fixed-point multiplication operation;

the second shifting unit 24 receives the data information of the 8-bit multiplier 23 and shifts the received data;

the second accumulating unit 25 receives the data information of the second shifting unit 24 and accumulates the received data;

the fifth gating unit 26 receives the data information of the second accumulating unit 25 and the sign bit information of the fourth gating unit 21, and transmits to the fourth inverting unit 27;

the fourth inverting unit 27 receives the data information of the fifth gating unit 26 and inverts the received data;

the sixth strobe unit 28 receives the data information of the fourth inverting unit 27 and outputs it.

The 16-bit multiplier is realized by a 4-bit 8-bit multiplier device, wherein the gating signal of the 8-bit multiplier is 0, namely an unsigned multiplier. Firstly, judging positive and negative according to sign bits of two 16-bit multipliers, if the positive and negative are regular, adding 1 reversely; secondly, dividing the judged 16-bit number into a high 8-bit number and a low 8-bit number, and then correspondingly multiplying; then, the result of multiplying the two high 8-bit numbers is shifted left by 16 bits, the result of multiplying the multiplier D by the high 8-bit multiplier E by the low 8-bit multiplier E and the result of multiplying the multiplier D by the low 8-bit multiplier E by the high 8-bit multiplier E are respectively added and then shifted left by 8 bits, and the shifted result is added with the low 8-bit multiplier A and the low 8-bit multiplier B to obtain a multiplication result L; and finally, determining whether inverse adding 1 is needed according to the sign bit of the result, if the sign of the multiplication result L is 1, adding 1 to the inverse of the multiplication result, if the sign bit of the multiplication result L is 0, keeping unchanged, and finally obtaining a final output result by taking the value of the sign bit at the first position of the multiplication result L.

10 The result matrix conversion module 10 receives the data to be operated transmitted by the dot product module 9, and is used for realizing the conversion of the dot product result matrix from Winograd domain to time domain, and obtaining a converted time domain dot product result matrix F;

the result matrix conversion module 10 performs a conversion operation f=a for the Winograd domain dot product result matrix M through a Winograd domain dot product result matrix M column-row vector shift add-subtract operation ^T MA, wherein M is a Winograd domain dot product result matrix, A is a conversion auxiliary matrix of the Winograd domain dot product result matrix M, and F is a time domain dot product result matrix;

the specific operation is as follows: taking the vector result of the first, second and third row addition of Winograd domain dot product result matrix M as temporary momentArray C ₃ In (C), wherein C ₃ ＝A ^T M; taking the vector result of the second, third and fourth row addition of Winograd domain dot product result matrix M as a temporary matrix C ₃ Is a second row of (2); temporary matrix C ₃ The vector result of the first, second and third column addition is used as the first column of the converted time domain dot product result matrix F; temporary matrix C ₃ The vector result of the second, third and fourth column addition is used as the second column of the converted time domain dot product result matrix F, and finally the time domain dot product result matrix F is obtained.

11 The accumulation module 11 receives the data to be operated transmitted by the result matrix conversion module 10, and obtains the final convolution result by accumulating the received data, and a result matrix with the size of 2 x 2;

12 A pooling module 12 for receiving the data to be operated transmitted by the accumulation module 11 and pooling the final convolution result array; different pooling methods can be used, including maximum, average, minimum, to pool the input neurons. Since the final output result matrix of Winograd convolution F (2×2,3×3) is 2×2, the pooling operation of 2×2 can be directly performed, and the pooling result is obtained through three size comparisons: the first time is to compare two numbers of the first row of the result matrix, the second time is to compare two numbers of the second row, and the third time is to compare the results of the previous two times of comparison, so as to obtain the maximum pooling result of the result matrix.

13 The activating module 13 receives the data to be operated transmitted by the pooling module 12, performs the Relu activating function processing on the pooling result to obtain an activated result, and transmits the activated result to the output buffer module 3.

Claims

1. A Winograd-based configurable convolutional array accelerator structure comprising: an activation value buffer module (1), a weight buffer module (2), an output buffer module (3), a controller (4), a weight preprocessing module (5), an activation value preprocessing module (6), a weight conversion module (7), an activation value matrix conversion module (8), a dot multiplication module (9), a result matrix conversion module (10), an accumulation module (11), a pooling module (12) and an activation module (13),

the activation value buffer module (1) is used for storing input pixel values or input characteristic diagram values, is connected with the controller (4) and provides activation value data for the activation value preprocessing module (6);

the weight buffer memory module (2) is used for storing trained weight values, is connected with the controller (4) and provides weight data for the weight preprocessing module (5);

the output buffer module (3) is used for storing the primary convolution layer result, is connected with the controller (4), and transmits the data to the output buffer module (3) for the next layer convolution after the data output by the activation module (13) is completed;

a controller (4) for controlling transmission of the activation value data, the weight data, and the convolutional layer data to be processed according to the calculation process;

the weight preprocessing module (5) is used for receiving the data to be operated transmitted by the weight caching module (2) and dividing convolution kernels to obtain a time domain weight matrix K;

the activation value preprocessing module (6) is used for receiving the data to be operated transmitted by the activation value caching module (1), and is used for taking out the activation value from the activation value caching module (1) and dividing the activation value to obtain a time domain activation value matrix I;

the weight conversion module (7) is used for receiving the data to be operated, which is transmitted by the weight preprocessing module (5), and converting the weight data from a time domain to a Winograd domain to obtain a Winograd domain weight matrix U;

the activation value matrix conversion module (8) is used for receiving the data to be operated, which is transmitted by the activation value preprocessing module (6), and converting the activation value from a time domain to a Winograd domain to obtain a Winograd domain activation value matrix V;

the dot multiplication module (9) is used for respectively receiving the data to be operated transmitted by the weight conversion module (7) and the activation value matrix conversion module (8) and realizing dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain a Winograd domain dot product result matrix M;

the result matrix conversion module (10) is used for receiving the data to be operated transmitted by the dot product module (9) and converting the dot product result matrix from a Winograd domain to a time domain to obtain a converted time domain dot product result matrix F;

the accumulation module (11) is used for receiving the data to be operated transmitted by the result matrix conversion module (10) and accumulating the received data to obtain a final convolution result;

the pooling module (12) receives the data to be operated transmitted by the accumulation module (11) and pools the final convolution result array;

the activation module (13) receives the data to be operated transmitted by the pooling module (12), carries out Relu activation function processing on the pooling result to obtain an activated result, and transmits the activated result to the output buffer module (3), wherein:

the weight preprocessing module (5) comprises:

(2) The convolution matrix of 6*6 is divided into four 3*3 convolution kernels;

the specific division is shown below, where K _input Representing a 5*5 weight matrix, the lower sides of which are respectively 4 corresponding divided time domain weight matrices to be processed K ₁ 、K ₂ 、K ₃ 、K ₄ The method comprises the steps of carrying out a first treatment on the surface of the In calculating u=gkg ^T Wherein the K value is K in turn ₁ 、K ₂ 、K ₃ 、K ₄ ：

Wherein K represents a time domain weight matrix, G is a weight conversion auxiliary matrix, and U is a Winograd domain weight matrix;

the activation value preprocessing module (6) divides the 6*6 activation value matrix into heavy partsA stacked 4 4*4 sized matrix; the division is as follows, where I _input Representing a 6*6 weight matrix, the lower side is divided into 4*4 time domain to-be-processed activation value matrix I ₁ 、I ₂ 、I ₃ 、I ₄ The method comprises the steps of carrying out a first treatment on the surface of the In calculating v=b ^T In IB, I values are I in turn ₁ 、I ₂ 、I ₃ 、I ₄ ：

Wherein I is a time domain activation value matrix, B is an activation value conversion auxiliary matrix, and V is a Winograd domain activation value matrix.

2. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said weight conversion module (7) performs matrix multiplication in calculation by row-column vector addition subtraction, thereby performing conversion for weight matrix in Winograd convolution to obtain Winograd domain weight matrix U= [ GKG ^T ]；

The specific operation is as follows: taking the first row vector of the weight matrix K as a temporary matrix C ₂ In which the temporary matrix C ₂ ＝G ^T K, performing K; the integer right shift complement 0 and the negative number right shift complement 1 in the weight matrix K are divided into two; when the weight is a positive value, the weight is shifted to the right, and the left side of the weight is supplemented with 0; when the weight is negative, the weight is shifted to the right, and 1 is supplemented to the left of the weight; the vector result after adding the first, second and third row elements of the weight matrix K and right shifting by one bit is taken as a temporary matrix C ₂ Is a second row of (2); the first of the weight matrix K,2. The vector result after one bit more right shift after the addition of the three-row elements is used as a temporary matrix C ₂ Is a third row of (2); taking the third row vector of the weight matrix K as a temporary matrix C ₂ Is the fourth row of (2); temporary matrix C ₂ The first column vector is used as a first column of a Winograd domain weight matrix U; temporary matrix C ₂ The vector result after one bit of right shift after the addition of the first, second and third columns is used as the second column of Winograd domain weight matrix U; temporary matrix C ₂ The vector result after adding the first, second and third columns and right shifting by one bit is used as the third column of Winograd domain weight matrix U; temporary matrix C ₂ And taking the third column vector of the (4) as the fourth column of the Winograd domain weight matrix U to finally obtain the Winograd domain weight matrix U.

3. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said active value matrix conversion module (8) performs matrix multiplication in computation by adding and subtracting row vectors and column vectors, thereby performing a conversion operation on a time domain active value matrix in Winograd convolution to obtain a matrix V= [ B ] ^T IB]；

The specific operation is as follows: taking the vector difference value of the first row minus the third row of the time domain activation value matrix I as a temporary matrix C ₁ In which the temporary matrix C ₁ ＝B ^T I, a step of I; the result of adding the second row and the third row of the time domain activation value matrix I is taken as a temporary matrix C ₁ Is a second row of (2); taking the vector difference value of the third row minus the second row of the time domain activation value matrix I as a temporary matrix C ₁ Is a third row of (2); taking the vector difference value of the second row minus the fourth row of the time domain activation value matrix I as a temporary matrix C ₁ Is the fourth row of (2); temporary matrix C ₁ The vector difference value of the third column minus the first column of the Winograd domain activation value matrix V; temporary matrix C ₁ The result of the addition of the second and third columns of (a) is taken as the second column of Winograd domain activation value matrix V; temporary matrix C ₁ Subtracting the vector difference value of the second column as the third column of the Winograd domain activation value matrix V; temporary matrix C ₁ The second column minus the fourth columnAnd taking the vector difference value of (2) as the fourth column of the Winograd domain activation value matrix V to finally obtain the Winograd domain activation value matrix V.

4. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said dot multiplication module (9) obtains a Winograd domain dot product result matrix M by performing a dot product operation of a Winograd domain weight matrix U and a Winograd domain activation value matrix V, expressed as M=U.V, where U is a Winograd domain weight matrix and V is a Winograd domain activation value matrix; the dot multiplication module (9) is used for realizing the dot product with configurable data bit width, and comprises two working modes of an 8-bit multiplier and a 16-bit multiplier, wherein the two working modes respectively and correspondingly carry out the operation of two data bit widths of 8 bits and 16 bits, and realize the fixed-point multiplication operation of 8 x 8 bits and 16 x 16 bits.

5. The Winograd-based configurable convolutional array accelerator structure according to claim 4, wherein said 8-bit multiplier comprises a first gating unit (14), a first inverting unit (15), a first shifting unit (16), a first accumulating unit (17), a second gating unit (18), a second inverting unit (19) and a third gating unit (20) connected in sequence,

the first gating units (14) respectively receive: the weight conversion module (7) and the activation value matrix conversion module (8) are used for converting data information of the weight conversion module (7) and a symbol control signal of the weight conversion module (7);

the first negation unit (15) receives the data information of the first gating unit (14) and performs negation on the received data;

the first shifting unit (16) receives the data information of the first inverting unit (15) and the symbol bit information of the first gating unit (14), and shifts the received data according to the symbol information;

a first accumulation unit (17) receives the data information of the first shift unit (16) and accumulates the received data;

the second gating unit (18) receives the data information of the first accumulating unit (17) and the sign bit information of the first gating unit (14) and transmits the data information and the sign bit information to the second inverting unit (19);

a second inverting unit (19) receives the data information of the second gating unit (18) and inverts the received data;

the third gating unit (20) receives the data information of the second inverting unit (19) and the first accumulating unit (17) respectively and outputs the data information.

6. The Winograd-based configurable convolutional array accelerator structure according to claim 4, wherein said 16-bit multiplier comprises a fourth gating unit (21), a third inverting unit (22), an 8-bit multiplier (23), a second shifting unit (24), a second accumulating unit (25), a fifth gating unit (26), a fourth inverting unit (27) and a sixth gating unit (28) connected in sequence,

the fourth strobe units (21) respectively receive: the weight conversion module (7) and the activation value matrix conversion module (8) are used for converting data information of the weight conversion module (7) and a symbol control signal of the weight conversion module (7);

a third inverting unit (22) receives the data information of the fourth gating unit (21) and inverts the received data;

an 8-bit multiplier (23) performs 8-bit data bit width operation to realize 8 x 8bit fixed-point multiplication operation;

a second shift unit (24) receives data information of the 8-bit multiplier (23) and shifts the received data;

a second accumulating unit (25) receives the data information of the second shifting unit (24) and accumulates the received data;

the fifth gating unit (26) receives the data information of the second accumulating unit (25) and the sign bit information of the fourth gating unit (21) and transmits the data information and the sign bit information to the fourth inverting unit (27);

a fourth inverting unit (27) receives the data information of the fifth gating unit (26) and inverts the received data;

a sixth strobe unit (28) receives the data information of the fourth inverting unit (27) and outputs the data information.

7. A configurable convolutional array accelerator structure based on Winograd as recited in claim 1,the method is characterized in that the result matrix conversion module (10) executes conversion operation F=A aiming at Winograd domain dot product result matrix M through Winograd domain dot product result matrix M row-column vector shift add-subtract operation ^T MA, wherein M is a Winograd domain dot product result matrix, A is a replacement auxiliary matrix of the Winograd domain dot product result matrix M, and F is a time domain dot product result matrix;