CN110288086B - Winograd-based configurable convolution array accelerator structure - Google Patents

Winograd-based configurable convolution array accelerator structure Download PDF

Info

Publication number
CN110288086B
CN110288086B CN201910511987.6A CN201910511987A CN110288086B CN 110288086 B CN110288086 B CN 110288086B CN 201910511987 A CN201910511987 A CN 201910511987A CN 110288086 B CN110288086 B CN 110288086B
Authority
CN
China
Prior art keywords
matrix
weight
module
winograd
activation value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910511987.6A
Other languages
Chinese (zh)
Other versions
CN110288086A (en
Inventor
魏继增
徐文富
王宇吉
郭炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201910511987.6A priority Critical patent/CN110288086B/en
Publication of CN110288086A publication Critical patent/CN110288086A/en
Application granted granted Critical
Publication of CN110288086B publication Critical patent/CN110288086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

A Winograd-based configurable convolutional array accelerator structure comprising: the system comprises an activation value caching module, a weight caching module, an output caching module, a controller, a weight preprocessing module, an activation value preprocessing module, a weight conversion module, an activation value matrix conversion module, a dot multiplication module, a result matrix conversion module, an accumulation module, a pooling module and an activation module. According to the Winograd-based configurable convolution array accelerator structure, the convolution array accelerator with the configurable bit width is designed according to the operation characteristics of a Winograd convolution algorithm of a fixed paradigm, and the requirements of different neural networks and different convolution layers on the bit width are flexibly met. In addition, a special multiplier unit with configurable data bit width is designed, so that the calculation efficiency of the neural network convolution operation is improved, and the calculation power consumption is reduced.

Description

Winograd-based configurable convolution array accelerator structure
Technical Field
The present invention relates to a configurable convolutional array accelerator structure. And more particularly to a Winograd-based configurable convolutional array accelerator structure.
Background
Neural networks are excellent in many fields of application, particularly image-related tasks, and begin to replace most of the traditional algorithms in computer vision problems such as image classification, image semantic segmentation, image retrieval, object detection, etc., and are gradually deployed on terminal devices.
But the calculated amount of the neural network is very huge, so that the problems of low processing speed, high operation power consumption and the like of the neural network exist. The neural network mainly comprises a training phase and an reasoning phase. In order to obtain a high-precision processing result, the weight data is required to be obtained from mass data through repeated iterative computation in training. In the neural network reasoning phase, the computational processing of the input data needs to be done in a very short response time (typically in the order of milliseconds), especially when the neural network is applied to real-time systems, such as in the field of autopilot. Furthermore, the calculations involved in the neural network mainly include convolution operations, activation operations, pooling operations, and the like.
Studies have shown that neural networks account for more than 90% of the computation time by the convolution process. The conventional convolution algorithm calculates each element in the output feature map through multiple multiply-accumulate operations. While previous solutions using this algorithm have met with preliminary success, the efficiency may be higher when the algorithm itself is more efficient. Therefore, researchers currently propose a convolution algorithm of Winograd, which performs an equivalent convolution operation task and reduces the multiplication times of the convolution operation process by performing specific data domain conversion on an input feature map and weights. Because the prediction process of most neural network processor chips in practical application adopts a fixed neural network model, the adopted Winograd convolution output mode is usually a fixed mode, the operation process is quite clear, and a larger optimization space is provided. How to design and optimize the accelerator structure based on Winograd neural network is one of the important research points.
In addition, for the application of the vast majority of neural networks, the input of fixed-point type data can achieve good experimental results, the speed can be improved, and the power consumption can be reduced. However, the convolution data bit width in the existing fixed-point neural network is fixed, so that the flexible configuration is not realized, and the applicability is reduced. In general, the 16-bit data bit width can meet the precision requirement of the neural network, and for some networks and scenes with low precision requirements, the 8-bit data bit width can also meet the precision requirement. Thus, in a neural network, achieving data bit width configurability can be better optimized.
Disclosure of Invention
The invention aims to solve the technical problem of providing a Winograd-based configurable convolution array accelerator structure capable of improving the calculation efficiency of neural network convolution operation.
The technical scheme adopted by the invention is as follows: a Winograd-based configurable convolutional array accelerator structure comprising: an activation value caching module, a weight caching module, an output caching module, a controller, a weight preprocessing module, an activation value preprocessing module, a weight conversion module, an activation value matrix conversion module, a dot multiplication module, a result matrix conversion module, an accumulation module, a pooling module and an activation module,
the activation value buffer module is used for storing an input pixel value or an input characteristic diagram value, is connected with the controller and provides activation value data for the activation value preprocessing module;
the weight buffer module is used for storing the trained weight values, is connected with the controller and provides weight data for the weight preprocessing module;
the output buffer module is used for storing the primary convolution layer result, is connected with the controller, and transmits the data to the output buffer module for the next layer convolution after the data output by the activation module is completed;
the controller is used for controlling transmission of the activation value data, the weight data and the convolution layer data to be processed according to the calculation process;
the weight preprocessing module is used for receiving the data to be operated, which is transmitted by the weight caching module, and dividing a convolution kernel to obtain a time domain weight matrix K;
the activation value preprocessing module is used for receiving the data to be operated, which is transmitted by the activation value caching module, and extracting an activation value from the activation value caching module, and dividing the activation value to obtain a time domain activation value matrix I;
the weight conversion module is used for receiving the data to be operated, which is transmitted by the weight preprocessing module, and converting the weight data from a time domain to a Winograd domain to obtain a Winograd domain weight matrix U;
the activation value matrix conversion module is used for receiving the data to be operated, which is transmitted by the activation value preprocessing module, and converting the activation value from a time domain to a Winograd domain to obtain a Winograd domain activation value matrix V;
the dot multiplication module is used for respectively receiving the data to be operated, which are transmitted by the weight conversion module and the activation value matrix conversion module, and realizing dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain a Winograd domain dot product result matrix M;
the result matrix conversion module is used for receiving the data to be operated transmitted by the dot product module and converting the dot product result matrix from a Winograd domain to a time domain to obtain a converted time domain dot product result matrix F;
the accumulation module is used for receiving the data to be operated, which is transmitted by the result matrix conversion module, and accumulating the received data to obtain a final convolution result;
the pooling module is used for receiving the data to be operated transmitted by the accumulation module and pooling the final convolution result array;
and the activation module is used for receiving the data to be operated, which is transmitted by the pooling module, carrying out Relu activation function processing on the pooling result, obtaining an activated result, and transmitting the activated result to the output buffer module.
The weight preprocessing module comprises:
(1) Extending a convolution kernel of size 5*5 to a 6*6 convolution matrix by zero padding;
(2) The convolution matrix of 6*6 is divided into four 3*3 convolution kernels;
the specific division is shown below, where K input Representing a 5*5 weight matrix, the lower sides of which are respectively 4 corresponding divided time domain weight matrices to be processed K 1 、K 2 、K 3 、K 4 . In calculating u=gkg T Wherein the K value is K in turn 1 、K 2 、K 3 、K 4
The activation value preprocessing module divides an activation value matrix of 6*6 into overlapped matrices of 4 sizes 4*4. The division is as follows, where I input Representing a 5*5 weight matrix, the lower side is divided into 4*4 time domain to-be-processed activation value matrix I 1 、I 2 、I 3 、I 4 . In calculating v=b T In IB, I values are I in turn 1 、I 2 、I 3 、I 4
The weight conversion module performs matrix multiplication in calculation through row-column vector addition and subtraction, so as to perform conversion of weight matrix in Winograd convolution, and obtain Winograd domain weight matrix U= [ GKG ] T ]Wherein K represents a time domain weight matrix, G is a weight conversion auxiliary matrix, and U is a Winograd domain weight matrix;
the specific operation is as follows: taking the first row vector of the weight matrix K as a temporary matrix C 2 In which the temporary matrix C 2 =G T K, performing K; the integer right shift complement 0 and the negative number right shift complement 1 in the weight matrix K are divided into two; when the weight is a positive value, the weight is shifted to the right, and the left side of the weight is supplemented with 0; when the weight is negative, the weight is shifted to the right, and 1 is supplemented to the left of the weight; the vector result after adding the first, second and third row elements of the weight matrix K and right shifting by one bit is taken as a temporary matrix C 2 Is a second row of (2); the vector result after adding the first, second and third row elements of the weight matrix K and right shifting by one bit is taken as a temporary matrix C 2 Is a third row of (2); taking the third row vector of the weight matrix K as a temporary matrix C 2 Is the fourth row of (2); temporary matrix C 2 The first column vector is used as a first column of a Winograd domain weight matrix U; temporary matrix C 2 The vector result after adding the first, second and third columns and then right shifting by one bit is used as the second column of Winograd domain weight matrix UThe method comprises the steps of carrying out a first treatment on the surface of the Temporary matrix C 2 The vector result after adding the first, second and third columns and right shifting by one bit is used as the third column of Winograd domain weight matrix U; temporary matrix C 2 And taking the third column vector of the (4) as the fourth column of the Winograd domain weight matrix U to finally obtain the Winograd domain weight matrix U.
The activation value matrix conversion module completes matrix multiplication in calculation through addition and subtraction of row vectors and column vectors, so that conversion operation on a time domain activation value matrix in Winograd convolution is executed, and a matrix V= [ B ] is obtained T IB]Wherein I is a time domain activation value matrix, B is an activation value conversion auxiliary matrix, and V is a Winograd domain activation value matrix;
the specific operation is as follows: taking the vector difference value of the first row minus the third row of the time domain activation value matrix I as a temporary matrix C 1 In which the temporary matrix C 1 =B T I, a step of I; the result of adding the second row and the third row of the time domain activation value matrix I is taken as a temporary matrix C 1 Is a second row of (2); taking the vector difference value of the third row minus the second row of the time domain activation value matrix I as a temporary matrix C 1 Is a third row of (2); taking the vector difference value of the second row minus the fourth row of the time domain activation value matrix I as a temporary matrix C 1 Is the fourth row of (2); temporary matrix C 1 The vector difference value of the third column minus the first column of the Winograd domain activation value matrix V; temporary matrix C 1 The result of the addition of the second and third columns of (a) is taken as the second column of Winograd domain activation value matrix V; temporary matrix C 1 Subtracting the vector difference value of the second column as the third column of the Winograd domain activation value matrix V; temporary matrix C 1 The vector difference value of the second column minus the fourth column is used as the fourth column of the Winograd domain activation value matrix V, and finally the Winograd domain activation value matrix V is obtained.
The dot multiplication module obtains a Winograd domain dot product result matrix M by executing dot product operation of a Winograd domain weight matrix U and a Winograd domain activation value matrix V, wherein the formula is expressed as M=U+V, U is the Winograd domain weight matrix, and V is the Winograd domain activation value matrix; the dot multiplication module is used for realizing the dot product with configurable data bit width, and comprises two working modes of an 8-bit multiplier and a 16-bit multiplier, wherein the two working modes respectively and correspondingly carry out the operation of two data bit widths of 8 bits and 16 bits, so as to realize the fixed-point multiplication operation of 8 x 8 bits and 16 x 16 bits.
The 8-bit multiplier comprises a first gating unit, a first inverting unit, a first shifting unit, a first accumulating unit, a second gating unit, a second inverting unit and a third gating unit which are sequentially connected, wherein,
the first gating units respectively receive: the data information of the weight conversion module and the activation value matrix conversion module and the symbol control signal of the weight conversion module;
the first negation unit receives the data information of the first gating unit and performs negation on the received data;
the first shifting unit receives the data information of the first inverting unit, receives the symbol bit information of the first gating unit and shifts the received data according to the symbol information;
the first accumulation unit receives the data information of the first shift unit and accumulates the received data;
the second gating unit receives the data information of the first accumulating unit and the sign bit information of the first gating unit and transmits the data information and the sign bit information to the second inverting unit;
the second inverting unit receives the data information of the second gating unit and inverts the received data;
the third gating unit receives the data information of the second inverting unit and the first accumulating unit respectively and outputs the data information.
The 16-bit multiplier comprises a fourth gating unit, a third inverting unit, an 8-bit multiplier, a second shifting unit, a second accumulating unit, a fifth gating unit, a fourth inverting unit and a sixth gating unit which are sequentially connected,
the fourth strobe units respectively receive: the data information of the weight conversion module and the activation value matrix conversion module and the symbol control signal of the weight conversion module;
the third inverting unit receives the data information of the fourth gating unit and inverts the received data;
the 8-bit multiplier performs 8-bit data bit width operation to realize 8 x 8-bit fixed-point multiplication operation;
the second shifting unit receives the data information of the 8-bit multiplier and shifts the received data;
the second accumulation unit receives the data information of the second shift unit and accumulates the received data;
the fifth gating unit receives the data information of the second accumulating unit and the sign bit information of the fourth gating unit and transmits the data information and the sign bit information to the fourth inverting unit;
the fourth negation unit receives the data information of the fifth gating unit and inverts the received data;
the sixth gating unit receives the data information of the fourth inverting unit and outputs the data information.
The result matrix conversion module performs conversion operation F=A for Winograd domain dot product result matrix M through Winograd domain dot product result matrix M row-column vector shift add-subtract operation T MA, wherein M is a Winograd domain dot product result matrix, A is a replacement auxiliary matrix of the Winograd domain dot product result matrix M, and F is a time domain dot product result matrix;
the specific operation is as follows: taking the vector result of the first, second and third row addition of Winograd domain dot product result matrix M as a temporary matrix C 3 In which the temporary matrix C 3 =A T M; taking the vector result of the second, third and fourth row addition of the dot Winograd domain dot product result matrix M as a temporary matrix C 3 Is a second row of (2); temporary matrix C 3 The vector result of the first, second and third column addition is used as the first column of the converted time domain dot product result matrix F; temporary matrix C 3 The vector result of the second, third and fourth column addition is used as the second column of the converted time domain dot product result matrix F, and finally the converted time domain dot product result matrix F is obtained.
According to the Winograd-based configurable convolution array accelerator structure, the convolution array accelerator with the configurable bit width is designed according to the operation characteristics of a Winograd convolution algorithm of a fixed paradigm, and the requirements of different neural networks and different convolution layers on the bit width are flexibly met. In addition, a special multiplier unit with configurable data bit width is designed, so that the calculation efficiency of the neural network convolution operation is improved, and the calculation power consumption is reduced.
Drawings
FIG. 1 is a diagram of the overall architecture of a Winograd convolutional array accelerator;
FIG. 2 is a schematic diagram of the construction of a Winograd-based configurable convolutional array accelerator structure of the present invention;
FIG. 3 is a schematic diagram of an 8-bit multiplier in a data bit width scheme;
fig. 4 is a schematic diagram of a 16-bit multiplier in a data bit width scheme.
Detailed Description
A detailed description of a Winograd-based configurable convolutional array accelerator structure of the present invention is provided below with reference to the examples and figures.
In convolutional calculation of the neural network, winograd conversion formula is as follows
Out=A T [(GKG T )⊙(B T IB)]A(1)
Wherein K represents a time domain weight matrix, I represents a time domain activation value matrix, A, G, B represents a dot product result matrix [ (GKG) respectively T )⊙(B T IB)]The conversion matrix corresponding to the time domain weight matrix K and the time domain activation value matrix I, and the conversion matrix A, G, B is specifically shown as follows:
the output paradigm of the Winograd convolution used in the present invention is F (2×2,3×3), where the first parameter 2×2 represents the size of the output feature map and the second parameter 3*3 represents the size of the convolution kernel.
As shown in fig. 1, the Winograd convolution may be performed in three stages. The first stage, converting the weight matrix G and the time domain activation value matrix I read from the buffer memory from the time domain to the Winograd domain, wherein the specific operation is matrix multiplication operation, and the calculation result is represented by U and V, wherein U=GKG T ,V=B T IB; in the second stage, activating Winograd domain weight matrix U and Winograd domainPerforming dot product operation "" on the value matrix V to obtain Winograd domain dot product result matrix M=U+V; the third stage converts the dot product result from the Winograd domain to the time domain.
As shown in fig. 2, a configurable convolutional array accelerator structure based on Winograd of the present invention comprises: an activation value caching module 1, a weight caching module 2, an output caching module 3, a controller 4, a weight preprocessing module 5, an activation value preprocessing module 6, a weight conversion module 7, an activation value matrix conversion module 8, a dot multiplication module 9, a result matrix conversion module 10, an accumulation module 11, a pooling module 12 and an activation module 13, wherein,
1) The activation value buffer module 1 is used for storing an input pixel value or an input feature map value, is connected with the controller 4, and provides activation value data for the activation value preprocessing module 6;
2) The weight buffer module 2 is used for storing the trained weight, is connected with the controller 4 and provides weight data for the weight preprocessing module 5;
3) The output buffer module 3 is used for storing the primary convolution layer result, connected with the controller 4, and transmitting the data into the output buffer module 3 for the next layer convolution after the activation module 13 finishes outputting the data;
4) A controller 4 for controlling transmission of the activation value data, the weight data, and the convolution layer data to be processed according to the calculation process;
5) The weight preprocessing module 5 is used for receiving the data to be operated transmitted by the weight caching module 2, dividing convolution kernels and respectively obtaining four time domain weight matrixes K to be processed 1 、K 2 、K 3 、K 4
The weight preprocessing module 5 comprises: (1) Extending a convolution kernel of size 5*5 to a 6*6 convolution matrix by zero padding; (2) The convolution matrix of 6*6 is divided into four 3*3 convolution kernels; in this way, the convolution of 5*5 can be implemented with the Winograd output paradigm of 3*3, efficiently and without increasing the number of power consumption multiplications.
The specific division is shown below, where K input Represents a time domain input weight matrix with the size of 5*5, and the right side is 6 time domain input weight matrix after expansion* The four processing results after the 6 time domain weight matrix is divided are respectively 4 corresponding time domain weight matrix K to be processed after the division 1 、K 2 、K 3 、K 4 . In calculating u=gkg T Wherein the K value is K in turn 1 、K 2 、K 3 、K 4
6) The activation value preprocessing module 6 is used for receiving the data to be operated transmitted by the activation value caching module 1, extracting the activation values from the activation value caching module 1, dividing the activation values, and respectively obtaining a time domain activation value matrix I to be processed 1 、I 2 、I 3 、I 4 . In calculating v=b T In IB, I values are I in turn 1 、I 2 、I 3 、I 4
The activation value preprocessing module 6 reads and preprocesses the activation value. In the Winograd algorithm, the activation value needs to correspond to a weight, and there is a lot of data that is reused, so it is divided in overlapping. The activation value preprocessing module 6 divides an activation value matrix with the size of 6*6 into overlapped matrices with the size of 4 4*4, which respectively correspond to the convolution kernels of 4 3*3; the division is as follows, where I input Representing a time domain input activation value matrix with the size of 6*6, and dividing a time domain to-be-processed activation value matrix I with the size of 4*4 below 1 、I 2 、I 3 、I 4 . In calculating v=b T In IB, I values are I in turn 1 、I 2 、I 3 、I 4
7) The weight conversion module 7 is used for receiving the data to be operated, which is transmitted by the weight preprocessing module 5, and converting the weight data from a time domain to a Winograd domain to obtain a Winograd domain weight matrix U;
the weight conversion module 7 performs matrix multiplication in calculation through row-column vector addition and subtraction, so as to perform conversion of weight matrix in Winograd convolution, and obtain Winograd domain weight matrix U= [ GKG ] T ]Wherein K represents a time domain weight matrix, G is a weight conversion auxiliary matrix, and U is a Winograd domain weight matrix;
the specific operation is as follows: taking the first row vector of the time domain weight matrix K as a temporary matrix C 2 In which the temporary matrix C 2 =G T K, performing K; because the weight matrix has a value of 1/2, only the integer number in the time domain weight matrix K is required to be shifted and supplemented by 0 to the right and the negative number is required to be shifted and supplemented by 1 to the right to complete the division of two; when the weight is positive, the weight is shifted to the right, and 0 is added to the left of the weight; when the weight is negative, the weight is shifted to the right, and 1 is supplemented to the left of the weight; the vector result after adding the first, second and third row elements of the time domain weight matrix K and right shifting by one bit is used as a temporary matrix C 2 Is a second row of (2); the vector result after adding the first, second and third row elements of the time domain weight matrix K and right shifting by one bit is taken as a matrix C 2 Is a third row of (2); taking the third row vector of the time domain weight matrix K as a temporary matrix C 2 Is the fourth row of (2); temporary matrix C 2 Is used as a first column of a Winograd domain weight matrix U; temporary matrix C 2 After the first, second and third columns of (a) are addedThe vector result after one bit shift right is used as the second column of Winograd domain weight matrix U; temporary matrix C 2 The vector result after adding the first, second and third columns and right shifting by one bit is used as the third column of Winograd domain weight matrix U; temporary matrix C 2 And taking the third column vector of the (4) as the fourth column of the Winograd domain weight matrix U to finally obtain the Winograd domain weight matrix U.
8) The activation value matrix conversion module 8 is used for receiving the data to be operated, which is transmitted by the activation value preprocessing module 6, and converting the activation value from a time domain to a Winograd domain to obtain a Winograd domain activation value matrix V;
the activation value matrix conversion module 8 performs matrix multiplication in calculation by adding and subtracting row vectors and column vectors, so as to perform conversion operation on the time domain activation value matrix in Winograd convolution, and obtain a Winograd domain activation value matrix V= [ B ] T IB]Wherein I is a time domain activation value matrix, B is an activation value conversion auxiliary matrix, and V is a Winograd domain activation value matrix;
the specific operation is as follows: taking the vector difference value of the first row minus the third row of the time domain activation value matrix I as a temporary matrix C 1 In which the temporary matrix C 1 =B T I, a step of I; the result of adding the second row and the third row of the time domain activation value matrix I is taken as a temporary matrix C 1 Is a second row of (2); taking the vector difference value of the third row minus the second row of the time domain activation value matrix I as a temporary matrix C 1 Is a third row of (2); taking the vector difference value of the second row minus the fourth row of the time domain activation value matrix I as a temporary matrix C 1 Is the fourth row of (2); temporary matrix C 1 The vector difference value of the third column minus the first column of the Winograd domain activation value matrix V; temporary matrix C 1 The result of the addition of the second and third columns of (a) is taken as the second column of Winograd domain activation value matrix V; temporary matrix C 1 Subtracting the vector difference value of the second column as the third column of the Winograd domain activation value matrix V; temporary matrix C 1 The vector difference value of the second column minus the fourth column is used as the fourth column of the Winograd domain activation value matrix V, and finally the Winograd domain activation value matrix V is obtained.
9) The dot multiplication module 9 is used for respectively receiving the data to be operated transmitted by the weight conversion module 7 and the activation value matrix conversion module 8, and realizing dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain a Winograd domain dot product result matrix M, which is also the module which consumes the most calculation time and resources in convolution;
the dot product module 9 obtains a Winograd domain dot product result matrix M by executing dot product operation of a Winograd domain weight matrix U and a Winograd domain activation value matrix V, wherein the formula is expressed as M=U+V, U is the Winograd domain weight matrix, and V is the Winograd domain activation value matrix; the dot multiplication module 9 is used for realizing a dot product with configurable data bit width, and has two working modes of an 8-bit multiplier and a 16-bit multiplier, and respectively and correspondingly carries out two data bit width operations of 8 bits and 16 bits, thereby realizing 8 x 8bit and 16 x 16bit fixed-point multiplication operations. Wherein, the liquid crystal display device comprises a liquid crystal display device,
(1) As shown in fig. 3, the 8-bit multiplier includes a first gating unit 14, a first inverting unit 15, a first shifting unit 16, a first accumulating unit 17, a second gating unit 18, a second inverting unit 19, and a third gating unit 20, which are sequentially connected, wherein,
the first gating units 14 respectively receive: the data information of the weight conversion module 7 and the activation value matrix conversion module 8, and the symbol control signal of the weight conversion module 7;
the first inverting unit 15 receives the data information of the first gating unit 14 and inverts the received data;
the first shifting unit 16 receives the data information of the first inverting unit 15, and receives the sign bit information of the first strobe unit 14, and shifts the received data according to the sign information;
the first accumulating unit 17 receives the data information of the first shifting unit 16 and accumulates the received data;
the second gating unit 18 receives the data information of the first accumulating unit 17 and the sign bit information of the first gating unit 14, and transmits to the second inverting unit 19;
the second inverting unit 19 receives the data information of the second strobe unit 18 and inverts the received data;
the third gating unit 20 receives the data information of the second inverting unit 19 and the first accumulating unit 17, respectively, and outputs it.
The 8-bit multiplier specifically operates: according to the sign bit of the two multipliers, the sign bit of the result is different or obtained, the sign bit is judged to be positive or negative according to the sign bit, if the sign bit is negative, the sign bit is provided, and the last seven digits are added by 1 in a reverse way; if positive, the last seven digits remain unchanged. Multiplier A after judging positive and negative 1 Respectively judging the multipliers B 1 Whether each binary bit is 1, if 1, the corresponding intermediate value is multiplier A 1 The last seven bits are shifted left by the corresponding position, and if the position is 0, the corresponding intermediate value is 8 bits of 0. Judging the multiplier B 1 After the last seven bits of (2), all intermediate values are added to obtain the multiplied result H 2 Then determining whether the result sign bit needs to be inverted and added with 1, if the result sign bit 1, multiplying the result H 2 Taking the inverse and adding 1, if the sign bit of the result is 0, keeping unchanged, and obtaining a multiplication result H 3 Finally, at the multiplication result H 3 And the eighth bit of the result is the sign bit of the result, so as to obtain the final result. Unsigned 8-bit multiplication without taking into account sign bits will be based on multiplier B 1 The 8-bit data shift addition of (2) yields the result.
(2) As shown in fig. 4, the 16-bit multiplier includes a fourth gating unit 21, a third inverting unit 22, an 8-bit multiplier 23, a second shifting unit 24, a second accumulating unit 25, a fifth gating unit 26, a fourth inverting unit 27 and a sixth gating unit 28, which are sequentially connected, wherein,
the fourth strobe unit 21 receives: the data information of the weight conversion module 7 and the activation value matrix conversion module 8, and the symbol control signal of the weight conversion module 7;
the third inverting unit 22 receives the data information of the fourth strobe unit 21 and inverts the received data;
the 8-bit multiplier 23 performs 8-bit data bit width operation to realize 8 x 8bit fixed-point multiplication operation;
the second shifting unit 24 receives the data information of the 8-bit multiplier 23 and shifts the received data;
the second accumulating unit 25 receives the data information of the second shifting unit 24 and accumulates the received data;
the fifth gating unit 26 receives the data information of the second accumulating unit 25 and the sign bit information of the fourth gating unit 21, and transmits to the fourth inverting unit 27;
the fourth inverting unit 27 receives the data information of the fifth gating unit 26 and inverts the received data;
the sixth strobe unit 28 receives the data information of the fourth inverting unit 27 and outputs it.
The 16-bit multiplier is realized by a 4-bit 8-bit multiplier device, wherein the gating signal of the 8-bit multiplier is 0, namely an unsigned multiplier. Firstly, judging positive and negative according to sign bits of two 16-bit multipliers, if the positive and negative are regular, adding 1 reversely; secondly, dividing the judged 16-bit number into a high 8-bit number and a low 8-bit number, and then correspondingly multiplying; then, the result of multiplying the two high 8-bit numbers is shifted left by 16 bits, the result of multiplying the multiplier D by the high 8-bit multiplier E by the low 8-bit multiplier E and the result of multiplying the multiplier D by the low 8-bit multiplier E by the high 8-bit multiplier E are respectively added and then shifted left by 8 bits, and the shifted result is added with the low 8-bit multiplier A and the low 8-bit multiplier B to obtain a multiplication result L; and finally, determining whether inverse adding 1 is needed according to the sign bit of the result, if the sign of the multiplication result L is 1, adding 1 to the inverse of the multiplication result, if the sign bit of the multiplication result L is 0, keeping unchanged, and finally obtaining a final output result by taking the value of the sign bit at the first position of the multiplication result L.
10 The result matrix conversion module 10 receives the data to be operated transmitted by the dot product module 9, and is used for realizing the conversion of the dot product result matrix from Winograd domain to time domain, and obtaining a converted time domain dot product result matrix F;
the result matrix conversion module 10 performs a conversion operation f=a for the Winograd domain dot product result matrix M through a Winograd domain dot product result matrix M column-row vector shift add-subtract operation T MA, wherein M is a Winograd domain dot product result matrix, A is a conversion auxiliary matrix of the Winograd domain dot product result matrix M, and F is a time domain dot product result matrix;
the specific operation is as follows: taking the vector result of the first, second and third row addition of Winograd domain dot product result matrix M as temporary momentArray C 3 In (C), wherein C 3 =A T M; taking the vector result of the second, third and fourth row addition of Winograd domain dot product result matrix M as a temporary matrix C 3 Is a second row of (2); temporary matrix C 3 The vector result of the first, second and third column addition is used as the first column of the converted time domain dot product result matrix F; temporary matrix C 3 The vector result of the second, third and fourth column addition is used as the second column of the converted time domain dot product result matrix F, and finally the time domain dot product result matrix F is obtained.
11 The accumulation module 11 receives the data to be operated transmitted by the result matrix conversion module 10, and obtains the final convolution result by accumulating the received data, and a result matrix with the size of 2 x 2;
12 A pooling module 12 for receiving the data to be operated transmitted by the accumulation module 11 and pooling the final convolution result array; different pooling methods can be used, including maximum, average, minimum, to pool the input neurons. Since the final output result matrix of Winograd convolution F (2×2,3×3) is 2×2, the pooling operation of 2×2 can be directly performed, and the pooling result is obtained through three size comparisons: the first time is to compare two numbers of the first row of the result matrix, the second time is to compare two numbers of the second row, and the third time is to compare the results of the previous two times of comparison, so as to obtain the maximum pooling result of the result matrix.
13 The activating module 13 receives the data to be operated transmitted by the pooling module 12, performs the Relu activating function processing on the pooling result to obtain an activated result, and transmits the activated result to the output buffer module 3.

Claims (7)

1. A Winograd-based configurable convolutional array accelerator structure comprising: an activation value buffer module (1), a weight buffer module (2), an output buffer module (3), a controller (4), a weight preprocessing module (5), an activation value preprocessing module (6), a weight conversion module (7), an activation value matrix conversion module (8), a dot multiplication module (9), a result matrix conversion module (10), an accumulation module (11), a pooling module (12) and an activation module (13),
the activation value buffer module (1) is used for storing input pixel values or input characteristic diagram values, is connected with the controller (4) and provides activation value data for the activation value preprocessing module (6);
the weight buffer memory module (2) is used for storing trained weight values, is connected with the controller (4) and provides weight data for the weight preprocessing module (5);
the output buffer module (3) is used for storing the primary convolution layer result, is connected with the controller (4), and transmits the data to the output buffer module (3) for the next layer convolution after the data output by the activation module (13) is completed;
a controller (4) for controlling transmission of the activation value data, the weight data, and the convolutional layer data to be processed according to the calculation process;
the weight preprocessing module (5) is used for receiving the data to be operated transmitted by the weight caching module (2) and dividing convolution kernels to obtain a time domain weight matrix K;
the activation value preprocessing module (6) is used for receiving the data to be operated transmitted by the activation value caching module (1), and is used for taking out the activation value from the activation value caching module (1) and dividing the activation value to obtain a time domain activation value matrix I;
the weight conversion module (7) is used for receiving the data to be operated, which is transmitted by the weight preprocessing module (5), and converting the weight data from a time domain to a Winograd domain to obtain a Winograd domain weight matrix U;
the activation value matrix conversion module (8) is used for receiving the data to be operated, which is transmitted by the activation value preprocessing module (6), and converting the activation value from a time domain to a Winograd domain to obtain a Winograd domain activation value matrix V;
the dot multiplication module (9) is used for respectively receiving the data to be operated transmitted by the weight conversion module (7) and the activation value matrix conversion module (8) and realizing dot product operation of the Winograd domain activation value matrix and the Winograd domain weight matrix to obtain a Winograd domain dot product result matrix M;
the result matrix conversion module (10) is used for receiving the data to be operated transmitted by the dot product module (9) and converting the dot product result matrix from a Winograd domain to a time domain to obtain a converted time domain dot product result matrix F;
the accumulation module (11) is used for receiving the data to be operated transmitted by the result matrix conversion module (10) and accumulating the received data to obtain a final convolution result;
the pooling module (12) receives the data to be operated transmitted by the accumulation module (11) and pools the final convolution result array;
the activation module (13) receives the data to be operated transmitted by the pooling module (12), carries out Relu activation function processing on the pooling result to obtain an activated result, and transmits the activated result to the output buffer module (3), wherein:
the weight preprocessing module (5) comprises:
(1) Extending a convolution kernel of size 5*5 to a 6*6 convolution matrix by zero padding;
(2) The convolution matrix of 6*6 is divided into four 3*3 convolution kernels;
the specific division is shown below, where K input Representing a 5*5 weight matrix, the lower sides of which are respectively 4 corresponding divided time domain weight matrices to be processed K 1 、K 2 、K 3 、K 4 The method comprises the steps of carrying out a first treatment on the surface of the In calculating u=gkg T Wherein the K value is K in turn 1 、K 2 、K 3 、K 4
Wherein K represents a time domain weight matrix, G is a weight conversion auxiliary matrix, and U is a Winograd domain weight matrix;
the activation value preprocessing module (6) divides the 6*6 activation value matrix into heavy partsA stacked 4 4*4 sized matrix; the division is as follows, where I input Representing a 6*6 weight matrix, the lower side is divided into 4*4 time domain to-be-processed activation value matrix I 1 、I 2 、I 3 、I 4 The method comprises the steps of carrying out a first treatment on the surface of the In calculating v=b T In IB, I values are I in turn 1 、I 2 、I 3 、I 4
Wherein I is a time domain activation value matrix, B is an activation value conversion auxiliary matrix, and V is a Winograd domain activation value matrix.
2. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said weight conversion module (7) performs matrix multiplication in calculation by row-column vector addition subtraction, thereby performing conversion for weight matrix in Winograd convolution to obtain Winograd domain weight matrix U= [ GKG T ];
The specific operation is as follows: taking the first row vector of the weight matrix K as a temporary matrix C 2 In which the temporary matrix C 2 =G T K, performing K; the integer right shift complement 0 and the negative number right shift complement 1 in the weight matrix K are divided into two; when the weight is a positive value, the weight is shifted to the right, and the left side of the weight is supplemented with 0; when the weight is negative, the weight is shifted to the right, and 1 is supplemented to the left of the weight; the vector result after adding the first, second and third row elements of the weight matrix K and right shifting by one bit is taken as a temporary matrix C 2 Is a second row of (2); the first of the weight matrix K,2. The vector result after one bit more right shift after the addition of the three-row elements is used as a temporary matrix C 2 Is a third row of (2); taking the third row vector of the weight matrix K as a temporary matrix C 2 Is the fourth row of (2); temporary matrix C 2 The first column vector is used as a first column of a Winograd domain weight matrix U; temporary matrix C 2 The vector result after one bit of right shift after the addition of the first, second and third columns is used as the second column of Winograd domain weight matrix U; temporary matrix C 2 The vector result after adding the first, second and third columns and right shifting by one bit is used as the third column of Winograd domain weight matrix U; temporary matrix C 2 And taking the third column vector of the (4) as the fourth column of the Winograd domain weight matrix U to finally obtain the Winograd domain weight matrix U.
3. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said active value matrix conversion module (8) performs matrix multiplication in computation by adding and subtracting row vectors and column vectors, thereby performing a conversion operation on a time domain active value matrix in Winograd convolution to obtain a matrix V= [ B ] T IB];
The specific operation is as follows: taking the vector difference value of the first row minus the third row of the time domain activation value matrix I as a temporary matrix C 1 In which the temporary matrix C 1 =B T I, a step of I; the result of adding the second row and the third row of the time domain activation value matrix I is taken as a temporary matrix C 1 Is a second row of (2); taking the vector difference value of the third row minus the second row of the time domain activation value matrix I as a temporary matrix C 1 Is a third row of (2); taking the vector difference value of the second row minus the fourth row of the time domain activation value matrix I as a temporary matrix C 1 Is the fourth row of (2); temporary matrix C 1 The vector difference value of the third column minus the first column of the Winograd domain activation value matrix V; temporary matrix C 1 The result of the addition of the second and third columns of (a) is taken as the second column of Winograd domain activation value matrix V; temporary matrix C 1 Subtracting the vector difference value of the second column as the third column of the Winograd domain activation value matrix V; temporary matrix C 1 The second column minus the fourth columnAnd taking the vector difference value of (2) as the fourth column of the Winograd domain activation value matrix V to finally obtain the Winograd domain activation value matrix V.
4. The configurable convolutional array accelerator structure based on Winograd according to claim 1, wherein said dot multiplication module (9) obtains a Winograd domain dot product result matrix M by performing a dot product operation of a Winograd domain weight matrix U and a Winograd domain activation value matrix V, expressed as M=U.V, where U is a Winograd domain weight matrix and V is a Winograd domain activation value matrix; the dot multiplication module (9) is used for realizing the dot product with configurable data bit width, and comprises two working modes of an 8-bit multiplier and a 16-bit multiplier, wherein the two working modes respectively and correspondingly carry out the operation of two data bit widths of 8 bits and 16 bits, and realize the fixed-point multiplication operation of 8 x 8 bits and 16 x 16 bits.
5. The Winograd-based configurable convolutional array accelerator structure according to claim 4, wherein said 8-bit multiplier comprises a first gating unit (14), a first inverting unit (15), a first shifting unit (16), a first accumulating unit (17), a second gating unit (18), a second inverting unit (19) and a third gating unit (20) connected in sequence,
the first gating units (14) respectively receive: the weight conversion module (7) and the activation value matrix conversion module (8) are used for converting data information of the weight conversion module (7) and a symbol control signal of the weight conversion module (7);
the first negation unit (15) receives the data information of the first gating unit (14) and performs negation on the received data;
the first shifting unit (16) receives the data information of the first inverting unit (15) and the symbol bit information of the first gating unit (14), and shifts the received data according to the symbol information;
a first accumulation unit (17) receives the data information of the first shift unit (16) and accumulates the received data;
the second gating unit (18) receives the data information of the first accumulating unit (17) and the sign bit information of the first gating unit (14) and transmits the data information and the sign bit information to the second inverting unit (19);
a second inverting unit (19) receives the data information of the second gating unit (18) and inverts the received data;
the third gating unit (20) receives the data information of the second inverting unit (19) and the first accumulating unit (17) respectively and outputs the data information.
6. The Winograd-based configurable convolutional array accelerator structure according to claim 4, wherein said 16-bit multiplier comprises a fourth gating unit (21), a third inverting unit (22), an 8-bit multiplier (23), a second shifting unit (24), a second accumulating unit (25), a fifth gating unit (26), a fourth inverting unit (27) and a sixth gating unit (28) connected in sequence,
the fourth strobe units (21) respectively receive: the weight conversion module (7) and the activation value matrix conversion module (8) are used for converting data information of the weight conversion module (7) and a symbol control signal of the weight conversion module (7);
a third inverting unit (22) receives the data information of the fourth gating unit (21) and inverts the received data;
an 8-bit multiplier (23) performs 8-bit data bit width operation to realize 8 x 8bit fixed-point multiplication operation;
a second shift unit (24) receives data information of the 8-bit multiplier (23) and shifts the received data;
a second accumulating unit (25) receives the data information of the second shifting unit (24) and accumulates the received data;
the fifth gating unit (26) receives the data information of the second accumulating unit (25) and the sign bit information of the fourth gating unit (21) and transmits the data information and the sign bit information to the fourth inverting unit (27);
a fourth inverting unit (27) receives the data information of the fifth gating unit (26) and inverts the received data;
a sixth strobe unit (28) receives the data information of the fourth inverting unit (27) and outputs the data information.
7. A configurable convolutional array accelerator structure based on Winograd as recited in claim 1,the method is characterized in that the result matrix conversion module (10) executes conversion operation F=A aiming at Winograd domain dot product result matrix M through Winograd domain dot product result matrix M row-column vector shift add-subtract operation T MA, wherein M is a Winograd domain dot product result matrix, A is a replacement auxiliary matrix of the Winograd domain dot product result matrix M, and F is a time domain dot product result matrix;
the specific operation is as follows: taking the vector result of the first, second and third row addition of Winograd domain dot product result matrix M as a temporary matrix C 3 In which the temporary matrix C 3 =A T M; taking the vector result of the second, third and fourth row addition of the dot Winograd domain dot product result matrix M as a temporary matrix C 3 Is a second row of (2); temporary matrix C 3 The vector result of the first, second and third column addition is used as the first column of the converted time domain dot product result matrix F; temporary matrix C 3 The vector result of the second, third and fourth column addition is used as the second column of the converted time domain dot product result matrix F, and finally the converted time domain dot product result matrix F is obtained.
CN201910511987.6A 2019-06-13 2019-06-13 Winograd-based configurable convolution array accelerator structure Active CN110288086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910511987.6A CN110288086B (en) 2019-06-13 2019-06-13 Winograd-based configurable convolution array accelerator structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910511987.6A CN110288086B (en) 2019-06-13 2019-06-13 Winograd-based configurable convolution array accelerator structure

Publications (2)

Publication Number Publication Date
CN110288086A CN110288086A (en) 2019-09-27
CN110288086B true CN110288086B (en) 2023-07-21

Family

ID=68004097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910511987.6A Active CN110288086B (en) 2019-06-13 2019-06-13 Winograd-based configurable convolution array accelerator structure

Country Status (1)

Country Link
CN (1) CN110288086B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766473B (en) * 2019-11-01 2023-12-05 中科寒武纪科技股份有限公司 Computing device and related product
CN112765538B (en) * 2019-11-01 2024-03-29 中科寒武纪科技股份有限公司 Data processing method, device, computer equipment and storage medium
CN111325332B (en) * 2020-02-18 2023-09-08 百度在线网络技术(北京)有限公司 Convolutional neural network processing method and device
CN112639839A (en) * 2020-05-22 2021-04-09 深圳市大疆创新科技有限公司 Arithmetic device of neural network and control method thereof
CN112580793B (en) * 2020-12-24 2022-08-12 清华大学 Neural network accelerator based on time domain memory computing and acceleration method
CN112734827B (en) * 2021-01-07 2024-06-18 京东鲲鹏(江苏)科技有限公司 Target detection method and device, electronic equipment and storage medium
CN112862091B (en) * 2021-01-26 2022-09-27 合肥工业大学 Resource multiplexing type neural network hardware accelerating circuit based on quick convolution
CN112949845B (en) * 2021-03-08 2022-08-09 内蒙古大学 Deep convolutional neural network accelerator based on FPGA
CN113269302A (en) * 2021-05-11 2021-08-17 中山大学 Winograd processing method and system for 2D and 3D convolutional neural networks
CN113407904B (en) * 2021-06-09 2023-04-07 中山大学 Winograd processing method, system and medium compatible with multi-dimensional convolutional neural network
CN113283591B (en) * 2021-07-22 2021-11-16 南京大学 Efficient convolution implementation method and device based on Winograd algorithm and approximate multiplier
CN113554163B (en) * 2021-07-27 2024-03-29 深圳思谋信息科技有限公司 Convolutional neural network accelerator
CN113656751B (en) * 2021-08-10 2024-02-27 上海新氦类脑智能科技有限公司 Method, apparatus, device and medium for realizing signed operation by unsigned DAC
CN114399036B (en) * 2022-01-12 2023-08-22 电子科技大学 Efficient convolution calculation unit based on one-dimensional Winograd algorithm

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793199A (en) * 2014-01-24 2014-05-14 天津大学 Rapid RSA cryptography coprocessor capable of supporting dual domains
CN107862374A (en) * 2017-10-30 2018-03-30 中国科学院计算技术研究所 Processing with Neural Network system and processing method based on streamline
CN109190755A (en) * 2018-09-07 2019-01-11 中国科学院计算技术研究所 Matrix conversion device and method towards neural network
CN109190756A (en) * 2018-09-10 2019-01-11 中国科学院计算技术研究所 Arithmetic unit based on Winograd convolution and the neural network processor comprising the device
CN109325591A (en) * 2018-09-26 2019-02-12 中国科学院计算技术研究所 Neural network processor towards Winograd convolution
CN109359730A (en) * 2018-09-26 2019-02-19 中国科学院计算技术研究所 Neural network processor towards fixed output normal form Winograd convolution
CN109447241A (en) * 2018-09-29 2019-03-08 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9805303B2 (en) * 2015-05-21 2017-10-31 Google Inc. Rotating data for neural network computations
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793199A (en) * 2014-01-24 2014-05-14 天津大学 Rapid RSA cryptography coprocessor capable of supporting dual domains
CN107862374A (en) * 2017-10-30 2018-03-30 中国科学院计算技术研究所 Processing with Neural Network system and processing method based on streamline
CN109190755A (en) * 2018-09-07 2019-01-11 中国科学院计算技术研究所 Matrix conversion device and method towards neural network
CN109190756A (en) * 2018-09-10 2019-01-11 中国科学院计算技术研究所 Arithmetic unit based on Winograd convolution and the neural network processor comprising the device
CN109325591A (en) * 2018-09-26 2019-02-12 中国科学院计算技术研究所 Neural network processor towards Winograd convolution
CN109359730A (en) * 2018-09-26 2019-02-19 中国科学院计算技术研究所 Neural network processor towards fixed output normal form Winograd convolution
CN109447241A (en) * 2018-09-29 2019-03-08 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A High-efficiency FPGA-based Accelerator for Convolutional Neural Networks using Winograd Algorithm;Y Huang 等;《Journal of Physics: Conference Series》;20181231;全文 *
EFFICIENT WINOGRAD CONVOLUTION VIA INTEGER ARITHMETIC;Lingchuan Meng 等;《arXiv》;20190107;全文 *
Fast Algorithms for Convolutional Neural Networks;Andrew Lavin 等;《2016 IEEE Conference on Computer Vision and Pattern Recognition》;20161231;全文 *
SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs;Liqiang Lu 等;《ACM》;20181231;全文 *

Also Published As

Publication number Publication date
CN110288086A (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN110288086B (en) Winograd-based configurable convolution array accelerator structure
CN107862374B (en) Neural network processing system and processing method based on assembly line
US10698657B2 (en) Hardware accelerator for compressed RNN on FPGA
US10810484B2 (en) Hardware accelerator for compressed GRU on FPGA
US11222254B2 (en) Optimized neuron circuit, and architecture and method for executing neural networks
CN110705703B (en) Sparse neural network processor based on systolic array
US20180046903A1 (en) Deep processing unit (dpu) for implementing an artificial neural network (ann)
CN111832719A (en) Fixed point quantization convolution neural network accelerator calculation circuit
CN107633297B (en) Convolutional neural network hardware accelerator based on parallel fast FIR filter algorithm
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN110851779B (en) Systolic array architecture for sparse matrix operations
CN111144556A (en) Hardware circuit of range batch processing normalization algorithm for deep neural network training and reasoning
Yang et al. An approximate multiply-accumulate unit with low power and reduced area
CN117813585A (en) Systolic array with efficient input reduced and extended array performance
CN115982528A (en) Booth algorithm-based approximate precoding convolution operation method and system
Duan et al. Energy-efficient architecture for FPGA-based deep convolutional neural networks with binary weights
CN114341796A (en) Signed multiword multiplier
CN110716751B (en) High-parallelism computing platform, system and computing implementation method
CN114399036A (en) Efficient convolution calculation unit based on one-dimensional Winograd algorithm
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
WO2021081854A1 (en) Convolution operation circuit and convolution operation method
Wu et al. A high-speed and low-power FPGA implementation of spiking convolutional neural network using logarithmic quantization
CN116524173A (en) Deep learning network model optimization method based on parameter quantization
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
CN115357214A (en) Operation unit compatible with asymmetric multi-precision mixed multiply-accumulate operation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant