CN115081600A

CN115081600A - Conversion unit for executing Winograd convolution, integrated circuit device and board card

Info

Publication number: CN115081600A
Application number: CN202110266331.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2022-09-20

Abstract

The invention relates to a transformation unit for executing Winograd convolution, which comprises an input buffer, 2 groups of adder groups and an output buffer. The input buffer is used for receiving and temporarily storing neuron data based on the input bandwidth; the 2 groups of adder groups are used for performing addition operation on the neuron data to generate transformation data; the output buffer is used for temporarily storing and outputting the transformation data based on the output bandwidth. The invention has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Description

Conversion unit for executing Winograd convolution, integrated circuit device and board card

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the present invention relates to a transform unit, an integrated circuit device, and a board that perform Winograd convolution.

Background

With the rapid development of the information age, the research in the fields of artificial intelligence and machine learning is earnestly successful, and the related industries are vigorously developed. The convolutional neural network has wide functions in the aspects of computer vision, automatic driving, machine translation, voice recognition, smart home and the like.

The convolutional neural network has large parameter quantity and large operation quantity, so that the execution performance of the convolutional neural network model is severely limited under the limited area and the computational power of the portable mobile terminal, and meanwhile, a non-specially designed processor can cause huge expenditure of power consumption when carrying out convolutional operation.

Winograd convolution is a convolution acceleration implementation mode based on a polynomial interpolation algorithm. It passes through two inputs to the convolution operation: after the neurons and the weight are segmented in a certain scale, linear transformation, namely Winograd positive transformation, is respectively carried out, the transformed neurons and the weight are subjected to counterpoint multiplication, the counterpoint multiplication result is subjected to linear transformation again, namely Winograd inverse transformation, and finally a convolution result equivalent to the original convolution operation is obtained.

In the process of Winograd convolution operation, the positive and inverse transformation matrixes of the neurons and the weights are all composed of simple fixed numerical values, so that the positive and inverse transformation process of the Winograd neurons and the weights can be realized only by addition. The multiplication operation required in the Winograd algorithm only occurs in the bit multiplication process, and the multiplication complexity of the process is reduced to a considerable extent compared with the original convolution algorithm. Because the cost (time sequence, power consumption and area) for realizing multiplication operation by hardware is much higher than that for realizing addition with the same bit width, the energy efficiency ratio of hardware and obvious gains in operation time can be brought by replacing the original convolution operation by Winograd convolution.

However, at present, no hardware is designed for the Winograd convolution acceleration algorithm, so that the conventional artificial intelligent chip cannot completely show the advantages of the Winograd convolution operation. Therefore, a hardware device capable of efficiently running the Winograd convolution algorithm is urgently needed.

Disclosure of Invention

In order to at least partially solve the technical problems mentioned in the background, the present invention provides a transformation unit, an integrated circuit device and a board card for performing Winograd convolution.

In one aspect, the present disclosure discloses a forward transform unit for performing Winograd convolution, which includes an input buffer, 2 adder sets, and an output buffer. The input buffer is used for receiving and temporarily storing neuron data based on the input bandwidth; the 2 groups of adder groups are used for performing addition operation on the neuron data to generate positive transformation data; the output buffer is used for temporarily storing and outputting the forward conversion data based on the output bandwidth. The input bandwidth and the output bandwidth are the same, and the computational power of the addition operation is twice of the input bandwidth and the output bandwidth.

In another aspect, the present invention discloses an inverse transform unit for performing Winograd convolution, comprising an input buffer, 2 adder groups, and an output buffer. The input buffer is used for receiving and temporarily storing the multiplication data based on the input bandwidth; 2 adder groups for performing addition operations on the bit-aligned multiplier data to generate convolution results; the output buffer is used for temporarily storing and outputting the convolution result based on the output bandwidth. The input bandwidth and the output bandwidth are the same, and the computational power of the addition operation is twice of the input bandwidth and the output bandwidth.

In another aspect, the present invention discloses an integrated circuit device comprising the forward transform unit and the inverse transform unit. The invention also discloses a board card comprising the integrated circuit device.

The hardware structure provided by the invention can be matched with a Winograd convolution acceleration algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

FIG. 1 is a schematic diagram illustrating a convolution kernel performing a convolution operation with an input neuron image;

FIG. 2 is a diagram showing the conversion of an original convolution of F (2 × 2,3 × 3) to a Winograd convolution;

FIG. 3 is a visualization diagram illustrating a multiply-by-bit operation;

FIG. 4 is a diagram illustrating the homogenous operation of forward transformed data with weights;

fig. 5 is a structural diagram showing a board card of the embodiment of the present invention;

FIG. 6 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 7 is a schematic diagram showing the internal structure of a computing device of an embodiment of the invention;

FIG. 8 is a diagram showing an overlapping portion when a transform is being performed;

FIG. 9 is a schematic diagram showing a neuron cache of an embodiment of the present invention;

FIG. 10 is a schematic diagram showing a forward transform unit of an embodiment of the present invention;

FIG. 11 is a schematic diagram illustrating a forward transform data cache of an embodiment of the present invention;

FIG. 12 is a diagram illustrating weight caching according to an embodiment of the invention;

FIG. 13 is a schematic diagram showing the forward transform data buffer output side of an embodiment of the present invention;

FIG. 14 is a diagram illustrating a weight buffer output side according to an embodiment of the invention;

fig. 15 is a schematic diagram showing an inverse transform unit of the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The Winograd convolution acceleration algorithm (hereinafter referred to as Winograd algorithm or Winograd convolution) is a method for finding out a transformation method requiring the minimum number of multiplications by performing linear transformation on operands in convolution operation, and then replacing the required multiplications by adding partial addition operations. In terms of hardware, compared with an adder, the structure of the multiplier is more complex, the area power consumption is larger, the comprehensive processing performance is poorer, and a Winograd algorithm which replaces multiplication with addition in practice has great advantages in processing convolution operation.

For two-dimensional convolution, assuming that the size of the input neuron image is H × W (H is the height of the input neuron image and W is the width of the input neuron image), the size of the weight is r × s (r is the height of the weight and s is the width of the weight), and the convolution result can be expressed as F (m × n, r × s), where m × n is the size of the output neuron image, m is the height of the output neuron image and n is the width of the output neuron image. In order to reduce the complexity of hardware, improve the universality and achieve a good acceleration effect, the embodiment of the invention sets convolution kernels (namely weights) not greater than 3 × 3 as a base convolution unit to combine and perform Winograd convolution operation with a convolution operation step (stride) of 1 at any scale. In the embodiment of the present invention, arbitrary F (m × n, r × s) is decomposed into 5 kinds of calculations of base convolution with the operation scale of 3 × 3, 3 × 2 (or 2 × 3), 3 × 1, 2 × 2,2 × 1, and the like, and then combined. More specifically, in the examples of the present invention, an arbitrary F (m × n, r × s) is decomposed into a base convolution calculation combination of F (2 × 2,3 × 3), F (2 × 2,3 × 2), F (2 × 2,2 × 3), F (2 × 2,3 × 1), F (2 × 2), and F (2 × 2,2 × 1). It should be noted that, since the convolution operation of 1 × 1 cannot be accelerated by Winograd convolution, the scale of 1 × 1 does not belong to the base convolution unit set by the embodiment of the present invention.

Taking F (2 × 2,5 × 5) with the input neuron image size of 6 × 6 and the step size of 1 as an example, before using the computing apparatus according to the embodiment of the present invention to perform Winograd convolution acceleration operation, the input neuron image of 6 × 6 and the convolution kernel of 5 × 5 need to be linearly split based on the base convolution unit, and the splitting process is shown in fig. 1.

Fig. 1 shows a convolution operation of a5 × 5 convolution kernel 101 with a6 × 6 input neuron image 102 to obtain a2 × 2 convolution result 103. The convolution kernel 101 needs to be split into 3 × 3, 3 × 2 (or 2 × 3), 3 × 1, 2 × 2,2 × 1, etc. scales, and this embodiment preferably selects 3 × 3, 3 × 2 (or 2 × 3) times, 3 × 1 times, 2 × 2 times, and finally 2 × 1. According to this rule, the convolution kernel 101 is split into 4 base convolution kernels: the 3 × 3 first base convolution kernel 104, the 3 × 2 second base convolution kernel 105, the 2 × 3 third base convolution kernel 106, and the 2 × 2 fourth base convolution kernel 107, i.e., F (2 × 2,5 × 5), are decomposed into one F (2 × 2,3 × 3), one F (2 × 2,3 × 2), one F (2 × 2,2 × 3), and one F (2 × 2 ). The input neuron image 102 is correspondingly also split into 4 sub-neuron data: 4 × 4 first sub-neuron data 108, 4 × 3 second sub-neuron data 109, 3 × 4 third sub-neuron data 110, and 3 × 3 fourth sub-neuron data 111.

Then Winograd convolution operation is carried out, namely: convolving the first basis convolution kernel 104 with the first sub-neuron data 108 to generate a first sub-convolution result 112; convolving the second basis convolution kernel 105 with the second sub-neuron data 109 to generate a second sub-convolution result 113; convolving the third base convolution kernel 106 with the third sub-neuron data 110 to generate a third sub-convolution result 114; the fourth base convolution kernel 107 is convolved with the fourth sub-neuron data 111 to generate a fourth sub-convolution result 115.

Finally, the first sub-convolution result 112, the second sub-convolution result 113, the third sub-convolution result 114 and the fourth sub-convolution result 115 are added to obtain a convolution result 116, and the convolution result 116 is the same as the convolution result 103. The above is an example of using the Winograd convolution algorithm to implement the original convolution operation.

Further, the Winograd algorithm can be represented by the following equation:

Y＝A ^T [(GgG ^T )⊙(B ^T dB)]A

wherein Y represents the output matrix of the convolution operation, A ^T Inverse transformation left-multiplication constant matrix, G weight transformation left-multiplication constant matrix, G weight of original convolution, G ^T For weight transformation right-times constant matrix,. alpha.representing bit-wise multiplication, B ^T The method comprises the following steps of converting a neuron into a left-multiplication constant matrix, d is neuron data, B is a right-multiplication constant matrix of the neuron, and A is an inverse-conversion right-multiplication constant matrix. The left and right multiplication matrices for each transform are simply transposed.

Taking F (2 × 2,3 × 3) as an example, the constant matrices are as follows:

fig. 2 shows a schematic diagram of the conversion of the original convolution of F (2 × 2,3 × 3) into Winograd convolution. As shown, neuron data 201 is convolved with convolution kernel 202. During calculation, the neuron data 201 is arranged in a row according to elements in the sliding window 203, the sliding window 203 slides for 4 times to form a4 × 9 matrix 204, then the elements of the convolution kernel 202 are arranged in a column to form a9 × 1 matrix 205, and the 4 × 9 matrix 204 and the 9 × 1 matrix 205 are subjected to convolution operation to obtain a4 × 1 convolution result 206.

Further, by dividing the graph by the dotted line, the 4 × 9 matrix 204 is converted into a2 × 3 matrix 207, the 9 × 1 matrix 205 is converted into a3 × 1 matrix 208, and the 4 × 1 convolution result 206 is converted into a2 × 1 convolution result 209. After the linear transformation, the first element R of the 2 × 1 convolution result 209 ₀ ＝M ₀ +M ₁ +M ₂ And R is ₁ ＝M ₁ -M ₂ -M ₃ . And M ₀ 、M ₁ 、M ₂ 、M ₃ Can be represented by the following sub-formula:

M ₀ ＝(K ₀ -K ₂ )·W ₀

M ₃ ＝(K ₁ -K ₃ )·W ₂

by the segmentation and linear transformation, the original convolution operation involves 36 multiplications, while the Winograd algorithm only needs to execute 16 multiplications, so that the computational complexity of the multiplications is reduced by 2.25 times.

As can be seen from the conversion of the Winograd algorithm of the two-dimensional convolution, the Winograd algorithm is mainly divided into the following steps. First, the weights are left-and right-multiplied by a matrix of weight constants, GgG ^T Obtaining a weight value after Winograd linear transformation, namely a Winograd weight value; next, the neuron data is subjected to a positive transformation operation, i.e., left and right multiplication of a neuron constant matrix, i.e., B ^T And dB, obtaining forward conversion data after Winograd linear conversion. Further, the forward transformed data and Winograd weight matrix are bit multiplied (GgG) ^T )⊙(B ^T dB), the bit-multiplied data is obtained. Finally, inverting the bit multiplier dataTransformation operations, i.e. left and right multiplication operations of the Winograd inverse transformation constant matrix, i.e. A ^T LA, wherein L is [ (GgG) ^T )⊙(B ^T dB)]And finally obtaining a convolution result equivalent to the original convolution.

From the perspective of hardware design, the embodiment of the present invention performs pipeline design on the three large transformation steps according to the dependency and operation distinguishing characteristics among the three processes, so as to achieve more efficient acceleration performance. The following will be separately described for the design of the forward transform operation, the bit-multiplication operation, and the inverse transform operation.

Embodiments of the present invention utilize a forward transform unit to implement a forward transform operation, i.e., perform B ^T dB, forward transforming the left-multiplication matrix B according to the rule of Winograd convolution ^T The size of (m + r-1) × (m + r-1), and the size of the right multiplication matrix B is (n + s-1) × (n + s-1). Due to forward transformation of the left multiplication matrix B ^T The elements of the right multiplication matrix B are all composed of 0, 1 and-1, so that the matrix multiplication operation of the positive transformation can be decomposed into the addition operation of the fixed mode, and the computing device of the embodiment of the invention configures a specific number of floating-point addition operators to complete the linear addition operation required by the multiplication of the whole matrix. Since the embodiment of the present invention converts any original convolution into a base convolution for calculation, the scale of the forward transform unit is related to the above-mentioned 5 types of base convolution scale calculation, and therefore, the following description will be made with respect to the data of the above-mentioned 5 types of base convolution calculation FP32, taking the convolution result of 2 × 2 as an example (i.e., m ═ n ═ 2).

Take 3 × 3 basis convolution as an example, which

Can be expressed as:

based on the above formula, the forward conversion power requirement of the forward conversion unit directly corresponds to the number of additions, which is 4 × (n + s-1) +4 × (m + r-1) ═ 32flops (floating point operations per second), and the input and output quantities of the forward conversion unit are respectively: the input data and the output data are both (r +1) (s +1) × 32 ═ 16 × 32 bits, and the reason why the foregoing arithmetic expression is multiplied by 32 bits is data for FP32, which is a sequence of 32 bits. When the input and output quantity of the forward conversion unit is the same as the operation time, the hardware utilization rate of the forward conversion unit is optimal, so the ratio of the input and output bandwidth of the forward conversion unit to the addition operation is preferably 16:32 to 1: 2. In other words, when the buffer bandwidth (or vectorization length) is l, the input bandwidth and the output bandwidth of the forward transform unit are l × 32 bits, and the computation power of the adder group of the forward transform unit is 2 × l flops. Each operation will generate 16 final results, and considering that 8 intermediate results will be generated during the operation, the minimum number of registers in the register file is l × 32 × (16+ 8).

Take 3 × 2 basis convolution as an example, which

Can be expressed as:

based on the above equation, the positive conversion force requirement of the positive conversion unit is 4 × (n + s-1) +2 × (m + r-1) ═ 20flops, and the input and output quantities of the positive conversion unit are respectively: both the input data and the output data are (r +1) (s +1) × 32 ═ 12 × 32 bits. To increase the hardware utilization of the forward transform unit, the ratio of the input/output bandwidth of the forward transform unit to the addition operation is preferably 12:20 to 3: 5. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group is

flops. Each calculation yields 12 final results and 6 intermediate results, with the minimum number of registers in the register file being l × 32 × (12+6) with maximum pipeline utilization.

Take a2 × 2 basis convolution as an example, which

Can be expressed as:

based on the above formula, the positive conversion power demand of the positive conversion unit is 2 × (n + s-1) +2 × (m + r-1) ═ 12flops, and the input and output quantities of the positive conversion unit are respectively: since both the input data and the output data are (r +1) (s +1) × 32 bits, 9 × 32 bits, the ratio of the input/output bandwidth of the forward conversion unit to the addition operation is preferably 9:12 to 3: 4. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group is

Each calculation yields 9 final results, and 6 intermediate results, with the minimum number of registers in the register file being l × 32 × (9+6) with maximum pipelined use of the register file.

Take 3 × 1 basis convolution as an example, which

Can be expressed as:

based on the above equation, the positive conversion force requirement of the positive conversion unit is 4flops, and the input and output quantities of the positive conversion unit are respectively: both the input data and the output data are (r +1) × 32 ═ 4 × 32 bits. Therefore, the ratio of the input/output bandwidth of the forward conversion unit to the addition operation is preferably 4:4 to 1: 1. I.e. input bandwidth and output bandwidth are l x 32 bits, while the computation power of the adder group is l flops. Each calculation yields 4 final results and 2 intermediate results, and the minimum number of registers in the register file is l × 32 × (4+2) with maximum pipeline utilization.

Take a2 × 1 basis convolution as an example, which

Can be expressed as:

based on the above formula, the forward conversion computing power requirement of the forward conversion unit is 2flops, and the input and output quantities of the forward conversion unit are respectively: since both the input data and the output data are (r +1) × 32 ═ 3 × 32 bits, the ratio of the input-output bandwidth of the forward transform unit to the addition operation is preferably 3: 2. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group is

Each calculation yields 3 final results and 1 intermediate result, and the minimum number of registers in the register file is l × 32 × (3+1) with the maximum pipeline utilization.

In order to satisfy and support the aforementioned 5 kinds of basic convolution operations simultaneously, the embodiment of the present invention selects the input bandwidth and the output bandwidth of the forward conversion unit to be the same, and the computation power of the addition operation is twice as much as the input bandwidth and the output bandwidth, i.e. the input bandwidth and the output bandwidth are both l × 32 bits, the computation power of the adder group is 2 × l flops, and the number of the register files is l × 32 × (16+ 8).

Then, considering the multiply-accumulate counter, the embodiment of the invention combines the multiply-accumulate counter operation and the characteristic diagram direction of the convolution neuron data based on the general consideration of hardware design, scheduling strategy and execution performance, and utilizes the same multiply-accumulate counter, thereby not only effectively reducing the overall complexity and resource consumption of hardware design, but also reducing the access amount of on-chip cache, saving power consumption area and simultaneously improving performance.

Assume that the parameters of the convolutional layer are: number of input batch processing (batch) N, number of input neuron channels Ci, height of input neuron data Hi, width of input neuron data Wi, number of output neuron channels Co, height of output neuron data Ho, and output neuronThe width Wo of the data, the convolution kernel size r × s, and the step size 1. Since this embodiment supports the operation of F (2 × 2, r × s), Ho ═ Hi-r +1, Wo ═ Wi-s +1, and the number of Winograd arithmetic units

Where T is the number of slices along the HW direction.

Since the on-chip buffer capacity is limited, the calculation device of the embodiment performs calculation with a single batch processing amount (N ═ 1), and thus the scale of input neuron data input to the calculation device is [1 Ci Hi Wi ], the scale of forward conversion data is [1 Ci T (r +1) × (S +1) ], the scale of original weight is [ Co Ci rs ], and the scale of Winograd weight is [1 Co Ci (r +1) × (S +1) ].

Fig. 3 shows a visual schematic diagram of the bit multiplication operation described above. Since N is 1, each of the aforementioned data can be represented in three dimensions, and the scale of the forward conversion data 301 is [ Ci (r +1) × (S +1) ], the three dimensions of which are Ci, T (i.e., the number of HW slices) and (r +1) × (S +1), respectively; the Winograd weight 302 is [ Co Ci (r +1) × (S +1) ], the three dimensions of which are Co, Ci and (r +1) × (S +1), respectively, the bit-alignment multiplication operation is cross bit-alignment multiplication of Co in the HW direction, and accumulation is performed in the Ci direction to obtain bit-alignment multiplication data 303, the scale of which is [ Co T (r +1) × (S +1) ], and the three dimensions of which are Co, T and (r +1) × (S +1), respectively.

In more detail, the forward transform data 301 includes T [ Ci (r +1) × (S +1) ] data units by bit multiplication, and the Winograd weight 302 includes Co [ Ci (r +1) × (S +1) ] data units by bit multiplication to obtain an intermediate result of [ Ci (r +1) × (S +1) ]. And accumulating along the Ci direction, wherein the process is the same as the matrix multiplication operation process, so that the process can be combined into the matrix multiplication operation, hardware resources are more effectively used, and the resource consumption of a register for intermediate storage is reduced.

Since the forward transform data 301 includes T data units of [ Ci (r +1) × (S +1) ], the Winograd weights 302 include Co data units of [ Ci (r +1) × (S +1) ], each data unit of the forward transform data 301 needs to be multiplied by each data unit of the Winograd weights 302. As shown in fig. 4, when performing the bit-by-bit multiplication, one data unit 401 of the forward transform data 301 performs a homogeneous operation with Co weight data units, i.e., the Co direction is taken as the direction of parallel computation, and produces an intermediate result 402. Then, the next data unit and the Co weight data units are taken out from the forward transformed data 301, and then the homogeneous operation is performed to generate the next intermediate result, and in this way, the operation is performed until all the T data units are calculated, and the multiplication-by-alignment data 303 is obtained.

When the above-mentioned data units are bit-multiplied and added up in the direction of the feature map, the amount of computation required is (Ci + Ci-1) × (r +1) × (S +1) flops. Since the value of Ci is often very large, it is practically difficult to input Ci as a granularity for real operation to the bit multiply accumulate operator, so this embodiment can further split Ci, perform multiply accumulate operation with the vectorization length l as a unit, split the multiply accumulate operation of another dimension (r +1) × (S +1) into (r +1) × (S +1) beats to be completed in sequence, and finally add all results along Ci direction to obtain the final result.

Since the output bandwidth of the forward transform unit is l × 32 bits, in order to ensure the same overall pipeline time from the forward transform unit to the bit multiply accumulate unit, the computation power of each bit multiply accumulate unit in the bit multiply accumulate unit is set to l + (l-1) flops in this embodiment, which include l multiplications and l-1 additions. If the multiply-accumulate unit has ω parallel dimensions, i.e. includes ω simultaneous units, the computation power of the multiply-accumulate unit is ω × (l + (l-1)) flops, which are functions of ω and l.

This embodiment is further provided with an inverse transformation unit for performing an inverse transformation operation on the basis of the inverse transformation left-multiplication matrix

And right multiplication matrix A _(n+s-1)×2 Carrying out A ^T LA calculation where L is (GgG) ^T )⊙(B ^T dB). Due to inverse transformation of the left multiplication matrix A ^T The elements of the right multiplication matrix A are also composed of 0, 1, -1, so the inverse matrix multiplication operation can be decomposed into fixed-pattern addition operation as well. The adder group of the inverse transform unit being configured with a specific number according to thisA floating-point addition operator to perform the linear addition operation required for the entire matrix multiplication. The following description is also made based on 5 kinds of base convolutions to determine the size of the inverse transform unit.

Take 3 × 3 basis convolution as an example, which

Can be expressed as:

based on the above equation, the inverse conversion force of ITU 715 is 24 fps, and the input bandwidth is (r +1) (s +1) × 32 bits, 16 × 32 bits, and the output bandwidth is (s +1) × 32 bits, 4 × 32 bits. Similarly, when the input bandwidth and the computation power of the inverse transform unit are the same, the hardware utilization rate of the inverse transform unit is optimal, so the ratio of the input bandwidth to the addition operation is preferably 16:24 to 2:3, i.e., the input bandwidth is l × 32 bits, and the computation power of the adder group is

Each calculation will produce 16 final results, no intermediate results will occur, and the minimum number of registers in the register file is l × 32 × 16 under the premise of maximizing the pipeline use of the register file.

Take 3 × 2 basis convolution as an example, which

Can be expressed as:

based on the above formula, the inverse conversion power of the inverse transform unit is 16 fps, the input bandwidth is 12 × 32 bits, the output bandwidth is 4 × 32 bits, the ratio of the input bandwidth to the addition operation is preferably 12:16 to 3:4, i.e., the input bandwidth is l × 32 bits, and the computation power of the adder group is

Each calculation will produce 12 final results, no intermediate results will occur, and the minimum number of registers in the register file is l × 32 × 12 under the premise of maximizing the pipeline use of the register file.

Take a2 × 2 basis convolution as an example, which

Can be expressed as:

based on the above formula, the inverse conversion power of the inverse transform unit is 10flops, the input bandwidth is 9 × 32 bits, and the output bandwidth is 4 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 9:10, i.e., the input bandwidth is l × 32 bits, and the power of the adder group is

Each calculation will produce 9 final results, no intermediate results, and the minimum number of registers in the register file is l × 32 × 9 under the premise of maximizing the pipeline utilization of the register file.

Take 3 × 1 basis convolution as an example, which

Can be expressed as:

based on the above formula, the inverse conversion power of the inverse transform unit is 4flop, the input bandwidth is 4 × 32 bits, and the output bandwidth is 2 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 4:4 to 1:1, i.e. the input bandwidth is l × 32 bits, and the computation power of the adder group is l flops, each computation will generate 4 final results and 2 intermediate results, and the minimum number of registers of the register file is l × 32 × (4+2) under the premise of maximizing the pipeline use of the register file.

Take a2 × 1 basis convolution as an example, which

Can be expressed as:

based on the above formula, the inverse conversion power of the inverse transform unit is 2 fps, the input bandwidth is 3 × 32 bits, and the output bandwidth is 3 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 3:2, i.e., the input bandwidth is l × 32 bits, and the power of the adder group is

Each calculation yields 3 final results, 1 intermediate result, and the minimum number of registers in the register file is l × 32 × (3+1) with the maximum pipeline utilization.

In order to simultaneously satisfy and support the above-mentioned 5 kinds of basic convolution operations, the power of the addition operation of the inverse transform unit may be set to the input bandwidth

Multiplication, i.e. when the input bandwidth is l x 32 bits, the computing power of the adder group is

However, in order to make the hardware design relatively simple, the embodiment may further consider that the hardware configuration of the forward transform unit and the inverse transform unit is the same. On the premise of meeting the requirements of the forward conversion unit and the inverse conversion unit at the same time, the design of the forward conversion unit is adopted by the inverse conversion unit in the embodiment, namely, the input bandwidth and the output bandwidth are the same, and the computing power of the addition operation is twice of that of the input bandwidth and the output bandwidth. In other words, the input bandwidth of the inverse transform unit is l × 32 bits, the output bandwidth is also l × 32 bits, and the computation power of the adder group is 2 × l flops.

In summary, the bandwidths and the computation powers of the 3 core modules (the forward transform unit, the alignment multiply accumulate operator and the inverse transform unit) performing the Winograd convolution operation in this embodiment are all matched, that is, the input bandwidths of the 3 core modules are all set to l × 32 bits, the output bandwidths are also all set to l × 32 bits, the computation power of the forward transform unit is 2 × l flops, the computation power of the alignment multiply accumulate operator is ω × (l + (l-1)) flops, and the computation power of the inverse transform unit is 2 × l flops.

As can be seen from the foregoing, Winograd convolution operation is directly related to the vectorization length parameter l. The vectorization length parameter l is a minimum processing length, which relates to the neuron transformation multiplexing condition of the computing device in this embodiment, and the larger the parameter l is, the higher the multiplexing rate is, and meanwhile, the required access amount, the operation amount, the power consumption and the average hardware design area are proportionally reduced. However, the parameters of the convolutional layer of the neural network change with the change of the network model, and with the increase of the vectorization length parameter l, when the number of channels of a part of the network model is smaller than the vectorization length l, the computational power is wasted, thereby affecting the acceleration effect and causing the additional overhead of area power consumption. Therefore, when determining the vectorization length l, a trade-off analysis needs to be performed on the two factors, so as to plan the most suitable vectorization length parameter configuration.

According to the empirical values, weights are set for several major hardware in this embodiment (such as FP32 adder, bit multiplication unit, register, etc.) to obtain their computation power and resource overhead function, and it is found that when l is greater than 16, the utilization rate of hardware resources can be guaranteed to be at a higher level. Then, the number of input channels and the number of output channels of the currently common neural network models (such as LeNet, VGG16, VGG19 and Alexnet) are taken into consideration, the computational power loss is calculated, and the comprehensive computational power loss is found to be greatly improved when l is larger than 64. From the above two quantitative analysis, it can be seen that the computing device of this embodiment performs better when the vectorization length parameter l is between 16 and 64. This embodiment preferably selects l-16 if further versatility is considered to meet future possible network architectures and network parameters.

Fig. 5 shows a schematic structural diagram of the foregoing embodiment in the form of a board card. As shown in fig. 5, the board 50 includes a Chip 501, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, to support various deep learning and machine learning algorithms, and meet the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 501 is connected to an external device 503 through an external interface 502. The external device 503 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred to the chip 501 by the external device 503 through the external interface means 502. The results of the calculations of the chip 501 may be communicated back to the external device 503 via the external interface means 502. The external interface device 502 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 50 also includes a memory device 504 for storing data, including one or more memory cells 505. The memory device 504 is connected and data-transferred to the control device 506 and the chip 501 via a bus. The control device 506 in the board 50 is configured to regulate the state of the chip 501. For this purpose, in an application scenario, the control device 506 may include a single chip Microcomputer (MCU).

Fig. 6 is a structural diagram showing a combined processing device in the chip 501 of this embodiment. As shown in fig. 6, the combination processing device 60 includes a computing device 601, an interface device 602, a processing device 603, and a DRAM 604.

The computing device 601 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, especially Winograd convolution operations, which can interact with the processing device 603 through the interface device 602 to collectively perform the user-specified operations.

The interface device 602 is used for transmitting data and control commands between the computing device 601 and the processing device 603. For example, the computing device 601 may obtain input data from the processing device 603 via the interface device 602, and write the input data to the on-chip cache of the computing device 601. Further, the computing device 601 may obtain the control command from the processing device 603 via the interface device 602, and also write the control command into the on-chip cache of the computing device 601. Alternatively or optionally, the interface device 602 may also read data in an on-chip cache of the computing device 601 and transmit to the processing device 603.

The processing device 603 is a general-purpose processing device that performs basic control including, but not limited to, data transfer, and turning on and/or off the computing device 601. Depending on the implementation, the processing device 603 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 601 of the present invention may be viewed as having a single core structure or an isostructural multi-core structure only. However, when considered collectively, the computing device 601 and the processing device 603 are considered to form a heterogeneous multi-core structure.

The DRAM 604 is used for storing data to be processed, is an off-chip memory, and is typically 16G or larger in size, and is used for storing data of the computing device 601 and/or the processing device 603, particularly storing neuron data and weights to be subjected to Winograd convolution. In this embodiment, the processing means 603 has previously linearly transformed the weights of the original convolution into Winograd weights GgG ^T And stored in DRAM 604.

Fig. 7 shows a block diagram of the computing device 601. The computing device 601 includes a bus 701, a Direct Memory Access (DMA) module 702, an instruction cache (Iram)707, a decode unit (IDU)708, a neuron cache (Nram)709, a transform unit (NTU) 710, a transform data cache (WNram)711, a weight cache (Wram)712, a Multiply Accumulate (MAC)713, a multiply data cache (WRram)714, an Inverse Transform Unit (ITU)715, a result cache (Rram)716, and a logical operation module (ALU, arithmetric logic unit) 717.

The bus 701 is a common communication trunk for transmitting information between the devices, and is a transmission line bundle composed of wires, and the bus 701 is a generic name of a data bus, an address bus, and a control bus for transmitting data, data addresses, and commands, respectively, according to the kind of information transmitted by the combination processing device 60. The bus 701 serves as a communication channel for the DRAM 604 and the computing device 601, which in this embodiment is specifically PCIe.

The DMA module 702 is used to copy data from one address space to another, typically by transferring data between external memory (e.g., DRAM 604) and internal caches of the computing device 601. When the DMA transfer is to be performed, the processing device 603 gives the DMA module 702 the bus control right, and the DMA module 702 controls the bus 701 to transfer data, and after the DMA transfer is completed, the DMA module 702 gives the bus control right back to the processing device 603.

The DMA module 702 includes Neuronal Direct Memory Access (NDMA)703, Weighted Direct Memory Access (WDMA)704, Instruction Direct Memory Access (IDMA)705, and Resultant Direct Memory Access (RDMA) 706. NDMA 703 is used to input neuron data from DRAM 604, WDMA 704 is used to input Winograd weights from DRAM 604, IDMA 705 is used to input commands from DRAM 604, and RDMA 706 is used to output the calculation results to DRAM 604. In other embodiments, NDMA 703, WDMA 704, IDMA 705, and RDMA 706 may be implemented by the same direct memory access.

Iram 707 is used to temporarily store instructions input by IDMA 705, and IDU 708 fetches the instructions from Iram 707 to decode them and controls other units to operate according to the decoded instructions. The IDU 708 is a decoding and scheduling unit of the entire computing device 601, and is responsible for decoding the control instructions obtained from the DRAM 604, converting the control instructions into control signals to coordinate operations of the various modules/units on the chip, and also responsible for performing various tasks such as branch prediction, exception handling, and interrupt handling. In fig. 7, thin line arrows indicate control flows, and thick line arrows indicate data flows.

Since the computing device 601 mainly aims at Winograd convolution calculation, which has no or low general processing capability, it will greatly depend on scheduling and data communication of the processing device 603 during task execution, which results in that input/output communication between the computing device 601 and the processing device 603 is very frequent, and thus, the operation performance of the computing device 601 is greatly limited. To this end, the computing device 601 is provided with a plurality of small-capacity on-chip caches for caching data capable of being temporarily stored in a multiplex manner, such as Nram 709, WNram 711, WRram 712, WRram 714, and the like.

When data on/off-chip is transferred, the neuron data and the Winograd weight are transferred in units of a single batch (N is 1), that is, the data unit of the neuron data is [ Ci Hi Wi ], the data unit of the Winograd weight is [ Co Ci (r +1) × (s +1) ], and the scale of the result obtained after the convolution operation of Winograd is [ Co Ho Wo ]. The former two are input data and the latter is output data, which are the minimum throughput transmitted and calculated in the calculating device 601, and as for the actual data throughput, it needs to be determined according to the size of the on-chip buffer and the operation scheduling flow, which will be further described below.

As can be seen from the characteristics of convolution operation, the convolution operation related to the input data of the above scale can be split in multiple dimensions, for example, in the Ci direction, the HW image direction, or the Co direction, but when Winograd transformation is involved, the minimum operation splitting unit is F (2 × 2, r × s), and the minimum splitting unit in the HW direction is (r +1) × (s + 1). Considering that the base convolution size of the computing device 601 for achieving the Winograd acceleration does not exceed 3 × 3, the embodiment estimates the buffer capacity based on the 3 × 3 base convolution which consumes the most on-chip buffer resources.

According to the rule of Winograd convolution, when forward conversion operation is carried out, vectorization length parameter l is required to be processed in parallel in the Ci direction, when bit-wise multiplication accumulation operation is carried out, operation is required to be carried out in parallel in the Co direction in the unit of l, and when inverse conversion is carried out, operation is required to be carried out in parallel in the Co direction in the unit of l, so that the size of the minimum neuron input data block participating in the operation can be estimated to be [ l (r +1) × (s +1) ]. Since the data block size of the neuron transformation result is estimated by the convolution with 3 × 3 basis, the block size of the Winograd weight data to be subjected to bit-wise multiplication and accumulation is [ l l 4 × 4], the block size of the bit-wise multiplication output data is [ l 4 × 4], and the block size of the inverse transformation output [ l 2 × 2 ].

The on-chip cache design is carried out according to the scale, although all requirements can be met, the design idea of multiplexing and low power consumption is also considered, the scale data is only the minimum input/output storage data scale for realizing the function, and the optimization potential of the input/output quantity of Winograd convolution operation needs to be further considered. This embodiment is further planned for caching as follows.

In the process of neuron forward transformation, the operation is based on the minimum implementation unit with F (2 × 2, r × s) and l as the vectorization length, the size of the data block taken out each time is [ l 44 ], and the step size of neuron taking is kept to be 2. As shown in fig. 8, there is a quarter of overlap 806 between the data unit 801 to be transformed and the four

data blocks

802, 803, 804, 805 generated by the sliding window, and the size of the overlap 806 is [ l 44 ], and it can be seen that, in the process of performing forward transformation on the data unit 801, the data blocks 802, 803, 804, 805 each include 1 overlap 806, so that 4 overlaps 806 are generated. When data is shifted by splitting it in the minimum data unit l 44, the data throughput required for the overlap portion 806 is quadrupled, so that redundant data increases. To solve this problem, this embodiment further reduces the input/output amount by caching the data unit of a larger setting scale in the on-chip cache of the computing device 601.

As previously mentioned, the scale is [ Ci Hi Wi]Has a neuron data set and scale of [ Co Ci (r +1) × (s +1)]Win ofAnd carrying out convolution operation on the ograd weights. This embodiment keeps as many Winograd weights as possible on-chip, i.e., as many as possible are temporarily stored on-chip

Pieces of paper (l l (r +1) × (s +1)]Therefore, the batch of neuron data can be calculated only by one time of weight loading operation, so as to save the input/output quantity of the weight data.

For the output data, since the convolutional neural network also has other network layer operations such as activation, pooling, normalization, and the like, the convolutional result needs to be cached on a chip, and the subsequent network layer operations are continued, and therefore, the computing device 601 will reserve a cache storage convolutional result of a fixed capacity. This portion of the data buffer may share buffer space with the results that ultimately go through various other layer operations, thus reducing the data throughput of other layer operations to reload the convolution results and transmit the results of the computation out.

As can be seen from the above optimization analysis, the buffer capacity of the neuron data should be as large as possible, so as to reduce the total throughput of the neuron data, and since the neuron data is accumulated along the Ci direction, the larger the amount of data stored along the Ci direction is, the more times the neuron data is reloaded and accumulated can be reduced. Furthermore, the buffer space for Winograd weights also needs to be as large as possible. Finally, this embodiment also needs to reserve the corresponding output result space for other layer operations. To sum up, this embodiment divides the on-chip cache into three main blocks respectively responsible for different functions: nram 709 is responsible for storing neuron data, Wram 712 is responsible for storing Winograd weights, and Rram716 is responsible for storing convolution results. The computing device 601 further sets 2 buffers responsible for temporarily storing the intermediate results: WNram 711 is responsible for temporarily storing the data after being transformed, and WRram 714 is responsible for temporarily storing the data after bit multiplication and accumulation.

Although the larger the buffer capacity for storing the neuron data, the Winograd weight, and the convolution result is, the better the buffer capacity is, the size of the buffer is related to the configuration of the arithmetic unit resources, and once the configuration is too large, the arithmetic capability of the computing device 601 will be lost. And the criterion of judgment is inputOutput bottleneck pressure and computational force pressure. This embodiment sets the Nram 709 size to

Wherein alpha is

β is the directional coefficient of HW; the size of the Wram 712 is set to α × γ × [ l l 44 ]]Wherein γ is

The directional coefficient of (a); the scale of Rram716 is set to β × γ × [ l 22 ]]. The time required to complete the operation of these scale data is l × α × β × γ.

Preferably, this embodiment selects l to be 16, α to be 4, β to be 64, γ to be 16, and considering that the data size of each FP32 is 4B, the storage capacity of the storage array of Nram 709 is

The storage capacity of the Wram 712 is α × γ × [ l l 44 × ]]X 4B 1MB, and the storage capacity of Rram716 is β × γ × [ l 22 ]]×4B＝256KB。

Returning to FIG. 7, Nram 709 stores the neuron data sent by NDMA 703 in a temporary storage according to the decoded instruction, and NTU 710 reads the neuron data from Nram 709 for forward transformation, i.e., B ^T dB to produce forward transformed data, which is temporarily stored in WNram 711. FIG. 9 shows a schematic of Nram 709. In this embodiment, Nram 709 includes 4

memory arrays

901, 902, 903, 904, each of which includes 4 memory blocks 905, 906, 907, 908, each of which has a size of d w-bit memory locations, where d also represents the number of addresses in the memory locations. Preferably, w is 128 and d is 1024, each memory block is 16KB in size, each memory array is 64KB in size, and Nram 709 has a total memory size of 256KB, a total width of 4 × w — 64B and a depth of 4 × d — 4 × 1024.

In the width direction, the input bandwidth of Nram 709 is set to 4B, and the output bandwidth is matched to the input bandwidth of NTU 710. As mentioned above, the input bandwidth of the NTU 710 is set to l × 32 bits, and l is preferably 16, the input bandwidth of the NTU 710 is 64B, so the output bandwidth of Nram 709 is also 4 × w — 64B. The input and output of Nram 709 need to be performed simultaneously, so the design of input and output dual port is adopted.

Fig. 10 shows a schematic diagram of the NTU 710. The NTU 710 includes an input buffer 1001, a register file 1002, an adder set 1003, and an output buffer 1004.

When the NTU 710 receives a command to load neuron data from Nram 709, the input buffer 1001 acts as a fifo queue buffer to temporarily store neuron data based on the input bandwidth 64B. The stage of loading neuron data continues until all data reception is complete, the overall process being controlled by the IDU 708 issuing instructions.

The register file 1002 fetches the temporarily stored neuron data from the input buffer 1001 in accordance with the programmed operation sequence based on the decoded instruction, stores the neuron data at a specific address of the register file 1002, and uses the neuron data stored at the specific address of the register file 1002 as an addition operand. In this embodiment, since the pipeline time lengths of the input stage, the operation stage and the output stage of the NTU 710 should be equal, a phenomenon of buffering hardware resource dependency may occur, in order to solve the problem of resource dependency, the register file 1002 is divided into a ping storage unit 1005 and a pong storage unit 1006 having the same size, the ith addition operand and the positive transformation data generated after the calculation are temporarily stored in the ping storage unit 1005, the (i +1) th addition operand and the (i +1) th positive transformation data are temporarily stored in the pong storage unit 1006, the (i + 5) th addition operand and the (i + 5) th positive transformation data are temporarily stored in the ping storage unit 1005, the (i + 5) th addition operand and the (i + 5) th positive transformation data are overwritten, and the register file 1002 stores according to the rule.

The adder group 1003 reads the addition operands in sequence from the specific address of the register file 1002 according to the decoded instruction, and performs the addition operation. In this embodiment, the number of adder groups 1003 is 2 groups to correspond to the addition scheduling direction, each group includes 16 adders to correspond to the vectorization direction/, and each adder is an FP32 adderIn the channel direction of neuron data, addition operation in forward conversion of Winograd convolution is performed in a specific order of first calculating a left multiplication matrix B of Winograd convolution ^T And (3) calculating the addition of the right multiplication matrix B of Winograd convolution, finally generating positive transformation data, and storing the positive transformation data back to the register file 1002. The order of operations, as well as register allocation and operation time, are all dependent on the convolution filter size and are controlled by instructions sent by the IDU 708. The operation stage and the neuron data loading stage generate data dependency, are executed in a pipeline mode, and are realized by counting through hardware.

The output buffer 1004 is also a fifo buffer for temporarily storing the positive-transition data sequentially from the ping storage unit 1005 and the pong storage unit 1006. This output stage needs to rely on the overall completion of the operation stage to perform the corresponding buffered output based on the output bandwidth 64B.

WNram 711 is configured to be buffered and sent repeatedly multiple times since the positive transformed data needs to be multiplexed to save overhead. Which includes multiple cache units, an exemplary WNram 711 is shown in fig. 11, where WNram 711 includes 4 cache units: a first buffer unit 1101, a second buffer unit 1102, a third buffer unit 1103, and a fourth buffer unit 1104. The positive transformed data from NTU 710 is sent to one or more of these cache molecules by way of route distribution.

WNram 711 sends the positive transformed data to MAC 713 in a certain order for subsequent operations. WNram 711 is designed to cache a part of forward conversion data, send the data to MAC 713, and then store the next part of forward conversion data, and the size of WNram 711 is reduced through pipelining. Further, WNram 711 is configured to perform a bit multiplication operation on the positive transformed data and a Winograd weight of γ × [ l l 4 × 4] scale, and then transmits the data in γ blocks to MAC 713. In this way, it is necessary to output the forward-converted data on average every γ beats, and the power consumption overhead of WNram 711 can be effectively reduced. Accordingly, the first γ pieces of positive transform data are sequentially overwritten by the next γ pieces of positive transform data, so that the minimum memory size of WNram 711 can be limited to [ l (r +1) (s +1) ] × 4B, that is, [ l 44 ] × 4B ═ 1KB as described above.

Specifically, the widths of the first buffer unit 1101, the second buffer unit 1102, the third buffer unit 1103 and the fourth buffer unit 1104 are w ₁ Byte, depth d ₁ And is divided into m parts in the depth direction. In this embodiment, m is preferably 8, w ₁ Is 64, d ₁ The buffer size is 128, so the width of each buffer unit is 64B, the depth is 128, the address space is divided into 8 parts in the depth direction for data multiplexing, the size of each buffer unit is 8KB, that is, the total capacity of WNram 711 is set to 32 KB.

Referring back to fig. 7, Wram 712 temporarily stores Winograd weights sent from WDMA 704 according to the decoded instructions, and MAC 713 reads the Winograd weights from Wram 712 and the forward transformed data from WNram 711 according to the decoded instructions, and performs a bit-by-bit accumulation operation on the forward transformed data and the Winograd weights, that is, performs [ (GgG) ^T )⊙(B ^T dB)]Generates the bit-by-bit data and temporarily stores the bit-by-bit data to WRram 714.

Fig. 12 shows a schematic diagram of Wram 712. In this embodiment, the Wram 712 includes 4

storage arrays

1201, 1202, 1203, 1204, and the WDMA 704 sends Winograd weights to the

storage arrays

1201, 1202, 1203, 1204 by route distribution. Each storage array includes 4

storage blocks

1205, 1206, 1207, 1208, each storage block including 4

storage cells

1209, 1210, 1211, 1212, each storage cell being 4 xd × w in size. As previously mentioned, w is 128 and d is 1024, so the size of each memory block is 64KB, while the size of each memory array is 256KB, with the total capacity of Wram 712 being 1 MB. For each memory block, the width of the memory block is 4 × w ═ 512 bits, the memory block is segmented in the depth direction and divided into 4 segments of address independent memory space, the depth of each segment is d ═ 1024, and the total depth is 4 × d ═ 4096.

In this embodiment, each

storage array

1201, 1202, 1203, 1204 independently has an input bandwidth and an output bandwidth of 4 × w B, and the total output bandwidth of Wram 712 is 4 × 4 × w B. Specifically, when w is 128, the input bandwidth and the output bandwidth of each memory array are 64B, and the total output bandwidth are 256B.

In this embodiment, the MAC 713 includes 64 MAC operators, which are divided into 4 groups and perform 4 operations in different batches, and 16 MAC operators in each group are distributed independently. The forward-transformed data of WNram 711 needs to be sent to the 64 MAC operators simultaneously, so that the forward-transformed data is subjected to bit-multiplication accumulation operation with different Winograd weights, and therefore WNram 711 sends the forward-transformed data in a broadcasting or distribution routing manner. Due to the fact that output load is large, in order to guarantee driving capacity and timing sequence, positive conversion data of WNram 711 are firstly sent to 4N 1 nodes through N1 and N2 two-stage broadcasting or distribution routing, each N1 node broadcasts or distributes routing to 4N 2 nodes, and each N2 node broadcasts or distributes routing to 4 MAC operators.

Fig. 13 shows a schematic diagram of the output side of WNram 711. The MAC 713 first performs bit-wise multiplication and then sequentially accumulates the resulting vectors, and the logic function is equivalent to solving an inner product of the vectors or performing an operation of element values in matrix multiplication. Each MAC set includes 16 MAC units 1301, i.e., ω -16, and since l is preferably 16, the calculated force for each MAC set is 16 × (16+ (16-1)) -496 flops.

Fig. 14 shows a schematic diagram of the output side of the Wram 712. The 4 outputs of the Wram 712 are responsible for the data transfer of 16 MAC units 1301, respectively, and in fact, this embodiment is responsible for the data transfer of 16 MAC units 1301 on a single N1 node by each

storage array

1201, 1202, 1203, 1204. Since the output bandwidth is only 64B, the bandwidth needs to be time-division multiplexed, each N2 node only occupies one eighth of the bandwidth time, and the other half of the bandwidth time is idle to reduce power consumption. In more detail, the Wram 712 transmits the Winograd weight to the N1 node based on the bandwidth of 64B, the N1 node transmits the Winograd weight to the N2 node in a broadcast manner based on the bandwidth of 64B, and the N2 node transmits the Winograd weight to each MAC unit 1301 in a route distribution manner based on the bandwidth of 64B. Each MAC unit 1301 may perform a FP32 multiply-accumulate operation of length l.

ITU 715 reads the bit-multiplied data from WRram 714 according to the decoded instruction, and inversely transforms the bit-multiplied data, i.e. performs A ^T Of LAAnd operating to obtain a convolution result, and temporarily storing the convolution result in the Rram 716.

Figure 15 shows a schematic diagram of ITU 715. ITU 715 includes input buffer 1501, register file 1502, adder bank 1503, and output buffer 1504.

When ITU 715 receives an instruction to load multiply-by-multiply-data from WRram 714, input buffer 1501 acts as a fifo buffer to temporarily store the multiply-by-multiply-data based on the input bandwidth. The stage of loading the multiply-by-bit data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU 708 sending instructions.

The register file 1502 fetches the temporarily stored bit-aligned data from the input buffer 1501 in a fixed operation order according to the decoded instruction, stores the fetched data to the specific address of the register file 1502, and adds the bit-aligned data stored at the specific address of the register file 1502 as an addition operand. Similarly, in order to solve the problem of resource dependency, the register file 1502 has a ping storage unit 1505 and a pong storage unit 1506 with the same size, the ith addition operand and the convolution result generated after the calculation are temporarily stored in the ping storage unit 1505, the (i +1) th addition operand and the (i +1) th convolution result are temporarily stored in the pong storage unit 1506, the (i + 5) th addition operand and the (i + 5) th convolution result are temporarily stored in the ping storage unit 1505, the ith addition operand and the ith convolution result are overwritten, and the register file 1502 stores according to the rule.

The adder group 1503 sequentially reads the addition operands from the specific addresses of the register file 1502 according to the decoded instruction and performs the addition operation. Like the adder group 1003, the number of the adder groups 1503 is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction, each adder is an FP32 adder, and the addition operation in the inverse transform of the Winograd convolution is performed in the channel direction of the bit multiplied data in a specific order of first calculating the left multiplication matrix a of the Winograd convolution ^T The addition of the right multiplication matrix A of Winograd convolution is calculated, the convolution result is generated finally, and the convolution result is stored back to the register file 1502In (1). The order of operations, as well as register allocation and operation time, are all dependent on the convolution filter size and are controlled by instructions sent by the IDU 708. The operation stage and the loading stage generate data dependency for the multiply-by-multiply-data stage, and the operation stage is executed in a pipeline mode and is realized by hardware through counting.

The output buffer 1504 is also a fifo queue buffer for temporarily storing convolution results sequentially from the ping storage unit 1505 and the pong storage unit 1506. The output stage needs to rely on the overall completion of the operation stage to perform the output of the corresponding cache based on the output bandwidth.

In addition to Winograd convolution, the computing device 601 is capable of performing all neural network related operations, and the ALU 717 performs two tasks according to the decoded instructions: the first task is the operation of convolution fusion operation, namely the operation can be completed on a chip with a convolution layer in one step without depending on more data, and the operation comprises the operation processes of activation, bias addition, direction part, accumulation and the like; the second task is a non-convolution operation. The result of the ALU 717 operation is also buffered in the Rram 716. The presence of the ALU 717 may ensure that various operations in the convolutional neural network are fully implemented in the computing device 601, allowing the computing device 601 to have the versatility and integrity of a neural network.

RDMA 706 fetches the convolution result from Rram716 and outputs it to DRAM 604 according to the decoded instruction, thus completing the entire convolution operation. Similarly, RDMA 706 may also fetch other operation results generated by ALU 717 from Rram716 and output them to DRAM 604 according to the decoded instruction. In this embodiment, the output bandwidth of Rram716 is w bytes, and also includes 4 memory arrays, each memory array includes 4 × d 4 × w bits of memory cells, i.e. 512 bits in width and 4096 in depth, so that the size of each memory array is 256KB, and the size of Rram716 is 1 MB. The input-output dual-port bandwidth of each memory array is 64B, the address in the depth direction is divided into 16 parts, and each address space is 256, so that the result of the neuron multiplexing direction is stored.

The invention carries out hardware design based on the characteristic of Winograd algorithm to realize the accelerated universality, provides a pipeline-level operation mode for accelerating the Winograd convolution operation speed, and fully utilizes reusable resources through methods of time division multiplexing, broadcast routing and the like in the hardware realization process. The hardware structure provided by the invention can be matched with a Winograd convolution algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series and combination of acts, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein can be practiced in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, a forward transform unit for performing a Winograd convolution, comprising: an input buffer for receiving and temporarily storing neuron data based on an input bandwidth; 2 groups of adder groups for performing addition operations on the neuron data to generate positive transformation data; and an output buffer to temporarily store and output the forward transformed data based on an output bandwidth; wherein the input bandwidth and the output bandwidth are the same, and the computational power of the addition operation is twice of the input bandwidth and the output bandwidth.

Clause a2, the forward transform unit of clause a1, wherein each set of adders includes 16 adders to perform addition operations in a particular order in a channel direction of the neuron data.

Clause A3, the forward transform unit of clause a2, wherein the specific order is the addition of first calculating the left-times matrix of the Winograd convolution, then calculating the right-times matrix of the Winograd convolution.

Clause a4, the forward transform unit of clause a2, wherein each adder is a FP32 adder.

Clause a5, the forward transform unit of clause a4, wherein the input bandwidth and the output bandwidth are l x 32 bits, the computation power is 2 x l floating point operations performed per second, where l is a vectorization length.

Clause a6, the forward transform unit of clause a1, further comprising a register file for fetching the neuron data from the input buffer and storing it to a specific address to become an addition operand, the adder bank reading the addition operand from the specific address to perform an addition operation in forward transform.

Clause a7, the forward transform unit of clause a1, wherein the input buffer and the output buffer are first-in-first-out queue buffers.

Clause A8, an inverse transform unit performing a Winograd convolution, comprising: an input buffer for receiving and temporarily storing the multiply-by-multiply-align data based on the input bandwidth; 2 adder groups for performing addition operations on the multiply-by-bit data to generate convolution results; and an output buffer for temporarily storing and outputting the convolution result based on an output bandwidth; wherein the input bandwidth and the output bandwidth are the same, and the computational power of the addition operation is twice of the input bandwidth and the output bandwidth.

Clause a9, the inverse transform unit of clause A8, wherein each set of adders includes 16 adders to add in a particular order in the channel direction of the bit multiplied data.

Clause a10, the inverse transform unit of clause a9, wherein the specific order is the addition of first calculating the left-hand multiplication matrix of the Winograd convolution, and then calculating the addition of the right-hand multiplication matrix of the Winograd convolution.

Clause a11, the inverse transform unit of clause a9, wherein each adder is a FP32 adder.

Clause a12, the inverse transform unit of clause a11, wherein the input bandwidth and the output bandwidth are l x 32 bits, the computation power is 2 x l floating point operations performed per second, where l is the vectorization length.

Clause a13, the inverse transform unit of clause A8, further comprising a register file to fetch the bit-aligned multiplier data from the input buffer and store it to a specific address to become an add operand, the add bank to read the add operand from the specific address for an add operation in an inverse transform.

Clause a14, the inverse transform unit of clause A8, wherein the input buffer and the output buffer are first-in-first-out queue buffers.

Clause a15, an integrated circuit device comprising the forward transform unit of any of clauses a 1-7 and the inverse transform unit of any of clauses A8-14.

Clause a16, a board comprising the integrated circuit device of clause a 15.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the description of the above embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A forward transform unit that performs a Winograd convolution, comprising:

an input buffer for receiving and temporarily storing neuron data based on an input bandwidth;

2 groups of adder groups for performing addition operations on the neuron data to generate positive transformation data; and

an output buffer for temporarily storing and outputting the forward transformed data based on an output bandwidth;

wherein the input bandwidth and the output bandwidth are the same, and the computational power of the addition operation is twice of the input bandwidth and the output bandwidth.

2. The forward transform unit of claim 1, wherein each set of adder banks includes 16 adders performing addition operations in a particular order in a channel direction of the neuron data.

3. The forward transform unit of claim 2, wherein the particular order is the addition of first calculating a left-times matrix of a Winograd convolution, then calculating a right-times matrix of a Winograd convolution.

4. The forward transform unit of claim 2, wherein each adder is an FP32 adder.

5. The forward transform unit of claim 4, wherein the input bandwidth and the output bandwidth are l x 32 bits, the computation power is 2 x l floating point operations performed per second, where l is a vectorization length.

6. The forward transform unit of claim 1, further comprising a register file to fetch the neuron data from the input buffer and store to a specific address to become an add operand, the adder bank to read the add operand from the specific address for an add operation in a forward transform.

7. The forward transform unit of claim 1, wherein the input buffer and the output buffer are first-in-first-out queue buffers.

8. An inverse transform unit performing Winograd convolution, comprising:

an input buffer for receiving and temporarily storing the multiply-by-multiply-align data based on the input bandwidth;

2 adder groups for performing addition operations on the multiply-by-bit data to generate convolution results; and

an output buffer for temporarily storing and outputting the convolution result based on an output bandwidth;

9. The inverse transform unit of claim 8, wherein each group of adders includes 16 adders to add in a particular order in the channel direction of the bit-multiplied data.

10. The inverse transform unit of claim 9, wherein the specific order is an addition of first calculating a left-times matrix of a Winograd convolution and then calculating a right-times matrix of the Winograd convolution.

11. The inverse transform unit of claim 9, wherein each adder is an FP32 adder.

12. The inverse transform unit of claim 11, wherein the input bandwidth and the output bandwidth are l x 32 bits, the computation power is 2 x l floating point operations performed per second, where l is a vectorization length.

13. The inverse transform unit of claim 8, further comprising a register file to fetch the bit-aligned multiplier data from the input buffer and store to a particular address to become an add operand from which the set of adders reads the add operand for an add operation in an inverse transform.

14. The inverse transform unit of claim 8, wherein the input buffer and the output buffer are first-in-first-out queue buffers.

15. An integrated circuit device comprising a forward transform unit according to any one of claims 1 to 7 and an inverse transform unit according to any one of claims 8 to 14.

16. A board card comprising the integrated circuit device of claim 15.