CN115079927A

CN115079927A - Temporary storage of convolution results, computing device, integrated circuit device and board card

Info

Publication number: CN115079927A
Application number: CN202110265208.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2022-09-20

Abstract

The invention relates to a cache, a computing device, an integrated circuit device and a board card for temporarily storing convolution results of Winograd convolution, wherein the output bandwidth of the cache is w bytes, the cache comprises 4 storage arrays, and each storage array comprises 4 xd storage units with 4 xw bits. The invention has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Description

Temporary storage of convolution results, computing device, integrated circuit device and board card

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the present invention relates to a buffer, a computing device, an integrated circuit device, and a board for temporarily storing convolution results of Winograd convolution.

Background

With the rapid development of the information age, the research in the fields of artificial intelligence and machine learning is popular, and the related industries are developed vigorously. The convolutional neural network has wide functions in the aspects of computer vision, automatic driving, machine translation, voice recognition, intelligent home and the like.

The convolutional neural network has large parameter quantity and large operation quantity, so that the execution performance of the convolutional neural network model is severely limited under the limited area and the computational power of the portable mobile terminal, and meanwhile, a non-specially designed processor can cause huge expenditure of power consumption when carrying out convolutional operation.

Winograd convolution is a convolution acceleration implementation mode based on a polynomial interpolation algorithm. It passes through two inputs to the convolution operation: after the neurons and the weight are segmented in a certain scale, linear transformation, namely Winograd positive transformation, is respectively carried out, the transformed neurons and the weight are subjected to counterpoint multiplication, the counterpoint multiplication result is subjected to linear transformation again, namely Winograd inverse transformation, and finally a convolution result equivalent to the original convolution operation is obtained.

In the process of Winograd convolution operation, the positive and inverse transformation matrixes of the neurons and the weights are all composed of simple fixed numerical values, so that the positive and inverse transformation process of the Winograd neurons and the weights can be realized only by addition. The multiplication operation required in the Winograd algorithm only occurs in the bit multiplication process, and the multiplication complexity of the process is reduced to a considerable extent compared with the original convolution algorithm. Because the cost (time sequence, power consumption and area) for realizing multiplication operation by hardware is much higher than that for realizing addition with the same bit width, the Winograd convolution is used for replacing the original convolution operation, so that obvious benefits on hardware energy efficiency ratio and operation time can be brought.

However, at present, no hardware is designed for the Winograd convolution acceleration algorithm, so that the conventional artificial intelligent chip cannot completely show the advantages of the Winograd convolution operation. Therefore, a hardware device capable of efficiently running the Winograd convolution algorithm is urgently needed.

Disclosure of Invention

In order to at least partially solve the technical problems mentioned in the background art, the invention provides a buffer for temporarily storing convolution results of Winograd convolution, a computing device, an integrated circuit device and a board card.

In one aspect, the present invention discloses a buffer for temporarily storing convolution results of Winograd convolution, wherein an output bandwidth of the buffer is w bytes, the buffer includes 4 memory arrays, and each memory array includes 4 × d memory cells with 4 × w bits.

In another aspect, the present invention discloses a computing device, which includes the aforementioned cache.

In another aspect, the present invention discloses an integrated circuit device including the aforementioned computing device, and a board including the integrated circuit device according to the aforementioned description.

The hardware structure provided by the invention can be matched with a Winograd convolution acceleration algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

FIG. 1 is a schematic diagram illustrating a convolution kernel performing a convolution operation with an input neuron image;

FIG. 2 is a diagram showing the conversion of a raw convolution of F (2 × 2,3 × 3) to a Winograd convolution;

FIG. 3 is a visualization diagram illustrating a multiply-by-bit operation;

FIG. 4 is a diagram illustrating the homogenous operation of forward transformed data with weights;

fig. 5 is a structural diagram showing a board card of the embodiment of the present invention;

FIG. 6 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 7 is a schematic diagram showing the internal structure of a computing device of an embodiment of the invention;

FIG. 8 is a diagram showing an overlapping portion when a transform is being performed;

FIG. 9 is a schematic diagram showing a neuron cache of an embodiment of the present invention;

FIG. 10 is a schematic diagram showing a forward transform unit of an embodiment of the present invention;

FIG. 11 is a diagram illustrating weight caching according to an embodiment of the invention;

FIG. 12 is a schematic diagram showing the forward transform data buffer output side of an embodiment of the present invention;

fig. 13 is a schematic diagram showing an inverse transform unit of an embodiment of the present invention; and

fig. 14 is a schematic diagram illustrating a connection relationship of result caches according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The Winograd convolution acceleration algorithm (hereinafter referred to as Winograd algorithm or Winograd convolution) is a method for finding out a transformation method requiring the minimum number of multiplications by performing linear transformation on operands in convolution operation, and then replacing the required multiplications by adding partial addition operations. In terms of hardware, compared with an adder, the structure of the multiplier is more complex, the area power consumption is larger, the comprehensive processing performance is poorer, and a Winograd algorithm which replaces multiplication with addition in practice has great advantages when convolution operation is processed.

For two-dimensional convolution, assuming that the size of the input neuron image is H × W (H is the height of the input neuron image and W is the width of the input neuron image), the size of the weight is r × s (r is the height of the weight and s is the width of the weight), and the convolution result can be expressed as F (m × n, r × s), where m × n is the size of the output neuron image, m is the height of the output neuron image and n is the width of the output neuron image. In order to reduce the complexity of hardware, improve the universality and achieve a good acceleration effect, the embodiment of the invention sets convolution kernels (namely weights) not greater than 3 × 3 as a base convolution unit to combine and perform Winograd convolution operation with a convolution operation step (stride) of 1 at any scale. In the embodiment of the present invention, arbitrary F (m × n, r × s) is decomposed into 5 kinds of calculations of base convolution with the operation scale of 3 × 3, 3 × 2 (or 2 × 3), 3 × 1, 2 × 2,2 × 1, and the like, and then combined. More specifically, in the examples of the present invention, an arbitrary F (m × n, r × s) is decomposed into a base convolution calculation combination of F (2 × 2,3 × 3), F (2 × 2,3 × 2), F (2 × 2,2 × 3), F (2 × 2,3 × 1), F (2 × 2), and F (2 × 2,2 × 1). It should be noted that, since the convolution operation of 1 × 1 cannot be accelerated by Winograd convolution, the scale of 1 × 1 does not belong to the base convolution unit set in the embodiment of the present invention.

Taking F (2 × 2,5 × 5) with the input neuron image size of 6 × 6 and the step size of 1 as an example, before using the computing apparatus according to the embodiment of the present invention to perform Winograd convolution acceleration operation, the input neuron image of 6 × 6 and the convolution kernel of 5 × 5 need to be linearly split based on the base convolution unit, and the splitting process is shown in fig. 1.

Fig. 1 shows a convolution operation of a5 × 5 convolution kernel 101 with a6 × 6 input neuron image 102 to obtain a2 × 2 convolution result 103. The convolution kernel 101 needs to be split into 3 × 3, 3 × 2 (or 2 × 3), 3 × 1, 2 × 2,2 × 1, etc. scales, and this embodiment preferably selects 3 × 3, 3 × 2 (or 2 × 3) times, 3 × 1 times, 2 × 2 times, and finally 2 × 1. According to this rule, the convolution kernel 101 is split into 4 base convolution kernels: the 3 × 3 first base convolution kernel 104, the 3 × 2 second base convolution kernel 105, the 2 × 3 third base convolution kernel 106, and the 2 × 2 fourth base convolution kernel 107, i.e., F (2 × 2,5 × 5), are decomposed into one F (2 × 2,3 × 3), one F (2 × 2,3 × 2), one F (2 × 2,2 × 3), and one F (2 × 2 ). The input neuron image 102 is correspondingly also split into 4 sub-neuron data: 4 × 4 first sub-neuron data 108, 4 × 3 second sub-neuron data 109, 3 × 4 third sub-neuron data 110, and 3 × 3 fourth sub-neuron data 111.

Then Winograd convolution operation is carried out, namely: the first base convolution kernel 104 convolves with the first sub-neuron data 108 to generate a first sub-convolution result 112; convolving the second basis convolution kernel 105 with the second sub-neuron data 109 to generate a second sub-convolution result 113; convolving the third base convolution kernel 106 with the third sub-neuron data 110 to generate a third sub-convolution result 114; the fourth base convolution kernel 107 is convolved with the fourth sub-neuron data 111 to generate a fourth sub-convolution result 115.

Finally, the first sub-convolution result 112, the second sub-convolution result 113, the third sub-convolution result 114 and the fourth sub-convolution result 115 are added to obtain a convolution result 116, and the convolution result 116 is the same as the convolution result 103. The above is an example of using the Winograd convolution algorithm to implement the original convolution operation.

Further, the Winograd algorithm can be represented by the following equation:

Y＝A ^T [(GgG ^T )⊙(B ^T dB)]A

wherein Y denotes the output matrix of the convolution operation, A ^T Inverse transformation left-multiplication constant matrix, G weight transformation left-multiplication constant matrix, G weight of original convolution, G ^T For weight transformation right-times constant matrix,. alpha.representing bit-wise multiplication, B ^T The method comprises the following steps of converting a neuron into a left-multiplication constant matrix, d is neuron data, B is a right-multiplication constant matrix of the neuron, and A is an inverse-conversion right-multiplication constant matrix. The left and right multiplication matrices for each transform are simply transposed.

Taking F (2 × 2,3 × 3) as an example, the constant matrices are as follows:

fig. 2 shows a schematic diagram of the conversion of the original convolution of F (2 × 2,3 × 3) into a Winograd convolution. As shown, neuron data 201 is convolved with convolution kernel 202. During calculation, the neuron data 201 is arranged in a row according to elements in the sliding window 203, the sliding window 203 slides for 4 times to form a4 × 9 matrix 204, then the elements of the convolution kernel 202 are arranged in a column to form a 9 × 1 matrix 205, and the 4 × 9 matrix 204 and the 9 × 1 matrix 205 are subjected to convolution operation to obtain a4 × 1 convolution result 206.

Further, by dividing the graph by the dotted line, the 4 × 9 matrix 204 is converted into a2 × 3 matrix 207, the 9 × 1 matrix 205 is converted into a3 × 1 matrix 208, and the 4 × 1 convolution result 206 is converted into a2 × 1 convolution result 209. After the linear transformation, the first element R of the 2 × 1 convolution result 209 ₀ ＝M ₀ +M ₁ +M ₂ And R is ₁ ＝M ₁ -M ₂ -M ₃ . And M ₀ 、M ₁ 、M ₂ 、M ₃ Can be represented by the following sub-formula:

by the segmentation and linear transformation, the original convolution operation involves 36 multiplications, while the Winograd algorithm only needs to execute 16 multiplications, so that the computational complexity of the multiplications is reduced by 2.25 times.

As can be seen from the conversion of the Winograd algorithm of the two-dimensional convolution, the Winograd algorithm is mainly divided into the following steps. First, the weights are left-and right-multiplied by a matrix of weight constants, GgG ^T Obtaining a weight value after Winograd linear transformation, namely a Winograd weight value; next, the neuron data is subjected to a positive transformation operation, i.e., left and right multiplication of a neuron constant matrix, i.e., B ^T And dB, obtaining the forward conversion data after Winograd linear conversion. Further, the forward transformed data and Winograd weight matrix are subjected to a bit-wise multiplication operation (GgG) ^T )⊙(B ^T dB), the bit-multiplied data is obtained. Finally, inverse transformation operation is carried out on the bit multiplier data, namely left multiplication and right multiplication operation of Winograd inverse transformation constant matrix, namely A ^T LA, wherein L is [ (GgG) ^T )⊙(B ^T dB)]And finally obtaining a convolution result equivalent to the original convolution.

From the perspective of hardware design, the embodiment of the present invention performs pipeline design on the three large transformation steps according to the dependency and operation distinguishing characteristics among the three processes, so as to achieve more efficient acceleration performance. The following will be separately described for the design of the forward transform operation, the multiply-by-bit operation, and the inverse transform operation.

Embodiments of the present invention utilize a forward transform unit to implement a forward transform operation, i.e., perform B ^T dB, forward transforming the left-multiplication matrix B according to the rule of Winograd convolution ^T The size of (m + r-1) × (m + r-1), and the size of the right multiplication matrix B is (n + s-1) × (n + s-1). Due to forward transformation of the left multiplication matrix B ^T The elements of the right multiplication matrix B are all composed of 0, 1 and-1, so that the matrix multiplication operation of forward conversion can be decomposed into addition operation of fixed modeThe computing device of embodiments of (a) configures a particular number of floating-point addition operators accordingly to accomplish the linear addition operations required for the entire matrix multiplication. Since the embodiment of the present invention converts any original convolution into a base convolution for calculation, the scale of the forward transform unit is related to the above-mentioned 5 types of base convolution scale calculation, and therefore, the following description will be made with respect to the data of the above-mentioned 5 types of base convolution calculation FP32, taking the convolution result of 2 × 2 as an example (i.e., m ═ n ═ 2).

Take 3 × 3 basis convolution as an example, which

Can be expressed as:

based on the above formula, the forward conversion power requirement of the forward conversion unit directly corresponds to the number of additions, which is 4 × (n + s-1) +4 × (m + r-1) ═ 32flops (floating point operations per second), and the input and output quantities of the forward conversion unit are respectively: the reason why the input data and the output data are both (r +1) (s +1) × 32 ═ 16 × 32 bits is that the above expression is multiplied by 32 bits is data for FP32, which is a 32-bit sequence. When the input and output quantity of the forward conversion unit is the same as the operation time, the hardware utilization rate of the forward conversion unit is optimal, so the ratio of the input and output bandwidth of the forward conversion unit to the addition operation is preferably 16:32 to 1: 2. In other words, when the buffer bandwidth (or vectorization length) is l, the input bandwidth and the output bandwidth of the forward transform unit are l × 32 bits, and the computation power of the adder group of the forward transform unit is 2 × l flops. Each operation will generate 16 final results, and considering that 8 intermediate results will be generated during the operation, the minimum number of registers in the register file is l × 32 × (16+ 8).

Take 3 × 2 basis convolution as an example, which

Can be expressed as:

based on the above equation, the positive conversion force requirement of the positive conversion unit is 4 × (n + s-1) +2 × (m + r-1) ═ 20flops, and the input and output quantities of the positive conversion unit are respectively: both the input data and the output data are (r +1) (s +1) × 32 ═ 12 × 32 bits. To increase the hardware utilization of the forward transform unit, the ratio of the input/output bandwidth of the forward transform unit to the addition operation is preferably 12:20 to 3: 5. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group is

Each calculation yields 12 final results and 6 intermediate results, with the minimum number of registers in the register file being l × 32 × (12+6) with maximum pipeline utilization.

Take a2 × 2 basis convolution as an example, which

Can be expressed as:

based on the above formula, the positive conversion power demand of the positive conversion unit is 2 × (n + s-1) +2 × (m + r-1) ═ 12flops, and the input and output quantities of the positive conversion unit are respectively: since both the input data and the output data are (r +1) (s +1) × 32 bits, 9 × 32 bits, the ratio of the input/output bandwidth of the forward conversion unit to the addition operation is preferably 9:12 to 3: 4. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group is

Each calculation yields 9 final results, and 6 intermediate results, with the minimum number of registers in the register file being l × 32 × (9+6) with maximum pipelined use of the register file.

In a 3X 1 base rollProduct is an example, which

Can be expressed as:

based on the above equation, the positive conversion force requirement of the positive conversion unit is 4flops, and the input and output quantities of the positive conversion unit are respectively: both the input data and the output data are (r +1) × 32 ═ 4 × 32 bits. Therefore, the ratio of the input/output bandwidth of the forward conversion unit to the addition operation is preferably 4:4 to 1: 1. I.e. input bandwidth and output bandwidth are l x 32 bits, while the computation power of the adder group is l flops. Each calculation yields 4 final results and 2 intermediate results, and the minimum number of registers in the register file is l × 32 × (4+2) with maximum pipeline utilization.

Take a2 × 1 basis convolution as an example, which

Can be expressed as:

based on the above formula, the forward conversion computing power requirement of the forward conversion unit is 2flops, and the input and output quantities of the forward conversion unit are respectively: since both the input data and the output data are (r +1) × 32 ═ 3 × 32 bits, the ratio of the input-output bandwidth of the forward transform unit to the addition operation is preferably 3: 2. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group is

Each calculation yields 3 final results and 1 intermediate result, and the minimum number of registers in the register file is l × 32 × (3+1) with the maximum pipeline utilization.

In order to satisfy and support the aforementioned 5 kinds of basic convolution operations simultaneously, the embodiment of the present invention selects the input bandwidth and the output bandwidth of the forward conversion unit to be the same, and the computation power of the addition operation is twice as much as the input bandwidth and the output bandwidth, i.e. the input bandwidth and the output bandwidth are both l × 32 bits, the computation power of the adder group is 2 × l flops, and the number of the register files is l × 32 × (16+ 8).

Then, considering the multiply-accumulate counter, the embodiment of the invention combines the multiply-accumulate counter operation and the characteristic diagram direction of the convolution neuron data based on the general consideration of hardware design, scheduling strategy and execution performance, and utilizes the same multiply-accumulate counter, thereby not only effectively reducing the overall complexity and resource consumption of hardware design, but also reducing the access amount of on-chip cache, saving power consumption area and simultaneously improving performance.

Assume that the parameters of the convolutional layer are: the number N of input batch processes (batch), the number Ci of input neuron channels, the height Hi of input neuron data, the width Wi of input neuron data, the number Co of output neuron channels, the height Ho of output neuron data, the width Wo of output neuron data, the size of convolution kernel r × s, and the step size 1. Since this embodiment supports the operation of F (2 × 2, r × s), Ho ═ Hi-r +1, Wo ═ Wi-s +1, and the number of Winograd arithmetic units

Where T is the number of slices along the HW direction.

Since the on-chip buffer capacity is limited, the calculation device of the embodiment performs calculation with a single batch processing amount (N ═ 1), and thus the scale of input neuron data input to the calculation device is [1Ci Hi Wi ], the scale of forward conversion data is [1Ci T (r +1) × (S +1) ], the scale of original weight is [ Co Ci rs ], and the scale of Winograd weight is [1Co Ci (r +1) × (S +1) ].

Fig. 3 shows a visual schematic diagram of the bit multiplication operation described above. Since N is 1, each of the aforementioned data can be represented in three dimensions, and the scale of the forward transform data 301 is [ Ci (r +1) × (S +1) ], the three dimensions of which are Ci, T (i.e., the number of HW slices) and (r +1) × (S +1), respectively; the Winograd weight 302 is [ Co Ci (r +1) × (S +1) ], the three dimensions of which are Co, Ci and (r +1) × (S +1), respectively, the counterpoint multiplication is performed by cross counterpoint multiplication in the HW direction of Co and accumulation in the Ci direction to obtain counterpoint multiplication data 303, the scale of which is [ Co T (r +1) × (S +1) ], and the three dimensions of which are Co, T and (r +1) × (S +1), respectively.

In more detail, the forward transform data 301 includes T [ Ci (r +1) × (S +1) ] data units by bit multiplication, and the Winograd weight 302 includes Co [ Ci (r +1) × (S +1) ] data units by bit multiplication to obtain an intermediate result of [ Ci (r +1) × (S +1) ]. And accumulating along the Ci direction, wherein the process is the same as the matrix multiplication operation process, so that the process can be combined into the matrix multiplication operation, hardware resources are more effectively used, and the resource consumption of a register for intermediate storage is reduced.

Since the forward transform data 301 includes T data units of [ Ci (r +1) × (S +1) ], the Winograd weights 302 include Co data units of [ Ci (r +1) × (S +1) ], each data unit of the forward transform data 301 needs to be multiplied by each data unit of the Winograd weights 302. As shown in fig. 4, when performing the bit-by-bit multiplication, one data unit 401 of the forward transform data 301 performs a homogeneous operation with Co weight data units, i.e., the Co direction is taken as the direction of parallel computation, and produces an intermediate result 402. Then, the next data unit and the Co weight data units are taken out from the forward transformed data 301, and then the homogeneous operation is performed to generate the next intermediate result, and in this way, the operation is performed until all the T data units are calculated, and the multiplication-by-alignment data 303 is obtained.

When the above-mentioned data units are bit-multiplied and added up in the direction of the feature map, the amount of computation required is (Ci + Ci-1) × (r +1) × (S +1) flops. Since the value of Ci is often very large, it is practically difficult to input Ci as a granularity for real operation to the bit multiply accumulate operator, so this embodiment can further split Ci, perform multiply accumulate operation with the vectorization length l as a unit, split the multiply accumulate operation of another dimension (r +1) × (S +1) into (r +1) × (S +1) beats to be completed in sequence, and finally add all results along Ci direction to obtain the final result.

Since the output bandwidth of the forward transform unit is l × 32 bits, in order to ensure the same overall pipeline time from the forward transform unit to the bit multiply accumulate unit, the computation power of each bit multiply accumulate unit in the bit multiply accumulate unit is set to l + (l-1) flops in this embodiment, which include l multiplications and l-1 additions. If the multiply-accumulate unit has ω parallel dimensions, i.e. includes ω simultaneous units, the computation power of the multiply-accumulate unit is ω × (l + (l-1)) flops, which are functions of ω and l.

This embodiment is further provided with an inverse transformation unit for performing an inverse transformation operation on the basis of the inverse transformation left-multiplication matrix

And right multiplication matrix A _(n+s-1)×2 Carrying out A ^T LA calculation where L is (GgG) ^T )⊙(B ^T dB). Due to inverse transformation of the left multiplication matrix A ^T The elements of the right multiplication matrix A are also composed of 0, 1, -1, so the inverse matrix multiplication operation can be decomposed into fixed-pattern addition operation as well. The adder bank of the inverse transform unit configures a specific number of floating-point addition operators accordingly to complete the linear addition operation required for the entire matrix multiplication. The following description is also made based on 5 kinds of base convolutions to determine the size of the inverse transform unit.

Take 3 × 3 basis convolution as an example, which

Can be expressed as:

based on the above equation, the inverse conversion force of the ITU 715 is 24 fps, the input bandwidth is (r +1) (s +1) × 32 bits 16 × 32 bits, and the output bandwidth is (s +1) × 32 bits 4 × 32 bits. Similarly, when the input bandwidth and power of the inverse transform unit are the same, the hardware utilization of the inverse transform unit is the best, so the ratio of the input bandwidth to the addition operation is preferably 16:24 to 2:3, i.e. the input bandwidth is l × 32 bits, and the power of the adder group is

Each calculation will produce 16 final results, no intermediate results will occur, and the minimum number of registers in the register file is l × 32 × 16 under the premise of maximizing the pipeline use of the register file.

Take 3 × 2 basis convolution as an example, which

Can be expressed as:

based on the above formula, the inverse conversion power of the inverse transform unit is 16 fps, the input bandwidth is 12 × 32 bits, the output bandwidth is 4 × 32 bits, the ratio of the input bandwidth to the addition operation is preferably 12:16 to 3:4, i.e., the input bandwidth is l × 32 bits, and the computation power of the adder group is

Each calculation will produce 12 final results, no intermediate results will occur, and the minimum number of registers in the register file is l × 32 × 12 under the premise of maximizing the pipeline use of the register file.

Take a2 × 2 basis convolution as an example, which

Can be expressed as:

based on the above formula, the inverse conversion power of the inverse transform unit is 10flops, the input bandwidth is 9 × 32 bits, and the output bandwidth is 4 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 9:10, i.e., the input bandwidth is l × 32 bits, and the power of the adder group is

Each calculation can generate 9 final results, no intermediate results can be generated, and the minimum number of registers of the register file is l multiplied by 32 multiplied by 9 under the premise of maximizing the pipeline use of the register file.

Take a3 × 1 basis convolution as an example, which

Can be expressed as:

based on the above formula, the inverse conversion power of the inverse transform unit is 4flop, the input bandwidth is 4 × 32 bits, and the output bandwidth is 2 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 4:4 to 1:1, i.e. the input bandwidth is l × 32 bits, and the computation power of the adder group is l flops, each computation will generate 4 final results and 2 intermediate results, and the minimum number of registers of the register file is l × 32 × (4+2) under the premise of maximizing the pipeline use of the register file.

Take a2 × 1 basis convolution as an example, which

Can be expressed as:

based on the above formula, the inverse conversion power of the inverse transform unit is 2 fps, the input bandwidth is 3 × 32 bits, and the output bandwidth is 3 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 3:2, i.e., the input bandwidth is l × 32 bits, and the power of the adder group is

Each calculation will produce 3 final results, 1 intermediate result, and the minimum number of registers in the register file is l × 32 × (3+1) under the premise of maximizing the pipeline use of the register file。

In order to simultaneously satisfy and support the above-mentioned 5 kinds of basic convolution operations, the power of the addition operation of the inverse transform unit may be set to the input bandwidth

Multiplication, i.e. when the input bandwidth is l x 32 bits, the computing power of the adder group is

However, in order to make the hardware design relatively simple, the embodiment may further consider that the hardware configuration of the forward transform unit and the inverse transform unit is the same. On the premise of meeting the requirements of the forward conversion unit and the inverse conversion unit at the same time, the design of the forward conversion unit is adopted by the inverse conversion unit in the embodiment, namely, the input bandwidth and the output bandwidth are the same, and the computing power of the addition operation is twice of that of the input bandwidth and the output bandwidth. In other words, the input bandwidth of the inverse transform unit is l × 32 bits, the output bandwidth is also l × 32 bits, and the computation power of the adder group is 2 × l flops.

In summary, the bandwidths and the computation powers of the 3 core modules (the forward transform unit, the alignment multiply accumulate operator and the inverse transform unit) performing the Winograd convolution operation in this embodiment are all matched, that is, the input bandwidths of the 3 core modules are all set to l × 32 bits, the output bandwidths are also all set to l × 32 bits, the computation power of the forward transform unit is 2 × l flops, the computation power of the alignment multiply accumulate operator is ω × (l + (l-1)) flops, and the computation power of the inverse transform unit is 2 × l flops.

As can be seen from the foregoing, Winograd convolution operation is directly related to the vectorization length parameter l. The vectorization length parameter l is a minimum processing length, which relates to the neuron transformation multiplexing condition of the computing device in this embodiment, and the larger the parameter l is, the higher the multiplexing rate is, and meanwhile, the required access amount, the operation amount, the power consumption and the average hardware design area are proportionally reduced. However, the parameters of the convolutional layer of the neural network change with the change of the network model, and with the increase of the vectorization length parameter l, when the number of channels of a part of the network model is smaller than the vectorization length l, the computational power is wasted, thereby affecting the acceleration effect and causing the additional overhead of area power consumption. Therefore, when determining the vectorization length l, a trade-off analysis needs to be performed on the two factors, so as to plan the most suitable vectorization length parameter configuration.

According to the empirical values, weights are set for several major hardware in this embodiment (such as FP32 adder, bit multiplication unit, register, etc.) to obtain their computation power and resource overhead function, and it is found that when l is greater than 16, the utilization rate of hardware resources can be guaranteed to be at a higher level. Then, the number of input channels and the number of output channels of the currently common neural network models (such as LeNet, VGG16, VGG19 and Alexnet) are taken into consideration, the computational power loss is calculated, and the comprehensive computational power loss is found to be greatly improved when l is larger than 64. From the above two quantitative analysis, it can be seen that the computing device of this embodiment performs better when the vectorization length parameter l is between 16 and 64. This embodiment preferably selects l-16 if further versatility is considered to meet future possible network architectures and network parameters.

Fig. 5 shows a schematic structural diagram of the foregoing embodiment in the form of a board card. As shown in fig. 5, the board 50 includes a Chip 501, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, to support various deep learning and machine learning algorithms, and meet the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 501 is connected to an external device 503 through an external interface 502. The external device 503 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred to the chip 501 by the external device 503 through the external interface means 502. The results of the calculations of the chip 501 may be communicated back to the external device 503 via the external interface means 502. The external interface device 502 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 50 also includes a memory device 504 for storing data, including one or more memory cells 505. The memory device 504 is connected and data-transferred to the control device 506 and the chip 501 via a bus. The control device 506 in the board 50 is configured to regulate the state of the chip 501. For this purpose, in an application scenario, the control device 506 may include a single chip Microcomputer (MCU).

Fig. 6 is a structural diagram showing a combined processing device in the chip 501 of this embodiment. As shown in fig. 6, the combination processing device 60 includes a computing device 601, an interface device 602, a processing device 603, and a DRAM 604.

The computing device 601 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, especially Winograd convolution operations, which can interact with the processing device 603 through the interface device 602 to collectively perform the user-specified operations.

The interface device 602 is used for transmitting data and control commands between the computing device 601 and the processing device 603. For example, the computing device 601 may obtain input data from the processing device 603 via the interface device 602, and write the input data to an on-chip cache of the computing device 601. Further, the computing device 601 may obtain the control command from the processing device 603 via the interface device 602, and also write the control command into the on-chip cache of the computing device 601. Alternatively or optionally, the interface device 602 may also read data in an on-chip cache of the computing device 601 and transmit to the processing device 603.

The processing device 603 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 601. Depending on the implementation, the processing device 603 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 601 of the present invention may be viewed as having a single core structure or an isomorphic multiple core structure. However, when considered collectively, the computing device 601 and the processing device 603 are considered to form a heterogeneous multi-core structure.

The DRAM 604 is used for storing data to be processed, and is an off-chip memory, generally 16G or larger, for storing data of the computing device 601 and/or the processing device 603, and particularly storing neuron data and weights to be subjected to Winograd convolution operation. In this embodiment, the processing means 603 has previously linearly transformed the weights of the original convolution into Winograd weights GgG ^T And stored in DRAM 604.

Fig. 7 shows a block diagram of the computing device 601. The computing device 601 includes a bus 701, a Direct Memory Access (DMA) module 702, an instruction cache (Iram)707, a decode unit (IDU)708, a neuron cache (Nram)709, a transform unit (NTU) 710, a transform data cache (WNram)711, a weight cache (Wram)712, a Multiply Accumulate (MAC)713, a multiply data cache (WRram)714, an Inverse Transform Unit (ITU)715, a result cache (Rram)716, and a logical operation module (ALU, arithmetric logic unit) 717.

The bus 701 is a common communication trunk for transmitting information between the devices, and is a transmission line bundle composed of wires, and the bus 701 is a generic name of a data bus, an address bus, and a control bus for transmitting data, data addresses, and commands, respectively, according to the kind of information transmitted by the combination processing device 60. The bus 701 serves as a communication channel for the DRAM 604 and the computing device 601, which in this embodiment is specifically PCIe.

The DMA module 702 is used to copy data from one address space to another, typically by transferring data between external memory (e.g., DRAM 604) and internal caches of the computing device 601. When the DMA transfer is to be performed, the processing device 603 gives the DMA module 702 the bus control right, and the DMA module 702 controls the bus 701 to transfer data, and after the DMA transfer is completed, the DMA module 702 gives the bus control right back to the processing device 603.

The DMA module 702 includes Neuronal Direct Memory Access (NDMA)703, Weighted Direct Memory Access (WDMA)704, Instruction Direct Memory Access (IDMA)705, and Resultant Direct Memory Access (RDMA) 706. NDMA 703 is used to input neuron data from DRAM 604, WDMA 704 is used to input Winograd weights from DRAM 604, IDMA 705 is used to input commands from DRAM 604, and RDMA 706 is used to output the calculation results to DRAM 604. In other embodiments, NDMA 703, WDMA 704, IDMA 705, and RDMA 706 may be implemented by the same direct memory access.

Iram 707 is used to temporarily store instructions input by IDMA 705, and IDU 708 fetches the instructions from Iram 707 to decode them and controls other units to operate according to the decoded instructions. The IDU 708 is a decoding and scheduling unit of the entire computing device 601, and is responsible for decoding the control instructions obtained from the DRAM 604, converting the control instructions into control signals to coordinate operations of the various modules/units on the chip, and also responsible for performing various tasks such as branch prediction, exception handling, and interrupt handling. In fig. 7, thin line arrows indicate control flows, and thick line arrows indicate data flows.

Since the computing device 601 mainly aims at Winograd convolution calculation, which has no or low general processing capability, it will greatly depend on scheduling and data communication of the processing device 603 during task execution, which results in that input/output communication between the computing device 601 and the processing device 603 is very frequent, and thus, the operation performance of the computing device 601 is greatly limited. To this end, the computing device 601 is provided with a plurality of small-capacity on-chip caches for caching data capable of being temporarily stored in a multiplex manner, such as Nram 709, WNram 711, WRram 712, WRram 714, and the like.

When data on/off-chip is transferred, the neuron data and the Winograd weight are transferred in units of a single batch (N is 1), that is, the data unit of the neuron data is [ Ci Hi Wi ], the data unit of the Winograd weight is [ Co Ci (r +1) × (s +1) ], and the scale of the result obtained after the convolution operation of Winograd is [ Co Ho Wo ]. The former two are input data and the latter is output data, which are the minimum throughput transmitted and calculated in the calculating device 601, and as for the actual data throughput, it needs to be determined according to the size of the on-chip buffer and the operation scheduling flow, which will be further described below.

As can be seen from the characteristics of convolution operation, the convolution operation related to the input data of the above scale can be split in multiple dimensions, for example, in the Ci direction, the HW image direction, or the Co direction, but when Winograd conversion is involved, the minimum operation splitting unit is F (2 × 2, r × s), and the minimum splitting unit in the HW direction is (r +1) × (s + 1). Considering that the base convolution size of the computing device 601 for achieving the Winograd acceleration does not exceed 3 × 3, the embodiment estimates the buffer capacity based on the 3 × 3 base convolution which consumes the most on-chip buffer resources.

According to the rule of Winograd convolution, when forward conversion operation is carried out, vectorization length parameter l is required to be processed in parallel in the Ci direction, when bit-wise multiplication accumulation operation is carried out, operation is required to be carried out in parallel in the Co direction in the unit of l, and when inverse conversion is carried out, operation is required to be carried out in parallel in the Co direction in the unit of l, so that the size of the minimum neuron input data block participating in the operation can be estimated to be [ l (r +1) × (s +1) ]. Since the data block size of the neuron transform result is estimated by the convolution with 3 × 3 basis, the block size of the Winograd weight data to be bit-multiplied and accumulated is [ l l 4 × 4], the block size of the bit-multiplied output data is [ l 4 × 4], and the block size of the inverse transform output result [ l 2 × 2 ].

The on-chip cache design is carried out according to the scale, although all requirements can be met, the design idea of multiplexing and low power consumption is also considered, the scale data is only the minimum input/output storage data scale for realizing the function, and the optimization potential of the input/output quantity of Winograd convolution operation needs to be further considered. This embodiment is further planned for caching as follows.

In the process of neuron forward transformation, the operation is based on the minimum implementation unit with F (2 × 2, r × s) and l as the vectorization length, the size of the data block taken out each time is [ l 44 ], and the step size of neuron taking is kept to be 2. As shown in fig. 8, there is a quarter of overlap 806 between the data unit 801 to be transformed and the four

data blocks

802, 803, 804 and 805 generated by the sliding window, and the size of the overlap 806 is [ l 44 ], and as can be seen from the figure, during the forward transformation of the data unit 801, the data blocks 802, 803, 804 and 805 each include 1 overlap 806, so that 4 overlaps 806 are generated. When data is shifted by splitting it in the minimum data unit l 44, the data throughput required for the overlap portion 806 is quadrupled, so that redundant data increases. To solve this problem, this embodiment further reduces the input/output amount by caching the data unit of a larger setting scale in the on-chip cache of the computing device 601.

As previously mentioned, the scale is [ Ci Hi Wi]Has a neuron data set and scale of [ Co Ci (r +1) × (s +1)]And performing convolution operation on the Winograd weight. This embodiment keeps as many Winograd weights as possible on-chip, i.e., as many as possible are temporarily stored on-chip

Pieces of paper (l l (r +1) × (s +1)]Therefore, the batch of neuron data can be calculated only by one time of weight loading operation, so as to save the input/output quantity of the weight data.

For the output data, since the convolutional neural network also has other network layer operations such as activation, pooling, normalization, and the like, the convolutional result needs to be cached on a chip, and the subsequent network layer operations are continued, and therefore, the computing device 601 will reserve a cache storage convolutional result of a fixed capacity. This portion of the data buffer may share buffer space with the results that ultimately go through various other layer operations, thus reducing the data throughput of other layer operations to reload the convolution results and transmit the results of the computation out.

As can be seen from the above optimization analysis, the buffer capacity of the neuron data should be as large as possible, so as to reduce the total throughput of the neuron data, and since the neuron data is accumulated along the Ci direction, the larger the amount of data stored along the Ci direction is, the more times the neuron data is reloaded and accumulated can be reduced. Furthermore, the buffer space for Winograd weights also needs to be as large as possible. Finally, this embodiment also needs to reserve the corresponding output result space for other layer operations. To sum up, the on-chip cache is mainly divided into three blocks in this embodiment, which are respectively responsible for different functions: nram 709 is responsible for storing neuron data, Wram 712 is responsible for storing Winograd weights, and Rram 716 is responsible for storing convolution results. The computing device 601 further sets 2 buffers responsible for temporarily storing the intermediate results: WNram 711 is responsible for temporarily storing the data after being converted, and WRram 714 is responsible for temporarily storing the data after bit multiplication and accumulation.

Although the larger the buffer capacity for storing the neuron data, the Winograd weight, and the convolution result, the better the buffer capacity, the buffer size is closely related to the configuration of the arithmetic unit resources, and once the configuration is too large, the computing capability of the computing device 601 will be lost. The judgment standard is the balance between the input/output bottleneck pressure and the calculation force pressure. This embodiment sets the Nram 709 size to

Wherein alpha is

β is the directional coefficient of HW; the size of the Wram 712 is set to α × γ × [ l l 44 ]]Wherein γ is

The directional coefficient of (a); the scale of Rram 716 is set to β × γ × [ l 22 ]]. The time required to complete the operation of these scale data is l × α × β × γ.

Preferably, this embodiment selects l to be 16, α to be 4, β to be 64, and γ to be 16, and considering that the data size of each FP32 is 4B, the storage capacity of the storage array of Nram 709 is

The storage capacity of the Wram 712 is α × γ × [ l l 44 × ]]X 4B 1MB, and a storage capacity of Rram 716 is β × γ × [ l 22 × ]]×4B＝256KB。

Referring back to FIG. 7, Nram 709 reads neuron data from Nram 709 for forward transformation, i.e. B transformation, according to the decoded command, NTU 710 reads neuron data from NDMA 703 for temporary storage according to the decoded command ^T dB to produce forward transformed data, which is temporarily stored in WNram 711. FIG. 9 shows a schematic of Nram 709. In this embodiment, Nram 709 includes 4

memory arrays

901, 902, 903, 904, each of which includes 4 memory blocks 905, 906, 907, 908, each of which has a size of d memory locations of w bits, where d also represents the number of addresses in the memory location. Preferably, w is 128 and d is 1024, each memory block is 16KB in size, each memory array is 64KB in size, and Nram 709 has a total memory size of 256KB, a total width of 4 × w — 64B and a depth of 4 × d — 4 × 1024.

In the width direction, the input bandwidth of Nram 709 is set to 4B, and the output bandwidth is matched to the input bandwidth of NTU 710. As mentioned above, the input bandwidth of the NTU 710 is set to l × 32 bits, and l is preferably 16, the input bandwidth of the NTU 710 is 64B, so the output bandwidth of Nram 709 is also 4 × w — 64B. The input and output of Nram 709 need to be performed simultaneously, so the design of input and output dual port is adopted.

Fig. 10 shows a schematic diagram of the NTU 710. The NTU 710 includes an input buffer 1001, a register file 1002, an adder set 1003, and an output buffer 1004.

When the NTU 710 receives a command to load neuron data from Nram 709, the input buffer 1001 acts as a fifo queue buffer to temporarily store neuron data based on the input bandwidth 64B. The stage of loading neuron data continues until all data reception is complete, the overall process being controlled by the IDU 708 issuing instructions.

The register file 1002 fetches the temporarily stored neuron data from the input buffer 1001 in accordance with the programmed operation sequence based on the decoded instruction, stores the neuron data at a specific address of the register file 1002, and uses the neuron data stored at the specific address of the register file 1002 as an addition operand. In this embodiment, since the pipeline time lengths of the input stage, the operation stage and the output stage of the NTU 710 should be equal, a phenomenon of buffering hardware resource dependency may occur, in order to solve the problem of resource dependency, the register file 1002 is divided into a ping storage unit 1005 and a pong storage unit 1006 having the same size, the ith addition operand and the positive transformation data generated after the calculation are temporarily stored in the ping storage unit 1005, the (i +1) th addition operand and the (i +1) th positive transformation data are temporarily stored in the pong storage unit 1006, the (i + 5) th addition operand and the (i + 5) th positive transformation data are temporarily stored in the ping storage unit 1005, the (i + 5) th addition operand and the (i + 5) th positive transformation data are overwritten, and the register file 1002 stores according to the rule.

The adder group 1003 reads the addition operands in sequence from the specific address of the register file 1002 according to the decoded instruction to perform the addition operation. In this embodiment, the number of adder groups 1003 is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction l, each adder is an FP32 adder, and the addition operation in the forward transform of the Winograd convolution is performed in the channel direction of the neuron data in a specific order of first calculating the left-multiplication matrix B of the Winograd convolution ^T And (3) calculating the addition of the right multiplication matrix B of Winograd convolution, finally generating positive transformation data, and storing the positive transformation data back to the register file 1002. The order of operations, as well as register allocation and operation time, are all dependent on the convolution filter size and are controlled by instructions sent by the IDU 708. The operation stage and the stage of loading neuron data generate data dependency, and are executed in a pipeline manner, and are realized by hardware through counting.

The output buffer 1004 is also a fifo buffer for temporarily storing the positive-transition data sequentially from the ping storage unit 1005 and the pong storage unit 1006. This output stage needs to rely on the overall completion of the operation stage to perform the corresponding buffered output based on the output bandwidth 64B.

WNram 711 is configured to be buffered and sent repeatedly multiple times since the positive transformed data needs to be multiplexed to save overhead. WNram 711 includes 4 cache units: the buffer memory comprises a first buffer unit, a second buffer unit, a third buffer unit and a fourth buffer unit. The positive transformed data from NTU 710 is sent to one or more of these cache molecules by way of route distribution.

WNram 711 sends the positive transformed data to MAC 713 in a certain order for subsequent operations. WNram 711 is designed to cache a part of forward transformed data, send the data to MAC 713, and then store the next part of forward transformed data, and the size of WNram 711 is reduced through pipelining. Further, WNram 711 is configured to perform a bit multiplication operation on the positive transformed data and a Winograd weight of γ × [ l l 4 × 4] scale, and then transmits the data in γ blocks to MAC 713. In this way, it is necessary to output the forward-converted data on average every γ beats, and the power consumption overhead of WNram 711 can be effectively reduced. Accordingly, the first γ pieces of positive transform data are sequentially overwritten by the next γ pieces of positive transform data, so that the minimum memory size of WNram 711 can be limited to [ l (r +1) (s +1) ] × 4B, that is, [ l 44 ] × 4B ═ 1KB as described above.

Specifically, the widths of the first buffer unit, the second buffer unit, the third buffer unit and the fourth buffer unit are all w ₁ Byte, depth d ₁ And is divided into m parts in the depth direction. In this embodiment, m is preferably 8, w ₁ Is 64, d ₁ The buffer size is 128, so the width of each buffer unit is 64B, the depth is 128, the address space is divided into 8 parts in the depth direction for data multiplexing, the size of each buffer unit is 8KB, that is, the total capacity of WNram 711 is set to 32 KB.

Referring back to fig. 7, Wram 712 temporarily stores Winograd weights sent from WDMA 704 according to the decoded instructions, and MAC 713 reads the Winograd weights from Wram 712 and the forward transformed data from WNram 711 according to the decoded instructions, and performs a bit-by-bit accumulation operation on the forward transformed data and the Winograd weights, that is, performs [ (GgG) ^T )⊙(B ^T dB)]Generates the bit-aligned multiplication data, and temporarily stores the bit-aligned multiplication data to WRram 714.

Fig. 11 shows a schematic diagram of Wram 712. In this embodiment, the Wram 712 includes 4

storage arrays

1101, 1102, 1103, 1104, and the WDMA 704 sends Winograd weights to the

storage arrays

1101, 1102, 1103, 1104 through a route distribution. Each storage array comprises 4

storage blocks

1105, 1106, 1107, 1108, each storage block comprising 4

storage units

1109, 1110, 1111, 1112, each storage unit having a size of 4 × d × w. As previously mentioned, w is 128 and d is 1024, so the size of each memory block is 64KB, while the size of each memory array is 256KB, with the total capacity of Wram 712 being 1 MB. For each memory block, the width of the memory block is 4 × w ═ 512 bits, the memory block is segmented in the depth direction and divided into 4 segments of address independent memory space, the depth of each segment is d ═ 1024, and the total depth is 4 × d ═ 4096.

In this embodiment, each

storage array

1101, 1102, 1103, 1104 independently has an input bandwidth and an output bandwidth of 4 × w B, and the total output bandwidth of Wram 712 are 4 × 4 × wB. Specifically, when w is 128, the input bandwidth and the output bandwidth of each memory array are 64B, and the total output bandwidth are 256B.

In this embodiment, the MAC 713 includes 64 MAC operators, which are divided into 4 groups for performing 4 different batches of operations, and 16 MAC operators in each group are distributed independently. The forward transform data of WNram 711 needs to be sent to the 64 MAC calculators at the same time, so that it is subjected to bit-by-bit accumulation operation with different Winograd weights, and therefore, WNram 711 sends the forward transform data in a broadcast or distribution route manner. Due to the fact that output load is large, in order to guarantee driving capacity and timing sequence, positive conversion data of WNram 711 are firstly sent to 4N 1 nodes through N1 and N2 two-stage broadcasting or distribution routing, each N1 node broadcasts or distributes routing to 4N 2 nodes, and each N2 node broadcasts or distributes routing to 4 MAC operators.

Fig. 12 shows a schematic diagram of the WNram 711 output side. The MAC 713 first performs bit-wise multiplication and then sequentially accumulates the resulting vectors, and the logic function is equivalent to solving an inner product of the vectors or performing an operation of element values in matrix multiplication. Each MAC set includes 16 MAC units 1201, i.e., ω ═ 16, and since l is preferably 16, the calculated force for each MAC set is 16 × (16+ (16-1)) ═ 496 flops.

ITU 715 reads the bit-multiplied data from WRram 714 according to the decoded instruction, and inversely transforms the bit-multiplied data, i.e. performs A ^T LA to obtain a convolution result, which is temporarily stored in Rram 716.

Figure 13 shows a schematic diagram of ITU 715. ITU 715 includes input buffer 1301, register file 1302, adder bank 1303, and output buffer 1304.

When an ITU 715 receives an instruction to load multiply-by-multiply-data from WRram 714, input buffer 1301 acts as a fifo buffer to temporarily store the multiply-by-multiply-data based on the input bandwidth. The stage of loading the multiply-by-bit data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU 708 sending instructions.

The register file 1302 fetches the temporarily stored bit-aligned data from the input buffer 1301 in a fixed operation order according to the decoded instruction, stores the bit-aligned data to a specific address of the register file 1302, and uses the bit-aligned data stored at the specific address of the register file 1302 as an addition operand. Similarly, in order to solve the problem of resource dependency, the register file 1302 has a ping storage unit 1305 and a pong storage unit 1306 with the same size, the ith addition operand and the convolution result generated after the calculation are temporarily stored in the ping storage unit 1305, the (i +1) th addition operand and the (i +1) th convolution result are temporarily stored in the pong storage unit 1306, the (i + 5) th addition operand and the (i + 5) th convolution result are temporarily stored in the ping storage unit 1305, the ith addition operand and the ith convolution result are overwritten, and the register file 1302 stores according to the rule.

The adder group 1303 sequentially reads the addition operands from the specific address of the register file 1302 according to the decoded instruction to perform the addition operation. Like the adder group 1003, the number of the adder groups 1303 is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction, each adder is an FP32 adder, and the addition operation in the inverse transform of the Winograd convolution is performed in the channel direction of the bit multiplied data in a specific order of first calculating the left multiplication matrix of the Winograd convolutionA ^T And (3) calculating the addition of the right multiplication matrix A of Winograd convolution, generating a convolution result, and storing the convolution result back to the register file 1302. The order of operations, as well as register allocation and operation time, are all dependent on the convolution filter size and are controlled by instructions sent by the IDU 708. The operation stage and the above-mentioned stage of loading the bit-multiplied data generate data dependency, and execute in a pipeline mode, and are realized by counting by hardware.

The output buffer 1304 is also a first-in-first-out queue buffer for temporarily storing convolution results sequentially from the ping storage unit 1305 and the pong storage unit 1306. The output stage needs to rely on the overall completion of the operation stage to perform the output of the corresponding cache based on the output bandwidth.

In addition to Winograd convolution, the computing device 601 is capable of performing all neural network related operations, and the ALU 717 performs two tasks according to the decoded instructions: the first task is the operation of convolution fusion operation, namely the operation can be completed on a chip with a convolution layer in one step without depending on more data, and the operation comprises the operation processes of activation, bias addition, direction part, accumulation and the like; the second task is a non-convolution operation. The result of the ALU 717 operation is also buffered in the Rram 716. The presence of the ALU 717 may ensure that various operations in the convolutional neural network may be fully implemented in the computing device 601, such that the computing device 601 has the versatility and integrity of a neural network.

RDMA 706 fetches the convolution result from Rram 716 and outputs it to DRAM 604 according to the decoded instruction, thus completing the entire convolution operation. Similarly, RDMA 706 may also fetch other operation results generated by ALU 717 from Rram 716 and output them to DRAM 604 according to the decoded instruction. In this embodiment, the output bandwidth of Rram 716 is w bytes, and also includes 4 memory arrays, each memory array includes 4 × d 4 × w bits of memory cells, i.e. 512 bits in width and 4096 in depth, so that the size of each memory array is 256KB, and the size of Rram 716 is 1 MB. The input-output dual-port bandwidth of each memory array is 64B, the address in the depth direction is divided into 16 parts, and each address space is 256, so that the result of the neuron multiplexing direction is stored.

Fig. 14 shows a schematic connection relationship diagram of Rram 716. The input port of Rram 716 is connected to ITU 715 and ALU 717, receiving its output data. Because convolution and other operations do not occur, it is not necessary to have the two input ports simultaneously operating, so the input bandwidth of each storage array is kept at 64B, and the 64B bandwidth is time-division multiplexed to access the data of ITU 715 and ALU 717. There are also 2 output ports of Rram 716, one connected to RDMA 706 and the other to ALU 717. After the ALU 717 operation is complete, Rram 716 sends the computation results to DRAM 604 via RDMA 706, so that the data transfer to RDMA 706 and ALU 717 is accomplished using 64B of output bandwidth, again in a time-multiplexed manner at the output.

The invention carries out hardware design based on the characteristic of Winograd algorithm to realize the accelerated universality, provides a pipeline-level operation mode for accelerating the Winograd convolution operation speed, and fully utilizes reusable resources through methods of time division multiplexing, broadcast routing and the like in the hardware realization process. The hardware structure provided by the invention can be matched with a Winograd convolution algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, it will be appreciated by those skilled in the art, given the benefit of this disclosure or teaching of this invention, that certain steps may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein can be practiced in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solution described in the embodiments of the present invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors and like devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, a buffer for temporarily storing convolution results of Winograd convolution, the output bandwidth of the buffer being w bytes, the buffer comprising 4 storage arrays, each storage array comprising 4 × d storage units of 4 × w bits.

Clause a2, the cache of clause a1, wherein the convolution result is calculated based on neuron data and Winograd weights, the neuron data having a size of [ C _i H _i W _i ]The Winograd weight value is [ C ] _o C _i (r+1)×(s+1)]The storage capacity of the cache is β × γ × [ l 22 × ]]X 4B, wherein C _i Is the number of the neuron data, H _i Is the height, W, of the neuron data _i Is the width, C, of the neuron data _o For the number of convolution results, r is the height of the Winograd weight, s is the width of the Winograd weight, and beta is H _i W _i Gamma is

I is the vectorization length.

Clause A3, the cache of clause a2, wherein β is 64 and γ is 16.

Clause a4, the cache of clause a1, wherein w is 128.

Clause a5, the cache of clause a1, wherein d is 1024.

Clause a6, a computing device comprising the cache of any of clauses a 1-5.

Clause a7, an integrated circuit device, comprising the computing device of clause a 6.

Clause A8, a board comprising the integrated circuit device of clause a 7.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A buffer for temporarily storing convolution results of Winograd convolution is characterized in that the output bandwidth of the buffer is w bytes, the buffer comprises 4 storage arrays, and each storage array comprises 4 xd storage units with 4 xw bits.

2. The cache of claim 1, wherein the convolution result is calculated based on neuron data and a Winograd weight, the neuron data having a size [ C [ ] _i H _i W _i ]The size of the Winograd weight is [ C _o C _i (r+1)×(s+1)]The storage capacity of the buffer is beta x gamma x [ l 22 × ]]X 4B, wherein C _i Is the number of the neuron data, H _i Is the height, W, of the neuron data _i Is the width, C, of the neuron data _o For the number of convolution results, r is the height of the Winograd weight, s is the width of the Winograd weight, and beta is H _i W _i Gamma is

I is the vectorization length.

3. The cache of claim 2, wherein β is 64 and γ is 16.

4. The cache of claim 1, wherein w is 128.

5. The cache of claim 1, wherein d is 1024.

6. A computing device comprising a cache according to any of claims 1 to 5.

7. An integrated circuit device comprising the computing device of claim 6.

8. A board card comprising the integrated circuit device of claim 7.