CN114692810A

CN114692810A - Device and board card for calculating Winograd convolution

Info

Publication number: CN114692810A
Application number: CN202011579087.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-07-01

Abstract

The invention relates to a device and a board card for calculating Winograd convolution, wherein the device is connected to an off-chip memory, and neuron data and a plurality of instructions are stored in the off-chip memory. The device comprises a neuron buffer and a forward transformation unit. The neuron cache is used for starting to load the neuron data from the off-chip memory; and the positive transformation unit is used for starting reading the neuron data from the neuron cache to carry out positive transformation before the neuron data are loaded, so as to generate positive transformation data. The invention has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Description

Device and board card for calculating Winograd convolution

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the invention relates to a device and a board card for calculating Winograd convolution.

Background

With the rapid development of the information age, the research in the fields of artificial intelligence and machine learning is popular, and the related industries are developed vigorously. The convolutional neural network has wide functions in the aspects of computer vision, automatic driving, machine translation, voice recognition, smart home and the like.

The convolutional neural network has large parameter quantity and large operation quantity, so that the execution performance of the convolutional neural network model is severely limited under the limited area and the computational power of the portable mobile terminal, and meanwhile, a non-specially designed processor can cause huge expenditure of power consumption when carrying out convolutional operation.

Winograd convolution is a convolution acceleration implementation mode based on a polynomial interpolation algorithm. It passes through two inputs to the convolution operation: after the neurons and the weight are segmented in a certain scale, linear transformation, namely Winograd positive transformation, is respectively carried out, the transformed neurons and the weight are subjected to counterpoint multiplication, the counterpoint multiplication result is subjected to linear transformation again, namely Winograd inverse transformation, and finally a convolution result equivalent to the original convolution operation is obtained.

In the process of Winograd convolution operation, the forward and inverse transformation matrixes of the neurons and the weights are all composed of simple fixed numerical values, so that the forward and inverse transformation process of the Winograd neurons and the weights can be realized only by addition. The multiplication operation required in the Winograd algorithm only occurs in the bit multiplication process, and the multiplication complexity of the process is reduced to a considerable extent compared with the original convolution algorithm. Because the cost (time sequence, power consumption and area) for realizing multiplication operation by hardware is much higher than that for realizing addition with the same bit width, the energy efficiency ratio of hardware and obvious gains in operation time can be brought by replacing the original convolution operation by Winograd convolution.

However, at present, no hardware is designed for the Winograd convolution acceleration algorithm, so that the conventional artificial intelligent chip cannot completely show the advantages of the Winograd convolution operation. Therefore, a hardware device capable of efficiently running the Winograd convolution algorithm is urgently needed.

Disclosure of Invention

In order to at least partially solve the technical problems mentioned in the background art, the invention provides a device and a board card for calculating Winograd convolution.

In one aspect, the present disclosure discloses a device for computing a Winograd convolution coupled to an off-chip memory, the off-chip memory storing neuron data and a plurality of instructions. The device comprises: a neuron cache for starting to load the neuron data from the off-chip memory; and the positive transformation unit is used for starting reading the neuron data from the neuron cache to perform positive transformation before the neuron data are loaded, so as to generate positive transformation data.

In another aspect, the present invention discloses an integrated circuit device including the above device, and a board including the integrated circuit device according to the above.

The hardware structure provided by the invention can be matched with a Winograd convolution acceleration algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

FIG. 1 is a diagram showing the conversion of an original convolution of F (2 × 2,3 × 3) to a Winograd convolution;

fig. 2 is a structural diagram showing a board card of the embodiment of the present invention;

FIG. 3 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 4 is a schematic diagram showing the internal structure of a computing device of an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a forward transform data cache of an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a bit multiply accumulate operator according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a pipeline of an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when.. or" once "or" in response to a determination "or" in response to a detection ".

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The Winograd convolution acceleration algorithm (hereinafter referred to as Winograd algorithm or Winograd convolution) is a conversion method for finding out the minimum required multiplication number by performing linear conversion on operands in convolution operation, and then replacing the required multiplication operation by adding partial addition operation. In terms of hardware, compared with an adder, the multiplier is more complex in structure, larger in area power consumption and poorer in comprehensive processing performance, so that a Winograd algorithm for replacing multiplication by addition has great advantages in processing two-dimensional convolution operation.

For two-dimensional convolution, the convolution result can be expressed as F (m × n, r × s), i.e., the output shape is m × n and the weight shape is r × s. The matrix representation of the Winograd algorithm is as follows:

Y＝A^T[(GgG^T)⊙(B^TdB)]A

wherein Y denotes the output matrix of the convolution operation, A^TInverse transformation left-multiplication constant matrix, G weight transformation left-multiplication constant matrix, G weight of original convolution, G^TFor weight transformation right-times constant matrix,. alpha.representing bit-wise multiplication, B^TThe method comprises the following steps of converting a neuron into a left-multiplication constant matrix, d is neuron data, B is a right-multiplication constant matrix of the neuron, and A is an inverse-conversion right-multiplication constant matrix. The left and right multiplication matrices for each transform are simply transposed.

Taking F (2 × 2,3 × 3) as an example, the constant matrices are as follows:

fig. 1 shows a schematic diagram of the conversion of the original convolution of F (2 × 2,3 × 3) into a Winograd convolution. As shown, neuron data 101 is convolved with convolution kernel 102. During calculation, the neuron data 101 is arranged in a row according to elements in the sliding window 103, the sliding window 103 slides for 4 times to form a4 × 9 matrix 104, elements of the convolution kernel 102 are arranged in a column to form a9 × 1 matrix 105, and the 4 × 9 matrix 104 and the 9 × 1 matrix 105 are subjected to convolution operation to obtain a4 × 1 convolution result 106.

Further, by dividing the graph by the dotted line, the 4 × 9 matrix 104 is converted into the 2 × 3 matrix 107, the 9 × 1 matrix 105 is converted into the 3 × 1 matrix 108, and the 4 × 1 convolution result 106 is converted into the 2 × 1 convolution result 109. After the linear transformation, the first element R of the 2 × 1 convolution result 109₀＝M₀+M₁+M₂And R is₁＝M₁-M₂-M₃. And M₀、M₁、M₂、M₃Can be represented by the following sub-formula:

through the segmentation and linear transformation, the original convolution operation needs to execute 36 times of multiplication, and the Winograd algorithm only needs to execute 16 times of multiplication, so that the multiplication complexity is reduced by 2.25 times.

As can be seen from the conversion of the Winograd algorithm of the two-dimensional convolution, the Winograd algorithm is mainly divided into the following steps. First, the weights are left-and right-multiplied by a matrix of weight constants, GgG^TObtaining the weight after Winograd linear transformation; simultaneous left and right multiplication of a matrix of neuron constants, B, on the neuron data^TAnd dB, obtaining the neurons after Winograd linear transformation. Furthermore, the neurons after Winograd transformation and the weight matrix are subjected to bit-wise multiplication, that is (GgG)^T)⊙(B^TdB) to obtain the result of the bit-wise multiplication. Finally, the result of the para-position multiplication is subjected to the left multiplication and right multiplication operations of a Winograd inverse transformation constant matrix, namely A^T[(GgG^T)⊙(B^TdB)]And A, finally obtaining a convolution result equivalent to the original convolution.

From the perspective of hardware design, the present invention performs pipeline design on the three large transformation steps according to the dependency and operation distinguishing characteristics among the three processes to achieve more efficient acceleration performance.

Fig. 2 shows a schematic structural diagram of a board card 20 according to an embodiment of the present invention. As shown in fig. 2, the board 20 includes a Chip 201, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 201 is connected to an external device 203 through an external interface 202. The external device 203 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred to the chip 201 by the external device 203 through the external interface 202. The calculation results of the chip 201 may be transmitted back to the external device 203 via the external interface means 202. The external interface device 202 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The board 20 also includes a memory device 204 for storing data, which includes one or more memory cells 205. The memory device 204 is connected and data-transferred to the control device 206 and the chip 201 via a bus. The control device 206 in the board 20 is configured to regulate the state of the chip 201. For this purpose, in an application scenario, the control device 206 may include a single chip Microcomputer (MCU).

Fig. 3 is a structural diagram showing a combined processing device in the chip 201 of this embodiment. As shown in fig. 3, the combination processing device 30 includes a computing device 301, an interface device 302, a processing device 303, and a DRAM 304.

The computing device 301 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, especially Winograd convolution operations, which can interact with the processing device 303 through the interface device 302 to collectively perform the user-specified operations.

The interface device 302 is used for transmitting data and control instructions between the computing device 301 and the processing device 303. For example, the computing device 301 may obtain input data from the processing device 303 via the interface device 302, and write to a storage device on-chip with the computing device 301. Further, the computing device 301 may obtain the control instruction from the processing device 303 via the interface device 302, and write the control instruction into the control cache on the computing device 301. Alternatively or optionally, the interface device 302 may also read data in a storage device of the computing device 301 and transmit to the processing device 303.

The processing device 303, as a general purpose processing device, performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 301. Depending on the implementation, the processing device 303 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 301 of the present invention may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 301 and the processing device 303 are considered to form a heterogeneous multi-core structure.

The DRAM 304 is used for storing data to be processed and is a sliceThe external memory, which is typically 16G or larger, is used for storing data of the computing device 301 and/or the processing device 303, and particularly for storing neuron data and weights to be subjected to Winograd convolution. In this embodiment, the processing means 303 has previously linearly transformed the weights of the original convolution into Winograd weights GgG^TAnd stored in the DRAM 304.

Fig. 4 shows a block diagram of the computing device 301. The computing device 301 includes a bus 401, a Direct Memory Access (DMA) module 402, an instruction cache (Iram)407, a decode unit (IDU)408, a neuron cache (Nram)409, a forward transform unit (NTU) 410, a forward transform data cache (WNram)411, a weight cache (Wram)412, a bit-wise multiply-accumulate (MAC)413, a bit-wise multiplier data cache (WRram)414, an Inverse Transform Unit (ITU)415, a result cache (Rram)416, and a logical operation module (ALU) 417.

The bus 401 is a common communication trunk for transmitting information between the devices, and is a transmission line bundle composed of wires, and the bus 401 is a generic name of a data bus, an address bus, and a control bus for transmitting data, data addresses, and instructions according to the kind of information transmitted by the combination processing device 30. The bus 401 serves as a communication channel for the DRAM 304 and the computing device 301, which in this embodiment is specifically PCIe.

The DMA module 402 is used to copy data from one address space to another, typically by transferring data between external memory (e.g., DRAM 304) and internal caches of the computing device 301. When DMA transfer is to be performed, the processing device 201 passes the bus control right to the DMA module 402, the DMA module 402 controls the bus 401 to transfer data, and after the DMA transfer is completed, the DMA module 402 passes the bus control right back to the processing device 201.

The DMA module 402 includes Neuronal Direct Memory Access (NDMA)403, Weighted Direct Memory Access (WDMA)404, Instruction Direct Memory Access (IDMA)405, and Resultant Direct Memory Access (RDMA) 406. NDMA 403 is used to input neuron data from DRAM 304, WDMA 404 is used to input Winograd weights from DRAM 304, IDMA 405 is used to input instructions from DRAM 304, and RDMA 406 is used to output the calculation results to DRAM 304. In other embodiments, NDMA 403, WDMA 404, IDMA 405, and RDMA 406 may be implemented by the same direct memory access.

Iram 407 is used to temporarily store the instruction inputted by IDMA 405, and IDU408 fetches the instruction from Iram 407 for decoding and controls the operation of other units according to the decoded instruction. The IDU408 is a decoding and scheduling unit of the entire computing device 301, and is responsible for decoding the control instructions obtained from the DRAM 304, converting the control instructions into control signals to coordinate operations of the various modules/units on the chip, and also responsible for performing various tasks such as instruction order preservation, instruction dependency resolution, branch prediction, exception handling, and interrupt handling. In the figure, thin line arrows indicate control flows, and thick line arrows indicate data flows.

The Nram409 is used for temporarily storing the neuron data sent by the NDMA 403 according to the decoded instruction, and the NTU 410 reads the neuron data from the Nram409 according to the decoded instruction to perform forward transformation, that is, perform B^TdB to produce forward transformed data, which is temporarily stored in WNram 411.

The NTU 410 includes an input buffer, a register file, an adder bank, and an output buffer.

When the NTU 410 receives a command to load neuron data from Nram409, the input buffer is used as a first in first out queue buffer for temporarily storing the neuron data. The stage of loading the neuron data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU408 sending instructions.

The register file fetches the temporarily stored neuron data from the input buffer according to the decoded instruction and the planned operation sequence, stores the neuron data to the specific address of the register file, and the neuron data stored at the specific address of the register file becomes an addition operand. In this embodiment, since the pipeline time lengths of the input stage, the operation stage and the output stage are equal, a phenomenon of cache hardware resource dependency occurs, in order to solve the problem of resource dependency, the register file is divided into a ping storage unit and a pong storage unit with the same size, the ith addition operand and the positive transformation data generated after the calculation are temporarily stored in the ping storage unit, the (i + 1) th addition operand and the (i + 1) th positive transformation data are temporarily stored in the pong storage unit, the (i + 2) th addition operand and the (i + 2) th positive transformation data are temporarily stored in the ping storage unit, the ith addition operand and the (i + 2) th positive transformation data are covered, and the register file stores according to the rule.

The adder group reads the addition operands in sequence from the specific address of the register file for addition operation according to the decoded instruction. In this embodiment, the number of adder groups is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction, each adder is an FB32 adder, and the addition operation in the forward transform of the Winograd convolution is performed in the channel direction of the neuron data in a specific order of first calculating the left-hand matrix B of the Winograd convolution^TAnd (3) calculating the addition of the right multiplication matrix B of Winograd convolution, finally generating positive transformation data, and storing the positive transformation data back to the register file. The operation sequence, register allocation and operation time are all related to the convolution filter size and are controlled by IDU408 sending instructions. The operation stage and the neuron data loading stage generate data dependency, are executed in a pipeline mode, and are realized by counting through hardware.

The output buffer is also a first-in first-out queue buffer for temporarily storing the positive conversion data from the ping storage unit and the pong storage unit in sequence. The output stage is required to depend on the integral completion of the operation stage to output the corresponding cache.

WNram411 includes a plurality of cache units, and fig. 5 shows a schematic diagram of an exemplary WNram411, and as shown, WNram411 includes 4 cache units: a first buffer unit 501, a second buffer unit 502, a third buffer unit 503, and a fourth buffer unit 504. The forward transformed data from NTU 410 is sent to one or more of these cache units by way of a route distribution.

Referring back to fig. 4, the Wram 412 temporarily stores the Winograd weight sent from the WDMA 404 according to the decoded instruction, and the MAC 413 reads the Winograd weight from the Wram 412 and the forward transformed data from the WNram411 according to the decoded instruction, and performs forward transformed data and the Winograd weightMultiply-accumulate bit operation, i.e., [ (GgG)^T)⊙(B^TdB)]Generates the bit-aligned multiplication data, and temporarily stores the bit-aligned multiplication data to WRram 414.

In this embodiment, the MAC 413 includes 64 MAC operators, which are divided into 4 groups and perform 4 different batches of operations, and 16 MAC operators in each group are distributed independently. The forward-transformed data of WNram411 needs to be sent to the 64 MAC operators simultaneously, so that the forward-transformed data is subjected to bit-multiplication accumulation operation with different Winograd weights, and therefore WNram411 sends the forward-transformed data in a broadcasting or distribution routing manner. Due to the fact that output load is large, in order to guarantee driving capability and timing sequence, positive conversion data of WNram411 are firstly sent to 4N 1 nodes through N1 and N2 two-stage broadcasting or distributing routes, each N1 node broadcasts or distributes routes to 4N 2 nodes, and each N2 node broadcasts or distributes routes to 4 MAC operators.

Fig. 6 shows a schematic diagram of a set of MAC operators 601. The MAC operator 601 first performs bit-wise multiplication, and then sequentially accumulates the obtained result vectors, and the logical function is equivalent to calculating an inner product of the vectors or performing an operation of element values in matrix multiplication.

ITU415 reads the bit-multiplied data from WRram 414 according to the decoded instruction, and inverse transforms the bit-multiplied data, i.e., performs A^T[(GgG^T)⊙(B^TdB)]And A, so as to obtain a convolution result, and the convolution result is temporarily stored in the Rram 416.

ITU415 includes an input buffer, a register file, an adder bank, and an output buffer.

When an instruction is received by ITU415 to load the bit-by-bit data from WRram 414, the input buffer is used as a fifo buffer to temporarily store the bit-by-bit data. The stage of loading the multiply-by-bit data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU408 issue instruction.

The register file fetches the temporarily stored bit-aligned multiplication data from the input buffer according to the decoded instruction and a fixed operation sequence, stores the bit-aligned multiplication data to a specific address of the register file, and the bit-aligned multiplication data stored at the specific address of the register file becomes an addition operand. Similarly, in order to solve the problem of resource dependence, the register file has a ping storage unit and a pong storage unit with the same size, the ith addition operand and the convolution result generated after calculation are temporarily stored in the ping storage unit, the (i + 1) th addition operand and the (i + 1) th convolution result are temporarily stored in the pong storage unit, the (i + 2) th addition operand and the (i + 2) th convolution result are temporarily stored in the ping storage unit, the ith addition operand and the ith convolution result are covered, and the register file stores according to the rule.

The adder group reads the addition operands in sequence from the specific address of the register file for addition operation according to the decoded instruction. Like the adder groups, the number of the adder groups is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction, each adder is an FB32 adder, and the addition operation in the inverse transform of the Winograd convolution is performed in the channel direction of the multiply data in a specific order of first calculating the left multiplication matrix a of the Winograd convolution^TAnd (3) calculating the addition of the right multiplication matrix A of Winograd convolution, finally generating a convolution result, and storing the convolution result back into the register file. The operation sequence, register allocation and operation time are all related to the convolution filter size and are controlled by IDU408 sending instructions. The operation stage and the above-mentioned stage of loading the bit-multiplied data generate data dependency, and execute in a pipeline mode, and are realized by counting by hardware.

The output buffer is also a first-in first-out queue buffer for temporarily storing convolution results from the ping storage unit and the pong storage unit in sequence. The output stage is required to depend on the integral completion of the operation stage to output the corresponding cache.

In addition to Winograd convolution, the computing device 301 is capable of performing all neural network related operations, and the ALU 417 is configured to perform two tasks according to decoded instructions: the first task is the operation of convolution fusion operation, namely the operation can be completed on a chip with a convolution layer in one step without depending on more data, and the operation comprises the operation processes of activation, bias addition, direction part, accumulation and the like; the second task is a non-convolution operation. The result of the operation performed by ALU 417 is also buffered in Rram 416. The presence of ALU 417 may ensure that various operations in the convolutional neural network are fully implemented in computing device 301, allowing computing device 301 to have the versatility and integrity of a neural network.

RDMA 406 fetches the convolution result from Rram416 and outputs it to DRAM 304, according to the decoded instruction, thus completing the entire convolution operation. Similarly, RDMA 406 may also fetch other operation results generated by ALU 417 from Rram416 and output them to DRAM 304 according to the decoded instruction.

Because the convolution operation has a huge data scale, in order to reduce the starting overhead of the instruction, the embodiment further utilizes the instruction control to enable the relevant modules/units to execute the pipeline, and the utilization rate of hardware is improved.

As can be seen from the above, the input timing and data scale of neuron data may affect the neuron forward transformation process of the Winograd convolution instruction, the input timing and data scale of weight data may also affect the bit-by-product accumulation operation process of the Winograd convolution instruction, and the completion timing of the Winograd convolution inverse transformation may affect the execution of the convolution result output instruction. Therefore, from the control point of view, the order of instructions and the execution time point are critical, and in addition, this embodiment needs to insert synchronous instructions between instructions having dependency relationship to solve the data dependency problem of the input/output program and the Winograd convolution program.

FIG. 7 shows a schematic diagram of the pipeline of this embodiment, which mainly controls the pipelining among Nram409, NTU 410, Wram 412, MAC 413, ITU415 and Rram416 by IDU 408.

During the i-th convolution operation, IDU408 sends an instruction to control Nram409 at time T₁Starting from DRAM 304, loading neuron data i, which is loaded at time point T₂And (4) finishing. Before the neuron data i is loaded, at a time point T₃The IDU408 controls the NTU 410 to start reading neuron data i from the Nram409 for forward transformation according to the synchronization command to generate forward transformed data i. From time point T₃At the same time when the Nram409 loads the neuron data i, the NTU 410 reads the neuron data i from the Nram409 for forward transformation, and the forward transformed data i is obtained at the time point T₄And (4) finishing.

The convolution operation of the neuron data i needs to be matched with the Winograd weight i, based on the hardware structure of the embodiment, the input of the neuron data i is responsible for NDMA 403, the input of the Winograd weight i is responsible for WDMA 404, and the inputs can be parallel, but considering that the input and output bandwidth of the computing device 301 is fixed, and the neuron data i needs to be positively transformed by the NTU 410 before being subjected to bit multiplication accumulation with the Winograd weight i by the MAC 413, the embodiment is designed to enable the neuron data i to be subjected to bit multiplication accumulation at the time point T₁Is loaded first and at a time point T₃The positive transformation is performed, and the Winograd weight i is input to the Wram 412 later, so that the Nram409, the NTU 410, the Wram 412 and the MAC 413 are well matched, and one module/unit is prevented from being idle or blocked as much as possible. To this end, IDU408 controls Wram 412 to initiate loading Winograd weights i from DRAM 304 according to the synchronization command before positive transformed data i is fully generated. The time point for starting loading the Winograd weight i may be determined according to the input/output bandwidth of the computing device 301, and preferably, the time point T may be selected₃Namely, the startup forward conversion and the startup loading of Winograd weight i are executed simultaneously. Suppose Winograd weight i is also at time T₄The download is completed.

Before the Winograd weight i finishes loading, at a time point T₅The IDU408 controls the MAC 413 to perform a bit-by-bit multiplication and accumulation operation on the positive transformed data i and the Winograd weight i according to the synchronization command, so as to generate the bit-by-bit data i. From time point T₅At the same time when the Wram 412 loads the Winograd weight i, the MAC 413 also performs the multiply-accumulate operation, and the multiply-data i will be at the time point T₆And (4) finishing.

Before the generation of the multiplied data i is completed, at a time point T₇IDU408 controls ITU415 to start reading the bit-multiplied data i from WRram 414 to perform inverse transformation according to the instruction to generate convolution result i. From time point T₇At the same time as MAC 413 performs multiply-accumulate operations on bits, ITU415 also performs inverse transform operations on bits, and convolution result i is time-consumingPoint T₈And (4) finishing.

There may be two points in time at which Rram416 starts to temporarily store convolution result i, one before convolution result i is completely generated, i.e., between points in time T₇And T₈And the other after convolution result i is completely generated. FIG. 7 illustrates the start-up of the temporary convolution result i after the convolution result i is completely generated, at time T₈IDU408 controls Rram416 to start temporary convolution i according to the synchronization instruction and stores the convolution at time T₉And finishing temporary storage.

After the output of the neuron data i is finished, the convolution operation for the (i + 1) th time can be started, and the IDU408 sends an instruction to control the Nram409 at the time point T₂Starts to load neuron data i +1 from DRAM 304, and loads neuron data i +1 at time point T₁₀That is, Nram409 has initiated the loading of neuron data i +1 before the positive transformed data i is completely generated. Before the neuron data i +1 is loaded, at the time point T₄The IDU408 controls the NTU 410 to start reading neuron data i +1 from the Nram409 for forward transformation according to the synchronization command to generate forward transformed data i + 1. From time point T₄At the same time when Nram409 loads neuron data i +1, NTU 410 reads neuron data i +1 from Nram409 to perform positive transformation, and positive transformation data i +1 is obtained at time T₁₁And (4) finishing.

IDU408 controls Wram 412 to initiate loading of Winograd weight i +1 from DRAM 304 according to the synchronization command before positive transformed data i +1 is fully generated and before MAC 413 fully generates the bit multiplied data i. The time point for starting loading Wi +1nograd weight i +1 can be determined according to the input and output bandwidth of the computing device 301, and preferably, the time point T can be selected₄That is, the startup forward conversion and the startup loading of the Winograd weight i +1 are executed simultaneously. Suppose that Winograd weight i +1 is also at time point T₁₁And finishing the downloading.

Before the Winograd weight i +1 finishes loading, at a time point T₆Before the ITU415 generates the convolution result i completely, the IDU408 controls the MAC 413 to start the multiplication and accumulation operation on the positive transformed data i +1 and the Winograd weight i +1 according to the synchronization command, so as to generate the multiplication data i + 1. From time point T₆At the same time when the Wram 412 loads the Winograd weight i +1, the MAC 413 also performs the multiplication-accumulation operation, and the multiplication-alignment data i +1 will be at the time point T₁₂And (4) finishing.

Before the generation of the multiplied data i +1 is completed, at a time point T₈IDU408 controls ITU415 to start reading the bit-multiplied data i +1 from WRram 414 to perform inverse transformation according to the instruction to generate convolution result i + 1. From time point T₈At the same time as MAC 413 performs multiply-accumulate operations on bits, ITU415 also performs inverse transform operations, and convolution result i +1 is at time T₁₃And (4) finishing.

Similarly, there may be two points in time at which Rram416 starts to temporarily store convolution result i +1, one before convolution result i +1 is completely generated, i.e., between point in time T₉And T₁₃Alternatively, after the convolution result i +1 is completely generated, for example, starting the temporary convolution result i +1 after the convolution result i +1 is completely generated, at the time point T₁₃IDU408 controls Rram416 to start temporary convolution result i +1 according to the synchronous instruction, and starts at time T₁₄And finishing temporary storage.

Based on the structure of the computing device 301 shown in fig. 4, the embodiment performs Winograd convolution operation according to the aforementioned flow, so as to fully exploit the advantages of hardware and improve the input/output and operation efficiency.

The invention carries out hardware design based on the characteristic of Winograd algorithm to realize the accelerated universality, provides a pipeline-level operation mode for accelerating the Winograd convolution operation speed, and fully utilizes reusable resources through methods of time division multiplexing, broadcast routing and the like in the hardware realization process. The hardware structure provided by the invention can be matched with a Winograd convolution algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance instrument, a B ultrasonic instrument and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein can be practiced in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, a device for calculating a Winograd convolution coupled to off-chip memory, the off-chip memory storing neuron data and a plurality of instructions, the device comprising: a neuron cache for starting to load the neuron data from the off-chip memory; and the positive transformation unit is used for starting reading the neuron data from the neuron cache to perform positive transformation before the neuron data are loaded, so as to generate positive transformation data.

Clause a2, the apparatus of clause a1, wherein the off-chip memory further stores a Winograd weight, the apparatus further comprising a weight cache to initiate loading of the Winograd weight from the off-chip memory before the forward transform data is completely generated.

Clause A3, the apparatus of clause a2, wherein initiating a positive transformation is performed concurrently with initiating loading of the Winograd weights.

Clause a4, the apparatus according to clause a2, further comprising a multiply-accumulate-over operator for starting the multiply-over operation on the forward transformed data and the Winograd weight before the Winograd weight is loaded, so as to generate multiply-over data.

Article a5, the apparatus of article a4, further comprising an inverse transform unit to initiate inverse transformation of the alignment multiplier data to produce a convolution result before the alignment multiplier data is completely produced.

Clause a6, the apparatus of clause a5, further comprising a convolution result buffer to initiate temporary storage of the convolution result before it is completely generated.

Clause a7, the apparatus of clause a5, further comprising a convolution result buffer to initiate temporary storage of the convolution result after it is completely generated.

Clause A8, the apparatus of clause a5, wherein the bit-wise multiply-accumulate operator initiates a bit-wise multiply-accumulate operation on the next positive transformed data and the next Winograd weight to generate the next bit-wise multiply data before the inverse transform unit fully generates the convolution result.

Clause a9, the apparatus of clause a4, wherein the weight cache initiates loading of a next Winograd weight from the off-chip memory before the multiply-accumulate operator completely generates the multiply-by-multiply data.

Clause a10, the apparatus of clause a1, wherein the neuron cache initiates loading of next neuron data from the off-chip memory before the forward transform unit completely generates the forward transform data.

Clause a11, an integrated circuit device, comprising the device of any one of clauses a 1-10.

Clause a12, a board comprising the integrated circuit device of clause a 11.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An apparatus for computing Winograd convolution coupled to off-chip memory having stored therein neuron data and a plurality of instructions, the apparatus comprising:

a neuron cache for starting to load the neuron data from the off-chip memory; and

and the positive transformation unit is used for starting reading the neuron data from the neuron cache to perform positive transformation before the neuron data are loaded, so as to generate positive transformation data.

2. The apparatus of claim 1, wherein the off-chip memory further stores Winograd weights, the apparatus further comprising a weight cache to initiate loading of the Winograd weights from the off-chip memory before the forward transformed data is completely generated.

3. The apparatus of claim 2, wherein initiating a forward transform is performed concurrently with initiating loading of the Winograd weights.

4. The apparatus of claim 2, further comprising a multiply-accumulate operator for initiating a multiply-accumulate operation on the forward transformed data and the Winograd weights before the Winograd weights are loaded to generate multiply-accumulate data.

5. The apparatus of claim 4, further comprising an inverse transform unit to initiate an inverse transform of the para-bit multiplier data to produce a convolution result before the para-bit multiplier data is completely produced.

6. The apparatus of claim 5, further comprising a convolution result buffer to initiate temporary storage of the convolution result before it is fully generated.

7. The apparatus of claim 5, further comprising a convolution result buffer to initiate temporary storage of the convolution result after the convolution result is fully generated.

8. The device of claim 5, wherein the bit-wise multiply-accumulate operator initiates a bit-wise multiply-accumulate operation on the next forward transformed data and the next Winograd weight to generate the next bit-wise multiply data before the inverse transform unit fully generates the convolution result.

9. The apparatus of claim 4, wherein the weight cache initiates loading of a next Winograd weight from the off-chip memory before the multiply-accumulate operator fully generates the multiply-by-multiply data.

10. The apparatus of claim 1, wherein the neuron cache initiates loading of next neuron data from the off-chip memory before the forward transform unit completely generates the forward transform data.

11. An integrated circuit device comprising the device of any one of claims 1 to 10.

12. A board card comprising the integrated circuit device of claim 11.