CN114692849A

CN114692849A - Inverse transformation unit, device and board card for inverse transformation of Winograd convolution bit-to-bit multiplication data

Info

Publication number: CN114692849A
Application number: CN202011580527.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-07-01

Abstract

The invention relates to an inverse transformation unit, an inverse transformation device and a board card for performing inverse transformation on bit multiplication data of Winograd convolution, wherein the inverse transformation unit comprises an input buffer, a register file and an adder group. The input buffer is used for temporarily storing the bit-aligned multiplication data; the register file is used for taking out the temporarily stored opposite-position multiplier data from the input buffer memory and storing the opposite-position multiplier data to a specific address so as to become an addition operand; the adder group is used for reading the addition operand from the specific address to carry out addition operation so as to generate a convolution result.

Description

Inverse transformation unit, device and board card for inverse transformation of Winograd convolution bit-to-bit multiplication data

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the present invention relates to an inverse transformation unit, an apparatus, and a board for inverse transforming bit-multiplied data convolved by Winograd.

Background

With the rapid development of the information age, the research in the fields of artificial intelligence and machine learning is popular, and the related industries are developed vigorously. The convolutional neural network has wide functions in the aspects of computer vision, automatic driving, machine translation, voice recognition, smart home and the like.

The convolutional neural network has large parameter quantity and large operation quantity, so that the execution performance of the convolutional neural network model is severely limited under the limited area and the computational power of the portable mobile terminal, and meanwhile, a non-specially designed processor can cause huge expenditure of power consumption when carrying out convolutional operation.

Winograd convolution is a convolution acceleration implementation mode based on a polynomial interpolation algorithm. It passes through two inputs to the convolution operation: after the neurons and the weight are segmented in a certain scale, linear transformation, namely Winograd positive transformation, is respectively carried out, the transformed neurons and the weight are subjected to counterpoint multiplication, the counterpoint multiplication result is subjected to linear transformation again, namely Winograd inverse transformation, and finally a convolution result equivalent to the original convolution operation is obtained.

In the process of Winograd convolution operation, the positive and inverse transformation matrixes of the neurons and the weights are all composed of simple fixed numerical values, so that the positive and inverse transformation process of the Winograd neurons and the weights can be realized only by addition. The multiplication operation required in the Winograd algorithm only occurs in the bit multiplication process, and the multiplication complexity of the process is reduced to a considerable extent compared with the original convolution algorithm. Because the cost (time sequence, power consumption and area) for realizing multiplication operation by hardware is much higher than that for realizing addition with the same bit width, the energy efficiency ratio of hardware and obvious gains in operation time can be brought by replacing the original convolution operation by Winograd convolution.

However, at present, no hardware is designed for the Winograd convolution acceleration algorithm, so that the conventional artificial intelligent chip cannot completely show the advantages of the Winograd convolution operation. Therefore, a hardware device capable of efficiently running the Winograd convolution algorithm is urgently needed.

Disclosure of Invention

In order to at least partially solve the technical problems mentioned in the background art, the present invention provides an inverse transformation unit, an apparatus, and a board for inverse transforming multiplied bit data of Winograd convolution.

In one aspect, the present invention discloses an inverse transform unit for inverse transforming bit-multiplied data convolved by Winograd, comprising: input buffer, register file and adder set. The input buffer is used for temporarily storing the bit-aligned multiplication data; the register file is used for taking out the temporarily stored paired multiplier data from the input cache and storing the data to a specific address to become an addition operand; and the adder group is used for reading the addition operand from the specific address to carry out addition operation so as to generate a convolution result.

In another aspect, the present invention discloses an integrated circuit device including the inverse transform unit and a board card including the integrated circuit device.

The hardware structure provided by the invention can be matched with a Winograd convolution acceleration algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

FIG. 1 is a diagram showing the conversion of an original convolution of F (2 × 2,3 × 3) to a Winograd convolution;

fig. 2 is a structural diagram showing a board card of the embodiment of the present invention;

FIG. 3 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 4 is a schematic diagram showing the internal structure of a computing device of an embodiment of the invention;

FIG. 5 is a schematic diagram showing a forward transform unit of an embodiment of the present invention;

fig. 6 is a schematic diagram showing an inverse transform unit of an embodiment of the present invention; and

FIG. 7 is a schematic diagram illustrating a pipeline of an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of the embodiments of the invention refers to the accompanying drawings.

The Winograd convolution acceleration algorithm (hereinafter referred to as Winograd algorithm or Winograd convolution) is a conversion method for finding out the minimum required multiplication number by performing linear conversion on operands in convolution operation, and then replacing the required multiplication operation by adding partial addition operation. In terms of hardware, compared with an adder, the multiplier is more complex in structure, larger in area power consumption and poorer in comprehensive processing performance, so that a Winograd algorithm for replacing multiplication by addition has great advantages in processing two-dimensional convolution operation.

For two-dimensional convolution, the convolution result can be expressed as F (m × n, r × s), i.e., the output shape is m × n and the weight shape is r × s. The matrix representation of the Winograd algorithm is as follows:

Y＝A^T[(GgG^T)⊙(B^TdB)]A

wherein Y denotes the output matrix of the convolution operation, A^TInverse transformation left-multiplication constant matrix, G weight transformation left-multiplication constant matrix, G weight of original convolution, G^TFor weight transformation right-times constant matrix,. alpha.representing bit-wise multiplication, B^TThe method comprises the following steps of converting a neuron into a left-multiplication constant matrix, d is neuron data, B is a right-multiplication constant matrix of the neuron, and A is an inverse-conversion right-multiplication constant matrix. The left and right multiplication matrices for each transform are simply transposed.

Taking F (2 × 2,3 × 3) as an example, the constant matrices are as follows:

fig. 1 shows a schematic diagram of the conversion of the original convolution of F (2 × 2,3 × 3) into a Winograd convolution. As shown, neuron data 101 is convolved with convolution kernel 102. During calculation, the neuron data 101 is arranged in a row according to elements in the sliding window 103, the sliding window 103 slides for 4 times to form a4 × 9 matrix 104, elements of the convolution kernel 102 are arranged in a column to form a9 × 1 matrix 105, and the 4 × 9 matrix 104 and the 9 × 1 matrix 105 are subjected to convolution operation to obtain a4 × 1 convolution result 106.

Further, by dividing the graph by the dotted line, the 4 × 9 matrix 104 is converted into the 2 × 3 matrix 107, the 9 × 1 matrix 105 is converted into the 3 × 1 matrix 108, and the 4 × 1 convolution result 106 is converted into the 2 × 1 convolution result 109. After the linear transformation, the first element R of the 2 × 1 convolution result 109₀＝M₀+M₁+M₂And R is₁＝M₁-M₂-M₃. And M₀、M₁、M₂、M₃Can be represented by the following sub-formula:

through the segmentation and the linear transformation, the original convolution operation needs to execute 36 times of multiplication, while the Winograd algorithm only needs to execute 16 times of multiplication, so that the multiplication computation complexity is reduced by 2.25 times.

As can be seen from the conversion of the Winograd algorithm of the two-dimensional convolution, the Winograd algorithm is mainly divided into the following steps. First, the weights are left-and right-multiplied by a matrix of weight constants, GgG^TObtaining a weight value after Winograd linear transformation; simultaneous left and right multiplication of a matrix of neuron constants, B, on the neuron data^TAnd dB, obtaining the neuron after Winograd linear transformation. Furthermore, the neurons after Winograd transformation and the weight matrix are subjected to bit-wise multiplication, that is (GgG)^T)⊙(B^TdB), the result of the bit-wise multiplication is obtained. Finally, the result of the para-position multiplication is subjected to the left multiplication and right multiplication operations of a Winograd inverse transformation constant matrix, namely A^T[(GgG^T)⊙(B^TdB)]And A, finally obtaining a convolution result equivalent to the original convolution.

From the perspective of hardware design, the three large transformation steps are designed in a pipelined manner according to the dependency and operation distinguishing characteristics among the three processes, so that the high-efficiency acceleration performance is realized.

Fig. 2 shows a schematic structural diagram of a board card 20 according to an embodiment of the present invention. As shown in fig. 2, the board 20 includes a Chip 201, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 201 is connected to an external device 203 through an external interface 202. The external device 203 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred to the chip 201 by the external device 203 through the external interface 202. The calculation results of the chip 201 may be transmitted back to the external device 203 via the external interface means 202. The external interface device 202 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The board 20 also includes a memory device 204 for storing data, which includes one or more memory cells 205. The memory device 204 is connected and data-transferred to the control device 206 and the chip 201 via a bus. The control device 206 in the board 20 is configured to regulate the state of the chip 201. For this purpose, in an application scenario, the control device 206 may include a single chip Microcomputer (MCU).

Fig. 3 is a structural diagram showing a combined processing device in the chip 201 of this embodiment. As shown in fig. 3, the combination processing device 30 includes a computing device 301, an interface device 302, a processing device 303, and a DRAM 304.

The computing device 301 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, especially Winograd convolution operations, which can interact with the processing device 303 through the interface device 302 to collectively perform the user-specified operations.

The interface device 302 is used for transmitting data and control instructions between the computing device 301 and the processing device 303. For example, the computing device 301 may obtain input data from the processing device 303 via the interface device 302, and write to a storage device on-chip with the computing device 301. Further, the computing device 301 may obtain the control instruction from the processing device 303 via the interface device 302, and write the control instruction into the control cache on the computing device 301. Alternatively or optionally, the interface device 302 may also read data in a storage device of the computing device 301 and transmit to the processing device 303.

The processing device 303, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 301. Depending on the implementation, the processing device 303 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 301 of the present invention may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 301 and the processing device 303 are considered to form a heterogeneous multi-core structure.

The DRAM 304 is used to store data to be processed,the off-chip memory is typically 16G or larger in size, and is used for storing data of the computing device 301 and/or the processing device 303, particularly for storing neuron data and weights to be subjected to Winograd convolution operation. In this embodiment, the processing means 303 has previously linearly transformed the weights of the original convolution into Winograd weights GgG^TAnd stored in the DRAM 304.

FIG. 4 illustrates a block diagram of a computing device 301. The computing device 301 includes a bus 401, a Direct Memory Access (DMA) module 402, an instruction cache (Iram)407, a decode unit (IDU)408, a neuron cache (Nram)409, a forward transform unit (NTU) 410, a forward transform data cache (WNram)411, a weight cache (Wram)412, a bit-wise multiply-accumulate (MAC)413, a bit-wise multiplier data cache (WRram)414, an Inverse Transform Unit (ITU)415, a result cache (Rram)416, and a logical operation module (ALU) 417.

The bus 401 is a common communication trunk for transmitting information between the devices, and is a transmission line bundle composed of wires, and the bus 401 is a generic name of a data bus, an address bus, and a control bus for transmitting data, data addresses, and instructions according to the kind of information transmitted by the combination processing device 30. The bus 401 serves as a communication channel for the DRAM 304 and the computing device 301, which in this embodiment is specifically PCIe.

The DMA module 402 is used to copy data from one address space to another, typically by transferring data between external memory (e.g., DRAM 304) and internal caches of the computing device 301. When DMA transfer is to be performed, the processing device 201 passes the bus control right to the DMA module 402, the DMA module 402 controls the bus 401 to transfer data, and after the DMA transfer is completed, the DMA module 402 passes the bus control right back to the processing device 201.

The DMA module 402 includes Neuronal Direct Memory Access (NDMA)403, Weighted Direct Memory Access (WDMA)404, Instruction Direct Memory Access (IDMA)405, and Resultant Direct Memory Access (RDMA) 406. NDMA 403 is used to input neuron data from DRAM 304, WDMA 404 is used to input Winograd weights from DRAM 304, IDMA 405 is used to input instructions from DRAM 304, and RDMA 406 is used to output the calculation results to DRAM 304. In other embodiments, NDMA 403, WDMA 404, IDMA 405, and RDMA 406 may be implemented by the same direct memory access.

Iram 407 is used to temporarily store the instruction inputted by IDMA 405, and IDU 408 fetches the instruction from Iram 407 for decoding and controls the operation of other units according to the decoded instruction. The IDU 408 is a decoding and scheduling unit of the entire computing device 301, and is responsible for decoding the control instructions obtained from the DRAM 304, converting the control instructions into control signals to coordinate operations of the various modules/units on the chip, and also responsible for performing various tasks such as instruction order preservation, instruction dependency resolution, branch prediction, exception handling, and interrupt handling. In the figure, thin line arrows indicate control flows, and thick line arrows indicate data flows.

The Nram 409 is used for temporarily storing the neuron data sent by the NDMA 403 according to the decoded instruction, and the NTU 410 reads the neuron data from the Nram 409 according to the decoded instruction to perform forward transformation, that is, perform B^TdB to produce forward transformed data, which is temporarily stored in WNram 411.

Fig. 5 shows a schematic diagram of the NTU 410. The NTU 410 includes an input buffer 501, a register file 502, an adder bank 503, and an output buffer 504.

When the NTU 410 receives a command to load neuron data from Nram 409, the input buffer 501 acts as a fifo queue buffer for temporarily storing the neuron data. The stage of loading the neuron data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU 408 sending instructions.

The register file 502 fetches the temporarily stored neuron data from the input buffer 501 in accordance with the programmed operation sequence based on the decoded instruction, stores the neuron data in the specific address of the register file 502, and uses the neuron data stored in the specific address of the register file 502 as an addition operand. In this embodiment, since the pipeline time lengths of the input stage, the operation stage and the output stage are equal, a phenomenon of cache hardware resource dependency occurs, in order to solve the problem of resource dependency, the register file 502 is divided into a ping storage unit 505 and a pong storage unit 506 having the same size, the ith addition operand and the positive transformation data generated after the calculation are temporarily stored in the ping storage unit 505, the (i + 1) th addition operand and the (i + 1) th positive transformation data are temporarily stored in the pong storage unit 506, the (i + 2) th addition operand and the (i + 2) th positive transformation data are temporarily stored in the ping storage unit 505, the (i + 2) th addition operand and the (i + 2) th positive transformation data are covered, and the register file 502 stores according to the rule.

The adder group 503 sequentially reads the addition operands from the specific address of the register file 502 according to the decoded instruction, and performs the addition operation. In this embodiment, the number of adder groups 503 is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction, each adder is an FB32 adder, and the addition operation in the forward transform of the Winograd convolution is performed in the channel direction of the neuron data in a specific order of first calculating the left-hand matrix B of the Winograd convolution^TAnd then calculates the addition of the right multiplication matrix B of the Winograd convolution, finally generates the positive transformation data, and stores the positive transformation data back into the register file 502. The operation sequence, register allocation and operation time are all related to the convolution filter size and are controlled by IDU 408 sending instructions. The operation stage and the stage of loading neuron data generate data dependency, and are executed in a pipeline manner, and are realized by hardware through counting.

The output buffer 504 is also a fifo buffer for temporarily storing the positive-transition data sequentially from the ping storage unit 505 and the pong storage unit 506. The output stage is required to depend on the integral completion of the operation stage to output the corresponding cache.

WNram 411 includes a plurality of cache units, e.g., 4 cache units, and the forward transformed data from NTU 410 is sent to one or more of these cache units by way of route distribution.

Referring back to fig. 4, Wram 412 temporarily stores Winograd weights sent from WDMA 404 according to the decoded instructions, and MAC 413 reads the Winograd weights from Wram 412 and the forward transformed data from WNram 411 according to the decoded instructions, and performs a bit-by-bit accumulation operation on the forward transformed data and the Winograd weights, that is, performs [ (GgG)^T)⊙(B^TdB)]Generates the bit-aligned multiplication data, and temporarily stores the bit-aligned multiplication data to WRram 414.

In this embodiment, the MAC 413 includes 64 MAC operators, which are divided into 4 groups and perform 4 different batches of operations, and 16 MAC operators in each group are distributed independently. The forward-transformed data of WNram 411 needs to be sent to the 64 MAC operators simultaneously, so that the forward-transformed data is subjected to bit-multiplication accumulation operation with different Winograd weights, and therefore WNram 411 sends the forward-transformed data in a broadcasting or distribution routing manner. Due to the fact that output load is large, in order to guarantee driving capability and timing sequence, positive transformation data of WNram 411 are firstly sent to 4N 1 nodes through N1 and N2 two-stage broadcasting or distributing routes, each N1 node broadcasts or distributes routes to 4N 2 nodes, and each N2 node broadcasts or distributes routes to 4 MAC calculators. The MAC arithmetic unit firstly carries out counterpoint multiplication, then sequentially accumulates the obtained result vectors, and the logic function is equivalent to the calculation of vector inner products or the operation of element values in matrix multiplication.

ITU 415 reads the bit-multiplied data from WRram 414 according to the decoded instruction, and inverse transforms the bit-multiplied data, i.e., performs A^T[(GgG^T)⊙(B^TdB)]And A, so as to obtain a convolution result, and the convolution result is temporarily stored in the Rram 416.

Figure 6 shows a schematic diagram of ITU 415. ITU 415 includes an input buffer 601, a register file 602, an adder bank 603, and an output buffer 604.

When ITU 415 receives an instruction to load the bit-by-bit data from WRram 414, input buffer 601 serves as a fifo buffer to temporarily store the bit-by-bit data. The stage of loading the multiply-by-bit data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU 408 issue instruction.

The register file 602 fetches the temporarily stored bit-aligned data from the input buffer 601 in a fixed operation order according to the decoded instruction, stores the data to the specific address of the register file 602, and uses the bit-aligned data stored at the specific address of the register file 602 as an addition operand. Similarly, in order to solve the problem of resource dependency, the register file 602 has a ping storage unit 605 and a pong storage unit 606 with the same size, the ith addition operand and the convolution result generated after the calculation are temporarily stored in the ping storage unit 605, the (i + 1) th addition operand and the (i + 1) th convolution result are temporarily stored in the pong storage unit 606, the (i + 2) th addition operand and the (i + 2) th convolution result are temporarily stored in the ping storage unit 605, the ith addition operand and the ith convolution result are overwritten, and the register file 602 stores according to the rule.

The adder group 603 sequentially reads the addition operands from the register file 602 at specific addresses for addition according to the decoded instruction. Like the adder group 503, the number of the adder groups 603 is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction, each adder is an FB32 adder, and the addition operation in the inverse transform of the Winograd convolution is performed in the channel direction of the bit multiplied data in a specific order of first calculating the left multiplication matrix a of the Winograd convolution^TAnd (3) calculating the addition of the right multiplication matrix A of Winograd convolution, generating a convolution result and storing the convolution result back to the register file 602. The operation sequence, register allocation and operation time are all related to the convolution filter size and are controlled by IDU 408 sending instructions. The operation stage and the above-mentioned stage of loading the bit-multiplied data generate data dependency, and execute in a pipeline mode, and are realized by counting by hardware.

The output buffer 604 is also a fifo queue buffer for temporarily storing convolution results sequentially from the ping storage unit 605 and the pong storage unit 606. The output stage is required to depend on the integral completion of the operation stage to output the corresponding cache.

In addition to Winograd convolution, the computing device 301 is capable of performing all neural network related operations, and the ALU 417 is configured to perform two tasks according to decoded instructions: the first task is the operation of convolution fusion operation, namely the operation can be completed on a chip with a convolution layer in one step without depending on more data, and the operation comprises the operation processes of activation, bias addition, direction part, accumulation and the like; the second task is a non-convolution operation. The result of the operation performed by ALU 417 is also buffered in Rram 416. The presence of ALU 417 may ensure that various operations in the convolutional neural network are fully implemented in computing device 301, allowing computing device 301 to have the versatility and integrity of a neural network.

RDMA 406 fetches the convolution result from Rram 416 and outputs it to DRAM 304, according to the decoded instruction, thus completing the entire convolution operation. Similarly, RDMA 406 may also fetch other operation results generated by ALU 417 from Rram 416 and output them to DRAM 304 according to the decoded instruction.

Because the convolution operation has a huge data scale, in order to reduce the starting overhead of the instruction, the embodiment further utilizes the instruction control to enable the relevant modules/units to execute the pipeline, and the utilization rate of hardware is improved.

As can be seen from the above, the input timing and data scale of neuron data may affect the neuron forward transformation process of the Winograd convolution instruction, the input timing and data scale of weight data may also affect the bit-by-product accumulation operation process of the Winograd convolution instruction, and the completion timing of the Winograd convolution inverse transformation may affect the execution of the convolution result output instruction. Therefore, from the control point of view, the order of instructions and the execution time point are critical, and in addition, this embodiment needs to insert synchronous instructions between instructions having dependency relationship to solve the data dependency problem of the input/output program and the Winograd convolution program.

FIG. 7 shows a schematic diagram of the pipeline of this embodiment, which mainly controls the pipelining among Nram 409, NTU 410, Wram 412, MAC 413, ITU 415 and Rram 416 by IDU 408.

During the i-th convolution operation, IDU 408 sends an instruction to control Nram 409 at time T₁Starting from DRAM 304, loading neuron data i, which is loaded at time point T₂And (4) finishing. Before the neuron data i is loaded, at a time point T₃The IDU 408 controls the NTU 410 to start reading neuron data i from the Nram 409 for forward transformation according to the synchronization command to generate forward transformed data i. From time point T₃Initially, neuron data i is loaded in Nram 409Meanwhile, NTU 410 also reads neuron data i from Nram 409 to perform forward transformation, and the forward transformation data i is obtained at time point T₄And (4) finishing.

The convolution operation of the neuron data i needs to be matched with a Winograd weight i, based on the hardware structure of the embodiment, the input of the neuron data i is responsible for NDMA 403, the input of the Winograd weight i is responsible for WDMA 404, and the input of the Winograd weight i can be parallel, but considering that the input and output bandwidth of the computing device 301 is fixed, and the neuron data i needs to be subjected to forward transformation through an NTU 410 before being subjected to bit multiplication accumulation with the Winograd weight i through a MAC 413, the embodiment is designed to enable the neuron data i to be subjected to bit multiplication accumulation at a time point T₁Is loaded first and at a time point T₃The Winograd weight i is input to the ram 412 only after the forward conversion is performed, so that the Nram 409, the NTU 410, the ram 412 and the MAC 413 can be well matched with each other, and idle or blockage of one module/unit can be avoided as much as possible. To this end, IDU 408 controls Wram 412 to initiate loading Winograd weights i from DRAM 304 according to the synchronization command before positive transformed data i is fully generated. The time point for starting loading the Winograd weight i may be determined according to the input/output bandwidth of the computing device 301, and preferably, the time point T may be selected₃Namely, the startup forward conversion and the startup loading of Winograd weight i are executed simultaneously. Suppose Winograd weight i is also at time T₄The download is completed.

Before the Winograd weight i finishes loading, at a time point T₅The IDU 408 controls the MAC 413 to perform a bit-by-bit multiplication and accumulation operation on the positive transformed data i and the Winograd weight i according to the synchronization command, so as to generate the bit-by-bit data i. From time point T₅At the same time when the Wram 412 loads the Winograd weight i, the MAC 413 also performs the multiply-accumulate operation, and the multiply-data i will be at the time point T₆And (4) finishing.

Before the generation of the multiplied data i is completed, at a time point T₇IDU 408 controls ITU 415 to start reading the bit-multiplied data i from WRram 414 to perform inverse transformation according to the instruction to generate convolution result i. From time point T₇At the same time as MAC 413 performs the multiply-accumulate operation on bits, ITU 415 also performs the inverse transform operation on bits, and convolution result i is at time T₈And (4) finishing.

There may be two points in time at which Rram 416 starts to temporarily store convolution result i, one before convolution result i is completely generated, i.e., between points in time T₇And T₈And the other after convolution result i is completely generated. FIG. 7 is a diagram illustrating the start of temporary convolution i after the convolution i is completely generated, at time T₈IDU 408 controls Rram 416 to start temporary convolution i according to the synchronization instruction and stores the convolution at time T₉And finishing temporary storage.

After the output of the neuron data i is finished, the convolution operation for the (i + 1) th time can be started, and the IDU 408 sends an instruction to control the Nram 409 at the time point T₂Starts to load neuron data i +1 from DRAM 304, and loads neuron data i +1 at time point T₁₀That is, Nram 409 has initiated the loading of neuron data i +1 before the positive transformed data i is completely generated. Before the neuron data i +1 is loaded, at time point T₄The IDU 408 controls the NTU 410 to start reading neuron data i +1 from the Nram 409 for forward transformation according to the synchronization command to generate forward transformed data i + 1. From time point T₄At the same time when Nram 409 loads neuron data i +1, NTU 410 reads neuron data i +1 from Nram 409 to perform positive transformation, and positive transformation data i +1 is obtained at time T₁₁And (4) finishing.

IDU 408 controls Wram 412 to initiate loading of Winograd weight i +1 from DRAM 304 according to the synchronization command before positive transformed data i +1 is fully generated and before MAC 413 fully generates the bit multiplied data i. The time point for starting loading Wi +1nograd weight i +1 can be determined according to the input and output bandwidth of the computing device 301, and preferably, the time point T can be selected₄Namely, the startup forward conversion and the startup loading of Winograd weight i +1 are executed at the same time. Suppose Winograd weight i +1 is also at time point T₁₁The download is completed.

Before the Winograd weight i +1 finishes loading, at a time point T₆Before the ITU 415 generates the convolution result i completely, the IDU 408 controls the MAC 413 to start the multiplication and accumulation operation on the positive transformed data i +1 and the Winograd weight i +1 according to the synchronization command, so as to generate the multiplication data i + 1. From time point T₆Initially, Win is loaded in Wram 412While the ogad weight i +1, the MAC 413 also performs the multiplication-to-accumulation operation, and the multiplication-to-alignment data i +1 will be at the time point T₁₂And (4) finishing.

Before the generation of the multiplied data i +1 is completed, at a time point T₈IDU 408 controls ITU 415 to start reading the bit-multiplied data i +1 from WRram 414 to perform inverse transformation according to the instruction to generate convolution result i + 1. From time point T₈At the same time as MAC 413 performs multiply-accumulate operations on bits, ITU 415 also performs inverse transform operations, and convolution result i +1 is at time T₁₃And (4) finishing.

Similarly, there may be two points in time at which Rram 416 starts to temporarily store convolution result i +1, one before convolution result i +1 is completely generated, i.e., between point in time T₉And T₁₃Alternatively, after the convolution result i +1 is completely generated, for example, starting the temporary convolution result i +1 after the convolution result i +1 is completely generated, at the time point T₁₃The IDU 408 controls the Rram 416 to start temporary storage of the convolution result i +1 according to the synchronous instruction, and starts the temporary storage at the time point T₁₄And finishing temporary storage.

Based on the structure of the computing device 301 shown in fig. 4, the embodiment performs Winograd convolution operation according to the aforementioned flow, so as to fully exploit the advantages of hardware and improve the input/output and operation efficiency.

The invention carries out hardware design based on the characteristic of Winograd algorithm to realize the accelerated universality, provides a pipeline-level operation mode for accelerating the Winograd convolution operation speed, and fully utilizes reusable resources through methods of time division multiplexing, broadcast routing and the like in the hardware realization process. The hardware structure provided by the invention can be matched with a Winograd convolution algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein can be practiced in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned Memory unit or the Memory device may be any suitable Memory medium (including a magnetic Memory medium or a magneto-optical Memory medium, etc.), and may be, for example, a variable Resistance Random Access Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, an inverse transform unit that inversely transforms bit-multiplied data of a Winograd convolution, comprising: an input buffer for temporarily storing the bit-aligned multiplication data; the register file is used for taking out the temporarily stored opposite-bit multiplier data from the input cache and storing the opposite-bit multiplier data to a specific address so as to become an addition operand; and the adder group is used for reading the addition operand from the specific address to carry out addition operation so as to generate a convolution result.

Clause a2, the inverse transform unit according to clause a1, wherein the number of adder groups is 2 groups.

Clause A3, the inverse transform unit of clause a2, wherein each group of adders includes 16 adders performing addition operations in a specific order in a channel direction of the neuron data.

Clause a4, the inverse transform unit of clause A3, wherein each adder is an FB32 adder.

Clause a5, the inverse transform unit of clause A3, wherein the specific order is the addition of first calculating the left-times matrix of the Winograd convolution and then the addition of calculating the right-times matrix of the Winograd convolution to produce the convolution result.

Clause a6, the inverse transform unit of clause a1, further comprising an output buffer to temporarily store the convolution results from the adder group.

Clause a7, the inverse transform unit of clause a6, wherein the input buffer, the register file, and the output buffer are pipelined.

Clause A8, the inverse transform unit of clause a6, wherein the input buffer and the output buffer are first-in-first-out queue buffers.

Clause a9, the inverse transform unit according to clause a1, connected to a convolution result buffer (Rram) for temporarily storing the convolution result, the convolution result buffer outputting the convolution result to an off-chip memory.

Clause a10, the inverse transform unit of clause a1, connected to a decode unit (IDU) configured to decode a plurality of instructions and control the inverse transform unit according to the decoded plurality of instructions.

Clause a11, an integrated circuit device, comprising the inverse transform unit of any one of clauses a 1-10.

Clause a12, a board comprising the integrated circuit device of clause a 11.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An inverse transform unit that inversely transforms bit-multiplied data convolved by Winograd, comprising:

an input buffer for temporarily storing the bit-aligned multiplication data;

the register file is used for taking out the temporarily stored paired multiplier data from the input cache and storing the data to a specific address to become an addition operand; and

and the adder group is used for reading the addition operand from the specific address to carry out addition operation so as to generate a convolution result.

2. The inverse transform unit of claim 1, wherein the number of adder groups is 2 groups.

3. The inverse transform unit according to claim 2, wherein each group of adder includes 16 adders performing addition operations in a specific order in a channel direction of the neuron data.

4. The inverse transform unit of claim 3, wherein each adder is an FB32 adder.

5. The inverse transform unit of claim 3, wherein the specific order is to compute an addition of a left-times matrix of a Winograd convolution first and then an addition of a right-times matrix of a Winograd convolution to produce the convolution result.

6. The inverse transform unit of claim 1, further comprising an output buffer to temporarily store the convolution results from the adder group.

7. The inverse transform unit of claim 6, wherein the input buffer, the register file, and the output buffer are pipelined operations.

8. The inverse transform unit of claim 6, wherein the input buffer and the output buffer are first-in-first-out queue buffers.

9. The inverse transform unit according to claim 1, connected to a convolution result buffer (Rram) for temporarily storing the convolution result, the convolution result buffer outputting the convolution result to an off-chip memory.

10. The inverse transform unit of claim 1, coupled to a decode unit (IDU) configured to decode a plurality of instructions and to control the inverse transform unit according to the decoded plurality of instructions.

11. An integrated circuit device comprising an inverse transform unit according to any one of claims 1 to 10.

12. A board card comprising the integrated circuit device of claim 11.