CN115438777A

CN115438777A - Device for performing Winograd convolution forward transform on neuron data

Info

Publication number: CN115438777A
Application number: CN202110614459.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2022-12-06

Abstract

The invention relates to a device for performing Winograd convolution forward transformation on neuron data, wherein the device comprises a forward transformation unit and a forward transformation data buffer. The positive transformation unit is used for positively transforming the neuron data to generate positive transformation data; the forward conversion data buffer is used for temporarily storing the forward conversion data. The invention has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Description

Device for performing Winograd convolution forward transform on neuron data

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the present invention relates to an apparatus for performing a Winograd convolutional forward transform on neuronal data.

Background

With the rapid development of the information age, the research in the fields of artificial intelligence and machine learning is popular, and the related industries are developed vigorously. The convolutional neural network has wide functions in the aspects of computer vision, automatic driving, machine translation, voice recognition, smart home and the like.

The convolutional neural network has large parameter quantity and large operation quantity, so that the execution performance of the convolutional neural network model is severely limited under the limited area and the calculation force of the portable mobile terminal, and meanwhile, a non-specially designed processor can cause huge expenditure of power consumption when carrying out convolutional operation.

Winograd convolution is a convolution acceleration implementation mode based on a polynomial interpolation algorithm. It passes through two inputs to the convolution operation: after the neurons and the weight are segmented in a certain scale, linear transformation, namely Winograd positive transformation, is respectively carried out, the transformed neurons and the weight are subjected to counterpoint multiplication, the counterpoint multiplication result is subjected to linear transformation again, namely Winograd inverse transformation, and finally a convolution result equivalent to the original convolution operation is obtained.

In the process of Winograd convolution operation, the positive and inverse transformation matrixes of the neurons and the weights are all composed of simple fixed numerical values, so that the positive and inverse transformation process of the Winograd neurons and the weights can be realized only by addition. The multiplication operation required in the Winograd algorithm only occurs in the bit multiplication process, and the multiplication complexity of the process is reduced to a considerable extent compared with the original convolution algorithm. Because the cost (time sequence, power consumption and area) for realizing multiplication operation by hardware is much higher than that for realizing addition with the same bit width, the energy efficiency ratio of hardware and obvious gains in operation time can be brought by replacing the original convolution operation by Winograd convolution.

However, at present, no hardware is designed for the Winograd convolution acceleration algorithm, so that the existing artificial intelligent chip cannot completely show the advantages of the Winograd convolution algorithm. Therefore, a hardware device capable of efficiently running the Winograd convolution algorithm is urgently needed.

Disclosure of Invention

In order to at least partially solve the technical problems mentioned in the background, the invention provides a device and a board card for performing Winograd convolution forward transform on neuron data.

In one aspect, an integrated circuit device for performing a Winograd convolutional forward transform on neuron data includes a forward transform unit for forward transforming the neuron data to generate forward transformed data. The forward conversion unit comprises an input buffer, a register file and an adder group.

The input buffer is used for temporarily storing neuron data. The register file is used for fetching the temporarily stored neuron data from the input buffer and storing the neuron data to a specific address according to one of the decoded instructions so as to become a plurality of addition operands. The adder group is used for reading addition operands one by one from a specific address to carry out addition operation according to one of the decoded instructions.

The adder group divides the addition element operand into a plurality of addition element operands according to the number of elements of the addition element operand, only one element in each addition element operand is the same as the numerical value of the element at the corresponding position in the addition operand, other elements are all 0, the adder group carries out operation on the plurality of addition element operands to obtain a plurality of Winograd positive conversion intermediate results, and adds the plurality of Winograd positive conversion intermediate results to obtain positive conversion data.

The hardware structure provided by the invention can be matched with a Winograd convolution acceleration algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

FIG. 1 is a diagram showing the conversion of an original convolution of F (2 × 2,3 × 3) to a Winograd convolution;

fig. 2 is a structural diagram showing a board card of the embodiment of the present invention;

FIG. 3 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 4 is a schematic diagram showing the internal structure of a computing device of an embodiment of the present invention;

FIG. 5 is a schematic diagram showing a forward transform unit of an embodiment of the present invention;

FIG. 6 is a diagram illustrating the disassembly of an add operand into multiple add element operands according to an embodiment of the invention;

FIG. 7 is a schematic diagram illustrating a forward transform data cache of an embodiment of the present invention; and

FIG. 8 is a diagram illustrating a bit multiply accumulate operator according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The Winograd convolution acceleration algorithm (hereinafter referred to as Winograd algorithm or Winograd convolution) is a conversion method for finding out the minimum required multiplication number by performing linear conversion on operands in convolution operation, and then replacing the required multiplication operation by adding partial addition operation. In terms of hardware, compared with an adder, the multiplier is more complex in structure, larger in area power consumption and poorer in comprehensive processing performance, so that a Winograd algorithm for replacing multiplication by addition has great advantages in processing two-dimensional convolution operation.

For two-dimensional convolution, the convolution result can be expressed as F (m × n, r × s), i.e., the output shape is m × n and the weight shape is r × s. The matrix representation of the Winograd algorithm is as follows:

Y＝A ^T [(GgG ^T )⊙(B ^T dB)]A

wherein Y represents the output matrix of the convolution operation, A ^T Inverse transformation left-multiplication constant matrix, G weight transformation left-multiplication constant matrix, G weight of original convolution, G ^T Right-by-constant matrix for weight transformation, \ indicates multiplication by bit, B ^T The method comprises the following steps of converting a neuron into a left-multiplication constant matrix, d is neuron data, B is a right-multiplication constant matrix of the neuron, and A is an inverse-conversion right-multiplication constant matrix. The left and right multiplication matrices for each transform are simply transposed.

Taking F (2 × 2,3 × 3) as an example, the constant matrices are as follows:

FIG. 1 shows a schematic of the conversion of an original convolution of F (2 × 2,3 × 3) to a Winograd convolution. As shown, neuron data 101 is convolved with convolution kernel 102. During calculation, the neuron data 101 is arranged in a row according to elements in the sliding window 103, the sliding window 103 slides for 4 times to form a 4 × 9 matrix 104, and then elements of the convolution kernel 102 are arranged in a column to form a 9 × 1 matrix 105,4 × 9 matrix 104, and the convolution operation is performed on the 9 × 1 matrix 105 to obtain a 4 × 1 convolution result 106.

Further, by dividing the graph by the dotted line, the 4 × 9 matrix 104 is converted into a 2 × 3 matrix 107,9 × 1 matrix 105, which is converted into a 3 × 1 matrix 108,4 × 1 convolution result 106, which is converted into a 2 × 1 convolution result 109. After the linear transformation, the first element R of the 2 × 1 convolution result 109 ₀ ＝M ₀ +M ₁ +M ₂ And R is ₁ ＝M ₁ -M ₂ -M ₃ . And M ₀ 、M ₁ 、M ₂ 、M ₃ Can be represented by the following sub-formula:

M ₀ ＝(K ₀ -K ₂ )·W ₀

M ₃ ＝(K ₁ -K ₃ )·W ₂

through the segmentation and linear transformation, the original convolution operation needs to execute 36 times of multiplication, and the Winograd algorithm only needs to execute 16 times of multiplication, so that the multiplication complexity is reduced by 2.25 times.

From the conversion of the Winograd algorithm of the two-dimensional convolution, it can be seen that the Winograd algorithm is mainly divided into the following steps. First, the weights are left-and right-multiplied by a matrix of weight constants, i.e., ggG ^T Obtaining the weight after Winograd linear transformation; simultaneous left and right multiplication of a matrix of neuron constants, B, on the neuron data ^T And dB, obtaining the neuron after Winograd linear transformation. Furthermore, the neurons after Winograd transformation and the weight matrix are processed with the bit-by-bit multiplication operation, namely (GgG) ^T )⊙(B ^T dB) to obtain the result of the bit-wise multiplication. Finally, the result of the bit multiplication is subjected to the left multiplication and right multiplication of a Winograd inverse transformation constant matrix, namely A ^T [(GgG ^T )⊙(B ^T dB)]And A, finally obtaining a convolution result equivalent to the original convolution.

From the perspective of hardware design, the present invention performs pipeline design on the three large transformation steps according to the dependency and operation distinguishing characteristics among the three processes to achieve more efficient acceleration performance.

Fig. 2 shows a schematic structural diagram of a board card 20 according to an embodiment of the present invention. As shown in fig. 2, the board 20 includes a Chip 201, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 201 is connected to an external device 203 via an external interface 202. The external device 203 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred to the chip 201 by the external device 203 through the external interface 202. The calculation results of the chip 201 may be transmitted back to the external device 203 via the external interface means 202. The external interface device 202 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The board 20 also includes a memory device 204 for storing data, which includes one or more memory cells 205. The memory device 204 is connected and data-transferred to the control device 206 and the chip 201 via a bus. The control device 206 in the board 20 is configured to regulate the state of the chip 201. For this reason, in an application scenario, the control device 206 may include a single chip Microcomputer (MCU).

Fig. 3 is a structural diagram showing a combined processing device in the chip 201 of this embodiment. As shown in fig. 3, the combination processing device 30 includes a computing device 301, an interface device 302, a processing device 303, and a DRAM 304.

The computing device 301 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, especially Winograd convolution operations, which can interact with the processing device 303 through the interface device 302 to collectively perform the user-specified operations.

The interface device 302 is used for transmitting data and control instructions between the computing device 301 and the processing device 303. For example, the computing device 301 may obtain input data from the processing device 303 via the interface device 302, and write to a storage device on-chip with the computing device 301. Further, the computing device 301 may obtain control instructions from the processing device 303 via the interface device 302, and write the control instructions into a control cache on the computing device 301. Alternatively or optionally, the interface device 302 may also read data in a storage device of the computing device 301 and transmit to the processing device 303.

The processing device 303, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 301. Depending on the implementation, the processing device 303 may be one or more types of Central Processing Unit (CPU), graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 301 of the present invention may be viewed as having a single core structure or an isomorphic multi-core structure. However, when considered collectively, the computing device 301 and the processing device 303 are considered to form a heterogeneous multi-core structure.

The DRAM 304 is used for storing data to be processed, is an off-chip memory, and is typically 16G or larger in size, and is used for storing and/or processing the computing device 301The data of the physical device 303, especially the neuron data and the weight to be subjected to Winograd convolution operation, are stored. In this embodiment, the processing device 303 has previously linearly transformed the weights of the original convolution into Winograd weights GgG ^T And stored in the DRAM 304.

Fig. 4 shows a block diagram of the computing device 301. Computing device 301 includes bus 401, direct Memory Access (DMA) module 402, instruction cache (Iram) 407, decode unit (IDU) 408, neuron cache (Nram) 409, transform unit (NTU) 410, transform data cache (WNram) 411, weight cache (Wram) 412, multiply-by-multiply-accumulate (MAC) 413, multiply-by-multiply-data cache (WRram) 414, inverse Transform Unit (ITU) 415, result cache (Rram) 416, and logical operation module (ALU) 417.

The bus 401 is a common communication trunk for transmitting information between the devices, and is a transmission line bundle composed of wires, and the bus 401 is a generic name of a data bus, an address bus, and a control bus for transmitting data, data addresses, and instructions according to the kind of information transmitted by the combination processing device 30. The bus 401 serves as a communication channel for the DRAM 304 and the computing device 301, which in this embodiment is specifically PCIe.

The DMA module 402 is used to copy data from one address space to another, typically by transferring data between external memory (e.g., DRAM 304) and internal caches of the computing device 301. When the DMA transfer is performed, the processing device 201 gives the bus control right to the DMA module 402, the DMA module 402 controls the bus 401 to transfer data, and after the DMA transfer is completed, the DMA module 402 gives the bus control right back to the processing device 201.

DMA module 402 includes Neuronal Direct Memory Access (NDMA) 403, weighted Direct Memory Access (WDMA) 404, instruction Direct Memory Access (IDMA) 405, and Resultant Direct Memory Access (RDMA) 406.NDMA 403 is used to input neuron data from DRAM 304, WDMA 404 is used to input Winograd weights from DRAM 304, IDMA 405 is used to input instructions from DRAM 304, and RDMA 406 is used to output the calculation results to DRAM 304. In other embodiments, NDMA 403, WDMA 404, IDMA 405, and RDMA 406 may be implemented by the same direct memory access.

The Iram 407 is used to temporarily store the instruction input by the IDMA 405, and the IDU 408 fetches the instruction from the Iram 407 to decode it, and controls the other units to operate according to the decoded instruction. The IDU 408 is a decoding and scheduling unit of the entire computing device 301, and is responsible for decoding the control instructions obtained from the DRAM 304, converting the control instructions into control signals to coordinate operations of the various modules/units on the chip, and also responsible for performing various tasks such as instruction order preservation, instruction dependency resolution, branch prediction, exception handling, and interrupt handling. In the figure, thin line arrows indicate control flows, and thick line arrows indicate data flows.

The Nram 409 is used for temporarily storing the neuron data sent by the NDMA 403 according to the decoded instruction, and the NTU 410 reads the neuron data from the Nram 409 according to the decoded instruction to perform forward transformation, that is, perform B ^T dB to produce forward transformed data, which is temporarily stored in WNram 411.

Fig. 5 shows a schematic diagram of the NTU 410. The NTU 410 includes an input buffer 501, a register file 502, an adder set 503, and an output buffer 504.

When the NTU 410 receives a command to load neuron data from Nram 409, the input buffer 501 acts as a fifo queue buffer for temporarily storing the neuron data. The stage of loading the neuron data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU 408 sending instructions.

The register file 502 fetches the temporarily stored neuron data from the input buffer 501 in accordance with a planned operation sequence based on the decoded instruction, stores the neuron data in a specific address of the register file 502, and uses the neuron data stored in the specific address of the register file 502 as an addition operand. In this embodiment, since the pipeline time lengths of the input stage, the operation stage and the output stage are equal, a phenomenon of cache hardware resource dependency occurs, in order to solve the problem of resource dependency, the register file 502 is divided into a ping storage unit 505 and a pong storage unit 506 having the same size, the ith addition operand and the positive transformation data generated after the calculation are temporarily stored in the ping storage unit 505, the (i + 1) th addition operand and the (i + 1) th positive transformation data are temporarily stored in the pong storage unit 506, the (i + 2) th addition operand and the (i + 2) th positive transformation data are temporarily stored in the ping storage unit 505, the (i + 2) th addition operand and the (i + 2) th positive transformation data are covered, and the register file 502 stores according to the rule.

The adder group 503 reads the addition operands sequentially from the specific address of the register file 502 according to the decoded instruction to perform the addition operation. In this embodiment, the number of adder groups 503 is 2 groups corresponding to the direction of scheduling of addition operation, each group includes 16 adders corresponding to the vectorization direction, each adder is an FB32 adder, and the addition operation in the forward transform of Winograd convolution is performed in the channel direction of neuron data in a specific order of first calculating the left-hand matrix B of Winograd convolution ^T And (4) calculating the addition of the right multiplication matrix B of Winograd convolution, and finally generating forward conversion data. The operation sequence, register allocation and operation time are all related to the convolution filter size and are controlled by IDU 408 sending instructions. The operation stage and the stage of loading neuron data generate data dependency, and are executed in a pipeline manner, and are realized by hardware through counting.

In more detail, the adder group 503 divides the addition operand into a plurality of adder operands, and the adder operands are in tensor form. The number of the plurality of addition element operands is the same as the number of elements of the addition operand, only one element in each addition element operand is the same as the numerical value of the element at the corresponding position in the addition operand, and other elements are all 0. FIG. 6 is a diagram illustrating the decomposition of an add operand into multiple add operands according to this embodiment. The addition operand 601 is illustrated as a 2 x 2 matrix, which includes 4 elements a ₁₁ 、a ₁₂ 、a ₂₁ 、a ₂₂ The adder group 503 disassembles the add operand 601 into 4 add meta-

operands

602, 603, 604, 605, wherein the add meta-operand 602 only includes a in the elements of the corresponding positions ₁₁ Other elements are all 0, and the operand 603 only includes a in the element at the corresponding position ₁₂ All other elements are 0, and the operand 604 is only the element at the corresponding positionIncluding a ₂₁ Other elements are all 0, and the operand 605 includes a only in the element of the corresponding position ₂₂ And all other elements are 0.

When the adder group 503 performs an operation in the forward transform of Winograd convolution on the addition operand 601, the 4

adder operands

602, 603, 604, 605 are operated instead, and thus, only the non-0 elements of the 4

adder operands

602, 603, 604, 605 need to be operated on. For each operand, the operand is left-hand multiplied by a forward transform left-hand multiplication matrix B ^T And multiplying the right side by the forward conversion right multiplication matrix B to obtain a Winograd forward conversion intermediate result of the operand of the addition element. Finally, the Winograd forward conversion intermediate results of the 4

addition element operands

602, 603, 604 and 605 are summed to obtain the forward conversion data of the addition operand 601. In the actual operation process, the operations can be directly obtained without repeated operation, so that the calculation time is shortened, and the calculation resources are saved.

The output buffer 504 is also a fifo buffer for temporarily storing the positive-transition data sequentially from the ping storage unit 505 and the pong storage unit 506. The output stage is required to depend on the integral completion of the operation stage to output the corresponding cache.

WNram 411 includes a plurality of cache units, and fig. 7 shows a schematic diagram of an exemplary WNram 411, and as shown, WNram 411 includes 4 cache units: a first buffer unit 701, a second buffer unit 702, a third buffer unit 703 and a fourth buffer unit 704. The forward transformed data from NTU 410 is sent to one or more of these cache units by way of a route distribution.

Returning to fig. 4, wram 412 temporarily stores Winograd weights sent from WDMA 404 according to the decoded instructions, and MAC 413 reads Winograd weights from Wram 412 and positive transformed data from WNram 411 according to the decoded instructions, and performs a bit-by-bit accumulation operation on the positive transformed data and the Winograd weights, that is, performs [ (GgG) ^T )⊙(B ^T dB)]Generates the bit-aligned multiplication data, and temporarily stores the bit-aligned multiplication data to WRram 414.

In this embodiment, the MAC 413 includes 64 MAC operators, which are divided into 4 groups and perform 4 different batches of operations, and 16 MAC operators in each group are distributed independently. The forward-transformed data of WNram 411 needs to be sent to the 64 MAC operators simultaneously, so that the forward-transformed data is subjected to bit-multiplication accumulation operation with different Winograd weights, and therefore WNram 411 sends the forward-transformed data in a broadcasting or distribution routing manner. Due to the fact that output load is large, in order to guarantee driving capacity and time sequence, positive transformation data of the WNram 411 are firstly sent to 4N 1 nodes through N1 and N2 two-stage broadcasting or distribution routes, each N1 node broadcasts or distributes the routes to 4N 2 nodes, and each N2 node broadcasts or distributes the routes to 4 MAC calculators.

Fig. 8 shows a schematic diagram of a set of MAC operators 801. The MAC operator 801 performs bit-wise multiplication first, and then sequentially accumulates the obtained result vectors, and the logical function is equivalent to calculating the vector inner product or performing the operation of the element values in the matrix multiplication.

ITU 415 reads the bit-multiplied data from WRram 414 according to the decoded instruction, and inverse transforms the bit-multiplied data, i.e., performs A ^T [(GgG ^T )⊙(B ^T dB)]And A to obtain a convolution result, and the convolution result is temporarily stored in the Rram 416.

The ITU 415 is similar in structure to the NTU 410, and also includes an input buffer, a register file, an adder bank, and an output buffer.

When an instruction is received by ITU 415 to load the bit-by-bit data from WRram 414, the input buffer is used as a fifo buffer to temporarily store the bit-by-bit data. The stage of loading the multiply-by-bit data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU 408 issue instruction.

The register file fetches the temporarily stored bit-aligned multiplication data from the input buffer according to the decoded instruction and a fixed operation sequence, stores the bit-aligned multiplication data to a specific address of the register file, and the bit-aligned multiplication data stored at the specific address of the register file becomes an addition operand. Similarly, in order to solve the problem of resource dependency, the register file has ping memory cells and pong memory cells with the same size, and the storage manner thereof is not described in detail.

The adder group reads the addition operands in sequence from the specific address of the register file for addition operation according to the decoded instruction. Like the adder group 503, the number of the adder groups is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction, each adder is an FB32 adder, and the addition operation in the inverse transform of the Winograd convolution is performed in the channel direction of the multiply data in a specific order of first calculating the left multiplication matrix a of the Winograd convolution ^T And (3) calculating the addition of the right multiplication matrix A of Winograd convolution, generating a convolution result and storing the convolution result back to the register file. The operation sequence, register allocation and operation time are all related to the scale of the convolution filter and are controlled by IDU 408 sending instructions. The operation stage and the above-mentioned stage of loading the bit-multiplied data generate data dependency, and execute in a pipeline mode, and are realized by counting by hardware.

The output buffer is also a first-in first-out queue buffer for temporarily storing convolution results from the ping storage unit and the pong storage unit in sequence. The output stage is required to depend on the integral completion of the operation stage to output the corresponding cache.

In addition to Winograd convolution, the computing device 301 is capable of performing all neural network related operations, and the ALU 417 is used to perform two tasks according to the decoded instructions: the first task is the operation of convolution fusion operation, namely the operation can be completed on a chip with a convolution layer in one step without depending on more data, and the operation comprises the operation processes of activation, bias addition, direction part, accumulation and the like; the second task is a non-convolution operation. The result of the operation performed by ALU 417 is also buffered in Rram 416. The presence of ALU 417 may ensure that various operations in the convolutional neural network are fully implemented in computing device 301, allowing computing device 301 to have the versatility and integrity of a neural network.

RDMA 406 fetches the convolution result from Rram416 and outputs it to DRAM 304, according to the decoded instruction, thus completing the entire convolution operation. Similarly, RDMA 406 may also fetch other operation results generated by ALU 417 from Rram416 and output them to DRAM 304 according to the decoded instructions.

Because the convolution operation has a huge data scale, in order to reduce the starting overhead of the instruction, the embodiment further utilizes the instruction control to enable the relevant modules/units to execute the pipeline, and the utilization rate of hardware is improved.

As can be seen from the above, the input timing and data scale of neuron data may affect the neuron forward transformation process of the Winograd convolution instruction, the input timing and data scale of weight data may also affect the bit-by-product accumulation operation process of the Winograd convolution instruction, and the completion timing of the Winograd convolution inverse transformation may affect the execution of the convolution result output instruction. Therefore, from the control point of view, the order of the instructions and the execution time point are critical, and in addition, this embodiment needs to insert synchronous instructions between the instructions with dependency relationship to solve the data dependency problem between the input and output program and the Winograd convolution program.

Based on the structure of the computing device 301 shown in fig. 4, the embodiment performs Winograd convolution operation according to the aforementioned flowing water, so that the advantages of hardware can be fully utilized, and the input/output and operation efficiency can be improved.

The invention carries out hardware design based on the characteristic of Winograd algorithm to realize the accelerated universality, provides a pipeline-level operation mode for accelerating the Winograd convolution operation speed, and fully utilizes reusable resources through methods of time division multiplexing, broadcast routing and the like in the hardware realization process. The hardware structure provided by the invention can be matched with a Winograd convolution algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can also be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein may be practiced in other ways than as specifically disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors and like devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the description of the above embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An integrated circuit device that performs a Winograd convolutional forward transform on neuron data, comprising:

a forward transform unit to forward transform the neuron data to produce forward transformed data, the forward transform unit comprising:

an input buffer for temporarily storing the neuron data;

a register file, according to one of the decoded instructions, for fetching the temporarily stored neuron data from the input buffer and storing the neuron data to a specific address to become a plurality of addition operands; and

and the adder group is used for reading the addition operands one by one from the specific address to perform addition operation according to one of the decoded instructions, wherein the adder group disassembles the addition operand into a plurality of addition operand according to the number of elements of the addition operand, only one element in each addition operand is the same as the numerical value of the element at the corresponding position in the addition operand, and other elements are all 0, and the adder group performs operation on the plurality of addition operand to obtain a plurality of Winograd positive conversion intermediate results and sums up the plurality of Winograd positive conversion intermediate results to obtain positive conversion data.

2. The integrated circuit device according to claim 1, coupled to a decode unit, the decode unit being configured to decode a plurality of instructions and to control the forward transform unit according to the decoded plurality of instructions.

3. The integrated circuit device according to claim 1, wherein the number of adder groups is 2 groups.

4. The integrated circuit device according to claim 1, wherein the register file comprises a ping storage element and a pong storage element, the addition operand and the forward-transformed data being temporarily stored in the ping storage element, a next addition operand and a next forward-transformed data being temporarily stored in the pong storage element.

5. The integrated circuit device of claim 4, wherein the forward transform unit further comprises an output buffer to temporarily store the forward transform data from the ping storage unit, wherein the forward transform data buffer temporarily stores the forward transform data from the output buffer.