CN115878543A

CN115878543A - Computing device, method for performing convolution operation by using computing device and related product

Info

Publication number: CN115878543A
Application number: CN202111129739.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2023-03-31

Abstract

The disclosure discloses a computing device, a method for performing convolution operation by using the computing device and a related product. The computing means may be comprised in a combined processing means which may also comprise interface means and other processing means. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme disclosed by the invention optimizes the convolution operation, and improves the operation processing efficiency.

Description

Computing device, method for performing convolution operation by using computing device and related product

Technical Field

The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a computing device configured to perform convolution operations, a method of performing convolution operations using the computing device, a chip, and a board.

Background

At present, deep Learning (Deep Learning) has become an important branch in machine Learning, and the development of Artificial Intelligence (AI) is also greatly promoted. The core technology of deep learning, deep Neural Network (DNN), has been widely used in many industries.

The Neural Network is one of the most critical technologies in artificial intelligence and deep learning, and a Convolutional Neural Network (CNN) is one of the most important Network types. The most critical calculation in the convolutional neural network is Convolution Operation (Convolution Operation) of the convolutional layer (Conv layer). The convolutional layer has the function of extracting the characteristics of input data, and can extract complex characteristics through multilayer convolution so as to ensure that the network has enough expression capacity and generalization capacity. The neural network model comprises a large number of convolution operations of various types, and the calculation performance of the convolution operations greatly influences the calculation performance of the whole neural network model. When the neural network model is applied to different fields, such as speech recognition, machine translation, image processing, etc., the respective dimensions of the input feature map and the weight thereof may be different. In order to fully utilize the hardware advantages of the deep learning processor, optimization needs to be performed for different types of convolution operations with different scales, so as to improve the calculation performance of executing the neural network model.

Disclosure of Invention

In order to solve at least one or more of the technical problems as mentioned above, the present disclosure proposes, in various aspects, a calculation apparatus that can enable data of various dimensional sizes to be adapted to hardware of convolution operations by performing a blocking process on an input feature map and weights, thereby improving the calculation efficiency of the convolution operations. The convolution operations of the disclosed embodiments may be operations in various neural network models that may be applied in various fields, such as image processing, speech processing, text processing, and so forth, which may include, for example, but not limited to, recognition and classification.

In a first aspect, embodiments of the present disclosure provide a computing apparatus configured to perform a deep convolution operation in inverse training of a neural network model, the computing apparatus including: a main processing circuit to: acquiring input neuron data and/or neuron gradient data, wherein the input neuron data and the neuron gradient data are respectively split into a plurality of splitting units according to a convolution splitting scheme, one splitting unit comprises data of the lowest storage dimension and at least one other storage dimension, the total data volume of one splitting unit does not exceed the single maximum operation volume of hardware, and data in one splitting unit are continuously stored into one data line; and a plurality of slave processing circuits that perform the deep convolution operation on corresponding data rows of the input neuron data and neuron gradient data.

In a second aspect, embodiments of the present disclosure provide a chip including the computing device of any of the foregoing first aspects.

In a third aspect, embodiments of the present disclosure provide a board card including the chip of any of the foregoing second aspects.

In a fourth aspect, embodiments of the present disclosure provide a method for performing convolution operations by the computing apparatus of any one of the foregoing embodiments of the first aspect.

By the computing device, the chip, the board card and the method for implementing convolution operation by the computing device, the scheme of the embodiment of the disclosure enables the input neuron data to adapt to the processing capability of the hardware operation device by applying the convolution splitting scheme, thereby fully utilizing the parallel processing capability of the plurality of slave processing circuits and effectively improving the operation efficiency of convolution operation. In addition, in some embodiments, the input feature map and the weight may be transmitted through different data paths, thereby supporting multiple multiplexing modes of the input feature map and the weight, further optimizing convolution operation, and reducing data access amount.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to like or corresponding parts and in which:

fig. 1 shows a block diagram of a board card of an embodiment of the present disclosure;

FIG. 2 shows a block diagram of a combined processing device of an embodiment of the disclosure;

FIG. 3 illustrates an internal structural diagram of a processor core of a single or multi-core computing device of an embodiment of the present disclosure;

4 a-4 c illustrate several exemplary convolution operation principle examples to which embodiments of the present disclosure may be applied;

FIG. 5 shows a schematic block diagram of a computing device according to an embodiment of the present disclosure;

FIG. 6 illustrates an exemplary data storage sequence in accordance with embodiments of the present disclosure;

7a-7d illustrate several exemplary grouping patterns according to embodiments of the present disclosure;

FIG. 8 illustrates an exemplary split schematic of an input feature map in accordance with an embodiment of the present disclosure;

9a-9d illustrate data storage schematics in a second memory circuit according to embodiments of the present disclosure;

10a-10b illustrate output point division schematics of an arithmetic circuit according to an embodiment of the present disclosure;

figure 11 illustrates a split and store schematic diagram for the Updated1 scheme, according to an embodiment of the present disclosure;

FIG. 12 illustrates a schematic diagram of data storage in a second storage circuit in an Update1 scheme according to an embodiment of the disclosure;

FIG. 13 illustrates a single pass operation in the Update1 scheme according to an embodiment of the disclosure;

FIG. 14 illustrates a schematic diagram of sliding convolution in the Update1 scheme according to an embodiment of the present disclosure; and

fig. 15 illustrates an output data format diagram of an Update1 scheme according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. as may appear in the claims, specification, and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

Exemplary hardware Environment

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System on Chip (SoC), or System on Chip, integrated with one or more combined processing devices, which is an artificial intelligence arithmetic unit for supporting various deep learning and machine learning algorithms and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface means 202 is used for transferring data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), graphics Processing Unit (GPU) or other general and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of an internal structure of a processing core when the computing apparatus 201 is a single-core or multi-core apparatus. The computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decoding unit 312 decodes the obtained instruction and sends the decoded result as control information to the operation module 32 and the storage module 33.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM) 331, a weight storage unit (weight RAM, WRAM) 332, and a Direct Memory Access (DMA) 333.NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM 204 via the bus 34 and is responsible for data transfer between the computing device 301 and the DRAM 204.

Exemplary convolution operation types

Based on the foregoing hardware environment, in one aspect, the disclosed embodiments provide a computing device configured to perform convolution operations such that the convolution operations in, for example, a neural network model, may be optimized. Convolution layers in the neural network model may perform convolution operations, and perform feature extraction by applying convolution kernels (also referred to as filters, weights, etc.) to an input feature map (also referred to as input data, neurons, or input neurons) for convolution processing. The convolutional layer may contain a plurality of convolutional kernels, and each element constituting a convolutional kernel corresponds to a weight coefficient and a bias.

Various convolution layers may be included in the neural network model, such as a convolution layer to perform forward, conventional 3D convolution, and an deconvolution layer to perform depth (Depthwise) convolution. In the inverse training, it may be necessary to perform an inverse deep convolution operation or a cross-product convolution operation. Embodiments of the present disclosure may be optimized for these different types of convolution operations.

In the conventional 3D convolution operation, assuming that the tensor shape of the input Feature map (Feature map) in the convolution layer is represented as X [ N Hi Wi Ci ], the tensor shape of the convolution kernel (kernel) is represented as K [ Co Kh Kw Ci ], and the output result is Y [ N Ho Wo Co ], then the mathematical calculation formula of the simplified convolution operation can be expressed as follows:

Y _in,jc,jh,jw ＝∑ _{0≤ic≤ci,0≤ih≤kh,0≤iw≤kw} X _{in,ic,jh×sh+ih,jw×sw+iw} ×K _jc,ic,ih,iw (1)

in the above equation, X is input data, Y is output data, K is a convolution kernel, kh and Kw are length and width of K, sh and sw are step sizes (stride) in the length and width directions, the formula ignores the amount of deviation bias, padding pad and dilation, and the convolution kernel has been dilated assuming that the input data X has been padded. The formula omits the N dimension and the C dimension, and the forward calculation of the neural network model is independent in the N dimension and is fully connected in the C dimension. When the convolution kernel works, the input features are swept according to a certain step length, matrix element multiplication summation is carried out on the input features in a convolution window, and deviation amount is superposed. In a conventional 3D convolution operation, the bit product results in H, W and Ci directions are accumulated, and thus referred to as 3D convolution. However, such 3D convolution has constraints: the dimension Ci of the convolution kernel is equal to the dimension Ci of the input feature map, so that the convolution kernel does not slide in the direction Ci and is a pseudo 3D convolution. To distinguish from other convolution operations herein, the above convolution operation is referred to as a 3D convolution operation.

FIG. 4a shows an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure may be applied.

The figure shows exemplarily four-dimensional input data X of size [ N Hi Wi Ci ], which can be represented as N Hi × Wi × Ci-sized solid rectangles 410a. Also illustrated is a four-dimensional convolution kernel K of size [ Co Kh Kw Ci ], which can be represented as Co Kh Kw Ci sized stereo convolution kernels 420a. The result of the convolution of the input data X with the convolution kernel K results in output data Y, which is four-dimensional data of size [ N Ho Wo Co ], and which can be represented as N Ho × Wo × Co-sized solid rectangles 430a.

The figure also specifically shows an example of convolution operation, in which the input data is an input feature map 440a of size 6 × 6 × 3, and N dimensions are omitted; the convolution kernel is a 3 × 3 × 3 stereo convolution kernel 450a, for a single convolution kernel Co; the output data is a 4 × 4 output profile 460a. The specific operation process is as follows:

the convolution kernel 450a sweeps the input signature graph 440a by a certain step size, and performs matrix element multiplication summation and overlap offset on the input signature within the convolution window 470 a. That is, the value at each position in the output feature map 460a is obtained by performing two-dimensional convolution operation on the corresponding block of each input feature map and the corresponding convolution kernel and then summing the two-dimensional convolution operation. For example, the values (i.e., convolution output points) at the (0, 0) positions on the output feature map 460a are shown to be obtained by performing a two-dimensional convolution operation on the convolution window 470a framed by the black cube in the input feature map and the stereo convolution kernel 450a to obtain 3 values, and then summing the values to obtain the final value.

To obtain outputs at other locations, the position of the convolution kernel 450a, i.e., the convolution window of the convolution output points, may be shifted over the input feature map 440 a. In the example shown, the convolution step (Sx, sy) is (1, 1), and when the horizontal direction (width direction) is moved to the right or the vertical direction (height direction) is moved down by one lattice, the convolution operation is performed to obtain the value of (0, 1) or (1, 0) position on the output feature map 460a, respectively.

From the above description, in a convolutional layer of a neural network, there are N sets of input feature maps, each set containing Hi × Wi × Ci pieces of information, where Hi and Wi are the height and width of the input feature maps, respectively, and Ci is the number of input feature maps, also referred to as the number of input channels. The convolutional layer has Ci × Co convolutional kernels of Kh × Kw size, where Ci is the number of input channels, co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolutional kernels, respectively. The output profile contains information of Ho × Wo × Co, where Ho and Wo are the height and width of the output profile, respectively, and Co is the number of output channels. In addition, convolution steps (Sx, sy) are involved in the convolution operation, and the size of the convolution steps affects the size of the output feature map.

FIG. 4b shows an example of an exemplary principle of deep convolution operation to which embodiments of the present disclosure may be applied.

The depth convolution differs from conventional 3D convolution in that the depth direction, which is referred to herein as the input channel Ci, is not accumulated. In the conventional 3D convolution, each convolution kernel needs to be calculated and accumulated with all layers (input channels) of the input feature map, so the number of input channels of each convolution kernel is equal to the number of input channels of the input feature map. Each convolution kernel in the deep convolution is a single channel, one convolution kernel is responsible for one channel, and one channel is only convoluted by one convolution kernel. Thus, the deep convolution is sometimes also referred to as a 2D convolution, i.e., sliding accumulation over only the H and W dimensions.

As shown, the dimension of the input feature map 410b is 12 × 12 × 3, i.e., includes three channels, each channel including 12 × 12 images. In this deep convolution, 3 convolution kernels 420b are used, each of which is a single channel, for example, 5 × 5 × 1 in size. Each convolution kernel convolves only one channel of the input feature map 410b, and such convolution yields outputs of size 8 × 8 × 1 at a time, and then the outputs are stacked together to create an 8 × 8 × 3 image, and finally yields an 8 × 8 × 3 output feature map 430b. As can be seen from the figure, the depth (number of channels) of the output feature map remains the same as the input feature map.

Because the input channels are not accumulated in the deep convolution, when the deep convolution is involved, the dimensions of the input feature map, the convolution kernel and the output feature map can be simplified into three dimensions of C (channel), H (height) and W (width).

In the back propagation of the neural network model training, the calculation of the neuron gradient and the weight gradient is involved, as follows:

wherein, topLet _diffand bottom _ diff be the neuron gradient, W be the weight of the iteration, Δ W be the weight gradient calculated by the iteration,

is a computation in the back propagation similar to a convolution operation. With respect to the reverse propagation direction, bottom _ diff of the previous layer is top _ diff of the current layer, and bottom _ diff of the current layer is top _ diff of the next layer, so that errors can be reversely transferred layer by layer.

In the calculation of equation (2), the operation between top _ diff and W is similar to the operation process between the input neuron and the weight W, where top _ diff corresponds to the input feature map.

In the calculation of equation (3), the operation between top _ diff and bottom _ data is similar to the deep convolution operation, where top _ diff is equivalent to a convolution kernel, and sliding accumulation is performed in the XY direction of bottom _ data, and the operation principle can refer to fig. 4b. In this operation scenario, the sizes of top _ diff and bottom _ data are usually large. Therefore, the embodiments of the present disclosure also provide an optimization scheme for convolution operation (referred to as inverse depth convolution) in such a scenario.

In backpropagation, for convolutional layers that perform conventional 3D convolution operations, the operations in their inverse process may be referred to as cross-product convolution operations. Embodiments of the present disclosure may also provide an optimization scheme for such convolution operations.

FIG. 4c shows an example of the principle of cross-product convolution operation to which embodiments of the present disclosure may be applied.

Three-dimensional data top _ diff of size [ Ho Wo Co ] is exemplarily shown in the figure, which can be represented as a solid rectangle 410c of size Ho × Wo × Co; also shown is three-dimensional data bottom _ data of size [ Hi Wi Ci ], which can be represented as a solid rectangle 420c of size Hi Wi Ci. the top _ diff and bottom _ data perform cross-product convolution operation to obtain output data 430c, which is four-dimensional data with the size of [ Co Kh Kw Ci ], and can be represented as Co three-dimensional rectangles 430c with the size of Kh × Kw × Ci. As can be seen from comparison with fig. 4a, the cross-product convolution of fig. 4c is equivalent to the inverse operation of the conventional 3D convolution, i.e. the convolution kernel is calculated by the output feature map (top _ diff) and the input feature map (bottom _ data). The N dimension is omitted in fig. 4 c.

Specifically, ci copies are made for each HoWo surface data in top _ diff, that is, for each Co value of the HoWo surface, to obtain data 440c of Ho × Wo × Ci. This data 440c is subjected to a deep convolution operation with the bottom _ data (see the schematic diagram of fig. 4 b), i.e., ci direction is not accumulated, thereby obtaining an output 460c, which is three-dimensional data of Kh × Kw × Ci size. For each HoWo surface, the replication and depth convolution operations are repeated to obtain Co three-dimensional data of Kh × Kw × Ci, that is, a four-dimensional convolution kernel 430c of Co × Kh × Kw × Ci.

Herein, input Feature maps (Feature maps), input data, neurons, or input neurons may be used interchangeably; convolution kernels, filters, or weights may be used interchangeably. Further, the H (height) and Y dimensions may be used interchangeably, and the W (width) and X dimensions may be used interchangeably. Accordingly, the H dimension of the input feature map may be represented as Hi or Yi, the H dimension of the output feature map may be represented as Ho or Yo, and the W dimension is similarly represented. In the disclosed embodiment, each convolution output point has a corresponding convolution window, the shape of which is equal to the shape of the convolution kernel. The value of each convolution output point corresponds to the result of the pair-wise multiplication and accumulation of the input feature map and the weight value in the convolution window. In addition, whatever type of convolution operation is involved, the data involved can be divided into an input signature, a convolution kernel, and an output signature. For example, in the inverse operation, top _ diff corresponds to a convolution kernel, bottom _ data corresponds to an input feature map, and Δ W corresponds to an output feature map.

Exemplary computing device

In the embodiments of the present disclosure, the above convolution operation may be implemented using a computing device of a master-slave structure. Further, different data paths can be configured for the input feature map and the convolution kernel, so that the memory access efficiency is improved.

FIG. 5 shows a schematic block diagram of a computing device 500 according to an embodiment of the present disclosure. It is understood that the structure can be regarded as an internal structure refinement of the operation module of a single processing core in fig. 3, and can also be regarded as a function division block diagram combined on the basis of a plurality of operation modules of the processing cores shown in fig. 3. As shown in fig. 5, computing device 500 of an embodiment of the present disclosure may be configured to perform various types of convolution operations, which may include a master processing circuit (MA) 510 and a plurality of slave processing circuits (SL) 520, 16 slave processing circuits SL 0-SL 15 being shown. Those skilled in the art will appreciate that the number of slave processing circuits may be more or less, depending on the particular hardware configuration, and the disclosed embodiments are not limited in this respect.

The master processing circuit and the slave processing circuits and the plurality of slave processing circuits may communicate with each other through various connections. In different application scenarios, the connection manner between the multiple slave processing circuits may be a hard connection manner arranged by a hard wire, or a logic connection manner configured according to, for example, a microinstruction, so as to form a topology of multiple slave processing circuit arrays. The disclosed embodiments are not limited in this respect. The master processing circuit and the slave processing circuit may cooperate with each other, thereby realizing parallel arithmetic processing.

To support the arithmetic function, the master processing circuit and the slave processing circuit may include various calculation circuits, and may include, for example, a vector operation unit and a matrix operation unit. The vector operation unit is used for executing vector operation and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit is responsible for core calculation of the deep learning algorithm, such as matrix multiplication and convolution.

The slave processing circuit may be configured to perform an intermediate operation on the corresponding data in parallel according to the operation instruction to obtain a plurality of intermediate results, and to transmit the plurality of intermediate results back to the master processing circuit.

By configuring the computing apparatus 500 to be in a master-slave configuration (e.g., a master-slave configuration, or a multi-master-slave configuration, which is not limited in this respect), for a forward-direction computation instruction, data can be split according to the computation instruction, so that a portion with a large computation amount is computed in parallel by a plurality of slave processing circuits to increase the computation speed, save the computation time, and further reduce the power consumption.

In some embodiments of the present disclosure, by using different data paths to transmit the input feature map and the weight, multiple multiplexing modes of the input feature map and the weight can be supported, thereby reducing the data access amount during operation and improving the processing efficiency.

Specifically, the computing apparatus 500 may further include a first storage circuit 530 and a second storage circuit 540 for storing data transmitted via different data channels, respectively. Optionally, the first storage circuit 530 and the second storage circuit 540 may be two storage blocks formed by dividing the same memory, or may be two independent memories, which is not limited herein.

The first memory circuit 530 may be used to store multicast data, i.e. the data in the first memory circuit will be transmitted via the broadcast bus to a plurality of slave processing circuits, which receive the same data. It will be appreciated that broadcast and multicast may be implemented via a broadcast bus. Multicast refers to a communication mode in which a piece of data is transmitted to a plurality of slave processing circuits; broadcast is a communication mode for transmitting a piece of data to all slave processing circuits, and is a special case of multicast. Since multicast and broadcast both correspond to one-to-many transmission modes, and are not specifically distinguished herein, broadcast and multicast may be collectively referred to as multicast, the meaning of which may be clear to those skilled in the art depending on the context.

The second memory circuit 540 may be used for storing the distribution data, i.e. the data in the second memory circuit will be transmitted to different slave processing circuits, respectively, each receiving different data.

By providing the first storage circuit and the second storage circuit separately, it is possible to support transmission in different transmission manners for data to be operated on, thereby reducing the data access amount by multiplexing multicast data among a plurality of slave processing circuits.

In some embodiments, one of the input signature graph and the convolution kernel may be determined to be multicast data and stored in the first storage circuit to transmit the data to the scheduled plurality of slave processing circuits in a broadcast manner during the operation. Correspondingly, the other of the input feature map and the convolution kernel may be determined as distribution data and stored in the second storage circuit. These distribution data may be distributed to the corresponding slave processing circuits prior to operation.

Fig. 5 also shows an internal structural schematic diagram of the slave processing circuit SL according to an embodiment of the present disclosure. As shown, each slave processing circuit 520 may include a plurality of operation circuits CU 521, a first buffer circuit 522, and a second buffer circuit 523. The figure shows 4 arithmetic circuits CU0 to CU3. Those skilled in the art will appreciate that the number of operational circuits may be greater or less, depending on the particular hardware configuration, and embodiments of the present disclosure are not limited in this respect.

In some embodiments, the first buffer circuit 522 may be used to buffer the weights or input signature assigned to the slave processing circuit. Accordingly, the second buffer circuit 523 may be used to buffer the input signature or weights assigned to the slave processing circuit. Both of the buffer circuits are used to select data to participate in the operation. The data of the first buffer circuit 522 may be a plurality of data lines from, for example, the first storage circuit 530 or the second storage circuit 540, and correspondingly, the data of the second buffer circuit 523 may be a plurality of data lines from, for example, the second storage circuit 540 or the first storage circuit 530. Depending on the particular multiplexing scheme, the data lines may be distributed to the corresponding arithmetic circuitry CU 521 during operation or broadcast to all CUs 521 within the slave processing circuitry 520.

Each of the operation circuits CU 521 is configured to perform a bit-by-bit accumulation operation on a data line selected from the first buffer circuit and a data line selected from the second buffer circuit in each operation cycle.

By providing the first buffer circuit and the second buffer circuit separately, it is possible to support transmission in different transmission manners for data to be operated on, thereby reducing the data access amount by multiplexing data as much as possible among a plurality of operation circuits within a single slave processing circuit.

The slave processing circuit 520 may further include a third buffer circuit 524 for buffering the operation result of each operation circuit CU 521.

It will be appreciated that although the individual processing and memory circuits are shown as separate modules in fig. 5, the memory and processing circuits may be combined into one module according to different configurations. For example, the first memory circuit 530 may be incorporated with the master processing circuit 510, and the second memory circuit 540 may be shared by a plurality of slave processing circuits 520, and each slave processing circuit may be assigned a separate memory region to speed up access. The disclosed embodiments are not limited in this respect. Furthermore, in the computing device, the master processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, and the disclosure is not limited in this respect.

Exemplary data splitting and storage

In the disclosed embodiment, the dimension of the multidimensional data is characterized as (N, H, W, C) or (Co, H, W, ci), which represents the storage order of the data in the memory. It will be appreciated that although the multidimensional data has multiple dimensions, there is correspondence between the multidimensional data and the order of storage on the memory because the layout of the memory is always one-dimensional. The multidimensional data is usually allocated in a continuous storage space, i.e. the multidimensional data can be one-dimensionally expanded and stored on the memory in sequence. For example, in embodiments of the present disclosure, the initial input feature maps may be stored sequentially in a low-dimensional (where C/Ci is the lowest dimension) first-in-place manner; however, in order to optimize the convolution operation, the storage order of the input feature maps may be adjusted during or before the operation, as will be described in detail later. The adjacent dimensions refer to dimensions next to each other in the dimension information representation of the multidimensional data, for example, W and Ci are adjacent, and the adjacent dimensions may also be referred to as continuous dimensions.

In an intelligent processor, the main operation unit of hardware is a vector multiplication and addition operator due to the need of calculation power and the consideration of area power consumption overhead. The support of various convolution algorithms is realized in hardware design, the multiplication and addition operation in the algorithms is extracted in a maximized mode, and input and output data of the multiplication and addition operation are exchanged between an on-chip RAM (such as NRAM, WRAM and the like in FIG. 3) and an operator efficiently through a data path.

Hardware is stored in a row-by-row (cache line) manner, and read, write and calculation operations are most efficient when the whole row is aligned, so that data is generally required to be vectorized and aligned in order to fully utilize bandwidth and meet the requirements of access amount of an arithmetic operator array and the like. The design of the artificial intelligence chip usually takes the Ci dimension as the lowest dimension, that is, the NHWC placement order mentioned above, and the data in the Ci dimension is continuous. Therefore, vectorization alignment requires that the size of Ci dimension be aligned to a specified value, for example, an alignment value M, so as to perform access number by taking the alignment value M as a unit, where M may also be referred to as a hardware maximum computation amount in one time. Based on different hardware designs, M may have different values, such as 64bit, 128bit, 256bit, 512bit, etc. Generally, the size of the input port of the operator array is also related to M, for example, in the case of symmetric bit width of the input data, the size of the input port of the operator array is generally 2 times of M, that is, input feature map data and weight data of the alignment value M scale are processed at one time. When the Ci dimension of the input feature map is large, the above alignment requirement is relatively easy to satisfy.

When the Ci dimension of the input feature map is small, for example, smaller than the size of one cache line, the Ci dimension needs to be filled into one line of data (for example, 512 bits), that is, invalid data 0 is filled. Such padding causes a large amount of redundant computation, resulting in resource waste and reduced computational efficiency.

In an embodiment of the present disclosure, a convolution operation scheme is proposed, which may be performed by the computing device of fig. 5, for example. The main processing circuit is used for acquiring an input feature map and/or a convolution kernel, wherein the input feature map and the convolution kernel are respectively split into a plurality of splitting units according to a convolution splitting scheme and the dimension storage sequence is converted, so that data in one splitting unit is continuously stored into one data line. The above-described splitting and dimension conversion of the input feature map and convolution kernel may be performed at different locations and at different times, depending on different hardware configurations and/or other considerations. In the back-propagating neuron gradient update process, top _ diff can be considered as an input feature map.

In some embodiments, the main processing circuitry may include a partitioning circuitry, i.e., a partitioning circuitry integrated in the main processing circuitry for splitting and dimension conversion storage for the input feature map and the convolution kernel, respectively. For example, the main processing circuit may read the input feature map and the convolution kernel of the original storage format from an external storage circuit (e.g., DDR), then perform splitting and dimension conversion on the input feature map and the convolution kernel, respectively, using the blocking circuit, and then store one of the input feature map and the convolution kernel in the first storage circuit and the other in the second storage circuit. The splitting process described above may be performed during or prior to an operation to prepare the data.

In other embodiments, the main processing circuit may include a partial blocking circuit configured to perform splitting and dimension conversion storage only for data in the input feature map and the convolution kernel that is determined to be multicast data, and data that is determined to be distributed data may be split and dimension converted by an external blocking circuit. For example, in one example, after the convolution kernel determined to distribute data is split and dimension-converted by the external circuit, it is stored in the second storage circuit in advance, and it may be directly stored into the second storage circuit from the off-chip storage circuit or stored into the second storage circuit via the first storage circuit.

In still other embodiments, the main processing circuit may not include the block circuit at all or perform the function of the block circuit. In these embodiments, the input signature graph and convolution kernel are stored for splitting and dimension conversion by a partitioning circuit that is independent of the main processing circuit. One of the split and dimension-converted input feature map and convolution kernel may be stored in a first storage circuit, and the other may be stored in a second storage circuit.

A corresponding convolution splitting scheme may be determined according to the size of the lowest storage dimension (e.g., ci) of the input feature map, where the convolution splitting scheme indicates at least the shape of the splitting unit of the data to be operated on. The data volume contained in one split unit does not exceed the single maximum operation volume of hardware.

In some embodiments, the data amount contained in one split unit can be set to the one-time processing alignment value M of the hardware, so that the calculation processing is performed by taking the split unit as a unit, the calculation power of the hardware can be fully exerted, and invalid calculation can be avoided or reduced.

In an exemplary description of the present disclosure, instead of assuming M =512bit =64byte, the data type may be Int8, int16, float16, or Float32, and the input profile is consistent with the data type of the convolution kernel. Since the data type requires at least a width of 1 byte and the minimum unit of arithmetic processing is one data, various calculations are performed in units of bytes in the following examples, such as M =64b, ci =28b, and the like, with the unit sometimes omitted for the sake of brevity.

When the data volume of a split cell is equal to M, the data block shape of each split cell is block c block y block x, which may have various situations, and table 1 lists several of them:

TABLE 1 data Block shape

As can be seen from Table 1, some data block shapes have equal dimensions in the X and Y dimensions (as indicated by the dark rows), which simplifies subsequent operations. Therefore, in the embodiment of the present disclosure, it may be preferable to split the data to be operated on using such a data block shape.

For the sake of simplicity, a splitting scheme of a 64B × 1 × 1 shape is referred to as Forward64, a splitting scheme of a 16B × 2 × 2 shape is referred to as Forward16, a splitting scheme of a 4B × 4 × 4 shape is referred to as Forward4, a splitting scheme of a 4B × 4 × 4 shape applied to a deep convolution operation is referred to as Forward1, a splitting scheme of a 4B × 4 × 4 shape applied to a reverse deep convolution operation is referred to as Update1, and a splitting scheme of a 4B × 4 × 4 shape applied to a cross-product convolution operation is referred to as Update4. In addition to Forward64, these splitting schemes are suitable for scenarios where the channel C is small in the convolution calculation, and therefore can also be collectively referred to as small convolutions. In these small convolution splitting schemes, one splitting unit includes data of the lowest storage dimension and at least one other storage dimension, and the total data amount of one splitting unit does not exceed the hardware single maximum operation amount.

Different convolution splitting schemes can be suitable for different operation scenes, so that performance optimization of different degrees is obtained. In particular, in some embodiments, the corresponding convolution splitting scheme may be determined according to at least one rule as follows:

aligning the lowest storage dimension Ci before splitting the input feature map to the nearest M/4 ⁿ Where M is the hardware single maximum operand,

and determining the size Uci (namely blockC) of the splitting unit on the lowest storage dimension as M/4 ⁿ ；

There are a plurality of nearest M/4 ⁿ When multiple of (2), take M/4 ⁿ Taking the maximum value as Uci, or taking M/4 with the minimum alignment filling amount ⁿ As Uci; and

the size Ux (i.e. blockX) and Uy (blockY) of the split unit in the X and Y storage dimensions are determined such that Uci X Uy X Ux = M, wherein Ux = Uy is preferred.

The application of the above rules is described below in connection with several examples. Assuming that M =64 in all examples, M/4 is ⁿ May be 64, 16 and 4.

In one example, assuming Ci =28, then align to the nearest M/4 ⁿ Is 4 x 7, the size Uc (i.e., blockC) of the split unit in the lowest storage dimension is determined to be 4. When Ux = Uy is preferred, the shape of the splitting unit can be determined to be 4B × 4 × 4, i.e. Forward4 scheme.

In another example, assuming Ci =112, if aligned to 64 × 2=128, 16 padding are performed; if align to 16 × 7=112, zero padding is not needed; zero padding is also not required if aligned to 4 × 28= 112. At this time, the nearest M/4 ⁿ Is 16 × 7=4 × 28=112, according to the rule, M/4 can be taken ⁿ The medium maximum value 16 is Uc. When preferably Ux = Uy, the shape of the splitting unit can be determined to be 16B × 2 × 2, i.e. Forward16 scheme.

After the splitting scheme is determined, the input feature map and the convolution kernel can be split into a plurality of corresponding splitting units according to the determined convolution splitting scheme, and the dimension storage order of the splitting units is converted, so that data in one splitting unit can be continuously stored as one data line, and subsequent reading processing is facilitated by taking the splitting unit (data line) as a unit.

In some embodiments, the data of the neurons or weights in three or four dimensions is all divided into data blocks with the size of block c block y block x (Uc × Uy × Ux), and each data block is continuously stored in one row of, for example, M =64B, so that when one row of data is read, the data of one data block is actually taken out.

Specifically, one or more splitting units may be read in a first reading order from the data to be operated stored in a first-dimension storage order, with the splitting units as a unit, and the read splitting units are stored in corresponding storage circuits, where data in each splitting unit is stored in a second-dimension storage order, and the splitting units are stored in a third-dimension storage order.

FIG. 6 illustrates an exemplary data storage sequence in accordance with embodiments of the present disclosure.

As shown in the figure, 610 represents a storage manner of a four-dimensional tensor to be computed, which includes N3-dimensional sub-tensors, where N is in the highest dimension, that is, the storage order of the first dimension of the four-dimensional tensor is NHWC. Note that H and Y, W and X may be used interchangeably herein. Each sub-tensor is divided into smaller data blocks or split units, and the number of data blocks in each dimension is C/Y/X respectively.

The middle graph 620 represents the storage of each sub-tensor, with each data block stored as a contiguous 64Byte, i.e., a row. When the order in which the data blocks are read differs, the order between the rows may also change accordingly. In the example shown in the figure, the data blocks are read in the directions of C, then X, and finally Y, i.e., the first reading order is YXC, and the rows are stored in the order of Y X C, i.e., the third dimension storage order is YXC or HWC. In this example, the third dimension storage order is the same as the first dimension storage order. It will be appreciated that other reading orders may be used, resulting in a third dimension storage order that is different from the first dimension storage order, and are not further enumerated herein.

The right diagram 630 shows the order within each row, i.e., the data order within each data block, which is shaped as blockC block y block x, when the second dimension storage order is CYX or CHW. The specific splitting scheme will be described in detail later in connection with various exemplary convolutional splitting schemes.

Exemplary packet operations and data multiplexing

The foregoing describes a hardware structure of a computing device according to an embodiment of the present disclosure, and an exemplary splitting scheme and a storage manner of data, where the hardware structure may provide different data paths for input feature maps and weights participating in operations, so as to reduce data access amount during operations and improve operation efficiency by using different data transmission manners (e.g., broadcast, multicast, distribution, etc.). The convolution calculation is that each input feature map needs to be subjected to multiplication and addition operation with each convolution kernel of Co, so that Co output feature maps are output. However, on-chip space cannot store convolution kernels and input feature maps of all scales at the same time, so that for hardware, a series of operations of repeatedly loading input feature data or weight data exist, and how to balance repeatedly loading input feature data or weight data has a certain influence on calculation efficiency. In actual operation, in order to reduce frequent off-chip access, different multiplexing modes can be adopted according to the scale characteristics of data participating in operation. In the convolution operation, there are mainly two data multiplexing methods: convolution kernel multiplexing and input feature map multiplexing.

Depending on the multiplexing scenario, the convolution kernel multiplexing can be further divided into intra-channel convolution kernel multiplexing and inter-batch convolution kernel multiplexing. Intra-channel convolution kernel multiplexing is the case for a single output channel, i.e. one output signature, where there is only one set of convolution kernels. Multiple convolution windows may multiplex the same convolution kernel for each input feature map. Inter-batch convolution kernel multiplexing is for the case of batch processing, i.e. processing multiple input images simultaneously. Multiple input images are processed using the same set of convolution kernels so that the convolution kernels can be multiplexed.

Similarly, input profile multiplexing can be divided into intra-channel input profile multiplexing and inter-channel input profile multiplexing, depending on the multiplexing scenario. Intra-channel input signature multiplexing is for a single output channel, for each input signature, its adjacent convolution windows may multiplex portions of the input signature's data. The multiplexing of the inter-channel input feature maps is performed for a plurality of output channels, that is, for a case where there are a plurality of output feature maps (i.e., a plurality of sets of convolution kernels), and at this time, the input feature maps in one convolution window may be subjected to convolution operation with the plurality of sets of convolution kernels.

According to the convolution operation principle described above, the operation results in the Co dimension (depth convolution is the C dimension) do not need to be accumulated, so that the operation distribution in different Co can be performed relatively independently on different operation circuits. In a scene with a small number of input channels, the convolution kernel is generally small, for example, kh and Kw are usually single-digit numbers, and Co and Ci are almost the same in size. In these embodiments, the output channel Co dimension of the convolution kernel in a single round of operation typically does not exceed the number of slave processing circuits scheduled, so a single Co operation needs to be completed by one or more slave processing circuits. More generally, this can be achieved by splitting into multiple rounds of operation, even when the Co dimension is large, where the Co size processed per round of operation does not exceed the number of slave processing circuits scheduled. Thus, in one example, the number of rounds of operation required to complete the convolution operation and the number of Co processed in each round of operation or the corresponding grouping pattern may first be determined based on the output channel Co dimension size of the convolution kernel and the number of schedulable slave processing circuits Ns.

When determining the number of rounds of operation required to complete the convolution operation, the number of Co processed in each round may be different, so that there may be multiple allocation patterns even for the same Co dimension size.

For example, taking the computing device shown in fig. 5 as an example, which includes 16 slave processing circuits SL, it is assumed that all slave processing circuits are schedulable, i.e. Ns =16. When Co =40, it can be divided into three rounds of operation, the first round dealing with the first 16 Co values, each SL dealing with a different Co value; the second round processes the next 16 Co values, one different Co value for each SL; the last round processes the remaining 8 Co values, one for each 2 SLs. In another allocation, the operation can be divided into two rounds, the first round processes the first 32 Co values, and each SL processes 2 different Co values; the last round processed the remaining 8 Co values, one for each 2 SLs. For another example, when Co =12, it may be divided into a single round of operation, each SL handling a different Co value, of which 4 SLs are idle or do no-effect operations. In another allocation, it is also possible to divide into three rounds of operation, each time processing 4 consecutive Co values, and each 4 SLs processing 1 different Co value, so that all schedulable slave processing circuits are utilized in each round of operation. It will be appreciated that more other allocation schemes may be envisaged by those skilled in the art.

Therefore, regardless of the allocation method, in a single round of operation, there may be two allocation cases for Co: multiple slave processing circuits process one Co value, or a single slave processing circuit processes one or more Co values. Specifically, in a single operation round for processing Nco output channels, each Rs SL forms a slave processing circuit group SLB, and convolution kernels corresponding to the same output Co value are processed, wherein Rs = [ Ns/Nco ], that is, the same convolution kernel is multiplexed on Rs SLs in the same SLB, and Rs represents the multiplexing times of the convolution kernels among the slave processing circuits. Accordingly, the input profiles may be multiplexed between the respective slave processing circuit groups SLB, and Rn = [ Ns/Rs ], which represents the number of times the input profiles are multiplexed between the slave processing circuits.

Alternatively or additionally, when each slave processing circuit processes convolution kernels corresponding to rn Co values, rn = [ Nco/Ns ], then the input signature processed by each slave processing circuit may be repeated for rn convolution kernels, rn representing the number of times the input signature is multiplexed within a single slave processing circuit. The maximum number of multiplexing of convolution kernels rs and the maximum number of multiplexing of input feature maps rn applicable within a single slave processing circuit may be determined taking into account factors such as hardware buffer space limitations (e.g., the size of the first buffer circuit and the second buffer circuit in fig. 5).

Considering the buffer size limitation and multiplexing gain in hardware circuits, in some embodiments of the present disclosure, one slave processing circuit is not temporarily considered to process multiple Co values in a single round of operation, but only one or more slave processing circuits are considered to process only one Co value in a single round of operation.

Different grouping modes can be used depending on the number of slave processing circuits SL that process the same Co value in a single round of operation. It will be appreciated that it is preferable to distribute the slave processing circuits SL that can be invoked equally, so as to balance the computational effort, for example, in groups of 2 SLs each, so that 16 SLs can process 8 Co values simultaneously; or one set every 4 SLs so that 16 SLs can handle 4 Co values at the same time; and so on. In some embodiments, for a computing device including Ns =16 SLs as shown in fig. 5, several grouping modes may be selected as follows: group1 mode, group4 mode, and Group16 mode. It will be appreciated by those skilled in the art that there may be different grouping patterns depending on the value of Ns, and each grouping pattern may be processed correspondingly with reference to the above three representative grouping patterns given herein.

In some embodiments, the grouping pattern may be collectively expressed as GroupN, which represents that all the slave processing circuits SL scheduled in the current round of operation are divided into N groups, each slave processing circuit group SLB processes the same Co value, and different slave processing circuit groups SLB process different Co values. For 16 SL total schedulable cases, N may be 1,4,16, corresponding to Group1, group4, and Group16 above, respectively.

Figures 7a-7d illustrate several exemplary grouping patterns according to embodiments of the present disclosure. Fig. 7a shows the Group1 mode, fig. 7b shows the Group16 mode, fig. 7c shows one Group4 mode, and fig. 7d shows another Group4 mode.

As shown in fig. 7a, the Group1 mode means that all schedulable 16 SLs belong to one Group, and collectively handle one Co value, for example, SL0 to SL15 belong to Group G0. Thus, the operation for the one output channel is distributed over 16 SLs. In this mode, it may be prioritized to transmit the convolution kernel 720 of the output channel to each SL in a broadcast manner, and the input feature map 710 is split and allocated to each SL, so as to improve the access efficiency.

In one embodiment, the convolution kernel may be stored on the first storage circuit 530 of FIG. 5 for transmission using a broadcast channel. The input signature may then be divided in the XY direction of the output signature and stored in a second memory circuit 540 for assignment to different SLs. Thus, all SLs collectively compute an output profile of Co. The division and storage of the input feature map will be described in detail later with reference to the drawings.

As shown in fig. 7b, the Group16 mode means that all schedulable 16 SLs are divided into 16 groups, i.e. one SL each Group, each SL handling a different Co value. For example, SL0 belongs to group G0, SL1 belongs to group G1, and so on until SL15 belongs to group G15. In this mode, the same block of input profile 730 may be reused across 16 SLs, so it may be preferable to broadcast the input profile 730 to each SL, while the convolution kernels 740 corresponding to different Co are distributed to the corresponding SL.

In one embodiment, the input signature may be stored on the first storage circuit 530 of FIG. 5 for transmission using a broadcast channel. The convolution kernels are then stored in a second memory circuit 540 according to a Co partition to be assigned to different SLs. Thus, all SLs compute the output profiles of different Co for the same input profile.

Group4 mode means that all schedulable 16 SLs are divided into 4 groups, each Group handling one Co value. Each SL group (SLB for short) includes the number of SLs equal to Rs = Ns/4=4. For example, SL0 to SL3 belong to group G0, SL4 to SL7 belong to group G1, SL8 to SL11 belong to group G2, and SL12 to SL15 belong to group G3. This pattern is intermediate between Group1 and Group16, so either the convolution kernel or the input signature graph can be determined as multicast data, while the other is determined as distribution data.

In one embodiment, the convolution kernels may be divided into 4 groups by Co and stored on the first storage circuit 530 of fig. 5 for transmission using a broadcast channel. The input signature can be divided into 4 copies in the XY direction of the output signature, and the copies can be stored in the second storage circuit 540 to be distributed to 4 SLBs. Each SLB obtains the same input signature, and distributes to 4 SLs therein in 4 divided portions within the SLB. Thus, all SLs in each SLB together compute an output profile for Co, with 4 SLBs each processing a different Co.

In another embodiment, the convolution kernel may be stored on the second storage circuit 540 of fig. 5, while the input signature is stored on the first storage circuit 530, in a similar manner as in the previous embodiment.

In this mode, the partition of the convolution kernel between SLBs can be done in a number of ways.

Figure 7c shows a Co allocation 770 of the convolution kernel. In this manner, the convolution kernels are divided into 4 groups, and divided into each group by Co in units of interval 1. For example, when Co =12, the divided 4 groups of Co are {0,4,8}, {1,5,9}, {2,6,10} and {3,7,11}, respectively. Each time one Co of each group is transmitted, for example, the first transmission Co =0 to 3, one Co corresponds to one SLB, and 4 SLs in one SLB share the same weight; co =4 to 7 for the second transmission, and so on. Thus, after each round of operation is completed, co dimensions of operation results output by the SLBs are continuous.

Fig. 7d shows another Co allocation 780 for the convolution kernel. In this manner, the convolution kernels are successively averaged into 4 groups by Co. For example, when Co =12, the divided 4 groups of Co are {0,1,2}, {3,4,5}, {6,7,8} and {9,10,11}, respectively. Each time one Co of each group is transmitted, for example, the first transmission Co =0,3,6,9, one Co corresponds to one SLB, and 4 SLs in one SLB share the same weight; the second transmission Co =1,4,7,10, and so on. Thus, co dimensions of operation results output by each SLB in the multiple rounds of operation are continuous.

Exemplary splitting of input feature graphs

As can be seen from the foregoing description, when multiple SLs collectively process one Co value, the input feature map needs to be split among the multiple SLs, for example, the Group1 grouping mode needs to split the input feature map into 16 parts, and the Group4 grouping mode needs to split the input feature map into 4 parts.

To ensure that the split input feature maps can share a convolution kernel, the split input feature maps can be divided according to the Ho/Wo directions of the output feature maps, and thus mapped back to the division of the input feature maps. In some embodiments, the input characteristic diagram may be divided among the Rs slave processing circuits SL included in each slave processing circuit group as follows: averagely dividing the output feature map into Rs output feature blocks with the same shape on an XY dimension (namely Ho/Wo dimension) according to the size of the corresponding output feature map; and dividing the input feature map into Rs input feature blocks in the XY dimension (i.e., hi/Wi dimension) to be assigned to Rs slave processing circuits, according to the input feature map region required to calculate each output feature block. It will be appreciated that depending on the convolution kernel size and convolution step size, there may be overlap of the input feature maps corresponding to adjacent output points on the output feature map.

FIG. 8 illustrates an exemplary split schematic of an input feature map according to an embodiment of the disclosure. In this example, the input profile is divided into 16 shares distributed over 16 SLs, corresponding to the Group1 mode.

In the figure 810, an output characteristic diagram of a single Co is represented, which is divided into 16 output characteristic blocks of the same shape in a 4 × 4 manner in the XY direction, and is respectively assigned to SL0 to SL15. Then, the 16 output feature blocks can be mapped onto the input feature map 820, and 16 input feature map regions required for respectively calculating the 16 output feature blocks are obtained, which also divides the input feature map in the XY direction. These 16 input profile areas may be correspondingly assigned to the 16 slave processing circuits SL.

According to the foregoing description, the input feature map is split in units of split units according to the determined convolution splitting scheme, and therefore, in the above embodiment, the input feature map is partitioned such that each partitioned input feature map block is a multiple of the dimension of the split unit in the XY direction, that is, each partitioned input feature map block may be aligned according to the split unit in the XY direction. For example, when a 4 × 4 × 4 convolution split scheme is selected, each input feature tile is aligned by 4 × 4; whereas when a 16 x 2 convolution split scheme is selected, each input feature tile is aligned by 2 x 2.

For the case where the output feature maps are not aligned in split units (e.g., 4 × 4 or 2 × 2), corresponding padding (e.g., 0 padding) on the input feature maps is required so that the actually calculated output XY is aligned in split units (e.g., 4 × 4 or 2 × 2) and the input XY is also aligned in split units (e.g., 4 × 4 or 2 × 2).

It will be understood by those skilled in the art that the output feature map may be divided in the XY direction according to other rules, for example, the output feature map may be divided into 16 output feature blocks with the same shape according to a 1 × 16 scheme, and the output feature blocks are respectively allocated to SL0 to SL15. The disclosed embodiments are not limited in this respect. Furthermore, it is to be understood that, although the foregoing is described in conjunction with splitting between slave processing circuits, this splitting manner may also be applied to splitting in other scenarios, for example, splitting between operation circuits CU within a single slave processing circuit SL, and the embodiments of the present disclosure are not limited in this respect.

Examples of data storage on the second storage Circuit

As previously described, one of the input signature graph or convolution kernel may be stored on the first storage circuit 530 of fig. 5, and the other of the two may be stored on the second storage circuit 540. The data in the first memory circuit may be multicast via the broadcast path while the data in the second memory circuit is typically distributed. The data access and storage speed can be increased by reasonably distributing the storage modes of all data. In some embodiments, the second memory circuit may assign a memory region to each slave processing circuit SL, so that the data required for operation of each slave processing circuit need only be read from its corresponding memory region.

9a-9d illustrate data storage schematics in a second memory circuit according to embodiments of the present disclosure. The figure exemplarily shows 16 blocks of memory regions 900 to 915 allocated for, for example, ns =16 slave processing circuits SL0 to SL15. Each memory area stores a convolution kernel or input signature to be processed by the slave processing circuit. It will be appreciated that depending on the grouping scheme, the storage content in the respective storage areas may be used.

Fig. 9a shows that in the Group1 mode, the input characteristic diagram is divided into 16 parts FB0 to FB15, and stored in the respective storage areas of the second storage circuit. The storage area for each SL stores a contiguous two-dimensional area that is split, for example, in the manner of fig. 8. In each two-dimensional region, the splitting units described above are stored in rows, i.e., one row corresponds to a splitting unit of an input feature map. For example, assuming that each input feature block after splitting includes 4 splitting units, that is, 4 lines of data, in the storage area 1100 allocated to the SL0, the first Line01, the second Line02, the third Line03, and the fourth Line04 input feature maps are stored in order. Each row may also be referred to as an input feature row.

Fig. 9b shows that in the Group16 mode, convolution kernels are stored in respective storage areas of the second storage circuit according to Co division to be assigned to corresponding SLs. The storage area for each SL stores the convolution kernel assigned to its different Co value. For example, two Co allocation schemes are described above, and correspondingly, there are two storage schemes. One of which is shown in fig. 9b, i.e. successive Co values are assigned to the SLs in order in each round of operation. Thus, after each round of operation is completed, the Co dimension of the operation result output by each SL is continuous. For example, the figure shows that convolution kernels of Co =0 to 15 in the first round of operation are sequentially stored in 16 storage areas 900 to 915; convolution kernels with Co = 16-31 in the second round of operation are sequentially stored in the 16 storage areas 900-915; and so on. It is understood that in Group16 mode, the input feature map may also be stored on a second storage circuit (not shown). At this time, the input feature map is directly copied into 16 copies without splitting, and the copies are respectively stored in the storage areas of the second storage circuit to be allocated to the corresponding SLs, so that each SL can perform convolution operation on the same input feature map and convolution kernels with different Co values.

Fig. 9c shows one possible storage content in the Group4 mode. In this illustrated example, the input feature map is divided into 4 copies and copied into 4 copies, and stored in the respective storage areas of the second storage circuit. Specifically, each slave processing circuit group SLB processes convolution kernels of different Co for the same input feature map; while the 4 SLs within each SLB process a split block of input feature tiles separately. Therefore, the storage contents of the storage areas for the 4 SLBs in the figure are the same, for example, the contents in 900 to 903 are the same as those in 912 to 915. Further, within each SLB, storage areas for different SLs store different split input feature blocks, e.g., 900 stores input feature block FB0, 901 stores input feature block FB1, and so on. The same memory allocation is performed in the memory areas of other SLBs, and is not described again.

Fig. 9d shows another possible storage content in Group4 mode. In this illustrated example, the convolution kernels are divided into 4 groups by Co and stored in respective storage areas of the second storage circuit. Specifically, the convolution kernels are divided into groups by Co in units of interval 1. For example, when Co =16, 4 SLBs are sequentially assigned by multiple rounds, where Co =0 is assigned to G0{ SL0 to SL3}, co =1 is assigned to G1{ SL4 to SL7}, co =2 is assigned to G2{ SL8 to SL11}, and Co =3 is assigned to G3{ SL12 to SL15}; then, the data is sequentially allocated to 4 SLBs from Co =4. The 4 SLs within each SLB share the same weight. For example, the same weight values are stored in the

storage regions

900, 901, 902, and 903. Likewise, co may also be in a continuous manner within a single SLB, and one skilled in the art can deduce the manner of storage thereof with reference to the foregoing description, which is not described in detail herein.

Exemplary convolution operation procedure within Single Slave processing Circuit

After the data to be operated are split and correspondingly placed and stored, a plurality of slave processing circuits can be scheduled to execute convolution operation on the input feature map and the corresponding data lines of the convolution kernels, and then the operation results returned by the slave processing circuits can be spliced according to a convolution splitting scheme to obtain an output feature map of the convolution operation of the input feature map and the convolution kernels. Specifically, a specific convolution operation process may be performed by using a plurality of operation circuits CU and respective buffer circuits (see fig. 5) in the slave processing circuit. Depending on the amount of space within the buffer circuitry from the processing circuitry and the computational power limitations of the arithmetic circuitry, multiple computations typically need to be performed in each round of operation to complete the desired operation.

In some embodiments, the first buffer circuit may be used to buffer an input signature, which may be from the first storage circuit or the second storage circuit; accordingly, a second buffer circuit may be used to buffer the convolution kernel, which may be from the second storage circuit or the first storage circuit. As mentioned above, the convolution operation is performed in units of split units (one line of data), so that the computational power of hardware can be fully utilized, and invalid computation can be avoided or reduced. Therefore, each operation circuit CU can perform a bit-by-bit accumulation operation on a data line (e.g., input feature line) selected from the first buffer circuit and a data line (e.g., weight value line) selected from the second buffer circuit, respectively, at each calculation. For simplicity, the following description is for processing within a single slave processing circuit SL, it being understood that similar processing is performed within other SLs.

As can be seen from the foregoing description, in the scenario for conventional 3D convolution operation, all the operation circuits within a single slave processing circuit calculate one output feature map or partial output feature map corresponding to the same output channel Co. Depending on the size of the buffer spaces of the first buffer circuit and the second buffer circuit within the slave processing circuit SL, the processing capability of the arithmetic circuit CU (e.g., internal registers, etc.), the slave processing circuit may not be able to calculate the output profiles assigned thereto all at once. Thus, the output feature blocks may be divided in units of a single operational capability of the operational circuit (e.g., a single computation of Nop output points or partial sums), each corresponding to all schedulable N's within a single SL _CU Single operation capability (N) of an operation circuit _CU * Nop output points). For example, taking the example of fig. 5 above where each SL includes 4 CUs, assuming that each CU can compute Nop =4 output points or partial sums of output points at a single time, a single SL can compute 4 × 4=16 output points (or partial sums) at a single time. Therefore, the output feature map can be divided into output feature blocks aligned according to 16 output points in the XoYo dimension, and each output feature block can be calculated one by one. It is to be understood that the 16 output points may be in the form of 4 x 4, or 1 x 16, and embodiments of the present disclosure are not limited in this respect.

In calculating the output feature block of each partition, it is possible to further calculate the output feature block at N _CU Dividing the output characteristics among the operational circuitsThe output points of the blocks to determine the processing objects of the respective arithmetic circuits. Then, according to the division of the output points, the splitting unit is taken as a sliding window, and N is selected from the first buffer circuit _CU Distribution of lines of input characteristic data to N _CU An arithmetic circuit for selecting corresponding weight data from the second buffer circuit and broadcasting to N _CU And the operation circuit is used for realizing parallel calculation of output points corresponding to the sliding windows by multiplexing weight data. Nk sliding picks are performed, where Nk is determined by the smaller of the convolution kernel size in the X and Y dimensions and the maximum convolution kernel size supported from a single operation by the processing circuitry.

In some embodiments, when performing a three-dimensional convolution operation, the corresponding weight data may be selected as follows: selecting 1/Nop weight lines from the second buffer circuit according to the corresponding sliding mode in the first buffer circuit, expanding the copy Nop-1 into an expanded weight line, and broadcasting to N in the slave processing circuit _CU An arithmetic circuit.

At this time, each arithmetic circuit may perform bit multiplication accumulation with 1/Nop data line units for one input feature line from the first buffer circuit and one expanded weight data line from the second buffer circuit during each sliding number selection period to obtain Nop partial sums; and accumulating the Nk Nop parts obtained by calculation in Nk sliding selection periods according to the corresponding convolution output points to obtain and output Nop operation results.

When the slave processing circuit outputs the output points of the operation circuits therein, the output points calculated by the operation circuits therein can be output in a specific sequence according to the division mode of the output points, so that the output points continuously output are continuous in X and/or Y dimensions, and subsequent processing is facilitated. In some embodiments, the aforementioned block circuit may further store the operation results returned from the respective slave processing circuits in a fourth-dimension storage order. According to the situation, the block circuit can also convert the operation result into a desired dimension storage sequence for storage.

The division of the output points among the arithmetic circuits can be realized in various ways, and correspondingly, the sliding number selection convolution process and the output sequence of the output points are different.

Fig. 10a-10b show two different output point divisions between the arithmetic circuits.

FIG. 10a illustrates a schematic diagram of assigning consecutive output points to each arithmetic circuit in accordance with some embodiments of the present disclosure. In these embodiments, may be at N _CU The output characteristic block is divided into N between arithmetic circuits _CU Each output characteristic subblock is provided with Nop output points, so that each arithmetic circuit is respectively responsible for calculating one output characteristic subblock. For example, in the above example, the output feature block 1010a is shown to include 4 × 4 output points, each of the equally divided output feature sub-blocks 1011a to 1011d includes 2 × 2 output points, and each of the operation circuits calculates 2 × 2 output points (or partial sums) in succession at a time. The output points assigned to the 4 different arithmetic circuits CU0 to CU3 are shown in the figure with different backgrounds.

Based on the above output point division, when the convolution operation is performed by sliding the selection number, the feature subblock may be output according to the data required for calculating the output feature subblock, and the N _CU Selecting N from the first buffer circuit corresponding to each output characteristic sub-block position _CU The data lines are operated on.

For example, when the first selection number of feature data is input, the first input data row may be selected from the corresponding input feature blocks and distributed to the 4 arithmetic circuits, based on the 4 input feature blocks required for calculating the 4 output feature sub-blocks 1011a to 1014 a.

When selecting weight data, corresponding weight data can be selected from the second buffer circuit and broadcasted to N _CU And the operation circuits realize parallel calculation of output points corresponding to the operation circuits by multiplexing weight data.

Further, in some embodiments, in order to fully exploit the computational power (e.g., multiply-add operator) inside the arithmetic unit CU, e.g., to compute Nop output points or partial sums at a single time, weight multiplexing may be performed within a single input data line, thereby computing Nop output points or partial sums simultaneously.

For example, when selecting the number of weight data, only 1/Nop weight rows may be selected, and Nop-1 is copied to expand the weight rows into 1 weight row, where the expanded weight row includes Nop identical 1/Nop weight rows. The extended weight line may also be broadcast to N _CU And an arithmetic circuit to multiplex the weights between the plurality of arithmetic circuits with a smaller granularity (e.g., 1/Nop row) between the computations of Nop output points of a single arithmetic circuit.

Thereby, N is taken out by corresponding each time _CU One input characteristic data line, 1/Nop weight lines are taken to copy and expand into 1 weight line, and N can be calculated each time _CU * Nop output points or partial sums. When the calculation result is partial sum, the partial sum can be calculated for a plurality of times through a plurality of sliding, and the partial sums of each time are accumulated according to the output points to which the partial sums belong, so that the final result can be obtained.

According to the division mode of the output points, the sliding times and the sliding step length of the convolution operation can be determined. According to the partitioning of fig. 10a, the number of slips Nk = Kx × Ky, where Kx, ky are the smaller of the sizes of the convolution kernels in the X and Y dimensions and the maximum convolution kernel size supported from a single operation of the processing circuit, respectively, and the slip step =1. The maximum convolution kernel size supported by a single operation from the processing circuit is determined by, for example, the spatial size of the first buffer circuit and the second buffer circuit. It will be appreciated that when the convolution kernel exceeds the maximum convolution kernel size, splitting in the Kx and Ky directions is required to be performed at the maximum convolution kernel size.

According to the division manner of fig. 10a, since the output points calculated by each arithmetic circuit are continuous in the X and/or Y dimensions, the operation results of the respective arithmetic circuits can be output one by one. For example, the operation result of one operation circuit, for example, 2 × 2 output points, is output one time in the order of the operation circuits, and the output feature block of 4 × 4 is returned 4 times in succession.

FIG. 10b is a schematic diagram illustrating assignment of spaced output points to each arithmetic circuit in accordance with further embodiments of the present disclosure. In these embodiments, may be at N _CU Equally dividing the output characteristic block into Nop output characteristic subsets with the same shape between the operation circuitsEach output characteristic sub-block comprises N _CU Output points, respectively divided into N _CU An arithmetic circuit. For example, the figure also illustrates, by way of example, that the output signature block 1010b includes 4by 4 output points, and that each of the equally divided output signature sub-blocks 1011 b-1011 b includes 2 by 2 output points. In each output signature sub-block, these 2 x 2 output points are assigned to 4 arithmetic circuits. Thus, each arithmetic circuit calculates one output point in Nop output feature subblocks. The output points assigned to the 4 different arithmetic circuits CU0 to CU3 are shown in different backgrounds in the figure.

Based on the output point division, when convolution operation is performed by sliding the selection number, N can be correspondingly selected from the first buffer circuit from the output point position of each output feature subblock according to the data required for calculating the output feature subblock _CU The data lines are operated on.

For example, when the first selection number of feature data is input, 4 input data lines may be selected from the corresponding input feature blocks and distributed to 4 arithmetic circuits, based on 4 input feature blocks required for calculating 4 output points in the first output feature sub-block 1011 b. It will be appreciated that since the 4 output points are consecutive in the X and/or Y direction, the spacing or step size in the X and/or Y direction of the simultaneously selected 4 rows of input data is 1.

When selecting weight data, corresponding weight data can be selected from the second buffer circuit and broadcast to N _CU And the operation circuits realize parallel calculation of output points corresponding to the operation circuits by multiplexing weight data.

For example, when selecting the number of weight data, only 1/Nop weight rows may be selected, and Nop-1 is copied to expand the weight rows into 1 weight row, where the expanded weight row includes Nop identical 1/Nop weight rows. The expansion weight can beTo broadcast to N _CU And an arithmetic circuit to multiplex the weights between a plurality of arithmetic circuits, and to multiplex the weights with a smaller granularity (e.g., 1/Nop row) between the computations of Nop output points of a single arithmetic circuit.

Thereby, N is taken out by corresponding each time _CU One input characteristic data line, 1/Nop weight lines are copied and expanded into 1 weight line, and N can be calculated each time _CU * Nop output points or partial sums. When the calculation result is partial sum, the partial sum can be calculated for a plurality of times through a plurality of sliding, and the partial sums of each time are accumulated according to the output points to which the partial sums belong, so that the final result can be obtained.

According to the division mode of the output points, the sliding times and the sliding step length of the convolution operation can be determined. According to the partitioning of fig. 10b, the number of sliding Nk = ceil (Kx/2) × ceil (Ky/2), where Kx, ky are the smaller of the sizes of the convolution kernels in the X and Y dimensions and the maximum convolution kernel size supported from a single operation of the processing circuit, respectively, and the sliding step =2. Also, the maximum convolution kernel size supported by a single operation from the processing circuit is determined by, for example, the spatial size of the first buffer circuit and the second buffer circuit. It will be appreciated that when the convolution kernel exceeds the maximum convolution kernel size, splitting in the Kx and Ky directions is required to be done at the maximum convolution kernel size.

According to the division manner of fig. 10b, since the output points calculated by each of the operation circuits are spaced, that is, discontinuous, in the X and/or Y dimensions, a partial operation result of a partial operation circuit needs to be selected to be output each time, so that the output points are continuous in the X and/or Y dimensions. For example, 1 × 4 operation results may be output one row at a time, and 4 × 4 output feature blocks may be returned in succession 4 times. In this example, the first row needs to output two results for CU0 and two results for CU1, the second row needs to output two results for CU2 and two results for CU3, and so on. In another example, 2 × 2 operation results may still be output at a time, returning 4 × 4 blocks of output features in succession. In this example, the first operation result of each of CU0 to CU3 is output for the first time, the second operation result of each of CU0 to CU3 is output for the second time, and so on. In another example, the operation result may also be output in columns, which is not described herein again.

Furthermore, it is contemplated that a slave processing circuit may calculate a plurality of 4 × 4 regions, for example, a maximum of 16 4 × 4 regions in the Xo/Yo direction, using registers inside the arithmetic circuit CU. At this time, the weight or the neuron may be multiplexed according to the storage content in the second storage circuit, and the reading frequency of the second storage circuit may be reduced. The calculated result, if a partial sum, is stored in a register within the arithmetic circuit.

In these embodiments, each slave processing circuit may control the manner of reading the weight data line and the input feature map data line according to the weight multiplexing and/or input feature map multiplexing manner, so as to perform the multiply-accumulate operation on the weight data and the input feature map data simultaneously traversing the entire convolution window of the convolution output point through multiple operations, obtain multiple partial sums and accumulate to obtain the convolution output on the corresponding convolution output point.

The following describes a detailed operation process applied to different types of convolution operations by using different convolution splitting schemes in combination with specific embodiments.

The embodiment is as follows: update1

In Update1, the shape of the splitting unit is 4 bx 4 x 4, and the operation process can also be applied to a convolution splitting scheme similar to the splitting unit. The size of the splitting unit indicated by these convolution splitting schemes may be represented as Uci × Uy × Ux = M, where Uci is the size of the splitting unit in the lowest storage dimension (e.g., ci dimension) of the input feature map and the initial convolution kernel, ux and Uy are the sizes of the splitting unit in the X and Y storage dimensions of the input feature map and the initial convolution kernel, respectively, and M is the hardware maximum computation amount in one time. In these convolution splitting schemes, ux = Uy ≧ Uci>1，Uci＝M/4 ⁿ ，

For example, assuming M =64, M/4 ⁿ May be 64, 16, 4 and 1, according to Ux = Uy ≧ Uci>1, the splitting unit may be a 4B × 4 × 4 shape. Using this convolution splitIn the scheme, the Ci dimensions of the input feature map and the convolution kernel need to be aligned to 4B. For example, when Ci =10, splitting can be performed by zero padding to 3 × 4=12, so that Ci has 3 split units in dimension.

As another example, assuming M =128, M/4 ⁿ Can be 128, 32, 8 and 2, according to Ux = Uy ≧ Uci>1, the splitting unit may be in the shape of 2B × 8 × 8. When using this convolution splitting scheme, it is necessary to align the Ci dimensions of the input feature map and convolution kernel to 2B. For example, when Ci =3, splitting can be performed by zero padding to 2 × 2=4, so that Ci has 2 split units in the dimension.

Update1 is applied to a deep convolution operation in the inverse training of the neural network model, which is a 2D convolution operation. The principle of the inverse depth convolution operation can be referred to the description above in connection with fig. 4b. In the inverse deep convolution operation scenario, the sizes of top _ diff and bottom _ data are usually large, so different optimization operation schemes are required.

In the following description, although the data to be operated will be referred to with top _ diff and bottom _ data, the foregoing description for the convolution kernel may be similarly applied to top _ diff, and the description for the input feature map may be similarly applied to bottom _ data, that is, both may be used interchangeably. The following description may be applied to a convolution splitting scheme similar to Update 1.

Since the input channels are not accumulated in the deep convolution, the dimensions of top _ diff and bottom _ data can be reduced to three dimensions of C (channel), H (height) and W (width). The shapes of the splitting cells indicated by these convolution splitting schemes also satisfy: uc × Uy × Ux = M, uc is the size of the split unit in the lowest storage dimension (e.g., C dimension) of the bottom _ data and top _ diff initials, ux and Uy are the sizes of the split unit in the X and Y storage dimensions of the bottom _ data and top _ diff initials, respectively, and M is the single maximum operand of the hardware. In these convolution splitting schemes, ux = Uy ≧ Uc>1，Uc＝M/4 ⁿ ，

Because the product result of the C channel dimension is not accumulated in the deep convolution operation, originally, when the operator executes the conventional 3D convolution, for example, 64 numbers are multiplied by 64 numbers in the C dimension, and 1 number is obtained after accumulation, but 64 numbers are obtained now. That is, the computational power of the arithmetic unit is wasted due to the fact that the C dimension is not accumulated, and performance loss is brought to the arithmetic unit. In order to fully utilize the computing power of the arithmetic unit, the data in the dimension (such as HW dimension) needing to be accumulated is transferred to the C dimension by the splitting mode, so that the utilization rate of the arithmetic unit can be improved. For example, when a 4B × 4 × 4 split unit is adopted, assuming that the data type is int8, the result of accumulating 64 times 64 numbers results in 4 numbers instead of the original 64 numbers.

Figure 11 illustrates a split and store diagram for the Update1 scheme according to one embodiment of the present disclosure. For simplicity, the illustration in the figure assumes a data type of Int8.

The diagram 1110 shows the raw data to be operated on (which may be neurons or weights) in the order it is stored as HWC. Also shown are 4 data blocks 1111-1114, where the original data to be operated on is split in split units, each data block comprising 4 × 4=64 data.

The split data is shown in 1120 in a tiled format for easy reading. It can be seen that the original data blocks (e.g., 1111-1114) are laid out in a row (e.g., 1121-1124) in the C dimension. In each row, data is stored in the order of CHW, for example, for

data row

1121, 16 data of C =0 are stored first, then 16 of C =1 are stored, then 16 of C =2 are stored, and finally 16 of C =3 are stored.

Specifically, for bottom _ data, data needs to be put from [ Hi Wi C ] as:

[ Hi/4 wi/4C/4 (4 × 4 × 4) ], the shape of this six-dimensional tensor, omitting the N dimension.

For top _ diff, we need to put the data from [ Ho Wo C ] as:

[ Ho/4 × wo/4 × c/4 (4 × 4 × 4) ], the shape of this six-dimensional tensor, omitting the N dimension.

When the Update1 convolution splitting scheme is executed using the computing apparatus shown in fig. 5, bottom _ data and top _ diff may be split into a plurality of split units by a blocking circuit integrated within the main processing circuit, or a blocking circuit completely or partially independent of the main processing circuit, in accordance with the Update1 convolution splitting scheme. The blocking circuit may also convert the dimension storage order of bottom _ data and top _ diff so that the data within each split unit is stored successively as one data line. The split and converted bottom _ data and/or top _ diff may be provided to the master processing circuit or the slave processing circuit. The master processing circuit may then distribute the data it obtains to a plurality of slave processing circuits for performing convolution operations; and according to the convolution splitting scheme, splicing the operation results returned from the processing circuit to obtain the output delta W (or called weight _ diff) of the deep convolution operation of bottom _ data and top _ diff. The plurality of slave processing circuits can execute convolution operation according to the obtained data and return the operation result to the master processing circuit.

In order to make full use of schedulable slave processing circuits, respective computational tasks may be distributed between the respective slave processing circuits to improve parallel processing efficiency. Considering that in the deep convolution operation scenario of Update1, the operation results in the C dimension do not need to be accumulated, and therefore, the operations in different C can be distributed relatively independently on different operation circuits. It should be noted that in the Update1 splitting scheme, the C dimension is aligned according to 4B, and therefore, when processing is performed in units of split units, the C dimension is aligned to 4B (that is, uc) and then split. In other words, the processing on the different arithmetic circuits is split in the unit of Uc in the C dimension.

In the inverse depth convolution scenario, the C dimension is typically large, e.g., greater than 64, and bottom _dataand top _ diff are also commonly large. In these embodiments, the channel C dimension size Nc of bottom _ data and top _ diff in a single round of operation may be a multiple of 64, so that a single channel of operation, calculated in units of Uc, may be allocated to one slave processing circuit for completion. Thus, in some embodiments, the convolution splitting scheme also indicates a grouping division manner in which to perform deep convolution operations, wherein the grouping division manner may be sequentially divided in the Uc unit, in terms of channel C dimension, for bottom _ data and top _ diff data to schedulable Ns slave processing circuits, each processing different consecutive Uc values of bottom _ data and top _ diff data. In other words, a single slave processing circuit may be a Group, which respectively processes operations of different C (in Uc), i.e. corresponding to the Group16 grouping pattern of the foregoing.

In such an embodiment of dividing the packets according to the C dimension, top _ diff may be stored in the first storage circuit after being divided and dimension converted according to the foregoing convolution splitting scheme. Since each slave processing circuit processes a different Uc, top _ diff corresponding to the different Uc values can be unicast/transmitted separately to the scheduled Ns slave processing circuits via the broadcast bus during operation.

In these embodiments, the bottom _ data may be determined as the distribution data, and the distribution data after the dimension splitting and converting is stored in the storage areas corresponding to the Ns slave processing circuits in the second storage circuit, respectively, in such a manner that the distribution data is sequentially divided in units of Uc according to the channel C dimension, so as to be distributed to the corresponding slave processing circuits.

Fig. 12 illustrates an exemplary storage manner of bottom _ data in the second memory circuit according to some embodiments of the present disclosure.

As shown in the figure, the second memory circuit can allocate a memory region for each slave processing circuit, so that the bottom _ data required by the operation of each slave processing circuit only needs to be read from the corresponding memory region. The figure exemplarily shows that 16 blocks of memory regions 1200 to 1215 are allocated to 16 slave processing circuits, and each memory region stores a block of bottom _ data to be processed by the slave processing circuit.

As mentioned earlier, the C dimension is split in units of Uc. In the example in the figure, assuming that Uc =4B and the data type is int8, one Uc includes 4C values. When the size of the C dimension exceeds Uc times the number of schedulable slave processing circuits, the operation needs to be performed by a number of operation rounds.

Taking the example in the figure as an example, assuming that 16 slave processing circuits in total can be scheduled, further assuming that the size of the C dimension of the bottom _ data is 128, exceeding Uc times the number of schedulable slave processing circuits (16 × 4= 64), the whole calculation can be completed in two rounds of operation. The bottom _ data can be split into 32 bottom _ data blocks according to the C dimension and by using Uc as a unit, the first 16 data blocks are calculated in the first round of operation, and the last 16 data blocks are calculated in the second round of operation.

As shown in the figure, among the data of the first round of operation, a bottom _ data block including C =0,1,2,3 is allocated to the first slave processing circuit; a bottom _ data block comprising C =4,5,6,7 is allocated to the second slave processing circuit; and so on. In the data of the second round of operation, the bottom _ data block is also divided similarly and stored accordingly, which is not repeated here.

Accordingly, the first buffer circuit may buffer the plurality of bottom _ data lines distributed to the slave processing circuit from the second storage circuit; and the second buffer circuit may buffer a plurality of top _ diff data lines from the first storage circuit unicast to a corresponding Uc of the slave processing circuit. Depending on the specific splitting and/or multiplexing approach, the data lines may be distributed to the corresponding arithmetic circuitry CU during operation or broadcast to all CUs within the slave processing circuitry. Then, each operation circuit CU may perform a multiply-accumulate operation on the bottom _ data line selected from the first buffer circuit and the top _ diff data line selected from the second buffer circuit, respectively, in each operation.

When a plurality of arithmetic circuits CU within a single slave processing circuit SL collectively process one Uc, it is necessary to split the output point among the plurality of CUs. The Update1 is divided so that an interval output point is allocated to each arithmetic circuit (for example, fig. 10 b). In Update1, the convolution kernel top _ diff is split in units of 4 × 4, and bottom _ data uses only 2 × 2 64B data in the first buffer circuit at a time, so that after a plurality of sliding fetch calculations on the first buffer circuit, a maximum of 4 × 4 output points can be calculated.

Specifically, in one embodiment, at each calculation, each arithmetic circuit calculates 1 output point adjacent in the X and/or Y dimensions on the XY plane of the Uc channel C values of the output Δ W; and in different calculations each arithmetic circuit calculates a different output point in the X and/or Y dimension on the output aw. The number of slips Nk = ceil (Kx/2) × ceil (Ky/2), where Kx and Ky are the smaller of the size of the output Δ W in the X and Y dimensions, respectively, or the maximum output size supported from a single operation of the processing circuit in the current convolution split mode. For example, for the case of Kx = Ky =4, nk =2 × 2=4 times, to be 4 times of sliding, 2 × 2 output points are calculated each time, and 4 × 4 output points are calculated in total.

Figure 13 illustrates a single pass operational procedure in Update1 scheme according to one embodiment of the present disclosure. In this example, the first buffer circuit 1310 has a size of 3 × 3 × 64B, that is, can buffer 9 line data at most, and the second buffer circuit 1320 has a size of 2 × 2 × 64B, that is, can buffer 4 line data at most. To coincide with the split cells, the storage within the buffer circuit in the figure is also shown in units of split cells.

The operation process of the first sliding access is shown in the figure. Selecting N from the first buffer circuit in a sliding manner by taking the splitting unit as a sliding window according to a manner corresponding to the division manner of the output points _CU A plurality of input feature lines respectively sent to N in the slave processing circuit _CU An arithmetic circuit for calculating; reading 1 weight data line from the second buffer circuit, and broadcasting to N in the slave processing circuit _CU An arithmetic circuit.

Specifically, in the computing device shown in FIG. 5, N _CU ＝4，Nop＝4。

As shown, one input feature data line is selected from the first buffer circuit 1310 at the start position and the position shifted by 1 in each of the X and/or Y directions, and a total of 4 input feature data lines are selected and correspondingly sent to the 4 arithmetic circuits 1340 in the slave processing circuit SL. From the

second buffer

1320, 1 weighted data row is selected at the beginning, i.e. 4 × 4 data 1330 is selected, which is broadcast to 4 operation circuits 1340 within the SL.

During each calculation, aiming at one input characteristic line from the first buffer circuit and one weight line from the second buffer circuit, the characteristic data and the weight data corresponding to the same channel value are subjected to bit-wise multiplication accumulation by using 1/Uc data line units to obtain Uc output points.

As shown in the figure, the 4 operation circuits 1340 perform a bit multiply accumulate operation on the distributed input feature data line and the broadcast weight data line according to 1/Uc (Uc = 4) lines, and obtain an operation result 1350. The results of the different background colors in 1350 represent results obtained by the different operation circuits 1340. It can be seen that each time one CU calculates 1 output point on each XoYo surface on Uc, the 4 CUs in total yield Uc × 2 × 2 output points. It can be seen that the output points of the 4 CU computations are adjacent in the XoYo dimension of the output feature map.

Then, the first buffer circuit performs sliding fetch, the second buffer circuit does not need to perform sliding, and the row weight is still used for the next calculation. Performing Nk sliding selections on the first buffer circuit, wherein Nk = Kx × Ky, kx and Ky being the smaller of the sizes of the convolution kernels in the X and Y dimensions or the maximum convolution kernel size supported by a single operation of the processing circuit in the current convolution splitting mode, respectively. Correspondingly, the operation circuit splices the Nk × Uc output points obtained by calculation in the Nk sliding calculation periods according to the dividing mode of the output points to obtain Nk × N on the Uc channels _CU And (5) calculating results.

In some embodiments, in Update1 mode, the maximum convolution kernel size supported by a single operation from the processing circuitry is 4 × 4.

Figure 14 illustrates a schematic diagram of a sliding convolution process in Update1 scheme according to one embodiment of the present disclosure. In this example, the first buffer circuit buffers 2 × 2=4 bottom _ data lines, shown as 1410 in the figure, with the C dimension omitted; the second buffer circuit then buffers 1 top _ diff data line, shown as 1420, again omitting the C dimension. Each data line is a block of size 4 × 4 × 4 (C × H × W). The size Kx = Ky =4 of Δ W in the X and Y dimensions. In each calculation, top _ diff with the size of 4 multiplied by 4 is selected from the second buffer circuit, just corresponds to a 4 multiplied by 4 block of top _ data, and is broadcasted to 4 arithmetic circuits.

Specifically, the splitting unit is used as a sliding window to buffer electricity from the first buffer in a manner corresponding to the dividing manner of the output pointIn-road sliding selection of N _CU The bottom _ data lines are respectively sent to N in the slave processing circuit _CU An arithmetic circuit for calculating. Further, 1 top _ diff data line is read from the second buffer circuit and broadcasted to N in the slave processing circuit _CU An arithmetic circuit. Performing Nk sliding selections on the first buffer circuit, wherein Nk = ceil (Kx/2) × ceil (Ky/2), kx and Ky being the smaller of the size of the weight gradient data Δ W in the X and Y dimensions or the maximum output size supported from a single operation of the processing circuit in the current convolution splitting mode, respectively.

The selection ranges of bottom _ data and top _ diff in the first buffer circuit and the second buffer circuit at each sliding are shown in fig. 14, which is 4 graphs representing 4 sliding times. The block 1410 in the figure represents bottom _ data in the first buffer circuit, and the four dashed boxes represent regions selected for four CUs; block 1420 represents top _ diff in the second buffer circuit and the dashed box represents the selected 1 top _ diff data line, which is broadcast to 4 CUs and does not need to be reselected during the sliding. The number of slips Nk =4, and the slip step =2. In the Update1 convolution operation mode, the maximum Δ W size supported by a single operation from the processing circuit is 4 × 4. It is understood that when aw exceeds the maximum support size, splitting in the XY direction is required to follow the maximum support size.

During each calculation, each CU performs bit-wise multiplication and accumulation on the bottom _ data and top _ diff data corresponding to the same channel value in units of 1/Uc data lines for one bottom _ data line from the first buffer circuit and one top _ diff data line from the second buffer circuit to obtain Uc output points, that is, 1 output point of Δ W on each KxKy plane on Uc, so that N is _CU Each time obtaining N on Uc KxKy surfaces by each operation circuit _CU And an output point. It will be appreciated that after a sliding pass through operation for Nk operation cycles, each operational circuit calculates nc output points spaced in the X and/or Y dimensions on the Uc KxKy surface. N is a radical of _CU The total of Nk sliding of each operation circuit can obtain Nk × N on Uc KxKy surfaces _CU And an output point. The output points are spliced to form maximum 4 × 4 (Kx × Ky) outputs on the Uc surface in the C dimensionA dot, namely Uc × 4 × 4.

Specifically, for each graph in fig. 14, the number of CUs Ncu =4, each CU calculates 1 output point on the Uc plane in the C dimension at a time, and the partial sum is a bit-by-bit accumulation result of 1/Uc (1/4) data lines, that is, each output point is a 2D convolution of 4 × 4 (Y × X). After Nk =4 times, the maximum output point calculation is completed, and a 4 × 4 (Y × X) output is obtained for 1 SL (as shown in fig. 10 b).

It can be understood that when Kx/Ky calculated by each CU is greater than 4, sliding along the Kx/Ky direction is needed to read different bottom _ data and top _ diff. The calculation process can be derived similarly by those skilled in the art from the foregoing description, and is not described in detail here.

As can be seen from the foregoing sliding convolution process, the result of the sliding mode output is not the normal ordering of the conventional convolution output data. Therefore, in the output process, each slave processing circuit SL can convert the operation result of the internal operation circuit CU thereof into a specified format. In some embodiments, each slave processing circuit may output 1 output point at the same position on Uc XY planes operated by one operation circuit therein at a time. The Ns slave processing circuits simultaneously output Ns x Uc output points at the same position on the XY surface at a time. With this output mode, the Ns × Uc output points are continuous in the C dimension. The blocking circuit may further store the operation results returned from the respective slave processing circuits in a fourth dimension storage order, for example, in a sequential manner according to the dimension Ky × Kx (Ns × Uc). According to the situation, the block circuit can also convert the operation result into a desired dimension storage sequence for storage.

Figure 15 illustrates an output data format diagram of an Update1 scheme according to one embodiment of the present disclosure. In this embodiment, a grouping by C dimension is used, i.e. each slave processing circuit SL processes the operation of a different Uc.

The original output of 1 SL is shown at 1510. As can be seen from the figure, each SL outputs a region of Uc × 1 × 1 (C × Y × X) at a time, that is, outputs Uc operation results of one operation circuit therein at a time, for example, 4 operation results of CU0, and these 4 operation results are continuous in the C dimension of the output data. Since different SLs process the operation of different Uc, 16 SLs can simultaneously output 1 output point at the same position on XY plane on different Uc, which can be spliced into 16 × Uc output points in C dimension, which are continuous in C dimension.

The diagram 1520 shows the presence data structure for 16 SLs. As shown, the outputs of 16 SLs at a time are spliced into a row of data that is continuous in the C dimension. For example, for the first time, 16 SLs all output an output point at Ky =0, kx =0 position (label "1"); second, 16 SLs all output the output point at Ky =0, kx =1 position (label "2"); and so on. The final output data, after being written into the memory circuit (e.g., the first memory circuit), is in the format of Kh × Kw (16 × uc), where 16 is the division over 16 SLs. In some implementations, the cycloidal operation may be performed again to translate to other desired data formats, as desired.

In some embodiments, a single slave processing circuit, for example, including four arithmetic circuits, can calculate up to 16 4 × 4 output dot regions, taking into account the memory space of the internal register of the arithmetic circuit, and thus can multiplex bottom _ data, thereby reducing the reading frequency of the second memory circuit. That is, the reading frequencies of the first and second memory circuits may be different. The result calculated by the arithmetic circuit is stored in an internal register if it is a partial sum.

In these embodiments, the slave processing circuit may be further configured to: according to the limitation of a storage space in the arithmetic circuit, determining the multiplexing times rn of bottom _ data in the slave processing circuit; and controlling the loading frequency of top _ diff data in the second buffer circuit, so that the bottom _ data loaded in the first buffer circuit at each time repeatedly uses rn times, and convolution operation is carried out with the top _ diff data loaded in the second buffer circuit at rn times. In some examples, rn may take a value no greater than 16.

The convolution optimization scheme provided by the embodiments of the present disclosure is exemplarily described and illustrated above in connection with the specific convolution splitting scheme of Update 1. Based on the teachings of the present disclosure, one skilled in the art may devise other convolution splitting schemes depending on the specific hardware circuit configuration (such as the number of slave processing circuits, the number of slave operational circuits within a processing circuit, the hardware single-pass processing capability, etc.), which all fall within the scope of the present disclosure and are not enumerated here.

The embodiment of the disclosure also provides a method for performing convolution operation by using the computing device. Those skilled in the art will appreciate that the method steps for performing the convolution operation correspond to the respective circuits of the computing device described above in connection with the figures, and therefore the features described above apply equally to the method steps and are not repeated here.

The disclosed embodiments also provide a chip that may include the computing device of any of the embodiments described above in connection with the figures. Further, the disclosure also provides a board card, which may include the aforementioned chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, an internet of things terminal, a mobile phone, a drive recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in that the acts or modules involved are not necessarily required for the implementation of the solution or solutions of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of the connection relationships between the different units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned Memory unit or the Memory device may be any suitable Memory medium (including a magnetic Memory medium or a magneto-optical Memory medium, etc.), and may be, for example, a variable Resistance Random Access Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for a person skilled in the art, according to the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as a limitation to the present disclosure.

Claims

1. A computing device configured to perform deep convolution operations in inverse training of a neural network model, the computing device comprising:

a main processing circuit to: acquiring input neuron data and/or neuron gradient data, wherein the input neuron data and the neuron gradient data are respectively split into a plurality of splitting units according to a convolution splitting scheme and the dimension storage sequence is converted, one splitting unit comprises data of the lowest storage dimension and at least one other storage dimension, the total data volume of one splitting unit does not exceed the single maximum operation amount of hardware, and the data in one splitting unit is continuously stored into one data line; and

a plurality of slave processing circuits to perform the deep convolution operation on corresponding data lines of the input neuron data and neuron gradient data.

2. The computing device of claim 1, wherein the convolutional splitting scheme indicates that the shape of the splitting unit is Uc X Uy X Ux = M, uc is the size of the splitting unit in the lowest storage dimension, channel C, in which the input neuron data and neuron gradient data are initialized, ux and Uy are the sizes of the splitting unit in the X and Y storage dimensions, respectively, in which the input neuron data and neuron gradient data are initialized, M is the hardware single maximum operand, ux = Uy ≧ Uc>1，Uc＝M/4 ⁿ ，

3. The computing device of claim 2, wherein the convolution splitting scheme further indicates a binning mode in which the deep convolution operation is performed, wherein the binning mode sequentially bins the input neuron data and neuron gradient data in a channel C dimension in units of Uc to schedulable Ns slave processing circuits each processing different consecutive Uc C values of input neuron data and neuron gradient data.

4. The computing device of claim 3, wherein the computing device further comprises a first storage circuit and a second storage circuit,

the neuron gradient data are determined to be unicast data, the unicast data after dimension storage sequence is split and converted are stored in a first storage circuit, and the neuron gradient data corresponding to different Uc C values are respectively transmitted to Ns slave processing circuits during operation through a broadcast bus; and

the input neuron data is determined as distribution data, and the distribution data after the dimension storage sequence is split and converted are respectively stored in the storage areas corresponding to the Ns slave processing circuits in the second storage circuit in a mode of sequentially dividing the distribution data by taking a channel C dimension and a Uc as a unit so as to be distributed to the corresponding slave processing circuits.

5. The computing device of claim 4, wherein each of the slave processing circuits comprises a first buffer circuit, a second buffer circuit, and a plurality of arithmetic circuits, wherein:

the first buffer circuit is used for buffering a plurality of input neuron data lines from the second storage circuit and distributed to the slave processing circuit;

the second buffer circuit is used for buffering a plurality of neuron gradient data lines which are transmitted to the slave processing circuit in a unicast mode from the first storage circuit; and is provided with

Each operational circuit is used for executing the operation of multiplication and accumulation on the input neuron data line respectively selected from the first buffer circuit and the neuron gradient data line selected from the second buffer circuit in each operation.

6. The computing device of claim 5, wherein the slave processing circuit is further to schedule N therein as follows _CU Output points are divided among the operation circuits:

during each calculation, each operation circuit calculates 1 output point adjacent to each other in X and/or Y dimensions on an XY surface of the C values of the Uc channels of the weight gradient data; and

in different calculations, each arithmetic circuit calculates different output points of the weight gradient data in X and/or Y dimensions.

7. The computing device of claim 6, wherein each of the slave processing circuits is further to:

selecting N from the first buffer circuit in a sliding manner by taking the splitting unit as a sliding window according to a manner corresponding to the dividing manner of the output point _CU A plurality of input neuron data lines respectively transmitted to N in the slave processing circuit _CU An arithmetic circuit for calculating;

reading 1 neuron gradient data line from the second buffer circuit and broadcasting the data line to N in the slave processing circuit _CU An arithmetic circuit; and

performing Nk sliding selections on the first buffer circuit, wherein Nk = ceil (Kx/2) × ceil (Ky/2), kx and Ky being the smaller of the size of weight gradient data in X and Y dimensions or the maximum weight gradient size supported by a single operation of the slave processing circuit in the convolution split mode, respectively.

8. The computing device of claim 7, wherein each of the operational circuits is further to:

during each calculation, aiming at one input neuron data line from the first buffer circuit and one neuron gradient data line from the second buffer circuit, carrying out counterpoint multiplication and accumulation on the input neuron data and the neuron gradient data corresponding to the same channel value by 1/Uc data line units to obtain 1 output point at the same position on Uc XY surfaces; and

in the Nk sliding calculations, nk output points spaced in X and/or Y dimensions on the Uc XY plane are calculated.

9. The computing device of claim 8, wherein each of the slave processing circuits is further to:

and outputting 1 output point at the same position on the Uc XY surface obtained by the operation of one operational circuit in the Uc surface each time.

10. The computing device of claim 9, wherein the primary processing circuit is further to:

and splicing and storing the operation result output from the processing circuit according to the dimension sequence of Ky, kx (Ns, uc).

11. The computing device of any of claims 1-10, wherein M =64 bytes, uc =4 bytes, ux = Uy =4.

12. The computing device of any of claims 6-10, wherein N _CU ＝4，Ns＝16。

13. The computing device of claim 7, wherein the maximum weight gradient size supported by the slave processing circuitry for a single operation in the convolution split mode is 4 x 4.

14. A chip comprising a computing device according to any one of claims 1-13.

15. A board comprising the chip of claim 14.

16. A method of performing convolution operations using the computing device of any of claims 1-13.