CN117252241A

CN117252241A - Computing device, method and related product for performing convolution operation

Info

Publication number: CN117252241A
Application number: CN202210633573.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2023-12-19

Abstract

A computing device, a method for performing convolution operation by using the computing device and related products. The computing device may be included in a combined processing device that may also include interface devices and other processing devices. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme optimizes convolution operation, improves data multiplexing efficiency, and improves operation processing efficiency.

Description

Computing device, method and related product for performing convolution operation

Technical Field

The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a computing device, a method, a chip, and a board for performing convolution operations using the computing device.

Background

Deep Learning (Deep Learning) has become an important branch in machine Learning, and has greatly assisted the development of Artificial Intelligence (AI). The core technology of deep learning, deep Neural Network (DNN), has found wide application in many industries.

Neural networks are one of the most critical techniques in artificial intelligence, deep learning, with convolutional neural networks (Convolution Neural Network, CNN) being one of the most important network types. The most critical calculation in convolutional neural networks is the convolutional operation (Convolution Operation) of the convolutional layer (Conv layer). The function of the convolution layer is to extract the characteristics of the input data, and complex characteristics can be extracted through multi-layer convolution, so that the network is ensured to have enough expression capability and generalization capability. The neural network model comprises a large number of various convolution operations, and the calculation performance of the convolution operations greatly influences the calculation performance of the whole neural network model. When the neural network model is applied to different fields, such as speech recognition, machine translation, image processing, etc., the respective dimensions of the corresponding input feature maps and weights may be different. In addition, the convolutional layers employed by different neural network models may also be different, with unique properties and requirements. For example, U-net is a fully convolutional neural network, which is composed entirely of convolutional layers, is structurally symmetrical, requires the input/output channel dimensions of data to be aligned by 16 numbers, and so on. To fully exploit the hardware advantages of deep learning processors, optimization for different scales, and/or different types of convolution operations is required to improve the computational performance of executing neural network models.

Disclosure of Invention

In order to solve at least one or more of the technical problems mentioned above, the disclosure proposes a computing device, a chip, a board card and a corresponding method in various aspects, which can enable data of various dimension sizes to adapt to hardware of convolution operation by performing segmentation processing on an input channel dimension of an input feature map and splitting other dimensions according to segmentation conditions, and fully utilize hardware performance, so as to improve computing efficiency of the convolution operation. In addition, the dimension of the output channel of the output feature map can be segmented, so that the dimension can be directly used as the input feature map of the next convolution layer without additional pendulum number processing. The convolution operations of embodiments of the present disclosure may be operations in various neural network models that may be applied in various fields, such as image processing, speech processing, text processing, and the like, which may include, for example, but not limited to, recognition and classification. The convolution operation of the disclosed embodiments is particularly suited to U-net neural network models.

In a first aspect, embodiments of the present disclosure provide a computing device configured to perform convolution operations, the computing device comprising: the input storage circuit is used for storing an input characteristic diagram and a convolution kernel, wherein the input characteristic diagram is stored in a segmented mode according to the split granularity Pci of the dimension of an input channel Ci; a plurality of slave processing circuits for performing convolution operations on the input feature map of the broadcast transmission and corresponding convolution kernels allocated to the slave processing circuits, respectively; and the output storage circuit is used for storing the plurality of output characteristic diagrams calculated and output by the processing circuit, wherein the output characteristic diagrams are stored in segments according to the splitting granularity Pco of the dimension of the output channel Co.

In a second aspect, embodiments of the present disclosure provide a chip comprising the computing device of the first aspect described above.

In a third aspect, embodiments of the present disclosure provide a board card comprising the chip of the foregoing second aspect.

In a fourth aspect, embodiments of the present disclosure provide a method of performing a convolution operation with the computing device of the first aspect described above.

According to the computing device, the chip, the board and the method for implementing convolution operation by the computing device, the scheme of the embodiment of the disclosure applies different channel segmentation and dimension splitting schemes to input feature diagrams with different dimension sizes so as to adapt to the processing capacity of the hardware computing device, thereby fully utilizing the parallel processing capacity of a plurality of slave processing circuits and effectively improving the operation efficiency of convolution operation. Further, the output channel dimension of the output feature map can be processed in a segmentation mode, so that the dimension can be conveniently transferred to the next convolution layer to serve as the input feature map of the next convolution layer, and additional pendulum number processing is not needed. Other advantages and effects will become apparent from the following detailed description taken in conjunction with the accompanying drawings.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;

FIG. 3 illustrates an internal architecture schematic diagram of a processor core of a single-core or multi-core computing device of an embodiment of the present disclosure;

FIG. 4 illustrates an example convolution operation principle example to which embodiments of the present disclosure may be applied;

FIG. 5 shows a schematic block diagram of a computing device according to an embodiment of the disclosure;

FIG. 6 schematically illustrates an exemplary storage of an input feature map according to some embodiments of the present disclosure;

FIG. 7 illustrates a convolution kernel storage scheme in accordance with an embodiment of the present disclosure;

FIG. 8 illustrates several exemplary ways of dividing an output block among multiple arithmetic circuits of a single slave processing circuit in accordance with an embodiment of the present disclosure;

FIG. 9 illustrates a schematic diagram of an operation of multiplexing input signature data in the H dimension according to some embodiments of the present disclosure;

FIGS. 10 a-10 d show schematic diagrams of the operation of a convolution operation scheme according to embodiment 1 of the present disclosure;

FIG. 11 shows a schematic diagram of the write and output logic of the result of an operation according to embodiment 1 of the present disclosure;

FIG. 12 shows a schematic diagram of the operation of a convolution operation scheme according to embodiment 2 of the present disclosure; and

Fig. 13 shows an operational process diagram of a convolution operation scheme according to embodiment 3 of the present disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.

Exemplary hardware Environment

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal architecture of a processing core when the computing device 201 is a single-core or multi-core device. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used for storing or carrying related data, and includes a neuron storage unit (NRAM) 331, a weight storage unit (weight RAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204.

Exemplary convolution operation types

Based on the foregoing hardware environment, in one aspect, the presently disclosed embodiments provide a computing device configured to perform convolution operations such that the convolution operations in, for example, a neural network model may be optimized. The convolution layers in the neural network model may perform convolution operations to perform feature extraction by applying convolution kernels (also known as filters, weights, etc.) to the input feature map (also known as input data, neurons, or input neurons). The convolution layer may contain a plurality of convolution kernels, each element constituting the convolution kernel corresponding to a weight coefficient and a bias amount bias.

Various convolution operation layers may be included in the neural network model, such as a convolution layer that performs forward, conventional 3D convolution operations, a deconvolution layer that performs depth (Depthwise) convolution operations. In reverse training, however, it may be necessary to perform a reverse deep convolution operation or a cross-product convolution operation. Embodiments of the present disclosure are primarily optimized for conventional 3D convolution operations, but may be applied to other types of convolution operations without conflict.

In the conventional 3D convolution operation, assuming that the tensor shape of the input Feature map (Feature map) in the convolution layer is denoted as X [ N Hi Wi Ci ], the tensor shape of the convolution kernel (kernel) is denoted as K [ Co Kh Kw Ci ], and the output result is Y [ N Ho Wo Co ], the mathematical calculation formula of the simplified convolution operation can be expressed as follows:

Y _in,jc,jh,jw ＝∑ _{0≤ic≤ci,0≤ih≤kh,0≤iw≤kw} X _{in,ic,jh×sh+ih,jw×sw+iw} ×K _jc,ic,ih,iw (1)

In the above formula, X is input data, Y is output data, K is a convolution kernel, kh and Kw are length and width of K, sh and Sw are step sizes (stride) in the length and width directions, the formula ignores the deviation amount bias, padding pad and expansion condition, and the convolution kernel has been expanded assuming that the input data X has been padded. The formula ignores the N-dimension and the C-dimension, and the forward computation of the neural network model is independent in the N-dimension and fully connected in the C-dimension. When the convolution kernel works, the convolution kernel sweeps the input features according to a certain step length, and matrix element multiplication summation and deviation amount superposition are carried out on the input features in a convolution window. In conventional 3D convolution operations, the result of the para-product of H, W and Ci directions is accumulated, and is therefore referred to as 3D convolution. However, this 3D convolution has constraints: the size of the dimension Ci of the convolution kernel is equal to that of the dimension Ci of the input feature map, so that the convolution kernel does not slide in the direction Ci, and the convolution kernel is a pseudo 3D convolution. For simplicity, the convolution operation described above is referred to as a conventional 3D convolution operation.

Fig. 4 illustrates an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure may be applied.

Four-dimensional input data X of size [ N Hi Wi Ci ] is exemplarily shown in the figure, which can be represented as a three-dimensional rectangle 410 of size N Hi X Wi X Ci. Also shown by way of example is a four-dimensional convolution kernel K of size [ Co Kh Kw Ci ], which may be represented as a three-dimensional convolution kernel 420 of size Co Kh Kw Ci. The convolution result of the input data X and the convolution kernel K yields output data Y, which is four-dimensional data of the size of [ N Ho Wo Co ], and can be expressed as a three-dimensional rectangle 430 of the size of N ho×wo×co.

Also specifically shown is an example of convolution operation, wherein the input data is a 6×6×3 size input feature map 440, omitting the N dimension; a 3 x 3 size stereo convolution kernel 450 for a single Co; the output data is a 4 x 4 output profile 460. The specific operation process is as follows:

the convolution kernel 450 sweeps the input feature map 440a in steps, matrix element multiplicatively sums the input features within a convolution window 470 and superimposes the amounts of deviation. That is, the value at each position in the output feature map 460 is obtained by summing the corresponding block of each input feature map and the corresponding convolution kernel after performing a two-dimensional convolution operation. For example, the values (i.e., convolved output points) at the (0, 0) position on the output signature 460 are shown as 3 values by two-dimensional convolution operations of the convolution window 470 outlined by the black cube in the input signature and the stereo convolution kernel 450, and then summed to the final value.

To obtain outputs at other locations, the location of the convolution kernel 450, i.e., the convolution window of the convolved output points, may be shifted on the input signature 440. In the example in the figure, the convolution step (Sw, sh) is (1, 1), and when the convolution operation is performed after shifting one frame in the lateral direction (width direction) to the right or in the longitudinal direction (height direction), the value of the (0, 1) or (1, 0) position on the output feature map 460a can be obtained, respectively.

From the above description, in one convolutional layer of the neural network, there are N sets of input feature maps, each set containing hi×wi×ci pieces of information, where Hi and Wi are the height and width of the input feature maps, respectively, and Ci is the number of input feature maps, also referred to as the number of input channels. The convolution layer has a convolution kernel of the size Ci Co of Kh Kw, where Ci is the number of input channels, co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolution kernel, respectively. The output feature map contains Ho x Wo x Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, in the convolution operation, a convolution step (Sw, sh) is also involved, and the size of the convolution step affects the size of the output feature map.

Input Feature map (Feature map), input data, neurons, or input neurons are used interchangeably herein; convolution kernel, filter, or weights are used interchangeably; output profile, output data, or output neurons are used interchangeably. Further, the H (height) and Y dimensions are used interchangeably and the W (width) and X dimensions are used interchangeably. Accordingly, the H dimension of the input feature map may be denoted as Hi or Yi, and the H dimension of the output feature map may be denoted as Ho or Yo, with the W dimension similarly denoted. In an embodiment of the present disclosure, each convolution output point has a corresponding convolution window having a shape equal to the shape of the convolution kernel. The value of each convolution output point corresponds to the result of the para-multiply-accumulate of the input feature map and weights within its convolution window.

Exemplary computing device

In embodiments of the present disclosure, computing devices in a master-slave configuration may be employed to implement the convolution operations described above. Further, different data paths can be configured for the input feature map and the convolution kernel, so that memory efficiency is improved.

Fig. 5 shows a schematic block diagram of a computing device 500 according to an embodiment of the disclosure. It will be appreciated that the architecture may be regarded as a refinement of the internal architecture of the operation module of a single processing core in fig. 3, or as a functional partitioning block diagram that is unified on the basis of the operation modules of a plurality of processing cores shown in fig. 3. As shown in fig. 5, a computing device 500 of an embodiment of the disclosure may be configured to perform various types of convolution operations, which may include a master processing circuit (MA) 510 and a plurality of slave processing circuits (SL) 520, 16 of which are shown as SL 0-SL 15. Those skilled in the art will appreciate that the number of slave processing circuits may be greater or lesser depending on the particular hardware configuration, and embodiments of the present disclosure are not limited in this respect.

The master processing circuit and the slave processing circuits and the plurality of slave processing circuits may communicate with each other via various connections. In different application scenarios, the connection manner between the plurality of slave processing circuits may be a hard connection manner through hard wire arrangement, or may be a logic connection manner configured according to, for example, micro instructions, so as to form a topology structure of the plurality of slave processing circuit arrays. The disclosed embodiments are not limited in this respect. The master processing circuit and the slave processing circuit may cooperate with each other, thereby realizing parallel arithmetic processing.

In order to support the arithmetic function, the master processing circuit and the slave processing circuit may include various calculation circuits, and may include a vector arithmetic unit and a matrix arithmetic unit, for example. The vector operation unit is used for executing vector operation and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit is responsible for core computation of the deep learning algorithm, such as matrix multiplication and convolution.

The slave processing circuit may be configured to perform an intermediate operation on the corresponding data in parallel according to the operation instruction to obtain a plurality of intermediate results, and transmit the plurality of intermediate results back to the master processing circuit, for example.

By arranging the computing device 500 in a master-slave configuration (e.g., a master-multiple-slave configuration, or a multiple-master-multiple-slave configuration, the disclosure is not limited in this respect), for the computing instruction of the forward operation, the data may be split according to the computing instruction, so that the computing speed is increased by parallel computing on the portion with larger computing amount through the multiple slave processing circuits, so that the computing time is saved, and the power consumption is further reduced.

In some embodiments of the present disclosure, by transmitting the input feature map and the weight using different data paths, multiple multiplexing modes of the input feature map and the weight may be supported, so as to reduce the data access amount during the operation and improve the processing efficiency.

In particular, the computing device 500 may further include a first storage circuit 530 and a second storage circuit 540 for respectively storing data transmitted via different data channels. Alternatively, the first memory circuit 530 and the second memory circuit 540 may be two memory blocks formed by the same memory partition, or may be two independent memories, which are not limited herein. The first memory circuit 530 and the second memory circuit 540 may also be collectively referred to as an input memory circuit for storing the input signature and convolution kernel, respectively.

The first memory circuit 530 may be used to store multicast data, i.e., the data in the first memory circuit is to be transmitted over a broadcast bus to a plurality of slave processing circuits that receive the same data. It will be appreciated that broadcast and multicast may be implemented over a broadcast bus. Multicasting refers to a communication scheme in which a piece of data is transmitted to a plurality of slave processing circuits; and broadcasting is a communication mode of transmitting a piece of data to all slave processing circuits, which is a special case of multicasting. Since both multicast and broadcast correspond to one-to-many transmission, which are not purposely distinguished herein, broadcast and multicast may be collectively referred to as multicast, whose meaning will be apparent to those skilled in the art from the context.

The second storage circuit 540 may be used to store distribution data, i.e. the data in the second storage circuit will be transferred to different slave processing circuits, each receiving different data.

By providing the first memory circuit and the second memory circuit separately, transmission of data to be operated on in different transmission modes can be supported, thereby reducing the data access amount by multiplexing multicast data among a plurality of slave processing circuits.

In some embodiments, the input profile may be determined as multicast data and stored in the first storage circuit to broadcast the data to the scheduled plurality of slave processing circuits during operation. Correspondingly, the convolution kernel may be determined to distribute the data and stored in the second storage circuit. These distribution data may be distributed to the corresponding slave processing circuits prior to the operation.

An output storage circuit may also be included in the computing device 500 for storing the output profile calculated from the processing circuit. The output memory circuit may be multiplexed with the first memory circuit or the second memory circuit, or may be a separate memory circuit, as the embodiments of the disclosure are not limited in this respect.

Fig. 5 also shows an internal structural schematic diagram of the slave processing circuit SL according to an embodiment of the present disclosure. As shown, each slave processing circuit 520 may include a plurality of arithmetic circuits CU 521, a first buffer circuit 522, and a second buffer circuit 523. The figure shows 4 arithmetic circuits CU0 to CU3. Those skilled in the art will appreciate that the number of operational circuits may be greater or lesser, depending on the particular hardware configuration, and embodiments of the present disclosure are not limited in this respect.

In some embodiments, the first buffer circuit 522 may be used to buffer weights or input profiles assigned to the slave processing circuit. Accordingly, the second buffer circuit 523 may then be used to buffer the input profile or weights assigned to the slave processing circuit. Both buffer circuits are used to select the data involved in the operation. The data of the first buffer circuit 522 may be a plurality of data lines from, for example, the first memory circuit 530 or the second memory circuit 540, and correspondingly, the data of the second buffer circuit 523 may be a plurality of data lines from, for example, the second memory circuit 540 or the first memory circuit 530. Depending on the particular multiplexing, these data lines may be distributed to the corresponding arithmetic circuits CU 521 or broadcast to all CUs 521 within the slave processing circuit 520 during the operation.

Each arithmetic circuit CU 521 is configured to perform a permutation multiply-accumulate operation on the data line selected from the first buffer circuit and the data line selected from the second buffer circuit, respectively, in each arithmetic cycle.

By providing the first buffer circuit and the second buffer circuit respectively, it is possible to support transmission of data to be operated on in different transmission modes, thereby reducing the data access amount by multiplexing data as much as possible between a plurality of operation circuits within a single slave processing circuit.

The slave processing circuit 520 may further include a third buffer circuit 524 for buffering the operation result of each operation circuit CU 521.

It will be appreciated that although the various processing and memory circuits are shown as separate modules in fig. 5, the memory and processing circuits may be combined into one module according to different configurations. For example, the first memory circuit 530 may be integrated with the master processing circuit 510, and the second memory circuit 540 may be shared by multiple slave processing circuits 520, and each slave processing circuit may be assigned a separate memory area to accelerate access. The disclosed embodiments are not limited in this respect. Furthermore, in the computing device, the master processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, as the disclosure is not limited in this respect.

Exemplary convolution optimization scheme

In the presently disclosed embodiments, the dimensions of the multidimensional data referred to are characterized as (N, H, W, C) or (Co, H, W, ci), which represent the order in which the data is stored in memory. It will be appreciated that although the multi-dimensional data has multiple dimensions, there is a correspondence between the multi-dimensional data and the order of storage on the memory because the layout of the memory is always one-dimensional. The multi-dimensional data is typically allocated in contiguous memory space, i.e., the multi-dimensional data can be one-dimensionally expanded and stored in sequence on the memory. For example, the input feature maps may be stored sequentially in a low-dimensional (where C/Ci is the lowest dimension) priority. Adjacent dimensions refer to dimensions next to each other in the dimensional information representation of the multi-dimensional data, e.g., W and Ci are adjacent. When the order of storage is consistent with the order of dimensions, the locations of adjacent dimensions on the memory are contiguous. Where W and Ci are adjacent, their data is also continuous in memory.

In an intelligent processor, the main arithmetic unit of the hardware is a vector multiply-add operator for reasons of computational power requirements and area power consumption overhead. Support for various convolution algorithms is implemented in hardware design, essentially maximizing extraction of multiply-add operations in the algorithm, and efficient exchange of input and output data for the multiply-add operations between on-chip RAM (such as NRAM, WRAM, etc. in fig. 3) and the operators is implemented through data paths.

The hardware is stored in a row (cache line) in a storage manner, and the reading, writing and calculating operations are most efficient when the whole row is aligned, so in order to fully utilize the bandwidth, the data needs to be aligned in a vectorization manner to adapt to the requirements of the memory of the arithmetic unit array and the like. The design of artificial intelligence chips is typically based on the lowest dimension in the Ci dimension, i.e., the NHWC placement order described above, where the data in the Ci dimension is continuous. Therefore, the vectorization alignment requires that the size of the Ci dimension be aligned to a specified value, for example, an aligned value M, so that the number of accesses is made in units of the aligned value M, which may also be referred to as a hardware single maximum operand. M may have different values, e.g., 32B, 64B, 128B, etc., based on different hardware designs. In general, the size of the input port of the operator array is also related to M, for example, in the case of symmetric input data bit width, the size of the input port of the operator array is typically 2 times of M, that is, the input feature map data and the weight data of the alignment value M scale are processed at one time. When the Ci dimension of the input feature map is large, it is easier to satisfy the above-described alignment requirement.

When the dimension Ci of the input feature map is smaller or when the remainder obtained by dividing Ci by M (i.e., the remainder obtained by dividing Ci by M) is smaller, for example, smaller than the size of one cache line, the dimension Ci needs to be padded to one line of data (for example, 64B), i.e., invalid data 0 is filled. This padding causes a large amount of redundant computation, resulting in resource waste and reduced operation efficiency.

It is known to propose a small convolution scheme suitable for the smaller case of channel C, in which the operation data is split and converted into dimensional order storage by a splitting unit. The data volume contained in one splitting unit can be set to be the one-time processing alignment value M of the hardware, so that the computing processing is carried out by taking the splitting unit as a unit, the computing power of the hardware can be fully exerted, and invalid computation is avoided or reduced.

However, in such a small convolution scheme, both the input feature map and the convolution kernel need to be subjected to block and dimension conversion processing by software in advance, and the output feature map also needs to be subjected to corresponding block and dimension conversion processing by software, which certainly increases the complexity of the software. In addition, in these block and dimension conversion processes, software is also required for alignment processing. Further, these small convolution schemes only support convolution operations with convolution steps of 1 for both the width and height directions.

In view of this, in order to further optimize convolution operation and reduce software complexity, the embodiment of the disclosure provides a convolution scheme for performing segmentation processing on Ci dimensions, which performs segmentation processing on Ci dimensions of an input channel of an input feature map and splits other dimensions according to segmentation conditions, so that alignment requirements of data can be reduced, and hardware performance is fully utilized. Furthermore, the dimension of the output channel Co of the output feature map can be subjected to segmentation processing, so that when a plurality of continuous convolution layer operations exist, the output feature map of the current convolution layer can be directly used as an input feature map of the next convolution layer without additional segmentation processing, and data segmentation and dimension conversion processing by software are omitted. Furthermore, by supporting discrete readings, the convolution step size of the width and height reversals is not limited to 1, but rather an arbitrary convolution step size can be supported without modifying the processing behavior of the arithmetic circuitry.

In particular, in some embodiments, a computing device configured to perform convolution operations is provided, the computing device comprising: the input storage circuit is used for storing an input characteristic diagram and a convolution kernel, wherein the input characteristic diagram is stored in a segmented mode according to the split granularity Pci of the dimension of the input channel Ci; a plurality of slave processing circuits for performing convolution operations on the input feature map of the broadcast transmission and the corresponding convolution kernels allocated to the slave processing circuits, respectively; and the output storage circuit is used for storing the plurality of output characteristic diagrams calculated and output by the processing circuit, wherein the output characteristic diagrams are stored in segments according to the split granularity Pco of the dimension of the output channel Co.

By storing the input feature images in segments according to Pci and storing the output feature images in segments according to Pco, the method can support that the output feature images of the current convolution layer are directly used as the input feature images of the next convolution layer in a neural network model with a plurality of continuous convolution layers, so that seamless connection is realized, and additional segmentation processing is not needed. Various aspects related to convolution operations are described in detail below.

Exemplary input feature map storage

In some embodiments, the previous layer output data of some convolution layers (e.g., the non-first convolution layer of U-net) has been divided into segments in the Ci dimension, each segment having a Ci size of 16B (e.g., data type int 8), 32B (e.g., data type int 16), 64B, or the like. At this time, the split granularity Pci of the dimension of the input channel Ci may conform to the size of each segment, i.e., 16B, 32B, or 64B.

In still other embodiments, where the data has not been segmented in advance, the input channel split granularity Pci may be determined from the size of the input channel dimension Ci of the input feature map and the hardware single-pass data amount M. For example, the Ci dimensions of the data required in the U-net network are aligned by 16 numbers, and the split granularity Pci of the Ci dimensions may also be aligned by 16 numbers. In the case of a data type of int8 or higher, the split granularity Pci is aligned by 16B, i.e. 16B, 32B, 64B. It will be appreciated that the convolution scheme of embodiments of the present disclosure may reduce alignment requirements by splitting the Ci dimension according to split granularity, thereby adapting to any Ci dimension. In addition, it can be further understood that the maximum splitting granularity Pci does not exceed the one-time processing alignment value M (or called reference alignment value, single-time processing data amount of hardware) of the hardware, that is, pci is less than or equal to M. Therefore, under different value ranges of Ci, proper Pci can be selected, so that the alignment requirement on the dimension of Ci is reduced.

In some embodiments, the input channel split granularity Pci may be selected to be M/2 ⁿ N=0, 1,2, …, thereby facilitating a 2-to-2 from the next lowest storage dimension W ⁿ The data is multiplied to the lowest storage dimension Ci to meet the hardware alignment requirement. Table 1 shows several exemplary schemes corresponding to the input channel splitting granularity ppi, assuming m=64b.

Resolution granularity (Pci)	4B	8B	16B	32B	64B
						Ws (W contribution)	16	8	4	2	1

TABLE 1

As can be seen from table 1, the composition of one input feature line varies depending on the split granularity ppi. When sci=16b, the W dimension needs to contribute 4 times the data, i.e., the shape of one data line is wi×ci=4×16b. When pp=32b, the shape of one data line is wi×ci=2×32b.

It can also be seen that the smaller the input channel split granularity, the more parts Wi contribute to the Ci direction, and the larger the alignment limit on Wi, the more Wi/Ws is required to be satisfied.

It will be appreciated that although in theory the resolution granularity may be taken to be M/2 ⁿ But only M/2 can be selected in consideration of the requirement of W dimension when the resolution granularity is too small, instruction cost, the value range of the actual Ci, the cost related to the subsequent discrete reading for supporting any convolution step length and the like ⁿ As an alternative split granularity. In the example of m=64b, alternative split granularities may include, for example, 64B, 32B, and 16B.

Different splitting granularities can be suitable for different operation scenarios, so that performance optimization of different degrees is obtained. Specifically, in some embodiments, the input channel split granularity Pci may be selected as follows:

aligning the lowest storage dimension Ci of the input feature map to each alternative splitting granularity; and

the alignment filling amount aligned to each alternative resolution granularity and the size of the corresponding resolution granularity are comprehensively considered, and an appropriate resolution granularity is selected, for example, the alternative resolution granularity which is aligned to the predetermined range and is as large as possible is used as the Pci.

For example, in the case where the alignment padding amounts are the same, a larger split granularity is preferentially selected; or under the condition of different alignment filling amounts, selecting the split granularity with the minimum alignment filling amount; or in case the alignment padding amounts differ little (e.g. within a predetermined range, such as not more than 16B), a larger split granularity is preferred.

Although the rules for selecting the input channel split granularity Pci are listed above, these rules are merely preferred embodiments for selecting the preferred input channel split granularity that best suits the current Ci value. The application of the above rules is described below in connection with several examples. Assuming m=64b in all examples, alternative split granularity includes 64B, 32B, and 16B.

In one example, assuming ci=48b, zero padding is not required for alignment to 16B, and 16B is required for alignment to both 32B and 64B. At this time, a resolution granularity that does not require zero padding may be preferable as Pci, that is, 16B.

In another example, assuming ci=28b, alignment to 16B, 32B would require a zero padding of 4B, and alignment to 64B would require a zero padding of 36B. At this time, a larger split granularity with a small alignment padding amount may be preferentially selected as the Pci, that is, 32B.

Fig. 6 schematically illustrates an exemplary storage of input feature maps according to some embodiments of the present disclosure. As shown in the figure, assuming that sci=16b or 32b, and the number of segments ci_seg_num of Ci is 2, the input feature map may be stored in two segments according to Ci, and the segments may be stored continuously according to the order of the dimensions of HWC; the head address interval ci_seg. Stride of two segments: each segment Ci has a size of 16B or 32B. The data format may be expressed as [2, hi, wi,16B ] or [2, hi, wi,32B ]. For the segment of 16B, the shape of one data line therein is wi×ci=4×16b; for the segment of 32B, the shape of one data line is wi×ci=2×32b.

The input profile may be stored in the first memory circuit 530, for example, in a segmented storage manner as described above, for broadcast transmission to the scheduled plurality of slave processing circuits during operation.

Thus, the storage format of the input feature map in the embodiments of the present disclosure is described above.

Exemplary convolution kernel storage

The convolution calculation is that each input feature map needs to multiply and add with the convolution kernel of each Co, so as to output Co output feature maps. However, not the space on the chip is necessarily capable of storing the convolution kernels and the input feature graphs of all scales at the same time, so that a series of operations for repeatedly loading the input feature data or the weight data exist for the hardware, and how to balance the repeated loading of the input feature data or the weight data has a certain influence on the calculation efficiency. In actual operation, in order to reduce frequent off-chip access, different multiplexing modes can be adopted according to the scale characteristics of data participating in the operation.

According to the convolution operation principle described above, the operation results in the Co dimension do not need to be accumulated, so that the operation distribution in different Co can be performed relatively independently in different operation circuits. That is, convolution kernels of different Co may be allocated to different arithmetic circuits, and the same input feature map is used for the arithmetic operation, where the input feature map is multiplexed between the arithmetic circuits, and the multiplexing times rn=ns, ns is the number of arithmetic circuits.

In some embodiments of the present disclosure, the Co values assigned to the individual slave processing circuits processing may be determined based on the split granularity Pco of the output channel Co dimension of the convolution kernel and the number of schedulable slave processing circuits Ns.

To simplify the scheduling of the slave processing circuit, in some embodiments, the convolution kernel may be stored in blocks according to the Co dimension according to the split granularity Pco of the Co dimension of the output channel of the convolution kernel, so that the scheduled slave processing circuit loads the corresponding weight block. In some embodiments, the split granularity Pco of the output channel Co dimension may be selected to be a multiple of the number Ns of schedulable slave processing circuits, thereby facilitating the average division of Co into Ns slave processing circuits. For example, each slave processing circuit processes a convolution kernel of different Pco/Ns Co values spaced apart by Ns. In other embodiments, the Co-dimensional resolution particle size Pco may be aligned by 16 numbers, such as 16, 32, 48, 64, etc. In view of various constraints such as instruction limitations, pco typically does not exceed the maximum throughput of a single microinstruction, e.g., 64.

As previously mentioned, in some embodiments, the convolution kernel may be determined to distribute data and stored in the second storage circuit 540 for distribution to or reading by the corresponding slave processing circuit prior to operation. The second storage circuit 540 may be shared by a plurality (for example, ns) of the slave processing circuits 520, and each slave processing circuit is allocated with an independent storage area, so that data required for each slave processing circuit to operate only needs to be read from the corresponding storage area, thereby accelerating the memory access speed. When the convolution kernels are stored in the Co dimension division, the convolution kernel corresponding to the Co value allocated to a certain slave processing circuit may be stored in a corresponding storage area of the second storage circuit. Since the Co dimension is the highest storage dimension of the convolution kernel, the partitioning storage in the Co dimension does not need to perform processing such as dimension conversion.

FIG. 7 illustrates a convolution kernel storage scheme in accordance with an embodiment of the present disclosure. The 8-block memory areas 700 to 707 allocated for, for example, ns=8 slave processing circuits SL0 to SL7 are exemplarily shown in the figure. Each memory area stores therein a convolution kernel of a corresponding Co value to be processed by the slave processing circuit.

As previously described, the Co dimension of the convolution kernel is split by Pco, and assuming pco=48, each slave processing circuit needs to process 6 Co values when ns=8. In the example shown in the figure, consecutive Co values are sequentially allocated to 8 SL one by one (i.e. in units of interval 1), and allocation is started again after one round of allocation. For example, the convolution kernels of co=0 to 7 are shown to be sequentially stored in 8 storage areas 700 to 707; the next convolution kernels of co=8 to 15 are stored in sequence on 8 storage areas 700 to 707; and so on. Thus, the Co dimension of the operation results outputted from the processing circuit is continuous after each round of operation is completed. In another example, the convolution kernel may also be divided into blocks in sequence of Co values, each SL processing a block, each block comprising consecutive Co/Ns Co-valued convolution kernels. For example, assuming pco=48, the convolution kernels of co=0 to 5 are stored in the storage area 700 allocated to SL0, the convolution kernels of co=6 to 11 are stored in the storage area 701 allocated to SL1, the convolution kernels of co=12 to 17 are stored in the storage area 702 allocated to SL2, and so on, until the convolution kernels of co=42 to 47 are stored in the storage area 707 allocated to SL 7. Thus, the Co dimension for each SL treatment is continuous.

From the above Co partitioning, each slave processing circuit may need to process a convolution kernel of one or more Co values. When a plurality of Co values are processed, each of the input feature maps processed by the slave processing circuit may then be further repeated for the convolution kernels of the plurality of Co values, with a maximum multiplexing number rn=pco/Ns, rn representing the number of multiplexing of the input feature maps within a single slave processing circuit. The number of input profile multiplexes available within a single slave processing circuit may be determined taking into account hardware buffer space constraints, such as the size of the first buffer circuit and the second buffer circuit in fig. 5.

The above describes the convolution kernels aligned in Co dimension and stored in blocks in the second storage circuit. Similar to the input feature map, the convolution kernel for each Co value is also stored in segments in the Ci dimension at the split granularity Pci and is not repeated here.

Exemplary reading of input feature lines

As mentioned previously, the convolution optimization scheme provided by the embodiments of the present disclosure supports convolution steps in the width and height directions that are arbitrary and are not limited to 1. In order to make the convolution step length not be 1, the processing behavior of the operation circuit is not changed, and the data transmitted to the operation circuit, that is, the reading mode of the input characteristic line, can be adjusted.

Specifically, in some embodiments, when the convolution step Sw of the convolution operation in the W dimension is 1, one input feature row is configured by pci×ws (ci×wi), that is, is configured by Ws data continuously read in the W dimension, and the Ci dimension of each data is equal to Pci. For example, when pcr=16b, m=64b, 4 columns of data in W dimension are read continuously, because within each piece of data stored in sections in Ci dimension, the dimension storage order is HWC, that is, WC dimension is continuous. When pp=32b, 2 columns of data in W dimension are successively read, constituting one input feature row. It will be appreciated that when the input computation bandwidth exceeds the capacity of one data line, for example 128B, then 2 input feature lines can be read at a time, thereby avoiding the bottleneck for the subsequent operations.

In other embodiments, when the convolution step Sw of the convolution operation in the W dimension is not 1, the configuration of one input feature row is also pp×ws (ci×wi), but is composed of Ws data read per interval (Sw-1) of data in the W dimension, and the Ci dimension of each data is pp. For example, when pp=16b, m=64b, sw=2, reading is performed every one column of data in the W dimension, that is, 16B data is read every one hop of 16B data, and reading 4 16B data constitutes one data line. When pp=32b, sw=2, the 32B data is read per one hop of 32B data in the W dimension, and 2 pieces of 32B data are read at intervals to constitute one data line.

The above interval readings may be accomplished by a memory winding that wires the readings at the corresponding addresses at intervals of readings (e.g., 16B) to support simultaneous skip readings, i.e., discrete data can be read out at one time in one beat. It will be appreciated that if the amount of data to be spaced is smaller, more wiring is required, more dense wiring is required, the difficulty of wiring is high, and the area is increased, thereby increasing the overhead. Typically, instructions require address accesses on RAM to be aligned by 16B, so the wire fetch overhead is small by 16B intervals.

It will be appreciated that this manner of interval reading may also exist for input feature line multiplexing when Sw < Kw, as will be described in detail later in connection with the embodiments.

Exemplary convolution operation procedure within a Single Slave processing Circuit

When the input feature map is broadcast to the scheduled slave processing circuits, and the convolution kernels are distributed to the corresponding slave processing circuits and simultaneously, each slave processing circuit can execute convolution operation on the input feature map and the corresponding data of the convolution kernels, and then the master processing circuit can perform splicing processing on operation results returned by the plurality of slave processing circuits according to a convolution optimization scheme so as to obtain an output feature map of the convolution operation of the input feature map and the convolution kernels. Specifically, a specific convolution operation process may be performed using a plurality of operation circuits CU in the slave processing circuit and respective buffer circuits (see fig. 5). Depending on the size of the space from the processing circuitry internal buffer circuitry and the computational power limitations of the arithmetic circuitry, multiple cycles of operations are typically required to be performed in each round of operations to complete the desired operation.

In some embodiments, the first buffer circuit may be configured to buffer a plurality of input feature rows to be subjected to a convolution operation, where one input feature row includes an amount of data of pp×ws=m in the input feature map, ws is a contribution multiple of a width W dimension, and M is an amount of data processed by hardware in a single pass. Accordingly, the second buffer circuit may be used to buffer weight data to be subjected to convolution operations. Each arithmetic circuit is used for executing a para-multiply-accumulate operation on an input characteristic row selected from the first buffer circuit and an extended weight row selected or generated from the second buffer circuit respectively at each calculation time, wherein one extended weight row is formed by copying and expanding a convolution kernel into Ws columns according to a column of data blocks split or aligned to Pci in the Ci dimension.

For simplicity, the following description is directed to processing for one Co value within a single slave processing circuit SL, it being understood that similar processing occurs within other SLs.

As can be seen from the foregoing description, in the context of conventional convolution operations, all the operation circuits within a single slave processing circuit calculate one output feature map or part of the output feature map corresponding to the same output channel Co. Depending on the buffer space size of the first buffer circuit and the second buffer circuit within the slave processing circuit SL, the processing capacity of the arithmetic circuit CU (e.g. internal registers etc.), the slave processing circuit generally cannot calculate the output profile allocated thereto at a time. Therefore, the arithmetic circuit can be divided in units of a single arithmetic capability (for example, a single calculation of Nop output points or partial sums) Output feature blocks (or output blocks), each corresponding to all schedulable N within a single SL _CU Single operation capability (N) of each operation circuit _CU * Nop output points).

In the disclosed embodiment, each output block includes Ws output points that are continuous in the width Wo dimension, as known from the segmentation of the previous Ci dimension. For example, when pci=16b, ws=4; when pp=32b, ws=2. When one slave processing circuit includes Ncu schedulable arithmetic circuits, the Ncu arithmetic circuits can calculate Ncu output blocks in parallel. The output blocks may be arranged in a continuous manner in the width Wo dimension and/or the height Ho dimension on the output feature map.

For example, taking the example of 4 CUs per SL in fig. 5 above, each CU can calculate 1 x ws (Ho x Wo) output points or parts of output points and constituent output blocks at a time, and a single SL can calculate 4 such output blocks at a time. There may be a variety of arrangements of these 4 output blocks in the WoHo dimension of the output feature map.

Fig. 8 illustrates several exemplary ways of dividing an output block among multiple arithmetic circuits of a single slave processing circuit in accordance with an embodiment of the present disclosure. In this example, it is assumed that the number ncu=4 of arithmetic circuits.

As shown in fig. 8 (a), the output blocks may be arranged continuously in the Wo dimension. Specifically, 4 output blocks of 1×ws are arranged in a row in the Wo dimension, so as to form a distribution of 4ws×1 (wo×ho), and each arithmetic circuit calculates one of the output blocks, respectively. When ws=4, the 4 arithmetic circuits simultaneously calculate output characteristic map portions of 16×1 (wo×ho) size. When ws=2, 4 arithmetic circuits calculate the output characteristic map portion of 8*1 (wo×ho) size at the same time.

As shown in fig. 8 (b), the output blocks may also be arranged continuously in the Ho dimension. Specifically, 4 output blocks of 1×ws are sequentially arranged in a column in the Ho dimension, thereby constituting a distribution of ws×4 (wo×ho). When ws=4, 4 arithmetic circuits calculate the output characteristic map portion of 4*4 (wo×ho) size at the same time. When ws=2, 4 arithmetic circuits calculate an output characteristic map portion of 2×4 (wo×ho) size at the same time.

As shown in fig. 8 (c), the output blocks may be arranged continuously in both the Wo and Ho dimensions. Specifically, 4 output blocks of 1×ws are arranged in a 2×2 manner in the WoHo dimension, thereby constituting a distribution of 2ws×2 (wo×ho). When ws=4, 4 arithmetic circuits calculate the output characteristic map portion of 8*2 (wo×ho) size at the same time. When ws=2, 4 arithmetic circuits calculate the output characteristic map portion of 4*2 (wo×ho) size at the same time.

After the splitting manner of the output block is allocated, a specific operation procedure of a single output point in the output block is described below.

From the previous convolution operation principle, it is known that the value of each convolution output point on the output feature map corresponds to the result of the para-multiply-accumulate of the input feature map and the weight value in the convolution window. That is, the values of the individual output points are multiplied together by the alignment of the individual portions.

In some embodiments, for a single output point in the output signature, the values of the output points may be calculated in the following order, a multi-layer loop, where: the Kw dimension and the Kh dimension of the convolution kernel serve as the partial sum of the output points of the inner layer circulation calculation, kw is the width dimension of the convolution kernel, and Kh is the height dimension of the convolution kernel; convolution operation the convolution step Sw of the W dimension is used as a middle layer to circularly calculate the partial sum of the output points; circularly calculating the partial sum of the output points by taking the number Ci_seg_num of the segments split according to Pci in the Ci dimension of the convolution kernel as an outer layer; and accumulating the partial sums to obtain the value of the output point.

From the foregoing description, it is known that the convolution kernel and the input feature map are each split into a plurality of segments in the Ci dimension according to Sci and stored in segments, and therefore, the number of segments Sci in the Ci dimension can be regarded as the outermost loop. The number of loops nci=ci_seg_num=ceil (Ci/Pci).

Since the convolution optimization scheme supports a convolution step Sw that is not 1 and thus provides the input signature data in a manner that takes interval readings accordingly, the convolution step Sw in the W dimension may be cycled as a middle layer. Number of cycles nsw=sw.

The Kw dimension and the Kh dimension of the convolution kernel may be regarded as the innermost loop, and the total number of loops nk=kw×kh in the Sw round loop. The number of loops in each turn of Sw is then related to the current Sw, where the loops in the Kw dimension need to determine the number of loops and corresponding index from the value of Sw. For example, when sw=1, only one cycle of Sw occurs, and the number Nk of cycles of Kw and Kh of the inner layer=kw×kh. When sw=2, kw=kh=3, 2 cycles of Sw occur, with the first cycle sw_index=0, the convolution kernel taking the weights of the 0 th and 2 nd columns (i.e. kw_index=0/2), i.e. cycle 3*2 =6 times; the second round sw_index=1, the convolution kernel takes the column 1 weight (i.e. kw_index=1), i.e. loop 3*1 =3 times; total cycles 3*3 =9.

It can be seen that the total number of cycles ncycle=nci=nk=ci_seg_num kw_kh of a single output point, i.e. the partial sums are calculated data by data in W and H dimensions and the partial sums are calculated in segments in Ci dimensions.

Further, the cyclic order of the Kw and Kh dimensions of the innermost layer may be interchanged. In some embodiments, when the Ncu output blocks calculated by the Ncu operation circuits are arranged continuously in the width Wo dimension, kw-dimension loop may be prioritized, and Kw-dimension loop may be repeated, thereby facilitating possible W-dimension data multiplexing. In other embodiments, where the Ncu output blocks are consecutive in the height Ho dimension, a Kh dimension cycle may be prioritized, followed by a Kw dimension cycle, thereby facilitating possible H-dimension data multiplexing. In still other embodiments, when the Ncu output blocks are continuous in both the width Wo dimension and the height Ho dimension, a cyclic sequencing of the Kw dimension and the Kh dimension may be optional. The above cyclic operation process will be described later with reference to specific embodiments.

It will be appreciated that when the width dimension of the convolution kernel exceeds the maximum convolution kernel width value Kmax supported by the slave processing circuitry, a split in the Kw direction is required in accordance with the maximum convolution kernel width value. In this case, in addition to the above-mentioned three-layer cycle, further cycle processing is performed according to resolution of Kw, which will not be described in detail here.

As mentioned previously, in some embodiments, the input signature data may be further multiplexed in the H dimension, further reducing the amount of access. Specifically, each selected input feature line may be multiplexed rn times, and the counterpoint multiply-accumulate operation is performed on rn extension weight lines corresponding to the convolution kernel in the height dimension respectively, so as to obtain rn output blocks of the output feature map which are continuous in the height dimension, where rn is determined according to the height dimension Kh of the convolution kernel and the convolution step size Sy of the convolution operation in the height direction.

Fig. 9 illustrates an operational schematic of multiplexing input signature data in the H-dimension according to some embodiments of the present disclosure.

As shown, when the same input feature data point traverses Kh weight points in the H dimension to perform a para-multiply-accumulate operation, the resulting partial sums belong to different output points. To avoid computation overflow, the input feature data point <2,0> is taken as an example. When the input feature data point <2,0> and the weight data point <0,0> perform a para-multiply accumulation, corresponding to the case of convolution window a, a first partial sum is obtained, which belongs to the output point <2,0>; when the input feature data point <2,0> and the weight data point <1,0> perform a para-multiply accumulate, corresponding to the case of convolution window B, a second partial sum is obtained, which belongs to the output point <1,0>; when the input feature data point <2,0> and the weight data point <2,0> perform a permutation multiply-accumulate, corresponding to the case of convolution window C, a third partial sum is obtained, which belongs to the output point <0,0>.

It follows that the number of multiplexing of the input feature map in the H dimension depends on the maximum number of overlapping of adjacent convolution windows in the H dimension. For example, in the above example, kh=3, sy=1, and the input feature data points <2,0> are simultaneously covered by three convolution windows corresponding to three output points (i.e., output points <2,0>, <1,0> and <0,0 >), and thus can be multiplexed 3 times. It can be appreciated that when Sy >1, the multiplexing number rn is smaller than Kh, rn=kh-sy+1; and some data points are not overlapped by the convolution window, i.e., do not need multiplexing.

The above describes that by cycling the calculation section and to obtain the values of a single output point a plurality of times, and interleaving the input feature map multiplexing in the H dimension in the calculation of a single output point, a plurality of output points/output blocks in the H dimension can be calculated.

In order to fully utilize the parallel operation characteristics of a plurality of arithmetic circuits in the slave processing circuit, the plurality of arithmetic circuits CU in the single slave processing circuit calculate the output feature map in parallel. Considering the dimension storage order of the output feature map and the W contribution of the input feature map, in some embodiments, each slave processing circuit may alternately write the operation results of the respective operation circuits into the third buffer circuit in order of the first width dimension Wo and the second height dimension Ho. The computing device (e.g. the main processing circuit) may then read the operation results from the third buffer circuit of each slave processing circuit according to the sequence of the Co values of the output channels in the single Pco segment, so that the dimension storage sequence of HoWoCo is followed in each Pco segment, thereby implementing the storage of the output feature map according to the Pco segment, which may be directly used for the convolution operation of the next layer.

The following describes the detailed operation procedure in the convolution operation of the embodiment of the present disclosure in connection with the specific embodiment.

In the following example, parameters are defined as follows:

ci_byte: a number of bytes representing the dimension of Ci;

co_num: represents the size of the Co dimension;

ci_seg_size: the size of the fragments representing the dimension Ci, i.e. the granularity of splitting Pci of the preamble, may be 16B, 32B, for example;

co_seg_size: the size of the Co dimension segment, that is, the splitting granularity Pco, can be 16, 32, 48, 64, etc., for example;

ci_seg_num: a segment number indicating a Ci dimension segment, ci_seg_num=ceil (ci_byte/ci_seg_size);

co_seg_num: represents the number of segments of the Co dimension segment, co_seg_num=ceil (co_num/co_seg_size).

Example 1: ci_seg_size=16B, ci_seg_num=2, wo=16×1, co_seg_size=48, kw ＝Kh＝3，Sw＝Sh＝2

Fig. 10 a-10 d show schematic diagrams of the operation of the convolution operation scheme according to embodiment 1 of the present disclosure. In this embodiment, ci is split into two pieces, ci_seg_index=0 to 1, and thus one input feature data line is in the format of 4×16b (WiCi), which is shown to include 4 columns of Wi data per line, so that an output block calculated by one arithmetic circuit includes 1×4 (CoWo) output points. Co_seg_size=48, assuming that the number of schedulable slave processing circuits ns=8, each slave processing circuit processes 6 Co values per round of operation. The size of the convolution kernel is not assumed to be khkw=3×3. In the following description, the height, width dimension coordinates < h, w > are used to represent the individual data points, each data point having a size Ci_seg_size in the Ci dimension, in this example 16B.

Considering the convolution step sw=sh=2, when the first buffer circuit is supplied with numbers, the corresponding input feature map data may be read to be buffered in the first buffer circuit in a discrete reading manner in which 16B is read per hop 16B. Assuming an input operation bandwidth of 128B, 8 16B data can be read in one beat. In this example, according to the convolution step sw=2, the first beat may read 8×16B input feature map data with w-direction coordinates of 0, 2, 4, 6, 8, 10, 12, 14, respectively, and store the data in the first buffer circuit. Since Sw is not 1 in this example, there are two loops caused by Sw, for example, when sw=2: even column cycles (corresponding to sw_index=0) and odd column cycles (sw_index=1) in the w direction.

Fig. 10a shows a cyclic operation procedure for ci_seg_index=0, sw_index=0, kh_index=0, kw_index=0. Selecting N from the first buffer circuit according to the mode corresponding to the dividing mode of the output block _CU A plurality of input feature lines respectively sent to N _CU And an arithmetic circuit. During the 6 calculations thereafter, this N _CU The input characteristic lines are respectively N _CU Each of the arithmetic circuits is multiplexed 6 times.

During the 6 times of calculation, the extended weight lines corresponding to different Co values are respectively selected from the second buffer circuit and broadcast to the Ncu operation circuits for calculation. In this example, each extension weight line is extended by copying 3 copies of 1/4 weight line. For example, for the example in the figure, let SL0 be the slave processing circuit, which processes 6 convolution kernels of co= 0/8/16/24/32/40.

During the 1 st calculation shown by arrow (1), the number is selected from the data segments of ci_seg_index=0. Specifically, selectFrom input feature points<0,0>、<0,2>、<0,4>、<0,6>The data line is sent to the first arithmetic circuit CU0, and the input characteristic points are selected<0,8>、<0,10>、<0,12>、<0,14>The data line is sent to the arithmetic circuit CU1, and the input feature point is selected<0,16>、<0,18>、<0,20>、<0,22>The data line is sent to the arithmetic circuit CU2, and the input characteristic points are selected<0,24>、<0,26>、<0,28>、<0,30>The data line thus constituted is sent to the arithmetic circuit CU3 (the number of choices is shown by a black dotted line box in the figure). Accordingly, the data points are selected from the convolution kernel data segment with co=0 and ci_seg_index=0<0,0>The extended weight row A0 (hereinafter referred to as "A0") is broadcast to four arithmetic circuits. Thus, the 4 arithmetic circuits respectively perform the multiply-accumulate operation to obtain partial sums of 16 output points w0 to w15 on ho=0, and each arithmetic circuit calculates 4 adjacent output points. It will be appreciated that these calculated output points are all partial sums, only accumulating data on Ci_seg_index=0 corresponding to Kh_index=0. During the remaining 5 computations indicated by arrows (2) - (6), the partial sums of the output points on the corresponding Co are computed for the 5 convolution kernels of co= 8/16/24/32/40, respectively, and will not be described here. Thus, for the 6 Co values processed by the single slave processing circuit, the processing is sequentially circulated, and the output points of 16 Wo dimensions of the 6 Co are obtained. In the computation of each CU, the input feature lines are multiplexed 6 times, while in the computation of a single slave processing circuit, the weights are multiplexed N _CU * M/ci_seg_size=4 (64/16) =16 times.

Then, the inner layer loop in Kw dimension, i.e., kw_index=2, can be continued.

Fig. 10b shows a cyclic operation procedure for ci_seg_index=0, sw_index=0, kh_index=0, kw_index=2. Since the convolution step sw=2, the step should also be 2 at the time of conventional sliding selection. In the embodiment of the application, since the skip count processing based on the convolution step size is already performed when the input feature map data is provided to the first buffer circuit, the skip count can still be selected by sliding according to the step size 1 on the first buffer circuit. At this time, for kw_index=2, the operation process of Co cycle and input feature map multiplexing similar to fig. 10a is performed similarly. The difference is that the data involved in the operation is selected: at the moment, corresponding 4 input characteristic rows are selected in one step in a sliding way in the Wi direction from the first buffer circuit and are respectively sent to 4 operation circuits; and sequentially selecting the weight data of Kh_index=0 and Kw_index=2 corresponding to different Co values from the second buffer circuit. During the 6 subsequent calculations, the 4 input feature lines are multiplexed 6 times in each of the 4 calculation circuits, each time with an extended weight line of a different Co value.

For example, during the 1 st calculation shown by arrow (1), the number of selections is slid 1 step from the first buffer circuit. Specifically, the data lines composed of the input feature points <0,2>, <0,4>, <0,6>, <0,8> are selected and transmitted to the first arithmetic circuit CU0, the data lines composed of the input feature points <0,10>, <0,12>, <0,14>, <0,16> are selected and transmitted to the arithmetic circuit CU1, and so on. Accordingly, the spread weight row A2, which is formed by spreading the data points <0,2> (hereinafter abbreviated as "A2"), in the convolution kernel data segment where co=0 and ci_seg_index=0 is selected, and broadcast to the four arithmetic circuits. Thus, the 4 arithmetic circuits respectively perform the multiply-accumulate operation to obtain partial sums of 16 output points w0 to w15 on ho=0, and each arithmetic circuit calculates 4 adjacent output points. It will be appreciated that these calculated output points are all partial sums, only accumulating data on Ci_seg_index=0 corresponding to Kh_index=0, kw_index=2. It will also be appreciated that since the input feature line is slid in the Wi direction (sw=2), the weight Kw is also slid in the W direction by 2 steps (kw_index=2), and therefore these parts and results still belong to 16 output points W0 to W15 in the Wo dimension of co=0. Thus, the partial sums obtained by this operation are added to the partial sums obtained by the operation in the last time (fig. 10 a).

During the remaining 5 computations indicated by arrows (2) - (6), the partial sums corresponding to the output points on Co are similarly computed for the 5 convolution kernels of co= 8/16/24/32/40, respectively, and will not be described here.

Thus, after the loop processing for the Kw portion of kh_index=0 is completed for sw_index=0 when sw=2, the Kh loop can be performed next, that is, switching to kh_index=1.

Fig. 10c shows a cyclic operation procedure for ci_seg_index=0, sw_index=0, kh_index=1, kw_index=0. Note that at this point the first buffer circuit needs to reload the input signature data. Similarly to fig. 10a, according to the convolution step sw=2, the first beat can read the input feature map data of 8×16B with w-direction coordinates of 0, 2, 4, 6, 8, 10, 12, 14, respectively, except that the line of data of hi=1 is read at this time.

As can be seen from fig. 10c, similarly, by 6 calculation cycles, the partial sums of 16 output points w0 to w15 on ho=0 of each of co= 0/8/16/24/32/40 can be calculated, and each calculation circuit calculates 4 adjacent output points. It will be appreciated that these calculated output points are all partial sums, only accumulating data on Ci_seg_index=0 corresponding to Kh_index=1 and Kw_index=0. It will also be appreciated that since hi=1 for the input feature line, the weights kh_index=1, these parts and results still belong to 16 output points w 0-w 15 in the Wo dimension of the corresponding Co. Therefore, the partial sums obtained by this operation are added to the partial sums obtained by the operation in the last time (fig. 10 b).

Similarly, the aforementioned loop operation procedure may be performed for ci_seg_index=0, sw_index=0, kh_index=1, kw_index=2, thereby completing the loop of kh_index=1 portion.

When the Kh cycle is also completed (for example, the operation is performed by continuing to take the corresponding data of ci_seg_index=0, sw_index=0, kh_index=2), then the Sw cycle may be performed, that is, the switch to sw_index=1 may be performed. The odd columns of data in the w direction of the feature map are input correspondingly in this cycle.

Fig. 10d shows a cyclic operation procedure for ci_seg_index=0, sw_index=1, kh_index=0, kw_index=1. Note that the input feature map data loaded by the first buffer circuit at this time is data whose w-direction coordinates are 1, 3, 5, 7, 9, 11, 13, 15, …, respectively.

During the 1 st calculation shown by arrow (1), the number is selected from the data segments of ci_seg_index=0. Specifically, the data line composed of the input feature points <0,1>, <0,3>, <0,5>, <0,7> is selected and transmitted to the first arithmetic circuit CU0, the data line composed of the input feature points <0,9>, <0,11>, <0,13>, <0,15> is selected and transmitted to the arithmetic circuit CU1, and so on. Accordingly, an expanded weight row A1 expanded by data points <0,1> (hereinafter abbreviated as "A1") in the convolution kernel data segment where co=0 and ci_seg_index=0 is selected and broadcast to the four arithmetic circuits. Thus, the 4 arithmetic circuits respectively perform the multiply-accumulate operation to obtain partial sums of 16 output points w0 to w15 on ho=0, and each arithmetic circuit calculates 4 adjacent output points. It will be appreciated that these calculated output points are all partial sums, only accumulating data on Ci_seg_index=0 corresponding to Kh_index=0 and Kw_index=1.

During the remaining 5 computations indicated by arrows (2) - (6), the partial sums of the output points on the corresponding Co are computed for the 5 convolution kernels of co= 8/16/24/32/40, respectively, and will not be described here.

Thus, according to the split of Sw, all loops in the Kw direction are processed.

Next, the foregoing loop operation process may be performed for kh_index=1 and 2 with ci_seg_index=0 and sw_index=1, respectively, thereby completing the loop of the Kh portion while also completing the loop of Sw.

Then, a loop of ci_seg may be performed, i.e. switching to ci_seg_index=1, and the above-mentioned number-selection and loop calculation process is repeatedly performed from the data segment of ci_seg_index=1. When all of these loop processes are completed, each arithmetic circuit can accumulate to obtain the final convolution results of 4 output points of 6 Co. 1 from the 4 arithmetic circuits in the processing circuit, the output points in the 16 wo directions on 6 Co are summarized. The 8 slave processing circuits obtain 16×1 (wo×ho) output points on 48 Co in total.

Fig. 11 shows a schematic diagram of write and output logic of an operation result according to embodiment 1 of the present disclosure.

As shown, a plurality of arithmetic circuits CU within a single slave processing circuit SL may sequentially write the arithmetic results into a result buffer circuit (e.g., a third buffer circuit of fig. 5) according to the order of the operations. Specifically, the output points of the same Co calculated by the respective CUs may be written first in the order Wo (write cycle (1)). Next, output points of different Co calculated by each CU are written in the order of Co (write cycle (2)). For example, for SL0, w0 to w15 of co=0 are written first, then w0 to w15 of co=8 are written, and then w0 to w15 of co=16 are sequentially circulated. Similar result writing is performed in other SL's, except that the Co values processed are different.

As can be seen from the above-described writing order, the operation results in the result buffer circuit are stored in the order of CW (W in the lowest storage dimension). The final output result is however desirably stored in Co segments, in HWC storage order (C in the lowest storage dimension) within each segment, so that data can be converted from CW order to WC order on the data path from which the result is read out of the result buffer circuit.

Specifically, as shown in the figure, the first output point W0 of the W dimension may be read from the result buffer circuits of the respective slave processing circuits in turn in the order of Co (output cycle (1)), and then the output points on the respective wos may be read out in the order of Wo (output cycle (2)). The right side view in fig. 11 shows the read-out result, noting that when reading in Co order, reading is performed alternately on the result buffer circuits of 8 SL, so that the Co dimension is continuous, for example, from 0 to 47.

Thus, the data output by Co segmentation can be directly used for the operation of the next convolution layer without additional pendulum number processing.

Example 2: ci_seg_size=16B, ci_seg_num=2, wo=4x4, co_seg_size=48, kw ＝Kh＝3，Sw＝Sh＝1

Fig. 12 shows an operational procedure schematic of a convolution operation scheme according to embodiment 2 of the present disclosure. Embodiment 2 is the same as most of the parameters of embodiment 1, except that the single output data processed from within the processing circuit is split in such a way that wo×ho= 4*4, and the convolution step size is 1.

Since the convolution step sw=sh=1, when the number is supplied to the first buffer circuit, the w-direction input feature map can be directly read in order and buffered in the first buffer circuit. Since the output data is split into 4*4 (wo×ho), the data must be read in the h direction. In this example, the first beat may read input feature map data for a total of 8×16b with h-direction coordinates of 0 to 1, respectively, and w-direction coordinates of 0 to 3, respectively; the second beat can read the input feature map data of 8×16b total with h-direction coordinates of 2 to 3,w and 0 to 3, respectively. In this example Sw is 1, so there is no Sw induced loop, i.e. sw_index has only a value of 0.

Fig. 12 shows a cyclic operation procedure for ci_seg_index=0, sw_index=0, kh_index=0, kw_index=0. Selecting N from the first buffer circuit according to the mode corresponding to the dividing mode of the output block _CU A plurality of input feature lines respectively sent to N _CU And an arithmetic circuit. During the 6 calculations thereafter, this N _CU The input characteristic lines are respectively N _CU Each of the arithmetic circuits is multiplexed 6 times. During the 6 times of calculation, the extended weight lines corresponding to different Co values are respectively selected from the second buffer circuit and broadcast to the Ncu operation circuits for calculation.

During the 1 st calculation shown by arrow (1), the number is selected from the data segments of ci_seg_index=0. Specifically, the data line composed of the input feature points <0,0>, <0,1>, <0,2>, <0,3> is selected and transmitted to the first arithmetic circuit CU0, the data line composed of the input feature points <1,0>, <1,1>, <1,2>, <1,3> is selected and transmitted to the arithmetic circuit CU1, and so on (the selection numbers are shown by black dotted boxes in the figure). Accordingly, an expanded weight row A0 expanded by data points <0,0> ("A0") in the convolution kernel data segment of co=0 and ci_seg_index=0 is selected and broadcast to the four arithmetic circuits. Thus, the 4 arithmetic circuits respectively perform the para-multiply-accumulate operation to obtain partial sums of 16 output points of w0 to w3 on ho=0 to 3, and each arithmetic circuit calculates 4 adjacent output points on different ho. It will be appreciated that these calculated output points are all partial sums, only accumulating data on Ci_seg_index=0 corresponding to Kh_index=0. During the remaining 5 computations indicated by arrows (2) - (6), the partial sums of the output points on the corresponding Co are computed for the 5 convolution kernels of co= 8/16/24/32/40, respectively, and will not be described here. Thus, for the 6 Co values processed by the single slave processing circuit, the processing is sequentially circulated, and 4*4 (wo×ho) output points of each of the 6 Co are obtained.

Since the data in the h direction is already loaded, it is considered that the respective partial sums are calculated by cycling Kh times in the h direction according to the magnitude of Kh, thereby realizing multiplexing of the input feature map. At this time, a cycle of Kh dimension is performed, that is, a partial sum calculation is performed by sliding the number of samples in the h direction in order of kh_index=1, 2. For example, for kh_index=1, the data line with wi=0 to 3 on hi=1, 2, 3, 4 is slidingly selected and sent to 4 arithmetic circuits, so that the partial sums of the 4 output points w0 to w3 on different ho are calculated.

After the loop processing in the Kh dimension, the loop in the Kw dimension may be continued, that is, the calculation of the partial sums may be performed by taking the numbers in the w direction sequentially in accordance with kw_index=1, 2. For example, for kw_index=1, the input feature map data selected at this time is a data line where hi=0, 1, 2, 3 wi=1 to 4, and the data line is sent to 4 arithmetic circuits, so that partial sums of 4 output points where ho=0 to 3 w0 to w3 are calculated, respectively.

After the cyclic processing in the Kw dimension is completed, since the convolution step is 1, there is no cycle caused by the convolution step.

Then, a loop of ci_seg may be performed, i.e. switching to ci_seg_index=1, and the above-mentioned number-selection and loop calculation process is repeatedly performed from the data segment of ci_seg_index=1. When all of these loop processes are completed, each arithmetic circuit can accumulate to obtain the final convolution results of 4 output points of 6 Co. 1 from 4 arithmetic circuits in the processing circuit, we summarize and obtain 4*4 (wo×ho) output points on 6 Co. The total of 8 slave processing circuits obtained 4*4 (wo×ho) output points on 48 Co.

Example 3: ci_seg_size=32B, ci_seg_num=2, wo=8×1, co_seg_size=48, kw ＝Kh＝3，Sw＝Sh＝2

Fig. 13 shows an operational process diagram of a convolution operation scheme according to embodiment 3 of the present disclosure. Embodiment 3 is the same as most of the parameters of embodiment 1, except that the size of the segment in the Ci dimension is 32B, and accordingly the splitting manner of the output data processed in the processing circuit individually is wo= 8*1.

In this embodiment, whereby the Ci segment size is 32B, and thus the format of one input feature data line is 2×32b (WiCi), each line is shown to include 2 columns of Wi data, and thus an output block calculated by one arithmetic circuit includes 1×2 (CoWo) output points.

Considering the convolution step sw=sh=2, when the first buffer circuit is supplied with numbers, the corresponding input feature map data may be read to be buffered in the first buffer circuit in a discrete reading manner of reading 32B per hop 32B. Assuming an input operation bandwidth of 128B, 4 32B data can be read in one beat. In this example, according to the convolution step sw=2, the first beat may read the input feature map data of 4×32b with w-direction coordinates of 0, 2, 4, 6, respectively, and store in the first buffer circuit. Since Sw is not 1 in this example, there are two loops caused by Sw, for example, when sw=2: even column cycles (corresponding to sw_index=0) and odd column cycles (sw_index=1) in the w direction.

Fig. 13 shows a cyclic operation procedure for ci_seg_index=0, sw_index=0, kh_index=0, kw_index=0. Selecting N from the first buffer circuit according to the mode corresponding to the dividing mode of the output block _CU A plurality of input feature lines respectively sent to N _CU And an arithmetic circuit. During the 6 calculations thereafter, this N _CU The input characteristic lines are respectively N _CU Each of the arithmetic circuits is multiplexed 6 times.

During the 6 times of calculation, the extended weight lines corresponding to different Co values are respectively selected from the second buffer circuit and broadcast to the Ncu operation circuits for calculation. In this example, each extension weight line is formed by copying 1/2 weight line and then extending. For example, for the example in the figure, let SL0 be the slave processing circuit, which processes 6 convolution kernels of co= 0/8/16/24/32/40.

During the 1 st calculation period shown by arrow (1), from ci_seg_index=The number of data segments of 0 is selected. Specifically, the input feature point is selected<0,0>、<0,2>The data line is sent to the first arithmetic circuit CU0, and the input characteristic points are selected<0,4>、<0,6>The data line is sent to the arithmetic circuit CU1, and the input feature point is selected<0,8>、<0,10>The data line is sent to the arithmetic circuit CU2, and the input characteristic points are selected <0,12>、<0,14>The data line thus constituted is sent to the arithmetic circuit CU3 (the number of choices is shown by a black dotted line box in the figure). Accordingly, the data points are selected from the convolution kernel data segment with co=0 and ci_seg_index=0<0,0>The extended weight row A0 (hereinafter referred to as "A0") is broadcast to four arithmetic circuits. Thus, the 4 arithmetic circuits respectively perform the para-multiply-accumulate operation to obtain partial sums of 8 output points w0 to w7 on ho=0, and each arithmetic circuit calculates 2 adjacent output points. It will be appreciated that these calculated output points are all partial sums, only accumulating data on Ci_seg_index=0 corresponding to Kh_index=0. During the remaining 5 computations indicated by arrows (2) - (6), the partial sums of the output points on the corresponding Co are computed for the 5 convolution kernels of co= 8/16/24/32/40, respectively, and will not be described here. Thus, for the 6 Co values processed by the single slave processing circuit, the processing is sequentially circulated, and the output points of 8 Wo dimensions of the 6 Co are obtained. In the computation of each CU, the input feature lines are multiplexed 6 times, while in the computation of a single slave processing circuit, the weights are multiplexed N _CU * M/ci_seg_size=4 (64/32) =8 times.

Then, the inner layer loop in Kw dimension, i.e., kw_index=2, can be continued.

When the Kh cycle is also completed (for example, the operation is performed by continuing to take the corresponding data of ci_seg_index=0, sw_index=0, kh_index=2), then the Sw cycle may be performed, that is, the switch to sw_index=1 may be performed. The odd columns of data in the w direction of the feature map are input correspondingly in this cycle. The specific cycle is similar to that described above and will not be described in detail.

After the loop of Sw is completed, a loop of ci_seg may then be performed, i.e. switch to ci_seg_index=1, and the above-described number-selection and loop calculation process is repeatedly performed from the data segment of ci_seg_index=1. When all of these loop processes are completed, each arithmetic circuit can accumulate to obtain the final convolution results of 2 output points of 6 Co. 1 from the 4 arithmetic circuits in the processing circuit, 8 output points in the wo directions on 6 Co are summarized. The total of 8 slave processing circuits obtained 8*1 (wo×ho) output points on 48 Co.

The convolution optimization schemes provided by the embodiments of the present disclosure have been exemplarily described and illustrated above in connection with the specific convolution operation procedures of several embodiments. It will be appreciated that depending on the different values of the individual parameters, there may be a further variety of combinations, resulting in different embodiments. Moreover, based on the teachings of the present disclosure, one skilled in the art may devise other convolution optimization schemes according to specific hardware circuit configurations (such as number of slave processing circuits, number of arithmetic circuits within slave processing circuits, hardware single processing capability, etc.), which all fall within the scope of the present disclosure and are not enumerated here.

The disclosed embodiments also provide a chip that may include the data processing apparatus of any of the embodiments described above in connection with the accompanying drawings. Further, the present disclosure also provides a board that may include the foregoing chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing may be better understood in light of the following clauses:

clause 1, a computing device configured to perform convolution operations, the computing device comprising:

the input storage circuit is used for storing an input characteristic diagram and a convolution kernel, wherein the input characteristic diagram is stored in a segmented mode according to the split granularity Pci of the dimension of an input channel Ci;

a plurality of slave processing circuits for performing convolution operations on the input feature map of the broadcast transmission and corresponding convolution kernels allocated to the slave processing circuits, respectively; and

and the output storage circuit is used for storing the plurality of output characteristic graphs calculated and output by the processing circuit, wherein the output characteristic graphs are stored in segments according to the splitting granularity Pco of the dimension of the output channel Co.

Clause 2, the computing device of clause 1, wherein each slave processing circuit comprises a first buffer circuit, a second buffer circuit, and a plurality of arithmetic circuits, wherein:

the first buffer circuit is used for buffering a plurality of input feature lines to be subjected to convolution operation, wherein one input feature line comprises data quantity of Pci×Ws=M in an input feature diagram, ws is a contribution multiple of a width W dimension, and M is hardware single-time processing data quantity;

the second buffer circuit is used for buffering weight data to be subjected to convolution operation; and

Each operation circuit is used for executing a para-multiply-accumulate operation on an input characteristic row selected from the first buffer circuit and an expansion weight row selected or generated from the second buffer circuit respectively at each calculation time, wherein one expansion weight row is formed by copying and expanding a column of data blocks split or aligned to Pci in the Ci dimension by a convolution kernel into Ws columns.

Clause 3, the computing device of clause 2, wherein:

when the convolution step length Sw of the convolution operation in the W dimension is 1, one input characteristic row is composed of Ws data continuously read in the W dimension, and the Ci dimension of each data is Pci; and/or

When the convolution step Sw of the convolution operation in the W dimension is not 1, one of the input feature lines is composed of Ws data read every interval (Sw-1) of data in the W dimension, and the Ci dimension of each data is Pci.

Clause 4, the computing device of any of clauses 2-3, wherein for a single output channel Co in the output signature, each of the slave processing circuits is further configured to:

within each computing cycle, using the schedulable Ncu arithmetic circuits therein, compute in parallel Ncu output blocks on the output feature map, each output block comprising Ws output points that are continuous in the width Wo dimension, the Ncu output blocks being continuous in the width Wo dimension and/or the height Ho dimension.

Clause 5, the computing device of clause 4, wherein for a single output point on a single output channel Co in the output signature, the arithmetic circuit computes the values of the output points in a multi-layer cycle in the following order, wherein:

the Kw dimension and the Kh dimension of the convolution kernel serve as part of the sum of the output points of the inner layer circulation calculation, kw is the width dimension of the convolution kernel, and Kh is the height dimension of the convolution kernel;

convolution operation the convolution step Sw of the W dimension is used as a middle layer to circularly calculate the partial sum of the output points;

circularly calculating the partial sum of the output points by taking the number Ci_seg_num of the segments split according to Pci in the Ci dimension of the convolution kernel as an outer layer; and

and accumulating the partial sums to obtain the value of the output point.

Clause 6, the computing device of clause 5, wherein in the inner loop,

when the Ncu output blocks are continuous in the width Wo dimension, the Kw dimension circulation is prioritized, and then the Kw dimension circulation is performed; or alternatively

When the Ncu output blocks are continuous in the height Ho dimension, the cycle of the Kh dimension is prioritized, and then the cycle of the Kw dimension is prioritized; or alternatively

When the Ncu output blocks are continuous in both the width Wo dimension and the height Ho dimension, a cyclic sequencing of Kw dimension and Kh dimension is optional.

Clause 7, the computing device of any of clauses 5-6, wherein in the inner loop, the Kw-dimension loop determines a number of loops and a corresponding index from the value of Sw.

The computing device of clause 8, any of clauses 5-7, wherein the convolution kernels are stored in blocks of output channel Co values to be allocated to different slave processing circuits, wherein each slave processing circuit is allocated to process convolution kernels of different Pco/Ns Co values at intervals Ns, wherein Ns is the number of schedulable slave processing circuits.

The computing device of clause 9, wherein in the inner loop, each of the slave processing circuits is further configured to:

each input feature line is multiplexed with Pco/Ns times to calculate Ws output points in the same width Wo dimension of the Pco/Ns output channels Co in the output feature map.

The computing device of clause 10, any of clauses 2-9, wherein each of the slave processing circuits further comprises a third buffer circuit, and the slave processing circuits are further configured to:

according to the sequence of the width dimension Wo and the height dimension Ho, the operation results of all operation circuits are written into the third buffer circuit in turn; and

The computing device is further to: and reading the operation results from the third buffer circuits of the slave processing circuits according to the sequence of the Co values of the output channels in the single Pco segment, so that the dimension storage sequence of HoWoCo is followed in each Pco segment.

Clause 11, a chip comprising the computing device according to any of clauses 1-10.

Clause 12, a board card comprising the chip of clause 11.

Clause 13, a method of performing convolution operations using the computing device of any of clauses 1-10.

The foregoing has described in detail embodiments of the present disclosure, with specific examples being employed herein to illustrate the principles and implementations of the present disclosure, the above examples being provided solely to assist in the understanding of the methods of the present disclosure and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the present disclosure, there are variations in the detailed description and the scope of the application, which in light of the foregoing description should not be construed to limit the present disclosure.

Claims

1. A computing device configured to perform convolution operations, the computing device comprising:

2. The computing device of claim 1, wherein each slave processing circuit comprises a first buffer circuit, a second buffer circuit, and a plurality of arithmetic circuits, wherein:

3. The computing device of claim 2, wherein:

4. A computing device according to any of claims 2-3, wherein for a single output channel Co in the output signature, each of the slave processing circuits is further to:

5. The computing device of claim 4, wherein for a single output point on a single output channel Co in an output signature, the arithmetic circuit computes values of the output points in a multi-layer cycle in the following order, wherein:

and accumulating the partial sums to obtain the value of the output point.

6. The computing device of claim 5, wherein in the inner loop,

7. The computing device of any of claims 5-6, wherein in the inner loop, the Kw-dimensional loop determines a number of loops and a corresponding index from the value of Sw.

8. The computing device of any of claims 5-7, wherein the convolution kernels are stored in blocks of output channel Co values for allocation to different slave processing circuits, wherein each slave processing circuit is allocated convolution kernels that process different Pco/Ns Co values at intervals Ns, wherein Ns is the number of schedulable slave processing circuits.

9. The computing device of claim 8, wherein in the inner loop, each of the slave processing circuits is further to:

10. The computing device of any of claims 2-9, wherein each of the slave processing circuits further comprises a third buffer circuit, and the slave processing circuits are further to:

11. A chip comprising a computing device according to any of claims 1-10.

12. A board card comprising the chip of claim 11.

13. A method of performing convolution operations using the computing device of any one of claims 1-10.