WO2024149112A1

WO2024149112A1 - Compilation method for convolution operator, and related product

Info

Publication number: WO2024149112A1
Application number: PCT/CN2024/070133
Authority: WO
Inventors: 杨云召; 张鹏; 沈宇斌
Original assignee: 上海寒武纪信息科技有限公司
Priority date: 2023-01-13
Filing date: 2024-01-02
Publication date: 2024-07-18
Also published as: CN116090519A

Abstract

Disclosed in the present application is a compilation method for a convolution operator. The compilation method is realized by means of a processing apparatus. The processing apparatus may be comprised in a combined processing apparatus, the combined processing apparatus comprising an interface apparatus and a computing apparatus, wherein the computing apparatus interacts with the processing apparatus to jointly complete a computing operation, which is specified by a user. The combined processing apparatus comprises a storage apparatus, the storage apparatus being respectively connected to the computing apparatus and the processing apparatus, and being used for storing data of the computing apparatus and the processing apparatus.

Description

Compilation method of convolution operator and related products

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese patent application number 2023100734451, filed on January 13, 2023, entitled “Compilation method for convolution operator and related products”.

Technical Field

The present disclosure generally relates to the field of intelligent computing, and more particularly to the field of neural networks. More specifically, the present disclosure relates to a compilation method of a convolution operator implemented by a processing device, a processing device, a computer-readable storage medium, and a computer program product.

Background technique

Neural network is one of the key technologies in artificial intelligence and deep learning, among which convolutional neural network (CNN) is the most important type of network. A very important calculation in convolutional neural network is the convolution operation of the convolution layer (Conv layer). The most time-consuming operation in common neural network models is often the convolution operation. The function of the convolution layer is to extract features from the input data. Through multiple layers of convolution, complex features can be extracted to ensure that the network has sufficient expression and generalization capabilities. The neural network model contains a large number of convolution operations of various types. The computational performance of the convolution operation greatly affects the computational performance of the entire neural network model. Therefore, the acceleration optimization of the convolution operation is an important part of the optimization of the deep learning computational graph.

Deep learning accelerators usually have multiple computing components, which can accelerate neural network calculations through parallel operation. Common deep learning accelerators are often subject to the limitations of hardware design, such as computing alignment issues and on-chip memory space size limitations.

Therefore, there is an urgent need for a convolution operator optimization solution suitable for the hardware design of deep learning accelerators, so as to maximize the computational parallelism and improve the computational efficiency while meeting the hardware limitations.

Summary of the invention

In order to at least solve one or more of the technical problems mentioned above, the present disclosure proposes a compilation scheme for the roll operator in multiple aspects, which, while adapting to the hardware requirements of the computing device, fully utilizes the parallel computing power of the computing components in the computing device as much as possible.

In a first aspect, the present disclosure provides a compilation method of a convolution operator implemented by a processing device, comprising: obtaining a file to be compiled containing a convolution operator; in response to the scale of the weight of the convolution operator exceeding the single-round computing amount of the computing device to execute the convolution operator, performing a first-level splitting on the weight to generate multiple first-level convolution sub-operators, wherein the first-level splitting splits the weight into multiple first-level sub-weights according to a first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator; generating a first merging operator, the first merging operator is used to merge the computing results of the multiple first-level convolution sub-operators to obtain the final result of the convolution operator; and compiling and optimizing the file to be compiled based on the multiple first-level convolution sub-operators and the first merging operator to obtain a corresponding binary instruction sequence to be allocated to the computing device to execute the task corresponding to the convolution operator.

In a second aspect, the present disclosure provides a processing device for performing compilation on a computational graph containing a convolution operator, comprising: a processor configured to execute program instructions; and a memory configured to store the program instructions, so that when the program instructions are loaded and executed by the processor, the processor executes the compilation method according to the first aspect of the present disclosure.

In a third aspect, the present disclosure provides a computer-readable storage medium having program instructions stored therein, which, when loaded and executed by a processor, causes the processor to execute the compilation method according to the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides a computer program product, comprising a computer program or instructions, which, when executed by a processor, implements the compilation method according to the first aspect of the present disclosure.

Through the compilation scheme of the convolution operator provided above, the disclosed embodiment provides a suitable splitting scheme for the implementation of a larger-scale convolution operator to adapt to the hardware computing power conditions. Furthermore, by setting a suitable splitting granularity, it is possible to adapt to the hardware design requirements of the computing device, make full use of the computing power of the parallel computing components, and improve the overall computing performance.

BRIEF DESCRIPTION OF THE DRAWINGS

By reading the detailed description below with reference to the accompanying drawings, the above and other objects, features and advantages of the exemplary embodiments of the present disclosure will become readily understood. In the accompanying drawings, several embodiments of the present disclosure are shown in an exemplary and non-limiting manner, and the same or corresponding reference numerals represent the same or corresponding parts, wherein:

FIG1 shows a structural diagram of a board according to an embodiment of the present disclosure;

FIG2 shows a structural diagram of a combined processing device according to an embodiment of the present disclosure;

FIG3 is a schematic diagram showing the internal structure of a single processor core of a single-core computing device according to an embodiment of the present disclosure;

FIG4 is a simplified schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;

FIG5 shows an example of an exemplary conventional 3D convolution operation principle to which the disclosed embodiments can be applied;

FIG6 shows an exemplary flow chart of a method for compiling a convolution operator implemented by a processing device according to an embodiment of the present disclosure;

FIG7 shows a schematic diagram of a convolution operator splitting scheme according to some embodiments of the present disclosure;

FIG8 shows a schematic diagram of a convolution operator splitting scheme according to other embodiments of the present disclosure.

Detailed ways

The following will be combined with the drawings in the embodiments of the present disclosure to clearly and completely describe the technical solutions in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present disclosure.

It should be understood that the terms "first", "second", "third" and "fourth" etc. that may appear in the claims, specifications and drawings of the present disclosure are used to distinguish different objects rather than to describe a specific order. The terms "include" and "comprise" used in the specifications and claims of the present disclosure indicate the presence of the described features, wholes, steps, operations, elements and/or components, but do not exclude the presence or addition of one or more other features, wholes, steps, operations, elements, components and/or their collections.

It should also be understood that the terms used in this disclosure are only for the purpose of describing specific embodiments and are not intended to limit the disclosure. As used in this disclosure and claims, the singular forms of "a", "an", and "the" are intended to include the plural forms unless the context clearly indicates otherwise. It should also be further understood that the term "and/or" used in this disclosure and claims refers to any combination of one or more of the associated listed items and all possible combinations, including these combinations.

As used in this specification and claims, the term "if" may be interpreted as "when" or "upon" or "in response to determining" or "in response to detecting," depending on the context. Similarly, the phrase "if it is determined" or "if [described condition or event] is detected" may be interpreted as meaning "upon determination" or "in response to determining" or "upon detection of [described condition or event]" or "in response to detecting [described condition or event]," depending on the context.

The specific implementation of the present disclosure is described in detail below with reference to the accompanying drawings.

Exemplary Hardware Environment

FIG1 shows a schematic diagram of the structure of a board 10 according to an embodiment of the present disclosure. As shown in FIG1 , the board 10 includes a chip 101, which is a system-on-chip (SoC), or system-on-chip, and integrates one or more combined processing devices. The combined processing device is an artificial intelligence computing unit that supports various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, data mining, etc. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is that the input The amount of input data is large, and high requirements are placed on the storage capacity and computing power of the platform. The board 10 of this embodiment is suitable for cloud-based intelligent applications and has huge off-chip storage, on-chip storage and powerful computing power.

The chip 101 is connected to an external device 103 via an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. The data to be processed can be transmitted from the external device 103 to the chip 101 via the external interface device 102. The calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102. According to different application scenarios, the external interface device 102 can have different interface forms, such as a PCIe interface.

The board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105. The storage device 104 is connected to the control device 106 and the chip 101 through a bus and transmits data. The control device 106 in the board 10 is configured to control the state of the chip 101. To this end, in an application scenario, the control device 106 may include a microcontroller (MCU).

Fig. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment. As shown in Fig. 2, the combined processing device 20 includes a calculation device 201, an interface device 202, a processing device 203 and a storage device 204.

The computing device 201 is configured to execute user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.

The interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 can obtain input data from the processing device 203 via the interface device 202 and write it into the storage device on the computing device 201 chip. Further, the computing device 201 can obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the computing device 201 chip. Alternatively or optionally, the interface device 202 can also read data in the storage device of the computing device 201 and transmit it to the processing device 203.

The processing device 203, as a general processing device, performs basic controls including but not limited to data handling, starting and/or stopping the computing device 201, etc. According to different implementations, the processing device 203 can be a central processing unit (CPU), a graphics processing unit (GPU), or one or more types of processors in other general and/or special processors, which include but are not limited to digital signal processors (DSP), application specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, only with respect to the computing device 201 disclosed in the present invention, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are integrated and considered together, the two are regarded as forming a heterogeneous multi-core structure.

The storage device 204 is used to store data to be processed, which may be DRAM or DDR memory, and is usually 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203 .

When the computing device 201 runs the neural network, it is generally necessary to first compile the neural network using the processing device 203 to obtain an executable file, which contains device information, that is, which device in the heterogeneous computer system the executable file needs to be executed on. After the executable file is assembled and linked, an executable program of the neural network can be obtained, and the executable program is stored in the storage device 204.

The processing device 203 can read the executable program from the storage location of the executable program and obtain multiple tasks of the program according to the executable program. These tasks are distributed to the computing device 201 for execution via the interface device 202, and finally obtain the operation result.

Fig. 3 shows a schematic diagram of the internal structure of a processing core when the computing device 201 in Fig. 2 is a single-core device. The computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining. The computing device 301 includes three modules: a control module 31 (also called a controller), an operation module 32 (also called an operator), and a storage module 33 (also called a memory).

The control module 31 is used to coordinate and control the operation of the operation module 32 and the storage module 33 to complete the deep learning task. It includes an instruction fetch unit (IFU) 311 and an instruction decode unit (IFU). The instruction fetch unit 311 is used to fetch instructions from the processing device 203, and the instruction decoding unit 312 decodes the fetched instructions and sends the decoding results as control information to the operation module 32 and the storage module 33.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.

The storage module 33 is used to store or transfer relevant data, including a neuron RAM (NRAM) 331, a weight RAM (WRAM) 332, and a direct memory access module (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after calculation; WRAM 332 is used to store the convolution kernel of the deep learning network, that is, the weight; DMA 333 is connected to DRAM 204 through bus 34, and is responsible for data transfer between the computing device 301 and DRAM 204. It should be noted that the NRAM and WRAM here can be two storage areas formed by dividing the same memory in the logical storage space, or they can be two independent memories, which are not specifically limited here.

FIG4 shows a simplified schematic diagram of the internal structure of the computing device 201 in FIG2 when it is multi-core. A multi-core computing device can be abstracted using a hierarchical hardware model. As shown in the figure, the multi-core computing device 400 is a system on chip, which includes at least one computing cluster, and each computing cluster includes multiple processor cores. In other words, the multi-core computing device 400 is composed of a hierarchy of system on chip-computing cluster-processor core.

From the perspective of the system on chip (SoC) level, as shown in the figure, the multi-core computing device 400 includes an external storage controller 41 , a peripheral communication module 42 , an on-chip interconnect module 43 , a global synchronization module 44 , and multiple computing clusters 45 .

There may be multiple external storage controllers 41, two of which are shown in the figure as an example, which are used to respond to access requests issued by the processor core and access external storage devices (such as DRAM 204 in Figure 2), thereby reading data from outside the chip or writing data. The peripheral communication module 42 is used to receive control signals from the processing device (203 in Figure 2) through the interface device (202 in Figure 2) to start the computing device (201 in Figure 2) to perform tasks. The on-chip interconnect module 43 connects the external storage controller 41, the peripheral communication module 42 and multiple computing clusters 45 to transmit data and control signals between each module. The global synchronization module 44 is, for example, a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each computing cluster and ensure information synchronization. The plurality of computing clusters 45 are the computing cores of the multi-core computing device 400. In the figure, four computing clusters are shown on each die by way of example. With the development of hardware, the multi-core computing device 400 disclosed herein may also include 8, 16, 64, or even more computing clusters 45. The computing clusters 45 are used to efficiently execute deep learning algorithms.

From the perspective of the computing cluster level, as shown in the figure, each computing cluster 45 includes multiple processor cores 406 as control and computing units, and a shared storage core 407 as a storage unit. Furthermore, each computing cluster may also include a local synchronization module 412 to coordinate the work progress of each processor core in the computing cluster and ensure information synchronization. The processor cores 406 are shown in the figure as an example, and the present disclosure does not limit the number of processor cores 406.

The storage core 407 is mainly used for storage and communication, that is, to store shared data or intermediate results between the processor cores 406, and to perform communication between the computing cluster 45 and the DRAM 204, between the computing clusters 45, and between the processor cores 406. In other embodiments, the storage core 407 has the ability of scalar operations and is used to perform scalar operations.

The storage core 407 includes a shared memory unit (SRAM) 408, a broadcast bus 409, a cluster direct memory access module (cluster direct memory access, CDMA) 410, and a global direct memory access module (global direct memory access, GDMA) 411. The SRAM 408 plays the role of a high-performance data transfer station. The data reused between different processor cores 406 in the same computing cluster 45 does not need to be obtained from the DRAM 204 by each processor core 406, but is transferred between the processor cores 406 through the SRAM 408. The storage core 407 only needs to quickly distribute the reused data from the SMEM 408 to multiple processor cores 406 to improve the efficiency of inter-core communication and greatly reduce on-chip and off-chip input/output access. The broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between computing clusters 45, and data transmission between computing clusters 45 and DRAM 204, respectively.

From the perspective of the processor core level, the structure of a single processor core may be similar to the structure diagram of a single-core computing device shown in FIG3 , and will not be described in detail here.

Convolution operation principle

The convolution layer in the neural network model can perform convolution operations to extract features by applying convolution kernels (also called filters, weights, etc.) to the input feature map (also called input data, neurons, or input neurons).

The neural network model may include various convolution operation layers, such as a convolution layer that performs forward, conventional 3D convolution operations, and a deconvolution layer that performs depthwise convolution operations. In reverse training, it may be necessary to perform reverse depthwise convolution operations or cross-product convolution operations. The disclosed embodiments are mainly optimized for conventional 3D convolution operations, and can also be applied to other types of convolution operations without conflict.

FIG. 5 shows an example of an exemplary conventional 3D convolution operation principle to which the disclosed embodiments can be applied.

The figure exemplarily shows a four-dimensional input data X of size [N Hi Wi Ci], which can be represented as N Hi×Wi×Ci 3D rectangles 510. The figure also exemplarily shows a four-dimensional convolution kernel K of size [Co Kh Kw Ci], which can be represented as Co Kh×Kw×Ci 3D convolution kernels 520. The convolution result of the input data X and the convolution kernel K obtains the output data Y, which is four-dimensional data of size [N Ho Wo Co], which can be represented as N Ho×Wo×Co 3D rectangles 530.

The figure also specifically shows an example of a convolution operation, in which the input data is an input feature map 540 of size 6×6×3, omitting the N dimension; the convolution kernel is a stereo convolution kernel 550 of size 3×3×3, targeting a single Co; and the output data is a 4×4 output feature map 560. The specific operation process is as follows:

The convolution kernel 550 scans the input feature map 540a at a certain step length, performs matrix element multiplication and summation on the input features in the convolution window 570, and superimposes the deviation. That is, the value at each position in the output feature map 560 is obtained by performing a two-dimensional convolution operation on the corresponding block of each input feature map and the corresponding convolution kernel, and then adding them together. For example, the figure shows that the value at the (0,0) position on the output feature map 560 (that is, the convolution output point) is obtained by performing a two-dimensional convolution operation on the convolution window 570 framed by the black cube in the input feature map and the three-dimensional convolution kernel 550 to obtain three values, and then adding them together to obtain the final value.

In order to obtain outputs at other positions, the position of the convolution kernel 550 can be moved on the input feature map 540, that is, the convolution window of the convolution output point can be moved. In the example in the figure, the convolution step size (Sw, Sh) is (1,1). When the convolution operation is performed after moving one grid to the right horizontally (width direction) or downward vertically (height direction), the value of the position (0,1) or (1,0) on the output feature map 560a can be obtained respectively.

From the above description, we can know that in a convolution layer of a neural network, there are N groups of input feature maps, each group contains Hi×Wi×Ci information, where Hi and Wi are the height and width of the input feature map, respectively, and Ci is the number of input feature maps, also known as the number of input channels. The convolution layer has Ci×Co convolution kernels of size Kh×Kw, where Ci is the number of input channels, Co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolution kernel, respectively. The output feature map contains Ho×Wo×Co information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, in the convolution operation, the convolution step size (Sw, Sh) is also involved, and the size of the convolution step size will affect the size of the output feature map.

In this article, input feature map, input data, neuron or input neuron are used interchangeably; convolution kernel, filter or weight are used interchangeably; output feature map, output data or output neuron are used interchangeably.

Exemplary Convolution Operator Coding Scheme

In intelligent computing systems, programming frameworks are used to encapsulate common operations in neural network model algorithms into operators for programmers to call directly, such as convolution and pooling. TensorFlow, PyTorch, etc. are currently popular deep learning frameworks. In these programming frameworks, computational graphs are usually used to describe the computational process of machine learning algorithms, tensors are used to represent all data in the computational graphs, and operators are used to represent various operations.

Regarding the terms "node" and "operator (OP)" mentioned in this disclosure, it should be noted that the term "operator" is from the perspective of computer computing (or from the perspective of software or algorithm); while the term "node" is a more figurative term (from the perspective of graphics or a more intuitive level). In terms of what they refer to, the terms "operator" and "node" actually refer to the same thing. That is, in this disclosure, it can be considered that the terms "operator" and "node" are the same. "Node" has the same meaning and can be used interchangeably, but is described from different aspects.

As mentioned above, common deep learning accelerators are often subject to computational alignment issues and on-chip memory space limitations due to hardware design limitations. In the disclosed embodiments, during compilation, for larger convolution operators, a solution for splitting the convolution operator is proposed so that the final operation result can be obtained through multiple rounds of operations.

FIG6 shows an exemplary flow chart of a method for compiling a convolution operator implemented by a processing device according to an embodiment of the present disclosure. The processing device may be, for example, the processing device 203 of FIG2 .

As shown in the figure, in step 610, a file to be compiled containing a convolution operator is obtained. In a programming framework, a computational graph is usually used to describe the computational process of a machine learning algorithm, and operators are used to represent various operations. The convolution operator is included in the computational graph as a computational node for compilation.

Next, in step 620, in response to the scale of the weight of the convolution operator exceeding the single-round computing capacity of the computing device to execute the convolution operator, the weight of the convolution operator is split into a first level to generate multiple first-level convolution sub-operators.

Depending on the hardware configuration of the computing device, its single-round computing volume is also different. Specifically, the single-round computing volume of the computing device can be determined based on one or more of the following factors: the number of parallel computing components in the computing device; the size of the on-chip storage space of the computing device; and the single-round computing volume of the parallel computing components. For example, under the hardware configuration shown in FIG3, the computing device is a single-core device, and the number of parallel computing components is 1; the on-chip storage space of the computing device is determined according to the capacity of the storage module 33; the single-round computing volume of the parallel computing components is determined according to the computing power of the computing module 32. For another example, under the hardware configuration shown in FIG4, the computing device is a multi-core device, and the number of parallel computing components depends on the number of computing clusters and the number of processing cores in each computing cluster. The example in FIG4 is 4×4=16 parallel computing components (processing cores); the on-chip storage space of the computing device can be determined by comprehensively considering the capacity of the shared storage core SRAM and the storage module inside the processor core; the single-round computing volume of the parallel computing components is determined according to the computing power of a single processing core.

In the disclosed embodiment, when the weight scale of the convolution operator to be compiled exceeds the single-round computational amount of the computing device to execute the convolution operator, the convolution operator can be directly split into multiple convolution sub-operators during compilation, thereby facilitating subsequent optimization of the computation graph. Specifically, the weight of the convolution operator can be split into a first-level sub-operator to generate multiple first-level convolution sub-operators.

In some embodiments, the first-level splitting splits the weight into multiple first-level sub-weights according to the first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator. The first channel dimension can be the input channel Ci dimension or the output channel Co dimension described above for the convolution operation, which will be described in detail later.

Next, in step 630, a first merging operator is generated, and the first merging operator is used to merge the operation results of the multiple first-level convolution sub-operators split out previously to obtain the final result of the original convolution operator. Since the original convolution operator is split into multiple convolution sub-operators, in order to obtain the final result of the original convolution operator, it is necessary to merge the operation results of these split convolution sub-operators. The specific operation involved in the first merging operator depends on the specific method of the first-level splitting used by these first-level convolution sub-operators, that is, it is related to the first channel dimension of the split. This will be described later in conjunction with the specific splitting method.

Finally, in step 640, the compiled file is compiled and optimized based on the split multiple first-level convolution operators and the first merging operator to obtain a corresponding binary instruction sequence to be allocated to the computing device to execute the task corresponding to the original convolution operator.

The original large convolution operator in the computational graph to be compiled is now split into multiple first-level convolution sub-operators and a first merging operator. Therefore, these multiple first-level convolution sub-operators and the first merging operator can be used to replace the original convolution operator, thereby updating the computational graph. Based on the updated computational graph, other compilation optimizations can be continued, including but not limited to various optimizations that do not involve underlying hardware information, such as pruning, constant folding, arithmetic simplification, layout optimization, etc.; and various optimizations related to hardware architecture, such as operator fusion, weight preloading, weight retention, etc. The binary instruction sequence generated after compilation can be assigned to the target computing device to execute the tasks corresponding to the computational graph, and the tasks corresponding to the original convolution operator are performed according to the split convolution sub-operators through multiple rounds of operations to obtain the final operation results.

Therefore, the above describes the compilation scheme of the convolution operator provided by the embodiment of the present disclosure, which is aimed at large-scale convolution operators. By splitting it into small sub-convolution operators, it can more flexibly adapt to the hardware configuration of the computing device, give full play to its computing power, and improve the overall computing efficiency.

As mentioned above, the first-level splitting splits the weight into multiple first-level sub-weights according to the first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator. The first channel dimension can be the input channel Ci dimension or the output channel Co dimension, so there are two splitting schemes.

Fig. 7 shows a schematic diagram of a convolution operator splitting scheme according to some embodiments of the present disclosure. In these embodiments, the splitting is performed according to the output channel Co dimension.

From the convolution operation principle described above in conjunction with FIG5 , it can be seen that the operation results on the Co dimension do not need to be accumulated, so the operation results of each convolution sub-operator obtained by splitting according to the Co dimension can be directly cascaded to obtain the operation result of the original convolution operator. That is, when the first channel dimension is the output channel Co dimension, the corresponding first merging operator is a cascade operator, which is used to cascade the operation results of multiple first-level convolution sub-operators obtained by splitting according to the splitting order, thereby obtaining the final result of the original convolution operator.

In the example of FIG. 7 , a convolution operator with an input neuron size of 1×32×32×2048 (NHWC) and a weight size of 2048×3×3×2048 (CoKxKyCi) is taken as an example to illustrate the splitting process.

The left side of FIG. 7 shows the operation diagram of the original convolution operator: the scale of the input neuron 710 is 1×32×32×2048, the scale of the weight of the convolution operator 720 is 2048×3×3×2048, and the scale of the output neuron 730 is 1×32×32×2048.

The right side of FIG. 7 shows multiple convolution sub-operators 740 and a first merging operator 760 obtained after performing the Co dimension split. In this example, because the split is performed according to the Co dimension, the first merging operator 760 is a cascade operator. The input of each of these convolution sub-operators 740 is still the original input neuron 710, whose scale is 1×32×32×2048. The sub-weight scale of each convolution sub-operator 740 is the same, and is evenly split into 256×3×3×2048, thereby splitting into a total of 8 convolution sub-operators. The scale of the output result 750 of each convolution sub-operator 740 is also the same, all of which are 1×32×32×256. The output results 750 of these convolution sub-operators 740 are cascaded together through the first merging operator 760 to obtain the final output result 770, whose scale is 1×32×32×2048, which is consistent with the output neuron 730 of the original convolution operator.

From the above embodiment, it can be seen that according to the Co dimension splitting, no additional operations need to be introduced, and it is only necessary to cascade the operation results of each convolution sub-operator together. In some implementations, the above cascade operation can be integrated into the operation result storage operation, that is, when the operation results of each convolution sub-operator are stored back to the memory, the storage location is directly allocated in the cascade order, so that the cascade operation is hidden in the storage operation.

Fig. 8 shows a schematic diagram of a convolution operator splitting scheme according to some other embodiments of the present disclosure. In these embodiments, the splitting is performed according to the input channel Ci dimension.

From the convolution operation principle described above in conjunction with Figure 5, it can be seen that the operation results on the Ci dimension need to be accumulated, so the operation results of each convolution operator obtained by splitting according to the Ci dimension can only be obtained after the operation results of the original convolution operator are accumulated. That is, when the first channel dimension is the input channel Ci dimension, the corresponding first merging operator is an addition operator, which is used to perform positional accumulation of the operation results of multiple first-level convolution operators obtained by splitting, thereby obtaining the final result of the original convolution operator.

In the example of FIG. 8 , a convolution operator with an input neuron size of 1×32×32×2048 (NHWC) and a weight size of 2048×3×3×2048 (CoKxKyCi) is taken as an example to illustrate the splitting process.

The left side of FIG8 shows the operation diagram of the original convolution operator: the scale of the input neuron 810 is 1×32×32×2048, the scale of the weight of the convolution operator 820 is 2048×3×3×2048, and the scale of the output neuron 830 is 1×32×32×2048.

The right side of Figure 8 shows multiple convolution sub-operators 840 and a first merging operator 860 obtained after performing the Ci dimension split. In this example, because the split is performed according to the Ci dimension, the first merging operator 860 is an addition operator. Different from the Co dimension split, when the weights are split according to the Ci dimension, since the operation performed in the Ci dimension is a bitwise multiplication and accumulation, the input neurons also need to be split according to the Ci dimension accordingly. Therefore, in these embodiments, it is also necessary to additionally generate a splitting operator 880, which is used to perform the same splitting as the weights on the input neurons of the original convolution operator, and then provide the split input neurons to the corresponding convolution sub-operators 840. As shown in the figure, after the original input neuron 810 is split by the splitting operator 880, multiple There are sub-input neurons 890, which serve as input neurons of the corresponding convolution sub-operators 840, and the scale of each is 1×32×32×256.

The sub-weights of each convolution sub-operator 840 are of the same size, and are evenly split into 2048×3×3×256 according to the Ci dimension, thus being split into 8 convolution sub-operators in total. The output results 850 of each convolution sub-operator 840 are also of the same size, which is 1×32×32×2048. The output results 850 of these convolution sub-operators 840 are bitwise accumulated through the first merging operator 860 to obtain the final output result 870, which has a size of 1×32×32×2048, which is consistent with the output neuron 830 of the original convolution operator.

From the above embodiment, it can be seen that according to the Ci dimension splitting, an additional addition operation needs to be introduced to accumulate the operation results of each convolution operator in place, so as to obtain an operation result consistent with the original convolution operator. Since the additional addition operation is introduced outside the convolution operation, it may cause an out-of-bounds error, so the data needs to be truncated, which may cause a decrease in accuracy.

In some embodiments, the Ci or Co dimension can be selected for splitting according to the actual accuracy requirement. For example, when the accuracy is not sensitive or the accuracy requirement is not high, such as below the accuracy threshold, the Co dimension can be selected or the Ci dimension can be selected for splitting. When the accuracy requirement is high, such as above the accuracy threshold, the Co dimension is selected for splitting.

In the above splitting, the splitting granularity can be set according to the operation characteristics of the hardware. In some embodiments, when the hardware has an operation alignment requirement, the splitting granularity can match the operation alignment requirement of the computing device.

In order to make full use of the bandwidth and adapt to the throughput requirements of the operator array, some operators will require the input data to be aligned to a specified value, such as an alignment value M, so that the data can be processed at the granularity of the alignment value M. If the input data is not enough for the alignment value, the data will be padded by zero padding. Based on different hardware designs, M can have different values, for example, 64, 128, 256, etc., and the unit can be the number of bytes (Byte) or the number of data. Furthermore, the operation alignment requirements can have different alignment values depending on the dimension. For example, some hardware may require the Ci dimension to be aligned to 64 and/or the Co dimension to be aligned to 64.

It can be understood that when the amount of data meets the operation alignment requirements, the operation efficiency of the operator is the highest and the computing power can be fully utilized. Therefore, in some embodiments, the above-mentioned first-level splitting according to the first channel dimension can determine the splitting granularity according to the alignment requirements of the first channel dimension. For example, assuming that the hardware alignment value of the first channel dimension is 64, the splitting granularity can be aligned to 64, that is, it can be an integer multiple of 64, so that the split data blocks are more conducive to fully utilizing the computing power of the hardware computing components and improving the computing efficiency. More specifically, when splitting according to the Co dimension, the splitting granularity is aligned to the alignment value of the Co dimension; when splitting according to the Ci dimension, the splitting granularity is aligned to the alignment value of the Ci dimension.

Since the splitting must be aligned to the operation alignment requirements, the minimum splitting granularity corresponds to the alignment value of the corresponding channel. In some cases, when the splitting is performed according to the minimum splitting granularity, the size of the sub-weights that may be obtained still exceeds the single-round operation amount of the computing device. In this case, a second splitting can be performed.

Specifically, for a first-level convolution sub-operator that needs to be split into two levels, it can further include: in response to the scale of the first-level sub-weight of the first-level convolution sub-operator exceeding the single-round computing capacity of the computing device, performing a second-level split on the first-level sub-weight to generate multiple second-level convolution sub-operators, the second-level splitting splits the first-level sub-weight into multiple second-level sub-weights according to the second channel dimension, and each second-level sub-weight corresponds to a second-level convolution sub-operator; and generating a second merging operator, the second merging operator is used to merge the computing results of the multiple split second-level convolution sub-operators to obtain the computing result of the original first-level convolution sub-operator.

Since the first-level splitting has split the first channel dimension into the minimum acceptable granularity, the second channel dimension split by the second-level splitting is different from the first channel dimension.

In some implementations, if the first channel dimension is the output channel Co dimension, the second channel dimension is the input channel Ci dimension, and the second merging operator is an addition operator, which is used to perform positional accumulation of the operation results of multiple split secondary convolution sub-operators to obtain the operation results of the corresponding primary convolution sub-operators.

In some other implementations, if the first channel dimension is the input channel Ci dimension, the second channel dimension is the output channel Co dimension, and the second merging operator is a cascade operator, which is used to cascade the operation results of the multiple split secondary convolution sub-operators in the split order to obtain the operation results of the corresponding primary convolution sub-operators.

Similar to the first-level split, the split granularity of the second-level split can be determined according to the calculation alignment requirements of the computing device for the second channel dimension. That is, the split granularity of the second-level split can be an integer multiple of the calculation alignment value of the second channel dimension. As a result, the split data blocks are more conducive to fully utilizing the computing power of the hardware computing components and improving the computing efficiency.

The scale of convolution operators in different neural network models varies, and the scale of different convolution operators in the same neural network model may also be different, so there will be multiple ways to split. Moreover, even convolution operators of the same scale may have multiple splitting schemes. The performance of the splitting scheme can be evaluated based on multiple factors.

On the one hand, as mentioned above, the splitting granularity can be determined according to the hardware's operation alignment requirements. However, not all convolution operators can be split exactly according to multiples of the alignment value. When the split data blocks are not aligned, the data will be padded by zero padding, and then the operation will be performed based on the aligned data, thereby introducing invalid operations and reducing the operation efficiency. Therefore, the performance of the splitting scheme in terms of operation alignment can be evaluated based on the amount of zero padding. For example, the amount of zero padding, the amount of invalid operations, or the invalid operation ratio can be used to characterize the invalid operation index caused by the splitting scheme due to the operation alignment requirements.

On the other hand, in order to improve processing efficiency, a pipeline method of load (Load)-compute (Compute)-store (Store) is usually used for processing. In the pipeline method, the first data block can be loaded while calculating the second data block, thereby saving processing time. If the time of each stage in the pipeline matches, that is, the time spent in each stage is not much different, the pipeline can work smoothly, otherwise there may be a situation of mutual waiting between the stages. Therefore, in some embodiments, the splitting scheme can consider the loading time of the split sub-weights to match the time of calculating the convolution, so as to facilitate the parallel pipeline of weight IO and calculation. When the execution time of the sub-operator is relatively uniform, the loading time of the sub-weights can be effectively shielded in the parallel pipeline of IO and calculation, thereby avoiding the IO bottleneck and better playing the purpose of weight preloading.

During compilation, the computation time of the computational components can be determined based on the hardware configuration information of the target computing device, thereby estimating the preloading time of the weights and determining the size of the weight split. Therefore, in some embodiments, when splitting, the size of each sub-weight after the split can be made as balanced as possible, so that the processing time of the previous and next operations is equivalent. Furthermore, the loading time of the weights after the split can be made to match the convolution operation time as much as possible, so that the weight preloading feature can work better. From the perspective of pipeline processing, the loading time of the latter sub-weight matches the operation time of the previous convolution sub-operator, for example, the difference between the two is within a predetermined range.

Therefore, the performance of the split scheme in weight preloading can be evaluated based on the matching degree between the weight loading time and the convolution operation time. For example, the matching degree between the loading time of adjacent sub-weights and the operation time of the convolution sub-operator can be used to characterize the performance index of the split scheme in weight preloading.

In the compilation optimization of the computational graph, other optimization measures are also involved, including but not limited to operator fusion, IO/computation parallel pipeline mode, weight retention, weight preloading, etc. Usually, the effects of these optimization methods on the optimization of the computational graph are comprehensively considered to select the most appropriate optimization combination. Therefore, in addition to being used alone to evaluate the performance of the splitting scheme, the performance indicators for evaluating the splitting scheme can also be integrated into the evaluation of the overall compilation optimization performance to comprehensively consider the improvement brought by various optimization methods to the overall performance. In the overall compilation performance evaluation, the performance advantages and disadvantages of various optimization schemes can be compared by calculating the overall execution time of various optimization schemes, so as to select the best optimization scheme.

Therefore, the disclosed embodiment provides a compilation method for a convolution operator, which supports taking a suitable splitting scheme when the scale of the convolution operator exceeds the single-round computing amount of the target computing device, splitting the large convolution operator into multiple small convolution sub-operators, so as to obtain the final result through multiple rounds of computing. The sub-weights after splitting try to meet the computational alignment requirements of the computing components on the target computing device, so that the computing power of the computing components can be fully utilized. Furthermore, the weights of each convolution sub-operator after splitting are as balanced as possible, and the weight loading time is matched with the convolution operation time as much as possible, so that the pipeline efficiency is high, and the weight preloading feature can work better. In this way, the weight scale of the split convolution sub-operator is smaller and more balanced. This convolution sub-operator is more conducive to scheduling between parallel computing components, and it is easier to meet the constraints of computational alignment and on-chip space limitations, which is more conducive to accelerated optimization.

The present disclosure also provides a processing device, which can be used to compile a computational graph containing a convolution operator, including: a processor configured to execute program instructions; and a memory configured to store the program instructions, so that when the program instructions are loaded and executed by the processor, the processor executes the compilation method described in the embodiment of the present disclosure.

In the disclosed embodiment, a computer-readable storage medium is also provided, in which program instructions are stored. When the program instructions are loaded and executed by a processor, the processor executes the compilation method of the convolution operator described in the disclosed embodiment. In the disclosed embodiment, a computer program product is also provided, including a computer program or instruction. When the computer program or instruction is executed by a processor, the compilation method of the convolution operator described in the disclosed embodiment is implemented.

The present disclosure also provides a chip, which may include the aforementioned processing device. Furthermore, the present disclosure also provides a board, which may include the aforementioned chip.

According to different application scenarios, the electronic equipment or device disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, transportation, household appliances, and/or medical equipment. The transportation includes airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes magnetic resonance imaging, ultrasound machines and/or electrocardiographs. The electronic equipment or device disclosed herein may also be applied to the Internet, IoT, data centers, energy, transportation, public administration, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical fields. Further, the electronic equipment or device disclosed herein may also be used in cloud, edge, and terminal applications related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solution can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or edge device are compatible with each other, so that according to the hardware information of the terminal device and/or edge device, appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or edge device, so as to complete the unified management, scheduling and collaborative work of end-to-end or cloud-edge-to-end.

It should be noted that, for the purpose of simplicity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will appreciate that the scheme of the present disclosure is not limited by the order of the actions described. Therefore, based on the disclosure or teaching of the present disclosure, those skilled in the art will appreciate that some of the steps therein may be performed in other orders or simultaneously. Further, those skilled in the art will appreciate that the embodiments described in the present disclosure may be regarded as optional embodiments, i.e., the actions or modules involved therein are not necessarily necessary for the implementation of one or some of the schemes of the present disclosure. In addition, depending on the different schemes, the present disclosure also has different focuses on the description of some embodiments. In view of this, those skilled in the art will appreciate that the parts that are not described in detail in a certain embodiment of the present disclosure may also refer to the relevant descriptions of other embodiments.

In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented by other methods not disclosed herein. For example, with respect to the various units in the electronic device or device embodiments described above, this article splits them on the basis of considering the logical functions, and there may be other splitting methods in actual implementation. For another example, multiple units or components can be combined or integrated into another system, or some features or functions in the units or components can be selectively disabled. In terms of the connection relationship between different units or components, the connection discussed in the above text in conjunction with the accompanying drawings can be a direct or indirect coupling between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, wherein the communication interface can support electrical, optical, acoustic, magnetic or other forms of signal transmission.

In the present disclosure, the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed on multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the scheme described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some other implementation scenarios, the above integrated units may also be implemented in the form of hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit. The physical implementation of the hardware structure of the circuit may include But it is not limited to physical devices, and physical devices may include but are not limited to devices such as transistors or memristors. In view of this, the various devices described herein (such as computing devices or other processing devices) can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs. Further, the aforementioned storage unit or storage device may be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which may be, for example, a resistive random access memory (RRAM), a dynamic random access memory (DRAM), a static random access memory (SRAM), an enhanced dynamic random access memory (EDRAM), a high bandwidth memory (HBM), a hybrid memory cube (HMC), a ROM, and a RAM, etc.

The embodiments of the present disclosure are introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation methods of the present disclosure. The description of the above embodiments is only used to help understand the method of the present disclosure and its core idea. At the same time, for those skilled in the art, according to the ideas of the present disclosure, there will be changes in the specific implementation methods and application scopes. In summary, the content of this specification should not be understood as a limitation on the present disclosure.

Claims

A method for compiling a convolution operator implemented by a processing device, comprising:

Get the file to be compiled containing the convolution operator;

In response to the scale of the weight of the convolution operator exceeding the single-round computation amount of the computing device to execute the convolution operator, the weight is split into a first level to generate a plurality of first-level convolution sub-operators, wherein the first-level splitting splits the weight into a plurality of first-level sub-weights according to a first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator;

generating a first merging operator, wherein the first merging operator is used to merge the operation results of the plurality of first-level convolution sub-operators to obtain a final result of the convolution operator; and

The file to be compiled is compiled and optimized based on the multiple first-level convolution operators and the first merging operator to obtain a corresponding binary instruction sequence to be allocated to the computing device to execute tasks corresponding to the convolution operator.
The compiling method according to claim 1, wherein:

The first channel dimension is the output channel Co dimension; and

The first merging operator is a cascade operator, which is used to cascade the operation results of the multiple first-level convolution sub-operators in the splitting order to obtain the final result of the convolution operator.
The compiling method according to claim 1, wherein:

The first channel dimension is the input channel Ci dimension; and

The first merging operator is an addition operator, which is used to perform bitwise accumulation of the operation results of the multiple first-level convolution sub-operators to obtain the final result of the convolution operator.
The compiling method according to claim 3, further comprising:

A splitting operator is generated to perform the first-level splitting on the input neurons of the convolution operator to provide the first-level convolution sub-operator to the corresponding neuron.
According to any one of claims 1 to 4, the compiling method, wherein generating a first-level convolution operator further comprises:

In response to the scale of the first-level sub-weight of the first-level convolution sub-operator exceeding the single-round computing amount of the computing device, performing a second-level splitting on the first-level sub-weight to generate a plurality of second-level convolution sub-operators, wherein the second-level splitting splits the first-level sub-weight into a plurality of second-level sub-weights according to a second channel dimension, and each second-level sub-weight corresponds to a second-level convolution sub-operator; and

A second merging operator is generated, where the second merging operator is used to merge the operation results of the multiple secondary convolution sub-operators to obtain the operation result of the primary convolution sub-operator.
The compiling method according to claim 5, wherein:

If the first channel dimension is the output channel Co dimension, the second channel dimension is the input channel Ci dimension, and the second merging operator is an addition operator, which is used to perform bitwise accumulation of the operation results of the multiple secondary convolution sub-operators to obtain the operation results of the corresponding primary convolution sub-operators;

If the first channel dimension is the input channel Ci dimension, the second channel dimension is the output channel Co dimension, and the second merging operator is a cascade operator, which is used to cascade the operation results of the multiple secondary convolution sub-operators in the splitting order to obtain the operation results of the corresponding first-level convolution sub-operators.
The compiling method according to any one of claims 1 to 6, wherein the single-round computation amount of the computing device is determined based on one or more of the following factors:

The number of parallel computing components in the computing device;

The size of the on-chip storage space of the computing device; and

The single-round computing capacity of the parallel computing component.
The compilation method according to any one of claims 1 to 7, wherein the split granularity of the first-level split is determined according to the operation alignment requirement of the computing device on the first channel dimension.
According to the compilation method according to any one of claims 1-8, the split size of the multiple first-level sub-weights is determined according to the convolution operation time of the computing device, so that the loading time of the subsequent sub-weight matches the operation time of the previous convolution sub-operator.
The compilation method according to any one of claims 5-6, wherein the split granularity of the second-level split is determined according to the operation alignment requirement of the computing device for the second channel dimension.
According to the compilation method according to claim 5-6 or 10, the split size of the multiple secondary sub-weights is determined according to the convolution operation time of the computing device, so that the loading time of the latter sub-weight matches the operation time of the previous convolution sub-operator.
The compilation method according to any one of claims 5-6 or 10-11, further comprising: evaluating the performance of the primary split and the secondary split based on one or more of the following factors:

Invalid operation pointers due to operation alignment requirements; and

The matching degree between the loading time of adjacent sub-weights and the operation time of the convolution sub-operator.
A processing device for compiling a computation graph including a convolution operator, comprising:

a processor configured to execute program instructions; and

A memory configured to store the program instructions, and when the program instructions are loaded and executed by the processor, the processor executes the compilation method according to any one of claims 1-12.
A computer-readable storage medium stores program instructions, which, when loaded and executed by a processor, enable the processor to execute the compilation method according to any one of claims 1 to 12.
A computer program product comprises a computer program or instructions, wherein when the computer program or instructions are executed by a processor, the compiling method according to any one of claims 1 to 12 is implemented.