CN116090519A

CN116090519A - Compiling method of convolution operator and related product

Info

Publication number: CN116090519A
Application number: CN202310073445.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2023-01-13
Filing date: 2023-01-13
Publication date: 2023-05-09
Also published as: WO2024149112A1

Abstract

The disclosure discloses a compiling method of a convolution operator and related products. The compiling method may be implemented by a processing device. The processing means may be comprised in a combined processing means, which combined processing means may further comprise interface means and computing means. The computing device interacts with the processing device to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the processing means, respectively, for storing data of the computing means and the processing means. The disclosed scheme provides a compiling method of convolution operators, which can optimize the realization of large-scale convolution, match hardware characteristics of a target computing device, particularly operation alignment characteristics, and improve overall operation performance.

Description

Compiling method of convolution operator and related product

Technical Field

The present disclosure relates generally to the field of intelligent computing, and more particularly to the field of neural networks. More particularly, the present disclosure relates to a method of compiling a convolution operator implemented with a processing device, a computer readable storage medium, and a computer program product.

Background

Neural networks are one of the key technologies in artificial intelligence, deep learning, with convolutional neural networks (Convolution Neural Network, CNN) being one of the most important network types. A very important calculation in convolutional neural networks is the convolutional operation of the convolutional layer (Conv layer) (Convolution Operation). The most time-consuming operations in common neural network models are often convolution operations as well. The function of the convolution layer is to extract the characteristics of the input data, and complex characteristics can be extracted through multi-layer convolution, so that the network is ensured to have enough expression capability and generalization capability. The neural network model comprises a large number of various convolution operations, and the calculation performance of the convolution operations greatly influences the calculation performance of the whole neural network model. Therefore, accelerated optimization of convolution operations is an important part of deep learning computation graph optimization.

The deep learning accelerator usually has a plurality of operation components, and the purpose of accelerating the neural network calculation is achieved through the parallelism among the plurality of operation components. Conventional deep learning accelerators tend to suffer from computational alignment problems and on-chip memory space size limitations due to hardware design limitations.

Therefore, there is a need for a convolution operator optimization scheme suitable for the hardware design of a deep learning accelerator, so as to improve the computation parallelism and the computation efficiency as much as possible while meeting the hardware limitation.

Disclosure of Invention

In order to address at least one or more of the technical problems mentioned above, the present disclosure proposes, in various aspects, a compilation scheme of a convolution operator that, while accommodating hardware requirements of a computing device, exploits as fully as possible the parallel computing effectiveness of computing components in the computing device.

In a first aspect, the present disclosure provides a method of compiling a convolution operator implemented with a processing device, comprising: acquiring a file to be compiled containing a convolution operator; in response to the scale of the weight of the convolution operator exceeding a single-round operand of a computing device that is to execute the convolution operator, performing a first-stage splitting on the weight to generate a plurality of first-stage convolution sub-operators, wherein the first-stage splitting splits the weight into a plurality of first-stage sub-weights according to a first channel dimension, each corresponding to one of the first-stage convolution sub-operators; generating a first merging operator, wherein the first merging operator is used for merging operation results of the plurality of primary convolution sub-operators so as to obtain a final result of the convolution operator; and compiling and optimizing the file to be compiled based on the plurality of primary convolution sub-operators and the first merging operator to obtain a corresponding binary instruction sequence, so as to be distributed to the computing device to execute tasks corresponding to the convolution operators.

In a second aspect, the present disclosure provides a processing apparatus for performing compilation of a computational graph containing convolution operators, comprising: a processor configured to execute program instructions; and a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform a compiling method according to the first aspect of the disclosure.

In a third aspect, the present disclosure provides a computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform a compiling method according to the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides a computer program product comprising a computer program or instructions which, when executed by a processor, implements a compiling method according to the first aspect of the disclosure.

By the convolution operator compilation scheme provided above, embodiments of the present disclosure provide a suitable resolution scheme for larger scale convolution operator implementations to accommodate hardware computing power conditions. Further, by setting proper splitting granularity, the hardware design requirement of the computing device can be met, the computing force of the parallel computing components is fully utilized, and the overall computing performance is improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;

FIG. 3 illustrates an internal architecture diagram of a single processor core of a single core computing device of an embodiment of the present disclosure;

FIG. 4 illustrates a simplified schematic diagram of the internal structure of a multi-core computing device of an embodiment of the present disclosure;

FIG. 5 illustrates an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure may be applied;

FIG. 6 illustrates an exemplary flow chart of a method of compiling convolution operators implemented by a processing device according to an embodiment of the present disclosure;

FIG. 7 illustrates a convolution operator splitting scheme schematic diagram in accordance with some embodiments of the present disclosure;

FIG. 8 illustrates a convolution operator splitting scheme schematic diagram according to further embodiments of the present disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Exemplary hardware Environment

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.

When the computing device 201 runs the neural network, the processing device 203 is generally required to compile the neural network to obtain an executable file, where the executable file includes device information, that is, which device in the heterogeneous computer system the executable file needs to execute. The executable files are assembled and linked to obtain an executable program of the neural network, and the executable program is stored in the storage device 204.

The processing device 203 may read an executable program from a storage location of the executable program and obtain a plurality of tasks of the program according to the executable program. These tasks are distributed via the interface means 202 to the computing means 201 for execution, ultimately obtaining the result of the operation.

Fig. 3 shows a schematic diagram of the internal structure of a processing core when the computing device 201 in fig. 2 is a single-core device. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31 (also referred to as a controller), an arithmetic module 32 (also referred to as an operator), and a storage module 33 (also referred to as a memory).

The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used for storing or carrying related data, and includes a neuron storage unit (NRAM) 331, a weight storage unit (weight RAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204. It should be noted that the NRAM and WRAM herein may be two memory areas formed by dividing the same memory in a logic memory space, or may be two independent memories, which are not limited herein specifically.

Fig. 4 shows a simplified schematic diagram of the internal architecture of the computing device 201 of fig. 2 when it is multi-core. The multi-core computing device may be abstracted using a hierarchical hardware model. As shown, the multi-core computing device 400 is a system-on-chip that includes at least one compute cluster (cluster), each of which in turn includes a plurality of processor cores, in other words, the multi-core computing device 400 is formed in a system-on-chip-compute cluster-processor core hierarchy.

At the system-on-chip level, as shown, the multi-core computing device 400 includes an external memory controller 41, a peripheral communication module 42, an on-chip interconnect module 43, a global synchronization module 44, and a plurality of computing clusters 45.

There may be a plurality of external memory controllers 41, 2 being shown by way of example, for accessing external memory devices (e.g., DRAM 204 in FIG. 2) to read data from or write data to off-chip in response to access requests issued by the processor cores. The peripheral communication module 42 is configured to receive a control signal from the processing device (203 of fig. 2) via the interface device (202 of fig. 2) and to initiate the computing device (201 of fig. 2) to perform a task. The on-chip interconnect module 43 connects the external memory controller 41, the peripheral communication module 42, and the plurality of computing clusters 45 for transmitting data and control signals between the respective modules. The global synchronization module 44 is, for example, a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each computing cluster to ensure synchronization of information. The plurality of computing clusters 45 are the computing cores of the multi-core computing device 400, 4 on each die being illustratively shown, the multi-core computing device 400 of the present disclosure may also include 8, 16, 64, or even more computing clusters 45 as hardware evolves. The computing clusters 45 are used to efficiently execute the deep learning algorithm.

At the level of the compute clusters, each compute cluster 45 includes a plurality of processor cores 406 as control and compute units, and a shared memory core 407 as a memory unit, as shown. Further, each computing cluster may further include a local synchronization module 412, configured to coordinate the working progress of each processor core in the computing cluster, so as to ensure synchronization of information. The processor cores 406 are illustratively shown as 4, and the present disclosure does not limit the number of processor cores 406.

The storage cores 407 are mainly used for storing and communicating, i.e., storing shared data or intermediate results between the processor cores 406, and executing communication between the compute clusters 45 and the DRAM 204, communication between the compute clusters 45, communication between the processor cores 406, and the like. In other embodiments, the memory core 407 has scalar operation capabilities to perform scalar operations.

The memory core 407 includes a shared memory unit (SRAM) 408, a broadcast bus 409, a compute cluster direct memory access module (cluster direct memory access, CDMA) 410, and a global direct memory access module (global direct memory access, GDMA) 411. The SRAM 408 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 406 in the same computing cluster 45 is not required to be obtained from the processor cores 406 to the DRAM 204 respectively, but is transferred between the processor cores 406 through the SRAM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SMEM 408 to a plurality of processor cores 406, so as to improve the inter-core communication efficiency and greatly reduce on-chip off-chip input/output access. Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between compute clusters 45, and data transfer between compute clusters 45 and DRAM 204, respectively.

At the level of the processor cores, the structure of a single processor core may be similar to the block diagram of a single core computing device shown in FIG. 3 and will not be described in detail herein.

Principle of convolution operation

The convolution layers in the neural network model may perform convolution operations to perform feature extraction by applying convolution kernels (also known as filters, weights, etc.) to the input feature map (also known as input data, neurons, or input neurons).

Various convolution operation layers may be included in the neural network model, such as a convolution layer that performs forward, conventional 3D convolution operations, a deconvolution layer that performs depth (Depthwise) convolution operations. In reverse training, however, it may be necessary to perform a reverse deep convolution operation or a cross-product convolution operation. Embodiments of the present disclosure are primarily optimized for conventional 3D convolution operations, but may be applied to other types of convolution operations without conflict.

Fig. 5 illustrates an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure may be applied.

Four-dimensional input data X of size [ N Hi Wi Ci ] is exemplarily shown in the figure, which can be represented as a three-dimensional rectangle 510 of size N hi×wi×ci. Also shown by way of example is a four-dimensional convolution kernel K of size [ Co Kh Kw Ci ], which may be represented as a three-dimensional convolution kernel 520 of size Co Kh Kw Ci. The convolution result of the input data X and the convolution kernel K yields output data Y, which is four-dimensional data of the size of [ N Ho Wo Co ], and can be expressed as a three-dimensional rectangle 530 of the size of N ho×wo×co.

Also specifically shown is an example of convolution operation, wherein the input data is a 6×6×3 size input feature map 540, omitting the N dimension; the convolution kernel is a 3 x 3 sized stereo convolution kernel 550 for a single Co; the output data is a 4×4 output profile 560. The specific operation process is as follows:

the convolution kernel 550 sweeps the input feature map 540a in a step size, matrix element multiplicatively sums the input features within a convolution window 570, and superimposes the offset amounts. That is, the value at each position in the output feature map 560 is obtained by summing the corresponding block of each input feature map and the corresponding convolution kernel after performing a two-dimensional convolution operation. For example, the values (i.e., convolution output points) at the (0, 0) position on the output signature 560 are shown as 3 values obtained by two-dimensional convolution operations of the convolution window 570 outlined by the black cube in the input signature and the stereo convolution kernel 550, and then summed to obtain the final value.

To obtain outputs at other locations, the location of the convolution kernel 550, i.e., the convolution window of the convolved output points, may be shifted on the input signature 540. In the example in the figure, the convolution step (Sw, sh) is (1, 1), and when the convolution operation is performed after shifting one frame in the lateral direction (width direction) to the right or in the longitudinal direction (height direction), the value of the (0, 1) or (1, 0) position on the output feature map 560a can be obtained, respectively.

From the above description, in one convolutional layer of the neural network, there are N sets of input feature maps, each set containing hi×wi×ci pieces of information, where Hi and Wi are the height and width of the input feature maps, respectively, and Ci is the number of input feature maps, also referred to as the number of input channels. The convolution layer has a convolution kernel of the size Ci Co of Kh Kw, where Ci is the number of input channels, co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolution kernel, respectively. The output feature map contains Ho x Wo x Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, in the convolution operation, a convolution step (Sw, sh) is also involved, and the size of the convolution step affects the size of the output feature map.

Input Feature map (Feature map), input data, neurons, or input neurons are used interchangeably herein; convolution kernel, filter, or weights are used interchangeably; output profile, output data, or output neurons are used interchangeably.

Exemplary convolution operator compilation scheme

In intelligent computing systems, common operations such as in neural network model algorithms are packaged into operators through a programming framework for direct invocation by programmers, such as convolution, pooling, and the like. TensorFlow, pyTorch, etc. are currently popular deep learning frameworks. In these programming frameworks, computational graphs are typically used to describe the computation of machine learning algorithms, with tensors representing all data in the computational graph and operators representing various operations.

With respect to the terms "node" and "Operator (OP)" mentioned in this disclosure, it is noted that the term "operator" is in terms of the computational level of a computer (or in terms of the software level or the algorithm level); while the term "node" is a more visual statement (from a graphical level or a more visual level). The terms "operator" and "node" are actually referred to the same as the reference. That is, in the present disclosure, the terms "operator" and "node" may be considered to have the same meaning, and may be used interchangeably, only described from different sides.

As mentioned above, conventional deep learning accelerators tend to suffer from computational alignment problems and on-chip memory space size limitations due to hardware design constraints. In the disclosed embodiment, during compiling, for a convolution operator with a larger scale, a scheme of splitting the convolution operator is proposed so as to obtain a final operation result through multiple rounds of operations.

FIG. 6 illustrates an exemplary flow chart of a method of compiling convolution operators implemented by a processing device according to an embodiment of the present disclosure. The processing means may be, for example, the processing means 203 of fig. 2.

As shown, in step 610, a file to be compiled containing convolution operators is obtained. In the programming framework, computational graphs are typically used to describe the computational process of machine learning algorithms, with operators representing various operations. The convolution operator is included in the computational graph as a computational node for compilation.

Next, in step 620, in response to the scale of the weights of the convolution operator exceeding the single round of computation by the computing device that is to perform the convolution operator, the weights of the convolution operator are first split to generate a plurality of first-order convolution sub-operators.

The single round of computation varies depending on the hardware configuration of the computing device. Specifically, the single round of computing device may be determined based on one or more of the following factors: calculating the number of parallel computing components in the device; calculating the size of an on-chip storage space of the device; and a single-round operation amount of the parallel operation unit. For example, in the hardware configuration shown in fig. 3, the computing device is a single-core device, and the number of parallel operation components is 1; the on-chip memory space of the computing device is determined according to the capacity of the memory module 33; the single-round calculation amount of the parallel calculation unit is determined according to the calculation power of the calculation module 32. As another example, in the hardware configuration shown in fig. 4, the computing device is a multi-core device, the number of parallel operation components depends on the number of computing clusters and the number of processing cores within each computing cluster, and the example of fig. 4 is 4×4=16 parallel operation components (processing cores); the on-chip memory space of the computing device can be determined by comprehensively considering the capacity of the shared memory core SRAM and the capacity of the memory module inside the processor core; the single-round operation amount of the parallel operation part is determined according to the calculation force of a single processing core.

In the embodiment of the disclosure, when the scale of the weight of the convolution operator to be compiled exceeds the single-round operand of a computing device for executing the convolution operator, the convolution operator can be directly split into a plurality of convolution sub-operators during compiling, thereby being beneficial to the subsequent optimization of a computing graph. In particular, the weights of the convolution operators may be first-order split to generate a plurality of first-order convolution sub-operators.

In some embodiments, the first level splitting splits the weights into a plurality of first level sub-weights according to the first channel dimension, each corresponding to a first level convolution sub-operator. The first channel dimension may be the input channel Ci dimension or the output channel Co dimension described above for the convolution operation, as will be described in detail later.

Next, in step 630, a first merging operator is generated, where the first merging operator is used to merge the operation results of the plurality of first-level convolution sub-operators split out before, so as to obtain the final result of the original convolution operator. Since the original convolution operator is split into a plurality of convolution sub-operators, in order to obtain the final result of the original convolution operator, the operation results of the split convolution sub-operators need to be combined. The specific operations involved in the first merge operator depend on the specific manner of primary splitting used by these primary convolution sub-operators, i.e. in relation to the first channel dimension of the splitting. This will be described later in connection with a specific splitting scheme.

Finally, in step 640, compiling and optimizing the file to be compiled based on the split multiple primary convolution sub-operators and the first merging operator to obtain a corresponding binary instruction sequence, so as to be distributed to the computing device to execute the task corresponding to the original convolution operator.

The original large convolution operator in the computational graph to be compiled is now split into a plurality of primary convolution sub-operators and a first merge operator, so that the primary convolution operator can be replaced by the plurality of primary convolution sub-operators and the first merge operator, thereby updating the computational graph. Other compilation optimizations may continue based on the updated computational graph, including, for example, but not limited to, various optimizations that do not involve underlying hardware information, such as pruning, constant folding, arithmetic simplification, layout optimization, and the like; and various optimizations related to the hardware architecture, such as operator fusion, weight preloading, weight residency, etc. The binary instruction sequence generated after compiling can be distributed to a target computing device so as to execute tasks corresponding to a computing graph, and the tasks corresponding to the original convolution operator obtain a final operation result through multiple rounds of operations according to the split convolution sub-operators.

Therefore, the compiling scheme of the convolution operator provided by the embodiment of the disclosure is described above, aiming at a large-scale convolution operator, the convolution operator is split into small sub-convolution operators, the hardware configuration of a computing device can be more flexibly adapted, the computing power of the computing device is fully exerted, and the overall computing efficiency is improved.

As mentioned previously, the first level splitting splits the weights into a plurality of first level sub-weights according to the first channel dimension, each corresponding to a first level convolution sub-operator. The first channel dimension may be either the input channel Ci dimension or the output channel Co dimension, so there are two split schemes.

FIG. 7 illustrates a convolution operator splitting scheme schematic diagram according to some embodiments of the present disclosure. In these embodiments, splitting is performed in accordance with the output channel Co dimension.

As can be seen from the convolution operation principle described in connection with fig. 5, the operation results in the Co dimension do not need to be accumulated, so that the operation results of the convolution sub-operators obtained by splitting according to the Co dimension can be directly cascaded to obtain the operation results of the original convolution operators. That is, when the first channel dimension is the output channel Co dimension, the corresponding first merging operator is a cascade operator, and the cascade operator is used for cascading the operation results of the plurality of first-level convolution sub-operators obtained by splitting according to the splitting sequence, so as to obtain the final result of the original convolution operator.

In the example of fig. 7, the splitting process is illustrated by taking a convolution operator with a size of 1×32×32×2048 (NHWC) for one input neuron and a size of 2048×3×3×2048 (CoKxKyCi) as an example.

The left hand side of fig. 7 shows an operational schematic of the original convolution operator: the scale of the input neurons 710 is 1×32×32×2048, the scale of the weights of the convolution operator 720 is 2048×3×3×2048, and the scale of the output neurons 730 is 1×32×32×2048.

The right side of fig. 7 shows the resulting plurality of convolution sub-operators 740 and first merge operator 760 after performing Co dimension splitting. In this example, the first merge operator 760 is a concatenation operator because it splits in Co dimension. The input to each of these convolution sub-operators 740 is still the original input neuron 710, which is 1 x 32 x 2048 in size. The sub-weights of each convolution sub-operator 740 are equally sized and evenly split into 256 x 3 x 2048, thus splitting into a total of 8 convolution sub-operators. The output 750 of each convolution sub-operator 740 is also 1 x 32 x 256 in scale. The output results 750 of these convolution sub-operators 740 are concatenated together by a first merge operator 760 to yield a final output result 770, which is 1 x 32 x 2048 in size, consistent with the output neurons 730 of the original convolution operator.

From the above embodiment, according to the Co dimension splitting, no additional operation is required, and only the operation results of the convolution sub-operators are cascaded together. In some implementations, the cascade operation may be fused in an operation result copy-back operation, that is, when the operation result of each convolution sub-operator is copied back to the memory, the storage positions of the convolution sub-operators are directly allocated in a cascade order, so that the cascade operation is hidden in the copy-back operation.

FIG. 8 illustrates a convolution operator splitting scheme schematic diagram according to further embodiments of the present disclosure. In these embodiments, splitting is performed in terms of the input channel Ci dimension.

As can be seen from the convolution operation principle described in connection with fig. 5, the operation results in the Ci dimension need to be accumulated, so that the operation results of the original convolution operators can be obtained after the operation results of the convolution sub-operators obtained by splitting according to the Ci dimension are accumulated. That is, when the first channel dimension is the dimension of the input channel Ci, the corresponding first merging operator is an adding operator, and is used for performing aligned accumulation on the operation results of the multiple primary convolution sub-operators obtained by splitting, so as to obtain the final result of the original convolution operator.

In the example of fig. 8, the splitting process is illustrated by taking a convolution operator with one input neuron of size 1×32×32×2048 (NHWC) and with a weight of size 2048×3×3×2048 (CoKxKyCi) as an example.

The left hand side of fig. 8 shows an operational schematic of the original convolution operator: the scale of the input neurons 810 is 1×32×32×2048, the scale of the weights of the convolution operator 820 is 2048×3×3×2048, and the scale of the output neurons 830 is 1×32×32×2048.

The right side of fig. 8 shows a plurality of convolution sub-operators 840 and a first merge operator 860 resulting from performing the Ci dimension splitting. In this example, the first merge operator 860 is an addition operator because it splits in the Ci dimension. Unlike Co dimension splitting, when weights are split in the Ci dimension, since the operations performed by the Ci dimension are multiply-accumulate, the input neurons also need to be split in the Ci dimension accordingly. Thus, in these embodiments, it is also necessary to additionally generate a splitting operator 880 for performing the same splitting as the weight on the input neurons of the original convolution operator, and then provide the split input neurons to the corresponding convolution sub-operators 840. As shown, the original input neuron 810 is split into a plurality of sub-input neurons 890, each of which is 1×32×32×256 in scale, as input neurons of a corresponding convolution sub-operator 840, respectively, by the splitting operator 880.

The sub-weights of each convolution sub-operator 840 are equally sized, uniformly split into 2048×3×3×256 according to the Ci dimension, and thus split into 8 convolution sub-operators in total. The output 850 of each convolution sub-operator 840 is also 1×32×32×2048 in scale. The output results 850 of these convolution sub-operators 840 are bit-wise accumulated by the first merge operator 860 to obtain a final output result 870, which is 1 x 32 x 2048 in scale, consistent with the output neurons 830 of the original convolution operator.

From the above embodiment, according to the Ci dimension splitting, an addition operation needs to be additionally introduced to accumulate the operation results of each convolution sub-operator together, so as to obtain the operation result consistent with the original convolution operator. Since addition is additionally introduced outside the convolution operation, out-of-bounds may be caused, so that data needs to be truncated, which may cause degradation in accuracy.

In some embodiments, the Ci or Co dimensions may be selected for splitting according to actual accuracy requirements. For example, when the accuracy is insensitive or the accuracy requirement is not high, for example, below the accuracy threshold, the Co dimension may be selected, or the Ci dimension may be selected for splitting. When the accuracy requirement is high, for example, above the accuracy threshold, the Co dimension is selected for splitting.

In the splitting, the splitting granularity may be set according to the operational characteristics of hardware. In some embodiments, when the hardware has an operational alignment requirement, the split granularity may match the operational alignment requirement of the computing device.

In order to fully utilize the bandwidth, the throughput of the adaptive arithmetic unit array, and other demands, some arithmetic units may require that the input data be aligned to a specified value, such as an alignment value M, so as to perform arithmetic processing on the data with the alignment value M as granularity. If the input data is not aligned with the value, the data is aligned by zero padding or the like. M may have different values, e.g., 64, 128, 256, etc., and may be in bytes (bytes) or data numbers based on different hardware designs. Further, the operational alignment requirement may have different alignment values according to dimensions. For example, some hardware may require Ci dimension alignment to 64 and/or Co dimension alignment to 64.

It can be understood that when the data volume meets the operation alignment requirement, the working efficiency of the operator is the highest, and the calculation power can be fully utilized. Thus, in some embodiments, the above-described first-order splitting in accordance with a first channel dimension may determine a splitting granularity based on an alignment requirement of the first channel dimension. For example, assuming that the alignment value of the hardware in the first channel dimension is 64, the splitting granularity may be aligned to 64, that is, may be an integer multiple of 64, so that the split data block is more favorable for filling up the computing power of the hardware computing component, and improves the computing efficiency. More specifically, when split according to the Co dimension, the split granularity is aligned to the alignment value of the Co dimension; when split according to the Ci dimension, the split granularity is aligned to the aligned value of the Ci dimension.

Since the split is to be aligned to the operational alignment requirement, the minimum split granularity corresponds to the alignment value of the corresponding channel. In some cases, when the split has been performed at the minimum split granularity, the scale of the possible sub-weights still exceeds the single round of computation of the computing device, at which point a secondary split may be performed.

Specifically, for a first-stage convolution sub-operator that needs to undergo a second-stage split, it may further include: responding to the fact that the scale of the primary sub-weight of the primary convolution sub-operator exceeds the single-round operand of a computing device, carrying out secondary splitting on the primary sub-weight to generate a plurality of secondary convolution sub-operators, and splitting the primary sub-weight into a plurality of secondary sub-weights according to the second channel dimension by the secondary splitting, wherein each secondary sub-weight corresponds to one secondary convolution sub-operator; and generating a second merging operator, wherein the second merging operator is used for merging the operation results of the split multiple second-level convolution sub-operators so as to obtain the operation results of the original first-level convolution sub-operator.

The second channel dimension, as split by the second split, is different from the first channel dimension, since the first channel dimension has been split to an acceptable minimum granularity by the first split.

In some implementations, if the first channel dimension is the output channel Co dimension, the second channel dimension is the input channel Ci dimension, and the second merging operator is an adding operator, configured to perform aligned accumulation on the operation results of the split multiple second-level convolution sub-operators, so as to obtain the operation result of the corresponding first-level convolution sub-operator.

In other implementations, if the first channel dimension is the input channel Ci dimension, the second channel dimension is the output channel Co dimension, and the second merging operator is a cascade operator, configured to cascade the operation results of the split multiple second-level convolution sub-operators according to the splitting order, so as to obtain the operation result of the corresponding first-level convolution sub-operator.

Similar to the primary split, the split granularity of the secondary split may be determined from the operational alignment requirements of the computing device on the second channel dimension. That is, the split granularity of the secondary split may be an integer multiple of the operational alignment value of the second channel dimension. Therefore, the split data block is more favorable for filling the computing power of the hardware computing component, and the computing efficiency is improved.

The convolution operators of different neural network models have different scales, and the different convolution operators in the same neural network model may have different scales, so that multiple splitting modes exist. Moreover, multiple splitting schemes are possible even for convolution operators of the same scale. The performance of the split scheme may be evaluated based on a number of factors.

In one aspect, as previously described, the split granularity may be determined based on the operational alignment requirements of the hardware. However, not all convolution operators can split exactly by multiples of the alignment value. When the split data blocks are not aligned, the data are aligned in a zero filling mode and the operation is performed based on the aligned data, so that invalid operation is introduced, and the operation efficiency is reduced. Thus, the performance of the split scheme in terms of operational alignment can be evaluated based on the zero padding amount. For example, zero padding, invalid operand, or an invalid operand ratio may be used to characterize an invalid operand indicator of the split scheme due to an operand alignment requirement.

On the other hand, in order to improve the processing efficiency, a Load-calculate-restore (Store) pipeline system is generally used for processing. In a pipelined manner, the first data block may be loaded while the second data block is being calculated, thereby saving processing time. If the time of each stage in the flowing water is matched, that is, the time spent by each stage is not great, the flowing water can work smoothly, otherwise, the mutual waiting condition can exist among the stages. Thus, in some embodiments, the splitting scheme may consider that the loading time of the split sub-weights matches the time of the computation convolution, thereby facilitating the exertion of parallel pipelining of the weights IO and computation. When the execution time of the sub operator is uniform, the loading time of the sub weight can be effectively shielded in the parallel running water of IO and calculation, so that IO bottleneck is avoided, and the aim of preloading the weight can be better exerted.

During compilation, from the hardware configuration information of the target computing device, the computation time of the computation component can be determined, whereby the preloading time of the weights can be estimated, thereby determining the split size of the weights. Therefore, in some embodiments, the sizes of the sub weights after splitting can be balanced as much as possible during splitting, so that the processing time of the front-back operation is equivalent. Further, the split weight loading time is matched with convolution operation time as much as possible, so that the weight preloading characteristic can work better. From the viewpoint of pipeline processing, the loading time of the latter sub-weight matches the operation time of the former convolution sub-operator, for example, the difference between the two is within a predetermined range.

Therefore, the performance of the splitting scheme in terms of weight preloading can be evaluated according to the matching degree of weight loading time consumption and convolution operation time consumption. For example, the matching degree of the loading time of adjacent sub-weights and the operation time of a convolution sub-operator can be used to characterize the performance index of the splitting scheme in terms of weight preloading.

Other optimization measures are also involved in the compilation optimization of the computational graph, including, but not limited to, operator fusion, pipelining of IO/computation parallelism, weight residency, weight preloading, and the like. Generally, the effects of these optimization approaches on computational graph optimization are taken into account in combination, so that the most appropriate optimization combination is selected. Therefore, the performance index of the evaluation resolution scheme can be independently used for evaluating the performance of the resolution scheme, and can be fused into the evaluation of the overall compiling optimization performance so as to comprehensively consider the improvement of various optimization means on the overall performance. In the overall compiling performance evaluation, the performance advantages and disadvantages of various optimization schemes can be compared by calculating the overall execution time consumption of the various optimization schemes, so that the optimal optimization scheme can be selected.

Thus, the disclosed embodiments provide a compiling method of a convolution operator, which supports that when the scale of the convolution operator exceeds the single-round operation amount of a target computing device, a proper splitting scheme is adopted to split a large convolution operator into a plurality of small convolution sub-operators, so that a final result is obtained through multiple rounds of operations. The split sub weight value meets the operation alignment requirement of the operation component on the target computing device as much as possible, so that the calculation force of the operation component can be fully utilized. Further, the weight values of all the split convolution sub operators are balanced as much as possible, the weight loading time is matched with the convolution operation time as much as possible, so that the flow efficiency is high, and the weight preloading characteristic can work better. Therefore, the weight scale of the split convolution sub operator is smaller and more balanced, the convolution sub operator is more beneficial to the scheduling among parallel operation components, and the constraint of operation alignment and on-chip space limitation is more easily met, so that the acceleration optimization is more beneficial.

The present disclosure also provides a processing apparatus that may be used to perform compilation of a computational graph containing convolution operators, comprising: a processor configured to execute program instructions; and a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the compiling method described in the embodiments of the present disclosure.

In an embodiment of the present disclosure, there is also provided a computer-readable storage medium in which program instructions are stored, which when loaded and executed by a processor, cause the processor to perform the compiling method of a convolution operator described in the embodiment of the present disclosure. In an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program or instructions which, when executed by a processor, implements a method of compiling a convolution operator according to the embodiments described in the present disclosure.

The disclosed embodiments also provide a chip, which may include the aforementioned processing device. Further, the present disclosure also provides a board that may include the foregoing chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing has described in detail embodiments of the present disclosure, with specific examples being employed herein to illustrate the principles and implementations of the present disclosure, the above examples being provided solely to assist in the understanding of the methods of the present disclosure and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the present disclosure, there are variations in the detailed description and the scope of the application, which in light of the foregoing description should not be construed to limit the present disclosure.

Claims

1. A method of compiling a convolution operator implemented with a processing device, comprising:

acquiring a file to be compiled containing a convolution operator;

in response to the scale of the weight of the convolution operator exceeding a single-round operand of a computing device that is to execute the convolution operator, performing a first-stage splitting on the weight to generate a plurality of first-stage convolution sub-operators, wherein the first-stage splitting splits the weight into a plurality of first-stage sub-weights according to a first channel dimension, each corresponding to one of the first-stage convolution sub-operators;

generating a first merging operator, wherein the first merging operator is used for merging operation results of the plurality of primary convolution sub-operators so as to obtain a final result of the convolution operator; and

And compiling and optimizing the file to be compiled based on the plurality of primary convolution sub-operators and the first merging operator to obtain a corresponding binary instruction sequence, so as to be distributed to the computing device to execute tasks corresponding to the convolution operators.

2. The compiling method of claim 1, wherein:

the first channel dimension is an output channel Co dimension; and is also provided with

The first merging operator is a cascade operator and is used for cascading the operation results of the plurality of primary convolution sub-operators according to a splitting sequence so as to obtain the final result of the convolution operator.

3. The compiling method of claim 1, wherein:

the first channel dimension is an input channel Ci dimension; and is also provided with

The first merging operator is an addition operator and is used for carrying out aligned accumulation on the operation results of the plurality of primary convolution sub-operators so as to obtain the final result of the convolution operator.

4. A compiling method according to claim 3, further comprising:

generating a split operator for performing the first-order split on input neurons of the convolution operator to provide to the corresponding first-order convolution sub-operator.

5. The compilation method of any of claims 1-4, wherein generating a primary convolution sub-operator further comprises:

In response to the scale of the primary sub-weights of the primary convolution sub-operators exceeding a single-round operand of the computing device, performing secondary splitting on the primary sub-weights to generate a plurality of secondary convolution sub-operators, wherein the secondary splitting splits the primary sub-weights into a plurality of secondary sub-weights according to a second channel dimension, each secondary sub-weight corresponding to one secondary convolution sub-operator; and

and generating a second merging operator, wherein the second merging operator is used for merging operation results of the plurality of second-level convolution sub-operators so as to obtain operation results of the first-level convolution sub-operators.

6. The compiling method of claim 5, wherein:

if the first channel dimension is the output channel dimension Co, the second channel dimension is the input channel dimension Ci, and the second merging operator is an addition operator, and is used for performing aligned accumulation on the operation results of the plurality of second-level convolution sub-operators so as to obtain the operation results of the corresponding first-level convolution sub-operators;

and if the first channel dimension is the dimension of the input channel Ci, the second channel dimension is the dimension of the output channel Co, and the second merging operator is a cascading operator, wherein the second merging operator is used for cascading the operation results of the plurality of second-level convolution sub-operators according to the splitting sequence so as to obtain the operation result of the corresponding first-level convolution sub-operator.

7. The compilation method of any of claims 1-6, wherein the single round of computing power of the computing device is determined based on one or more of:

the number of parallel computing components in the computing device;

an on-chip storage size of the computing device; and

and the single-round operation amount of the parallel operation part.

8. The compilation method of any of claims 1-7, wherein a split granularity of the primary split is determined based on an operational alignment requirement of the computing device on the first channel dimension.

9. The compiling method according to any one of claims 1 to 8, wherein split sizes of the plurality of primary sub-weights are determined according to a convolution operation time consumption of the computing device such that a loading time of a subsequent sub-weight matches an operation time of a previous convolution sub-operator.

10. The compilation method of any of claims 5-6, wherein a split granularity of the secondary split is determined based on an operational alignment requirement of the computing device on the second channel dimension.

11. The compiling method of claim 5-6 or 10, wherein split sizes of the plurality of secondary sub-weights are determined according to convolution operation time consumption of the computing device such that a loading time of a subsequent sub-weight matches an operation time of a previous convolution sub-operator.

12. The compiling method according to any one of claims 5 to 6 or 10 to 11, further comprising: the performance of the primary and secondary splits is evaluated based on one or more of the following factors:

invalid operation indexes caused by operation alignment requirements; and

and matching degree between the loading time of the adjacent sub-weights and the operation time of the convolution sub-operator.

13. A processing apparatus for performing compilation of a computational graph containing convolution operators, comprising:

a processor configured to execute program instructions; and

a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the compiling method of any one of claims 1 to 12.

14. A computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform the compiling method according to any of claims 1-12.

15. A computer program product comprising a computer program or instructions which, when executed by a processor, implements a compiling method according to any one of claims 1 to 12.