WO2024149112A1 - Compilation method for convolution operator, and related product - Google Patents

Compilation method for convolution operator, and related product Download PDF

Info

Publication number
WO2024149112A1
WO2024149112A1 PCT/CN2024/070133 CN2024070133W WO2024149112A1 WO 2024149112 A1 WO2024149112 A1 WO 2024149112A1 CN 2024070133 W CN2024070133 W CN 2024070133W WO 2024149112 A1 WO2024149112 A1 WO 2024149112A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution
operator
sub
level
dimension
Prior art date
Application number
PCT/CN2024/070133
Other languages
French (fr)
Chinese (zh)
Inventor
杨云召
张鹏
沈宇斌
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2024149112A1 publication Critical patent/WO2024149112A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure generally relates to the field of intelligent computing, and more particularly to the field of neural networks. More specifically, the present disclosure relates to a compilation method of a convolution operator implemented by a processing device, a processing device, a computer-readable storage medium, and a computer program product.
  • Neural network is one of the key technologies in artificial intelligence and deep learning, among which convolutional neural network (CNN) is the most important type of network.
  • CNN convolutional neural network
  • a very important calculation in convolutional neural network is the convolution operation of the convolution layer (Conv layer).
  • Conv layer The most time-consuming operation in common neural network models is often the convolution operation.
  • the function of the convolution layer is to extract features from the input data. Through multiple layers of convolution, complex features can be extracted to ensure that the network has sufficient expression and generalization capabilities.
  • the neural network model contains a large number of convolution operations of various types.
  • the computational performance of the convolution operation greatly affects the computational performance of the entire neural network model. Therefore, the acceleration optimization of the convolution operation is an important part of the optimization of the deep learning computational graph.
  • Deep learning accelerators usually have multiple computing components, which can accelerate neural network calculations through parallel operation.
  • Common deep learning accelerators are often subject to the limitations of hardware design, such as computing alignment issues and on-chip memory space size limitations.
  • the present disclosure proposes a compilation scheme for the roll operator in multiple aspects, which, while adapting to the hardware requirements of the computing device, fully utilizes the parallel computing power of the computing components in the computing device as much as possible.
  • the present disclosure provides a compilation method of a convolution operator implemented by a processing device, comprising: obtaining a file to be compiled containing a convolution operator; in response to the scale of the weight of the convolution operator exceeding the single-round computing amount of the computing device to execute the convolution operator, performing a first-level splitting on the weight to generate multiple first-level convolution sub-operators, wherein the first-level splitting splits the weight into multiple first-level sub-weights according to a first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator; generating a first merging operator, the first merging operator is used to merge the computing results of the multiple first-level convolution sub-operators to obtain the final result of the convolution operator; and compiling and optimizing the file to be compiled based on the multiple first-level convolution sub-operators and the first merging operator to obtain a corresponding binary instruction sequence to be allocated to the computing device to execute the task corresponding to the convolution operator
  • the present disclosure provides a processing device for performing compilation on a computational graph containing a convolution operator, comprising: a processor configured to execute program instructions; and a memory configured to store the program instructions, so that when the program instructions are loaded and executed by the processor, the processor executes the compilation method according to the first aspect of the present disclosure.
  • the present disclosure provides a computer-readable storage medium having program instructions stored therein, which, when loaded and executed by a processor, causes the processor to execute the compilation method according to the first aspect of the present disclosure.
  • the present disclosure provides a computer program product, comprising a computer program or instructions, which, when executed by a processor, implements the compilation method according to the first aspect of the present disclosure.
  • the disclosed embodiment provides a suitable splitting scheme for the implementation of a larger-scale convolution operator to adapt to the hardware computing power conditions. Furthermore, by setting a suitable splitting granularity, it is possible to adapt to the hardware design requirements of the computing device, make full use of the computing power of the parallel computing components, and improve the overall computing performance.
  • FIG1 shows a structural diagram of a board according to an embodiment of the present disclosure
  • FIG2 shows a structural diagram of a combined processing device according to an embodiment of the present disclosure
  • FIG3 is a schematic diagram showing the internal structure of a single processor core of a single-core computing device according to an embodiment of the present disclosure
  • FIG4 is a simplified schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • FIG5 shows an example of an exemplary conventional 3D convolution operation principle to which the disclosed embodiments can be applied
  • FIG6 shows an exemplary flow chart of a method for compiling a convolution operator implemented by a processing device according to an embodiment of the present disclosure
  • FIG7 shows a schematic diagram of a convolution operator splitting scheme according to some embodiments of the present disclosure
  • FIG8 shows a schematic diagram of a convolution operator splitting scheme according to other embodiments of the present disclosure.
  • the term “if” may be interpreted as “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context.
  • the phrase “if it is determined” or “if [described condition or event] is detected” may be interpreted as meaning “upon determination” or “in response to determining” or “upon detection of [described condition or event]” or “in response to detecting [described condition or event],” depending on the context.
  • FIG1 shows a schematic diagram of the structure of a board 10 according to an embodiment of the present disclosure.
  • the board 10 includes a chip 101, which is a system-on-chip (SoC), or system-on-chip, and integrates one or more combined processing devices.
  • the combined processing device is an artificial intelligence computing unit that supports various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, data mining, etc.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is that the input The amount of input data is large, and high requirements are placed on the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud-based intelligent applications and has huge off-chip storage, on-chip storage and powerful computing power.
  • the chip 101 is connected to an external device 103 via an external interface device 102.
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface.
  • the data to be processed can be transmitted from the external device 103 to the chip 101 via the external interface device 102.
  • the calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102.
  • the external interface device 102 can have different interface forms, such as a PCIe interface.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105.
  • the storage device 104 is connected to the control device 106 and the chip 101 through a bus and transmits data.
  • the control device 106 in the board 10 is configured to control the state of the chip 101.
  • the control device 106 may include a microcontroller (MCU).
  • Fig. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment.
  • the combined processing device 20 includes a calculation device 201, an interface device 202, a processing device 203 and a storage device 204.
  • the computing device 201 is configured to execute user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203.
  • the computing device 201 can obtain input data from the processing device 203 via the interface device 202 and write it into the storage device on the computing device 201 chip.
  • the computing device 201 can obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the computing device 201 chip.
  • the interface device 202 can also read data in the storage device of the computing device 201 and transmit it to the processing device 203.
  • the processing device 203 performs basic controls including but not limited to data handling, starting and/or stopping the computing device 201, etc.
  • the processing device 203 can be a central processing unit (CPU), a graphics processing unit (GPU), or one or more types of processors in other general and/or special processors, which include but are not limited to digital signal processors (DSP), application specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs.
  • DSP digital signal processors
  • ASIC application specific integrated circuits
  • FPGA field-programmable gate arrays
  • the computing device 201 can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing device 201 and the processing device 203 are integrated and considered together, the two are regarded as forming a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be DRAM or DDR memory, and is usually 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203 .
  • the computing device 201 runs the neural network, it is generally necessary to first compile the neural network using the processing device 203 to obtain an executable file, which contains device information, that is, which device in the heterogeneous computer system the executable file needs to be executed on. After the executable file is assembled and linked, an executable program of the neural network can be obtained, and the executable program is stored in the storage device 204.
  • the processing device 203 can read the executable program from the storage location of the executable program and obtain multiple tasks of the program according to the executable program. These tasks are distributed to the computing device 201 for execution via the interface device 202, and finally obtain the operation result.
  • Fig. 3 shows a schematic diagram of the internal structure of a processing core when the computing device 201 in Fig. 2 is a single-core device.
  • the computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the computing device 301 includes three modules: a control module 31 (also called a controller), an operation module 32 (also called an operator), and a storage module 33 (also called a memory).
  • the control module 31 is used to coordinate and control the operation of the operation module 32 and the storage module 33 to complete the deep learning task. It includes an instruction fetch unit (IFU) 311 and an instruction decode unit (IFU).
  • the instruction fetch unit 311 is used to fetch instructions from the processing device 203, and the instruction decoding unit 312 decodes the fetched instructions and sends the decoding results as control information to the operation module 32 and the storage module 33.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322.
  • the vector operation unit 321 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transfer relevant data, including a neuron RAM (NRAM) 331, a weight RAM (WRAM) 332, and a direct memory access module (DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons, and intermediate results after calculation;
  • WRAM 332 is used to store the convolution kernel of the deep learning network, that is, the weight;
  • DMA 333 is connected to DRAM 204 through bus 34, and is responsible for data transfer between the computing device 301 and DRAM 204.
  • the NRAM and WRAM here can be two storage areas formed by dividing the same memory in the logical storage space, or they can be two independent memories, which are not specifically limited here.
  • FIG4 shows a simplified schematic diagram of the internal structure of the computing device 201 in FIG2 when it is multi-core.
  • a multi-core computing device can be abstracted using a hierarchical hardware model.
  • the multi-core computing device 400 is a system on chip, which includes at least one computing cluster, and each computing cluster includes multiple processor cores.
  • the multi-core computing device 400 is composed of a hierarchy of system on chip-computing cluster-processor core.
  • the multi-core computing device 400 includes an external storage controller 41 , a peripheral communication module 42 , an on-chip interconnect module 43 , a global synchronization module 44 , and multiple computing clusters 45 .
  • the peripheral communication module 42 is used to receive control signals from the processing device (203 in Figure 2) through the interface device (202 in Figure 2) to start the computing device (201 in Figure 2) to perform tasks.
  • the on-chip interconnect module 43 connects the external storage controller 41, the peripheral communication module 42 and multiple computing clusters 45 to transmit data and control signals between each module.
  • the global synchronization module 44 is, for example, a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each computing cluster and ensure information synchronization.
  • GBC global synchronization barrier controller
  • the plurality of computing clusters 45 are the computing cores of the multi-core computing device 400. In the figure, four computing clusters are shown on each die by way of example. With the development of hardware, the multi-core computing device 400 disclosed herein may also include 8, 16, 64, or even more computing clusters 45. The computing clusters 45 are used to efficiently execute deep learning algorithms.
  • each computing cluster 45 includes multiple processor cores 406 as control and computing units, and a shared storage core 407 as a storage unit. Furthermore, each computing cluster may also include a local synchronization module 412 to coordinate the work progress of each processor core in the computing cluster and ensure information synchronization.
  • the processor cores 406 are shown in the figure as an example, and the present disclosure does not limit the number of processor cores 406.
  • the storage core 407 is mainly used for storage and communication, that is, to store shared data or intermediate results between the processor cores 406, and to perform communication between the computing cluster 45 and the DRAM 204, between the computing clusters 45, and between the processor cores 406.
  • the storage core 407 has the ability of scalar operations and is used to perform scalar operations.
  • the storage core 407 includes a shared memory unit (SRAM) 408, a broadcast bus 409, a cluster direct memory access module (cluster direct memory access, CDMA) 410, and a global direct memory access module (global direct memory access, GDMA) 411.
  • the SRAM 408 plays the role of a high-performance data transfer station.
  • the data reused between different processor cores 406 in the same computing cluster 45 does not need to be obtained from the DRAM 204 by each processor core 406, but is transferred between the processor cores 406 through the SRAM 408.
  • the storage core 407 only needs to quickly distribute the reused data from the SMEM 408 to multiple processor cores 406 to improve the efficiency of inter-core communication and greatly reduce on-chip and off-chip input/output access.
  • the broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between computing clusters 45, and data transmission between computing clusters 45 and DRAM 204, respectively.
  • the structure of a single processor core may be similar to the structure diagram of a single-core computing device shown in FIG3 , and will not be described in detail here.
  • the convolution layer in the neural network model can perform convolution operations to extract features by applying convolution kernels (also called filters, weights, etc.) to the input feature map (also called input data, neurons, or input neurons).
  • convolution kernels also called filters, weights, etc.
  • the neural network model may include various convolution operation layers, such as a convolution layer that performs forward, conventional 3D convolution operations, and a deconvolution layer that performs depthwise convolution operations.
  • convolution layer that performs forward
  • deconvolution layer that performs depthwise convolution operations.
  • reverse training it may be necessary to perform reverse depthwise convolution operations or cross-product convolution operations.
  • the disclosed embodiments are mainly optimized for conventional 3D convolution operations, and can also be applied to other types of convolution operations without conflict.
  • FIG. 5 shows an example of an exemplary conventional 3D convolution operation principle to which the disclosed embodiments can be applied.
  • the figure exemplarily shows a four-dimensional input data X of size [N Hi Wi Ci], which can be represented as N Hi ⁇ Wi ⁇ Ci 3D rectangles 510.
  • the figure also exemplarily shows a four-dimensional convolution kernel K of size [Co Kh Kw Ci], which can be represented as Co Kh ⁇ Kw ⁇ Ci 3D convolution kernels 520.
  • the convolution result of the input data X and the convolution kernel K obtains the output data Y, which is four-dimensional data of size [N Ho Wo Co], which can be represented as N Ho ⁇ Wo ⁇ Co 3D rectangles 530.
  • the figure also specifically shows an example of a convolution operation, in which the input data is an input feature map 540 of size 6 ⁇ 6 ⁇ 3, omitting the N dimension; the convolution kernel is a stereo convolution kernel 550 of size 3 ⁇ 3 ⁇ 3, targeting a single Co; and the output data is a 4 ⁇ 4 output feature map 560.
  • the specific operation process is as follows:
  • the convolution kernel 550 scans the input feature map 540a at a certain step length, performs matrix element multiplication and summation on the input features in the convolution window 570, and superimposes the deviation. That is, the value at each position in the output feature map 560 is obtained by performing a two-dimensional convolution operation on the corresponding block of each input feature map and the corresponding convolution kernel, and then adding them together. For example, the figure shows that the value at the (0,0) position on the output feature map 560 (that is, the convolution output point) is obtained by performing a two-dimensional convolution operation on the convolution window 570 framed by the black cube in the input feature map and the three-dimensional convolution kernel 550 to obtain three values, and then adding them together to obtain the final value.
  • the position of the convolution kernel 550 can be moved on the input feature map 540, that is, the convolution window of the convolution output point can be moved.
  • the convolution step size (Sw, Sh) is (1,1).
  • each group contains Hi ⁇ Wi ⁇ Ci information, where Hi and Wi are the height and width of the input feature map, respectively, and Ci is the number of input feature maps, also known as the number of input channels.
  • the convolution layer has Ci ⁇ Co convolution kernels of size Kh ⁇ Kw, where Ci is the number of input channels, Co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolution kernel, respectively.
  • the output feature map contains Ho ⁇ Wo ⁇ Co information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels.
  • the convolution step size (Sw, Sh) is also involved, and the size of the convolution step size will affect the size of the output feature map.
  • input feature map, input data, neuron or input neuron are used interchangeably; convolution kernel, filter or weight are used interchangeably; output feature map, output data or output neuron are used interchangeably.
  • programming frameworks are used to encapsulate common operations in neural network model algorithms into operators for programmers to call directly, such as convolution and pooling.
  • TensorFlow, PyTorch, etc. are currently popular deep learning frameworks.
  • computational graphs are usually used to describe the computational process of machine learning algorithms, tensors are used to represent all data in the computational graphs, and operators are used to represent various operations.
  • node and “operator (OP)” mentioned in this disclosure, it should be noted that the term “operator” is from the perspective of computer computing (or from the perspective of software or algorithm); while the term “node” is a more figurative term (from the perspective of graphics or a more intuitive level). In terms of what they refer to, the terms “operator” and “node” actually refer to the same thing. That is, in this disclosure, it can be considered that the terms “operator” and “node” are the same. “Node” has the same meaning and can be used interchangeably, but is described from different aspects.
  • FIG6 shows an exemplary flow chart of a method for compiling a convolution operator implemented by a processing device according to an embodiment of the present disclosure.
  • the processing device may be, for example, the processing device 203 of FIG2 .
  • a file to be compiled containing a convolution operator is obtained.
  • a computational graph is usually used to describe the computational process of a machine learning algorithm, and operators are used to represent various operations.
  • the convolution operator is included in the computational graph as a computational node for compilation.
  • step 620 in response to the scale of the weight of the convolution operator exceeding the single-round computing capacity of the computing device to execute the convolution operator, the weight of the convolution operator is split into a first level to generate multiple first-level convolution sub-operators.
  • the single-round computing volume of the computing device can be determined based on one or more of the following factors: the number of parallel computing components in the computing device; the size of the on-chip storage space of the computing device; and the single-round computing volume of the parallel computing components.
  • the computing device is a single-core device, and the number of parallel computing components is 1; the on-chip storage space of the computing device is determined according to the capacity of the storage module 33; the single-round computing volume of the parallel computing components is determined according to the computing power of the computing module 32.
  • the computing device is a multi-core device, and the number of parallel computing components depends on the number of computing clusters and the number of processing cores in each computing cluster.
  • the convolution operator when the weight scale of the convolution operator to be compiled exceeds the single-round computational amount of the computing device to execute the convolution operator, the convolution operator can be directly split into multiple convolution sub-operators during compilation, thereby facilitating subsequent optimization of the computation graph. Specifically, the weight of the convolution operator can be split into a first-level sub-operator to generate multiple first-level convolution sub-operators.
  • the first-level splitting splits the weight into multiple first-level sub-weights according to the first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator.
  • the first channel dimension can be the input channel Ci dimension or the output channel Co dimension described above for the convolution operation, which will be described in detail later.
  • a first merging operator is generated, and the first merging operator is used to merge the operation results of the multiple first-level convolution sub-operators split out previously to obtain the final result of the original convolution operator. Since the original convolution operator is split into multiple convolution sub-operators, in order to obtain the final result of the original convolution operator, it is necessary to merge the operation results of these split convolution sub-operators.
  • the specific operation involved in the first merging operator depends on the specific method of the first-level splitting used by these first-level convolution sub-operators, that is, it is related to the first channel dimension of the split. This will be described later in conjunction with the specific splitting method.
  • step 640 the compiled file is compiled and optimized based on the split multiple first-level convolution operators and the first merging operator to obtain a corresponding binary instruction sequence to be allocated to the computing device to execute the task corresponding to the original convolution operator.
  • the original large convolution operator in the computational graph to be compiled is now split into multiple first-level convolution sub-operators and a first merging operator. Therefore, these multiple first-level convolution sub-operators and the first merging operator can be used to replace the original convolution operator, thereby updating the computational graph.
  • other compilation optimizations can be continued, including but not limited to various optimizations that do not involve underlying hardware information, such as pruning, constant folding, arithmetic simplification, layout optimization, etc.; and various optimizations related to hardware architecture, such as operator fusion, weight preloading, weight retention, etc.
  • the binary instruction sequence generated after compilation can be assigned to the target computing device to execute the tasks corresponding to the computational graph, and the tasks corresponding to the original convolution operator are performed according to the split convolution sub-operators through multiple rounds of operations to obtain the final operation results.
  • the above describes the compilation scheme of the convolution operator provided by the embodiment of the present disclosure, which is aimed at large-scale convolution operators. By splitting it into small sub-convolution operators, it can more flexibly adapt to the hardware configuration of the computing device, give full play to its computing power, and improve the overall computing efficiency.
  • the first-level splitting splits the weight into multiple first-level sub-weights according to the first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator.
  • the first channel dimension can be the input channel Ci dimension or the output channel Co dimension, so there are two splitting schemes.
  • Fig. 7 shows a schematic diagram of a convolution operator splitting scheme according to some embodiments of the present disclosure. In these embodiments, the splitting is performed according to the output channel Co dimension.
  • the operation results on the Co dimension do not need to be accumulated, so the operation results of each convolution sub-operator obtained by splitting according to the Co dimension can be directly cascaded to obtain the operation result of the original convolution operator. That is, when the first channel dimension is the output channel Co dimension, the corresponding first merging operator is a cascade operator, which is used to cascade the operation results of multiple first-level convolution sub-operators obtained by splitting according to the splitting order, thereby obtaining the final result of the original convolution operator.
  • a convolution operator with an input neuron size of 1 ⁇ 32 ⁇ 32 ⁇ 2048 (NHWC) and a weight size of 2048 ⁇ 3 ⁇ 3 ⁇ 2048 (CoKxKyCi) is taken as an example to illustrate the splitting process.
  • the left side of FIG. 7 shows the operation diagram of the original convolution operator: the scale of the input neuron 710 is 1 ⁇ 32 ⁇ 32 ⁇ 2048, the scale of the weight of the convolution operator 720 is 2048 ⁇ 3 ⁇ 3 ⁇ 2048, and the scale of the output neuron 730 is 1 ⁇ 32 ⁇ 32 ⁇ 2048.
  • the right side of FIG. 7 shows multiple convolution sub-operators 740 and a first merging operator 760 obtained after performing the Co dimension split.
  • the first merging operator 760 is a cascade operator.
  • the input of each of these convolution sub-operators 740 is still the original input neuron 710, whose scale is 1 ⁇ 32 ⁇ 32 ⁇ 2048.
  • the sub-weight scale of each convolution sub-operator 740 is the same, and is evenly split into 256 ⁇ 3 ⁇ 3 ⁇ 2048, thereby splitting into a total of 8 convolution sub-operators.
  • the scale of the output result 750 of each convolution sub-operator 740 is also the same, all of which are 1 ⁇ 32 ⁇ 32 ⁇ 256.
  • the output results 750 of these convolution sub-operators 740 are cascaded together through the first merging operator 760 to obtain the final output result 770, whose scale is 1 ⁇ 32 ⁇ 32 ⁇ 2048, which is consistent with the output neuron 730 of the original convolution operator.
  • the above cascade operation can be integrated into the operation result storage operation, that is, when the operation results of each convolution sub-operator are stored back to the memory, the storage location is directly allocated in the cascade order, so that the cascade operation is hidden in the storage operation.
  • Fig. 8 shows a schematic diagram of a convolution operator splitting scheme according to some other embodiments of the present disclosure. In these embodiments, the splitting is performed according to the input channel Ci dimension.
  • the operation results on the Ci dimension need to be accumulated, so the operation results of each convolution operator obtained by splitting according to the Ci dimension can only be obtained after the operation results of the original convolution operator are accumulated. That is, when the first channel dimension is the input channel Ci dimension, the corresponding first merging operator is an addition operator, which is used to perform positional accumulation of the operation results of multiple first-level convolution operators obtained by splitting, thereby obtaining the final result of the original convolution operator.
  • a convolution operator with an input neuron size of 1 ⁇ 32 ⁇ 32 ⁇ 2048 (NHWC) and a weight size of 2048 ⁇ 3 ⁇ 3 ⁇ 2048 (CoKxKyCi) is taken as an example to illustrate the splitting process.
  • the left side of FIG8 shows the operation diagram of the original convolution operator: the scale of the input neuron 810 is 1 ⁇ 32 ⁇ 32 ⁇ 2048, the scale of the weight of the convolution operator 820 is 2048 ⁇ 3 ⁇ 3 ⁇ 2048, and the scale of the output neuron 830 is 1 ⁇ 32 ⁇ 32 ⁇ 2048.
  • the right side of Figure 8 shows multiple convolution sub-operators 840 and a first merging operator 860 obtained after performing the Ci dimension split.
  • the first merging operator 860 is an addition operator.
  • the input neurons also need to be split according to the Ci dimension accordingly. Therefore, in these embodiments, it is also necessary to additionally generate a splitting operator 880, which is used to perform the same splitting as the weights on the input neurons of the original convolution operator, and then provide the split input neurons to the corresponding convolution sub-operators 840.
  • sub-input neurons 890 serve as input neurons of the corresponding convolution sub-operators 840, and the scale of each is 1 ⁇ 32 ⁇ 32 ⁇ 256.
  • each convolution sub-operator 840 The sub-weights of each convolution sub-operator 840 are of the same size, and are evenly split into 2048 ⁇ 3 ⁇ 3 ⁇ 256 according to the Ci dimension, thus being split into 8 convolution sub-operators in total.
  • the output results 850 of each convolution sub-operator 840 are also of the same size, which is 1 ⁇ 32 ⁇ 32 ⁇ 2048.
  • the output results 850 of these convolution sub-operators 840 are bitwise accumulated through the first merging operator 860 to obtain the final output result 870, which has a size of 1 ⁇ 32 ⁇ 32 ⁇ 2048, which is consistent with the output neuron 830 of the original convolution operator.
  • the Ci or Co dimension can be selected for splitting according to the actual accuracy requirement. For example, when the accuracy is not sensitive or the accuracy requirement is not high, such as below the accuracy threshold, the Co dimension can be selected or the Ci dimension can be selected for splitting. When the accuracy requirement is high, such as above the accuracy threshold, the Co dimension is selected for splitting.
  • the splitting granularity can be set according to the operation characteristics of the hardware. In some embodiments, when the hardware has an operation alignment requirement, the splitting granularity can match the operation alignment requirement of the computing device.
  • some operators will require the input data to be aligned to a specified value, such as an alignment value M, so that the data can be processed at the granularity of the alignment value M. If the input data is not enough for the alignment value, the data will be padded by zero padding.
  • M can have different values, for example, 64, 128, 256, etc., and the unit can be the number of bytes (Byte) or the number of data.
  • the operation alignment requirements can have different alignment values depending on the dimension. For example, some hardware may require the Ci dimension to be aligned to 64 and/or the Co dimension to be aligned to 64.
  • the above-mentioned first-level splitting according to the first channel dimension can determine the splitting granularity according to the alignment requirements of the first channel dimension. For example, assuming that the hardware alignment value of the first channel dimension is 64, the splitting granularity can be aligned to 64, that is, it can be an integer multiple of 64, so that the split data blocks are more conducive to fully utilizing the computing power of the hardware computing components and improving the computing efficiency. More specifically, when splitting according to the Co dimension, the splitting granularity is aligned to the alignment value of the Co dimension; when splitting according to the Ci dimension, the splitting granularity is aligned to the alignment value of the Ci dimension.
  • the minimum splitting granularity corresponds to the alignment value of the corresponding channel.
  • the size of the sub-weights that may be obtained still exceeds the single-round operation amount of the computing device. In this case, a second splitting can be performed.
  • a first-level convolution sub-operator that needs to be split into two levels, it can further include: in response to the scale of the first-level sub-weight of the first-level convolution sub-operator exceeding the single-round computing capacity of the computing device, performing a second-level split on the first-level sub-weight to generate multiple second-level convolution sub-operators, the second-level splitting splits the first-level sub-weight into multiple second-level sub-weights according to the second channel dimension, and each second-level sub-weight corresponds to a second-level convolution sub-operator; and generating a second merging operator, the second merging operator is used to merge the computing results of the multiple split second-level convolution sub-operators to obtain the computing result of the original first-level convolution sub-operator.
  • the second channel dimension split by the second-level splitting is different from the first channel dimension.
  • the second merging operator is an addition operator, which is used to perform positional accumulation of the operation results of multiple split secondary convolution sub-operators to obtain the operation results of the corresponding primary convolution sub-operators.
  • the second merging operator is a cascade operator, which is used to cascade the operation results of the multiple split secondary convolution sub-operators in the split order to obtain the operation results of the corresponding primary convolution sub-operators.
  • the split granularity of the second-level split can be determined according to the calculation alignment requirements of the computing device for the second channel dimension. That is, the split granularity of the second-level split can be an integer multiple of the calculation alignment value of the second channel dimension.
  • the split data blocks are more conducive to fully utilizing the computing power of the hardware computing components and improving the computing efficiency.
  • the scale of convolution operators in different neural network models varies, and the scale of different convolution operators in the same neural network model may also be different, so there will be multiple ways to split. Moreover, even convolution operators of the same scale may have multiple splitting schemes. The performance of the splitting scheme can be evaluated based on multiple factors.
  • the splitting granularity can be determined according to the hardware's operation alignment requirements.
  • not all convolution operators can be split exactly according to multiples of the alignment value.
  • the performance of the splitting scheme in terms of operation alignment can be evaluated based on the amount of zero padding.
  • the amount of zero padding, the amount of invalid operations, or the invalid operation ratio can be used to characterize the invalid operation index caused by the splitting scheme due to the operation alignment requirements.
  • a pipeline method of load (Load)-compute (Compute)-store (Store) is usually used for processing.
  • the first data block can be loaded while calculating the second data block, thereby saving processing time. If the time of each stage in the pipeline matches, that is, the time spent in each stage is not much different, the pipeline can work smoothly, otherwise there may be a situation of mutual waiting between the stages. Therefore, in some embodiments, the splitting scheme can consider the loading time of the split sub-weights to match the time of calculating the convolution, so as to facilitate the parallel pipeline of weight IO and calculation. When the execution time of the sub-operator is relatively uniform, the loading time of the sub-weights can be effectively shielded in the parallel pipeline of IO and calculation, thereby avoiding the IO bottleneck and better playing the purpose of weight preloading.
  • the computation time of the computational components can be determined based on the hardware configuration information of the target computing device, thereby estimating the preloading time of the weights and determining the size of the weight split. Therefore, in some embodiments, when splitting, the size of each sub-weight after the split can be made as balanced as possible, so that the processing time of the previous and next operations is equivalent. Furthermore, the loading time of the weights after the split can be made to match the convolution operation time as much as possible, so that the weight preloading feature can work better. From the perspective of pipeline processing, the loading time of the latter sub-weight matches the operation time of the previous convolution sub-operator, for example, the difference between the two is within a predetermined range.
  • the performance of the split scheme in weight preloading can be evaluated based on the matching degree between the weight loading time and the convolution operation time.
  • the matching degree between the loading time of adjacent sub-weights and the operation time of the convolution sub-operator can be used to characterize the performance index of the split scheme in weight preloading.
  • the performance indicators for evaluating the splitting scheme can also be integrated into the evaluation of the overall compilation optimization performance to comprehensively consider the improvement brought by various optimization methods to the overall performance.
  • the performance advantages and disadvantages of various optimization schemes can be compared by calculating the overall execution time of various optimization schemes, so as to select the best optimization scheme.
  • the disclosed embodiment provides a compilation method for a convolution operator, which supports taking a suitable splitting scheme when the scale of the convolution operator exceeds the single-round computing amount of the target computing device, splitting the large convolution operator into multiple small convolution sub-operators, so as to obtain the final result through multiple rounds of computing.
  • the sub-weights after splitting try to meet the computational alignment requirements of the computing components on the target computing device, so that the computing power of the computing components can be fully utilized.
  • the weights of each convolution sub-operator after splitting are as balanced as possible, and the weight loading time is matched with the convolution operation time as much as possible, so that the pipeline efficiency is high, and the weight preloading feature can work better.
  • the weight scale of the split convolution sub-operator is smaller and more balanced.
  • This convolution sub-operator is more conducive to scheduling between parallel computing components, and it is easier to meet the constraints of computational alignment and on-chip space limitations, which is more conducive to accelerated optimization.
  • the present disclosure also provides a processing device, which can be used to compile a computational graph containing a convolution operator, including: a processor configured to execute program instructions; and a memory configured to store the program instructions, so that when the program instructions are loaded and executed by the processor, the processor executes the compilation method described in the embodiment of the present disclosure.
  • a computer-readable storage medium in which program instructions are stored. When the program instructions are loaded and executed by a processor, the processor executes the compilation method of the convolution operator described in the disclosed embodiment.
  • a computer program product is also provided, including a computer program or instruction. When the computer program or instruction is executed by a processor, the compilation method of the convolution operator described in the disclosed embodiment is implemented.
  • the present disclosure also provides a chip, which may include the aforementioned processing device. Furthermore, the present disclosure also provides a board, which may include the aforementioned chip.
  • the electronic equipment or device disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, transportation, household appliances, and/or medical equipment.
  • the transportation includes airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes magnetic resonance imaging, ultrasound machines and/or electrocardiographs.
  • the electronic equipment or device disclosed herein may also be applied to the Internet, IoT, data centers, energy, transportation, public administration, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical fields. Further, the electronic equipment or device disclosed herein may also be used in cloud, edge, and terminal applications related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solution can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras).
  • cloud devices such as cloud servers
  • electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or edge device are compatible with each other, so that according to the hardware information of the terminal device and/or edge device, appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or edge device, so as to complete the unified management, scheduling and collaborative work of end-to-end or cloud-edge-to-end.
  • the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will appreciate that the scheme of the present disclosure is not limited by the order of the actions described. Therefore, based on the disclosure or teaching of the present disclosure, those skilled in the art will appreciate that some of the steps therein may be performed in other orders or simultaneously. Further, those skilled in the art will appreciate that the embodiments described in the present disclosure may be regarded as optional embodiments, i.e., the actions or modules involved therein are not necessarily necessary for the implementation of one or some of the schemes of the present disclosure. In addition, depending on the different schemes, the present disclosure also has different focuses on the description of some embodiments. In view of this, those skilled in the art will appreciate that the parts that are not described in detail in a certain embodiment of the present disclosure may also refer to the relevant descriptions of other embodiments.
  • connection relationship between different units or components can be a direct or indirect coupling between units or components.
  • the aforementioned direct or indirect coupling involves a communication connection using an interface, wherein the communication interface can support electrical, optical, acoustic, magnetic or other forms of signal transmission.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units.
  • the aforementioned components or units may be located in the same location or distributed on multiple network units.
  • some or all of the units may be selected to achieve the purpose of the scheme described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
  • the above integrated units may also be implemented in the form of hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit.
  • the physical implementation of the hardware structure of the circuit may include But it is not limited to physical devices, and physical devices may include but are not limited to devices such as transistors or memristors.
  • the various devices described herein can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device may be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which may be, for example, a resistive random access memory (RRAM), a dynamic random access memory (DRAM), a static random access memory (SRAM), an enhanced dynamic random access memory (EDRAM), a high bandwidth memory (HBM), a hybrid memory cube (HMC), a ROM, and a RAM, etc.
  • RRAM resistive random access memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • EDRAM enhanced dynamic random access memory
  • HBM high bandwidth memory
  • HMC hybrid memory cube
  • ROM read only memory
  • RAM random access memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Stored Programmes (AREA)

Abstract

Disclosed in the present application is a compilation method for a convolution operator. The compilation method is realized by means of a processing apparatus. The processing apparatus may be comprised in a combined processing apparatus, the combined processing apparatus comprising an interface apparatus and a computing apparatus, wherein the computing apparatus interacts with the processing apparatus to jointly complete a computing operation, which is specified by a user. The combined processing apparatus comprises a storage apparatus, the storage apparatus being respectively connected to the computing apparatus and the processing apparatus, and being used for storing data of the computing apparatus and the processing apparatus.

Description

卷积算子的编译方法及相关产品Compilation method of convolution operator and related products
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2023年1月13日申请的,申请号为2023100734451,名称为“卷积算子的编译方法及相关产品”的中国专利申请的优先权。This application claims priority to Chinese patent application number 2023100734451, filed on January 13, 2023, entitled “Compilation method for convolution operator and related products”.
技术领域Technical Field
本披露一般涉及智能计算领域,尤其涉及神经网络领域。更具体地,本披露涉及一种用处理装置实现的卷积算子的编译方法、处理装置、计算机可读存储介质和计算机程序产品。The present disclosure generally relates to the field of intelligent computing, and more particularly to the field of neural networks. More specifically, the present disclosure relates to a compilation method of a convolution operator implemented by a processing device, a processing device, a computer-readable storage medium, and a computer program product.
背景技术Background technique
神经网络是人工智能、深度学习中的关键技术之一,其中卷积神经网络(Convolution Neural Network,CNN)是最为重要的一种网络类型。卷积神经网络中非常重要的计算即为卷积层(Conv layer)的卷积运算(Convolution Operation)。常见神经网络模型中耗时最多的运算往往也是卷积运算。卷积层的功能是对输入数据进行特征提取,通过多层卷积,能够抽取复杂特征,以保证网络具有足够的表达能力和泛化能力。神经网络模型中包含了大量的、各种类型的卷积运算,卷积运算的计算性能极大地影响整个神经网络模型的计算性能。因此,对卷积运算的加速优化是深度学习计算图优化中的重要部分。Neural network is one of the key technologies in artificial intelligence and deep learning, among which convolutional neural network (CNN) is the most important type of network. A very important calculation in convolutional neural network is the convolution operation of the convolution layer (Conv layer). The most time-consuming operation in common neural network models is often the convolution operation. The function of the convolution layer is to extract features from the input data. Through multiple layers of convolution, complex features can be extracted to ensure that the network has sufficient expression and generalization capabilities. The neural network model contains a large number of convolution operations of various types. The computational performance of the convolution operation greatly affects the computational performance of the entire neural network model. Therefore, the acceleration optimization of the convolution operation is an important part of the optimization of the deep learning computational graph.
深度学习加速器通常会有多个运算部件,通过多个运算部件之间的并行来达到加速神经网络计算的目的。常见的深度学习加速器因为受制于硬件设计的限制,往往有计算对齐的问题和片上存储器空间大小的限制。Deep learning accelerators usually have multiple computing components, which can accelerate neural network calculations through parallel operation. Common deep learning accelerators are often subject to the limitations of hardware design, such as computing alignment issues and on-chip memory space size limitations.
因此,当前亟需一种适合于深度学习加速器的硬件设计的卷积算子优化方案,以在满足硬件限制的同时,尽可能提高计算并行度,提升计算效率。Therefore, there is an urgent need for a convolution operator optimization solution suitable for the hardware design of deep learning accelerators, so as to maximize the computational parallelism and improve the computational efficiency while meeting the hardware limitations.
发明内容Summary of the invention
为了至少解决如上所提到的一个或多个技术问题,本披露在多个方面中提出了卷子算子的编译方案,其在适应计算装置的硬件需求的同时,尽可能充分发挥计算装置中的运算部件的并行运算效力。In order to at least solve one or more of the technical problems mentioned above, the present disclosure proposes a compilation scheme for the roll operator in multiple aspects, which, while adapting to the hardware requirements of the computing device, fully utilizes the parallel computing power of the computing components in the computing device as much as possible.
在第一方面中,本披露提供一种用处理装置实现的卷积算子的编译方法,包括:获取包含卷积算子的待编译文件;响应于所述卷积算子的权值的规模超过待执行所述卷积算子的计算装置的单轮运算量,对所述权值进行一级拆分以生成多个一级卷积子算子,其中所述一级拆分按照第一通道维度将所述权值拆分为多个一级子权值,每个一级子权值对应一个一级卷积子算子;生成第一合并算子,所述第一合并算子用于合并所述多个一级卷积子算子的运算结果,以获得所述卷积算子的最终结果;以及基于所述多个一级卷积子算子和所述第一合并算子对所述待编译文件进行编译优化,获得对应的二进制指令序列,以分配至所述计算装置上执行所述卷积算子对应的任务。In a first aspect, the present disclosure provides a compilation method of a convolution operator implemented by a processing device, comprising: obtaining a file to be compiled containing a convolution operator; in response to the scale of the weight of the convolution operator exceeding the single-round computing amount of the computing device to execute the convolution operator, performing a first-level splitting on the weight to generate multiple first-level convolution sub-operators, wherein the first-level splitting splits the weight into multiple first-level sub-weights according to a first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator; generating a first merging operator, the first merging operator is used to merge the computing results of the multiple first-level convolution sub-operators to obtain the final result of the convolution operator; and compiling and optimizing the file to be compiled based on the multiple first-level convolution sub-operators and the first merging operator to obtain a corresponding binary instruction sequence to be allocated to the computing device to execute the task corresponding to the convolution operator.
在第二方面中,本披露提供一种处理装置,用于对包含卷积算子的计算图执行编译,包括:处理器,其配置用于执行程序指令;以及存储器,其配置用于存储所述程序指令,当所述程序指令由所述处理器加载并执行时,使得所述处理器执行根据本披露第一方面的编译方法。In a second aspect, the present disclosure provides a processing device for performing compilation on a computational graph containing a convolution operator, comprising: a processor configured to execute program instructions; and a memory configured to store the program instructions, so that when the program instructions are loaded and executed by the processor, the processor executes the compilation method according to the first aspect of the present disclosure.
在第三方面中,本披露提供一种计算机可读存储介质,其中存储有程序指令,当所述程序指令由处理器加载并执行时,使得处理器执行根据本披露第一方面的编译方法。In a third aspect, the present disclosure provides a computer-readable storage medium having program instructions stored therein, which, when loaded and executed by a processor, causes the processor to execute the compilation method according to the first aspect of the present disclosure.
在第四方面中,本披露提供一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现根据本披露第一方面的编译方法。 In a fourth aspect, the present disclosure provides a computer program product, comprising a computer program or instructions, which, when executed by a processor, implements the compilation method according to the first aspect of the present disclosure.
通过如上所提供的卷积算子的编译方案,本披露实施例为较大规模的卷积算子的实现提供了合适的拆分方案,以适应硬件算力条件。进一步地,通过设置合适的拆分粒度,能够适应计算装置的硬件设计需求,充分利用并行运算部件的算力,提高整体运算性能。Through the compilation scheme of the convolution operator provided above, the disclosed embodiment provides a suitable splitting scheme for the implementation of a larger-scale convolution operator to adapt to the hardware computing power conditions. Furthermore, by setting a suitable splitting granularity, it is possible to adapt to the hardware design requirements of the computing device, make full use of the computing power of the parallel computing components, and improve the overall computing performance.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:By reading the detailed description below with reference to the accompanying drawings, the above and other objects, features and advantages of the exemplary embodiments of the present disclosure will become readily understood. In the accompanying drawings, several embodiments of the present disclosure are shown in an exemplary and non-limiting manner, and the same or corresponding reference numerals represent the same or corresponding parts, wherein:
图1示出本披露实施例的板卡的结构图;FIG1 shows a structural diagram of a board according to an embodiment of the present disclosure;
图2示出本披露实施例的组合处理装置的结构图;FIG2 shows a structural diagram of a combined processing device according to an embodiment of the present disclosure;
图3示出本披露实施例的单核计算装置的单个处理器核的内部结构示意图;FIG3 is a schematic diagram showing the internal structure of a single processor core of a single-core computing device according to an embodiment of the present disclosure;
图4示出本披露实施例的多核计算装置的内部结构简化示意图;FIG4 is a simplified schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;
图5示出了可以应用本披露实施例的示例性常规3D卷积运算原理示例;FIG5 shows an example of an exemplary conventional 3D convolution operation principle to which the disclosed embodiments can be applied;
图6示出了根据本披露实施例的由处理装置实现的卷积算子的编译方法的示例性流程图;FIG6 shows an exemplary flow chart of a method for compiling a convolution operator implemented by a processing device according to an embodiment of the present disclosure;
图7示出了根据本披露一些实施例的卷积算子拆分方案示意图;FIG7 shows a schematic diagram of a convolution operator splitting scheme according to some embodiments of the present disclosure;
图8示出了根据本披露另一些实施例的卷积算子拆分方案示意图。FIG8 shows a schematic diagram of a convolution operator splitting scheme according to other embodiments of the present disclosure.
具体实施方式Detailed ways
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。The following will be combined with the drawings in the embodiments of the present disclosure to clearly and completely describe the technical solutions in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present disclosure.
应当理解,本披露的权利要求、说明书及附图中可能出现的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" etc. that may appear in the claims, specifications and drawings of the present disclosure are used to distinguish different objects rather than to describe a specific order. The terms "include" and "comprise" used in the specifications and claims of the present disclosure indicate the presence of the described features, wholes, steps, operations, elements and/or components, but do not exclude the presence or addition of one or more other features, wholes, steps, operations, elements, components and/or their collections.
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terms used in this disclosure are only for the purpose of describing specific embodiments and are not intended to limit the disclosure. As used in this disclosure and claims, the singular forms of "a", "an", and "the" are intended to include the plural forms unless the context clearly indicates otherwise. It should also be further understood that the term "and/or" used in this disclosure and claims refers to any combination of one or more of the associated listed items and all possible combinations, including these combinations.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" may be interpreted as "when" or "upon" or "in response to determining" or "in response to detecting," depending on the context. Similarly, the phrase "if it is determined" or "if [described condition or event] is detected" may be interpreted as meaning "upon determination" or "in response to determining" or "upon detection of [described condition or event]" or "in response to detecting [described condition or event]," depending on the context.
下面结合附图来详细描述本披露的具体实施方式。The specific implementation of the present disclosure is described in detail below with reference to the accompanying drawings.
示例性硬件环境Exemplary Hardware Environment
图1示出本披露实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输 入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。FIG1 shows a schematic diagram of the structure of a board 10 according to an embodiment of the present disclosure. As shown in FIG1 , the board 10 includes a chip 101, which is a system-on-chip (SoC), or system-on-chip, and integrates one or more combined processing devices. The combined processing device is an artificial intelligence computing unit that supports various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, data mining, etc. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is that the input The amount of input data is large, and high requirements are placed on the storage capacity and computing power of the platform. The board 10 of this embodiment is suitable for cloud-based intelligent applications and has huge off-chip storage, on-chip storage and powerful computing power.
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。The chip 101 is connected to an external device 103 via an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. The data to be processed can be transmitted from the external device 103 to the chip 101 via the external interface device 102. The calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102. According to different application scenarios, the external interface device 102 can have different interface forms, such as a PCIe interface.
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。The board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105. The storage device 104 is connected to the control device 106 and the chip 101 through a bus and transmits data. The control device 106 in the board 10 is configured to control the state of the chip 101. To this end, in an application scenario, the control device 106 may include a microcontroller (MCU).
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和存储装置204。Fig. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment. As shown in Fig. 2, the combined processing device 20 includes a calculation device 201, an interface device 202, a processing device 203 and a storage device 204.
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。The computing device 201 is configured to execute user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。The interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 can obtain input data from the processing device 203 via the interface device 202 and write it into the storage device on the computing device 201 chip. Further, the computing device 201 can obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the computing device 201 chip. Alternatively or optionally, the interface device 202 can also read data in the storage device of the computing device 201 and transmit it to the processing device 203.
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。The processing device 203, as a general processing device, performs basic controls including but not limited to data handling, starting and/or stopping the computing device 201, etc. According to different implementations, the processing device 203 can be a central processing unit (CPU), a graphics processing unit (GPU), or one or more types of processors in other general and/or special processors, which include but are not limited to digital signal processors (DSP), application specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, only with respect to the computing device 201 disclosed in the present invention, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are integrated and considered together, the two are regarded as forming a heterogeneous multi-core structure.
存储装置204用以存储待处理的数据,其可以是DRAM,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。The storage device 204 is used to store data to be processed, which may be DRAM or DDR memory, and is usually 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203 .
在计算装置201运行神经网络时,一般首先需要利用处理装置203对神经网络进行编译得到可执行文件,该可执行文件包含设备信息,即可执行文件需在异构计算机系统中的哪一设备上执行。可执行文件被汇编链接后可得到神经网络的可执行程序,并将该可执行程序存储在存储装置204中。When the computing device 201 runs the neural network, it is generally necessary to first compile the neural network using the processing device 203 to obtain an executable file, which contains device information, that is, which device in the heterogeneous computer system the executable file needs to be executed on. After the executable file is assembled and linked, an executable program of the neural network can be obtained, and the executable program is stored in the storage device 204.
处理装置203可以从可执行程序的存储位置读取可执行程序,并根据该可执行程序得到程序的多个任务。这些任务经接口装置202被分发至计算装置201上执行,最终获得运算结果。The processing device 203 can read the executable program from the storage location of the executable program and obtain multiple tasks of the program according to the executable program. These tasks are distributed to the computing device 201 for execution via the interface device 202, and finally obtain the operation result.
图3示出了图2中的计算装置201为单核装置时处理核的内部结构示意图。计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,计算装置301包括三大模块:控制模块31(也称为控制器)、运算模块32(也称为运算器)及存储模块33(也称为存储器)。Fig. 3 shows a schematic diagram of the internal structure of a processing core when the computing device 201 in Fig. 2 is a single-core device. The computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining. The computing device 301 includes three modules: a control module 31 (also called a controller), an operation module 32 (also called an operator), and a storage module 33 (also called a memory).
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)311及指令译码单元(instruction decode  unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。The control module 31 is used to coordinate and control the operation of the operation module 32 and the storage module 33 to complete the deep learning task. It includes an instruction fetch unit (IFU) 311 and an instruction decode unit (IFU). The instruction fetch unit 311 is used to fetch instructions from the processing device 203, and the instruction decoding unit 312 decodes the fetched instructions and sends the decoding results as control information to the operation module 32 and the storage module 33.
运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元322负责深度学习算法的核心计算,即矩阵乘及卷积。The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
存储模块33用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)331、权值存储单元(weight RAM,WRAM)332、直接内存访问模块(direct memory access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204,负责计算装置301与DRAM 204间的数据搬运。应当注意,此处的NRAM和WRAM可以是同一存储器在逻辑存储空间上划分形成的两个存储区域,也可以是两个独立的存储器,此处不做具体限定。The storage module 33 is used to store or transfer relevant data, including a neuron RAM (NRAM) 331, a weight RAM (WRAM) 332, and a direct memory access module (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after calculation; WRAM 332 is used to store the convolution kernel of the deep learning network, that is, the weight; DMA 333 is connected to DRAM 204 through bus 34, and is responsible for data transfer between the computing device 301 and DRAM 204. It should be noted that the NRAM and WRAM here can be two storage areas formed by dividing the same memory in the logical storage space, or they can be two independent memories, which are not specifically limited here.
图4示出了图2中的计算装置201为多核时的内部结构简化示意图。多核计算装置可以用层次化硬件模型来进行抽象。如图所示,多核计算装置400作为一个片上系统,其包括至少一个计算簇(cluster),每个计算簇又包括多个处理器核,换言之,多核计算装置400是以片上系统-计算簇-处理器核的层次所构成的。FIG4 shows a simplified schematic diagram of the internal structure of the computing device 201 in FIG2 when it is multi-core. A multi-core computing device can be abstracted using a hierarchical hardware model. As shown in the figure, the multi-core computing device 400 is a system on chip, which includes at least one computing cluster, and each computing cluster includes multiple processor cores. In other words, the multi-core computing device 400 is composed of a hierarchy of system on chip-computing cluster-processor core.
以片上系统的层级来看,如图所示,多核计算装置400包括外部存储控制器41、外设通信模块42、片上互联模块43、全局同步模块44以及多个计算簇45。From the perspective of the system on chip (SoC) level, as shown in the figure, the multi-core computing device 400 includes an external storage controller 41 , a peripheral communication module 42 , an on-chip interconnect module 43 , a global synchronization module 44 , and multiple computing clusters 45 .
外部存储控制器41可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备(例如图2中的DRAM 204),从而自片外读取数据或是将数据写入。外设通信模块42用以通过接口装置(图2的202)接收来自处理装置(图2的203)的控制信号,启动计算装置(图2的201)执行任务。片上互联模块43将外部存储控制器41、外设通信模块42及多个计算簇45连接起来,用以在各个模块间传输数据和控制信号。全局同步模块44例如是一种全局同步屏障控制器(global barrier controller,GBC),用以协调各计算簇的工作进度,确保信息的同步。多个计算簇45是多核计算装置400的计算核心,在图中示例性地展示每个裸片上4个,随着硬件的发展,本披露的多核计算装置400还可以包括8个、16个、64个、甚至更多的计算簇45。计算簇45用以高效地执行深度学习算法。There may be multiple external storage controllers 41, two of which are shown in the figure as an example, which are used to respond to access requests issued by the processor core and access external storage devices (such as DRAM 204 in Figure 2), thereby reading data from outside the chip or writing data. The peripheral communication module 42 is used to receive control signals from the processing device (203 in Figure 2) through the interface device (202 in Figure 2) to start the computing device (201 in Figure 2) to perform tasks. The on-chip interconnect module 43 connects the external storage controller 41, the peripheral communication module 42 and multiple computing clusters 45 to transmit data and control signals between each module. The global synchronization module 44 is, for example, a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each computing cluster and ensure information synchronization. The plurality of computing clusters 45 are the computing cores of the multi-core computing device 400. In the figure, four computing clusters are shown on each die by way of example. With the development of hardware, the multi-core computing device 400 disclosed herein may also include 8, 16, 64, or even more computing clusters 45. The computing clusters 45 are used to efficiently execute deep learning algorithms.
以计算簇的层级来看,如图所示,每个计算簇45包括多个处理器核406作为控制和计算单元,另外还有共享存储核407作为存储单元。进一步地,每个计算簇内还可以包括本地同步模块412,用以协调计算簇内各个处理器核的工作进度,确保信息的同步。处理器核406在图中示例性地展示4个,本披露不限制处理器核406的数量。From the perspective of the computing cluster level, as shown in the figure, each computing cluster 45 includes multiple processor cores 406 as control and computing units, and a shared storage core 407 as a storage unit. Furthermore, each computing cluster may also include a local synchronization module 412 to coordinate the work progress of each processor core in the computing cluster and ensure information synchronization. The processor cores 406 are shown in the figure as an example, and the present disclosure does not limit the number of processor cores 406.
存储核407主要用以存储和通信,即存储处理器核406间的共享数据或中间结果、以及执行计算簇45与DRAM 204之间的通信、计算簇45间彼此的通信、处理器核406间彼此的通信等。在其他实施例中,存储核407具有标量运算的能力,用以执行标量运算。The storage core 407 is mainly used for storage and communication, that is, to store shared data or intermediate results between the processor cores 406, and to perform communication between the computing cluster 45 and the DRAM 204, between the computing clusters 45, and between the processor cores 406. In other embodiments, the storage core 407 has the ability of scalar operations and is used to perform scalar operations.
存储核407包括共享存储单元(SRAM)408、广播总线409、计算簇直接内存访问模块(cluster direct memory access,CDMA)410及全局直接内存访问模块(global direct memory access,GDMA)411。SRAM 408承担高性能数据中转站的角色,在同一个计算簇45内不同处理器核406之间所复用的数据不需要通过处理器核406各自向DRAM 204获得,而是经SRAM 408在处理器核406间中转,存储核407只需要将复用的数据从SMEM 408迅速分发给多个处理器核406即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。广播总线409、CDMA 410及GDMA 411则分别用来执行处理器核406间的通信、计算簇45间的通信和计算簇45与DRAM 204的数据传输。The storage core 407 includes a shared memory unit (SRAM) 408, a broadcast bus 409, a cluster direct memory access module (cluster direct memory access, CDMA) 410, and a global direct memory access module (global direct memory access, GDMA) 411. The SRAM 408 plays the role of a high-performance data transfer station. The data reused between different processor cores 406 in the same computing cluster 45 does not need to be obtained from the DRAM 204 by each processor core 406, but is transferred between the processor cores 406 through the SRAM 408. The storage core 407 only needs to quickly distribute the reused data from the SMEM 408 to multiple processor cores 406 to improve the efficiency of inter-core communication and greatly reduce on-chip and off-chip input/output access. The broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between computing clusters 45, and data transmission between computing clusters 45 and DRAM 204, respectively.
以处理器核的层级来看,单个处理器核的结构可以类似于图3所示的单核计算装置的结构图,此处不再详述。 From the perspective of the processor core level, the structure of a single processor core may be similar to the structure diagram of a single-core computing device shown in FIG3 , and will not be described in detail here.
卷积运算原理Convolution operation principle
神经网络模型中的卷积层可以执行卷积运算,通过对输入特征图(也称为输入数据、神经元或输入神经元)应用卷积核(也称为过滤器、权值等)做卷积处理,从而进行特征提取。The convolution layer in the neural network model can perform convolution operations to extract features by applying convolution kernels (also called filters, weights, etc.) to the input feature map (also called input data, neurons, or input neurons).
神经网络模型中可能包含各种卷积运算层,例如执行正向、常规3D卷积运算的卷积层、执行深度(Depthwise)卷积运算的反卷积层。而在反向训练中,可能需要执行反向的深度卷积运算或叉乘卷积运算。本披露实施例主要针对常规3D卷积运算进行优化,在不冲突的情况下,也可以应用于其他类型的卷积运算。The neural network model may include various convolution operation layers, such as a convolution layer that performs forward, conventional 3D convolution operations, and a deconvolution layer that performs depthwise convolution operations. In reverse training, it may be necessary to perform reverse depthwise convolution operations or cross-product convolution operations. The disclosed embodiments are mainly optimized for conventional 3D convolution operations, and can also be applied to other types of convolution operations without conflict.
图5示出了可以应用本披露实施例的示例性常规3D卷积运算原理示例。FIG. 5 shows an example of an exemplary conventional 3D convolution operation principle to which the disclosed embodiments can be applied.
图中示例性示出了大小为[N Hi Wi Ci]的四维输入数据X,其可以表示成N个Hi×Wi×Ci大小的立体矩形510。图中还示例性示出了大小为[Co Kh Kw Ci]的四维卷积核K,其可以表示成Co个Kh×Kw×Ci大小的立体卷积核520。输入数据X与卷积核K的卷积结果得到输出数据Y,其为[N Ho Wo Co]大小的四维数据,可以表示成N个Ho×Wo×Co大小的立体矩形530。The figure exemplarily shows a four-dimensional input data X of size [N Hi Wi Ci], which can be represented as N Hi×Wi×Ci 3D rectangles 510. The figure also exemplarily shows a four-dimensional convolution kernel K of size [Co Kh Kw Ci], which can be represented as Co Kh×Kw×Ci 3D convolution kernels 520. The convolution result of the input data X and the convolution kernel K obtains the output data Y, which is four-dimensional data of size [N Ho Wo Co], which can be represented as N Ho×Wo×Co 3D rectangles 530.
图中还具体示出了一个卷积运算示例,其中输入数据为6×6×3大小的输入特征图540,省去N维度;卷积核为3×3×3大小的立体卷积核550,针对单个Co;输出数据为4×4的输出特征图560。具体运算过程如下:The figure also specifically shows an example of a convolution operation, in which the input data is an input feature map 540 of size 6×6×3, omitting the N dimension; the convolution kernel is a stereo convolution kernel 550 of size 3×3×3, targeting a single Co; and the output data is a 4×4 output feature map 560. The specific operation process is as follows:
卷积核550按照一定的步长扫过输入特征图540a,在卷积窗口570内对输入特征做矩阵元素乘法求和并叠加偏差量。也即,输出特征图560中每个位置上的值由每个输入特征图的对应区块和对应卷积核做二维卷积运算之后再加和得到。例如,图中示出了输出特征图560上(0,0)位置的值(也即卷积输出点)由输入特征图中黑色立方体框出的卷积窗口570与立体卷积核550进行二维卷积运算得到3个值,再加和得到最终值。The convolution kernel 550 scans the input feature map 540a at a certain step length, performs matrix element multiplication and summation on the input features in the convolution window 570, and superimposes the deviation. That is, the value at each position in the output feature map 560 is obtained by performing a two-dimensional convolution operation on the corresponding block of each input feature map and the corresponding convolution kernel, and then adding them together. For example, the figure shows that the value at the (0,0) position on the output feature map 560 (that is, the convolution output point) is obtained by performing a two-dimensional convolution operation on the convolution window 570 framed by the black cube in the input feature map and the three-dimensional convolution kernel 550 to obtain three values, and then adding them together to obtain the final value.
为了得到其他位置的输出,可以在输入特征图540上移动卷积核550的位置,也即移动卷积输出点的卷积窗口。在图中示例中,卷积步长(Sw,Sh)为(1,1),当横向(宽度方向)向右或纵向(高度方向)向下移动一格后做卷积运算,可以分别得到输出特征图560a上(0,1)或(1,0)位置的值。In order to obtain outputs at other positions, the position of the convolution kernel 550 can be moved on the input feature map 540, that is, the convolution window of the convolution output point can be moved. In the example in the figure, the convolution step size (Sw, Sh) is (1,1). When the convolution operation is performed after moving one grid to the right horizontally (width direction) or downward vertically (height direction), the value of the position (0,1) or (1,0) on the output feature map 560a can be obtained respectively.
从上面的描述可知,在神经网络的一个卷积层中,有N组输入特征图,每组包含Hi×Wi×Ci个信息,其中Hi和Wi分别是输入特征图的高度和宽度,Ci是输入特征图的个数,也称为输入通道数。卷积层有Ci×Co个Kh×Kw大小的卷积核,其中Ci是输入通道数,Co是输出特征图的个数(或输出通道数),Kh和Kw分别是卷积核的高度和宽度。输出特征图包含Ho×Wo×Co个信息,其中Ho和Wo分别是输出特征图的高度和宽度,Co是输出通道数。此外,在卷积运算中,还会涉及到卷积步长(Sw,Sh),卷积步长的大小会影响输出特征图的尺寸。From the above description, we can know that in a convolution layer of a neural network, there are N groups of input feature maps, each group contains Hi×Wi×Ci information, where Hi and Wi are the height and width of the input feature map, respectively, and Ci is the number of input feature maps, also known as the number of input channels. The convolution layer has Ci×Co convolution kernels of size Kh×Kw, where Ci is the number of input channels, Co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolution kernel, respectively. The output feature map contains Ho×Wo×Co information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, in the convolution operation, the convolution step size (Sw, Sh) is also involved, and the size of the convolution step size will affect the size of the output feature map.
在本文中,输入特征图(Feature map)、输入数据、神经元或输入神经元可互换使用;卷积核、过滤器或权值可互换使用;输出特征图、输出数据或输出神经元可互换使用。In this article, input feature map, input data, neuron or input neuron are used interchangeably; convolution kernel, filter or weight are used interchangeably; output feature map, output data or output neuron are used interchangeably.
示例性卷积算子编译方案Exemplary Convolution Operator Coding Scheme
在智能计算系统中,通过编程框架将诸如神经网络模型算法中的常用操作封装成算子,供程序员直接调用,如卷积、池化等。TensorFlow、PyTorch等是当前流行的深度学习框架。在这些编程框架中,通常使用计算图来描述机器学习算法的计算过程,用张量来表示计算图中的所有数据,用算子来表示各种操作。In intelligent computing systems, programming frameworks are used to encapsulate common operations in neural network model algorithms into operators for programmers to call directly, such as convolution and pooling. TensorFlow, PyTorch, etc. are currently popular deep learning frameworks. In these programming frameworks, computational graphs are usually used to describe the computational process of machine learning algorithms, tensors are used to represent all data in the computational graphs, and operators are used to represent various operations.
关于在本披露中提到的术语“节点”和“算子(OP)”,需要说明的是,术语“算子”是从计算机的计算层面来说的(或者从软件层面或算法层面来说的);而术语“节点”是一个更形象的说法(从图形层面或者更加直观的层面来说的)。从所指代的内容上来讲,术语“算子”和“节点”实际上指代相同。也即,在本披露中,可以认为术语“算子”和 “节点”具有相同的含义,可互换使用,只是从不同的侧面进行描述。Regarding the terms "node" and "operator (OP)" mentioned in this disclosure, it should be noted that the term "operator" is from the perspective of computer computing (or from the perspective of software or algorithm); while the term "node" is a more figurative term (from the perspective of graphics or a more intuitive level). In terms of what they refer to, the terms "operator" and "node" actually refer to the same thing. That is, in this disclosure, it can be considered that the terms "operator" and "node" are the same. "Node" has the same meaning and can be used interchangeably, but is described from different aspects.
如前面所提到的,常见的深度学习加速器因为受制于硬件设计的限制,往往有计算对齐的问题和片上存储器空间大小的限制。在本披露实施例中,在编译期间,针对较大规模的卷积算子,提出了对卷积算子进行拆分的方案,以便通过多轮运算来得到最终的运算结果。As mentioned above, common deep learning accelerators are often subject to computational alignment issues and on-chip memory space limitations due to hardware design limitations. In the disclosed embodiments, during compilation, for larger convolution operators, a solution for splitting the convolution operator is proposed so that the final operation result can be obtained through multiple rounds of operations.
图6示出了根据本披露实施例的由处理装置实现的卷积算子的编译方法的示例性流程图。处理装置例如可以是图2的处理装置203。FIG6 shows an exemplary flow chart of a method for compiling a convolution operator implemented by a processing device according to an embodiment of the present disclosure. The processing device may be, for example, the processing device 203 of FIG2 .
如图所示,在步骤610中,获取包含卷积算子的待编译文件。在编程框架中,通常使用计算图来描述机器学习算法的计算过程,用算子来表示各种操作。卷积算子作为计算节点包含在计算图中,以进行编译。As shown in the figure, in step 610, a file to be compiled containing a convolution operator is obtained. In a programming framework, a computational graph is usually used to describe the computational process of a machine learning algorithm, and operators are used to represent various operations. The convolution operator is included in the computational graph as a computational node for compilation.
接着,在步骤620中,响应于卷积算子的权值的规模超过待执行该卷积算子的计算装置的单轮运算量,对卷积算子的权值进行一级拆分以生成多个一级卷积子算子。Next, in step 620, in response to the scale of the weight of the convolution operator exceeding the single-round computing capacity of the computing device to execute the convolution operator, the weight of the convolution operator is split into a first level to generate multiple first-level convolution sub-operators.
取决于计算装置的硬件配置,其单轮运算量也各有不同。具体地,计算装置的单轮运算量可以基于如下一项或多项因素确定:计算装置中的并行运算部件的数量;计算装置的片上存储空间大小;以及并行运算部件的单轮运算量。例如在图3所示的硬件配置下,计算装置为单核装置,并行运算部件的数量是1;计算装置的片上存储空间则根据存储模块33的容量确定;并行运算部件的单轮运算量则根据运算模块32的算力确定。又例如,在图4所示的硬件配置下,计算装置为多核装置,并行运算部件的数量取决于计算簇的数量以及每个计算簇内的处理核的数量,图4的示例为4×4=16个并行运算部件(处理核);计算装置的片上存储空间则可以综合考虑共享存储核SRAM和处理器核内部的存储模块的容量确定;并行运算部件的单轮运算量则根据单个处理核的算力确定。Depending on the hardware configuration of the computing device, its single-round computing volume is also different. Specifically, the single-round computing volume of the computing device can be determined based on one or more of the following factors: the number of parallel computing components in the computing device; the size of the on-chip storage space of the computing device; and the single-round computing volume of the parallel computing components. For example, under the hardware configuration shown in FIG3, the computing device is a single-core device, and the number of parallel computing components is 1; the on-chip storage space of the computing device is determined according to the capacity of the storage module 33; the single-round computing volume of the parallel computing components is determined according to the computing power of the computing module 32. For another example, under the hardware configuration shown in FIG4, the computing device is a multi-core device, and the number of parallel computing components depends on the number of computing clusters and the number of processing cores in each computing cluster. The example in FIG4 is 4×4=16 parallel computing components (processing cores); the on-chip storage space of the computing device can be determined by comprehensively considering the capacity of the shared storage core SRAM and the storage module inside the processor core; the single-round computing volume of the parallel computing components is determined according to the computing power of a single processing core.
在本披露实施例中,当待编译的卷积算子的权值的规模超过待执行该卷积算子的计算装置的单轮运算量时,可以在编译期间直接将卷积算子拆分为多个卷积子算子,从而有利于后续的计算图优化。具体地,可以对卷积算子的权值进行一级拆分以生成多个一级卷积子算子。In the disclosed embodiment, when the weight scale of the convolution operator to be compiled exceeds the single-round computational amount of the computing device to execute the convolution operator, the convolution operator can be directly split into multiple convolution sub-operators during compilation, thereby facilitating subsequent optimization of the computation graph. Specifically, the weight of the convolution operator can be split into a first-level sub-operator to generate multiple first-level convolution sub-operators.
在一些实施例中,一级拆分按照第一通道维度将权值拆分为多个一级子权值,每个一级子权值对应一个一级卷积子算子。第一通道维度可以是前面针对卷积运算时描述的输入通道Ci维度或输出通道Co维度,后面将详细描述。In some embodiments, the first-level splitting splits the weight into multiple first-level sub-weights according to the first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator. The first channel dimension can be the input channel Ci dimension or the output channel Co dimension described above for the convolution operation, which will be described in detail later.
接着,在步骤630中,生成第一合并算子,该第一合并算子用于合并前面拆分出的多个一级卷积子算子的运算结果,以获得原卷积算子的最终结果。由于将原卷积算子拆分为多个卷积子算子,为了获得原卷积算子的最终结果,需要将这些拆分的卷积子算子的运算结果进行合并。第一合并算子所涉及的具体操作取决于这些一级卷积子算子所使用的一级拆分的具体方式,也即与拆分的第一通道维度相关。后面将结合具体的拆分方式对此进行描述。Next, in step 630, a first merging operator is generated, and the first merging operator is used to merge the operation results of the multiple first-level convolution sub-operators split out previously to obtain the final result of the original convolution operator. Since the original convolution operator is split into multiple convolution sub-operators, in order to obtain the final result of the original convolution operator, it is necessary to merge the operation results of these split convolution sub-operators. The specific operation involved in the first merging operator depends on the specific method of the first-level splitting used by these first-level convolution sub-operators, that is, it is related to the first channel dimension of the split. This will be described later in conjunction with the specific splitting method.
最后,在步骤640中,基于拆分出的多个一级卷积子算子和第一合并算子对待编译文件进行编译优化,获得对应的二进制指令序列,以分配至计算装置上执行原卷积算子对应的任务。Finally, in step 640, the compiled file is compiled and optimized based on the split multiple first-level convolution operators and the first merging operator to obtain a corresponding binary instruction sequence to be allocated to the computing device to execute the task corresponding to the original convolution operator.
待编译的计算图中原来的大卷积算子现在被拆分为多个一级卷积子算子和第一合并算子,因此可以用这多个一级卷积子算子和第一合并算子来代替原卷积算子,从而更新计算图。基于更新后的计算图可以继续进行其他的编译优化,例如包括但不限于,不涉及底层硬件信息的各种优化,诸如剪枝、常量折叠、算术简化、布局优化等;以及与硬件架构相关的各种优化,诸如算子融合、权值预加载、权值驻留等。编译后生成的二进制指令序列,可以分配给目标计算装置,从而执行计算图对应的任务,原卷积算子对应的任务则按照拆分后的卷积子算子,通过多轮运算来得到最终的运算结果。The original large convolution operator in the computational graph to be compiled is now split into multiple first-level convolution sub-operators and a first merging operator. Therefore, these multiple first-level convolution sub-operators and the first merging operator can be used to replace the original convolution operator, thereby updating the computational graph. Based on the updated computational graph, other compilation optimizations can be continued, including but not limited to various optimizations that do not involve underlying hardware information, such as pruning, constant folding, arithmetic simplification, layout optimization, etc.; and various optimizations related to hardware architecture, such as operator fusion, weight preloading, weight retention, etc. The binary instruction sequence generated after compilation can be assigned to the target computing device to execute the tasks corresponding to the computational graph, and the tasks corresponding to the original convolution operator are performed according to the split convolution sub-operators through multiple rounds of operations to obtain the final operation results.
由此,上面描述了本披露实施例提供的卷积算子的编译方案,其针对大规模的卷积算 子,将其拆分成小的子卷积算子,可以更灵活地适配计算装置的硬件配置,充分发挥其算力,提高整体运算效率。Therefore, the above describes the compilation scheme of the convolution operator provided by the embodiment of the present disclosure, which is aimed at large-scale convolution operators. By splitting it into small sub-convolution operators, it can more flexibly adapt to the hardware configuration of the computing device, give full play to its computing power, and improve the overall computing efficiency.
如前面所提到的,一级拆分按照第一通道维度将权值拆分为多个一级子权值,每个一级子权值对应一个一级卷积子算子。第一通道维度可以是输入通道Ci维度或输出通道Co维度,因此存在两种拆分方案。As mentioned above, the first-level splitting splits the weight into multiple first-level sub-weights according to the first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator. The first channel dimension can be the input channel Ci dimension or the output channel Co dimension, so there are two splitting schemes.
图7示出了根据本披露一些实施例的卷积算子拆分方案示意图。在这些实施例中,按照输出通道Co维度进行拆分。Fig. 7 shows a schematic diagram of a convolution operator splitting scheme according to some embodiments of the present disclosure. In these embodiments, the splitting is performed according to the output channel Co dimension.
从前面结合图5描述的卷积运算原理可知,Co维度上的运算结果无需进行累加,因此按照Co维度拆分得到的各个卷积子算子的运算结果直接级联就可获得原卷积算子的运算结果。也即,第一通道维度是输出通道Co维度时,对应的第一合并算子是级联算子,用于将拆分得到的多个一级卷积子算子的运算结果按照拆分顺序级联,从而获得原卷积算子的最终结果。From the convolution operation principle described above in conjunction with FIG5 , it can be seen that the operation results on the Co dimension do not need to be accumulated, so the operation results of each convolution sub-operator obtained by splitting according to the Co dimension can be directly cascaded to obtain the operation result of the original convolution operator. That is, when the first channel dimension is the output channel Co dimension, the corresponding first merging operator is a cascade operator, which is used to cascade the operation results of multiple first-level convolution sub-operators obtained by splitting according to the splitting order, thereby obtaining the final result of the original convolution operator.
在图7的示例中,以一个输入神经元的规模为1×32×32×2048(NHWC),权值的规模为2048×3×3×2048(CoKxKyCi)的卷积算子为例来说明拆分过程。In the example of FIG. 7 , a convolution operator with an input neuron size of 1×32×32×2048 (NHWC) and a weight size of 2048×3×3×2048 (CoKxKyCi) is taken as an example to illustrate the splitting process.
图7的左边示出了原卷积算子的运算示意:其输入神经元710的规模为1×32×32×2048,卷积算子720的权值的规模为2048×3×3×2048,输出神经元730的规模为1×32×32×2048。The left side of FIG. 7 shows the operation diagram of the original convolution operator: the scale of the input neuron 710 is 1×32×32×2048, the scale of the weight of the convolution operator 720 is 2048×3×3×2048, and the scale of the output neuron 730 is 1×32×32×2048.
图7的右边示出了执行Co维度拆分后得到的多个卷积子算子740和第一合并算子760。在此示例中,因为按照Co维度拆分,因此第一合并算子760是级联算子。这些卷积子算子740中每个的输入仍然是原来的输入神经元710,其规模为1×32×32×2048。每个卷积子算子740的子权值规模一样,均匀地拆分为256×3×3×2048,从而共拆分为8个卷积子算子。每个卷积子算子740的输出结果750的规模也一样,均为1×32×32×256。这些卷积子算子740的输出结果750通过第一合并算子760级联在一起,从而得到最终输出结果770,其规模为1×32×32×2048,与原卷积算子的输出神经元730一致。The right side of FIG. 7 shows multiple convolution sub-operators 740 and a first merging operator 760 obtained after performing the Co dimension split. In this example, because the split is performed according to the Co dimension, the first merging operator 760 is a cascade operator. The input of each of these convolution sub-operators 740 is still the original input neuron 710, whose scale is 1×32×32×2048. The sub-weight scale of each convolution sub-operator 740 is the same, and is evenly split into 256×3×3×2048, thereby splitting into a total of 8 convolution sub-operators. The scale of the output result 750 of each convolution sub-operator 740 is also the same, all of which are 1×32×32×256. The output results 750 of these convolution sub-operators 740 are cascaded together through the first merging operator 760 to obtain the final output result 770, whose scale is 1×32×32×2048, which is consistent with the output neuron 730 of the original convolution operator.
从上述实施例可知,按照Co维度拆分,无需引入额外的运算,只需要将各个卷积子算子的运算结果级联在一起即可。在一些实现中,上述级联操作可以融合在运算结果回存操作中,也即在将各个卷积子算子的运算结果回存到存储器的时候,直接将其存储位置按级联顺序分配,从而将该级联操作隐藏在回存操作中。From the above embodiment, it can be seen that according to the Co dimension splitting, no additional operations need to be introduced, and it is only necessary to cascade the operation results of each convolution sub-operator together. In some implementations, the above cascade operation can be integrated into the operation result storage operation, that is, when the operation results of each convolution sub-operator are stored back to the memory, the storage location is directly allocated in the cascade order, so that the cascade operation is hidden in the storage operation.
图8示出了根据本披露另一些实施例的卷积算子拆分方案示意图。在这些实施例中,按照输入通道Ci维度进行拆分。Fig. 8 shows a schematic diagram of a convolution operator splitting scheme according to some other embodiments of the present disclosure. In these embodiments, the splitting is performed according to the input channel Ci dimension.
从前面结合图5描述的卷积运算原理可知,Ci维度上的运算结果需要进行累加,因此按照Ci维度拆分得到的各个卷积子算子的运算结果累加后才能获得原卷积算子的运算结果。也即,第一通道维度是输入通道Ci维度时,对应的第一合并算子是加法算子,用于将拆分得到的多个一级卷积子算子的运算结果进行对位累加,从而获得原卷积算子的最终结果。From the convolution operation principle described above in conjunction with Figure 5, it can be seen that the operation results on the Ci dimension need to be accumulated, so the operation results of each convolution operator obtained by splitting according to the Ci dimension can only be obtained after the operation results of the original convolution operator are accumulated. That is, when the first channel dimension is the input channel Ci dimension, the corresponding first merging operator is an addition operator, which is used to perform positional accumulation of the operation results of multiple first-level convolution operators obtained by splitting, thereby obtaining the final result of the original convolution operator.
在图8的示例中,以一个输入神经元的规模为1×32×32×2048(NHWC),权值的规模为2048×3×3×2048(CoKxKyCi)的卷积算子为例来说明拆分过程。In the example of FIG. 8 , a convolution operator with an input neuron size of 1×32×32×2048 (NHWC) and a weight size of 2048×3×3×2048 (CoKxKyCi) is taken as an example to illustrate the splitting process.
图8的左边示出了原卷积算子的运算示意:其输入神经元810的规模为1×32×32×2048,卷积算子820的权值的规模为2048×3×3×2048,输出神经元830的规模为1×32×32×2048。The left side of FIG8 shows the operation diagram of the original convolution operator: the scale of the input neuron 810 is 1×32×32×2048, the scale of the weight of the convolution operator 820 is 2048×3×3×2048, and the scale of the output neuron 830 is 1×32×32×2048.
图8的右边示出了执行Ci维度拆分后得到的多个卷积子算子840和第一合并算子860。在此示例中,因为按照Ci维度拆分,因此第一合并算子860是加法算子。不同于Co维度拆分,当权值按照Ci维度拆分时,由于Ci维度执行的运算是对位乘累加,因此,输入神经元也需要对应地按照Ci维度拆分。因此,在这些实施例中,还需要额外生成拆分算子880,用于对原卷积算子的输入神经元执行与权值同样的拆分,再将拆分后的输入神经元提供给对应的卷积子算子840。如图所示,原输入神经元810经拆分算子880后,得到多 个子输入神经元890,其分别作为对应的卷积子算子840的输入神经元,每个的规模为1×32×32×256。The right side of Figure 8 shows multiple convolution sub-operators 840 and a first merging operator 860 obtained after performing the Ci dimension split. In this example, because the split is performed according to the Ci dimension, the first merging operator 860 is an addition operator. Different from the Co dimension split, when the weights are split according to the Ci dimension, since the operation performed in the Ci dimension is a bitwise multiplication and accumulation, the input neurons also need to be split according to the Ci dimension accordingly. Therefore, in these embodiments, it is also necessary to additionally generate a splitting operator 880, which is used to perform the same splitting as the weights on the input neurons of the original convolution operator, and then provide the split input neurons to the corresponding convolution sub-operators 840. As shown in the figure, after the original input neuron 810 is split by the splitting operator 880, multiple There are sub-input neurons 890, which serve as input neurons of the corresponding convolution sub-operators 840, and the scale of each is 1×32×32×256.
每个卷积子算子840的子权值规模一样,按照Ci维度,均匀地拆分为2048×3×3×256,从而共拆分为8个卷积子算子。每个卷积子算子840的输出结果850的规模也一样,均为1×32×32×2048。这些卷积子算子840的输出结果850通过第一合并算子860进行对位累加,从而得到最终输出结果870,其规模为1×32×32×2048,与原卷积算子的输出神经元830一致。The sub-weights of each convolution sub-operator 840 are of the same size, and are evenly split into 2048×3×3×256 according to the Ci dimension, thus being split into 8 convolution sub-operators in total. The output results 850 of each convolution sub-operator 840 are also of the same size, which is 1×32×32×2048. The output results 850 of these convolution sub-operators 840 are bitwise accumulated through the first merging operator 860 to obtain the final output result 870, which has a size of 1×32×32×2048, which is consistent with the output neuron 830 of the original convolution operator.
从上述实施例可知,按照Ci维度拆分,需要额外引入加法运算,以将各个卷积子算子的运算结果对位累加在一起,从而获得与原卷积算子一致的运算结果。由于在卷积运算外部额外引入了加法运算,可能会导致越界,因此需要对数据进行截断,这可能会引起精度下降。From the above embodiment, it can be seen that according to the Ci dimension splitting, an additional addition operation needs to be introduced to accumulate the operation results of each convolution operator in place, so as to obtain an operation result consistent with the original convolution operator. Since the additional addition operation is introduced outside the convolution operation, it may cause an out-of-bounds error, so the data needs to be truncated, which may cause a decrease in accuracy.
在一些实施例中,可以根据实际精度需求,来选择Ci或Co维度进行拆分。例如,当对精度不敏感或精度需求不高时,例如低于精度阈值,则可以选择Co维度,也可以选择Ci维度进行拆分。当精度需求较高时,例如高于精度阈值,则选择Co维度进行拆分。In some embodiments, the Ci or Co dimension can be selected for splitting according to the actual accuracy requirement. For example, when the accuracy is not sensitive or the accuracy requirement is not high, such as below the accuracy threshold, the Co dimension can be selected or the Ci dimension can be selected for splitting. When the accuracy requirement is high, such as above the accuracy threshold, the Co dimension is selected for splitting.
在上述拆分中,拆分粒度可以根据硬件的运算特性进行设置。在一些实施例中,当硬件存在运算对齐要求时,拆分粒度可以匹配计算装置的运算对齐要求。In the above splitting, the splitting granularity can be set according to the operation characteristics of the hardware. In some embodiments, when the hardware has an operation alignment requirement, the splitting granularity can match the operation alignment requirement of the computing device.
为了充分利用带宽,适配运算器阵列的吞吐量等需求,有些运算器会要求将输入数据对齐到指定数值,例如对齐值M,从而以该对齐值M为粒度对数据进行运算处理。若输入的数据不够对齐值,则会通过补零等方式将数据补齐。基于不同的硬件设计,M可以有不同的数值,例如,64、128、256等,单位可以是字节数(Byte),也可以是数据个数。进一步地,运算对齐要求可以根据维度不同而有不同的对齐值。例如,有的硬件可能要求Ci维度对齐到64和/或Co维度对齐到64。In order to make full use of the bandwidth and adapt to the throughput requirements of the operator array, some operators will require the input data to be aligned to a specified value, such as an alignment value M, so that the data can be processed at the granularity of the alignment value M. If the input data is not enough for the alignment value, the data will be padded by zero padding. Based on different hardware designs, M can have different values, for example, 64, 128, 256, etc., and the unit can be the number of bytes (Byte) or the number of data. Furthermore, the operation alignment requirements can have different alignment values depending on the dimension. For example, some hardware may require the Ci dimension to be aligned to 64 and/or the Co dimension to be aligned to 64.
可以理解,当数据量满足运算对齐要求时,运算器的工作效率最高,算力能充分利用。因此,在一些实施例中,上述按照第一通道维度的一级拆分可以根据该第一通道维度的对齐要求来确定拆分粒度。例如,假设硬件在第一通道维度的对齐值为64,则拆分粒度可以对齐到64,也即可以是64的整数倍,从而拆分出的数据块更有利于将硬件运算部件的算力打满,提高运算效率。更具体地,当按照Co维度拆分时,拆分粒度对齐到Co维度的对齐值;当按照Ci维度拆分时,拆分粒度对齐到Ci维度的对齐值。It can be understood that when the amount of data meets the operation alignment requirements, the operation efficiency of the operator is the highest and the computing power can be fully utilized. Therefore, in some embodiments, the above-mentioned first-level splitting according to the first channel dimension can determine the splitting granularity according to the alignment requirements of the first channel dimension. For example, assuming that the hardware alignment value of the first channel dimension is 64, the splitting granularity can be aligned to 64, that is, it can be an integer multiple of 64, so that the split data blocks are more conducive to fully utilizing the computing power of the hardware computing components and improving the computing efficiency. More specifically, when splitting according to the Co dimension, the splitting granularity is aligned to the alignment value of the Co dimension; when splitting according to the Ci dimension, the splitting granularity is aligned to the alignment value of the Ci dimension.
由于拆分时要对齐到运算对齐要求,因此,最小拆分粒度对应于相应通道的对齐值。在一些情况下,当已经按照最小拆分粒度进行拆分时,可能得到的子权值的规模仍然会超过计算装置的单轮运算量,此时,可以进行二次拆分。Since the splitting must be aligned to the operation alignment requirements, the minimum splitting granularity corresponds to the alignment value of the corresponding channel. In some cases, when the splitting is performed according to the minimum splitting granularity, the size of the sub-weights that may be obtained still exceeds the single-round operation amount of the computing device. In this case, a second splitting can be performed.
具体地,针对需要进行二级拆分的一级卷积子算子,可以进一步包括:响应于该一级卷积子算子的一级子权值的规模超过计算装置的单轮运算量,对该一级子权值进行二级拆分以生成多个二级卷积子算子,二级拆分按照第二通道维度将一级子权值拆分为多个二级子权值,每个二级子权值对应一个二级卷积子算子;以及生成第二合并算子,该第二合并算子用于合并拆分出的多个二级卷积子算子的运算结果,以获得原一级卷积子算子的运算结果。Specifically, for a first-level convolution sub-operator that needs to be split into two levels, it can further include: in response to the scale of the first-level sub-weight of the first-level convolution sub-operator exceeding the single-round computing capacity of the computing device, performing a second-level split on the first-level sub-weight to generate multiple second-level convolution sub-operators, the second-level splitting splits the first-level sub-weight into multiple second-level sub-weights according to the second channel dimension, and each second-level sub-weight corresponds to a second-level convolution sub-operator; and generating a second merging operator, the second merging operator is used to merge the computing results of the multiple split second-level convolution sub-operators to obtain the computing result of the original first-level convolution sub-operator.
由于一级拆分已经将第一通道维度拆分到了可接受的最小粒度,因此二级拆分所拆分的第二通道维度不同于第一通道维度。Since the first-level splitting has split the first channel dimension into the minimum acceptable granularity, the second channel dimension split by the second-level splitting is different from the first channel dimension.
在一些实现中,若第一通道维度是输出通道Co维度,则第二通道维度是输入通道Ci维度,并且第二合并算子是加法算子,用于将拆分出的多个二级卷积子算子的运算结果进行对位累加,以获得对应的一级卷积子算子的运算结果。In some implementations, if the first channel dimension is the output channel Co dimension, the second channel dimension is the input channel Ci dimension, and the second merging operator is an addition operator, which is used to perform positional accumulation of the operation results of multiple split secondary convolution sub-operators to obtain the operation results of the corresponding primary convolution sub-operators.
在另一些实现中,若第一通道维度是输入通道Ci维度,则第二通道维度是输出通道Co维度,并且第二合并算子是级联算子,用于将拆分出的多个二级卷积子算子的运算结果按照拆分顺序级联,以获得对应的一级卷积子算子的运算结果。 In some other implementations, if the first channel dimension is the input channel Ci dimension, the second channel dimension is the output channel Co dimension, and the second merging operator is a cascade operator, which is used to cascade the operation results of the multiple split secondary convolution sub-operators in the split order to obtain the operation results of the corresponding primary convolution sub-operators.
与一级拆分类似,二级拆分的拆分粒度可以根据计算装置对第二通道维度的运算对齐要求来确定。也即,二级拆分的拆分粒度可以是第二通道维度的运算对齐值的整数倍。由此,拆分后的数据块更有利于将硬件运算部件的算力打满,提高运算效率。Similar to the first-level split, the split granularity of the second-level split can be determined according to the calculation alignment requirements of the computing device for the second channel dimension. That is, the split granularity of the second-level split can be an integer multiple of the calculation alignment value of the second channel dimension. As a result, the split data blocks are more conducive to fully utilizing the computing power of the hardware computing components and improving the computing efficiency.
不同神经网络模型的卷积算子的规模各异,同一神经网络模型中不同卷积算子的规模也可能不同,因此会存在多种拆分方式。而且,即使同一规模的卷积算子也可能存在多种拆分方案。拆分方案的性能可以基于多个因素进行评估。The scale of convolution operators in different neural network models varies, and the scale of different convolution operators in the same neural network model may also be different, so there will be multiple ways to split. Moreover, even convolution operators of the same scale may have multiple splitting schemes. The performance of the splitting scheme can be evaluated based on multiple factors.
一方面,如前所述,拆分粒度可以根据硬件的运算对齐要求来确定。然而,不是所有卷积算子都能正好按照对齐值的倍数进行拆分。当拆分的数据块未对齐时,会通过补零等方式将数据补齐,再基于对齐后的数据执行运算,从而引入了无效运算,降低了运算效率。因此,可以根据补零量来评估该拆分方案在运算对齐方面的性能。例如,可以采用补零量、无效运算量或无效运算比率来表征拆分方案因为运算对齐要求而导致的无效运算指标。On the one hand, as mentioned above, the splitting granularity can be determined according to the hardware's operation alignment requirements. However, not all convolution operators can be split exactly according to multiples of the alignment value. When the split data blocks are not aligned, the data will be padded by zero padding, and then the operation will be performed based on the aligned data, thereby introducing invalid operations and reducing the operation efficiency. Therefore, the performance of the splitting scheme in terms of operation alignment can be evaluated based on the amount of zero padding. For example, the amount of zero padding, the amount of invalid operations, or the invalid operation ratio can be used to characterize the invalid operation index caused by the splitting scheme due to the operation alignment requirements.
另一方面,为了提高处理效率,通常会采用加载(Load)-计算(Compute)-回存(Store)的流水方式进行处理。在流水方式中,在计算第二数据块的同时可以加载第一数据块,从而节省处理时间。如果流水中各阶段的时间匹配,也即各阶段花费的时间相差不大,则流水可以顺畅工作,否则各阶段之间可能存在相互等待的情况。因此,在一些实施例中,拆分方案可以考虑拆分后的子权值的加载时间与计算卷积的时间相匹配,从而有利于发挥权值IO与计算的并行流水。当子算子的执行时间比较均匀的时候,子权值的加载时间在IO与计算的并行流水中可以被有效地遮蔽掉,从而避免出现IO瓶颈,能更好地发挥权值预加载的目的。On the other hand, in order to improve processing efficiency, a pipeline method of load (Load)-compute (Compute)-store (Store) is usually used for processing. In the pipeline method, the first data block can be loaded while calculating the second data block, thereby saving processing time. If the time of each stage in the pipeline matches, that is, the time spent in each stage is not much different, the pipeline can work smoothly, otherwise there may be a situation of mutual waiting between the stages. Therefore, in some embodiments, the splitting scheme can consider the loading time of the split sub-weights to match the time of calculating the convolution, so as to facilitate the parallel pipeline of weight IO and calculation. When the execution time of the sub-operator is relatively uniform, the loading time of the sub-weights can be effectively shielded in the parallel pipeline of IO and calculation, thereby avoiding the IO bottleneck and better playing the purpose of weight preloading.
在编译期间,根据目标计算装置的硬件配置信息,可以确定运算部件的运算耗时,由此可以估计权值的预加载耗时,从而确定权值的拆分大小。因此,在一些实施例中,在拆分时,可以使得拆分后各子权值的大小尽量均衡,从而前后运算的处理时间相当。进一步地,可以使得拆分后的权值加载耗时尽量和卷积运算耗时匹配,以便权值预加载特性可以更好地工作。从流水处理的角度看,后一子权值的加载时间匹配前一卷积子算子的运算时间,例如二者之间的差距在预定范围内。During compilation, the computation time of the computational components can be determined based on the hardware configuration information of the target computing device, thereby estimating the preloading time of the weights and determining the size of the weight split. Therefore, in some embodiments, when splitting, the size of each sub-weight after the split can be made as balanced as possible, so that the processing time of the previous and next operations is equivalent. Furthermore, the loading time of the weights after the split can be made to match the convolution operation time as much as possible, so that the weight preloading feature can work better. From the perspective of pipeline processing, the loading time of the latter sub-weight matches the operation time of the previous convolution sub-operator, for example, the difference between the two is within a predetermined range.
因此,可以根据权值加载耗时与卷积运算耗时的匹配度来评估该拆分方案在权值预加载方面的性能。例如,可以采用相邻子权值的加载时间与卷积子算子的运算时间的匹配度来表征拆分方案在权值预加载方面的性能指标。Therefore, the performance of the split scheme in weight preloading can be evaluated based on the matching degree between the weight loading time and the convolution operation time. For example, the matching degree between the loading time of adjacent sub-weights and the operation time of the convolution sub-operator can be used to characterize the performance index of the split scheme in weight preloading.
在计算图的编译优化中,还涉及其他优化措施,包括但不限于,算子融合、IO/计算并行的流水方式、权值驻留、权值预加载等等。通常,综合考虑这些优化方式对计算图优化的效果,从而选择最合适的优化组合。因此,上述评估拆分方案的性能指标除了可以单独用于评估拆分方案的性能之外,还可以融合在整体编译优化性能的评估中,以综合考虑各种优化手段对整体性能带来的改善。在整体编译性能评估中,可以通过计算各种优化方案的整体执行耗时来比较各个优化方案的性能优劣,从而选择最佳的优化方案。In the compilation optimization of the computational graph, other optimization measures are also involved, including but not limited to operator fusion, IO/computation parallel pipeline mode, weight retention, weight preloading, etc. Usually, the effects of these optimization methods on the optimization of the computational graph are comprehensively considered to select the most appropriate optimization combination. Therefore, in addition to being used alone to evaluate the performance of the splitting scheme, the performance indicators for evaluating the splitting scheme can also be integrated into the evaluation of the overall compilation optimization performance to comprehensively consider the improvement brought by various optimization methods to the overall performance. In the overall compilation performance evaluation, the performance advantages and disadvantages of various optimization schemes can be compared by calculating the overall execution time of various optimization schemes, so as to select the best optimization scheme.
由此,本披露实施例提供了一种卷积算子的编译方法,其支持在卷积算子的规模超过目标计算装置的单轮运算量时,采取合适的拆分方案,将大的卷积算子拆分为多个小的卷积子算子,从而通过多轮运算来得到最终结果。拆分后的子权值尽量满足目标计算装置上运算部件的运算对齐要求,从而可以充分利用运算部件的算力。进一步地,拆分后各卷积子算子的权值大小尽量均衡,并且权值加载耗时尽量与卷积运算耗时匹配,以便流水效率高,并且权值预加载特性可以更好地工作。这样,拆分后的卷积子算子的权值规模更小更均衡,这种卷积子算子更有利于并行运算部件之间的调度,更容易满足运算对齐以及片上空间限制的约束,从而更有利于加速优化。Therefore, the disclosed embodiment provides a compilation method for a convolution operator, which supports taking a suitable splitting scheme when the scale of the convolution operator exceeds the single-round computing amount of the target computing device, splitting the large convolution operator into multiple small convolution sub-operators, so as to obtain the final result through multiple rounds of computing. The sub-weights after splitting try to meet the computational alignment requirements of the computing components on the target computing device, so that the computing power of the computing components can be fully utilized. Furthermore, the weights of each convolution sub-operator after splitting are as balanced as possible, and the weight loading time is matched with the convolution operation time as much as possible, so that the pipeline efficiency is high, and the weight preloading feature can work better. In this way, the weight scale of the split convolution sub-operator is smaller and more balanced. This convolution sub-operator is more conducive to scheduling between parallel computing components, and it is easier to meet the constraints of computational alignment and on-chip space limitations, which is more conducive to accelerated optimization.
本披露还提供了一种处理装置,其可以用于对包含卷积算子的计算图执行编译,包括:处理器,其配置用于执行程序指令;以及存储器,其配置用于存储该程序指令,当该程序指令由处理器加载并执行时,使得处理器执行本披露实施例描述的编译方法。 The present disclosure also provides a processing device, which can be used to compile a computational graph containing a convolution operator, including: a processor configured to execute program instructions; and a memory configured to store the program instructions, so that when the program instructions are loaded and executed by the processor, the processor executes the compilation method described in the embodiment of the present disclosure.
在本披露实施例中,还提供一种计算机可读存储介质,其中存储有程序指令,当该程序指令由处理器加载并执行时,使得处理器执行本披露实施例中描述的卷积算子的编译方法。在本披露实施例中,还提供一种计算机程序产品,包括计算机程序或指令,该计算机程序或指令被处理器执行时,实现根据本披露实施例中描述的卷积算子的编译方法。In the disclosed embodiment, a computer-readable storage medium is also provided, in which program instructions are stored. When the program instructions are loaded and executed by a processor, the processor executes the compilation method of the convolution operator described in the disclosed embodiment. In the disclosed embodiment, a computer program product is also provided, including a computer program or instruction. When the computer program or instruction is executed by a processor, the compilation method of the convolution operator described in the disclosed embodiment is implemented.
本披露实施例还提供一种芯片,其可以包括前述处理装置。进一步地,本披露还提供了一种板卡,该板卡可以包括前述芯片。The present disclosure also provides a chip, which may include the aforementioned processing device. Furthermore, the present disclosure also provides a board, which may include the aforementioned chip.
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic equipment or device disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, transportation, household appliances, and/or medical equipment. The transportation includes airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes magnetic resonance imaging, ultrasound machines and/or electrocardiographs. The electronic equipment or device disclosed herein may also be applied to the Internet, IoT, data centers, energy, transportation, public administration, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical fields. Further, the electronic equipment or device disclosed herein may also be used in cloud, edge, and terminal applications related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solution can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or edge device are compatible with each other, so that according to the hardware information of the terminal device and/or edge device, appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or edge device, so as to complete the unified management, scheduling and collaborative work of end-to-end or cloud-edge-to-end.
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will appreciate that the scheme of the present disclosure is not limited by the order of the actions described. Therefore, based on the disclosure or teaching of the present disclosure, those skilled in the art will appreciate that some of the steps therein may be performed in other orders or simultaneously. Further, those skilled in the art will appreciate that the embodiments described in the present disclosure may be regarded as optional embodiments, i.e., the actions or modules involved therein are not necessarily necessary for the implementation of one or some of the schemes of the present disclosure. In addition, depending on the different schemes, the present disclosure also has different focuses on the description of some embodiments. In view of this, those skilled in the art will appreciate that the parts that are not described in detail in a certain embodiment of the present disclosure may also refer to the relevant descriptions of other embodiments.
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented by other methods not disclosed herein. For example, with respect to the various units in the electronic device or device embodiments described above, this article splits them on the basis of considering the logical functions, and there may be other splitting methods in actual implementation. For another example, multiple units or components can be combined or integrated into another system, or some features or functions in the units or components can be selectively disabled. In terms of the connection relationship between different units or components, the connection discussed in the above text in conjunction with the accompanying drawings can be a direct or indirect coupling between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, wherein the communication interface can support electrical, optical, acoustic, magnetic or other forms of signal transmission.
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In the present disclosure, the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed on multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the scheme described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括 但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。In some other implementation scenarios, the above integrated units may also be implemented in the form of hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit. The physical implementation of the hardware structure of the circuit may include But it is not limited to physical devices, and physical devices may include but are not limited to devices such as transistors or memristors. In view of this, the various devices described herein (such as computing devices or other processing devices) can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs. Further, the aforementioned storage unit or storage device may be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which may be, for example, a resistive random access memory (RRAM), a dynamic random access memory (DRAM), a static random access memory (SRAM), an enhanced dynamic random access memory (EDRAM), a high bandwidth memory (HBM), a hybrid memory cube (HMC), a ROM, and a RAM, etc.
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。 The embodiments of the present disclosure are introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation methods of the present disclosure. The description of the above embodiments is only used to help understand the method of the present disclosure and its core idea. At the same time, for those skilled in the art, according to the ideas of the present disclosure, there will be changes in the specific implementation methods and application scopes. In summary, the content of this specification should not be understood as a limitation on the present disclosure.

Claims (15)

  1. 一种用处理装置实现的卷积算子的编译方法,包括:A method for compiling a convolution operator implemented by a processing device, comprising:
    获取包含卷积算子的待编译文件;Get the file to be compiled containing the convolution operator;
    响应于所述卷积算子的权值的规模超过待执行所述卷积算子的计算装置的单轮运算量,对所述权值进行一级拆分以生成多个一级卷积子算子,其中所述一级拆分按照第一通道维度将所述权值拆分为多个一级子权值,每个一级子权值对应一个一级卷积子算子;In response to the scale of the weight of the convolution operator exceeding the single-round computation amount of the computing device to execute the convolution operator, the weight is split into a first level to generate a plurality of first-level convolution sub-operators, wherein the first-level splitting splits the weight into a plurality of first-level sub-weights according to a first channel dimension, and each first-level sub-weight corresponds to a first-level convolution sub-operator;
    生成第一合并算子,所述第一合并算子用于合并所述多个一级卷积子算子的运算结果,以获得所述卷积算子的最终结果;以及generating a first merging operator, wherein the first merging operator is used to merge the operation results of the plurality of first-level convolution sub-operators to obtain a final result of the convolution operator; and
    基于所述多个一级卷积子算子和所述第一合并算子对所述待编译文件进行编译优化,获得对应的二进制指令序列,以分配至所述计算装置上执行所述卷积算子对应的任务。The file to be compiled is compiled and optimized based on the multiple first-level convolution operators and the first merging operator to obtain a corresponding binary instruction sequence to be allocated to the computing device to execute tasks corresponding to the convolution operator.
  2. 根据权利要求1所述的编译方法,其中:The compiling method according to claim 1, wherein:
    所述第一通道维度是输出通道Co维度;并且The first channel dimension is the output channel Co dimension; and
    所述第一合并算子是级联算子,用于将所述多个一级卷积子算子的运算结果按照拆分顺序级联,以获得所述卷积算子的最终结果。The first merging operator is a cascade operator, which is used to cascade the operation results of the multiple first-level convolution sub-operators in the splitting order to obtain the final result of the convolution operator.
  3. 根据权利要求1所述的编译方法,其中:The compiling method according to claim 1, wherein:
    所述第一通道维度是输入通道Ci维度;并且The first channel dimension is the input channel Ci dimension; and
    所述第一合并算子是加法算子,用于将所述多个一级卷积子算子的运算结果进行对位累加,以获得所述卷积算子的最终结果。The first merging operator is an addition operator, which is used to perform bitwise accumulation of the operation results of the multiple first-level convolution sub-operators to obtain the final result of the convolution operator.
  4. 根据权利要求3所述的编译方法,进一步包括:The compiling method according to claim 3, further comprising:
    生成拆分算子,用于对所述卷积算子的输入神经元执行所述一级拆分,以提供给对应的所述一级卷积子算子。A splitting operator is generated to perform the first-level splitting on the input neurons of the convolution operator to provide the first-level convolution sub-operator to the corresponding neuron.
  5. 根据权利要求1-4任一所述的编译方法,其中,生成一级卷积子算子进一步包括:According to any one of claims 1 to 4, the compiling method, wherein generating a first-level convolution operator further comprises:
    响应于所述一级卷积子算子的一级子权值的规模超过所述计算装置的单轮运算量,对所述一级子权值进行二级拆分以生成多个二级卷积子算子,其中所述二级拆分按照第二通道维度将所述一级子权值拆分为多个二级子权值,每个二级子权值对应一个二级卷积子算子;以及In response to the scale of the first-level sub-weight of the first-level convolution sub-operator exceeding the single-round computing amount of the computing device, performing a second-level splitting on the first-level sub-weight to generate a plurality of second-level convolution sub-operators, wherein the second-level splitting splits the first-level sub-weight into a plurality of second-level sub-weights according to a second channel dimension, and each second-level sub-weight corresponds to a second-level convolution sub-operator; and
    生成第二合并算子,所述第二合并算子用于合并所述多个二级卷积子算子的运算结果,以获得所述一级卷积子算子的运算结果。A second merging operator is generated, where the second merging operator is used to merge the operation results of the multiple secondary convolution sub-operators to obtain the operation result of the primary convolution sub-operator.
  6. 根据权利要求5所述的编译方法,其中:The compiling method according to claim 5, wherein:
    若所述第一通道维度是输出通道Co维度,则所述第二通道维度是输入通道Ci维度,并且所述第二合并算子是加法算子,用于将所述多个二级卷积子算子的运算结果进行对位累加,以获得对应的一级卷积子算子的运算结果;If the first channel dimension is the output channel Co dimension, the second channel dimension is the input channel Ci dimension, and the second merging operator is an addition operator, which is used to perform bitwise accumulation of the operation results of the multiple secondary convolution sub-operators to obtain the operation results of the corresponding primary convolution sub-operators;
    若所述第一通道维度是输入通道Ci维度,则所述第二通道维度是输出通道Co维度,并且所述第二合并算子是级联算子,用于将所述多个二级卷积子算子的运算结果按照拆分顺序级联,以获得对应的一级卷积子算子的运算结果。If the first channel dimension is the input channel Ci dimension, the second channel dimension is the output channel Co dimension, and the second merging operator is a cascade operator, which is used to cascade the operation results of the multiple secondary convolution sub-operators in the splitting order to obtain the operation results of the corresponding first-level convolution sub-operators.
  7. 根据权利要求1-6任一所述的编译方法,其中所述计算装置的单轮运算量基于如下一项或多项因素确定:The compiling method according to any one of claims 1 to 6, wherein the single-round computation amount of the computing device is determined based on one or more of the following factors:
    所述计算装置中的并行运算部件的数量;The number of parallel computing components in the computing device;
    所述计算装置的片上存储空间大小;以及The size of the on-chip storage space of the computing device; and
    所述并行运算部件的单轮运算量。 The single-round computing capacity of the parallel computing component.
  8. 根据权利要求1-7任一所述的编译方法,其中所述一级拆分的拆分粒度根据所述计算装置对所述第一通道维度的运算对齐要求来确定。The compilation method according to any one of claims 1 to 7, wherein the split granularity of the first-level split is determined according to the operation alignment requirement of the computing device on the first channel dimension.
  9. 根据权利要求1-8任一所述的编译方法,其中所述多个一级子权值的拆分大小根据所述计算装置的卷积运算耗时来确定,以使得后一子权值的加载时间匹配前一卷积子算子的运算时间。According to the compilation method according to any one of claims 1-8, the split size of the multiple first-level sub-weights is determined according to the convolution operation time of the computing device, so that the loading time of the subsequent sub-weight matches the operation time of the previous convolution sub-operator.
  10. 根据权利要求5-6任一所述的编译方法,其中所述二级拆分的拆分粒度根据所述计算装置对所述第二通道维度的运算对齐要求来确定。The compilation method according to any one of claims 5-6, wherein the split granularity of the second-level split is determined according to the operation alignment requirement of the computing device for the second channel dimension.
  11. 根据权利要求5-6或10所述的编译方法,其中所述多个二级子权值的拆分大小根据所述计算装置的卷积运算耗时来确定,以使得后一子权值的加载时间匹配前一卷积子算子的运算时间。According to the compilation method according to claim 5-6 or 10, the split size of the multiple secondary sub-weights is determined according to the convolution operation time of the computing device, so that the loading time of the latter sub-weight matches the operation time of the previous convolution sub-operator.
  12. 根据权利要求5-6或10-11任一所述的编译方法,还包括:基于以下一项或多项因素评估所述一级拆分和二级拆分的性能:The compilation method according to any one of claims 5-6 or 10-11, further comprising: evaluating the performance of the primary split and the secondary split based on one or more of the following factors:
    因为运算对齐要求而导致的无效运算指标;以及Invalid operation pointers due to operation alignment requirements; and
    相邻子权值的加载时间与卷积子算子的运算时间的匹配度。The matching degree between the loading time of adjacent sub-weights and the operation time of the convolution sub-operator.
  13. 一种处理装置,用于对包含卷积算子的计算图执行编译,包括:A processing device for compiling a computation graph including a convolution operator, comprising:
    处理器,其配置用于执行程序指令;以及a processor configured to execute program instructions; and
    存储器,其配置用于存储所述程序指令,当所述程序指令由所述处理器加载并执行时,使得所述处理器执行根据权利要求1-12任一所述的编译方法。A memory configured to store the program instructions, and when the program instructions are loaded and executed by the processor, the processor executes the compilation method according to any one of claims 1-12.
  14. 一种计算机可读存储介质,其中存储有程序指令,当所述程序指令由处理器加载并执行时,使得处理器执行根据权利要求1-12任一所述的编译方法。A computer-readable storage medium stores program instructions, which, when loaded and executed by a processor, enable the processor to execute the compilation method according to any one of claims 1 to 12.
  15. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现根据权利要求1-12任一所述的编译方法。 A computer program product comprises a computer program or instructions, wherein when the computer program or instructions are executed by a processor, the compiling method according to any one of claims 1 to 12 is implemented.
PCT/CN2024/070133 2023-01-13 2024-01-02 Compilation method for convolution operator, and related product WO2024149112A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310073445.1A CN116090519A (en) 2023-01-13 2023-01-13 Compiling method of convolution operator and related product
CN202310073445.1 2023-01-13

Publications (1)

Publication Number Publication Date
WO2024149112A1 true WO2024149112A1 (en) 2024-07-18

Family

ID=86198919

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/070133 WO2024149112A1 (en) 2023-01-13 2024-01-02 Compilation method for convolution operator, and related product

Country Status (2)

Country Link
CN (1) CN116090519A (en)
WO (1) WO2024149112A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116090519A (en) * 2023-01-13 2023-05-09 上海寒武纪信息科技有限公司 Compiling method of convolution operator and related product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885700A (en) * 2017-12-29 2018-04-06 中国人民解放军国防科技大学 Multi-core implementation method for large-scale matrix convolution
CN114746868A (en) * 2019-12-20 2022-07-12 华为技术有限公司 Method and apparatus for compiling neural network model
CN115169548A (en) * 2022-06-01 2022-10-11 华为技术有限公司 Tensor-based continuous learning method and device
CN116090519A (en) * 2023-01-13 2023-05-09 上海寒武纪信息科技有限公司 Compiling method of convolution operator and related product

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885700A (en) * 2017-12-29 2018-04-06 中国人民解放军国防科技大学 Multi-core implementation method for large-scale matrix convolution
CN114746868A (en) * 2019-12-20 2022-07-12 华为技术有限公司 Method and apparatus for compiling neural network model
CN115169548A (en) * 2022-06-01 2022-10-11 华为技术有限公司 Tensor-based continuous learning method and device
CN116090519A (en) * 2023-01-13 2023-05-09 上海寒武纪信息科技有限公司 Compiling method of convolution operator and related product

Also Published As

Publication number Publication date
CN116090519A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN107832845A (en) A kind of information processing method and Related product
WO2022161318A1 (en) Data processing device and method, and related products
WO2024149112A1 (en) Compilation method for convolution operator, and related product
WO2023071238A1 (en) Computational graph compiling and scheduling methods and related products
CN112465133B (en) Control flow multi-core parallel method, computer device and storage medium
CN113469336A (en) Compiling method and execution method for optimizing neural network model and related products
WO2023045446A1 (en) Computing apparatus, data processing method, and related product
WO2022134873A1 (en) Data processing device, data processing method, and related product
WO2023030507A1 (en) Compilation optimization method and apparatus, computer device and storage medium
CN112084023A (en) Data parallel processing method, electronic equipment and computer readable storage medium
WO2023045638A1 (en) Computing device, method for implementing convolution operation by using computing device, and related product
CN113469337B (en) Compiling method for optimizing neural network model and related products thereof
WO2022095676A1 (en) Neural network sparsification device and method, and corresponding product
CN114281561A (en) Processing unit, synchronization method for a processing unit and corresponding product
CN115840894A (en) Method for processing multidimensional tensor data and related product thereof
CN116185378A (en) Optimization method of calculation graph, data processing method and related products
CN111291884B (en) Neural network pruning method, device, electronic equipment and computer readable medium
CN115329923A (en) Compiling method for neural network model and related product
CN112801276A (en) Data processing method, processor and electronic equipment
CN113791996B (en) Integrated circuit device, electronic apparatus, board and computing method
WO2022111013A1 (en) Device supporting multiple access modes, method and readable storage medium
CN113742266B (en) Integrated circuit device, electronic apparatus, board and computing method
CN113469365B (en) Reasoning and compiling method based on neural network model and related products thereof
WO2022134872A1 (en) Data processing apparatus, data processing method and related product
WO2022134688A1 (en) Data processing circuit, data processing method, and related products

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24741111

Country of ref document: EP

Kind code of ref document: A1