CN113821208A - Compiling optimization method and system for deep learning operator - Google Patents

Compiling optimization method and system for deep learning operator Download PDF

Info

Publication number
CN113821208A
CN113821208A CN202110677399.7A CN202110677399A CN113821208A CN 113821208 A CN113821208 A CN 113821208A CN 202110677399 A CN202110677399 A CN 202110677399A CN 113821208 A CN113821208 A CN 113821208A
Authority
CN
China
Prior art keywords
operator
compiling
thread
optimization
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110677399.7A
Other languages
Chinese (zh)
Inventor
胡事民
李相利
梁盾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110677399.7A priority Critical patent/CN113821208A/en
Publication of CN113821208A publication Critical patent/CN113821208A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention provides a compiling optimization method and a compiling optimization system for a deep learning operator, wherein the method comprises the following steps: mapping each order of operators in the meta-operators to corresponding hardware threads through an operator dynamic compiler according to a thread scheduling algorithm; compiling and optimizing the operator mapped to the hardware thread according to the type of the hardware thread and a compiling and optimizing strategy to obtain a compiled and optimized operator; and compiling to obtain an executable code according to the compiled and optimized operator. The invention is compiled and optimized for the deep learning operator, and can greatly improve the performance of the deep learning operator, thereby improving the overall performance of the whole deep learning framework.

Description

Compiling optimization method and system for deep learning operator
Technical Field
The invention relates to the technical field of deep learning frameworks, in particular to a compiling optimization method and a compiling optimization system for deep learning operators.
Background
The deep learning framework is responsible for training and reasoning of the machine learning model, managing large-scale data and models required by deep learning application and scheduling and resource application of underlying computing equipment. From the Theano framework invented in 2008, to Caffe, TensorFlow, and PyTorch, the deep learning framework is evolving towards being easier to use, more complete, and more efficient.
Meanwhile, new deep Learning acceleration chips are also being updated, and the development is progressing from the original general-purpose chips such as a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) to the special-purpose chips such as a Tensor Processing Unit (TPU), a Machine Learning Unit (MLU), and a lens and Graphics. A complete deep learning framework is often required to support thousands of operators, and each operator needs to be adapted to different hardware. Therefore, different hardware architectures bring huge challenges to migration and optimization of operators, and besides hardware optimization, software optimization is also a key factor for improving performance. However, the existing performance optimization for operators is not comprehensive enough, and the optimization mode needs to be further improved.
Therefore, there is a need for a compilation optimization method and system for deep learning operators to solve the above problems.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a compiling optimization method and a compiling optimization system for a deep learning operator.
The invention provides a compiling optimization method for a deep learning operator, which comprises the following steps:
mapping each order of operators in the meta-operators to corresponding hardware threads through an operator dynamic compiler according to a thread scheduling algorithm;
compiling and optimizing the operator mapped to the hardware thread according to the type of the hardware thread and a compiling and optimizing strategy to obtain a compiled and optimized operator;
and compiling to obtain an executable code according to the compiled and optimized operator.
According to the compiling optimization method for the deep learning operator provided by the invention, before mapping each order operator in the meta-operator to the corresponding hardware thread through the operator dynamic compiler according to the thread scheduling algorithm, the method further comprises the following steps:
according to a preset compiling optimization requirement, a thread scheduling algorithm is constructed, wherein the thread scheduling algorithm comprises a search strategy and a manual strategy;
wherein the search strategy is determined based on the actual running speed of the hardware, and the manual strategy is determined based on the hardware overhead.
According to the compiling optimization method for the deep learning operator, provided by the invention, the search strategy comprises a violent enumeration strategy, a search cycle rearrangement strategy and a cycle splitting strategy; the manual policies include increasing access continuity, decreasing atomic operations, and decreasing register pressure.
According to the compiling optimization method for the deep learning operator, the compiling optimization strategy comprises the following steps: loop splitting, vectorization, cache hints, and thread allocation.
According to the compiling optimization method for the deep learning operator provided by the invention, the operator mapped to the hardware thread is compiled and optimized according to the type of the hardware thread and the compiling optimization strategy to obtain the compiled and optimized operator, and the compiling optimization method comprises the following steps:
if the type of the hardware thread is the central processing unit, the operator mapped to the hardware thread is subjected to cyclic splitting, vectorization and cache prompting in sequence by taking the memory access speed of the central processing unit as a target, and the operator after compiling optimization is obtained.
According to the compiling optimization method for the deep learning operator provided by the invention, the operator mapped to the hardware thread is compiled and optimized according to the type of the hardware thread and the compiling optimization strategy to obtain the compiled and optimized operator, and the compiling optimization method further comprises the following steps:
and if the type of the hardware thread is a graphic processor, performing thread allocation on the operator mapped to the hardware thread by taking the maximized access and storage continuity as a target to obtain the operator after compiling optimization.
According to the compiling optimization method for the deep learning operator provided by the invention, the operator mapped to the hardware thread is compiled and optimized according to the type of the hardware thread and the compiling optimization strategy to obtain the compiled and optimized operator, and the compiling optimization method further comprises the following steps:
and if the type of the hardware thread is an artificial intelligence processor, sequentially performing cycle splitting, vectorization and thread allocation on the operator mapped to the hardware thread by taking the maximized tensor processing performance as a target to obtain the operator after compiling optimization.
The invention also provides a compiling optimization system for the deep learning operator, which comprises the following steps:
the mapping module is used for mapping each order of operators in the meta-operators to corresponding hardware threads through an operator dynamic compiler according to a thread scheduling algorithm;
the compiling and optimizing module is used for compiling and optimizing the operator mapped to the hardware thread according to the type of the hardware thread and a compiling and optimizing strategy to obtain a compiled and optimized operator;
and the compiling module is used for compiling to obtain the executable code according to the compiled and optimized operator.
The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the compiling optimization methods for the deep learning operator.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for compilation optimization of deep learning operators as described in any of the above.
The compiling and optimizing method and system for the deep learning operator, provided by the invention, are used for compiling and optimizing the deep learning operator, and can greatly improve the performance of the deep learning operator, so that the overall performance of the whole deep learning framework is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the following briefly introduces the drawings needed for the embodiments or the prior art descriptions, and obviously, the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart of a compiling optimization method for a deep learning operator according to the present invention;
FIG. 2 is a schematic diagram of a broadcast operator-based CPU optimization strategy provided in the present invention;
FIG. 3 is a schematic diagram of a broadcast operator-based GPU optimization strategy provided in the present invention;
FIG. 4 is a schematic diagram of an MLU optimization strategy based on broadcast operators according to the present invention;
FIG. 5 is a schematic structural diagram of a compiling optimization system for deep learning operators according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A complete deep learning framework is often required to support thousands of operators, and each operator needs to be adapted to different hardware. Thus, different hardware architectures present significant challenges to operator migration and optimization. Besides hardware optimization, software optimization is also a key factor for improving the performance of operators. The software optimization strategy of the deep learning framework can be roughly divided into the following points according to the granularity of the influence from small to large: 1. compiling optimization, wherein the performance of bottom hardware is fully exerted through the execution logic of an optimization operator on hardware; 2. scheduling optimization, namely, the throughput of the system is improved and the delay is reduced by optimizing the execution sequence and the execution mode of an operator; 3. and distributed optimization is adopted, and the expandability of the system is improved by optimizing the calculation task distribution and communication overhead of cross nodes.
In software optimization, the main objective of compilation optimization is to achieve and maximize the performance of the underlying hardware given a task-specific operator, such as convolution or matrix multiplication. Compilation optimization needs to take into account the differences in thread logic, cache structure, and instruction set of various hardware, and maximizing performance often requires experienced developers to accomplish. Therefore, how to optimize the program compiling automatically to maximize the performance becomes the key point of research. The TVM (temporal Virtual machine) automatically realizes optimization strategies such as cache multiplexing, cyclic blocking and the like by providing a set of operator description languages suitable for deep learning; the automatic TVM realizes automatic optimization through machine learning on the basis of the TVM and can be comparable to the result of manual optimization; the Ansor uses an evolutionary tree search algorithm to better optimize performance. The MLIR is based on LLVM architecture, and an Internal Representation (IR) dedicated for deep learning is designed on the IR of LLVM, which is characterized in that various dialects (dialects) are designed to adapt to various special situations of deep learning.
Scheduling optimization is focused on the execution sequence and execution optimization of operators and models, and application of resources is comprehensively raised, so that key resources cannot become bottlenecks to delay the overall speed of the system. The Rammer integrates a plurality of different operators into a kernel function by abstracting out a virtual execution unit, and simultaneously sends the kernel function to different hardware threads for execution, thereby greatly improving the throughput of hardware. And the AntMan improves the utilization efficiency of the GPU training cluster by performing targeted video memory optimization and scheduling on the deep learning framework. The PipeSwitch optimizes the multi-task switching through a pipeline algorithm, improves the utilization rate of bottom hardware, reduces the task switching overhead, and the overhead is less than MPS (multiprocessserver) realized by an Nvidia official entity. Clockwork finds that the partial request delay caused by resource competition in the inference process is too high, the inference delay can be greatly reduced by introducing a global scheduler, and the situation that the partial request delay is too high rarely occurs. According to the invention, the deep learning operator is optimized by combining hardware optimization and compiling optimization in software optimization, so that the performance of the deep learning operator is improved, and the overall performance of the whole deep learning framework is improved. The machine learning model is obtained based on deep learning framework training after the operator is optimized, and the recognition and prediction efficiency of respective corresponding fields can be further improved.
Fig. 1 is a schematic flow chart of a compiling optimization method for a deep learning operator provided by the present invention, and as shown in fig. 1, the present invention provides a compiling optimization method for a deep learning operator, including:
step 101, mapping each order operator in the meta-operator to a corresponding hardware thread through an operator dynamic compiler according to a thread scheduling algorithm.
In the invention, the meta-operator written by Python is used for illustration, and the operator dynamic compiler firstly maps each order operator in the meta-operator to a corresponding hardware thread (or a model constructed based on the hardware thread) by using a thread scheduling algorithm according to the operator content. In the invention, a thread scheduling algorithm is mainly used for thread allocation through a search strategy and a manual strategy, wherein the search strategy is a thread allocation strategy which is determined through strategies such as violence enumeration, search cycle rearrangement, cycle splitting and the like and based on the actual running speed of hardware; the manual strategy is to complete performance tuning on the premise of not increasing hardware extra overhead by increasing access continuity, reducing atomic operation, reducing register pressure and the like.
102, according to the type of the hardware thread and a compiling optimization strategy, compiling and optimizing an operator mapped to the hardware thread to obtain a compiled and optimized operator;
and 103, compiling to obtain an executable code according to the compiled and optimized operator.
In the invention, the operator dynamic compiler respectively carries out intelligent optimization compilation on a CPU, a GPU and an AI tensor processing chip (namely an artificial intelligent processor which comprises a TPU and an MLU, the invention is explained by the MLU, the compilation optimization process of the TPU is similar to the optimization process of the MLU, and the compilation optimization process of the MLU can be referred). Aiming at a CPU, an operator dynamic compiler mainly carries out optimization strategies such as cyclic splitting, vectorization, cache prompting and the like on a deep learning operator, so that the memory access speed of the CPU is maximized; aiming at a GPU, an operator dynamic compiler reasonably distributes threads and blocks according to a thread hierarchical structure of CUDA (computer Unified Device architecture), and the access continuity is realized to the maximum extent; aiming at an AI tensor processing chip (MLU), an operator dynamic compiler carries out optimization strategies such as cycle splitting, vectorization, thread allocation and the like on a deep learning operator, so that tensor processing performance is maximized. And finally, compiling the compiled and optimized operator into executable codes aiming at specific hardware by the operator dynamic compiler.
The compiling and optimizing method for the deep learning operator, provided by the invention, is used for compiling and optimizing the deep learning operator, and can greatly improve the performance of the deep learning operator, so that the overall performance of the whole deep learning framework is improved.
On the basis of the above embodiment, before mapping each order operator in the meta-operator to a corresponding hardware thread through an operator dynamic compiler according to the thread scheduling algorithm, the method further includes:
according to a preset compiling optimization requirement, a thread scheduling algorithm is constructed, wherein the thread scheduling algorithm comprises a search strategy and a manual strategy;
wherein the search strategy is determined based on the actual running speed of the hardware, and the manual strategy is determined based on the hardware overhead.
On the basis of the above embodiment, the search strategy includes a violence enumeration strategy, a search cycle rearrangement strategy, and a cycle split strategy; the manual policies include increasing access continuity, decreasing atomic operations, and decreasing register pressure.
In the invention, the thread scheduling algorithm fuses each order of an operator, reasonably maps the operator to a hardware thread model, and thread allocation can be realized mainly by adopting two algorithms according to compiling optimization requirements: search strategies (requiring higher performance optimization) and manual strategies (requiring reduced overhead). Specifically, the search strategy determines a thread allocation strategy through strategies such as violence enumeration, search cycle rearrangement, cycle splitting and the like based on the actual running speed of hardware. Where the performance achievable by a brute force enumeration search must be optimal, but the search space is relatively large, the number of individual round robin rebinnings is 6! 120, plus loop splitting, the total amount of searches can reach tens of thousands, and searches are needed for different shapes and sizes, which brings huge search overhead to the deep learning framework. When the search cost is huge, a manual strategy can be adopted, and by increasing the access continuity, reducing the atomic operation and the register pressure and other strategies, the performance optimization of 95% of the search strategy can be achieved on the premise of no additional expenditure. In the invention, the manual strategy is used as a more preferable method, the performance tuning is completed on the premise of not increasing extra overhead, and the process of mapping each-order operator in the meta-operator to the corresponding hardware thread is realized. Specifically, the description of three specific strategies for manual strategies follows:
and by adding the access continuity strategy, the access efficiency and the throughput of operators can be improved, and the method is friendly to cache and is easier to optimize. For different hardware devices, there are different allocation strategies: for CUDA, the bottom level is preferentially allocated to the inner thread, for example, W level in NCHW (N represents number, C represents channel, H represents height, and W represents width) is allocated to threadidx.x; for the CPU, the top level is preferentially assigned to the outer thread, e.g., the Nth level in NCHW is assigned to the threadIdx. The reason for the difference between the CUDA and the CPU is that the number of CUDA threads is far larger than that of the CPU, the difference of the access and storage models is large, the CPU realizes the access and storage with high frequency and low throughput through the multi-stage Cache, and the CUDA realizes the access and storage with low frequency and high throughput through the many-core and the two-stage Cache, so the parallel depths of the CUDA and the two Cache are different.
Reducing the atomic operation strategy, since the re-indexing reduction operator will produce atomic operations at some order, atomic operations are expensive on all devices. The CUDA has high tolerance to atomic operation, so that acceleration can be performed by adjusting an allocation strategy, but the atomic operation efficiency of a CPU is very low, so that when the CPU operator is found to have the atomic operation at the back end, the CPU operator can directly run by using a single thread, thereby avoiding the atomic operation, and the CUDA has two strategies: if the simplified rank is the innermost rank, the number of threads allocated by the rank is 32(CUDA warp); if the simplified order is other orders, the number of threads allocated by the order is reduced to the maximum, the best allocation is not carried out, but the number of bus threads is guaranteed to be saturated (>64 k).
The register pressure reduction strategy merges adjacent stages, e.g., merge H and W, as conditions permit. Many deep learning operations are element-level, such as adding two 4 th order tensors one by one. A plan (jitor) creates 4-loops according to the definition of meta-operators, but actually does not need so many loop variables, and the large number of loop variables only consumes more register resources and increases the difficulty of thread allocation. Therefore, the back-end compresses unnecessary cyclic variables, for example, when two 4-order tensors are added, if the memory space of the tensors is continuous, the back-end of the histogram directly compresses 4 cyclic variables to 1. The re-index operator and the re-index reduction operator also use this strategy.
The invention realizes the optimal parallelization of operators by utilizing a thread scheduling algorithm, and greatly improves the execution efficiency of the operators.
On the basis of the above embodiment, the compiling optimization strategy includes: loop splitting, vector quantization, cache hints, and thread allocation.
In the invention, the compiling optimization is realized through optimization strategies such as loop splitting, vectorization, cache prompting, thread allocation and the like. Aiming at a CPU, an operator dynamic compiler mainly carries out optimization strategies such as cycle splitting, vectorization, cache prompting and the like on a depth learning operator, and the memory access speed of the CPU is maximized, wherein the cycle splitting strategy changes the memory access sequence by splitting a cycle into a plurality of cycles; the vectorization strategy uses a single instruction to access a plurality of data, such as 8 32-bit floating point numbers can be accessed at a time by an AVX instruction set of Intel; the cache hint strategy maximizes cache utilization efficiency by using instructions that are used instead of cache hints.
Aiming at the GPU, the operator dynamic compiler reasonably distributes threads and blocks according to the thread hierarchy of the CUDA, and the access continuity is realized to the maximum extent. And the operator dynamic compiler allocates the same continuous line of the memory to the same block according to the thread hierarchical structure of the CUDA. And performing cycle splitting optimization for the condition that the number of elements is excessive and the upper limit of the total number of threads is exceeded.
Aiming at an AI tensor processing chip (MLU), the operator dynamic compiler carries out optimization strategies such as cycle splitting, vectorization, thread allocation and the like on the deep learning operator, so that tensor processing performance is maximized. The loop splitting strategy splits loops to ensure that subsequent vectorization can be carried out smoothly; vectorizing the whole memory by a vectorization strategy; and the thread allocation strategy improves the performance of operators by multiple clusters and multiple cores in the concurrent MLU.
The invention carries out compiling optimization aiming at specific hardware, can adapt to various computing hardware and can exert the maximum performance of the specific hardware.
On the basis of the foregoing embodiment, the performing compilation optimization on the operator mapped to the hardware thread according to the type of the hardware thread and a compilation optimization strategy to obtain a compiled and optimized operator includes:
if the type of the hardware thread is the central processing unit, the operator mapped to the hardware thread is subjected to cyclic splitting, vectorization and cache prompting in sequence by taking the memory access speed of the central processing unit as a target, and the operator after compiling optimization is obtained.
In the invention, for the compiling optimization of the CPU, the deep learning operator has high requirement on the access speed, and in the CPU, the Cache hit rate and the CPU memory prefetching strategy which most affect the access speed are provided. Fig. 2 is a schematic diagram of a broadcast operator-based CPU optimization strategy provided by the present invention, and reference may be made to fig. 2, in this embodiment, a broadcast operator is taken as an example to show three optimization strategies: 1. and (3) cyclic splitting: splitting a cycle into a plurality of cycles so as to change the memory access sequence; 2. vectorization: a single instruction is used for accessing a plurality of data, such as 8 32-bit floating point numbers can be accessed at a time by an AVX instruction set of Intel; 3. and (4) caching and prompting: by using instructions that are used to replace cache hints, cache utilization efficiency is maximized.
Specifically, as shown in fig. 2, taking the CPU to implement the line-by-line traversal of the broadcast operator as an example, in fig. 2 (a), a solid arrow from top to bottom represents that each element in the vector is broadcasted as a corresponding column of the matrix; a block in a vector represents an instruction that starts from the first block in the vector, reads the elements of the vector one by one and writes to the matrix; and the dashed arrows in the matrix represent the sequence of traversal, i.e., after a full broadcast of one row of the matrix is completed, the broadcast of the next row is started. In fig. 2 (b), a cyclic splitting optimization is performed, the traversal order is changed, and the matrix is not broadcasted for a whole row, but the right half of the matrix is broadcasted after the left half of the matrix is broadcasted. This optimization may be helpful when the length of the broadcasted vector exceeds the L1 cache length, because, through loop splitting, it can always be guaranteed that the broadcasted vector resides in the L1 cache, and thus does not need to be accessed from main memory.
Further, referring to fig. 2 (c), vectorization is the second important optimization for CPU compilation, and in the conventional ordinary scalar operation instruction, only 1 element can be read and written at a time, and 8 elements can be read and written at a time through a vectorization instruction set, such as AVX. In fig. 2 (c), every two squares are taken as a group, which represents that two elements are read and written at a time. However, the use of vectorization instructions has a constraint that addresses must be aligned, otherwise performance degradation is also caused, and in the process of framework implementation, memory application methods such as align _ alloc and the like are used to ensure the alignment of memory addresses.
Further, cache hints are the third important optimization for CPU compilation, and broadcast operators have an important property that inputs are typically read multiple times, but outputs are written only once. Referring to (d) of fig. 2, for such operators, the framework may have the input reside in the cache for long periods of time, while the output does not need to reside. Thus, the back-end compiler may use Non-temporal write vector instructions (Non-temporal instructions) to hint the CPU that the output does not need to be kept in cache, but instead is written directly to main memory. The optimization can ensure that the input vector can be resided in the cache for a long time and cannot be extruded out of the cache by output, thereby greatly improving the use efficiency of the cache.
On the basis of the foregoing embodiment, the compiling and optimizing an operator mapped to a hardware thread according to a type of the hardware thread and a compiling and optimizing policy to obtain a compiled and optimized operator further includes:
and if the type of the hardware thread is a graphic processor, performing thread allocation on the operator mapped to the hardware thread by taking the maximized access and storage continuity as a target to obtain the operator after compiling optimization.
In the invention, the broadcast operator is also taken as an example for the compiling optimization of the GPU, the broadcast operator optimization strategy of the GPU is simpler than that of a CPU, and the key point is a thread allocation strategy. The thread hierarchy of CUDA is divided into two levels, thread and block, wherein the number of thread cannot exceed 1024, block is the upper layer structure of thread, and thread in the same block is executed in the same physical stream processing. Fig. 3 is a schematic diagram of a broadcast operator-based GPU optimization strategy provided by the present invention, as shown in fig. 3, different columns in a matrix are allocated to threads, and different rows are allocated to a block, which is very critical, and if the sequence is reversed, the performance is greatly reduced, because threads in the same block need to satisfy the access continuity as much as possible, and therefore, the same row with continuous memory needs to be allocated to the same block. In the case where the number of elements is too large and the total number of threads is exceeded, it is necessary to perform loop splitting according to actual conditions.
On the basis of the foregoing embodiment, the compiling and optimizing an operator mapped to a hardware thread according to a type of the hardware thread and a compiling and optimizing policy to obtain a compiled and optimized operator further includes:
and if the type of the hardware thread is an artificial intelligence processor, sequentially performing cycle splitting, vectorization and thread allocation on the operator mapped to the hardware thread by taking the maximized tensor processing performance as a target to obtain the operator after compiling optimization.
In the invention, the optimization strategy of the AI tensor processing chip (MLU) is optimized mainly by taking the maximum performance of the chip as guidance. The invention is explained by optimizing MLU, MLU and GPU have similar two-stage thread structure, GPU is divided into thread and block, and MLU is divided into core and cluster. But the two are different in magnitude, one block of the GPU can contain 1024 threads at most, and one cluster of the MLU contains 4 cores. The single Core in the MLU has a greater capability, as distinguished from the single thread in the GPU, which can only operate on scalars, whereas the single Core of the MLU can operate on vectors or even matrices one at a time. The ALUs in the MLU, each in a Core, also called Function Units (FUs), have more powerful vector and matrix manipulation capabilities than the ALUs in the CPU and GPU. Each ALU can access the store NRAM inside the Core, while SRAM is shared among multiple cores and MLU main memory is shared among multiple Clusters. Therefore, maximizing the capability of vector and matrix operations in ALU and fully exploiting the bandwidths of NRAM and SRAM are the key points for optimizing MLU operators. Fig. 4 is a schematic diagram of an MLU optimization strategy based on a broadcast operator provided by the present invention, and as shown in fig. 4, the optimization strategy of the broadcast operator on the MLU is: 1. and (3) cyclic splitting: splitting the cycle to ensure that the subsequent vectorization can be carried out smoothly; 2. vectorization: vectorizing the whole memory; 3. thread allocation: multiple cluster and multiple core in concurrent MLU.
Specifically, the loop splitting policy of the MLU in this embodiment is similar to the loop splitting purpose of the CPU in the above embodiments, where the CPU is to maximally utilize the L1 Cache, and the loop splitting of the MLU is to fully utilize the performance of the ALU on the whole segment of the vector. The length of the whole vector operation in the ALU is much longer than that of the CPU, which can reach thousands, but there is also an upper limit, so that the length of the subsequent vectorization is ensured to be within the ALU's requirements by the loop splitting. After loop splitting, the deep learning framework can map the underlying loops of the broadcast operator directly to the vectorization operation of the MLU. In actual implementation, __ memcpy will be used to read and write vectors from the MLU main memory into the NRAM. Finally, through a thread allocation strategy, multiple cores in the MLU are concurrent, and unlike the CUDA thread allocation strategy, the MLU allocates the same 4 columns inside the same core in order to make the broadcasted vectors reside in NRAMs as much as possible, because the NRAMs are exclusive to each core, if the threads are divided into rows, the broadcasted vectors are repeatedly loaded, thereby reducing performance.
Fig. 5 is a schematic structural diagram of a compiling optimization system for a deep learning operator provided by the present invention, and as shown in fig. 5, the present invention provides a compiling optimization system for a deep learning operator, which includes a mapping module 501, a compiling optimization module 502, and a compiling module 503, where the mapping module 501 is configured to map each order operator in an meta operator to a corresponding hardware thread through an operator dynamic compiler according to a thread scheduling algorithm; the compiling and optimizing module 502 is configured to compile and optimize an operator mapped to the hardware thread according to the type of the hardware thread and a compiling and optimizing strategy to obtain a compiled and optimized operator; the compiling module 502 is configured to compile an executable code according to the compiled and optimized operator.
The compiling and optimizing system for the deep learning operator, provided by the invention, is used for compiling and optimizing the deep learning operator, and can greatly improve the performance of the deep learning operator, so that the overall performance of the whole deep learning framework is improved.
The system provided by the present invention is used for executing the above method embodiments, and for the specific processes and details, reference is made to the above embodiments, which are not described herein again.
Fig. 6 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 6, the electronic device may include: a processor (processor)601, a communication interface (communication interface)602, a memory (memory)603 and a communication bus 604, wherein the processor 601, the communication interface 602 and the memory 603 complete communication with each other through the communication bus 604. Processor 601 may invoke logical instructions in memory 603 to perform a compilation optimization method for deep learning operators, the method comprising: mapping each-order operator in the meta-operator to a corresponding hardware thread through an operator dynamic compiler according to a thread scheduling algorithm; compiling and optimizing the operator mapped to the hardware thread according to the type of the hardware thread and a compiling and optimizing strategy to obtain a compiled and optimized operator; and compiling to obtain an executable code according to the compiled and optimized operator.
Furthermore, the logic instructions in the memory 603 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when executed by a computer, the computer being capable of executing the compilation optimization method for deep learning operators provided by the above methods, the method including: mapping each order of operators in the meta-operators to corresponding hardware threads through an operator dynamic compiler according to a thread scheduling algorithm; compiling and optimizing the operator mapped to the hardware thread according to the type of the hardware thread and the compiling and optimizing strategy to obtain a compiled and optimized operator; and compiling to obtain an executable code according to the compiled and optimized operator.
In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the compiling optimization method for a deep learning operator provided in the foregoing embodiments, the method including: mapping each-order operator in the meta-operator to a corresponding hardware thread through an operator dynamic compiler according to a thread scheduling algorithm; compiling and optimizing the operator mapped to the hardware thread according to the type of the hardware thread and a compiling and optimizing strategy to obtain a compiled and optimized operator; and compiling to obtain an executable code according to the compiled and optimized operator.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement the present invention without any inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (10)

1. A compilation optimization method for a deep learning operator, comprising:
mapping each order of operators in the meta-operators to corresponding hardware threads through an operator dynamic compiler according to a thread scheduling algorithm;
compiling and optimizing the operator mapped to the hardware thread according to the type of the hardware thread and a compiling and optimizing strategy to obtain a compiled and optimized operator;
and compiling to obtain an executable code according to the compiled and optimized operator.
2. The compilation optimization method for deep learning operators according to claim 1, wherein before mapping each order of operators in meta-operators to corresponding hardware threads by an operator dynamic compiler according to a thread scheduling algorithm, the method further comprises:
according to a preset compiling optimization requirement, a thread scheduling algorithm is constructed, wherein the thread scheduling algorithm comprises a search strategy and a manual strategy;
wherein the search strategy is determined based on the actual running speed of the hardware, and the manual strategy is determined based on the hardware overhead.
3. The compilation optimization method for the deep learning operator according to claim 2, wherein the search strategies include a brute force enumeration strategy, a search round rearrangement strategy, and a round splitting strategy; the manual policies include increasing access continuity, decreasing atomic operations, and decreasing register pressure.
4. The compilation optimization method for deep learning operators according to claim 1, wherein the compilation optimization strategy comprises: loop splitting, vectorization, cache hints, and thread allocation.
5. The compiling optimization method for the deep learning operator according to claim 4, wherein the compiling optimization of the operator mapped to the hardware thread according to the type of the hardware thread and the compiling optimization strategy to obtain a compiled optimized operator comprises:
if the type of the hardware thread is the central processing unit, sequentially performing cycle splitting, vectorization and cache prompting on the operator mapped to the hardware thread by taking the memory access speed of the central processing unit as a target to obtain the operator after compiling optimization.
6. The compiling optimization method for a deep learning operator according to claim 4, wherein the compiling optimization is performed on the operator mapped to the hardware thread according to the type of the hardware thread and a compiling optimization strategy to obtain a compiled optimized operator, further comprising:
and if the type of the hardware thread is a graphic processor, performing thread allocation on an operator mapped to the hardware thread by taking the maximized access and storage continuity as a target to obtain an operator after compiling optimization.
7. The compiling optimization method for a deep learning operator according to claim 4, wherein the compiling optimization is performed on the operator mapped to the hardware thread according to the type of the hardware thread and a compiling optimization strategy to obtain a compiled optimized operator, further comprising:
and if the type of the hardware thread is an artificial intelligence processor, sequentially performing cycle splitting, vectorization and thread allocation on the operator mapped to the hardware thread by taking the maximized tensor processing performance as a target to obtain the operator after compiling optimization.
8. A compilation optimization system for deep learning operators, comprising:
the mapping module is used for mapping each order of operators in the meta-operators to corresponding hardware threads through an operator dynamic compiler according to a thread scheduling algorithm;
the compiling and optimizing module is used for compiling and optimizing the operator mapped to the hardware thread according to the type of the hardware thread and a compiling and optimizing strategy to obtain a compiled and optimized operator;
and the compiling module is used for compiling to obtain an executable code according to the compiled and optimized operator.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the compilation optimization method for deep learning algorithms according to any one of claims 1 to 7 when executing the computer program.
10. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when being executed by a processor, implements the steps of the compilation optimization method for deep learning algorithms according to any one of claims 1 to 7.
CN202110677399.7A 2021-06-18 2021-06-18 Compiling optimization method and system for deep learning operator Pending CN113821208A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110677399.7A CN113821208A (en) 2021-06-18 2021-06-18 Compiling optimization method and system for deep learning operator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110677399.7A CN113821208A (en) 2021-06-18 2021-06-18 Compiling optimization method and system for deep learning operator

Publications (1)

Publication Number Publication Date
CN113821208A true CN113821208A (en) 2021-12-21

Family

ID=78923995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110677399.7A Pending CN113821208A (en) 2021-06-18 2021-06-18 Compiling optimization method and system for deep learning operator

Country Status (1)

Country Link
CN (1) CN113821208A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186813A (en) * 2022-07-12 2022-10-14 上海人工智能创新中心 Method for expressing and fusing tensor reference operator in deep learning compiler
CN115495095A (en) * 2022-11-18 2022-12-20 上海燧原科技有限公司 Whole program compiling method, device, equipment, medium and cluster of tensor program
CN116954721A (en) * 2023-09-20 2023-10-27 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator
WO2023221406A1 (en) * 2022-05-19 2023-11-23 北京百度网讯科技有限公司 Method and apparatus for operating deep learning compiler, and electronic device
CN117648091A (en) * 2023-12-12 2024-03-05 上海寒武纪信息科技有限公司 Compiling method of calculation graph and related product

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023221406A1 (en) * 2022-05-19 2023-11-23 北京百度网讯科技有限公司 Method and apparatus for operating deep learning compiler, and electronic device
CN115186813A (en) * 2022-07-12 2022-10-14 上海人工智能创新中心 Method for expressing and fusing tensor reference operator in deep learning compiler
CN115495095A (en) * 2022-11-18 2022-12-20 上海燧原科技有限公司 Whole program compiling method, device, equipment, medium and cluster of tensor program
CN115495095B (en) * 2022-11-18 2023-03-21 上海燧原科技有限公司 Whole program compiling method, device, equipment, medium and cluster of tensor program
CN116954721A (en) * 2023-09-20 2023-10-27 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator
CN116954721B (en) * 2023-09-20 2023-12-15 天津南大通用数据技术股份有限公司 Asynchronous non-blocking splitting method for multi-modal operator of actuator
CN117648091A (en) * 2023-12-12 2024-03-05 上海寒武纪信息科技有限公司 Compiling method of calculation graph and related product

Similar Documents

Publication Publication Date Title
CN113821208A (en) Compiling optimization method and system for deep learning operator
CN106991011B (en) CPU multithreading and GPU (graphics processing unit) multi-granularity parallel and cooperative optimization based method
Zhang et al. GPU-acceleration for Large-scale Tree Boosting
US8364739B2 (en) Sparse matrix-vector multiplication on graphics processor units
Li et al. A coordinated tiling and batching framework for efficient GEMM on GPUs
Herrero-Lopez et al. Parallel multiclass classification using SVMs on GPUs
Abdelfattah et al. Kblas: An optimized library for dense matrix-vector multiplication on gpu accelerators
Abdelfattah et al. Fast batched matrix multiplication for small sizes using half-precision arithmetic on GPUs
US20120331278A1 (en) Branch removal by data shuffling
Li et al. Automatic data placement into GPU on-chip memory resources
US9513886B2 (en) Heap data management for limited local memory(LLM) multi-core processors
Boyer et al. Dense dynamic programming on multi GPU
Yabuta et al. Relational joins on GPUs: A closer look
Jiang et al. Boyi: A systematic framework for automatically deciding the right execution model of OpenCL applications on FPGAs
Martínez-del-Amor et al. Population Dynamics P systems on CUDA
CN112148472A (en) Method and apparatus for improving utilization of heterogeneous system executing software
Bakunas-Milanowski et al. Efficient algorithms for stream compaction on GPUs
Park et al. mGEMM: Low-latency convolution with minimal memory overhead optimized for mobile devices
Collins et al. Recursion-driven parallel code generation for multi-core platforms
Cao et al. Parallel implementations of candidate solution evaluation algorithm for N-Queens problem
Li et al. An experimental study on deep learning based on different hardware configurations
Ghike et al. Directive-based compilers for GPUs
Dojchinovski et al. Efficiently running SQL queries on GPU
Cecilia et al. Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE
Feng et al. Accelerating Smith-Waterman alignment of species-based protein sequences on GPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination