WO2024039923A1

WO2024039923A1 - Method of compile-time optimization for nested parallel for-loops for deep learning neural network computation

Info

Publication number: WO2024039923A1
Application number: PCT/US2023/067621
Authority: WO
Inventors: Yijie MEI; Zhennan Qin; Jianhui Li
Original assignee: Intel Corporation
Priority date: 2022-08-19
Filing date: 2023-05-30
Publication date: 2024-02-22

Abstract

A computing system includes memory circuitry to store instructions and a deep neural network (DNN) computation subgraph; and a processor coupled to the memory circuitry to execute the instructions to transform a current operation of the DNN computation subgraph to a nested parallel-for loop instruction for the current operation; block the nested parallel-for loop instruction to create a one-to-one mapping between parallel subtasks of the nested parallel-for loop instruction with threads; and mark a parallel-for loop instruction of the nested parallel-for loop instruction of the current operation and a parallel-for loop instruction of a next operation of the DNN computation subgraph as linkable if both the current operation and the next operation are parallelized along a same data dimension at a top level of the DNN computation subgraph and with a same blocking factor.

Description

METHOD OF COMPILE-TIME OPTIMIZATION FOR NESTED PARALLEL FOR- LOOPS FOR DEEP LEARNING NEURAL NETWORK COMPUTATION

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of Patent Cooperation Treaty (PCT) Patent Application No. PCT/CN2022/113638, filed August 19, 2022, which is incorporated by reference herein in its entirety.

FIELD OF THE DISCLOSURE

[0002] This disclosure relates generally to compilers in computing systems, and more particularly, to compile-time optimizations for deep learning neural network computations in deep learning compilers in computing systems.

BACKGROUND

[0003] A deep learning (DL) compiler generates code for neural network (NN) computations. For multi-core processors, the DL compiler needs to decompose the NN computations into multiple subtasks and submit the subtasks to multiple processor cores for execution. The software abstraction of multi-core processor hardware is often represented as a thread pool interface, which allows the DL compiler to submit parallel subtasks to the cores. The thread pool interface can be implemented using runtime libraries, such Open Multi-processing (OpenMP), Thread Building Blocks (TBB), and Eigen (a C++ library), for example. Existing DL compilers don’t exploit the opportunity of improving runtime performance by identifying affinities between subtasks and software (SW) threads, and the thread pool interface doesn’t allow the DL compiler to control the mapping of subtasks to threads.

[0004] Most DL compilers and runtime libraries generate one or more parallel sections for each single operation and perform global synchronization between each parallel section of the operation. Thread pools do not guarantee the affinity of a task (or subtask) to any thread. It is possible that two subtasks from different parallel sections which share the same memory accesses will be dispatched to different threads, resulting in bad cache reuse. As processor core counts increase, global synchronization costs increase. This becomes a performance bottleneck for executing neural networks on computing systems having processors with increasing numbers of cores.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] Figure 1 is a block diagram illustrating an example computing system environment according to an implementation. [0006] Figure 2 illustrates an example of two deep neural network (DNN) operations and the data movement between subtasks from the two operations in an implementation.

[0007] Figure 3 illustrates cross-operation subtask affinity (CSA) processing, in an implementation.

[0008] Figures 4A and 4B are flow diagrams of cross-operation subtask affinity (CSA) processing, in an implementation.

[0009] Figure 5 is a block diagram of an example processor platform structured to execute and/or instantiate the machine-readable instructions and/or operations of Figures 1-4 to implement the apparatus discussed with reference to Figures 1-4.

[0010] Figure 6 is a block diagram of an example implementation of the processor circuitry of Figure 5.

[0011] Figure 7 is a block diagram of another example implementation of the processor circuitry of Figure 5.

[0012] Figure 8 is a block diagram illustrating an example software distribution platform to distribute software such as the example machine readable instructions of Figure 5 to hardware devices owned and/or operated by third parties.

[0013] The figures are not to scale. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts.

DETAILED DESCRIPTION

[0014] The technology described herein provides a method, system and apparatus for DL compiler optimization, named cross-operation subtask affinity (CSA) herein, to assign subtasks from consecutive deep neural network (DNN) operations to specific SW threads. CSA links subtasks from multiple DNN operations as linked subtasks and uses a thread pool interface to dispatch the linked subtasks to threads. CSA ensures each linked subtask is bound to one thread, which helps reduce the data movement between subtasks from consecutive DNN operations. CSA further groups threads to limit the synchronization within thread groups and performs a cleanup function only for whole linked sub tasks.

[0015] In the following detailed description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific examples that may be practiced. These examples are described in sufficient detail to enable one skilled in the art to practice the subject matter, and it is to be understood that other examples may be utilized and that logical, mechanical, electrical and/or other changes may be made without departing from the scope of the subject matter of this disclosure. The following detailed description is, therefore, provided to describe example implementations and not to be taken as limiting on the scope of the subject matter described in this disclosure. Certain features from different aspects of the following description may be combined to form yet new aspects of the subject matter discussed below.

[0016] As used herein, connection references (e.g., attached, coupled, connected, and joined) may include intermediate members between the elements referenced by the connection reference and/or relative movement between those elements unless otherwise indicated. As such, connection references do not necessarily infer that two elements are directly connected and/or in fixed relation to each other. As used herein, stating that any part is in “contact” with another part is defined to mean that there is no intermediate part between the two parts.

[0017] Unless specifically stated otherwise, descriptors such as “first,” “second,” “third,” etc., are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name. As used herein, “approximately” and “about” refer to dimensions that may not be exact due to manufacturing tolerances and/or other real- world imperfections.

[0018] As used herein, “processor” or “processing device” or “processor circuitry” or “hardware resources” are defined to include (i) one or more special purpose electrical circuits structured to perform specific operation(s) and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors), and/or (ii) one or more general purpose semiconductor-based electrical circuits programmed with instructions to perform specific operations and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors). Examples of processor circuitry include programmed microprocessors, Field Programmable Gate Arrays (FPGAs) that may instantiate instructions, Central Processor Units (CPUs), Graphics Processor Units (GPUs), Digital Signal Processors (DSPs), XPUs, or microcontrollers and integrated circuits such as Application Specific Integrated Circuits (ASICs). For example, an XPU may be implemented by a heterogeneous computing system including multiple types of processor circuitry (e.g., one or more FPGAs, one or more CPUs, one or more GPUs, one or more DSPs, etc., and/or a combination thereof) and application programming interface(s) (API(s)) that may assign computing task(s) to whichever one(s) of the multiple types of the processing circuitry is/are best suited to execute the computing task(s). As used herein, a device may comprise processor circuitry or hardware resources.

[0019] As used herein, a computing system can be, for example, a server, a disaggregated server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cell phone, a smart phone, a tablet (such as an iPad™)), a personal digital assistant (PDA), an Internet appliance, a DVD player, a CD player, a digital video recorder, a Blu-ray player, a gaming console, a personal video recorder, a set top box, a headset (e.g., an augmented reality (AR) headset, a virtual reality (VR) headset, etc.) or other wearable device, an electronic voting machine, or any other type of computing device.

[0020] As used herein, a compiler is a first computer program executed by a processor that converts instructions of a second computer program into machine code or a lower-level form so that the second computer program instructions can be read and executed by a processor.

[0021] In the following description, numerous specific details are set forth, such as specific interfaces, primitives, specific operations and sequences of operations, and the like. However, embodiments may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail to avoid obscuring the understanding of the description.

[0022] The CSA described herein exploits a first opportunity to optimize the cache data movement between two DNN compute intensive operations (such as two matrix multiplication (MATMUL) operations). In the scenario where the second operation consumes the output from the first operation, it is possible and desirable to arrange the decomposition of both operations so that a subset of the second operation’s subtasks only consume output from corresponding subset of the first operation’s subtasks. Since the data produced and consumed are limited between the two subsets, identifying and assigning an affinity of these two subsets to a same thread (or a group of threads) may significantly reduce the data movement across a cache (e.g., a level two (L2) cache).

[0023] The CSA described herein also exploits a second opportunity to reduce synchronization overhead between subtasks. The traditional DL compiler inserts a global synchronization between two Parallel-For loops. In some scenarios, this global synchronization is known as a global barrier. As used herein, a Parallel-For loop is a programming “for” loop in which iterations of the loop may be executed by the processor in parallel. The CSA of the DL compiler described herein ensures that only a subset of the first operation’s subtasks is consumed by a subset of the second operation’s subtasks, and these two subsets are mapped to the same thread (or group of threads), and then synchronization is only performed within the thread group.

[0024] The CSA described herein also exploits a third opportunity to reduce the overhead of managing resources for accelerating processor hardware. When a processor implements one or more matrix multiple accelerator circuitry (such as a tile matrix multiple unit (TMUL)), the CSA notifies the operating system (OS) to release resources so that the OS doesn’t have to maintain the tile register in context during a context switch. Without knowing the thread to which a subtask is bound to, the DL compiler releases resources at the end of every subtask, which otherwise only needs to be performed at the end of a sequence of subtasks created from multiple operations before exiting the thread.

[0025] Figure 1 is a block diagram illustrating an example computing system environment 100 according to an implementation. A deep learning (DL) model 102 may include one or more tensors and operations. In an implementation, DL model 102 may be code provided by a user in a programming language such as Python. DL model 102 is input to artificial intelligence (Al) framework 104. Al framework 104 may include a SW library of procedures calls and/or application programming interfaces (APIs) to implement operations of DL model 102. Examples of Al framework 104 include Py torch (a machine learning (ML) framework based on the Torch library, used for applications such as computer vision and natural language processing) and TensorFlow (a SW library for machine learning and artificial intelligence used across a range of tasks but with a particular focus on training and inference of deep neural networks). Al framework 104 may represent the DL model as a graph intermediate representation (IR) 106. In an implementation, graph IR 106 may be a deep neural network (DNN) computation graph. [0026] Graph IR 106 uses graph, logical tensor, and operations to describe a computational graph. A graph contains a set of operations and logical tensors. Each operation represents a computation in a computation graph. Logical tensor represents the tensor’ s metadata, such as the element’s data type, shape, and memory layout. An operation includes a kind, category, attributes and logical tensors for inputs and outputs.

DL compiler 108 comprises a tensor compiler that automates code generation for computeintensive DNN operations (such as matrix multiplications). DL compiler 108 generates generated DL code 122 from graph IR 106. Generated DL code 122 may be binary code or be code input to a C or lower-level virtual machine (LLVM) compiler 124, where compiler 124 generates generated code 126. LLVM is a set of compiler and toolchain technologies that can be used to develop a frontend for any programming language and a backend for any instruction set architecture. LLVM is designed around a language-independent intermediate representation (IR) that serves as a portable, high- level assembly language that can be optimized with a variety of transformations over multiple passes. Generated code 126 may be executed by a processor of a computing system. In an implementation, DL compiler 108 is executed by a processor of a computing system.

[0027] DL compiler 108 includes graph IR optimizer 110 to perform a plurality of transformations (e.g., decomposition, transformation, fusion) that optimize and group graph IR 106 as a sequence of fused operations. Graph IR optimizer 110 decomposes complex operations into basic DNN operations. Complex DNN operations are operations with complex semantics which could be composed of simple fundamental operations like addition and subtraction. Basic DNN operations are categorized as tunable or fusible. Tunable operations describe DNN operations that use tunable parameters to instantiate a pre-defined template to generate bestperforming code (for example, compute-intensive operations such as matrix multiplication). Fusible operations refer to operations that can be fused to tunable operations, such as element- wise operations, broadcast, reduction and data movement operations.

[0028] Graph IR lowerer 111 transforms graph IR 106 from a higher level of semantics to a lower level of semantics. Graph IR 106 is further lowered to tensor IR 114. Tensor IR 114 doesn’t preserve DNN operation semantics and is closer to C program semantics. The data structures of tensor IR 114 are typically multidimensional arrays, representing tensor buffers in physical memory. Tensor IR optimizer 116 optimizes tensor IR 114 (e.g., loops, tensors, variables, constants) and the optimized tensor IR 114 is further lowered by tensor IR lowerer 120 (e.g., to LLVM and/or intrinsic calls to microkernels). Graph IR 106 keeps DNN operations semantics, so domain- specific optimizations are performed at this level. Instead of lowering DNN operations to tensor IR 114 and performing sophisticated loop analysis to achieve an optimized loop schedule and fusion, DL compiler 108 uses templates, microkernels and heuristics to guide code generation of compute-intensive operations and fusion processes. The decisions of parallel task decomposition, loop scheduling and tiling, tensor memory layout, and whether to fuse with neighbor operations are based on graph IR 106 with DNN operations semantics.

[0029] Similar to a C program, tensor IR 114 supports function statement, expression and intrinsic functions. Tensor IR 114 includes multiple functions, each of which represents a lowered fused operation. Tensor IR 114 includes an entry function that contains a sequence of calls to other functions lowered from fused operations. A tensor IR function includes multiple statements built on expressions, which operate on constants, variables and tensors. Constants and variables represent individual data elements, used to represent scalar data such as a loop index, a tensor shape, address and/or offset to a tensor buffer. Tensors represent multi-dimensional arrays backed by a data buffer.

[0030] DL compiler 108 generates generated DL code 122 for a DNN computation subgraph, which may consist of multiple compute-intensive operations and/or memory-intensive operations. The memory-intensive operations are usually fused to the compute-intensive operations. After the fusion, the DNN computation subgraph is reduced to contain multiple fused operations, which are further translated by graph IR lowerer 111 into a sequence of nested Parallel-For loops. DL compiler 108 represents each nested Parallel-For as a task and decomposes the task into subtasks. For example, each iteration of Parallel-For can be viewed as a subtask. However, the traditional DL compiler doesn’t control the affinity of each subtask with a thread, so subtasks from different Parallel-For loops may be scheduled to different threads. [0031] In an implementation, graph IR lowerer 111 includes cross-operation subtask affinity (CSA) Parallel-For lowerer 112 to translate the higher-level semantics of a Parallel-For loop into lower-level semantics. In an implementation, tensor IR optimizer 116 includes CSA subtask linker, CSA subtask grouper, and cleanup function reducer 118 to link subtasks to threads, combine subtasks into groups, and cleanup functions left over from lowering and optimizing processing.

[0032] Figure 2 illustrates an example 200 of two deep neural network (DNN) operations of graph IR 106 and the data movement between sub tasks from the two operations in an implementation. For example, the two operations may be matrix multiply operations, such as operation 1 212 Mml: C[M, N] = A[M, K] * B[K, N] and operation 2 214 Mm2: D[M, N2] = C[M, N] * B[N, N2]. In this example, four SW threads are shown: thread 1 204, thread 2 206, thread 3 208 and thread 4 210. In other examples, any number of threads may be used. Tasks and/or subtasks may be assigned to threads using thread pool interface 202, where any task or sub-task may be assigned to any thread. In this example, each DNN operation, notably operation 1 212 and operation 2 214, is decomposed into four subtasks (e.g., subtask 1-1 220, subtask 1-2 222, subtask 1-3 226 and subtask 1-4 228 for operation 1 212, and subtask 2-1 222, subtask 2-2 230, subtask 2-3 232 and subtask 2-4 234 for operation 2 214) wherein the second set of subtasks needs to read data from the first set of subtasks. In other examples, there may be any number of DNN operations, tasks, and subtasks.

[0033] A traditional DL compiler uses a traditional thread pool interface, which doesn’t guarantee the affinity of subtasks to threads. Since the traditional DL compiler can’t assign affinity of operation 1’s subtasks to operation 2’s subtasks, the subtasks may be scheduled to different threads, potentially resulting in excessive inter-thread data movement. [0034] In contrast, the technology described herein introduces cross -operation subtask affinity (CSA), so that DL compiler 108 can link the subtasks across operations as linked subtasks and assign affinity of the linked subtasks to the same thread. This is illustrated as linked subtasks 250, 252, 254, and 256 in Figure 2, which contains subtasks from consecutive operations (e.g., operation 1 212 and operation 2 214, in this example). Due to a set of subtasks from operation 1 212 and operation 2 214 being assigned to one set of linked subtasks and mapped to one thread, inter-thread data movement may be reduced. For example, subtask 1-1 220 and subtask 2-1 222 are linked and assigned to thread 1 204, subtask 1-2 224 and subtask 2-2 230 are linked and assigned to thread 2 206, subtask 1-3 226 and subtask 2-3 232 are linked and assigned to thread 3 208, and subtask 1-4 228 and subtask 2-4 234 are linked and assigned to thread 4 210. In use cases for best performance, each thread is often pinned to a specific processor core, reducing inter-thread data movement, which helps to reduce data traffic across processor cores.

[0035] In some existing scenarios, all subtasks for a first operation are synchronized using a global barrier with all subtasks of a second, succeeding operation, thereby negatively affecting overall subtask performance. In an implementation, DL compiler 108 may assign affinity of the linked subtasks to two or more threads to a group of threads and only use an intra-group barrier thereby decreasing the cost of synchronization for some threads. In this example, DL compiler 108 may determine that data dependence only exists between two groups of subtasks (such as group 1 260 and group 2 262) from two operations, for example operation 2’s subtask 2-1 222 and subtask 2-2 230 can start once operation 1’s subtask 1-1 220 and subtask 1-2 224 are complete. Similarly, DL compiler 108 may determine that data dependence only exists between operation 2’s subtask 2-3 232 and subtask 2-4 234 can start once operation 1’s subtask 1-3 226 and subtask 1-4 228 are complete. Thus, in an implementation, an intra-group barrier may be used in place of a traditional global barrier. Figure 2 illustrates this optimization. Without this optimization, a traditional DL compiler uses a global barrier between two operations. With this optimization, the synchronization only happens within the thread group for DL compiler 108, and data movement is limited to within the group of threads (e.g., group 1 and group 2).

[0036] In an implementation, each set of linked subtasks may use an assigned register (reg) set of a processor. For example, reg set 1 240 may be assigned to set of linked subtasks 250, reg set 2 242 may be assigned to set of linked subtasks 252, reg set 3 240 may be assigned to set of linked subtasks 244, and reg set 4 246 may be assigned to set of linked subtasks 256. Each set of registers may be assigned to a thread.

[0037] A traditional DL compiler releases registers (such as Advanced Matrix Extensions

(AMX) registers) at the end of each parallel subtask. In contrast, the technology described herein releases the registers (e.g., reg set 1 240, reg set 2 242, reg set 3 244 and reg set 4 246) at the end of execution of a linked subtask once before exiting the thread. At the end of execution of a linked subtask, CSA calls a “cleanup” function before the task exits. For example, the cleanup function releases the matrix multiple accelerator related resources (such as tile matrix multiply (TMUL) instruction related resources) to the OS.

[0038] Representation of Parallel-For Loop.

[0039] A Parallel-For loop may be represented in different forms in different layers within DL compiler 108. The Parallel-For loop starts as a node for a Parallel-For loop in a DNN computation graph (e.g., graph IR 106), and then is translated to a call to a thread pool interface (TPI) 202.

[0040] DL compiler 108 represents the Parallel-For loop internally as a node in graph IR 106. The Parallel-For node is composed of a loop variable node and three expressions specifying the start, end, and step of the loop. The start expression represents the value for the loop variable in the first iteration of the loop, the end expression represents the value of the last iteration of the loop (not including itself), and the step expression represents the increment for each iteration. In an implementation, the Parallel-For node may also contain a constant field called “num_thrds”, which indicates the number of the threads to be used to execute the Parallel-For loop.

[0041] A Parallel-For node also contains a loop body. In example pseudo code, a Parallel-For loop may be represented in the form shown in Table 1.

Table 1 parallel-for (i, start, end, step, num_thrds) { for_body }

[0042] DL compiler 108 lowers (that is, translates a higher level semantic into a lower level semantic) “parallel_for” nodes to a call to “TP_parallel_for()”, with the “for_body” prepared as a closure. The “for_body” closure contains an address of the Parallel-For loop body code, called “for_body _func” in an example, and the captured variables as arguments, called “for_body _func_args” in an example. The start, end, and step expressions are evaluated before being passed as arguments to the “TP_parallel_for()” function call.

[0043] The “TP_parallel_for()” code is a “parallel_for()” interface typically provided by thread pool interface 202, which is implemented on top of an underline thread pool (such as OpenMP, for example, where the thread pool includes the threads, such as thread 1 204, thread 2 206, thread 3 208, and thread 4 210). The “TP_parallel_for()” code does not guarantee the affinity of a task (or subtask) to any thread. An example pseudo code implementation of this API is shown in Table 2.

Table 2 parallel_for (int iter = 0; iter < num_tasks; iter ++) { fn (iter, num_tasks);

}

[0044] To simplify the pseudo code of Table 2, the example pseudo code format for TP_parallel_for() shown in Table 3 may be used.

Table 3

TP_parallel_for (iter, num_tasks) { the function body of for_body_func }

[0045] Cross-operation subtask affinity optimization (CSA).

[0046] As used herein, CSA refers to CSA parallel-for lowerer 112 and/or CSA subtask linker, subtask grouper and cleanup function reducer 118, collectively. It is assumed that CSA knows the number of working threads at compile time. When CSA decomposes a nested Parallel-For loop by blocking the parallel index, CSA creates the same number of subtasks as the number of threads and assigns each subtask to a specific thread. As used herein, blocking a loop index means that the iterations of the loop are divided into smaller blocks. The original loop is split into two levels, where the top level (the outer loop) iterates at the block granularity, and the inner loop iterates every index within the block. In an implementation, the outer loop remains parallel, and the inner loop executes sequentially. CSA further links subtasks from multiple DNN operations as linked subtasks and uses the underlying thread pool to dispatch the linked subtasks. CSA ensures each linked subtask is bound to one thread, which helps reduce the data movement between subtasks from consecutive DNN operations. CSA further groups threads to limit the synchronization within thread groups and calls a cleanup function only for whole linked subtasks.

[0047] Decomposing nested Parallel-For loop and assigning affinity of subtasks to threads. [0048] Many DNN operations may be split into a two-dimensional (2D) array of subtasks, and the subtasks are independent of each other and may be executed in parallel by a processor. For example, when implementing the matrix multiplication (MATMUL) operation (for a M*K and K*N matrix, where M, N, and K are natural numbers), DL compiler 108 decomposes the overall task into multiple parallel subtasks by blocking the parallel indexes (e.g., M and K). Table 4 shows an example DNN operation task represented as two level nested Parallel-For loops.

Table 4 parallel_for (i, start_i, end_i, step_i) { parallel_for (j, start_j, end _j, step _j) {

Subtask (i,j)

}

[0049] The outer Parallel-For task may be sliced by BLK_I times, and the inner Parallel-For task may be sliced by BLK_J times. The task may be decomposed into BLK_I * BLK_J subtasks and grouped as a BLK_I group, each of which contains BLK_J subtasks. Each subtask is then mapped to a thread. The outer Parallel-For loop may be associated with a thread group identifier (ID) (grp_tid), and the inner Parallel-For loop may be associated with a local thread ID (local_tid). The local thread ID is the ID of the thread in the current thread group and in an implementation is an integer starting from 0 to the number of threads in current group. The group ID is the ID of the thread group and in an implementation is an integer starting from 0 to the number of groups.

[0050] CSA divides the total length of the loop variable of the outer Parallel-For loop equally by the number of thread groups, and the inner Parallel-For loop by the number of local threads. The original nested Parallel-For loop is transformed into a sequential for-loop with the same body but within a small loop variable range assigned to the current subtask. The start and the end of the loop variable needs to be modified and the variables are calculated based on the group id and local thread id. Table 5 shows example pseudo code illustrating the transformed loops.

Table 5 parallel_for (grp_tid, 0, BLK_I, 1) { parallel_for (local_tid, 0, BLK_J, 1) { // BLK_I groups, each group has BLK_J subtasks/threads [tile_start_i, tile_end_i] = get_subtask (grp_id, start_i, end_i, step_i) [tle_start_j, tile_endj] = get_subtask (local_tid, start_j, endj, step _j) for (int i = tile_start_i; i < tile_end_i; i+=step_i) { for (int j= tile_start_j; j< tile_endj; j+=stepj) {

Subtask (i,j) }

}

[0051] The nested Parallel-For code is then flattened into a single-level Parallel-For. The length of the single-level Parallel-For is the total number of subtasks or threads in all groups. Thus, the flattened Parallel-For is associated with a global thread ID (glb_tid), which may be decomposed into a group ID and local thread ID for each iteration.

[0052] The nested loop may be flattened as shown in Table 6.

Table 6

Parallel_for (glb_tid, 0, BLK_I * BLK_J, 1){ // loop body denoted as flatten_parallel_for_body // BLK_I groups, each group has BLK_J threads grp_id = glb_tid / BLK_J local_tid = glb_tid % BLK_J

[tile_start_i, tile_end_i] = get_subtask (grp_tid, start_i, end_i, step_i)

[tile_start_j, tile_endj] = get_subtask (local_tid, start_j, endj, step _j) for (int i = tile_start_i; i < tile_end_i; i+=step_i) { for (int j= tile_start_j; j< tile_endj; j+=stepj) { subtask (i,j)

}

[0053] The “parallel_for” is then lowered to “TP_parallel_for()” as shown in Table 7. The loop body of Table 6 is extracted as a closure function, denoted as “flatten_parallel_for_body”, and passed as argument to TP_paralle_for(). The pseudo code below shows the original loop body instead of the closure function for simplicity.

Table 7

TP_parallel_for (glb_tid, num_threads) {

// BLK_I groups, each group has BLK_J threads grp_id = glb_tid / BLK_J local_tid = glb_tid % BLK_J

[tile_start_i, tile_end_i] = get_subtask (grp_tid, start_i, end_i, step_i) [tile_start_j , tile_endj] = get_subtask (local_tid, start_j, endj, step _j) for (int i = tile_start_i; i < tile_end_i; i+=step_i) { for (int j= tile_start_j; j< tile_endj; j+=stepj) { subtask (i,j)

}

[0054] Link multiple Parallel-For loops as one kernel.

[0055] When DL compiler 108 handles multiple matrix multiplication operations (e.g., MATMULS (as in, for example, Multilayer Perceptron (MLP))) where the result of a matrix multiplication is the input of the next matrix multiplication, a naive sub-optimal implementation is two side-by-side matrix multiplications. Example pseudo code of this naive sub-optimal implementation is shown in Table 8.

Table 8

Parallel_for (i, start_i_l, end_i_l, step_i_l) { Parallel_for (j, start_j_l, endj_l, stepj_l) { subtask l(i,j) } }

Parallel_for (i, start_i_2, end_i_2, step_i_2) {

Parallel_for (j, start_j_2, endj_2, stepj_2) { subtask2(i,j) } }

[0056] This representation may then be transformed by CSA by blocking, flattening, and lowering to normal for-loops and a barrier, as shown in the example pseudo code of Table 9. Table 9 barrier glb_barrier;

TP_parallel_for (glb_tid, 0, num_threads, 1) {

// BLK_I groups, each group has BLK_J threads grp_id = glb_tid / BLK_J local_tid = glb_tid % BLK_J [tile_start_i_l, tile_end_i_l] = get_subtask (grp_tid, start_i_l, end_i_l, step_i_l) [tile_start_j_l, tile_endj_l] = get_subtask(local_tid, start_j_l, endj_l, stepj_l) for (int i = tile_start_i_l; i < tile_end_i_l; i+=step_i_l) { for (int j= tile_start_j_l; j< tile_endj_l; j+=stepj_l) { subtask (i,j)

}

} global_barrier(glb_barrier) ;

[tile_start_i_2, tile_end_i_2] = get_subtask (grp_tid, start_i_2, end_i_2, step_i_2)

[tile_start_j_2, tile_endj_2] = get_subtask (local_tid, start_j_2, endj_2, stepj_2) for (int i = tile_start_i_2; i < tile_end_i_2; i+=step_i_2) { for (int j= tile_start_j_2; j< tile_endj_2; j+=stepj_2) { subtask (i,j)

}

[0057] Since DL compiler 108 knows how each loop iteration is mapped to which working thread, this gives the opportunity for the DL compiler to ensure certain related subtasks in different Parallel-For loop iterations are to be executed on the same thread, thus reducing possible cross-thread data movement.

[0058] Thread Group with Intra-Group Barrier.

[0059] CSA may assign multiple Parallel-For loops to a thread which effectively assigns affinity of the Parallel-For loops to the thread. However, between each Parallel-For loop, CSA uses an implicit global barrier to ensure data dependence between the preceding and the succeeding Parallel-For loops. In some cases, the DL compiler could limit the data dependence between a subset of the preceding Parallel-For subtasks and a subset of the succeeding Parallel-For subtasks. The technology described herein includes how DL compiler 108 may map each subset of Parallel-For subtasks to a group of threads and only use an intra-group barrier to reduce the synchronization overhead. The overall Parallel-For subtasks may be mapped to multiple groups of threads.

[0060] An example of two Parallel-For loops is shown in the example pseudo code of Table 10. Table 10 parallel_for (i, start_i_l, end_i_l, step_i_l) { parallel_for (j, start_j_l, endj_l, stepj_l) { subtask l(i,j)

}

} parallel_for (i, start_i_2, end_i_2, step_i_2) { parallel_for (j, start_j_2, endj_2, stepj_2) { subtask2(i,j)

}

[0061] DL compiler 108 first blocks both Parallel-For loops with the same blocking numbers

BLK_I and BLK_J, and then merges the outer Parallel-For loops as shown in the example pseudo code of Table 11.

Table 11 parallel_for (grp_tid, 0, BLK_I, 1){ parallel_for(local_tid, 0, BLK_J, 1){

// BLK_I groups, each group has BLK_J subtasks/threads

[tile_start_i, tile_end_i] = get_subtask(grp_tid, start_i_l, end_i_l, step_i_l)

[tile_start_j, tile_endj] = get_subtask(local_tid, start_j_l, endj_l, stepj_l) for (int i = tile_start_i; i < tile_end_i; i+=step_i_l) { for (int j= tile_start_j; j< tile_endj; j+=stepj_l ) { subtask 1 (i,j)

}

} parallel_for (local_tid, 0, BLK_J, 1){

// BLK_I groups, each group has BLK_J subtasks/threads

[tile_start_i, tile_end_i] = get_subtask (grp_tid, start_i_2, end_i_2, step_i_2)

[tile_start_j, tile_endj] = get_subtask (local_tid, start_j_2, endj_2, stepj_2) for (int i = tile_start_i; i < tile_end_i; i+=step_i) { for (int j= tile_start_j; j< tile_endj; j+=stepj) { subtask2 (i,j)

} }

}

[0062] Often, the DL compiler cannot merge the inner Parallel-For loop since the second inner Parallel-For depends on the completion of the first Parallel-For loop. In an implementation, CSA provides a “CSA_Intra_group_barrier()” call which allows the DL compiler to ensure that the thread group assigned for second inner Parallel-For loop waits for the completion of the first inner Parallel-For loop. Table 12 shows example pseudo code of two nested Parallel-For loops fully merged with the intra-group barrier (such as intra-group barrier 1 236 or intra-group barrier 2 238 for the example of Figure 2).

Table 12 barrier grp_barrier [BLK_I] // there are BLK_I groups in outer Parallel-For parallel_for (grp_tid, 0, BLK_I, 1){ parallel_for (local_tid, 0, BLK_J, 1){ // denoted as flatten_merged_parallel_for_body

// BLK_I groups, each group has BLK_J subtasks/threads

[tile_start_i, tile_end_i] = get_subtask (grp_tid, start_i_l, end_i_l, step_i_l)

[tile_start_j, tile_endj] = get_subtask (local_tid, start_j_l, endj_l, stepj_l) for (int i = tile_start_i; i < tile_end_i; i+=step_i_l) { for (int j= tile_start_j; j< tile_endj; j+=stepj_l ) { subtask 1 (i,j)

}

CSA_Intra_group_barrier (grp_barrier[grp_tid])

// BLK_I groups, each group has BLK_J subtasks/threads

[tile_start_j, tile_endj] = get_subtask (local_tid, start_j_2, endj_2, stepj_2) for (int i = tile_start_i; i < tile_end_i; i+=step_i) { for (int j= tile_start_j ;j< tile_endj; j+=stepj) { subtask2 (i,j)

}

} [0063] With this transformation, the DL compiler can lower the nested Parallel-For loops as shown in Table 13.

Table 13 barrier grp_barrier [BLK_I] // there are BLK_I groups in outer Parallel-For

TP_parallel_for (glb_tid, 0, num_threads, 1) {

// BLK_I groups, each group has BLK_J subtasks/threads

[tile_start_j, tile_endj] = get_subtask (local_tid, start_j_l, endj_l, stepj_l) for (int i = tile_start_i; i < tile_end_i; i+=step_i_l) { for (int j= tile_start_j; j< tile_endj; j+=step_j_l ) { subtask 1 (i,j)

}

CSA_Intra_group_barrier (grp_barrier [grp_tid])

// BLK_I groups, each group has BLK_J subtasks/threads

}

[0064] Smart AMX Register Context Release.

[0065] Advanced Matrix Extensions (AMX) is an instruction set, introduced in processors available from Intel Corporation, to accelerate matrix computations. AMX introduces a state register to store tile configurations (tile config) per CPU core (e.g., a tile config register). A deep learning (DL) kernel must set the tile config register before using any AMX features in a thread and DL kernel must release the AMX tile configuration in the state register in all threads the DL kernel uses at the exit of kernel. The tile config register is used by the OS to indicate whether the OS needs to save the data registers of AMX when performing a context switch. The performance of OS’s context switching will degenerate if a thread fails to release the tile config in a thread. [0066] A conservative and easy-to-implement way to release AMX tile config in DL workloads is to release the AMX tile config immediately after using AMX features and re-load the AMX tile config before a next use of AMX features. This results in an approximately 2% loss of performance. A more efficient (but still sub-optimal) way is to release the tile config at the end of each DNN operation. However, there is usually more than one operation in a kernel of DL compilers and the kernel needs to release the tile config multiple times for each thread.

[0067] The technology described herein introduces an optimal way to safely release AMX tile config on all threads used in a kernel, based on CSA in an implementation. Note that CSA blocks the parallel loops and creates one-to-one mappings between parallel subtasks with threads, and all threads of CSA call a “cleanup” function just once at the exit of the parallel section of underlying thread pool. This ensures that all threads used in the kernel will release the tile config before the threads are returned to the underlying thread pool, and the release of the tile config occurs only once per thread per call of the kernel.

[0068] An advantage of using CSA in AMX tile release as compared to traditional deep learning compiler techniques is that the CSA significantly reduces the cleanup function call, while traditional deep learning compiler implementations run a cleanup function at the end of a Parallel-For section, which can be significantly larger than the number of threads and multiple Parallel-For sections are needed for multiple operations.

[0069] The pseudo code below shows the optimization result of CSA cleanup function reducer based on the code example of Table 13, after multiple linked Parallel-For loops are merged and the scope of synchronization is reduced. CSA scans through the loop body and removes the cleanup functions, and only calls the cleanup function at the end of the loop body.

Table 14 barrier grp_barrier [BLK_I] // there are BLK_I groups in outer Parallel-For TP_parallel_for (glb_tid, 0, num_threads, 1) { // BLK_I groups, each group has BLK_J threads grp_id = glb_tid / BLK_J local_tid = glb_tid % BLK_J // BLK_I groups, each group has BLK_J subtasks/threads

[tile_start_j, tile_endj] = get_subtask (local_tid, start_j_l, endj_l, stepj_l) for (int I = tile_start_i; I < tile_end_i; i+=step_i_l) { for (int j= tile_start_j; j< tile_endj; j+=stepj_l ) { subtaskl_with_cleanupfunction_removed (i,j)

}

CSA_Intra_group_barrier (grp_barrier [grp_tid])

// BLK_I groups, each group has BLK_J subtasks/threads

[tile_start_j, tile_endj] = get_subtask (local_tid, start_j_2, endj_2, stepj_2) for (int I = tile_start_i; I < tile_end_i; i+=step_i) { for (int j= tile_start_j ;j< tile_endj; j+=stepj) { subtask2_with_cleanupfunction_removed (i,j)

}

} cleanup()

}

[0070] Figure 3 illustrates CSA processing 300 in an implementation. DNN computation subgraph 302 (e.g., a graph IR 106) is input to CSA parallel-for lowerer 112 (which may be part of graph IR lowerer 111). CSA parallel-for lowerer 112 generates a sequence of nested parallel- for loops 304 from the DNN computation subgraph 302. In an implementation, lowering compute-intensive tunable operations and fusible operations may be performed using one or more templates. Templates use consistent parallel decomposition for outer most parallel-for loops of neighboring tunable operations so that the parallel-for loops (in subtasks) can be subsequently linked and grouped. CSA subtask linker 118-1 (which may be part of tensor IP optimizer 116) merges the sequence of nested parallel-for loops 304 into a single merged parallel-for loop 306. CSA subtask grouper 118-2 assigns merged parallel-for loop 306 with a relaxed synchronization to a subtask group to form merged parallel-for loop with relaxed synchronization with subtask group 308. The affinity of subtasks within a group may then be set to one or more threads. CSA cleanup function reducer 118-3 adds reduced cleanup functions

(e.g., fewer cleanup functions for implementations described herein than for existing approaches) to generate merged parallel-for loop with relaxed synchronization with subtask group and reduced cleanup functions 310. Merged parallel-for loop with relaxed synchronization with subtask group and reduced cleanup functions 310 may then be input to tensor IR lowerer 120 of DL compiler 108.

[0071] Figures 4A and 4B are flow diagrams of CSA processing 400 in an implementation. Actions shown in Figures 4A and 4B may be performed by one or more of CSA parallel-for lowerer 112, CSA subtask linker 118-1, CAS subtask grouper 118-2 and CSA cleanup function reducer 118-3. At block 402 of Figure 4 A, a current operation indicator may be set to a first operation of a DNN computation subgraph (e.g., graph IR 106) in topological order. As used herein, topological order is an ordering of the nodes in a directed acyclic graph (DAG) (e.g., Graph IR 106) such that for every directed edge (u, v) in the graph, node u comes before node v in the ordering. It is a linear ordering of the operations in the DNN computation graph that respects the relation between producer operation and consumer operation. At block 404, the current operation may be transformed to a nested parallel-for loop, as shown in Table 5. When the DNN operation is lowered to parallel-for loops, the parallel-for loops are sequentially created and so they form a parallel-for list.

[0072] At block 406, the nested parallel-for loop is blocked to create a one-to-one mapping between parallel subtasks with threads, as shown in Table 6. At block 408, the parallel-for loop of the current operation is marked as linkable if the current operation and a preceding operation are both parallelized without cross -iteration dependencies along at least one same data dimension at the top level and with the same blocking factor. In parallel computing, blocking is a technique used to optimize the performance of parallel loops by dividing the iteration space into smaller blocks and processing each block sequentially. The blocking factor is the number of iterations in each block. At block 410, the current operation indicator is set to the next operation of the DNN computation subgraph in topological order. At block 412, if the current operation exists, then processing returns to block 404 (e.g., thereby traversing through operations of the DNN computation subgraph). If the current operation does not exist at block 412 (e.g., all operations of the DNN computation subgraph have been processed), then processing continues with block 414 of Figure 4B via connector 4B.

[0073] At block 414 of Figure 4B, a parallel-for-group indicator is set to the first parallel-for loop in a lowered parallel-for list. At block 416, a parallel-for- next indicator is set to the next parallel-for loop in the lowered parallel-for list. At block 418, if the next parallel-for loop is linkable to any parallel-for loops in the parallel-for group, then at block 420 the next parallel-for loop is merged to the parallel-for group and processing continues with the next parallel-for loop at block 416. At block 418, if the next parallel-for loop is not linkable to any parallel-for loops in the parallel-for group, then at block 422 the parallel-for loops of the parallel-for group is flattened to a one-dimension TP-parallel-for loop, as shown in Table 9. At block 424, the global barrier is replaced with a group level barrier, as shown in Table 13. At block 426, one or more cleanup functions are removed from the inner loop body, as shown in Table 14. At block 428, the parallel-for-group indicator is set to the next parallel-for loop in the lowered parallel-for list. If there are more parallel-for loops to process at block 430, then processing continues at block 416 with the next parallel-for loop. Otherwise, processing is done at block 432.

[0074] The examples shown above illustrate parrel-for loops in one dimension. In an implementation, multiple parallel loops at the top level may be generated on the same dimensions and with the same blocking factors. The examples show only the case of merging along the top parallel-for loop, but this may be extended to multiple parallel loops. In such loops, each iteration may be executed independently of all other iterations, without any dependency between them.

[0075] While an example manner of implementing the technology described herein is illustrated in Figures 1-4, one or more of the elements, processes, and/or devices illustrated in Figures 1-4 may be combined, divided, re-arranged, omitted, eliminated, and/or implemented in any other way. Further, the example improved computing system 101 may be implemented by hardware, software, firmware, and/or any combination of hardware, software, and/or firmware. Thus, for example, any portion or all of the improved computing system 101 could be implemented by processor circuitry, analog circuit(s), digital circuit(s), logic circuit(s), programmable processor(s), programmable microcontroller(s), graphics processing unit(s) (GPU(s)), digital signal processor(s) (DSP(s)), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), and/or field programmable logic device(s) (FPLD(s)) such as Field Programmable Gate Arrays (FPGAs). When reading any of the apparatus or system claims of this patent to cover a purely software and/or firmware implementation, at least one of the example hardware resources is/are hereby expressly defined to include a non-transitory computer readable storage device or storage disk such as a memory, a digital versatile disk (DVD), a compact disk (CD), a Blu-ray disk, etc., including the software and/or firmware. Further still, the example embodiments of Figures 1-4 may include one or more elements, processes, and/or devices in addition to, or instead of, those illustrated in Figures 1-4, and/or may include more than one of any or all the illustrated elements, processes and devices.

[0076] A flowchart representative of example hardware logic circuitry, machine readable instructions, hardware implemented state machines, and/or any combination thereof is shown in Figures 3 and 4. The machine readable instructions may be one or more executable programs or portion(s) of an executable program for execution by processor circuitry, such as the processor circuitry 1012 shown in the example processor platform 1000 discussed below in connection with Figure 5 and/or the example processor circuitry discussed below in connection with Figures 6 and/or 7. The program may be embodied in software stored on one or more non-transitory computer readable storage media such as a CD, a floppy disk, a hard disk drive (HDD), a DVD, a Blu-ray disk, a volatile memory (e.g., Random Access Memory (RAM) of any type, etc.), or a non-volatile memory (e.g., FLASH memory, an HDD, etc.) associated with processor circuitry located in one or more hardware devices, but the entire program and/or parts thereof could alternatively be executed by one or more hardware devices other than the processor circuitry and/or embodied in firmware or dedicated hardware. The tangible machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediate client hardware device (e.g., a radio access network (RAN) gateway that may facilitate communication between a server and an endpoint client hardware device). Similarly, the non-transitory computer readable storage media may include one or more mediums located in one or more hardware devices. Further, although the example program is described with reference to the flowcharts illustrated in Figures 3 and 4, many other methods of implementing the example computing system may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuitry, discrete and/or integrated analog and/or digital circuitry, an FPGA, an ASIC, a comparator, an operational- amplifier (op-amp), a logic circuit, etc.) structured to perform the corresponding operation without executing software or firmware. The processor circuitry may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single-core processor (e.g., a single core central processor unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.) in a single machine, multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, a CPU and/or a FPGA located in the same package (e.g., the same integrated circuit (IC) package or in two or more separate housings, etc.).

[0077] The machine -readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., as portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine-readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine- readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc., in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine-readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and/or stored on separate computing devices, wherein the parts when decrypted, decompressed, and/or combined form a set of machine executable instructions that implement one or more operations that may together form a program such as that described herein.

[0078] In another example, the machine-readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc., in order to execute the machine-readable instructions on a particular computing device or other device. In another example, the machine-readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine-readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine-readable instructions and/or program(s) when stored or otherwise at rest or in transit.

[0079] The machine -readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine-readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.

[0080] As mentioned above, the example operations of Figures 3 and 4 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media such as optical storage devices, magnetic storage devices, an HDD, a flash memory, a read-only memory (ROM), a CD, a DVD, a cache, a RAM of any type, a register, and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

[0081] “Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc., may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, or (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B.

[0082] As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” object, as used herein, refers to one or more of that object. The terms “a” (or “an”), “one or more”, and “at least one” are used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., the same entity or object. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.

[0083] Figure 5 is a block diagram of an example processor platform 1000 structured to execute and/or instantiate the machine-readable instructions and/or operations of Figures 1-4. The processor platform 1000 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad™), a personal digital assistant (PDA), an Internet appliance, a DVD player, a CD player, a digital video recorder, a Blu-ray player, a gaming console, a personal video recorder, a set top box, a headset (e.g., an augmented reality (AR) headset, a virtual reality (VR) headset, etc.) or other wearable device, or any other type of computing device.

[0084] The processor platform 1000 of the illustrated example includes processor circuitry 1012. The processor circuitry 1012 of the illustrated example is hardware. For example, the processor circuitry 1012 can be implemented by one or more integrated circuits, logic circuits, FPGAs microprocessors, CPUs, GPUs, DSPs, and/or microcontrollers from any desired family or manufacturer. The processor circuitry 1012 may be implemented by one or more semiconductor based (e.g., silicon based) devices. In this example, the processor circuitry 1012 implements the example processor circuitry to implement DL compiler 108.

[0085] The processor circuitry 1012 of the illustrated example includes a local memory 1013 (e.g., a cache, registers, etc.). The processor circuitry 1012 of the illustrated example is in communication with a main memory including a volatile memory 1014 and a non-volatile memory 1016 by a bus 1018. The volatile memory 1014 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®), and/or any other type of RAM device. The non-volatile memory 1016 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 1014, 1016 of the illustrated example is controlled by a memory controller 1017.

[0086] The processor platform 1000 of the illustrated example also includes interface circuitry 1020. The interface circuitry 1020 may be implemented by hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) interface, a Bluetooth® interface, a near field communication (NFC) interface, a PCI interface, and/or a PCIe interface.

[0087] In the illustrated example, one or more input devices 1022 are connected to the interface circuitry 1020. The input device(s) 1022 permit(s) a user to enter data and/or commands into the processor circuitry 1012. The input device(s) 1022 can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a trackpad, a trackball, an isopoint device, and/or a voice recognition system.

[0088] One or more output devices 1024 are also connected to the interface circuitry 1020 of the illustrated example. The output devices 1024 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer, and/or speaker. The interface circuitry 1020 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry such as a GPU.

[0089] The interface circuitry 1020 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) by a network 1026. The communication can be by, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, an optical connection, etc.

[0090] The processor platform 1000 of the illustrated example also includes one or more mass storage devices 1028 to store software and/or data. Examples of such mass storage devices 1028 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, Blu- ray disk drives, redundant array of independent disks (RAID) systems, solid state storage devices such as flash memory devices, and DVD drives.

[0091] The machine executable instructions 1032, which may be implemented by the machine- readable instructions of Figures 1-4, may be stored in the mass storage device 1028, in the volatile memory 1014, in the non-volatile memory 1016, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.

[0092] FIG. 6 is a block diagram of an example implementation of the processor circuitry 1012 of FIG. 5. In this example, the processor circuitry 1012 of FIG. 6 is implemented by a microprocessor 1100. For example, the microprocessor 1100 may implement multi-core hardware circuitry such as a CPU, a DSP, a GPU, an XPU, etc. Although it may include any number of example cores 1102 (e.g., 1 core), the microprocessor 1100 of this example is a multicore semiconductor device including N cores. The cores 1102 of the microprocessor 1100 may operate independently or may cooperate to execute machine readable instructions. For example, machine code corresponding to a firmware program, an embedded software program, or a software program may be executed by one of the cores 1102 or may be executed by multiple ones of the cores 1102 at the same or different times. In some examples, the machine code corresponding to the firmware program, the embedded software program, or the software program is split into threads and executed in parallel by two or more of the cores 1102. The software program may correspond to a portion or all the machine-readable instructions and/or operations represented by the flowcharts of Figures 3 and 4.

[0093] The cores 1102 may communicate by an example bus 1104. In some examples, the bus 1104 may implement a communication bus to effectuate communication associated with one(s) of the cores 1102. For example, the bus 1104 may implement at least one of an Inter- Integrated Circuit (I2C) bus, a Serial Peripheral Interface (SPI) bus, a PCI bus, or a PCIe bus. Additionally or alternatively, the bus 1104 may implement any other type of computing or electrical bus. The cores 1102 may obtain data, instructions, and/or signals from one or more external devices by example interface circuitry 1106. The cores 1102 may output data, instructions, and/or signals to the one or more external devices by the interface circuitry 1106. Although the cores 1102 of this example include example local memory 1120 (e.g., Level 1 (LI) cache that may be split into an LI data cache and an LI instruction cache), the microprocessor 1100 also includes example shared memory 1110 that may be shared by the cores (e.g., Level 2 (L2_ cache)) for high-speed access to data and/or instructions. Data and/or instructions may be transferred (e.g., shared) by writing to and/or reading from the shared memory 1110. The local memory 1120 of each of the cores 1102 and the shared memory 1110 may be part of a hierarchy of storage devices including multiple levels of cache memory and the main memory (e.g., the main memory 1014, 1016 of FIG. 5). Typically, higher levels of memory in the hierarchy exhibit lower access time and have smaller storage capacity than lower levels of memory. Changes in the various levels of the cache hierarchy are managed (e.g., coordinated) by a cache coherency policy.

[0094] Each core 1102 may be referred to as a CPU, DSP, GPU, etc., or any other type of hardware circuitry. Each core 1102 includes control unit circuitry 1114, arithmetic and logic (AL) circuitry (sometimes referred to as an ALU) 1116, a plurality of registers 1118, the LI cache in local memory 1120, and an example bus 1122. Other structures may be present. For example, each core 1102 may include vector unit circuitry, single instruction multiple data (SIMD) unit circuitry, load/store unit (LSU) circuitry, branch/jump unit circuitry, floating-point unit (FPU) circuitry, etc. The control unit circuitry 1114 includes semiconductor-based circuits structured to control (e.g., coordinate) data movement within the corresponding core 1102. The AL circuitry 1116 includes semiconductor-based circuits structured to perform one or more mathematic and/or logic operations on the data within the corresponding core 1102. The AL circuitry 1116 of some examples performs integer-based operations. In other examples, the AL circuitry 1116 also performs floating point operations. In yet other examples, the AL circuitry 1116 may include first AL circuitry that performs integer-based operations and second AL circuitry that performs floating point operations. In some examples, the AL circuitry 1116 may be referred to as an Arithmetic Logic Unit (ALU). The registers 1118 are semiconductor-based structures to store data and/or instructions such as results of one or more of the operations performed by the AL circuitry 1116 of the corresponding core 1102. For example, the registers 1118 may include vector register(s), SIMD register(s), general purpose register(s), flag register(s), segment register(s), machine specific register(s), instruction pointer register(s), control register(s), debug register(s), memory management register(s), machine check register(s), etc. The registers 1118 may be arranged in a bank as shown in FIG. 6. Alternatively, the registers 1118 may be organized in any other arrangement, format, or structure including distributed throughout the core 1102 to shorten access time. Bus 1104 may implement at least one of an I2C bus, a SPI bus, a PCI bus, or a PCIe bus.

[0095] Each core 1102 and/or, more generally, the microprocessor 1100 may include additional and/or alternate structures to those shown and described above. For example, one or more clock circuits, one or more power supplies, one or more power gates, one or more cache home agents (CHAs), one or more converged/common mesh stops (CMSs), one or more shifters (e.g., barrel shifter(s)) and/or other circuitry may be present. The microprocessor 1100 is a semiconductor device fabricated to include many transistors interconnected to implement the structures described above in one or more integrated circuits (ICs) contained in one or more packages. The processor circuitry may include and/or cooperate with one or more accelerators. In some examples, accelerators are implemented by logic circuitry to perform certain tasks more quickly and/or efficiently than can be done by a general-purpose processor. Examples of accelerators include ASICs and FPGAs such as those discussed herein. A GPU or other programmable device can also be an accelerator. Accelerators may be on-board the processor circuitry, in the same chip package as the processor circuitry and/or in one or more separate packages from the processor circuitry.

[0096] FIG. 7 is a block diagram of another example implementation of the processor circuitry 1012 of FIG. 5. In this example, the processor circuitry 1012 is implemented by FPGA circuitry 1200. The FPGA circuitry 1200 can be used, for example, to perform operations that could otherwise be performed by the example microprocessor 1100 of FIG. 6 executing corresponding machine-readable instructions. However, once configured, the FPGA circuitry 1200 instantiates the machine-readable instructions in hardware and, thus, can often execute the operations faster than they could be performed by a general-purpose microprocessor executing the corresponding software.

[0097] More specifically, in contrast to the microprocessor 1100 of FIG. 6 described above (which is a general purpose device that may be programmed to execute some or all of the machine readable instructions represented by the flowchart of Figure 4 but whose interconnections and logic circuitry are fixed once fabricated), the FPGA circuitry 1200 of the example of FIG. 7 includes interconnections and logic circuitry that may be configured and/or interconnected in different ways after fabrication to instantiate, for example, some or all of the machine readable instructions represented by the flowcharts of Figures 3 and 4. In particular, the FPGA 1200 may be thought of as an array of logic gates, interconnections, and switches. The switches can be programmed to change how the logic gates are interconnected by the interconnections, effectively forming one or more dedicated logic circuits (unless and until the FPGA circuitry 1200 is reprogrammed). The configured logic circuits enable the logic gates to cooperate in different ways to perform different operations on data received by input circuitry. Those operations may correspond to some or all of the software represented by the flowcharts of Figures 3 and 4. As such, the FPGA circuitry 1200 may be structured to effectively instantiate some or all the machine -readable instructions of the flowchart of Figure 4 as dedicated logic circuits to perform the operations corresponding to those software instructions in a dedicated manner analogous to an ASIC. Therefore, the FPGA circuitry 1200 may perform the operations corresponding to the some or all the machine-readable instructions of Figure 4 faster than the general-purpose microprocessor can execute the same.

[0098] In the example of FIG. 7, the FPGA circuitry 1200 is structured to be programmed (and/or reprogrammed one or more times) by an end user by a hardware description language (HDL) such as Verilog. The FPGA circuitry 1200 of FIG. 7, includes example input/output (I/O) circuitry 1202 to obtain and/or output data to/from example configuration circuitry 1204 and/or external hardware (e.g., external hardware circuitry) 1206. For example, the configuration circuitry 1204 may implement interface circuitry that may obtain machine readable instructions to configure the FPGA circuitry 1200, or portion(s) thereof. In some such examples, the configuration circuitry 1204 may obtain the machine-readable instructions from a user, a machine (e.g., hardware circuitry (e.g., programmed or dedicated circuitry) that may implement an Artificial Intelligence/Machine Learning (AI/ML) model to generate the instructions), etc. In some examples, the external hardware 1206 may implement the microprocessor 1100 of FIG. 6. The FPGA circuitry 1200 also includes an array of example logic gate circuitry 1208, a plurality of example configurable interconnections 1210, and example storage circuitry 1212. The logic gate circuitry 1208 and interconnections 1210 are configurable to instantiate one or more operations that may correspond to at least some of the machine-readable instructions of Figures 3 and 4 and/or other desired operations. The logic gate circuitry 1208 shown in FIG. 7 is fabricated in groups or blocks. Each block includes semiconductor-based electrical structures that may be configured into logic circuits. In some examples, the electrical structures include logic gates (e.g., AND gates, OR gates, NOR gates, etc.) that provide basic building blocks for logic circuits. Electrically controllable switches (e.g., transistors) are present within each of the logic gate circuitry 1208 to enable configuration of the electrical structures and/or the logic gates to form circuits to perform desired operations. The logic gate circuitry 1208 may include other electrical structures such as look-up tables (LUTs), registers (e.g., flip-flops or latches), multiplexers, etc.

[0099] The interconnections 1210 of the illustrated example are conductive pathways, traces, vias, or the like that may include electrically controllable switches (e.g., transistors) whose state can be changed by programming (e.g., using an HDL instruction language) to activate or deactivate one or more connections between one or more of the logic gate circuitry 1208 to program desired logic circuits.

[00100] The storage circuitry 1212 of the illustrated example is structured to store result(s) of the one or more of the operations performed by corresponding logic gates. The storage circuitry 1212 may be implemented by registers or the like. In the illustrated example, the storage circuitry 1212 is distributed amongst the logic gate circuitry 1208 to facilitate access and increase execution speed.

[00101] The example FPGA circuitry 1200 of FIG. 7 also includes example Dedicated Operations Circuitry 1214. In this example, the Dedicated Operations Circuitry 1214 includes special purpose circuitry 1216 that may be invoked to implement commonly used functions to avoid the need to program those functions in the field. Examples of such special purpose circuitry 1216 include memory (e.g., DRAM) controller circuitry, PCIe controller circuitry, clock circuitry, transceiver circuitry, memory, and multiplier- accumulator circuitry. Other types of special purpose circuitry may be present. In some examples, the FPGA circuitry 1200 may also include example general purpose programmable circuitry 1218 such as an example CPU 1220 and/or an example DSP 1222. Other general purpose programmable circuitry 1218 may additionally or alternatively be present such as a GPU, an XPU, etc., that can be programmed to perform other operations.

[00102] Although FIGS. 6 and 7 illustrate two example implementations of the processor circuitry 1012 of FIG. 5, many other approaches are contemplated. For example, as mentioned above, modern FPGA circuitry may include an on-board CPU, such as one or more of the example CPU 1220 of FIG. 7. Therefore, the processor circuitry 1012 of FIG. 5 may additionally be implemented by combining the example microprocessor 1100 of FIG. 6 and the example FPGA circuitry 1200 of FIG. 7. In some such hybrid examples, a first portion of the machine-readable instructions represented by the flowchart of Figure 4 may be executed by one or more of the cores 1102 of FIG. 6 and a second portion of the machine-readable instructions represented by the flowcharts of Figures 3 and 4 may be executed by the FPGA circuitry 1200 of FIG. 7.

[00103] In some examples, the processor circuitry 1012 of FIG. 5 may be in one or more packages. For example, the microprocessor 1100 of FIG. 6 and/or the FPGA circuitry 1200 of FIG. 7 may be in one or more packages. In some examples, an XPU may be implemented by the processor circuitry 1012 of FIG. 5, which may be in one or more packages. For example, the XPU may include a CPU in one package, a DSP in another package, a GPU in yet another package, and an FPGA in still yet another package.

[00104] A block diagram illustrating an example software distribution platform 1305 to distribute software such as the example machine readable instructions 1032 of FIG. 5 to hardware devices owned and/or operated by third parties is illustrated in FIG. 8. The example software distribution platform 1305 may be implemented by any computer server, data facility, cloud service, etc., capable of storing and transmitting software to other computing devices. The third parties may be customers of the entity owning and/or operating the software distribution platform 1305. For example, the entity that owns and/or operates the software distribution platform 1305 may be a developer, a seller, and/or a licensor of software such as the example machine readable instructions 1032 of FIG. 5. The third parties may be consumers, users, retailers, OEMs, etc., who purchase and/or license the software for use and/or re-sale and/or sublicensing. In the illustrated example, the software distribution platform 1305 includes one or more servers and one or more storage devices. The storage devices store the machine-readable instructions 1032, which may correspond to the example machine readable instructions, as described above. The one or more servers of the example software distribution platform 1305 are in communication with a network 1310, which may correspond to any one or more of the Internet and/or any of the example networks, etc., described above. In some examples, the one or more servers are responsive to requests to transmit the software to a requesting party as part of a commercial transaction. Payment for the delivery, sale, and/or license of the software may be handled by the one or more servers of the software distribution platform and/or by a third-party payment entity. The servers enable purchasers and/or licensors to download the machine- readable instructions 1032 from the software distribution platform 1305. For example, the software, which may correspond to the example machine readable instructions described above, may be downloaded to the example processor platform 1300, which is to execute the machine- readable instructions 1032 to implement the methods described above and associated computing system 101. In some examples, one or more servers of the software distribution platform 1305 periodically offer, transmit, and/or force updates to the software (e.g., the example machine readable instructions 1032 of FIG. 5) to ensure improvements, patches, updates, etc., are distributed and applied to the software at the end user devices.

[00105] In some examples, an apparatus includes means for data processing of Figures 1- 4. For example, the means for processing may be implemented by processor circuitry, processor circuitry, firmware circuitry, etc. In some examples, the processor circuitry may be implemented by machine executable instructions executed by processor circuitry, which may be implemented by the example processor circuitry 1012 of FIG. 5, the example microprocessor 1100 of FIG. 6, and/or the example Field Programmable Gate Array (FPGA) circuitry 1200 of FIG. 7. In other examples, the processor circuitry is implemented by other hardware logic circuitry, hardware implemented state machines, and/or any other combination of hardware, software, and/or firmware. For example, the processor circuitry may be implemented by at least one or more hardware circuits (e.g., processor circuitry, discrete and/or integrated analog and/or digital circuitry, an FPGA, an Application Specific Integrated Circuit (ASIC), a comparator, an operational- amplifier (op-amp), a logic circuit, etc.) structured to perform the corresponding operation without executing software or firmware, but other structures are likewise appropriate.

[00106] From the foregoing, it will be appreciated that example systems, methods, apparatus, and articles of manufacture have been disclosed that provide improved performance for a compiler in a computing system. The disclosed systems, methods, apparatus, and articles of manufacture improve the performance of implementing a compiler in a computing system. The disclosed systems, methods, apparatus, and articles of manufacture are accordingly directed to one or more improvement(s) in the operation of a machine such as a computer or other electronic and/or mechanical device.

[00107] The following examples pertain to further embodiments. Specifics in the examples may be used anywhere in one or more embodiments. Example 1 is a computing system including memory circuitry to store instructions and a deep neural network (DNN) computation subgraph; and a processor coupled to the memory circuitry to execute the instructions to transform a current operation of the DNN computation subgraph to a nested parallel-for loop instruction for the current operation; block the nested parallel-for loop instruction to create a one-to-one mapping between parallel subtasks of the nested parallel-for loop instruction with threads; and mark a parallel-for loop instruction of the nested parallel-for loop instruction of the current operation and a parallel-for loop instruction of a next operation of the DNN computation subgraph as linkable if both the current operation and the next operation are parallelized along a same data dimension at a top level of the DNN computation subgraph and with a same blocking factor.

[00108] In Example 2, the subject matter of Example 1 optionally includes the processor to perform transforming, blocking and marking for all operations of the DNN computation subgraph in topological order. In Example 3, the subject matter of Example 1 optionally includes the processor to set a parallel-for group to include a first parallel-for loop instruction and merge a next parallel-for loop instruction to the parallel-for group if the next parallel-for loop instruction is linkable to any parallel-for loop instructions in the parallel-for group. In Example 4, the subject matter of Example 3 optionally includes, if the next parallel-for loop instruction is not linkable to any parallel-for loop instructions in the parallel-for group, the processor to flatten one or more parallel-for loop instructions of the parallel-for group to a one-dimensional thread pool (TP) parallel-for loop instruction. In Example 5, the subject matter of Example 4 optionally includes the processor to replace a global barrier with a group level barrier for parallel-for loop instructions of the parallel-for group. In Example 6, the subject matter of Example 5 optionally includes the processor to remove one or more cleanup functions between parallel-for loop instructions of the parallel-for group. In Example 7, the subject matter of Example 6 optionally includes the processor to generate deep learning (DL) instructions representing a DL model from the parallel-for loop instructions. In Example 8, the subject matter of Example 1 optionally includes the processor to dispatch linked parallel subtasks of the nested parallel-for loop instruction to a same thread.

[00109] Example 9 is a method includes transforming a current operation of a deep neural network (DNN) computation subgraph to a nested parallel-for loop instruction for the current operation; blocking the nested parallel-for loop instruction to create a one-to-one mapping between parallel subtasks of the nested parallel-for loop instruction with threads; and marking a parallel-for loop instruction of the nested parallel-for loop instruction of the current operation and a parallel-for loop instruction of a next operation of the DNN computation subgraph as linkable if both the current operation and the next operation are parallelized along a same data dimension at a top level of the DNN computation subgraph and with a same blocking factor. [00110] In Example 10, the subject matter of Example 9 optionally includes performing the transforming, blocking and marking for all operations of the DNN computation subgraph in topological order. In Example 11, the subject matter of Example 9 optionally includes setting a parallel-for group to include a first parallel-for loop instruction and merging a next parallel-for loop instruction to the parallel-for group if the next parallel-for loop instruction is linkable to any parallel-for loop instructions in the parallel-for group. In Example 12, the subject matter of Example 11 optionally includes if the next parallel-for loop instruction is not linkable to any parallel-for loop instructions in the parallel-for group, flattening one or more parallel-for loop instructions of the parallel-for group to a one-dimensional thread pool (TP) parallel-for loop instruction. In Example 13, the subject matter of Example 12 optionally includes replacing a global barrier with a group level barrier for parallel-for loop instructions of the parallel-for group. In Example 14, the subject matter of Example 13 optionally includes removing one or more cleanup functions between parallel-for loop instructions of the parallel-for group. In Example 15, the subject matter of Example 14 optionally includes generating deep learning (DL) instructions representing a DL model from the parallel-for loop instructions.

[00111] Example 16 is at least one machine-readable storage medium comprising instructions which, when executed by at least one processor, cause the at least one processor to transform a current operation of a deep neural network (DNN) computation subgraph to a nested parallel-for loop instruction for the current operation; block the nested parallel-for loop instruction to create a one-to-one mapping between parallel subtasks of the nested parallel-for loop instruction with threads; and mark a parallel-for loop instruction of the nested parallel-for loop instruction of the current operation and a parallel-for loop instruction of a next operation of the DNN computation subgraph as linkable if both the current operation and the next operation are parallelized along a same data dimension at a top level of the DNN computation subgraph and with a same blocking factor.

[00112] In Example 17, the subject matter of Example 16 optionally includes instructions which, when executed by at least one processor, cause the at least one processor to perform transforming, blocking and marking for all operations of the DNN computation subgraph in topological order. In Example 18, the subject matter of Example 16 optionally includes instructions which, when executed by at least one processor, cause the at least one processor to set a parallel-for group to include a first parallel-for loop instruction and merge a next parallel- for loop instruction to the parallel-for group if the next parallel-for loop instruction is linkable to any parallel-for loop instructions in the parallel-for group. In Example 19, the subject matter of Example 18 optionally includes instructions which, when executed by at least one processor, cause the at least one processor to, if the next parallel-for loop instruction is not linkable to any parallel-for loop instructions in the parallel-for group, flatten one or more parallel-for loop instructions of the parallel-for group to a one-dimensional thread pool (TP) parallel-for loop instruction. In Example 20, the subject matter of Example 19 optionally includes instructions which, when executed by at least one processor, cause the at least one processor to replace a global barrier with a group level barrier for parallel-for loop instructions of the parallel-for group.

[00113] Example 21 is an apparatus operative to perform the method of any one of Examples 9 to 15. Example 22 is an apparatus that includes means for performing the method of any one of Examples 9 to 15. Example 23 is an apparatus that includes any combination of modules and/or units and/or logic and/or circuitry and/or means operative to perform the method of any one of Examples 9 to 15. Example 24 is an optionally non-transitory and/or tangible machine-readable medium, which optionally stores or otherwise provides instructions that if and/or when executed by a computer system or other machine are operative to cause the machine to perform the method of any one of Examples 9 to 15.

[00114] Although certain example systems, methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the examples of this patent.

Claims

CLAIMS What is claimed is:

1. A computing system comprising: memory circuitry to store instructions and a deep neural network (DNN) computation subgraph; and a processor coupled to the memory circuitry to execute the instructions to: transform a current operation of the DNN computation subgraph to a nested parallel-for loop instruction for the current operation; block the nested parallel-for loop instruction to create a one-to-one mapping between parallel subtasks of the nested parallel-for loop instruction with threads; and mark a parallel-for loop instruction of the nested parallel-for loop instruction of the current operation and a parallel-for loop instruction of a next operation of the DNN computation subgraph as linkable if both the current operation and the next operation are parallelized along a same data dimension at a top level of the DNN computation subgraph and with a same blocking factor.

2. The computing system of claim 1, comprising the processor to perform transforming, blocking and marking for all operations of the DNN computation subgraph in topological order.

3. The computing system of claim 1, comprising the processor to set a parallel-for group to include a first parallel-for loop instruction and merge a next parallel-for loop instruction to the parallel-for group if the next parallel-for loop instruction is linkable to any parallel-for loop instructions in the parallel-for group.

4. The computing system of claim 3, comprising, if the next parallel-for loop instruction is not linkable to any parallel-for loop instructions in the parallel-for group, the processor to flatten one or more parallel-for loop instructions of the parallel-for group to a onedimensional thread pool (TP) parallel-for loop instruction.

5. The computing system of claim 4, comprising the processor to replace a global barrier with a group level barrier for parallel-for loop instructions of the parallel-for group.

6. The computing system of claim 5, comprising the processor to remove one or more cleanup functions between parallel-for loop instructions of the parallel-for group.

7. The computing system of claim 6, comprising the processor to generate deep learning (DL) instructions representing a DL model from the parallel-for loop instructions.

8. The computing system of claim 1, comprising the processor to dispatch linked parallel subtasks of the nested parallel-for loop instruction to a same thread.

9. A method comprising: transforming a current operation of a deep neural network (DNN) computation subgraph to a nested parallel-for loop instruction for the current operation; blocking the nested parallel-for loop instruction to create a one-to-one mapping between parallel subtasks of the nested parallel-for loop instruction with threads; and marking a parallel-for loop instruction of the nested parallel-for loop instruction of the current operation and a parallel-for loop instruction of a next operation of the DNN computation subgraph as linkable if both the current operation and the next operation are parallelized along a same data dimension at a top level of the DNN computation subgraph and with a same blocking factor.

10. The method of claim 9, comprising performing the transforming, blocking and marking for all operations of the DNN computation subgraph in topological order.

11. The method of claim 9, comprising setting a parallel-for group to include a first parallel-for loop instruction and merging a next parallel-for loop instruction to the parallel-for group if the next parallel-for loop instruction is linkable to any parallel-for loop instructions in the parallel-for group.

12. The method of claim 11, comprising, if the next parallel-for loop instruction is not linkable to any parallel-for loop instructions in the parallel-for group, flattening one or more parallel-for loop instructions of the parallel-for group to a one-dimensional thread pool (TP) parallel-for loop instruction.

13. The method of claim 12, comprising replacing a global barrier with a group level barrier for parallel-for loop instructions of the parallel-for group.

14. The method of claim 13, comprising removing one or more cleanup functions between parallel-for loop instructions of the parallel-for group.

15. The method of claim 14, comprising generating deep learning (DL) instructions representing a DL model from the parallel-for loop instructions.

16. At least one machine -readable storage medium comprising instructions which, when executed by at least one processor, cause the at least one processor to: transform a current operation of a deep neural network (DNN) computation subgraph to a nested parallel-for loop instruction for the current operation; block the nested parallel-for loop instruction to create a one-to-one mapping between parallel subtasks of the nested parallel-for loop instruction with threads; and mark a parallel-for loop instruction of the nested parallel-for loop instruction of the current operation and a parallel-for loop instruction of a next operation of the DNN computation subgraph as linkable if both the current operation and the next operation are parallelized along a same data dimension at a top level of the DNN computation subgraph and with a same blocking factor.

17. The at least one machine-readable storage medium of claim 16, comprising instructions which, when executed by at least one processor, cause the at least one processor to perform transforming, blocking and marking for all operations of the DNN computation subgraph in topological order.

18. The at least one machine-readable storage medium of claim 16, comprising instructions which, when executed by at least one processor, cause the at least one processor to set a parallel-for group to include a first parallel-for loop instruction and merge a next parallel- for loop instruction to the parallel-for group if the next parallel-for loop instruction is linkable to any parallel-for loop instructions in the parallel-for group.

19. The at least one machine-readable storage medium of claim 18, comprising instructions which, when executed by at least one processor, cause the at least one processor to, if the next parallel-for loop instruction is not linkable to any parallel-for loop instructions in the parallel-for group, flatten one or more parallel-for loop instructions of the parallel-for group to a one-dimensional thread pool (TP) parallel-for loop instruction.

20. The at least one machine-readable storage medium of claim 19, comprising instructions which, when executed by at least one processor, cause the at least one processor to replace a global barrier with a group level barrier for parallel-for loop instructions of the parallel-for group.