CN116339704A

CN116339704A - Method and apparatus for machine learning guided compiler optimization

Info

Publication number: CN116339704A
Application number: CN202211462568.6A
Authority: CN
Inventors: 阿南德·文卡特; 贾斯汀·戈特施利希; 尼兰詹·哈萨布尼斯
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2021-12-23
Filing date: 2022-11-22
Publication date: 2023-06-27
Also published as: DE102022129219A1; US20220121430A1; US11954466B2

Abstract

Methods and apparatus for compiler optimization for machine learning guidance are disclosed. Examples disclosed herein include a non-transitory computer-readable medium comprising instructions that, when executed, cause a machine to at least: selecting a register-based compiler transformation to apply to source code at a current location in a search tree, determining whether the search tree needs pruning based on an output of a query to a Machine Learning (ML) model, pruning the search tree at the current location in response to determining that the search tree needs pruning, generating code variants in response to applying the selected register-based compiler transformation to the source code, calculating a score associated with the source code at the current location in the search tree, and updating parameters of the Machine Learning (ML) model to include the calculated score.

Description

Method and apparatus for machine learning guided compiler optimization

Technical Field

The present disclosure relates generally to compiler optimization, and more particularly, to a method and apparatus for machine learning guided compiler optimization for register-based hardware architecture.

Background

In recent years, the field of software development has progressed rapidly. In general, the peak performance of high performance computing (High Performance Computing, HPC) and Machine Learning (ML) applications is a primary goal of automated software development on modern central processing unit (Central Processing Unit, CPU) architectures. Register-based compiler optimizations (e.g., scalar substitution, unrolling and blocking (jams), etc.) can be utilized to improve application performance on the CPU architecture.

Disclosure of Invention

One aspect of the present disclosure provides a computer-readable medium. The computer-readable medium includes instructions that, when executed, cause a machine to at least: selecting a register-based compiler transformation to apply to source code at a current location in a search tree; determining whether a search tree requires pruning based on an output of a query to a Machine Learning (ML) model; pruning the search tree at the current location in response to determining that the search tree needs to be pruned; generating code variants in response to applying the selected register-based compiler transformation to the source code; calculating a score associated with the source code at the current location in the search tree; and updating parameters of a Machine Learning (ML) model to include the calculated score.

Another aspect of the present disclosure provides a method of compiler optimization for performing machine learning bootstrapping for a register-based hardware architecture. The method comprises the following steps: selecting a register-based compiler transformation to apply to source code at a current location in a search tree; determining whether a search tree requires pruning based on an output of a query to a Machine Learning (ML) model; pruning the search tree at the current location in response to determining that the search tree needs to be pruned; generating code variants in response to applying the selected register-based compiler transformation to the source code; calculating a score associated with the source code at the current location; and updating parameters of a Machine Learning (ML) model to include the calculated score.

Another aspect of the present disclosure provides an apparatus for performing machine learning directed compiler optimization for a register-based hardware architecture. The device comprises: an interface circuit; processor circuitry comprising one or more of: at least one of the central processing unit, the graphics processing unit, or the digital signal processor has: control circuitry for controlling movement of data within the processor circuitry, arithmetic and logic circuitry for performing one or more first operations in accordance with instructions, and one or more registers for storing results of the one or more first operations, the instructions being in the device; a Field Programmable Gate Array (FPGA) including a logic gate circuit, a plurality of configurable interconnections, and a storage circuit, the logic gate circuit and interconnections performing one or more second operations, the storage circuit storing results of the one or more second operations; or an Application Specific Integrated Circuit (ASIC) comprising logic gates to perform one or more third operations; the processor circuit performs at least one of the first operation, the second operation, or the third operation to cause instantiation of: a transform selection circuit for selecting a register-based compiler transform to apply to source code at a current location in the search tree; a search tree pruning circuit for pruning the search tree in response to determining that the search tree needs to be pruned at the current location; code variant generation circuitry to apply the selected register-based compiler transformation to the source code; and a Machine Learning (ML) model parameter updating circuit to calculate a score associated with the source code at a current location in the search tree and update parameters of the ML model to include the calculated score.

Drawings

FIG. 1 is a block diagram of an example implementation of compiler optimization circuitry within an example compiler optimization system to improve application performance on a CPU architecture.

FIG. 2 is a flow chart representing example machine readable instructions executable by an example processor circuit to implement the example compiler optimization system of FIG. 1 in accordance with the teachings of the present disclosure.

3A-3C illustrate example compiler transformations executed on example source code to improve application performance.

FIG. 4 depicts an example dependency vector representation of example source code for 2-dimensional convolution.

FIG. 5 illustrates an example embedding process that utilizes the example dependency vector representation of FIG. 4.

FIG. 6 illustrates example encodings for actions representing compiler optimizations corresponding to edges in an example search tree.

FIG. 7 illustrates an example search tree of code variants and edges annotated with actions corresponding to compiler transformations.

FIG. 8 is a block diagram of an example processing platform including processor circuitry configured to execute the example machine readable instructions of FIG. 2 to implement compiler optimization circuit 104 of FIG. 1.

Fig. 9 is a block diagram of an example implementation of the processor circuit of fig. 8.

Fig. 10 is a block diagram of another example implementation of the processor circuit of fig. 8.

FIG. 11 is a block diagram of an example software distribution platform (e.g., one or more servers) for distributing software (e.g., software corresponding to the example machine readable instructions of FIG. 2) to client devices associated with end users and/or consumers (e.g., for licensing, selling and/or using), retailers (e.g., for selling, re-selling, licensing and/or secondary licensing), and/or Original Equipment Manufacturers (OEMs) (e.g., for inclusion in products to be distributed to, e.g., retailers and/or other end users such as direct purchasing customers).

The figures are not to scale. On the contrary, the thickness of the layer or region may be exaggerated in the drawings. Although layers and regions having sharp lines and boundaries are illustrated in the figures, some or all of these lines and/or boundaries may be idealized. In reality, boundaries and/or lines may be imperceptible, mixed, and/or irregular. In general, the same reference numerals will be used throughout the drawings and the accompanying written description to refer to the same or like parts. As used herein, reference to a connection (e.g., attaching, coupling, connecting, joining) may include reference by the connection to intermediate members between the referenced elements and/or relative movement between the elements unless otherwise indicated. Thus, reference to a connection does not necessarily infer that two elements are directly connected and/or in fixed relation to each other. As used herein, recitation of any element being "in contact with" another element is defined to mean that there are no intervening elements between the two elements.

Unless specifically stated otherwise, descriptions such as "first," "second," "third," and the like are used herein without input or other indication of any priority, physical order, arrangement in a list, and/or meaning ordered in any way, but rather merely as labels and/or arbitrary names to distinguish the elements for ease of understanding of the disclosed examples. In some examples, the descriptor "first" may be used in the detailed description to refer to a certain element, while the same element may be referred to in the claims by different descriptors, such as "second" or "third". In this case, it should be understood that such descriptors are merely used to explicitly identify those elements, which may otherwise share the same name, for example. As used herein, the phrase "in communication with … …," including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, without requiring direct physical (e.g., wired) communication and/or continuous communication, but also includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or disposable events. As used herein, "processor circuit" is defined to include (i) one or more special purpose electrical circuits configured to perform the specified operation(s) and to include one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors), and/or (ii) one or more general purpose semiconductor-based electrical circuits programmed with instructions to perform the specified operation(s) and to include one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors). Examples of processor circuits include a programmed microprocessor, a field programmable gate array (Field Programmable Gate Array, FPGA) that can instantiate instructions, a central processor unit (Central Processor Unit, CPU), a graphics processor unit (Graphics Processor Unit, GPU), a digital signal processor (Digital Signal Processor, DSP), XPU, or a microcontroller and integrated circuit, such as an application specific integrated circuit (Application Specific Integrated Circuit, ASIC). For example, the XPU may be implemented by a heterogeneous computing system that includes multiple types of processor circuitry (e.g., one or more FPGAs, one or more CPUs, one or more GPUs, one or more DSPs, and/or the like, and/or combinations of these) and application programming interface(s) (application programming interface, APIs) that can assign computing task(s) to any one(s) of the multiple types of processing circuitry that is (are) best suited to perform the computing task(s).

Detailed Description

Artificial intelligence (artificial intelligence, AI), including Machine Learning (ML), deep Learning (DL), and/or other artificial machine-driven logic, enables a machine (e.g., a computer, logic circuitry, etc.) to process input data using a model to generate output based on patterns and/or associations that the model previously learned via a training process. For example, the model may be trained with data to identify patterns and/or associations, and follow such patterns and/or associations as input data is processed, such that other input(s) produce output(s) consistent with the identified patterns and/or associations.

Many different types of machine learning models and/or machine learning architectures exist. In some examples disclosed herein, a convolutional neural network (Convolutional Neural Network, CNN) model is used. The use of Convolutional Neural Network (CNN) models enables interpretation of data using weighted importance. In general, a machine learning model/architecture suitable for use in the example methods disclosed herein will be a deep neural network (Deep Neural Network, DNN) in which interconnections are not visible outside the model. However, other types of machine learning models may be used in addition or instead, such as recurrent neural networks (Recurrent Neural Network, RNN), support vector machines (Support Vector Machine, SVM), gated recursive units (Gated Recurrent Unit, GRU), long term memory (Long Short Term Memory, LSTM), and so forth.

In general, implementing an ML/AI system involves two phases: a learning/training phase and an inference phase. In the learning/training phase, a training algorithm is used to train the model to operate according to patterns and/or associations based on, for example, training data. Generally, a model includes internal parameters that direct how input data is transformed into output data, such as transforming the input data into output data through a series of nodes and connections within the model. Further, the hyper-parameters are used as part of a training process to control how learning is performed (e.g., learning rate, number of layers to be used in a machine learning model, etc.). Super-parameters are defined as training parameters that are determined before initiating the training process.

Based on the type and/or expected output of the ML/AI model, different types of training may be performed. For example, supervised training uses inputs and corresponding expected (e.g., labeled) outputs to select parameters for the ML/AI model (e.g., by iterating over a combination of selected parameters) that reduce model errors. As used herein, a label (labeling) refers to an expected output (e.g., classification, expected output value, etc.) of the machine learning model. Alternatively, unsupervised training (e.g., for use in a subset of deep learning, machine learning, etc.) involves inferring patterns from inputs to select parameters of the ML/AI model (e.g., without the expected (e.g., labeled) benefit of output).

In examples disclosed herein, the ML/AI model is trained using code variants obtained by applying register-based compiler transforms and their associated scores. However, any other training algorithm may additionally or alternatively be used. In examples disclosed herein, training is performed until a maximum depth of the search tree is reached.

In some examples, the ML/AI model may additionally be trained using scores associated with code variants (e.g., quantization of main memory accesses, etc.). However, any other training algorithm may additionally or alternatively be used. In some examples, training may be performed on all states of the machine learning model.

Training is performed with super-parameters that control how learning is performed (e.g., learning rate, number of layers to be used in the machine learning model, etc.).

Training is performed using the training data. In examples disclosed herein, the training data is derived from a local set of source code and/or code variants. However, any type of dataset of source code and/or code variants may be utilized.

Automating software development for modern hardware architectures (e.g., central Processing Units (CPUs), graphics Processing Units (GPUs), field Programmable Gate Arrays (FPGAs), etc.) using Machine Learning (ML) techniques involves training and deploying machine learning models to facilitate identifying code development paths that produce high performance.

Current methods of achieving peak performance for High Performance Computing (HPC) and Machine Learning (ML) applications on modern architectures are challenging and require significant effort and time by teams of experts and/or software developers.

Typically, chip vendors, including Intel (Intel), push software libraries, such as mathematical kernel libraries (Math Kernel Library, MKL), for efficient implementation of High Performance Computing (HPC) and Machine Learning (ML) applications. These libraries focus on small kernels and optimize them efficiently because the kernels are believed to account for a large share of the runtime of the application. However, other optimizations may be required to achieve peak performance that take into account the application context (e.g., compiler transformations such as fusion) around multiple kernels.

Machine learning applications, such as Convolutional Neural Networks (CNNs), have such kernels stacked in large code sequences, and require such compiler-based optimizations to achieve the most advanced performance. However, compilers do have drawbacks as well. The out-of-box performance that ML compilers implement on such applications does not reach the possible peak performance because it is practically difficult to predict the optimal sequence and/or type of compiler transformations for a particular application, the combined search space of transformations tends to be too large, and optimization is not portable across architectures and hardware generations.

Example methods and apparatus disclosed herein limit possible optimized search space to compiler transforms that only utilize data reuse from registers, thereby ensuring that optimized code does not degrade performance of the original code, formulate accurate cost metrics for transformed code variants by quantifying accesses to main memory, rank higher reuse code variants that involve data from registers (e.g., low cost metrics), and automatically identify optimal transform sequences at lower computational cost using machine-learning based tree search algorithms (e.g., monte carlo tree search (Monte Carlo Tree Search, MCTS)). The ability to constrain the search space by focusing on register-based optimizations only, and compute accurate cost metrics for transformed code variants, rank them accordingly, and identify optimal transform sequences using ML models allows scalability among multiple hardware architectures and/or hardware generations, and achieves peak application performance on CPU with minimal computational cost.

In examples disclosed herein, an open source compilation framework may be utilized, such as a Multi-level intermediate representation (Multi-Level Intermediate Representation, MLIR) and/or a low-level virtual machine (Low Level Virtual machine, LLVM). Further, tensor compilation frameworks, such as tensor flow and pyrerch, may also be used in the examples disclosed herein. However, in other examples, any other type of compilation framework may be utilized.

In the examples disclosed herein, the Monte Carlo Tree Search (MCTS) algorithm is used as a search method to explore the combined search space of code variants derived from a set of register-based optimizations because it contains randomness for exploring paths selected with a degree of randomness, however, in some examples any other type of search method may be utilized.

In examples disclosed herein, "register-based compiler optimization" may be interchangeably referred to hereinafter as "register optimization," compiler optimization, "" register-based transformation, "and/or" compiler transformation. Further, in examples disclosed herein, "state" may be used interchangeably hereinafter with "code variants" to refer to transformed code segments and/or nodes in a search tree.

FIG. 1 illustrates an example implementation of compiler optimization circuitry 104 within an example compiler optimization system 100 to improve application performance on a CPU architecture. The example input code 102 is received by the compiler optimization circuit 104, where it is used by: an example transformation framework circuit 105 that contains an example source code score computation circuit 108, an example transformation selection circuit 120, and an example code variant generation circuit 122; an example machine learning framework circuit 106 that includes an example source code feature extraction circuit 112, an example feature embedding circuit 114, an example profitability calculation circuit 116, an example Machine Learning (ML) parameter update circuit 124; and an example search framework circuit 107, an example search tree state checking circuit 110, and a search tree pruning circuit 118. In some examples, the input code 102 may be located within a larger database of source code and/or code variants.

The example source code score computation circuit 108 evaluates the input code 102 and determines its associated score. The score determined by the source code score calculation circuit 108 directly corresponds to the number of memory loads and/or stores eliminated from main memory and/or CPU cache by reusing data from registers. In the examples disclosed herein, the memory load and/or store count is inversely proportional to the score of the input code 102 determined by the source code score calculation circuit 108.

The example search tree state checking circuit 110 keeps track of the current location in the search tree, checks whether the maximum depth has been reached in the tree during the search, and determines whether the current location in the search tree is at the root. In the examples disclosed herein, when the search tree state checking circuit 110 determines that the maximum depth has been reached and the current location is not at the root of the tree, the search tree state checking circuit 110 returns the current location to the parent node in the search tree. Further, in the example disclosed herein, if the search tree status checking circuit 110 determines that the current location is at the root of the search tree, then the process is ended.

The example source code feature extraction circuit 112 evaluates the input code 102 and extracts source code features (e.g., array accesses, loop limits, scores for code variants, etc.) from each state (e.g., code variant). In the examples disclosed herein, feature extraction is utilized by source code feature extraction circuitry 112 to parse the source code, however, in other examples, any type of dimension reduction algorithm (e.g., classification) may be used.

The example feature embedding circuit 114 receives extracted feature dependencies (e.g., array accesses, loop boundaries, code variant scores, etc.) from the source code feature extraction circuit 112 and embeds the extracted features into a feature vector representation. In the examples disclosed herein, feature embedding circuit 114 uses one-hot encoding (one-hot encoding) to embed the extracted features and scores associated with the code variations into feature vectors, however, any other type of feature embedding algorithm may be utilized.

The example profitability calculation circuitry 116 utilizes a Machine Learning (ML) model to correlate features contained within the feature vector representation generated by the feature embedding circuitry 114 with profitability to explore sub-states reached from a current state (i.e., code variant).

The example search tree pruning circuit 118 uses the profitability calculation performed by the profitability calculation circuit 116 and compares it to a threshold to determine whether pruning of the search tree at the current location is required. In examples disclosed herein, the threshold value of profitability is a predetermined value.

The example transform selection circuit 120 uses a selection algorithm (e.g., random selection) to determine what register-based transform to apply to the received input code 102.

The example code morphing generation circuit 122 then applies selected register-based transformations (e.g., scalar substitution, expansion and blocking, fusion, loop interchange, etc.) to the input code 102 to generate new code morphing.

In examples disclosed herein, scalar substitution refers to one example of a register-based compiler transformation in which repeatedly referenced array accesses are identified and copied into scalar format. The scalar is then used to make the desired computation and once the computation is complete it is copied back into the original array, avoiding unnecessary memory load and/or store operations associated with the array.

In examples disclosed herein, unrolling and blocking refers to one example of a register-based compiler transformation in which reuse of registers among multiple independent computations referencing the same register is facilitated. Furthermore, unrolling and blocking facilitate instruction level parallelization (Instruction Level Parallelism, ILP), i.e. parallel execution of sequential instructions in a program.

In the examples disclosed herein, fusion refers to one example of a register-based compiler transformation that enables reuse of array elements referenced in two separate loop nests by fusing together two separate loop nested statements, thereby avoiding unnecessary memory loading of the array elements.

In the examples disclosed herein, loop interchange is one example of a register-based compiler transformation in which an optimal loop order is implemented to maximize performance (e.g., switch inner loops to outer loops, etc.).

The example code variant generation circuit 122 applies the register-based compiler transform selected by the transform selection circuit 120 to the input code 102 to generate a new code variant. The code variant generation circuit 122 then updates the input code set 102 to include the new code variant.

When the ML model is in the training mode, the example Machine Learning (ML) model parameter update circuit 124 recursively iterates through the search tree and calculates and updates a score associated with each code variant in the search tree until a root node is reached.

In some examples, the source code score computation circuit 108 of fig. 1 includes means for computing a score based on quantization of memory load and/or store operations in the source code. For example, means for calculating a score based on quantization of memory load and/or store operations in source code may be implemented by source code calculation circuitry 108. In some examples, the source code computing circuitry 108 may be implemented by machine-executable instructions, such as at least the block 213 of fig. 2, executed by a processor circuit, which may be implemented by the example processor circuit 912 of fig. 9, the example processor circuit 1000 of fig. 10, and/or the example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, the source code score computation circuit 108 is implemented by other hardware logic circuits, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, the source code score computation circuit 108 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the corresponding operations, without executing software or firmware, although other configurations may be equally suitable.

In some examples, the search tree state checking circuit 110 of fig. 1 includes means for keeping track of the current location within the search tree and determining whether the maximum depth and/or root node has been reached. For example, means for keeping track of the current location within the search tree and determining whether the maximum depth and/or root node has been reached may be implemented by the search tree status checking circuit 110. In some examples, the search tree state checking circuit 110 may be implemented by machine executable instructions, such as those implemented by at least the

blocks

206, 208, 210 of fig. 2, executed by a processor circuit, which may be implemented by the example processor circuit 912 of fig. 9, the example processor circuit 1000 of fig. 10, and/or the example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, search tree state checking circuit 110 is implemented by other hardware logic circuits, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, the search tree checking circuit 110 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the corresponding operations without executing software or firmware, although other configurations are equally suitable.

In some examples, the source code feature extraction circuit 112 includes means for extracting feature dependencies from source code (e.g., input code 102 from fig. 1). For example, means for extracting feature dependencies from source code (e.g., input code 102 from FIG. 1) may be implemented by source code feature extraction circuitry 112. In some examples, the source code feature extraction circuit 112 may be implemented by machine-executable instructions, such as at least the block 216 of fig. 2, executed by a processor circuit, which may be implemented by the example processor circuit 912 of fig. 9, the example processor circuit 1000 of fig. 10, and/or the example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, source code feature extraction circuitry 112 is implemented by other hardware logic circuitry, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, the source code feature extraction circuit 112 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the corresponding operations, without executing software or firmware, although other configurations may be equally suitable.

In some examples, feature embedding circuit 114 of fig. 1 includes means for embedding the extracted feature dependencies into feature dependency vectors for input into the ML model. For example, the means for embedding the extracted feature dependencies into feature dependency vectors for input into the ML model may be implemented by the feature embedding circuit 114. In some examples, feature embedding circuit 114 may be implemented by machine-executable instructions, such as at least block 216 of fig. 2, executed by a processor circuit, which may be implemented by example processor circuit 912 of fig. 9, example processor circuit 1000 of fig. 10, and/or example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, feature embedding circuitry 114 is implemented by other hardware logic circuitry, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, feature embedding circuit 114 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the corresponding operations, without executing software or firmware, although other configurations may be equally suitable.

In some examples, the profitability calculation circuit 116 of fig. 1 includes means for calculating profitability of exploring potential code variations based on embedded features of source code within the feature dependency vector. For example, means for computing profitability of exploring potential code variations based on embedded features of source code within feature dependency vectors may be implemented by the profitability calculation circuit 116. In some examples, the profitability calculation circuitry 116 may be implemented by machine-executable instructions, such as implemented by at least the block 218 of fig. 2, executed by a processor circuit, which may be implemented by the example processor circuit 912 of fig. 9, the example processor circuit 1000 of fig. 10, and/or the example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, the profitability calculation circuitry 116 is implemented by other hardware logic circuitry, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, the profitability calculation circuit 116 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) that are configured to perform the corresponding operations without executing software or firmware, although other configurations are equally suitable.

In some examples, the search tree pruning circuit 118 of fig. 1 includes means for determining whether a branch of the search tree requires pruning and, if so, subsequently pruning based on the determined profitability. For example, the means for determining whether a branch of the search tree requires pruning and, if desired, subsequent pruning based on the determined profitability may be implemented by the search tree pruning circuit 118. In some examples, the search tree pruning circuit 116 may be implemented by machine-executable instructions, such as at least the

blocks

220, 222 of fig. 2, executed by a processor circuit, which may be implemented by the example processor circuit 912 of fig. 9, the example processor circuit 1000 of fig. 10, and/or the example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, search tree pruning circuitry 118 is implemented by other hardware logic circuitry, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, the search tree pruning circuit 118 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the corresponding operations without executing software or firmware, although other configurations are equally suitable.

In some examples, the transform selection circuit 120 of fig. 1 includes means for selecting a register-based compiler transform type to apply to source code in order to generate a new code variant. For example, the means for selecting a register-based compiler transform type to apply to the source code in order to generate a new code variant may be implemented by the transform selection circuit 120. In some examples, the transform selection circuit 120 may be implemented by machine-executable instructions, such as at least the block 224 of fig. 2, executed by a processor circuit, which may be implemented by the example processor circuit 912 of fig. 9, the example processor circuit 1000 of fig. 10, and/or the example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, transform selection circuit 120 is implemented by other hardware logic circuits, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, the transform selection circuit 120 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) that are configured to perform the corresponding operations without executing software or firmware, although other configurations are equally suitable.

In some examples, code variant generation circuit 122 of fig. 1 includes means for applying the selected register-based compiler optimization type to source code to generate a new code variant and adding the newly generated code variant to the set of code states. For example, means for applying the selected register-based compiler optimization type to source code to generate new code variants and adding the newly generated code variants to the set of code states may be implemented by the code variant generation circuit 122. In some examples, the code variant generation circuit 122 may be implemented by machine-executable instructions, such as those implemented by at least the blocks 226, 228 of fig. 2, executed by a processor circuit, which may be implemented by the example processor circuit 912 of fig. 9, the example processor circuit 1000 of fig. 10, and/or the example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, code variant generation circuit 122 is implemented by other hardware logic circuits, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, code variant generation circuit 122 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the corresponding operations without executing software or firmware, although other configurations may be equally suitable.

In some examples, machine Learning (ML) model parameter updating circuit 124 of fig. 1 includes means for recursively calculating and updating a score associated with each code variant in the search tree if the ML model is in a training mode until a root node of the search tree is reached. For example, means for recursively calculating and updating the score associated with each code variant in the search tree with the ML model in training mode until the root node of the search tree is reached may be implemented by Machine Learning (ML) model parameter updating circuit 124. In some examples, the Machine Learning (ML) model parameter updating circuit 124 may be implemented by machine-executable instructions, such as those implemented by at least blocks 230, 232 of fig. 2, executed by a processor circuit, which may be implemented by the example processor circuit 912 of fig. 9, the example processor circuit 1000 of fig. 10, and/or the example Field Programmable Gate Array (FPGA) circuit 1100 of fig. 11. In other examples, the Machine Learning (ML) model parameter updating circuit 124 is implemented by other hardware logic circuitry, a hardware-implemented state machine, and/or any other combination of hardware, software, and/or firmware. For example, the Machine Learning (ML) model parameter updating circuit 124 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the corresponding operations, without executing software or firmware, although other configurations are equally suitable.

Although an example manner of implementing compiler optimization circuit 104 is illustrated in fig. 1, one or more of the elements, processes, and/or devices illustrated in fig. 1 may be combined, divided, rearranged, omitted, eliminated, and/or implemented in any other way. In addition, the example source code score computation circuit 108, the example search tree state check circuit 110, the example source code feature extraction circuit 112, the example feature embedding circuit 114, the example profitability computation circuit 116, the example search tree pruning circuit 118, the example transformation selection circuit 120, the example code variant generation circuit 122, the example Machine Learning (ML) model parameter update circuit 124, and/or, more generally, the example compiler optimization circuit 104 of fig. 1 may be implemented in hardware, software, firmware, and/or any combination of hardware, software, and/or firmware. Thus, for example, any of the example source code score computation circuit 108, the example search tree state check circuit 110, the example source code feature extraction circuit 112, the example feature embedding circuit 114, the example profitability computation circuit 116, the example search tree pruning circuit 118, the example transformation selection circuit 120, the example code variant generation circuit 122, the example Machine Learning (ML) model parameter update circuit 124, and/or, more generally, the example compiler optimization circuit 104 of fig. 1, may be implemented by a processor circuit, analog circuit(s), digital circuit(s), logic circuit(s), programmable processor(s), programmable microcontroller(s), graphics Processing Unit (GPU), digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), programmable logic device(s) (programmable logic device, fppld) and/or field programmable logic device(s) (field programmable logic device, ld) (e.g., field programmable gate array (s)). When read into any apparatus or system claim of this patent covers a purely software and/or firmware implementation, at least one of the example source code score computation circuit 108, the example search tree state check circuit 110, the example source code feature extraction circuit 112, the example feature embedding circuit 114, the example profitability computation circuit 116, the example search tree pruning circuit 118, the example transform selection circuit 120, the example code variant generation circuit 122, and/or the example Machine Learning (ML) model parameter update circuit 124 is explicitly defined herein to include a non-transitory computer-readable storage device or storage disk, such as a memory, a digital versatile disk (digital versatile disk, DVD), a Compact Disk (CD), a blu-ray disk, etc., that contains the software and/or firmware. Further, the example compiler optimization circuit 104 of fig. 1 may include one or more elements, processes, and/or devices in addition to or instead of those shown in fig. 1, and/or may include any or all of more than one of the illustrated elements, processes, and devices.

A flowchart representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination of these to implement compiler optimization circuit 104 of fig. 1 is shown in fig. 2. The machine-readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a processor circuit, such as the processor circuit 912 shown in the example processor platform 900 discussed below in connection with fig. 9 and/or the example processor circuit discussed below in connection with fig. 10 and/or 11. The program may be embodied in software stored on one or more non-transitory computer readable storage media associated with processor circuitry located in one or more hardware devices, such as a CD, floppy disk, hard Disk Drive (HDD), DVD, blu-ray disc, volatile memory (e.g., any type of random access memory (Random Access Memory, RAM), etc.), or non-volatile memory (e.g., FLASH memory, HDD, etc.), but the entire program and/or a portion thereof may alternatively be executed by one or more hardware devices other than the processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a radio access network (radio access network, RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. In addition, while the example program is described with reference to the flowchart illustrated in FIG. 9, many other methods of implementing the example compiler optimization circuit 104 of FIG. 1 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to: one or more hardware devices in a single machine (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

The machine-readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a segmented format, a compiled format, an executable format, a packaged format, and the like. Machine-readable instructions described herein may be stored as data or data structures (e.g., as portions of instructions, code, representations of code, etc.) that can be utilized to create, fabricate, and/or generate machine-executable instructions. For example, the machine-readable instructions may be segmented and stored on one or more storage devices and/or computing devices (e.g., servers) located in the same or different locations of a network or collection of networks (e.g., in the cloud, in an edge device, etc.). The machine-readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decrypting, decompressing, unpacking, distributing, reassigning, compiling, etc., in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, machine-readable instructions may be stored as portions that are individually compressed, encrypted, and/or stored on separate computing devices, wherein the portions, when decrypted, decompressed, and/or combined, form a set of machine-executable instructions that implement one or more operations that together form a program such as the one described herein.

In another example, machine-readable instructions may be stored in the following state: in this state, they may be read by the processor circuit, but require the addition of libraries (e.g., dynamically linked libraries (dynamic link library, DLLs)), software development suites (software development kit, SDKs), application programming interfaces (application programming interface, APIs), etc., in order to execute these machine-readable instructions on a particular computing device or other device. In another example, machine-readable instructions may need to be configured (e.g., store settings, input data, record network addresses, etc.) before the machine-readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, a machine-readable medium as used herein may include machine-readable instructions and/or program(s) regardless of the particular format or state of the machine-readable instructions and/or program(s) when stored or otherwise at rest or in transit.

Machine-readable instructions described herein may be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine-readable instructions may be represented using any of the following languages: C. c++, java, c#, perl, python, javaScript, hyper text markup language (HyperText Markup Language, HTML), structured query language (Structured Query Language, SQL), swift, etc.

As described above, the example operations of fig. 2 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media, such as an optical storage device, a magnetic storage device, an HDD, a flash memory, a read-only memory (ROM), a CD, a DVD, a cache, any type of RAM, a register, and/or any other storage device or storage disk in which information may be stored for any duration (e.g., for longer periods of time, permanently stored, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium are expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

"including" and "comprising" (and all forms and tenses thereof) are used herein as open ended terms. Thus, whenever a claim is used as a preamble or in any of the various claims, it is to be understood that additional elements, terms, etc. may be present without falling outside the scope of the respective claim or claim. As used herein, the phrase "at least" is open ended when used as a transitional term in, for example, the preamble of a claim, as are the terms "comprising" and "including". The term "and/or" when used in the form of, for example, A, B and/or C, refers to any combination or subset of A, B, C, such as (1) a alone a,

(2) B alone, (3) C alone, (4) a and B, (5) a and C, (6) B and C, or (7) a and B and C. As used herein in the context of describing structures, components, items, C and/or things, the phrase "at least one of a and B" is intended to refer to an implementation that includes any one of the following: (1) at least one a, (2) at least one B, or (3) at least one a and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase "at least one of a or B" is intended to refer to an implementation that includes any of the following: (1) at least one a, (2) at least one B, or (3) at least one a and at least one B. As used herein in the context of describing the execution or execution of a process, instruction, action, activity, and/or step, the phrase "at least one of a and B" is intended to refer to an implementation that includes any one of the following:

(1) at least one a, (2) at least one B, or (3) at least one a and at least one B. Similarly, as used herein in the context of describing the execution or execution of a process, instruction, action, activity, and/or step, the phrase "at least one of a or B" is intended to refer to an implementation that includes any one of the following: (1) at least one a, (2) at least one B, or (3) at least one a and at least one B.

As used herein, singular references (e.g., "a", "an", "first", "second", etc.) do not exclude a plurality. As used herein, the terms "a" or "an" object refer to one or more of the object. The terms "a" (or "an"), "one or more" and "at least one" are used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method acts may be implemented by e.g. the same entity or object. Furthermore, although individual features may be included in different examples or claims, they may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.

The flow chart of fig. 2 represents example machine readable instructions and/or example operations 200 that may be executed and/or instantiated by a processor circuit to find an optimal path for code development (e.g., to maximize CPU performance) for a given source code segment. The machine-readable instructions and/or operations 200 of fig. 2 begin at block 202 where the source code score calculation circuit 108 of fig. 1 receives source code to process.

As shown in FIG. 2, at block 202, the source code score computation circuit 108 receives source code to be processed. In the examples disclosed herein, "source code" may refer to any piece of code on which a register-based compiler transformation may be applied to generate code variants (e.g., input code 102 of fig. 1).

At block 204, the search tree check circuit 110 determines whether the source code received in block 202 to process was previously encountered. If the search tree status checking circuit 110 determines that the source code was previously seen, the process proceeds to block 206. If the search tree status checking circuit 110 determines that the source code has not been encountered previously, the process moves to block 212.

At block 206, the search tree status checking circuit 110 determines whether the maximum depth of the search tree has been reached. In the examples disclosed herein, the search tree state checking circuit 110 determines whether the maximum depth of the search tree has been reached by keeping track of the current location within the search tree and/or counting the number of nodes accessed within the search tree. Further, in examples disclosed herein, the maximum depth of the search tree is a predetermined value. If the search tree status checking circuit 110 determines that the maximum depth of the search tree has been reached, the process proceeds to block 208. If the search tree status checking circuit 110 determines that the maximum depth of the search tree has not been reached, the process moves to block 214.

At block 208, the search tree status checking circuit 110 determines that the current location is at the root of the search tree. In examples disclosed herein, the "root of the search tree" refers to the starting node and/or starting point of the search tree (e.g., initial input code 102 of fig. 1). If the search tree status checking circuit 110 determines that the current location is indeed at the root of the search tree, the process ends. However, if the search tree status checking circuit 110 determines that the current location is not at the root of the search tree, the process moves to block 210.

At block 210, the search tree status checking circuit 110 returns the current location in the search tree to the parent node in the search tree. In examples disclosed herein, a "parent node" refers to a state (e.g., code variant) immediately preceding a current location. Once the search tree status checking circuit 110 returns the current location in the search tree to the parent node, the process returns again to block 204.

At block 212, the search tree state checking circuit 110 determines whether a Machine Learning (ML) model is currently being trained. If the search tree status checking circuit 110 determines that the ML model is not in a training mode, then the process moves to block 213. However, if the search tree status checking circuit 110 determines that the ML model is in a training mode, then the process moves to block 214. In some examples, the search tree state checking circuit 110 may verify whether the ML model is in the training mode by using a training state identifier (e.g., boolean variable, etc.).

At block 213, the source code score calculation circuit 108 calculates a score associated with the source code received at block 202 to be processed. In the examples disclosed herein, the score calculated by the source code score calculation circuit 108 directly corresponds to the number of memory loads and/or stores eliminated from main memory and/or CPU cache by reusing data from registers. In the examples disclosed herein, the memory load and/or store count is inversely proportional to the score of the source code (e.g., the input code 102 of fig. 1) determined by the source code score calculation circuit 108.

At block 215, the source code score calculation circuit 108 determines whether the score calculated in block 213 meets a threshold to further explore the code variants. In the examples disclosed herein, the threshold to which the source code score calculation circuit 108 compares the score is a predetermined threshold. If the source code score calculation circuit 108 determines that the score associated with the source code does not meet the given threshold, the process moves to block 216. However, if the source code score calculation circuit 108 determines that the score associated with the source code does meet the threshold, the process moves to block 214.

At block 216, the source code feature extraction circuit 112 extracts feature dependencies from the source code. In examples disclosed herein, source code features (e.g., array accesses, cycle boundaries, scores for code variants, etc.) are extracted from each state (e.g., code variants). Further, in the examples disclosed herein, feature extraction is utilized by the source code feature extraction circuitry 112 to parse the source code, however, in other examples, any type of dimension reduction algorithm (e.g., classification, etc.) may be used.

At block 217, the source code feature embedding circuit 114 receives the extracted feature dependencies (e.g., array accesses, loop boundaries, code variant scores, etc.) from the source code feature extraction circuit 112 (from block 216) and embeds the extracted features into a feature vector representation. In the examples disclosed herein, feature embedding circuit 114 uses one-hot encoding to embed the extracted features and scores associated with the code variations into feature vectors at block 217, however, any other type of feature embedding algorithm may be utilized.

At block 218, the profitability calculation circuit 116 queries a Machine Learning (ML) model to determine profitability to explore further code variant paths for the current state (e.g., source code). The profitability calculation circuit 116 utilizes a Machine Learning (ML) model to correlate the embedded features with profitability to explore sub-states achievable from the current state or code variant.

At block 220, the search tree pruning circuit 118 evaluates the profitability metric calculated by the profitability calculation circuit 116 at block 218 and determines whether it meets a threshold for pruning. For example, if the profitability metric is below a minimum threshold, the search tree pruning circuit 118 will determine that the search tree needs to be pruned in the current state, thereby limiting search space exploration by pruning paths that are relatively unlikely to result in a profitability state. If the search tree pruning circuit 118 determines that the search tree needs to be pruned, the process proceeds to block 222. However, if the search tree pruning circuit 118 determines that the search tree does not need pruning, the process moves to block 214.

At block 222, the search tree pruning circuit 118 prunes the search tree in the current state. In examples disclosed herein, "pruning" refers to the process of: wherein a given portion of the tree is removed after that portion is deemed redundant and/or non-critical. Further, in the examples disclosed herein, the type of search tree that is pruned is a Monte Carlo search tree, however any other type of search tree may be utilized.

At block 214, transform selection circuit 120 determines whether all possible register-based compiler transforms (e.g., scalar substitutions, unrolling and blocking, loop fusion, etc.) that may be applied to a given piece of source code have been exhausted. If the transform selection circuit 120 determines that all possible transforms have been exhausted, the process moves to block 210. However, if the transform selection circuit 120 determines that all possible transforms have not been exhausted, then the process proceeds to block 224.

At block 224, the transform selection circuit 120 selects a register-based compiler transform (e.g., scalar replacement, unrolling and blocking, loop fusion, etc.) to apply to the source code to generate a new optimized code variant. In the examples disclosed herein, the transform selection circuit 120 uses a random selection algorithm to select the register-based compiler transform, however, any other type of selection algorithm may be utilized.

At block 226, the code variant generation circuit 122 applies the register-based compiler optimization selected by the transform selection circuit 120 in block 224 to the source code to generate a new code variant.

At block 228, the code variant generation circuit 122 adds the newly generated code variant to the total set of code variants. In the examples disclosed herein, the set of code variants is represented as states in a monte carlo search tree, however, any other representation of code variants may be utilized.

At block 230, the search tree state checking circuit 110 checks whether the Machine Learning (ML) model is currently in a training mode, similar to block 212. If the search tree status checking circuit 110 determines that the Machine Learning (ML) model is currently in a training mode, then the process proceeds to block 232. However, if the search tree status checking circuit 110 determines that the Machine Learning (ML) model is not currently in a training mode, then the process moves to block 210.

At block 232, the Machine Learning (ML) model parameters update circuit 124 recursively calculates scores associated with the search code variants (e.g., sub-states in the search tree) and updates the Machine Learning (ML) model parameters to include the calculated scores.

Fig. 3A-3C illustrate an example compiler transformation 300 that is executed on example source code 305 to improve application performance. Fig. 3A illustrates an example of applying scalar substitutions 310 on source code 305, fig. 3B illustrates an example of applying unrolling and blocking 315 on source code 305, and fig. 3C illustrates an example of applying loop fusion 320 on source code 305.

FIG. 4 depicts an example dependency vector representation 400 of example source code 305 for 2-dimensional convolution. The example dependency vector representation 400 includes example arrays/tensors 402 and their corresponding dependency vectors 404. For example, the output array 403 represented in the source code 305 has a corresponding output dependency vector 406, the input array 405 has a corresponding input dependency vector 408, and the weightsThe weight array 409 has a corresponding weight dependency vector 410. In the example source code 305, the output array 403 is calculated using the following equation: output [ n ]][k _b ][h][w][0:15]+＝input[n][c _b ][h+r][w+s][c]*weights[k _b ][c _b ][r][s][c][0:15]。

FIG. 5 illustrates an example embedding process 500 that utilizes the example dependency vector representation 404 of FIG. 4. Example feature 505 represents a feature extracted from the source code of each state in the search tree. Features 505 include dependency vector 404, loop limits 506, array dimensions 507, and dependency types 508 of FIG. 4 (e.g., template, input, output, true, inverse, etc.). Feature 505 represents the source code for each state in a manner that can be embedded as input into a Machine Learning (ML) model. In the examples disclosed herein, the loop limit 506 represents the number of times a given loop (and the statements contained therein) is executed. The features 505 are then embedded in the embedding layer 510 for input into the ML model 515. In the examples disclosed herein, one-hot encoding is used to embed features 505, however, any other type of embedding algorithm and/or technique may be utilized. The embedded features within the embedded layer 510 are then passed as input into a Machine Learning (ML) model 515. The ML model 515 correlates the embedded features within the embedded layer 510 with the profitability of exploring code variations. The profitability metrics generated by the ML model 515 are then mapped to act 520 (e.g., applying a selected register-based compiler transformation on the code variants, pruning the current state, etc.).

FIG. 6 illustrates an example code 600 that represents compiler-optimized actions corresponding to edges in an example search tree (e.g., a Monte Carlo tree search tree). Example act 520 (e.g., fusion, loop interchange, unrolling and blocking, scalar substitution, etc.) is mapped to coded act 610. In the examples disclosed herein, the encoded act 610 is generated using one-hot encoding, however, any other type of encoding technique may be utilized.

FIG. 7 illustrates an example search tree 700 of code variants and edges annotated with parameters 702 corresponding to compiler transforms and/or scores. The search tree 700 includes an example root node 704 and its corresponding score 706. Originating from the root node 704 are

example code variants

710A, 710B, 710C, 710D, each representing code variants generated by applying

different actions

708A, 708B, 708C, 708D to the root node 704. For example, action 708A, "F:0," indicates "no fusion," action 708B, "SR:1," indicates "scalar replacement," action 708C, "P:1," indicates "loop interchange," action 708D, "U:1," indicates "expand and block.

FIG. 8 is a block diagram of an example processor platform 800, the example processor platform 800 being configured to execute and/or instantiate the machine readable instructions and/or operations of FIG. 2 to implement the compiler optimization system 100 of FIG. 1. The processor platform 800 may be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cellular telephone, a smart phone, a personal computer such as an iPad) ^TM And the like), personal digital assistant (personal digital assistant, PDA), internet appliance, DVD player, CD player, digital video recorder, blu-ray player, game player, personal video recorder, set top box, headphones (e.g., augmented reality (augmented reality, AR) headphones, virtual Reality (VR) headphones, etc.), or other wearable device, or any other type of computing device.

The processor platform 800 of the illustrated example includes a processor circuit 825. The processor circuit 825 of the illustrated example is hardware. For example, the processor circuit 825 may be implemented by one or more integrated circuits, logic circuits, FPGAs, microprocessors, CPU, GPU, DSP, and/or microcontrollers from any desired family or manufacturer. The processor circuit 825 may be implemented by one or more semiconductor-based (e.g., silicon-based) devices. In this example, the processor circuit 825 implements an example source code score computation circuit 108, an example search tree state check circuit 110, an example source code feature extraction circuit 112, an example feature embedding circuit 114, an example profitability computation circuit 116, an example search tree pruning circuit 118, an example transform selection circuit 120, an example code variant generation circuit 122, and an example Machine Learning (ML) model parameter update circuit 124.

The processor circuit 825 of the illustrated example includes local memory 805 (e.g., cache, registers, etc.). The processor circuit 825 of the illustrated example communicates with the main memory, including the volatile memory 815 and the non-volatile memory 820, via a bus 830. Volatile memory 815 may be comprised of synchronous dynamic random access memory (Synchronous Dynamic Random Access Memory, SDRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM),

Dynamic random access memory (+)>

Dynamic Random Access Memory，/>

) And/or any other type of RAM device implementation. The non-volatile memory 820 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 815, 820 of the illustrated example is controlled by a memory controller.

The processor platform 800 of the illustrated example also includes interface circuitry 845. Interface circuit 845 may be implemented in hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (universal serial bus, USB) interface, a USB interface, or a combination thereof,

An interface, a near field communication (near field communication, NFC) interface, a PCI interface, and/or a PCIe interface.

In the illustrated example, one or more input devices 840 are connected to interface circuitry 845. Input device(s) 840 allow a user to input data and/or commands into processor circuit 825. Input device(s) 840 may be implemented by, for example, an audio sensor, microphone, camera (still or video), keyboard, buttons, mouse, touch screen, touch pad, trackball, isopoint device, and/or voice recognition system.

One or more output devices 850 are also connected to the interface circuit 845 of the illustrated example. The output device 850 may be implemented, for example, by a display device (e.g., a light emitting diode (light emitting diode, LED), an organic light emitting diode (organic light emitting diode, OLED), a liquid crystal display (liquid crystal display, LCD), a Cathode Ray Tube (CRT) display, an in-plane switching (IPS) display, a touch screen, etc.), a haptic output device, a printer, and/or speakers. The interface circuit 845 of the illustrated example thus generally includes a graphics driver card, a graphics driver chip, and/or a graphics processor circuit, such as a GPU.

The interface circuit 845 of the illustrated example also includes communication devices, such as transmitters, receivers, transceivers, modems, residential gateways, wireless access points, and/or network interfaces, to facilitate the exchange of data with external machines (e.g., any kind of computing device) via the network 810. The communication may be through, for example, an ethernet connection, a digital subscriber line (digital subscriber line, DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-to-line wireless system, a cellular telephone system, an optical connection, etc.

The processor platform 800 of the illustrated example also includes one or more mass storage devices 835 to store software and/or data. Examples of such mass storage devices 835 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, blu-ray disc drives, redundant array of independent disks (redundant array of independent disk, RAID) systems, solid-state storage devices (such as flash memory devices), and DVD drives.

The machine-executable instructions 832, which may be implemented by the machine-readable instructions of fig. 2, may be stored in the mass storage device 835, in the volatile memory 815, in the non-volatile memory 820, and/or on a removable non-transitory computer-readable storage medium such as a CD or DVD.

Fig. 9 is a block diagram of an example implementation of the processor circuit 825 of fig. 8. In this example, the processor circuit 825 of fig. 8 is implemented by the microprocessor 900. For example, microprocessor 900 may implement multi-core hardware circuitry, such as CPU, DSP, GPU, XPU, and so forth. The microprocessor 900 of this example is a multi-core semiconductor device including N cores, although it may include any number of example cores 902 (e.g., 1 core). The cores 902 of the microprocessor 900 may operate independently or may cooperate to execute machine-readable instructions. For example, machine code corresponding to a firmware program, an embedded software program, or a software program may be executed by one of cores 902, or may be executed by multiple ones of cores 902 at the same or different times. In some examples, machine code corresponding to a firmware program, an embedded software program, or a software program is partitioned into threads and executed in parallel by two or more of cores 902. The software program may correspond to a part or all of the machine readable instructions and/or operations represented by the flow chart of fig. 2.

The core 902 may communicate via an example first bus 904. In some examples, first bus 904 may implement a communication bus to enable communication associated with one (or more) of cores 902. For example, first bus 904 may implement at least one of an Inter-integrated circuit (Inter-Integrated Circuit, I2C) bus, a serial peripheral interface (Serial Peripheral Interface, SPI) bus, a PCI bus, or a PCIe bus. Additionally or alternatively, first bus 904 may implement any other type of computing or electrical bus. The core 902 may obtain data, instructions, and/or signals from one or more external devices through the example interface circuitry 906. The core 902 may output data, instructions, and/or signals to one or more external devices via the interface circuitry 906. While the core 902 of this example includes an example local memory 920 (e.g., a level 1 (L1) cache that may be partitioned into an L1 data cache and an L1 instruction cache), the microprocessor 900 also includes an example shared memory 910 (e.g., a level 2 (L2) cache) that may be shared by the cores for high-speed access to data and/or instructions. Data and/or instructions may be transferred (e.g., shared) by writing to shared memory 910 and/or reading from shared memory 910. The local memory 920 and shared memory 910 of each core 902 may be part of a hierarchy of memory devices including multi-level cache memory and main memory (e.g., main memory 815, 820 of fig. 8). In general, higher level memory in the hierarchy exhibits lower access times and has less storage capacity than lower level memory. The various levels of changes to the cache hierarchy are managed (e.g., coordinated) by a cache coherency policy.

Each core 902 may be referred to as CPU, DSP, GPU, or the like, or any other type of hardware circuitry. Each core 902 includes control unit circuitry 914, arithmetic and logic (arithmetic and logic, AL) circuitry (sometimes referred to as an ALU) 916, a plurality of registers 918, an L1 cache 920, and an example bus 922. Other structures may also be present. For example, each core 902 may include vector unit circuitry, single instruction multiple data (single instruction multiple data) unit circuitry, load/store unit (LSU) circuitry, branch/skip unit circuitry, floating Point Unit (FPU) circuitry, and so forth. The control unit circuit 914 includes semiconductor-based circuitry configured to control (e.g., coordinate) data movement within the respective cores 902. The AL circuit 916 includes semiconductor-based circuitry configured to perform one or more mathematical and/or logical operations on data within the respective cores 902. The AL circuit 916 in some examples performs integer-based operations. In other examples, AL circuit 916 also performs floating point operations. In still other examples, the AL circuit 916 may include a first AL circuit performing integer-based operations and a second AL circuit performing floating point operations. In some examples, the AL circuit 916 may be referred to as an arithmetic logic unit (Arithmetic Logic Unit, ALU). Register 918 is a semiconductor-based structure used to store data and/or instructions, such as the results of one or more operations performed by AL circuitry 916 of corresponding core 902. For example, registers 918 may include vector register(s), SIMD register(s), general purpose register(s), flag register(s), fragment register(s), machine-specific register(s), instruction pointer register(s), control register(s), debug register(s), memory management register(s), machine check register(s), and so forth. The registers 918 may be arranged as banks (banks) as shown in fig. 9. Alternatively, the registers 918 may be organized in any other arrangement, format, or structure, including distributed throughout the core 902 to reduce access time. The second bus 922 may implement at least one of an I2C bus, an SPI bus, a PCI bus, or a PCIe bus.

Each core 902 and/or more generally microprocessor 900 may include additional and/or alternative structures to those shown and described above. For example, there may be one or more clock circuits, one or more power supplies, one or more power gates, one or more Cache Home Agents (CHA), one or more aggregation/Common Mesh Stops (CMS), one or more shifters (e.g., barrel shifter (s)), and/or other circuitry. Microprocessor 900 is a semiconductor device that is fabricated to include a number of interconnected transistors to implement the structure described above in one or more Integrated Circuits (ICs) contained within one or more packages. The processor circuit may include and/or cooperate with one or more accelerators. In some examples, the accelerator is implemented by logic circuitry to perform certain tasks faster and/or more efficiently than a general purpose processor. Examples of accelerators include ASICs and FPGAs, such as those discussed herein. The GPU or other programmable device may also be an accelerator. The accelerator may be on a board of the processor circuit, in the same chip package as the processor circuit, and/or in one or more packages separate from the processor circuit.

Fig. 10 is a block diagram of another example implementation of the processor circuit 825 of fig. 8. In this example, processor circuit 825 is implemented by FPGA circuit 1000. For example, FPGA circuitry 1000 may be used, for example, to perform operations that may otherwise be performed by the example microprocessor 900 of fig. 9 executing corresponding machine-readable instructions. Once configured, however, the FPGA circuitry 1000 instantiates machine-readable instructions in hardware so that the operations are often performed faster than the general purpose microprocessor executing the corresponding software.

More specifically, in contrast to the microprocessor 900 of fig. 9 described above (which is a general purpose device that may be programmed to execute some or all of the machine readable instructions represented by the flowchart of fig. 2, but whose interconnections and logic circuitry are fixed once manufactured), the FPGA circuit 1000 of the example of fig. 10 includes interconnections and logic circuitry that may be configured and/or interconnected in a different manner after manufacture to instantiate some or all of the machine readable instructions represented, for example, by the flowchart of fig. 2. In particular, FPGA 1000 may be considered an array of logic gates, interconnects, and switches. The switches can be programmed to change the manner in which the logic gates are interconnected, effectively forming one or more dedicated logic circuits (unless and until FPGA circuit 1000 is reprogrammed). The logic circuits are configured such that the logic gates can cooperate in different ways to perform different operations on data received by the input circuit. These operations may correspond to a portion or all of the software represented by the flow chart of fig. 2. Accordingly, FPGA circuitry 1000 may be configured to effectively instantiate a portion or all of the machine-readable instructions of the flowchart of figure 2 as dedicated logic circuitry to perform operations corresponding to those software instructions in a manner analogous to that of an ASIC. Accordingly, FPGA circuit 1000 may execute operations corresponding to some or all of the machine-readable instructions of figure 2 faster than a general-purpose microprocessor can execute such instructions.

In the example of fig. 10, FPGA circuit 1000 is structured to be programmed (and/or reprogrammed one or more times) by an end user via a hardware description language (hardware description language, HDL) (e.g., verilog). The FPGA circuit 1000 of fig. 10 includes example input/output (I/O) circuitry 1002 to obtain and/or output data from/to example configuration circuitry 1004 and/or external hardware (e.g., external hardware circuitry) 1006. For example, configuration circuit 1004 may implement interface circuitry that may obtain machine-readable instructions to configure FPGA circuit 1000, or portion(s) thereof. In some such examples, the configuration circuit 1004 may obtain Machine-readable instructions from a user, a Machine (e.g., a hardware circuit (e.g., a programmed or dedicated circuit) that may implement an artificial intelligence/Machine Learning (AI/ML) model to generate instructions), and so forth. In some examples, external hardware 1006 may implement microprocessor 900 of fig. 9. FPGA circuit 1000 also includes an array of example logic gates 1008, a plurality of example configurable interconnects 1010, and example storage circuitry 1012. Logic gate 1008 and interconnect 1010 may be configured to instantiate one or more operations corresponding to at least some of the machine readable instructions of fig. 2, and/or other desired operations. The logic gates 1008 shown in fig. 10 are fabricated by groups or blocks. Each block includes semiconductor-based electrical structures that may be configured as logic circuits. In some examples, the electrical structure includes logic gates (e.g., and gates, or gates, nor gates, etc.) that provide basic building blocks for logic circuitry. Within each logic gate 1008 there is an electrically controllable switch (e.g., a transistor) so that electrical structures and/or logic gates can be configured to form a circuit to perform a desired operation. The logic gate 1008 may include other electrical structures such as look-up tables (LUTs), registers (e.g., flip-flops or latches), multiplexers, and the like.

The interconnect 1010 of the illustrated example is a conductive path, trace, via, or the like, which may include electrically controllable switches (e.g., transistors) whose states may be changed by programming (e.g., using HDL instruction language) to activate or deactivate one or more connections between one or more logic gates 1008 to program a desired logic circuit.

The storage circuit 1012 of the illustrated example is configured to store the result(s) of one or more operations performed by the respective logic gates. The storage circuit 1012 may be implemented by a register or the like. In the illustrated example, the storage circuits 1012 are distributed among the logic gates 1008 to facilitate access and increase execution speed.

The example FPGA circuit 1000 of fig. 10 also includes example special purpose operational circuitry 1014. In this example, the special purpose operation circuit 1014 includes special purpose circuits 1016 that can be invoked to implement commonly used functions to avoid the need to program these functions in the field. Examples of such dedicated circuitry 1016 include memory (e.g., DRAM) controller circuitry, PCIe controller circuitry, clock circuitry, transceiver circuitry, memory, and multiplier-accumulator circuitry. Other types of dedicated circuitry may also be present. In some examples, FPGA circuit 1000 can also include example general-purpose programmable circuitry 1018, such as example CPU 1020 and/or example DSP 1022. Other general purpose programmable circuits 1018 may additionally or alternatively exist, such as GPUs, XPUs, etc., which may be programmed to perform other operations.

While fig. 9 and 10 illustrate two example implementations of the processor circuit 825 of fig. 8, many other approaches are also contemplated. For example, as described above, modern FPGA circuitry may include an on-board CPU, such as one or more of the example CPUs 1020 of fig. 10. Thus, the processor circuit 825 of fig. 8 may additionally be implemented by combining the example microprocessor 900 of fig. 9 and the example FPGA circuit 1000 of fig. 10. In some such hybrid examples, a first portion of the machine-readable instructions represented by the flowchart of fig. 2 may be executed by the one or more cores 902 of fig. 9, and a second portion of the machine-readable instructions represented by the flowchart of fig. 2 may be executed by the FPGA circuitry 1000 of fig. 10.

In some examples, the processor circuit 825 of fig. 8 may be in one or more packages. For example, the processor circuit 900 of fig. 9 and/or the FPGA circuit 1000 of fig. 10 may be in one or more packages. In some examples, the XPU may be implemented by the processor circuit 825 of fig. 8, which may be in one or more packages. For example, an XPU may include a CPU in one package, a DSP in another package, a GPU in yet another package, and an FPGA in yet another package.

A block diagram illustrating an example software distribution platform 1105 for distributing software, such as the example machine readable instructions 1132 of fig. 11, to hardware devices owned and/or operated by a third party is illustrated in fig. 11. The example software distribution platform 1105 may be implemented by any computer server, data facility, cloud service, or the like that is capable of storing and transmitting software to other computing devices. The third party may be a customer of the entity owning and/or operating the software distribution platform 1105. For example, the entity that owns and/or operates the software distribution platform 1105 may be a developer, seller, and/or licensee of software (e.g., example machine readable instructions represented by the flowchart of FIG. 2). The third party may be a consumer, user, retailer, OEM, etc. who purchases and/or license the software for use and/or resale and/or licensing. In the illustrated example, the software distribution platform 1105 includes one or more servers and one or more storage devices. The storage device stores machine-readable instructions 1132, which may correspond to the example machine-readable instructions represented by the flowchart of fig. 2, described above. One or more servers of the example software distribution platform 1105 are in communication with a network 1110, which may correspond to the internet and/or any one or more of the example networks described above. In some examples, one or more servers respond to requests to transmit software to a requestor as part of a commercial transaction. Payment for delivery, sales, and/or licensing of the software may be handled by one or more servers of the software distribution platform and/or by a third party payment entity. These servers enable purchasers and/or licensees to download machine readable instructions 1132 from the software distribution platform 1105. For example, software that may correspond to the example machine readable instructions represented by the flowchart of fig. 2 may be downloaded to the example processor platform 1100, which would execute the machine readable instructions 1132 to implement the compiler optimization system 100 of fig. 1. In some examples, one or more servers of the software distribution platform 1105 periodically provide, transmit, and/or force updates to the software (e.g., the example machine readable instructions 1132 of fig. 11) to ensure improvements, patches, updates, etc. are distributed and applied to the software at the end user device.

Example methods, apparatus, systems, and articles of manufacture for machine-learning guided compiler optimization for register-based hardware architecture are disclosed herein. Further examples and combinations thereof include the following examples:

example 1 includes a non-transitory computer-readable medium comprising instructions that, when executed, cause a machine to select at least a register-based compiler transformation to apply to source code at a current location in a search tree, determine whether the search tree needs pruning based on an output of a query to a Machine Learning (ML) model, prune the search tree at the current location in response to determining that the search tree needs pruning, generate code variants in response to applying the selected register-based compiler transformation to the source code, calculate a score associated with the source code at the current location in the search tree, and update parameters of the Machine Learning (ML) model to include the calculated score.

Example 2 includes the non-transitory computer-readable medium of example 1, wherein the instructions, when executed, cause the machine to calculate a score associated with the source code to determine profitability of exploring a given path of code variants, the score being inversely proportional to at least a memory load or memory store count of the source code.

Example 3 includes the non-transitory computer-readable medium of example 1, wherein the query to the Machine Learning (ML) model is run using a feature dependency vector as input.

Example 4 includes the non-transitory computer-readable medium of example 1, wherein parameters of the Machine Learning (ML) model are recursively updated.

Example 5 includes the non-transitory computer-readable medium of example 4, wherein the parameters of the Machine Learning (ML) model include at least a score or an action associated with the code variant.

Example 6 includes the non-transitory computer-readable medium of example 1, wherein the instructions, when executed, cause the machine to compare a score associated with source code at the current location with a threshold value score in order to determine that exploration of a path of the code variant is unproductive.

Example 7 includes the non-transitory computer-readable medium of example 6, wherein the threshold value is a predetermined value.

Example 8 includes the non-transitory computer-readable medium of example 1, wherein the determination of whether the search tree needs pruning is made by comparing an output of a query to the Machine Learning (ML) model to a threshold profitability value.

Example 9 includes the non-transitory computer-readable medium of example 8, wherein the threshold profitability value is a predetermined value.

Example 10 includes the non-transitory computer-readable medium of example 1, wherein the register-based compiler transformation that may be applied to the source code includes at least one of: scalar substitution, expansion and blocking, loop interchange, or loop fusion.

Example 11 includes the non-transitory computer-readable medium of example 1, wherein the instructions, when executed, cause the machine to determine metrics associated with exploring a given path of code variants originating from the source code, and in response to determining that exploration of the path of the code variants is profitable, extract features from the source code and embed the extracted features into feature dependency vectors.

Example 12 includes a method of performing machine learning directed compiler optimization for a register-based hardware architecture, the method comprising: selecting a register-based compiler transformation to apply to source code at a current location in a search tree, determining whether the search tree needs pruning based on an output of a query to a Machine Learning (ML) model, pruning the search tree at the current location in response to determining that the search tree needs pruning, generating code variants in response to applying the selected register-based compiler transformation to the source code, calculating a score associated with the source code at the current location, updating parameters of the Machine Learning (ML) model to include the calculated score.

Example 13 includes the method of example 12, wherein exploring the profitability of a given path of code variants originating from source code at the current location is calculated by calculating a score associated with the source code, the score being inversely proportional to at least a memory load or memory store count of the source code.

Example 14 includes the method of example 12, wherein the query to the Machine Learning (ML) model is run using feature dependency vectors as input.

Example 15 includes the method of example 12, wherein parameters of the Machine Learning (ML) model are recursively updated.

Example 16 includes the method of example 15, wherein the parameters of the Machine Learning (ML) model include at least a score or an action associated with the code variant.

Example 17 includes the method of example 12, wherein the determination that there is no benefit to exploring the path of the code variant is made by comparing a score associated with source code at the current location to a threshold value score.

Example 18 includes the method of example 12, wherein the determination of whether the search tree needs pruning is made by comparing an output of a query to the Machine Learning (ML) model to a threshold profitability value.

Example 19 includes the method of example 12, further comprising determining metrics associated with exploring a given path of code variants originating from the source code, and in response to determining that exploration of the path of the code is unproductive, extracting features from the source code, and embedding the extracted features into feature dependency vectors.

Example 20 includes the method of example 12, wherein the register-based compiler transformation that may be applied to the source code includes at least one of: scalar substitution, expansion and blocking, loop interchange, or loop fusion.

Example 21 includes an apparatus to perform machine learning directed compiler optimization for a register-based hardware architecture, the apparatus comprising interface circuitry, processor circuitry, the processor circuitry comprising one or more of: at least one of a central processing unit, a graphics processing unit, or a digital signal processor, the at least one of the central processing unit, the graphics processing unit, or the digital signal processor having: control circuitry to control movement of data within said processor circuitry, arithmetic and logic circuitry to perform one or more first operations in accordance with instructions, and one or more registers to store results of said one or more first operations, said instructions in said apparatus, a Field Programmable Gate Array (FPGA) comprising logic gates, a plurality of configurable interconnections, and storage circuitry, said logic gates and interconnections performing one or more second operations, said storage circuitry storing results of said one or more second operations, or an Application Specific Integrated Circuit (ASIC) comprising logic gates to perform one or more third operations, said processor circuitry performing at least one of said first operations, said second operations, or said third operations to instantiate: a transformation selection circuit to select a source code to apply to a current location in a search tree based on a register-based compiler transformation, a search tree pruning circuit to prune the search tree in response to determining that the search tree needs pruning at the current location, a code variant generation circuit to apply the selected register-based compiler transformation to the source code, and a Machine Learning (ML) model parameter update circuit to calculate a score associated with the source code at the current location in the search tree and update parameters of the ML model to include the calculated score.

Example 22 includes the apparatus of example 21, wherein the ML model parameter updating circuit recursively updates parameters of the Machine Learning (ML) model.

Example 23 includes the apparatus of example 21, wherein the search tree pruning circuit is to determine whether the search tree requires pruning by comparing an output of a query to the Machine Learning (ML) model to a threshold profitability value.

Example 24 includes the apparatus of example 21, further comprising a profitability calculation circuit to determine a metric associated with exploring a given path of code variant originating from the source code, a source code feature extraction circuit to extract features from the source code in response to determining that exploration of the path of the code is profitable, and a feature embedding circuit to embed the extracted features into a feature dependency vector.

Example 25 includes the apparatus of example 21, wherein the transform selection circuit selects a register-based compiler transform to apply to the source code, the register-based compiler transform comprising at least one of: scalar substitution, expansion and blocking, loop interchange, or loop fusion. Although certain example systems, methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the appended claims.

Example 26 is an edge computing gateway, comprising processing circuitry to perform any of examples 12-20.

Example 27 is an edge computing node, comprising processing circuitry to perform any of examples 12-20.

Example 28 is a base station, comprising a network interface card and processing circuitry to perform any of examples 12-20.

Example 29 is a computer-readable medium comprising instructions to perform any of examples 12-20.

The following claims are hereby incorporated into this detailed description by this reference, with each claim standing on its own as a separate embodiment of this disclosure.

Claims

1. A computer-readable medium comprising instructions that, when executed, cause a machine to at least:

selecting a register-based compiler transformation to apply to source code at a current location in a search tree;

determining whether the search tree requires pruning based on an output of a query to a Machine Learning (ML) model;

pruning the search tree at the current location in response to determining that the search tree needs to be pruned;

generating code variants in response to applying the selected register-based compiler transformation to the source code;

Calculating a score associated with source code at a current location in the search tree; and is also provided with

Parameters of the Machine Learning (ML) model are updated to include the calculated scores.

2. The computer readable medium of claim 1, wherein the instructions, when executed, cause the machine to:

determining metrics associated with exploring a given path of code variants originating from the source code; and is also provided with

In response to determining that exploration of a given path of the code variant is unproductive:

extracting features from the source code; and is also provided with

The extracted features are embedded into feature dependency vectors.

3. The computer-readable medium of claim 2, wherein the instructions, when executed, cause the machine to calculate a score associated with the source code to determine the metric associated with exploring a given path of the code variant, the score being at least proportional to a memory load or memory store count of the source code.

4. The computer-readable medium of claim 2, wherein the query to the Machine Learning (ML) model is run using the feature dependency vector as input.

5. The computer-readable medium of claim 1, wherein parameters of the Machine Learning (ML) model are updated recursively.

6. The computer-readable medium of claim 1, wherein parameters of the Machine Learning (ML) model include at least a score or an action associated with the code variant.

7. The computer readable medium of any of claims 1 to 6, wherein the instructions, when executed, cause the machine to compare a score associated with source code at the current location to a threshold value score to determine that exploration of a given path of the code variant is unproductive.

8. The computer-readable medium of claim 7, wherein the threshold value score value is a predetermined value.

9. The computer-readable medium of claim 1, wherein the determination of whether the search tree requires pruning is made by comparing an output of a query to the Machine Learning (ML) model to a threshold profitability value.

10. The computer-readable medium of claim 9, wherein the threshold profitability value is a predetermined value.

11. The computer-readable medium of claim 1, wherein a register-based compiler transformation applicable to the source code comprises at least one of: scalar substitution, expansion and blocking, loop interchange, or loop fusion.

12. A method of performing machine learning directed compiler optimization for a register-based hardware architecture, the method comprising:

calculating a score associated with the source code at the current location; and is also provided with

13. The method of claim 12, further comprising:

extracting features from the source code; and is also provided with

The extracted features are embedded into feature dependency vectors.

14. The method of claim 13, wherein the metric associated with exploring a given path of the code variant originating from the source code is calculated by calculating a score associated with the source code, the score being proportional at least to a memory load or memory store count of the source code.

15. The method of claim 13, wherein the query to the Machine Learning (ML) model is run using the feature dependency vector as input.

16. The method of claim 12, wherein parameters of the Machine Learning (ML) model are updated recursively.

17. The method of any of claims 12 to 16, wherein parameters of the Machine Learning (ML) model include at least a score or an action associated with the code variant.

18. The method of claim 13, wherein the determination that there is no benefit to exploring the path of the code variant is made by comparing a score associated with source code at the current location to a threshold value score.

19. The method of claim 12, wherein the determination of whether the search tree needs pruning is made by comparing an output of a query to the Machine Learning (ML) model to a threshold profitability value.

20. The method of claim 12, wherein a register-based compiler transformation applicable to the source code comprises at least one of: scalar substitution, expansion and blocking, loop interchange, or loop fusion.

21. An apparatus for performing machine learning directed compiler optimization for a register-based hardware architecture, the apparatus comprising:

an interface circuit;

processor circuitry comprising one or more of:

at least one of a central processing unit, a graphics processing unit, or a digital signal processor, the at least one of the central processing unit, the graphics processing unit, or the digital signal processor having: control circuitry for controlling movement of data within said processor circuitry, arithmetic and logic circuitry for performing one or more first operations in accordance with instructions, said instructions being in said device, and one or more registers for storing results of said one or more first operations;

a Field Programmable Gate Array (FPGA), the FPGA comprising logic gates, a plurality of configurable interconnects, and storage circuitry, the logic gates and the interconnects performing one or more second operations, the storage circuitry storing results of the one or more second operations; or alternatively

An Application Specific Integrated Circuit (ASIC) including logic gates to perform one or more third operations;

the processor circuit performs at least one of the first operation, the second operation, or the third operation to cause instantiation of:

A transform selection circuit for selecting a register-based compiler transform to apply to source code at a current location in the search tree;

a search tree pruning circuit to prune the search tree in response to determining that the search tree needs to be pruned at the current location;

code variant generation circuitry to apply the selected register-based compiler transformation to the source code; and

a Machine Learning (ML) model parameter updating circuit to calculate a score associated with source code at a current location in the search tree and update parameters of the ML model to include the calculated score.

22. The apparatus of claim 21, further comprising:

a profitability calculation circuit to determine metrics associated with exploring a given path of code variants originating from the source code;

source code feature extraction circuitry to extract features from the source code in response to determining that exploration of a given path of the code variant is profitable; and

and a feature embedding circuit for embedding the extracted features into the feature dependency vector.

23. The apparatus of claim 21, wherein the search tree pruning circuit determines whether the search tree requires pruning by comparing an output of a query to the Machine Learning (ML) model to a threshold profitability value.

24. The apparatus of any of claims 21 to 23, wherein the transform selection circuit selects a register-based compiler transform to apply to the source code, the register-based compiler transform comprising at least one of: scalar substitution, expansion and blocking, loop interchange, or loop fusion.