CN117546141A

CN117546141A - Apparatus, article of manufacture, and method for managing processing units

Info

Publication number: CN117546141A
Application number: CN202180099788.4A
Authority: CN
Inventors: 文森特·齐默; 尼尔什·贾因; 拉杰什·普尔纳沙德朗; 阿比吉特·戴维斯; 考什克·巴拉苏布兰马尼安; 昌特·拉斯韦尔; 卡兰·普坦纳亚; 方家豪; 苏布拉塔·巴尼克; 拉贾拉姆·雷古帕蒂; 萨利尔·马塔尚·托马斯
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2021-06-23
Filing date: 2021-12-24
Publication date: 2024-02-09

Abstract

Apparatus, articles of manufacture, and methods for managing a processing unit are disclosed. Examples disclosed herein facilitate management of systems utilizing heterogeneous processing units, XPUs, and the like to efficiently utilize such processing units. For example, some apparatus, articles of manufacture, and methods facilitate hardware resource based resource sharing, resource allocation, and/or kernel generation.

Description

Apparatus, article of manufacture, and method for managing processing units

RELATED APPLICATIONS

The patent claims the benefits of indian patent application number 202141028125 filed on 6 months 23, U.S. patent application number 63/222,938 filed on 7 months 16, 2021, indian patent application number 202141036070 filed on 8 months 10, 2021, U.S. patent application number 17/645,742 filed on 12 months 22, 2021, U.S. patent application number 17/559,730 filed on 12 months 22, 2021, U.S. patent application number 17/560,025 filed on 12 months 22, 2021, and U.S. patent application number 17/558,284 filed on 12 months 21. Indian patent application Ser. No. 202141028125, U.S. patent application Ser. No. 63/222,938, indian patent application Ser. No. 202141036070, U.S. patent application Ser. No. 17/645,742, U.S. patent application Ser. No. 17/559,730, U.S. patent application Ser. No. 17/560,025, and U.S. patent application Ser. No. 17/558,284 are all incorporated herein by reference. The priorities of indian patent application number 202141028125, U.S. patent application number 63/222,938, indian patent application number 202141036070, U.S. patent application number 17/645,742, U.S. patent application number 17/559,730, U.S. patent application number 17/560,025, and U.S. patent application number 17/558,284 are hereby claimed.

Technical Field

The present disclosure relates generally to computing systems, and more particularly, to an apparatus, article of manufacture, and method for managing processing units.

Background

The evolution of computing systems has led to the utilization of computing systems having many types of processing units. For example, the concept of XPU is directed to the utilization of application-specific processing elements that may be included in a computing system. For example, a computing system may include a general purpose processing unit, a graphics processing unit, and an artificial intelligence processing unit. XPU is a cross-architecture computing solution that can be bundled together in a single application programming interface (e.g., oneAPI standard application programming interface) that manages the assignment work of each task to the processing unit that is most suited to handle it. For example, many cloud service providers (cloud Service Provider, CSP) are evolving their hardware platforms to a disaggregated element consisting of a general purpose processor, heterogeneous accelerator, and a specially constructed vertical integrated infrastructure processing unit (Infrastructure Processing Unit, IPU). Such processing units may be implemented by way of an attachment card (e.g., a Peripheral Component Interconnect Express (PCIE) attachment card), an external processing unit connected via a table (e.g., via a Thunderbolt port), an under-motherboard (MB-down) solution via soldering or otherwise attached to a motherboard, a build to a central processing unit (central processing unit, CPU), etc.

Drawings

FIG. 1 is a block diagram of an example architecture for supporting heterogeneous computing.

FIG. 2 is a block diagram of an example architecture for sharing memory between two processing units (e.g., a CPU and a GPU).

FIG. 3 is a block diagram of an example method for sharing SPI flash using attached flash sharing.

FIG. 4 illustrates an example updated IFWI layout for the SPI flash of FIG. 2.

FIG. 5 is a flowchart representative of example machine readable instructions and/or example operations which may be executed and/or instantiated by the processor circuit to perform firmware booting of a system in which shared access flash memory has been implemented between two processing units.

FIG. 6 is a block diagram of an example layout of a BIOS (e.g., BIOS stored in region 2 of the IFWI layout of FIG. 4).

Fig. 7A and 7B are flowcharts representative of example machine readable instructions and/or example operations which may be executed and/or instantiated by a processor circuit to perform a unified initialization of a processing unit using silicon initialization code.

FIG. 8 is a flow chart illustrating an example detailed unified FSP initialization flow using an Integrated Graphics Device (IGD) and GPU.

Fig. 9 is a block diagram of an example architecture for an IPURDT.

FIG. 10 is a flowchart representative of example machine readable instructions and/or example operations which may be executed and/or instantiated by the processor circuit to perform a configuration using the IPURDT.

FIG. 11 is a flow diagram representing example machine readable instructions and/or example operations executable and/or instantiated by a processor circuit to negotiate to dynamically configure resources based on tolerances specified by an application and available IPU resources.

Fig. 12 illustrates an example environment in which resources managed by an IPU have various states of free and busy resources between CPU, GPU, SSD and the like.

FIG. 13 illustrates an example environment in which consensus in collaborative resource management is implemented via a decentralized public blockchain book.

FIG. 14 is a block diagram of an example dynamic negotiable dynamic neural network library.

FIG. 15 is a flowchart representative of example machine readable instructions and/or example operations which may be executed and/or instantiated by the processor circuit to select features for deep neural network learning based on hardware capabilities.

FIG. 16 is a block diagram of an example processing platform including processor circuitry configured to execute example machine readable instructions and/or example operations to implement the example configurable machine learning system configurator of FIG. 1, FIG. 2, and/or FIG. 3.

FIG. 17 is an illustration of an example automatic machine learning (AutoML) architecture that includes an example machine learning system configurator to identify and/or generate configurable machine learning computing nodes.

Fig. 18 is a block diagram of an example configuration of a dynamic XPU hardware-aware Deep Learning (DL) model management system 200 implemented in accordance with the teachings of the present disclosure.

FIG. 19 is a flowchart representative of example machine readable instructions and/or example operations executable by the example processor circuit to implement the example model training circuit of FIG. 18.

FIG. 20 is a flowchart representative of example machine readable instructions and/or example operations which may be executed by the example processor circuit to implement the example model management circuit of FIG. 18.

FIG. 21 is a block diagram of an example processing platform including processor circuitry configured to execute the example machine readable instructions and/or the example operations of FIG. 19 to implement the model training circuitry and model management circuitry of FIG. 18.

FIG. 22 is a block diagram of an example system implemented in accordance with the teachings of the present disclosure for data-enhanced automation model generation.

FIG. 23 is a block diagram of an example process flow utilizing the example system of FIG. 22.

FIG. 24 is a flowchart representative of example machine readable instructions and/or example operations that may be executed by the example processor circuit to implement the example knowledge builder circuit and the example model builder circuit of FIG. 22.

FIG. 25 is a flowchart representative of example machine readable instructions and/or example operations which can be executed by the example processor circuit to implement the example target hardware of FIG. 22.

FIG. 26 is a block diagram of an example processing platform including processor circuitry configured to execute the example machine readable instructions and/or the example operations of FIG. 24 to implement the example knowledge builder circuit and the example model builder circuit of FIG. 22.

FIG. 27 is a block diagram of an example computing device.

FIG. 28 is a block diagram of an implementation of the example Instruction Set Architecture (ISA) management circuit and microcode processing circuit of FIG. 27.

Fig. 29 and 30 are flowcharts representative of example machine readable instructions executable by the example processor circuit to implement the ISA management circuit of fig. 28.

FIG. 31 is a flowchart representative of example machine readable instructions executable by the example processor circuit to implement the microcode processing circuitry of FIG. 28.

FIG. 32 is an exemplary diagram representing exemplary operations that may be performed by ISA management circuitry of FIG. 28.

Fig. 33 is a block diagram of an example processing platform including processor circuitry configured to execute the example machine readable instructions of fig. 29-31 to implement the example computing device of fig. 27.

FIG. 34 is an illustration of an example automatic machine learning (AutoML) architecture that includes an example machine learning system configurator to identify and/or generate configurable machine learning computing nodes.

FIG. 35 is a block diagram of an example implementation of the machine learning system configurator of FIG. 34.

FIG. 36 is a block diagram of an example implementation of the machine learning system configurator of FIG. 34 and/or FIG. 35.

FIG. 37 is an illustration of an example workflow of generating a configurable machine learning computing node.

FIG. 38 is an illustration of another example workflow for identifying configurable machine learning computing nodes.

FIG. 39 is an illustration of an example implementation of an example ontology database.

FIG. 40 is an illustration of another example workflow for identifying configurable machine learning computing nodes.

FIG. 41 is a flowchart representative of example machine readable instructions and/or example operations executable by the example processor circuit to implement the example configurable machine learning system configurator of FIG. 34, FIG. 35, and/or FIG. 36 to perform workload with the configurable machine learning computing node.

FIG. 42 is a flowchart representative of example machine readable instructions and/or example operations executable by the example processor circuit to implement the example configurable machine learning system configurator of FIG. 34, FIG. 35, and/or FIG. 36 to generate a first configuration of one or more machine learning models based on machine learning workloads.

FIG. 43 is a flowchart representative of example machine readable instructions and/or example operations which may be executed by the example processor circuit to implement the example configurable machine learning system configurator of FIG. 34, FIG. 35, and/or FIG. 36 to generate a second configuration of hardware.

FIG. 44 is a flowchart representative of example machine readable instructions and/or example operations executable by the example processor circuit to implement the example configurable machine learning system configurator of FIG. 34, FIG. 35, and/or FIG. 36 to adjust the first configuration based on the evaluation parameters.

FIG. 45 is a flowchart representative of example machine readable instructions and/or example operations executable by the example processor circuit to implement the example configurable machine learning system configurator of FIG. 34, FIG. 35, and/or FIG. 36 to adjust the second configuration based on the evaluation parameters.

FIG. 46 is a flowchart representative of example machine readable instructions and/or example operations executable by the example processor circuit to implement the example configurable machine learning system configurator of FIG. 34, FIG. 35, and/or FIG. 36 to deploy computing nodes to perform machine learning workloads.

FIG. 47 is a block diagram of an example processing platform including processor circuitry configured to execute the example machine readable instructions of FIGS. 41-46 and/or the example operations to implement the example configurable machine learning system configurator of FIGS. 34, 35, and/or 36.

Fig. 48 is a block diagram of an example implementation of the processor circuit of fig. 16, 21, 26, 33, and/or 47.

Fig. 49 is a block diagram of another example implementation of the processor circuit of fig. 16, 21, 26, 33, and/or 47.

FIG. 50 is a block diagram of an example software distribution platform (e.g., one or more servers) for distributing software (e.g., software corresponding to example machine-readable instructions described herein) to client devices associated with end users and/or consumers (e.g., for licensing, selling and/or using), retailers (e.g., for selling, resale, licensing and/or secondary licensing), and/or Original Equipment Manufacturers (OEMs) (e.g., for inclusion in products to be distributed to, e.g., retailers and/or other end users such as direct purchasing customers).

In general, the same reference numerals will be used throughout the various figures and the accompanying written description to refer to the same or like parts. The figures are not to scale.

Detailed Description

As used herein, reference to a connection (e.g., attaching, coupling, connecting, joining) may include reference by the connection to intermediate members between the referenced elements and/or relative movement between the elements unless otherwise indicated. Thus, reference to a connection does not necessarily infer that two elements are directly connected and/or in fixed relation to each other.

Unless specifically stated otherwise, descriptions such as "first," "second," "third," and the like are used herein without input or other indication of any priority, physical order, arrangement in a list, and/or meaning ordered in any way, but rather merely as labels and/or arbitrary names to distinguish the elements for ease of understanding of the disclosed examples. In some examples, the descriptor "first" may be used in the detailed description to refer to a certain element, while the same element may be referred to in the claims by different descriptors, such as "second" or "third". In this case, it should be understood that such descriptors are merely used to explicitly identify those elements, which may otherwise share the same name, for example.

As used herein, "substantially real-time" and "substantially simultaneous" refer to occurring in a near instantaneous manner, acknowledging that there may be delays in computing time, transmission, etc. in the real world. Thus, unless otherwise indicated, "substantially real-time" and "substantially simultaneous" refer to real-time +/-1 second. As used herein, the phrase "in communication with … …" (including variations thereof) encompasses direct communication and/or indirect communication through one or more intermediary components without requiring direct physical (e.g., wired) communication and/or continuous communication, but also includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or disposable events.

As used herein, "processor circuit" is defined to include (i) one or more special purpose electrical circuits configured to perform the specified operation(s) and to include one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors), and/or (ii) one or more general purpose semiconductor-based electrical circuits programmed with instructions to perform the specified operation(s) and to include one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors). Examples of processor circuits include a programmed microprocessor, a field programmable gate array (Field Programmable Gate Array, FPGA) that can instantiate instructions, a central processor unit (Central Processor Unit, CPU), a graphics processor unit (Graphics Processor Unit, GPU), a digital signal processor (Digital Signal Processor, DSP), XPU, or a microcontroller and integrated circuit, such as an application specific integrated circuit (Application Specific Integrated Circuit, ASIC). For example, the XPU may be implemented by a heterogeneous computing system (e.g., a computing system having one or more heterogeneous processing units) that includes multiple types of processor circuitry (e.g., one or more FPGAs, one or more CPUs, one or more GPUs, one or more DSPs, etc., and/or combinations of these) and application programming interface(s) (application programming interface, APIs) that may assign computing task(s) to any one(s) of the multiple types of processing circuitry that are best suited to perform the computing task(s).

Computer components, such as components including processors (including heterogeneous processors), and/or other computer components, may use firmware to boot, initialize, and/or operate. It is desirable to provide multiple processing capabilities, such as graphics and/or artificial intelligence, to computer components and computers. It is also desirable to reduce bill of materials (bill of materials, boM) and/or cost of such computing systems. Apparatuses, articles of manufacture, and methods are disclosed that facilitate resource sharing among processors such as CPU, GPU, AI chips, FPGAs, ASICs, microcontrollers (e.g., embedded microcontrollers), and the like. Identifying common and/or shareable resources between the CPU and other processors in a heterogeneous processor platform (e.g., a platform including the CPU and discrete graphics) may reduce dedicated hardware usage at the platform, which may help reduce BoM costs. The disclosed apparatus, articles of manufacture, and methods disclosed herein improve efficiency, for example, by reusing firmware and/or software (e.g., using the OneAPI library).

Some cloud Service providers (group Service Pr)ovider, CSP) is evolving its hardware platform into a decomposed element consisting of general purpose processors, heterogeneous accelerators, and specially constructed vertically integrated Infrastructure Processing Units (IPUs), XPUs, DPUs, etc. Some resource management systems (resource management system, RMS) (e.g., RDT) operates on the domain of the CPU as a control point and manages server node level platform resources with the CPU as a pivot. This approach may not be scalable or even applicable to micro-service based infrastructure hosted by the IPU, where the IPU becomes the control point. IPU-based systems are upsetting the way data center resource management systems operate (e.g., moving away from the CPU as a control point to a split heterogeneous self-manageable intelligent accelerator).

The apparatus, articles of manufacture, and methods disclosed herein facilitate implementation of an IPU resource management system (IPU resource management system, IPURMS) that provides distributed services. In some examples, the proposed IPURMS provides low-latency microservice oriented decentralized peer-to-peer IPU resource negotiation and management without involvement of a CPU center. In some examples, the proposed IPURMS provides application-aware resource management, where the IPU can dynamically renegotiate RMS service level agreements (service level agreement, SLAs) for various micro services at runtime. In some examples, the proposed IPURMS facilitates IPU P2P negotiation and resource management via decentralized distributed public account book blockchain tracking with revocation capability to track/record auditable telemetry. In some examples, the proposed IPURMS includes an IPU divided into two parts, i.e., i) a data plane, and ii) a control plane. The control plane handles resource allocation, monitoring, and policy enforcement, and the data plane handles data flow between the IPU and the logic units associated with the IPU.

Deep neural network (Deep Neural Network, DNN) libraries (e.g., oneAPI deep neural network (oneDNN)) provide computational primitives to facilitate improved deep learning performance on CPUs and GPUs with unified/identical APIs developed for the CPUs, GPUs, etc., or any combination thereof. Existing DNN libraries detect underlying target hardware capabilities (e.g.,deep learning enhancement techniques) to accelerate reasoning/training performance. For example, oneDNN may be generated using Just-in-Time (JIT) code and attempt to select an instruction set architecture (instruction set architecture, ISA) or mix of ISAs based on the detected target hardware feature. Even though this abstraction provides the ability to utilize the underlying hardware capabilities, challenges remain. The apparatus, articles of manufacture, and methods disclosed herein provide a dynamic negotiable deep learning neural network library that facilitates a configurable and negotiable interface that provisions an application framework to specify SLAs to configure JIT code generation parameters at runtime. Such a system may be policy configurable, with or without a platform trusted execution environment (Trusted Execution Environment, TEE) that may help dynamically manage kernels (Kernel) in terms of power, performance, energy efficiency, optimization, in addition to pure hardware capabilities. The apparatus, articles of manufacture, and methods disclosed herein filter an implementation set of parameters to identify a candidate set based on application SLA and platform information. The respective JIT cores may be dynamically generated from the candidate set for each. The devices, articles, and methods disclosed herein can dry run these cores one by one, pick the one with the best performance (e.g., power/energy efficiency, TCO advantage, etc.), and cache it for later use.

FIG. 1 is a block diagram of an example architecture 100 that includes an example optimization application 104, example optimization middleware and framework 106, and example Application Programming Interfaces (APIs) 108. In some examples, the optimization application 104 may be implemented by an application (e.g., a software application, a web or browser-based application, etc.) that is customized, tailored, and/or otherwise optimized to enable identification and/or generation of configurable ML computing nodes. For example, the optimization application 104 can be accessed, utilized, etc. by a developer (e.g., software developer, researcher, etc.), information Technology (IT) personnel, etc. In some such examples, the optimization application 104 may be accessed, utilized, etc. to co-design hardware/software (HW/SW) solutions to solve technical problems that may benefit from AI/ML technology. In some examples, optimization middleware and framework 106 may be implemented by middleware and framework that are customized, tailored, and/or otherwise optimized to enable identification and/or generation of configurable ML compute nodes. For example, the optimization middleware and framework 106 can implement interfaces (e.g., communications, connectivity, etc.) between the optimization application 104 and the API 108.

The API 108 of the illustrated example may be invoked to program, develop, and/or otherwise generate an AI/ML application through at least one of direct programming or API-based programming. The APIs 108 of the illustrated example include an example migration tool 110, an example direct programming API 112, an example API-based programming API 114, and an example analysis tool 116.

In some examples, migration tool 110 may be implemented by software (e.g., a software application) that may adapt a program for implementation in some form of execution in a first computing or electronic environment that is different from a second computing or electronic environment for which the program was originally designed. For example, migration tool 110 may transform and/or otherwise adapt a first program developed for a first type of hardware, operating System (OS), library, etc., to a second program for a second type of hardware, OS, library, etc.

In some examples, the direct programming API 112 may be invoked to implement a direct programming task, which may include developing and/or compiling a data parallel c++ application. In some examples, the API-based programming API 114 may be invoked to implement API-based programming, which may include developing and/or compiling applications that invoke (or invoke, instantiate, etc.) mathematical kernel libraries (Math Kernel Library, MKL), MKL deep neural network (Deep Neural Network, DNN) libraries, data analysis acceleration libraries, thread building block libraries, parallel standard template libraries, media software development kits (software development kit, SDKs), deep learning deployment kits, machine learning scaling libraries, etc., and/or any combination(s) of these.

In some examples, the analysis tool 116 may be invoked, instantiated, and/or otherwise invoked to analyze hardware, software, and/or configuration(s) thereof of the configurable ML computing node. For example, the analysis tool 116 may instantiate a simulator(s) to simulate all hardware and/or software features of the configurable ML computing node to generate and/or otherwise output one or more evaluation parameters. In some such examples, the evaluation parameters may include parameters that represent and/or otherwise indicate the accuracy, latency, number of cycles to complete the workload, or throughput of the configurable ML compute node. In some examples, the evaluation parameters may include parameters that represent and/or otherwise indicate: processor or clock frequency, fabric frequency, read memory bandwidth, write memory bandwidth, hardware throttling factor, number of memory ports, number of data processing units (data processing unit, DPU), number of model layers (e.g., neural network layers, convolutional layers, etc.), activation precision (e.g., precision of activation values to be processed), weight precision (e.g., precision of weight values to be processed), and/or the like, and/or any combination(s) of these. For example, the analysis tool 116 may execute a simulator based on the configurable ML computing node. In some such examples, the analysis tool 116 may execute a simulator to determine the throughput of the configurable ML computing node when executing a particular AI/ML model with a particular configuration.

In some examples, the analysis tool 116 may instantiate a simulator(s) to simulate behavior, configuration, etc. of the configurable ML computing node to generate and/or otherwise output one or more evaluation parameters. For example, the analysis tool 116 may execute a model (e.g., a simulation model, an AI/ML model, etc.) based on the configurable ML computing nodes. In some such examples, the analysis tool 116 may execute a model to estimate, predict, and/or otherwise determine the throughput of the configurable ML computing node when executing a particular AI/ML model with a particular configuration.

The architecture 100 of the illustrated example includes different types of hardware and/or software that may be used to generate a configurable ML computing node. In the illustrated example, architecture 100 includes interfaces and target system software for scalar, vector, matrix, and space hardware. Additionally and/or alternatively, any other type of hardware may be used. In this example, scalar hardware is implemented by an example CPU 118 and an example CPU system software 120. For example, the CPU system software 120 may include instructions corresponding to a CPU instruction set architecture (Instruction Set Architecture, ISA). In this example, vector hardware is implemented by the example GPU 122 and the example GPU system software 124. For example, GPU system software 124 may include a kernel, portion(s) of code, etc., such as a kernel, a compute kernel, and/or a shader. In some examples, the kernel, portion(s) of code, etc. may be represented in a High-level programming language, such as High-Level Shader Language, HLSL, openCL, etc.

In this example, the matrix hardware is implemented by the example AI processor 126 and the example AI system software 128. For example, the AI system software 128 may include one or more AI/ML algorithms, models, etc., such as neural networks (e.g., convolutional neural networks (convolution neural network, CNN), deep neural networks (deep neural network, DNN), recurrent neural networks (recurrent neural network, RNN), etc.), linear regression models, logistic regression models, decision tree models, learning vector quantization models, etc., and/or combinations thereof(s). In this example, the spatial hardware is implemented by an example FPGA 130 and an example FPGA system software 132. For example, FPGA system software 132 may include a kernel, portion(s) of code, etc., that is based on a hardware description language (hardware description language, HDL), such as Verilog.

In the illustrated example, the CPU system software 120, GPU system software 124, AI system software 128, FGPA system software 132, host interface 134, and/or zero-level interface 136 may correspond to and/or otherwise implement example below-zero-level system software 138. For example, the zero-level below system software 138 may correspond to and/or otherwise be implemented as a low-level direct-to-metal interface tailored to hardware such as the CPU 118, GPU 122, and the like.

In the illustrated example, the API 108 may implement example zero level above system software 140 and example developer interface 142. For example, a developer, user, etc. may access and/or otherwise utilize architecture 100 through API 108. In some examples, developers, users, etc. may access and/or otherwise utilize higher level system software than low level direct-to-metal interfaces through the manner of the API 108. In some examples, developers, users, etc. may access and/or otherwise utilize the below-zero level system software 138 via the host interface 134 and/or the zero-level interface 136.

Architecture 100 is well suited to facilitate efficient utilization of hardware, such as CPU 118, GPU 122, and the like, by way of APIs 108. For example, an API may be added to the API 108 to facilitate and/or improve various processes. For example, the disclosed examples include APIs for a set of library functions that can communicate with XPU hardware (e.g., to facilitate sharing of firmware and software resources between processing units). In some disclosed examples, the API 108 may include a platform component to support machine learning (e.g., a dynamic negotiable deep neural network platform). For example, the machine learning component of the API 108 can operate to improve pertinence of hardware capabilities to improve performance (e.g., to improve deep learning reasoning performance). The disclosed API improvements (as well as other improvements disclosed herein) may be implemented alone and/or in combination. For example, the APIs 108 may include APIs for a set of library functions that may communicate with XPU hardware to facilitate sharing of firmware and software resources between processing units, and the APIs 108 may include APIs to improve pertinence of hardware capabilities to improve deep learning reasoning performance. For example, various improvements when combined may provide additive system performance improvements and reduce BOM costs.

Symbiotic guidance

FIG. 2 is a block diagram of an example architecture 200 for sharing memory between two processing units (e.g., a CPU and a GPU). For example, architecture 200 may be utilized in conjunction with architecture 100 of FIG. 1 or any other computer architecture including a plurality of processing units. The example architecture 200 of fig. 2 includes an example CPU 202 that includes an example platform controller hub 204 and an example serial peripheral interface (serial peripheral interface, SPI) 206; example GPU 208, which includes an example dedicated GPU flash 210 and an example shared SPI 212; and an example SPI flash 214. According to the illustrated example, architecture 200 facilitates sharing SPI flash 214 by CPU 202 and GPU 208.

The example GPU 202 is a central processing unit for a computing system. Alternatively, CPU 202 may be any other type of processing unit. The example CPU 202 includes an example platform control hub (platform control hub, PCH) 204 that includes circuitry, software, and/or firmware to manage data paths and support the functionality of the CPU 202. Alternatively, any other type of control circuitry, chipset, software, and/or firmware may be utilized. The example PCH 204 may include several interfaces including, according to the illustrated example, an SPI 206. The example SPI 206 interfaces the PCH 204 and the CPU 202 with the SPI flash 214 to facilitate initialization and booting of the CPU 202 and the architecture 200 as a whole.

Example GPU 208 is a system-on-chip (SoC) that is soldered to a motherboard on which CPU 202 (e.g., a Motherboard (MB) down solution) is mounted. Alternatively, GPU 208 may be any other type of processing unit (e.g., an AI processing unit, XPU, etc.) coupled to architecture 200 in any other manner (e.g., a discrete PCIE-based add-on card (AIC) attached to a PCIE slot in a client device), an external graphics processing unit connected via a cable/port (e.g., thunderbolt port) of architecture 200, etc.).

While a typical GPU will have its own SPI memory (e.g., 8MB cache) of store instructions in addition to the CPU's SPI memory (e.g., 32MB cache) for handling the boot process associated with the GPU, the example GPU 208 includes a dedicated GPU flash memory 210 and a shared SPI 212, the shared SPI 212 facilitating sharing of the SPI flash 214 with the CPU 202. According to the illustrated example, the GPU's integrated firmware image (integrated firmware image, IFWI) is stored in the shared SPU flash 214.

The example SPI flash 214 is a SPIOR flash memory device including an SPI interface for access. SPI flash 214 stores IFWI information for initialization and booting of CPU 202 and GPU 208. Alternatively, any other type of flash memory may be utilized.

FIG. 3 is a block diagram of an example method for sharing SPI flash 214 using attached flash sharing. According to the illustrated example, the example GPU 208 is communicatively coupled to the example CPU 202 via an example first enhanced SPI (eSPI) interface 302 of the CPU 202, the interface 302 being in communication with an example second eSPI interface 304 of the GPU 208. Thus, GPU 208 may access SPI flash 214 via flash access channels supported by first eSPI interface 302 and second eSPI interface 304, while PCH 204 of CPU 202 accesses SPI flash 214 via SPI 206.

Runtime access to SPI flash 214 through an ePI interface established by first ePI 302 and second ePI 304 will pass through the ePI body (CPU 202), which then routes the cycle to the flash access block of CPU 202 before the cycle is forwarded to the PCH of CPU 202 (e.g., SPI flash controller of PCH 204). The SPI flash controller will then perform an access to SPI flash 214 on behalf of eSPI secondary (GPU 208). Since the flash access address used by the eSPI secondary device (e.g., GPU 208) is a physical flash linear address, it encompasses the entire flash addressing space. However, the SPI flash controller may impose access restrictions on certain areas of the SPI flash 214 to ensure security.

The proposed hardware changes to support sharing SPI flash 214 can be coupled with updates to SPI flash 214 (e.g., updated master portion descriptors) to accommodate dedicated secondary device firmware mapped into SPI flash 214. The descriptor change may facilitate injection of secondary device firmware regions into the IFWI layout on SPI flash 214.

FIG. 4 illustrates an example updated IFWI layout 400 for SPI flash 214. As shown in fig. 4, IFWI layout 400 includes a dedicated firmware area for each XPU device. For example, the example IFWI layout 400 includes: a region 13 for storing firmware (e.g., country-specific code (country specific code, CSC) firmware, firmware patches, and redundant images) for initializing the GPU; area 14 for storing firmware for a Field Programmable Gate Array (FPGA); and an area 15 for storing firmware for the AI processing unit. During booting, a basic input output system (basic input output system, BIOS) (e.g., system boot software) is accessed from the SPI flash before booting and initialization. Once a hardware RESET (e.g., reset#) is issued to GPU 208, GPU 208 will cause the ROM to begin retrieving a firmware image from SPI flash 214 to read the descriptor to learn the specific flash range mapped for initializing GPU 208.

The area of SPI flash 214 may be defined for read or write access by setting protection parameters in the flash descriptor. For example, region 0 may be read-only to the CPU and may be inaccessible to the GPU, region 1 may be read and written by the CPU (e.g., before POST end of POST, EOP) and may be inaccessible to the GPU, and region 13 may be read and written by the CPU (e.g., for firmware update) and GPU.

Although an example manner of implementing the components of architecture 100 of fig. 1 is illustrated in fig. 2 and 3, one or more of the elements, processes, and/or devices illustrated in fig. 2 and/or 3 may be combined, divided, rearranged, omitted, eliminated, and/or implemented in any other way. In addition, the example CPU 202, the example PCH 204, the example SPI 206, the example GPU 208, the example shared SPI 212, the example first eSPI 302, the example second eSPI 304, and/or, more generally, the architectures 200 and/or 300 of FIG. 2 and/or FIG. 3, can be implemented in hardware alone or in combination with software and/or firmware. Thus, for example, any of the example CPU 202, the example PCH 204, the example SPI 206, the example GPU 208, the example shared SPI 212, the example first eSPI 302, the example second eSPI 304, and/or, more generally, the architecture 200 and/or 300 of fig. 2 and/or 3, may be implemented by processor circuitry, analog circuitry(s), digital circuitry(s), logic circuitry(s), programmable processor(s), programmable microcontroller(s), graphics processing unit(s) (GPU (s)), digital signal processor (DSP(s), application specific integrated circuit (ASIC(s), programmable logic device(s) (programmable logic device, PLD), and/or field programmable logic device(s) (field programmable logic device, FPLD) (e.g., field Programmable Gate Array (FPGA)). Further, the example architecture 100 of fig. 1 may include one or more elements, processes, and/or devices in addition to or instead of those shown in fig. 2 and 3, and/or may include any or all of more than one of the illustrated elements, processes, and devices.

A flowchart representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination of these for implementing the architecture 200 of fig. 2 and/or the example architecture 300 of fig. 3 is shown in fig. 5. The machine-readable instructions may be one or more executable programs, or portion(s) of an executable program, for execution by a processor circuit, such as the processor circuit 1612 shown in the example processor platform 1600 discussed below in connection with fig. 16 and/or the example processor circuit discussed below in connection with fig. 48 and/or fig. 49. The program may be embodied in software stored on one or more non-transitory computer readable storage media, such as a Compact Disc (CD), floppy disc, hard Disk Drive (HDD), solid State Drive (SSD), digital versatile disc (digital versatiledisk, DVD), blu-ray disc, volatile memory (e.g., any type of random access memory (Random Access Memory, RAM, etc.), or non-volatile memory (e.g., electrically erasable programmable read-only memory (EEPROM), flash memory, HDD, SSD, etc.), associated with processor circuitry located in one or more hardware devices, but the entire program and/or portions thereof may be executed by one or more hardware devices other than the processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a radio access network (radio access network, RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. Additionally, while the example program is described with reference to the flowchart illustrated in FIG. 5, many other methods of implementing the example architectures 200 and/or 300 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

The machine-readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a segmented format, a compiled format, an executable format, a packaged format, and the like. Machine-readable instructions described herein may be stored as data or data structures (e.g., as portions of instructions, code, representations of code, etc.) that can be utilized to create, fabricate, and/or generate machine-executable instructions. For example, the machine-readable instructions may be segmented and stored on one or more storage devices and/or computing devices (e.g., servers) located in the same or different locations of a network or collection of networks (e.g., in the cloud, in an edge device, etc.). The machine-readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decrypting, decompressing, unpacking, distributing, reassigning, compiling, etc., in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, machine-readable instructions may be stored as portions that are individually compressed, encrypted, and/or stored on separate computing devices, wherein the portions, when decrypted, decompressed, and/or combined, form a set of machine-executable instructions that implement one or more operations that together form a program such as the one described herein.

In another example, machine-readable instructions may be stored in the following states: in this state, they may be read by the processor circuit, but require the addition of libraries (e.g., dynamically linked libraries (dynamic link library, DLLs)), software development suites (software development kit, SDKs), application programming interfaces (application programming interface, APIs), etc., in order to execute these machine-readable instructions on a particular computing device or other device. In another example, machine-readable instructions may need to be configured (e.g., store settings, input data, record network addresses, etc.) before the machine-readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, a machine-readable medium as used herein may include machine-readable instructions and/or program(s) regardless of the particular format or state of the machine-readable instructions and/or program(s) when stored or otherwise at rest or in transit.

Machine-readable instructions described herein may be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine-readable instructions may be represented using any of the following languages: C. c++, java, c#, perl, python, javaScript, hyper text markup language (HyperText Markup Language, HTML), structured query language (Structured Query Language, SQL), swift, etc.

As described above, the example operations of FIG. 5 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media, such as optical storage devices, magnetic storage devices, HDDs, flash memory, read-only memory (ROM), CDs, DVDs, caches, any type of RAM, registers, and/or any other storage device or storage disk where information may be stored for any duration (e.g., for longer periods of time, permanently stored, temporarily stored, for temporarily buffering, and/or for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium are expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

"including" and "comprising" (and all forms and tenses thereof) are used herein as open ended terms. Thus, whenever a claim is used as a preamble or in any of the various claims, it is to be understood that additional elements, terms, etc. may be present without falling outside the scope of the respective claim or claim. As used herein, the phrase "at least" is open ended when used as a transitional term in, for example, the preamble of a claim, as are the terms "comprising" and "including". The term "and/or" when used in a form such as A, B and/or C, for example, refers to any combination or subset of A, B, C, e.g., (1) a alone, (2) B alone, (3) C alone, (4) a and B, (5) a and C, (6) B and C, or (7) a and B and C. As used herein in the context of describing structures, components, items, C and/or things, the phrase "at least one of a and B" is intended to refer to an implementation that includes any one of the following: (1) at least one a, (2) at least one B, or (3) at least one a and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase "at least one of a or B" is intended to refer to an implementation that includes any of the following: (1) at least one a, (2) at least one B, or (3) at least one a and at least one B. As used herein in the context of describing the execution or execution of a process, instruction, action, activity, and/or step, the phrase "at least one of a and B" is intended to refer to an implementation that includes any one of the following: (1) at least one a, (2) at least one B, or (3) at least one a and at least one B. Similarly, as used herein in the context of describing the execution or execution of a process, instruction, action, activity, and/or step, the phrase "at least one of a or B" is intended to refer to an implementation that includes any one of the following: (1) at least one a, (2) at least one B, or (3) at least one a and at least one B.

As used herein, singular references (e.g., "a", "an", "the" and "the" do not exclude a plurality. As used herein, the term "an" object refers to one or more of that object. The terms "a," "an," "one or more," and "at least one" are used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method acts may be implemented by e.g. the same entity or object. Furthermore, although individual features may be included in different examples or claims, they may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.

FIG. 5 is a flowchart representative of example machine readable instructions and/or example operations 500 that may be executed and/or instantiated by processor circuitry to perform firmware booting of a system in which shared access flash has been implemented between two processing units (e.g., CPU 202 and GPU 208).

The machine-readable instructions and/or operations 500 of fig. 5 begin at block 502 where the CPU 202 retrieves the BIOS from the SPI flash 214 via the SPI 206 (block 502). According to the illustrated example, BIOS executes from zone 2 according to IFWI layout 400 of FIG. 4 (block 504). The CPU will continue programming of CPU 202 and chipset registers (block 506).

According to the illustrated example, in parallel with BIOS execution, GPU 208 receives a RESET (e.g., RESET#) and begins executing CSCROM (block 508). Example GPU 208 retrieves GPU firmware from SPI flash 214 (e.g., region 13) (block 510). The example GPU firmware will authenticate and load the pCode patch from the SPI flash 214 (block 512). GPI firmware executed by GPU 208 will perform a memory controller initialization (block 514). Although initialization of GPU 208 is illustrated in blocks 508-514, the program may additionally or alternatively perform initialization of any other processing unit (e.g., initialization of another processing unit may begin after block 514).

GPU 208 will determine whether the memory controller initialization is complete (block 516). When the memory controller initialization is complete, then the BIOS will initiate GPU initialization (block 518). For example, an example process for performing GPU initialization is described in connection with fig. 7A and 7B. Once GPU initialization is performed, any output device (e.g., high-definition multimedia interface, HDMI, or Display Port (DP)) on the GPU (e.g., discrete graphics) will be ready for resolution and allocated frame buffers for further display-related use (block 520). The CPU executing the BIOS or Operating System (OS) loader will start up the screen before the OS renders the OS using the frame buffer at boot time (block 522). The process 500 of fig. 5 is then complete.

FIG. 6 is a block diagram of an example layout of a BIOS 600 (e.g., BIOS stored in region 2 of IFWI layout 400 of FIG. 4). The example BIOS 600 includes a boot loader 602 and silicon initialization code 604 (e.g., referred to herein as firmware support packages (firmware support package, FSPs)). For example, the silicon initialization code may be a memory that includes support for shared SPI flash FSP. Example FSP 604 includes an example FSP silicon (FSP-S) 606, an example FSP memory (FSP-M) 608, and FSP Temp RAM (FSP-T) 610.

Modern system BIOS is typically composed of 2 key elements that are silicon initialization code provided by SoC vendors expressed in binary format (e.g.,firmware Support Packages (FSPs)), which are consumed by various open and/or closed source boot loader implementations (e.g., tia, coboot, slide bootloader, etc.) to distinguish the production BIOS for the original design fabrication (original design manufacturing, ODM)/original device fabrication (original equipment manufacturer, OEM) platform. But when operating on a platform with multiple heterogeneous processors, all other heterogeneous processors have their own SPI flash constituted by dedicated firmware blobs that execute outside the silicon initialization code (e.g., FSP) boundaries, which can cause redundancy. Having a dedicated firmware blob for each heterogeneous processor would require a separate hardware block, which results in a higher BoM. Furthermore, DG initialization code that is allowed to run in the boot loader context will not be considered eligible for SoC authenticated boot and executing option ROM for each processor will result in higher boot time due to the dependencies on PCI enumeration and dynamic resource allocation prior to initializing the controller or device.

According to the illustrated example, the FSP 604 is extended such that all XPU initializations within range of the FSP create a hardware abstraction layer that ensures that all SoC vendors recommend that chipset programming be performed using a unified block. By utilizing FSP 604 and its components for initialization of a processing unit (e.g., GPU), the special option ROM may be eliminated, thereby reducing redundant components.

7A-7B are flowcharts representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination of these to implement unified firmware for example architecture 200 and/or example architecture 300 of FIG. 3. The machine-readable instructions may be one or more executable programs, or portion(s) of an executable program, for execution by a processor circuit, such as the processor circuit 1612 shown in the example processor platform 1600 discussed below in connection with fig. 16 and/or the example processor circuit discussed below in connection with fig. 48 and/or fig. 49. The program may be embodied in software stored on one or more non-transitory computer readable storage media, such as a Compact Disc (CD), a floppy disk, a Hard Disk Drive (HDD), a Solid State Drive (SSD), a Digital Versatile Disc (DVD), a blu-ray disc, volatile memory (e.g., any type of Random Access Memory (RAM), etc.), or non-volatile memory (e.g., electrically erasable programmable read-only memory (EEPROM), flash memory, HDD, SSD, etc.), associated with processor circuitry located in one or more hardware devices, but the entire program and/or a portion thereof may be executed by one or more hardware devices other than processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a Radio Access Network (RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. Additionally, while the example program is described with reference to the flowcharts shown in FIGS. 7A-7B, many other methods of implementing the example architectures 200 and/or 300 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

Fig. 7A and 7B are flowcharts representative of example machine readable instructions and/or example operations 700 that may be executed and/or instantiated by a processor circuit to perform unified initialization of a processing unit using silicon initialization code (FSP 604).

The machine-readable instructions and/or operations 700 of fig. 7A-7B begin at block 702 where the bootloader 602 owns a reset vector (block 702). For example, the boot loader 602 includes real mode reset vector handler code. In some examples, the boot loader 602 may invoke the FSP-T610 to cache as a RAM (CAR) setup and initialize the stack. The CPU 202 executing the boot loader 602 populates the FSP initialization parameters (block 704). For example, the bootloader 602 may populate the updateable product data (updateable product data, UPD).

The example boot loader 602 invokes the FSP-M608 for memory initialization (block 706). Upon exiting from FSP-M608, the bootloader tears down the CAR (block 708). The boot loader 602 performs silicon programming (block 710). For example, silicon programming may include populating the FSP-S606 with UPDs. The boot loader 602 then invokes the FSP-S606 to initialize the chipset (block 712).

According to the illustrated example, a heterogeneous processor (e.g., GPU 208) is soldered onto a motherboard using a dedicated PCI-E slot, so that boot loader 602 does not need to perform PCI enumeration. Instead, boot loader 602 may rely on configuration information for a particular motherboard to provide such PCI-E slot information to FSP 604. Alternatively, boot loader 602 may perform PCI enumeration to identify the hardware.

The bootloader then transfers the call to FSP-S606 to begin the XPU initialization sequence (block 714). For example, control arrives at the XPU initialization sequence within FSP-S606.

Continuing to fig. 7b, FSP 604 adds new FSP initialization parameters (e.g., UPD) to pass PCIE slot information (e.g., information about heterogeneous processors attached via PCIE) from boot loader 602 to the FSP blobs (block 716). For example, the UPD may include IAXPUAddress, an array of 32-bit UPD parameters populated by the bootloader to inform the FSP 604 of the address format of the XPU with PCIE slots attached in the form of buses, devices, and functions. For example, the default value would be 0x0, which is identified as an invalid address. The format of IAXPUAddress may be: bus < < 16|device < < 11|function < < 8|offset (assume 0). For example, for a bus number of 0xFE and a device/function 0,IAdGPUAddress UPD value would be 0x00FE0000. Another UPD may be XPUConfigPtr, a 32-bit UPD parameter populated by boot loader 602 for informing FSP 604 of the location of additional configuration data, such as Video BIOS Table (VBT) for GPU 208. For example, the default value will be NULL (NULL), which identifies an invalid address.

Example UPD variable definitions within the FSP 604 may include:

# -! BSF NAME: { XPU PCI-E Address format for FSP purposes } TYPE: { EditNum, HEX, (0 x00,0 xFFFFFFFF) }

# -! BSF HELP { bootloader tells FSP about the address format of the attached PCIE slot for FSP use, default will be 0, identifying no device attached. }

gPlatformFspPkgTokenSpaceGuid.IAXPUAddress|*|0x20|{0x00FE0000,0x00,0x00}

# -! BSF NAME { XPU configuration pointer }

#！BSF TYPE:{EditNum,HEX,(0x0,0xFFFFFFFF)}

# -! BSF HELP { point to configuration data File like VBT }

gPlatformFspPkgTokenSpaceGuid.XPUConfigPtr|*|0x04|0x00000000

Returning to process 700, the example bootloader 602 invokes the FSP-S606 wherein XPU address FSP initialization parameters are overridden to initialize the display device (e.g., on a separate DGPU) (block 718). The example FSP-S606 reads XPU address FSP initialization parameters to see if any heterogeneous processors are attached to the platform (block 720). For example, if the "IAXPUAddress" UPD value >0, a dash-G exists, then the BdF information is retrieved from the UPD and the XPU data configuration pointer is read to see that a configuration table such as VBT exists. FSP 604 identifies and initializes any XPU devices attached to the processor (block 722). For example, FSP 604 may identify the type of XPU associated with a PCIE port and perform various calls to initialize a device attached to the processor (e.g., a display attached to the GPU). An example detailed process is shown in fig. 8.

Control exits FSP-S606 operation (block 724). Upon exit, the display will be initialized for the device (e.g., DGPU) attached to the GPU. The example boot loader 602 performs PCI enumeration and resource allocation for PCI/PCI-E devices (block 726). For example, in addition to a Dash-G device, resource allocation may be based on looking at the already implemented base address registers (Base Address Register, BAR) and enabled mmio/io address space. FSP 604 then communicates VBT information to the OS (block 728). For example, FSP 604 may create a DGPU GFX ACPI operation region to pass VBT information for the GPU driver to the OS.

The boot loader 602 then invokes NotifyPhase (block 730). For example, boot loader 602 may invoke NotifyPhase before handing over to the payload. Control is transferred to the boot loader 602 to render a pre-OS flag, a UEFI setup screen, or an OS start screen (block 732). Process 700 then ends as an OS boot.

Since the FSP is designated to perform initialization of the XPU device, the initialization sequence can be divided into two parts: 1. static DG initialization procedure as part of the boot services inside the FSP 604, and 2. Create oneAPI library functions for accessing XPU hardware resources: a set of library functions for communicating with the XPU hardware is available as part of the FSP runtime service, so that the different OS stacks do not require a dedicated OS driver for communicating with the XPU hardware. For example, the API 108 of FIG. 1 may include an oneAPI library for accessing XPU hardware resources.

Fig. 8 is a flow chart illustrating an example detailed unified FSP initialization flow with an integrated graphics device (integrated graphics device, IGD) and GPU.

The machine-readable instructions and/or operations 800 of fig. 8 begin at block 802 where FSP-S reads UPD IADGpuAddress. FSP-S determines whether a discrete graphics processing unit (discrete graphic processing unit, DGPU) is present (block 804). If the DGPU is not present, initialization of the integrated graphics processing unit (integrated graphics processing unit, IGPU) is performed by obtaining the IGD VBT PTR (block 806), reading the RGX MMIO base address (block 808), reading the sub-device configuration (block 810), and reading the GFX frame buffer address (block 812). Control then passes to block 830, described below.

If FSP-S determines that a DGPU is present (block 804), FSP-S performs initialization of the DGPU as follows. FSP-S obtains the PCI location (block 814) and obtains the DGPU VBT PTR (block 816). FSP-S reads the GFX MMIO base address (block 818) and reads the child device configuration (block 820). The FSP-S reads the device identifier (device identifier, DID) and compares it to a list of supported DID (block 822). If the DID is invalid (i.e., not supported) (block 824), then the display is not presented (block 826), and control returns to block 802. If the DID is valid, FSP-S reads the GFX frame buffer address (block 828) and control proceeds to block 830.

After initiating initialization of the IGD (blocks 806-812) or DGPU (blocks 814-828), the FSP-S reads the values from the GT driver mailbox (block 830). The FSP-S then initializes the video memory variables (block 832) and programs the GTT (e.g., sets the maximum voltage, programs the CD CLK, etc.) (block 834). The FSP-S performs watermark initialization (block 836). Then, to reach the attached display, the FSP-S enumerates the supported display and executes a display timing algorithm (block 838). Finally, FSP-S programs the phase locked loop (phase locked loop, PLL) (block 840) and then the display is brought on line (block 842). The process of fig. 8 then ends.

From the foregoing, it will be apparent that example systems, methods, apparatus, and articles of manufacture have been disclosed for symbiotic guidance between heterogeneous processors. The disclosed systems, methods, apparatuses, and articles of manufacture improve the efficiency of using a computing device by sharing memory resources such as SPI flash to reduce BoM costs and reduce boot time. By moving XPU initialization to FSP, XPU silicon initialized encapsulation protects intellectual property and maintains the security of the boot program while allowing shared utilization of memory (e.g., memory storing IFWI). The use of unified firmware and software modules results in less space occupation and optimized verification boot for heterogeneous processors. The disclosed examples also support a unified firmware flash layout between the CPU and other processing units to allow for having field firmware updates (e.g., for DG under-motherboard solutions).

Infrastructure processing unit resource director techniques

Apparatus, articles of manufacture, and methods for implementing infrastructure processing unit resource inventory techniques (infrastructure processing unit resource directory technology, IPURMS) are disclosed. The example IPURMS provides decentralized peer-to-peer IPU resource negotiation and management without involvement of the CPU center to facilitate low-latency micro-services and workloads, such as VRAN and the like. In addition, IPURMS provides application-aware resource management, where IPUs can dynamically renegotiate RMS SLAs for various micro-services at runtime. In addition, IPURMS may facilitate IPU P2P negotiation and resource management, which may be tracked via a decentralized distributed public account book such as a blockchain with revocation capability (e.g., revocation management) to track/record telemetry with auditability. Furthermore, the IPURMS may facilitate an IPU that is divided into two parts, i) a data plane, and ii) a control plane, wherein the control plane handles resource allocation, monitoring, and policy enforcement, and the data plane handles data flow between the IPU and a logical unit associated with the IPU.

Fig. 9 is a block diagram of an example architecture 900 for an IPURMS. According to the example shown in FIG. 9, a new workload (or VM) 902 communicates with an example coordinator 904 to request a system with a particular SLA. The example architecture 900 includes a coordinator 904, an example user space 908, an example XPU/IPU software domain 908, and an example IPU hardware domain 910.

The example coordinator 904 is a server circuit that negotiates an existing workload for placing the workload on computing resources based on the SLA. The example coordinator 904 communicates with one or more computing systems 906 to manage assignment of workloads to computing resources.

The example computing resource 906 is represented by several abstractions, including a user space 908, an XPU/IPU software domain 910, and an IPU hardware domain 912. The example user space 908 includes application A914 and application B916, but may include any number or type of applications. The example user space 908 is monitored by the coordinator 904.

The example XPU/IPU software domain 910 includes an example RMS exposure 918 monitored by an example SLA manager 920. The example RMS exposure 918 facilitates communication of application level information with the coordinator 904.

The example IPU hardware domain 912 includes an example XPU/IPU resource monitor 922 monitored by an example SLA manager 924, an example XPU/IPU resource implementation 926 monitored by an example SLA manager 928, and a Punit RMS 930.

The example XPU/IPU resource monitor 922 provides resource feedback to the example RMS exposure 918, while the example XPU/IPU resource monitor 922 and the example XPU/IPU resource implementation 926 communicate with respect to hardware policies. The example RMS exposure 918 communicates QoS hints to the example XPU/IPU resource implementation 926, and the example XPU/IPU resource implementation 926 communicates with the Punit RMS 930 regarding QoS hardware features. The example architecture 900 facilitates transitioning from CPU-centric single-node resource management to scalable self-managed XPU/IPU that can work in concert with peers. The consensus in such collaborative resource management may be implemented via a centralized trust proxy, a decentralized public ledger blockchain as shown in fig. 13, and so on.

Flowcharts representative of example hardware logic circuits, machine readable instructions, hardware-implemented state machines, and/or any combination thereof to implement the unified firmware of the example architecture 900 are shown in fig. 10 and 11. The machine-readable instructions may be one or more executable programs, or portion(s) of an executable program, for execution by a processor circuit, such as the processor circuit 1612 shown in the example processor platform 1600 discussed below in connection with fig. 16 and/or the example processor circuit discussed below in connection with fig. 48 and/or fig. 49. The program may be embodied in software stored on one or more non-transitory computer readable storage media, such as a Compact Disc (CD), a floppy disk, a Hard Disk Drive (HDD), a Solid State Drive (SSD), a Digital Versatile Disc (DVD), a blu-ray disc, volatile memory (e.g., any type of Random Access Memory (RAM), etc.), or non-volatile memory (e.g., electrically erasable programmable read-only memory (EEPROM), flash memory, HDD, SSD, etc.), associated with processor circuitry located in one or more hardware devices, but the entire program and/or a portion thereof may be executed by one or more hardware devices other than processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a Radio Access Network (RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. In addition, while the example program is described with reference to the flowcharts shown in FIGS. 10 and 11, many other methods of implementing the example architecture 900 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

Fig. 10 is a flow diagram representing example machine readable instructions and/or example operations 1000 executable and/or instantiated by a processor circuit for performing a configuration using IPURMS.

The machine-readable instructions and/or operations 1000 of fig. 10 begin at block 1002, where the example coordinator 904 detects a new instance/application (e.g., workload 902) that can be run in a heterogeneous IPU-based data center platform along with a resource and migration tolerance SLA. For example, when creating a new instance/application (e.g., using an SLA template), resource requirements and tolerances may be established by the user/manager. The coordinator 904 determines whether the verification of the device and resource requirements is successful (block 1004). For example, resource requirements may be analyzed to determine whether they are viable without the constraints of the computing system. If the resource requirements are not valid and/or cannot be met by the computing system, the coordinator 904 returns control to block 1002.

If the resource requirements are valid (block 1004), the coordinator 904 negotiates with the IPU control plane to identify resources for executing the new instance/application (block 1006). For example, based on the type of hardware resource specified in the request (e.g., CPU, GPU, FPGA and SSD), a set of IPUs corresponding to the specified resource is selected. Negotiation between the new request and the existing application in the IPU is then started. For example, the negotiations may include making policy-based decisions using the identified resource tolerance thresholds and dynamically migrating existing workloads between IPUs to efficiently utilize all resources. Each IPU may include two parts, i) a data plane, and ii) a control plane. The control plane handles resource allocation, monitoring, and policy enforcement, and the data plane handles data flow between the IPU and the logic units associated with the IPU. An example process for negotiating is described in connection with fig. 11.

The coordinator 904 determines if the negotiation was successful (block 1008). For example, if the coordinator is able to find the necessary resources within the set of IPUs, it may be determined that the negotiation was successful. For example, in one scenario, an existing application continues to run on a given IPU, but there are additional resources free for a new application to run. In another scenario, the coordinator 904 negotiates with an existing application and arranges that the application be migrated to a different set of IPUs to release resources to the new instance/application.

If the negotiation is unsuccessful (block 1008), control returns to block 1002 so that the coordinator 904 looks for a different set of IPUs meeting the resource requirements.

If the negotiation is successful (block 1008), the coordinator 904 configures the IPU/XPU resource monitoring and enforcement in the IPU control plane (block 1010). The coordinator 904 then configures hardware resources on the IPU-based datacenter platform(s) for the new instance/application (block 1012). Thus, the negotiation process between IPUs may enable data center level cross-domain coordinated resource management.

FIG. 11 is a flow diagram representing example machine readable instructions and/or example operations 1100 executable and/or instantiated by a processor circuit to negotiate to dynamically configure resources based on tolerances specified by an application and available IPU resources.

The machine-readable instructions and/or operations 1100 of fig. 11 begin at block 1102 where the coordinator 904 detects that a user has launched a new instance/application (e.g., VM, application, etc.). For example, the request may identify QoS parameters, SLA requirements, and so forth. For example, the QoS parameter may be set to qos=func (DEVICE REQS, FREQUENCY, CACHE, MEM-BW, POWER, IPC, CORES, STORAGE, MIGRATION-policy). Specifying SLA parameters enables specification of hardware resources (e.g., CPU, GPU, FPGA, SSD and individual IPUs) within the data center. An example SLA template is specified as:

1.CPU：

A. frequency range

B. Memory bandwidth range

C. Cache size range

TDP Range

E. Core count range

F. Migration tolerance

GXeon IPC Range

SSD storage space Range

GPU core scope

4.FPGA

PCIe generation requirements

IPU control plane management

h. Network bandwidth range

i. Queue prioritization

The coordinator 904 verifies the validity of the request (block 1104). If the request is not valid, the user is prompted to provide a valid request and control returns to block 1102. If the request is valid (block 1104), the coordinator 904 determines availability of computing resources (block 1106). If there are no available computing resources (e.g., IPU resources) to negotiate, control returns to block 1102.

If it is determined that available computing resources are willing to negotiate (block 1106), the coordinator 904 begins negotiating with the existing instance/application executing on the IPU and determines if the negotiation is successful (block 1108). For example, negotiations may involve determining existing applications on the IPU that may tolerate lower resources to release the resources to the new instance/application. Alternatively, the negotiation may identify applications that may be migrated to other resources to release the selected resources to the new instance/application. If the negotiation fails to release the resource to the new instance/application, control returns to block 1106 to identify a different resource.

If the negotiation successfully identifies available resources for execution by the new instance/application (block 1108), the coordinator 904 determines if there are any existing instances/applications to migrate the resources (block 1110). If there are existing instances/applications to be migrated, control returns to block 1106 to manage negotiations and assignments of the existing instances/applications.

If the existing instance/application is not to be migrated (block 1110), the coordinator 904 updates the resource allocator (e.g., class of Service (CloS) for the existing instance/application) (block 1112). The coordinator 904 starts the requested instance/application (e.g., workload 902) with the negotiated set of IPUs (block 1114).

Fig. 12 illustrates an example environment 1200 in which resources managed by an IPU 1202 (or any type of processing unit, e.g., XPU, GPU, etc.) have various states of space and busy resources between CPUs 1204, GPUs 1206, SSDs 1208, etc. According to the illustrated example, application 1 is utilizing a portion of CPU 1204, GPU 1206, and SSD storage 1208, application 2 is utilizing a portion of CPU 1204 and GPU 1206, and application 3 is utilizing a portion of CPU 1204 and SSD storage 1208.

Fig. 13 illustrates an example environment 1300 in which consensus in collaborative resource management is implemented via a de-centralized public blockchain book. As shown in fig. 13, the operating states of several IPUs 1 to IPU N (e.g., state S ₁ State S ₂ State S _N ). Thus, each block in the blockchain (e.g., block B ₁ To B _N ) Status information may be stored that may be utilized for peer-to-peer resource negotiation. Utilizing such blockchains facilitates distributed collection of information that is trusted to operate effectively as a trust proxy. Although fig. 9 illustrates a single centralized coordinator 904, blockchain or other decentralization techniques may be utilized to facilitate an in-centering coordinator that manages resources pursuing the control plane portion of the IPU. In this decentralization approach, resource management may be tracked via a decentralization public account book with revocation capability to track/record auditable telemetry. Thus, for devices associated with an IPU, IPU 1202 may be considered to have computer resources and manage intellectual property (Ip). The control plane of the IPU hosts an decentralized coordinator that handles resource allocation, monitoring, and policy enforcement.

From the foregoing, it will be apparent that example systems, methods, apparatus, and articles of manufacture have been disclosed for managing assignment of resources in systems utilizing IPUs. The disclosed systems, methods, apparatus, and articles of manufacture improve the efficiency of using computing devices by improving IPU and ingredient resource utilization, manageability with auditability, and secure metering toward improving the overall cost of ownership savings. The disclosed examples facilitate fine-grained resource monitoring and manageability among IPUs in a very large scale data center. Providing application negotiable resource monitoring and management allows dynamic prioritization to provide deterministic performance for large scale micro-services.

Dynamic negotiable deep neural network

Some neural network systems attempt to detect the underlying target hardware capabilities to speed up reasoning/training performance. For example, JIT code generation may be utilized to attempt to select an Instruction Set Architecture (ISA) or a mix of ISAs based on a detected target hardware feature of a computing environment. Even though this abstraction provides the ability to utilize the underlying hardware capabilities, it has drawbacks.

The apparatus, articles of manufacture, and apparatus disclosed herein provide a dynamically negotiable deep neural network solution. This approach facilitates utilization of hardware resources, particularly in situations where there are a large number of possible features (e.g., single instruction stream, multiple data stream (single instruction stream, multipledata stream, SIMD) features, learning boost features (e.g., Deep learning enhancement), etc.). The disclosed dynamic negotiable deep neural network stack involves a configurable and negotiable interface implemented in the API 108 of FIG. 1 for specifying an SLA. The candidate set of features may be filtered from the set of available implementations and the JIT kernel may be dynamically generated for the candidate set of hardware features. The disclosed dynamically negotiable deep neural network stack may run these kernels dry one by one to pick the kernel with the best performance and cache it for subsequent use.

Fig. 14 is a block diagram of an example dynamic negotiable dynamic neural network library 1400. For example, a dynamic negotiable dynamic neural network library 1400 may be added to the API 108 of the architecture 100 of FIG. 1. The example dynamic negotiable dynamic neural network library 1400 includes an example configurable user interface 1402, an example platform capability manager 1404, an example application SLA manager 1406, an example JIT manager 1410, and an example kernel evaluation engine 1410.

The example configurable user interface 1402 provides a user interface (e.g., via the oneAPI stack of the architecture 100) to provision the middleware/framework to configure the SLA associated with the operation. For example, the user interface 1402 may be a graphical user interface, a text interface, an API, or the like.

The example platform compatibility manager 1404 identifies target hardware capabilities. The platform compatibility manager 1404 is also configured with a configurable user interface 1402 for application configuration JIT kernel configuration.

The example application SLA manager 1406 collects and enforces SLAs provided via the configurable user interface 1402.

The example JIT manager 1408 generates and manages dynamic JIT kernels based on specified SLAs and in conjunction with bare metal/VM heuristics observed in the past.

The example kernel evaluation engine 1410 provides the ability to perform fused sandbox evaluation of newly generated kernels/operations prior to large-scale deployment.

A flowchart representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination of these for implementing the dynamic negotiable deep neural network 1400 of fig. 14 is shown in fig. 14. The machine-readable instructions may be one or more executable programs, or portion(s) of an executable program, for execution by a processor circuit, such as the processor circuit 1612 shown in the example processor platform 1600 discussed below in connection with fig. 16 and/or the example processor circuit discussed below in connection with fig. 48 and/or fig. 49. The program may be embodied in software stored on one or more non-transitory computer readable storage media, such as a Compact Disc (CD), a floppy disk, a Hard Disk Drive (HDD), a Solid State Drive (SSD), a Digital Versatile Disc (DVD), a blu-ray disc, volatile memory (e.g., any type of Random Access Memory (RAM), etc.), or non-volatile memory (e.g., electrically erasable programmable read-only memory (EEPROM), flash memory, HDD, SSD, etc.), associated with processor circuitry located in one or more hardware devices, but the entire program and/or a portion thereof may be executed by one or more hardware devices other than processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a Radio Access Network (RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. Additionally, while the example program is described with reference to the flowchart shown in fig. 14, many other methods of implementing the example dynamic negotiable deep neural network 1400 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

FIG. 15 is a flowchart representative of example machine readable instructions and/or example operations 1500 that may be executed and/or instantiated by the processor circuit to select features for deep neural network learning based on hardware capabilities.

The machine-readable instructions and/or operations 1500 of fig. 15 begin at block 1502 where an example configurable user interface 1402 obtains an operational description (e.g., instructions and SLA information entered by a user). The example SLA manager 1406 obtains the SLA criteria for the current configuration (block 1504). The example platform capability manager 1404 selects a candidate configuration (e.g., primitive descriptor) based on the target hardware capability (block 1506). For example, platform capability manager 1404 may select candidates successfully created from the implementation set based on platform information SLA criteria.

The example JIT manager 1408 generates a kernel corresponding to the selected candidate (block 1508). For example, the JIT manager 1408 may generate kernels for each candidate in the candidate set one by one. The example kernel evaluation engine 1410 then performs a dry run/test run of the kernel and gathers performance information (block 1510). For example, where multiple cores are generated one by the JIT manager 1408, the example core evaluation engine 1410 may perform test runs for each core and collect performance results to facilitate performance-based core selection (e.g., selecting the core with the best performance). For example, the core evaluation engine 1410 may cache the core with the best performance.

The example application SLA manager 1406 then determines whether the selected kernel satisfies the requested SLA in the sandbox configuration based on the configuration policy (block 1512). If the SLA is not met, control returns to block 1508 to attempt to generate another kernel that may meet the SLA. If the application SLA manager 1406 determines that the SLA is satisfied, process 1500 ends and the appropriate kernel has been selected for operation.

In some implementations, process 1500 may detect ISA capabilities of a CPU or another processing unit and generate queues for all implementations in one operation. For example, the following is an example queue for data types and convolution operations for FP 32:

{{forward,f32,f32,f32},{

CPU_INSTANCE_X64(jit_avx512_common_dw_convolution_fwd_t

CPU_INSTANCE_X64(jit_avx512_common_1x1_convolution_fwd_f32_t)

CPU_INSTANCE_X64(jit_avx512_core_f32_wino_conv_2x3_fwd_t)

CPU_INSTANCE_X64(jit_avx512_core_f32_wino_conv_4x3_fwd_t)

CPU_INSTANCE_X64(jit_avx512_common_convolution_winograd_fwd_t)

CPU_INSTANCE_X64(jit_avx512_common_convolution_fwd_t<f32>)

CPU_INSTANCE_X64(jit_avx2_dw_convolution_fwd_t)

CPU_INSTANCE_X64(jit_avx2_1x1_convolution_fwd_t)

CPU_INSTANCE_X64(jit_sse41_dw_convolution_fwd_t)

CPU_INSTANCE_X64(jit_sse41_1x1_convolution_fwd_t)

CPU_INSTANCE_X64(jit_avx2_convolution_fwd_t)

CPU_INSTANCE_X64(jit_sse41_convolution_fwd_t)

CPU_INSTANCE(gemm_convolution_fwd_t)

CPU_INSTANCE(ref_convolution_fwd_t<f32>)

CPU_INSTANCE(ref_fused_convolution_fwd_t)

nullptr,

}},

the example process 1500 may attempt to instantiate each primitive descriptor in the implementation queue. The platform capability manager 1404 may select all successfully instantiated primitive descriptors as candidates for the next layer based on the application/middleware SLA and the target hardware platform capability. The JIT manager 1408 may then generate and save the JIT kernel corresponding to each primitive descriptor candidate to the JIT kernel candidate queue. The example core evaluation engine 1410 will run each core from the JIT core candidate queue dry in the current platform, report performance data, and select the JIT core based on performance (e.g., select the JIT core with the best throughput) and cache it for subsequent use.

In some examples, the proposed method provides about a 10% improvement in performance over existing methods (e.g., methods that select the first JIT kernel that meets SLA requirements).

Placeholder: insertion AD6570: placeholder: insertion AD6571: placeholder: insertion AD6572: placeholder: insertion AD6578:

FIG. 16 is a block diagram of an example processor platform 1600 configured to execute and/or instantiate the machines of one or more of FIG. 5, FIG. 7A, FIG. 7B, FIG. 8, FIG. 10, FIG. 11, and/or FIG. 15The processor may read instructions and/or operations to implement the architecture 100, 200, 300, the BIOS 600, and/or the dynamic negotiable deep neural network 1400. Processor platform 1600 may be, for example, a server, personal computer, workstation, self-learning machine (e.g., neural network), mobile device (e.g., cellular telephone, smart phone, such as an iPad) ^TM Such as a tablet device), a headset (e.g., an augmented reality (augmented reality, AR) headset, a Virtual Reality (VR) headset, etc.), or other wearable device, or any other type of computing device.

The processor platform 1600 of the illustrated example includes a processor circuit 1612. The processor circuit 1612 of the illustrated example is hardware. For example, the processor circuit 1612 may be implemented by one or more integrated circuits, logic circuits, FPGAs, microprocessors, CPU, GPU, DSP, and/or microcontrollers from any desired family or manufacturer. The processor circuit 1612 may be implemented by one or more semiconductor-based (e.g., silicon-based) devices.

The processor circuit 1612 of the illustrated example includes a local memory 1613 (e.g., cache, registers, etc.). The processor circuit 1612 of the illustrated example communicates with a main memory including a volatile memory 1614 and a non-volatile memory 1616 over a bus 1618. The volatile memory 1614 may be formed of synchronous dynamic random access memory (Synchronous Dynamic Random Access Memory, SDRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM),Dynamic random access memory (+)>Dynamic Random Access Memory，/>) And/or any other type of RAM device implementation. The non-volatile memory 1616 may be implemented by flash memory and/or any other desired type of memory device. Main memory for the illustrated example1614. Access to 1616 is controlled by a memory controller 1617.

The processor platform 1600 of the illustrated example also includes interface circuitry 1620. The interface circuit 1620 may be implemented in hardware according to any type of interface standard, such as an Ethernet interface, a universal serial bus (universal serial bus, USB) interface, a USB interface, or a combination thereof,An interface, a near field communication (near field communication, NFC) interface, a peripheral component interconnect (Peripheral Component Interconnect, PCI) interface, and/or a peripheral component interconnect express (Peripheral Component Interconnect Express, PCIe) interface.

In the illustrated example, one or more input devices 1622 are connected to the interface circuitry 1620. Input device(s) 1622 allow a user to input data and/or commands into processor circuit 1612. Input device(s) 1622 may be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, buttons, a mouse, a touch screen, a touch pad, a trackball, an isopoint device, and/or a speech recognition system.

One or more output devices 1624 are also connected to the interface circuitry 1620 in the illustrated example. Output device(s) 1624 may be implemented, for example, by a display device (e.g., a light emitting diode (light emitting diode, LED), an organic light emitting diode (organic light emitting diode, OLED), a liquid crystal display (liquid crystal display, LCD), a Cathode Ray Tube (CRT) display, an in-plane switching (IPS) display, a touch screen, etc.), a haptic output device, a printer, and/or speakers. The interface circuitry 1620 of the illustrated example thus generally includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry, such as a GPU.

The interface circuitry 1620 of the illustrated example also includes communication devices such as transmitters, receivers, transceivers, modems, residential gateways, wireless access points, and/or network interfaces to facilitate the exchange of data with external machines (e.g., any kind of computing device) via a network 1626. The communication may be through, for example, an ethernet connection, a digital subscriber line (digital subscriber line, DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-to-line wireless system, a cellular telephone system, an optical connection, etc.

The processor platform 1600 of the illustrated example also includes one or more mass storage devices 1628 to store software and/or data. Examples of such mass storage devices 1628 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, blu-ray disc drives, redundant array of independent disks (redundant array of independent disk, RAID) systems, solid-state storage devices (such as flash memory devices and/or SSDs), and DVD drives.

The machine-executable instructions 1632, which may be implemented by the machine-readable instructions of fig. 5, 7A, 7B, 8, 10, 11, and/or 15, may be stored in the mass storage device 1628, in the volatile memory 1614, in the non-volatile memory 1616, and/or on a removable non-transitory computer-readable storage medium such as a CD or DVD.

The processor platform 1600 of the illustrated example of fig. 16 includes an example acceleration circuit 1634 that includes an example GPU 1640, an example visual processing unit (vision processing unit, VPU) 1642, and an example neural network processor 1644. Additionally and/or alternatively, the acceleration circuit 1634 may include any other type of hardware, such as CPU, FPGA, ASIC, etc. In this example, the GPU 1640, VPU 1642, and neural network processor 1644 communicate with different hardware of the processor platform 1600, such as volatile memory 1614, non-volatile memory 1616, and the like, via a bus 1618. In this example, the neural network processor 1644 may be implemented with one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer, which may be used to execute AI models, such as neural networks.

Method and device for managing dynamic XPU hardware perception deep learning model

The computational workload of a computing device may be performed using a Deep Learning (DL) model. Deep Learning (DL) models, such as neural networks (neural networks), are useful tools that have proven valuable in solving complex problems with pattern recognition, object classification, natural language processing, automatic speech recognition, and the like. Identifying the optimal combination of Hardware (HW) and/or Software (SW) (e.g., a deep learning model) to perform the computational workload is complex because the available types of hardware and/or Deep Learning (DL) models and their customization(s) are wide ranging.

Artificial intelligence (artificial intelligence, AI), including Machine Learning (ML), deep Learning (DL), and/or other artificial machine-driven logic, enables a machine (e.g., a computer, logic circuitry, etc.) to process input data using a model to generate output based on patterns and/or associations that the model previously learned via a training process. For example, the model may be trained with data to identify patterns and/or associations as input data is processed, such that other input(s) result in output(s) consistent with the identified patterns and/or associations.

Many different types of machine learning models and/or machine learning architectures exist. In examples disclosed herein, a decision tree model is used. The use of decision tree models enables interpretation of simple and interpretable data. In general, a machine learning model/architecture suitable for use in the example methods disclosed herein will be a convolutional neural network (Convolutional Neural Network, CNN) and/or a deep neural network (Deep Neural Network, DNN), where interconnections are not visible outside the model. However, other types of machine learning models may be used in addition or instead, such as recurrent neural networks (Recurrent Neural Network, RNN), support vector machines (Support Vector Machine, SVM), gated recursive units (Gated Recurrent Unit, GRU), long term memory (Long Short Term Memory, LSTM), and so forth.

In general, implementing an ML/AI system involves two phases, one being a learning/training phase and one being an reasoning phase. In the learning/training phase, a training algorithm is used to train the model to operate according to patterns and/or associations based on, for example, training data. Generally, a model includes internal parameters that direct how input data is transformed into output data, such as transforming the input data into output data through a series of nodes and connections within the model. Further, the hyper-parameters are used as part of a training process to control how learning is performed (e.g., learning rate, number of layers to be used in a machine learning model, etc.). Super-parameters are defined as training parameters that are determined before initiating the training process.

Based on the type and/or expected output of the ML/AI model, different types of training may be performed. For example, supervised training uses inputs and corresponding expected (e.g., labeled) outputs to select parameters for the ML/AI model (e.g., by iterating over a combination of selected parameters) that reduce model errors. As used herein, a label (labeling) refers to an expected output (e.g., classification, expected output value, etc.) of the machine learning model. Alternatively, unsupervised training (e.g., for use in deep learning, a subset of machine learning, etc.) involves reasoning patterns from the inputs to select parameters of the ML/AI model (e.g., without the expected (e.g., labeled) output benefits).

In examples disclosed herein, known software samples (e.g., malicious and/or clean) are used to train the ML/AI model. However, any other training algorithm may additionally or alternatively be used. In the examples disclosed herein, training is performed on a set of models optimized for a selected goal (e.g., performance, accuracy, cost, etc.).

Training is performed with super-parameters that control how learning is performed (e.g., learning rate, number of layers to be used in the machine learning model, etc.).

Training is performed using training data. In examples disclosed herein, the training data may be any type of feature data set (e.g., AI features).

Once trained, the model is deployed for use as an executable construct that processes inputs and provides outputs based on the nodes and connected networks defined in the model. The model is stored in memory. The model may then be executed by the model management circuitry 1808 of fig. 18.

Once trained, the deployed model can be manipulated in the inference phase to process the data. In the inference phase, the data to be analyzed (e.g., live data) is input to the model, and the model executes to create an output. This inference phase may be considered an AI "thinking" to generate an output based on what it learns from training (e.g., by executing a model to apply learned patterns and/or associations to live data). In some examples, the input data may undergo preprocessing before being used as input to a machine learning model. Further, in some examples, the output data may undergo post-processing after it is generated by the AI model to transform the output into useful results (e.g., display of the data, instructions to be executed by the machine, etc.).

In some examples, the output of the deployed model may be captured and provided as feedback. By analyzing the feedback, the accuracy of the deployed model can be determined. If the feedback indicates that the accuracy of the deployed model is below a threshold or other criteria, the feedback and updated training data sets, super parameters, etc. may be utilized to trigger training of the updated model to generate an updated deployed model.

The exploration and discovery of new Artificial Intelligence (AI) features is a time consuming problem. Rapid discovery of new hardware features will speed up the time to market for new AI products and/or features.

Currently, both training and reasoning phases in the DL model management system are focused on a single DL model. Some of these single DL models are broken down into multiple smaller models, however, the emphasis of these DL model management systems is on a single abstract entity. These current DL model management systems do not analyze the differences between the surrogate models to gain insight and propose new features for AI feature development and/or exploration.

Neural architecture search (Neural Architecture Search, NAS) refers to a method of Deep Learning (DL) model management that focuses on finding the correct network topology for a particular set of requirements. The hardware aware NAS approach considers information from the target Hardware (HW) when searching for the optimal neural network topology. The main focus of the hardware aware NAS approach is to find a single DL model that meets the listed criteria.

The NAS approach of current DL model management treats each discovered model in isolation. That is, they do not further consider the existence of differences between models (e.g., candidate features optimized by NAS algorithms for different objectives) to discover new features and/or gain further insight.

Most current NAS solutions fail to consider how, where and under what conditions an optimized model will be deployed. For example, while the model is being optimized, the target hardware may have other processes affecting the availability of resources of the device, resulting in an assumption that all available resources will be allocated to the model during reasoning. However, this has proven to be a significant disadvantage during deployment, because if the target hardware experiences a change in resource utilization during runtime, the hardware will likely require replacement of the model with another model that is more suitable for the new conditions.

Model duality must be exploited in order to explore two or more different architectural options optimized for multiple objectives (e.g., accuracy, latency, performance, cost, etc.). Differences between these architectural options are identified and explored to establish new features and/or gaps in Software (SW) or Hardware (HW) to assist in model design/management and/or hardware co-optimization.

FIG. 17 is an illustration of an example AutoML architecture 1700 that includes an example Machine Learning (ML) system configurator 1702 to identify and/or generate configurable ML computing nodes. The autopl architecture 1700 includes an ML system configurator 1702 to generate a hardware search space and/or a software search space based on a computing task or workload (e.g., an artificial intelligence/machine learning (AI/ML) computing task or workload). The ML system configurator 1702 may identify the hardware or portion(s) thereof from the hardware search space. The ML system configurator 1702 may also discover and/or otherwise identify software (e.g., AI/ML models), or portion(s) thereof, from a software search space. In some examples, ML system configurator 1702 may individually and/or simultaneously evolve a configurable ML computing node by iterating (i) the architecture and/or type of hardware and/or software and/or (ii) the configuration(s) of hardware and/or software. For example, ML system configurator 1702 may evolve a configurable ML computing node by evaluating hardware and/or software when executing a workload and/or based on a simulation of hardware and/or software executing a workload. In some such examples, the configurable ML computing node may be configurable in that hardware and/or software components may be selected and assembled in various combinations to meet specific or predefined requirements (e.g., accuracy requirements, latency requirements, throughput requirements, etc.). In some such examples, in response to identifying a particular combination of hardware and/or software that meets a particular or predefined requirement, ML system configurator 1702 may output the combination as a configurable ML computing node to execute the workload of interest.

In some examples, the configurable ML computing node may be implemented by a single homogeneous computing or electronic system that may be configured and/or otherwise utilized to execute the AI/ML model. For example, the configurable ML computing node may be implemented by a single Central Processing Unit (CPU), graphics Processor Unit (GPU), artificial intelligence processor (AI processor), field Programmable Gate Array (FPGA), digital Signal Processor (DSP), XPU, or the like. In some examples, the configurable ML compute node may be implemented by portion(s) of a single homogeneous computing or electronic system, such as portion(s) (e.g., kernel (s)) of a single CPU, GPU, AI processor, FPGA, DSP, XPU, and so on. In some such examples, the portion(s) may include a core (e.g., a hardware core) and/or corresponding interconnect(s), to which different core(s), hardware, etc. may be coupled (e.g., physically coupled, communicatively coupled, coupled via a computing or electrical bus, etc.). In some examples, the configurable ML computing node may be implemented by multiple homogeneous computing or electronic systems of the same type or portion(s) thereof. For example, a configurable ML computing node may be implemented by two or more CPUs (or portion(s) thereof), two or more GPUs (or portion(s) thereof), two or more AI processors (or portion(s) thereof), two or more FPGAs (or portion(s) thereof), two or more DSPs (or portion(s) thereof), two or more XPUs (or portion(s) thereof), and so forth.

In some examples, the configurable ML computing node may be implemented by a single heterogeneous computing or electronic system that may be configured and/or otherwise utilized to execute the AI/ML model. For example, the configurable ML computing node may be implemented by CPU, GPU, AI processors, FPGA, DSP, XPU, and the like, and/or any combination(s) of these. In some such examples, the configurable ML computing node may be implemented by one or more CPUs, one or more GPUs, one or more AI processors, one or more FPGAs, one or more DSPs, one or more XPUs, and the like, and/or any combination(s) of these. In some examples, the configurable ML computing node may be implemented by portion(s) of a single heterogeneous computing or electronic system, such as portion(s) of CPU, GPU, AI processor, FPGA, DSP, XPU, and the like, and/or any combination(s) of these. In some examples, the configurable ML computing node may be implemented by multiple identical heterogeneous computing or electronic systems or portion(s) thereof. For example, a configurable ML computing node may be implemented by two or more instances of a heterogeneous computing system that includes one or more CPUs (or portion(s) thereof), one or more GPUs (or portion(s) thereof), one or more AI processors (or portion(s) thereof), one or more FPGAs (or portion(s) thereof), one or more DSPs (or portion(s) thereof), one or more XPUs (or portion(s) thereof), and/or combinations thereof. In some examples, the configurable ML computing node may be implemented by two or more different heterogeneous computing or electronic systems. For example, a configurable ML computing node may be implemented by a first heterogeneous computing system and a second heterogeneous computing system. In some such examples, the portion(s) of the first heterogeneous computing system and the second heterogeneous computing system may be different.

In some examples, the configurable ML computing node may include, store, and/or otherwise access an executable construct to execute an AI/ML model to complete a workload, or portion(s) thereof. For example, the executable construct may be implemented by a configuration image, an executable binary, an executable code (e.g., executable machine readable code), an executable file (e.g., executable binary), an executable program, executable instructions (e.g., executable machine readable instructions), etc., that when executed may implement an AI/ML model to implement the completion of AI/ML workload.

The example autopl architecture 1700 is illustrated as including an example optimized application 1704, example optimized middleware and framework 1706, and example Application Programming Interfaces (APIs) 1708. In some examples, the optimized application 1704 may be implemented by an application (e.g., a software application, a web or browser-based application, etc.) that is customized, tailored, and/or otherwise optimized to enable identification and/or generation of configurable ML computing nodes. For example, the optimized application 1704 can be accessed, utilized, etc. by a developer (e.g., software developer, researcher, etc.), information Technology (IT) personnel, etc. In some such examples, the optimized applications 1704 may be accessed, utilized, etc. to co-design hardware/software (HW/SW) solutions to solve technical problems that may benefit from AI/ML technology. In some examples, optimized middleware and framework 1706 may be implemented by middleware and framework that are customized, tailored, and/or otherwise optimized to enable identification and/or generation of configurable ML compute nodes. For example, the optimized middleware and framework 1706 can implement interfaces (e.g., communications, connectivity, etc.) between the optimized applications 1704 and the APIs 1708.

The API 1708 of the illustrated example can be invoked to program, develop, and/or otherwise generate an AI/ML application via at least one of direct programming or API-based programming. The APIs 1708 of the illustrated example include an example migration tool 1710, an example direct programming API 1712, an example API-based programming API 1714, and an example analysis tool 1716.

In some examples, migration tool 1710 may be implemented by software (e.g., a software application) that may adapt a program for implementation in some form of execution in a first computing or electronic environment that is different from a second computing or electronic environment for which the program was originally designed. For example, migration tool 1710 may convert and/or otherwise adapt a first program developed for a first type of hardware, operating System (OS), library, etc., to a second program for a second type of hardware, OS, library, etc.

In some examples, the direct programming API 1712 may be invoked to implement direct programming tasks, which may include developing and/or compiling a data parallel c++ application. In some examples, the API-based programming API 1714 may be invoked to implement API-based programming, which may include developing and/or compiling applications that invoke (or call, instantiate, etc.) mathematical kernel libraries (Math Kernel Library, MKL), MKL deep neural network (Deep Neural Network, DNN) libraries, data analysis acceleration libraries, thread building block libraries, parallel standard template libraries, media software development kits (software development kit, SDK), deep learning deployment kits, machine learning scaling libraries, etc., and/or any combination(s) of these.

In some examples, the analysis tool 1716 may be invoked, instantiated, and/or otherwise invoked to analyze the hardware, software, and/or configuration(s) thereof of the configurable ML computing node. For example, the analysis tool 1716 may instantiate simulator(s) to simulate all hardware and/or software features of the configurable ML computing node to generate and/or otherwise output one or more evaluation parameters. In some such examples, the evaluation parameters may include parameters that represent and/or otherwise indicate the accuracy, latency, number of cycles to complete the workload, or throughput of the configurable ML compute node. In some examples, the evaluation parameters may include parameters that represent and/or otherwise indicate: processor or clock frequency, fabric frequency, read memory bandwidth, write memory bandwidth, hardware throttling factor, number of memory ports, number of data processing units (data processing unit, DPU), number of model layers (e.g., neural network layers, convolutional layers, etc.), activation precision (e.g., precision of activation values to be processed), weight precision (e.g., precision of weight values to be processed), and/or the like, and/or any combination(s) of these. For example, the analysis tool 1716 may execute a simulator based on the configurable ML computing nodes. In some such examples, the analysis tool 1716 may execute a simulator to determine the throughput of the configurable ML computing node when executing a particular AI/ML model with a particular configuration.

In some examples, the analysis tool 1716 may instantiate a simulator(s) to simulate behavior, configuration, etc. of the configurable ML computing node to generate and/or otherwise output one or more evaluation parameters. For example, the analysis tool 1716 may execute a model (e.g., a simulation model, an AI/ML model, etc.) based on the configurable ML computing nodes. In some such examples, the analysis tool 1716 may execute a model to estimate, predict, and/or otherwise determine the throughput of the configurable ML computing node when executing a particular AI/ML model with a particular configuration.

The AutoML architecture 1700 of the illustrated example includes different types of hardware and/or software that can be used to generate a configurable ML computing node. In the illustrated example, the autopl architecture 1700 includes interfaces and target system software for scalar, vector, matrix, and space hardware. Additionally and/or alternatively, any other type of hardware may be used. In this example, scalar hardware is implemented by example CPU 1718 and example CPU system software 1720. For example, the CPU system software 1720 may include instructions corresponding to a CPU instruction set architecture (Instruction Set Architecture, ISA). In this example, vector hardware is implemented by an example GPU 1722 and an example GPU system software 1724. For example, GPU system software 1724 may include a kernel, portion(s) of code, etc., such as a kernel, compute kernel, and/or shader. In some examples, the kernel, portion(s) of code, etc. may be represented in a High-level programming language, such as High-Level Shader Language, HLSL, openCL, etc.

In this example, the matrix hardware is implemented by the example AI processor 1726 and the example AI system software 1728. For example, the AI system software 1728 can include one or more AI/ML algorithms, models, etc., such as a neural network (e.g., convolutional Neural Network (CNN), deep Neural Network (DNN), recurrent Neural Network (RNN), etc.), linear regression models, logistic regression models, decision tree models, learning vector quantization models, etc., and/or combinations thereof(s). In this example, the spatial hardware is implemented by example FPGA 1730 and example FPGA system software 1732. For example, FPGA system software 1732 can include kernels, portion(s) of code, and the like, which are based on hardware description language (hardware description language, HDL), such as Verilog.

The ML system configurator 1702 of the illustrated example can interface with the CPU 1718 and/or the CPU system software 1720 via the example host interface 1734. The ML system configurator 1702 of the illustrated example may interface with a GPU 1722, GPU system software 1724, AI processor 1726, AI system software 1728, FPGA 1730, and/or FPGA system software 1732 via an example zero-level interface 1736.

In the illustrated example, the CPU system software 1720, GPU system software 1724, AI system software 1728, FPGA system software 1732, host interface 1734, and/or zero-level interface 1736 may correspond to and/or otherwise implement example below-zero system software 1738. For example, the zero-level below system software 1738 may correspond to and/or otherwise be implemented as a low-level direct-to-metal interface tailored to hardware such as the CPU 1718, GPU 1722, and the like.

In the illustrated example, the API 1708 may implement example zero level above system software 1740 and example developer interface 1742. For example, a developer, user, etc. may access and/or otherwise utilize the autopl architecture 1700 through the API 1708. In some examples, developers, users, etc. may access and/or otherwise utilize higher level system software than low level direct-to-metal interfaces through the manner of the API 1708. In some examples, developers, users, etc. can access and/or otherwise utilize the below-zero system software 1738 via the host interface 1734 and/or the zero-level interface 1736.

Fig. 18 is a block diagram of an example configuration of a dynamic XPU hardware-aware Deep Learning (DL) model management system implemented according to the teachings of the present disclosure. The example DL model management system 1800 includes an example input data set 1802, an example model training circuit 1804 including an example variance determiner circuit 1806, an example similarity determiner circuit 1808, and an example feature collector circuit 1810, example first, second, and third models 1812A, 1812B, and 1812C, and an example model management circuit 1814 including an example QoS selector circuit 1816, an example QoS sampler circuit 1818, and an example model scheduler circuit 1820.

In examples disclosed herein, the example input dataset 1802 may contain candidate features, targets for which the model is to be optimized, and so forth. The example input data set 1802 is transmitted to the model training circuit 1804 for use by the DL model management system 1800 in training and/or optimization of models.

The example model training circuit 1804, including the example difference determiner circuit 1806, the example similarity determiner circuit 1808, and the example feature collector circuit 1810, receives the example input data set 1802 and generates a set of models (e.g., a first model 1812A, a second model 1812B, and a third model 1812C) based on the selected targets. For example, in the DL model management system 1800 disclosed herein, a first model 1812A is trained to optimize accuracy as a key objective, a second model 1812B is trained to optimize performance as a key objective, and a third model 1812C is trained to optimize cost as a key objective.

The example variance determiner circuit 1806 analyzes feature lists of models optimized for different selected objectives (e.g., accuracy, performance, cost, etc.) to identify feature variances between the various models. In the examples disclosed herein, the discrepancy determiner circuit 1806 identifies these discrepancies by associating features that are present when a first target is selected for a first model (e.g., features from the first model 1812A having a selected accuracy target), but are absent when a second target is selected for a second model (e.g., features from the second model 1812B having a selected performance target). In determining these differences, it can be further appreciated why a certain model may improve its overall performance at the expense of another goal (e.g., cost).

In some examples, model training circuit 1804 includes means for identifying candidate differences between models optimized for different selected objectives (e.g., accuracy, performance, cost, etc.). For example, the means for identifying differences may be implemented by the example difference determiner circuit 1806. In some examples, the example variance determiner circuit 1806 may be instantiated by a processor circuit, such as the example processor circuit 2112 of fig. 21. For example, the example variance determiner circuit 1806 can be instantiated by the example general purpose processor circuit 2100 of fig. 21 executing machine executable instructions, such as implemented by at least blocks 1905, 1910, and 1915 of fig. 19. In some examples, the example variance determiner circuit 1806 may be instantiated by a hardware logic circuit, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine readable instructions. Additionally or alternatively, the example variance determiner circuit 1806 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example variance determiner circuit 1806 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine readable instructions and/or to perform some or all of the operations corresponding to the machine readable instructions without executing software or firmware, although other structures may be equally suitable.

The example similarity determiner circuit 1808 analyzes feature lists of models optimized for different selected objectives (e.g., accuracy, performance, cost, etc.) to identify feature similarities between the various models. In the examples disclosed herein, the similarity determiner circuit 1808 identifies these similarities by associating features that are present when a first target is selected for a first model (e.g., features from the first model 1812A having a selected accuracy target) and that are still present when a second target is selected for a second model (e.g., features from the second model 1812B having a selected performance target). In determining these similarities, it is further understood which features are important to overall model performance (e.g., it can be concluded that some layers are very important in performing object detection).

In some examples, model training circuit 1804 includes means for identifying similarities between models optimized for different selected objectives (e.g., accuracy, performance, cost, etc.). For example, the means for identifying similarity may be implemented by the example similarity determiner circuit 1808. In some examples, the example similarity determiner circuit 1808 may be instantiated by a processor circuit, such as the example processor circuit 2112 of fig. 21. For example, the example similarity determiner circuit 1808 can be instantiated via the example general purpose processor circuit 2112 of fig. 21 executing machine-executable instructions, such as implemented by at least block 1920 of fig. 19. In some examples, the example similarity determiner circuit 1808 may be instantiated by hardware logic circuitry, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example similarity determiner circuit 1808 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example similarity determiner circuit 1808 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine readable instructions and/or to perform some or all of the operations corresponding to the machine readable instructions without executing software or firmware, although other structures may be equally suitable.

The example feature collector circuit 1810 collects a list of features identified by both the variance determiner circuit 1806 and the similarity determiner circuit 120. In some examples, feature collector circuit 1810 may then perform further analysis on a list of collected features, however in examples disclosed herein, the list may be retained for output.

In some examples, the model training circuit 1804 includes means for collecting features identified by the example variance determiner circuit 1806 and the example similarity determiner circuit 1808. For example, the means for collecting features may be implemented by the example feature collector circuit 1810. In some examples, the example feature collector circuit 1810 may be instantiated by a processor circuit, such as the example processor circuit 2112 of fig. 21. For example, the example feature collector circuit 1810 may be instantiated via the example general purpose processor circuit 2112 of fig. 21 executing machine-executable instructions, such as implemented by at least block 1925 of fig. 19. In some examples, the example feature collector circuit 1810 may be instantiated by hardware logic circuitry, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example feature collector circuit 1810 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example feature collector circuit 1810 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other structures may be equally suitable.

The first, second, and third models (1812A, 1812B, and 1812C) obtained from the input data set 1802 are input into the example model management circuit 1814 for further processing after use by the model training circuit 1804. In the examples disclosed herein, the first model 1812A is optimized to maximize the selected accuracy target, the second model 1812B is optimized to maximize the selected performance target, and the third model 1812C is optimized to maximize the selected cost target.

In examples disclosed herein, the example model management circuit 1814 includes an example quality of service (Quality of Service, qoS) sampling circuit 1816, an example QoS selector circuit 1818, and an example model scheduler circuit 1820.

The example quality of service (QoS) sampler circuit 1816 samples the current state of the target hardware platform. For example, quality of service (QoS) sampler circuit 1816 may determine that the target hardware platform is currently responding to a high priority request from an application.

In some examples, the model management circuitry 1814 includes means for determining a current state of a target hardware platform. For example, the means for determining may be implemented by the example QoS sampler circuit 1816. In some examples, the example QoS sampler circuit 1816 may be instantiated by a processor circuit, such as the example processor circuit 2112 of fig. 21. For example, the example QoS sampler circuit 1816 may be instantiated via the example general purpose processor circuit 2112 of fig. 21 executing machine-executable instructions, such as implemented by at least block 2005 of fig. 20. In some examples, the example QoS sampler circuit 1816 may be instantiated by hardware logic circuitry, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example QoS sampler circuit 1816 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example QoS sampler circuit 1816 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions without executing software or firmware, although other structures are equally suitable.

The example QoS selector circuit 1818 selects a quality of service (QoS) to be prioritized based on a current state of the target hardware platform, which is determined by the QoS sampler circuit 1816. For example, if QoS sampler circuit 1816 determines in advance that the target hardware platform is currently responding to a high priority request from an application, qoS selector circuit 1818 may select accuracy as the most prioritized QoS target.

In some examples, the model management circuit 1814 includes means for selecting a quality of service (QoS) target. For example, the means for selecting QoS targets may be implemented by the example QoS selector circuit 1818. In some examples, the example QoS selector circuit 1818 may be instantiated by a processor circuit, such as the example processor circuit 2112 of fig. 21. For example, the example QoS selector circuit 1818 may be instantiated via the example general purpose processor circuit 2100 of fig. 21 executing machine-executable instructions, such as implemented by at least blocks 2010, 2015, and 2020 of fig. 20. In some examples, the example QoS selector circuit 1818 may be instantiated by hardware logic circuitry, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example QoS selector circuit 1818 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example QoS selector circuit 1818 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other structures are equally suitable.

The example model scheduler circuit 1820 selects a model that will best meet the requirements of the selected quality of service (QoS) prioritization target for use by the target hardware platform. In addition, the model scheduler circuit 1820 also monitors utilization metrics of the target hardware platform. If any utilization metrics are determined to be below a predetermined threshold, the model scheduler circuit 1820 then adjusts the model selection to produce another model for use by the target hardware platform. For example, if the first model 1812A begins to produce low utilization metrics on a hardware platform, the model scheduler circuit 1820 selects the second model 1812B to use as the new model. If the second model 1812B begins to produce a low utilization metric after a period of time, the model scheduler circuit 1820 may determine that the first model 1812A is more suitable for use by the hardware platform.

In some examples, the model management circuit 1814 includes means for selecting a model. For example, the means for selecting may be implemented by the example model scheduler circuit 1820. In some examples, the example model scheduler circuit 1820 may be instantiated by a processor circuit, such as the example processor circuit 2112 of fig. 21. For example, the example model scheduler circuit 1820 may be instantiated via the example general purpose processor circuit 2100 of fig. 21 executing machine executable instructions, such as implemented by at least blocks 2025, 2030, and 2035 of fig. 20. In some examples, the example model scheduler circuit 1820 may be instantiated by hardware logic circuitry, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example model scheduler circuit 1820 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example model scheduler circuit 1820 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other structures are equally suitable.

Although an example manner of implementing the model training circuit 1804 of fig. 18 is illustrated in fig. 18, one or more of the elements, processes, and/or devices illustrated in fig. 18 may be combined, divided, rearranged, omitted, eliminated, and/or implemented in any other way. In addition, the example discrepancy determiner circuit 1806, the example similarity determiner circuit 1808, the example feature collector circuit 1810, and/or, more generally, the example model training circuit 1804 of fig. 18 may be implemented in hardware alone or in combination with software and/or firmware. Thus, for example, any of the example variance determiner circuit 1806, the example similarity determiner circuit 1808, the example feature collector circuit 1810, and/or, more generally, the example model training circuit 1804 may be implemented by a processor circuit, analog circuit(s), digital circuit(s), logic circuit(s), programmable processor(s), programmable microcontroller(s), graphics Processing Unit (GPU) (digital signal processor(s) (DSP) (application specific integrated circuit (ASIC) (application specific integrated circuit (s)) programmable logic device(s) (PLD) (e.g., field Programmable Gate Array (FPGA)) and/or field programmable logic device(s) (FPLD) (e.g., field Programmable Gate Array (FPGA)). Further, the example model training circuit 1804 of fig. 18 may include one or more elements, processes, and/or devices in addition to or instead of those shown in fig. 18, and/or may include any or all of the more than one illustrated elements, processes, and devices.

While example ways of implementing the model management circuitry 1814 of fig. 18 are illustrated in fig. 18, one or more of the elements, processes, and/or devices illustrated in fig. 18 may be combined, divided, rearranged, omitted, eliminated, and/or implemented in any other way. In addition, the example quality of service (QoS) sampler circuit 1816, the example QoS selector circuit 1818, the example model scheduler circuit 1820, and/or, more generally, the example model management circuit 1814 of fig. 18 may be implemented in hardware alone or in combination with software and/or firmware. Thus, for example, any of the example quality of service (QoS) sampler circuit 1816, the example QoS selector circuit 1818, the example model scheduler circuit 1820, and/or, more generally, the example model management circuit 1814 may be implemented by a processor circuit, analog circuit(s), digital circuit(s), logic circuit(s), programmable processor(s), programmable microcontroller(s), graphics Processing Unit (GPU) s, digital signal processor(s) (DSP (s)), application Specific Integrated Circuit (ASIC) (PLD) programmable logic device(s), and/or field programmable logic device(s) (FPLD) (e.g., field Programmable Gate Array (FPGA)). Further, the example model management circuitry 1814 of fig. 18 may include one or more elements, processes, and/or devices in addition to or instead of those shown in fig. 18, and/or may include any or all of more than one of the illustrated elements, processes, and devices.

A flowchart representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination of these for implementing the model training circuit 1804 of fig. 18 is shown in fig. 19. A flowchart representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination of these to implement the model management circuit 1814 of fig. 18 is shown in fig. 20. The machine-readable instructions may be one or more executable programs, or portion(s) of an executable program, for execution by a processor circuit, such as the processor circuit 2112 shown in the example processor platform 2100 discussed below in connection with fig. 21 and/or the example processor circuit discussed below in connection with fig. 48 and/or 49. The program may be embodied in software stored on one or more non-transitory computer readable storage media, such as a Compact Disc (CD), a floppy disk, a Hard Disk Drive (HDD), a Solid State Drive (SSD), a Digital Versatile Disc (DVD), a blu-ray disc, volatile memory (e.g., any type of Random Access Memory (RAM), etc.), or non-volatile memory (e.g., electrically erasable programmable read-only memory (EEPROM), flash memory, HDD, SSD, etc.), associated with processor circuitry located in one or more hardware devices, but the entire program and/or a portion thereof may be executed by one or more hardware devices other than processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a Radio Access Network (RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. Additionally, although the example program is described with reference to the flow diagrams shown in fig. 19 and/or 20, many other methods of implementing the example model training circuit 1804 and/or the example model management circuit 1814 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

In another example, machine-readable instructions may be stored in the following states: in this state, they may be read by the processor circuit, but require the addition of libraries (e.g., dynamic Link Libraries (DLLs)), software Development Kits (SDKs), application Programming Interfaces (APIs), etc. in order to execute these machine-readable instructions on a particular computing device or other device. In another example, machine-readable instructions may need to be configured (e.g., store settings, input data, record network addresses, etc.) before the machine-readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, a machine-readable medium as used herein may include machine-readable instructions and/or program(s) regardless of the particular format or state of the machine-readable instructions and/or program(s) when stored or otherwise at rest or in transit.

Machine-readable instructions described herein may be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine-readable instructions may be represented using any of the following languages: C. c++, java, c#, perl, python, javaScript, hyper Text Markup Language (HTML), structured Query Language (SQL), swift, etc.

As described above, the example operations of FIG. 19 and/or FIG. 20 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media, such as an optical storage device, a magnetic storage device, an HDD, a flash memory, a read-only memory (ROM), a CD, a DVD, a cache, any type of RAM, a register, and/or any other storage device or storage disk in which information may be stored for any duration (e.g., for longer periods of time, permanently stored, temporarily stored, for temporarily buffering, and/or for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium are expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

FIG. 19 is a flowchart representative of example machine readable instructions and/or example operations 1900 that may be executed and/or instantiated by the processor circuit to identify and collect similar and/or different features between a set of models optimized for various target platform objectives. The machine-readable instructions and/or operations 1900 of fig. 19 begin at block 1905 where the variance determiner circuit 1806 receives the input data set 1802 of fig. 18 for processing.

As shown in fig. 19, at block 1905, the variance determiner circuit 1806 receives a data set (e.g., the input data set 1802 from fig. 18) for processing. In examples disclosed herein, the data set includes an optimized model, however, in other examples, the data set may be configured to include candidate features, platform metrics, and so forth.

At block 1910, the variance determiner circuit 1806 checks whether the model contained within the example data set received at block 1905 (e.g., the input data set 1802 from fig. 18) is optimized for the same target hardware. The variance determiner circuit 1806 checks the target hardware match of the model before the various models are to be compared to each other. If the variance determiner circuit 1806 determines that the models are optimized for the same target hardware, the process proceeds to block 1915. However, if the variance determiner circuit 1806 determines that the model is not all optimized for the same target hardware, the process moves back to the starting point.

At block 1915, the variance determiner circuit 1806 identifies feature variances between each model received for processing at block 1905. In the examples disclosed herein, the example data set received for processing in block 1905 includes various models, each model optimized for a different goal on the same target hardware platform. Thus, the variance determining circuit 1806 identifies feature variances between each model by comparing the list of features present in each model and selecting those features that are not present in all models. For example, certain features that exist for models with selected accuracy targets, but do not exist for models with selected performance targets, are identified by the variance determiner circuit 1806.

At block 1920, the example similarity determiner circuit 1808 performs a similar process as the example variance determiner circuit 1806, however, feature similarities between each model are identified. For example, certain features that exist for models with selected accuracy targets and also exist for models with selected performance targets are identified by the similarity determiner circuit 1808.

At block 1925, the example feature collector circuit 1810 aggregates the features identified by the example variance determiner circuit 1806 and the example similarity determiner circuit 1808 into a single set. In examples disclosed herein, feature collector circuit 1810 may output an aggregated feature set.

FIG. 20 is a flowchart representative of example machine readable instructions and/or example operations 2000 that may be executed and/or instantiated by the processor circuit to dynamically select and/or adjust an optimization model for use based on a current state of a target hardware platform and/or model utilization metrics. The machine-readable instructions and/or operations 2000 of fig. 20 begin at block 2002 where a quality of service (QoS) sampler circuit 1816 samples a current state of a hardware platform.

As shown in fig. 20, at block 2005, qos sampler circuit 1816 samples the current state of the hardware platform. For example, qoS sampler circuit 1816 may determine that the hardware platform is currently responding to high priority requests from an application.

At block 2010, the QoS selector circuit 1818 selects a quality of service (QoS) target (e.g., cost, accuracy, performance, etc.) for prioritization based on the current state of the hardware platform (e.g., currently in response to a high priority request from the application) determined by the QoS sampler circuit 1816 at block 2005. For example, if QoS sampler circuit 1816 determines that the hardware platform is currently responding to a high priority request from an application, qoS selector circuit 1818 may select accuracy as the most prioritized QoS target.

At block 2015, the QoS selector circuit 1818 sorts the set of models, each model optimized for a different QoS target, based on the QoS priority target selected in block 2010. In examples disclosed herein, qoS selector circuit 1818 may sort the set of models in descending order based on the ability to maximize the selected QoS prioritization objective.

At block 2020, the QoS selector circuit 1818 checks whether the list of ordered models (e.g., ordered based on the ability to maximize the selected QoS prioritization target) is empty. If QoS selector circuit 1818 determines that the list is empty, the process moves back to block 2005. However, if the QoS selector circuit 1818 determines that the list is not empty, the process moves forward to block 2025.

At block 2025, the model scheduler circuit 1820 selects a model that will meet the requirements of the selected QoS prioritization target for use by the target hardware platform. In the examples disclosed herein, since the list of optimization models is ordered in descending order based on the ability to meet a selected QoS priority target, the first model in the list is selected for use.

At block 2030, the model scheduler circuit 1820 determines whether the selected model produces a low utilization metric on the target hardware platform. If the model scheduler circuit 1820 determines that the model does have a low utilization metric, the process moves to block 2035. However, if the model scheduler circuit 1820 determines that the selected model does not produce a low utilization metric on the target platform, the process is ended.

At block 2035, the model scheduler circuit 1820 removes the currently in-use model from the list of ordered models after determining that the selected model yields a low utilization metric on the target hardware platform. The process then moves back to block 2020 where the QoS selector circuit 1818 checks if the list of ordering models is empty.

Fig. 21 is a block diagram of an example processor platform 2100 that is configured to execute and/or instantiate the machine readable instructions and/or operations of fig. 19-20 to implement the model training circuit 1804, the model management circuit 1814, and/or, more generally, the Deep Learning (DL) model management system 1800 of fig. 18. The processor platform 2100 may be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cellular telephone, a smart phone, a personal computer such as an iPad) ^TM Tablet devices such as these), personal Digital Assistants (PDAs), internet appliances, DVD players, CD players, digital video recorders, blu-ray players, gaming machines, personal video recorders, set-top boxes, headsets (e.g., augmented Reality (AR) headsets, virtual Reality (VR) headsets, etc.) or other wearable devices, orAny other type of computing device.

The processor platform 2100 of the illustrated example includes processor circuitry 2112. The processor circuit 2112 of the illustrated example is hardware. For example, the processor circuit 2112 may be implemented by one or more integrated circuits, logic circuits, FPGAs, microprocessors, CPU, GPU, DSP, and/or microcontrollers from any desired family or manufacturer. The processor circuit 2112 may be implemented by one or more semiconductor-based (e.g., silicon-based) devices. In this example, the processor circuit 2112 implements an example model training circuit 1804 including an example variance determiner circuit 1806, an example similarity determiner circuit 1808, and an example feature collector circuit 1810, and an example model management circuit 1814 including an example quality of service (QoS) sampler circuit 1816, an example QoS selector circuit 1818, and an example model scheduler circuit.

The processor circuit 2112 of the illustrated example includes a local memory 2113 (e.g., cache, registers, etc.). The processor circuit 2112 of the illustrated example communicates with a main memory including a volatile memory 2114 and a non-volatile memory 2116 over a bus 2118. The volatile memory 2114 may be selected from the group consisting of Synchronous Dynamic Random Access Memory (SDRAM), dynamic Random Access Memory (DRAM),Dynamic random access memory->And/or any other type of RAM device implementation. The non-volatile memory 2116 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 2114, 2116 of the illustrated example is controlled by a memory controller 2117.

The processor platform 2100 of the illustrated example also includes interface circuitry 2120. The interface circuit 2120 may be implemented in hardware in accordance with any type of interface standard, such as an Ethernet interface, a Universal Serial Bus (USB) interface, a USB interface, or a combination thereof,An interface, a Near Field Communication (NFC) interface, a Peripheral Component Interconnect (PCI) interface, and/or a peripheral component interconnect express (PCIe) interface.

In the illustrated example, one or more input devices 2122 are connected to the interface circuit 2120. The input device(s) 2122 allow a user to input data and/or commands into the processor circuit 2112. The input device(s) 2122 may be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, buttons, a mouse, a touch screen, a touch pad, a trackball, an isopoint device, and/or a speech recognition system.

One or more output devices 2124 are also connected to the interface circuit 2120 of the illustrated example. The output device(s) 2124 can be implemented, for example, by a display device (e.g., a Light Emitting Diode (LED), an Organic Light Emitting Diode (OLED), a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT) display, an in-situ switch (IPS) display, a touch screen, etc.), a haptic output device, a printer, and/or speakers. The interface circuitry 2120 of the illustrated example thus generally includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry, such as a GPU.

The interface circuit 2120 of the illustrated example also includes communication devices, such as a transmitter, receiver, transceiver, modem, residential gateway, wireless access point, and/or network interface to facilitate data exchange with external machines (e.g., any kind of computing device) via a network 2126. The communication may be through, for example, an ethernet connection, a Digital Subscriber Line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-to-line wireless system, a cellular telephone system, an optical connection, etc.

The processor platform 2100 of the illustrated example also includes one or more mass storage devices 2128 to store software and/or data. Examples of such mass storage devices 2128 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, blu-ray disc drives, redundant Array of Independent Disks (RAID) systems, solid-state storage devices (such as flash memory devices and/or SSDs), and DVD drives.

The machine-executable instructions 2132, which may be implemented by the machine-readable instructions of fig. 19-20, may be stored in the mass storage device 2128, in the volatile memory 2114, in the nonvolatile memory 2116, and/or on a removable non-transitory computer-readable storage medium such as a CD or DVD.

From the foregoing, it will be apparent that example systems, methods, apparatus, and articles of manufacture have been disclosed for dynamic XPU hardware-aware Deep Learning (DL) model management. The disclosed systems, methods, apparatuses, and articles of manufacture increase the efficiency of using a computing device by allowing new hardware features to be discovered quickly, which speeds up the time to market for new Artificial Intelligence (AI) products and/or features, and enhances performance improvement measures of the computing device by applying the newly discovered features. The disclosed systems, methods, apparatus, and articles of manufacture are thus directed to one or more improvements in the operation of machines such as computers or other electronic and/or mechanical devices.

Method and apparatus for data enhanced automated model generation

Machine learning is an important enabling technology of the currently ongoing artificial intelligence revolution, driving truly significant advances in such fields as object detection, image classification, speech recognition, natural language processing, and many others. Machine learning is used to create models that when utilized enable output to be generated based on input. Neural architecture searching enables searching of various architectures when creating machine learning models.

Neural architecture searching (Neural Architecture Search, NAS) is a method for exploring different machine learning algorithms to solve machine learning tasks. NAS algorithms take a large number of resources (e.g., computing resources, time resources, energy resources, etc.) to identify acceptable architectures. During the exploration phase, most of these resources are spent by checking for non-optimal architecture configurations. Existing NAS algorithms do not provide a clear explanation of the decision making to select a particular architecture, and such algorithms do not benefit from data collected about previous findings (e.g., sequence of operations, FLOP, etc.) or target hardware capabilities. This information is typically discarded and is not beneficial for future applications of NAS algorithms.

Due to the complexity of the task, NAS solutions tend to forget any insight from one run to the next. The initial conditions/configurations in the previous solutions are independent of any other configuration previously used.

Existing NAS methods no longer use previously performed data related to the model identified via NAS. That is, existing approaches do not benefit from the collected knowledge about the tasks (e.g., detection, segmentation, etc.) that the model is to perform. Existing methods start from scratch each time when NAS is executed, looking for better models. Many existing NAS methods also require extensive reconfiguration when moving to different tasks, and this approach does not generalize the neural network architecture search process.

The example methods disclosed herein analyze technology current conditions and emerging workloads and collect historical information about the model for each operation, including performance, sequence of operations, size, floating point operations per second (floating point operations per second, FLOPS), and so forth.

In the examples disclosed herein, the user provides tasks (object identification, segmentation, etc.) and objects (accuracy, latency, mix, etc.), and the NAS system selects start hyper-parameters/configuration information including the best configuration for the task, object, and in some examples the object hardware on which the model will execute.

The collected execution and/or performance information provides insight and guidance regarding searching for initial conditions for architectures that meet these requirements. The system also gathers target hardware resources, making the system hardware aware and allowing the system to perfect the specific target hardware(s). For example, if the kernel does not perform well (e.g., the latency on the selected target hardware exceeds the threshold amount of latency), the system may avoid an expanded 7 x 7 convolution kernel.

Example methods disclosed herein provide a user with a generated model and the reasons behind making these selections at the time of the selection operation. Decisions are based on the collected historical data and task knowledge obtained from the knowledge builder (knowledge builder, KB). The reason for providing decisions may lead to insight for future HW improvements (e.g., optimizing a particular core, memory BW, etc.).

FIG. 22 is a block diagram of an example system implemented in accordance with the teachings of the present disclosure for data-enhanced automation model generation. The example system 2200 of fig. 22 includes: knowledge builder circuit 2205, which receives user input 2210; and model builder circuit 2215 that builds and provides a model to target hardware 2220.

The example system 2200 of fig. 22 presents an end-to-end solution that receives information (targets, tasks, target HWs) from a user, analyzes this information using a knowledge base, and builds a search space and initial configuration suggestion for NAS methods. This approach is agnostic to the NAS approach to be used, enabling the user to decide on the technology-current approach that will receive the proposed configuration.

Example user input 2210 includes information including, for example, a target of the machine learning model, a task to be performed by the machine learning model, and, optionally, one or more characteristics of target hardware on which the machine learning model is to be performed. Tasks (object recognition, segmentation, etc.) will include input layer requirements, output layer requirements, and data requirements. The system of fig. 22 is flexible enough that a user can provide information for influencing model generation (e.g., by specifying whether a current task is similar to another task, and/or by specifying additional layers (not in the knowledge base, or associated with a different task) to include in the search space).

The knowledge builder circuit 2205 of FIG. 22 can be instantiated (e.g., created to exist for any length of time, materialized, implemented, etc.) by processor circuit executing instructions, such as a central processing unit. Additionally or alternatively, the knowledge builder circuit 2205 of fig. 22 can be instantiated (e.g., create an instance thereof, exist for any length of time, materialize, implement, etc.) by an ASIC or FPGA configured to perform operations corresponding to the instructions. It should be appreciated that some or all of the circuitry of fig. 22 may thus be instantiated at the same or different times (and/or by different hardware circuitry). Some or all of the circuitry may be instantiated, for example, in one or more threads executing concurrently on hardware and/or executing serially on hardware. Further, in some examples, some or all of the circuitry of fig. 22 may be implemented by one or more virtual machines and/or containers executing on a microprocessor.

The example knowledge builder circuit 2205 of the illustrated example of fig. 22 includes a request accessor circuit 2230, a hardware data coordination circuit 2235, a task data coordination circuit 2240, and a knowledge data store 2245. The example knowledge builder circuit 2205 archives information for the model and hardware into a knowledge data store 2245. If the hardware is not known in knowledge data store 2245, the user can cause the system to execute on target hardware 2220 to extract the performance metrics. Reports of such performance metrics are obtained and added to knowledge data store 2245 to build task knowledge. If the task is not in knowledge data store 2245, task data coordination circuitry 2240 creates task knowledge for the new task. Fig. 2 illustrates a process for creating or updating knowledge data store 2245.

In the examples disclosed herein, knowledge data store 2245 of knowledge builder circuit 2205 may be pre-populated with state-of-the-art (SOTA) or custom models and hardware configurations. In addition, knowledge data store 2245 may be updated at any time based on, for example, statistics collected by target hardware 2220. In the examples disclosed herein, knowledge data store 2245 separates models by task. To build task knowledge, model information is retrieved from knowledge data store 2245 and specific tasks and features are extracted from the model. In the case of new or custom tasks, similar tasks/models are retrieved based on user input. These features include, but are not limited to, the framework used to train the model, the HW specification and any information (latency, etc.) used to map the model including HW telemetry, performance goals, sequence of operations, number of FLOP, data sets used, number of layers, etc. These features may then be ranked by hardware feature, goal, etc. The extracted and ranked features are then considered as task knowledge, which may then be archived in knowledge data store 2245 for future use.

The example request accessor circuit 2230 of the illustrated example of fig. 22 receives a request to generate a model to perform a selected task. In the examples disclosed herein, user input 2210 received by request accessor circuit 2230 includes information including, for example, a goal of the machine learning model, a task to be performed by the machine learning model, and, in some examples, one or more characteristics of the target hardware on which the machine learning model is to be performed. The request may be formatted, for example, as a request received at a web server, as a request in a structured data format (e.g., javaScript object notation (JavaScript object notation, JSON) format, extensible markup language (extensible markup language, XML) format, etc.). The example request accessor circuit 2230 accesses hardware data coordination information via the hardware data coordination circuit 2235 and task data coordination information via the task data coordination circuit 2240. The accessed information (if available) and requests are provided to search space management circuit 2260 of model builder circuit 2215.

In some examples, the apparatus includes means for accessing the request. For example, the means for accessing may be implemented by the request accessor circuit 2230. In some examples, the request accessor circuit 2230 may be instantiated by a processor circuit, such as the example processor circuit 2612 of fig. 26. For example, request accessor circuit 2230 may be instantiated via the example general purpose processor circuit 4800 of FIG. 48 executing machine-executable instructions, such as implemented by at least block 2410 of FIG. 24. In some examples, the request accessor circuit 2230 may be instantiated by hardware logic circuitry, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, request accessor circuit 2230 may be instantiated by any other combination of hardware, software, and/or firmware. For example, request accessor circuit 2230 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other structures may be equally suitable.

The example hardware data coordination circuitry 2235 of the illustrated example of fig. 22 determines whether there is any prior knowledge in the knowledge data store 2245 for the selected hardware (e.g., the selected hardware identified in the request accessed by the request accessor circuitry 2230). If no prior knowledge is known for the selected hardware, the example hardware data coordination circuitry 2235 adds an identification of the selected hardware to the knowledge data store 2245. The identification of hardware enables subsequent performance metrics associated with the selected hardware to be stored in the knowledge data store 2245 in an organized manner. In some examples, the identification of the selected hardware may be omitted prior to model creation and may instead be performed when the performance metrics are provided to the knowledge data store by the performance statistics collection circuit 2285.

The example task data orchestration circuit 2240 of the illustrated example of fig. 22 determines whether any task information is available for the selected task. If no prior knowledge is available for the selected task, the example task data orchestration circuit 2240 adds an identification of the selected task to the knowledge data store 2245. The identification of the selected task enables subsequent performance metrics associated with the selected task to be stored in the knowledge data store 2245 in an organized manner. In some examples, the identification of the selected task may be omitted prior to model creation and may instead be performed when the performance metrics are provided to the knowledge data store by the performance statistics collection circuit 2285.

In some examples, the apparatus includes means for generating task knowledge. For example, the means for generating task knowledge may be implemented by the example task data orchestration circuit 2240. In some examples, the example task data orchestration circuit 2240 may be instantiated by a processor circuit, such as the example processor circuit 2612 of fig. 26. For example, the example task data orchestration circuit 2240 may be instantiated via the example general purpose processor circuit 4800 of fig. 48 executing machine-executable instructions, e.g., implemented by at least blocks 2420, 2435, 2425 of fig. 24. In some examples, the example task data orchestration circuit 2240 may be instantiated by hardware logic, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example task data orchestration circuit 2240 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example task data orchestration circuit 2240 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other structures are equally suitable.

The example knowledge data store 2245 of the illustrated example of fig. 22 is implemented by any memory, storage device, and/or storage disk for storing data, such as flash memory, magnetic media, optical media, solid state memory, hard disk drive(s), thumb drive(s), and so forth. Further, the data stored in the example knowledge data store 2245 may take any data format, such as binary data, comma separated data, tab separated data, structured Query Language (SQL) constructs, and so forth. Although in the illustrated example, the example knowledge data store 2245 is illustrated as a single device, the example knowledge data store 2245 and/or any other data storage device described herein may be implemented by any number and/or any type(s) of memory. In the illustrated example of fig. 22, the example knowledge data store 2245 stores hardware and/or task knowledge.

The model builder circuit 2215 of FIG. 22 can be instantiated (e.g., created to exist for any length of time, materialized, implemented, etc.) by processor circuit executing instructions, such as a central processing unit. Additionally or alternatively, the model builder circuit 2215 of fig. 22 can be instantiated (e.g., create an instance thereof, exist for any length of time, materialize, implement, etc.) by an ASIC or FPGA configured to perform operations corresponding to the instructions. As noted above, it should be appreciated that some or all of the circuitry of fig. 22 may thus be instantiated at the same or different times (and/or by different hardware circuitry). Some or all of the circuitry may be instantiated, for example, in one or more threads executing concurrently on hardware and/or executing serially on hardware. Further, in some examples, some or all of the circuitry of fig. 22 may be implemented by one or more virtual machines and/or containers executing on a microprocessor.

The example model builder circuit 2215 of the illustrated example of fig. 22 includes a search space management circuit 2260, an anchor point inserter circuit 2265, a neural architecture search circuit 2270, and a model outputter circuit 2275. Model builder circuit 2215 is responsible for extracting insights in knowledge data stores and performing neural architecture searches to identify optimal models. First, the example search space management circuit 2260 creates a search space. This search space includes operations provided by task knowledge from the knowledge data store, variants of these operations, and additional layers if specified by the user. Neural architecture search circuitry 2270 performs a search that is initiated using the configuration identified by search space management circuitry 2260 for the target, task, HW, etc. The anchor point is inserted in the selected NAD algorithm by anchor point inserter circuit 2265 to capture decisions made during this process. Task knowledge is incorporated in the training loop of neural architecture search circuit 2270 to inform decisions and guide searches. During training, historical decisions, confidence levels, and knowledge data repository-based recommendations derived from task knowledge are used to guide neural architecture searches.

In some examples, the apparatus includes means for generating a create search space. For example, the means for creating may be implemented by the example search space management circuit 2260. In some examples, the example search space management circuit 2260 may be instantiated by a processor circuit, such as the example processor circuit 2612 of fig. 26. For example, the example search space management circuit 2260 may be instantiated via the example general purpose processor circuit 2600 of fig. 26 executing machine-executable instructions, such as implemented by at least blocks 2427, 2440 of fig. 24. In some examples, the example search space management circuit 2260 may be instantiated by hardware logic circuitry, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example search space management circuit 2260 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example search space management circuit 2260 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other structures may be equally suitable.

In some examples, the apparatus includes means for generating a machine learning model. For example, the means for generating may be implemented by the example neural architecture search circuit 2270. In some examples, the example neural architecture search circuit 2270 may be instantiated by a processor circuit, such as the example processor circuit 2612 of fig. 26. For example, the example neural architecture search circuit 2270 may be instantiated by the example general purpose processor circuit 4800 of fig. 48 executing machine-executable instructions, such as implemented by at least blocks 2430, 2450 of fig. 24. In some examples, the example neural architecture search circuit 2270 may be instantiated by hardware logic circuitry, which may be implemented by the ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example neural architecture search circuit 2270 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example neural architecture search circuit 2270 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other structures may be equally suitable.

In some examples, the apparatus includes means for inserting. For example, the means for inserting may be implemented by an example anchor point inserter circuit 2265. In some examples, the example anchor point inserter circuit 2265 may be instantiated by a processor circuit, such as the example processor circuit 2612 of fig. 26. For example, the example anchor point inserter circuit 2265 may be instantiated by the example general purpose processor circuit 4800 of fig. 48 executing machine-executable instructions, such as implemented by at least block 2460 of fig. 24. In some examples, example anchor point inserter circuit 2265 may be instantiated by hardware logic circuitry, which may be implemented by ASIC or FPGA circuit 4900 of fig. 49 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the example anchor point inserter circuit 2265 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the example anchor point inserter circuit 2265 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, application Specific Integrated Circuits (ASICs), comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other structures may be equally suitable.

After generating the model, the example model outputter circuit 2275 provides the model for execution. In some examples, decisions and/or arguments selected during the neural architecture search are provided in association with the generated model.

The target hardware 2220 of fig. 22 may be instantiated (e.g., created to exist for any length of time, embodied, implemented, etc.) by processor circuitry executing instructions, such as a central processing unit. Additionally or alternatively, the target hardware 2220 of fig. 22 may be instantiated (e.g., create an instance thereof, exist for any length of time, materialize, implement, etc.) by an ASIC or FPGA configured to perform operations corresponding to the instructions. As noted above, it should be appreciated that some or all of the circuitry of fig. 22 may thus be instantiated at the same or different times (and/or by different hardware circuitry). Some or all of the circuitry may be instantiated, for example, in one or more threads executing concurrently on hardware and/or executing serially on hardware. Further, in some examples, some or all of the circuitry of fig. 22 may be implemented by one or more virtual machines and/or containers executing on a microprocessor.

The example target hardware 2220 of the illustrated example of fig. 22 includes model execution circuitry 2280 and execution performance statistics collection circuitry 2285. The example model execution circuit 2280 of the illustrated example of fig. 22 executes the model provided by the model outputter circuit 2275.

During execution of the model by model execution circuit 2280, example execution performance statistics collection circuit 2285 of the illustrated example of fig. 22 uses the inserted anchor points to collect model execution statistics. The collected performance statistics are provided to knowledge data store 2245. In examples disclosed herein, the collected performance statistics include information about anchor points. The inclusion of information about anchor points enables the use of feature-specific statistics in generating task knowledge.

FIG. 2 is a block diagram of an example process flow utilizing the example system of FIG. 22. The example process begins when a user submits a request to generate a model to perform a selected task. (Block 2310). The requested model is generated using the neural architecture search and prior knowledge of the model associated with the selected task. (block 220). The generated model is provided to the target hardware for execution and collection of performance statistics. (block 230). Execution features are extracted from the model. (block 240). The extracted features are ranked based on the collected performance metrics. (block 250). The extracted features and their associated performance metrics are added to knowledge data store 2245. (block 260). This added knowledge can then be used for future model generation. (block 220).

Although an example manner of implementing the example knowledge builder circuit 2205 and/or the example model builder circuit 2215 is illustrated in fig. 22, one or more of the elements, processes, and/or devices illustrated in fig. 22 may be combined, divided, rearranged, omitted, eliminated, and/or implemented in any other way. In addition, the example request accessor circuit 2230, the example hardware data coordination circuit 2235, the example task data coordination circuit 2240, and/or, more generally, the example knowledge builder circuit 2205, and/or the example search space management circuit 2260, the example anchor point inserter circuit 2265, the example neural architecture search circuit 2270, the example model outputter circuit 2275, and/or, more generally, the example model builder circuit 2215 of fig. 22, may be implemented in hardware alone or in combination with software and/or firmware. Thus, for example, any of the example request accessor circuit 2230, the example hardware data coordination circuit 2235, the example task data coordination circuit 2240, and/or, more generally, the example knowledge builder circuit 2205, and/or the example search space management circuit 2260, the example anchor point inserter circuit 2265, the example neural architecture search circuit 2270, the example model outputter circuit 2275, and/or, more generally, the example model builder circuit 2215 of fig. 22, may be implemented by a processor circuit, an analog circuit(s), a digital circuit(s), a logic circuit(s), a programmable processor(s), a programmable microcontroller(s), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a programmable logic device (FPLD), and/or a field Programmable Logic Device (PLD) (e.g., a field programmable gate array (PLD)). Further, the example request accessor circuit 2230, the example hardware data coordination circuit 2235, the example task data coordination circuit 2240, and/or, more generally, the example knowledge builder circuit 2205 of fig. 22, and/or the example search space management circuit 2260, the example anchor point inserter circuit 2265, the example neural architecture search circuit 2270, the example model outputter circuit 2275, and/or, more generally, the example model builder circuit 2215 of fig. 22, may include one or more elements, processes, and/or devices in addition to or instead of those shown in fig. 22, and/or may include any or all of the more than one illustrated elements, processes, and devices.

A flowchart representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination of these for implementing the knowledge builder circuit 2205 and/or the example model builder circuit 2215 of fig. 22 is shown in fig. 24. The machine-readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a processor circuit, such as the processor circuit 2612 shown in the example processor platform 2600 discussed below in connection with fig. 26 and/or the example processor circuit discussed below in connection with fig. 48 and/or fig. 49.

A flowchart representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination of these to implement the target hardware 2220 of fig. 22 is shown in fig. 25. The machine-readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a processor circuit, such as the processor circuit 2612 shown in the example processor platform 2600 discussed below in connection with fig. 26 and/or the example processor circuit discussed below in connection with fig. 48 and/or fig. 49.

The program of fig. 24 and/or 25 may be embodied in software stored on one or more non-transitory computer readable storage media, such as Compact Discs (CDs), floppy discs, hard Disc Drives (HDDs), solid State Drives (SSDs), digital Versatile Discs (DVDs), blu-ray discs, volatile memory (e.g., any type of Random Access Memory (RAM), etc.), or non-volatile memory (e.g., electrically erasable programmable read-only memory (EEPROM), flash memory, HDD, SSD, etc.), associated with processor circuitry located in one or more hardware devices, although the entire program and/or a portion thereof may be executed by one or more hardware devices other than processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a Radio Access Network (RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. Additionally, while the example program is described with reference to the flowchart shown in fig. 24, many other methods of implementing the example knowledge builder circuit 2205 and/or the example model builder circuit 2215 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

As described above, the example operations of fig. 24 and/or 25 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media, such as optical storage devices, magnetic storage devices, HDDs, flash memory, read-only memory (ROM), CDs, DVDs, caches, any type of RAM, registers, and/or any other storage device or storage disk where information may be stored for any duration (e.g., for a longer period of time, permanently stored, temporarily stored, used for temporary buffering, and/or used for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium are expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

FIG. 24 is a flowchart representative of example machine readable instructions and/or example operations 2400 that may be executed and/or instantiated by the processor circuit to implement the example knowledge builder circuit and the example model builder circuit of FIG. 22. The machine-readable instructions and/or operations 2400 of fig. 24 begin at block 2410 where request accessor circuit 2230 receives a request to generate a model to perform a selected task. (block 2410). In the examples disclosed herein, user input 2210 received by request accessor circuit 2230 includes information including, for example, a goal of the machine learning model, a task to be performed by the machine learning model, and, in some examples, one or more characteristics of the target hardware on which the machine learning model is to be performed. The request may be formatted, for example, as a request received at a web server, as a request in a structured data format (e.g., javaScript object notation (JSON) format, extensible markup language (XML) format, etc.). The example request accessor circuit 2230 accesses hardware data coordination information via the hardware data coordination circuit 2235 and task data coordination information via the task data coordination circuit 2240. The accessed information (if available) and requests are provided to search space management circuit 2260 of model builder circuit 2215.

The example hardware data coordination circuitry 2235 determines whether any prior knowledge exists in the knowledge data store 2245 regarding the selected hardware. (block 2412). If no prior knowledge is known for the selected hardware (e.g., block 2412 returns a "no" result), the example hardware data coordination circuitry 2235 adds an identification of the selected hardware to the knowledge data store 2245. (block 2414). The identification of hardware enables subsequent performance metrics associated with the selected hardware to be stored in the knowledge data store 2245 in an organized manner. In some examples, the identification of the selected hardware may be omitted prior to model creation and may instead be performed when the performance metrics are provided to the knowledge data store by the performance statistics collection circuit 2285.

The example task data orchestration circuit 2240 determines whether any task information is available for the selected task. (block 2420). If no prior knowledge is available for the selected task (e.g., block 2420 returns a "no" result), the example task data orchestration circuit 2240 adds the identity of the selected task to the knowledge data store 2245. (block 2425). The identification of the selected task enables subsequent performance metrics associated with the selected task to be stored in the knowledge data store 2245 in an organized manner. In some examples, the identification of the selected task may be omitted prior to model creation and may instead be performed when the performance metrics are provided to the knowledge data store by the performance statistics collection circuit 2285. The example search space management circuit 2260 creates a search space based on the user selecting available building blocks for the task or building blocks from the prior art state-of-the-art architecture(s). (block 2427). In this way, a search space is created, but not based on specific prior task knowledge (as described below in connection with block 2440). In some examples, the ability to execute a user to select available building blocks (and/or whether to use technology state-of-the-art architecture(s) for the task) may be configured by a policy.

The example NAS search circuit 2270 performs a neural architecture search to generate a model using the search space. (block 2430). In the illustrated example of fig. 24, NAS search circuit 2270 starts from an uninitialized state. That is, prior knowledge about the performance of various tasks and/or the hardware on which the tasks are to be performed is not used when performing the neural architecture search of block 2430.

Returning to block 2420, if the task data orchestration circuit 2240 determines that prior knowledge exists for the selected task (e.g., block 2420 returns a "yes" result), the example task data orchestration circuit 2240 builds task knowledge. (block 2435). To construct task knowledge, model information is retrieved by the task data orchestration circuit 2240 from the knowledge data store 2245 for a particular task, and features are extracted from the model. In the case of new or custom tasks, similar tasks/models are retrieved based on user input. These features include, but are not limited to, a framework for training the model, hardware specifications, and/or any information (latency, etc.) for mapping the model including hardware telemetry, performance goals, sequence of operations, number of FLOPs, data sets used, number of layers, etc. These features are then ranked by hardware, goal, etc. The individual features extracted from the model(s) and ranked are collectively identified as task knowledge, which is then used to create a search space. In some examples, this task knowledge is archived in knowledge data store 2245 to allow efficient retrieval if the same task is later requested.

The example search space management circuit 2260 creates a search space from prior task knowledge. (block 2440). The search space may be created by, for example, ranking and selecting previous architectures that have an acceptable level of performance on (and/or similar to) the target hardware. In some examples, performance statistics stored in knowledge data store 2245 associated with different architectures and tasks are compared to select an architecture that satisfies the threshold performance statistics. In some examples, the performance statistics on which the selection is based may depend on user input 2210, which may, for example, indicate whether the power consumption statistics are to take precedence over the processing speed statistics.

In some examples, the selection of a prioritization (e.g., a prioritization of functions, performance, power optimization, etc.) may be guided by a policy. For example, policies may be provided by a policy providing entity to control training operations and/or search space management behavior. In some examples, policies control other details regarding the creation and/or training of models, including, for example, different levels of neural network sparsity (e.g., 260%, 90%, etc.), different levels of accuracy (e.g., thirty-two bit floating point values, sixteen bit floating point values, eight bit integer values, etc.).

In some examples, the policy providing entity may be a user of the system of fig. 22. However, the policy providing entity may be any other entity that directs the functionality of the system of FIG. 22, including, for example, a system administrator, manufacturer, device provider, and so forth. In some examples, the policy providing entity may be separate from the user. In this way, a user can enter a request to train and/or create a machine learning model while allowing parameters according to which the training and/or creation of the machine learning model is based on policies created by the policy providing entity.

In some examples, policies are provisioned to the system of fig. 22 by a policy providing entity via a platform Trusted Execution Environment (TEE). However, policies may be provided to the system of FIG. 22 in any other manner.

The example NAS search circuitry 2270 uses neural architecture searches to generate models based on the search space created by the search space management circuitry 2260. (block 2450). In this way, the neural architecture search performed by NAS search circuit 2270 at block 2450 begins from an initialized state (e.g., from an architecture that previously met a performance threshold) based on previous task knowledge.

The example anchor point inserter circuit 2265 then inserts the anchor point into the generated model. (block 2460). The anchor point provides a location where performance statistics are to be measured by the performance statistics collection circuit 2285. Furthermore, the anchor points provide locations that can be used to capture additional information about the model and/or the goals/tasks of the model. In examples disclosed herein, anchor points are inserted in the middle of layers of the generated model. In some examples, anchor points are added to the model before the first layer and after the last layer of the model. In some other examples, anchor points are added adjacent (e.g., before and after) a particular type of layer (e.g., a convolutional layer).

The example model outputter circuit 2275 provides the generated model to the target hardware 2220 for execution by the model execution circuit 2280. (block 2470). In examples disclosed herein, the model may first be stored in a storage location (e.g., a server) before being provided to model execution circuitry 2280. In some examples, model execution circuitry 2280 may retrieve the model from the storage location or directly from model outputter circuitry 2275. The process of the illustrated example of fig. 24 then terminates, but may be re-executed, for example, upon receipt of a subsequent user input 2210.

FIG. 25 is a flowchart representative of example machine readable instructions and/or example operations 2500 that can be executed and/or instantiated by the processor circuit to implement the example target hardware 2220 of FIG. 22. The machine-readable instructions and/or operations 2500 of fig. 25 begin at block 2510 at which model execution circuit 2280 begins executing the model received from model outputter circuit 2275. (block 2510). During model execution, the example execution performance statistics collection circuit 2285 uses the inserted anchor points to collect model execution statistics. (block 2520). The collected performance statistics are provided to knowledge data store 2245. (block 2530). In examples disclosed herein, the collected performance statistics include information about anchor points. The inclusion of information about anchor points enables the use of feature-specific statistics in generating task knowledge.

Fig. 26 is a block diagram of an example processor platform 2600 configured to execute and/or instantiate the machine readable instructions and/or operations of fig. 24 and/or 25 to implement the system 2200 of fig. 22. The processor platform 2600 may be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cellular telephone, a smart phone, a personal digital assistant such as an iPad) ^TM Such as a tablet device), a Personal Digital Assistant (PDA), an internet appliance, a DVD player, a CD player, a digital video recorder, a blu-ray player, a game console, a personal video recorder, a set-top box, headphones (e.g., an Augmented Reality (AR) headset, a Virtual Reality (VR) headset, etc.), or other wearable device, or any other type of computing device.

The processor platform 2600 of the illustrated example includes a processor circuit 2612. The processor circuit 2612 of the illustrated example is hardware. For example, the processor circuit 2612 may be implemented by one or more integrated circuits, logic circuits, FPGAs, microprocessors, CPU, GPU, DSP, and/or microcontrollers from any desired family or manufacturer. The processor circuit 2612 may be implemented by one or more semiconductor-based (e.g., silicon-based) devices. In this example, the processor circuit 2612 implements a knowledge builder circuit 2205 and a model builder circuit 2215. In some examples, knowledge builder circuit 2205 and model builder circuit 2215 may be implemented on separate processor platforms.

The processor circuit 2612 of the illustrated example includes a local memory 2613 (e.g., a cache, a register, etc.). The processor circuit 2612 of the illustrated example communicates with a main memory including a volatile memory 2614 and a non-volatile memory 2616 over a bus 2618. The volatile memory 2614 may be selected from the group consisting of Synchronous Dynamic Random Access Memory (SDRAM), dynamic Random Access Memory (DRAM),Dynamic random access memory->And/or any other type of RAM device implementation. The non-volatile memory 2616 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 2614, 2616 of the illustrated example is controlled by a memory controller 2617.

The processor platform 2600 of the illustrated example also includes interface circuitry 2620. Interface circuit 2620 may be implemented in hardware in accordance with any type of interface standard, such as an Ethernet interface, a Universal Serial Bus (USB) interface, a USB interface, or a combination thereof,An interface, a Near Field Communication (NFC) interface, a Peripheral Component Interconnect (PCI) interface, and/or a peripheral component interconnect express (PCIe) interface.

In the illustrated example, one or more input devices 2622 are connected to the interface circuitry 2620. Input device(s) 2622 allow a user to input data and/or commands into the processor circuit 2612. Input device(s) 2622 may be implemented by, for example, an audio sensor, microphone, camera (still or video), keyboard, buttons, mouse, touch screen, touch pad, trackball, isopoint device, and/or voice recognition system.

One or more output devices 2624 are also connected to the interface circuit 2620 of the illustrated example. The output device(s) 2624 may be implemented, for example, by a display device (e.g., light Emitting Diode (LED), organic Light Emitting Diode (OLED), liquid Crystal Display (LCD), cathode Ray Tube (CRT) display, in-situ switch (IPS) display, touch screen, etc.), haptic output device, printer, and/or speakers. The interface circuitry 2620 of the illustrated example thus generally includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry, such as a GPU.

Interface circuitry 2620 for the illustrated example also includes communication devices such as transmitters, receivers, transceivers, modems, residential gateways, wireless access points, and/or network interfaces to facilitate data exchange with external machines (e.g., any kind of computing device) over a network 2626. The communication may be through, for example, an ethernet connection, a Digital Subscriber Line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-to-line wireless system, a cellular telephone system, an optical connection, etc.

The processor platform 2600 of the illustrated example also includes one or more mass storage devices 2628 to store software and/or data. Examples of such mass storage devices 2628 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, blu-ray disc drives, redundant Array of Independent Disks (RAID) systems, solid-state storage devices (such as flash memory devices and/or SSDs), and DVD drives.

The machine-executable instructions 2632, which may be implemented by the machine-readable instructions of fig. 24 and/or 25, may be stored in the mass storage device 2628, in the volatile memory 2614, in the non-volatile memory 2616, and/or on a removable non-transitory computer-readable storage medium such as a CD or DVD.

From the foregoing, it will be apparent that example systems, methods, apparatus, and articles of manufacture have been disclosed that enable neural architecture searches to be performed based on prior knowledge of models created to perform particular tasks. The disclosed systems, methods, apparatuses, and articles of manufacture improve the efficiency of using a computing device by avoiding re-discovery of models that would otherwise be found initially by neural architecture searches, but that do not work well for the intended task. By starting from based on prior knowledge, a higher performing model can be identified more quickly. This not only reduces resource consumption on the target hardware (e.g., a more efficient model may be developed), but also reduces resource consumption on the system that generated the model (e.g., a more performance model may be discovered more quickly/efficiently). The disclosed systems, methods, apparatus, and articles of manufacture are thus directed to one or more improvements in the operation of machines such as computers or other electronic and/or mechanical devices.

Method and apparatus for conditionally activating large cores in a computing system

Some computing systems include one or more large device processors (e.g., cores) and/or one or more small device processors (e.g., atoms) to perform operations. A large device processor may include one or more cores and/or processing units, while a small device processor may have one or two cores. Furthermore, large device processors are more powerful and/or consume more space than small device processors. The large device processor may handle high performance applications while the small device processor provides lower power, less footprint, and more moderate performance than the large device processor. Examples of small device processors includeSoC, littale core, etc. />

Hardware-based microcode (also referred to as hardware-level instructions) may be implemented in the hardware of a computing system (e.g., computer, laptop, mobile phone, server, edge device, cloud-based device, etc.) to configure the hardware of the computing system. In some examples, such hardware-level instructions (e.g., uCode, xuCode, etc.) may control the operation of hardware, including processing devices. If a computing device includes multiple processing devices (e.g., large cores, small cores, atoms, central Processing Unit (CPU) sockets, CPUs, sockets, etc.), microcode may facilitate the operation and/or configuration of the multiple processing devices.

As the number and/or types of architectures increase, the difficulty of programming instructions also increases, as instructions may need to be configured separately for each type of architecture. For example, the instruction may be a 2724-bit instruction configured to be executed by hardware capable of processing 2724-bit instructions. Similarly, a system having multiple smaller processing units that process 64-bit instructions will not be able to execute instructions above 64 bits.

Examples disclosed herein provide software and/or firmware based Application Programming Interfaces (APIs) to process instructions of applications running on an operating system, virtual machine manager (virtual machine manager, VMM), etc., and instruct microcode to configure a processing unit to be able to execute these instructions, regardless of their structure. For example, if a 512-bit instruction is obtained from an application, examples disclosed herein may configure eight 64-bit processing units to decompose the 512-bit instruction into eight 64-bit instructions, execute the 64-bit instructions in parallel, and combine the results to operate as a conditionally active large core (e.g., a large core capable of processing 512-bit instructions). In this way, an application may generate one instruction, and examples disclosed herein may determine whether and/or how to execute the instruction in view of constraints of a computing system via which the instruction is to be executed.

The example disclosed APIs obtain ISA instructions from the OS/VMM. ISA instructions are instructions that require multiple processing devices to operate as a single large processing device capable of processing ISA instructions. When the disclosed API obtains an ISA request from an application to execute ISA instructions (e.g., as an interrupt), the API first determines whether the processing unit is capable and/or available to execute the instructions while meeting a service level agreement (service level agreement, SLA), latency requirements, tolerance requirements, etc., corresponding to the instructions. If the API determines that the processing unit is capable and available to execute the instruction while meeting the requirements, the API instructs the microcode to cause the processing unit to execute the instruction as required. If the API determines that the processing unit is capable but not available to execute the instruction, the API may indicate (1) (e.g., to the application) when the processing unit will be available (e.g., an approximation of when the currently implemented workload will complete) and/or (2) that the large core may be emulated, but may not meet the requirements. In this way, the application may determine whether to wait for an instruction to execute to meet the requirements, continue the simulation without meeting one or more of the requirements, or execute the instruction without utilizing a corresponding processing element. If the API determines that the processing unit is not capable of executing the instruction, the API indicates (e.g., indicates to the application) that the instruction cannot be executed.

Fig. 27 is a block diagram of an example computing device 2700. The example computing device 2700 includes example hardware 2702 that includes one or more example cores 2704, one or more example small device processors 2706, example microcode processing circuitry 2711, and example register(s) 2713. The example computing device 2700 also includes an example BIOS2708 that includes example ISA management circuitry 2710. The example computing device 2700 also includes an example Operating System (OS)/Virtual Machine Manager (VMM) 2707 and an example Application (APP) 2714.

The example hardware 2702 of fig. 27 performs tasks corresponding to instructions from the applications 2714, OS/VMM 2722, and/or BIOS 2708. The example hardware 2702 may include processor resources (e.g., memory, register(s), and/or logic circuitry of the example processor core(s) 2704 and/or the small device processor(s) 2706) to execute instructions to implement the instructions of the example application 2714 and/or access data from memory.

The example processor core(s) 2704 and/or the example small device processor(s) 2706 of fig. 27 execute instructions (e.g., workload) from an application (e.g., by reading and/or writing data). Tasks performed on one or more cores 2704 may result in a different amount of completion time and/or different efficiency than performing the same tasks on one or more small device processors 2706. For example, in performing a computation-constrained task, one or more cores 2704 may be more efficient in terms of iteration per cycle (iterations per cycle, IPC) ratio. In addition, one or more cores 2704 may have a larger cache than small device processor 2706 to perform cache-limiting tasks. The one or more small device processors 2706 may be more efficient for memory restriction tasks corresponding to more time waiting for memory in a pipeline retention and/or may be more efficient for I/O restriction tasks because the IO restriction tasks are not dependent on processing operation speed. Although example hardware 2702 includes core(s) 2704 and small device processor(s) 2706, hardware 2702 may include any number and/or class(s) Processing components of the type (e.g., small cores, large cores, threads, etc.). Examples of small device processor 2706 include SoC, littale core, etc. As described above, two or more of core(s) 2704 and/or small device processor(s) 2706 may work together (e.g., based on instructions from ISA management circuit 2710 and/or microcode processing circuit 2711) to partition large instructions into sub-instructions and execute on respective processing devices. In this way, applications 2714 and/or OS/VMM 2707 may send a single instruction that a single core or small device processor cannot execute alone, and core(s) 2704 and/or small device processor(s) 2706 may work together as a larger computing device to execute the single instruction.

The example OS/VMM 2707 of FIG. 27 is a software system that manages example hardware 2702, software resources, and/or provides servers for computer programs and/or applications of computing device 2700. OS/VMM 2707 of fig. 27 sends instructions and/or ISA execution requests to ISA management circuit 2710 to cause ISA management circuit 2710 to control processing resources (e.g., core(s) 2704 and/or small device processor(s) 2706) to operate as large cores. In some examples, OS/VMM 2707 stores instructions and/or ISA execution requests in example register(s) 2713 monitored by ISA management circuit 2710. In this way, OS/VMM 2707 may cause an interrupt to occur to facilitate ISA execution when new data is placed in register 2713.

The example BIOS2708 of FIG. 27 provides low level control of the hardware 2702 of the computing device 2700. For example, BIOS2712 may use example core(s) 2704 and/or small device processor(s) 2706 to execute instructions and/or perform operations to operate as a large core. BIOS2708 may perform hardware initialization and/or provide runtime services for OS/VMM 2707 and/or other programs. Although the example computing device 2700 of fig. 27 includes a BIOS2708, the BIOS2708 may be replaced with EFI, UEFI, and/or any other type of firmware capable of interfacing between hardware and the OS/VMM 2707. Example BIOS2708 includes example ISA management circuitry 2710.

The example ISA management circuit 2710 of fig. 27 obtains instructions from an application (e.g., to execute ISA execution with processor resources operating as a large core) via the OS/VMM 2707. In some examples, ISA management circuit 2710 determines that OS/VMM 2707 has requested a processing component of hardware 2702 to operate as a large core by monitoring changes in data in one or more registers 2713 of hardware 2702. For example, when OS/VMM 2707 is in its request or request for a large core operation, data may be placed into one or more registers 2713 to indicate the large core operation (e.g., as an interrupt). Thus, ISA management circuit 2710 may monitor register 2713 (e.g., like an interrupt) to determine when large core operations are facilitated.

When the example ISA management circuit 2710 of fig. 27 determines that a large core operation is to occur, ISA management circuit 2710 determines ISA requirements (SLA, latency requirements, tolerance requirements, etc.) of instructions to be executed by the large core structure. For example, if an instruction is stored in one or more registers 2713, ISA management circuit 2710 processes the ISA instruction to identify a requirement. ISA management circuitry 2710 evaluates whether processing resources (e.g., one or more of core(s) 2704 and/or small device processing component 2706) are capable and/or available to execute as a large core processing ISA according to the determined requirements. In some examples, one or more of the processing resources may be capable of processing ISA execution, but are not currently available for executing instructions, since the processing resources may be executing other workloads. In some examples, the processing resources may not be capable of processing ISA execution. For example, the processing resources may be configured to process integer-based instructions. In such an example, if OS/VMM 2707 sends instructions to process floating point numbers, the processing resources may not have the ability to process such resources. Accordingly, example ISA management circuit 2710 determines whether processing resources are available and/or capable of executing instructions from OS/VMM 2707 corresponding to ISA execution.

If the example ISA management circuit 2710 of fig. 27 determines that the processing resources are capable and available to execute ISA instructions by combining the operations of multiple of core(s) 2704 and/or smaller processing component 2706 to operate as a large core, the example ISA management circuit 2710 instructs the microcode processing circuit 2711 of hardware 2702 to cause core(s) 2704 and/or smaller processing component 2706 to operate as a large core. If the example ISA management circuit 2710 of fig. 27 determines that processing resources are capable but unavailable to execute instructions (e.g., only a portion of processing resources are available), the example ISA management circuit 2710 may (a) determine when sufficient processor resources are available to operate as a large core (e.g., when a workload(s) based on the current workload and/or schedule will complete) and/or (b) whether emulation of a large core is possible. Combinations of small device processors capable of acting as larger processing devices are strategically configurable and can be implemented via a platform trusted execution environment (trusted execution environment, TEE). Simulation is possible when available processor resources have the ability to execute as a large core, but execution does not meet all requirements. For example, ISA management circuit 2710 may determine that 512 bits per cycle are not possible, but 256 bits per cycle are possible. In such an example, a 512-bit instruction may be executed in two 256-bit cycles instead of one 512-bit cycle. Thus, while the instruction may complete, it will complete in half of the 512 bit period requirement. The example ISA management circuit 2710 may send information to the example OS/VMM 2707 regarding emulation and/or when additional resources are available. In this manner, OS/VMM 2707 may determine whether to wait, continue emulating, and/or not continue advancing based on information from ISA management circuit 2710. In some examples, OS/VMM 2707 and ISA management circuit 2710 may negotiate terms of emulation. If example ISA management circuit 2710 determines that a processor resource is not capable of executing an instruction and/or is not capable of operating as a large core, ISA management circuit 2710 may generate an exception (e.g., also referred to as a trap and/or a block) for ISA execution and inform OS/VMM 2707 that it will not execute the instruction because it is not capable. Example ISA management circuit 2710 is further described below in conjunction with fig. 27.

The example microcode processing circuitry 2711 of fig. 27 is hardware that executes microcode (e.g., xucode, etc.) to control the operation of the example core(s) and/or the small device processor(s) 2706. For example, if small device processor(s) 2706 are 64-bit per cycle processors and ISA management circuit 2710 instructs microcode processing circuit 2711 to operate as a large core executing 512-bit per cycle instructions, microcode processing circuit 2711 will split the 512-bit instructions into eight 64-bit instructions, causing eight 64-bit cycle small device processors 2706 to execute the corresponding 64-bit instructions and combine the results to output the results. For example, microcode processing circuitry 2711 may divide and/or group instructions into smaller portions or sub-instructions. Smaller sub-instructions are loaded into the smaller device processor 2706 and the microcode processing circuitry 2711 performs cumulative combinations in a larger register space of temporary storage (e.g., virtual registers). For example, if small device processor 2706 only supports 256 bits wide, 512 bits of operation is obtained, and small device processor 2706 has 512 bits of accumulation registers, small device processor 2706 may operate using the accumulation registers and/or configuring the accumulation registers in SRAM. Additional operations may include multiplication, additive encryption, and so forth. In this way, 512-bit instructions may be executed by eight small device processors that act as large cores. If microcode processing circuitry 2711 recognizes an error during execution, microcode processing circuitry 2711 may return an error to ISA management circuitry 2710 to recognize that ISA execution failed and prevent crashing. An example microcode processing circuit 2711 is described further below in connection with fig. 27.

Fig. 28 is a block diagram of an example implementation of the example ISA management circuit 2710 and microcode processing circuit 2711 of fig. 27. Example ISA management circuit 2710 includes one or more example interfaces 200, example authentication circuit 2802, and example hardware management circuit 2804. The example microcode processing circuitry 2711 includes one or more example interfaces 210, example hardware control circuitry 2812, example error determination circuitry 2814, and example output control circuitry 2816.

Example interface(s) 200 of ISA management circuit 2710 of fig. 28 obtain instructions to perform ISA execution by using multiple processing devices as large core operations. In some examples, ISA management circuit 2710 obtains instructions directly from OS/VMM 2707 of fig. 27. In some examples, OS/VMM 2707 writes data into register 2713 when ISA execution is required. In such examples, interface(s) 200 access data in registers 2713 to allow hardware management circuitry 2804 to determine if ISA execution is possible. In addition, example interface 2800 sends instructions to microcode processing circuitry 2711 to cause processing resources to operate according to ISA execution requests from OS/VMM 2707.

Example authentication circuitry 2802 of fig. 28 authenticates ISA execution requests and/or instructions to verify that the requests are valid and/or authentic. To verify ISA execution requests, example authentication circuitry 2802 may (a) match a CPU in the platform, (b) check the header, loader version, and/or checksum of the ISA execution request, (c) perform an authenticity and/or signature check pass, and/or (d) utilize any verification technique. The example authentication circuit 2802 may match a CPU in the platform with a CPU ID/manifest that is provisioned via factory provisioning (e.g., fuse setting) during manufacturing or via field provisioning of firmware/microcode patches. CPU matching may be dynamically controlled in the field after deployment via policies and/or via out-of-band manageability of a platform Trusted Execution Environment (TEE). If the ISA execution request is invalid and/or unrealistic, authentication circuitry 2802 may inform OS/VMM 2707 that the ISA execution request cannot be verified and/or return control to OS/VMM 2707.

Example hardware management circuitry 2804 of fig. 28 obtains the validated ISA execution request and determines how to execute the ISA execution request based on the requirements of the ISA execution request, availability and/or capabilities of processing resources (e.g., core(s) 2704 and/or small device processor(s) 2706), and any policies. The policy may be a user and/or manufacturer designed policy that determines whether ISA execution should be performed, emulated, and/or blocked based on various factors. Hardware management circuitry 2804 monitors the capabilities and/or availability of processor resources (e.g., core(s) 2704 and/or small device processor(s) 2706). If the ISA request corresponds to executing an X-bit per cycle instruction including a floating point operation, hardware management circuitry 2804 determines if processing resources are available and has the ability to process ISA execution requests for floating point operations at a speed of X bits per cycle. For example, if the total bits per cycle provided by two or more available processor resources have the ability to equal or exceed X bits per cycle, hardware management circuitry 2804 may determine that ISA execution is available and instruct microcode processing circuitry 2711 to coordinate execution of ISA execution as a large core using two or more processor resources (e.g., core(s) 2704 and/or small device processor(s) 2706).

Further, the example hardware management circuitry 2804 of fig. 28 may determine that two or more processor resources are capable of performing floating point operations, but not as required by ISA execution. If hardware management circuitry 2804 determines that ISA execution requirements cannot be met, hardware management circuitry 2804 may identify when requirements can be met and/or may generate emulation protocols to execute ISA requests, but not as required. In this manner, hardware management circuitry 2804 may negotiate with OS/VMM 2707 to determine whether to emulate, not emulate, and/or wait until additional resources are available. If hardware management circuitry 2804 determines that ISA execution is not possible and/or likely not possible in the future, hardware management circuitry 2804 sends a response (e.g., via interface(s) 200) to OS/VMM 2707 to indicate that ISA execution is not possible. If example hardware management circuitry 2804 determines that the processing resource is not capable of handling ISA execution requests (e.g., regardless of availability), example hardware management circuitry 2804 generates an exception of ISA execution blocking to prevent execution of ISA execution and indicates to example OS/VMM 2707 that the processing resource is not capable of executing ISA execution. After hardware management circuit 2804 determines how to handle the ISA execution request, hardware management circuit 2804 instructs microcode processing circuit 2711 to control processing resources accordingly.

Example interface 2810 of microcode processing circuit 2711 of fig. 28 obtains instructions regarding execution of an ISA execution request from ISA management circuit 2710. In addition, example interface(s) 210 obtain ISA-based instructions for ISA execution. After completion of the ISA instruction, interface(s) 210 send output to OS/VMM 2707 (e.g., directly or via BIOS 2708).

Example hardware control circuitry 2812 of fig. 28 determines how processing resources (e.g., example core(s) 2704 and/or example small device processor(s) 2706) are configured to execute ISA execution based on instructions from ISA management circuitry 2710. For example, hardware control circuitry 2812 may decompose ISA instructions into sub-instructions executable by available processing resources and provide the sub-instructions to respective processing resources (e.g., via interface(s) 210). For example, if a 2728-bit instruction is obtained, the hardware control circuit 2812 may decompose the 2728-bit instruction into two 64-bit sub-instructions for execution by the two 64-bit small device processors (e.g., a first sub-instruction to a first small device processor and a second sub-instruction to a second small device processor). In this way, the processing resources may execute larger instructions without using larger processing resources.

The example error determination circuit 2814 of fig. 28 monitors for errors in execution of ISA execution. For example, if an instruction causes a divide by zero, infinite loop, and/or other instruction errors, the error determination circuit 2814 may identify the error, stop execution, and return a message to the OS/VMM 2707 indicating that the instruction execution cannot be completed. In this way, the error determination circuit 2814 can prevent occurrence of a crash.

The example output control circuit 2816 of fig. 28 obtains multiple outputs from multiple processing resources and combines the outputs to generate a single output. For example, if the hardware control circuit 2812 splits a 2728-bit instruction into two 64-bit instructions for two 64-bit processing resources, the output control circuit 2816 obtains a first output from the first processing resource and a second output from the second processing resource and combines the outputs to generate a 2728-bit output. Output control circuit 2816 sends output to OS/VMM 2707 via interface(s) 2810.

Although an example manner of implementing ISA management circuit 2710 and/or microcode processing circuit 2711 of fig. 27 is illustrated in fig. 2, one or more of the elements, processes, and/or devices shown in fig. 28 may be combined, divided, rearranged, omitted, eliminated, and/or implemented in any other way. In addition, example interface(s) 200, example authentication circuit 2802, example hardware management circuit 2804, example interface(s) 210, example hardware control circuit 2812, example error determination circuit 2814, example output control circuit 2816, and/or, more generally, ISA management circuit 2710 and/or microcode processing circuit 2711 of fig. 27-2 may be implemented by hardware, software, firmware, and/or any combination of hardware, software, and/or firmware. Thus, for example, any of example interface(s) 200, example authentication circuit 2802, example hardware management circuit 2804, example interface(s) 210, example hardware control circuit 2812, example error determination circuit 2814, example output control circuit 2816, and/or more generally ISA management circuit 2710 and/or microcode processing circuit 2711 of fig. 27-2 may be implemented by processor circuit(s) analog circuit(s), digital circuit(s), logic circuit(s), programmable processor(s), programmable microcontroller(s), graphics Processing Unit (GPU) digital signal processor(s) (DSP), application Specific Integrated Circuit (ASIC), programmable logic device(s) (PLD), and/or field programmable logic device(s) (FPLD) (e.g., field programmable gate array (s)). When read in any apparatus or system claim of this patent covers a purely software and/or firmware implementation, at least one of ISA management circuit 2710 and/or microcode processing circuit 2711 of fig. 27-2 is expressly defined herein to include a non-transitory computer readable storage device or storage disk containing the software and/or firmware, such as memory, digital Versatile Disk (DVD), compact Disk (CD), blu-ray disk, etc. Further, ISA management circuit 2710 and/or microcode processing circuit 2711 of fig. 27-28 may include one or more elements, processes, and/or devices in addition to or instead of those shown in fig. 27-28, and/or may include any or all of the more than one illustrated elements, processes, and devices.

Flowcharts representative of example hardware logic circuits, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing ISA management circuit 2710 and/or microcode processing circuit 2711 of fig. 27-2 are shown in fig. 3-5. The machine-readable instructions may be one or more executable programs, or portion(s) of an executable program, for execution by a processor circuit, such as the processor circuit 3312 shown in the example processor platform 3300 discussed below in connection with fig. 33 and/or the example processor circuit discussed below in connection with fig. 48. The program may be embodied in software stored on one or more non-transitory computer readable storage media, such as a CD, floppy disk, hard Disk Drive (HDD), DVD, blu-ray disc, volatile memory (e.g., any type of Random Access Memory (RAM), etc.) or non-volatile memory (e.g., FLASH memory, HDD, etc.), associated with processor circuitry located in one or more hardware devices, but the entire program and/or a portion thereof may also be executed by one or more hardware devices other than processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a Radio Access Network (RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. In addition, while the example program is described with reference to the flowchart shown in fig. 2, many other methods of implementing the computing device 2700, ISA management circuit 2710, and/or microcode processing circuit 2711 of fig. 27-2 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

As described above, the example operations of fig. 3-5 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media, such as optical storage devices, magnetic storage devices, HDDs, flash memory, read-only memory (ROM), CDs, DVDs, caches, any type of RAM, registers, and/or any other storage device or storage disk where information may be stored for any duration (e.g., for a longer period of time, permanently stored, temporarily stored, used for temporary buffering, and/or used for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium are expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

Fig. 29 is a flow diagram representing example machine readable instructions and/or example operations 2900 executable and/or instantiated by processor circuitry (e.g., example ISA management circuitry 2710 of fig. 2) for processing ISA execution requests. The instruction begins at block 2902 when example hardware management circuitry 2804 determines whether data has been written into an ISA manager status register (e.g., one or more registers 2713 of fig. 27). As described above, OS/VMM 2707 may write data into register 2713 to trigger an interrupt when ISA execution is about to occur. In some examples, OS/VMM 2707 may send instructions directly to ISA management circuit 2710.

If example hardware management circuit 2804 determines that data is not to be written to ISA manager status register 2713 (block 2902: no), control returns to block 2902. If example hardware management circuit 2804 determines that data has been written to ISA manager status register 2713 (block 2902: yes), example authentication circuit 2802 authenticates an ISA execution request corresponding to the data in ISA manager status register 2713 (block 2904). As described above in connection with fig. 2, example authentication circuitry 2802 may authenticate ISA requests using any authentication technique to determine that ISA execution requests are valid.

If example authentication circuit 2802 determines that the ISA request is not authentic (block 306: no), authentication circuit 2802 returns a response to OS/VMM 2707 indicating that the ISA request cannot be performed (block 2908), and control continues to block 2922. If example authentication circuitry 2802 determines that the ISA request is authentic (block 306: yes), example hardware management circuitry 2804 evaluates the ISA request based on one or more polarities, resource capacities, and/or resource capabilities (block 310). For example, hardware management circuitry 2804 may process one or more policies to determine how to process the request and/or may determine whether available processor resources are capable of processing the request.

At block 2912, example hardware management circuitry 2804 determines whether the ISA can be executed according to requirements (e.g., latency, bit rate, etc.) corresponding to ISA execution and/or according to one or more policies. For example, hardware management circuitry 2804 determines if processor resources are capable and/or available to handle ISA execution. If hardware management circuitry 2804 determines that the ISA request can be performed by a processor resource (block 2912: yes), then example hardware management circuitry 2804 instructs the microcode of the hardware (e.g., microcode ISA management circuitry 2711) to cause the processing element to operate like a large core to handle ISA execution (block 314). For example, hardware management circuitry 2804 may provide ISA execution instructions and/or requirements to microcode to cause the microcode to facilitate ISA execution with corresponding processor resources.

If hardware management circuitry 2804 determines that the ISA request cannot be executed by a processor resource (block 2912: no), then example hardware management circuitry 2804 determines whether the processor resource may emulate ISA execution and/or execute ISA requests at a later time (block 2916) (e.g., based on policy(s), resource capabilities, and/or resource availability). If example hardware management circuit 2804 determines that emulation should occur (block 2916: yes), then example ISA management circuit 2710 facilitates execution of the ISA emulation (block 2918), as described further below in connection with FIG. 29.

If example hardware management circuitry 2804 determines that emulation should not occur (block 2916: no), example hardware management circuitry 2804 creates an exception for the ISA request to VMM/host 2706 (e.g., via interface(s) 200) and/or blocks the ISA request to indicate that the ISA request cannot be executed (block 2920). At block 2922, the example hardware management circuit 2804 returns control to the example OS/VMM 2707.

FIG. 30 is a flow diagram representing example machine-readable instructions and/or example operations executable and/or instantiated by processor circuitry (e.g., ISA management circuit 2710 of FIG. 2) to facilitate ISA emulation in connection with block 2918 of FIG. 29.

Machine-readable instructions and/or operations corresponding to block 2918 of fig. 30 begin at block 3002 when example hardware management circuitry 2804 determines whether additional resources are available to execute ISA execution corresponding to the ISA request at a later time. For example, hardware management circuitry 2804 may determine that additional hardware (e.g., sufficient resources to perform ISA execution according to and/or more closely in agreement with policy(s) and/or parameter (s)) is currently executing one or more workloads, but will be free for ISA execution after completion of the one or more workloads.

If example hardware management circuitry 2804 determines that no additional resources are available to perform ISA execution corresponding to the ISA request at a later time (block 3002: no), control continues to block 3008. If example hardware management circuitry 2804 determines that additional resources will be available later to perform ISA execution corresponding to the ISA request (block 3002: yes), example hardware management circuitry 2804 instructs interface(s) 200 to send an indication to example OS/VMM 2707 as to when ISA instructions are executable by processor resources (block 3004). For example, hardware management circuitry 2804 may determine and/or estimate when currently unavailable processor resources will be available based on the speed of the currently unavailable resources and the amount of workload remaining to be completed.

At block 3006, example hardware management circuitry 2804 determines whether OS/VMM 2707 has refused subsequent execution based on the response from OS/VMM 2707. For example, after sending an indication to OS/VMM 2707 as to when processing resources will be available, OS/VMM 2707 may determine whether it wishes to wait for full execution of ISA instructions or advance in an immediate emulation manner. In some examples, if OS/VMM 2707 determines to wait for additional resources to become available (e.g., based on user and/or manufacturer preferences indicating when resources are waiting to be fully available if not currently available), control may return to OS/VMM 2707 and OS/VMM 2707 may submit a subsequent request based on the time at which the identified resources will be available. In some examples, if OS/VMM 2707 decides to wait for additional resources to become available, hardware management circuitry 2804 may reserve and/or queue ISA instructions for resources that are not currently available to execute the ISA instructions after the workload is completed.

If the example hardware management circuit 2804 determines that OS/VMM 2707 does not refuse subsequent execution (block 3006: NO), control returns to block 2922 of FIG. 29. If example hardware management circuit 2804 determines that OS/VMM 2707 refuses subsequent execution (block 3006: yes), example hardware management circuit 2804 identifies a configuration of resources that may be utilized to emulate an ISA. For example, if there are two small device processors available with a 64 bit rate and the ISA instructions correspond to 256 bit instructions, hardware management circuitry 2804 may identify a configuration using the two small device processors to execute the instructions at half the bit rate (e.g., 2728 bits by 2 cycles per 2 cycles 256 bits per cycle). At block 3010, the example hardware management circuitry 2804 sends emulation configuration information to the OS/VMM 2707 via interface(s) 200. The emulation configuration information can include information regarding processor resources to be used for emulation ISA execution, policies and/or parameters to be met, policies and/or parameters not to be met, and/or parameters of the emulation configuration (e.g., bit rate, latency, etc.).

At block 3012, the example hardware management circuitry 2804 determines whether the configuration is accepted by the OS/VMM 2707 (e.g., based on a response obtained from the OS/VMM 2707 via interface(s) 200). If the example hardware management circuit 2804 determines that the configuration is accepted (block 3012: yes), the example hardware management circuit 2804 instructs the microcode of the hardware (e.g., the microcode processing circuit 2711) to cause the processing resources to operate according to the emulated configuration (block 414), and control returns to block 2922 of fig. 29. If the example hardware management circuit 2804 determines that the configuration is not accepted (block 3012: no), the example hardware management circuit 2804 determines if other emulation configurations are available (block 416). In this manner, example OS/VMM 2707 and ISA management circuit 2710 may negotiate an emulation configuration. In some examples, OS/VMM 2707 may provide instructions and/or preferences that it wishes to see in the emulation configuration, and ISA management circuit 2710 may attempt to satisfy these instructions and/or preferences and/or provide an emulation configuration that is more appropriate for these instructions and/or preferences.

If the example hardware management circuit 2804 determines that other emulation configurations are available (block 416: yes), control returns to block 3010. If the example hardware management circuit 2804 determines that other emulation configurations are not available (block 416: no), the example hardware management circuit 2804 sends (e.g., to the OS/VMM 2707 using the example interface(s) 200) an indication that the emulation is not available (block 418), and control returns to block 2922.

Fig. 31 is a flow diagram representing example machine readable instructions and/or example operations 3100 executable and/or instantiated by processor circuitry (e.g., microcode processing circuitry 2711) to control processing resources to process execution of ISA instructions. The example hardware control circuit 2812 begins at block 3102 when it determines whether an ISA instruction has been obtained (e.g., directly from the OS/VMM 2707 or via the BIOS 2708).

If example hardware control circuit 2812 determines that an ISA instruction is not obtained (block 3102: NO), control returns to block 3102 until an ISA instruction is obtained. If example hardware control circuit 2812 determines that an ISA instruction has been obtained (block 3102: yes), example hardware control circuit 2812 partitions the instruction into sub-instructions according to the configuration instruction from ISA management circuit 2710 (block 3104). For example, if the configuration corresponds to one 2728-bit processor and two 64-bit processors, the hardware control circuit 2812 may split one 256-bit instruction into one 2728-bit instruction and two 64-bit instructions to correspond to the configuration, as further described above in connection with fig. 27.

At block 3106, the example hardware control circuit 2812 causes the processing resource to execute a partitioned instruction based on the configuration instruction. Using the above example, the hardware control circuit 2812 may provide 2728-bit instructions to be executed with processing resources operating at 2728 bits per cycle, first 64-bit instructions to be executed with first processing resources operating at 64 bits per cycle, and second 64-bit instructions to be executed with second processing resources operating at 64 bits per cycle. At block 3108, the example error determination circuit 2814 determines whether an error has occurred at any processing resources. For example, the error determination circuit 2814 can identify operations that result in errors, infinite loops, and so forth.

If example error determination circuit 2814 determines that an error has occurred (block 3108: yes), example error determination circuit 2814 sends an indication (e.g., using interface(s) 210) that the ISA instruction cannot complete (block 510) and the instruction ends. If the example error determination circuit 2814 determines that no error has occurred (block 3108: no), the example output control circuit 2816 combines results (e.g., outputs) from multiple executions at multiple processor resources to generate a final output for the cycle (block 512), as further described above in connection with fig. 27. For example, the output control circuit 2816 can combine results (e.g., outputs) by concatenating the outputs, adding the outputs, multiplying the outputs, and so forth. If an ISA instruction corresponds to multiple instructions of multiple cycles, microcode processing circuitry 2711 may store the output of the cycle in memory (e.g., registers, caches, volatile memory, non-volatile memory, etc.) for use during subsequent cycles and/or until all instructions are completed, then combine some or all of the outputs of the cycles. At block 3114, the example output control circuit 2816 sends output to the OA/VMM 2707 (e.g., directly or via the BIOS 2708) using the interface(s) 210.

Fig. 32 illustrates an example diagram 3200 corresponding to the operation of ISA management circuit 2710 of fig. 27. Example diagram 3200 of fig. 32 begins when OS/VMM 2707 writes data to an ISA manager status register (isa_msr) to initiate an interrupt to ISA management circuit 2710 to determine whether and/or how to execute ISA instructions in accordance with an ISA execution request. When ISA management circuitry (e.g., implementing UEFI BIOS microcode update manager) recognizes the isa_msr write, authentication circuitry 2802 (e.g., implementing ISA decoder and/or evaluator) decodes and verifies the authenticity of the isa_msr write. If authenticated, hardware management circuitry 2804 (e.g., implementing an ISA manager) verifies the ISA configuration for the current session with message channel interface (message passage interface, MPI) bits, configures ISA MPI bits as allowed to execute, emulate, or generate, and applies ISA configuration for the current session by indicating Xucode (e.g., microcode processing circuitry 2711). In some examples, hardware management circuitry 2804 may take policy-based actions, including generating new micro-operations to execute using a surplus mapper to configure processing resources to execute ISA instructions. After completion, example ISA management circuit 2710 returns control to OS/VMM 2707. To return to the normal thin mode (e.g., the processing resources do not operate as large cores but as separate smaller processor devices), a similar process occurs.

Fig. 33 is a block diagram of an example processor platform 3300 that is configured to execute and/or instantiate the machine readable instructions and/or operations of fig. 3-5 to implement ISA management circuit 2710 and/or microcode processing circuit 2711 of fig. 27. The processor platform 3300 may be, for example, a server, personal computer, workstation, self-learning machine (e.g., neural network), mobile device (e.g., cellular telephone, smart phone, such as an iPad) ^TM Such as a tablet device), a Personal Digital Assistant (PDA), an internet appliance, a DVD player, a CD player, a digital video recorder, a blu-ray player, a game console, a personal video recorder, a set-top box, headphones (e.g., an Augmented Reality (AR) headset, a Virtual Reality (VR) headset, etc.), or other wearable device, or any other type of computing device.

The processor platform 3300 of the illustrated example includes processor circuitry 3312. The processor circuit 3312 of the illustrated example is hardware. For example, the processor circuit 3312 may be implemented by one or more integrated circuits, logic circuits, FPGAs, microprocessors, CPU, GPU, DSP, and/or microcontrollers from any desired family or manufacturer. The processor circuit 3312 may be implemented by one or more semiconductor-based (e.g., silicon-based) devices. In this example, processor circuit 3312 implements example interface(s) 200, example authentication circuit 2802, example hardware management circuit 2804, example interface(s) 210, example hardware control circuit 2812, example error determination circuit 2814, and example output control circuit 2816.

The processor circuit 3312 of the illustrated example includes local memory 3313 (e.g., buffers, registers, etc.). The processor circuit 3312 of the illustrated example communicates with a main memory including a volatile memory 3314 and a non-volatile memory 3316 over a bus 3318. Volatile memory 3314 can be implemented by Synchronous Dynamic Random Access Memory (SDRAM), dynamic Random Access Memory (DRAM),Dynamic random access memory->And/or any other type of RAM device implementation. The non-volatile memory 3316 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memories 3314, 3316 of the illustrated example is controlled by a memory controller 3317.

The processor platform 3300 of the illustrated example also includes interface circuitry 3320. The interface circuit 3320 may be implemented in hardware according to any type of interface standard, such as an Ethernet interface, a Universal Serial Bus (USB) interface, a USB interface, or a combination thereof,An interface, near Field Communication (NFC) interface, a PCI interface, and/or a PCIe interface.

In the illustrated example, one or more input devices 3322 are connected to the interface circuit 3320. Input device(s) 3322 allows a user to input data and/or commands into processor circuit 3312. The input device(s) 3322 may be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, buttons, a mouse, a touch screen, a touch pad, a trackball, an isopoint device, and/or a speech recognition system.

One or more output devices 3324 are also connected to the interface circuit 3320 for the illustrated example. The output device 3324 may be implemented, for example, by a display device (e.g., a Light Emitting Diode (LED), an Organic Light Emitting Diode (OLED), a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT) display, an in-situ switched (IPS) display, a touch screen, etc.), a haptic output device, a printer, and/or speakers. The interface circuit 3320 of the illustrated example thus generally includes a graphics driver card, a graphics driver chip, and/or a graphics processor circuit, such as a GPU.

The interface circuit 3320 of the illustrated example also includes communication devices, such as transmitters, receivers, transceivers, modems, residential gateways, wireless access points, and/or network interfaces to facilitate the exchange of data with external machines (e.g., any kind of computing devices) via the network 3326. The communication may be through, for example, an ethernet connection, a Digital Subscriber Line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-to-line wireless system, a cellular telephone system, an optical connection, etc.

The processor platform 3300 of the illustrated example also includes one or more mass storage devices 3328 to store software and/or data. Examples of such mass storage devices 3328 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, blu-ray disc drives, redundant Array of Independent Disks (RAID) systems, solid-state storage devices (such as flash memory devices), and DVD drives.

The machine-executable instructions 3332, which may be implemented by the machine-readable instructions of fig. 3-5, may be stored in the mass storage device 3328, in the volatile memory 3314, in the non-volatile memory 3316, and/or on a removable non-transitory computer-readable storage medium such as a CD or DVD.

From the foregoing, it will be apparent that example systems, methods, apparatus, and articles of manufacture have been disclosed to improve boot performance. The disclosed systems, methods, apparatus, and articles of manufacture provide an Application Programming Interface (API) based on software and/or firmware to process instructions from applications running on an operating system, virtual Machine Manager (VMM), etc., and instruct microcode to configure a processing unit to be able to execute the instructions, regardless of the structure of the instructions. Thus, examples disclosed herein are able to combine smaller resources to execute code designed for larger resources without requiring instructions to be constructed for the smaller resources. In this way, an application may generate an instruction, and examples disclosed herein may determine whether and/or how to execute the instruction in view of constraints of a computing system.

Apparatus, article of manufacture, and method for configurable machine learning computing nodes

The computational workload may be performed by using a machine learning model. Machine learning models, such as neural networks, are useful tools that have proven valuable in solving complex problems with pattern recognition, natural language processing, automatic speech recognition, and the like. Identifying the optimal combination of hardware and/or software (e.g., machine learning models) to perform the computational workload is complex because the available types of hardware and/or machine learning models and their customization(s) are wide-ranging.

Automated Machine Learning (Automated Machine Learning, autoML) provides techniques to improve access and usability of Machine Learning (ML) for various applications and use cases. AutoML is a process that automates the operation of applying ML to tasks and workloads. For example, autopl may be used to automate selection, organization, and parameterization of ML models. In some such examples, autoML may be used throughout the ML pipeline, from receiving the original dataset to generating the deployable machine learning model.

Some autopl methods may select an ML model (e.g., an ML model that performs a workload) based on a hardware search space and/or a software search space. As used herein, a "hardware search space" is a space or collection of viable hardware, configurations of hardware, etc., and/or combinations thereof(s), in which a desired hardware configuration exists to execute the ML model. For example, an autopl system may evaluate various types of ML models based on the configuration of hardware included in a hardware search space. As used herein, a "software search space" is a space in which a feasible ML model, configuration of an ML model, etc., and/or combination thereof(s), a desired software configuration exists to perform a workload (e.g., a computational workload, an ML task, an ML operation, etc.). For example, an autopl system may evaluate various types of ML models based on ML models and/or configurations of ML models included in a software search space.

Some autopl methods may use a single, inflexible hardware template (e.g., CPU, GPU, FPGA, etc.) to express the hardware search space, which an autopl system may use to identify ML models to execute workloads of interest. For example, a hardware template may be inflexible in that the interconnect topology of the hardware may be fixed and/or otherwise non-configurable. Some such AutoML methods may evaluate different types of ML models and/or configurations of ML models based on a single type of hardware. In some such examples, the type of hardware may be vulnerable when a particular one (or more) of the ML models is instantiated. Thus, one (or more) of the ML models may not be selected for a particular type of ML workload based on the hardware type being evaluated. In some such examples, when executing a particular type of ML workload on different hardware, one (or more) of the ML models may be efficient, but the automated ML system may not select that one (or more) of the ML models because of inefficiency of the underlying hardware type on which that one (or more) of the ML models is being evaluated.

Some autopl methods may use a single inflexible software template (e.g., type of neural network, configuration of neural network, etc.) to express a software search space, which an autopl system may use to identify ML models to perform workloads of interest. Some such AutoML methods may evaluate execution of the workload(s) based on a single type of ML model. In some such examples, the ML model may have weaknesses in executing a particular type of workload. Thus, one (or more) of the ML models may not be selected for a particular type of ML workload. In some such examples, one (or more) of the ML models may be efficient in executing a particular type of ML workload, but the automated ML system may not select that one (or more) of the ML models because of the inefficiency of inflexible configuration of the software search space on which that one (or more) of the ML models is being evaluated.

The co-development of the artificial intelligence/machine learning (AI/ML) model and the hardware on which they are executed and/or instantiated is beneficial for obtaining a very efficient solution. However, this co-development requires many slow manual iterations by interdisciplinary human experts in hardware design and AI/ML algorithm. Recently, an autopl method as described above has been proposed to reduce the design effort of humans by performing an automatic AI/ML hardware/software (HW/SW) co-design. However, as described above, existing AutoML methods lack the flexibility of hardware and software design, which can release the real potential of AI/ML HW/SW co-design. For example, existing autopl methods typically use a single fixed hardware architecture template that is based on a fixed set of modules and connectivity, with a fixed set of low-level design parameters (e.g., buffer size, number of computing units, etc.) for each module. Thus, the hardware design search space is limited to a limited set of instances from only a single hardware architecture style. Similarly, software search space is limited. In neural network searching, the search space is typically for a single class of networks (e.g., only recurrent neural network (recurrent neural network, RNN) class or only convolutional neural network (convolution neural network, CNN) class).

Examples disclosed herein include apparatuses, articles of manufacture, and methods for a configurable machine learning computing node. In some disclosed examples, incorporating hardware and software heterogeneity into an autopl search can potentially discover new models (e.g., AI/ML models) that take advantage of different computing platforms (e.g., branches and strong control on CPU, massively parallel layers on GPU, custom new layers on FPGA, etc.) to generate a machine learning system based on configurable, modular building blocks of hardware and/or software.

Examples disclosed herein include an expressive search space representation that covers multiple templates of hardware and software architectures. In some disclosed examples, these templates may be dynamically modifiable during the HW/SW co-design search. Advantageously, the expressive search space enables the HW/SW co-design system to explore a much larger, richer HW/SW design space that spans multiple architectural styles. In some disclosed examples, one (or more) of the architectural styles may be flexible in their respective modules and connectivity sets (e.g., selection and/or configuration of connections, topologies, inputs/outputs, etc.). In some such disclosed examples, the collection of modules and connectivity may be formed by a configurable building block. Advantageously, the examples disclosed herein increase the likelihood of discovering more efficient hardware architecture instances and their corresponding co-designed software as compared to previous autopl approaches, because the examples disclosed herein provide much larger HW/SW search space(s) and their configurable version(s).

Examples disclosed herein include a set of hardware architecture templates and a software architecture template. Advantageously, the hardware and software templates may be based on a palette of configurable architectural building blocks, each of which may have a set of microarchitectural parameters. In some disclosed examples, the microarchitectural parameters may be searchable to enhance the granularity of the AutoML search. Advantageously, the example hardware and software templates are not limited to a predefined set of modules and their fixed connectivity as some templates used in previous AutoML methods. In some disclosed examples, the configurable architectural building blocks may be flexibly combined, added, removed, modified, and/or mutated based on a set of design rules (e.g., pre-specified design rules, dynamically specified or immediately specified design rules, etc.) to create a large number of new HW/SW architecture instances. In some disclosed examples, the formal and accurate semantics and interfaces of the example hardware and software templates allow for automated searching of HW/SW design space in an AutoML framework, as well as easy extension of HW/SW block palettes with new user and/or machine specified blocks.

Examples disclosed herein include evolving multiple sets of related configurable building blocks simultaneously, each set of building blocks may cover different architectural categories and design styles. For example, in a hardware search space, an AI/ML processor architecture with a systolic array design-based style may fit into a computationally intensive AI/ML model, but not fit into memory bindings and less computationally intensive workloads. Thus, embodiments disclosed herein may simultaneously evolve HW architectures having different architectural design styles to allow the AI/ML model to evolve flexibly to achieve improved software accuracy and hardware efficiency during the co-design process. Similarly, as an example, in a software search space (e.g., a neural network software search space), there are multiple classes of networks that have their own beneficial attributes (e.g., CNN, RNN, transformer, etc.) and configurable building blocks (e.g., matrix-by-vector operations of RNN (e.g., matrix x-vector), convolutions of CNN, etc.). Advantageously, examples disclosed herein may build an improved HW/SW solution based on a configurable ML compute node to perform workload with less development effort than previous autopl methods.

FIG. 34 is an illustration of an example AutoML architecture 3400 that includes an example Machine Learning (ML) system configurator 3402 to identify and/or generate configurable ML computing nodes. The autopl architecture 3400 includes an ML system configurator 3402 to generate a hardware search space and/or a software search space based on a computing task or workload (e.g., an artificial intelligence/machine learning (AI/ML) computing task or workload). The ML system configurator 3402 may identify hardware or portion(s) thereof from the hardware search space. The ML system configurator 3402 may also discover and/or otherwise identify software (e.g., AI/ML models), or portion(s) thereof, from the software search space. In some examples, ML system configurator 3402 may individually and/or simultaneously evolve a configurable ML computing node by iterating (i) the architecture and/or type of hardware and/or software and/or (ii) the configuration(s) of hardware and/or software. For example, the ML system configurator 3402 may evolve the configurable ML computing nodes by evaluating hardware and/or software when executing a workload and/or based on simulations of hardware and/or software executing a workload. In some such examples, the configurable ML computing node may be configurable in that hardware and/or software components may be selected and assembled in various combinations to meet specific or predefined requirements (e.g., accuracy requirements, latency requirements, throughput requirements, etc.). In some such examples, in response to identifying a particular combination of hardware and/or software that meets a particular or predefined requirement, the ML system configurator 3402 may output the combination as a configurable ML computing node to execute the workload of interest.

The example AutoML architecture 3400 includes an example optimized application 3404, example optimized middleware and framework 3406, and example Application Programming Interfaces (APIs) 3408. In some examples, the optimized application 3404 may be implemented by an application (e.g., a software application, a web or browser-based application, etc.) that is customized, tailored, and/or otherwise optimized to enable identification and/or generation of configurable ML computing nodes. For example, the optimized application 3404 may be accessed, utilized, etc. by a developer (e.g., software developer, researcher, etc.), information Technology (IT) personnel, etc. In some such examples, the optimized application 3404 may be accessed, utilized, etc. to co-design hardware/software (HW/SW) solutions to solve technical problems that may benefit from AI/ML technology. In some examples, optimized middleware and framework 3406 may be implemented by middleware and framework that are customized, tailored, and/or otherwise optimized to enable identification and/or generation of configurable ML compute nodes. For example, the optimized middleware and framework 3406 may implement interfaces (e.g., communications, connectivity, etc.) between the optimized applications 3404 and the APIs 3408.

The API 3408 of the illustrated example may be invoked to program, develop, and/or otherwise generate an AI/ML application by at least one of direct programming or API-based programming. The APIs 3408 of the illustrated example include an example migration tool 3410, an example direct programming API 3412, an example API-based programming API 3414, and an example analysis tool 3416.

In some examples, the migration tool 3410 may be implemented by software (e.g., a software application) that may adapt a program for implementation in some form of execution in a first computing or electronic environment different from a second computing or electronic environment for which the program was originally designed. For example, the migration tool 3410 may convert and/or otherwise adapt a first program developed for a first type of hardware, operating System (OS), library, etc. to a second program for a second type of hardware, OS, library, etc.

In some examples, the direct programming API 3412 may be invoked to implement a direct programming task, which may include developing and/or compiling a data parallel c++ application. In some examples, the API-based programming API 3414 may be invoked to implement API-based programming, which may include developing and/or compiling applications that call (or call, instantiate, etc.) mathematical kernel libraries (Math Kernel Library, MKL), MKL deep neural network (Deep Neural Network, DNN) libraries, data analysis acceleration libraries, thread building block libraries, parallel standard template libraries, media software development kits (software development kit, SDK), deep learning deployment kits, machine learning scaling libraries, etc., and/or any combination(s) of these.

In some examples, the analysis tool 3416 may be invoked, instantiated, and/or otherwise invoked to analyze hardware, software, and/or configuration(s) thereof of the configurable ML computing node. For example, the analysis tool 3416 may instantiate simulator(s) to simulate all hardware and/or software features of the configurable ML computing node to generate and/or otherwise output one or more evaluation parameters. In some such examples, the evaluation parameters may include parameters that represent and/or otherwise indicate the accuracy, latency, number of cycles to complete the workload, or throughput of the configurable ML compute node. In some examples, the evaluation parameters may include parameters that represent and/or otherwise indicate: processor or clock frequency, fabric frequency, read memory bandwidth, write memory bandwidth, hardware throttling factor, number of memory ports, number of data processing units (data processing unit, DPU), number of model layers (e.g., neural network layers, convolutional layers, etc.), activation precision (e.g., precision of activation values to be processed), weight precision (e.g., precision of weight values to be processed), and/or the like, and/or any combination(s) of these. For example, the analysis tool 3416 may execute a simulator based on the configurable ML computing nodes. In some such examples, the analysis tool 3416 may execute a simulator to determine the throughput of the configurable ML computing node when executing a particular AI/ML model with a particular configuration.

In some examples, the analysis tool 3416 may instantiate the simulator(s) to simulate behavior, configuration, etc. of the configurable ML computing node to generate and/or otherwise output one or more evaluation parameters. For example, the analysis tool 3416 may execute a model (e.g., a simulation model, an AI/ML model, etc.) based on the configurable ML computing nodes. In some such examples, the analysis tool 3416 may execute a model to estimate, predict, and/or otherwise determine the throughput of the configurable ML computing node when executing a particular AI/ML model with a particular configuration.

The AutoML architecture 3400 of the illustrated example includes different types of hardware and/or software that may be used to generate a configurable ML compute node. In the illustrated example, the autopl architecture 3400 includes interface and target system software for scalar, vector, matrix, and space hardware. Additionally and/or alternatively, any other type of hardware may be used. In this example, scalar hardware is implemented by the example CPU 3418 and the example CPU system software 3420. For example, the CPU system software 3420 may include instructions corresponding to a CPU instruction set architecture (Instruction Set Architecture, ISA). In this example, vector hardware is implemented by an example GPU 3422 and example GPU system software 3424. For example, GPU system software 3424 may include a kernel, portion(s) of code, etc., such as a kernel, compute kernel, and/or shader. In some examples, the kernel, portion(s) of code, etc. may be represented in a High-level programming language, such as High-Level Shader Language, HLSL, openCL, etc.

In this example, the matrix hardware is implemented by the example AI processor 3426 and the example AI system software 3428. For example, the AI system software 3428 can include one or more AI/ML algorithms, models, etc., such as neural networks (e.g., convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), etc.), linear regression models, logistic regression models, decision tree models, learning vector quantization models, etc., and/or combinations thereof(s). In this example, the spatial hardware is implemented by an example FPGA 3430 and an example FPGA system software 3432. For example, FPGA system software 3432 can include kernels, portion(s) of code, and the like, which are based on hardware description language (hardware description language, HDL), such as Verilog.

The ML system configurator 3402 of the illustrated example may interface with the CPU 3418 and/or CPU system software 3420 via the example host interface 3434. The ML system configurator 3402 of the illustrated example may interface with the GPU 3422, GPU system software 3424, AI processor 3426, AI system software 3428, FPGA 3430, and/or FPGA system software 3434 via the example zero-level interface 466.

In the illustrated example, the CPU system software 3420, GPU system software 3424, AI system software 3428, FPGA system software 3432, host interface 3434, and/or zero level interface 3436 may correspond to and/or otherwise implement example below zero level system software 3436. For example, the zero-level below system software 3436 may correspond to and/or otherwise be implemented as a low-level direct-to-metal interface tailored to hardware such as the CPU 3418, GPU 3422, and the like.

In the illustrated example, the API 3408 may implement example zero level above system software 3440 and example developer interface 3442. For example, a developer, user, etc. may access and/or otherwise utilize the AutoML architecture 3400 through the manner of the API 3408. In some examples, developers, users, etc. may access and/or otherwise utilize higher level system software than low level direct-to-metal interfaces through the manner of the API 3408. In some examples, a developer, user, etc. can access and/or otherwise utilize the sub-zero system software 3436 via the host interface 3434 and/or the zero-level interface 3436.

Fig. 35 is a block diagram of an example implementation of ML system configurator 3402 of fig. 34. The ML system configurator 3402 includes an example controller 3502, an example evaluator 3504, an example ontology generator 3506, and an example ontology database 3508.

In the illustrated example, the ontology database 3508 includes a plurality of example configurable building block databases 3510. In the illustrated example, the configurable building block database 3510 includes example software templates 3512 and hardware templates 3514. For example, the configurable building block database 3510 may include a first configurable building block database, which may include a first software template (identified by "SW template 34") of the software templates 3512. In some such examples, the first software template may include one or more CNNs, configuration(s) thereof, and/or metadata. For example, the metadata may describe the operation of the CNN, different configurations and/or capabilities of the CNN, aspects of the CNN that may be modified or mutated, and so on. In some examples, the first software template may expose and/or otherwise make available aspects, configurations, interconnections, etc. of the CNN, which may be adjusted, changed, modified, mutated, etc. In some examples, the configurable building block database 3510 may include a second configurable building block database that may include a second software template (identified by "SW template 35") in the software templates 3512, a third configurable building block database that may include a third software template (identified by "SW template N") in the software templates 3512, and so on. In the illustrated example, the second software template may include one or more RNNs and/or configuration(s) thereof. In the illustrated example, the third software template may include one or more transducers and/or configuration(s) thereof. Additionally and/or alternatively, any other type of AI/ML model and/or configuration(s) thereof may be included in the configurable building block database 3510.

In some examples, the configurable building block database 3510 may include database(s) and/or template(s) from example contributors 3513. For example, contributors 3513 can be users, developers, researchers, and so forth. The contributor 3513 of the illustrated example can upload and/or otherwise provide database(s), template(s), etc., to the example repository 3515. In some examples, contributors 3513 may include metadata in database(s), template(s), and/or the like, that provides an indication of the configurability of hardware and/or software of the template(s). In the illustrated example, repository 3515 is an application repository (e.g., app Store) that can be accessed by ML system configurator 3402 for use in instantiating the organization, generation, etc. of ML computing nodes 3517. For example, ML compute node 3517 may implement a configurable ML compute node. The ML compute node 3517 of the illustrated example contains example software 3519 and example hardware 3521. For example, software 3519 can be implemented by one or more AI/ML models. In some examples, hardware 3521 may be implemented by one or more CPUs (or portion(s) thereof), one or more GPUs (or portion(s) thereof), one or more AI processors (or portion(s) thereof), one or more FPGAs (or portion(s) thereof), one or more ASICs (or portion(s) thereof), and/or the like, and/or any combination(s) of these.

In the illustrated example, the configurable building block database 3510 may include a fourth configurable building block database, which may include a first hardware template (identified by "HW template 34") of the hardware templates 3514. In some such examples, the first hardware template may include one or more FPGAs (e.g., one or more architectures, manufacturer models, types, etc., of FPGAs) and/or configuration(s) thereof. For example, a hardware template may expose and/or otherwise make available aspects, configurations, interconnections, etc. of an FPGA, which may be adjusted, changed, modified, mutated, etc. In some examples, the configurable building block database 3510 may include a fifth configurable building block database that may include a second hardware template (identified by "HW template 35"), a sixth configurable building block database that may include a third hardware template (identified by "HW template N"), and so on. In the illustrated example, the second hardware template may include one or more GPUs (e.g., one or more architectures, manufacturer models, types, etc. of GPUs) and/or configuration(s) thereof. In the illustrated example, the third hardware template may include one or more CPUs (e.g., one or more architectures, manufacturer models, types, etc., of the CPUs) and/or configuration(s) thereof. Additionally and/or alternatively, any other type of hardware and/or configuration(s) thereof may be included in the configurable building block database 3510.

In example operations, the controller 3502 may receive, obtain, and/or otherwise identify the example workload(s) (e.g., AI/ML workload (s)) 3516. For example, workload(s) 3516 may be scientific simulation, financial analysis, AI/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography, data compression, and the like. In the illustrated example, the controller 3502 can generate an example software search space 3518 and an example hardware search space 3520 based on the workload(s) 3516.

In some examples, the controller 3502 can generate the software search space 3518 and the hardware search space 3520 in response to a query to the ontology generator 3506 for HW/SW solutions of previous AutoML searches corresponding to the workload(s) 3516. For example, the controller 3502 may query the ontology generator 3506 with an identifier corresponding to the workload(s) 3516, an initial or seed AI/ML model that may execute the workload(s) 3516, and so on. In some such examples, the ontology generator 3506 can identify an association of an initial or seed AI/ML model with another AI/ML model in the ontology database 3508. For example, ontology generator 3506 can track and learn previous searches, the operation of ML system configurator 3402, and so forth. In some examples, ontology generator 3506 may search for such previous searches, runs, etc. in ontology database 3508. For example, the ontology database 3508 can store learning, mapping, etc. associated with the software templates 3512 and/or the hardware templates 3514 across hardware and/or software domains from previous searches. In some examples, the previous search may correspond to a search for a previous workload. In some examples, the previous search may correspond to an iteration of the search for workload(s) 3516. Advantageously, the controller 3502 can utilize the ontology generator 3506 to identify fine-grained configurable building blocks for blending and matching dynamic flexible template generation for the generation of the software search space 3518 and the hardware search space 3520.

Advantageously, the controller 3502 can provide an expressive search space representation (e.g., software search space 3518, hardware search space 3520, etc.) that covers multiple templates of hardware and software architecture (e.g., software templates 3512, hardware templates 3514, etc.), wherein the templates can be dynamically modifiable during a HW/SW co-design search. Advantageously, the controller 3502 may enable the HW/SW co-design system (which may be implemented by the ML system configurator 3402) to explore a much larger and richer HW/SW design space, spanning multiple architectural styles. In some examples, one (or more) of the architectural styles corresponding to the software template 3512 and/or hardware template 3514 can be flexible in their respective module and connectivity sets (e.g., selection and/or configuration of connections, topologies, inputs/outputs, etc.). In some such examples, the set of modules and connectivity may be formed by configurable building blocks that may be included in software template 3512 (e.g., configurable software building blocks in software template 3512) and/or hardware template 3514 (e.g., configurable hardware building blocks in hardware template 3514). Advantageously, the controller 3502 and/or more generally the ML system configurator 3402 may increase the likelihood of discovering more efficient hardware architecture instances and their corresponding co-designed software as compared to previous autopl approaches, as the controller 3502 of the illustrated example may utilize much larger HW/SW search space(s) and its configurable version(s).

In some examples, the controller 3502, the evaluator 3504, the ontology generator 3506, and the like, and/or more generally, the ML system configurator 3402 may utilize artificial intelligence and/or machine learning techniques to identify and/or otherwise generate the ML computing node 3517 to execute the workload(s) 3516. Artificial Intelligence (AI), including Machine Learning (ML), deep Learning (DL), and/or other artificial machine driven logic, enables a machine (e.g., a computer, logic circuitry, etc.) to process input data using a model to generate an output based on patterns and/or associations that the model previously learned via a training process (e.g., a machine learning training process). For example, the controller 3502, the evaluator 3504, the ontology generator 3506, and/or the like, and more generally the ML system configurator 3402, may be trained with data to identify patterns and/or associations, and follow such patterns and/or associations in processing input data, such that other input(s) result in output(s) consistent with the identified patterns and/or associations.

Many different types of machine learning models and/or machine learning architectures exist. In some examples, ML system configurator 3402 generates software 3519 into neural network model(s). Advantageously, using a neural network model enables hardware 3521 and/or, in general, ML computing node 3517 to perform AI/ML workloads. In general, machine learning models/architectures suitable for use in the example methods disclosed herein include reinforcement learning networks. However, other types of machine learning models, such as recurrent neural networks (recurrent neural network, RNN), supervised learning artificial neural network (artificial neural network, ANN) models, cluster models, classification models, and the like, and/or combinations of these, may additionally or alternatively be used. An example supervised learning ANN model may include a two-layer (2-layer) radial basis neural network (radial basis neural network, RBN), a learning vector quantization (learning vector quantization, LVQ) classification neural network, and so on. Example cluster models may include k-means clustering, hierarchical clustering, mean-shift clustering, density-based clustering, and the like. Example classification models may include logistic regression, support vector machines or networks, na iotave bayes, and so forth. In some examples, ML system configurator 3402 may compile and/or otherwise generate software 3519 into lightweight machine learning model(s).

In general, implementing an ML/AI system involves two phases, one being a learning/training phase and one being an reasoning phase. In the learning/training phase, the ML system configurator 3402 is trained using a training algorithm to operate according to patterns and/or associations based on, for example, training data. In general, the ML system configurator 3402 includes internal parameters that direct how input data is transformed into output data, such as by a series of nodes and connections within the ML system configurator 3402. Further, the hyper-parameters are used as part of a training process to control how learning is performed (e.g., learning rate, number of layers to be used in a machine learning model, etc.). Super-parameters are defined as training parameters that are determined before initiating the training process. In some examples, the hyper-parameters may control how learning is performed (e.g., learning rate, number of layers to be used in the machine learning model, etc.). In some examples, the hyper-parameters controlling the model performance and training speed may learn rates, number of rounds, topology of the neural network, size of the neural network, and/or regularization parameter(s). Such super parameters are selected, for example, by trial and error, to achieve optimal model performance. In some examples, retraining may be performed. Such retraining may be performed in response to the user's override(s).

Based on the type and/or expected output of the ML/AI model, different types of training may be performed. Reinforcement learning, for example, includes machines, agents, etc. interacting with the environment, performing actions, and learning through trial and error techniques. In other examples, the supervised training uses the inputs and corresponding expected (e.g., labeled) outputs to select parameters for the AI/ML model that reduce model errors (e.g., by iterating over a combination of the selected parameters). As used herein, a label (labeling) refers to an expected output (e.g., classification, expected output value, etc.) of the machine learning model. Alternatively, unsupervised training (e.g., for use in deep learning, a subset of machine learning, etc.) involves reasoning patterns from the inputs to select parameters of the ML/AI model (e.g., without the expected (e.g., labeled) output benefits). Additionally and/or alternatively, any other training technique may be used, such as random gradient descent, simulated annealing, particle swarm optimization, evolutionary algorithms, genetic algorithms, and/or nonlinear conjugate gradients.

Once training is complete, ML system configurator 3402 is deployed to serve as an executable construct that processes inputs and provides outputs based on the network of nodes and connections defined in the model. For example, the ML system configurator 3402 may be operated in an inference phase to process data. In the inference phase, the data to be analyzed (e.g., live data, workload(s) 3516, etc.) is input to the ML system configurator 3402, and the ML system configurator 3402 executes to create an output. This inference phase can be considered an AI "thought" to generate an output based on what it learns from training, reinforcement learning, and so forth. In some examples, the input data undergoes preprocessing before being used as input to the ML system configurator 3402. Further, in some examples, the output data may undergo post-processing after it is generated by the ML system configurator 3402 to transform the output into useful results (e.g., compilation of software 3519, generation of a configuration file associated with hardware 3521, etc.).

In some examples, the ML system configurator 3402 of the illustrated examples can be stored in a memory of one or more computing systems or in a database of one or more remote computing systems. The ML system configurator 3402 may then be executed by the one or more computing systems or one or more disparate computing systems.

In the illustrated example, ML system configurator 3402 may use reinforcement learning to organize and/or otherwise cause compilation of ML compute nodes 3517. However, any other AI/ML algorithm or technique may additionally or alternatively be used. In some examples, ML system configurator 3402 may iteratively generate proposed HW/SW instances 3522 until the error level is no longer decreasing and/or otherwise meets a threshold (e.g., accuracy threshold, training threshold, etc.). As used herein, a "threshold" is expressed as data, such as a numerical value expressed in any form, that can be used by the processor circuit as a reference for a comparison operation. As used herein, data is any form of information that may be ingested, processed, interpreted, and/or otherwise manipulated by a processor circuit to produce a result. The results generated may themselves be data. As used herein, a model is a set of instructions and/or data that may be ingested, processed, interpreted, and/or otherwise manipulated by a processor circuit to produce a result. Typically, the model is operated on using input data to produce output data according to one or more relationships reflected in the model. The model may be based on training data.

In some examples, the ML system configurator 3402 utilizes bayesian hyper-parametric optimization to determine optimal and/or other improved or more efficient network and/or hardware architecture to avoid model overfitting and improve the overall applicability of the software 3519 and/or hardware 3521 of the ML computing node 3517. Alternatively, the ML system configurator 3402 may use any other type of optimization.

In an example operation, the controller 3502 can receive a history of previous runs of the ML system configurator 3402 for the type of workload(s) 3516 (or different types of workloads). The controller 3502 can generate the software search space 3518 by populating the software search space 3518 with one or more AI/ML models used in previous runs. In some examples, the controller 3502 can populate the software search space 3518 with one or more different types of AI/ML models based on the workload(s) 3516. In the illustrated example, the software search space 3516 includes one or more Neural Network (NN) algorithms and/or configuration(s) thereof. Additionally and/or alternatively, the software search space 3516 can include any other type of AI/ML model, algorithm, and the like. For example, the controller 3502 may discover and/or otherwise identify one or more RNNs, one or more transducers, etc. by examining and/or otherwise searching the configurable building block database 3510.

In example operations, the controller 3502 can generate the hardware search space 3520 by populating the hardware search space 3520 with one or more types of hardware used in previous runs and/or configuration(s) thereof. In some examples, the controller 3502 can populate the hardware search space 3520 with one or more different types of AI/ML models based on the workload(s) 3516. In the illustrated example, the hardware search space 3520 includes one or more NN accelerators. Additionally and/or alternatively, the hardware search space 3520 can include any other type of hardware (e.g., one or more CPUs, one or more FPGAs, etc.).

In an example operation, the controller 3502 can generate an example proposed HW/SW instance 3522 and provide the proposed HW/SW instance 3522 to the evaluator 3504. In some examples, the proposed HW/SW instance 3522 may implement a candidate or proposed ML compute node. For example, the proposed HW/SW instance 3522 may be a configurable ML compute node implemented by an NN accelerator having a first hardware configuration and an NN algorithm having a first software configuration.

In example operations, the evaluator 3504 may execute the example performance modeling 3524 to generate and/or otherwise output the example evaluation parameters 3526. For example, evaluator 3504 may simulate, debug, etc. the proposed HW/SW instance 3522 to generate evaluation parameters 3526. For example, the evaluation parameter 3526 may be implemented by a value representing and/or otherwise indicating an evaluation metric of accuracy, latency, number of cycles to complete workload, or throughput of the proposed HW/SW instance 3522. In some examples, the evaluation parameters may represent and/or otherwise indicate a processor or clock frequency, an architecture frequency, a read memory bandwidth, a write memory bandwidth, a hardware throttling factor, a number of memory ports, a number of Data Processing Units (DPUs), a number of model layers (e.g., neural network layers, convolutional layers, etc.), an activation precision (e.g., a precision of an activation value to be processed), a weight precision (e.g., a precision of a weight value to be processed), etc., and/or any combination(s) of these associated with the proposed HW/SW instance 3522.

In some examples, the evaluator 3504 may perform and/or otherwise instantiate an analysis, a software simulation, a register transfer level (Register Transfer Level, RTL) simulation to verify correctness of digital Integrated Circuit (IC) operation, a simulation (e.g., NN accelerator simulator), and so forth. In some such examples, the evaluator 3504 may perform the performance modeling 3524 by simulating, debugging, etc. the NN accelerator having the first hardware configuration while the NN accelerator is executing the NN algorithm having the first software configuration. For example, the evaluator 3504 may instantiate a simulation of an NN accelerator executing an NN algorithm to output the evaluation parameters 3526. In some examples, the evaluator 3504 may instantiate a simulation of the NN accelerator executing the NN algorithm to determine the evaluation parameters 3526.

In an example operation, the evaluator 3504 may output an example reward function 3528. In some examples, the reward function 3528 may be implemented by a mathematical function that captures what is desired to be optimized (e.g., a mathematical function that includes a higher weight for throughput to optimize throughput) and what is desired to be penalized (e.g., a mathematical function that includes a lower weight for latency to optimize throughput, sacrificing latency). For example, the reward function 3528 may include one or more outputs (e.g., evaluation parameters 3526) from the evaluator 3504. In some examples, the evaluator 3504 may generate the reward function 3528 to include at least a first output, e.g., accuracy, having a first weight and a second output, e.g., throughput, having a second weight. In some examples, the evaluation parameter 3526 can be implemented using the first output (and/or the first weight) and the second output (and/or the second weight). The evaluator 3504 may generate the first weight to be greater than the second weight to invoke and/or otherwise cause the controller 3502 to increase emphasis on increasing and/or otherwise optimizing accuracy and decrease emphasis on increasing and/or otherwise optimizing the second output. In some examples, in response to obtaining the reward function 3528, the controller 3502 can alter, modify, and/or otherwise adjust the proposed HW/SW instance 3522 to increase accuracy and decrease throughput based on the respective first and second weights of the first and second outputs of the reward function 3528. In some examples, the reward function 3528 may be the accuracy of the HW/SW instance 3522 proposed in executing the NN algorithm. In the illustrated example, the reward function 3528 may correspond to an evaluation result provided and/or otherwise fed back to the controller 3502 to update (e.g., iteratively update) the next version of the proposed HW/SW instance 3522.

In an example operation, the controller 3502 can update the proposed HW/SW instance 3522 based on the reward function 3528. For example, the controller 3502 may change the manufacturer model, configuration, etc. of the NN accelerator to maximize and/or otherwise augment the bonus function 3528. In some such examples, the controller 3502 can modify the hardware interconnect(s) (e.g., input(s) and/or output (s)) of the portion(s) of the NN accelerator, configure the mirror (e.g., values of one or more configuration registers of the NN accelerator), and/or the like, and/or any combination(s) of these. Alternatively, the controller 3502 may replace the NN accelerator with a different type of hardware (e.g., GPU). In some examples, the controller 3502 may modify the NN algorithm based on the bonus function 3528. For example, the controller 3502 may change the number of layers of the NN algorithm, the value(s) of the activation and/or weight(s), the interconnect(s) of the NN algorithm (e.g., input(s) and/or output (s)), and so forth. Alternatively, the controller 3502 may replace the NN algorithm with a different type of AI/ML algorithm, such as a transformer.

In some examples, the controller 3502, in response to the reward function 3528 being maximized and/or otherwise meeting a threshold, e.g., a reward threshold, can output the proposed HW/SW instance 3522 as the ML compute node 3517 to execute the workload(s) 3516. For example, the controller 3502 can compile the software portion of the proposed HW/SW instance 3522 into an executable construct (e.g., executable file, machine readable executable file, etc.) for execution on the hardware portion of the HW/SW instance 3522.

Fig. 36 is a block diagram of an example ML system configuration circuit 3600 that organizes ML computing nodes (e.g., ML computing node 3517 of fig. 35) to execute a workload (e.g., workload(s) 3516 of fig. 35). In some examples, ML system configuration circuit 3600 of fig. 36 may implement ML system configurator 3402 of fig. 34 and/or 35. The ML system configuration circuit 3600 of fig. 36 may be instantiated (e.g., created to exist for any length of time, embodied, implemented, etc.) by processor circuitry, such as a CPU, executing instructions. Additionally and/or alternatively, the ML system configuration circuit 3600 of fig. 36 may be instantiated (e.g., create an instance thereof, exist for any length of time, be embodied, be implemented, etc.) by an ASIC or FPGA that is structured to perform operations corresponding to the instructions. It should be appreciated that some or all of the ML system configuration circuit 3600 of fig. 36 may thus be instantiated at the same or different times. Some or all of ML system configuration circuitry 3600 may be instantiated, for example, in one or more threads executing concurrently on hardware and/or executing serially on hardware. Further, in some examples, a portion or all of ML system configuration circuit 3600 of fig. 36 may be implemented by one or more virtual machines and/or containers executing on a microprocessor.

The ML system configuration circuit 3600 of the illustrated example includes an example interface circuit 3610, an example ML software configuration circuit 3620, an example ML hardware configuration circuit 3630, an example configuration evaluation circuit 3640, an example ontology generation circuit 3650, an example workload execution circuit 3660, an example data store 3670, and an example bus 3680. The data store 3670 of the illustrated example includes example software templates 3672, example hardware templates 3674, example interconnect topology 3676, and example history configuration 3678.

In the illustrated example of fig. 36, interface circuitry 3610, ML software configuration circuitry 3620, ML hardware configuration circuitry 3630, configuration evaluation circuitry 3640, ontology generation circuitry 3650, workload execution circuitry 3660, and data store 3670 are in communication with bus 3680. For example, bus 3680 may be implemented by at least one of an Inter-integrated circuit (Inter-Integrated Circuit, I2C) bus, a serial peripheral interface (Serial Peripheral Interface, SPI) bus, a peripheral component interconnect (Peripheral Component Interconnect, PCI) bus, or a peripheral component interconnect express (Peripheral Component Interconnect Express, PCIe, or PCIe) bus. Additionally or alternatively, bus 3680 may be implemented by any other type of computing or electrical bus.

The ML system configuration circuit 3600 of the illustrated example of fig. 36 includes an interface circuit 3610 to receive a request to execute an AI/ML workload. For example, the interface circuit 3610 may receive a request from a user, computing or electronic system, or the like for configuring an AutoML solution (e.g., a combination of hardware and/or software) based on the workload(s) 3516. In some examples, the interface circuit 3610 may receive a request for the AI/ML model and corresponding hardware to execute an AI/ML workload. In some examples, the interface circuit 3610 may receive an AI/ML workload.

The ML system configuration circuitry 3600 of the illustrated example of fig. 36 includes ML software configuration circuitry 3620 to generate a first configuration of one or more models (e.g., one or more ML models, one or more AI/ML models, etc.) based on a workload. In some examples, ML software configuration circuit 3620 may generate a software search space based on at least one of the request or the historical configuration. For example, the ML software configuration circuit 3620 can populate and/or otherwise generate the software search space 3518 to include one or more AI/ML models identified in at least one of the ontology database 3508 or the configurable building block database 3510. In some such examples, ML software configuration circuit 3620 can generate software search space 3518 based on workload(s) 3516, or aspect(s) or portion(s) thereof.

In some examples, ML software configuration circuit 3620 queries the configuration database with a workload using an API. For example, one (or more) of the configurable building block databases 3510 may implement a configuration database, and the ML software configuration circuit 3620 may query one (or more) of the configurable building block databases 3510. In some such examples, ML software configuration circuit 3620 may query one (or more) of configurable building block database 3510 with workload(s) 3516 or aspect(s) thereof as input(s).

In some examples, the ML software configuration circuit 3620 determines the number of layers of the AI/ML model. For example, ML software configuration circuit 3620 may identify CNNs in software templates 3512, 3672, and so forth. In some examples, ML software configuration circuit 3620 may determine the number of layers of the CNN.

In some examples, the ML software configuration circuit 3620 determines weights for layers of the AI/ML model. For example, ML software configuration circuit 3620 may identify a weight value corresponding to the CNN in software template 3512. In some such examples, ML software configuration circuit 3620 may utilize the weights identified in software template 3512, determine new weight(s), adjust the value of one (or more) of the weights, and/or the like, and/or any combination(s) of these.

In some examples, the ML software configuration circuit 3620 determines a training type of the AI/ML model. For example, ML software configuration circuit 3620 may determine that reinforcement learning is associated with CNNs in software template 3512. In some examples, ML software configuration circuit 3620 may select different training types of CNNs, such as random gradient descent, simulated annealing, particle swarm optimization, evolutionary algorithms, genetic algorithms, nonlinear conjugate gradients, and so forth.

In some examples, the ML software configuration circuit 3620 determines superparameters to train the AI/ML model. For example, ML software configuration circuit 3620 may identify a superparameter, a value of the superparameter, etc., corresponding to the CNN in software template 3512. In some such examples, ML software configuration circuit 3620 may utilize the superparameters identified in software template 3512, determine new superparameters(s), adjust the value of one (or more) of the superparameters, and/or the like, and/or any combination(s) of these.

In some examples, the ML software configuration circuit 3620 determines whether another AI/ML model has been identified. For example, ML software configuration circuit 3620 may determine that a transformer model is identified in addition to CNN. In some such examples, the ML software configuration circuit 3620 may determine that more than one AI/ML model has been identified, such as CNN and transformer models. In some such examples, the ML software configuration circuit 3620 can generate a topology (e.g., an interconnect or interconnect topology, an input/output (I/O) topology, etc.) based on connection(s) between one (or more) of the AI/ML models. For example, the ML software configuration circuit 3620 may select the CNN as the first or primary model and the transformer model as the second or secondary model. For example, ML software configuration circuit 3620 may determine that the CNN and the transformer model may be coupled together by connecting the output(s) of the CNN to the input(s) of the transformer model.

In some examples, ML software configuration circuit 3620 adjusts the first configuration (e.g., the configuration of software to be included in proposed HW/SW instance 3522) based on the evaluation parameters. For example, the evaluator 3504 may calculate and/or otherwise determine the evaluation parameters 3526 based on an evaluation of the proposed HW/SW instance 3522. In some such examples, the evaluator 3504 may determine a first one of the evaluation parameters 3526 as an accuracy parameter (e.g., accuracy of the output(s) of the proposed HW/SW instance 3522, accuracy evaluation parameter, etc.).

In some examples, the ML software configuration circuit 3620 determines whether to replace the first AI/ML model with a different AI/ML model. For example, ML software configuration circuit 3620 may determine to replace the CNN with a different model, such as an ANN, DNN, or the like. In some such examples, ML software configuration circuit 3620 may determine an alternative CNN based on the value of the accuracy parameter in an attempt to increase and/or otherwise improve the value. In some examples, in response to determining to replace the first AI/ML model with a different AI/ML model, the ML software configuration circuit 3620 may identify the second ML model in the configuration database. For example, ML software configuration circuit 3620 may identify ANNs, DNNs, and the like in software template 3512. In some examples, the ML software configuration circuit 3620 generates a new configuration based on replacing the first AI/ML model with the second AI/ML model. For example, the ML software configuration circuit 3620 can generate a new version, an updated version, etc., of the proposed HW/SW instance 3522 based on replacing the CNN with a different AI/ML model.

In some examples, the ML software configuration circuit 3620 may determine to add the second AI/ML model to the configuration. For example, the ML software configuration circuit 3620 can determine to add another AI/ML model, such as ANN, DNN, etc., in connection with CNN. In some such examples, the ML software configuration circuit 3620 may determine to add another AI/ML model based on the value of the evaluation parameter, e.g., the value of the accuracy parameter. In some examples, the ML software configuration circuit 3620 can be added to the configuration by identifying a second AI/ML model in the software template 3512 and/or, more generally, in the configurable building block database 3510 to identify the second AI/ML model.

In some examples, responsive to determining to add another AI/ML model to the proposed configuration of HW/SW instance 3522, ML software configuration circuitry 3620 determines one or more first layers of a first AI/ML model to execute a first portion of the workload and one or more second layers of a second AI/ML model to execute a second portion of the workload. For example, ML software configuration circuit 3620 may identify (or select) one or more first layers of CNN to execute a first portion of workload(s) 3516 and one or more second layers of ANN, DNN, etc. to execute a second portion of workload(s) 3516. In some examples, ML software configuration circuit 3620 may determine the new configuration based on the topology of the one or more first layers and the one or more second layers. For example, the ML software configuration circuit 3620 can determine new and/or updated instances, versions, etc. of the proposed HW/SW instance 3522 based on a topology that couples the first AI/ML model and the second AI/ML model.

The ML system configuration circuit 3600 of the illustrated example of fig. 36 includes ML hardware configuration circuit 3630 to generate a second configuration of hardware based on AI/ML workload. In some examples, ML hardware configuration circuit 3630 can query the configuration database with AI/ML workload using an API. For example, one (or more) of the configurable building block databases 3510 may implement a configuration database, and the ML hardware configuration circuit 3630 may query one (or more) of the configurable building block databases 3510. In some such examples, ML hardware configuration circuit 3630 may query one (or more) of configurable building block database 3510 with workload(s) 3516 or aspect(s) thereof as input(s).

In some examples, ML hardware configuration circuit 3630 may identify a first block (or portion) of hardware to execute a matrix-matrix workload. For example, workload(s) 3516 may include matrix-matrix computing operations, vector-vector computing operations, matrix-vector computing operations, and the like, and/or any combination(s) of these. In some examples, ML hardware configuration circuit 3630 may identify a first kernel of the GPU (or other hardware) to execute the matrix-matrix workload. In some such examples, ML hardware configuration circuit 3630 may identify the first kernel, and/or more generally, the GPU, in one of hardware templates 3514, 3674, and/or the like.

In some examples, ML hardware configuration circuit 3630 may identify a second block (or portion) of hardware to execute the vector-vector workload. For example, ML hardware configuration circuit 3630 may identify a second kernel of the GPU (or other hardware) to execute the vector-vector workload. In some such examples, ML hardware configuration circuit 3630 may identify the second kernel in one of hardware templates 3514 and/or, more generally, the GPU.

In some examples, ML hardware configuration circuit 3630 may identify a third block (or portion) of hardware to perform the matrix-vector workload. For example, ML hardware configuration circuit 3630 may identify a third kernel of the GPU (or other hardware) to execute the matrix-vector workload. In some such examples, ML hardware configuration circuit 3630 may identify the third kernel, and/or more generally, the GPU, in one of hardware templates 3514.

In some examples, ML hardware configuration circuit 3630 may identify a register file to configure each of the first, second, and/or third blocks. For example, ML hardware configuration circuit 3630 may identify a register file associated with the GPU and the register file may be identified in one of hardware templates 3514. In some such examples, the register file may include a first configuration to configure a first core of the GPU, a second configuration to configure a second core of the GPU, and/or a third configuration to configure a third core of the GPU.

In some examples, ML hardware configuration circuit 3630 determines whether another type of hardware and/or another instance of hardware has been identified. For example, ML hardware configuration circuit 3630 may determine that in addition to the first instance of the GPU, another instance of the GPU is identified. In some examples, ML hardware configuration circuit 3630 may determine that a different type of hardware, such as an AI processor, has been identified in hardware template 3514. In some such examples, ML hardware configuration circuit 3630 may generate a topology (e.g., one (or more) of an interconnect or interconnect topology, an input/output (I/O) topology, an interconnect topology 3676, etc.) based on the connection(s) between the first GPU and the second GPU or one (or more) of the AI processors. For example, the ML hardware configuration circuit 3630 may select a first GPU as the first or primary hardware and a second GPU or AI processor as the second or secondary hardware. For example, the ML hardware configuration circuit 3630 may determine that the first GPU and the second GPU or AI processor may be coupled together by connecting the output(s) of the first GPU to the input(s) of the second GPU or AI processor.

In some examples, ML hardware configuration circuit 3630 adjusts the second configuration (e.g., the configuration of hardware to be included in proposed HW/SW instance 3522) based on the evaluation parameters. For example, the evaluator 3504 may calculate and/or otherwise determine the evaluation parameters 3526 based on an evaluation of the proposed HW/SW instance 3522. In some such examples, the evaluator 3504 may determine a first one of the evaluation parameters 3526 as a throughput parameter (e.g., throughput of output(s) of the proposed HW/SW instance 3522, throughput evaluation parameter, etc.).

In some examples, ML hardware configuration circuit 3630 determines whether to replace the first hardware with a different hardware. For example, ML hardware configuration circuit 3630 may determine to replace the GPU with different hardware, such as a CPU, AI processor, FPGA, or the like. In some such examples, ML hardware configuration circuit 3630 may determine an alternative GPU based on the value of the throughput parameter in an attempt to increase and/or otherwise improve the value. In some examples, in response to determining to replace the first hardware with a different hardware, ML hardware configuration circuit 3630 may identify the second hardware in the configuration database. For example, the ML hardware configuration circuit 3630 can identify a CPU, AI processor, FPGA, etc. in the hardware template 3514. In some examples, ML hardware configuration circuit 3630 generates a new configuration based on replacing the first hardware with the second hardware. For example, ML hardware configuration circuit 3630 may generate a new version, an updated version, etc. of proposed HW/SW instance 3522 based on replacing the GPU with different hardware.

In some examples, ML hardware configuration circuit 3630 may determine to add second hardware to the configuration. For example, ML hardware configuration circuit 3630 may determine to add additional hardware, such as a CPU, another GPU, an AI processor, an FPGA, etc., in connection with the first GPU. In some such examples, ML hardware configuration circuit 3630 may determine to add additional hardware based on the value of the evaluation parameter, e.g., the value of the throughput parameter. In some examples, ML hardware configuration circuit 3630 may identify second hardware for addition to the configuration by identifying the second hardware in hardware template 3514 and/or, more generally, in configurable building block database 3510.

In some examples, responsive to determining to add hardware to the configuration of the proposed HW/SW instance 3522, the ML hardware configuration circuit 3630 determines one or more first portions of first hardware to execute a first portion of a workload and one or more second portions of second hardware to execute a second portion of the workload. For example, ML hardware configuration circuit 3630 may identify (or select) one or more first cores of a first GPU to execute a first portion of workload(s) 3516 and one or more second cores of a second GPU, AI processor, CPU, FPGA, etc. to execute a second portion of workload(s) 3516. In some examples, ML hardware configuration circuit 3630 may determine the new configuration based on a topology of the one or more first portions and the one or more second portions. For example, ML hardware configuration circuit 3630 may determine new and/or updated instances, versions, etc. of proposed HW/SW instance 3522 based on the topology of the coupling first and second hardware.

The ML system configuration circuit 3600 of the illustrated example of fig. 36 includes a configuration evaluation circuit 3640 to generate an evaluation parameter according to execution of a workload based on the first configuration and the second configuration. For example, configuration evaluation circuit 3640 may generate evaluation parameter 3526. In some such examples, configuration evaluation circuit 3640 may generate evaluation parameters 3526 in response to simulation, emulation, etc. of execution of workload(s) 3516 (or different workloads) with the proposed HW/SW instance 3522. In some such examples, the configuration evaluation circuit 3640 may evaluate the proposed HW/SW instance 3522 based on a first configuration of software (e.g., one or more AI/ML models) and a second configuration of hardware (e.g., one or more instances and/or types of hardware) that configure the proposed HW/SW instance 3522.

In some examples, configuration evaluation circuit 3640 may determine whether the evaluation parameter meets a threshold. For example, the configuration evaluation circuit 3640 may determine whether the first value of the accuracy parameter meets an accuracy threshold. In some such examples, configuration evaluation circuit 3640 may determine that the first value meets the accuracy threshold in response to determining that the first value is greater than the accuracy threshold. For example, the configuration evaluation circuit 3640 may determine that 40% of the accuracy parameters do not meet the 90% accuracy threshold because 40% is less than 90%. In some examples, configuration evaluation circuit 3640 may determine that 95% of the accuracy parameters meet a 90% accuracy threshold because 95% is greater than 90%. Additionally or alternatively, configuration evaluation circuit 3640 may determine whether one or more other evaluation parameters (e.g., latency parameters, throughput parameters, etc.) satisfy one or more respective evaluation thresholds (e.g., latency thresholds, throughput thresholds, etc.).

The ML system configuration circuit 3600 of the illustrated example of fig. 36 includes an ontology generating circuit 3650 to generate, update, and/or otherwise maintain an ontology database. In some examples, the ontology generation circuit 3650 generates the ontology database 3508 based on at least one of the configurable building block database 3510 or the application store 3515. In some such examples, the ontology generating circuit 3650 may generate the ontology database 3508 by including associations between different AI/ML models, their configuration(s), type of AI/ML workload(s), and/or the like, and/or any combination(s) of these. In some such examples, the association may be implemented by an identifier, a variable, a pointer, etc., or any other identifying data structure. In some examples, the ontology generation circuit 3650 may update the ontology database 3508 based on the proposed HW/SW instance 3522, historical configuration (e.g., historical configuration 3678), evaluation parameters 3526, rewards function 3528, and/or the like, and/or any combination(s) of these. For example, the ontology generation circuit 3650 may update the ontology database 3508 based on a previous version of the proposed HW/SW instance 3522, one (or more) of the evaluation parameters 3526 associated therewith, and so forth.

In some examples, the ontology generation circuit 3650 identifies the AI/ML model based on a historical configuration. For example, the ontology generation circuit 3650 may identify AI/ML models, such as NN, based on previously generated ML compute nodes, proposed HW/SW instances, and/or the like, and/or any combination(s) of these. In some examples, ontology generating circuit 3650 identifies hardware based on a historical configuration, such as historical configuration 3678. For example, the ontology generation circuit 3650 may identify hardware, such as a GPU, based on previously generated ML compute nodes, proposed HW/SW instances, and/or the like, and/or any combination(s) of these.

The ML system configuration circuit 3600 of the illustrated example of fig. 36 includes a workload execution circuit 3660 to deploy computing node(s) to execute a workload. For example, workload execution circuit 3660 may deploy ML compute node 3517 to execute workload(s) 3516. In some such examples, workload execution circuit 3660 may deploy ML compute node 3517 in response to one or more evaluation parameters meeting one or more respective thresholds. In some examples, workload execution circuit 3660 may deploy ML compute node 3517 by compiling software 3519 using a software configuration determined by ML software configuration circuit 3620. In some examples, workload execution circuit 3660 may deploy ML compute node 3517 by configuring hardware 3521 using a hardware configuration determined by ML hardware configuration circuit 3630. In some such examples, the workload execution circuit 3660 may execute one or more AI/ML models, which may be implemented by software 3519, based on the software configuration and the hardware configuration.

The ML system configuration circuit 3600 of the illustrated example of fig. 36 includes a data store 3670 to record data (e.g., software templates 3672, hardware templates 3674, interconnect topologies 3676, historical configurations 3678, etc.). The data store 3670 may be implemented by volatile memory (e.g., synchronous Dynamic Random Access Memory (SDRAM), dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RDRAM), etc.) and/or non-volatile memory (e.g., electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, hard Disk Drive (HDD), solid State Disk (SSD) drive, etc.). The data store 3670 may additionally or alternatively be implemented by one or more Double Data Rate (DDR) memories, such as DDR, DDR2, DDR3, DDR4, DDR5, mobile DDR (mDDR), DDR SDRAM, etc. The data store 3670 may additionally or alternatively be implemented by one or more mass storage devices, such as HDD(s), compact Disc (CD) drive(s), digital versatile disc (Digital versatiledisk, DVD) drive(s), SSD drive(s), secure Digital (SD) card(s), compactFlash (CF) card(s), and so forth. Although in the illustrated example, the data store 3670 is illustrated as a single data store, the data store 3670 may be implemented by any number and/or type(s) of data stores. In addition, the data stored in the data store 3670 may take any data format, such as binary data, comma separated data, tab separated data, structured Query Language (SQL) constructs, and so forth. In some examples, the data store 3670 may include and/or otherwise implement one or more databases. The term "database" as used herein refers to an organization of related data, regardless of the manner in which the data or organization thereof is represented. For example, the organization of related data may be in the form of one or more of a table, map, grid, package, datagram, frame, file, document, report, list, or any other form.

In some examples, software template 3672 may be implemented by software template 3512 of fig. 35. For example, the software templates 3672 may include first templates corresponding to a first type of AI/ML model (e.g., NN, such as ANN, CNN, DNN, RNN, etc.) and/or configuration(s) associated therewith. In some such examples, the software templates 3672 may include second templates corresponding to a second type of AI/ML model (e.g., a transformer model) and/or configuration(s) thereof, a third type of AI/ML model (e.g., a reinforcement learning model) and/or configuration(s) thereof, and so forth.

In some examples, hardware template 3674 may be implemented by hardware template 3514 of fig. 35. For example, the hardware templates 3674 may include a first template corresponding to a first type of hardware (e.g., CPU, etc.) and/or configuration(s) associated therewith, a second template corresponding to a second type of hardware (e.g., GPU) and/or configuration(s) thereof, a third type of hardware (e.g., AI processor) and/or configuration(s) thereof, and so forth.

In some examples, interconnect topology 3676 can be implemented by portion(s) of software template 3512 and/or hardware template 3514. For example, the interconnect topology 3676 can include AI/ML network topology (e.g., layer configuration, etc.), model input(s), model output(s), and so forth. In some such examples, AI/ML network topology, model input(s), model output(s), and the like may be included in portion(s) of software template 3512. In some examples, interconnect topology 3676 can include hardware architecture topology (e.g., core coupling, printed circuit board layout, etc.), input(s) (e.g., bare metal input(s), interface(s), etc.), output(s) (e.g., bare metal output(s), interface(s), etc.), and so forth. In some such examples, the hardware architecture topology, input(s), output(s), and the like may be included in the portion(s) of the hardware template 3514.

In some examples, the historical configuration 3678 may be implemented by part(s) of the ontology database 3508 and/or, more generally, by the ontology database 3508. For example, the historical configuration 3678 may include ML compute nodes previously generated, determined, identified, etc., proposed HW/SW instances, workload(s), etc., and/or any combination(s) of these. In some examples, historical configuration 3678 may include occurrence or other statistics associated with hardware and/or software kernels in ML computing nodes.

In some examples, ML system configuration circuit 3600 includes means for receiving a workload. For example, the means for receiving may be implemented by the interface circuit 3610. In some examples, interface circuit 3610 may be instantiated by a processor circuit, such as example processor circuit 4712 of fig. 47. For example, interface circuit 3610 may be instantiated by example general purpose processor circuit 34500 of fig. 345 executing machine-executable instructions implemented, for example, by at least block 4102 of fig. 41, block 4202 of fig. 42, block 4302 of fig. 43, and block 4602 of fig. 46. In some examples, interface circuit 3610 may be instantiated by hardware logic circuitry, which may be implemented by ASIC or FPGA circuitry 34600 of graph 346 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, interface circuit 3610 may be instantiated by any other combination of hardware, software, and/or firmware. For example, interface circuit 3610 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.), a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or any kind of network interface configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions without executing software or firmware, although other configurations are equally suitable.

In some examples, the ML system configuration circuit 3600 includes first means for generating a first configuration of one or more machine learning models based on a workload. In some such examples, the first configuration is stored in a first configuration database, the first configuration database comprising a plurality of machine learning models, and the plurality of machine learning models comprising the one or more machine learning models. For example, the first means for generating may be implemented by ML software configuration circuit 3620. In some examples, ML software configuration circuit 3620 may be instantiated by a processor circuit, such as example processor circuit 4712 of fig. 47. For example, ML software configuration circuit 3620 may be instantiated by example general purpose processor circuit 34500 of fig. 345 executing machine executable instructions implemented, for example, by at least blocks 4104 and 4114 of fig. 41, blocks 4202, 4206, 4208, 4210, 4212, 4214, 4216, and 4218 of fig. 42, blocks 4402, 4404, 4406, 4408, 4410, 4412, 4414, and 4416 of fig. 44, and blocks 4604, 4606, and 4608 of fig. 46. In some examples, ML software configuration circuit 3620 may be instantiated by hardware logic circuitry, which may be implemented by ASIC or FPGA circuitry 34600 of graph 346 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, ML software configuration circuit 3620 may be instantiated by any other combination of hardware, software, and/or firmware. For example, ML software configuration circuit 3620 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other architectures are equally suitable.

In some examples where the one or more machine learning models include the first machine learning model, the first means for generating identifies a second machine learning model in the first configuration database in response to the evaluation parameter not meeting the threshold, generates a third configuration of the second machine learning model, determines the evaluation parameter according to execution of the workload based on the third configuration, and deploys the second machine learning model to execute the workload based on the third configuration.

In some examples where the one or more machine learning models include a first machine learning model, the first means for generating determines one or more first layers of the first machine learning model to perform a first portion of the workload in response to the evaluation parameter not meeting a threshold, identifies a second machine learning model in a first configuration database, determines one or more second layers of the second machine learning model to perform a second portion of the workload, and determines a third configuration based on a topology of the one or more first layers and the one or more second layers, the topology based on output from the one or more first layers as input to the one or more second layers.

In some examples where the one or more machine learning models include a first machine learning model, the first means for generating identifies the first machine learning model in a first configuration database, identifies a second machine learning model based on entering a query to an ontology database including an association of the first machine learning model and the second machine learning model with an identifier of the first machine learning model as an input, and updates the ontology database based on the first configuration in response to the evaluation parameter meeting a threshold.

In some examples, ML system configuration circuit 3600 includes a second means for generating a second configuration of hardware. In some such examples, the second configuration is stored in a second configuration database, the second configuration database comprising one or more portions of a plurality of hardware, and the plurality of hardware comprising the hardware. For example, the second means for generating may be implemented by ML hardware configuration circuit 3630. In some examples, ML hardware configuration circuit 3630 may be instantiated by a processor circuit, such as example processor circuit 4712 of fig. 47. For example, ML hardware configuration circuit 3630 may be instantiated by example general purpose processor circuit 34500 of fig. 345 executing machine executable instructions implemented, for example, by at least blocks 4106 and 4116 of fig. 41, blocks 4302, 4306, 4308, 4310, 4312, 4314, 4316, and 4318 of fig. 43, blocks 4502, 4504, 4506, 4508, 4510, 4512, 4514, and 4516 of fig. 45, and blocks 4604, 4606, and 4608 of fig. 46. In some examples, ML hardware configuration circuit 3630 may be instantiated by hardware logic circuitry, which may be implemented by ASIC or FPGA circuitry 34600 of graph 346 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, ML hardware configuration circuit 3630 may be instantiated by any other combination of hardware, software, and/or firmware. For example, ML hardware configuration circuit 3630 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other architectures are equally suitable.

In some examples, where the one or more portions include at least one of a first block, a second block, or a third block, the second means for generating identifies the first block of hardware to execute a matrix-matrix workload, identifies the second block of hardware to execute a vector-vector workload, identifies the third block of hardware to execute a matrix-vector workload, and identifies a register file for each of the first block, the second block, and the third block, the register file to store a state of each of the first block, the second block, and the third block, the second configuration based on a topology including at least one of the first block, the second block, or the third block.

In some examples where the hardware is the first hardware, the second means for generating identifies the second hardware in the second configuration database in response to the evaluation parameter not meeting the threshold, generates a third configuration of the second hardware, determines the evaluation parameter based on execution of the workload by the second hardware in the third configuration, and deploys the second hardware with the third configuration to execute the one or more machine learning models to execute the workload.

In some examples where the hardware is first hardware, the second means for generating determines one or more first portions of the first hardware to execute a first portion of the workload in response to the evaluation parameter not meeting the threshold, identifies the second hardware in the first configuration database, determines one or more second portions of the second hardware to execute a second portion of the workload, and determines a third configuration based on a topology of the one or more first portions and the one or more second portions, the topology based on output from the one or more first portions as input to the one or more second portions.

In some examples, ML system configuration circuit 3600 includes means for determining an evaluation parameter based on execution of a workload. In some such examples, execution of the workload is based on a first configuration of one or more machine learning models and a second configuration of hardware. In some such examples, the second configuration is stored in a second configuration database, the second configuration database comprising one or more portions of a plurality of hardware, and the plurality of hardware comprising the hardware. In some examples where the evaluation parameter is a first evaluation parameter, the means for determining determines a reward function comprising a first evaluation parameter having a first weight and a second evaluation parameter having a second weight, the first weight being greater than the second weight, and in response to determining that at least one of the first evaluation parameter or the second evaluation parameter does not satisfy the threshold, at least one of the first configuration or the second configuration is changed to at least one of increase the first evaluation parameter or decrease the second evaluation parameter. For example, the means for determining may be implemented by the configuration evaluation circuit 3640. In some examples, the configuration evaluation circuit 3640 may be instantiated by a processor circuit, such as the example processor circuit 4712 of fig. 47. For example, configuration evaluation circuit 3640 may be instantiated via example general purpose processor circuit 34500 of fig. 345 executing machine-executable instructions, such as implemented by at least blocks 4108 and 4110 of fig. 41 and blocks 4610 and 4612 of fig. 46. In some examples, configuration evaluation circuit 3640 may be instantiated by hardware logic circuitry, which may be implemented by ASIC or FPGA circuitry 34600 of graph 346 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, configuration evaluation circuit 3640 may be instantiated by any other combination of hardware, software, and/or firmware. For example, configuration evaluation circuit 3640 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other configurations may be equally suitable.

In some examples, ML system configuration circuit 3600 includes means for generating, maintaining, and/or updating an ontology database based on the evaluation parameters. For example, the means for generating, maintaining, and/or updating may be implemented by the ontology generating circuit 3650. In some examples, the ontology generation circuit 3650 may be instantiated by a processor circuit, such as the example processor circuit 4712 of fig. 47. For example, the ontology generation circuit 3650 may be instantiated by the example general purpose processor circuit 34500 of fig. 345 executing machine-executable instructions implemented, for example, by at least block 4112 of fig. 41, block 4204 of fig. 42, block 4304 of fig. 43, and block 4604 of fig. 46. In some examples, the ontology generation circuit 3650 may be instantiated by hardware logic circuitry, which may be implemented by ASIC or FPGA circuits 34600 of the graph 346 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, ontology generation circuit 3650 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the ontology generation circuit 3650 may be implemented by at least one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other configurations are equally suitable.

In some examples, ML system configuration circuit 3600 includes means for executing one or more machine learning models in a first configuration on hardware in a second configuration. In some such examples, the performing is in response to the evaluation parameter meeting a threshold. In some such examples, one or more machine learning models and hardware are to perform a workload. For example, the means for executing may be implemented by the workload execution circuit 3660. In some examples, the workload execution circuit 3660 may be instantiated by a processor circuit, such as the example processor circuit 4712 of fig. 47. For example, configuration evaluation circuit 3640 may be instantiated via example general purpose processor circuit 34500 of fig. 345 executing machine-executable instructions, such as implemented by at least block 4118 of fig. 41 and block 4614 of fig. 46. In some examples, the workload execution circuit 3660 may be instantiated by hardware logic circuitry, which may be implemented by ASIC or FPGA circuits 34600 of the graph 346 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, workload execution circuitry 3660 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the workload execution circuitry 3660 may be implemented by at least one or more hardware circuits (e.g., processor circuitry, discrete and/or integrated analog and/or digital circuitry, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuitry, etc.) configured to execute some or all of the machine readable instructions and/or to perform some or all of the operations corresponding to the machine readable instructions without executing software or firmware, although other structures are equally suitable.

In some examples, ML system configuration circuit 3600 includes means for storing data. In some examples, the data may include software templates 3672, hardware templates 3674, interconnect topologies 3676, historical configuration 3678, or any other data described herein. For example, the means for storing may be implemented by the data store 3670. In some examples, the data store 3670 may be instantiated by a processor circuit, such as the example processor circuit 4712 of fig. 47. For example, data store 3670 may be instantiated via general purpose processor circuit 34500 of FIG. 345 executing machine-executable instructions. In some examples, the data store 3670 may be instantiated by hardware logic circuitry, which may be implemented by ASIC or FPGA circuitry 34600 of the diagram 346 configured to perform operations corresponding to machine-readable instructions. Additionally or alternatively, the data store 3670 may be instantiated by any other combination of hardware, software, and/or firmware. For example, the data store 3670 may be implemented by one or more mass storage devices (e.g., one or more mass storage devices 4728 of fig. 47), one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuitry, etc.) configured to execute some or all of the machine-readable instructions and/or to perform some or all of the operations corresponding to the machine-readable instructions, without executing software or firmware, although other arrangements are equally suitable.

While example ways of implementing the ML system configurator 3402 of fig. 34 and/or 35 are illustrated in fig. 36, one or more of the elements, processes, and/or devices illustrated in fig. 36 may be combined, divided, rearranged, omitted, eliminated, and/or implemented in any other way. In addition, the example interface circuit 3610, the example ML software configuration circuit 3620, the example ML hardware configuration circuit 3630, the example configuration evaluation circuit 3640, the example ontology generation circuit 3650, the example workload execution circuit 3660, the example data store 3670, the example bus 3680, and/or, more generally, the example ML system configurator 3402 of fig. 34 and/or 35 may be implemented in hardware alone or in combination with software and/or firmware. Thus, for example, any of the example interface circuit 3610, the example ML software configuration circuit 3620, the example ML hardware configuration circuit 3630, the example configuration evaluation circuit 3640, the example ontology generation circuit 3650, the example workload execution circuit 3660, the example data store 3670, the example bus 3680, and/or, more generally, the example ML system configurator 3402, may be implemented by processor circuit(s) analog circuit(s), digital circuit(s), logic circuit(s), programmable processor(s), programmable microcontroller(s), GPU(s), DSP(s), ASIC(s), programmable Logic Device (PLD) and/or field programmable logic device(s) (FPLD) (e.g., FPGA). Further, the example ML system configurator 3402 of fig. 34 and/or 35 may include one or more elements, processes and/or devices in addition to or instead of those shown in fig. 36 and/or may include any or all of more than one of the illustrated elements, processes and devices.

FIG. 37 is an illustration of an example workflow 3700 of generating an ML compute node, such as the configurable ML compute node 3517 of FIG. 35. Workflow 3700 includes a first configurable building block database 3510A in the configurable building block database 3510 of fig. 35, a first hardware template 3514A in the hardware template 3514 of fig. 35, an ontology generator 3506 of fig. 35, an ontology database 3508 of fig. 35, an ML compute node 3517 of fig. 35, and hardware 3521 of fig. 35.

The first hardware template 3514A of the illustrated example includes a first example block 3702, a second example block 3704, and an example register file 3706. In this example, the first block 3702 is a matrix-vector block (identified by a "MAT_VEC block"). For example, the first block 3702 may be a hardware block or a portion of hardware, such as the GPU 3422 of fig. 34 (or the CPU 3418, AI processor 3426, FPGA 3430 of fig. 34, etc.), which may perform matrix-vector computing operations. Additionally and/or alternatively, the first block 3702 may be a software block, a kernel, etc., which may include a portion or fragment of machine-readable instructions. In some such examples, the first block 3702 may be implemented by code, which when executed by hardware or processor circuitry, may perform matrix-vector calculations.

In this example, the second block 3702 is a vector-vector block (identified by a "vec_vec block"). For example, the second block 3704 may be a hardware block or a portion of hardware, such as the GPU 3422 of fig. 34 (or the CPU 3418, AI processor 3426, FPGA 3430 of fig. 34, etc.), which may perform vector-vector computing operations. Additionally and/or alternatively, the second block 3704 may be a software block, a kernel, etc., which may include a portion or fragment of machine readable instructions. In some such examples, the second block 3704 may be implemented by code, which when executed by hardware or processor circuitry, may perform vector-vector calculations.

In this example, register file 3706 may include one or more register files, each of which may be implemented by an array, bank, or the like of processor registers. For example, register file 3706 can store the state of a processor thread (e.g., CPU thread, GPU thread, etc.) that supports execution of a workload.

In the illustrated example of fig. 37, workflow 3700 begins when ML system configurator 3402 of fig. 34 and/or 35 generates first example configuration 3708 (identified by "configuration iteration 34") based on first hardware template 3514A and/or, more generally, first configurable building block database 3510A. The first configuration 3708 of the illustrated example includes a first block 3702, a second block 3704, and two register files of the register file 3706. In response to generating the first configuration 3708, the ml system configurator 3402 may evaluate the first configuration 3708 based on execution of the workload(s) 3516 of fig. 35 utilizing the first configuration 3708. The ontology generator 3506 may update the ontology database 3508 based on the first configuration 3708, the evaluation parameter(s) associated with the first configuration 3708, and/or the like, and/or any combination(s) of these.

In the illustrated example of fig. 37, workflow 3700 includes ML system configurator 3402 generating second example configuration 3710 (identified by "configuration iteration 35") based on first hardware template 3514A and/or, more generally, first configurable building block database 3510A. In the illustrated example, the second configuration 3710 is an iteration, update, etc. of the first configuration 3708. In some examples, the iteration of the first configuration 3708 can be implemented based on the evaluation parameter(s) associated with the first configuration 3708 (e.g., by increasing the motivation for the evaluation parameter values such as accuracy, latency, throughput, etc.). The second configuration 3710 of the illustrated example includes a first block 3702, two instances of a second block 3704, and three register files of the register file 3706. In response to generating the second configuration 3710, the ml system configurator 3402 may evaluate the second configuration 3710 based on execution of the workload(s) 3516 utilizing the second configuration 3710. The ontology generator 3506 may update the ontology database 3508 based on the second configuration 3710, the evaluation parameter(s) associated with the second configuration 3710, and/or the like, and/or any combination(s) of these.

Advantageously, the ML system configurator 3402 may evolve multiple groups of related configurable building blocks simultaneously, each group of building blocks covering a different architectural class and design style. For example, the workflow 3700 may be executed for different hardware simultaneously (e.g., substantially simultaneously). In some such examples, the workflow 3700 can be executed at substantially the same time for a GPU, CPU, AI processor or the like. Advantageously, evolving multiple sets of related configurable building blocks for different hardware simultaneously may result in identifying hardware that meets the requirements of a given workload. For example, the ML system configurator 3402 may determine that a systolic array design style based AI processor architecture may be suitable for a computationally intensive AI model, but not for memory binding and less computationally intensive workloads. Thus, by evolving hardware architectures having different design styles simultaneously, the ML system configurator 3402 is allowed to evolve flexibly to achieve the best accuracy and hardware efficiency combination during the co-design process, which may be implemented in whole and/or in part by the workflow 3700. Similarly, workflow 3700 can be performed in software search space 3518 of FIG. 35 by evolving multiple sets of related configurable building blocks for different software simultaneously. For example, in a neural network software search, there are multiple classes of networks that have their own beneficial attributes (e.g., RNN, CNN, transformer, etc.) and their own configurable building block(s) (e.g., matrix x vector of RNN, convolution of CNN, etc.).

During workflow 3700, ML system configurator 3402 may generate and/or otherwise identify ML computing node 3517 based on a plurality of configuration iterations (e.g., first configuration 3708, second configuration 3710, etc.). In this example, ML system configurator 3402 may generate ML computing node 3517 based on third example configuration 3712 (identified by "configuration iteration N"). The third configuration 3712 includes three instances of the first block 3702, the third block 3704, and two register files of the register file 3706. The ontology generator 3506 may update the ontology database 3508 based on the third configuration 3712, the evaluation parameter(s) associated with the third configuration 3712, and/or the like, and/or any combination(s) of these.

FIG. 38 is an illustration of another example workflow 3800 for identifying configurable machine learning computing nodes, such as ML computing node 3517 of FIG. 35. The workflow 3800 of the illustrated example includes a second of the configurable building block databases 3510 of fig. 35, a controller 3502 of fig. 35, an evaluator 3504 of fig. 35, a software search space 3518 of fig. 35, a hardware search space 3520 of fig. 35, a proposed HW/SW instance 3522 of fig. 35, performance modeling 3524 of fig. 35, evaluation parameters 3526 of fig. 35, rewards functions 3528 of fig. 35, and an example interconnection topology library 3802.

In the illustrated example, the second configurable building block database 3510B includes and/or otherwise implements the interconnection topology base 3802. In some examples, the interconnect topology store 3802 can be implemented by the interconnect topology 3676 of fig. 36. In the illustrated example, the interconnection topology base 3802 depicts example topologies of different example nodes 3804, 3806, 3808, 3810, including a first example node 3804, a second example node 3806, a third example node 3808, and a fourth example node 3810. The nodes 3804, 3806, 3808, 3810 of the illustrated example are heterogeneous computing nodes, which may be implemented by one or more portions from different types of hardware. For example, the first node 3804 includes a first example hardware core 3812, a second example hardware core 3814, and a third example hardware core 3816. In some such examples, the first hardware core 3812 may be a hardware core of a GPU, the second hardware core 3814 may be a hardware core of an AI processor, and the third hardware core 3816 may be a hardware core of a CPU.

In the illustrated example, each of the nodes 3804, 3806, 3808, 3810 has a different topology (e.g., interconnect configuration). For example, the first node 3804 has a first topology in which each of the cores 3812, 3814, 3816 are sequential. The second node 3806 has a second topology in which each of the cores 3812, 3814, 3816 is coupled with two other cores. The third node 3808 has a third topology in which one core provides an output to each of the remaining cores. The fourth node 3810 has a fourth topology in which all cores, except one, provide their respective outputs to the other core. Alternatively, any other topology may be included in the interconnection topology base 3802.

The workflow 3800 may generally implement a first example operation 3818 and a second example operation 3820. For example, the ML system configurator 3402 may perform the first operation 3818 by optimizing and/or otherwise improving a heterogeneous system solution (e.g., an example implementation of the ML computing node 3517) given the candidate AI model architecture (e.g., software 3519 of fig. 35, portion(s) of the proposed HW/SW instance 3522 of fig. 35, etc.). In some such examples, ML system configurator 3402 may iteratively evolve the hardware portion of proposed HW/SW instance 3522 by iteratively evaluating one (or more) of nodes 3804, 3806, 3808, 3810 and their respective topologies to determine which one (or more) of nodes 3804, 3806, 3808, 3810 implements an improved and/or otherwise optimal value of the evaluation parameter of interest.

In some examples, the ML system configurator 3402 may perform the second operation 3820 by optimizing and/or otherwise improving the AI model given the candidate system solution. For example, the ML system configurator 3402 may iteratively evolve the software portion of the proposed HW/SW instance 3522 by iteratively evaluating different AI/ML models, different AI/ML model topologies, and so on, in response to changes in the hardware portion of the proposed HW/SW instance 3522. In some examples, the first and second operations 3818, 3820 may be iteratively performed to identify (i) a best and/or otherwise optimal target platform (e.g., hardware and/or software platform) for the different compute kernels and/or (ii) a best and/or otherwise optimal interconnection topology between the different compute nodes.

Fig. 39 is an illustration of an example implementation of an example ontology database 3900. In some examples, the ontology database 3900 may implement the ontology database 3508 of fig. 35, the historical configuration 3678 of fig. 36, and/or the data store 3670 of fig. 36.

The ontology database 3900 of the illustrated example includes an example building block ontology 3902. Building block ontology 3902 of the illustrated example is implemented by a graph (e.g., an ontology graph). Additionally and/or alternatively, building block ontology 3902 may be implemented by any other data representation, such as a table, map, grid, package, datagram, frame, file, document, report, list, or any other form. Building block ontology 3902 includes example software blocks 3904 in relation to each other. For example, software block 3904 may correspond to portion(s) of the AI/ML model. In the illustrated example, software blocks 3904 include convolution blocks, residual blocks, pool blocks, bottleneck blocks, linear blocks, and so forth. In the illustrated example, the convolution blocks include two-dimensional convolution (identified by CONV 2D), three-dimensional convolution (CONV 3D), packet convolution, and so forth. For example, different layers of building block ontology 3902 may provide increased granularity of different types and sub-types of AI/ML components.

The ontology database 3900 of the illustrated example includes an example historical configuration database 3904. Database 3904 of the illustrated example is implemented by a table (e.g., a history configuration table). Additionally and/or alternatively, database 3904 may be implemented by any other data representation, such as a graph, map, grid, package, datagram, frame, file, document, report, list, or any other form. The database 3904 of the illustrated example includes columns for index, tier type, kernel size, input channels, output channels, ranking between categories, position of front and rear tiers, occurrence in optimized SW/HW, and so forth. In the illustrated example, a first one of the indices (identified by "index 7") corresponds to a layer of the AI/ML model, which in this example is a layer at a particular location in the neural network, which can implement a two-dimensional convolution. In the illustrated example, "index 7" corresponds to a two-dimensional convolution, having a kernel size of 5x5, 128 input channels, 64 output channels, and a rank among the two-dimensional convolution layers is third. In the illustrated example, the two-dimensional convolution layer identified by "index 7" typically has a front layer corresponding to the layer identified by "index 2" in the table, and a back layer corresponding to the layer identified by "index 43" in the table. For example, the AI/ML model can have a first layer (e.g., the layer identified by "index 2"), a second layer (e.g., the layer identified by "index 7"), and a third layer (e.g., the layer identified by "index 43"). In some such examples, the output(s) of the layer identified by "index 2" are provided to the input(s) of the layer identified by "index 7". In some such examples, the output(s) of the layer identified by "index 7" are provided to the input(s) of the layer identified by "index 43".

FIG. 40 is an illustration of an example workflow 4000 for identifying configurable ML compute nodes, such as ML compute node 3517 of FIG. 35. Workflow 4000 includes controller 3502 and evaluator 3504 of fig. 35. Workflow 4000 includes an example building block 4002 and an example model layer 4004. In some examples, building block 4002 can be implemented by software template 3512, hardware template 3514, and/or more generally configurable building block database 3510 of fig. 35. In the illustrated example, the building block 4002 includes an example CPU core 4006, an example GPU core 4008, an example FPGA core 4010, and an example ASIC core 4012. In some examples, one (or more) of the kernels 4006, 4008, 4010, 4012 may be implemented by one (or more) of the hardware templates 3514 of fig. 35. For example, the CPU core 4006 may be implemented by the "HW template N" of fig. 35, the GPU core 4008 may be implemented by the "HW template 35" of fig. 35, and the FPGA core 4010 may be implemented by the "HW template 34" of fig. 34, and so on.

In some examples, model layer 4004 may be implemented by the proposed HW/SW instance 3522 of fig. 35 and/or software 3519 of fig. 35. For example, model layer 4004 may be implemented by a database that includes historical implementations of ML computing nodes, immediate or current implementations of ML computing nodes being evaluated, and so forth.

During the workflow 4000, at an initial example operation 4014, the controller 3502 receives an initial AI model, which can be referred to as a seed AI model. For example, the initial AI model may be a particular neural network that is known to be efficient for the workload of interest (e.g., image processing). Additionally and/or alternatively, the initial operation 4014 may include a function input, request, etc. indicating a desired AI/ML operation (e.g., a desire to perform image processing without specifying an initial AI model). In some such examples, the controller 3502 can identify an initial AI model based on the function input, the request, and the like.

In a first example operation 4016, the controller 3502 can select a layer implementation given the initial AI model. For example, the controller 3502 can map the initial AI model to one (or more) of the cores 4006, 4008, 4010, 4012 of the building block 4002. In some such examples, the controller 3502 can identify the GPU core 4008 based on determining that the GPU core 4008 is efficient for executing the initial AI model. For example, the controller 3502 can identify an implementation(s) of layer(s) of the initial AI model, wherein the implementation(s) can correspond to hardware, such as one or more of the GPU kernels 4008.

During a second example operation 4018, the controller 3502 can provide an initial AI model and layer implementation to the evaluator 3504. For example, when the model and layer implementation is to perform a desired or expected workload, evaluator 3504 may evaluate the model and layer implementation based on simulation(s), and the like of the model and layer implementation. The evaluator 3504 may evaluate the model and layer implementation to generate example accuracy parameters 4020, example performance parameters 4022, example energy parameters 4024, and/or any other type of parameters, such as latency, cost (e.g., computational cost, monetary cost, production or manufacturing cost, cost of purchasing energy to power hardware running the model, etc.), and so forth. For example, the accuracy parameter 4020 may be the accuracy of model and layer implementation. In some examples, the performance parameter 4022 may be efficiency, throughput, etc. of model and layer implementation. In some examples, the energy parameter 4024 may be the power consumption of the layer implementation in executing the model. In some examples, the energy parameter 724 may be heat dissipation of hardware configured using layer implementation when executing the model. In the illustrated example, parameters 4020, 4022, 4024 are provided as inputs to an example cost function 4026. In some examples, the cost function 4026 may be implemented by the bonus function 3528 of fig. 35. For example, the cost function 4026 may determine the difference between the values of the parameters 4020, 4022, 4024 and the expected or predicted values of the parameters 4020, 4022, 4024.

During a third example operation 4028, the output of the cost function 4026 may result in an update of proxy parameters (e.g., proxy parameters in the reinforcement learning AI/ML model) processed and/or otherwise maintained by the controller 3502. For example, the controller 3502 may determine whether to modify the model to prioritize one parameter (e.g., heat dissipation, accuracy) over another parameter (e.g., energy consumption, etc.).

During a fourth example operation 4030, the controller 3502 may adjust the model and/or layer implementation based on the output from the cost function 4026. For example, the controller 3502 may replace the initial AI model with a different type of AI/ML model, change the configuration of the initial AI model, and so forth. In some examples, the controller 3502 can replace the GPU core 4008 with a different core (e.g., FPGA core 4010, etc.), change the configuration (e.g., register file, topology, etc.) of the GPU core 4008, and so forth.

During a fifth example operation 4032, the controller 3502 provides another iteration of the model and layer implementation to the evaluator 3504 for evaluation. Advantageously, the workflow 4000 of fig. 40 may be executed (e.g., iterative execution) to identify models and corresponding layer implementations to execute workloads with improved accuracy, performance, energy consumption, heat dissipation, cost, and the like.

Flowcharts representative of example hardware logic circuits, machine readable instructions, hardware-implemented state machines, and/or any combination thereof for implementing ML system configurator 3402 of fig. 34 and/or 35 and/or ML system configuration circuit 3600 of fig. 36 are shown in fig. 41-13. The machine-readable instructions may be one or more executable programs, or portion(s) of an executable program, for execution by a processor circuit, such as the processor circuit 4712 shown in the example processor platform 4700 discussed below in connection with fig. 47 and/or the example processor circuit discussed below in connection with fig. 345 and/or 346. The program may be embodied in software stored on one or more non-transitory computer readable storage media, such as a Compact Disc (CD), a floppy disk, a Hard Disk Drive (HDD), a Solid State Drive (SSD), a Digital Versatile Disc (DVD), a blu-ray disc, volatile memory (e.g., any type of Random Access Memory (RAM), etc.), or non-volatile memory (e.g., electrically erasable programmable read-only memory (EEPROM), flash memory, HDD, SSD, etc.), associated with processor circuitry located in one or more hardware devices, but the entire program and/or a portion thereof may be executed by one or more hardware devices other than processor circuitry and/or embodied in firmware or dedicated hardware. The machine-readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediary client hardware device (e.g., a Radio Access Network (RAN) gateway that may facilitate communications between the server and the endpoint client hardware device). Similarly, the non-transitory computer readable storage medium may include one or more media located in one or more hardware devices. Additionally, although the example program is described with reference to the flowcharts shown in fig. 41-13, many other methods of implementing the example ML system configurator 3402 of fig. 34 and/or 35 and/or the example ML system configuration circuit 3600 of fig. 36 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuits, discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers (op-amps), logic circuits, etc.) configured to perform the respective operations without executing software or firmware. The processor circuits may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single core processor (e.g., a single core Central Processing Unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.), multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, CPUs and/or FPGAs located in the same package (e.g., the same Integrated Circuit (IC) package or in two or more separate housings, etc.).

As described above, the example operations of fig. 41-13 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media, such as optical storage devices, magnetic storage devices, HDDs, flash memory, read-only memory (ROM), CDs, DVDs, caches, any type of RAM, registers, and/or any other storage device or storage disk where information may be stored for any duration (e.g., for a longer period of time, permanently stored, temporarily stored, for temporary buffering, and/or for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium are expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

FIG. 41 is a flowchart representative of example machine readable instructions and/or example operations 4100 that can be executed and/or instantiated by the processor circuit to perform a workload with the configurable ML compute node. The example machine readable instructions and/or example operations 4100 of fig. 41 begin at block 4102 where ML system configuration circuit 3600 receives a request to perform a Machine Learning (ML) workload. For example, interface circuit 3610 (fig. 36) may receive a request identifying a combination of hardware and/or software to execute workload(s) 3516 of fig. 35. In some such examples, a combination of hardware and/or software may be implemented by software 3519, hardware 3521, and/or more generally ML computing node 3517 of fig. 35.

At block 4104, ML system configuration circuit 3600 generates a first configuration of one or more ML models based on the ML workload. For example, the ML software configuration circuit 3620 (FIG. 36) can identify an AI/ML model, such as a CNN, from the software search space 3518. In some such examples, ML software configuration circuit 3620 may identify the configuration of the CNN based on one of software template 3512 of fig. 35, software template 3672 of fig. 36, and so forth, corresponding to the CNN. An example process that may be performed to implement block 4104 is described below in connection with fig. 42.

At block 4106, ML system configuration circuit 3600 generates a second configuration of hardware based on the ML workload. For example, ML hardware configuration circuit 3630 (fig. 36) may identify hardware, such as a GPU, from hardware search space 3520. In some such examples, ML hardware configuration circuit 3630 may identify the configuration of the GPU based on one of hardware templates 3514 of fig. 35, hardware templates 3674 of fig. 36, etc., corresponding to the GPU. An example process that may be performed to implement block 4104 is described below in connection with fig. 43.

At block 4108, ml system configuration circuit 3600 generates an evaluation parameter from execution of the workload based on the first configuration and the second configuration. For example, configuration evaluation circuit 3640 (fig. 36) may perform performance modeling (e.g., simulation(s), emulation(s), debugging, etc.) associated with the GPU performing the CNN. In some such examples, configuration evaluation circuit 3640 may generate evaluation parameters 3526, which may correspond to performing a simulation, emulation, etc. of an AI/ML workload with the CNN for the GPU.

At block 4110, the ml system configuration circuit 3600 determines whether the evaluation parameter meets a threshold. For example, the configuration evaluation circuit 3640 may determine whether an evaluation parameter, such as an accuracy parameter, has a value that meets an evaluation parameter threshold, such as an accuracy threshold (e.g., an accuracy parameter threshold). In some such examples, configuration evaluation circuit 3640 may determine that the accuracy parameter has a value of 425%, which satisfies an accuracy threshold of 420%, because the value of 425% is greater than 420%.

If at block 4110 the ml system configuration circuit 3600 determines that the evaluation parameters do not meet the threshold, at block 4112 the ml system configuration circuit 3600 updates the ontology database based on the evaluation parameters. For example, ontology generation circuit 3650 (fig. 36) may update the ontology database 3508 of fig. 35 based on the evaluation parameters 3526, the proposed HW/SW instance 3522 associated with the evaluation parameters 3526, and/or the like, and/or any combination(s) of these.

At block 4114, the ml system configuration circuit 3600 adjusts the first configuration based on the evaluation parameters. For example, the ML software configuring circuit 3620 can replace the CNN with a different AI/ML model, add another AI/ML model, change the configuration of the CNN, and/or the like, and/or any combination(s) of these. An example process that may be performed to implement block 4114 is described below in connection with fig. 44.

At block 4116, the ml system configuration circuit 3600 adjusts the second configuration based on the evaluation parameters. For example, ML hardware configuration circuit 3630 may replace the GPU with different hardware, add additional hardware, change the configuration of the GPU, and the like, and/or any combination(s) of these. An example process that may be performed to implement block 4116 is described below in connection with fig. 45. In response to adjusting the second configuration based on the evaluation parameters at block 4116, control returns to block 4108 to generate evaluation parameters based on execution of the workload based on the first configuration (e.g., updated or adjusted version of the first configuration) and the second configuration (e.g., updated or adjusted version of the second configuration).

If at block 4110 the ML system configuration circuit 3600 determines that the evaluation parameters meet the threshold, control proceeds to block 4118 to execute one or more ML models on the hardware in the second configuration according to the ML model based on the first configuration. For example, workload execution circuit 3660 (fig. 36) may compile, organize, generate, identify, and/or otherwise instantiate ML computing node 3517 of fig. 35. In some such examples, software 3519 of ML computing node 3517 can be implemented by one or more AI/ML models based on the first configuration. In some examples, hardware 3521 of ML computing node 3517 can be implemented by one or more hardware types and/or instances based on the second configuration. In some examples, ML computing node 3517 may be deployed and/or otherwise provided to execute workload(s) 3516. In response to executing the one or more ML models based on the first configuration on the hardware in the second configuration at block 4118, the example machine readable instructions of fig. 41 and/or the example operations 4100 end.

Fig. 42 is a flow diagram representing example machine-readable instructions and/or example operations 4200 executable and/or instantiated by a processor circuit to generate a first configuration of one or more machine learning models based on a machine learning workload. The example machine readable instructions and/or example operations 4200 of fig. 42 may be executed and/or instantiated by the processor circuit to implement block 4104 of the example machine readable instructions and/or example operations 4100 of fig. 41. The example machine readable instructions and/or example operations 4200 of fig. 42 begin at block 4202 where the ML system configuration circuit 3600 of fig. 36 queries a configuration database with ML workloads using an application programming interface. For example, ML software configuration circuit 3620 (fig. 36) may query one (or more) of configurable building block database 3510 of fig. 35, software template 3672 of fig. 36, and/or interconnect topology 3676 of fig. 36 via one or more APIs.

At block 4204, the ML system configuration circuit 3600 identifies ML models based on historical configurations. For example, the ontology generation circuit 3660 (fig. 36) may identify ML models, such as NN, utilized in previous autopl searches. In some such examples, ontology generation circuit 3660 may identify the ML model based on historical configurations that may be stored in ontology database 3508 of fig. 35 and/or historical configuration 3678 of fig. 36.

At block 4206, the ML system configuration circuit 3600 determines a number of layers of the ML model. For example, the ML software configuration circuit 3620 can determine that the NN is to have multiple layers (e.g., network layers, NN layers, etc.), wherein one (or more) of the multiple layers is coupled to a different (or more) of the multiple layers in the NN configuration. In some such examples, ML software configuration circuit 3620 may determine the plurality of layers and/or configuration(s) thereof based on information (e.g., metadata or other data) included in software template 3512 of fig. 35, software template 3672 of fig. 36, and so forth.

At block 4208, the ML system configuration circuit 3600 determines weights for layers of the ML model. For example, the ML software configuration circuit 3620 can determine that one (or more) of the plurality of layers is to have a particular weight (e.g., weight value). In some such examples, ML software configuration circuit 3620 may determine the weights based on information (e.g., metadata or other data) included in software templates 3512, software templates 3672 of fig. 36, and so forth.

At block 4210, ML system configuration circuit 3600 determines a type of ML training for the ML model. For example, the ML software configuration circuit 3620 may determine that the NN model is to be trained with reinforcement learning. In some such examples, the ML software configuration circuit 3620 can determine a type of ML training for training the NN model based on information (e.g., metadata or other data) included in the software templates 3512, 3672 of fig. 36, and so on.

At block 4212, the ML system configuration circuit 3600 determines hyper-parameters for training the ML model. For example, the ML software configuration circuit 3620 may determine values of one or more hyper-parameters that may be utilized to train the NN model. In some such examples, ML software configuration circuit 3620 may determine the value of the hyper-parameter based on information (e.g., metadata or other data) included in software template 3512, software template 3672 of fig. 36, and so forth.

At block 4214, ML system configuration circuit 3600 determines if another ML model is identified. For example, the ML software configuration circuit 3620 can determine that another type of AI/ML model, such as a transformer, is identified for use with the NN. In some such examples, the ML software configuration circuit 3620 can identify a number of AI/ML models and/or types thereof by searching the software search space 3518. In some examples, ML software configuration circuit 3620 may determine that the identified first NN model is a CNN and another type of NN model, such as an ANN, DNN, or the like, may be utilized in conjunction with the CNN.

If at block 4214 the ML system configuration circuit 3600 determines that another ML model is identified, control returns to block 4206 to determine the number of layers of the additional identified ML model. If at block 4214 the ML system configuration circuit 3600 determines that another ML model has not been identified, then at block 4216 the ML system configuration circuit 3600 determines if more than one ML model has been identified. For example, the ML software configuration circuit 3620 may determine that only one ML model (e.g., CNN) has been identified, while in other examples, the ML software configuration circuit 3620 may determine that more than one ML model (e.g., CNN and transformer model) has been identified.

If at block 4216, the ML system configuration circuit 3600 determines that only one ML model is identified, the example machine readable instructions of fig. 42 and/or the example operation 4200 end. For example, the machine-readable instructions and/or example operations 4200 of fig. 42 may return to block 4106 of the machine-readable instructions and/or example operations 4100 of fig. 41 to generate a second configuration of hardware based on the ML workload.

If at block 4216 the ML system configuration circuit 3600 determines that more than one ML model has been identified, at block 4218 the ML system configuration circuit 3600 generates a topology based on the connection(s) between one (or more) of the ML models. For example, the ML software configuration circuit 3620 can analyze different ones of the interconnect topologies 3676 to identify connection(s) between the first identified AI/ML model (e.g., CNN) and the second identified AI/ML model (e.g., transformer model). In some such examples, the ML software configuration circuit 3620 can couple the output(s) of the first identified AI/ML model with the input(s) of the second identified AI/ML model based on the topology in the interconnect topology 3676.

In response to generating the topology based on the connection(s) between one (or more) of the ML models at block 4218, the example machine readable instructions of fig. 42 and/or the example operation 4200 end. For example, the machine-readable instructions and/or example operations 4200 of fig. 42 may return to block 4106 of the machine-readable instructions and/or example operations 4100 of fig. 41 to generate a second configuration of hardware based on the ML workload.

Fig. 43 is a flowchart representative of example machine readable instructions and/or example operations 4300 that may be executed and/or instantiated by the processor circuit to generate a second configuration of hardware based on a machine learning workload. The example machine readable instructions and/or example operations 4300 of fig. 43 may be executed and/or instantiated by processor circuitry to implement block 4106 of the example machine readable instructions and/or example operations 4100 of fig. 41. The example machine readable instructions and/or example operations 4300 of fig. 43 begin at block 4302 where the ML system configuration circuit 3600 of fig. 36 queries a configuration database with ML workloads using an application programming interface. For example, ML hardware configuration circuit 3630 (fig. 36) may query one (or more) of configurable building block database 3510 of fig. 35, hardware template 3674 of fig. 36, and/or interconnect topology 3676 of fig. 36 via one or more APIs.

At block 4304, ml system configuration circuit 3600 identifies the type of hardware based on the historical configuration. For example, ontology generation circuit 3660 (fig. 36) may identify the type of hardware, e.g., GPU, utilized in the previous AutoML search. In some such examples, ontology generation circuit 3660 may identify GPUs based on historical configurations that may be stored in ontology database 3508 of fig. 35 and/or historical configuration 3678 of fig. 36.

At block 4306, the ml system configuration circuit 3600 determines a first block of hardware to execute a matrix-matrix workload. For example, ML hardware configuration circuit 3630 may identify a first kernel of the GPU to perform the matrix-matrix computing operation(s). In some such examples, ML hardware configuration circuit 3630 may identify the first kernel and/or its configuration(s) based on information (e.g., metadata or other data) included in hardware template 3514 of fig. 35, hardware template 3674 of fig. 36, and so forth.

At block 4308, the ml system configuration circuit 3600 determines a second block of hardware to execute the vector-vector workload. For example, ML hardware configuration circuit 3630 may identify a second kernel of the GPU (e.g., second block 404 of fig. 4) to perform the vector-vector computing operation(s). In some such examples, ML hardware configuration circuit 3630 may identify the second core and/or its configuration(s) based on information (e.g., metadata or other data) included in hardware template 3514 of fig. 35, hardware template 3674 of fig. 36, and so forth.

At block 4310, the ml system configuration circuit 3600 determines a third block of hardware to perform matrix-vector workloads. For example, ML hardware configuration circuit 3630 may identify a third kernel of the GPU (e.g., first block 402 of fig. 4) to perform the matrix-vector computing operation(s). In some such examples, ML hardware configuration circuit 3630 may identify the third kernel and/or its configuration(s) based on information (e.g., metadata or other data) included in hardware template 3514 of fig. 35, hardware template 3674 of fig. 36, and so forth.

At block 4312, the ml system configuration circuit 3600 identifies a register file(s) to store the state of each of the first block, the second block, and/or the third block. For example, ML hardware configuration circuit 3630 may generate and/or otherwise identify a first register file (e.g., one of register files 406 of fig. 4) in which state(s) of hardware thread(s) corresponding to the first kernel may be stored. In some such examples, ML hardware configuration circuit 3630 may generate, identify, and/or otherwise instantiate a second register file corresponding to the second core and/or a third register file corresponding to the third core.

At block 4314, the ml system configuration circuit 3600 determines whether another type of hardware has been identified. For example, the ML hardware configuration circuit 3630 may determine that another type of hardware, such as a CPU, AI processor, FPGA, etc., is identified for use with the GPU. In some such examples, ML hardware configuration circuit 3630 may identify the number of instances of hardware (or portion(s) thereof) and/or the type thereof by searching hardware search space 3520. In some examples, ML hardware configuration circuit 3630 may determine that another instance of the GPU (or portion(s) thereof) may be utilized along with the GPU.

If at block 4314, the ML system configuration circuit 3600 determines that another type of hardware is identified, control returns to block 4306 to identify a first block of identified hardware. If at block 4314, the ml system configuration circuit 3600 determines that another type of hardware has not been identified, then at block 4316, the ml system configuration circuit 3600 determines if more than one hardware type and/or instance has been identified. For example, ML hardware configuration circuit 3630 may determine that only one hardware type and/or instance (e.g., a single GPU core, a single GPU, etc.) is identified. In some such examples, ML hardware configuration circuit 3630 may determine that a homogenous ML compute node has been identified. In some examples, ML hardware configuration circuit 3630 may determine that more than one hardware instance and/or type (e.g., more than one GPU core, a GPU and an FPGA, at least one GPU core and at least one FPGA core, etc.) has been identified. In some such examples, ML hardware configuration circuit 3630 may determine that a heterogeneous ML compute node has been identified.

If at block 4316, the ml system configuration circuit 3600 determines that only one hardware type and/or instance has been identified, the example machine readable instructions of fig. 43 and/or the example operation 4300 ends. For example, the machine-readable instructions and/or example operations 4300 of fig. 43 may return to block 4108 of the machine-readable instructions and/or example operations 4100 of fig. 41 to generate the evaluation parameters according to execution of the workload based on the first configuration and the second configuration.

If at block 4316, the ml system configuration circuit 3600 determines that more than one hardware type and/or instance has been identified, at block 4318, the ml system configuration circuit 3600 generates a topology based on the connection(s) of the hardware. For example, the ML hardware configuration circuit 3630 may analyze different ones of the interconnect topologies 3676 to identify connection(s) between a first hardware core (e.g., a first GPU core) and a second hardware core (e.g., a second GPU core). In some examples, ML hardware configuration circuit 3630 may analyze different topologies in interconnect topology 3676 to identify connection(s) between a first type of hardware (e.g., GPU) and a second type of hardware (e.g., AI processor). In some examples, ML hardware configuration circuit 3630 may couple the output(s) of the first hardware core and the second hardware core based on a topology included in interconnect topology 3676. In some examples, ML hardware configuration circuit 3630 may couple the output(s) of the first type of hardware and the second type of hardware based on a topology included in interconnect topology 3676.

In response to generating the topology based on the hardware connection(s) at block 4318, the example machine readable instructions of fig. 43 and/or the example operation 4300 end. For example, the machine-readable instructions and/or example operations 4300 of fig. 43 may return to block 4108 of the machine-readable instructions and/or example operations 4100 of fig. 41 to generate the evaluation parameters according to execution of the workload based on the first configuration and the second configuration.

FIG. 44 is a flowchart representative of example machine readable instructions and/or example operations 4400 executable and/or instantiated by the processor circuit to adjust the first configuration based on the evaluation parameter. The example machine readable instructions and/or example operations 4400 of fig. 44 may be executed and/or instantiated by processor circuitry to implement block 4114 of the example machine readable instructions and/or example operations 4100 of fig. 41. The example machine readable instructions and/or example operations 4400 of fig. 44 begin at block 4402 where the ML system configuration circuit 3600 determines whether to replace a first ML model with a different ML model. For example, the ML software configuration circuit 3620 (fig. 36) can determine that the proposed HW/SW instance 3522 of fig. 35 includes a first AI/ML model, e.g., CNN. In some such examples, ML software configuration circuit 3620 may determine that the CNN model is to be replaced with a DNN model.

If at block 4402, the ML system configuration circuit 3600 determines not to replace the first ML model with a different ML model, control proceeds to block 4408. If at block 4402 the ML system configuration circuit 3600 determines that the first ML model is to be replaced with a different ML model, then at block 4404 the ML system configuration circuit 3600 identifies a second ML model in the configuration database. For example, ML software configuration circuit 3620 can identify DNNs in software templates 3512 of configurable building block database 3510.

At block 4406, the ML system configuration circuit 3600 generates a new configuration based on replacing the first ML model with the second ML model. For example, ML software configuration circuit 3620 may generate a new or updated configuration of software in the proposed HW/SW instance 3522 by replacing CNN with DNN.

At block 4408, the ML system configuration circuit 3600 determines whether to add a second ML model to the configuration. For example, the ML software configuration circuit 3620 may determine a configuration to add DNNs to the software along with CNNs and/or different AI/ML models.

If at block 4408, the ML system configuration circuit 3600 determines not to add the second ML model to the configuration, the example machine readable instructions of fig. 44 and/or the example operation 4400 ends. For example, the machine-readable instructions of fig. 44 and/or the example operations 4400 may return to block 4116 of the machine-readable instructions of fig. 41 and/or the example operations 4100 to adjust the second configuration based on the evaluation parameters.

If at block 4408 the ML system configuration circuit 3600 determines to add the second ML model to the configuration, at block 4410 the ML system configuration circuit 3600 determines one or more first layers of the first ML model to execute the first portion of the workload. For example, in a configuration including CNN and DNN, ML software configuration circuit 3620 may identify and/or otherwise determine one or more first layers of CNN to execute a first portion of workload(s) 3516.

At block 4412, the ML system configuration circuit 3600 identifies a second ML model in the configuration database. For example, ML software configuration circuit 3620 can identify DNNs in software templates 3512 of configurable building block database 3510.

At block 4414, the ML system configuration circuit 3600 determines one or more second layers of a second ML model to execute a second portion of the workload. For example, in a configuration including CNN and DNN, ML software configuration circuit 3620 may identify and/or otherwise determine one or more second layers of DNN to execute the second portion of workload(s) 3516.

At block 4416, the ml system configuration circuit 3600 determines a new configuration based on the topology of the one or more first layers and the one or more second layers. For example, ML software configuration circuit 3620 may determine to couple output(s) of the CNN to input(s) of the DNN (or vice versa) based on the topology included in interconnect topology 3676.

In response to determining a new configuration based on the topology of the one or more first layers and the one or more second layers at block 4416, the example machine readable instructions of fig. 44 and/or the example operations 4400 end. For example, the machine-readable instructions of fig. 44 and/or the example operations 4400 may return to block 4116 of the machine-readable instructions of fig. 41 and/or the example operations 4100 to adjust the second configuration based on the evaluation parameters.

FIG. 45 is a flowchart representative of example machine readable instructions and/or example operations 4500 that may be executed and/or instantiated by the processor circuit to adjust the second configuration based on the evaluation parameter. The example machine readable instructions and/or example operations 4500 of fig. 45 may be executed and/or instantiated by the processor circuit to implement block 4116 of the example machine readable instructions and/or example operations 4100 of fig. 41. The example machine readable instructions and/or example operations 4500 of fig. 45 begin at block 4502, where ML system configuration circuit 3600 determines whether to replace a first hardware with a different hardware. For example, ML hardware configuration circuit 3630 (fig. 36) may determine that proposed HW/SW instance 3522 of fig. 35 includes first hardware, e.g., GPU. In some such examples, ML hardware configuration circuit 3630 may determine that the GPU is to be replaced with an FPGA.

If at block 4502, ml system configuration circuit 3600 determines not to replace the first hardware with a different hardware, control proceeds to block 4508. If at block 4502, the ml system configuration circuit 3600 determines that the first hardware is to be replaced with a different hardware, then at block 4504, the ml system configuration circuit 3600 identifies the second hardware in the configuration database. For example, ML hardware configuration circuit 3630 may identify the FPGA in hardware template 3514 of configurable building block database 3510.

At block 4506, ml system configuration circuit 3600 generates a new configuration based on replacing the first hardware with the second hardware. For example, ML hardware configuration circuit 3630 may generate a new or updated configuration of hardware in the proposed HW/SW instance 3522 by replacing the GPU with an FPGA.

At block 4508, ml system configuration circuit 3600 determines whether to add second hardware to the configuration. For example, ML hardware configuration circuit 3630 may determine a configuration to add the FPGA to the hardware along with the GPU and/or different hardware (e.g., AI processor).

If at block 4508, the ml system configuration circuit 3600 determines not to add the second hardware to the configuration, the example machine readable instructions of fig. 45 and/or the example operations 4500 end. For example, the machine-readable instructions and/or example operations 4500 of fig. 45 may return to block 4118 of the machine-readable instructions and/or example operations 4100 of fig. 41 to execute one or more ML models based on the first configuration on the hardware in the second configuration.

If at block 4508, the ml system configuration circuit 3600 determines to add the second hardware to the configuration, then at block 4510, the ml system configuration circuit 3600 determines one or more first portions of the first hardware to execute the first portion of the workload. For example, in a configuration including a GPU and an FPGA, ML hardware configuration circuit 3630 may identify and/or otherwise determine one or more first kernels of the GPU to execute a first portion of workload(s) 3516.

At block 4512, the ml system configuration circuit 3600 identifies the second hardware in the configuration database. For example, ML hardware configuration circuit 3630 may identify the FPGA in hardware template 3514 of configurable building block database 3510.

At block 4514, the ml system configuration circuit 3600 determines one or more second portions of the second hardware to execute the second portion of the workload. For example, in a configuration including a GPU and an FPGA, ML hardware configuration circuit 3630 may identify and/or otherwise determine one or more second cores of the FPGA to execute the second portion of workload(s) 3516.

At block 4516, the ml system configuration circuit 3600 determines a new configuration based on the topology of the one or more first portions and the one or more second portions. For example, ML hardware configuration circuit 3630 may determine to couple the output(s) of the GPU to the input(s) of the FPGA (or the output(s) of the FPGA to the input(s) of the GPU) based on the topology included in interconnect topology 3676.

In response to determining a new configuration based on the topology of the one or more first portions and the one or more second portions at block 4516, the example machine readable instructions and/or example operations 4500 of fig. 45 end. For example, the machine-readable instructions and/or example operations 4500 of fig. 45 may return to block 4118 of the machine-readable instructions and/or example operations 4100 of fig. 41 to execute one or more ML models based on the first configuration on the hardware in the second configuration.

FIG. 46 is a flowchart representative of example machine readable instructions and/or example operations 4600 executable and/or instantiated by the processor circuit to deploy computing nodes to perform machine learning workloads. The example machine readable instructions and/or example operations 4600 of fig. 46 begin at block 4602 where ML system configuration circuitry 3600 receives a request for a Machine Learning (ML) model and corresponding hardware to execute an ML workload. For example, interface circuit 3610 (fig. 36) may receive a request identifying a combination of hardware and/or software to execute workload(s) 3516 of fig. 35. In some such examples, a combination of hardware and/or software may be implemented by software 3519, hardware 3521, and/or more generally ML computing node 3517 of fig. 35.

At block 4604, the ml system configuration circuit 3600 generates a software search space and a hardware search space based on at least one of the request or the historical configuration. For example, ML software configuration circuit 3620 can generate software search space 3518 of fig. 35 based on workload(s) 3516, historical configurations of ML computing nodes that can be stored in ontology database 3508 of fig. 35, historical configurations 3678 of fig. 36, and the like, and/or any combination(s) of these. In some examples, ML hardware configuration circuit 3630 can generate hardware search space 3520 of fig. 35 based on workload(s) 3516, historical configurations of ML computing nodes that can be stored in ontology database 3508 of fig. 35, historical configurations 3678 of fig. 36, and/or the like, and/or any combination(s) of these.

At block 4606, ML system configuration circuit 3600 selects a configuration of ML model(s) and corresponding hardware for the compute node based on at least one of the software search space or the hardware search space. For example, ML software configuration circuit 3620 and/or ML hardware configuration circuit 3630 can generate proposed HW/SW instance 3522 of fig. 35 based on one or more AI/ML models from software search space 3518 and hardware from hardware search space 3520.

At block 4608, ML system configuration circuit 3600 selects a topology for the configuration of ML model(s) and corresponding hardware of the compute node. For example, the ML software configuration circuit 3620 can couple together one or more ML models of the proposed HW/SW instance 3522. In some examples, ML hardware configuration circuit 3630 may couple the hardware of the proposed HW/SW instance 3522 together.

At block 4610, ml system configuration circuit 3600 outputs evaluation parameters associated with the configuration. For example, configuration evaluation circuit 3640 (fig. 36) may determine evaluation parameters 3526 based on performance modeling 3524 of the proposed HW/SW instance 3522.

At block 4612, the ml system configuration circuit 3600 determines whether one (or more) of the evaluation parameters satisfy respective thresholds. For example, the configuration evaluation circuit 3640 may determine whether a first value of the accuracy parameter meets an accuracy threshold, whether a second value of the delay parameter meets the delay parameter, and/or the like, and/or any combination(s) of these.

If at block 4612, the ML system configuration circuit 3600 determines that one (or more) of the evaluation parameters do not meet the respective threshold(s), control returns to block 4606, otherwise at block 4614, the ML system configuration circuit 3600 deploys the compute node to execute the ML workload. For example, workload execution circuit 3660 (fig. 36) may deploy ML compute node 3517 to execute workload(s) 3516. In some such examples, workload execution circuitry 3660 may compile and/or otherwise provide ML computing node 3517 as an executable construct that, when executed and/or instantiated, may execute workload(s) 3516. In response to deploying the computing node to execute the ML workload at block 4614, the example machine readable instructions of fig. 46 and/or the example operations 4600 end.

Fig. 47 is a block diagram of an example processor platform 4700 that is configured to execute and/or instantiate the machine readable instructions and/or operations of fig. 41-13 to implement the ML system configurator 3402 of fig. 34 and/or 35 and/or the ML system configuration circuit 3600 of fig. 36. The processor platform 4700 may be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cellular telephone, a smart phone, a personal digital assistant such as an iPad) ^TM Such as a tablet device), a headset (e.g., an Augmented Reality (AR) headset, a Virtual Reality (VR) headset, etc.), or other wearable device, or any other type of computing device.

The processor platform 4700 of the illustrated example includes processor circuitry 4712. The processor circuit 4712 of the illustrated example is hardware. For example, the processor circuit 4712 may be implemented by one or more integrated circuits, logic circuits, FPGAs, microprocessors, CPU, GPU, DSP, and/or microcontrollers from any desired family or manufacturer. The processor circuit 4712 may be implemented by one or more semiconductor-based (e.g., silicon-based) devices. In this example, the processor circuit 4712 implements the ML software configuration circuit 3620 (identified by "ML SW configuration circuit"), the ML hardware configuration circuit 3630 (identified by "ML HW configuration circuit"), the configuration evaluation circuit 3640 (identified by "configuration evaluation circuit"), the ontology generating circuit 3650 (identified by "ontology generating circuit"), and the workload executing circuit 3660 (identified by "workload executing circuit") of fig. 36.

The processor circuit 4712 of the illustrated example includes local memory 4713 (e.g., cache, registers, etc.). The processor circuit 4712 of the illustrated example communicates with a main memory including a volatile memory 4714 and a non-volatile memory 4716 over a bus 4718. In some examples, bus 4718 implements bus 3680 of fig. 36. The volatile memory 4714 may be selected from Synchronous Dynamic Random Access Memory (SDRAM), dynamic Random Access Memory (DRAM),Dynamic random access memory->And/or any other type of RAM device implementation. The non-volatile memory 4716 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 4714, 4716 of the illustrated example is controlled by a memory controller 4717.

The processor platform 4700 of the illustrated example also includes interface circuitry 4720. In this example, interface circuit 4720 implements interface circuit 3610 of fig. 36. The interface circuit 4720 may be implemented in hardware in accordance with any type of interface standard, such as an Ethernet interface, a Universal Serial Bus (USB) interface, a USB interface, or a combination thereof,An interface, a Near Field Communication (NFC) interface, a Peripheral Component Interconnect (PCI) interface, and/or a peripheral component interconnect express (PCIe) interface.

In the illustrated example, one or more input devices 4722 are connected to the interface circuit 4720. Input device(s) 4722 allow a user to input data and/or commands into processor circuit 4712. The input device(s) 4722 may be implemented by, for example, an audio sensor, microphone, camera (still or video), keyboard, buttons, mouse, touch screen, touch pad, trackball, isopoint device, and/or voice recognition system.

One or more output devices 4724 are also connected to the interface circuit 4720 of the illustrated example. The output device(s) 4724 may be implemented, for example, by a display device (e.g., light Emitting Diode (LED), organic Light Emitting Diode (OLED), liquid Crystal Display (LCD), cathode Ray Tube (CRT) display, in-situ switching (IPS) display, touch screen, etc.), haptic output device, printer, and/or speakers. The interface circuit 4720 of the illustrated example thus generally includes a graphics driver card, a graphics driver chip, and/or a graphics processor circuit, such as a GPU.

The interface circuit 4720 of the illustrated example also includes communication devices, such as transmitters, receivers, transceivers, modems, residential gateways, wireless access points, and/or network interfaces, to facilitate the exchange of data with external machines (e.g., any kind of computing device) via the network 4726. The communication may be through, for example, an ethernet connection, a Digital Subscriber Line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-to-line wireless system, a cellular telephone system, an optical connection, etc.

The processor platform 4700 of the illustrated example also includes one or more mass storage devices 4728 to store software and/or data. In this example, one or more mass storage devices 4728 implement a data store 3670, a software template 3672 (identified by a "SW template"), a hardware template 3674 (identified by a "HW template"), an interconnect topology 3676 (identified by an "interconnect topology"), and a history configuration 3678 (identified by a "history configuration"). Examples of such mass storage devices 4728 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, blu-ray disc drives, redundant Array of Independent Disks (RAID) systems, solid state storage devices (such as flash memory devices and/or SSDs), and DVD drives.

The machine-executable instructions 4732, which may be implemented by the machine-readable instructions of fig. 41-13, may be stored in the mass storage device 4728, in the volatile memory 4714, in the non-volatile memory 4716, and/or on a removable non-transitory computer-readable storage medium such as a CD or DVD.

The processor platform 4700 of the illustrated example of fig. 47 includes an example acceleration circuit 4734 that includes an example GPU 4740, an example Visual Processing Unit (VPU) 4742, and an example neural network processor 4744. Additionally and/or alternatively, the acceleration circuit 4734 may include any other type of hardware, such as CPU, FPGA, ASIC, etc. In this example, the GPU 4740, VPU 4742, and neural network processor 4744 communicate with different hardware of the processor platform 4700, such as volatile memory 4714, non-volatile memory 4716, and the like, via a bus 4718. In this example, the neural network processor 4744 may be implemented with one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer, which may be used to execute AI models, such as neural networks. In some examples, one or more of ML software configuration circuit 3620, ML hardware configuration circuit 3630, configuration evaluation circuit 3640, ontology generation circuit 3650, and/or workload execution circuit 3660 may be implemented in or with at least one of GPU 4740, VPU 4742, or neural network processor 4744, instead of processor 4712 or in addition to processor 4712.

From the foregoing, it will be apparent that example systems, methods, apparatus, and articles of manufacture have been disclosed for a configurable machine learning computing node. The disclosed systems, methods, apparatus, and articles of manufacture implement AI/ML workloads by identifying and/or generating improved and/or other optimal combinations of hardware and/or software, thereby improving the efficiency of using a computing device. The disclosed systems, methods, apparatus, and articles of manufacture include an expressive search space representation that covers multiple templates of hardware and software architectures. These templates may be dynamically modifiable during the HW/SW co-design search. Advantageously, the expressive search space enables the HW/SW co-design system to explore a much larger, richer HW/SW design space that spans multiple architectural styles. One (or more) of the architectural styles may be flexible in their respective modules and connectivity sets (e.g., selection and/or configuration of connections, topologies, inputs/outputs, etc.). The collection of modules and connectivity may be formed by a configurable building block. Advantageously, the disclosed systems, methods, apparatus, and articles of manufacture increase the likelihood of discovering more efficient hardware architecture instances and their corresponding co-designed software as compared to previous AutoML methods, because the examples disclosed herein provide much larger HW/SW search space(s) and configurable version(s) thereof. The disclosed systems, methods, apparatus, and articles of manufacture are thus directed to one or more improvements in the operation of machines such as computers or other electronic and/or mechanical devices.

Fig. 48 is a block diagram of an example implementation of processor circuit 1612 of fig. 16, processor circuit 2112 of fig. 21, processor circuit 2612 of fig. 26, processor circuit 312 of fig. 33, and/or processor circuit 4712 of fig. 47. In this example, processor circuit 1612 of fig. 16, processor circuit 2112 of fig. 21, processor circuit 2612 of fig. 26, processor circuit 312 of fig. 33, and/or processor circuit 4712 of fig. 47 are implemented by general purpose microprocessor 4800. The general-purpose microprocessor circuit 4800 executes some or all of the machine-readable instructions of the flowcharts disclosed herein to effectively instantiate the logic circuitry to perform the operations corresponding to these machine-readable instructions. For example, microprocessor 4800 may implement multi-core hardware circuitry, such as CPU, DSP, GPU, XPU, and so forth. The microprocessor 4800 of this example is a multi-core semiconductor device including N cores, although it may include any number of example cores 4802 (e.g., 1 core). The cores 4802 of the microprocessor 4800 can operate independently or can cooperate to execute machine-readable instructions. For example, machine code corresponding to a firmware program, an embedded software program, or a software program may be executed by one of cores 4802, or may be executed by multiple ones of cores 4802 at the same or different times. In some examples, machine code corresponding to a firmware program, an embedded software program, or a software program is partitioned into threads and executed in parallel by two or more of cores 4802. The software program may correspond to a portion or all of the machine readable instructions and/or operations represented by one or more of the flowcharts disclosed herein.

The core 4802 may communicate over a first example bus 4804. In some examples, first bus 4804 may implement a communication bus to enable communication associated with one (or more) of cores 4802. For example, first bus 4804 may implement at least one of an inter-integrated circuit (I2C) bus, a Serial Peripheral Interface (SPI) bus, a PCI bus, or a PCIe bus. Additionally or alternatively, the first bus 4804 may implement any other type of computing or electrical bus. The core 4802 may obtain data, instructions, and/or signals from one or more external devices via example interface circuitry 4806. The core 4802 may output data, instructions, and/or signals to one or more external devices via the interface circuitry 4806. While the core 4802 of this example includes an example local memory 4820 (e.g., a level 1 (L1) cache that may be partitioned into an L1 data cache and an L1 instruction cache), the microprocessor 4800 also includes an example shared memory 4810 (e.g., a level 2 (L2) cache) that may be shared by the cores for high-speed access of data and/or instructions. Data and/or instructions may be transferred (e.g., shared) by writing to shared memory 4810 and/or reading from shared memory 4810. The local memory 4810 and the shared memory 4810 of each core 4802 may be part of a hierarchy of storage devices including multi-level cache memory and main memory (e.g., main memory of one or more of fig. 16, 21, 26, 33, and 47). In general, higher level memory in the hierarchy exhibits lower access times and has less storage capacity than lower level memory. The various levels of changes to the cache hierarchy are managed (e.g., coordinated) by a cache coherency policy.

Each core 4802 may be referred to as CPU, DSP, GPU, or the like, or any other type of hardware circuitry. Each core 4812 includes a control unit circuit 4814, an arithmetic and logic (arithmetic and logic, AL) circuit (sometimes referred to as an ALU) 4816, a plurality of registers 4818, an L1 cache 4810, and a second example bus 4812. Other structures may also be present. For example, each core 4802 may include vector unit circuitry, single instruction multiple data (single instruction multipledata) unit circuitry, load/store unit (LSU) circuitry, branch/jump unit circuitry, floating Point Unit (FPU) circuitry, and so forth. The control unit circuitry 4814 includes semiconductor-based circuitry configured to control (e.g., coordinate) movement of data within the respective cores 4812. The AL circuit 4816 includes semiconductor-based circuitry configured to perform one or more mathematical and/or logical operations on data within the respective core 4802. The AL circuit 4816 in some examples performs integer-based operations. In other examples, AL circuit 4816 also performs floating point operations. In still other examples, the AL circuit 4816 may include a first AL circuit that performs integer-based operations and a second AL circuit that performs floating point operations. In some examples, the AL circuit 4816 may be referred to as an arithmetic logic unit (Arithmetic Logic Unit, ALU). The registers 4818 are semiconductor-based structures for storing data and/or instructions, e.g., the results of one or more operations performed by the AL circuitry 4816 of the respective core 48102. For example, registers 4818 may include vector register(s), SIMD register(s), general purpose register(s), flag register(s), fragment register(s), machine-specific register(s), instruction pointer register(s), control register(s), debug register(s), memory management register(s), machine check register(s), and so forth. The registers 4818 may be arranged as banks as shown in fig. 48. Alternatively, registers 4818 may be organized in any other arrangement, format, or structure, including distributed throughout core 4802 to reduce access time. The second bus 4822 may implement at least one of an I2C bus, an SPI bus, a PCI bus, or a PCIe bus.

Each core 4802 and/or, more generally, microprocessor 4800 can include additional and/or alternative structures to those shown and described above. For example, there may be one or more clock circuits, one or more power supplies, one or more power gates, one or more Cache Home Agents (CHA), one or more aggregation/Common Mesh Stops (CMS), one or more shifters (e.g., barrel shifter (s)), and/or other circuitry. Microprocessor 4800 is a semiconductor device that is fabricated to include a number of interconnected transistors to implement the structure described above in one or more Integrated Circuits (ICs) contained within one or more packages. The processor circuit may include and/or cooperate with one or more accelerators. In some examples, the accelerator is implemented by logic circuitry to perform certain tasks faster and/or more efficiently than a general purpose processor. Examples of accelerators include ASICs and FPGAs, such as those discussed herein. The GPU or other programmable device may also be an accelerator. The accelerator may be on a board of the processor circuit, in the same chip package as the processor circuit, and/or in one or more packages separate from the processor circuit.

Fig. 49 is a block diagram of another example implementation of processor circuit 1612 of fig. 16, processor circuit 2112 of fig. 21, processor circuit 2612 of fig. 26, processor circuit 312 of fig. 33, and/or processor circuit 4712 of fig. 47. In this example, processor circuit 1612 of fig. 16, processor circuit 2112 of fig. 21, processor circuit 2612 of fig. 26, processor circuit 312 of fig. 33, and/or processor circuit 4712 of fig. 47 are implemented by FPGA circuit 4900. For example, the FPGA circuitry 4900 may be used, for example, to perform operations that may otherwise be performed by the example microprocessor 4800 of fig. 48 to execute corresponding machine-readable instructions. Once configured, however, FPGA circuitry 4900 instantiates machine-readable instructions in hardware so that the operations are often performed faster than the general purpose microprocessor executing the corresponding software.

More specifically, in contrast to the microprocessor 4800 of fig. 48 described above (which is a general purpose device that may be programmed to execute some or all of the machine readable instructions represented by the flowcharts disclosed herein, but whose interconnections and logic circuitry are fixed once manufactured), the FPGA circuit 4900 of the example of fig. 49 includes interconnections and logic circuitry that may be configured and/or interconnected differently after manufacture to instantiate some or all of the machine readable instructions represented, for example, by the flowcharts disclosed herein. In particular, FPGA 4900 may be considered an array of logic gates, interconnects, and switches. The switches can be programmed to change the manner in which the logic gates are interconnected, effectively forming one or more dedicated logic circuits (unless and until FPGA circuit 4900 is reprogrammed). The logic circuits are configured such that the logic gates can cooperate in different ways to perform different operations on data received by the input circuit. These operations may correspond to a part or all of the software represented by the flowcharts disclosed herein. Accordingly, FPGA circuitry 4900 can be configured to effectively instantiate a portion or all of the machine-readable instructions of the flowcharts disclosed herein as dedicated logic circuitry to perform operations corresponding to these software instructions in a manner analogous to that of an ASIC. Accordingly, the FPGA circuit 4900 may execute operations corresponding to some or all of the machine-readable instructions disclosed herein faster than the general-purpose microprocessor can execute such instructions.

In the example of fig. 49, FPGA circuit 4900 is structured to be programmed (and/or reprogrammed one or more times) by an end user via a hardware description language (hardware description language, HDL) (e.g., verilog). FPGA circuit 4900 of fig. 49 includes example input/output (I/O) circuitry 4902 to obtain and/or output data from/to example configuration circuitry 4904 and/or external hardware (e.g., external hardware circuitry) 1606. For example, the configuration circuit 1604 may implement interface circuitry that may obtain machine-readable instructions to configure the FPGA circuit 4900, or portion(s) thereof. In some such examples, the configuration circuit 1604 may obtain machine-readable instructions from a user, a machine (e.g., a hardware circuit (e.g., a programmed or dedicated circuit) that may implement an artificial intelligence/machine learning (AI/ML) model to generate instructions), and so forth. In some examples, external hardware 1606 may implement microprocessor 1500 of fig. 48. FPGA circuit 4900 also includes an array of example logic gates 4908, a plurality of example configurable interconnects 4910, and example storage circuitry 4912. Logic gate 4908 and interconnect 4910 may be configured to instantiate one or more operations corresponding to at least some of the machine readable instructions of fig. 8-13, and/or other desired operations. Logic gates 4908 shown in fig. 49 are fabricated by groups or blocks. Each block includes semiconductor-based electrical structures that may be configured as logic circuits. In some examples, the electrical structure includes logic gates (e.g., and gates, or gates, nor gates, etc.) that provide basic building blocks for logic circuitry. Within each logic gate circuit 4908 there are electrically controllable switches (e.g., transistors) so that electrical structures and/or logic gates can be configured to form a circuit to perform a desired operation. Logic gate 4908 may include other electrical structures such as look-up tables (LUTs), registers (e.g., flip-flops or latches), multiplexers, and the like.

The interconnect 4910 of the illustrated example is a conductive via, trace, via, or the like, which may include an electrically controllable switch (e.g., a transistor) whose state may be changed by programming (e.g., using HDL instruction language) to activate or deactivate one or more connections between one or more logic gates 4908 to program a desired logic circuit.

The storage circuit 4912 of the illustrated example is structured to store the result(s) of one or more operations performed by the respective logic gates. The storage circuit 4912 may be implemented by a register or the like. In the illustrated example, the storage circuitry 4912 is distributed among the logic gates 4908 to facilitate access and to increase execution speed.

The example FPGA circuit 4900 of fig. 49 also includes example dedicated operating circuitry 4914. In this example, dedicated operating circuit 4914 includes dedicated circuitry 4916 that can be invoked to implement commonly used functions to avoid the need to program these functions in the field. Examples of such dedicated circuitry 4916 include memory (e.g., DRAM) controller circuitry, PCIe controller circuitry, clock circuitry, transceiver circuitry, memory, and multiplier-accumulator circuitry. Other types of dedicated circuitry may also be present. In some examples, the FPGA circuit 4900 may also include example general-purpose programmable circuitry 4918, such as example CPU 4920 and/or example DSP 4922. Other general purpose programmable circuitry 4918 may additionally or alternatively be present, such as a GPU, XPU, etc., which may be programmed to perform other operations.

While fig. 48 and 49 illustrate two example implementations of the processor circuit 1612 of fig. 16, the processor circuit 2112 of fig. 21, the processor circuit 2612 of fig. 26, the processor circuit 312 of fig. 33, and/or the processor circuit 4712 of fig. 47, many other approaches are contemplated. For example, as described above, modern FPGA circuitry may include an on-board CPU, such as one or more of the example CPUs 4920 of fig. 49. Thus, the processor circuit 1612 of fig. 16, the processor circuit 2112 of fig. 21, the processor circuit 2612 of fig. 26, the processor circuit 312 of fig. 33, and/or the processor circuit 4712 of fig. 47 may also be implemented by combining the example microprocessor 4800 of fig. 48 and the example FPGA circuit 4900 of fig. 49. In some such hybrid examples, a first portion of the machine-readable instructions represented by the flowcharts of fig. 8-13 may be executed by the one or more cores 4802 of fig. 48, a second portion of the machine-readable instructions represented by the flowcharts of fig. 8-13 may be executed by the FPGA circuit 4900 of fig. 49, and/or a third portion of the machine-readable instructions represented by the flowcharts disclosed herein may be executed by the ASIC. Some or all of the circuitry may be instantiated, for example, in one or more threads executing concurrently and/or serially.

In some examples, the processor circuit 1612 of fig. 16, the processor circuit 2112 of fig. 21, the processor circuit 2612 of fig. 26, the processor circuit 312 of fig. 33, and/or the processor circuit 4712 of fig. 47 may be in one or more packages. For example, processor circuit 4800 of fig. 48 and/or FPGA circuit 1600 of fig. 49 can be in one or more packages. In some examples, the XPU may be implemented by the processor circuit 1612 of fig. 16, the processor circuit 2112 of fig. 21, the processor circuit 2612 of fig. 26, the processor circuit 312 of fig. 33, and/or the processor circuit 4712 of fig. 47, which may be in one or more packages. For example, an XPU may include a CPU in one package, a DSP in another package, a GPU in another package, and an FPGA in another package.

A block diagram illustrating an example software distribution platform 5005 for distributing software, such as example machine readable instructions 1632 or machine readable instructions of one or more of fig. 16, 21, 26, 33, and/or 47, to hardware devices owned and/or operated by third parties is illustrated in fig. 50. The example software distribution platform 5005 may be implemented by any computer server, data facility, cloud service, etc. capable of storing and transmitting software to other computing devices. The third party may be a customer of the entity owning and/or operating the software distribution platform 5005. For example, the entity that owns and/or operates the software distribution platform 5005 may be a developer, seller, and/or licensor of software (e.g., example machine readable instructions 1632). The third party may be a consumer, user, retailer, OEM, etc. who purchases and/or license the software for use and/or resale and/or licensing. In the illustrated example, the software distribution platform 5005 includes one or more servers and one or more storage devices. The storage device stores machine-readable instructions 1632, which may correspond to example machine-readable instructions of the flowcharts disclosed herein as described above. One or more servers of the example software distribution platform 5005 are in communication with a network 5010, which can correspond to the internet and/or any one or more of the example networks 1626 described above. In some examples, one or more servers respond to requests to transmit software to a requestor as part of a commercial transaction. Payment for delivery, sales, and/or licensing of the software may be handled by one or more servers of the software distribution platform and/or by a third party payment entity. These servers enable purchasers and/or licensees to download machine readable instructions 1632 from the software distribution platform 5005. For example, software that may correspond to the example machine readable instructions of the flowcharts disclosed herein may be downloaded to the example processor platform 1600 or any of the processor platforms disclosed in one or more of fig. 16, 21, 26, 33, and/or 47, which would execute the machine readable instructions. In some examples, one or more servers of the software distribution platform 5005 periodically provide, transmit, and/or force updates to the software (e.g., the example machine readable instructions 1632) to ensure that improvements, patches, updates, etc. are distributed and applied to the software at the end user device.

Example methods, apparatus, systems, and articles of manufacture for a configurable machine learning computing node are disclosed herein. Further examples and combinations thereof include the following:

example methods, apparatus, systems, and articles of manufacture to manage a processing unit are disclosed herein. Further examples and combinations thereof include the following:

example 1 includes an apparatus for managing a processing unit, the apparatus comprising interface circuitry to detect a request to initialize a computing system, and processor circuitry to include one or more of: at least one of a central processing unit, a graphics processing unit, or a digital signal processor having control circuitry, arithmetic and logic circuitry, and one or more registers, the processor circuitry to execute instructions to execute system boot software retrieved from memory, execute firmware for a heterogeneous processing unit, the firmware retrieved from the memory, identify a type of the heterogeneous processing unit via silicon initialization code, and cause initialization of the heterogeneous processing unit via the silicon initialization code.

Example 2 includes the apparatus as defined in example 1, wherein the memory is a serial peripheral interface flash memory.

Example 3 includes the apparatus as defined in example 2, further comprising an enhanced serial peripheral interface to facilitate sharing the serial peripheral interface flash memory between the central processing unit and the heterogeneous processing unit.

Example 4 includes the apparatus as defined in example 1, wherein the heterogeneous processor is a graphics processing unit.

Example 5 includes the apparatus as defined in example 1, wherein the heterogeneous processor is a discrete graphics processing unit.

Example 6 includes the apparatus as defined in example 1, wherein the processor circuit is to execute the instructions to retrieve a motherboard specific configuration via the silicon initialization code, the motherboard specific configuration including enhanced peripheral connection interface (PCI-E) slot information.

Example 7 includes the apparatus as defined in example 1, wherein the processor circuit executes the instructions to store updatable product data comprising address information of the heterogeneous processing unit.

Example 8 includes the apparatus as defined in example 7, wherein the processor circuit is to execute the instructions to retrieve the updateable product data via the silicon initialization code to access information of the heterogeneous processing unit.

Example 9 includes a non-transitory computer-readable medium comprising instructions that when executed cause a processor to at least detect a request to initialize a computing system and execute system boot software retrieved from a memory, execute firmware for a heterogeneous processing unit, the firmware retrieved from the memory, identify a type of the heterogeneous processing unit via silicon initialization code, and cause initialization of the heterogeneous processing unit via the silicon initialization code.

Example 10 includes the non-transitory computer-readable medium as defined in example 9, wherein the memory is a serial peripheral interface flash memory.

Example 11 includes the non-transitory computer-readable medium as defined in example 10, wherein the instructions, when executed, cause the processor to facilitate sharing the serial peripheral interface flash memory between the central processing unit and the heterogeneous processing unit.

Example 12 includes the non-transitory computer-readable medium as defined in example 9, wherein the heterogeneous processor is a graphics processing unit.

Example 13 includes the non-transitory computer-readable medium as defined in example 9, wherein the heterogeneous processor is a discrete graphics processing unit.

Example 14 includes the non-transitory computer-readable medium as defined in example 9, wherein the instructions, when executed, cause the processor to retrieve a motherboard-specific configuration via the silicon initialization code, the motherboard-specific configuration including enhanced peripheral connection interface (PCI-E) slot information.

Example 15 includes the non-transitory computer-readable medium as defined in example 9, wherein the instructions, when executed, cause the processor to store updatable product data comprising address information of the heterogeneous processing unit.

Example 16 includes the non-transitory computer-readable medium as defined in example 15, wherein the instructions, when executed, cause the processor to retrieve the updateable product data via the silicon initialization code to access information of the heterogeneous processing unit.

Example 17 includes a method comprising detecting a request to initialize a computing system and executing system boot software retrieved from a memory, executing firmware for a heterogeneous processing unit, the firmware retrieved from the memory, identifying a type of the heterogeneous processing unit via silicon initialization code, and causing initialization of the heterogeneous processing unit via the silicon initialization code.

Example 18 includes the method as defined in example 17, wherein the memory is a serial peripheral interface flash memory.

Example 19 includes the method as defined in example 18, further comprising facilitating sharing the serial peripheral interface flash memory between the central processing unit and the heterogeneous processing unit.

Example 20 includes the method as defined in example 17, wherein the heterogeneous processor is a graphics processing unit.

Example 21 includes the method as defined in example 17, wherein the heterogeneous processor is a discrete graphics processing unit.

Example 22 includes the method as defined in example 17, further comprising retrieving, via the silicon initialization code, a motherboard-specific configuration including enhanced peripheral connection interface (PCI-E) slot information.

Example 23 includes the method as defined in example 17, further comprising storing updatable product data comprising address information of the heterogeneous processing unit.

Example 24 includes the method as defined in example 23, further comprising retrieving the updateable product data via the silicon initialization code to access information of the heterogeneous processing unit.

Example 25 includes an apparatus for managing a processing unit, comprising interface circuitry to detect a request to obtain a resource request from a workload, processor circuitry to include one or more of: at least one of a central processing unit, a graphics processing unit, or a digital signal processor having control circuitry, arithmetic and logic circuitry, and one or more registers, the processor circuitry to execute instructions to determine whether resources are available on an infrastructure processing unit management system for the workload, negotiate with the infrastructure processing unit to determine whether an execution workload can be migrated, in response to determining that an execution workload can be migrated, cause the execution workload to be migrated, and cause the workload to execute on the resources.

Example 26 includes the apparatus as defined in example 25, wherein the workload is a virtual machine.

Example 27 includes the apparatus as defined in example 25, wherein the processor circuitry is to execute the instructions to validate the resource request.

Example 28 includes the apparatus as defined in example 25, wherein the resource request identifies a service level agreement.

Example 29 includes the apparatus as defined in example 28, wherein the processor circuitry is to execute the instructions to determine whether a service level agreement identified in the resource request can be satisfied by any available resources.

Example 30 includes the apparatus as defined in example 29, wherein the processor circuit prompts a user to provide a valid request in response to determining that the service level agreement cannot be satisfied.

Example 31 includes the apparatus as defined in example 25, wherein the processor circuit is to execute the instructions to update a class of service for the execution workload.

Example 32 includes the apparatus as defined in example 25, wherein the processor circuitry is to execute the instructions to store an association of the workload and the resource in a blockchain.

Example 33 includes a non-transitory computer-readable medium comprising instructions that, when executed, cause a processor to at least detect a request to obtain a resource request from a workload, determine whether resources are available on an infrastructure processing unit management system for the workload, negotiate with the infrastructure processing unit to determine whether an executing workload can be migrated, cause the executing workload to be migrated, and cause the workload to execute on the resources in response to determining that an executing workload can be migrated.

Example 34 includes the non-transitory computer-readable medium as defined in example 33, wherein the workload is a virtual machine.

Example 35 includes the non-transitory computer-readable medium as defined in example 33, wherein the instructions, when executed, cause the processor to validate the resource request.

Example 36 includes the non-transitory computer-readable medium as defined in example 33, wherein the resource request identifies a service level agreement.

Example 37 includes the non-transitory computer-readable medium as defined in example 36, wherein the instructions, when executed, cause the processor to execute the instructions to determine whether a service level agreement identified in the resource request can be satisfied by any available resources.

Example 38 includes the non-transitory computer-readable medium as defined in example 37, wherein the instructions, when executed, cause the processor to prompt a user to provide a valid request in response to determining that the service level agreement cannot be satisfied.

Example 39 includes the non-transitory computer-readable medium as defined in example 33, wherein the instructions, when executed, cause the processor to update a class of service for the execution workload.

Example 40 includes the non-transitory computer-readable medium as defined in example 33, wherein the instructions, when executed, cause the processor to store the association of the workload and the resource in a blockchain.

Example 41 includes a method comprising detecting a request to obtain a resource request from a workload, determining whether resources are available on an infrastructure processing unit management system for the workload, negotiating with the infrastructure processing unit to determine whether an execution workload can be migrated, causing the execution workload to be migrated, and causing the workload to be executed on the resources in response to determining that an execution workload can be migrated.

Example 42 includes the method as defined in example 41, wherein the workload is a virtual machine.

Example 43 includes the method as defined in example 41, further comprising validating the resource request.

Example 44 includes the method as defined in example 41, wherein the resource request identifies a service level agreement.

Example 45 includes the method as defined in example 44, further comprising executing the instructions to determine whether a service level agreement identified in the resource request can be satisfied by any available resources.

Example 46 includes the method as defined in example 45, further comprising prompting a user to provide a valid request in response to determining that the service level agreement cannot be satisfied.

Example 47 includes the method as defined in example 41, further comprising updating a class of service for the execution workload.

Example 48 includes the method as defined in example 41, further comprising storing an association of the workload and the resource in a blockchain.

Example 49 includes an apparatus for managing a processing unit, the apparatus comprising interface circuitry to detect a request to execute a deep neural network, and processor circuitry to include one or more of: at least one of a central processing unit, a graphics processing unit, or a digital signal processor having control circuitry, arithmetic and logic circuitry, and one or more registers, the processor circuitry to execute instructions to obtain a service level protocol associated with the request, to determine a candidate set of operating parameters based on the service level protocol to service the request, to generate a kernel for a set of operating parameters from the candidate set, and to execute the kernel to determine performance of the kernel.

Example 50 includes the apparatus as defined in example 49, wherein the processor circuit is to execute the instructions to determine whether the performance meets the service level agreement.

Example 51 includes the apparatus as defined in example 49, wherein the processor circuit is to execute the instructions to determine the candidate set based on hardware capabilities of a computing system used to execute the kernel.

Example 52 includes the apparatus as defined in example 49, wherein the processor circuit is to execute the instructions to obtain an operational description associated with the request.

Example 53 includes the apparatus as defined in example 49, wherein the processor circuit executes the instructions to implement an application programming interface to receive the request.

Example 54 includes the apparatus as defined in example 53, wherein the application programming interface manages a plurality of heterogeneous processors.

Example 55 includes the apparatus as defined in example 53, wherein the application programming interface is included in an oneAPI framework.

Example 56 includes a non-transitory computer-readable medium comprising instructions that, when executed, cause a processor to at least detect a request to execute a deep neural network and obtain a service level agreement associated with the request, determine a candidate set of operating parameters based on the service level agreement to service the request, generate a kernel for a set of operating parameters from the candidate set, and execute the kernel to determine performance of the kernel.

Example 57 includes the non-transitory computer-readable medium as defined in example 56, wherein the instructions, when executed, cause the processor to determine whether the performance satisfies the service level agreement.

Example 58 includes the non-transitory computer-readable medium as defined in example 56, wherein the instructions, when executed, cause the processor to determine the candidate set based on hardware capabilities of a computing system for executing the kernel.

Example 59 includes the non-transitory computer-readable medium as defined in example 56, wherein the instructions, when executed, cause the processor to obtain an operational description associated with the request.

Example 60 includes the non-transitory computer-readable medium as defined in example 56, wherein the instructions, when executed, cause the processor to implement an application programming interface to receive the request.

Example 61 includes the non-transitory computer-readable medium as defined in example 60, wherein the application programming interface manages a plurality of heterogeneous processors.

Example 62 includes the non-transitory computer-readable medium as defined in example 60, wherein the application programming interface is included in an oneAPI framework.

Example 63 includes a method comprising detecting a request to execute a deep neural network, and obtaining a service level agreement associated with the request, determining a candidate set of operating parameters based on the service level agreement to service the request, generating a kernel for a set of operating parameters from the candidate set, and executing the kernel to determine performance of the kernel.

Example 64 includes the method as defined in example 63, further comprising determining whether the performance meets the service level agreement.

Example 65 includes the method as defined in example 63, further comprising determining the candidate set based on hardware capabilities of a computing system for executing the kernel.

Example 66 includes the method as defined in example 63, further comprising obtaining an operational description associated with the request.

Example 67 includes the method as defined in example 63, further comprising implementing an application programming interface to receive the request.

Example 68 includes the method as defined in example 67, wherein the application programming interface manages a plurality of heterogeneous processors.

Example 69 includes the method as defined in example 67, wherein the application programming interface is included in an oneAPI framework.

Example methods, apparatus, systems, and articles of manufacture for a dynamic XPU hardware-aware deep learning model are disclosed herein. Further examples and combinations thereof include the following:

example 70 includes an apparatus for a Deep Learning (DL) model management system, the apparatus comprising interface circuitry, processor circuitry comprising one or more of: at least one of a central processing unit, a graphics processing unit, or a digital signal processor, the at least one of the central processing unit, the graphics processing unit, or the digital signal processor having: control circuitry for controlling movement of data within said processor circuitry; arithmetic and logic circuitry for performing one or more first operations in accordance with the instructions; and one or more registers for storing results of the one or more first operations, instructions in the apparatus, a Field Programmable Gate Array (FPGA), the FPGA comprising logic gates, a plurality of configurable interconnects, and storage circuitry, the logic gates and interconnects to perform one or more second operations, the storage circuitry to store results of the one or more second operations, or an Application Specific Integrated Circuit (ASIC) comprising logic gates to perform one or more third operations, the processor circuitry to perform at least one of the first operations, the second operations, or the third operations to instantiate: a variance determiner circuit for analyzing a feature list of models optimized for the selected objective to identify feature variances between the plurality of models; a similarity determiner circuit for analyzing a plurality of feature lists of the plurality of models optimized for a plurality of selected objectives to identify feature similarities between the plurality of models; qoS selector circuitry for determining a QoS target for prioritization among the plurality of selected targets; and model scheduler circuitry to select a model from the plurality of models for use on the target hardware platform.

Example 71 includes the apparatus of example 70, wherein the processor circuit instantiates a QoS sampler circuit to sample a current state of the target hardware platform.

Example 72 includes the apparatus of example 70, wherein the QoS selector circuitry is further to order the plurality of models based on maximizing the ability to prioritize QoS targets.

Example 73 includes the apparatus of example 70, wherein the model scheduler circuitry is further to calculate a model utilization metric for the selected model on the target hardware platform, and to select another model of the plurality of models for use on the target hardware platform in response to determining that the model utilization metric is below a threshold.

Example 74 includes the apparatus of example 70, wherein the feature collector circuit retains a plurality of features identified by the variance determiner circuit and the similarity determiner circuit.

Example 75 includes a method for Deep Learning (DL) model training, the method comprising: a plurality of models are extracted from the dataset, each of the plurality of models being optimized for a selected quality of service (QoS) target of a plurality of QoS targets, a plurality of feature differences between each of the plurality of models are identified, and a plurality of feature similarities between each of the plurality of models are identified.

Example 76 includes the method of example 75, wherein the plurality of feature differences and the plurality of feature similarities are aggregated for retention.

Example 77 includes a method for Deep Learning (DL) model management, the method comprising: sampling a current state of a target hardware platform, selecting a quality of service (QoS) target for prioritization among a plurality of QoS targets based on the current state of the target hardware platform, ordering a plurality of models, each model of the plurality of models being optimized for each QoS target of the plurality of QoS targets, selecting one model of the ordered plurality of models for use by the target hardware platform, calculating a utilization metric of the model on the target hardware platform, and selecting another model of the plurality of models for use by the target hardware platform in response to determining that the utilization metric does not satisfy a threshold.

Example 78 includes the method of example 77, wherein the ordering of the models is based on the selected QoS targets.

Example 79 includes a non-transitory computer-readable medium comprising instructions that, when executed, cause a machine to at least: the method includes analyzing a feature list of models optimized for a plurality of selected targets to identify feature differences between the plurality of models, analyzing a plurality of feature lists of the plurality of models optimized for the plurality of selected targets to identify feature similarities between the plurality of models, determining QoS targets for prioritization among the plurality of selected targets, and selecting a model from the plurality of models for use on a target hardware platform.

Example 80 includes the non-transitory computer-readable medium of example 11, wherein the current state of the target hardware platform is sampled.

Example 81 includes the non-transitory computer-readable medium of example 80, wherein the plurality of models are ordered based on a capability to maximize the QoS target for prioritization.

Example 82 includes the non-transitory computer-readable medium of example 80, wherein the instructions, when executed, further cause the machine to calculate a model utilization metric for the selected model on the target hardware platform, and select another model of the plurality of models for use on the target hardware platform in response to determining that the model utilization metric is below a threshold.

Example 83 includes the non-transitory computer-readable medium of example 80, wherein a plurality of features identified by the variance determiner circuit and the similarity determiner circuit are retained.

Example 84 includes an apparatus comprising at least one interface circuit, instructions in the apparatus, and processor circuitry to execute the instructions to analyze a feature list of models optimized for a selected target to identify feature differences between a plurality of models, analyze a plurality of feature lists of the plurality of models optimized for a plurality of selected targets to identify feature similarities between the plurality of models, determine QoS targets for prioritization among the plurality of selected targets, and select a model from the plurality of models for use on a target hardware platform.

Example 85 includes the apparatus of example 84, wherein the processor circuit is to sample a current state of the target hardware platform.

Example 86 includes the apparatus of example 84, wherein the processor circuit is further to order the plurality of models based on maximizing the ability to prioritize QoS targets.

Example 87 includes the apparatus of example 84, wherein the processor circuit is further to calculate a model utilization metric for the selected model on the target hardware platform, and to select another model of the plurality of models to use on the target hardware platform in response to determining that the model utilization metric is below a threshold.

Example 88 includes the apparatus of example 84, wherein the processor circuit retains a plurality of features identified by the variance determiner circuit and the similarity determiner circuit.

Example 89 includes the apparatus of example 87, wherein the threshold of the model utilization metric is a predetermined threshold.

Example methods, apparatus, systems, and articles of manufacture for data-enhanced automated model generation are disclosed herein. Further examples and combinations thereof include the following:

example 90 includes an apparatus for data-enhanced automation model generation, the apparatus comprising interface circuitry to access a request to generate a machine learning model, and processor circuitry comprising one or more of: at least one of a central processing unit, a graphics processing unit, or a digital signal processor, the at least one of the central processing unit, the graphics processing unit, or the digital signal processor having: control circuitry for controlling movement of data within said processor circuitry, arithmetic and logic circuitry for performing one or more first operations corresponding to instructions, and one or more registers for storing results of said one or more first operations, instructions in said apparatus, a Field Programmable Gate Array (FPGA) comprising logic gates, a plurality of configurable interconnections, and storage circuitry, said logic gates and interconnections performing one or more second operations, said storage circuitry storing results of said one or more second operations, or an Application Specific Integrated Circuit (ASIC) comprising logic gates to perform one or more third operations, said processor circuitry performing at least one of said first operations, said second operations, or said third operations to instantiate: a task data coordination circuit for generating task knowledge based on a previously generated machine learning model; search space management circuitry to create a search space based on the task knowledge; and a neural architecture search circuit for generating the machine learning model using a neural architecture search, the neural architecture search circuit initiating an architecture search based on the search space.

Example 91 includes the apparatus of example 90, wherein the processor circuit is to insert a plurality of anchor points into the machine learning model during generation of the machine learning model, the anchor points to be used for collection of performance statistics related to execution of the machine learning model.

Example 92 includes the apparatus of example 91, wherein the performance statistics include at least one of power efficiency or energy efficiency.

Example 93 includes the apparatus of example 91, wherein the processor circuit is further to collect the performance statistics based on the anchor point.

Example 94 includes the apparatus of example 93, wherein to generate the task knowledge, the processor circuit is further to rank features of a previously generated machine learning model.

Example 95 includes the apparatus of example 90, wherein to create the search space, the processor circuit is to select a previous architecture based on performance of the previous architecture on the selected hardware.

Example 96 includes at least one non-transitory computer-readable storage medium comprising instructions that, when executed, cause at least one processor to at least: a request to generate a machine learning model to perform a selected task is accessed, task knowledge is generated based on previously generated machine learning models, a search space is created based on the task knowledge, and a machine learning model is generated utilizing a neural architecture search that begins based on the search space.

Example 97 includes the at least one non-transitory computer-readable storage medium of example 96, wherein the instructions, when executed, further cause the at least one processor to insert a plurality of anchor points into the machine learning model, the anchor points to be used in collecting performance statistics regarding execution of the machine learning model.

Example 98 includes the at least one non-transitory computer-readable storage medium of example 97, wherein the instructions, when executed, further cause the at least one processor to collect the performance statistics based on the anchor point.

Example 99 includes the at least one non-transitory computer-readable storage medium of example 98, wherein the instructions, when executed, further cause the at least one processor to rank features of a previously generated machine learning model to generate the task knowledge.

Example 100 includes the at least one non-transitory computer-readable storage medium of example 96, wherein the instructions, when executed, further cause the at least one processor to select a previous architecture to create the search space based on performance of the previous architecture on the selected hardware.

Example 101 includes a method for data-enhanced automated model generation, the method comprising: a request to generate a machine learning model to perform a selected task is accessed, task knowledge is generated based on previously generated machine learning models, a search space is created based on the task knowledge, and a machine learning model is generated utilizing a neural architecture search that begins based on the search space.

Example 102 includes the method of example 101, further comprising, during generation of the machine learning model, inserting a plurality of anchor points into the machine learning model, the anchor points to be used in collecting performance statistics regarding execution of the machine learning model.

Example 103 includes the method of example 102, further comprising collecting the performance statistics based on the anchor points.

Example 104 includes the method of example 103, wherein the generating of task knowledge includes ranking features of previously generated machine learning models.

Example 105 includes the method of example 101, wherein the creating of the search space includes selecting a previous architecture based on performance of the previous architecture on the selected hardware.

Example 106 includes an apparatus for data-enhanced automation model generation, the apparatus comprising: the computer-readable medium includes code for accessing a request for generating a machine learning model to perform a selected task, code for generating task knowledge based on a previously generated machine learning model, code for creating a search space based on the task knowledge, and code for generating a machine learning model using a neural architecture search, the neural architecture search beginning based on the search space.

Example 107 includes the apparatus of example 106, further comprising means for inserting, during generation of the machine learning model, a plurality of anchor points into the machine learning model, the anchor points to be used in collecting performance statistics regarding execution of the machine learning model.

Example 108 includes the apparatus of example 107, further comprising means for collecting the performance statistics based on the anchor points.

Example 109 includes the apparatus of example 108, wherein the means for generating further ranks features of previously generated machine learning models.

Example 110 includes the apparatus of example 106, wherein the means for creating selects a previous architecture based on performance of the previous architecture on the selected hardware.

Example methods, apparatus, systems, and articles of manufacture to conditionally activate large cores in computing systems are disclosed herein. Further examples and combinations thereof include the following:

example 111 includes an apparatus to conditionally activate a large core in a computing system, the apparatus comprising: a first instruction in the apparatus and a processor circuit to execute the first instruction in response to a request to operate two or more processing devices as a single processing device, determine from the request whether the two or more processing devices are available and capable of executing a second instruction, split the second instruction into a first sub-instruction and a second sub-instruction when the two or more processing devices are available and capable of executing the second instruction, (a) provide the first sub-instruction to a first processing device of the two or more processing devices, and (b) provide the second sub-instruction to a second processing device of the two or more processing devices, and generate an output for the second instruction by combining a first output of the first processing device and a second output of the second processing device.

Example 112 includes the apparatus of example 111, wherein the request is a first request, the processor circuit is to determine, in response to a second request to operate the two or more processing devices as a single processing device, whether the two or more processing devices are available and capable of executing a third instruction based on the second request, and when the two or more processing devices are capable of executing the third instruction but not available, determine whether the two or more processing devices will be capable of executing the third instruction at a subsequent point in time.

Example 113 includes the apparatus of example 112, wherein the processor circuit is to send a response indicating when the two or more processing devices will be available in response to determining that the two or more processing devices will have the capability to execute the third instruction in the future.

Example 114 includes the apparatus of example 111, wherein the request is a first request, the processor circuit is to determine, in response to a second request to operate the two or more processing devices as a single processing device, whether the two or more processing devices are available and capable of executing a second instruction according to a parameter associated with the second request, and when the two or more processing devices are capable of executing the third instruction but not available according to the parameter, generate an emulation configuration corresponding to execution of the third instruction based on the first processing device and the second processing device.

Example 115 includes the apparatus of example 114, wherein the processor circuit is to send an indication that the third instruction is executable according to the emulation configuration, and in response to receipt of the emulation configuration, split the third instruction into a third sub-instruction and a fourth sub-instruction, (a) provide the third sub-instruction to a first processing device of the two or more processing devices, and (b) provide the fourth sub-instruction to a second processing device of the processing devices, and combine a third output of the first processing device and a fourth output of the second processing device.

Example 116 includes the apparatus of example 114, wherein the processor circuit determines that the two or more processing devices are capable of executing the third instruction but not available at a first time based on the parameter and determines that the two or more processing devices are capable of executing the third instruction and available at a second time based on the parameter.

Example 117 includes the apparatus of example 111, wherein the processor circuit is to send an indication indicating that the first instruction cannot be executed when the two or more processing devices are not capable of executing the first instruction.

Example 118 includes the apparatus of example 111, wherein the processor circuit authenticates the request before determining from the request whether the two or more processing devices are available and capable of executing the first instruction.

Example 119 includes the apparatus of example 111, wherein the two or more processing devices are configurable to operate as a single processing device of different sizes.

Example 120 includes the apparatus of example 119, wherein the combination of two or more processing devices is configurable via a policy.

Example 121 includes the apparatus of example 120, wherein the policy is enforced via a platform trusted execution environment.

Example 122 includes the apparatus of example 111, wherein the processor circuit combines the first output of the first processing device and the second output of the second processing device by at least one of: concatenating the first output and the second output, adding the first output and the second output, or multiplying the first output and the second output.

Example 123 includes a non-transitory computer-readable medium comprising instructions that, when executed, cause one or more processors to at least: in response to a request to operate two or more processing devices as a single processing device, determining from the request whether the two or more processing devices are available and capable of executing an instruction, dividing the instruction into a first sub-instruction and a second sub-instruction when the two or more processing devices are available and capable of executing the instruction, (a) providing the first sub-instruction to a first processing device of the two or more processing devices, and (b) providing the second sub-instruction to a second processing device of the two or more processing devices, and generating an output for the instruction by combining a first output of the first processing device and a second output of the second processing device.

Example 124 includes the computer-readable medium of example 123, wherein the request is a first request and the instructions are first instructions that cause the one or more processors to, in response to a second request to operate the two or more processing devices as a single processing device, determine from the second request whether the two or more processing devices are available and capable of executing the second instructions, and determine whether the two or more processing devices will have the capability to execute the second instructions at a subsequent point in time when the two or more processing devices are capable of executing the second instructions but not available.

Example 125 includes the computer-readable medium of example 124, wherein the instructions cause the one or more processors to, in response to determining that the two or more processing devices will have the capability to execute the second instruction in the future, send a response indicating when the two or more processing devices will be available.

Example 126 includes the computer-readable medium of example 123, wherein the request is a first request and the instructions are first instructions that cause the one or more processors to, in response to a second request to operate the two or more processing devices as a single processing device, determine whether the two or more processing devices are available and capable of executing a second instruction according to a parameter associated with the second request, and generate an emulation configuration corresponding to execution of the second instruction based on the first processing device and the second processing device when the two or more processing devices are capable of executing the second instruction but not available according to the parameter.

Example 127 includes the computer-readable medium of example 126, wherein the instructions cause the one or more processors to send an indication that the second instruction is executable according to the emulation configuration and, in response to acceptance of the emulation configuration, split the second instruction into a third sub-instruction and a fourth sub-instruction, (a) provide the third sub-instruction to a first processing device of the two or more processing devices, and (b) provide the fourth sub-instruction to a second processing device of the processing devices, and combine a third output of the first processing device and a fourth output of the second processing device.

Example 128 includes the computer-readable medium of example 126, wherein the instructions cause the one or more processors to determine that the two or more processing devices are capable of executing the second instruction but not available at a first time as a function of the parameter, and determine that the two or more processing devices are capable of executing the second instruction and available at a second time as a function of the parameter, different from the first time.

Example 129 includes the computer-readable medium of example 123, wherein the instructions cause the one or more processors to send an indication that the instructions cannot be executed when the two or more processing devices are not capable of executing the instructions.

Example 130 includes the computer-readable medium of example 123, wherein the instructions cause the one or more processors to authenticate the request before determining from the request whether the two or more processing devices are available and capable of executing the instructions.

Example 131 includes the computer-readable medium of example 123, wherein the two or more processing devices are configurable to operate as a single processing device of different sizes.

Example 132 includes the computer-readable medium of example 131, wherein the combination of the two or more processing devices is configurable via a policy.

Example 133 includes the computer-readable medium of example 132, wherein the policy is enforced via a platform trusted execution environment.

Example 134 includes the computer-readable medium of example 123, wherein the instructions cause the one or more processors to combine the first output of the first processing device and the second output of the second processing device by at least one of: concatenating the first output and the second output, adding the first output and the second output, or multiplying the first output and the second output.

Example 135 includes an apparatus to conditionally activate a large core in a computing system, the apparatus comprising: interface circuitry for obtaining a request for operating two or more processing devices as a single processing device; and processor circuitry comprising one or more of: at least one of a central processing unit, a graphics processing unit, or a digital signal processor, the at least one of the central processing unit, the graphics processing unit, or the digital signal processor having: control circuitry for controlling movement of data within said processor circuitry; an arithmetic and logic circuit for performing one or more first operations corresponding to the instruction; and one or more registers for storing results of the one or more first operations, instructions in the apparatus, a Field Programmable Gate Array (FPGA), the FPGA comprising logic gates, a plurality of configurable interconnects, and storage circuitry, the logic gates and interconnects to perform one or more second operations, the storage circuitry to store results of the one or more second operations, or an Application Specific Integrated Circuit (ASIC) comprising logic gates to perform one or more third operations, the processor circuitry to execute at least one of the first operations, the second operations, or the third operations to instantiate a hardware management circuit to determine whether the two or more processing devices are available and capable of executing instructions in response to the request, the instructions being split into first and second sub-instructions when the two or more processing devices are available and capable of executing the instructions, (a) to provide the first sub-instruction to the one or more processing devices and the second sub-instructions to the one or more processing devices and to generate the instructions for the one or more output devices by combining the two or more output devices.

Example 136 includes the apparatus of example 135, wherein the request is a first request and the instruction is a first instruction, the hardware management circuitry is to determine, in response to a second request to operate the two or more processing devices as a single processing device, whether the two or more processing devices are available and capable of executing a second instruction based on the second request, and determine, when the two or more processing devices are capable of executing the second instruction but not available, whether the two or more processing devices will have the capability to execute the second instruction at a subsequent point in time.

Example 137 includes the apparatus of example 136, wherein the hardware management circuitry is to send a response indicating when the two or more processing devices will be available in response to determining that the two or more processing devices will have the capability to execute the second instruction in the future.

Example 138 includes the apparatus of example 135, wherein the request is a first request and the instruction is a first instruction, the hardware management circuitry is to determine, in response to a second request to operate the two or more processing devices as a single processing device, whether the two or more processing devices are available and capable of executing a second instruction according to a parameter associated with the second request, and when the two or more processing devices are capable of executing the second instruction but not available according to the parameter, generate an emulation configuration corresponding to execution of the second instruction based on the first processing device and the second processing device.

Example 139 includes the apparatus of example 138, wherein the hardware management circuitry is to send an indication that the second instruction is executable according to the emulation configuration, and in response to receipt of the emulation configuration, to split the second instruction into a third sub-instruction and a fourth sub-instruction, (a) provide the third sub-instruction to a first processing device of the two or more processing devices, and (b) provide the fourth sub-instruction to a second processing device of the processing devices, and to combine a third output of the first processing device and a fourth output of the second processing device.

Example 140 includes the apparatus of example 138, wherein the hardware management circuitry is to determine, at a first time, that the two or more processing devices are capable of executing the second instruction but not available based on the parameter, and to determine, at a second time, that the two or more processing devices are capable of executing the second instruction and available based on the parameter.

Example 141 includes the apparatus of example 135, wherein the hardware management circuitry is to send an indication that indicates that the instruction cannot be executed when the two or more processing devices are not capable of executing the instruction.

Example 142 includes the apparatus of example 135, wherein the processor circuit instantiates an authentication circuit to authenticate the request before determining from the request whether the two or more processing devices are available and capable of executing the instructions.

Example 143 includes the apparatus of example 135, wherein the two or more processing devices are configurable to operate as a single processing device of different sizes.

Example 144 includes the apparatus of example 143, wherein the combination of two or more processing devices is configurable via a policy.

Example 145 includes the apparatus of example 144, wherein the policy is enforced via a platform trusted execution environment.

Example 146 includes the apparatus of example 135, wherein the hardware management circuitry is to combine the first output of the first processing device and the second output of the second processing device by at least one of: concatenating the first output and the second output, adding the first output and the second output, or multiplying the first output and the second output.

Example 147 includes a method of conditionally activating a large core in a computing system, the method comprising: in response to a request to operate two or more processing devices as a single processing device, determining whether the two or more processing devices are available and capable of executing instructions according to the request by executing instructions with one or more processors, when the two or more processing devices are available and capable of executing the instructions, splitting the instructions into a first sub-instruction and a second sub-instruction by executing instructions with the one or more processors, (a) providing the first sub-instruction to a first one of the two or more processing devices, and (b) providing the second sub-instruction to a second one of the two or more processing devices, and generating an output for the instructions by combining a first output of the first processing device and a second output of the second processing device by executing instructions with the one or more processors.

Example 148 includes the method of example 147, wherein the request is a first request and the instruction is a first instruction, further comprising, in response to a second request to operate the two or more processing devices as a single processing device, determining from the second request whether the two or more processing devices are available and capable of executing a second instruction, and determining, when the two or more processing devices are capable of executing the second instruction but not available, whether the two or more processing devices will have the capability to execute the second instruction at a subsequent point in time.

Example 149 includes the method of example 148, further comprising, in response to determining that the two or more processing devices will have the capability to execute the second instruction in the future, sending a response indicating when the two or more processing devices will be available.

Example 150 includes the method of example 147, wherein the request is a first request and the instruction is a first instruction, further comprising, in response to a second request to operate the two or more processing devices as a single processing device, determining from a parameter associated with the second request whether the two or more processing devices are available and capable of executing a second instruction, and generating, based on the first processing device and the second processing device, an emulation configuration corresponding to execution of the second instruction when the two or more processing devices are capable of executing the second instruction but not available according to the parameter.

Example 151 includes the method of example 150, further comprising sending an indication that the second instruction is executable according to the emulation configuration, and in response to acceptance of the emulation configuration, splitting the second instruction into a third sub-instruction and a fourth sub-instruction, (a) providing the third sub-instruction to a first processing device of the two or more processing devices, and (b) providing the fourth sub-instruction to a second processing device of the processing devices, and combining a third output of the first processing device and a fourth output of the second processing device.

Example 152 includes the method of example 150, further comprising determining, at a first time, that the two or more processing devices are capable of executing the second instruction but not available according to the parameter, and determining, at a different time than the first time, that the two or more processing devices are capable of executing the second instruction and available according to the parameter.

Example 153 includes the method of example 147, further comprising sending an indication that indicates that the instruction cannot be executed when the two or more processing devices are not capable of executing the instruction.

Example 154 includes the method of example 147, further comprising authenticating the request before determining from the request whether the two or more processing devices are available and capable of executing the instructions.

Example 155 includes the method of example 147, wherein the two or more processing devices are configurable to operate as a single processing device of different sizes.

Example 156 includes the method of example 155, wherein the combination of two or more processing devices is configurable via a policy.

Example 157 includes the method of example 156, wherein the policy is enforced via a platform trusted execution environment.

Example 158 includes the method of example 147, wherein the combination of the first output of the first processing device and the second output of the second processing device includes at least one of: concatenating the first output and the second output, adding the first output and the second output, or multiplying the first output and the second output.

Example 159 includes an apparatus for generating a computing node, the apparatus comprising interface circuitry to receive a workload, instructions in the apparatus, and processor circuitry to at least one of execute or instantiate the instructions to generate a first configuration of one or more machine learning models based on the workload, the first configuration stored in a first configuration database comprising a plurality of machine learning models, the plurality of machine learning models comprising the one or more machine learning models, generate a second configuration of hardware, the second configuration stored in a second configuration database comprising one or more portions of a plurality of hardware, the plurality of hardware comprising the hardware, determine an evaluation parameter based on execution of the workload, the execution of the workload based on the first configuration and the second configuration, and execute the one or more machine learning models in the first configuration and the plurality of machine learning models in the second configuration on the hardware in response to the evaluation parameter meeting a threshold.

Example 160 includes the apparatus of example 159, wherein the first configuration includes at least one of a number of model layers associated with the one or more machine learning models, weights of model layers, a type of machine learning training, or one or more super parameters.

Example 161 includes the apparatus of example 159, wherein the one or more portions include at least one of a first block, a second block, or a third block, and the processor circuit performs at least one of execution or instantiation of the instructions to identify the first block of the hardware to execute a matrix-matrix workload, identify the second block of the hardware to execute a vector-vector workload, identify the third block of the hardware to execute a matrix-vector workload, and identify a register file for each of the first block, the second block, and the third block, the register file storing states for each of the first block, the second block, and the third block, the second configuration being based on a topology including at least one of the first block, the second block, or the third block.

Example 162 includes the apparatus of example 159, wherein the one or more machine learning models comprise a first machine learning model, and the processor circuit performs at least one of execution or instantiation of the instructions to identify a second machine learning model in the first configuration database in response to the evaluation parameter not meeting the threshold, generate a third configuration of the second machine learning model, determine the evaluation parameter according to execution of the workload based on the third configuration, and deploy the second machine learning model to execute the workload based on the third configuration.

Example 163 includes the apparatus of example 159, wherein the one or more machine learning models comprise a first machine learning model, and the processor circuit performs at least one of execution or instantiation of the instructions to determine one or more first layers of the first machine learning model to execute a first portion of the workload in response to the evaluation parameter not meeting the threshold, identify a second machine learning model in the first configuration database, determine one or more second layers of the second machine learning model to execute a second portion of the workload, and determine a third configuration based on a topology of the one or more first layers and the one or more second layers, the topology based on output from the one or more first layers as input to the one or more second layers.

Example 164 includes the apparatus of example 159, wherein the one or more machine learning models comprise a first machine learning model, and the processor circuit performs at least one of execution or instantiation of the instructions to identify the first machine learning model in the first configuration database, identify a second machine learning model based on a query to an ontology database with an identifier of the first machine learning model as input, the ontology database comprising an association of the first machine learning model and the second machine learning model, and update the ontology database based on the first configuration in response to the evaluation parameter meeting the threshold.

Example 165 includes the apparatus of example 159, wherein the hardware is first hardware and the processor circuit performs at least one of execution or instantiation of the instructions to identify second hardware in the second configuration database in response to the evaluation parameter not meeting the threshold, generate a third configuration of the second hardware, determine the evaluation parameter based on execution of the workload by the second hardware in the third configuration, and deploy the second hardware with the third configuration to execute the one or more machine learning models to execute the workload.

Example 166 includes the apparatus of example 159, wherein the hardware is first hardware and the processor circuit performs at least one of execution or instantiation of the instructions to determine one or more first portions of the first hardware to execute a first portion of the workload in response to the evaluation parameter not meeting the threshold, identify second hardware in the first configuration database, determine one or more second portions of the second hardware to execute a second portion of the workload, and determine a third configuration based on a topology of the one or more first portions and the one or more second portions, the topology based on output from the one or more first portions as input to the one or more second portions.

Example 167 includes the apparatus of example 166, wherein the first hardware and the second hardware are one of: a central processing unit, a graphics processing unit, a digital signal processor, an artificial intelligence processor, a neural network processor, or a field programmable gate array.

Example 168 includes the apparatus of example 159, wherein the evaluation parameter is a first evaluation parameter and the processor circuit performs at least one of execution or instantiation of the instructions to output a reward function including the first evaluation parameter having a first weight and a second evaluation parameter having a second weight, the first weight being greater than the second weight, and in response to determining that at least one of the first evaluation parameter or the second evaluation parameter does not satisfy the threshold, at least one of the first configuration or the second configuration is modified to achieve at least one of increasing the first evaluation parameter or decreasing the second evaluation parameter.

Example 169 includes the apparatus of example 159, wherein the evaluation parameter is at least one of accuracy, cost, energy consumption, latency, performance, or throughput associated with at least one of the one or more machine learning models or the hardware.

Example 170 includes an apparatus for generating a first configuration of one or more machine learning models based on a workload, the first configuration stored in a first configuration database comprising a plurality of machine learning models, the plurality of machine learning models comprising the one or more machine learning models, a second apparatus for generating a second configuration of hardware, the second configuration stored in a second configuration database comprising one or more portions of a plurality of hardware, the plurality of hardware comprising the hardware, means for determining an evaluation parameter based on execution of the workload, the execution of the workload being based on the first configuration and the second configuration, and means for executing the one or more machine learning models in the first configuration on the hardware in the second configuration in response to the evaluation parameter meeting a threshold, the one or more hardware learning models to execute the workload based on the one or more hardware.

Example 171 includes the apparatus of example 170, wherein the one or more portions include at least one of a first block, a second block, or a third block, and the second means for generating identifies the first block of the hardware to execute a matrix-matrix workload, identifies the second block of the hardware to execute a vector-vector workload, identifies the third block of the hardware to execute a matrix-vector workload, and identifies a register file for each of the first block, the second block, and the third block, the register file storing states for each of the first block, the second block, and the third block, the second configuration being based on a topology including at least one of the first block, the second block, or the third block.

Example 172 includes the apparatus of example 170, wherein the one or more machine learning models comprise a first machine learning model, and the first means for generating identifies a second machine learning model in the first configuration database in response to the evaluation parameter not meeting the threshold, generates a third configuration of the second machine learning model, determines the evaluation parameter according to execution of the workload based on the third configuration, and deploys the second machine learning model to execute the workload based on the third configuration.

Example 173 includes the apparatus of example 170, wherein the one or more machine learning models comprise a first machine learning model, and the first means for generating determines one or more first layers of the first machine learning model to execute the first portion of the workload in response to the evaluation parameter not meeting the threshold, identifies a second machine learning model in the first configuration database, determines one or more second layers of the second machine learning model to execute the second portion of the workload, and determines a third configuration based on a topology of the one or more first layers and the one or more second layers, the topology based on output from the one or more first layers as input to the one or more second layers.

Example 174 includes the apparatus of example 170, wherein the one or more machine learning models comprise a first machine learning model, and the first means for generating identifies the first machine learning model in the first configuration database, identifies a second machine learning model based on a query of an ontology database with an identifier of the first machine learning model as input, the ontology database comprising an association of the first machine learning model and the second machine learning model, and updates the ontology database based on the first configuration in response to the evaluation parameter satisfying the threshold.

Example 175 includes the apparatus of example 170, wherein the hardware is first hardware, and the second means for generating identifies second hardware in the second configuration database in response to the evaluation parameter not meeting the threshold, generates a third configuration of the second hardware, determines the evaluation parameter based on execution of the workload by the second hardware in the third configuration, and deploys the second hardware with the third configuration to execute the one or more machine learning models to execute the workload.

Example 176 includes the apparatus of example 170, wherein the hardware is first hardware, and the second means for generating determines one or more first portions of the first hardware to execute a first portion of the workload in response to the evaluation parameter not meeting the threshold, identifies second hardware in the first configuration database, determines one or more second portions of the second hardware to execute a second portion of the workload, and determines a third configuration based on a topology of the one or more first portions and the one or more second portions, the topology based on output from the one or more first portions as input to the one or more second portions.

Example 177 includes the apparatus of example 170, wherein the evaluation parameter is a first evaluation parameter and the means for determining determines a reward function comprising the first evaluation parameter having a first weight and a second evaluation parameter having a second weight, the first weight being greater than the second weight, and in response to determining that at least one of the first evaluation parameter or the second evaluation parameter does not satisfy the threshold, at least one of the first configuration or the second configuration is changed to achieve at least one of increasing the first evaluation parameter or decreasing the second evaluation parameter.

Example 178 includes at least one non-transitory computer-readable storage medium comprising instructions that, when executed, cause a processor circuit to generate a first configuration of one or more machine learning models based at least on a workload, the first configuration stored in a first configuration database comprising a plurality of machine learning models, the plurality of machine learning models comprising the one or more machine learning models, generate a second configuration of hardware, the second configuration stored in a second configuration database comprising one or more portions of a plurality of hardware, the plurality of hardware comprising the hardware, determine an evaluation parameter based on execution of the workload, the execution of the workload being based on the first configuration and the second configuration, and execute the one or more machine learning models in the first configuration on the hardware in the second configuration in response to the evaluation parameter meeting a threshold, the one or more machine learning models and the one or more machine learning models to be executed on the hardware in the first configuration.

Example 179 includes the at least one non-transitory computer-readable storage medium of example 178, wherein the first configuration includes at least one of a number of model layers associated with the one or more machine learning models, weights of model layers, a type of machine learning training, or one or more hyper-parameters.

Example 180 includes the at least one non-transitory computer-readable storage medium of example 178, wherein the one or more portions include at least one of a first block, a second block, or a third block, and the instructions, when executed, cause the processor circuit to select the first block of the hardware to perform a matrix-matrix workload, select the second block of the hardware to perform a vector-vector workload, select the third block of the hardware to perform a matrix-vector workload, and create a register file for each of the first block, the second block, and the third block, the register file storing states for each of the first block, the second block, and the third block, the second configuration being based on a topology including at least one of the first block, the second block, or the third block.

Example 181 includes the at least one non-transitory computer-readable storage medium of example 178, wherein the one or more machine learning models comprise a first machine learning model, and the instructions, when executed, cause the processor circuit to identify a second machine learning model in the first configuration database in response to the evaluation parameter not meeting the threshold, configure a third configuration of the second machine learning model, calculate the evaluation parameter according to execution of the workload based on the third configuration, and deploy the second machine learning model to execute the workload based on the third configuration.

Example 182 includes the at least one non-transitory computer-readable storage medium of example 178, wherein the one or more machine learning models include a first machine learning model, and the instructions, when executed, cause the processor circuit to determine one or more first layers of the first machine learning model to cause execution of a first portion of the workload in response to the evaluation parameter not meeting the threshold, identify a second machine learning model in the first configuration database, determine one or more second layers of the second machine learning model to cause execution of a second portion of the workload, and determine a third configuration based on a topology of the one or more first layers and the one or more second layers, the topology being coupled to an input to the one or more second layers based on an output from the one or more first layers.

Example 183 includes the at least one non-transitory computer-readable storage medium of example 178, wherein the one or more machine learning models comprise a first machine learning model, and the instructions, when executed, cause the processor circuit to discover the first machine learning model in the first configuration database, discover a second machine learning model based on a query of an ontology database with an identifier of the first machine learning model as input, the ontology database comprising an association of the first machine learning model and the second machine learning model, and update the ontology database based on the first configuration in response to the evaluation parameter satisfying the threshold.

Example 184 includes the at least one non-transitory computer-readable storage medium of example 178, wherein the hardware is first hardware, and the instructions, when executed, cause the processor circuit to identify second hardware in the second configuration database in response to the evaluation parameter not meeting the threshold, generate a third configuration of the second hardware, determine the evaluation parameter based on execution of the workload by the second hardware in the third configuration, and deploy the second hardware with the third configuration to execute the one or more machine learning models to execute the workload.

Example 185 includes at least one non-transitory computer-readable storage medium of example 178, wherein the hardware is first hardware, and the instructions, when executed, cause the processor circuit to select one or more first portions of the first hardware to execute a first portion of the workload in response to the evaluation parameter not meeting the threshold, identify second hardware in the first configuration database, select one or more second portions of the second hardware to execute a second portion of the workload, and determine a third configuration based on a topology of the one or more first portions and the one or more second portions, the topology based on an output from the one or more first portions as an input to the one or more second portions.

Example 186 includes the at least one non-transitory computer-readable storage medium of example 178, wherein the evaluation parameter is a first evaluation parameter, and the instructions, when executed, cause the processor circuit to generate a reward function including the first evaluation parameter having a first weight and a second evaluation parameter having a second weight, the first weight being greater than the second weight, and in response to determining that at least one of the first evaluation parameter or the second evaluation parameter does not satisfy the threshold, adjust at least one of the first configuration or the second configuration to achieve at least one of increasing the first evaluation parameter or decreasing the second evaluation parameter.

Example 187 includes a method for generating a computing node, the method comprising: generating a first configuration of one or more machine learning models based on a workload, the first configuration stored in a first configuration database comprising a plurality of machine learning models including the one or more machine learning models, generating a second configuration of hardware, the second configuration stored in a second configuration database comprising one or more portions of a plurality of hardware, the plurality of hardware including the hardware, determining an evaluation parameter based on execution of the workload, the execution of the workload being based on the first configuration and the second configuration, and executing the one or more machine learning models in the first configuration on the hardware in the second configuration in response to the evaluation parameter meeting a threshold, one or more machine learning models and hardware to execute the workload.

Example 188 includes the method of example 187, wherein the first configuration includes at least one of a number of model layers associated with the one or more machine learning models, weights of model layers, a type of machine learning training, or one or more super parameters.

Example 189 includes the method of example 187, wherein the one or more portions include at least one of a first block, a second block, or a third block, and further comprising identifying the first block of the hardware to perform a matrix-matrix workload, identifying the second block of the hardware to perform a vector-vector workload, identifying the third block of the hardware to perform a matrix-vector workload, and identifying a register file for each of the first block, the second block, and the third block, the register file storing a state for each of the first block, the second block, and the third block, the second configuration being based on a topology including at least one of the first block, the second block, or the third block.

Example 190 includes the method of example 187, wherein the one or more machine learning models comprise a first machine learning model, and further comprising, in response to the evaluation parameters not meeting the threshold, identifying a second machine learning model in the first configuration database, generating a third configuration of the second machine learning model, determining the evaluation parameters according to execution of the workload based on the third configuration, and deploying the second machine learning model to execute the workload based on the third configuration.

Example 191 includes the method of example 187, wherein the one or more machine learning models comprise a first machine learning model, and further comprising determining one or more first layers of the first machine learning model to execute the first portion of the workload in response to the evaluation parameter not meeting the threshold, identifying a second machine learning model in the first configuration database, determining one or more second layers of the second machine learning model to execute the second portion of the workload, and determining a third configuration based on a topology of the one or more first layers and the one or more second layers, the topology based on output from the one or more first layers as input to the one or more second layers.

Example 192 includes the method of example 187, wherein the one or more machine learning models comprise a first machine learning model, and further comprising identifying the first machine learning model in the first configuration database, identifying a second machine learning model based on a query to an ontology database with an identifier of the first machine learning model as input, the ontology database comprising an association of the first machine learning model and the second machine learning model, and updating the ontology database based on the first configuration in response to the evaluation parameter satisfying the threshold.

Example 193 includes the method of example 187, wherein the hardware is first hardware, and further comprising, in response to the evaluation parameter not meeting the threshold, identifying second hardware in the second configuration database, generating a third configuration of the second hardware, determining the evaluation parameter based on execution of the workload by the second hardware in the third configuration, and deploying the second hardware with the third configuration to execute the one or more machine learning models to execute the workload.

Example 194 includes the method of example 187, wherein the hardware is first hardware, and further comprising determining one or more first portions of the first hardware to execute a first portion of the workload in response to the evaluation parameter not meeting the threshold, identifying second hardware in the first configuration database, determining one or more second portions of the second hardware to execute a second portion of the workload, and determining a third configuration based on a topology of the one or more first portions and the one or more second portions, the topology based on output from the one or more first portions as input to the one or more second portions.

Example 195 includes the method of example 194, wherein the first hardware and the second hardware are one of: a central processing unit, a graphics processing unit, a digital signal processor, an artificial intelligence processor, a neural network processor, or a field programmable gate array.

Example 196 includes the method of example 187, wherein the evaluation parameter is a first evaluation parameter and further comprising outputting a reward function comprising the first evaluation parameter having a first weight and a second evaluation parameter having a second weight, the first weight being greater than the second weight, and in response to determining that at least one of the first evaluation parameter or the second evaluation parameter does not satisfy the threshold, adjusting at least one of the first configuration or the second configuration to achieve at least one of increasing the first evaluation parameter or decreasing the second evaluation parameter.

Example 197 includes the method of example 187, wherein the evaluation parameter is at least one of accuracy, cost, energy consumption, latency, performance, or throughput associated with at least one of the one or more machine learning models or the hardware.

The appended claims are hereby incorporated into this detailed description by this reference. Although certain example systems, methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the appended claims.

Claims

1. An apparatus for managing a processing unit, comprising:

interface circuitry to detect a request to initialize a computing system; and

a processor circuit comprising one or more of:

at least one of a central processing unit, a graphics processing unit, or a digital signal processor, the at least one of the central processing unit, the graphics processing unit, or the digital signal processor having control circuitry, arithmetic and logic circuitry, and one or more registers, the processor circuitry to execute instructions to:

executing the system boot software retrieved from the memory;

executing firmware for a heterogeneous processing unit, the firmware retrieved from the memory;

identifying a type of the heterogeneous processing unit via a silicon initialization code; and is also provided with

Initialization of the heterogeneous processing unit is caused via the silicon initialization code.

2. The apparatus as defined in claim 1, wherein the memory is a serial peripheral interface flash memory.

3. The apparatus as defined in claim 2, further comprising: an enhanced serial peripheral interface for facilitating sharing of the serial peripheral interface flash memory between the central processing unit and the heterogeneous processing unit.

4. The apparatus as defined in claim 1, wherein the heterogeneous processor is a graphics processing unit.

5. The apparatus as defined in claim 1 wherein the heterogeneous processor is a discrete graphics processing unit.

6. The apparatus as defined in claim 1, wherein the processor circuit is to execute the instructions to retrieve a motherboard specific configuration via the silicon initialization code, the motherboard specific configuration including enhanced peripheral connection interface (PCI-E) slot information.

7. The apparatus as defined in claim 1, wherein the processor circuit executes the instructions to store updatable product data, the updatable product data including address information of the heterogeneous processing unit.

8. The apparatus as defined in claim 7, wherein the processor circuit executes the instructions to retrieve the updatable product data via the silicon initialization code to access information of the heterogeneous processing unit.

9. A non-transitory computer readable medium comprising instructions that when executed cause a processor to at least:

detecting a request to initialize a computing system; and is also provided with

Executing the system boot software retrieved from the memory;

10. The non-transitory computer readable medium as defined in claim 9, wherein the memory is a serial peripheral interface flash memory.

11. The non-transitory computer readable medium as defined in claim 10, wherein the instructions, when executed, cause the processor to facilitate sharing the serial peripheral interface flash memory between the central processing unit and the heterogeneous processing unit.

12. The non-transitory computer readable medium as defined in claim 9, wherein the heterogeneous processor is a graphics processing unit.

13. The non-transitory computer readable medium as defined in claim 9, wherein the heterogeneous processor is a discrete graphics processing unit.

14. The non-transitory computer readable medium as defined in claim 9, wherein the instructions, when executed, cause the processor to retrieve a motherboard specific configuration via the silicon initialization code, the motherboard specific configuration including enhanced peripheral connection interface (PCI-E) slot information.

15. The non-transitory computer readable medium as defined in claim 9, wherein the instructions, when executed, cause the processor to store updatable product data comprising address information of the heterogeneous processing unit.

16. The non-transitory computer readable medium as defined in claim 15, wherein the instructions, when executed, cause the processor to retrieve the updateable product data via the silicon initialization code to access information of the heterogeneous processing unit.

17. A method, comprising:

detecting a request to initialize a computing system; and is also provided with

Executing the system boot software retrieved from the memory;

18. The method as defined in claim 17, wherein the memory is a serial peripheral interface flash memory.

19. The method as defined in claim 18, further comprising: facilitating sharing of the serial peripheral interface flash memory between the central processing unit and the heterogeneous processing units.

20. The method as defined in claim 17, wherein the heterogeneous processor is a graphics processing unit.

21. The method as defined in claim 17, wherein the heterogeneous processor is a discrete graphics processing unit.

22. The method as defined in claim 17, further comprising: a motherboard specific configuration is retrieved via the silicon initialization code, the motherboard specific configuration including enhanced peripheral connection interface (PCI-E) slot information.

23. The method as defined in claim 17, further comprising: and storing updatable product data, wherein the updatable product data comprises address information of the heterogeneous processing unit.

24. The method as defined in claim 23, further comprising: retrieving the updatable product data via the silicon initialization code to access information of the heterogeneous processing unit.

25. An apparatus for managing a processing unit, comprising:

Interface circuitry for detecting a request for obtaining a resource request from a workload;

a processor circuit comprising one or more of:

determining whether resources are available on an infrastructure processing unit management system for the workload;

negotiating with the infrastructure processing unit to determine whether an execution workload can be migrated;

responsive to determining that an execution workload is capable of being migrated, causing the execution workload to be migrated; and is also provided with

Causing the workload to execute on the resource.

26. The apparatus as defined in claim 25, wherein the workload is a virtual machine.

27. The apparatus as defined in claim 25, wherein the processor circuit executes the instructions to validate the resource request.

28. The apparatus as defined in claim 25, wherein the resource request identifies a service level agreement.

29. The apparatus as defined in claim 28, wherein the processor circuit executes the instructions to determine whether a service level agreement identified in the resource request can be satisfied by any available resources.

30. The apparatus as defined in claim 29, wherein the processor circuit prompts a user to provide a valid request in response to determining that the service level agreement cannot be satisfied.

31. The apparatus as defined in claim 25, wherein the processor circuit executes the instructions to update a class of service for the execution workload.

32. The apparatus as defined in claim 25, wherein the processor circuit is to execute the instructions to store an association of the workload and the resource in a blockchain.

33. A non-transitory computer readable medium comprising instructions that, when executed, cause a processor to at least:

detecting a request to obtain a resource request from a workload;

Causing the workload to execute on the resource.

34. The non-transitory computer readable medium as defined in claim 33, wherein the workload is a virtual machine.

35. The non-transitory computer readable medium as defined in claim 33, wherein the instructions, when executed, cause the processor to validate the resource request.

36. The non-transitory computer readable medium as defined in claim 33, wherein the resource request identifies a service level agreement.

37. The non-transitory computer readable medium as defined in claim 36, wherein the instructions, when executed, cause the processor to execute the instructions to determine whether a service level agreement identified in the resource request can be satisfied by any available resources.

38. The non-transitory computer readable medium as defined in claim 37, wherein the instructions, when executed, cause the processor to prompt a user to provide a valid request in response to determining that the service level agreement cannot be satisfied.

39. The non-transitory computer readable medium as defined in claim 33, wherein the instructions, when executed, cause the processor to update a class of service for the execution workload.

40. The non-transitory computer readable medium as defined in claim 33, wherein the instructions, when executed, cause the processor to store the association of the workload and the resource in a blockchain.

41. A method, comprising:

detecting a request to obtain a resource request from a workload;

Causing the workload to execute on the resource.

42. The method as defined in claim 41, wherein the workload is a virtual machine.

43. The method as defined in claim 41, further comprising: and verifying the resource request.

44. The method as defined in claim 41, wherein the resource request identifies a service level agreement.

45. The method as defined in claim 44, further comprising: the instructions are executed to determine whether a service level agreement identified in the resource request can be satisfied by any available resources.

46. The method as defined in claim 45, further comprising: in response to determining that the service level agreement cannot be satisfied, the user is prompted to provide a valid request.

47. The method as defined in claim 41, further comprising updating a class of service for the execution workload.

48. The method as defined in claim 41, further comprising storing an association of the workload and the resource in a blockchain.

49. An apparatus for managing a processing unit, comprising:

an interface circuit for detecting a request for executing a deep neural network; and

a processor circuit comprising one or more of:

obtaining a service level agreement associated with the request;

determining a candidate set of operating parameters to service the request based on the service level agreement;

generating a kernel for a set of operating parameters from the candidate set; and is also provided with

The kernel is executed to determine performance of the kernel.

50. The apparatus as defined in claim 49, wherein the processor circuit executes the instructions to determine whether the performance meets the service level agreement.

51. The apparatus as defined in claim 49, wherein the processor circuit executes the instructions to determine the candidate set based on hardware capabilities of a computing system used to execute the kernel.

52. The apparatus as defined in claim 49, wherein the processor circuit executes the instructions to obtain an operational description associated with the request.

53. The apparatus as defined in claim 49, wherein the processor circuit executes the instructions to implement an application programming interface to receive the request.

54. The apparatus as defined in claim 53, wherein the application programming interface manages a plurality of heterogeneous processors.

55. The apparatus as defined in claim 53, wherein the application programming interface is included in an oneAPI framework.

56. A non-transitory computer readable medium comprising instructions that, when executed, cause a processor to at least:

detecting a request for performing a deep neural network; and is also provided with

Obtaining a service level agreement associated with the request;

The kernel is executed to determine performance of the kernel.

57. The non-transitory computer readable medium as defined in claim 56, wherein the instructions, when executed, cause the processor to determine whether the performance satisfies the service level agreement.

58. The non-transitory computer readable medium as defined in claim 56, wherein the instructions, when executed, cause the processor to determine the candidate set based on hardware capabilities of a computing system for executing the kernel.

59. The non-transitory computer readable medium as defined in claim 56, wherein the instructions, when executed, cause the processor to obtain an operational description associated with the request.

60. The non-transitory computer readable medium as defined in claim 56, wherein the instructions, when executed, cause the processor to implement an application programming interface to receive the request.

61. The non-transitory computer readable medium as defined in claim 60, wherein the application programming interface manages a plurality of heterogeneous processors.

62. The non-transitory computer readable medium as defined in claim 60, wherein the application programming interface is included in an oneAPI framework.

63. A method, comprising:

Obtaining a service level agreement associated with the request;

The kernel is executed to determine performance of the kernel.

64. The method as defined in claim 63, further comprising: determining whether the performance satisfies the service level agreement.

65. The method as defined in claim 63, further comprising: the candidate set is determined based on hardware capabilities of a computing system used to execute the kernel.

66. The method as defined in claim 63, further comprising: an operational description associated with the request is obtained.

67. The method as defined in claim 63, further comprising: an application programming interface is implemented to receive the request.

68. The method as defined in claim 67, wherein the application programming interface manages a plurality of heterogeneous processors.

69. The method as defined in claim 67, wherein the application programming interface is included in an oneAPI framework.