CN117632469A

CN117632469A - Application programming interface for terminating software workload

Info

Publication number: CN117632469A
Application number: CN202311084053.1A
Authority: CN
Inventors: S·查特吉; S·纳塔拉詹; S·罗伊; M·科鲁波卢; N·维斯瓦纳坦; S·拉马穆尔蒂; R·H·穆昆丹; A·P·派森卡尔
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2022-08-25
Filing date: 2023-08-25
Publication date: 2024-03-01

Abstract

The invention discloses an application programming interface for terminating a software workload, and in particular discloses a device, a system and a technology for executing the software workload. In at least one embodiment, one or more circuits of a processor execute a first application programming interface to select a second application programming interface, wherein the second application programming interface terminates execution of one or more software workloads identified by the first application programming interface.

Description

Application programming interface for terminating software workload

Cross Reference to Related Applications

The present application claims the benefit of U.S. provisional application No. 63/400,887 entitled "Multi-NODE initiator" filed on 8/25 of 2022, the entire contents of which are incorporated herein by reference.

Technical Field

At least one embodiment relates to processing resources for managing one or more applications executing on a distributed system. For example, at least one embodiment relates to launching, monitoring, and/or terminating an application on a distributed system.

Background

Performing computing operations can take up a significant amount of memory, time, or computing resources. The amount of memory, time, and/or use of resources (e.g., computing resources) may be improved. The computer program may be organized in a different manner, in a different order, and with multiple computer systems to execute the various components. While advances in computer hardware may accelerate or otherwise assist in the execution of the various components of a computer program, such advances often fail to account for all the different configurations of computer programs, as well as the various ways in which elements of a computer program are assigned to a computer system. Generating a computing program to perform computing operations, some of which are deployed into various systems, can cause delays in execution of the software program when the operations are completed.

Drawings

FIG. 1 is a block diagram illustrating operations performed using a computing environment in accordance with at least one embodiment;

FIG. 2 is a block diagram illustrating an example system for launching and terminating a distributed application on multiple nodes using a multi-node launcher utility, in accordance with at least one embodiment;

FIG. 3 illustrates a process for launching and terminating a distributed application on multiple nodes using a multi-node launcher utility according to at least one embodiment;

FIG. 4 is a block diagram illustrating an example computing cluster of a high performance computer system in accordance with at least one embodiment;

FIG. 5 illustrates a process for launching one or more workloads using the high-performance computing environment in accordance with at least one environment;

FIG. 6 illustrates a process of monitoring one or more workloads using the high performance computing environment in accordance with at least one environment;

FIG. 7 illustrates a process of terminating one or more workloads using the high performance computing environment in accordance with at least one environment;

FIG. 8 is a block diagram illustrating a software program to be executed by one or more processors in accordance with at least one embodiment;

FIG. 9 is a block diagram illustrating an Application Programming Interface (API) for launching one or more software workloads, in accordance with at least one embodiment;

FIG. 10 is a block diagram illustrating an Application Programming Interface (API) for monitoring one or more software workloads, in accordance with at least one embodiment;

FIG. 11 is a block diagram illustrating an Application Programming Interface (API) for terminating one or more software workloads, according to at least one embodiment;

FIG. 12 illustrates a process for executing one or more Application Programming Interfaces (APIs) in accordance with at least one embodiment;

FIG. 13 is a block diagram illustrating an example software stack for processing an Application Programming Interface (API) in accordance with at least one embodiment;

FIG. 14 is a block diagram illustrating a processor and modules in accordance with at least one embodiment;

FIG. 15 is a block diagram illustrating a driver and/or runtime including one or more libraries for providing one or more Application Programming Interfaces (APIs), in accordance with at least one embodiment;

FIG. 16 illustrates a distributed system in accordance with at least one embodiment;

FIG. 17 illustrates an exemplary data center in accordance with at least one embodiment;

FIG. 18 illustrates a client-server network in accordance with at least one embodiment;

FIG. 19 illustrates an example of a computer network in accordance with at least one embodiment;

FIG. 20A illustrates a networked computer system in accordance with at least one embodiment;

FIG. 20B illustrates a networked computer system in accordance with at least one embodiment;

FIG. 20C illustrates a networked computer system in accordance with at least one embodiment;

FIG. 21 illustrates one or more components of a system environment in which a service may be provided as a third party network service in accordance with at least one embodiment;

FIG. 22 illustrates a cloud computing environment in accordance with at least one embodiment;

FIG. 23 illustrates a set of functional abstraction layers provided by a cloud computing environment in accordance with at least one embodiment;

FIG. 24 illustrates a chip-level supercomputer in accordance with at least one embodiment;

FIG. 25 illustrates a rack module level supercomputer in accordance with at least one embodiment;

FIG. 26 illustrates a rack-level supercomputer in accordance with at least one embodiment;

FIG. 27 illustrates an overall system level supercomputer in accordance with at least one embodiment;

FIG. 28A illustrates inference and/or training logic in accordance with at least one embodiment;

FIG. 28B illustrates inference and/or training logic in accordance with at least one embodiment;

FIG. 29 illustrates training and deployment of a neural network in accordance with at least one embodiment;

FIG. 30 illustrates an architecture of a network system in accordance with at least one embodiment;

FIG. 31 illustrates an architecture of a network system in accordance with at least one embodiment;

FIG. 32 illustrates a control plane protocol stack in accordance with at least one embodiment;

FIG. 33 illustrates a user plane protocol stack in accordance with at least one embodiment;

fig. 34 illustrates components of a core network in accordance with at least one embodiment;

FIG. 35 illustrates components of a system supporting Network Function Virtualization (NFV) in accordance with at least one embodiment;

FIG. 36 illustrates a processing system in accordance with at least one embodiment;

FIG. 37 illustrates a computer system in accordance with at least one embodiment;

FIG. 38 illustrates a system in accordance with at least one embodiment;

FIG. 39 illustrates an exemplary integrated circuit in accordance with at least one embodiment;

FIG. 40 illustrates a computing system in accordance with at least one embodiment;

FIG. 41 illustrates an APU in accordance with at least one embodiment;

FIG. 42 illustrates a CPU in accordance with at least one embodiment;

FIG. 43 illustrates an exemplary accelerator integrated slice in accordance with at least one embodiment;

FIGS. 44A and 44B illustrate an exemplary graphics processor in accordance with at least one embodiment;

FIG. 45A illustrates a graphics core in accordance with at least one embodiment;

FIG. 45B illustrates a GPGPU in accordance with at least one embodiment;

FIG. 46A illustrates a parallel processor in accordance with at least one embodiment;

FIG. 46B illustrates a processing cluster in accordance with at least one embodiment;

FIG. 46C illustrates a graphics multiprocessor in accordance with at least one embodiment;

FIG. 47 illustrates a software stack of a programming platform in accordance with at least one embodiment;

FIG. 48 illustrates a CUDA implementation of the software stack of FIG. 47 in accordance with at least one embodiment;

FIG. 49 illustrates a ROCm implementation of the software stack of FIG. 47 in accordance with at least one embodiment;

FIG. 50 illustrates an OpenCL implementation of the software stack of FIG. 47 in accordance with at least one embodiment;

FIG. 51 illustrates software supported by a programming platform in accordance with at least one embodiment;

FIG. 52 illustrates compiled code for execution on the programming platform of FIGS. 47-50 in accordance with at least one embodiment; and

FIG. 53 illustrates components of a system for accessing a large language model in accordance with at least one embodiment.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of at least one embodiment. It will be apparent, however, to one skilled in the art that the concepts of the invention may be practiced without one or more of these specific details.

In at least one embodiment, a distributed deep learning application requires a multi-node initiator utility to initiate and terminate the distributed application on multiple nodes. In at least one embodiment, the multi-node initiator utility resides in the field of High Performance Computing (HPC), but has significant drawbacks. In at least one embodiment, one disadvantage is that the HPC multi-node initiator is not aware of the setup requirements of the Deep Learning (DL) framework and is not integrated with the DL workload. In at least one embodiment, another disadvantage is that HPC multi-node initiators cannot be "out of the box" when used within a container, and typically require platform-specific settings.

In at least one embodiment, ideally, a multi-node initiator on the AI training platform should provide a unified start and stop mechanism for both HPC and DL applications when running within the container using the same unified Application Programming Interface (API). In at least one embodiment, there is a need for a launcher capable of handling applications such as MPI, pyrerch, and Tensorflow that has a unified API and does not require modification to existing applications.

FIG. 1 is a block diagram 100 illustrating operations performed using a computing environment in accordance with at least one embodiment. In at least one embodiment, the processor 114 of the client environment 112 includes one or more circuits for causing one or more Application Programming Interfaces (APIs) to perform the operations described herein. In at least one embodiment, an Application Programming Interface (API) specifies or otherwise indicates one or more operations to be performed by a processor, such as described herein, to cause the processor to perform one or more operations (e.g., to initiate a software workload, monitor a software workload, terminate a software workload, and/or other such operations described herein). In at least one embodiment, the client environment 112 is an environment of one or more client devices, such as those described herein. In at least one embodiment, not shown in fig. 1, client environment 112 is an environment of one or more client devices that are clients of a cloud computing environment such as described herein. In at least one embodiment, processor 114 is a processor such as those described below. In at least one embodiment, processor 114 is a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Parallel Processing Unit (PPU), a General Purpose Graphics Processing Unit (GPGPU), a computing cluster, and/or a combination of these and/or other such processors. In at least one embodiment, the processor 114 is part of a computer system such as described herein. In at least one embodiment, the APIs are APIs such as those described herein in connection with at least FIGS. 8-15.

In at least one embodiment, one or more circuits of the processor 114 cause the start-up workload 116 operations to be performed. In at least one embodiment, launching the workload 116 includes executing operations of the software workload using the computing environment 102. For example, in at least one embodiment, initiating workload 116 includes: initiating a workload to perform neural network training operations (as described herein), initiating a workload to perform one or more molecular chemistry analyses, initiating a workload to perform large language modules (e.g., as described herein at least in connection with fig. 53), and/or other operations described herein. In at least one embodiment, the start workload 116 is executed using an API, such as the start workload API 902 described herein in connection with at least FIG. 9. In at least one embodiment, the launch workload 116 is to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, not shown in FIG. 1, the start workload 116 includes one or more parameters (parameters) including, but not limited to, those described herein with respect to FIG. 9. In at least one embodiment, the start workload 116, when executed, starts a single workload or a single job. In at least one embodiment, the start workload 116, when executed, starts a single workload with multiple jobs. In at least one embodiment, the monitoring workload 120, when executed, initiates a plurality of workloads.

In at least one embodiment, the start-up workload 116 is indicated, sent, or otherwise provided to the computing environment 102. In at least one embodiment, the computing environment 102 is a high performance computing environment. In at least one embodiment, the computing environment 102 is a deep learning environment. In at least one embodiment, the startup workload 116 is performed using the computing environment 102, systems, methods, operations, and/or techniques described herein. In at least one embodiment, the computing environment 102 is a cloud computing environment such as those described herein. In at least one embodiment, the computing environment 102 includes one or more processors 104. In at least one embodiment, the computing environment 102 includes one or more graphics processors 106. In at least one embodiment, the processor 104 includes one or more processors such as those described herein. In at least one embodiment, graphics processor 106 is one or more graphics processors such as those described herein. In at least one embodiment, not shown in fig. 1, processor 104 and/or graphics processor 106 includes one or more Central Processing Units (CPUs), graphics Processing Units (GPUs), parallel Processing Units (PPUs), general Purpose Graphics Processing Units (GPGPUs), computing clusters, and/or combinations of these and/or other such processors as described herein.

In at least one embodiment, at least a portion of the startup workload 116 is executed using one or more of the processors 104 and/or one or more of the graphics processors 106. In at least one embodiment, the processor 104 includes one or more processors such as those described herein. In at least one embodiment, graphics processor 106 includes one or more graphics processors such as those described herein. In at least one embodiment, one or more of the processors 104 are connected together using systems and methods such as those described herein. For example, in at least one embodiment, at least some of the processors 104 are connected together using one or more clusters (such as cluster 402 described herein in connection with at least fig. 4). In at least one embodiment, one or more of the graphics processors 106 are connected together using systems and methods such as those described herein. For example, in at least one embodiment, at least some of the graphics processors 106 are connected together using one or more clusters (such as cluster 402 described herein in connection with at least FIG. 4).

In at least one embodiment, not shown in fig. 1, when the startup workload 116 is executed using the computing environment 102, one or more additional APIs such as described herein are executed (e.g., using the processor 104 and/or the graphics processor 106). In at least one embodiment, not shown in FIG. 1, when the startup workload 116 is executed using the computing environment 102, at least a portion of the startup workload 116 is executed using one or more additional processors of the computing environment 102 (e.g., one or more of the processors 104 and/or one or more of the graphics processors 106). In at least one embodiment, the computing environment 102 indicates, transmits, or otherwise provides one or more responses to the client environment 112, including but not limited to those described herein in connection with at least fig. 9. In at least one embodiment, the computing environment 102 indicates, transmits, or otherwise provides the job ID 118 (e.g., job identifier) to the client environment 112. In at least one embodiment, job ID 118 indicates one or more processes in computing environment 102 for executing start workload 116. In at least one embodiment, not shown in FIG. 1, job ID 118 includes a plurality of identifiers in computing environment 102 for executing a process that initiates workload 116.

In at least one embodiment, one or more circuits of processor 114 cause monitoring workload 120 operations to be performed. In at least one embodiment, the monitoring workload 120 is performed using an API such as the monitoring workload API 1002 described herein in connection with at least FIG. 10. In at least one embodiment, the monitoring workload 120 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, not shown in fig. 1, the monitoring workload 120 includes one or more parameters including, but not limited to, those described herein with respect to fig. 10. In at least one embodiment, the one or more parameters include one or more job IDs (e.g., job ID 118) indicating a process to be monitored. In at least one embodiment, the one or more job IDs indicate processes in the computing environment 102 for executing the start-up workload 116, as described above. In at least one embodiment, the monitoring workload 120, when executed, monitors a single workload (e.g., a workload corresponding to a job ID such as job ID 118). In at least one embodiment, monitoring workload 120, when executed, monitors multiple workloads (e.g., workloads corresponding to a single job ID, such as job ID 118).

In at least one embodiment, the monitoring workload 120 is indicated, sent, or otherwise provided to the computing environment 102. In at least one embodiment, monitoring the workload 120 is performed using the computing environment 102 using the systems, methods, operations, and/or techniques described herein. In at least one embodiment, at least a portion of the monitoring workload 120 is performed using one or more of the processors 104 and/or one or more of the graphics processors 106.

In at least one embodiment, not shown in fig. 1, when monitoring workload 120 is performed using computing environment 102, one or more additional APIs such as described herein are performed (e.g., using processor 104 and/or graphics processor 106). In at least one embodiment, not shown in fig. 1, when monitoring workload 120 is performed using computing environment 102, at least a portion of monitoring workload 120 is performed using one or more additional processors of computing environment 102 (e.g., one or more of processors 104 and/or one or more of graphics processors 106). In at least one embodiment, the computing environment 102 indicates, transmits, or otherwise provides one or more responses to the monitoring workload 120, including but not limited to those described herein in connection with at least fig. 10. In at least one embodiment, the computing environment 102 indicates, transmits, or otherwise provides the status 122 (e.g., the status of one or more workloads to monitor) to the client environment 112. In at least one embodiment, the state 122 indicates a state of one or more workloads to monitor, such as, for example, running, waiting, terminating, one or more error conditions, and the like. In at least one embodiment, as described above, the state 122 indicates a state of one or more processes in the computing environment 102 for executing the start workload 116. In at least one embodiment, not shown in FIG. 1, state 122 includes a plurality of states in computing environment 102 for executing a process to launch workload 116, as described above.

In at least one embodiment, one or more circuits of processor 114 cause termination workload 124 operations to be performed. In at least one embodiment, the termination workload 124 is executed using an API such as the termination workload API 1102 described herein in connection with at least FIG. 11. In at least one embodiment, the termination workload 124 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, not shown in FIG. 1, terminating workload 124 includes one or more parameters including, but not limited to, those described herein with respect to FIG. 11. In at least one embodiment, the one or more parameters include one or more job IDs (e.g., job ID 118) indicating a process to be terminated. In at least one embodiment, the one or more job IDs indicate a process in the computing environment 102 for executing the start-up workload 116, as described above. In at least one embodiment, termination workload 124, when executed, causes a single workload (e.g., a workload corresponding to a job ID such as job ID 118) to be terminated. In at least one embodiment, termination workload 124, when executed, causes a plurality of workloads (e.g., workloads corresponding to a single job ID, such as job ID 118) to be terminated.

In at least one embodiment, the termination workload 124 is indicated, sent, or otherwise provided to the computing environment 102. In at least one embodiment, the termination workload 124 is performed using the computing environment 102 using the systems, methods, operations, and/or techniques described herein. In at least one embodiment, at least a portion of termination workload 124 is performed using one or more of processors 104 and/or one or more of graphics processors 106.

In at least one embodiment, not shown in fig. 1, when the termination workload 124 is executed using the computing environment 102, one or more additional APIs (such as those described herein) are executed (e.g., using the processor 104 and/or the graphics processor 106). In at least one embodiment, not shown in FIG. 1, when termination workload 124 is executed using computing environment 102, at least a portion of termination workload 124 is executed using one or more additional processors in computing environment 102 (e.g., one or more of processors 104 and/or one or more of graphics processors 106). In at least one embodiment, the computing environment 102 indicates, sends, or otherwise provides one or more responses to the terminating workload 124, including but not limited to those described herein in connection with at least fig. 11. In at least one embodiment, the computing environment 102 indicates, sends, or otherwise provides the state 126 (e.g., the state of one or more workloads to terminate) to the client environment 112. In at least one embodiment, the state 126 indicates a state of one or more workloads to terminate, such as, for example, terminated, not terminated, one or more error conditions, and the like. In at least one embodiment, the state 126 indicates a state in the computing environment 102 for executing one or more processes that launch the workload 116, as described above. In at least one embodiment, not shown in FIG. 1, state 126 includes a plurality of states in computing environment 102 for executing processes that launch workload 116, as described above.

In at least one embodiment, one or more processors (e.g., processor 114, one or more of processors 104, one or more of graphics processors 106, and/or other processors and/or accelerators such as described herein) include one or more circuits to perform operations or instructions described herein, such as one or more circuits to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the one or more processors include one or more circuits to perform the operations or instructions described herein, such as one or more circuits to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API using an API, such as the launch workload API 902 described herein in connection with at least fig. 9. In at least one embodiment, one or more processors include one or more circuits to perform operations or instructions described herein, such as one or more circuits to execute a first Application Programming Interface (API) to cause a second API to be executed, thereby causing one or more software workloads to be executed by one or more other processors. In at least one embodiment, not shown in fig. 1, a set of instructions stored on a machine-readable medium, which if executed by one or more processors, is to perform the operations described herein in connection with at least fig. 1-15, such as having a first Application Programming Interface (API) select a second API to perform the operations of one or more software workloads identified by the first API. In at least one embodiment, one or more processors include one or more circuits to perform operations or instructions to cause a first Application Programming Interface (API) to select a second API to execute one or more workloads identified by the first API by at least executing the operations or instructions to execute the first API to cause the second API to be executed, thereby causing execution of one or more software workloads by one or more other processors.

In at least one embodiment, one or more processors (e.g., processor 114, one or more of processors 104, one or more of graphics processors 106, and/or other processors and/or accelerators such as described herein) include one or more circuits to perform operations or instructions described herein, such as one or more circuits to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the one or more processors include one or more circuits to perform the operations or instructions described herein, such as one or more circuits to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API using an API, such as monitoring workload API 1002 described herein in connection with at least fig. 10. In at least one embodiment, the one or more processors include one or more circuits to perform the operations or instructions described herein, such as one or more circuits to execute a first Application Programming Interface (API) to cause a second API to be executed such that a state of one or more software workloads is provided. In at least one embodiment, not shown in fig. 1, a set of instructions stored on a machine-readable medium that, if executed by one or more processors, perform the operations described herein in connection with at least fig. 1-15, such as executing a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, one or more processors include one or more circuits to perform operations or instructions to execute a first Application Programming Interface (API) to select a second API by at least executing the operations or instructions to execute the first API to cause the second API to be executed, such that a state of one or more software workloads is provided, to monitor execution of the one or more software workloads identified by the first API.

In at least one embodiment, one or more processors (e.g., processor 114, one or more of processors 104, one or more of graphics processors 106, and/or other processors and/or accelerators such as described herein) include one or more circuits to perform operations or instructions described herein, such as one or more circuits to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, the one or more processors include one or more circuits to perform the operations or instructions described herein, such as one or more circuits to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API using an API, such as termination workload API 1102 described herein in connection with fig. 11 at least. In at least one embodiment, one or more processors include one or more circuits to perform the operations or instructions described herein, such as one or more circuits to execute a first Application Programming Interface (API) to cause a second API to be executed such that one or more software workloads being executed by one or more other processors are terminated. In at least one embodiment, not shown in fig. 1, a set of instructions stored on a machine-readable medium, which if executed by one or more processors, is to perform the operations described herein in connection with at least fig. 1-15, such as executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, one or more processors include one or more circuits to perform operations or instructions to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API by at least executing the operations or instructions to execute the first API to cause the second API to be executed to cause termination of one or more software workloads being executed by one or more other processors.

FIG. 2 is a block diagram 200 illustrating an example system for launching and terminating a distributed application on multiple nodes using a multi-node launcher utility, in accordance with at least one embodiment. In at least one embodiment, the system illustrated in block diagram 200 is a collection of one or more hardware and/or software computing resources having instructions that, when executed, perform one or more communication processes such as those described herein. In at least one embodiment, the system shown in block diagram 200 is a software program executing on computer hardware, an application executing on computer hardware, and/or variations thereof. In at least one embodiment, one or more processes of the system shown in block diagram 200 are performed by any suitable processing system or unit (e.g., a Graphics Processing Unit (GPU), a General Purpose GPU (GPGPU), a Parallel Processing Unit (PPU), a Central Processing Unit (CPU), as described below) in any suitable manner, including sequentially, in parallel, and/or variations thereof. In at least one embodiment, the system shown in block diagram 200 is a software program executing on one or more processors (such as one or more of processors 114, 104, and/or one or more of graphics processors 106 described herein in connection with at least fig. 1).

In at least one embodiment, user 202 may be a user of a computer system. In at least one embodiment, the network 204 may be a network as described with respect to FIGS. 16-53. In at least one embodiment, the control plane 206 can be an AI training platform for managing a containerized workload control plane. In at least one embodiment, control plane 206 is a control plane such as those described herein in connection with at least FIGS. 30-33. In at least one embodiment, control plane 206 includes a plurality of modules such as those described herein in connection with fig. 16-53. In at least one embodiment, the control plane 206 includes a scheduler, job controller, cluster agent, PAI server, MPI operator, or any combination thereof. In at least one embodiment, control plane 206 sends one or more workloads to nodes 210 and 214. In at least one embodiment, node 210 and node 214 are nodes such as node 404, node 434, and/or node 436, as described herein at least in connection with fig. 4. In at least one embodiment, node 210 includes a container 212. In at least one embodiment, the container includes a ready-to-run software package that contains the software and/or data needed to run the application. In at least one embodiment, a container, such as container 212, includes code, desired runtime information, application libraries, system libraries, environment variables, and/or any default values for basic settings. In at least one embodiment, the control plane 206 automates container operations. In at least one embodiment, the control plane 206 groups the containers that make up the application into logical units. In at least one embodiment, the control plane 206 allows for clustering of host groups running container applications, and the system helps manage the clusters. In at least one embodiment, node 214 includes container 216. In at least one embodiment, container 216 is a container such as container 212, as described above. In at least one embodiment, container 212 sends one or more workload APIs 208 to container 216 using systems and methods as described herein. In at least one embodiment, node 210 and node 214 are in communication with a network 218. In at least one embodiment, the network 218 is an InfiniBand network (e.g., a network with high throughput and/or low latency used in a high performance computing environment). In at least one embodiment, network 218 is a channel-based fabric network that may facilitate high-speed communications between interconnected nodes.

In at least one embodiment, the workload APIs 208 include APIs of a multi-node application launcher utility on an AI training platform and GPU cloud cluster, such as those described herein (e.g., the start workload API 902, the monitor workload API 1002, and/or the terminate workload API 1102 described herein). In at least one embodiment, the workload API 208 includes a multi-node initiator utility for initiating, monitoring, and terminating one or more distributed applications on one or more nodes, as described herein. In at least one embodiment, the workload API 208 is used to provide a unified launch mechanism for High Performance Computing (HPC) and Distributed Learning (DL) applications running within an Artificial Intelligence (AI) training platform container environment, such as described herein. In at least one embodiment, the workload API 208 abstracts the framework-specific environment required by a distributed DL application (e.g., distributed PyTorch or Tensorflow). In at least one embodiment, the workload API 208 allows a user (such as user 202) to submit commands as part of a batch script. In at least one embodiment, the workload API 208 allows a user (such as the user 202) to submit commands using APIs (such as the start workload API 902 described herein at least in connection with FIG. 9, the monitor workload API 1002 described herein at least in connection with FIG. 10, and/or the terminate workload API 1102 described herein at least in connection with FIG. 11).

In at least one embodiment, not shown in fig. 2, the environment for supporting one or more workload APIs 208 is automatically injected into a container at "/usr/local/bin" (e.g., as described herein at least in connection with fig. 3). In at least one embodiment, for example, injecting an environment for supporting one or more workload APIs 208 comprises: software is added to execute one or more workload APIs 208 when the container (e.g., container 212 and/or container 216) is defined, specified, instantiated, or otherwise created. In at least one embodiment, when an environment for supporting one or more workload APIs 208 is automatically injected into a container, the workload APIs 208 are always available to applications running on an AI training platform using the container. In at least one embodiment, the workload API initiates processes (e.g., remote processes) using an open source platform for managing the workload and services of the container, and manages them with other processes (e.g., local processes). In at least one embodiment, the environment variables and binding information are propagated to all nodes during startup. In at least one embodiment, during termination, one or more of the workload APIs 208 manage the propagation of status indicators, error indicators, debug information, and the like to calling processes.

In at least one embodiment, the parameters of the workload API 208 include necessary parameters and optional parameters. In at least one embodiment, the necessary parameters to launch a workload API, such as launch workload API 902, include a command to launch the workload (e.g., as a string). For example, in at least one embodiment, the parameters for launching the workload include a string such as "-cmd' python train. Py.

In at least one embodiment, the parameters of the workload API 208 include one or more optional parameters, such as the number of nodes to run (e.g., as an integer). In at least one embodiment, the parameters of the workload API 208 have optional parameters: for the range of minimum to maximum values of the number of nodes. In at least one embodiment, the minimum value of the range is 1 and the maximum value of the range is R, where R is the maximum number or number of copies requested by the GPU cloud job.

In at least one embodiment, the parameters of the workload API 208 include one or more optional parameters, such as the number of tasks per node to be run (e.g., as an integer). In at least one embodiment, the default value of the number of tasks per node to be run is 1.

In at least one embodiment, the parameters of the workload API 208 include one or more optional parameters, such as environment variables set in a "key=value" (e.g., as a string) format. In at least one embodiment, the environment variable includes a string such as "-env 'var1 = value1' -env 'var2 = value 2'". In at least one embodiment, different environment variables require separate environment flags for each environment variable.

In at least one embodiment, the parameters of the workload API 208 include one or more optional parameters, such as a base directory (e.g., as a string) from which commands are run. In at least one embodiment, the default value of the base directory parameter is a working directory. In at least one embodiment, the directory variable includes a string such as "- -workdir ' \ $WORK_HOME/scripts ' - -env ' WORK_HOME=/mnt/workspace.

In at least one embodiment, the parameters of the workload API 208 include one or more optional parameters, such as one or more external initiators (e.g., as a string) for initiating a workload. For example, in at least one embodiment, the external initiator supported is "mpirun" or "horovadrun" or some other such initiator. In at least one embodiment, mpirun maps to one or more OpenMPI options. In at least one embodiment, horovarun maps to one or more horovad options. In at least one embodiment, the option assumes that the initiator is present and accessible. In at least one embodiment, the initiator-specific parameters (not part of the script name option) are provided as suffixes. For example, in at least one embodiment, the external initiator string includes "- -launcher' mpirun- -alloy-run-as-root".

In at least one embodiment, the parameters of the workload API 208 include one or more optional parameters for specifying one or more execution modes, including but not limited to running asynchronous failure support enabled (enable run asynchronous failure support) (e.g., a sub-process of the $ script name may exit upon failure without stopping the program). In at least one embodiment, the optional parameter "enable run asynchronous failure support" means that the program will continue to run while at least one sub-process is running. In at least one embodiment, the optional parameter "enable run asynchronous failure support" configures the default semantics of the script name to stop the program when any sub-process initiated by the script name exits due to an error.

In at least one embodiment, the parameters of the workload API 208 include one or more optional parameters "binding" that bind the process to the CPU core. In at least one embodiment, the "bind" option only applies when the array type is PYTORCH. In at least one embodiment, the selectable parameter "binding" includes whether there is a non-uniform memory access (NUMA) binding option available. In at least one embodiment, the binding parameters include an optional parameter "node to which a process is bound" to bind to a CPU within the NUMA node. In at least one embodiment, on a compute node supporting a GPU, processes are bound to all CPUs of the associated NUMA node (e.g., local rank is mapped to GPU id), and the total number of process numbers (rank) is limited to the total number of GPUs. In at least one embodiment, for example, given 2 NUMA nodes N {0,1}, each node having 4 GPUs and 32 CPUs C {0-31,32-63},8 processes P {0-7} would be mapped as P {0-3}:N0: C {0-31}, P {4-7 }.

In at least one embodiment, the binding has an optional parameter "exclusive," where the process is bound to an exclusive set of CPUs within the NUMA node. In at least one embodiment, on a compute node supporting a GPU, a process is bound to an exclusive set of CPUs within the associated NUMA node (local rank is mapped to GPU id), and the total number of process numbers is limited to the total number of GPUs. In at least one embodiment, for example, given 2 NUMA nodes N {0,1}, each node having 4 GPUs and 32 CPUs C {0-31,32-63},8 processes P {0-7} would be mapped as P0: N0: C {0-7}, P1: N0: C {8-15}, P2: N0: C {16-23}, P3: N0: C {24-31}, P4: N1: C {32-39}, P5: N1: C {40-47}, P6: N1: C {48-55}, P7: N1: C {56-63}.

In at least one embodiment, the binding has an optional parameter "core-complex" where the process is bound to the core-complex, e.g., the CPU sharing the last level cache. In at least one embodiment, on a compute node supporting a GPU, processes are bound to the core-complex of the associated NUMA node (local rank is mapped to GPU ID) and the total number of process numbers is limited to the total number of GPUs. In at least one embodiment, for example, given 2 NUMA nodes N {0,1}, each node has 2 GPUs and 4 core complexes X {0-3,4-7},4 processes P {0-3} will be mapped as P0:N0:X0, P1:N0:X1, P2:N1:X4, P3:N1:X5.

In at least one embodiment, the binding has an optional parameter "core-complex" in which the process is bound to the CPU within the socket. In at least one embodiment, on a compute node supporting a GPU, a process is bound to the CPU of the socket containing the associated NUMA node (local rank is mapped to GPU ID) and the total number of process numbers is limited to the total number of GPUs. In at least one embodiment, for example, given 2 sockets S {0,1}, each socket having 4 GPUs and 64 CPUs C {0-63,64-127},8 processes P {0-7} would be mapped as P {0-3}:S0: C {0-63}, P {4-7 }.

In at least one embodiment, the parameters of the workload API 208 include necessary parameters and/or optional parameters, such as parameters described at least in connection with the start workload API 902 described herein at least in connection with FIG. 9, the monitor workload API 1002 described herein at least in connection with FIG. 10, and/or the end workload API 1102 described herein at least in connection with FIG. 11.

FIG. 3 illustrates a process 300 for launching and terminating a distributed application on multiple nodes using a multi-node launcher utility in accordance with at least one embodiment. In at least one embodiment, some or all of process 300 (or any other process described herein, or variations and/or combinations thereof) is performed under control of one or more computer systems configured with computer-executable instructions (such as the computer systems described in fig. 16-53), and is implemented by hardware, software, or combinations thereof as code (e.g., computer-executable instructions, one or more computer programs, or one or more applications) that is executed collectively on one or more processors. In at least one embodiment, the code is stored on a computer readable storage medium in the form of a computer program comprising a plurality of computer readable instructions executable by one or more processors. In at least one embodiment, the computer-readable storage medium is a non-transitory computer-readable medium. In at least one embodiment, a processor, such as processor 114, one or more of processors 104, one or more of graphics processors 106 (all as described herein at least in connection with FIG. 1), performs one or more steps of process 300 using systems, methods, operations, and techniques, such as those described herein, to launch and terminate distributed applications on multiple nodes using a multi-node launcher utility.

In at least one embodiment, the system that executes at least a portion of process 300 to launch and terminate distributed applications on multiple nodes using a multi-node launcher utility includes executable code for injecting at least an environment into a container that supports one or more workload APIs (e.g., as described in connection with FIG. 2, and as described in connection with step 302, below). In at least one embodiment, the workload API parameters are performed (e.g., as described herein at least in connection with FIGS. 1-13, and also as described in connection with step 304, described below). In at least one embodiment, the command is run based on the workload API and parameters (e.g., as described in connection with step 306, described below).

In at least one embodiment, a system such as the system shown in FIG. 1 (such as using computing environment 102) includes executable code for injecting an environment supporting a workload API into a working catalog (e.g.,/usr/local/bin) of executable code of a container at step 302 of process 300 for starting and terminating a distributed application on multiple nodes using a multi-node starter utility. In at least one embodiment, at step 302, a system, such as the system shown in FIG. 1, includes executable code for injecting an environment supporting a workload API into a container using systems, methods, processes, and techniques, such as those described at least in connection with FIG. 2. In at least one embodiment, after step 302, the process 300 of starting and terminating a distributed application on a plurality of nodes using a multi-node initiator utility continues at step 304.

In at least one embodiment, at step 304 of process 300 for starting and terminating a distributed application on multiple nodes using a multi-node starter utility, a system, such as computing environment 102 shown in FIG. 1, includes executable code for executing one or more workload APIs having specified parameters. In at least one embodiment, at 304, the executable code for executing one or more workload APIs having specified parameters includes executable code for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API using at least the launch workload API 902 described herein in connection with FIG. 9. In at least one embodiment, at 304, the executable code for executing one or more workload APIs having specified parameters includes executable code for executing a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API using at least the monitor workload API 1002 described herein in connection with FIG. 10. In at least one embodiment, at 304, the executable code for executing one or more workload APIs having specified parameters includes executable code for executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API using at least the terminate workload API 1102 described herein in connection with FIG. 11. In at least one embodiment, not shown in fig. 3, at step 304, executable code for executing one or more workload APIs with specified parameters includes executable code that identifies one or more APIs using systems, methods, techniques, and operations as described herein. In at least one embodiment, after step 304, the process 300 of starting and terminating a distributed application on a plurality of nodes using a multi-node initiator utility continues at step 306.

In at least one embodiment, at step 304 of process 300 for starting and terminating a distributed application on multiple nodes using a multi-node starter utility, a system, such as computing environment 102 shown in FIG. 1, includes executable code for running a command based on a workload API and specified parameters. In at least one embodiment, executable code that runs commands based on a workload API and specified parameters includes executable code for identifying one or more APIs that are available to run commands using systems, methods, techniques, and operations such as those described herein. In at least one embodiment, the executable code of the run command based on the workload API and the specified parameters includes executable code for executing one or more APIs including, but not limited to, a start workload API 902 described herein with respect to at least FIG. 9, a monitor workload API 1002 described herein with respect to at least FIG. 10, and/or a terminate workload API 1102 described herein with respect to at least FIG. 11. In at least one embodiment, after step 306, process 300 ends 308 with the multi-node initiator utility starting and terminating the distributed application on the plurality of nodes.

In at least one embodiment, the operations of process 300 for starting and terminating a distributed application on multiple nodes using a multi-node initiator utility are performed in a different order than that shown in FIG. 3. In at least one embodiment, the operations of process 300 for starting and terminating a distributed application on multiple nodes using a multi-node initiator utility are performed simultaneously or in parallel. In at least one embodiment, the operations of process 300 for starting and terminating a distributed application on multiple nodes using a multi-node initiator utility are performed simultaneously or in parallel, the nodes not being interdependent (e.g., order independent). In at least one embodiment, the operations of process 300 for starting and terminating a distributed application on multiple nodes using a multi-node initiator utility are performed by multiple threads executing on a processor as described herein.

FIG. 4 is a block diagram 400 illustrating an example high performance computer system in accordance with at least one embodiment. In at least one embodiment, a high performance computer system such as that shown in FIG. 4 is implemented using a computing environment such as computing environment 102 described herein in connection with at least FIG. 1. In at least one embodiment, the computing cluster 402 includes one or more nodes. In at least one embodiment, the computing cluster 402 includes nodes 404. In at least one embodiment, the node 404 includes one or more switches. In at least one embodiment, node 404 includes a switch 406. In at least one embodiment, switch 406 is a hardware device that manages connections between processors, processor memory, and/or other nodes. In at least one embodiment, switch 406 is software that manages connections between processors, processor memory, and/or other nodes. In at least one embodiment, switch 406 is a virtual device that emulates hardware and/or software that manages connections between processors, processor memory, and/or other nodes. In at least one embodiment, switch 406 is a device implementing software that manages connections between processors, processor memory, and/or other nodes. In at least one embodiment, node 404 includes one or more other additional switches, such as switch 414. In at least one embodiment, switch 414 is the same switch (e.g., hardware, software, virtual device) as switch 406.

In at least one embodiment, switch 406 includes a software stack 408. In at least one embodiment, software stack 408 implements one or more software systems for enabling switch 406 to manage connections between processors, processor memory, and/or other nodes. In at least one embodiment, not shown in FIG. 4, software stack 408 has one or more memory space names (designs), including, but not limited to, kernel space, non-privileged user space, and the like. In at least one embodiment, software stack 408 includes one or more drivers, such as driver 410. In at least one embodiment, driver 410 is a kernel driver. In at least one embodiment, the driver 410 is a runtime driver. In at least one embodiment, software stack 408 includes one or more memory managers, such as memory manager 412. In at least one embodiment, software stack 408 is a software stack, such as the software stack shown in block diagram 1300 described herein at least in connection with fig. 13. In at least one embodiment, software stack 408 is a software stack such as those described herein in connection with at least FIGS. 47-50.

In at least one embodiment, switch 414 includes a software stack 416. In at least one embodiment, software stack 416 implements one or more software systems for enabling switch 414 to manage connections between processors, processor memory, and/or other nodes. In at least one embodiment, not shown in fig. 4, the software stack 416 has one or more memory space names, such as those described herein. In at least one embodiment, the software stack 416 includes one or more drivers, such as driver 418. In at least one embodiment, the driver 418 is a kernel driver, as described herein. In at least one embodiment, the driver 418 is a runtime driver, as described herein. In at least one embodiment, the software stack 416 includes one or more memory managers, such as memory manager 420. In at least one embodiment, software stack 416 is a software stack such as that shown in block diagram 1300 described herein at least in connection with fig. 13. In at least one embodiment, software stack 416 is a software stack such as those described herein in connection with at least FIGS. 47-50.

In at least one embodiment, node 404 includes one or more processors, such as processor 422, processor 426, and/or processor 430. In at least one embodiment, processor 422, processor 426, and/or processor 430 are processors such as described herein at least in connection with fig. 1 (e.g., one or more of processors 104 and/or one or more of graphics processors 106). In at least one embodiment, processor 422 may access memory 424, processor 426 may access memory 428, and processor 430 may access memory 432. In at least one embodiment, memory 424, memory 428, and memory 432 are memories such as those described herein.

In at least one embodiment, switch 406 is connected to processor 422, processor 426, and/or processor 430 using systems and methods such as those described herein. In at least one embodiment, switch 406 may access memory 424 (using processor 422), memory 428 (using processor 426), and/or memory 432 (using processor 430). In at least one embodiment, switch 406 may be connected to one or more other processors and/or may access other memory not shown in fig. 4.

In at least one embodiment, switch 414 is coupled to processor 422, processor 426, and/or processor 430 using systems and methods such as those described herein, and switch 406 may access memory 424 (using processor 422), memory 428 (using processor 426), and/or memory 432 (using processor 430). In at least one embodiment, switch 414 may also be connected to one or more other processors and/or may access other memory not shown in FIG. 4. In at least one embodiment, switch 406 and/or switch 414 are connected to one or more other nodes, such as node 434 and/or node 436. In at least one embodiment, not shown in fig. 4, node 434 and/or node 436 comprises one or more switches, processors, and memory, as described herein.

FIG. 5 illustrates a process 500 for launching one or more workloads using a high-performance computing environment in accordance with at least one environment. In at least one embodiment, some or all of process 500 (or any other process described herein, or variations and/or combinations thereof) is performed under control of one or more computer systems configured with computer-executable instructions, such as described in fig. 16-53, and is implemented by hardware, software, or combinations thereof as code (e.g., computer-executable instructions, one or more computer programs, or one or more application programs) that is executed collectively on one or more processors. In at least one embodiment, the code is stored on a computer readable storage medium in the form of a computer program comprising a plurality of computer readable instructions executable by one or more processors. In at least one embodiment, the computer-readable storage medium is a non-transitory computer-readable medium. In at least one embodiment, a processor, such as processor 114, one or more of processors 104, one or more of graphics processors 106 (all as described herein at least in connection with fig. 1), performs one or more steps of process 500 using systems, methods, operations, and techniques such as those described herein.

In at least one embodiment, at step 502 of process 500 for launching a workload using a high performance computing environment, a processor executing process 500 executes instructions to receive or otherwise obtain a launch workload API with parameters. In at least one embodiment, at step 502, a launch workload API having parameters is received from an environment, such as client environment 112 described herein in connection with at least FIG. 1. In at least one embodiment, at step 502, the launch workload API with parameters is an API such as launch workload API 902 described herein in connection with at least FIG. 9. In at least one embodiment, at step 502, the parameters of the launch workload API include one or more of the workload indicators 904, the number of nodes 906, the tasks per node 908, the environment variables 910, the working catalog 912, the initiator 914, the execution mode 916, and/or other parameters 918 described herein at least in connection with FIG. 9. In at least one embodiment, after step 502, the process 500 of starting up a workload using a high performance computing environment continues at step 504.

In at least one embodiment, at step 504 of process 500, which starts a workload using a high performance computing environment, a processor executing process 500 executes instructions to identify one or more software workloads to start. In at least one embodiment, at step 504, the one or more software workloads to be started are parameters of the start workload API with parameters received in step 502. In at least one embodiment, after step 504, the process 500 of starting up a workload using a high performance computing environment continues at step 506.

In at least one embodiment, at step 506 of process 500 to launch a workload using a high performance computing environment, a processor executing process 500 executes instructions to select one or more APIs to cause a software workload to be launched. In at least one embodiment, at step 506, the one or more APIs are selected to cause the software workload to be launched based at least in part on the one or more parameters of the launched workload APIs received at step 502 with the parameters. In at least one embodiment, at step 506, one or more APIs are selected to cause the software workload to be initiated, including selecting one or more APIs from a candidate API list using, for example, a lookup table, a decision tree, an algorithm, or some other similar method. In at least one embodiment, at step 506, one or more APIs are selected to cause the software workload to be launched, including selecting a default API. In at least one embodiment, after step 506, the process 500 of starting up the workload using the high performance computing environment continues at step 508.

In at least one embodiment, at step 508 of process 500 of launching a workload using a high performance computing environment, a determination is made as to whether an API (or APIs) that causes the software workload to be launched is selected at step 506. In at least one embodiment, at step 508, determining whether an API (or APIs) that causes the software workload to be launched is selected at step 506 comprises: determining whether a plurality of acceptable APIs are selected, determining whether a single API (or a set of APIs) is selected, determining whether a default API is selected, and/or determining whether no APIs are selected. In at least one embodiment, at step 508, if it is determined that the API (or APIs) that caused the software workload to be launched was selected at step 506 ("Yes" branch), then process 500 of launching the workload using the high performance computing environment continues at step 510. In at least one embodiment, at step 508, if it is determined that the API (or APIs) that caused the software workload to be launched was not selected at step 506 ("NO" branch), then process 500 of launching the workload using the high performance computing environment continues at step 514.

In at least one embodiment, at step 510 of process 500 for launching a workload using a high performance computing environment, a processor executing process 500 executes instructions to cause launching of a software workload using one or more APIs selected at step 506. In at least one embodiment, after step 510, the process 500 of starting up the workload using the high performance computing environment continues at step 512.

In at least one embodiment, at step 512 of the process 500 for launching a workload using a high performance computing environment, a processor executing the process 500 executes instructions to return a success indicator and a job identifier. In at least one embodiment, at step 512, the processor executing process 500 executes instructions to return a success indicator (e.g., success indicator 922) and a job identifier (e.g., job identifier 926) using at least the launch workload API return 920 described herein in connection with FIG. 9. In at least one embodiment, a JOB identifier, such as JOB identifier 926, includes an indicator of one or more JOBs (e.g., "JOB12345," "123456," "job_abcde," etc.). In at least one embodiment, after step 512, the process 500 of starting up the workload using the high performance computing environment ends. In at least one embodiment, not shown in FIG. 5, after step 512, the process 500 of starting a workload using a high performance computing environment continues at step 502 to receive additional start workload APIs with parameters.

In at least one embodiment, at step 514 of process 500 starting up a workload using a high performance computing environment, a processor executing process 500 executes instructions to return an error indicator. In at least one embodiment, at step 514, the processor executing process 500 executes instructions to return 920 an error indicator (e.g., error indicator 924) using the startup workload API return 920 described herein in connection with at least FIG. 9. In at least one embodiment, after step 514, the process 500 of starting up the workload using the high performance computing environment ends. In at least one embodiment, not shown in FIG. 5, after step 514, the process 500 of starting a workload using a high performance computing environment continues at step 502 to receive additional start workload APIs with parameters.

In at least one embodiment, the operations of process 500 for launching a workload using a high performance computing environment are performed in a different order than that shown in FIG. 5. In at least one embodiment, the operations of process 500 for launching a workload using a high performance computing environment are performed simultaneously or in parallel. In at least one embodiment, the operations of process 500 for launching a workload using a high performance computing environment are performed simultaneously or in parallel, independent of each other (e.g., order independent). In at least one embodiment, the operations of process 500 for initiating a workload using a high performance computing environment are performed by multiple threads executing on a processor as described herein.

FIG. 6 illustrates a process 600 for monitoring one or more workloads using a high performance computing environment in accordance with at least one environment. In at least one embodiment, part or all of process 600 (or any other process described herein, or variations and/or combinations thereof) is performed under the control of one or more computer systems configured with computer-executable instructions (such as the computer systems described in fig. 16-53), and is implemented by hardware, software, or combinations thereof as code (e.g., computer-executable instructions, one or more computer programs, or one or more application programs) that is executed collectively on one or more processors. In at least one embodiment, the code is stored on a computer readable storage medium in the form of a computer program comprising a plurality of computer readable instructions executable by one or more processors. In at least one embodiment, the computer-readable storage medium is a non-transitory computer-readable medium. In at least one embodiment, a processor, such as processor 114, one or more of processors 104, one or more of graphics processors 106 (all as described herein at least in connection with fig. 1), performs one or more steps of process 600 using systems, methods, operations, and techniques such as those described herein.

In at least one embodiment, at step 602 of process 600 for monitoring workloads using a high performance computing environment, a processor executing process 600 executes instructions to receive or otherwise obtain a monitoring workload API having parameters. In at least one embodiment, at step 602, a monitoring workload API having parameters is received from an environment, such as the client environment 112 described herein in connection with at least FIG. 1. In at least one embodiment, at step 602, the monitoring workload API having parameters is an API such as monitoring workload API 1002 described herein in connection with at least FIG. 10. In at least one embodiment, at step 602, monitoring parameters of the workload API includes one or more of the job identifier 1004 and/or other parameters 1006 described herein in connection with at least FIG. 10. In at least one embodiment, after step 602, the process 600 of monitoring the workload using the high performance computing environment continues at step 604.

In at least one embodiment, at step 604 of process 600 for monitoring workloads using a high performance computing environment, a processor executing process 600 executes instructions to identify one or more software workloads to monitor. In at least one embodiment, at step 604, the one or more software workloads to be monitored are parameters of the monitoring workload API having parameters received at step 602. In at least one embodiment, after step 604, the process 600 of monitoring the workload using the high performance computing environment continues at step 606.

In at least one embodiment, at step 606 of process 600 using a high performance computing environment to monitor a workload, a processor executing process 600 executes instructions to select one or more APIs to obtain the state of the identified software workload. In at least one embodiment, at step 606, the one or more APIs are selected to cause the software workload to be monitored based at least in part on the one or more parameters of the monitoring workload APIs received at step 602 having the parameters. In at least one embodiment, at step 606, selecting one or more APIs such that the software workload is monitored comprises: one or more APIs are selected from the candidate API list using, for example, a look-up table, decision tree, algorithm, or other similar method. In at least one embodiment, at step 606, selecting one or more APIs to cause the software workload to be monitored includes: a default API is selected. In at least one embodiment, after step 606, the process 600 of monitoring the workload using the high performance computing environment continues at step 608.

In at least one embodiment, at step 608 of process 600 for monitoring a workload using a high performance computing environment, a determination is made as to whether an API (or APIs) is selected at step 606 that causes the software workload to be monitored. In at least one embodiment, at step 608, determining whether an API (or APIs) that causes the software workload to be monitored is selected at step 606 includes: determining whether a plurality of acceptable APIs are selected, determining whether a single API (or a set of APIs) is selected, determining whether a default API is selected, and/or determining whether an API is not selected. In at least one embodiment, at step 608, if it is determined that the API (or APIs) that cause the software workload to be monitored was selected at step 606 ("Yes" branch), the process 600 of monitoring the workload using the high performance computing environment continues at step 610. In at least one embodiment, at step 608, if it is determined that the API (or APIs) that cause the software workload to be monitored were not selected at step 606 ("NO" branch), then the process 600 of monitoring the workload using the high performance computing environment continues at step 614.

In at least one embodiment, at step 610 of process 600 using a high performance computing environment to monitor workloads, a processor executing process 600 executes instructions to obtain the status of one or more software workloads to be monitored using one or more APIs selected at step 606. In at least one embodiment, at step 610, a processor executing process 600 executes instructions to obtain a state of one or more software workloads to monitor to obtain the state from a computing environment (such as computing environment 102 described herein in connection with at least FIG. 1). In at least one embodiment, after step 610, the process 600 of monitoring the workload using the high performance computing environment continues at step 612.

In at least one embodiment, at step 612 of process 600 monitoring workload using a high performance computing environment, a processor executing process 600 executes instructions to return a success indicator and a job identifier. In at least one embodiment, at step 612, the processor executing process 600 executes instructions to return 1020 a success indicator (e.g., success indicator 1022) and a workload state (e.g., workload state 1026) using at least the monitor workload API return described herein in connection with FIG. 10. In at least one embodiment, after step 612, the process 600 of monitoring the workload using the high performance computing environment ends. In at least one embodiment, not shown in FIG. 6, after step 612, the process 600 of monitoring the workload using the high performance computing environment continues at step 602 to receive additional monitoring workload APIs with parameters.

In at least one embodiment, at step 614 of process 600 monitoring workload using a high performance computing environment, a processor executing process 600 executes instructions to return an error indicator. In at least one embodiment, at step 614, the processor executing process 600 executes instructions to return 1020 an error indicator (e.g., error indicator 1024) using at least the monitoring workload API return described herein in connection with FIG. 10. In at least one embodiment, after step 614, the process 600 of monitoring the workload using the high performance computing environment ends. In at least one embodiment, not shown in FIG. 6, after step 614, the process 600 of monitoring the workload using the high performance computing environment continues at step 602 to receive additional monitoring workload APIs with parameters.

In at least one embodiment, the operations of process 600 for monitoring workload using a high performance computing environment are performed in a different order than shown in FIG. 6. In at least one embodiment, the operations of process 600 for monitoring workloads using a high performance computing environment are performed simultaneously or in parallel. In at least one embodiment, the operations of process 600 for monitoring workloads using a high performance computing environment are performed simultaneously or in parallel, independent of each other (e.g., order independent). In at least one embodiment, the operations of process 600 for monitoring workload using a high performance computing environment are performed by multiple threads executing on a processor such as described herein.

FIG. 7 illustrates a process 700 for terminating one or more workloads using a high performance computing environment in accordance with at least one environment. In at least one embodiment, part or all of process 700 (or any other process described herein, or variations and/or combinations thereof) is performed under control of one or more computer systems configured with computer-executable instructions, such as described in fig. 16-53, and is implemented by hardware, software, or combinations thereof as code (e.g., computer-executable instructions, one or more computer programs, or one or more application programs) that is executed collectively on one or more processors. In at least one embodiment, the code is stored on a computer readable storage medium in the form of a computer program comprising a plurality of computer readable instructions executable by one or more processors. In at least one embodiment, the computer-readable storage medium is a non-transitory computer-readable medium. In at least one embodiment, a processor, such as processor 114, one or more of processors 104, one or more of graphics processors 106 (all as described herein at least in connection with fig. 1), performs one or more steps of process 700 using systems, methods, operations, and techniques such as those described herein.

In at least one embodiment, at step 702 of process 700 for terminating a workload using a high performance computing environment, a processor executing process 700 executes instructions to receive or otherwise obtain a terminating workload API with parameters. In at least one embodiment, at step 702, a termination workload API having parameters is received from an environment, such as the client environment 112 described herein in connection with at least FIG. 1. In at least one embodiment, at step 702, the termination workload API having parameters is an API such as termination workload API 1102 described herein in connection with at least FIG. 11. In at least one embodiment, at step 702, the parameters of the termination workload API include one or more of the job identifier 1104 and/or other parameters 1106 described herein in connection with at least FIG. 11. In at least one embodiment, after step 702, the process 700 of terminating the workload using the high performance computing environment continues at step 704.

In at least one embodiment, at step 704 of process 700 for terminating a workload using a high performance computing environment, a processor executing process 700 executes instructions to identify one or more software workloads to terminate. In at least one embodiment, at step 704, the one or more software workloads to be terminated are parameters of the termination workload API having parameters received at step 702. In at least one embodiment, after step 704, the process 700 of terminating the workload using the high performance computing environment continues at step 706.

In at least one embodiment, at step 706 of process 700, which terminates a workload using a high performance computing environment, a processor executing process 700 executes instructions to select one or more APIs to terminate the identified software workload. In at least one embodiment, at step 706, the one or more APIs are selected to cause the software workload to be terminated based at least in part on the one or more parameters of the terminated workload APIs received at step 702 having the parameters. In at least one embodiment, at step 706, selecting one or more APIs to cause the software workload to be terminated comprises: one or more APIs are selected from the candidate API list using, for example, a look-up table, decision tree, algorithm, or other similar method. In at least one embodiment, at step 706, selecting one or more APIs to cause the software workload to be terminated comprises: a default API is selected. In at least one embodiment, after step 706, the process 700 of terminating the workload using the high performance computing environment continues at step 708.

In at least one embodiment, at step 708 of process 700 for terminating a workload using a high performance computing environment, a determination is made as to whether an API (or APIs) is selected at step 706 that causes the software workload to be terminated. In at least one embodiment, at step 708, determining whether an API (or APIs) that causes the software workload to be terminated was selected at step 706 comprises: determining whether a plurality of acceptable APIs are selected, determining whether a single API (or a set of APIs) is selected, determining whether a default API is selected, and/or determining whether an API is not selected. In at least one embodiment, at step 708, if it is determined that the API (or APIs) that cause the software workload to be terminated were selected at step 706 ("Yes" branch), then process 700 of terminating the workload using the high performance computing environment continues at step 710. In at least one embodiment, at step 708, if it is determined that the API (or APIs) that cause the software workload to be terminated were not selected at step 706 ("NO" branch), then process 700 of terminating the workload using the high performance computing environment continues at step 716.

In at least one embodiment, at step 710 of process 700 for terminating a workload using a high performance computing environment, a processor executing process 700 executes instructions to cause one or more software workloads to be terminated using one or more APIs selected at step 706. In at least one embodiment, at step 710, a processor executing process 700 executes instructions to cause one or more software workloads to be terminated, terminating the states running on a computing environment, such as computing environment 102 described herein in connection with at least FIG. 1. In at least one embodiment, after step 710, the process 700 of terminating the workload using the high performance computing environment continues at step 712.

In at least one embodiment, at step 712 of process 700 for terminating a workload using a high performance computing environment, a determination is made as to whether the software workload identified at step 704 was terminated at step 710. In at least one embodiment, at step 712, the processor executing process 700 executes instructions to determine whether the software workload identified at step 704 was terminated at step 710 by querying a computing environment, such as computing environment 102 described herein in connection with at least FIG. 1. In at least one embodiment, at step 712, the processor executing process 700 executes instructions to determine whether the software workload identified at step 704 was terminated at step 710 by executing one or more additional APIs, such as the monitor workload API 1002 described herein in connection with at least FIG. 10. In at least one embodiment, at step 712, the processor executing process 700 executes instructions to determine whether the software workload identified at step 704 was terminated at step 710 by performing one or more steps of process 600 described herein in connection with at least FIG. 6. In at least one embodiment, at step 712, if it is determined that the software workload identified at step 704 was terminated at step 710 ("yes" branch), then the process 700 of terminating the workload using the high performance computing environment continues at step 714. In at least one embodiment, at step 712, if it is determined that the software workload identified at step 704 was not terminated at step 710 (the "no" branch), then process 700 of terminating the workload using the high performance computing environment continues at step 716.

In at least one embodiment, at step 714 of process 700 terminating the workload using the high performance computing environment, the processor executing process 700 executes instructions to return a success indicator. In at least one embodiment, at step 714, the processor executing process 700 executes instructions to return 1120 a success indicator (e.g., success indicator 1022) using at least the terminate workload API described herein in connection with FIG. 11. In at least one embodiment, after step 714, the process 700 of terminating the workload using the high performance computing environment ends. In at least one embodiment, not shown in FIG. 7, after step 714, process 700 of terminating the workload using the high performance computing environment continues at step 702 to receive additional termination workload APIs with parameters.

In at least one embodiment, at step 716 of process 700, which terminates the workload using the high performance computing environment, a processor executing process 700 executes instructions to return an error indicator. In at least one embodiment, at step 714, the processor executing process 700 executes instructions to return 1120 an error indicator (e.g., error indicator 1124) using at least the terminate workload API return described herein in connection with FIG. 11. In at least one embodiment, after step 716, the process 700 of terminating the workload using the high performance computing environment ends. In at least one embodiment, not shown in FIG. 7, after step 716, process 700 of terminating a workload using a high performance computing environment continues at step 702 to receive additional termination workload APIs with parameters.

In at least one embodiment, the operations of process 700 for terminating a workload using a high performance computing environment are performed in a different order than that shown in FIG. 7. In at least one embodiment, the operations of process 700 for terminating a workload using a high performance computing environment are performed simultaneously or in parallel. In at least one embodiment, the operations of process 700 for terminating a workload using a high performance computing environment are performed simultaneously or in parallel, independent of each other (e.g., order independent). In at least one embodiment, the operations of process 700 for terminating a workload using a high performance computing environment are performed by multiple threads executing on a processor such as described herein.

Fig. 8 is a block diagram 800 illustrating a software program executed by one or more processors in accordance with at least one embodiment. In at least one embodiment, block diagram 800 illustrates a software program 804 to be executed by a processor, such as CPU 802 (e.g., a central processing unit) and GPU 810 (e.g., a graphics processing unit) and accelerator 814 within a heterogeneous processor. In at least one embodiment, CPU 802 is a processor, such as processor 114 and/or one or more of processors 104, as described herein at least in connection with FIG. 1. In at least one embodiment, the CPU 802 includes one or more processors, such as one or more of the processors 114, one or more of the processors 104, one or more of the graphics processors 106, and/or one or more of the other processors and/or accelerators, such as described herein. In at least one embodiment, GPU 810 is a graphics processor, such as one or more of graphics processors 106, as described herein at least in connection with fig. 1. In at least one embodiment, GPU 810 includes one or more processors, such as one or more of processors 114, one or more of processors 104, one or more of graphics processors 106, and/or one or more of other processors and/or accelerators, as described herein. In at least one embodiment, CPU 802 is any processor having any architecture further described herein. In at least one embodiment, CPU 802 is any general-purpose processor having any architecture further described herein. In at least one embodiment, a processor, such as CPU 802, includes circuitry for performing one or more computing operations. In at least one embodiment, a processor, such as CPU 802, includes any configuration of circuitry for performing one or more of the computing operations described further herein.

In at least one embodiment, the CPU 802 executes a parallel computing environment 808. In at least one embodiment, the CPU 802 includes one or more processors such as those described herein. In at least one embodiment, a processor, such as CPU 802, executes a parallel computing environment 808, such as a Compute Unified Device Architecture (CUDA). In at least one embodiment, the parallel computing environment 808 includes instructions that, if executed by one or more processors (such as the CPU 802), facilitate execution of one or more software programs by one or more CPUs 802, one or more Parallel Processing Units (PPUs) (such as the GPU 810), and/or one or more accelerators 814 within a heterogeneous processor.

In at least one embodiment, the one or more PPUs are processors that include one or more circuits for performing parallel computing operations, such as GPU 810 and any other parallel processors described further herein. In at least one embodiment, GPU 810 is hardware that includes circuitry for performing one or more computing operations, as described further below in connection with various embodiments. In at least one embodiment, GPU 810 includes one or more processing cores that are each to perform one or more computing operations. In at least one embodiment, GPU 810 includes one or more processing cores to perform one or more parallel computing operations. In at least one embodiment, GPU 810 is packaged with CPU 802 or other processor as a system on a chip (SoC). In at least one embodiment, GPU 810 is packaged with CPU 802 or other processor on a shared die or other substrate as a system on a chip (SoC). In at least one embodiment, the one or more accelerators 814 within the heterogeneous processor are hardware including one or more circuits for performing specific computing operations, such as a Deep Learning Accelerator (DLA), a Programmable Vision Accelerator (PVA), a Field Programmable Gate Array (FPGA), or any other accelerator described further herein. In at least one embodiment, accelerator 814 within a heterogeneous processor is packaged with CPU 802 or other processor as a system on a chip (SoC). In at least one embodiment, accelerator 814 within a heterogeneous processor is packaged with CPU 802 or other processor on a shared die or other substrate as a system on a chip (SoC). In at least one embodiment, one or more CPUs (such as CPU 802), one or more GPUs (such as GPU 810), one or more or other PPUs, and/or accelerator 814 within a heterogeneous processor are packaged as a system on a chip (SoC). In at least one embodiment, one or more CPUs 802, one or more GPUs 810, or other PPUs and/or accelerators 814 within a heterogeneous processor are packaged on a shared die or other substrate as a system on a chip (SoC).

In at least one embodiment, the parallel computing environment 808 (such as a CUDA) includes libraries and other software programs for performing one or more computing operations using one or more PPUs (such as GPU 810) and/or one or more accelerators 814 within a heterogeneous processor. In at least one embodiment, the parallel computing environment 808 includes libraries and other software programs that, if executed by one or more processors (such as one or more CPUs 802), cause one or more PPUs (such as GPUs 810) and/or one or more accelerators 814 within a heterogeneous processor to perform one or more computing operations. In at least one embodiment, the parallel computing environment 808 includes a library that, if executed, causes one or more PPUs (such as GPU 810) and/or one or more accelerators 814 in the heterogeneous processor to perform mathematical operations. In at least one embodiment, the parallel computing environment 808 includes a library that, if executed, causes one or more PPUs (such as GPU 810) and/or one or more accelerators 814 in a heterogeneous processor to perform any other operations described further herein.

In at least one embodiment, one or more PPUs (such as GPU 810) and/or one or more accelerators 814 within the heterogeneous processor perform one or more computing operations in response to one or more Application Programming Interfaces (APIs). In at least one embodiment, the API is a set of software instructions that, if executed by one or more processors (such as one or more CPUs 802), cause one or more PPUs (such as GPU 810) and/or one or more accelerators 814 within the heterogeneous processor to perform one or more computing operations. In at least one embodiment, the parallel computing environment 808 includes one or more APIs 806 that, if executed by one or more processors (such as one or more CPUs 802), cause one or more PPUs (such as GPUs 810) and/or one or more accelerators 814 within a heterogeneous processor to perform one or more computing operations. In at least one embodiment, the one or more APIs 806 include one or more functions (functions) that, if executed, cause one or more processors (such as the one or more CPUs 802) to perform one or more operations, such as computing operations, error reporting, scheduling other operations to be performed by the GPUs 810 and/or the accelerators 814 within the heterogeneous processors, or any other operations described further herein. In at least one embodiment, one or more APIs 806 include one or more functions that, if executed, cause one or more PPUs (such as GPU 810) to perform one or more operations, such as computing operations, error reporting, or any other operations described further herein. In at least one embodiment, the one or more APIs 806 include one or more functions, such as those described below in connection with fig. 9-11, that if executed, cause one or more accelerators 814 within the heterogeneous processor to perform one or more operations, such as computing operations, error reporting, or any other operations further described herein. In at least one embodiment, one or more APIs 806 include one or more functions for causing CPU 802 to perform one or more computing operations in response to information or events generated by one or more PPUs (such as GPU 810) and/or one or more accelerators 814 within a heterogeneous processor. In at least one embodiment, one or more APIs 806 include one or more functions that, if invoked, cause CPU 802 to perform one or more computing operations in response to information or events generated by one or more PPUs (such as GPU 810) and/or one or more accelerators 814 in a heterogeneous processor.

In at least one embodiment, a processor, such as CPU 802, executes one or more software programs 804. In at least one embodiment, the one or more software programs are a set of instructions that, if executed, cause one or more processors (such as one or more of the CPUs 802, PPUs (such as the GPU 810), and/or the accelerator 814) in the heterogeneous processor to perform computing operations. In at least one embodiment, software program 804 includes instructions and/or operations to be performed by one or more PPUs, such as GPU 810. In at least one embodiment, one or more software programs 804 include GPU-specific code 812 and/or accelerator-specific code 816. In at least one embodiment, the instructions and/or operations to be performed by one or more PPUs (such as GPU 810) are PPU-specific code or GPU-specific code 812. In at least one embodiment, GPU-specific code 812 is a set of software instructions and/or other operations to be performed by one or more GPUs 810, as further described herein. In at least one embodiment, software program 804 includes instructions and/or operations to be performed by one or more accelerators 814 in a heterogeneous processor. In at least one embodiment, the instructions and/or operations to be performed by one or more accelerators 814 in the heterogeneous processor are accelerator specific code 816. In at least one embodiment, the accelerator-specific code 816 is a set of software instructions and/or other operations to be performed by one or more accelerators 814, as further described herein. In at least one embodiment, PPU-specific code or GPU-specific code 812 and/or accelerator-specific code 816 is used to be executed in response to one or more APIs 806, as described below in connection with fig. 9-11.

FIG. 9 is a block diagram 900 illustrating an Application Programming Interface (API) that initiates one or more software workloads, according to at least one embodiment. In at least one embodiment, one or more circuits of the processor are to execute the launch workload API 902 to launch one or more software workloads using a computing environment such as the computing environment 102 described herein at least in connection with FIG. 1. In at least one embodiment, not shown in fig. 9, one or more circuits of a processor, such as those described herein, execute one or more instructions to execute the launch workload API 902 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, not shown in fig. 9, one or more circuits of a processor, such as those described herein, execute one or more instructions to execute the launch workload API 902 to execute a first Application Programming Interface (API) to cause a second API to be executed, thereby causing one or more software workloads to be executed by one or more other processors. In at least one embodiment, also not shown in fig. 9, one or more circuits of a processor, such as described herein, execute one or more instructions to execute the launch workload API 902 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API in response to receiving additional APIs, such as described herein.

In at least one embodiment, the launch workload API 902, when invoked, receives one or more parameters indicating information about operations to be performed using techniques such as those described herein. In at least one embodiment, the launch workload API 902, when invoked, receives one or more parameters indicating information about instructions to be executed using techniques such as those described herein.

In at least one embodiment, the launch workload API 902 receives as input one or more parameters including a workload indicator 904. In at least one embodiment, the workload indicator 904 is a data value that includes information that can be used to identify, indicate, or otherwise specify one or more workloads to be started using the start workload API 902. In at least one embodiment, the workload indicator 904 is a command (e.g., a script of command line commands) to be executed to cause one or more workloads to be initiated. In at least one embodiment, the one or more workloads to be started that are identified, indicated, or otherwise specified by the workload indicator 904 are one or more of a plurality of parameters that can be used by the start workload API 902 to start one or more software workloads. In at least one embodiment, the workload indicator 904 is a data value used to identify, indicate, or otherwise specify a set of operations or instructions to an API (such as the launch workload API 902) that are to be performed by one or more PPUs (such as GPUs) and/or one or more accelerators within a heterogeneous processor, as described herein.

In at least one embodiment, the launch workload API 902 receives as input one or more parameters including the number of nodes 906. In at least one embodiment, the node number 906 is a data value that includes information that can be used to identify, indicate, or otherwise specify the node number to be used to launch a workload using the launch workload API 902. In at least one embodiment, the number of nodes to be used to launch a workload identified, indicated, or otherwise specified by the node count 906 is one of a plurality of parameters that the launch workload API 902 may use to launch one or more software workloads. In at least one embodiment, node number 906 is a data value that is used to identify, indicate, or otherwise designate a set of operations or instructions to an API (such as launch workload API 902) that are to be executed by one or more PPUs (such as GPUs) and/or one or more accelerators within a heterogeneous processor, as described herein.

In at least one embodiment, the launch workload API 902 receives as input one or more parameters including a task 908 per node. In at least one embodiment, the per-node tasks 908 are data values that include information that can be used to identify, indicate, or otherwise specify the number of tasks per node that will be used to launch a workload using the launch workload API 902. In at least one embodiment, the number of tasks per node to be used to launch a workload, identified, indicated, or otherwise specified by the per-node task 908, is one of a plurality of parameters that the launch workload API 902 may use to launch one or more software workloads. In at least one embodiment, the per-node task 908 is a data value that is used to identify, indicate, or otherwise specify a set of operations or instructions to an API (such as the launch workload API 902) that are to be performed by one or more PPUs (such as GPUs) and/or one or more accelerators within a heterogeneous processor, as described herein.

In at least one embodiment, the launch workload API902 receives as input one or more parameters including the environment variable 910. In at least one embodiment, the environment variables 910 are data values that include information that may be used to identify, indicate, or otherwise specify one or more environment variables using the launch workload API 902. In at least one embodiment, the environment variable 910 includes one or more key-value pairs (e.g., key = value) specified using a list of key-value pairs. In at least one embodiment, the environment variable identified, indicated, or otherwise specified by environment variable 910 is one of a plurality of parameters that may be used by launch workload API902 to launch one or more software workloads. In at least one embodiment, the environment variable 910 is a data value that is used to identify, indicate, or otherwise specify a set of operations or instructions to an API (such as the launch workload API 902) that are to be executed by one or more PPUs (such as GPUs) and/or one or more accelerators within a heterogeneous processor, as described herein.

In at least one embodiment, the launch workload API902 receives as input one or more parameters including the working catalog 912. In at least one embodiment, the working directory 912 is a data value that includes information that can be used to identify, indicate, or otherwise designate the working directory from which the workload is launched using the launch workload API 902. In at least one embodiment, the working directory identified, indicated, or otherwise specified by the working directory 912 is one of a plurality of parameters that may be used by the launch workload API902 to launch one or more software workloads. In at least one embodiment, the working directory 912 is a data value that is used to identify, indicate, or otherwise specify a set of operations or instructions to an API (such as the launch workload API 902) that are to be executed by one or more PPUs (such as GPUs) and/or one or more accelerators within a heterogeneous processor, as described herein.

In at least one embodiment, the launch workload API 902 receives as input one or more parameters including a launcher 914. In at least one embodiment, the initiator 914 is a data value that includes information that can be used to identify, indicate, or otherwise specify the initiator that will be used to launch the software workload using the launch workload API 902. In at least one embodiment, launcher 914 identifies, indicates, or otherwise specifies a software program to be used to launch a software workload using launch workload API 902. In at least one embodiment, the initiator 914 identifies, indicates, or otherwise specifies that the initiator is one of a plurality of parameters that can be used by the initiation workload API 902 to initiate one or more software workloads. In at least one embodiment, initiator 914 is a data value that is used to identify, instruct, or otherwise designate a set of operations or instructions to an API, such as initiator workload API 902, that are to be performed by one or more PPUs (such as GPUs) and/or one or more accelerators within a heterogeneous processor, as described herein.

In at least one embodiment, the launch workload API 902 receives as input one or more parameters including an execution mode 916. In at least one embodiment, the execution mode 916 is a data value that includes information that can be used to identify, indicate, or otherwise designate one or more execution modes to be used when launching a software workload using the launch workload API 902. In at least one embodiment, execution mode 916 specifies one or more execution modes including, but not limited to, a start-up supporting asynchronous faults and/or a start-up in debug mode. In at least one embodiment, the one or more execution modes identified, indicated, or otherwise specified by execution mode 916 are one of a plurality of parameters that are available to launch one or more software workloads by launch workload API 902. In at least one embodiment, the execution mode 916 is a data value that is used to identify, indicate, or otherwise specify to an API (such as the launch workload API 902) a set of operations or instructions to be performed by one or more PPUs (such as GPUs) and/or one or more accelerators in a heterogeneous processor, as described herein.

In at least one embodiment, the launch workload API 902 receives as input one or more parameters including one or more other parameters 918. In at least one embodiment, the other parameters 918 are data comprising information indicative of any other information available when executing the launch workload API 902 to launch one or more software workloads. In at least one embodiment, one or more of the workload indicator 904, the number of nodes 906, the tasks per node 908, the environment variables 910, the working directory 912, the initiator 914, the execution mode 916, and/or other parameters 918 are parameters necessary to initiate the workload API 902. In at least one embodiment, one or more of the workload indicator 904, the number of nodes 906, the tasks per node 908, the environment variables 910, the working directory 912, the initiator 914, the execution mode 916, and/or other parameters 918 are optional parameters to initiate the workload API 902.

In at least one embodiment, not shown in fig. 9, the processor executes one or more instructions to execute one or more APIs, such as launch workload API 902, to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API using one or more parameters including, but not limited to, workload indicator 904, number of nodes 906, tasks per node 908, environment variables 910, working directory 912, initiator 914, execution mode 916, and/or other parameters 918. In at least one embodiment, not shown in fig. 9, a processor executes one or more instructions to execute one or more APIs (such as launch workload API 902) to execute a first Application Programming Interface (API) to cause a second API to be executed to cause one or more software workloads to be executed by one or more other processors using one or more parameters including, but not limited to, workload indicators 904, node count 906, tasks per node 908, environment variables 910, work directory 912, initiator 914, execution mode 916, and/or other parameters 918.

In at least one embodiment, the launch workload API 902, if invoked, causes one or more APIs (such as one or more APIs 806 described herein in connection with at least FIG. 8) to add one or more operations or instructions to be added, inserted, or otherwise included in a stream or set of instructions to be executed by one or more accelerators within a heterogeneous processor. In at least one embodiment, the launch workload API 902, if invoked, causes one or more APIs (such as one or more APIs 806) to add one or more operations or instructions in a parallel computing environment (such as parallel computing environment 808 described herein in connection with at least FIG. 8) to be added, inserted, or otherwise included in a stream or set of instructions executed by one or more accelerators within a heterogeneous processor.

In at least one embodiment, in response to the launch workload API 902, the one or more APIs 806, if executed, cause the one or more processors to execute the launch workload API return 920. In at least one embodiment, the launch workload API return 920 is a set of instructions that, if executed, generates and/or indicates one or more data values in response to the launch workload API 902.

In at least one embodiment, the launch workload API returns 920 an indicator of success 922. In at least one embodiment, success indicator 922 is data that includes any value indicating success in launching workload API 902. In at least one embodiment, success indicator 922 includes information indicating one or more particular types of success generated as a result of executing launch workload API 902. In at least one embodiment, success indicator 922 includes information indicating one or more other data values generated as a result of launching workload API 902.

In at least one embodiment, the launch workload API return 920 indicates an error indicator 924. In at least one embodiment, the error indicator 924 is data comprising any value that indicates a failure to launch the workload API 902. In at least one embodiment, the error indicator 924 includes information indicating one or more particular types of errors generated as a result of executing the launch workload API 902. In at least one embodiment, the error indicator 924 includes information indicating one or more other data values generated as a result of executing the launch workload API 902.

In at least one embodiment, the launch workload API return 920 indicates the job identifier 926. In at least one embodiment, job identifier 926 is data comprising any value indicative of an identifier of a job used to launch workload API 902 (e.g., an identifier of a launched workload). In at least one embodiment, job identifier 926 includes information that may be used to identify the started workload to monitor workload API 1002, terminate workload API 1102, and/or other such APIs, as described herein. In at least one embodiment, job identifier 926 includes information indicating one or more other data values generated as a result of launching workload API 902. In at least one embodiment, job identifier 926 is a parameter of one or more other APIs including, but not limited to, an API for monitoring workloads (such as monitoring workload API 1002) (described herein in connection with at least fig. 10) and/or an API for terminating workloads (such as terminating workload API 1102) (described herein in connection with at least fig. 11).

In at least one embodiment, the parallel computing environment 808 includes one or more APIs 806, including but not limited to a launch workload API 902, that add various types of various operations to streams executed by one or more accelerators in a heterogeneous processor. In at least one embodiment, the streaming operation includes a acquire semaphore (semaphore) operation. In at least one embodiment, the streaming operation includes a release semaphore operation. In at least one embodiment, the streaming operation includes one or more operations to flush and/or invalidate cache memory (such as an L2 cache memory of a PPU, such as a GPU, and/or a cache memory of one or more accelerators in a heterogeneous processor). In at least one embodiment, the streaming operation includes one or more operations for indicating a commit operation to an external device (such as one or more accelerators in a heterogeneous processor). In at least one embodiment, example software code indicating the type of flow operation is as follows:

/**

*Types of stream operations

*/

typedef enum

{

/**<Acquire semaphore*/

CUSOCKET_STREAM_OP_SEMA_ACQ,

/**<Release semaphore*/

CUSOCKET_STREAM_OP_SEMA_REL,

/**<Flush GPU L2 cache*/

CUSOCKET_STREAM_OP_GPU_L2_FLUSH,

/**<Invalidate GPU L2 cache*/

CUSOCKET_STREAM_OP_GPU_L2_INVALIDATE,

/**<Submitting an operation to an external device*/

CUSOCKET_STREAM_OP_EXTERNAL_DEVICE_SUBMIT

}cuSocketStreamOpType；

In at least one embodiment, the parallel computing environment 808, including one or more APIs 806, (including but not limited to the launch workload API 902) includes one or more function signatures that can be used to indicate one or more callback functions of operations to be performed by one or more accelerators within the heterogeneous processor. In at least one embodiment, one or more operations cause one or more callback functions to be executed. In at least one embodiment, example software code indicating a function signature of a callback function is as follows:

/**

*Callback function signature for submitting to an external device.

*/

typedef unsigned int(*cuSocketExternalDeviceSubmitCallback)(void*submitArgs)；

in at least one embodiment, to specify one or more accelerators within a heterogeneous processor to perform one or more operations indicated by the launch workload API 902 to one or more APIs 806, one or more data structures of the one or more APIs 806 may be used to specify one or more external devices, wherein the one or more APIs 806 are used to submit the one or more operations for the one or more external devices. In at least one embodiment, example software code indicating a data structure representing a device node for one or more accelerators within a heterogeneous processor is as follows:

/**

*Struct representing the external device node that captures the information

*about a particular task submit for an external device.

*/

typedef struct

{

void*submitArgs；

cuSocketExternalDeviceSubmitCallback callback；

}cuSocketExternalDeviceNodeParams；

In at least one embodiment, to specify the type and data of one or more operations indicated by one or more operations to be performed by one or more accelerators within a heterogeneous processor, one or more data structures of one or more APIs 806 are to be used. In at least one embodiment, example software code indicating a data structure for specifying the type and data of one or more operations to be performed by one or more accelerators within a heterogeneous processor is as follows:

/>

in at least one embodiment, the one or more APIs 806 include instructions that, if executed, cause one or more operations or instructions to be added to a stream or other set of instructions to be executed by one or more accelerators within the heterogeneous processor. In at least one embodiment, instructions for causing one or more operations or instructions to be added to a stream or other instruction set are executed in response to launching the workload API 902, as described above. In at least one embodiment, example software code indicating a stream operation API call in a parallel computing environment 808 (such as CUDA) is as follows:

/>

in at least one embodiment, the one or more APIs 806 include instructions that, if executed, cause one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor to be added to the one or more executable graphs, similar to how the one or more operations or instructions to be performed by the one or more accelerators within the heterogeneous processor are added to the one or more streams or instruction sets in response to the launch workload API 902. In at least one embodiment, example software code that instructs one or more APIs 806 of the parallel computing environment 808 to add one or more operations or instructions to one or more executable graphics is as follows:

FIG. 10 is a block diagram 1000 illustrating an Application Programming Interface (API) for monitoring one or more software workloads, in accordance with at least one embodiment. In at least one embodiment, one or more circuits of the processor are to execute the monitor workload API 1002 to monitor one or more software workloads of a computing environment, such as the computing environment 102 described herein in connection with at least FIG. 1. In at least one embodiment, one or more circuits of the processor are to execute the monitor workload API 1002 to monitor one or more software workloads that are launched using at least the launch workload API 902 described herein in connection with FIG. 9. In at least one embodiment, not shown in fig. 10, one or more circuits of a processor, such as those described herein, execute one or more instructions to execute the monitor workload API 1002 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, not shown in fig. 10, one or more circuits of a processor, such as those described herein, execute one or more instructions to execute the monitor workload API 1002 to execute a first Application Programming Interface (API) to cause a second API to be executed such that a state of one or more software workloads is provided. In at least one embodiment, also not shown in fig. 10, one or more circuits of a processor, such as described herein, execute one or more instructions to execute the monitor workload API 1002 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API in response to receiving additional APIs, such as described herein.

In at least one embodiment, the monitoring workload API 1002, when invoked, receives one or more parameters indicating information about operations to be performed using techniques such as those described herein. In at least one embodiment, the monitoring workload API 1002, when invoked, receives one or more parameters indicating information about instructions to be executed using techniques such as those described herein.

In at least one embodiment, the monitoring workload API 1002 receives as input one or more parameters including a job identifier 1004. In at least one embodiment, job identifier 1004 is a data value that includes information that can be used to identify, indicate, or otherwise specify one or more workloads to be monitored using monitoring workload API 1002. In at least one embodiment, job identifier 1004 is a job identifier returned by the launch workload API 902 (e.g., job identifier 926) described herein. In at least one embodiment, the job identifier identified, indicated, or otherwise specified by job identifier 1004 is one of a plurality of parameters that may be used by monitoring workload API 1002 to monitor one or more software workloads. In at least one embodiment, job identifier 1004 is a data value that is used to identify, indicate, or otherwise specify a set of operations or instructions to be performed by one or more PPUs (such as GPUs) and/or one or more accelerators within a heterogeneous processor to an API (such as monitor workload API 1002), as described herein.

In at least one embodiment, the monitoring workload API 1002 receives as input one or more parameters including one or more other parameters 1006. In at least one embodiment, the other parameters 1006 are data including information indicative of any other information that may be used to execute the monitor workload API 1002 to monitor one or more software workloads. In at least one embodiment, one or more of the job identifier 1004 and/or other parameters 1006 are necessary parameters to monitor the workload API 1002. In at least one embodiment, one or more of the job identifier 1004 and/or other parameters 1006 are optional parameters of the monitoring workload API 1002.

In at least one embodiment, not shown in fig. 10, the processor executes one or more instructions to execute one or more APIs, such as monitor workload API 1002, to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API using one or more parameters including, but not limited to, job identifier 1004 and/or other parameters 1006. In at least one embodiment, not shown in fig. 10, the processor executes one or more instructions to execute one or more APIs, such as monitor workload API 1002, to execute a first Application Programming Interface (API) to cause a second API to be executed to provide a status of one or more software workloads using one or more parameters including, but not limited to, job identifier 1004 and/or other parameters 1006.

In at least one embodiment, the monitoring workload API 1002, if invoked, causes one or more APIs (such as one or more APIs 806 described herein in connection with at least FIG. 8) to add one or more operations or instructions to be added, inserted, or otherwise included in a stream or instruction set executed by one or more accelerators within a heterogeneous processor. In at least one embodiment, the monitoring workload API 1002, if invoked, causes one or more APIs (such as one or more APIs 806) to add one or more operations or instructions in a parallel computing environment (such as parallel computing environment 808 described herein in connection with at least FIG. 8) to be added, inserted, or otherwise included in a stream or set of instructions executed by one or more accelerators within a heterogeneous processor.

In at least one embodiment, in response to monitoring workload API 1002, one or more APIs 806, if executed, cause the one or more processors to execute monitoring workload API return 1020. In at least one embodiment, the monitoring workload API return 1020 is a set of instructions which, if executed, generates and/or indicates one or more data values in response to the monitoring workload API 1002. In at least one embodiment, the monitoring workload API returns 1020 an indicator of success 1022. In at least one embodiment, success indicator 1022 is data that includes any value for indicating the success of monitoring workload API 1002. In at least one embodiment, success indicator 1022 includes information indicating one or more particular types of success generated as a result of executing monitoring workload API 1002. In at least one embodiment, success indicator 1022 includes information indicating one or more other data values generated as a result of monitoring workload API 1002.

In at least one embodiment, the monitor workload API return 1020 indicates an error indicator 1024. In at least one embodiment, error indicator 1024 is data comprising any value for indicating failure to monitor workload API 1002. In at least one embodiment, the error indicator 1024 includes information indicating one or more particular types of errors generated as a result of executing the monitoring workload API 1002. In at least one embodiment, error indicator 1024 includes information indicating one or more other data values generated as a result of monitoring workload API 1002.

In at least one embodiment, the monitor workload API return 1020 indicates a workload state 1026. In at least one embodiment, the workload state 1026 is data comprising any value indicative of one or more states of a workload to be monitored that are obtained as a result of executing the monitor workload API 1002. In at least one embodiment, the workload state 1026 includes information indicating one or more other data values generated as a result of monitoring the workload API 1002.

In at least one embodiment, the parallel computing environment 808 includes one or more APIs 806, including but not limited to a monitoring workload API 1002, that add various types of various operations to streams to be executed by one or more accelerators in heterogeneous processors. In at least one embodiment, the streaming operation includes a acquire semaphore operation. In at least one embodiment, the streaming operation includes a release semaphore operation. In at least one embodiment, the streaming operation includes one or more operations to flush and/or invalidate cache memory (such as an L2 cache memory of a PPU such as a GPU and/or a cache memory of one or more accelerators within a heterogeneous processor). In at least one embodiment, the streaming operation includes one or more operations for indicating the commit of the operation to an external device (such as one or more accelerators in a heterogeneous processor). In at least one embodiment, one or more operations for indicating submission of an operation to an external device use software code, such as the example software code indicating a streaming operation described herein in connection with at least fig. 9.

In at least one embodiment, the parallel computing environment 808 including one or more APIs 806 includes, but is not limited to, a monitoring workload API 1002, the monitoring workload API 1002 including one or more function signatures that may be used to indicate one or more callback functions for operations to be performed by one or more accelerators within a heterogeneous processor. In at least one embodiment, one or more operations cause one or more callback functions to be executed. In at least one embodiment, the one or more operations for causing the one or more callback functions to be performed use software code, such as example software code indicating a function signature for the callback functions, as described herein at least in connection with fig. 9.

In at least one embodiment, to specify one or more accelerators within a heterogeneous processor to perform one or more operations indicated by the monitoring workload API 1002 to one or more APIs 806, one or more data structures of the one or more APIs 806 may be used to specify one or more external devices for which the one or more APIs 806 are to submit the one or more operations. In at least one embodiment, one or more data structures of one or more APIs 806 that are usable to specify one or more external devices for which the one or more APIs 806 are to commit, for commit the one or more operations, the one or more operations using software code, such as example software code that indicates a data structure of a device node that represents one or more accelerators within a heterogeneous processor, as described herein at least in connection with fig. 9.

In at least one embodiment, one or more data structures of one or more APIs 806 are to be used in order to specify the type and data of one or more operations indicated by one or more operations to be performed by one or more accelerators within a heterogeneous processor. In at least one embodiment, one or more data structures of one or more APIs 806 for specifying the type of one or more operations and data to be indicated by one or more operations within a heterogeneous processor use software code, such as example software code that indicates a data structure for specifying the type of one or more operations and data to be performed by one or more accelerators within a heterogeneous processor, as described herein at least in connection with fig. 9.

In at least one embodiment, the one or more APIs 806 include instructions that, if executed, cause one or more operations or instructions to be added to a stream or other set of instructions to be executed by one or more accelerators within the heterogeneous processor. In at least one embodiment, instructions for causing one or more operations or instructions to be added to a stream or other instruction set will be executed in response to monitoring the workload API 1002, as described above. In at least one embodiment, instructions for causing one or more operations or instructions to be added to a stream or other set of instructions to be executed in response to monitoring the workload API 1002 use software code, such as example software code indicating a stream operation API call in the parallel computing environment 808, as described herein at least in connection with FIG. 9.

In at least one embodiment, the one or more APIs 806 include instructions that, if executed, cause one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor to be added to the one or more executable graphs. In at least one embodiment, the instructions, if executed, cause one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor to be added to one or more executable graphics, similar to how one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor are added to one or more streams or instruction sets in response to the monitoring workload API 1002 as described herein. In at least one embodiment, the instructions that, if executed, cause one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor to be added to one or more executable graphs use software code, such as example software code that instructs one or more APIs 806 of the parallel computing environment 808 to add one or more operations or instructions to one or more executable graphs, as described herein at least in connection with fig. 9.

FIG. 11 is a block diagram 1100 illustrating an Application Programming Interface (API) for terminating one or more software workloads, according to at least one embodiment. In at least one embodiment, one or more circuits of the processor are to execute the terminate workload API 1102 to terminate one or more software workloads of a computing environment, such as the computing environment 102 described herein in connection with at least FIG. 1. In at least one embodiment, one or more circuits of the processor are to execute a termination workload API 1102 to terminate one or more software workloads that are launched using at least the launch workload API 902 described herein in connection with FIG. 9. In at least one embodiment, not shown in fig. 11, one or more circuits of a processor, such as those described herein, execute one or more instructions to execute a termination workload API 1102 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, not shown in fig. 11, one or more circuits of a processor, such as those described herein, execute one or more instructions to execute a termination workload API 1102 to execute a first Application Programming Interface (API) to cause a second API to be executed to cause termination of one or more software workloads being executed by one or more other processors. In at least one embodiment, also not shown in fig. 11, one or more circuits of a processor, such as described herein, execute one or more instructions to execute a termination workload API 1102 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API in response to receiving additional APIs, such as described herein.

In at least one embodiment, the terminate workload API 1102, when invoked, receives one or more parameters to indicate information about operations to be performed using techniques such as those described herein. In at least one embodiment, the terminate workload API 1102, when invoked, receives one or more parameters to indicate information about instructions to be executed using techniques such as those described herein.

In at least one embodiment, the terminate workload API 1102 receives as input one or more parameters including a job identifier 1104. In at least one embodiment, job identifier 1104 is a data value that includes information that can be used to identify, indicate, or otherwise specify one or more workloads to terminate using termination workload API 1102. In at least one embodiment, job identifier 1104 is a job identifier returned by launch workload API 902 described herein (such as job identifier 926). In at least one embodiment, the job identifier identified, indicated, or otherwise specified by job identifier 1104 is one of a plurality of parameters that may be used by termination workload API 1102 to terminate one or more software workloads. In at least one embodiment, job identifier 1104 is a data value that is used to identify, indicate, or otherwise specify a set of operations or instructions to be performed by one or more PPUs (such as GPUs) and/or one or more accelerators within a heterogeneous processor to an API (such as termination workload API 1102), as described herein.

In at least one embodiment, the terminate workload API 1102 receives as input one or more parameters including one or more other parameters 1106. In at least one embodiment, the other parameters 1106 are data comprising information indicating any other information available when executing the terminate workload API 1102 to terminate one or more software workloads. In at least one embodiment, one or more of the job identifier 1104 and/or other parameters 1106 are necessary parameters to terminate the workload API 1102. In at least one embodiment, one or more of the job identifier 1104 and/or other parameters 1106 are optional parameters for terminating the workload API 1102.

In at least one embodiment, not shown in fig. 11, the processor executes one or more instructions to execute one or more APIs, such as termination workload API 1102, to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API using one or more parameters including, but not limited to, job identifier 1104 and/or other parameters 1106. In at least one embodiment, not shown in fig. 11, the processor executes one or more instructions to execute one or more APIs, such as termination workload API 1102, to execute a first Application Programming Interface (API) to cause a second API to be executed such that a state of one or more software workloads is provided using one or more parameters including, but not limited to, job identifier 1104 and/or other parameters 1106.

In at least one embodiment, the terminate workload API 1102, if invoked, causes one or more APIs (such as one or more APIs 806 described herein in connection with at least FIG. 8) to add one or more operations or instructions to be added, inserted, or otherwise included in a stream or instruction set to be executed by one or more accelerators within the heterogeneous processor. In at least one embodiment, the terminate workload API 1102, if invoked, causes one or more APIs (such as one or more APIs 806) to add one or more operations or instructions in a parallel computing environment (such as parallel computing environment 808 described herein in connection with at least FIG. 8) to be added, inserted, or otherwise included in a stream or set of instructions to be executed by one or more accelerators within a heterogeneous processor.

In at least one embodiment, in response to termination workload API 1102, one or more APIs 806, if executed, cause the one or more processors to execute termination workload API return 1120. In at least one embodiment, the terminate workload API return 1120 is an instruction set that, if executed, generates and/or indicates one or more data values in response to the terminate workload API 1102. In at least one embodiment, the terminate workload API returns 1120 an indicator 1122 of success. In at least one embodiment, success indicator 1122 is data comprising any value that indicates the success of terminating workload API 1102. In at least one embodiment, success indicator 1122 includes information indicating one or more particular types of success generated as a result of executing termination workload API 1102. In at least one embodiment, success indicator 1122 includes information indicating one or more other data values generated as a result of terminating workload API 1102.

In at least one embodiment, the terminate workload API returns 1120 an indication error indicator 1124. In at least one embodiment, the error indicator 1124 is data comprising any value to indicate a failure to terminate the workload API 1102. In at least one embodiment, the error indicator 1124 includes information indicating one or more specific types of errors generated as a result of executing the terminate workload API 1102. In at least one embodiment, the error indicator 1124 includes information indicating one or more other data values generated as a result of terminating the workload API 1102.

In at least one embodiment, the parallel computing environment 808, including one or more APIs 806 (which include, but are not limited to, the termination workload API 1102), adds various types of various operations to the stream for execution by one or more accelerators in the heterogeneous processor. In at least one embodiment, the streaming operation includes a acquire semaphore operation. In at least one embodiment, the streaming operation includes a release semaphore operation. In at least one embodiment, the streaming operation includes one or more operations to flush and/or invalidate cache memory, such as an L2 cache memory of a PPU (such as a GPU) within a heterogeneous processor, and/or a cache memory of one or more accelerators. In at least one embodiment, the streaming operation includes one or more operations for indicating to commit the operation to an external device (such as one or more accelerators within a heterogeneous processor). In at least one embodiment, one or more operations for directing submission of operations to an external device use software code, such as example software code directing streaming operations as described herein in connection with at least fig. 9.

In at least one embodiment, the parallel computing environment 808 including one or more APIs 806 (which include, but are not limited to, the termination workload API 1102) includes one or more function signatures that can be used to indicate one or more callback functions for operations to be performed by one or more accelerators within the heterogeneous processor. In at least one embodiment, one or more operations cause one or more callback functions to be executed. In at least one embodiment, the one or more operations for causing the one or more callback functions to be performed use software code, such as example software code indicating a function signature for the callback functions, as described herein at least in connection with fig. 9.

In at least one embodiment, to specify one or more accelerators within a heterogeneous processor to perform one or more operations indicated by terminating workload API 1102 to one or more APIs 806, one or more data structures of one or more APIs 806 may be used to specify one or more external devices for which the one or more APIs 806 are to submit the one or more operations. In at least one embodiment, one or more data structures of one or more APIs 806 of one or more external devices for which the one or more APIs 806 are to submit are usable to specify the one or more operations, using software code, such as example software code indicating a data structure of a device node representing one or more accelerators within a heterogeneous processor, as described herein at least in connection with fig. 9.

In at least one embodiment, to specify the type and data of one or more operations indicated by one or more operations to be performed by one or more accelerators within a heterogeneous processor, one or more data structures of one or more APIs 806 are to be used. In at least one embodiment, one or more data structures of one or more APIs 806 for specifying the type of one or more operations and data to be indicated by one or more accelerators within a heterogeneous processor use software code, such as example software code that indicates a data structure for specifying the type of and data for one or more operations to be performed by one or more accelerators within a heterogeneous processor, as described herein at least in connection with fig. 9.

In at least one embodiment, the one or more APIs 806 include instructions that, if executed, cause one or more operations or instructions to be added to a stream or other set of instructions to be executed by one or more accelerators within the heterogeneous processor. In at least one embodiment, instructions for causing one or more operations or instructions to be added to a stream or other instruction set will be executed in response to terminating the workload API 1102, as described above. In at least one embodiment, instructions for causing one or more operations or instructions to be added to a stream or other set of instructions to be executed in response to terminating the workload API 1102 use software code, such as example software code indicating a stream operation API call in the parallel computing environment 808, as described herein at least in connection with FIG. 9.

In at least one embodiment, the one or more APIs 806 include instructions that, if executed, cause one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor to be added to the one or more executable graphs. In at least one embodiment, the instructions, if executed, cause one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor to be added to one or more executable graphics, similar to how one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor are added to one or more streams or instruction sets in response to termination workload API 1102 as described herein. In at least one embodiment, the instructions that, if executed, cause one or more operations or instructions to be performed by one or more accelerators within the heterogeneous processor to be added to one or more executable diagrams use software code, such as example software code that instructs one or more APIs 806 of the parallel computing environment 808 to add one or more operations or instructions to one or more executable diagrams, as described herein at least in connection with fig. 9.

FIG. 12 illustrates a process 1200 for executing one or more Application Programming Interfaces (APIs) in accordance with at least one embodiment. In at least one embodiment, process 1200 is a process for executing one or more APIs by a parallel computing environment (such as parallel computing environment 808 as described herein in connection with at least fig. 8) using one or more accelerators within a heterogeneous processor. In at least one embodiment, a process 1200 for executing one or more Application Programming Interfaces (APIs) begins 1202 at step 1204 whereby one or more processors are configured to execute a software program comprising one or more instructions that, if executed, cause the one or more processors and/or one or more other processors (such as a Graphics Processing Unit (GPU) and/or one or more accelerators within one or more heterogeneous processors) to perform one or more computing operations. In at least one embodiment, at step 1204, a software program to be executed by the one or more processors includes one or more instructions that, if executed, cause one or more APIs 806 of the parallel computing environment 808 to be executed, as described above. In at least one embodiment, after step 1204, process 1200 continues at step 1206.

In at least one embodiment, at step 1206, the processor executing process 1200 determines whether to execute an API such as those described herein in connection with at least FIGS. 9-11 (e.g., start workload API 902, monitor workload API 1002, and/or terminate workload API 1102). In at least one embodiment, if it is determined at step 1206 that the API is not to be executed (the "NO" branch), process 1200 continues at step 1216. In at least one embodiment, if it is determined at step 1206 that an API is to be executed ("Yes" branch), then process 1200 continues at step 1208.

In at least one embodiment, at step 1208, a processor executing process 1200 executes an API such as those described herein in connection with at least FIGS. 9-11. In at least one embodiment, at step 1208, the one or more processors are to execute the one or more instructions to cause one or more API calls (e.g., launch workload API 902, monitor workload API 1002, and/or terminate workload API 1102) such as described herein at least in connection with fig. 9-11 to be performed by the one or more processors and/or one or more other processors (such as GPUs and/or accelerators within heterogeneous processors), as described above. In at least one embodiment, after step 1208, the process 1200 continues at step 1210.

In at least one embodiment, at step 1210, the processor executing process 1200 determines whether a return value is to be returned as a result of executing one or more instructions, such that one or more API calls (e.g., start workload API 902, monitor workload API 1002, and/or terminate workload API 1102) such as described herein in connection with at least fig. 9-11 are executed by the one or more processors and/or one or more other processors (such as GPUs and/or accelerators within heterogeneous processors), as described above. In at least one embodiment, at step 1210, the processor executing process 1200 determines whether to return a return value using an API return such as described herein at least in connection with FIGS. 9-11 (e.g., initiate workload API return 920, monitor workload API return 1020, and/or terminate workload API return 1120). In at least one embodiment, if, at step 1210, it is determined that a return value is to be returned ("yes" branch), process 1200 continues at step 1212. In at least one embodiment, if it is determined at step 1210 that the return value is not returned ("no" branch), process 1200 continues at step 1214.

In at least one embodiment, at step 1212, a return value is set. In at least one embodiment, at step 1212, the return value is set by storing the return value in a memory location specified by an API such as described herein in connection with at least FIGS. 9-11 (e.g., start workload API 902, monitor workload API 1002, and/or terminate workload API 1102). In at least one embodiment, at step 1212, the return value is set by storing the return value in a memory location included in an API return such as described herein at least in connection with FIGS. 9-11 (e.g., initiate workload API return 920, monitor workload API return 1020, and/or terminate workload API return 1120). In at least one embodiment, after step 1212, process 1200 continues at step 1214.

In at least one embodiment, at step 1214, a success or failure (e.g., an error) is returned using an API return such as described herein at least in connection with FIGS. 9-11 (e.g., initiate workload API return 920, monitor workload API return 1020, and/or terminate workload API return 1120). In at least one embodiment, after step 1214, process 1200 continues at step 1216.

In at least one embodiment, at step 1216, the processor executing process 1200 determines whether execution of the software program (e.g., at step 1204) is complete. In at least one embodiment, at step 1216, the processor executing process 1200 determines whether execution of the software program (e.g., at step 1204) is complete based at least in part on whether one or more processors are executing instructions of the software program (e.g., at step 1204). In at least one embodiment, at step 1216, if it is determined that execution of the software program (e.g., at step 1204) has been completed, process 1200 ends 1218. In at least one embodiment, if it is determined at step 1216 that execution of the software program is not complete (e.g., at step 1204), process 1200 continues at step 1204 to continue executing one or more instructions of the software program.

In at least one embodiment, the operations of process 1200 for executing one or more Application Programming Interfaces (APIs) are performed in a different order than shown in fig. 12. In at least one embodiment, the operations of process 1200 for executing one or more Application Programming Interfaces (APIs) are performed simultaneously or in parallel. For example, in at least one embodiment, operations of process 1200 for executing one or more Application Programming Interfaces (APIs) are performed simultaneously or in parallel, independent of each other (e.g., order independent). In at least one embodiment, the operations of process 1200 for executing one or more Application Programming Interfaces (APIs) are performed by multiple threads executing on a processor such as described herein.

FIG. 13 is a block diagram 1300 illustrating an example software stack in which an Application Programming Interface (API) is processed in accordance with at least one embodiment. In at least one embodiment, an API such as the launch workload API 902 described herein in connection with at least FIG. 9 is processed using the software stack shown in block 1300 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, an API, such as the monitor workload API 1002 described herein in connection with at least FIG. 10, is processed using the software stack shown in block 1300 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, an API, such as termination workload API 1102 described herein in connection with at least FIG. 11, is processed using the software stack shown in block 1300 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, the software stacks shown in block 1300 are software stacks such as those described herein at least in connection with fig. 47-50. In at least one embodiment, the software stack shown in block 1300 is a software stack such as software stack 408 and/or software stack 416 described herein in connection with at least fig. 4. In at least one embodiment, the application 1302 executes a command to determine whether a feature 1304 is supported. In at least one embodiment, the application 1302 executes commands to determine whether features 1304, such as those APIs described herein, are supported for execution.

In at least one embodiment, the application 1302 uses 1306 one or more runtime APIs 1308 to determine whether to support the feature 1304. In at least one embodiment, the runtime API 1308 uses 1310 one or more driver APIs 1312 to determine whether to support the feature 1304. In at least one embodiment, not shown in FIG. 13, the application 1302 uses one or more driver APIs 1312 to determine whether to support the feature 1304. In at least one embodiment, the driver API 1312 queries 1314 the computer system hardware 1316 to determine whether to support the feature 1304.

In at least one embodiment, the computer system hardware 1316 determines whether the processor 1334 supports the feature 1304 by querying a set of capabilities associated with the processor 1334. In at least one embodiment, the processor 1334 includes one or more processors such as described herein (e.g., one or more of the processors 114 and/or 104 described herein in connection with at least fig. 1). In at least one embodiment, the computer system hardware 1316 uses the operating system of the processor 1334 to determine whether the processor 1334 supports the feature 1304. In at least one embodiment, the computer system hardware 1316 determines whether the graphics processor 1336 supports a feature by querying a set of capabilities associated with the graphics processor 1336. In at least one embodiment, graphics processor 1336 includes one or more graphics processors such as described herein (e.g., one or more of graphics processors 106 described herein in connection with at least fig. 1). In at least one embodiment, the computer system hardware 1316 uses the operating system of the processor 1334 to determine whether the graphics processor 1336 supports the feature 1304. In at least one embodiment, the computer system hardware 1316 uses the operating system of the graphics processor 1336 to determine whether the graphics processor 1336 supports the feature 1304.

In at least one embodiment, after the computer system hardware 1316 determines whether the feature 1304 is supported, the computer system hardware 1316 returns 1318 the determination using the driver API 1312, the driver API 1312 may return 1320 the determination using the runtime API 1308, and the runtime API 1308 may return 1322 the determination to the application 1302. In at least one embodiment, if the application 1302 receives a determination 1324 indicating support 1324 for the feature 1304, the application 1302 executes the feature 1326 using one or more APIs, such as those described herein. In at least one embodiment, the application 1302 performs the features 1326 using systems and methods such as those described herein. In at least one embodiment, application 1302 executes feature 1326 using 1328 runtime API 1308, runtime API 1308 including, but not limited to, runtime versions of APIs such as those described herein at least in connection with fig. 9-11.

In at least one embodiment, runtime API 1308 executes feature 1326 using 1330 driver API 1312, driver API 1312 including, but not limited to, a driver version of an API such as described herein. In at least one embodiment, not shown in FIG. 13, application 1302 executes feature 1326 using 1330 driver API 1312. In at least one embodiment, the driver API 1312 uses 1332 the computer system hardware 1316 to execute the feature 1326.

Fig. 14 is a block diagram 1400 illustrating a processor 1402 and modules in accordance with at least one embodiment. In at least one embodiment, processor 1402 executes one or more processes, such as those described herein, to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the processor 1402 executes the process to cause one or more circuits to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API using the systems, methods, operations, and techniques described in connection with FIGS. 1-13.

In at least one embodiment, processor 1402 executes one or more processes, such as those described herein, to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the processor 1402 executes the process to cause one or more circuits to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API using the systems, methods, operations, and techniques described in connection with FIGS. 1-13.

In at least one embodiment, processor 1402 executes one or more processes, such as those described herein, to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, the processor 1402 executes the process to cause one or more circuits to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API using the systems, methods, operations, and techniques described in connection with FIGS. 1-13.

In at least one embodiment, the processor 1402 includes one or more processors, such as those described in connection with FIGS. 16-53. In at least one embodiment, processor 1402 is a processor such as processor 114, one or more of processors 104, and/or one or more of graphics processors 106 described herein in connection with at least fig. 1. In at least one embodiment, the processor 1402 is any suitable processing unit and/or combination of processing units, such as one or more CPU, GPU, GPGPU, PPU and/or variants thereof. In at least one embodiment, the processor 1402 includes or has access to a client module 1404, a high-performance computing module 1406, a start workload module 1408, a monitor workload module 1410, and a terminate workload module 1412. In at least one embodiment, the client module 1404, the high-performance computing module 1406, the start workload module 1408, the monitor workload module 1410, and the end workload module 1412 are part of a processor 1402 and/or one or more other processors such as those described herein. In at least one embodiment, the client module 1404, the high-performance computing module 1406, the start workload module 1408, the monitor workload module 1410, and the termination workload module 1412 are distributed among multiple processors that communicate via buses, networks, by writing to shared memory, and/or any suitable communication process, as described herein.

In at least one embodiment, a module, as used in any implementation described herein, refers to any combination of software logic, firmware logic, hardware logic, and/or circuitry configured to provide the functionality described herein unless explicitly or implicitly stated to the contrary from the context. In at least one embodiment, the software may be embodied as a software package, code, and/or instruction set or instructions, and as "hardware" used by any implementation described herein, such as a processor, may include, for example, hardwired circuitry, programmable circuitry, state machine circuitry, fixed-function circuitry, execution unit circuitry, and/or firmware that stores instructions executed by the programmable circuitry, either alone or in any combination. In at least one embodiment, the modules may be collectively or individually embodied as circuitry that forms part of a larger system (e.g., an Integrated Circuit (IC), a system on a chip (SoC), etc.). In at least one embodiment, the modules perform one or more processes in connection with any suitable processing unit and/or combination of processing units (such as one or more CPU, GPU, GPGPU, PPU and/or variants thereof).

In at least one embodiment, the processor 1402 executes or otherwise implements one or more client environments, such as those described herein, using the client module 1404. In at least one embodiment, the processor 1402 executes one or more APIs such as described herein (e.g., a start workload API 902, a monitor workload API 1002, and/or a terminate workload API 1102) using the client module 1404. In at least one embodiment, the processor 1402 executes the client module 1404 and processes such as described herein by including at least or otherwise encoding instructions that cause execution of the one or more processes or that are otherwise available for execution of the one or more processes (e.g., by the processor 1402). In at least one embodiment, a processor using client module 1404 obtains or is otherwise provided with one or more APIs such as described herein. In at least one embodiment, the processor 1402 uses the client module 1404 to perform or otherwise implement one or more client environments using the systems, methods, operations, and techniques described herein in connection with at least fig. 1-13. In at least one embodiment, the processor 1402 performs one or more operations using the client module 1404 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the processor 1402 performs one or more operations using the client module 1404 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the processor 1402 performs one or more operations using the client module 1404 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API.

In at least one embodiment, the processor 1402 executes or otherwise implements a high-performance computing environment, such as one or more of those described herein, using a high-performance computing module 1406. In at least one embodiment, the processor 1402 executes one or more APIs (e.g., the start workload API 902, the monitor workload API 1002, and/or the terminate workload API 1102) such as described herein using the high-performance computing module 1406. In at least one embodiment, the processor 1402 executes one or more processes such as described herein using the high-performance computing module 1406 by including at least or otherwise encoding instructions that make execution of the one or more processes or otherwise available to execute the one or more processes (e.g., by the processor 1402). In at least one embodiment, the processor 1402 uses the high-performance computing module 1406 to execute one or more APIs in conjunction with the client module 1404. In at least one embodiment, the processor 1402 performs one or more operations using the high-performance computing module 1406 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the processor 1402 performs one or more operations using the high-performance computing module 1406 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the processor 1402 performs one or more operations using the high-performance computing module 1406 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API.

In at least one embodiment, the processor 1402 initiates one or more software workloads using the initiation workload module 1408, as described herein. In at least one embodiment, launch workload module 1408 executes one or more processes such as described herein by including at least or otherwise encoding instructions that cause execution of the one or more processes or are otherwise available to execute the one or more processes (e.g., by processor 1402). In at least one embodiment, launch workload module 1408 causes one or more software workloads to be launched using the systems, methods, operations, and/or techniques described herein. In at least one embodiment, launch workload module 1408 causes one or more software workloads to be launched using an API such as launch workload API 902. In at least one embodiment, the processor performs one or more operations using the launch workload module 1408 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the processor performs one or more operations using the launch workload module 1408 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the processor performs one or more operations using the launch workload module 1408 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API.

In at least one embodiment, the processor 1402 monitors one or more software workloads using the monitor workload module 1410, as described herein. In at least one embodiment, the monitoring workload module 1410 executes one or more processes such as described herein by including at least or otherwise encoding instructions that cause the execution of or are otherwise available to execute the one or more processes (e.g., by the processor 1402). In at least one embodiment, the monitor workload module 1410 enables monitoring of one or more software workloads using the systems, methods, operations, and/or techniques described herein. In at least one embodiment, the monitor workload module 1410 causes monitoring of one or more software workloads using an API, such as monitor workload API 1002. In at least one embodiment, the processor performs one or more operations using the monitor workload module 1410 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the processor performs one or more operations using the monitor workload module 1410 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the processor performs one or more operations using the monitor workload module 1410 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API.

In at least one embodiment, the processor 1402 terminates one or more software workloads as described herein using a termination workload module 1412. In at least one embodiment, the termination workload module 1412 executes one or more processes such as described herein by including at least or otherwise encoding instructions that cause execution of the one or more processes or are otherwise available for execution of the one or more processes (e.g., by the processor 1402). In at least one embodiment, the termination workload module 1412 causes one or more software workloads to be terminated using the systems, methods, operations, and/or techniques described herein. In at least one embodiment, the terminate workload module 1412 causes one or more software workloads to be terminated using an API such as the terminate workload API 1102. In at least one embodiment, the processor performs one or more operations using the termination workload module 1412 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the processor performs one or more operations using the termination workload module 1412 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the processor performs one or more operations using a termination workload module 1412 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API.

In at least one embodiment, the processor 1402 includes circuitry (circuitry) for causing one or more circuits of the processor 1402 to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API using one or more of the client module 1404, the high performance computing module 1406, the start workload module 1408, the monitor workload module 1410, and/or the terminate workload module 1412 utilizing at least the systems, methods, operations, and/or techniques described herein in connection with fig. 1-13. In at least one embodiment, the processor 1402 includes circuitry for causing one or more circuits of the processor 1402 to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API using one or more of the client module 1404, the high performance computing module 1406, the start workload module 1408, the monitor workload module 1410, and/or the terminate workload module 1412 utilizing at least the systems, methods, operations, and/or techniques described herein in connection with fig. 1-13. In at least one embodiment, the processor 1402 includes circuitry for causing one or more circuits of the processor 1402 to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API using one or more of the client module 1404, the high performance computing module 1406, the start workload module 1408, the monitor workload module 1410, and/or the terminate workload module 1412 utilizing at least the systems, methods, operations, and/or techniques described herein in connection with fig. 1-13.

FIG. 15 is a block diagram 1500 illustrating a driver and/or runtime including one or more libraries for providing one or more Application Programming Interfaces (APIs) in accordance with at least one embodiment. In at least one embodiment, software program 1502 is a software module. In at least one embodiment, software program 1502 includes one or more software modules including, but not limited to, the software modules described herein in connection with at least FIG. 14. In at least one embodiment, the software modules are further described exclusively in FIG. 14. In at least one embodiment, the one or more APIs 1510 are a set of software instructions that, if executed, cause one or more processors to perform one or more computing operations. In at least one embodiment, the one or more APIs 1510 include one or more of a start workload API 902, a monitor workload API 1002, and/or a terminate workload API 1102. In at least one embodiment, the one or more APIs 1510 are a set of software instructions that, if executed, cause the one or more processors to perform one or more computing operations to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the one or more APIs 1510 are a set of software instructions that, if executed, cause the one or more processors to perform one or more computing operations to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the one or more APIs 1510 are a set of software instructions that, if executed, cause the one or more processors to perform one or more computing operations to execute a first Application Programming Interface (API) to select a second API to terminate execution of the one or more software workloads identified by the first API.

In at least one embodiment, one or more APIs 1510 are distributed or otherwise provided as part of one or more libraries 1506, drivers, and/or run-times 1504, and/or any other groupings of software and/or executable code described further herein. In at least one embodiment, one or more APIs 1510 perform one or more computing operations in response to a call by software program 1502. In at least one embodiment, software program 1502 is a collection of software code, commands, instructions, or other text sequences for instructing a computing device to perform one or more computing operations and/or to invoke one or more other sets of instructions to be executed, such as API 1510 or API function 1512. In at least one embodiment, the functionality provided by the one or more APIs 1510 includes software functions 1512, such as software functions that are operable to accelerate one or more portions of the software program 1502 using one or more Parallel Processing Units (PPUs), such as Graphics Processors (GPUs).

In at least one embodiment, the API 1510 is a hardware interface of one or more circuits for performing one or more computing operations. In at least one embodiment, one or more software APIs 1510 described herein are implemented as one or more circuits for performing one or more of the techniques described herein in connection with FIGS. 1-13. In at least one embodiment, one or more software programs 1502 include instructions that, if executed, cause one or more hardware devices and/or circuits to perform one or more of the techniques described herein in connection with fig. 1-13.

In at least one embodiment, a software program 1502, such as a user-implemented software program, utilizes one or more Application Programming Interfaces (APIs) 1510 to perform various computing operations, such as memory reservation, matrix multiplication, arithmetic operations, or any computing operation performed by a Parallel Processing Unit (PPU), such as a Graphics Processing Unit (GPU), as further described herein. In at least one embodiment, one or more APIs 1510 provide a set of callable functions 1512 (referred to herein as APIs, API functions, and/or functions) that each perform one or more computing operations, such as computing operations related to parallel computing. For example, in one embodiment, one or more APIs 1510 provide a function 1512 for starting a workload, monitoring a workload, and/or terminating a workload, as described herein.

In at least one embodiment, one or more software programs 1502 interact with or otherwise communicate with one or more APIs 1510 to perform one or more computing operations using one or more PPUs (such as a GPU). In at least one embodiment, one or more computing operations using one or more PPUs include at least one or more sets of computing operations to be accelerated by execution at least in part by the one or more PPUs. In at least one embodiment, one or more software programs 1502 interact with one or more APIs 1510 to facilitate parallel computing using a remote interface or a local interface.

In at least one embodiment, the interface is software instructions that, if executed, provide access to one or more functions 1512 provided by one or more APIs 1510. In at least one embodiment, the software programs 1502 use a local interface when a software developer compiles one or more software programs 1502 in conjunction with one or more libraries 1506 that include or otherwise provide access to one or more APIs 1510. In at least one embodiment, one or more software programs 1502 are statically compiled in conjunction with a precompiled library 1506 or uncompiled source code comprising instructions for executing one or more APIs 1510. In at least one embodiment, one or more software programs 1502 are dynamically compiled and linked to one or more precompiled libraries 1506 comprising one or more APIs 1510 using a linker.

In at least one embodiment, when a software developer executes a software program, the software program 1502 uses a remote interface, which communicates with the library 1506 via a network or other remote communication medium using the library 1506 including one or more APIs 1510 or otherwise. In at least one embodiment, one or more libraries 1506, including one or more APIs 1510, will be executed by a remote computing service (such as a computing resource service provider). In another embodiment, the one or more libraries 1506, including the one or more APIs 1510, will be executed by any other computing host that provides the one or more APIs 1510 to the one or more software programs 1502.

In at least one embodiment, a processor executing or using one or more software programs 1502 invokes, uses, executes, or otherwise implements one or more APIs 1510 to allocate and otherwise manage memory to be used by the software programs 1502. In at least one embodiment, one or more software programs 1502 utilize one or more APIs 1510 to allocate and otherwise manage memory to be used by one or more portions of the software program 1502 to accelerate using one or more PPUs (such as GPUs) or any other accelerators or processors described further herein. These software programs 1502 request that the processor launch, monitor, and/or terminate a workload using functions 1512 provided in one embodiment by one or more APIs 1510.

In at least one embodiment, the API 1510 is an API for facilitating parallel computing. In at least one embodiment, the API 1510 is any other API further described herein. In at least one embodiment, the API 1510 is provided by a driver and/or runtime 1504. In at least one embodiment, API 1510 is provided by a CUDA user mode driver. In at least one embodiment, the API 1510 is provided by the CUDA runtime. In at least one embodiment, the driver and/or runtime 1504 are data values and software instructions that, if executed, perform or otherwise facilitate the operation of one or more functions 1512 of the API 1510 during loading and executing one or more portions of the software program 1502. In at least one embodiment, the driver and/or runtime 1504 are data values and software instructions that, if executed, perform or otherwise facilitate the operation of one or more functions 1512 of the API 1510 during execution of the software program 1502. In at least one embodiment, one or more software programs 1502 utilize one or more APIs 1510 implemented or otherwise provided by a driver and/or runtime 1504 to perform combined arithmetic operations by the one or more software programs 1502 during execution by one or more PPUs (such as GPUs).

In at least one embodiment, one or more software programs 1502 utilize one or more APIs 1510 provided by the driver and/or runtime 1504 to perform the combined arithmetic operations of one or more PPUs (such as a GPU). In at least one embodiment, one or more APIs 1510 provide for combined arithmetic operations through drivers and/or runtime 1504, as described above. In at least one embodiment, one or more software programs 1502 allocate or otherwise reserve one or more blocks of memory 1514 of one or more PPUs (such as GPUs) using one or more APIs 1510 provided by a driver and/or runtime 1504. In at least one embodiment, one or more software programs 1502 allocate or otherwise reserve blocks of memory using one or more APIs 1510 provided by the driver and/or runtime 1504. In at least one embodiment, one or more APIs 1510 are used to perform the combined arithmetic operations, as described herein in connection with FIGS. 1-13.

To improve the usability of the software program 1502 and/or optimize one or more portions of the software program 1502 to be accelerated by one or more PPUs (such as GPUs), in one embodiment, one or more APIs 1510 provide one or more API functions 1512 to launch, monitor, and/or terminate workloads that are used or available by one or more computing devices as described above and further described herein in connection with fig. 1-13. In at least one embodiment, block diagram 1500 depicts a processor including one or more circuits to execute one or more software programs to combine two or more Application Programming Interfaces (APIs) into a single API. In at least one embodiment, block diagram 1500 depicts a system comprising one or more processors to execute one or more software programs to combine two or more Application Programming Interfaces (APIs) into a single API. In at least one embodiment, the processor uses the API to initiate, monitor, and/or terminate the workload 1516 described herein. In at least one embodiment, the processor initiates, monitors, and/or terminates the workload 1516 using an API, wherein the processor initiates, monitors, and/or terminates the workload 1516 by causing one or more circuits to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, the processor initiates, monitors, and/or terminates the workload 1516 using an API, wherein the processor initiates, monitors, and/or terminates the workload 1516 by causing one or more circuits to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, the processor initiates, monitors, and/or terminates the workload 1516 using an API, wherein the processor initiates, monitors, and/or terminates the workload 1516 by causing one or more circuits to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API.

Server and data center

The following figures illustrate exemplary web server and data center based systems that may be used to implement at least one embodiment.

Fig. 16 illustrates a distributed system 1600 in accordance with at least one embodiment. In at least one embodiment, the distributed system 1600 includes one or more client computing devices 1602, 1604, 1606, and 1608 configured to execute and operate client applications, such as a network (web) browser, proprietary client, and/or variants thereof, on one or more networks 1610. In at least one embodiment, a server 1612 can be communicatively coupled with remote client computing devices 1602, 1604, 1606, and 1608 via a network 1610.

In at least one embodiment, the server 1612 may be adapted to run one or more services or software applications, such as services and applications that may manage session activity for single sign-on (SSO) access across multiple data centers. In at least one embodiment, the server 1612 may also provide other services, or software applications, which may include non-virtual and virtual environments. In at least one embodiment, these services may be provided to users of client computing devices 1602, 1604, 1606, and/or 1608 as web-based services or cloud services or under a software as a service (SaaS) model. In at least one embodiment, a user operating client computing devices 1602, 1604, 1606, and/or 1608 can, in turn, utilize one or more client applications to interact with server 1612 to utilize services provided by these components.

In at least one embodiment, software components 1618, 1620, and 1622 of system 1600 are implemented on server 1612. In at least one embodiment, one or more components of system 1600 and/or services provided by such components may also be implemented by one or more of client computing devices 1602, 1604, 1606, and/or 1608. In at least one embodiment, a user operating a client computing device may then utilize one or more client applications to use the services provided by these components. In at least one embodiment, these components may be implemented in hardware, firmware, software, or a combination thereof. It should be appreciated that a variety of different system configurations are possible, which may differ from distributed system 1600. Thus, the embodiment shown in FIG. 16 is one example of a distributed system for implementing the embodiment system and is not intended to be limiting.

In at least one embodiment, client computing devices 1602, 1604, 1606, and/or 1608 can include different types of computing systems. In at least one embodiment, the client computing device may comprise a portable handheld device (e.g.,cellular phone, & lt & gt>Computing tablet, personal Digital Assistant (PDA)) or wearable device (e.g., google +. >Head mounted display) running software (e.g. Microsoft Windows +.>) And/or various mobile operating systems (such as iOS, windows Phone, android, blackBerry, palm OS, and/or variants thereof). In at least one embodiment, the device may support different applications, such as different internet-related applications, email, short Message Service (SMS) applications, and may use various other communication protocols. In at least one embodiment, the client computing device may also include a general purpose personal computer, including, for example, microsoft ++A running versions of Microsoft->Apple/>And/or a personal computer and/or a laptop computer of a Linux operating system. In at least one embodiment, the client computing device may be running a variety of commercially available +.>Or a workstation computer like any of the UNIX operating systems, including but not limited to various GNU/Linux operating systems, such as Google Chrome OS. In at least one embodiment, the client computing devices may also include electronic devices capable of communicating over one or more networks 1610, such as a thin client computer, an internet-enabled gaming system (e.g., with or without- >Microsoft Xbox game console of the gesture input device), and/or a personal messaging device. Although distributed system 1600 in fig. 16 is illustrated as having four client computing devices, any number of client computing devices may be supported. Other devices (such as devices with sensors, etc.) may interact with the server 1612.

In at least one embodiment, the network 1610 in the distributed system 1600 may be any type of network capable of supporting data communications using any of a variety of available protocols, including, but not limited to, TCP/IP (transmission control protocol/internet protocol), SNA (system network architecture), IPX (internet packet exchange), appleTalk, and/or variants thereof. In at least one embodiment, the network 1610 may be a Local Area Network (LAN), an Ethernet-based network, token ring, wide area network, the Internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., in the Institute of Electrical and Electronics Engineers (IEEE) 802.11 protocol suite, a wireless network,And/or a network operating under any of the other wireless protocols), and/or any combination of these and/or other networks.

In at least one embodiment, the server 1612 can be implemented by one or more general purpose computers, special purpose server computers (e.g., including a PC server,Servers, mid-range servers, mainframe computers, rack mounted servers, etc.), a server farm, a cluster of servers, or any other suitable arrangement and/or combination. In at least one embodiment, the server 1612 may include one or more virtual machines running a virtual operating system or other computing architecture involving virtualization. In at least one embodiment, one or more flexible pools of logical storage devices may be virtualized to maintain virtual storage devices for servers. In at least one embodiment, the virtual network may be controlled by the server 1612 using a software-defined network. In at least one embodiment, the server 1612 may be adapted to run one or more services or software applications.

In at least one embodiment, the server 1612 may run any operating system, as well as any commercially available server operating system. In at least one embodiment, the server 1612 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP (HyperText transfer protocol) servers, FTP (File transfer protocol) servers, CGI (common gateway interface) servers, Servers, database servers, and/or variants thereof. In at least one embodiment, exemplary database servers include, but are not limited to, those commercially available from Oracle, microsoft, sybase, IBM (International Business machines) and/or variants thereof.

In at least one embodiment, the server 1612 can include one or more applications for analyzing and merging data feeds and/or event updates received from users of the client computing devices 1602, 1604, 1606, and 1608. In at least one embodiment, the data feed and/or event update may include, but is not limited to, being received from one or more third party information sources and a continuous data streamFeed, & lt & gt>Updates or real-time updates, which may include real-time events related to sensor data applications, financial quoters, network performance measurement tools (e.g., network monitoring and traffic management applications), click stream analysis tools, automotive traffic monitoring, and/or changes thereto. In at least one embodiment, the server 1612 can also include one or more applications for displaying data feeds and/or real-time events via one or more display devices of the client computing devices 1602, 1604, 1606, and 1608.

In at least one embodiment, distributed system 1600 may also include one or more databases 1614 and 1616. In at least one embodiment, the database may provide a mechanism for storing information such as user interaction information, usage pattern information, adaptation rule information, and other information. In at least one embodiment, databases 1614 and 1616 may reside in various locations. In at least one embodiment, one or more of databases 1614 and 1616 may reside on a non-transitory storage medium local to server 1612 (and/or residing in server 1612). In at least one embodiment, databases 1614 and 1616 may be remote from server 1612 and in communication with server 1612 via a network-based connection or a dedicated connection. In at least one embodiment, databases 1614 and 1616 may reside in a Storage Area Network (SAN). In at least one embodiment, any necessary files for performing the functions attributed to server 1612 may be stored locally on server 1612 and/or remotely as appropriate. In at least one embodiment, databases 1614 and 1616 may include relational databases, such as databases adapted to store, update, and retrieve data in response to SQL formatted commands.

In at least one embodiment, at least one component shown or described with respect to fig. 16 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 16 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 16 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 16 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 16 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 17 illustrates an exemplary data center 1700 in accordance with at least one embodiment. In at least one embodiment, data center 1700 includes, but is not limited to, a data center infrastructure layer 1710, a framework layer 1720, a software layer 1730, and an application layer 1740.

In at least one embodiment, as shown in fig. 17, the data center infrastructure layer 1710 can include a resource coordinator 1712, grouped computing resources 1714, and node computing resources ("node c.r.") 1716 (1) -1716 (N), where "N" represents any complete positive integer. In at least one embodiment, the nodes c.r.1716 (1) -1716 (N) may include, but are not limited to, any number of central processing units ("CPUs") or other processors (including accelerators, field programmable gate arrays ("FPGAs"), graphics processors, etc.), memory devices (e.g., dynamic read only memory), storage devices (e.g., solid state drives or disk drives), network input/output ("NW I/O") devices, network switches, virtual machines ("VMs"), power modules, cooling modules, and the like. In at least one embodiment, one or more of the nodes c.r.1716 (1) -1716 (N) may be a server having one or more of the above-described computing resources.

In at least one embodiment, the grouped computing resources 1714 may include individual groupings of nodes c.r. housed within one or more racks (not shown), or a number of racks (also not shown) housed within a data center at various geographic locations. Individual packets of node c.r. within the packet's computing resources 1714 may include computing, network, memory, or storage resources of the packet that may be configured or allocated to support one or more workloads. In at least one embodiment, several nodes c.r. including CPUs or processors may be grouped within one or more racks to provide computing resources to support one or more workloads. In at least one embodiment, one or more racks may also include any number of power modules, cooling modules, and network switches, in any combination.

In at least one embodiment, the resource coordinator 1712 may configure or otherwise control one or more nodes c.r.1716 (1) -1716 (N) and/or grouped computing resources 1714. In at least one embodiment, the resource coordinator 1712 may include a software design infrastructure ("SDI") management entity for the data center 1700. In at least one embodiment, the resource coordinator 1712 may include hardware, software, or some combination thereof.

In at least one embodiment, as shown in FIG. 17, framework layer 1720 includes, but is not limited to, a job scheduler 1732, a configuration manager 1734, a resource manager 1736, and a distributed file system 1738. In at least one embodiment, framework layer 1720 can include a framework of one or more applications 1742 supporting software 1752 of software layer 1730 and/or application layer 1740. In at least one embodiment, software 1752 or application 1742 can comprise Web-based service software or applications, such as those provided by Amazon Web Services, google Cloud, and Microsoft Azure, respectively. In at least one embodiment, framework layer 1720 may be, but is not limited to, a free and open source network application framework, such as Apache Spark (hereinafter "Spark") that may utilize distributed file system 1738 for extensive data processing (e.g., "big data"). In at least one embodiment, job scheduler 1732 may include Spark drivers to facilitate scheduling of the workloads supported by the various layers of data center 1700. In at least one embodiment, the configuration manager 1734 may be capable of configuring different layers, such as a software layer 1730 and a framework layer 1720 that includes Spark and a distributed file system 1738 for supporting large scale data processing. In at least one embodiment, the resource manager 1736 can manage cluster or group computing resources mapped to or allocated to support the distributed file system 1738 and the job scheduler 1732. In at least one embodiment, the clustered or grouped computing resources can include grouped computing resources 1714 at the data center infrastructure layer 1710. In at least one embodiment, resource manager 1736 may coordinate with resource coordinator 1712 to manage these mapped or allocated computing resources.

In at least one embodiment, the software 1752 included in the software layer 1730 can include software used by at least a portion of the nodes c.r.1716 (1) -1716 (N), the packet computing resource 1714, and/or the distributed file system 1738 of the framework layer 1720. One or more types of software may include, but are not limited to, internet web search software, email virus scanning software, database software, and streaming video content software.

In at least one embodiment, the one or more applications 1742 included in the application layer 1740 can include one or more types of applications used by at least a portion of the nodes c.r.1716 (1) -1716 (N), the grouped computing resources 1714, and/or the distributed file system 1738 of the framework layer 1720. The one or more types of applications may include, but are not limited to, a CUDA application, a 5G network application, an artificial intelligence application, a data center application, and/or variants thereof.

In at least one embodiment, any of the configuration manager 1734, resource manager 1736, and resource coordinator 1712 may implement any number and type of self-modifying actions based on any number and type of data acquired in any technically feasible manner. In at least one embodiment, the self-modifying action may mitigate data center operators of data center 1700 from making potentially bad configuration decisions and may avoid underutilized and/or poorly performing portions of the data center.

In at least one embodiment, at least one component shown or described with respect to fig. 17 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 17 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 17 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 17 is for executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 17 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 18 illustrates a client-server network 1804 formed by a plurality of network server computers 1802 interconnected in accordance with at least one embodiment. In at least one embodiment, in the system 1800, each network server computer 1802 stores other network server computers 1802 and data accessible to client computers 1806 and networks 1808 linked into a wide area network 1804. In at least one embodiment, the configuration of the client-server network 1804 may change over time as the client computer 1806 and one or more networks 1808 connect and disconnect from the network 1804, and as one or more trunk server computers 1802 are added to the network 1804 or removed from the network 1804. In at least one embodiment, a client-server network includes client computers 1806 and networks 1808 when such client computers 1806 and networks 1808 are connected to network server computer 1802. In at least one embodiment, the term computer includes any device or machine capable of accepting data, applying a specified process to the data, and providing the results of the process.

In at least one embodiment, the client-server network 1804 stores information accessible to the network server computer 1802, the remote network 1808, and the client computers 1806. In at least one embodiment, the network server computer 1802 is formed from a mainframe computer, mini-computer, and/or microcomputer each having one or more processors. In at least one embodiment, the server computer 1802 is linked together by a wired and/or wireless transmission medium, such as a wire, fiber optic cable, and/or microwave transmission medium, satellite transmission medium, or other conductive, optical, or electromagnetic wave transmission medium. In at least one embodiment, the client computer 1806 accesses the network server computer 1802 via a similar wired or wireless transmission medium. In at least one embodiment, the client computers 1806 can be linked into the client-server network 1804 using modems and standard telephone communications networks. In at least one embodiment, alternative carrier systems (e.g., cable and satellite communication systems) may also be used to link into the client-server network 1804. In at least one embodiment, other proprietary or time-shared carrier systems may be used. In at least one embodiment, the network 1804 is a global information network, such as the internet. In at least one embodiment, the network is a private intranet that uses a similar protocol to the Internet but with added security measures and limited access control. In at least one embodiment, the network 1804 is a private or semi-private network that uses proprietary communication protocols.

In at least one embodiment, the client computer 1806 is any end user computer, and may also be a mainframe, mini-computer, or mini-computer having one or more microprocessors. In at least one embodiment, a server computer 1802 may sometimes be used as a client computer that accesses another server computer 1802. In at least one embodiment, the remote network 1808 may be a local area network, a network added to a wide area network through an Independent Service Provider (ISP) for the internet, or another set of computers interconnected by a wired or wireless transmission medium having a fixed or time-varying configuration. In at least one embodiment, the client computers 1806 can be linked into the network 1804 and access the network 1804 independently or through a remote network 1808.

In at least one embodiment, at least one component shown or described with respect to fig. 18 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 18 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 18 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 18 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 18 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 19 illustrates an example system 1900 in accordance with at least one embodiment that includes a computer network 1908 connecting one or more computing machines. In at least one embodiment, the network 1908 may be any type of electrically connected set of computers, including, for example, the following networks: the internet, an intranet, a Local Area Network (LAN), a Wide Area Network (WAN), or an interconnected combination of these network types. In at least one embodiment, the connections within network 1908 may be remote modems, ethernet (IEEE 802.3), token ring (IEEE 802.5), fiber distributed data link interface (FDDI), asynchronous Transfer Mode (ATM), or any other communication protocol. In at least one embodiment, the computing device linked to the network may be a desktop, server, portable, handheld, set-top box, personal Digital Assistant (PDA), terminal, or any other desired type or configuration. In at least one embodiment, network connected devices may vary widely in processing power, internal memory, and other performance depending on their functionality. In at least one embodiment, communications within the network and communications to or from computing devices connected to the network may be wired or wireless. In at least one embodiment, the network 1908 may comprise, at least in part, the worldwide public internet, which connects multiple users according to transmission control protocol/internet protocol (TCP/IP) specifications, typically according to a client-server model. In at least one embodiment, the client-server network is the dominant model for communication between two computers. In at least one embodiment, a client computer ("client") issues one or more commands to a server computer ("server"). In at least one embodiment, the server fulfills the client command by accessing available network resources and returning information to the client in accordance with the client command. In at least one embodiment, a client computer system and network resources residing on a network server are assigned network addresses for identification during communication between elements of a network. In at least one embodiment, the communication from the other network-connected system to the server will include the network address of the relevant server/network resource as part of the communication, such that the appropriate destination of the data/request is identified as the recipient. In at least one embodiment, when the network 1908 comprises the global Internet, the network address is an IP address in TCP/IP format, which may route data, at least in part, to an email account, website, or other Internet appliance residing on a server. In at least one embodiment, information and services residing on the web server may be available to the web browser of the client computer through a domain name (e.g., www.site.com) (which maps to the IP address of the web server).

In at least one embodiment, a plurality of clients 1902, 1904, and 1906 are connected to a network 1908 via respective communication links. In at least one embodiment, each of these clients may access network 1908 via any desired form of communication, such as via a dial-up modem connection, a cable link, a Digital Subscriber Line (DSL), a wireless or satellite link, or any other form of communication. In at least one embodiment, each client can communicate using any machine compatible with network 1908, such as a Personal Computer (PC), workstation, dedicated terminal, personal Data Assistant (PDA), or other similar device. In at least one embodiment, clients 1902, 1904, and 1906 may or may not be located in the same geographic region.

In at least one embodiment, a plurality of servers 1910, 1912, and 1914 are connected to network 1908 to service clients in communication with network 1908. In at least one embodiment, each server is typically a powerful computer or device that manages network resources and responds to client commands. In at least one embodiment, the server includes a computer readable data storage medium such as a hard disk drive and RAM memory that stores program instructions and data. In at least one embodiment, the servers 1910, 1912, 1914 run applications that respond to client commands. In at least one embodiment, the server 1910 can run a web server application for responding to client requests for HTML pages, and can also run a mail server application for receiving and routing email. In at least one embodiment, other applications may also run on server 1910, such as an FTP server or media server for streaming audio/video data to clients. In at least one embodiment, different servers may be dedicated to performing different tasks. In at least one embodiment, the server 1910 can be a dedicated web server that manages website-related resources for different users, while the server 1912 can be dedicated to providing electronic mail (email) management. In at least one embodiment, other servers may be dedicated to media (audio, video, etc.), file Transfer Protocol (FTP), or a combination of any two or more services that are generally available or provided over a network. In at least one embodiment, each server may be in the same or different location as the other servers. In at least one embodiment, there may be multiple servers performing mirroring tasks for the user, thereby alleviating congestion or minimizing traffic to and from a single server. In at least one embodiment, the servers 1910, 1912, 1914 are under the control of a web hosting provider in a business that maintains and delivers third party content over the network 1908.

In at least one embodiment, a web hosting provider delivers services to two different types of clients. In at least one embodiment, one type, which may be referred to as a browser, requests content, such as web pages, email messages, video clips, and the like, from servers 1910, 1912, and 1914. In at least one embodiment, a second type (which may be referred to as a user) hires a web hosting provider to maintain network resources (such as websites) and make them available to the browser. In at least one embodiment, users contract with web hosting providers to make memory space, processor capacity, and communication bandwidth available to their desired network resources, depending on the amount of server resources that users desire to utilize.

In at least one embodiment, in order for a web hosting provider to serve both clients, the application that manages the network resources hosted by the server must be properly configured. In at least one embodiment, the program configuration process involves defining a set of parameters that at least partially control the application's response to browser requests and also at least partially define server resources available to a particular user.

In one embodiment, intranet server 1916 communicates with network 1908 via a communication link. In at least one embodiment, the intranet server 1916 communicates with a server manager 1918. In at least one embodiment, the server manager 1918 includes a database of application configuration parameters used in the servers 1910, 1912, and 1914. In at least one embodiment, the user modifies database 1920 via intranet server 1916, and server manager 1918 interacts with servers 1910, 1912, and 1914 to modify application parameters so that they match the contents of the database. In at least one embodiment, a user logs into the intranet server 1916 by connecting to the intranet server 1916 via the client 1902 and entering authentication information such as a user name and password.

In at least one embodiment, when a user wishes to log in to a new service or modify an existing service, the intranet server 1916 authenticates the user and provides the user with an interactive screen display/control panel that allows the user to access configuration parameters of a particular application. In at least one embodiment, a plurality of modifiable text boxes describing aspects of a configuration of a user's website or other network resource are presented to the user. In at least one embodiment, if a user desires to increase the memory space reserved on a server for his website, the user is provided with a field in which the user specifies the desired memory space. In at least one embodiment, in response to receiving the information, the intranet server 1916 updates a database 1920. In at least one embodiment, the server manager 1918 forwards this information to the appropriate server and uses the new parameters during application operation. In at least one embodiment, the intranet server 1916 is configured to provide a user with access to configuration parameters of hosted network resources (e.g., web pages, emails, FTP sites, media sites, etc.) that the user has signed up with a web hosting service provider.

In at least one embodiment, at least one component shown or described with respect to fig. 19 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 19 is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 19 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 19 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 19 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 20A illustrates a networked computer system 2000A in accordance with at least one embodiment. In at least one embodiment, the networked computer system 2000A includes a plurality of nodes 2002, 2018, 2020 (e.g., personal computers ("PCs")). In at least one embodiment, the personal computer or node 2002 includes a processor 2014, a memory 2016, a camera 2004, a microphone 2006, a mouse 2008, a speaker 2010, and a monitor 2012. In at least one embodiment, the nodes 2002, 2018, 2020 may each run one or more desktop servers, e.g., internal networks within a given company, or may be servers of a general network that is not limited to a particular environment. In at least one embodiment, there is one server per PC node of the network, such that each PC node of the network represents a particular network server with a particular network URL address. In at least one embodiment, each server defaults to a default web page for the user of that server, which may itself contain embedded URLs pointing to further sub-pages of the user on that server, or to pages on other servers or other servers on the network.

In at least one embodiment, the nodes 2002, 2018, 2020 and other nodes of the network are interconnected via a medium 2022. In at least one embodiment, medium 2022 may be a communication channel such as an integrated services digital network ("ISDN"). In at least one embodiment, the various nodes of the networked computer system may be connected by a variety of communication media including a local area network ("LAN"), plain old telephone line ("POTS") (sometimes referred to as the public switched telephone network ("PSTN")), and/or variants thereof. In at least one embodiment, the various nodes of the network may also constitute computer system users interconnected via a network, such as the Internet. In at least one embodiment, each server on the network (running from a particular node of the network at a given instance) has a unique address or identity within the network, which may be specified in terms of a URL.

In at least one embodiment, a plurality of multipoint conference units ("MCUs") may thus be used to transmit data to and from various nodes or "endpoints" of the conference system. In at least one embodiment, the nodes and/or MCUs may be interconnected via ISDN links or by a local area network ("LAN") in addition to various other communication media, such as nodes connected by the internet. In at least one embodiment, the nodes of the conference system may be generally connected directly to a communication medium (such as a LAN) or through an MCU, and the conference system may include other nodes or elements, such as routers, servers, and/or variants thereof.

In at least one embodiment, the processor 2014 is a general purpose programmable processor. In at least one embodiment, the processor of the node of the networked computer system 2000A may also be a dedicated video processor. In at least one embodiment, the different peripherals and components of a node (such as those of node 2002) may be different from those of other nodes. In at least one embodiment, node 2018 and node 2020 may be configured the same as or different from node 2002. In at least one embodiment, the nodes may be implemented on any suitable computer system in addition to a PC system.

In at least one embodiment, at least one component shown or described with respect to fig. 20A is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 20A is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 20A is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 20A is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 20A is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 20B illustrates a networked computer system 2000B in accordance with at least one embodiment. In at least one embodiment, system 2000B illustrates a network (such as LAN 2124) that may be used to interconnect various nodes that may communicate with each other. In at least one embodiment, attached to the LAN 2024 are a plurality of nodes, such as PC nodes 2026, 2028, 2030. In at least one embodiment, the nodes may also be connected to a LAN via a web server or other device. In at least one embodiment, system 2100B includes other types of nodes or elements, including routers, servers, and nodes, for example.

In at least one embodiment, at least one component shown or described with respect to fig. 20B is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 20B is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 20B is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 20B is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 20B is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 20C illustrates a networked computer system 2000C in accordance with at least one embodiment. In at least one embodiment, system 2000C illustrates a WWW system having communication across a backbone communication network (such as the internet 2032) that may be used to interconnect the various nodes of the network. In at least one embodiment, the WWW is a set of protocols that operate on top of the internet and allow a graphical interface system to operate thereon to access information through the internet. In at least one embodiment, attached to the internet 2032 in the WWW are a plurality of nodes, e.g., PCs 2040, 2042, 2044. In at least one embodiment, the nodes interface with other nodes of the WWW through WWW HTTP servers (such as servers 2034, 2036). In at least one embodiment, the PC 2044 may be a PC that forms a node of the network 2032, and the PC 2044 itself runs its server 2036, although the PC 2044 and the server 2036 are shown separately in fig. 20C for purposes of illustration.

In at least one embodiment, the WWW is a distributed type of application characterized by WWW HTTP, a protocol of the WWW that runs on top of the transmission control protocol/Internet protocol ("TCP/IP") of the Internet. In at least one embodiment, the WWW may thus be characterized by a set of protocols (i.e., HTTP) running on the internet as its "backbone".

In at least one embodiment, a web browser is an application running on a node of a network in a WWW-type compatible network system that allows a user of a particular server or node to view such information and thus allow the user to search for graphics and text-based files linked together using hypertext links embedded in documents or files available from a server on the HTTP-aware network. In at least one embodiment, when a user retrieves a given web page of a first server associated with a first node using another server on a network such as the Internet, the retrieved document may have a different hypertext link embedded therein and a local copy of the page is created locally to the retrieving user. In at least one embodiment, when the user clicks on a hypertext link, the locally stored information associated with the selected hypertext link is generally sufficient to allow the user's machine to open a connection through the Internet to a server indicated by the hypertext link.

In at least one embodiment, more than one user may be coupled to each HTTP server, e.g., through a LAN (such as LAN 2038, such as shown with respect to WWW HTTP server 2034). In at least one embodiment, system 2000C may also include other types of nodes or elements. In at least one embodiment, the WWW HTTP server is an application running on a machine such as a PC. In at least one embodiment, each user may be considered to have a unique "server," as shown with respect to PC 2044. In at least one embodiment, a server may be considered a server, such as WWW HTTP server 2034, that provides access to a network for a LAN or more nodes or more LANs. In at least one embodiment, there are multiple users, each with a desktop PC or node of the network, each desktop PC potentially building a server for its user. In at least one embodiment, each server is associated with a particular network address or URL that, when accessed, provides a default web page for the user. In at least one embodiment, the web page may contain a further link (embedded URL) that points to a further sub-page of the user on the server, or to other servers on the network or to pages on other servers on the network.

In at least one embodiment, at least one component shown or described with respect to fig. 20C is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 20C is configured to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 20C is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 20C is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 20C is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Cloud computing and services

The following figures illustrate, but are not limited to, exemplary cloud-based systems that may be used to implement at least one embodiment.

In at least one embodiment, cloud computing is a style of computing in which dynamically extensible and often virtualized resources are provided as services over the internet. In at least one embodiment, users need not have knowledge of, expertise in, or control over their technical infrastructure, which may be referred to as "in the cloud. In at least one embodiment, cloud computing incorporates infrastructure as services, platforms as services, software as services, and other variants with common topics that rely on the internet to meet the computing needs of the user. In at least one embodiment, a Data Center (DC) in a typical cloud deployment, such as in a private cloud (e.g., an enterprise network) or a public cloud (e.g., the internet), may consist of thousands of servers (or alternatively, VMs), hundreds of ethernet, fibre channel, or fibre channel over ethernet (FCoE) ports, switching and storage infrastructure, etc. In at least one embodiment, the cloud may also consist of a network services infrastructure, such as an IPsec VPN hub, firewall, load balancer, wide Area Network (WAN) optimizer, or the like. In at least one embodiment, remote subscribers may securely access cloud applications and services by connecting via a VPN tunnel (e.g., an IPsec VPN tunnel).

In at least one embodiment, cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be quickly configured and released with minimal management effort or service provider interaction.

In at least one embodiment, cloud computing is characterized by on-demand self-service, where consumers can automatically unilaterally provision computing capabilities, such as server time and network storage, as needed without human interaction with each service provider. In at least one embodiment, cloud computing is characterized by extensive network access, where capabilities are available on the network and accessed through standard mechanisms that facilitate use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs). In at least one embodiment, cloud computing is characterized by a resource pool in which the computing resources of a provider are pooled to serve multiple consumers using a multi-tenant model, in which different physical and virtual resources are dynamically signed and reallocated according to consumer demand. In at least one embodiment, there is a sense of location independence because consumers typically have no control or knowledge of the exact location of the provided resources, but may be able to specify locations at a higher level of abstraction (e.g., country, state, or data center). In at least one embodiment, examples of resources include storage, processing, memory, network bandwidth, and virtual machines. In at least one embodiment, cloud computing is characterized by fast resilience, where capabilities can be quickly and flexibly provisioned (in some cases automatically) to quickly shrink and quickly release to quickly zoom in. In at least one embodiment, the available supply capacity for the consumer generally appears unrestricted and may be purchased in any number at any time. In at least one embodiment, cloud computing is characterized by measured services, where the cloud system automatically controls and optimizes resource usage by utilizing metering capabilities at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). In at least one embodiment, resource usage may be monitored, controlled, and reported to provide transparency to both the provider and consumer of the utilized service.

In at least one embodiment, cloud computing may be associated with various services. In at least one embodiment, cloud software as a service (SaaS) may refer to a service where the capability provided to the consumer is an application using a provider running on a cloud infrastructure. In at least one embodiment, an application may be accessed from different client devices through a thin client interface such as a web browser (e.g., web-based email). In at least one embodiment, the consumer does not manage or control the underlying cloud infrastructure including network, server, operating system, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.

In at least one embodiment, cloud platform as a service (PaaS) may refer to such a service: wherein the capability provided to the consumer is to deploy consumer created or acquired applications onto the cloud infrastructure, the applications being created using programming languages and tools supported by the provider. In at least one embodiment, the consumer does not manage or control an underlying cloud infrastructure including a network, server, operating system, or storage, but has control over deployed applications and possibly application hosting environment configurations.

In at least one embodiment, cloud infrastructure as a service (IaaS) may refer to such services: where the capability provided to the consumer is to provide processing, storage, networking, and other basic computing resources that the consumer can deploy and run any software that may include operating systems and applications. In at least one embodiment, the consumer does not manage or control the underlying cloud infrastructure, but rather has control over the operating system, storage, deployed applications, and possibly limited control over selected networking components (e.g., host firewalls).

In at least one embodiment, cloud computing may be deployed in different ways. In at least one embodiment, a private cloud may refer to a cloud infrastructure that operates only for an organization. In at least one embodiment, the private cloud may be managed by an organization or a third party, and may exist either within the venue or outside the venue. In at least one embodiment, a community cloud may refer to a cloud infrastructure that is shared by several organizations and supports a particular community with shared concerns (e.g., tasks, security requirements, policies, and compliance considerations). In at least one embodiment, the community cloud may be managed by an organization or a third party, and may exist either within the venue or outside the venue. In at least one embodiment, a public cloud may refer to a cloud infrastructure available to the general public or large industrial groups and owned by an organization providing cloud services. In at least one embodiment, a hybrid cloud may refer to a cloud infrastructure that is an integral part of two or more clouds (private, community, or public), which are still unique entities, but are bound together by standardized or proprietary techniques that enable data and application portability (e.g., cloud bursting for load balancing between clouds). In at least one embodiment, the cloud computing environment is service oriented, focusing on stateless, low-coupling, modularity, and semantic interoperability.

Fig. 21 illustrates one or more components of a system environment 2100 in which services can be provided as third party network services in accordance with at least one embodiment. In at least one embodiment, the third party network may be referred to as a cloud, a cloud network, a cloud computing network, and/or variants thereof. In at least one embodiment, the system environment 2100 includes one or more client computing devices 2104, 2106, and 2108, which client computing devices 2104, 2106, and 2108 can be used by a user to interact with a third party network infrastructure system 2102 that provides third party network services (which can be referred to as cloud computing services). In at least one embodiment, the third party network infrastructure system 2102 may include one or more computers and/or servers.

It should be appreciated that the third party network infrastructure system 2102 depicted in fig. 21 may have other components in addition to those depicted. Further, fig. 21 depicts an embodiment of a third party network infrastructure system. In at least one embodiment, the third party network infrastructure system 2102 may have more or fewer components than depicted in fig. 21, two or more components may be combined, or may have different component configurations or arrangements.

In at least one embodiment, the client computing devices 2104, 2106, and 2108 may be configured to operate a client application, such as a web browser, a proprietary client application or some other application that may be used by a user of the client computing device to interact with the third party network infrastructure system 2102 to use services provided by the third party network infrastructure system 2102. Although the exemplary system environment 2100 is illustrated as having three client computing devices, any number of client computing devices can be supported. In at least one embodiment, other devices, such as devices with sensors, etc., may interact with the third party network infrastructure system 2102. In at least one embodiment, one or more networks 2110 can facilitate communication and data exchange between client computing devices 2104, 2106, and 2108 and third-party network infrastructure system 2102.

In at least one embodiment, the services provided by the third party network infrastructure system 2102 can include hosts of services available to users of the third party network infrastructure system on demand. In at least one embodiment, various services may also be provided including, but not limited to, online data storage and backup solutions, web-based email services, hosted office suites and document collaboration services, database management and processing, managed technical support services, and/or variations thereof. In at least one embodiment, the services provided by the third party network infrastructure system may be dynamically extended to meet the needs of its users.

In at least one embodiment, a particular instantiation of a service provided by the third party network infrastructure system 2102 may be referred to as a "service instance". In at least one embodiment, any service available to a user from a third party network service provider system via a communications network (such as the internet) is generally referred to as a "third party network service". In at least one embodiment, in a public third party network environment, the servers and systems that make up the third party network service provider system are different from the customer's own on-premise servers and systems. In at least one embodiment, a third party network service provider system may host applications, and users may order and use applications on demand via a communication network (such as the internet).

In at least one embodiment, services in a computer network third party network infrastructure may include protected computer network access to storage, hosted databases, hosted network servers, software applications, or other services provided to users by third party network providers. In at least one embodiment, the service may include password-protected access to remote storage on a third party network via the Internet. In at least one embodiment, the services can include a web service-based hosted relational database and a scripting language middleware engine for private use by networking developers. In at least one embodiment, the service may include access to an email software application hosted on a website of a third party network provider.

In at least one embodiment, the third party network infrastructure system 2102 may include a set of applications, middleware, and database service offerings that are delivered to customers in a self-service, subscription-based, elastically extensible, reliable, highly available, and secure manner. In at least one embodiment, the third party network infrastructure system 2102 may also provide "big data" related computing and analysis services. In at least one embodiment, the term "big data" is generally used to refer to a very large set of data that can be stored and manipulated by analysts and researchers to visualize, detect trends, and/or otherwise interact with the data. In at least one embodiment, big data and related applications may be hosted and/or manipulated by the infrastructure system at many levels and on different scales. In at least one embodiment, tens, hundreds, or thousands of processors linked in parallel may act on such data to present the data or simulate external forces on the data or the content represented thereby. In at least one embodiment, these data sets may relate to structured data (such as structured data organized in a database or otherwise according to a structured model) and/or unstructured data (e.g., emails, images, data blobs, web pages, complex event processing). In at least one embodiment, by utilizing the capabilities of the embodiments to relatively quickly focus more (or less) computing resources on a target, a third party network infrastructure system may be better available to perform tasks on a large data set based on demands from an enterprise, government agency, research organization, private individual, group of individuals or organizations with the same ideas, or other entity.

In at least one embodiment, the third party network infrastructure system 2102 can be adapted to automatically provide, manage, and track customer subscriptions to services provided by the third party network infrastructure system 2102. In at least one embodiment, the third party network infrastructure system 2102 can provide third party network services via different deployment models. In at least one embodiment, services may be provided under a public third party network model, where the third party network infrastructure system 2102 is owned by an organization selling third party network services and makes the services available to the general public or to different business enterprises. In at least one embodiment, the services may be provided under a private third party network model in which the third party network infrastructure system 2102 operates for only a single organization and may provide services to one or more entities within the organization. In at least one embodiment, third party network services may also be provided under a community third party network model, where the services provided by the third party network infrastructure system 2102 and the third party network infrastructure system 2102 are shared by several organizations in the relevant community. In at least one embodiment, the third party network services may also be provided under a hybrid third party network model, which is a combination of two or more different models.

In at least one embodiment, the services provided by the third party network infrastructure system 2102 may include one or more services provided under a software as a service (SaaS) category, a platform as a service (PaaS) category, an infrastructure as a service (IaaS) category, or other service categories including hybrid services. In at least one embodiment, a customer via a subscription order may subscribe to one or more services provided by the third party network infrastructure system 2102. In at least one embodiment, the third party network infrastructure system 2102 then performs processing to provide services in a customer's subscription order.

In at least one embodiment, the services provided by the third party network infrastructure system 2102 may include, but are not limited to, application services, platform services, and infrastructure services. In at least one embodiment, the application services may be provided by a third party network infrastructure system via a SaaS platform. In at least one embodiment, the SaaS platform may be configured to provide third party web services belonging to the SaaS class. In at least one embodiment, the SaaS platform may provide the ability to build and deliver a set of on-demand applications on an integrated development and deployment platform. In at least one embodiment, the SaaS platform may manage and control the underlying software and infrastructure for providing the SaaS services. In at least one embodiment, the client may utilize an application executing on a third party network infrastructure system by utilizing services provided by the SaaS platform. In at least one embodiment, the client may obtain the application service without requiring the client to purchase a separate license and support. In at least one embodiment, a variety of different SaaS services may be provided. In at least one embodiment, examples include, but are not limited to, services that provide solutions for sales performance management, enterprise integration, and business flexibility for large organizations.

In at least one embodiment, the platform services may be provided by the third party network infrastructure system 2102 via the PaaS platform. In at least one embodiment, the PaaS platform can be configured to provide third party web services belonging to the PaaS class. In at least one embodiment, examples of platform services may include, but are not limited to, services that enable an organization to merge existing applications on a shared common architecture, and the ability to build new applications that utilize shared services provided by the platform. In at least one embodiment, the PaaS platform can manage and control the underlying software and infrastructure for providing PaaS services. In at least one embodiment, the customer may obtain PaaS services provided by the third party network infrastructure system 2102 without the customer purchasing separate licenses and support.

In at least one embodiment, by utilizing the services provided by the PaaS platform, the customer can employ programming languages and tools supported by the third party network infrastructure system and also control the deployed services. In at least one embodiment, the platform services provided by the third party network infrastructure system may include database third party network services, middleware third party network services, and third party network services. In at least one embodiment, the database third party network services may support a shared service deployment model that enables an organization to aggregate database resources and provide databases, i.e., services, to clients in the form of a database third party network. In at least one embodiment, in a third party network infrastructure system, middleware third party network services may provide a platform for customers to develop and deploy different business applications, and third party network services may provide a platform for customers to deploy applications.

In at least one embodiment, various infrastructure services may be provided by the IaaS platform in the third party network infrastructure system. In at least one embodiment, the infrastructure services facilitate management and control of underlying computing resources (such as storage, networks, and other underlying computing resources) by clients that utilize services provided by the SaaS platform and PaaS platform.

In at least one embodiment, the third party network infrastructure system 2102 can further include infrastructure resources 2130 for providing resources for providing various services to clients of the third party network infrastructure system. In at least one embodiment, the infrastructure resources 2130 can include a combination of pre-integration and optimization of hardware (such as servers, storage, and networking resources) for executing services and other resources provided by PaaS platforms and SaaS platforms.

In at least one embodiment, resources in the third party network infrastructure system 2102 can be shared by multiple users and dynamically reallocated as desired. In at least one embodiment, resources can be allocated to users in different time zones. In at least one embodiment, the third party network infrastructure system 2102 can enable a first group of users in a first time zone to utilize resources of the third party network infrastructure system for a specified number of hours and then enable the same resources to be reassigned to another group of users located in a different time zone, thereby maximizing resource utilization.

In at least one embodiment, a plurality of internal sharing services 2132 shared by different components or modules of the third party network infrastructure system 2102 may be provided for enabling services to be provided by the third party network infrastructure system 2102. In at least one embodiment, these internal sharing services may include, but are not limited to, security and identity services, integration services, enterprise library services, enterprise manager services, virus scanning and whitelisting services, high availability, backup and restore services, services for enabling third party network support, email services, notification services, file transfer services, and/or variants thereof.

In at least one embodiment, the third party network infrastructure system 2102 can provide comprehensive management of third party network services (e.g., saaS, paaS, and IaaS services) in the third party network infrastructure system. In at least one embodiment, the third party network management functions may include the ability to provision, manage, and track subscriptions of customers received by the third party network infrastructure system 2102 and/or variations thereof.

In at least one embodiment, as shown in FIG. 21, third party network management functions may be provided by one or more modules, such as an order management module 2120, an order orchestration module 2121, an order provisioning module 2124, an order management and monitoring module 2126, and an identity management module 2128. In at least one embodiment, these modules may include or be provided using one or more computers and/or servers, which may be general purpose computers, special purpose server computers, server farms, clusters of servers, or any other suitable arrangement and/or combination.

In at least one embodiment, in step 2134, a customer using a client device (such as client computing device 2104, 2106, or 2108) may interact with third party network infrastructure system 2102 by requesting one or more services provided by third party network infrastructure system 2102 and placing an order for a subscription to one or more services provided by third party network infrastructure system 2102. In at least one embodiment, the customer can access a third party network User Interface (UI), such as a third party network UI 2112, a third party network UI 2114, and/or a third party network UI 2116, and place order orders via these UIs. In at least one embodiment, the order information received by the third party network infrastructure system 2102 in response to the customer placing the order may include information identifying the customer and one or more services provided by the third party network infrastructure system 2102 to which the customer wants to subscribe.

In at least one embodiment, at step 2136, the order information received from the customer may be stored in order database 2118. In at least one embodiment, if this is a new order, a new record may be created for the order. In at least one embodiment, the order database 2118 can be one of several databases operated by the third party network infrastructure system 2118 and operated in conjunction with other system elements.

In at least one embodiment, at step 2138, the order information may be forwarded to an order management module 2120, which may be configured to perform billing and accounting functions related to the order, such as validating the order, and, upon validation, to order an order.

In at least one embodiment, at step 2140, information about the order may be transferred to an order orchestration module 2122, the order orchestration module 2122 configured to orchestrate the provision of services and resources for the order placed by the customer. In at least one embodiment, the order orchestration module 2122 may use the services of the order provisioning module 2124 for provisioning. In at least one embodiment, the order orchestration module 2122 enables the business processes associated with each order to be managed, and applies business logic to determine whether the order should continue to be served.

In at least one embodiment, at step 2142, when a newly subscribed order is received, the order orchestration module 2122 sends a request to the order provisioning module 2124 to allocate resources and configure the resources needed to fulfill the subscribed order. In at least one embodiment, the order provisioning module 2124 enables resource allocation for services subscribed to by the customer. In at least one embodiment, the order provisioning module 2124 provides a level of abstraction between the third party network services provided by the third party network infrastructure system 2100 and the physical implementation layer for provisioning resources for providing the requested services. In at least one embodiment, this enables the order orchestration module 2122 to be isolated from implementation details, such as whether services and resources are actually provisioned in real-time, or pre-provisioned and allocated/assigned only upon request.

In at least one embodiment, once the services and resources are provisioned, a notification may be sent to the subscribing client indicating that the requested service is now ready for use, at step 2144. In at least one embodiment, information (e.g., a link) may be sent to the customer that enables the customer to begin using the requested service.

In at least one embodiment, at step 2146, the orders to which the customer subscribes may be managed and tracked by the order management and monitoring module 2126. In at least one embodiment, the order management and monitoring module 2126 may be configured to collect usage statistics regarding customer usage of subscription services. In at least one embodiment, statistics may be collected for the amount of memory used, the amount of data transmitted, the number of users, and the amount of system power-up and system power-down time and/or changes thereof.

In at least one embodiment, the third party network infrastructure system 2100 can include an identity management module 2128, the identity management module 2128 configured to provide identity services, such as access management and authorization services in the third party network infrastructure system 2100. In at least one embodiment, the identity management module 2128 can control information about customers desiring to utilize services provided by the third party network infrastructure system 2102. In at least one embodiment, such information may include information authenticating the identity of such clients and information describing which actions those clients are authorized to perform with respect to various system resources (e.g., files, directories, applications, communication ports, memory segments, etc.). In at least one embodiment, the identity management module 2128 may also include managing descriptive information about each customer and about how and by whom the descriptive information may be accessed and modified.

In at least one embodiment, at least one component shown or described with respect to fig. 21 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 21 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 21 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 21 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 21 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 22 illustrates a cloud computing environment 2202 in accordance with at least one embodiment. In at least one embodiment, cloud computing environment 2202 includes one or more computer systems/servers 2204 with which computing devices, such as Personal Digital Assistants (PDAs) or cellular telephones 2206A, desktop computers 2206B, laptop computers 2206C, and/or automobile computer systems 2206N communicate with one or more computer systems/servers 2204. In at least one embodiment, this allows infrastructure, platforms, and/or software to be provided as a service from cloud computing environment 2202 so that each client is not required to maintain such resources individually. It should be appreciated that the types of computing devices 2206A-N shown in fig. 22 are intended to be illustrative only, and that cloud computing environment 2202 may communicate with any type of computerized device over any type of network and/or network/addressable connection (e.g., using a web browser).

In at least one embodiment, computer system/server 2204, which may be represented as a cloud computing node, may operate in conjunction with a number of other general purpose or special purpose computing system environments or configurations. In at least one embodiment, examples of computing systems, environments, and/or configurations that may be suitable for use with computer system/server 2204 include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and/or variations thereof.

In at least one embodiment, the computer system/server 2204 can be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. In at least one embodiment, program modules include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. In at least one embodiment, the exemplary computer system/server 2204 can be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In at least one embodiment, in a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In at least one embodiment, at least one component shown or described with respect to fig. 22 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 22 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 22 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 22 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 22 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 23 illustrates a set of functional abstraction layers provided by cloud computing environment 2202 (FIG. 22) in accordance with at least one embodiment. It should be understood in advance that the components, layers, and functions shown in fig. 23 are intended to be illustrative only, and that the components, layers, and functions may vary.

In at least one embodiment, hardware and software layer 2302 includes hardware and software components. In at least one embodiment, examples of hardware components include a mainframe, servers based on various RISC (reduced instruction set computer) architectures, various computing systems, supercomputers, storage devices, networks, networking components, and/or variations thereof. In at least one embodiment, examples of software components include web application server software, various database software, and/or variations thereof.

In at least one embodiment, virtualization layer 2304 provides an abstraction layer from which the following exemplary virtual entities may be provided: virtual servers, virtual storage, virtual networks (including virtual private networks), virtual applications, virtual clients, and/or variants thereof.

In at least one embodiment, the management layer 2306 provides various functions. In at least one embodiment, resource provisioning provides dynamic acquisition of computing resources and other resources for executing tasks within a cloud computing environment. In at least one embodiment, metering (metering) provides usage tracking when resources are utilized within a cloud computing environment, as well as billing or invoices for consumption of those resources. In at least one embodiment, the resource may include an application software license. In at least one embodiment, security provides authentication for users and tasks, as well as protection of data and other resources. In at least one embodiment, the user interface provides access to the cloud computing environment for both users and system administrators. In at least one embodiment, service level management provides cloud computing resource allocation and management such that a desired service level is met. In at least one embodiment, service Level Agreement (SLA) management provides for pre-deployment and acquisition of cloud computing resources, which are anticipated to be in future demand for the cloud computing resources according to the SLA.

In at least one embodiment, the workload layer 2308 provides functionality that utilizes a cloud computing environment. In at least one embodiment, examples of workloads and functions that may be provided from this layer include: map and navigation, software development and management, educational services, data analysis and processing, transaction processing, and service delivery.

In at least one embodiment, at least one component shown or described with respect to fig. 23 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 23 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 23 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 23 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 23 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Super computing

The following figures illustrate, but are not limited to, exemplary supercomputer-based systems that may be utilized to implement at least one embodiment.

In at least one embodiment, a supercomputer may refer to a hardware system exhibiting significant parallelism and including at least one chip, wherein chips in the system are interconnected by a network and placed in a hierarchically organized enclosure. In at least one embodiment, a large hardware system that fills a machine room with racks is one specific example of a supercomputer, with each rack containing boards/rack modules, each board/rack module containing chips that are all interconnected by an extensible network. In at least one embodiment, a single rack of such a large hardware system is another example of a supercomputer. In at least one embodiment, a single chip exhibiting significant parallelism and containing several hardware components may also be considered a supercomputer, as the amount of hardware that may be incorporated into a single chip may increase as feature sizes may decrease.

FIG. 24 illustrates a chip-scale supercomputer in accordance with at least one embodiment. In at least one embodiment, the main computation is performed within a finite state machine (2404), referred to as a thread unit, inside the FPGA or ASIC chip. In at least one embodiment, a task and synchronization network (2402) is connected to the finite state machine and is used to dispatch threads and perform operations in the correct order. In at least one embodiment, a memory network (2406, 2410) is used to access an on-chip cache hierarchy (2408, 2412) of a multi-level partition. In at least one embodiment, the off-chip memory is accessed using a memory controller (2416) and an off-chip memory network (2414). In at least one embodiment, the I/O controller (2418) is used to communicate across chips when the design is not suitable for a single logic chip.

In at least one embodiment, at least one component shown or described with respect to fig. 24 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 24 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 24 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 24 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 24 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 25 illustrates a supercomputer at rack module level in accordance with at least one embodiment. In at least one embodiment, within the rack module, there are multiple FPGA or ASIC chips (2502) connected to one or more DRAM units (2504) that make up the main accelerator memory. In at least one embodiment, each FPGA/ASIC chip is connected to its neighboring FPGA/ASIC chip using on-board wide buses with differential high-speed signaling (2506). In at least one embodiment, each FPGA/ASIC chip is also connected to at least one high-speed serial communications cable.

In at least one embodiment, at least one component shown or described with respect to fig. 25 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 25 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 25 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 25 is for executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 25 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 26 illustrates a rack-level supercomputer in accordance with at least one embodiment. FIG. 27 illustrates an overall system level supercomputer in accordance with at least one embodiment. In at least one embodiment, referring to fig. 26 and 27, a scalable, possibly incomplete hypercube network is implemented using high-speed serial or copper cables (2602, 2702) between rack modules in the rack and across the entire system of racks. In at least one embodiment, one of the accelerator's FPGA/ASIC chips is connected to the host system (2704) through a PCI-Express connection. In at least one embodiment, the host system includes a host microprocessor (2708) on which a software portion of the application runs and memory comprised of one or more host memory DRAM cells (2706) that are consistent with memory on the accelerator. In at least one embodiment, the host system may be a separate module on one of the racks, or may be integrated with one of the modules of the supercomputer. In at least one embodiment, the loop topology of the cube connections provides communication links to create a hypercube network for a large supercomputer. In at least one embodiment, a small group of FPGA/ASIC chips on a rack module may act as a single hypercube node such that the total number of external links per group is increased compared to a single chip. In at least one embodiment, the group contains chips A, B, C and D on a rack module with an internal wide differential bus connecting A, B, C and D in a ring organization. In at least one embodiment, there are 12 serial communication cables connecting the rack module to the outside world. In at least one embodiment, the chip A on the rack module is connected to the serial communication cable 0, 1, 2. In at least one embodiment, chip B is connected to cables 3, 4, 5. In at least one embodiment, chip C is connected to 6, 7, 8. In at least one embodiment, the chip D is connected to 9, 10, 11. In at least one embodiment, the entire set { A, B, C, D } comprising the rack modules may form a hypercube node within a supercomputer system, with up to 212 = 4096 rack modules (16384 FPGA/ASIC chips). In at least one embodiment, in order for chip A to send messages out on link 4 of group { A, B, C, D }, the messages must first be routed to chip B using an on-board differential wide bus connection. In at least one embodiment, messages arriving on link 4 at group { A, B, C, D } destined for chip A (i.e., arriving at B) must also be routed first to the correct destination chip (A) inside group { A, B, C, D }. In at least one embodiment, other sizes of parallel supercomputer systems may also be implemented.

In at least one embodiment, at least one component shown or described with respect to fig. 26 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 26 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 26 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 26 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 26 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

In at least one embodiment, at least one component shown or described with respect to fig. 27 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 27 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 27 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 27 is for executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 27 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Artificial intelligence

The following figures illustrate exemplary artificial intelligence-based systems that may be used to implement at least one embodiment.

Fig. 28A illustrates inference and/or training logic 2815 for performing inference and/or training operations associated with one or more embodiments. Details regarding the inference and/or training logic 2815 are provided below in connection with fig. 28A and/or 28B.

In at least one embodiment, the inference and/or training logic 2815 can include, but is not limited to, code and/or data storage 2801 for storing forward and/or output weights and/or input/output data, and/or other parameters for configuring neurons or layers of a neural network that is trained and/or used for inference in aspects of one or more embodiments. In at least one embodiment, training logic 2815 may include or be coupled to code and/or data store 2801 for storing graphics code or other software to control timing and/or sequencing, where loading weights and/or other parameter information configures logic, including integer and/or floating point units (collectively referred to as Arithmetic Logic Units (ALUs)). In at least one embodiment, code (such as graph code) loads weight or other parameter information into the processor ALU based on the architecture of the neural network to which such code corresponds. In at least one embodiment, code and/or data store 2801 stores weight parameters and/or input/output data for each layer of a neural network that is trained or used in connection with one or more embodiments during forward propagation of input/output data and/or weight parameters during training and/or reasoning using aspects of one or more embodiments. In at least one embodiment, any portion of code and/or data store 2801 may be included with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache memory or system memory.

In at least one embodiment, any portion of code and/or data store 2801 may be internal or external to one or more processors or other hardware logic devices or circuitry. In at least one embodiment, code and/or data store 2801 can be a cache memory, dynamic random-access memory ("DRAM"), static random-access memory ("SRAM"), non-volatile memory (e.g., flash memory), or other storage device. In at least one embodiment, the choice of whether code and/or data store 2801 is internal or external to the processor, e.g., or includes DRAM, SRAM, flash, or some other storage type, may depend on the latency requirements of the training and/or reasoning function being performed relative to the available storage off-chip, the batch size of the data used in the reasoning and/or training of the neural network, or some combination of these factors.

In at least one embodiment, the inference and/or training logic 2815 can include, but is not limited to: code and/or data store 2805 to store inverse and/or output weights and/or input/output data corresponding to neurons or layers of a neural network that are trained and/or used for reasoning in aspects of one or more embodiments. In at least one embodiment, code and/or data store 2805 stores weight parameters and/or input/output data for each layer of a neural network that is trained or used in connection with one or more embodiments during back propagation of input/output data and/or weight parameters during training and/or reasoning using aspects of one or more embodiments. In at least one embodiment, training logic 2815 may include or be coupled to code and/or data store 2805 to store graph code or other software to control timing and/or sequencing, where weights and/or other parameter information are to be loaded to configure logic, including integer and/or floating point units (collectively referred to as Arithmetic Logic Units (ALUs)).

In at least one embodiment, code (such as graph code) causes the architecture based on the neural network to which such code corresponds to load weight or other parameter information into the processor ALU. In at least one embodiment, any portion of code and/or data store 2805 may be included with other on-chip or off-chip data stores, including the processor's L1, L2, or L3 cache or system memory. In at least one embodiment, any portion of code and/or data store 2805 may be internal or external to one or more processors or other hardware logic devices or circuitry. In at least one embodiment, code and/or data storage 2805 can be cache memory, DRAM, SRAM, nonvolatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice of whether code and/or data store 2805 is internal or external to the processor, e.g., or includes DRAM, SRAM, flash, or some other storage type, may depend on the latency requirements of the training and/or reasoning functions being performed, the batch size of the data used in the reasoning and/or training of the neural network, or some combination of these factors, relative to the available storage off-chip.

In at least one embodiment, code and/or data store 2801 and code and/or data store 2805 can be separate storage structures. In at least one embodiment, code and/or data store 2801 and code and/or data store 2805 can be a combined storage structure. In at least one embodiment, code and/or data store 2801 and code and/or data store 2805 can be partially combined and partially separated. In at least one embodiment, code and/or data store 2801 and any portion of code and/or data store 2805 may be included with other on-chip or off-chip data stores (including processor L1, L2, or L3 caches or system memory).

In at least one embodiment, the inference and/or training logic 2815 can include, but is not limited to, one or more arithmetic logic units ("ALU") 2810, including integer and/or floating point units, for performing logical and/or mathematical operations based at least in part on or indicated by training and/or inference code (e.g., graphics code), the results of which can result in activations (e.g., output values from layers or neurons within a neural network) stored in an activation store 2820 that is a function of input/output and/or weight parameter data stored in the code and/or data store 2801 and/or code and/or data store 2805. In at least one embodiment, the activations stored in the activation store 2820 are generated from linear algebra performed by the ALU 2810 and/or matrix-based mathematics in response to executing instructions or other code, wherein weight values stored in the code and/or data store 2805 and/or data store 2801 are used as operands along with other values (such as bias values, gradient information, momentum values, or other parameters or hyper-parameters), any or all of which may be stored in the code and/or data store 2805 or code and/or data store 2801 or another store on-chip or off-chip.

In at least one embodiment, one or more ALUs 2810 are included within one or more processors or other hardware logic devices or circuits, while in another embodiment, one or more ALUs 2810 may be external to the processor or other hardware logic device or circuit (e.g., co-processor) in which they are used. In at least one embodiment, the ALU 2810 may be included within an execution unit of a processor or otherwise within an ALU library accessible to execution units of the processor, either within the same processor or distributed among different processors of different types (e.g., central processing units, graphics processing units, fixed function units, etc.). In at least one embodiment, code and/or data store 2801, code and/or data store 2805, and activation store 2820 may share a processor or other hardware logic device or circuitry, while in another embodiment they may be in different processors or other hardware logic devices or circuitry, or in some combination of the same and different processors or other hardware logic devices or circuitry. In at least one embodiment, any portion of the activation store 2820 may be included with other on-chip or off-chip data stores including the processor's L1, L2, or L3 cache or system memory. In addition, the inference and/or training code can be stored with other code that can be accessed by a processor or other hardware logic or circuitry and that can be obtained and/or processed using the processor's acquisition, decoding, scheduling, execution, retirement (retirement), and/or other logic circuitry.

In at least one embodiment, the activation storage 2820 may be cache memory, DRAM, SRAM, nonvolatile memory (e.g., flash memory), or other storage. In at least one embodiment, the activation store 2820 may be wholly or partially within or external to one or more processors or other logic circuits. In at least one embodiment, the choice of whether the activation store 2820 is internal or external to the processor, e.g., or includes DRAM, SRAM, flash, or some other storage type, may depend on the latency requirements of the training and/or reasoning functions being performed, the batch size of the data used in the reasoning and/or training of the neural network, or some combination of these factors, relative to the available storage off-chip.

In at least one embodiment, the reasoning and/or training logic 2815 shown in FIG. 28A can be used in conjunction with an application specific integrated circuit ("ASIC"), such as from GoogleProcessing unit from Graphcore ^TM Is an reasoning processing unit (IPU) of (E) or +.>(e.g., "Lake create") processor. In at least one embodiment, the inference and/or training logic 2815 illustrated in fig. 28A can be used in conjunction with central processing unit ("CPU") hardware, graphics processing unit ("GPU") hardware, or other hardware, such as a field programmable gate array ("FPGA").

In at least one embodiment, at least one component shown or described with respect to fig. 28A is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 28A is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 28A is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 28A is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 28A is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 28B illustrates inference and/or training logic 2815 in accordance with at least one embodiment. In at least one embodiment, the inference and/or training logic 2815 can include, but is not limited to, hardware logic in which computing resources are dedicated or otherwise used exclusively in connection with weight values or other information corresponding to one or more neuron layers within a neural network. In at least one embodiment, the reasoning and/or training logic 2815 shown in FIG. 28B can be combined with an Application Specific Integrated Circuit (ASIC) (e.g., from GoogleProcessing unit from Graphcore ^TM Is an reasoning processing unit (IPU) of (E) or +.>(e.g., "Lake create") processor. In at least one embodiment, the inference and/or training logic 2815 shown in fig. 28B can be used in conjunction with Central Processing Unit (CPU) hardware, graphics Processing Unit (GPU) hardware, or other hardware, such as a Field Programmable Gate Array (FPGA). In at least one embodiment, the inference and/or training logic 2815 includes, but is not limited to, code and/or data storage 2801 and code and/or data storage 2805, which may be used to store code (e.g., graph code), weight values, and/or other information, including bias values, gradient information, momentum values, and/or other parameter or hyper-parameter information. In at least one embodiment illustrated in fig. 28B, each of code and/or data store 2801 and code and/or data store 2805 are respectively associated with a dedicated computing resource (e.g., computing hard Piece 2802 and computing hardware 2806). In at least one embodiment, each of computing hardware 2802 and 2806 includes one or more ALUs that perform mathematical functions (such as linear algebraic functions) only on information stored in code and/or data store 2801 and code and/or data store 2805, respectively, the results of which are stored in activation store 2820.

In at least one embodiment, each code and/or data store 2801 and 2805 and corresponding computing hardware 2802 and 2806 correspond to different layers of a neural network, respectively, such that the resulting activation from one storage/computing pair 2801/2802 of the code and/or data store 2801 and computing hardware 2802 is provided as input to the next storage/computing pair 2805/2806 of the code and/or data store 2805 and computing hardware 2806 to mirror the conceptual organization of the neural network. In at least one embodiment, each of the storage/computation pairs 2801/2802 and 2805/2806 may correspond to more than one neural network layer. In at least one embodiment, additional storage/computation pairs (not shown) after storage/computation pairs 2801/2802 and 2805/2806 or in parallel with storage/computation pairs 2801/2802 and 2805/2806 may be included in inference and/or training logic 2815.

In at least one embodiment, at least one component shown or described with respect to fig. 28B is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 28B is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 28B is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 28B is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 28B is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 29 illustrates training and deployment of deep neural networks in accordance with at least one embodiment. In at least one embodiment, the training data set 2902 is used to train untrained neural network 2906. In at least one embodiment, training frame 2904 is a PyTorch frame, while in other embodiments, training frame 2904 is TensorFlow, boost, caffe, microsoft Cognitive Toolkit/CNTK, MXNet, chainer, keras, deeplearning4j or other training frame. In at least one embodiment, training framework 2904 trains untrained neural network 2906 and enables it to train using the processing resources described herein to generate trained neural network 2908. In at least one embodiment, the weights may be selected randomly or by pre-training using a deep belief network. In at least one embodiment, training may be performed in a supervised, partially supervised, or unsupervised manner.

In at least one embodiment, the untrained neural network 2906 is trained using supervised learning, wherein the training data set 2902 includes inputs paired with desired outputs for the inputs, or wherein the training data set 2902 includes inputs having known outputs, and the output of the untrained neural network 2906 is manually ranked. In at least one embodiment, the untrained neural network 2906 is trained in a supervised manner and inputs from the training dataset 2902 are processed and the resulting output compared to a set of expected or desired outputs. In at least one embodiment, the error is then counter-propagated through untrained neural network 2906. In at least one embodiment, training framework 2904 adjusts weights that control untrained neural network 2906. In at least one embodiment, training framework 2904 includes tools for monitoring how well untrained neural network 2906 converges toward a model (such as trained neural network 2908) that is adapted to generate a correct answer (such as result 2914) based on input data (such as new data set 2912). In at least one embodiment, training framework 2904 repeatedly trains untrained neural network 2906 while adjusting weights using an loss function and an adjustment algorithm (such as a random gradient descent) to refine the output of untrained neural network 2906. In at least one embodiment, the training framework 2904 trains the untrained neural network 2906 until the untrained neural network 2906 achieves a desired accuracy. In at least one embodiment, trained neural network 2908 may then be deployed to implement any number of machine learning operations.

In at least one embodiment, the untrained neural network 2906 is trained using unsupervised learning, where the untrained neural network 2906 attempts to train itself using untagged data. In at least one embodiment, the unsupervised learning training data set 2902 will include input data without any associated output data or "ground truth" data. In at least one embodiment, untrained neural network 2906 may learn groupings within training data set 2902 and may determine how various inputs relate to the untrained data set (e.g., new data set 2912). In at least one embodiment, unsupervised training may be used to generate an ad hoc map in trained neural network 2908 that is capable of performing operations useful in reducing the dimensions of new dataset 2912. In at least one embodiment, unsupervised training may also be used to perform anomaly detection that allows identification of data points in the new data set 2912 that deviate from the normal pattern of the new data set 2912.

In at least one embodiment, semi-supervised learning, which is a technique in which a mix of labeled and unlabeled data is included in the training dataset 2902, may be used. In at least one embodiment, training framework 2904 may be used to perform incremental learning, such as through transfer learning techniques. In at least one embodiment, incremental learning enables the trained neural network 2908 to adapt to the new data set 2912 without forgetting knowledge injected within the trained neural network 2908 during initial training.

In at least one embodiment, at least one component shown or described with respect to fig. 29 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 29 is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 29 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 29 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 29 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

5G network

The following figures illustrate exemplary 5G network-based systems that may be used to implement at least one embodiment.

Fig. 30 illustrates an architecture of a system 3000 of a network in accordance with at least one embodiment. In at least one embodiment, system 3000 is shown to include User Equipment (UE) 3002 and UE 3004. In at least one embodiment, the UEs 3002 and 3004 are shown as smart phones (e.g., handheld touch screen mobile computing devices connectable to one or more cellular networks), but may also include any mobile or non-mobile computing devices, such as Personal Digital Assistants (PDAs), pagers, laptop computers, desktop computers, wireless handheld devices, or any computing device that includes a wireless communication interface.

In at least one embodiment, any of the UEs 3002 and 3004 may include an internet of things (IoT) UE that may include a network access layer designed for low power IoT applications that utilize ephemeral UE connections. In at least one embodiment, the IoT UE may utilize technologies such as for exchanging data with MTC servers or devices via Public Land Mobile Networks (PLMNs), proximity-based services (ProSe) or device-to-device (D2D) communications, sensor networks, or IoT networks, such as machine-to-machine (M2M) or Machine Type Communications (MTC). In at least one embodiment, the M2M or MTC data exchange may be a machine initiated data exchange. In at least one embodiment, the IoT network describes interconnected IoT UEs that may include uniquely identifiable embedded computing devices (within the internet infrastructure) with short-lived connections. In at least one embodiment, ioT UEs may execute background applications (e.g., keep-alive messages, status updates, etc.) to facilitate connection of IoT networks.

In at least one embodiment, the UE 3002 and UE 3004 may be configured to connect (e.g., communicatively couple) with a Radio Access Network (RAN) 3016. In at least one embodiment, the RAN 3016 may be, for example, an evolved Universal Mobile Telecommunications System (UMTS) terrestrial radio access network (E-UTRAN), a NextGen RAN (NG RAN), or some other type of RAN. In at least one embodiment, the UE 3002 and the UE 3004 utilize connections 3012 and 3014, respectively, each connection comprising a physical communication interface or layer. In at least one embodiment, connections 3012 and 3014 are shown as air interfaces for implementing communicative coupling and may be consistent with cellular communication protocols, such as the Global System for Mobile communications (GSM) protocol, code Division Multiple Access (CDMA) network protocols, push-to-talk (PTT) protocols, push-to-cellular (POC) protocols, universal Mobile Telecommunications System (UMTS) protocols, 3GPP Long Term Evolution (LTE) protocols, fifth generation (5G) protocols, new Radio (NR) protocols, and variations thereof.

In at least one embodiment, the UEs 3002 and 3004 may also exchange communication data directly via ProSe interface 3006. In at least one embodiment, proSe interface 3006 may alternatively be referred to as a side link interface, comprising one or more logical channels including, but not limited to, a physical side link control channel (PSCCH), a physical side link shared channel (PSSCH), a physical side link discovery channel (PSDCH), and a physical side link broadcast channel (PSBCH).

In at least one embodiment, UE 3004 is shown configured to access an Access Point (AP) 3010 via connection 3008. In at least one embodiment, connection 3008 may comprise a local wireless connection, such as a connection consistent with any IEEE 802.11 protocol, where AP 3010 would comprise wireless fidelityAnd a router. In at least one embodiment, the AP 3010 is shown to connect to the internet rather than to the core network of the wireless system.

In at least one embodiment, the RAN 3016 may include one or more access nodes that enable connections 3012 and 3014. In at least one embodiment, these Access Nodes (ANs) may be referred to as Base Stations (BS), nodebs, evolved nodebs (enbs), next generation nodebs (gnbs), RAN nodes, etc., and may include ground stations (e.g., terrestrial access points) or satellite stations that provide coverage within a geographic area (e.g., cell). In at least one embodiment, the RAN 3016 may include one or more RAN nodes (e.g., macro RAN node 3018) to provide macro cells and one or more RAN nodes (e.g., low Power (LP) RAN node 3020) to provide femto cells or pico cells (e.g., cells having a smaller coverage area, smaller user capacity, or higher bandwidth than macro cells).

In at least one embodiment, either of the RAN nodes 3018 and 3020 may terminate the air interface protocol and may be the first point of contact for the UEs 3002 and 3004. In at least one embodiment, either of the RAN nodes 3018 and 3020 may implement various logical functions of the RAN 3016 including, but not limited to, radio Network Controller (RNC) functions such as radio bearer management, uplink and downlink dynamic radio resource management, and data packet scheduling and mobility management.

In at least one embodiment, the UEs 3002 and 3004 may be configured to communicate with each other or any of the RAN nodes 3018 and 3020 over multicarrier communication channels using Orthogonal Frequency Division Multiplexing (OFDM) communication signals in accordance with various communication techniques such as, but not limited to, orthogonal Frequency Division Multiple Access (OFDMA) communication techniques (e.g., for downlink communications) or single carrier frequency division multiple access (SC-FDMA) communication techniques (e.g., for uplink and ProSe or side link communications), and/or variants thereof. In at least one embodiment, the OFDM signal may include a plurality of orthogonal subcarriers.

In at least one embodiment, the downlink resource grid may be used for downlink transmissions from either of the RAN nodes 3018 and 3020 to the UEs 3002 and 3004, while the uplink transmissions may utilize similar techniques. In at least one embodiment, the grid may be a time-frequency grid, referred to as a resource grid or a time-frequency resource grid, which is a physical resource in the downlink in each time slot. In at least one embodiment, such a time-frequency planar representation is a common practice of OFDM systems, which makes it intuitive for radio resource allocation. In at least one embodiment, each column and each row of the resource grid corresponds to one OFDM symbol and one OFDM subcarrier, respectively. In at least one embodiment, the duration of the resource grid in the time domain corresponds to one slot in a radio frame. In at least one embodiment, the smallest time-frequency unit in the resource grid is denoted as a resource element. In at least one embodiment, each resource grid includes a plurality of resource blocks that describe the mapping of certain physical channels to resource elements. In at least one embodiment, each resource block includes a set of resource elements. In at least one embodiment, in the frequency domain, this may represent the minimum number of resources that can currently be allocated. In at least one embodiment, there are several different physical downlink channels transmitted using such resource blocks.

In at least one embodiment, a Physical Downlink Shared Channel (PDSCH) may carry user data and higher layer signaling to UEs 3002 and 3004. In at least one embodiment, a Physical Downlink Control Channel (PDCCH) may carry information on a transport format and resource allocation related to a PDSCH channel, and the like. In at least one embodiment, it may also inform UEs 3002 and 3004 of transport format, resource allocation, and HARQ (hybrid automatic repeat request) information related to the uplink shared channel. In at least one embodiment, in general, downlink scheduling (allocation of control and shared channel resource blocks to UEs 3002 within a cell) may be performed at either of the RAN nodes 3018 and 3020 based on channel quality information fed back from either of the UEs 3002 and 3004. In at least one embodiment, the downlink resource allocation information may be transmitted on a PDCCH for (e.g., allocated to) each of the UEs 3002 and 3004.

In at least one embodiment, the PDCCH may transmit control information using Control Channel Elements (CCEs). In at least one embodiment, the PDCCH complex-valued symbols may first be organized into quadruples before being mapped to resource elements, which may then be permuted using a sub-block interleaver for rate matching. In at least one embodiment, each PDCCH may be transmitted using one or more of these CCEs, where each CCE may correspond to nine sets of four physical resource elements referred to as Resource Element Groups (REGs). In at least one embodiment, four Quadrature Phase Shift Keying (QPSK) symbols may be mapped to each REG. In at least one embodiment, the PDCCH may be transmitted using one or more CCEs depending on a size of Downlink Control Information (DCI) and channel conditions. In at least one embodiment, there may be four or more different PDCCH formats defined in LTE with different numbers of CCEs (e.g., aggregation level, l=1, 2, 4, or 8).

In at least one embodiment, an Enhanced Physical Downlink Control Channel (EPDCCH) using PDSCH resources may be used for control information transmission. In at least one embodiment, the EPDCCH may be transmitted using one or more Enhanced Control Channel Elements (ECCEs). In at least one embodiment, each ECCE may correspond to nine sets of four physical resource elements referred to as Enhanced Resource Element Groups (EREGs). In at least one embodiment, ECCEs may have other numbers of EREGs in some cases.

In at least one embodiment, the RAN 3016 is shown communicatively coupled to a Core Network (CN) 3038 via an S1 interface 3022. In at least one embodiment, the CN 3038 may be an Evolved Packet Core (EPC) network, a NextGen Packet Core (NPC) network, or some other type of CN. In at least one embodiment, the S1 interface 3022 is split into two parts: an S1-U interface 3026 carrying traffic data between the RAN nodes 3018 and 3020 and a serving gateway (S-GW) 3030; and an S1-Mobility Management Entity (MME) interface 3024, which is a signaling interface between the RAN nodes 3018 and 3020 and the MME 3028.

In at least one embodiment, the CN 3038 includes an MME 3028, an S-GW 3030, a Packet Data Network (PDN) gateway (P-GW) 3034, and a Home Subscriber Server (HSS) 3032. In at least one embodiment, the MME 3028 may be similar in function to the control plane of a legacy serving General Packet Radio Service (GPRS) support node (SGSN). In at least one embodiment, the MME 3028 may manage mobility aspects in the access such as gateway selection and tracking area list management. In at least one embodiment, HSS 3032 may include a database for network users that includes subscription-related information for supporting network entities to handle communication sessions. In at least one embodiment, the CN 3038 may include one or more HSS 3032, depending on the number of mobile users, the capacity of the device, the organization of the network, etc. In at least one embodiment, HSS 3032 may provide support for routing/roaming, authentication, authorization, naming/addressing resolution, location dependencies, and the like.

In at least one embodiment, the S-GW 3030 may terminate the S1 interface 3022 towards the RAN 3016 and route data packets between the RAN 3016 and the CN 3038. In at least one embodiment, the S-GW 3030 may be a local mobility anchor for inter-RAN node handovers and may also provide an anchor for inter-3 GPP mobility. In at least one embodiment, other responsibilities may include lawful interception, charging, and some policy enforcement.

In at least one embodiment, the P-GW 3034 may terminate the SGi interface towards the PDN. In at least one embodiment, the P-GW 3034 may route data packets between the EPC network (e.g., CN 3038) and external networks, such as networks including an application server 3040 (or referred to as an Application Function (AF)), via an Internet Protocol (IP) interface 3042. In at least one embodiment, the application server 3040 may be an element that provides applications using IP bearer resources employing a core network (e.g., UMTS Packet Service (PS) domain, LTE PS data service, etc.). In at least one embodiment, the P-GW 3034 is shown to be communicatively coupled to an application server 3040 via an Internet Protocol (IP) interface 3042. In at least one embodiment, the application server 3040 may also be configured to support one or more communication services (e.g., voice over internet protocol (VoIP) sessions, PTT sessions, group communication sessions, social network services, etc.) of the UEs 3002 and 3004 via the CN 3038.

In at least one embodiment, the P-GW 3034 may also be a node for policy enforcement and charging data collection. In at least one embodiment, the policy and charging enforcement function (PCRF) 3036 is a policy and charging control element of the CN 3038. In at least one embodiment, in a non-roaming scenario, a single PCRF may be present in a Home Public Land Mobile Network (HPLMN) associated with an internet protocol connectivity access network (IP-CAN) session of a UE. In at least one embodiment, in a roaming scenario with local traffic breakthrough, there may be two PCRFs associated with the IP-CAN session of the UE: a home PCRF (H-PCRF) within the HPLMN and a visited PCRF (V-PCRF) within the Visited Public Land Mobile Network (VPLMN). In at least one embodiment, PCRF 3036 may be communicatively coupled to application server 3040 via P-GW 3034. In at least one embodiment, the application server 3040 may signal the PCRF 3036 to indicate new service flows and select appropriate quality of service (QoS) and charging parameters. In at least one embodiment, PCRF 3036 may supply this rule to a Policy and Charging Enforcement Function (PCEF) (not shown) of the QoS Class (QCI) with the appropriate Traffic Flow Template (TFT) and identifier, which starts QoS and charging specified by application server 3040.

In at least one embodiment, at least one component shown or described with respect to fig. 30 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 30 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 30 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 30 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 30 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 31 illustrates an architecture of a system 3100 of a network according to some embodiments. In at least one embodiment, system 3100 is shown to include a UE 3102, a 5G access node or RAN node (shown as (R) AN node 3108), a user plane function (shown as UPF 3104), a data network (DN 3106), which may be, for example, AN operator service, AN internet access or third party service, and a 5G core network (5 GC) (shown as CN 3110).

In at least one embodiment, CN 3110 includes an authentication server function (AUSF 3114); core access and mobility management functions (AMF 3112); session management function (SMF 3118); network exposure function (NEF 3116); policy control function (PCF 3122); a Network Function (NF) repository function (NRF 3120); unified data management (UDM 3124); and an application function (AF 3126). In at least one embodiment, CN 3110 may also include other elements not shown, such as structured data storage network functions (SDSFs), unstructured data storage network functions (UDSFs), and variations thereof.

In at least one embodiment, UPF 3104 may act as an anchor point for intra-RAT and inter-RAT mobility, an external PDU session point interconnected to DN 3106, and a branching point to support multi-homing PDU sessions. In at least one embodiment, the UPF 3104 may also perform packet routing and forwarding, packet inspection, user plane portion enforcing policy rules, lawful intercept packets (UP collection); traffic usage reporting, performing QoS processing (e.g., packet filtering, gating, UL/DL rate execution) for the user plane, performing uplink traffic verification (e.g., SDF to QoS flow mapping), transport level packet marking in uplink and downlink, and downlink packet buffering and downlink data notification triggering. In at least one embodiment, the UPF 3104 may include an uplink classifier for supporting routing traffic flows to a data network. In at least one embodiment, DN 3106 can represent various network operator services, internet access, or third party services.

In at least one embodiment, the AUSF 3114 may store data for authentication of the UE 3102 and process authentication related functions. In at least one embodiment, AUSF 3114 may facilitate a common authentication framework for various access types.

In at least one embodiment, AMF 3112 may be responsible for registration management (e.g., for registering UE 3102, etc.), connection management, reachability management, mobility management, lawful interception of AMF related events, and access authentication and authorization. In at least one embodiment, AMF 3112 may provide for transmission of SM messages for SMF 3118 and act as a transparent proxy for routing SM messages. In at least one embodiment, the AMF 3112 may also provide for transmission of Short Message Service (SMS) messages between the UE 3102 and an SMS function (SMSF) (not shown in fig. 31). In at least one embodiment, the AMF 3112 may serve as a security anchoring function (SEA), which may include interactions with the AUSF 3114 and the UE 3102 and receiving intermediate keys established as a result of the UE 3102 authentication procedure. In at least one embodiment, in the case of USIM-based authentication, AMF 3112 may retrieve security material from AUSF 3114. In at least one embodiment, the AMF 3112 may also include a Security Context Management (SCM) function that receives a key from the SEA that it uses to derive access network specific keys. Furthermore, in at least one embodiment, the AMF 3112 may be an end point of the RAN CP interface (N2 reference point), an end point of NAS (NI) signaling, and perform NAS ciphering and integrity protection.

In at least one embodiment, the AMF 3112 may also support NAS signaling with the UE 3102 over an N3 interworking function (IWF) interface. In at least one embodiment, the N3IWF may be used to provide access to untrusted entities. In at least one embodiment, the N3IWF may be the termination point of the N2 and N3 interfaces of the control plane and user plane, respectively, and thus, the N2 signaling from the SMF and AMF may be handled for PDU sessions and QoS, encapsulating/decapsulating packets of IPSec and N3 tunnels, marking the N3 user plane packets in the uplink, and enforcing QoS corresponding to the N3 packet marking taking into account QoS requirements associated with such marking received over N2. In at least one embodiment, the N3IWF may also relay uplink and downlink control plane NAS (NI) signaling between the UE 3102 and the AMF 3112, and relay uplink and downlink user plane packets between the UE 3102 and the UPF 3104. In at least one embodiment, the N3IWF also provides a mechanism for IPsec tunnel establishment with UE 3102.

In at least one embodiment, the SMF 3118 may be responsible for session management (e.g., session establishment, modification, and release, including tunnel maintenance between UPF and AN nodes); UE IP address assignment and management (including optional authorization); selection and control of the UP function; configuring traffic steering at the UPF to route traffic to an appropriate destination; terminating the interface towards the policy control function; policy enforcement and QoS control section; lawful interception (for SM events and interfaces to LI systems); termination of SM portion of NAS message; downlink data notification; AN initiator of AN specific SM information, which is sent to the AN over N2 via AMF; the SSC pattern of the session is determined. In at least one embodiment, the SMF 3118 may include the following roaming functions: processing the local implementation to apply QoS SLAB (VPLMN); a charging data collection and charging interface (VPLMN); lawful interception (for SM events in VPLMN and interfacing to LI system); interaction with the external DN is supported to transmit signaling for PDU session authorization/authentication by the external DN.

In at least one embodiment, the NEF 3116 may provide means for securely exposing services and capabilities provided by 3GPP network functions for third parties, internal exposure/re-exposure, application functions (e.g., AF 3126), edge computing or fog computing systems, and the like. In at least one embodiment, the NEF 3116 may authenticate, authorize and/or throttle AF. In at least one embodiment, the NEF 3116 may also convert information exchanged with the AF 3126 and information exchanged with internal network functions. In at least one embodiment, the NEF 3116 may translate between AF service identifiers and internal 5GC information. In at least one embodiment, the NEF 3116 may also receive information from other Network Functions (NFs) based on the exposed capabilities of the other network functions. In at least one embodiment, this information may be stored as structured data at NEF 3116 or at data store NF using a standardized interface. In at least one embodiment, the stored information may then be re-exposed to other NFs and AFs by the NEF 3116 and/or used for other purposes, such as analysis.

In at least one embodiment, NRF 3120 may support service discovery functionality, receive NF discovery requests from NF instances, and provide NF instances with information of discovered NF instances. In at least one embodiment, NRF 3120 also maintains information of available NF instances and services supported thereby.

In at least one embodiment, PCF 3122 may provide policy rules to control plane functions to implement them and may also support a unified policy framework to manage network behavior. In at least one embodiment, PCF 3122 may also implement a Front End (FE) for accessing subscription information related to policy decisions in the UDR of UDM 3124.

In at least one embodiment, the UDM 3124 may process subscription related information to support network entities to handle communication sessions, and may store subscription data for the UE 3102. In at least one embodiment, the UDM 3124 may include two parts, an application FE and a User Data Repository (UDR). In at least one embodiment, the UDM may include a UDM FE responsible for handling credentials, location management, subscription management, and the like. In at least one embodiment, several different front ends may serve the same user in different transactions. In at least one embodiment, the UDM-FE accesses sub-subscription information stored in the UDR and performs authentication credential processing; user identification processing; access authorization; registration/mobility management; subscription management. In at least one embodiment, the UDR may interact with the PCF 3122. In at least one embodiment, the UDM 3124 may also support SMS management, where SMS-FEs implement similar application logic as previously described.

In at least one embodiment, the AF 3126 may provide application impact on traffic routing, access to Network Capability Exposure (NCE), and interaction with a policy framework for policy control. In at least one embodiment, NCE may be a mechanism that allows 5GC and AF 3126 to provide information to each other via NEF 3116, which NEF 3116 may be used for edge computing implementations. In at least one embodiment, network operators and third party services may be hosted near the attachment access point of the UE 3102 to enable efficient service delivery with reduced end-to-end latency and load on the transport network. In at least one embodiment, for edge computing implementations, the 5GC may select a UPF 3104 close to the UE 3102 and perform traffic steering from the UPF 3104 to the DN 3106 via the N6 interface. In at least one embodiment, this may be based on UE subscription data, UE location, and information provided by AF 3126. In at least one embodiment, the AF 3126 may influence UPF (re) selection and traffic routing. In at least one embodiment, based on the operator deployment, the network operator may allow the AF 3126 to interact directly with the relevant NF when the AF 3126 is considered a trusted entity.

In at least one embodiment, CN 3110 may include SMSF, which may be responsible for SMS subscription checking and authentication, and relay SM messages to/from UE 3102 to/from other entities, such as SMS-GMSC/IWMSC/SMS router. In at least one embodiment, SMS may also interact with AMF 3112 and UDM 3124 for notification procedures that UE 3102 is available for SMS delivery (e.g., setting a UE unreachable flag and notifying UDM 3124 when UE 3102 is available for SMS).

In at least one embodiment, system 3100 can include the following service-based interfaces: namf: service-based interfaces presented by the AMF; nsmf: a service-based interface presented by the SMF; nnef: a service-based interface exhibited by the NEF; npcf: a service-based interface exhibited by the PCF; nudm: a service-based interface presented by the UDM; naf: service-based interfaces revealed by AF; nnrf: service-based interfaces presented by NRF; nausf: an AUSF exposed service-based interface.

In at least one embodiment, system 3100 can include the following reference points: n1: a reference point between the UE and the AMF; n2: (R) a reference point between AN and AMF; and N3: (R) a reference point between AN and UPF; n4: a reference point between SMF and UPF; and N6: reference points between UPF and data network. In at least one embodiment, there may be more reference points and/or service-based interfaces between NF services in the NF, however, these interfaces and reference points have been omitted for clarity. In at least one embodiment, the NS reference point may be between the PCF and the AF; the N7 reference point may be between the PCF and the SMF; the N11 reference point is between AMF and SMF, etc. In at least one embodiment, CN 3110 may include an Nx interface, which is an inter-CN interface between MME and AMF 3112, to enable interworking between CN 3110 and CN 3110.

In at least one embodiment, the system 3100 may include a plurality of RAN nodes (such as (R) AN nodes 3108), wherein AN Xn interface is defined between two or more (R) AN nodes 3108 (e.g., gnbs) connected to the 5gc 410, between a (R) AN node 3108 (e.g., gNB) connected to the CN 3110 and AN eNB (e.g., macro RAN node), and/or between two enbs connected to the CN 3110.

In at least one embodiment, the Xn interface may include an Xn user plane (Xn-U) interface and an Xn control plane (Xn-C) interface. In at least one embodiment, an Xn-U may provide for the non-guaranteed delivery of user plane PDUs and support/provide data forwarding and flow control functions. In at least one embodiment, the Xn-C may provide management and error handling functions, functions to manage the Xn-C interface; mobility support for UEs 3102 in CONNECTED mode (e.g., CM-CONNECTED) includes functionality to manage UE mobility for CONNECTED modes between one or more (R) AN nodes 3108. In at least one embodiment, mobility support may include a context transfer from AN old (source) serving (R) AN node 3108 to a new (target) serving (R) AN node 3108; and controls user plane tunneling between the old (source) serving (R) AN node 3108 to the new (target) serving (R) AN node 3108.

In at least one embodiment, the protocol stack of the Xn-U may include a transport network layer built on top of an Internet Protocol (IP) transport layer and a GTP-U layer on top of UDP and/or one or more IP layers for carrying user plane PDUs. In at least one embodiment, the Xn-C protocol stack may include an application layer signaling protocol, referred to as Xn application protocol (Xn-AP), and a transport network layer built upon the SCTP layer. In at least one embodiment, the SCTP layer may be on top of the IP layer. In at least one embodiment, the SCTP layer provides guaranteed delivery of application layer messages. In at least one embodiment, in the transport IP layer, point-to-point transport is used to deliver signaling PDUs. In at least one embodiment, the Xn-U protocol stack and/or the Xn-C protocol stack may be the same or similar to the user plane and/or control plane protocol stacks shown and described herein.

In at least one embodiment, at least one component shown or described with respect to fig. 31 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 31 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 31 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 31 is for executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 31 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 32 is an illustration of a control plane protocol stack in accordance with some embodiments. In at least one embodiment, the control plane 3200 is shown as a communication protocol stack between the UE 3002 (or alternatively, UE 3004), RAN 3016, and MME 3028.

In at least one embodiment, the PHY layer 3202 may transmit or receive information used by the MAC layer 3204 over one or more air interfaces. In at least one embodiment, PHY layer 3202 may also perform link adaptation or Adaptive Modulation and Coding (AMC), power control, cell search (e.g., for initial synchronization and handover purposes), and other measurements used by higher layers (e.g., RRC layer 3210). In at least one embodiment, PHY layer 3202 may further perform error detection for the transmission channel, forward Error Correction (FEC) encoding/decoding of the transmission channel, modulation/demodulation of the physical channel, interleaving, rate matching, mapping to the physical channel, and multiple-input multiple-output (MIMO) antenna processing.

In at least one embodiment, the MAC layer 3204 may perform mapping between logical channels and transport channels, multiplexing MAC Service Data Units (SDUs) from one or more logical channels onto Transport Blocks (TBs) to be delivered to the PHY via the transport channels, demultiplexing MAC SDUs from Transport Blocks (TBs) delivered from the PHY via the transport channels onto one or more logical channels, multiplexing MAC SDUs onto TBs, scheduling information reporting, error correction by hybrid automatic repeat request (HARD), and logical channel prioritization.

In at least one embodiment, the RLC layer 3206 may operate in a variety of modes of operation, including: transparent Mode (TM), unacknowledged Mode (UM), and Acknowledged Mode (AM). In at least one embodiment, the RLC layer 3206 may perform transmission of upper layer Protocol Data Units (PDUs), error correction by automatic repeat request (ARQ) for AM data transmission, and concatenation, segmentation, and reassembly of RLC SDUs for UM and AM data transmission. In at least one embodiment, the RLC layer 3206 may also perform re-segmentation of RLC data PDUs for AM data transmissions, reorder RLC data PDUs for UM and AM data transmissions, detect duplicate data for UM and AM data transmissions, discard RLC SDUs for UM and AM data transmissions, detect protocol errors for AM data transmissions, and perform RLC re-establishment.

In at least one embodiment, the PDCP layer 3208 may perform header compression and decompression of IP data, maintain PDCP Sequence Numbers (SNs), perform in-sequence delivery of higher layer PDUs when reconstructing lower layers, eliminate duplication of lower layer SDUs when reconstructing lower layers for radio bearers mapped on RLC AM, encrypt and decrypt control plane data, integrity protect and integrity verify control plane data, discard data based on control timers, and perform security operations (e.g., ciphering, deciphering, integrity protection, integrity verification, etc.).

In at least one embodiment, the primary services and functions of the RRC layer 3210 may include broadcasting of system information (e.g., included in a Master Information Block (MIB) or a System Information Block (SIB) associated with a non-access stratum (NAS)), broadcasting of system information related to an Access Stratum (AS), paging, establishment, maintenance, and release of RRC connections between a UE and an E-UTRAN (e.g., RRC connection paging, RRC connection establishment, RRC connection modification, and RRC connection release), establishment, configuration, maintenance, and release of point-to-point radio bearers, security functions including key management, inter-Radio Access Technology (RAT) mobility, and measurement configuration for UE measurement reporting. In at least one embodiment, the MIB and SIB may include one or more Information Elements (IEs), each of which may include a separate data field or data structure.

In at least one embodiment, the UE 3002 and the RAN 3016 may utilize a Uu interface (e.g., an LTE-Uu interface) to exchange control plane data via a protocol stack including a PHY layer 3202, a MAC layer 3204, an RLC layer 3206, a PDCP layer 3208, and an RRC layer 3210.

In at least one embodiment, the non-access stratum (NAS) protocol (NAS protocol 3212) forms the highest layer of the control plane between the UE 3002 and the MME 3028. In at least one embodiment, NAS protocol 3212 supports mobility and session management procedures for UE 3002 to establish and maintain an IP connection between UE 3002 and P-GW 3034.

In at least one embodiment, the Si application protocol (Si-AP) layer (Si-AP layer 3222) may support the functionality of the Si interface and include basic procedures (EPs). In at least one embodiment, the EP is an interworking unit between the RAN 3016 and the CN 3038. In at least one embodiment, the S1-AP layer service may include two groups: UE-associated services and non-UE-associated services. In at least one embodiment, these services perform functions including, but not limited to: E-UTRAN radio access bearer (E-RAB) management, UE capability indication, mobility, NAS signaling, RAN Information Management (RIM), and configuration transfer.

In at least one embodiment, a Stream Control Transmission Protocol (SCTP) layer (alternatively referred to as a stream control transmission protocol/internet protocol (SCTP/IP) layer) (SCTP layer 3220) may ensure reliable delivery of signaling messages between the RAN 3016 and the MME 3028 based in part on the IP protocols supported by the IP layer 3218. In at least one embodiment, the L2 layer 3216 and L1 layer 3214 may refer to communication links (e.g., wired or wireless) used by the RAN node and MME to exchange information.

In at least one embodiment, the RAN 3016 and the one or more MMEs 3028 may utilize the S1-MME interface to exchange control plane data via a protocol stack comprising an L1 layer 3214, an L2 layer 3216, an IP layer 3218, an SCTP layer 3220, and a Si-AP layer 3222.

In at least one embodiment, at least one component shown or described with respect to fig. 32 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 32 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 32 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 32 is for executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 32 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 33 is an illustration of a user plane protocol stack in accordance with at least one embodiment. In at least one embodiment, the user plane 3300 is shown as a communication protocol stack between the UE 3002, RAN 3016, S-GW 3030, and P-GW 3034. In at least one embodiment, the user plane 3300 may utilize the same protocol layer as the control plane 3200. In at least one embodiment, for example, the UE 3002 and the RAN 3016 may utilize a Uu interface (e.g., an LTE-Uu interface) to exchange user plane data via a protocol stack including a PHY layer 3202, a MAC layer 3204, an RLC layer 3206, a PDCP layer 3208.

In at least one embodiment, a General Packet Radio Service (GPRS) tunneling protocol (GTP-U) layer for the user plane (GTP-U layer 3304) may be used to carry user data within the GPRS core network and between the radio access network and the core network. In at least one embodiment, for example, the transmitted user data may be packets in any of the IPv4, IPv6, or PPP formats. In at least one embodiment, the UDP and IP security (UDP/IP) layer (UDP/IP layer 3302) may provide a checksum of data integrity, port numbers for addressing different functions at the source and destination, and encryption and authentication of selected data streams. In at least one embodiment, the RAN 3016 and S-GW 3030 may utilize an S1-U interface to exchange user plane data via a protocol stack comprising L1 layer 3214, L2 layer 3216, UDP/IP layer 3302, and GTP-U layer 3304. In at least one embodiment, the S-GW 3030 and P-GW 3034 may utilize an S5/S8a interface to exchange user plane data via a protocol stack comprising L1 layer 3214, L2 layer 3216, UDP/IP layer 3302, and GTP-U layer 3304. In at least one embodiment, as discussed above with respect to fig. 32, the NAS protocol supports mobility and session management procedures for the UE 3002 to establish and maintain an IP connection between the UE 3002 and the P-GW 3034.

In at least one embodiment, at least one component shown or described with respect to fig. 33 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 33 is configured to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 33 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 33 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 33 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 34 illustrates a component 3400 of a core network in accordance with at least one embodiment. In at least one embodiment, the components of the CN 3038 may be implemented in one physical node or a separate physical node comprising components for reading and executing instructions from a machine-readable medium or computer-readable medium (e.g., a non-transitory machine-readable storage medium). In at least one embodiment, network Function Virtualization (NFV) is used to virtualize any or all of the above-described network node functions via executable instructions stored in one or more computer-readable storage media (described in further detail below). In at least one embodiment, the logical instantiation of the CN 3038 may be referred to as a network slice 3402 (e.g., the network slice 3402 is shown as including the HSS 3032, the MME 3028, and the S-GW 3030). In at least one embodiment, a logical instantiation of a portion of the CN 3038 may be referred to as a network sub-slice 3404 (e.g., the network sub-slice 3404 is shown as including the P-GW 3034 and PCRF 3036).

In at least one embodiment, the NFV architecture and infrastructure can be used to virtualize one or more network functions on physical resources including industry standard server hardware, storage hardware, or a combination of switches, which can alternatively be performed by dedicated hardware. In at least one embodiment, the NFV system may be used to perform virtual or reconfigurable implementations of one or more EPC components/functions.

In at least one embodiment, at least one component shown or described with respect to fig. 34 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 34 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 34 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 34 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 34 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 35 is a block diagram illustrating components of a system 3500 for supporting Network Function Virtualization (NFV) in accordance with at least one embodiment. In at least one embodiment, system 3500 is shown to include a virtualized infrastructure manager (shown as VIM 3502), a network function virtualized infrastructure (shown as NFVI 3504), a VNF manager (shown as VNFM 3506), a virtualized network function (shown as VNF 3508), an element manager (shown as EM 3510), an NFV coordinator (shown as NFVO 3512), and a network manager (shown as NM 3514).

In at least one embodiment, VIM 3502 manages the resources of NFVI 3504. In at least one embodiment, NFVI 3504 can include physical or virtual resources and applications (including hypervisors) for executing system 3500. In at least one embodiment, the VIM 3502 can utilize the NFVI 3504 to manage lifecycles of virtual resources (e.g., creation, maintenance, and tear down of Virtual Machines (VMs) associated with one or more physical resources), track VM instances, track performance, failures and security of VM instances and associated physical resources, and expose VM instances and associated physical resources to other management systems.

In at least one embodiment, the VNFM 3506 may manage the VNF 3508. In at least one embodiment, the VNF 3508 may be used to perform EPC components/functions. In at least one embodiment, the VNFM 3506 may manage the life cycle of the VNF 3508 and track performance, failure, and security of the virtual aspects of the VNF 3508. In at least one embodiment, the EM 3510 may track performance, faults, and security in functional aspects of the VNF 3508. In at least one embodiment, tracking data from VNFM 3506 and EM 3510 may include, for example, performance Measurement (PM) data used by VIM 3502 or NFVI 3504. In at least one embodiment, both VNFM 3506 and EM 3510 may scale up/down the number of VNFs of system 3500.

In at least one embodiment, NFVO 3512 can coordinate, authorize, release, and occupy resources of NFVI 3504 in order to provide the requested service (e.g., to perform EPC functions, components, or slices). In at least one embodiment, NM 3514 may provide end user function packages responsible for managing networks, which may include network elements with VNFs, non-virtualized network functions, or both (management of VNFs may occur via EM 3510).

In at least one embodiment, at least one component shown or described with respect to fig. 35 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 35 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 35 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 35 is for executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 35 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Computer-based system

The following figures set forth, but are not limited to, exemplary computer-based systems that can be used to implement at least one embodiment.

Fig. 36 illustrates a processing system 3600 in accordance with at least one embodiment. In at least one embodiment, the system 3600 includes one or more processors 3602 and one or more graphics processors 3608, and may be a single processor desktop system, a multiprocessor workstation system, or a server system having a large number of processors 3602 or processor cores 3607. In at least one embodiment, the processing system 3600 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for mobile, handheld, or embedded devices.

In at least one embodiment, the processing system 3600 may include or be incorporated in a server-based gaming platform, including a game console, a mobile game console, a handheld game console, or an online game console. In at least one embodiment, the processing system 3600 is a mobile phone, a smart phone, a tablet computing device, or a mobile internet device. In at least one embodiment, the processing system 3600 may further include a wearable device coupled with or integrated in the wearable device, such as a smart watch wearable device, a smart eyewear device, an augmented reality device, or a virtual reality device. In at least one embodiment, the processing system 3600 is a television or set-top box device having one or more processors 3602 and a graphical interface generated by one or more graphics processors 3608.

In at least one embodiment, the one or more processors 3602 each include one or more processor cores 3607 to process instructions that, when executed, perform operations for system and user software. In at least one embodiment, each of the one or more processor cores 3607 is configured to process a particular instruction set 3609. In at least one embodiment, the instruction set 3609 may facilitate Complex Instruction Set Computing (CISC), reduced Instruction Set Computing (RISC), or computing by Very Long Instruction Words (VLIW). In at least one embodiment, the multiple processor cores 3607 may each process a different instruction set 3609, and the instruction set 3609 may include instructions that facilitate emulation of other instruction sets. In at least one embodiment, the processor core 3607 may also include other processing devices, such as a Digital Signal Processor (DSP).

In at least one embodiment, the processor 3602 includes a cache memory (cache) 3604. In at least one embodiment, the processor 3602 may have a single internal cache or multiple levels of internal caches. In at least one embodiment, the cache memory is shared among the various components of the processor 3602. In at least one embodiment, the processor 3602 also uses an external cache (e.g., a level three (L3) cache or Last Level Cache (LLC)) (not shown), which may share this logic between the processor cores 3607 using known cache coherency techniques. In at least one embodiment, a register file 3606 is additionally included in the processor 3602, and the processor 3602 may include different types of registers (e.g., integer registers, floating point registers, status registers, and instruction pointer registers) for storing different types of data. In at least one embodiment, the register file 3606 may include general purpose registers or other registers.

In at least one embodiment, one or more processors 3602 are coupled with one or more interface buses 3610 to transmit communication signals, such as address, data, or control signals, between the processors 3602 and other components in the system 3600. In at least one embodiment, interface bus 3610 may be a processor bus in one embodiment, such as a version of a Direct Media Interface (DMI) bus. In at least one embodiment, interface bus 3610 is not limited to a DMI bus and may include one or more peripheral component interconnect buses (e.g., PCI, PCI Express), memory buses, or other types of interface buses. In at least one embodiment, the processor 3602 includes an integrated memory controller 3616 and a platform controller hub 3630. In at least one embodiment, memory controller 3616 facilitates communication between storage devices and other components of processing system 3600, while Platform Controller Hub (PCH) 3630 provides connectivity to input/output (I/O) devices through a local I/O bus.

In at least one embodiment, memory device 3620 may be a Dynamic Random Access Memory (DRAM) device, a Static Random Access Memory (SRAM) device, a flash memory device, a phase change memory device, or have suitable capabilities to function as processor memory. In at least one embodiment, storage 3620 can be used as system memory for processing system 3600 to store data 3622 and instructions 3621 for use when one or more processors 3602 execute applications or processes. In at least one embodiment, memory controller 3616 is also coupled with an optional external graphics processor 3612, which may be in communication with one or more graphics processors 3608 of processors 3602 to perform graphics and media operations. In at least one embodiment, a display device 3611 may be connected to the processor 3602. In at least one embodiment, the display device 3611 can include one or more of internal display devices, such as external display devices connected at a mobile electronic device or portable computer device or through a display interface (e.g., display port (DisplayPort), etc.). In at least one embodiment, the display device 3611 can include a Head Mounted Display (HMD), such as a stereoscopic display device used in a Virtual Reality (VR) application or an Augmented Reality (AR) application.

In at least one embodiment, platform controller hub 3630 enables peripheral devices to be connected to storage device 3620 and processor 3602 via a high-speed I/O bus. In at least one embodiment, I/O peripherals include, but are not limited to, an audio controller 3646, a network controller 3634, a firmware interface 3628, a wireless transceiver 3626, a touch sensor 3625, a data storage device 3624 (e.g., hard disk drive, flash memory, etc.). In at least one embodiment, data storage device 3624 can be connected via a memory interface (e.g., SATA) or via a peripheral bus, such as a peripheral component interconnect bus (e.g., PCI, PCIe). In at least one embodiment, the touch sensor 3625 may include a touch screen sensor, a pressure sensor, or a fingerprint sensor. In at least one embodiment, the wireless transceiver 3626 may be a Wi-Fi transceiver, a bluetooth transceiver, or a mobile network transceiver, such as a 3G, 4G, or Long Term Evolution (LTE) transceiver. In at least one embodiment, firmware interface 3628 enables communication with system firmware, for example, and may be a Unified Extensible Firmware Interface (UEFI). In at least one embodiment, network controller 3634 can enable network connections to a wired network. In at least one embodiment, a high performance network controller (not shown) is coupled to interface bus 3610. In at least one embodiment, audio controller 3646 is a multi-channel high definition audio controller. In at least one embodiment, the processing system 3600 includes an optional legacy (legacy) I/O controller 3640 for coupling legacy (e.g., personal System 2 (PS/2)) devices to the processing system 3600. In at least one embodiment, the platform controller hub 3630 may also be connected to one or more Universal Serial Bus (USB) controllers 3642 that connect input devices, such as a keyboard and mouse 3643 combination, a camera 3644, or other USB input devices.

In at least one embodiment, the memory controller 3616 and an instance of the platform controller hub 3630 can be integrated into a discrete external graphics processor, such as external graphics processor 3612. In at least one embodiment, the platform controller hub 3630 and/or the memory controller 3616 may be external to the one or more processors 3602. For example, in at least one embodiment, the processing system 3600 may include an external memory controller 3616 and a platform controller hub 3630, which may be configured as a memory controller hub and a peripheral controller hub in a system chipset in communication with the processor 3602.

In at least one embodiment, at least one component shown or described with respect to fig. 36 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 36 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 36 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 36 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 36 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 37 illustrates a computer system 3700 in accordance with at least one embodiment. In at least one embodiment, computer system 3700 can be a system with interconnected devices and components, an SOC, or some combination. In at least one embodiment, computer system 3700 is formed by a processor 3702, which processor 3702 may include an execution unit to execute instructions. In at least one embodiment, computer system 3700 can include, but is not limited to, components, such as processor 3702, employing an execution unit comprising logic to perform an algorithm for process data. In at least one embodiment, computer system 3700 can include a processorSuch as those available from Intel corporation of Santa Clara, calif. (Intel Corporation of Santa Clara, california)Processor family, xeonTM, +.>XScaleTM and/or StrongARMTM, < >>Core ^TM Or->Nervana ^TM Microprocessors, although other systems (including PCs with other microprocessors, engineering workstations, set-top boxes, etc.) may also be used. In at least one embodiment, computer system 3700 can execute a version of the WINDOWS operating system available from microsoft corporation of redmond, wash, although other operating systems (e.g., UNIX and Linux), embedded software, and/or graphical user interfaces can be used.

In at least one embodiment, computer system 3700 can be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular telephones, internet protocol (Internet Protocol) devices, digital cameras, personal digital assistants ("PDAs"), and handheld PCs. In at least one embodiment, the embedded application may include a microcontroller, a digital signal processor ("DSP"), a SoC, a network computer ("NetPC"), a set-top box, a hub, a wide area network ("WAN") switch, or any other system that may execute one or more instructions in accordance with at least one embodiment.

In at least one embodiment, computer system 3700 may include, but is not limited to, a processor 3702, which processor 3702 may include, but is not limited to, one or more execution units 3708, which may be configured to execute a compute unified device architecture ("CUDA")Developed by NVIDIA Corporation of santa clara, california). In at least one embodiment, the CUDA program is at least a portion of a software application written in a CUDA programming language. In at least one embodiment, computer system 3700 is a single processor desktop or server system. In at least one embodiment, computer system 3700 can be a multiprocessor system. In at least one embodiment, the processor 3702 may include, but is not limited to, a CISC microprocessor, a RISC microprocessor, a VLIW microprocessor, a processor implementing a combination of instruction sets, or any other processor device, such as, for example, a digital signal processor. In at least one embodiment, the processor 3702 may be coupled to a processor bus 3710, and the processor bus 3710 may transmit data signals between the processor 3702 and other components in the computer system 3700.

In at least one embodiment, the processor 3702 may include, but is not limited to, a level 1 ("L1") internal cache memory ("cache") 3704. In at least one embodiment, the processor 3702 may have a single internal cache or multiple levels of internal caches. In at least one embodiment, the cache memory may reside external to the processor 3702. In at least one embodiment, the processor 3702 may include a combination of internal and external caches. In at least one embodiment, register file 3706 may store different types of data in various registers, including but not limited to integer registers, floating point registers, status registers, and instruction pointer registers.

In at least one embodiment, execution units 3708, including but not limited to logic to perform integer and floating point operations, are also located in the processor 3702. The processor 3702 may also include microcode ("ucode") read-only memory ("ROM") to store microcode for certain macroinstructions. In at least one embodiment, execution unit 3708 can include logic to process the packaged instruction set 3709. In at least one embodiment, the encapsulated data in the general purpose processor 3702 may be used to perform operations for many multimedia application uses by including the encapsulated instruction set 3709 in the instruction set of the general purpose processor 3702, as well as related circuitry to execute the instructions. In at least one embodiment, many multimedia applications may be accelerated and executed more efficiently by performing operations on packed data using the full width of the processor's data bus, which may not require the transmission of smaller data units on the processor's data bus to perform one or more operations on one data element at a time.

In at least one embodiment, execution unit 3708 may also be used in microcontrollers, embedded processors, graphics devices, DSPs, and other types of logic circuits. In at least one embodiment, computer system 3700 can include, but is not limited to, memory 3720. In at least one embodiment, memory 3720 may be implemented as a DRAM device, an SRAM device, a flash memory device, or other storage device. Memory 3720 may store instructions 3719 and/or data 3721 represented by data signals that may be executed by processor 3702.

In at least one embodiment, a system logic chip may be coupled to processor bus 3710 and memory 3720. In at least one embodiment, the system logic chip may include, but is not limited to, a memory controller hub ("MCH") 3716 and the processor 3702 may communicate with the MCH 3716 via a processor bus 3710. In at least one embodiment, the MCH 3716 may provide a high bandwidth memory path 3718 to memory 3720 for instruction and data storage as well as for storage of graphics commands, data, and textures. In at least one embodiment, the MCH 3716 may initiate data signals between the processor 3702, the memory 3720, and other components in the computer system 3700, and bridge data signals between the processor bus 3710, the memory 3720, and the system I/O3722. In at least one embodiment, the system logic chip may provide a graphics port for coupling to a graphics controller. In at least one embodiment, MCH 3716 may be coupled to memory 3720 through a high bandwidth memory path 3718, and graphics/video card 3712 may be coupled to MCH 3716 through an accelerated graphics port (Accelerated Graphics Port) ("AGP") interconnect 3714.

In at least one embodiment, the computer system 3700 may couple the MCH 3716 to an I/O controller hub ("ICH") 3730 using the system I/O3722 as a proprietary hub interface bus. In at least one embodiment, ICH 3730 may provide a direct connection to certain I/O devices through a local I/O bus. In at least one embodiment, the local I/O bus may include, but is not limited to, a high-speed I/O bus for connecting peripheral devices to memory 3720, the chipset, and processor 3702. Examples may include, but are not limited to, an audio controller 3729, a firmware hub ("Flash BIOS") 3728, a wireless transceiver 3726, a data store 3724, a conventional I/O controller 3723 and keyboard interface including user input 3725, a serial expansion port 3727 (e.g., USB), and a network controller 3734. Data storage 3724 may include a hard disk drive, a floppy disk drive, a CD-ROM device, a flash memory device, or other mass storage device.

In at least one embodiment, FIG. 37 illustrates a system including interconnected hardware devices or "chips". In at least one embodiment, fig. 37 may illustrate an exemplary SoC. In at least one embodiment, the devices shown in fig. 37 may be interconnected with a proprietary interconnect, a standardized interconnect (e.g., PCIe), or some combination thereof. In at least one embodiment, one or more components of system 3700 are interconnected using a computing fast link (CXL) interconnect.

In at least one embodiment, at least one component shown or described with respect to fig. 37 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 37 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 37 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 37 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 37 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 38 illustrates a system 3800 in accordance with at least one embodiment. In at least one embodiment, the system 3800 is an electronic device that utilizes a processor 3810. In at least one embodiment, system 3800 can be, for example, but not limited to, a notebook computer, a tower server, a rack server, a blade server, a laptop computer, a desktop computer, a tablet computer, a mobile device, a telephone, an embedded computer, or any other suitable electronic device.

In at least one embodiment, the system 3800 can include, but is not limited to, a processor 3810 communicatively coupled to any suitable number or variety of components, peripheral devices, modules, or devices. In at least one embodiment, processor 3810 uses bus or interface coupling such as an I2C bus, a system management bus ("SMBus"), a Low Pin Count (LPC) bus, a serial peripheral interface ("SPI"), a high definition audio ("HDA") bus, a serial advanced technology attachment ("SATA") bus, a USB (version 1, 2, 3), or a universal asynchronous receiver/transmitter ("UART") bus. In at least one embodiment, FIG. 38 illustrates a system that includes interconnected hardware devices or "chips". In at least one embodiment, fig. 38 may illustrate an exemplary SoC. In at least one embodiment, the devices shown in FIG. 38 may be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, one or more components of fig. 38 are interconnected using a computing fast link (CXL) interconnect line.

In at least one embodiment, fig. 38 may include a display 3824, a touch screen 3825, a touch pad 3830, a near field communication unit ("NFC") 3835, a sensor hub 3840, a thermal sensor 3839, a fast chipset ("EC") 3835, a trusted platform module ("TPM") 3838, a BIOS/firmware/Flash ("BIOS, fwflash") 3832, a DSP 3860, a solid state disk ("SSD") or hard disk drive ("HDD") 3820, a wireless local area network unit ("WLAN") 3850, a bluetooth unit 3852, a wireless wide area network unit ("WWAN") 3856, a Global Positioning System (GPS) 3855, a camera ("USB 3.0 camera") 3854 (e.g., a USB 3.0 camera), or a low power double data rate ("LPDDR") memory unit ("LPDDR 3") 3815 implemented, for example, in the LPDDR3 standard. These components may each be implemented in any suitable manner.

In at least one embodiment, other components may be communicatively coupled to the processor 3810 via components as discussed above. In at least one embodiment, an accelerometer 3841, an ambient light sensor ("ALS") 3842, a compass 3843, and a gyroscope 3844 may be communicatively coupled to the sensor hub 3840. In at least one embodiment, the thermal sensor 3839, the fan 3837, the keyboard 3836, and the touch pad 3830 can be communicatively coupled to the EC 3835. In at least one embodiment, a speaker 3863, an earphone 3864, and a microphone ("mic") 3865 can be communicatively coupled to an audio unit 3862 (e.g., "audio codec and class D amplifier"), which in turn can be communicatively coupled to the DSP 3860. In at least one embodiment, the audio unit 3862 may include, but is not limited to, an audio encoder/decoder ("codec") and a class D amplifier. In at least one embodiment, a SIM card ("SIM") 3857 may be communicatively coupled to the WWAN unit 3856. In at least one embodiment, components such as WLAN unit 3850 and bluetooth unit 3852 and WWAN unit 3856 may be implemented as Next Generation Form Factor (NGFF).

In at least one embodiment, at least one component shown or described with respect to fig. 38 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 38 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 38 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 38 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 38 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 39 illustrates an exemplary integrated circuit 3900 in accordance with at least one embodiment. In at least one embodiment, the exemplary integrated circuit 3900 is a SoC that can be fabricated using one or more IP cores. In at least one embodiment, the integrated circuit 3900 includes one or more application processors 3905 (e.g., a CPU), at least one graphics processor 3910, and may additionally include an image processor 3915 and/or a video processor 3920, any of which may be modular IP cores. In at least one embodiment, integrated circuit 3900 includes peripheral or bus logic that includes USB controller 3925, UART controller 3930, SPI/SDIO controller 3935, and I ² S/I ² C controller 3940. In at least one embodiment, the integrated circuit 3900 can include a display device 3945 coupled to one or more of a High Definition Multimedia Interface (HDMI) controller 3950 and a Mobile Industrial Processor Interface (MIPI) display interface 3955. In at least one embodiment, storage may be provided by flash subsystem 3960, including flash memory and a flash controller. In at least one embodiment, a memory interface may be provided via the memory controller 3965 for accessing SDRAM or SRAM memory devices. In at least one embodiment, some integrated circuits further include an embedded security engine 3970.

In at least one embodiment, at least one component shown or described with respect to fig. 39 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 39 is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 39 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 39 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 39 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 40 illustrates a computing system 4000 in accordance with at least one embodiment. In at least one embodiment, the computing system 4000 includes a processing subsystem 4001 having one or more processors 4002 and a system memory 4004 that communicate via an interconnection path that may include a memory hub 4005. In at least one embodiment, the memory hub 4005 may be a separate component within a chipset component or may be integrated within one or more processors 4002. In at least one embodiment, the memory hub 4005 is coupled to the I/O subsystem 4011 through a communications link 4006. In at least one embodiment, the I/O subsystem 4011 includes an I/O hub 4007, which can enable the computing system 4000 to receive input from one or more input devices 4008. In at least one embodiment, the I/O hub 4007 can enable a display controller, included in the one or more processors 4002, to provide output to the one or more display devices 4010A. In at least one embodiment, the one or more display devices 4010A coupled to the I/O hub 4007 can comprise local, internal, or embedded display devices.

In at least one embodiment, the processing subsystem 4001 includes one or more parallel processors 4012 coupled to a memory hub 4005 via a bus or other communication link 4013. In at least one embodiment, the communication link 4013 may be one of a number of standards-based communication link technologies or protocols, such as, but not limited to PCIe, or may be a communication interface or communication fabric for a vendor. In at least one embodiment, one or more parallel processors 4012 form a computationally intensive parallel or vector processing system that may include a large number of processing cores and/or processing clusters, such as Multiple Integrated Core (MIC) processors. In at least one embodiment, one or more parallel processors 4012 form a graphics processing subsystem that can output pixels to one of one or more display devices 4010A coupled via I/O hub 4007. In at least one embodiment, the one or more parallel processors 4012 can further include a display controller and a display interface (not shown) to enable direct connection to the one or more display devices 4010B.

In at least one embodiment, a system memory unit 4014 may be connected to I/O hub 4007 to provide a storage mechanism for computing system 4000. In at least one embodiment, the I/O switch 4016 can be used to provide an interface mechanism to enable connectivity between the I/O hub 4007 and other components, such as network adapter 4018 and/or wireless network adapter 4019, which can be integrated into a platform, and various other devices which can be added by one or more additional devices 4020. In at least one embodiment, the network adapter 4018 can be an ethernet adapter or another wired network adapter. In at least one embodiment, the wireless network adapter 4019 can comprise one or more of Wi-Fi, bluetooth, NFC, or other network devices comprising one or more radios.

In at least one embodiment, computing system 4000 may include other components not explicitly shown, including USB or other port connections, optical storage drives, video capture devices, and/or variations thereof, as well as being connected to I/O hub 4007. In at least one embodiment, the communication paths interconnecting the various components in FIG. 40 may be implemented using any suitable protocol, such as PCI (peripheral component interconnect) based protocols (e.g., PCIe), or other bus or point-to-point communication interfaces and/or protocols (e.g., NVLink high-speed interconnect or interconnect protocol).

In at least one embodiment, the one or more parallel processors 4012 comprise circuitry optimized for graphics and video processing (including video output circuitry in at least one embodiment), and constitute a Graphics Processing Unit (GPU). In at least one embodiment, one or more of the parallel processors 4012 comprises circuitry optimized for general purpose processing. In at least one embodiment, components of computing system 4000 may be integrated with one or more other system elements on a single integrated circuit. For example, in at least one embodiment, one or more of the parallel processor 4012, the memory hub 4005, the processor 4002 and the I/O hub 4007 may be integrated into a system on chip (SoC) integrated circuit. In at least one embodiment, the components of computing system 4000 may be integrated into a single package to form a System In Package (SIP) configuration. In at least one embodiment, at least a portion of the components of computing system 4000 may be integrated into a multi-chip module (MCM) that may be interconnected with other multi-chip modules into a modular computing system. In at least one embodiment, the I/O subsystem 4011 and display device 4010B are omitted from computing system 4000.

In at least one embodiment, at least one component shown or described with respect to fig. 40 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 40 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 40 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 40 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 40 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Processing system

The following figures illustrate exemplary processing systems that may be used to implement at least one embodiment.

FIG. 41 illustrates an acceleration processing unit ("APU") 4100 in accordance with at least one embodiment. In at least one embodiment, APU 4100 is developed by AMD corporation of santa clara, california. In at least one embodiment, APU 4100 can be configured to execute an application, such as a CUDA program. In at least one embodiment, APU 4100 includes, but is not limited to, a core complex 4110, a graphics complex 4140, a fabric 4160, an I/O interface 4170, a memory controller 4180, a display controller 4192, and a multimedia engine 4194. In at least one embodiment, APU 4100 can include, but is not limited to, any combination of any number of core complexes 4110, any number of graphics complexes 4140, any number of display controllers 4192, and any number of multimedia engines 4194. For purposes of illustration, a number of instances of a similar object are denoted herein by reference numerals, where the reference numerals identify the object and the numerals in brackets identify the desired instance.

In at least one embodiment, core complex 4110 is a CPU, graphics complex 4140 is a GPU, and APU 4100 is a processing unit that integrates 4110 and 4140 onto a single chip, without limitation. In at least one embodiment, some tasks may be assigned to core complex 4110, while other tasks may be assigned to graphics complex 4140. In at least one embodiment, core complex 4110 is configured to execute main control software, such as an operating system, associated with APU 4100. In at least one embodiment, core complex 4110 is the main processor of APU 4100, which controls and coordinates the operation of the other processors. In at least one embodiment, the core complex 4110 issues commands that control the operation of the graphics complex 4140. In at least one embodiment, core complex 4110 can be configured to execute host executable code that is derived from CUDA source code, and graphics complex 4140 can be configured to execute device executable code that is derived from CUDA source code.

In at least one embodiment, core complex 4110 includes, but is not limited to, cores 4120 (1) -4120 (4) and an L3 cache 4130. In at least one embodiment, core complex 4110 may include, but is not limited to, any combination of any number of cores 4120 and any number and type of caches. In at least one embodiment, core 4120 is configured to execute instructions of a particular instruction set architecture ("ISA"). In at least one embodiment, each core 4120 is a CPU core.

In at least one embodiment, each core 4120 includes, but is not limited to, a fetch/decode unit 4122, an integer execution engine 4124, a floating point execution engine 4126, and an L2 cache 4128. In at least one embodiment, the fetch/decode unit 4122 fetches instructions, decodes the instructions, generates micro-operations, and dispatches individual micro-instructions to the integer execution engine 4124 and the floating point execution engine 4126. In at least one embodiment, the fetch/decode unit 4122 may dispatch one micro instruction to the integer execution engine 4124 and another micro instruction to the floating point execution engine 4126 simultaneously. In at least one embodiment, the integer execution engine 4124 performs operations that are not limited to integer operations and memory operations. In at least one embodiment, the floating point execution engine 4126 performs operations that are not limited to floating point operations and vector operations. In at least one embodiment, the fetch-decode unit 4122 assigns the microinstructions to a single execution engine that replaces both the integer execution engine 4124 and the floating point execution engine 4126.

In at least one embodiment, each core 4120 (i) may access an L2 cache 4128 (i) included in the core 4120 (i), where i is an integer representing a particular instance of the core 4120. In at least one embodiment, each core 4120 included in core complex 4110 (j) is connected to other cores 4120 included in core complex 4110 (j) via an L3 cache 4130 (j) included in core complex 4110 (j), where j is an integer representing a specific instance of core complex 4110. In at least one embodiment, the core 4120 included in the core complex 4110 (j) may access all L3 caches 4130 (j) included in the core complex 4110 (j), where j is an integer representing a particular instance of the core complex 4110. In at least one embodiment, the L3 cache 4130 may include, but is not limited to, any number of slices.

In at least one embodiment, the graphics complex 4140 may be configured to perform computing operations in a highly parallel manner. In at least one embodiment, the graphics complex 4140 is configured to perform graphics pipeline operations such as drawing commands, pixel operations, geometric calculations, and other operations associated with rendering images to a display. In at least one embodiment, the graphics complex 4140 is configured to perform graphics-independent operations. In at least one embodiment, the graphics complex 4140 is configured to perform both graphics-related and graphics-unrelated operations.

In at least one embodiment, the graphics complex 4140 includes, but is not limited to, any number of computing units 4150 and L2 caches 4142. In at least one embodiment, the computing unit 4150 shares the L2 cache 4142. In at least one embodiment, the L2 cache 4142 is partitioned. In at least one embodiment, the graphics complex 4140 includes, but is not limited to, any number of computing units 4150 and any number (including zero) and type of caches. In at least one embodiment, the graphics complex 4140 includes, but is not limited to, any number of dedicated graphics hardware.

In at least one embodiment, each computing unit 4150 includes, but is not limited to, any number of SIMD units 4152 and a shared memory 4154. In at least one embodiment, each SIMD unit 4152 implements a SIMD architecture and is configured to perform operations in parallel. In at least one embodiment, each computing unit 4150 may execute any number of thread blocks, but each thread block executes on a single computing unit 4150. In at least one embodiment, a thread block includes, but is not limited to, any number of threads of execution. In at least one embodiment, the workgroup is a thread block. In at least one embodiment, each SIMD unit 4152 executes a different thread bundle (warp). In at least one embodiment, the thread bundle is a set of threads (e.g., 16 threads), where each thread in the thread bundle belongs to a single thread block and is configured to process different sets of data based on a single instruction set. In at least one embodiment, prediction (prediction) may be used to disable one or more threads in a thread bundle. In at least one embodiment, the channel is a thread. In at least one embodiment, the work items are threads. In at least one embodiment, the wavefront is a thread bundle. In at least one embodiment, the different wave fronts in the thread blocks may be synchronized together and communicated via shared memory 4154.

In at least one embodiment, the fabric 4160 is a system interconnect that facilitates data transfer and control transfer across the core complex 4110, the graphics complex 4140, the I/O interface 4170, the memory controller 4180, the display controller 4192, and the multimedia engine 4194. In at least one embodiment, APU 4100 may also include, in addition to structure 4160 or in lieu of structure 4160, any number and type of system interconnections, such structure 4160 facilitating data transmission and control transmission across any number and type of directly or indirectly linked components that may be internal or external to APU 4100. In at least one embodiment, I/O interface 4170 represents any number and type of I/O interfaces (e.g., PCI, PCI-Extended ("PCI-X"), PCIe, gigabit Ethernet ("GBE"), USB, and the like). In at least one embodiment, various types of peripheral devices are coupled to the I/O interface 4170. In at least one embodiment, the peripheral devices coupled to the I/O interface 4170 may include, but are not limited to, a keyboard, mouse, printer, scanner, joystick or other type of game controller, media recording device, external storage device, network interface card, and the like.

In at least one embodiment, the display controller AMD92 displays images on one or more display devices, such as a Liquid Crystal Display (LCD) device. In at least one embodiment, the multimedia engine 4194 includes, but is not limited to, any number and type of multimedia-related circuits, such as video decoders, video encoders, image signal processors, and the like. In at least one embodiment, memory controller 4180 facilitates data transfer between APU 4100 and unified system memory 4190. In at least one embodiment, the core complex 4110 and the graphics complex 4140 share a unified system memory 4190.

In at least one embodiment, APU 4100 implements a variety of memory subsystems including, but not limited to, any number and type of memory controllers 4180 and memory devices (e.g., shared memory 4154) that may be dedicated to one component or shared among multiple components. In at least one embodiment, APU 4100 implements a cache subsystem that includes, but is not limited to, one or more cache memories (e.g., L2 cache 3828, L3 cache 4130, and L2 cache 4142), each of which may be component private or shared among any number of components (e.g., core 4120, core complex 4110, SIMD unit 4152, computing unit 4150, and graphics complex 4140).

In at least one embodiment, at least one component shown or described with respect to fig. 41 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 41 is configured to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 41 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 41 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 41 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 42 illustrates a CPU 4200 in accordance with at least one embodiment. In at least one embodiment, CPU 4200 is developed by AMD corporation of Santa Clara, calif. In at least one embodiment, the CPU 4200 may be configured to execute applications. In at least one embodiment, the CPU 4200 is configured to execute main control software, such as an operating system. In at least one embodiment, the CPU 4200 issues commands to control the operation of an external GPU (not shown). In at least one embodiment, the CPU 4200 can be configured to execute host executable code derived from CUDA source code, and the external GPU can be configured to execute device executable code derived from such CUDA source code. In at least one embodiment, the CPU 4200 includes, but is not limited to, any number of core complexes 4210, architectures 4260, I/O interfaces 4270, and memory controllers 4280.

In at least one embodiment, the core complex 4210 includes, but is not limited to, cores 4220 (1) -4220 (4) and an L3 cache 4230. In at least one embodiment, the core complex 4210 may include, but is not limited to, any combination of any number of cores 4220 and any number and type of caches. In at least one embodiment, core 4220 is configured to execute instructions of a particular ISA. In at least one embodiment, each core 4220 is a CPU core.

In at least one embodiment, each core 4220 includes, but is not limited to, a fetch/decode unit 4222, an integer execution engine 4224, a floating point execution engine 4226, and an L2 cache 4228. In at least one embodiment, the fetch/decode unit 4222 fetches instructions, decodes the instructions, generates micro-operations, and dispatches individual micro-instructions to the integer execution engine 4224 and the floating point execution engine 4226. In at least one embodiment, the fetch/decode unit 4222 may dispatch one micro instruction to the integer execution engine 4224 and another micro instruction to the floating point execution engine 4226 simultaneously. In at least one embodiment, integer execution engine 4224 performs operations that are not limited to integer and memory operations. In at least one embodiment, the floating point engine 4226 performs operations not limited to floating point and vector operations. In at least one embodiment, the fetch-decode unit 4222 assigns micro-instructions to a single execution engine that replaces both the integer execution engine 4224 and the floating point execution engine 4226.

In at least one embodiment, each core 4220 (i) may access an L2 cache 4228 (i) included in the core 4220 (i), where i is an integer representing a particular instance of the core 4220. In at least one embodiment, each core 4220 included in the core complex 4210 (j) is connected to other cores 4220 included in the core complex 4210 (j) via an L3 cache 4230 (j) in the core complex 4210 (j), where j is an integer representing a particular instance of the core complex 4210. In at least one embodiment, the cores 4220 included in the core complex 4210 (j) may access all L3 caches 4230 (j) included in the core complex 4210 (j), where j is an integer representing a particular instance of the core complex 4210. In at least one embodiment, the L3 cache 4230 may include, but is not limited to, any number of slices.

In at least one embodiment, the fabric 4260 is a system interconnect that facilitates data and control transfer across the core complexes 4210 (1) -4210 (N) (where N is an integer greater than 0), the I/O interface 4270, and the memory controller 4280. In at least one embodiment, the CPU 4200 may include, in addition to or in lieu of the structure 4260, any number and type of system interconnections, such structure 4260 facilitating data and control transmission across any number and type of directly or indirectly linked components that may be internal or external to the CPU 4200. In at least one embodiment, I/O interface 4270 represents any number and type of I/O interfaces (e.g., PCI-X, PCIe, GBE, USB, etc.). In at least one embodiment, various types of peripheral devices are coupled to the I/O interface 4170. In at least one embodiment, the peripheral devices coupled to the I/O interface 4170 may include, but are not limited to, a display, keyboard, mouse, printer, scanner, joystick or other type of game controller, media recording device, external storage device, network interface card, and the like.

In at least one embodiment, memory controller 4280 facilitates data transfer between CPU 4200 and system memory 4290. In at least one embodiment, core complex 4210 and graphics complex 4240 share system memory 4290. In at least one embodiment, CPU 4200 implements a memory subsystem including, but not limited to, any number and type of memory controllers 4280 and memory devices that may be dedicated to one component or shared among multiple components. In at least one embodiment, CPU 4200 implements a cache subsystem that includes, but is not limited to, one or more cache memories (e.g., L2 cache 4228 and L3 cache 4230), each of which may be component private or shared among any number of components (e.g., core 4220 and core complex 4210).

In at least one embodiment, at least one component shown or described with respect to fig. 42 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 42 is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 42 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 42 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 42 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 43 illustrates an exemplary accelerator integrated slice 4390 in accordance with at least one embodiment. As used herein, a "slice" includes a specified portion of the processing resources of the accelerator integrated circuit. In at least one embodiment, the accelerator integrated circuit provides cache management, memory access, environment management, and interrupt management services on behalf of a plurality of graphics processing engines of a plurality of graphics acceleration modules. The graphics processing engines may each include a separate GPU. Alternatively, the graphics processing engine may include different types of graphics processing engines within the GPU, such as a graphics execution unit, a media processing engine (e.g., video encoder/decoder), a sampler, and a blit engine. In at least one embodiment, the graphics acceleration module may be a GPU having multiple graphics processing engines. In at least one embodiment, the graphics processing engine may be a respective GPU integrated on a generic package, line card, or chip.

Application effective address space 4382 within system memory 4314 stores process elements 4383. In one embodiment, the process elements 4383 are stored in response to GPU calls 4381 from an application 4380 executing on the processor 4307. The process element 4383 contains the processing state of the corresponding application 4380. The Work Descriptor (WD) 4384 contained in the process element 4383 may be a single job requested by the application or may contain a pointer to a job queue. In at least one embodiment, WD 4384 is a pointer to a job request queue in application effective address space 4382.

The graphics acceleration module 4346 and/or various graphics processing engines may be shared by all or part of the processes in the system. In at least one embodiment, an infrastructure for establishing processing state and sending WD 4384 to graphics acceleration module 4346 to begin jobs in a virtualized environment may be included.

In at least one embodiment, the dedicated process programming model is implementation-specific. In this model, a single process owns the graphics acceleration module 4346 or an individual graphics processing engine. Since the graphics acceleration module 4346 is owned by a single process, the hypervisor initializes the accelerator integrated circuit for the owned partition and the operating system initializes the accelerator integrated circuit for the owned partition when the graphics acceleration module 4346 is allocated.

In operation, the WD obtain unit 4391 in the accelerator integrated slice 4390 obtains the next WD 4384, which includes an indication of work to be done by one or more graphics processing engines of the graphics acceleration module 4346. Data from WD 4384 may be stored in registers 4345 for use by Memory Management Unit (MMU) 4339, interrupt management circuit 4347, and/or context management circuit 4348, as shown. For example, MMU 4339 includes segment/page roaming circuitry for accessing segment/page tables 4386 within OS virtual address space 4385. Interrupt management circuit 4347 may process interrupt events (INT) 4392 received from graphics acceleration module 4346. When performing the graphics operation, the effective address 4393 generated by the graphics processing engine is translated to a real address by the MMU 4339.

In one embodiment, the same register set 4345 is replicated for each graphics processing engine and/or graphics acceleration module 4346, and may be initialized by a hypervisor or operating system. Each of these replicated registers may be contained in accelerator integrated slice 4390. An exemplary register that may be initialized by the hypervisor is shown in Table 1.

TABLE 1 registers for hypervisor initialization

1	Slice control register
		2	Real Address (RA) planned processing region pointer
3	Rights mask override register
		4	Interrupt vector table input offset
5	Interrupt vector table entry restriction
		6	Status register
7	Logical partition ID
		8	Real Address (RA) hypervisor accelerator utilization record pointer
9	Storage description register

An exemplary register that may be initialized by the operating system is shown in Table 2.

TABLE 2 operating System initialization registers

In one embodiment, each WD 4384 is specific to a particular graphics acceleration module 4346 and/or a particular graphics processing engine. It contains all the information that the graphics processing engine needs to do or work, or it may be a pointer to a memory location where the application program establishes a command queue for the work to be done.

In at least one embodiment, at least one component shown or described with respect to fig. 43 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 43 is operable to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 43 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 43 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 43 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 44A and 44B illustrate an exemplary graphics processor in accordance with at least one embodiment herein. In at least one embodiment, any of the exemplary graphics processors may be manufactured using one or more IP cores. In addition to the illustration, other logic and circuitry may be included in at least one embodiment, including additional graphics processors/cores, peripheral interface controllers, or general purpose processor cores. In at least one embodiment, an exemplary graphics processor is used within a SoC.

Fig. 44A illustrates an exemplary graphics processor 4410 of an SoC integrated circuit that may be fabricated using one or more IP cores in accordance with at least one embodiment. Fig. 44B illustrates an additional exemplary graphics processor 4440 of a SoC integrated circuit that may be fabricated using one or more IP cores in accordance with at least one embodiment. In at least one embodiment, graphics processor 4410 of FIG. 44A is a low power graphics processor core. In at least one embodiment, graphics processor 4440 of FIG. 44B is a higher performance graphics processor core. In at least one embodiment, each of graphics processors 4410, 4440 may be a variation of graphics processors such as those described herein.

In at least one embodiment, graphics processor 4410 includes a vertex processor 4405 and one or more fragment processors 4415A-4415N (e.g., 4415A, 4415B, 4415C, 4415D through 4415N-1 and 4415N). In at least one embodiment, graphics processor 4410 may execute different shader programs via separate logic such that vertex processor 4405 is optimized to perform operations for the vertex shader programs, while one or more fragment processors 4415A-4415N perform fragment (e.g., pixel) shading operations for fragment or pixel or shader programs. In at least one embodiment, vertex processor 4405 performs the vertex processing stages of the 3D graphics pipeline and generates primitives and vertex data. In at least one embodiment, segment processors 4415A-4415N use primitives and vertex data generated by vertex processor 4405 to generate a frame buffer for display on a display device. In at least one embodiment, the fragment processors 4415A-4415N are optimized to execute fragment shader programs as provided in the OpenGL API, which can be used to perform operations similar to the pixel shader programs provided in Direct 3 DAPI.

In at least one embodiment, graphics processor 4410 additionally includes one or more MMUs 4420A-4420B, caches 4425A-4425B, and circuit interconnects 4430A-4430B. In at least one embodiment, one or more MMUs 4420A-4420B provide a mapping of virtual to physical addresses for graphics processor 4410, including for vertex processor 4405 and/or fragment processors 4415A-4415N, which may reference vertex or image/texture data stored in memory in addition to vertex or image/texture data stored in one or more caches 4425A-4425B. In at least one embodiment, one or more MMUs 4420A-4420B may be synchronized with other MMUs within the system, including one or more MMUs associated with one or more application processors 2005, image processors 2015, and/or video processors (such as those described herein) such that each processor 2005-2020 may participate in a shared or unified virtual memory system. In at least one embodiment, one or more circuit interconnects 4430A-4430B enable graphics processor 4410 to connect with other IP cores within the SoC via an internal bus of the SoC or via a direct connection.

In at least one embodiment, graphics processor 4440 includes one or more MMUs 4420A-4420B, caches 4425A-4425B, and circuit interconnects 4430A-4430B of graphics processor 4410 of FIG. 44A. In at least one embodiment, graphics processor 4440 includes one or more shader cores 4455A-4455N (e.g., 4455A, 4455B, 4455C, 4455D, 4455E, 4455F, through 4455N-1 and 4455N) that provide a unified shader core architecture, where a single core or type or core may execute all types of programmable shader code, including shader program code for implementing vertex shaders, fragment shaders, and/or compute shaders. In at least one embodiment, the plurality of shader cores may vary. In at least one embodiment, graphics processor 4440 includes an inter-core task manager 4445 that acts as a thread dispatcher to dispatch execution threads to one or more shader cores 4455A-4455N and a partitioning unit 4458 to accelerate tile-based rendering partitioning operations, where rendering operations of a scene are subdivided in image space, e.g., to take advantage of local spatial consistency within the scene or to optimize use of internal caches.

In at least one embodiment, at least one component shown or described with respect to fig. 44A and 44B is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to fig. 44A and 44B is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 44A and 44B is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 44A and 44B is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 44A and 44B is used to perform at least one aspect described with respect to block diagram 100, block diagram 200, process 300, block diagram 400, process 500, process 600, process 700, block diagram 800, block diagram 900, block diagram 1000, block diagram 1100, process 1200, block diagram 1300, block diagram 1400, block diagram 1500, and/or other systems, methods, or operations described herein.

Fig. 45A illustrates a graphics core 4500 in accordance with at least one embodiment. In at least one embodiment, graphics core 4500 may be included within graphics processor 3910 of fig. 39. In at least one embodiment, graphics core 4500 may be unified shader cores 4455A-4455N in FIG. 44B. In at least one embodiment, graphics core 4500 includes shared instruction cache 4502, texture unit 4518, and cache/shared memory 4520, which are common to execution resources within graphics core 4500. In at least one embodiment, graphics core 4500 may include multiple slices (slices) 4501A-4501N or partitions of each core, and a graphics processor may include multiple instances of graphics core 4500. The slices 4501A-4501N may include support logic including local instruction caches 4504A-4504N, thread schedulers 4506A-4506N, thread dispatchers 4508A-4508N, and a set of registers 4510A-4510N. In at least one embodiment, slices 4501A-4501N may include a set of Additional Functional Units (AFUs) 4512A-4512N, floating Point Units (FPUs) 4514A-4514N, integer Arithmetic Logic Units (ALUs) 4516A-4516N, address Calculation Units (ACUs) 4513A-4513N, double-precision floating point units (DPFPUs)

4515A to 4515N and Matrix Processing Units (MPUs) 4517A to 4517N.

In one embodiment, FPUs 4514A-4514N may perform single-precision (32-bit) and half-precision (16-bit) floating-point operations, while DPFPUs 4515A-4515N may perform double-precision (64-bit) floating-point operations. In at least one embodiment, ALUs 4516A-4516N may perform variable precision integer operations with 8-bit, 16-bit, and 32-bit precision, and may be configured for mixed-precision operations. In at least one embodiment, MPUs 4517A-4517N may also be configured for mixed precision matrix operations, including half-precision floating point operations and 8-bit integer operations. In at least one embodiment, MPUs 4517A-4517N can perform various matrix operations to accelerate CUDA programs, including enabling support for accelerated generic matrix-to-matrix multiplication (GEMM). In at least one embodiment, AFUs 4512A-4512N may perform additional logical operations that are not supported by floating point numbers or integer units, including trigonometric operations (e.g., sine, cosine, etc.).

In at least one embodiment, at least one component shown or described with respect to fig. 45A is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 45A is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 45A is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 45A is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 45A is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 45B illustrates a General Purpose Graphics Processing Unit (GPGPU) 4530 in at least one embodiment. In at least one embodiment, GPGPU 4530 is highly parallel and suitable for deployment on a multi-chip module. In at least one embodiment, GPGPU 4530 may be configured to enable highly parallel computing operations to be performed by a GPU array. In at least one embodiment, GPGPU 4530 may be directly linked to other instances of GPGPU 4530 to create a multi-GPU cluster to increase execution time for CUDA programs. In at least one embodiment, the GPGPU 4530 includes a host interface 4532 to enable connection with a host processor. In at least one embodiment, host interface 4532 is a PCIe interface. In at least one embodiment, the host interface 4532 may be a vendor-specific communication interface or communication structure. In at least one embodiment, GPGPU 4530 receives commands from a host processor and dispatches execution threads associated with those commands to a set of compute clusters 4536A-4536H using global scheduler 4534. In at least one embodiment, the compute clusters 4536A-4536H share cache memory 4538. In at least one embodiment, the cache memory 4538 may be used as a higher level cache for cache memory within a compute cluster 4536A-4536H.

In at least one embodiment, GPGPU 4530 includes memories 4544A-4544B coupled with computing clusters 4536A-4536H via a set of memory controllers 4542A-4542B. In at least one embodiment, the memories 4544A-4544B may comprise various types of memory devices including Dynamic Random Access Memory (DRAM) or graphics random access memory, such as Synchronous Graphics Random Access Memory (SGRAM), including Graphics Double Data Rate (GDDR) memory.

In at least one embodiment, the compute clusters 4536A-4536H each include a set of graphics cores, such as graphics core 4500 of FIG. 45A, which may include multiple types of integer and floating point logic units, may perform compute operations with various accuracies, including computations suitable for association with a CUDA program. For example, in at least one embodiment, at least a subset of the floating point units in each of the compute clusters 4536A-4536H may be configured to perform 16-bit or 32-bit floating point operations, while a different subset of the floating point units may be configured to perform 64-bit floating point operations.

In at least one embodiment, multiple instances of GPGPU 4530 may be configured to operate as a compute cluster. In at least one embodiment, the computing clusters 4536A-4536H may implement any technically feasible communication technique for synchronizing and exchanging data. In at least one embodiment, multiple instances of the GPGPU 4530 communicate through a host interface 4532. In at least one embodiment, GPGPU 4530 includes an I/O hub 4539 that couples GPGPU 4530 with a GPU link 4540 to enable direct connection to other instances of GPGPU 4530. In at least one embodiment, GPU link 4540 is coupled to a dedicated GPU-to-GPU bridge that enables communication and synchronization between multiple instances of GPGPU 4530. In at least one embodiment, GPU link 4540 is coupled with a high speed interconnect to send and receive data to other GPGPUs 4530 or parallel processors. In at least one embodiment, multiple instances of GPGPU 4530 are located in separate data processing systems and communicate via a network device that is accessible via host interface 4532. In at least one embodiment, GPU link 4540 may be configured to be capable of connecting to a host processor, in addition to or in lieu of host interface 4532. In at least one embodiment, GPGPU 4530 may be configured to execute a CUDA program.

In at least one embodiment, at least one component shown or described with respect to fig. 45B is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 45B is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 45B is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 45B is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 45B is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 46A illustrates a parallel processor 4600 in accordance with at least one embodiment. In at least one embodiment, the various components of the parallel processor 4600 may be implemented using one or more integrated circuit devices, such as a programmable processor, an Application Specific Integrated Circuit (ASIC), or an FPGA.

In at least one embodiment, parallel processor 4600 includes parallel processing unit 4602. In at least one embodiment, the parallel processing unit 4602 includes an I/O unit 4604 that enables communication with other devices, including other instances of the parallel processing unit 4602. In at least one embodiment, the I/O unit 4604 may be directly connected to other devices. In at least one embodiment, the I/O unit 4604 connects with other devices using a hub or switch interface (e.g., memory hub 2105). In at least one embodiment, the connection between the memory hub 2105 and the I/O unit 4604 forms a communications link. In at least one embodiment, I/O unit 4604 is connected with host interface 4606 and memory crossbar 4616, where host interface 4606 receives commands to perform processing operations and memory crossbar 4616 receives commands to perform memory operations.

In at least one embodiment, when host interface 4606 receives command buffers via I/O unit 4604, host interface 4606 can direct work operations to execute those commands to front end 4608. In at least one embodiment, front end 4608 is coupled to a scheduler 4610, scheduler 4610 being configured to assign commands or other work items to processing array 4612. In at least one embodiment, scheduler 4610 ensures that processing arrays 4612 are properly configured and in an active state prior to assigning tasks to processing arrays 4612 in processing arrays 4612. In at least one embodiment, scheduler 4610 is implemented by firmware logic executing on a microcontroller. In at least one embodiment, the microcontroller-implemented scheduler 4610 may be configured to perform complex scheduling and work allocation operations at coarse and fine granularity, thereby enabling fast preemption and context switching of threads executing on the processing array 4612. In at least one embodiment, the host software may prove a workload for scheduling on the processing array 4612 by one of the plurality of graphics processing doorbell. In at least one embodiment, the workload may then be automatically distributed on the processing array 4612 by scheduler 4610 logic within a microcontroller that includes scheduler 4610.

In at least one embodiment, processing array 4612 may include up to "N" processing clusters (e.g., clusters 4614A, 4614B through 4614N). In at least one embodiment, each cluster 4614A-4614N of the processing array 4612 may execute a large number of concurrent threads. In at least one embodiment, the scheduler 4610 may assign work to clusters 4614A-4614N of the processing array 4612 using various scheduling and/or work assignment algorithms, which may vary depending on the workload generated by each program or type of computation. In at least one embodiment, scheduling may be dynamically handled by scheduler 4610 or may be aided in part by compiler logic during compilation of program logic configured to be executed by processing array 4612. In at least one embodiment, different clusters 4614A-4614N of processing array 4612 may be allocated for processing different types of programs or for performing different types of computations.

In at least one embodiment, processing array 4612 may be configured to perform various types of parallel processing operations. In at least one embodiment, processing array 4612 is configured to perform general parallel computing operations. For example, in at least one embodiment, processing array 4612 may include logic to perform processing tasks including filtering video and/or audio data, performing modeling operations, including physical operations, and performing data transformations.

In at least one embodiment, processing array 4612 is configured to perform parallel graphics processing operations. In at least one embodiment, processing array 4612 may include additional logic to support the execution of such graphics processing operations, including, but not limited to, texture sampling logic to perform texture operations, as well as tessellation logic and other vertex processing logic. In at least one embodiment, processing array 4612 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. In at least one embodiment, the parallel processing unit 4602 may transfer data from system memory for processing via the I/O unit 4604. In at least one embodiment, during processing, the transferred data may be stored to on-chip memory (e.g., parallel processor memory 4622) during processing and then written back to system memory.

In at least one embodiment, when parallel processing unit 4602 is used to perform graphics processing, scheduler 4610 may be configured to divide the processing workload into approximately equal sized tasks to better allocate graphics processing operations to the multiple clusters 4614A-4614N of processing array 4612. In at least one embodiment, portions of processing array 4612 may be configured to perform different types of processing. For example, in at least one embodiment, a first portion may be configured to perform vertex shading and topology generation, a second portion may be configured to perform tessellation and geometry shading, and a third portion may be configured to perform pixel shading or other screen space operations to generate a rendered image for display. In at least one embodiment, intermediate data generated by one or more of the clusters 4614A-4614N may be stored in a buffer to allow the intermediate data to be transferred between the clusters 4614A-4614N for further processing.

In at least one embodiment, the processing array 4612 may receive processing tasks to be performed via a scheduler 4610, the scheduler 4610 receiving commands defining the processing tasks from a front end 4608. In at least one embodiment, the processing tasks may include an index of data to be processed, which may include, for example, surface (patch) data, raw data, vertex data, and/or pixel data, as well as state parameters and commands defining how to process the data (e.g., what program is to be executed). In at least one embodiment, the scheduler 4610 may be configured to obtain an index corresponding to a task or may receive an index from the front end 4608. In at least one embodiment, the front end 4608 can be configured to ensure that the processing array 4612 is configured to a valid state prior to launching a workload specified by an incoming command buffer (e.g., batch-buffer, push buffer, etc.).

In at least one embodiment, each of the one or more instances of the parallel processing unit 4602 may be coupled with a parallel processor memory 4622. In at least one embodiment, parallel processor memory 4622 may be accessed via memory crossbar 4616, which memory crossbar 4616 may receive memory requests from processing array 4612 and I/O unit 4604. In at least one embodiment, the memory crossbar 4616 can access the parallel processor memory 4622 via the memory interface 4618. In at least one embodiment, the memory interface 4618 may include a plurality of partition units (e.g., partition unit 4620A, partition unit 4620B through partition unit 4620N), which may each be coupled to a portion of the parallel processor memory 4622 (e.g., a memory unit). In at least one embodiment, the plurality of partition units 4620A-4620N are configured to be equal to the number of memory units such that a first partition unit 4620A has a corresponding first memory unit 4624A, a second partition unit 4620B has a corresponding memory unit 4624B, and an Nth partition unit 4620N has a corresponding Nth memory unit 4624N. In at least one embodiment, the number of partition units 4620A-4620N may not be equal to the number of memory devices.

In at least one embodiment, memory units 4624A-4624N may include various types of memory devices, including Dynamic Random Access Memory (DRAM) or graphics random access memory, such as Synchronous Graphics Random Access Memory (SGRAM), including Graphics Double Data Rate (GDDR) memory. In at least one embodiment, memory units 4624A-4624N may also include 3D stacked memory, including but not limited to High Bandwidth Memory (HBM). In at least one embodiment, rendering targets such as frame buffers or texture maps may be stored across memory units 4624A-4624N, allowing partition units 4620A-4620N to write portions of each rendering target in parallel to efficiently use the available bandwidth of parallel processor memory 4622. In at least one embodiment, the local instance of parallel processor memory 4622 may be eliminated to facilitate a unified memory design that utilizes system memory in combination with local cache memory.

In at least one embodiment, any of clusters 4614A-4614N of processing array 4612 may process data to be written to any of memory cells 4624A-4624N within parallel processor memory 4622. In at least one embodiment, the memory crossbar 4616 may be configured to transmit the output of each cluster 4614A-4614N to any partition unit 4620A-4620N or another cluster 4614A-4614N, the clusters 4614A-4614N may perform other processing operations on the output. In at least one embodiment, each cluster 4614A-4614N may communicate with a memory interface 4618 through a memory crossbar 4616 to read from or write to various external storage devices. In at least one embodiment, memory crossbar 4616 has a connection to memory interface 4618 to communicate with I/O unit 4604 and a connection to a local instance of parallel processor memory 4622 to enable processing units within different processing clusters 4614A-4614N to communicate with system memory or other memory that is not local to parallel processing unit 4602. In at least one embodiment, the memory crossbar 4616 may use virtual channels to split traffic between clusters 4614A-4614N and partition units 4620A-4620N.

In at least one embodiment, multiple instances of parallel processing unit 4602 may be provided on a single add-in card, or multiple add-in cards may be interconnected. In at least one embodiment, different instances of parallel processing unit 4602 may be configured to interoperate, even though the different instances have different numbers of processing cores, different amounts of local parallel processor memory, and/or other configuration differences. For example, in at least one embodiment, some instances of parallel processing unit 4602 may include higher precision floating point units relative to other instances. In at least one embodiment, a system incorporating one or more instances of parallel processing unit 4602 or parallel processor 4600 may be implemented in a variety of configurations and form factors, including, but not limited to, a desktop, laptop or handheld personal computer, server, workstation, gaming machine, and/or embedded system.

In at least one embodiment, at least one component shown or described with respect to fig. 46A is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 46A is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 46A is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 46A is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 46A is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 46B illustrates a processing cluster 4694 in accordance with at least one embodiment. In at least one embodiment, processing clusters 4694 are included within parallel processing units. In at least one embodiment, the processing cluster 4694 is an example of one of the processing clusters 4614A-4614N of FIG. 46A. In at least one embodiment, the processing clusters 4694 may be configured to execute a number of threads in parallel, where the term "thread" refers to an instance of a particular program executing on a particular set of input data. In at least one embodiment, single Instruction Multiple Data (SIMD) instruction issue techniques are used to support parallel execution of a large number of threads without providing multiple independent instruction units. In at least one embodiment, single Instruction Multithreading (SIMT) techniques are used to support parallel execution of a large number of generally synchronized threads, using a common instruction unit configured to issue instructions to a set of processing engines within each processing cluster 4694.

In at least one embodiment, the operation of the processing clusters 4694 may be controlled by a pipeline manager 4632 that allocates processing tasks to the SIMT parallel processors. In at least one embodiment, the pipeline manager 4632 receives instructions from the scheduler 4610 of FIG. 46A, and manages execution of these instructions through the graphics multiprocessor 4634 and/or texture units 4636. In at least one embodiment, graphics multiprocessor 4634 is an illustrative example of a SIMT parallel processor. However, in at least one embodiment, various types of SIMT parallel processors of different architectures may be included within processing cluster 4694. In at least one embodiment, one or more instances of graphics multiprocessor 4634 may be included within processing cluster 4694. In at least one embodiment, the graphics multiprocessor 4634 may process data, and the data crossbar 4640 may be used to distribute the processed data to one of a number of possible purposes, including other shader units. In at least one embodiment, pipeline manager 4632 may facilitate distribution of processed data by specifying a destination of the processed data to be distributed via data crossbar 4640.

In at least one embodiment, each graphics multiprocessor 4634 within a processing cluster 4694 may include the same set of function execution logic (e.g., arithmetic logic units, load Store Units (LSUs), etc.). In at least one embodiment, the function execution logic may be configured in a pipelined fashion, where a new instruction may be issued before a previous instruction completes. In at least one embodiment, the function execution logic supports a variety of operations including integer and floating point arithmetic, comparison operations, boolean operations, shifting, and computation of various algebraic functions. In at least one embodiment, the same functional unit hardware may be utilized to perform different operations, and any combination of functional units may be present.

In at least one embodiment, the instructions transferred to the processing cluster 4694 constitute threads. In at least one embodiment, the set of threads executing across a set of parallel processing engines is a thread group. In at least one embodiment, a thread group executes programs on different input data. In at least one embodiment, each thread within a thread group may be assigned to a different processing engine within the graphics multiprocessor 4634. In at least one embodiment, the thread group may include fewer threads than a plurality of processing engines within the graphics multiprocessor 4634. In at least one embodiment, when a thread group includes fewer threads than the number of processing engines, one or more processing engines may be idle during the loop that is processing the thread group. In at least one embodiment, the thread group may also include more threads than multiple processing engines within the graphics multiprocessor 4634. In at least one embodiment, when a thread group includes more threads than the number of processing engines within graphics multiprocessor 4634, processing may be performed in successive clock cycles. In at least one embodiment, multiple thread groups may be concurrently executing on the graphics multiprocessor 4634.

In at least one embodiment, graphics multiprocessor 4634 includes an internal cache memory to perform load and store operations. In at least one embodiment, graphics multiprocessor 4634 may relinquish internal caches and use cache memory (e.g., L1 cache 4648) within processing cluster 4694. In at least one embodiment, each graphics multiprocessor 4634 may also access an L2 cache within partition units (e.g., partition units 4620A-4620N of FIG. 46A) that are shared among all processing clusters 4694 and may be used to transfer data between threads. In at least one embodiment, graphics multiprocessor 4634 may also access off-chip global memory, which may include one or more of local parallel processor memory and/or system memory. In at least one embodiment, any memory external to the parallel processing unit 4602 may be used as global memory. In at least one embodiment, processing cluster 4694 includes multiple instances of graphics multiprocessor 4634, which may share common instructions and data that may be stored in L1 cache 4648.

In at least one embodiment, each processing cluster 4694 can include an MMU 4645 configured to map virtual addresses to physical addresses. In at least one embodiment, one or more instances of MMU 4645 may reside within memory interface 4618 of fig. 46A. In at least one embodiment, the MMU 4645 includes a set of Page Table Entries (PTEs) for mapping virtual addresses to physical addresses of tiles (talking about more information about tiles) and optionally to cache line indexes. In at least one embodiment, MMU 4645 may include an address translation look-aside buffer (TLB) or may reside in graphics multiprocessor 4634 or L1 cache 4648 or a cache within processing cluster 4694. In at least one embodiment, physical addresses are processed to allocate surface data access locality for efficient request interleaving among partition units. In at least one embodiment, the cache line index may be used to determine whether a request for a cache line is a hit or miss.

In at least one embodiment, the processing clusters 4694 may be configured such that each graphics multiprocessor 4634 is coupled to a texture unit 4636 to perform texture mapping operations, which may involve, for example, determining texture sample locations, reading texture data, and filtering the texture data. In at least one embodiment, texture data is read from an internal texture L1 cache (not shown) or from an L1 cache within graphics multiprocessor 4634, and fetched from an L2 cache, local parallel processor memory, or system memory, as desired. In at least one embodiment, each graphics multiprocessor 4634 outputs processed tasks to data crossbar 4640 to provide the processed tasks to another processing cluster 4694 for further processing or to store the processed tasks in an L2 cache, local parallel processor memory, or system memory via memory crossbar 4616. In at least one embodiment, pre-raster operations unit (preROP) 4642 is configured to receive data from graphics multiprocessor 4634, direct the data to ROP units, which may be located with partition units described herein (e.g., partition units 4620A-4620N of FIG. 46A). In at least one embodiment, the PreROP 4642 unit may perform optimization for color blending, organize pixel color data, and perform address translation.

In at least one embodiment, at least one component shown or described with respect to fig. 46B is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 46B is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 46B is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 46B is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 46B is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 46C illustrates a graphics multiprocessor 4696 in accordance with at least one embodiment. In at least one embodiment, graphics multiprocessor 4696 is graphics multiprocessor 4634 of fig. 46B. In at least one embodiment, graphics multiprocessor 4696 is coupled with a pipeline manager 4632 of processing clusters 4694. In at least one embodiment, graphics multiprocessor 4696 has an execution pipeline that includes, but is not limited to, an instruction cache 4652, an instruction unit 4654, an address mapping unit 4656, a register file 4658, one or more GPGPU cores 4662, and one or more LSUs 4666.GPGPU core 4662 and LSU 4666 are coupled with cache memory 4672 and shared memory 4670 via memory and cache interconnect 4668.

In at least one embodiment, the instruction cache 4652 receives a stream of instructions to be executed from the pipeline manager 4632. In at least one embodiment, instructions are cached in instruction cache 4652 and dispatched for execution by instruction unit 4654. In one embodiment, the instruction unit 4654 may dispatch instructions as a thread group (e.g., a thread bundle), each thread of the thread group being assigned to a different execution unit within the GPGPU core 4662. In at least one embodiment, an instruction may access any local, shared, or global address space by specifying an address within a unified address space. In at least one embodiment, address mapping unit 4656 may be used to translate addresses in a unified address space into different memory addresses that may be accessed by LSU 4666.

In at least one embodiment, register file 4658 provides a set of registers for the functional units of graphics multiprocessor 4696. In at least one embodiment, register file 4658 provides temporary storage for operands of data paths connected to functional units of graphics multiprocessor 4696 (e.g., GPGPU cores 4662, LSU 4666). In at least one embodiment, the register file 4658 is divided among each functional unit such that each functional unit is assigned a dedicated portion of the register file 4658. In at least one embodiment, register file 4658 is divided among different thread groups being executed by graphics multiprocessor 4696.

In at least one embodiment, GPGPU cores 4662 may each include an FPU and/or ALU for executing instructions of graphics multiprocessor 4696. GPGPU cores 4662 may be similar in architecture or may differ in architecture. In at least one embodiment, a first portion of the GPGPU core 4662 includes a single precision FPU and integer ALUs, while a second portion of the GPGPU core 4662 includes a dual precision FPU. In at least one embodiment, the FPU may implement the IEEE 754-2008 standard for floating point algorithms or enable variable precision floating point algorithms. In at least one embodiment, graphics multiprocessor 4696 may additionally include one or more fixed-function or special-function units to perform specific functions, such as copy rectangle or pixel blend operations. In at least one embodiment, one or more of the GPGPU cores 4662 may also include fixed or special function logic.

In at least one embodiment, the GPGPU core 4662 includes SIMD logic capable of executing a single instruction on multiple sets of data. In at least one embodiment, GPGPU core 4662 may physically execute SIMD4, SIMD8, and SIMD16 instructions and logically execute SIMD1, SIMD2, and SIMD32 instructions. In at least one embodiment, the SIMD instructions for the GPGPU core 4662 may be generated by a shader compiler at compile time or automatically when executing programs written and compiled for single program multi-data (SPMD) or SIMT architectures. In at least one embodiment, multiple threads of a program configured for the SIMT execution model may be executed by a single SIMD instruction. For example, in at least one embodiment, eight SIMT threads performing the same or similar operations may be executed in parallel by a single SIMD8 logic unit.

In at least one embodiment, memory and cache interconnect 4668 is an interconnect network that connects each functional unit of graphics multiprocessor 4696 to register file 4658 and shared memory 4670. In at least one embodiment, the memory and cache interconnect 4668 is a crossbar interconnect that allows the LSU 4666 to implement load and store operations between the shared memory 4670 and the register file 4658. In at least one embodiment, register file 4658 may operate at the same frequency as GPGPU core 4662, such that the latency of data transfer between GPGPU core 4662 and register file 4658 is very low. In at least one embodiment, shared memory 4670 may be used to enable communication between threads executing on functional units within graphics multiprocessor 4696. In at least one embodiment, for example, cache memory 4672 may be used as a data cache to cache texture data communicated between functional units and texture units 4636. In at least one embodiment, shared memory 4670 may also be used as a program managed cache. In at least one embodiment, threads executing on the GPGPU core 4662 may also programmatically store data in shared memory in addition to automatically cached data stored in the cache memory 4672.

In at least one embodiment, a parallel processor or GPGPU as described herein is communicatively coupled to a host/processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various General Purpose GPU (GPGPU) functions. In at least one embodiment, the GPU may be communicatively coupled to the host processor/core via a bus or other interconnect (e.g., a high speed interconnect such as PCIe or NVLink). In at least one embodiment, the GPU may be integrated on the same package or chip as the core and communicatively coupled to the core through an internal processor bus/interconnect (i.e., internal to the package or chip). In at least one embodiment, regardless of the manner in which the GPUs are connected, the processor cores may distribute work to the GPUs in the form of command/instruction sequences that the WD contains. In at least one embodiment, the GPU then uses dedicated circuitry/logic to efficiently process these commands/instructions.

In at least one embodiment, at least one component shown or described with respect to fig. 46C is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 46C is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 46C is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 46C is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 46C is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

General purpose computing

The following figures set forth, but are not limited to, exemplary software configurations for implementing at least one embodiment in a general purpose computing.

FIG. 47 illustrates a software stack of a programming platform in accordance with at least one embodiment. In at least one embodiment, the programming platform is a platform for utilizing hardware on a computing system to accelerate computing tasks. In at least one embodiment, a software developer may access a programming platform through libraries, compiler directives, and/or extensions to a programming language. In at least one embodiment, the programming platform may be, but is not limited to, CUDA, radeon open computing platform ("ROCm"), openCL (OpenCL developed by Khronos group) ^TM ) SYCL or Intel One APIs.

In at least one embodiment, the software stack 4700 of the programming platform provides an execution environment for the application 4701. In at least one embodiment, the application 4701 may include any computer software capable of being launched on the software stack 4700. In at least one embodiment, the application 4701 may include, but is not limited to, an artificial intelligence ("AI")/machine learning ("ML") application, a high performance computing ("HPC") application, a virtual desktop infrastructure ("VDI") or a data center workload.

In at least one embodiment, the application 4701 and software stack 4700 run on hardware 4707. In at least one embodiment, hardware 4707 may include one or more GPU, CPU, FPGA, AI engines and/or other types of computing devices supporting a programming platform. In at least one embodiment, such as with CUDA, software stack 4700 may be vendor specific and compatible only with devices from a particular vendor. In at least one embodiment, such as in employing OpenCL, the software stack 4700 may be used with devices from different vendors. In at least one embodiment, hardware 4707 includes a host connected to one or more devices that are accessible via Application Programming Interface (API) calls to perform computing tasks. In at least one embodiment, as compared to a host within hardware 4707, it may include, but is not limited to, a CPU (but may also include a computing device) and its memory, and devices within hardware 4707 may include, but are not limited to, a GPU, FPGA, AI engine, or other computing device (but may also include a CPU) and its memory.

In at least one embodiment, the software stack 4700 of the programming platform includes, but is not limited to, a plurality of libraries 4703, runtime 4705, and device kernel drivers 4706. In at least one embodiment, each of the libraries 4703 may include data and programming code that may be used by a computer program and utilized during software development. In at least one embodiment, the library 4703 may include, but is not limited to, pre-written code and subroutines, classes, values, type specifications, configuration data, documents, assistance data, and/or message templates. In at least one embodiment, library 4703 includes functions optimized for execution on one or more types of devices. In at least one embodiment, the library 4703 may include, but is not limited to, functions for performing mathematical, deep learning, and/or other types of operations on a device. In at least one embodiment, the library 4803 is associated with a corresponding API 4802, and the API 4802 may include one or more APIs that expose the functions implemented in the library 4803.

In at least one embodiment, application 4701 is written as source code that is compiled into executable code, as discussed in more detail below in connection with FIG. 52. In at least one embodiment, the executable code of the application 4701 may run at least in part on the execution environment provided by the software stack 4700. In at least one embodiment, code that needs to run on the device (as compared to the host) may be available during execution of application 4701. In this case, in at least one embodiment, runtime 4705 may be invoked to load and launch the necessary code on the device. In at least one embodiment, the runtime 4705 may comprise any technically feasible runtime system capable of supporting execution of the application 4701.

In at least one embodiment, the runtime 4705 is implemented as one or more runtime libraries associated with a corresponding API (which is shown as API 4704). In at least one embodiment, one or more such runtime libraries may include, but are not limited to, functions for memory management, execution control, device management, error handling and/or synchronization, and the like. In at least one embodiment, the memory management functions may include, but are not limited to, functions for allocating, deallocating, and copying device memory and transferring data between host memory and device memory. In at least one embodiment, executing the control functions may include, but is not limited to, a function that starts a function on the device (sometimes referred to as a "kernel" when the function is a global function that is callable from the host), and a function that sets attribute values in a buffer maintained by the runtime library for a given function to be executed on the device.

In at least one embodiment, the runtime libraries and corresponding APIs 4704 may be implemented in any technically feasible manner. In at least one embodiment, one (or any number) of APIs may expose a low-level set of functions for fine-grained control of a device, while another (or any number) of APIs may expose such a higher-level set of functions. In at least one embodiment, a high-level runtime API may be built on top of a low-level API. In at least one embodiment, the one or more runtime APIs may be language-specific APIs that are layered on top of the language-independent runtime APIs.

In at least one embodiment, the device kernel driver 4706 is configured to facilitate communications with underlying devices. In at least one embodiment, device kernel driver 4706 may provide low-level functions that are relied upon by APIs, such as API 4704, and/or other software. In at least one embodiment, the device kernel driver 4706 may be configured to compile intermediate representation ("IR") code into binary code at runtime. In at least one embodiment, for CUDA, the device kernel driver 4706 may compile non-hardware specific parallel thread execution ("PTX") IR code at runtime into binary code (cache compiled binary code) for a particular target device, sometimes referred to as "final" code. In at least one embodiment, this may allow the final code to run on the target device, which may not exist when the source code is initially compiled into PTX code. Alternatively, in at least one embodiment, the device source code may be compiled offline into binary code without requiring the device kernel driver 4706 to compile IR code at runtime.

In at least one embodiment, at least one component shown or described with respect to fig. 47 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 47 is used to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 47 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 47 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 47 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 48 illustrates a CUDA implementation of the software stack 4700 of FIG. 47 in accordance with at least one embodiment. In at least one embodiment, CUDA software stack 4800, on which application 4801 can be launched, includes CUDA library 4803, CUDA runtime 4805, CUDA driver 4807, and device kernel driver 4808. In at least one embodiment, CUDA software stack 4800 executes on hardware 4809, which hardware 4809 can include a CUDA-enabled GPU developed by NVIDIA corporation of santa clara, california.

In at least one embodiment, the application 4801, CUDA runtime 4805, and device kernel driver 4808 can perform similar functions as the application 4701, runtime 4705, and device kernel driver 4706, respectively, described above in connection with FIG. 47. In at least one embodiment, CUDA driver 4807 includes a library (libcuda. So) that implements CUDA driver API 4806. In at least one embodiment, similar to CUDA runtime API 4804 implemented by CUDA runtime library (cudart), CUDA driver API 4806 may disclose, but is not limited to, functions for memory management, execution control, device management, error handling, synchronization, and/or graphics interoperability, etc. In at least one embodiment, CUDA driver API 4806 differs from CUDA runtime API 4804 in that CUDA runtime API 4804 simplifies device code management by providing implicit initialization, context (similar to process) management, and module (similar to dynamically loaded libraries) management. In contrast to the advanced CUDA runtime API 4804, in at least one embodiment, the CUDA driver API 4806 is a low-level API that provides finer granularity control of the device, particularly with respect to context and module loading. In at least one embodiment, CUDA driver API 4806 can expose functions for context management that are not exposed by CUDA runtime API 4804. In at least one embodiment, CUDA driver API 4806 is also language independent and supports, for example, openCL in addition to CUDA runtime API 4804. Further, in at least one embodiment, the development library, including CUDA runtime 4805, can be considered separate from the driver components, including user-mode CUDA driver 4807 and kernel-mode device driver 4808 (also sometimes referred to as a "display" driver).

In at least one embodiment, CUDA library 4803 may include, but is not limited to, a math library, a deep learning library, a parallel algorithm library, and/or a signal/image/video processing library, which may be utilized by a parallel computing application (e.g., application 4801). In at least one embodiment, CUDA library 4803 may include a mathematical library, such as a cuBLAS library, which is an implementation of a basic linear algebra subroutine ("BLAS") for performing linear algebra operations; a curfft library for computing a fast fourier transform ("FFT"), a curnd library for generating random numbers, and the like. In at least one embodiment, CUDA library 4803 may include deep learning libraries such as cuDNN libraries for primitives of deep neural networks and the TensorRT platform for high performance deep learning reasoning, among others.

In at least one embodiment, at least one component shown or described with respect to fig. 48 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 48 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 48 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 48 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 48 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 49 illustrates a ROCm implementation of the software stack 4700 of fig. 47 in accordance with at least one embodiment. In at least one embodiment, the ROCm software stack 4900 on which an application 4901 may be launched includes a language runtime 4903, a system runtime 4905, a thunder 4907, a ROCm kernel driver 4908, and a device kernel driver 4909. In at least one embodiment, the ROCm software stack 4900 is executed on hardware 4910, and the hardware 4910 may include a ROCm enabled GPU developed by AMD corporation of santa clara, california.

In at least one embodiment, application 4901 may perform similar functions as application 4701 discussed above in connection with FIG. 47. Additionally, in at least one embodiment, language runtime 4903 and system runtime 4905 may perform similar functions as runtime 4705 discussed above in connection with FIG. 47. In at least one embodiment, language runtime 4903 differs from system runtime 4905 in that system runtime 4905 is a language independent runtime that implements ROCr system runtime API 4904 and utilizes heterogeneous system architecture ("HAS") runtime API. In at least one embodiment, the HAS runtime API is a thin user mode API that exposes interfaces for accessing and interacting with AMD GPUs, including functions for memory management, execution control through architecture dispatch kernels, error handling, system and agent information, and runtime initialization and shutdown, among others. In at least one embodiment, language runtime 4903 is an implementation of language specific runtime API 4902 layered above ROCr system runtime API 4904, as compared to system runtime 4905. In at least one embodiment, the language runtime APIs may include, but are not limited to, a portable heterogeneous computing interface ("HIP") language runtime API, a heterogeneous computing compiler ("HCC") language runtime API or an OpenCL API, or the like. In particular, the HIP language is an extension of the C++ programming language, having functionally similar versions of the CUDA mechanism, and in at least one embodiment, the HIP language runtime APIs include similar functions as the CUDA runtime APIs 4804 discussed above in connection with FIG. 48, such as functions for memory management, execution control, device management, error handling, synchronization, and the like.

In at least one embodiment, the thread (ROCt) 4907 is an interface that may be used to interact with the underlying ROCm driver 4908. In at least one embodiment, ROCm driver 4908 is a ROCk driver that is a combination of an amdpu driver and HAS kernel driver (amdkfd). In at least one embodiment, the AMDGPU driver is a device kernel driver for a GPU developed by AMD that performs similar functions as the device kernel driver 4706 discussed above in connection with FIG. 47. In at least one embodiment, the HAS kernel driver is a driver that allows different types of processors to more efficiently share system resources via hardware features.

In at least one embodiment, various libraries (not shown) can be included in the ROCm software stack 4900 above the language runtime 4903 and provide similar functionality to the CUDA library 4803 discussed above in connection with fig. 48. In at least one embodiment, the various libraries may include, but are not limited to, mathematical, deep learning, and/or other libraries, such as hipBLAS libraries that implement functions similar to CUDA cuBLAS, rocFFT libraries similar to CUDA cuFFT used to calculate FFTs, and the like.

In at least one embodiment, at least one component shown or described with respect to fig. 49 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 49 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 49 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 49 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 49 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 50 illustrates an OpenCL implementation of the software stack 4700 of FIG. 47 in accordance with at least one embodiment. In at least one embodiment, the OpenCL software stack 5000 on which the application 5001 can be launched includes an OpenCL framework 5009, an OpenCL runtime 5006, and a driver 5007. In at least one embodiment, the OpenCL software stack 5000 executes on hardware 4809 that is not vendor specific. In at least one embodiment, since devices developed by different vendors support OpenCL, specific OpenCL drivers may be required to interoperate with hardware from such vendors.

In at least one embodiment, the application 5001, the OpenCL runtime 5006, the device kernel driver 5007, and the hardware 5008 can perform similar functions as the application 4701, the runtime 4705, the device kernel driver 4706, and the hardware 4707, respectively, discussed above in connection with fig. 47. In at least one embodiment, the application 5001 also includes an OpenCL kernel 5002 having code to be executed on the device.

In at least one embodiment, openCL defines a "platform" that allows a host to control devices connected to the host. In at least one embodiment, the OpenCL framework provides a platform layer API and a runtime API, shown as platform API 5003 and runtime API 5005. In at least one embodiment, the runtime API 5005 uses contexts to manage execution of kernels on devices. In at least one embodiment, each identified device can be associated with a respective context that the runtime API 5005 can use to manage the command queues, program objects and kernel objects, shared memory objects, etc. of the device. In at least one embodiment, the platform API 5003 discloses functions that allow device context to be used to select and initialize devices, submit work to devices via command queues, and enable data transfer from and to devices, among other things. In addition, in at least one embodiment, the OpenCL framework provides various built-in functions (not shown), including mathematical functions, relational functions, image processing functions, and the like.

In at least one embodiment, the compiler 5004 is also included in the OpenCL framework 5009. In at least one embodiment, the source code may be compiled offline prior to executing the application or online during execution of the application. In contrast to CUDA and ROCm, the OpenCL application in at least one embodiment may be compiled online by compiler 5004, with compiler 5004 included to represent any number of compilers that may be used to compile source code and/or IR code (e.g., standard portable intermediate representation ("SPIR-V") code) into binary code. Alternatively, in at least one embodiment, the OpenCL application may be compiled offline prior to execution of such application.

In at least one embodiment, at least one component shown or described with respect to fig. 50 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 50 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 50 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 50 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 50 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 51 illustrates software supported by a programming platform in accordance with at least one embodiment. In at least one embodiment, the programming platform 5104 is configured to support various programming models 5103, middleware and/or libraries 5102, and frameworks 5101 upon which applications 5100 may depend. In at least one embodiment, the application 5100 can be an AI/ML application implemented using, for example, a deep learning framework (e.g., MXNet, pyrerch, or TensorFlow) that can rely on libraries such as cuDNN, NVIDIA Collective Communications Library ("NCCL") and/or NVIDIA developer data loader library ("DALI") CUDA library to provide accelerated computing on underlying hardware.

In at least one embodiment, programming platform 5104 can be one of the CUDA, ROCm, or OpenCL platforms described above in connection with fig. 48, 49, and 50, respectively. In at least one embodiment, the programming platform 5104 supports a plurality of programming models 5103, which are abstractions of the underlying computing system that allow for the expression of algorithms and data structures. In at least one embodiment, the programming model 5103 can expose features of underlying hardware in order to improve performance. In at least one embodiment, programming model 5103 may include, but is not limited to CUDA, HIP, openCL, c++ accelerated massive parallelism ("c++ AMP"), open multiprocessing ("OpenMP"), open accelerator ("OpenACC"), and/or Vulcan computing (Vulcan computer).

In at least one embodiment, middleware and/or library 5102 provides an abstract implementation of programming model 5104. In at least one embodiment, such libraries include data and programming code that can be used by computer programs and utilized during software development. In at least one embodiment, such middleware includes software that provides services to applications in addition to those available from programming platform 5104. In at least one embodiment, middleware and/or library 5102 can include, but is not limited to cuBLAS, cuFFT, cuRAND and other CUDA libraries, or rocBLAS, rocFFT, rocRAND and other ROCm libraries. Additionally, in at least one embodiment, the middleware and/or library 5102 may include NCCL and ROCm communication collection library ("RCCL") libraries that provide communication routines for GPUs, MIOpen libraries for deep learning acceleration, and/or eigenlibraries for linear algebra, matrix and vector operations, geometric transformations, numerical solvers, and related algorithms.

In at least one embodiment, the application framework 5101 relies on middleware and/or libraries 5102. In at least one embodiment, each application framework 5101 is a software framework for implementing a standard architecture for application software. In at least one embodiment, the AI/ML application can be implemented using a framework (such as a Caffe, caffe2, tensorFlow, keras, pyTorch or MxNet deep learning framework).

In at least one embodiment, at least one component shown or described with respect to fig. 51 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 51 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 51 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 51 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 51 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

FIG. 52 illustrates compiling code to execute on one of the programming platforms of FIGS. 47-50 in accordance with at least one embodiment. In at least one embodiment, the compiler 5201 receives source code 5200, which includes both host code as well as device code. In at least one embodiment, the compiler 5201 is configured to convert the source code 5200 into host executable code 5202 for execution on a host and device executable code 5203 for execution on a device. In at least one embodiment, the source code 5200 can be compiled offline prior to executing the application or online during execution of the application.

In at least one embodiment, the source code 5200 can include code in any programming language supported by the compiler 5201, such as c++, C, fortran, and the like. In at least one embodiment, the source code 5200 may be included in a single-source (single-source) file having a mix of host code and device code and in which the location of the device code is indicated. In at least one embodiment, the single source file may be a. Cu file including CUDA code or a. HIP. Cpp file including HIP code. Alternatively, in at least one embodiment, the source code 5200 may comprise multiple source code files instead of a single source file in which the host code and the device code are separate.

In at least one embodiment, the compiler 5201 is configured to compile the source code 5200 into host executable code 5202 for execution on a host and device executable code 5203 for execution on a device. In at least one embodiment, the compiler 5201 performs operations including parsing the source code 5200 into Abstract System Trees (AST), performing optimizations, and generating executable code. In at least one embodiment where the source code 5200 includes a single source file, the compiler 5201 can separate the device code from the host code in such a single source file, compile the device code and the host code into device executable code 5203 and host executable code 5202, respectively, and link the device executable code 5203 and the host executable code 5202 together in a single file, as discussed in more detail below with respect to fig. 41.

In at least one embodiment, the host executable code 5202 and the device executable code 5203 may be in any suitable format, such as binary code and/or IR code. In the case of CUDA, in at least one embodiment, host executable code 5202 may include native object code, while device executable code 5203 may include code that is represented in the middle of PTX. In at least one embodiment, in the case of ROCm, both the host executable 5202 and the device executable 5203 may comprise target binary code.

In at least one embodiment, at least one component shown or described with respect to fig. 52 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 52 is for causing a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 52 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 52 is for executing a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 52 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

Fig. 53 is a system diagram illustrating a system 5300 for interfacing with an application 5302 to process data in accordance with at least one embodiment. In at least one embodiment, the application 5302 uses a Large Language Model (LLM) 5312 to generate output data 5320 based at least in part on the input data 5310. In at least one embodiment, the input data 5310 is a text prompt. In at least one embodiment, the input data 5310 includes unstructured text. In at least one embodiment, the input data 5310 includes a token (token) sequence. In at least one embodiment, the tag is part of the input data. In at least one embodiment, the tag is a Word (Word). In at least one embodiment, the indicia is a character. In at least one embodiment, the token is a subword. In at least one embodiment, the input data 5310 is formatted as chat markup language (ChatML). In at least one embodiment, the input data 5310 is an image. In at least one embodiment, the input data 5310 is one or more video frames. In at least one embodiment, input data 5310 is any other expression medium.

In at least one embodiment, the large language model 5312 includes a deep neural network. In at least one embodiment, the deep neural network is a neural network having two or more layers. In at least one embodiment, the large language model 5312 includes a transducer model. In at least one embodiment, the large language model 5312 includes a neural network configured to perform natural language processing. In at least one embodiment, the large language model 5312 is configured to process one or more data sequences. In at least one embodiment, the large language model 5312 is configured to process text. In at least one embodiment, the weights and bias of the large language model 5312 are configured to process text. In at least one embodiment, the large language model 5312 is configured to determine patterns in the data to perform one or more natural language processing tasks. In at least one embodiment, the natural language processing task includes text generation. In at least one embodiment, the natural language processing task includes a question-answer. In at least one embodiment, performing natural language processing tasks results in output data 5320.

In at least one embodiment, the processor queries the search database 5314 using the input data 5310. In at least one embodiment, the retrieval database 5314 is a key-value store. In at least one embodiment, the retrieval database 5314 is a corpus used to train a large language model 5312. In at least one embodiment, the processor provides updated information to the large language model 5312 using the retrieval database 5314. In at least one embodiment, the retrieval database 5314 includes data from internet sources. In at least one embodiment, the large language model 5312 does not use the search database 5314 to perform reasoning.

In at least one embodiment, the encoder encodes the input data 5310 as one or more feature vectors. In at least one embodiment, the encoder encodes the input data 5310 as sentence-embedded vectors. In at least one embodiment, the processor performs a nearest neighbor search using the sentence-embedded vector to generate one or more neighbors (neighbors) 5316. In at least one embodiment, one or more neighbors 5316 are values in the retrieval database 5314 that correspond to keys that include input data 5310. In at least one embodiment, one or more neighbors 5316 include text data. In at least one embodiment, the encoder 5318 encodes one or more neighbors 5316. In at least one embodiment, the encoder 5318 encodes one or more neighbors 5316 as text-embedded vectors. In at least one embodiment, the encoder 5318 encodes one or more neighbors 5316 as sentence-embedded vectors. In at least one embodiment, the large language model 5312 uses the input data 5310 and the data generated by the encoder 5318 to generate output data 5320. In at least one embodiment, the processor 5306 interfaces with the application 5302 using a Large Language Model (LLM) Application Programming Interface (API) 5304. In at least one embodiment, the processor 5306 accesses the large language model 5312 using a Large Language Model (LLM) Application Programming Interface (API) 5304.

In at least one embodiment, output data 5320 includes computer instructions. In at least one embodiment, output data 5320 includes instructions written in a CUDA programming language. In at least one embodiment, output data 5320 includes instructions to be executed by processor 5306. In at least one embodiment, the output data 5320 includes instructions that control execution of one or more algorithm modules 5308. In at least one embodiment, the one or more algorithm modules 5308 include, for example, one or more neural networks for performing pattern recognition. In at least one embodiment, the one or more algorithm modules 5308 include, for example, one or more neural networks for performing frame generation. In at least one embodiment, the one or more algorithm modules 5308 include, for example, one or more neural networks for generating the driving path. In at least one embodiment, the one or more algorithm modules 5308 include, for example, one or more neural networks for generating 5G signals. In at least one embodiment, the processor 5306 interfaces with the application 5302 using a Large Language Model (LLM) Application Programming Interface (API) 5304. In at least one embodiment, the processor 5306 can use one or more parallel computing platforms and/or programming models (e.g., CUDA model of NVIDIA).

In at least one embodiment, aspects of the systems and techniques described herein with respect to fig. 53 are incorporated into aspects of previous figures. For example, in at least one embodiment, the apparatus described in the previous figures includes a processor 5306. For example, in at least one embodiment, system 5300 writes CUDA code using ChatGPT. For example, in at least one embodiment, the system 5300 trains the object classification neural network using ChatGPT. For example, in at least one embodiment, the system 5300 uses ChatGPT and neural network to identify a driving path. For example, in at least one embodiment, system 5300 generates 5G signals using ChatGPT and a neural network.

It should be noted that while the example embodiments described herein may relate to a CUDA programming model, the techniques described herein may be used with any suitable programming model, such as HIP, oneAPI (e.g., using oneAPI-based programming to perform or implement the methods disclosed herein), and/or variations thereof.

In at least one embodiment, one or more components of the systems and/or processors disclosed above may be in communication with one or more CPU, ASIC, GPU, FPGA or other hardware, circuit or integrated circuit components, including: for example, an amplifier or upsampler to amplify an image (upscale), an image mixer or image mixer component to mix, fuse, or add images together, a sampler to sample images (e.g., as part of a DSP), a neural network circuit configured to perform an amplifier or to amplify an image (e.g., from a low resolution image to a high resolution image), or other hardware to modify or generate an image, frame, or video to adjust its resolution, size, or pixels; one or more components of the systems and/or processors disclosed above may perform the methods, operations, or instructions of generating or modifying images using the components described in this disclosure.

In at least one embodiment, at least one component shown or described with respect to fig. 53 is used to perform the techniques and/or functions described in connection with fig. 1-15. In at least one embodiment, at least one component shown or described with respect to FIG. 53 is operable to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 53 is used to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to FIG. 53 is used to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API. In at least one embodiment, at least one component shown or described with respect to fig. 53 is used to perform at least one aspect described with respect to block 100, block 200, process 300, block 400, process 500, process 600, process 700, block 800, block 900, block 1000, block 1100, process 1200, block 1300, block 1400, block 1500, and/or other systems, methods, or operations described herein.

At least one embodiment of the present disclosure may be described in view of the following clauses:

1. a processor, comprising:

one or more circuits to cause a first Application Programming Interface (API) to select a second API to execute one or more software workloads identified by the first API.

2. The processor of clause 1, wherein the first API is to receive one or more input values indicative of the one or more software workloads.

3. The processor of clause 1 or 2, wherein the first API is to receive one or more input values indicating a number of nodes to be used to execute the second API.

4. The processor of any of clauses 1-3, wherein the first API is to receive one or more input values indicating a number of tasks per node to be used to execute the second API.

5. The processor of any one of clauses 1-4, wherein the first API is to receive one or more input values indicative of one or more environment variables to be used to execute the second API.

6. The processor of any one of clauses 1-5, wherein the first API is to receive one or more input values indicating a working directory to be used to execute the second API.

7. The processor of any one of clauses 1-6, wherein the first API is to receive one or more input values indicating a initiator to be used to execute the second API.

8. The processor of any one of clauses 1-7, wherein the first API is to receive one or more input values indicating one or more execution modes to be used to execute the second API.

9. A computer system, comprising:

a memory and one or more processors, the memory to store executable instructions that, if executed by the one or more processors, cause the one or more processors to cause a first application programming interface API to select a second API to execute one or more software workloads identified by the first API.

10. The computer system of clause 9, comprising:

the first API is to receive one or more first input values indicative of the one or more software workloads; and

the second API is to receive one or more second input values based at least in part on the one or more first input values.

11. The computer system of clause 9 or 10, wherein the first API is to receive one or more input values indicative of a number of nodes of the high performance computing system to be used to execute the second API.

12. The computer system of any of clauses 9-11, wherein the first API is to receive one or more input values indicative of one or more environment variables to be used to execute the second API.

13. The computer system of any of clauses 9-12, wherein the first API is to receive one or more input values indicating a working directory to be used to execute the second API.

14. The computer system of any of clauses 9-13, wherein the first API is to receive one or more input values indicating a initiator to be used to execute the second API.

15. A computer-implemented method, comprising:

the first application programming interface API is caused to select a second API to execute one or more software workloads identified by the first API.

16. The computer-implemented method of clause 15, wherein the first API is used to receive one or more input values indicative of the one or more software workloads.

17. The computer-implemented method of clauses 15 or 16, wherein the first API is to receive one or more input values indicating a number of nodes to be used to execute the second API.

18. The computer-implemented method of any of clauses 15-17, wherein the first API is to receive one or more input values indicative of one or more environment variables to be used to execute the second API.

19. The computer-implemented method of any of clauses 15-18, wherein the first API is to receive one or more input values indicating a launcher to be used to execute the second API.

20. The computer-implemented method of any of clauses 15-19, wherein the second API is to provide one or more output values indicative of one or more job identifiers of the one or more software workloads.

21. A processor, comprising:

one or more circuits to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API.

22. The processor of clause 21, wherein the first API is to receive one or more input values indicative of one or more job identifiers of the one or more software workloads.

23. The processor of clause 21 or 22, wherein the one or more software workloads are to be identified by the first API based at least in part on the output values of the third API to execute the one or more software workloads.

24. The processor of any one of clauses 21-23, wherein the one or more software workloads are to be identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

25. The processor of any of clauses 21-24, wherein the one or more software workloads are executing using a high performance computing system.

26. The processor of any one of clauses 21-25, wherein the one or more software workloads are executing using one or more nodes of a high performance computing system.

27. The processor of any one of clauses 21-26, wherein the second API is to provide one or more output values indicative of one or more workload states of the one or more software workloads.

28. A computer system, comprising:

one or more processors and memory for storing executable instructions that, if executed by the one or more processors, cause the one or more processors to execute a first Application Programming Interface (API) to select a second API to monitor execution of one or more software workloads identified by the first API.

29. The computer system of clause 28, wherein the first API is to receive one or more input values indicative of one or more job identifiers of the one or more software workloads.

30. The computer system of clauses 28 or 29, wherein the one or more software workloads are to be identified by the first API based at least in part on the output values of the third API to execute the one or more software workloads.

31. The computer system of any of clauses 22-30, wherein the one or more software workloads are to be identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

32. The computer system of any of clauses 22-31, wherein the one or more software workloads are executing using a high performance computing system.

33. The computer system of any of clauses 22-32, wherein the one or more software workloads are executing using one or more nodes of a high performance computing system.

34. The computer system of any of clauses 22-33, wherein the second API is to provide one or more output values indicative of one or more workload states of the one or more software workloads.

35. A computer-implemented method, comprising:

a first Application Programming Interface (API) is executed to select a second API to monitor execution of one or more software workloads identified by the first API.

36. The computer-implemented method of clause 35, wherein the first API is to receive one or more input values indicative of one or more job identifiers of the one or more software workloads.

37. The computer-implemented method of clauses 35 or 36, wherein the one or more software workloads are to be identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

38. The computer-implemented method of any of clauses 35-37, wherein the one or more software workloads are performed using a deep learning computing system.

39. The computer-implemented method of any of clauses 35-38, wherein the one or more software workloads are performed using one or more nodes of the deep learning computing system.

40. The computer-implemented method of any of clauses 35-39, wherein the second API is to provide one or more output values indicative of one or more workload states of the one or more software workloads.

41. A processor, comprising:

one or more circuits to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API.

42. The processor of clause 41, wherein the first API is to receive one or more input values indicative of one or more job identifiers of the one or more software workloads.

43. The processor of clause 41 or 42, wherein the one or more software workloads are to be identified by the first API based at least in part on the output values of the third API to execute the one or more software workloads.

44. The processor of any one of clauses 41-43, wherein the one or more software workloads are to be identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

45. The processor of any of clauses 41-44, wherein the one or more software workloads are executing using a high performance computing system.

46. The processor of any one of clauses 41-45, wherein the one or more software workloads are executing using one or more nodes of a high performance computing system.

47. The processor of any one of clauses 41-46, wherein the second API is to provide one or more output values indicative of one or more states of the one or more software workloads to terminate execution based at least in part on executing the second API.

48. A computer system, comprising:

one or more processors and memory for storing executable instructions that, if executed by the one or more processors, cause the one or more processors to execute a first Application Programming Interface (API) to select a second API to terminate execution of one or more software workloads identified by the first API.

49. The computer system of clause 48, wherein the first API is to receive one or more input values indicative of one or more job identifiers of the one or more software workloads.

50. The computer system of clauses 48 or 49, wherein the one or more software workloads are to be identified by the first API based at least in part on the output values of the third API to execute the one or more software workloads.

51. The computer system of any of clauses 48-50, wherein the one or more software workloads are to be identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

52. The computer system of any of clauses 48-51, wherein the one or more software workloads are executing using a high performance computing system.

53. The computer system of any of clauses 48-52, wherein the one or more software workloads are executing using one or more nodes of a high performance computing system.

54. The computer system of any of clauses 48-53, wherein the second API provides one or more output values indicative of one or more states of the one or more software workloads to terminate execution based at least in part on executing the second API.

55. A computer-implemented method, comprising:

a first Application Programming Interface (API) is executed to select a second API to terminate execution of one or more software workloads identified by the first API.

56. The computer-implemented method of clause 55, wherein the first API is used to receive one or more input values indicative of one or more job identifiers of the one or more software workloads.

57. The computer-implemented method of clause 55 or 56, wherein the one or more software workloads are identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

58. The computer-implemented method of any of clauses 55-57, wherein the one or more software workloads are performed using a deep learning computing system.

59. The computer-implemented method of any of clauses 55-58, wherein the one or more software workloads are performed using one or more nodes of a deep learning computing system.

60. The computer-implemented method of any of clauses 55-59, wherein the second API provides one or more output values indicative of one or more states of the one or more software workloads to terminate execution of the one or more software workloads based at least in part on executing the second API.

Other variations are within the spirit of the present disclosure. Thus, while the disclosed technology is susceptible to various modifications and alternative arrangements, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the disclosure to the specific form or forms disclosed, but on the contrary, the intention is to cover all modifications, alternative arrangements, and equivalents falling within the spirit and scope of the disclosure as defined by the appended claims.

The use of the terms "a" and "an" and "the" and similar referents in the context of describing the disclosed embodiments (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Unless otherwise indicated, the terms "comprising," "having," "including," and "containing" are to be construed as open-ended terms (meaning "including, but not limited to"). The term "connected" (referring to physical connection when unmodified) should be interpreted as partially or wholly contained within, attached to, or connected together, even if there is some intervening. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. In at least one embodiment, unless indicated otherwise or contradicted by context, the use of the term "set" (e.g., "set of items") or "subset" should be interpreted as a non-empty set comprising one or more members. Furthermore, unless indicated otherwise or contradicted by context, the term "subset" of a respective set does not necessarily denote an appropriate subset of the corresponding set, but the subset and the corresponding set may be equal.

Unless otherwise explicitly indicated or clearly contradicted by context, a connective language such as a phrase in the form of "at least one of a, B and C" or "at least one of a, B and C" is understood in the context to be generally used to denote an item, term, etc., which may be a or B or C, or any non-empty subset of the a and B and C sets. For example, in the illustrative example of a set having three members, the conjoin phrases "at least one of a, B, and C" and "at least one of a, B, and C" refer to any of the following sets: { A }, { B }, { C }, { A, B }, { A, C }, { B, C }, { A, B, C }. Thus, such connection language is not generally intended to imply that certain embodiments require the presence of at least one of A, at least one of B, and at least one of C. In addition, unless otherwise indicated herein or otherwise clearly contradicted by context, the term "plurality" refers to a state of plural (e.g., the term "plurality of items" refers to a plurality of items). In at least one embodiment, the number of items in the plurality of items is at least two, but may be more if explicitly indicated or indicated by context. Furthermore, unless otherwise indicated or clear from context, the phrase "based on" means "based at least in part on" rather than "based only on".

The operations of the processes described herein may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. In at least one embodiment, processes such as those described herein (or variations and/or combinations thereof) are performed under control of one or more computer systems configured with executable instructions and are implemented as code (e.g., executable instructions, one or more computer programs, or one or more application programs) that are jointly executed on one or more processors via hardware or a combination thereof. In at least one embodiment, the code is stored on a computer readable storage medium in the form of a computer program that, in at least one embodiment, includes a plurality of instructions executable by one or more processors. In at least one embodiment, the computer-readable storage medium is a non-transitory computer-readable storage medium that excludes transitory signals (e.g., propagated transient electrical or electromagnetic transmissions), but includes non-transitory data storage circuitry (e.g., buffers, caches, and queues). In at least one embodiment, code (e.g., executable code or source code) is stored on a set of one or more non-transitory computer-readable storage media (or other memory for storing executable instructions) that, when executed by one or more processors of a computer system (i.e., as a result of being executed), cause the computer system to perform operations described herein. In at least one embodiment, a set of non-transitory computer-readable storage media includes a plurality of non-transitory computer-readable storage media, and one or more of the individual non-transitory storage media in the plurality of non-transitory computer-readable storage media lacks all code, but the plurality of non-transitory computer-readable storage media collectively store all code. In at least one embodiment, the executable instructions are executed such that different instructions are executed by different processors, in at least one embodiment, a non-transitory computer-readable storage medium stores instructions, and a main central processing unit ("CPU") executes some instructions, while a graphics processing unit ("GPU") executes other instructions. In at least one embodiment, different components of the computer system have separate processors, and different processors execute different subsets of the instructions.

Thus, in at least one embodiment, a computer system is configured to implement one or more services that individually or collectively perform the operations of the processes described herein, and such computer system is configured with suitable hardware and/or software that enables the operations to be performed. Further, a computer system implementing at least one embodiment of the present disclosure is a single device, and in another embodiment is a distributed computer system, comprising a plurality of devices operating in different manners, such that the distributed computer system performs the operations described herein, and such that a single device does not perform all of the operations.

The use of any and all examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate embodiments of the disclosure and does not pose a limitation on the scope of the disclosure unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the disclosure.

All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.

In the description and claims, the terms "coupled" and "connected," along with their derivatives, may be used. It should be understood that these terms may not be intended as synonyms for each other. Rather, in particular examples, "connected" or "coupled" may be used to indicate that two or more elements are in direct or indirect physical or electrical contact with each other. "coupled" may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.

Unless specifically stated otherwise, it is appreciated that throughout the description, terms such as "processing," "computing," "calculating," "determining," or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical quantities (e.g., electronic) within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices.

In a similar manner, the term "processor" may refer to any device or portion of memory that processes electronic data from registers and/or memory and converts the electronic data into other electronic data that may be stored in the registers and/or memory. As a non-limiting example, a "processor" may be a CPU or GPU. A "computing platform" may include one or more processors. As used herein, in at least one embodiment, a "software" process may include software and/or hardware entities, such as tasks, threads, and intelligent agents, that perform work over time. Also, each process may refer to multiple processes to execute instructions sequentially or in parallel, either continuously or intermittently. The terms "system" and "method" are used interchangeably herein as long as the system can embody one or more methods, and the methods can be considered as systems.

In at least one embodiment, the arithmetic logic unit is a set of combinational logic circuits that employ one or more inputs to produce a result. In at least one embodiment, the processor uses arithmetic logic units to implement mathematical operations, such as addition, subtraction, or multiplication. In at least one embodiment, an arithmetic logic unit is used to implement a logical operation, such as a logical AND/OR OR XOR. In at least one embodiment, the arithmetic logic unit is stateless and is made of physical switching components such as semiconductor transistors arranged to form logic gates. In at least one embodiment, the arithmetic logic unit may operate internally as a stateful logic circuit with an associated clock. In at least one embodiment, the arithmetic logic unit may be configured as an asynchronous logic circuit whose internal state is not maintained in the associated register set. In at least one embodiment, the processor uses an arithmetic logic unit to combine operands stored in one or more registers of the processor and produce an output that can be stored by the processor in another register or memory location.

In at least one embodiment, as a result of processing an instruction retrieved by a processor, the processor presents one or more inputs or operands to an arithmetic logic unit, causing the arithmetic logic unit to produce a result based at least in part on instruction code provided to the inputs of the arithmetic logic unit. In at least one embodiment, the instruction code provided by the processor to the ALU is based at least in part on instructions executed by the processor. In at least one embodiment, combinational logic in the ALU processes the inputs and produces outputs that are placed on a bus within the processor. In at least one embodiment, the processor selects a destination register, memory location, output device, or output storage location on the output bus, thereby clocking the processor such that the results produced by the ALU are sent to the desired location.

In this document, reference may be made to obtaining, acquiring, receiving or inputting analog or digital data into a subsystem, computer system or computer-implemented machine. In at least one embodiment, the process of obtaining, acquiring, receiving, or inputting analog and digital data may be accomplished in a variety of ways, such as by receiving data that is a parameter of a function call or call to an application programming interface. In some implementations, the process of obtaining, acquiring, receiving, or inputting analog or digital data may be accomplished by transmitting the data via a serial or parallel interface. In another implementation, the process of obtaining, acquiring, receiving, or inputting analog or digital data may be accomplished by transmitting the data from a providing entity to an acquiring entity via a computer network. Reference may also be made to providing, outputting, transmitting, sending or presenting analog or digital data. In various examples, the process of providing, outputting, transmitting, sending, or presenting analog or digital data may be implemented by transmitting the data as input or output parameters for a function call, parameters for an application programming interface, or an inter-process communication mechanism.

While the above discussion sets forth example implementations of the described technology, other architectures may be used to implement the described functionality and are intended to fall within the scope of the present disclosure. Furthermore, while specific assignments of responsibilities are defined above for purposes of discussion, various functions and responsibilities may be assigned and divided in different ways depending on the circumstances.

Furthermore, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter claimed in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claims.

Claims

1. A processor, comprising:

one or more circuits to execute a first application programming interface API to select a second API to terminate execution of one or more software workloads identified by the first API.

2. The processor of claim 1, wherein the first API is to receive one or more input values indicative of one or more job identifiers of the one or more software workloads.

3. The processor of claim 1, wherein the one or more software workloads are to be identified by the first API based at least in part on an output value of a third API for executing the one or more software workloads.

4. The processor of claim 1, wherein the one or more software workloads are to be identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

5. The processor of claim 1, wherein the one or more software workloads are executing using a high performance computing system.

6. The processor of claim 1, wherein the one or more software workloads are executed using one or more nodes of a high performance computing system.

7. The processor of claim 1, wherein the second API is to provide one or more output values indicative of one or more states of the one or more software workloads based at least in part on executing the second API to terminate execution.

8. A computer system, comprising:

a memory for storing executable instructions that, if executed by the one or more processors, cause the one or more processors to execute a first application programming interface API to select a second API to terminate execution of one or more software workloads identified by the first API.

9. The computer system of claim 8, wherein the first API is to receive one or more input values indicative of one or more job identifiers for the one or more software workloads.

10. The computer system of claim 8, wherein the one or more software workloads are to be identified by the first API based at least in part on an output value of a third API for executing the one or more software workloads.

11. The computer system of claim 8, wherein the one or more software workloads are to be identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

12. The computer system of claim 8, wherein the one or more software workloads are executed using a high performance computing system.

13. The computer system of claim 8, wherein the one or more software workloads are executed using one or more nodes of a high performance computing system.

14. The computer system of claim 8, wherein the second API is to provide one or more output values indicative of one or more states of the one or more software workloads based at least in part on executing the second API to terminate execution.

15. A computer-implemented method, comprising:

a first application programming interface API is executed to select a second API to terminate execution of one or more software workloads identified by the first API.

16. The computer-implemented method of claim 15, wherein the first API is to receive one or more input values indicative of one or more job identifiers for the one or more software workloads.

17. The computer-implemented method of claim 15, wherein the one or more software workloads are to be identified by the first API based at least in part on executing a third API to launch the one or more software workloads.

18. The computer-implemented method of claim 15, wherein the one or more software workloads are executed using a deep learning computing system.

19. The computer-implemented method of claim 15, wherein the one or more software workloads are executed using one or more nodes of a deep learning computing system.

20. The computer-implemented method of claim 15, wherein the second API is to provide one or more output values indicative of one or more states of the one or more software workloads based at least in part on executing the second API to terminate execution of the one or more software workloads.