WO2024060256A1 - Self-evolving and multi-versioning code - Google Patents

Self-evolving and multi-versioning code Download PDF

Info

Publication number
WO2024060256A1
WO2024060256A1 PCT/CN2022/121108 CN2022121108W WO2024060256A1 WO 2024060256 A1 WO2024060256 A1 WO 2024060256A1 CN 2022121108 W CN2022121108 W CN 2022121108W WO 2024060256 A1 WO2024060256 A1 WO 2024060256A1
Authority
WO
WIPO (PCT)
Prior art keywords
checkpoint
file
jit
inference
code
Prior art date
Application number
PCT/CN2022/121108
Other languages
French (fr)
Inventor
Junyong Ding
Yuan Chen
Wenyong HUANG
Xin Wang
Mohammad Reza HAGHIGHAT
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to PCT/CN2022/121108 priority Critical patent/WO2024060256A1/en
Publication of WO2024060256A1 publication Critical patent/WO2024060256A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines

Definitions

  • This disclosure relates in general to the field of software compilation, and more particularly, though not exclusively, to systems and methods for self-evolving and multi-versioning code.
  • a software compiler generally translates a high-level software language (asource language) into a native machine code optimized for execution by a specific architecture of a host machine.
  • a host receiving a source language may employ an ahead of time (AOT) compilation or a just in time (JIT) compilation.
  • AOT ahead of time
  • JIT just in time
  • FIG. 1 is a simplified illustration of a apparatus in an operating environment, in accordance with various embodiments.
  • FIG. 2 is a flowchart for an example method for self-evolving and multi-versioning code, in accordance with various embodiments.
  • FIG. 3 provides a process flow diagram for an ahead-of-time compiler, in accordance with various embodiments
  • FIG. 4 provides a process flow for a just-in-time compiler, in accordance with various embodiments.
  • FIG. 5 is an example use case illustrating the use of instrument checkpoint and inference operators.
  • FIG. 6 illustrates examples of configurations of Web Assembly runtime environments and respective Web Assembly System Interfaces.
  • FIG. 7 is a block diagram of an example compute node that may include any of the embodiments disclosed herein.
  • FIG. 8 illustrates a multi-processor environment in which embodiments may be implemented.
  • FIG. 9 is a block diagram of an example processor to execute computer-executable instructions as part of implementing technologies described herein
  • a software compiler generally translates a high-level software language (asource language) into a native machine code optimized for execution by a specific architecture of a host architecture or apparatus (e.g., a host processing unit, such as, a complex instruction set computer, “CISC, ” or reduced instruction set computer “RISC, ” that has a specific machine architecture and language) .
  • a host processing unit such as, a complex instruction set computer, “CISC, ” or reduced instruction set computer “RISC, ” that has a specific machine architecture and language
  • CISC complex instruction set computer
  • RISC reduced instruction set computer
  • the software compiler may use techniques such as ahead of time (AOT) compiling or just in time (JIT) compiling.
  • AOT compiling generally refers to a build or translate that occurs before execution.
  • JIT compiling “jitting” ) generally refers to compiling source or intermediate code into machine code while the machine code is executing.
  • JIT compiling is often performed instruction by instruction, so it can slow performance, but JIT compiling provides an opportunity to dynamically review runtime information to improve runtime performance; this procedure is called profile guided optimization (PGO) .
  • PGO profile guided optimization
  • PGO Profile Guided Optimization
  • Hardware event sampling based on monitoring predetermined hardware events/interruptions, has a relatively low overhead compared to other methods. Hardware event sampling is good for profile collection and running optimized native machine code but usually requires administrative permission to allow access to system hardware interruptions. Additionally, the profile to source code position correlation relies on debug information, which may reduce the precision of the hardware event sampler when it is profiling optimized native machine code.
  • the other common PGO profiling method is an instrumented profiler that will run counters through call backs when instructions are executed.
  • An instrumented profiler requires the additional callbacks and normally runs on top of non-optimized native machine code.
  • the profile generated by the instrumented profiler can be more precise than the profile generated by the event sampler, although the runtime cost the instrumented profiler can be higher due to the added hardware.
  • a software compiler employing PGO during JIT compiling can dynamically acquire runtime information and use it to dynamically recompile parts of the executed native machine code, and based thereon, generate a more efficient native machine code. If the dynamic profile changes during execution, the software compiler can deoptimize the previous native machine code and generate a new native machine code that is optimized with the runtime information from the new profile.
  • a similar mechanism applies to compiling Web Assembly language when running a software compiler inside an embedder and compiling or optimizing through a compiler.
  • Embodiments propose a technical solution for the above-described inefficiencies in the form of systems and methods for self-evolving and multi-versioning code.
  • Embodiments use checkpoints to collect globals and inputs/parameter values for key functions for use in a profile inference process in improved WASM execution.
  • Embodiments generate profile inference operators and populate a separate WASM file with the inference operators ahead-of-time.
  • a background thread monitors profile changes, the inference operators, hot and cold branches, counts, and frequently accessed memory locations (address and size) to direct profile guided optimization (PGO) .
  • PGO profile guided optimization
  • module functional block, “block, “ “system, “ and “engine” may be used herein, with functionality attributed to them.
  • a processor e.g., a CPU, a reduced instruction set computer (RISC) , a complex instruction set computer (CISC) , a compute node, a graphics processing unit (GPU) ) , a processing system, as discrete logic or circuitry, as an application specific integrated circuit, as a field programmable gate array, etc., or a combination thereof.
  • RISC reduced instruction set computer
  • CISC complex instruction set computer
  • GPU graphics processing unit
  • processing system as discrete logic or circuitry, as an application specific integrated circuit, as a field programmable gate array, etc., or a combination thereof.
  • the approaches and methodologies presented herein can be utilized in various computer-based environments (including, but not limited to, virtual machines, web servers, and stand-alone computers) , edge computing environments, network environments, and/or database system environments.
  • the terms “operating” , “executing” , or “running” as they pertain to software or firmware in relation to a processing unit, compute node, system, device, platform, or resource are used interchangeably and can refer to software or firmware stored in one or more computer-readable storage media accessible by the system, device, platform or resource, even though the software or firmware instructions are not actively being executed by the system, device, platform, or resource.
  • circuitry can comprise, singly or in any combination, non-programmable (hardwired) circuitry, programmable circuitry such as processors, state machine circuitry, and/or firmware that stores instructions executable by programmable circuitry.
  • Some embodiments may have some, all, or none of the features described for other embodiments.
  • “First, ” “second, ” “third, ” and the like describe a common object and indicate different instances of like objects being referred to. Such adjectives do not imply objects so described must be in a given sequence, either temporally or spatially, in ranking, or any other manner.
  • operating environment 100 includes a simplified illustration of a host 104 configured to receive high level languages, referred to as source language or source code, run a browser, and parse a web page.
  • the host 104 is in operational communication with a source 102 of the source language or a JavaScript file (. jsp) .
  • the host 104 generally via a communication circuitry 118, performs instruction monitoring.
  • the described compile operations can be interactive with a browser at the source 102, meaning that data and commands may be exchanged between the host 104 and the source 102, generally via communication circuitry 118.
  • the source 102 may be one of a plurality of sources that each independently may transmit a source language, JavaScript file or WASM file to the host 104.
  • the host 104 relies on at least one CPU, indicated generally with processor 106, and together they embody a language and hardware architecture (also referred to herein as a host architecture or apparatus) .
  • the host 104 includes at least one storage component, indicated generally with storage device 116.
  • Storage device 116 may be any combination of memory, disk, cache, etc., and may store, inter alia, instructions and parameters and data that are utilized in the operation of the compiler 110 described herein.
  • the host 104 may be a complex computer node or computer processing system, and may include or be integrated with many more components and peripheral devices (see, for example, FIG. 7, compute node 700, and FIG. 8, computing system 800) .
  • the host 104 architecture includes or is upgraded to include an enhanced compiler 110 facilitating self-evolving and multi-versioning code, as described herein.
  • the compiler 110 can be realized as hardware (circuitry) or an algorithm or set of rules embodied in software (e.g., stored in the storage device 116) and executed by the processor 106.
  • the compiler 110 is depicted as a separate functional block or module for discussion; however, in practice, the compiler 110 may be integrated with the host processor 106 as software, hardware, or a combination thereof. Accordingly, the compiler 110 may be updated during updates to the host 104 software, such as boot or during runtime.
  • a high-level source language file ( “input file” ) may be received by the compiler 110, starting a compile operation.
  • the compiler 110 may reference a host library 108.
  • the host specific library 108 is configured with microcode (also referred to as machine code) instructions that are native to the host 104 architecture, so that the compile operation effectively translates the incoming source language into native machine code.
  • the compile operation may take the form of two different threads of overall compiler 110 operation.
  • an ahead-of-time module (AOT 112) translates the high-level source language file into a wasm file (amain. wasm file) and also inserts one or more checkpoints into a function.
  • the AOT 112 may store the checkpoints into a file of checkpoints referred to as an inference. wasm file) .
  • An AOT wasm compile operation can be performed prior to running in a browser (e.g., source 102) , and the output of the AOT wasm compile operation can be part of the source language file received (Fig. 2, 204) .
  • a “Wasm JIT compile” operation may be performed to translate the main. wasm file to JIT’d code that includes both the check-pointers and inference operators described in more detail herein.
  • the AOT 112 module is part of a compiler 110. In other embodiments, the functionality performed by the AOT 112 module may be distributed among other components or processors within the source 102. Further, the AOT results ( “main. wasm” and “inference. wasm” ) may be pre-built and can be distributed through source 102, and made available for other components within source 102, just like other wasm or . js files at runtime.
  • JIT 114 consumes the main. wasm file and the inference. wasm file, generating self-evolving JIT’d code and performing execution and PGO profiling based thereon.
  • a JIT compile operation (e.g., that may be part of background module 122) can compile the main. wasm and inference. wasm to JIT’d code that then is executed in the JIT 114 module.
  • the JIT 114 may be organized as an execution module 120 in communication with a background module 122. As part of the self-evolving process, the JIT 114 can perform a JIT compile operation on the main. wasm and inference. wasm, generating therefrom JIT’d code to be executed in the JIT 114. In various embodiments, the JIT compile can be performed in the background module 122 and the JIT execution can occur in the execution module 120.
  • FIG. 2 provides a flowchart 200 for an example method for self-evolving and multi-versioning code.
  • FIG. 3 provides a process flow for ahead-of-time operations (AOT 112) and
  • FIG. 4 provides a process flow for various just in time operations (JIT 114) .
  • a processor 106 e.g., a CISC machine
  • FIG. 7, 700 a compute node
  • FIG. 8, 800 a processing system
  • method 200 may refer to elements mentioned in connection with FIGS. 1, 3, or 4.
  • portions of method 200 may be performed by different components of the described system environment 100.
  • method 200 may include any number of additional or alternative operations and tasks, the tasks shown in FIG. 2 need not be performed in the illustrated order, and method 200 may be incorporated into a more comprehensive procedure or method having additional functionality not described in detail herein.
  • one or more of the tasks shown in FIG. 2 could be omitted from an embodiment of the method 200 if the intended overall functionality remains intact.
  • the host may be running a browser and parsing a website.
  • a high-level source language file ( “input file” ) may be received by the host 104.
  • the host 104 could receive an input file in the form of a . jsp file or wasm file.
  • the AOT compile operations may be performed ahead of time (such as, before running a browser at 202) and outputs from the AOT compile may be pre-built. In these embodiments, outputs from the AOT compile operations may be integrated into, or be part of, receiving the source language file at 204.
  • an ahead of time (AOT) wasm compile 304 operation can be performed by AOT 112. This AOT operation may include identifying at 208 one or more functions in the input file (source language 302) and inserting one or more checkpoints into the input file at 210.
  • the AOT compile operations may include a wasm JIT compile to translate the wasm file to JIT’d code that includes the checkpoints 210 and the inference operators 212.
  • more than one checkpoint is inserted into each function of more than one functions.
  • the checkpoints may be placed at a specific location with respect to the function, e.g., at the start of the function, and may include specific parameters for which to collect values.
  • the location is predefined as at the entry of the key function, and parameter value to collect at the location is also predefined.
  • Other parameter values to collect can include global data accesses by the key function, or the like.
  • more than one parameter value may be collected for a checkpoint.
  • This operation at 210 may be referred to as creating a checkpoint instrument for the input file, and the output of this operation, the checkpoint instrumented 306 input file, is a wasm file and can be stored, e.g., into a main. wasm file.
  • embodiments of the AOT 112 compiler may also create at 212 one or more profile inference operators 308 to calculate values at other instrumentation points, for use for general PGO profiling.
  • the AOT 112 may populate an inference. wasm file with the one or a plurality of profile inference operators 308.
  • the AOT compile 304 will build a profile operator to inference condition branch from input parameter or global value ranges. For memory accesses on load or stores, build an operator from checkpoints to inference the memory address (es) to be accessed.
  • Output 213 includes the source language file from 204 and, respective output from operations at 206. Output 213 is consumed by JIT 114 module at 214.
  • the operations 402 represent the main wasm execution thread 120, which includes a functional block 403, whose functions are generally present on a client device or host (performing, responsive to an input wasm file or an input JavaScript file, the operations of fetch 404, baseline compile 406, instantiate 410, and interpret 408, to thereby translate the input file into JIT’d code for execution) . Additionally, operations 402 include technical improvements in the form of the wasm execution and profiling 412 operations and the creation of the checkpoint instrumented intermediate representation (IR) graph 414.
  • IR checkpoint instrumented intermediate representation
  • wasm execution and profiling (412) is performed on self-evolving JIT’d code 428 generated in the background thread 122.
  • Output from wasm execution and profiling (412) can include a profiler runtime data dump that generates checkpoint raw profiles (e.g., checkpoint raw profile 416) .
  • the checkpoint raw profiles 416 can be stored into a shared memory file and transmitted to a profile monitor 418 in the background/inference thread 430.
  • the background/inference thread 430 also receives the profile inference operators 308 in the inf. wasm file and follows with a fetch 420 and a compile 422 based thereon.
  • a profile inference 424 operation receives the compile 422 output and processes it with the profile monitor 418 output to determine whether there’s been a change from an initial profile. Said differently, the profile inference 424 operation determines whether an actual execution of a program matches a previously generated profile that is based on previous passes through the JIT compile 400.
  • embodiments may build an intermediate representation (IR) graph and attach the inference operators 308 to nodes in the IR graph, thereby creating a checkpoint instrumented IR graph 414.
  • the checkpoint instrumented IR graph 414 can be stored in a disk or other storage device (e.g., storage device 116) .
  • the checkpoint instrumented IR graph 414 can be compiled with output from the profile inference 424 operation (PGO compilation 426) . Pulling together the previous operations, the output from PGO compilation 426 represents monitored profile changes, the inference operators, hot and cold branches, counts, and frequently accessed spaces (address and size) . Output from the PGO compilation 426 can be combined with the baseline compiled (406) output from the main wasm execution thread 120; and, upon determining there has been a profile change, embodiments will regenerate the JIT’d code that the wasm execution and profiling 412 engine operates on (i.e., the herein described self-evolving JIT’d code 428) .
  • the first time through the wasm execution and profiling 412 engine may rely on initial assumptions including an initial profile; a profile change may be detected when there is a deviation in at least one value in the raw profile that exceeds a respective profile inference value.
  • the compiler 110 regenerates the JIT’d code 428 when an inference operator is exceeded by a corresponding parameter value in the checkpoint raw profile.
  • the method 200 may end or repeat.
  • FIG. 5 an example use case illustrating how the instrument checkpoint and inference operators facilitate PGO optimization.
  • the code snippet for a function is shown at 500 and the corresponding flowchart is illustrated at 502.
  • the AOT 112 of the compiler 110 inserted a checkpoint (an instrument checkpoint) at the beginning of the function to collect the input value of “day. ”
  • the profile monitoring operation (FIG. 4, 418) , identifies/determines that “day” is in a range of ⁇ Monday, Tuesday, Friday ⁇ more frequently than the remaining values, never or seldomly passing as Thursday and Wednesday.
  • the compiler 110 moves to inference the full profile (FIG.
  • blocks 1-N represent blocks of code; the first time through the compiler 110 the blocks are original wasm code. Thereafter, the compiler 110 can self-evolve the code (i.e., self-evolve the layout of the switch case conditions 506 and corresponding blocks 1-N) , perform a direct inline decision if any block calls another functions (not shown) , and conduct more advanced PGO optimizations.
  • Embodiments can use a similar profiling mechanism to optimize data placement during memory accesses. For example, some technologies give access to multiple different storage devices for memory accesses (e.g., registers, level 0 cache, level 1, cache, level 2 cache) , and each of these storage devices has a different access speed. In these scenarios, the compiler 110 can optimize data placement. This advantageously brings in non-volatility to memory consumption calculations, which is critical to Web performance and experience.
  • the AOT 112 can have some of the WASM runtime features embedded into its compiled binary code, and subsequent passes through the compiler 110 can be used to improve/optimize the AOT 112.
  • provided embodiments enable the monitoring of profile changes, predefined inference operators, hot and cold branches, counts, and frequently accessed spaces (address and size) to direct profile guided optimization (PGO) . Further, some additional advantages and applications as applied to the AOT 112 module are provided below.
  • AOT runner module comprising an AOT module loader, an AOT module instantiator, and an AOT module runner. Then, upon a compile operations, the AOT runtime includes the previously described AOT 112 plus the AOT runner.
  • the AOT runtime is to accept the WASM file or the AOT file; in scenarios in which the input is the WASM file, AOT 112 is to compile the WASM file into an AOT file first.
  • the WASM file is divided into a bytecode part and a non-bytecode part
  • the bytecode part is compiled into AOT code and the necessary rodata, the rodata is used to apply relocation on AOT code.
  • the AOT code always accesses the AOT 112 module instance through its first function argument: each AOT function's first argument is the handle of the module instance.
  • the AOT code doesn't access global variables in the object file, no global variables are generated in the object file.
  • the generated AOT file includes: the original WASM file, the AOT code, the rodata, and the relocation records.
  • the WASM file in the AOT file is used to create the AOT module's data, e.g., WASM global/table/memory and their instances.
  • the AOT code, rodata and relocation records are used to allocate the machine code. Relocations are applied.
  • the AOT function pointers are stored inside the AOT 112 module.
  • the AOT module instance's handle is passed as the AOT function's first argument, so that AOT code can access the AOT module instance's data.
  • the profiling data may be generated
  • the AOT 112 compiler After the PGO profiling data is generated, it can be input into the AOT 112 compiler, the AOT 112 compiler generates the new AOT file.
  • WASM is a collaboratively developed portable low-level bytecode designed to improve upon the deficiencies of JavaScript.
  • WASM was developed with a component model in which code is organized in modules that have a shared-nothing inter-component invocation.
  • a host 104 such as a virtual machine, container, or microservice, can be populated with multiple different WASM components (also referred to herein as WASM modules) .
  • the WASM modules interface using the shared-nothing interface, which enables fast instance-derived import calls.
  • the shared-nothing interface enables software and hardware optimization via adaptors.
  • a WASM module contains definitions for functions, globals, tables, and memories. The definitions can be imported or exported.
  • a module can define only one memory, that memory is a traditional linear memory that is mutable and may be shared. The code in a module is organized into functions. Functions can call each other, but functions cannot be nested.
  • Instantiating a module can be provided by a JavaScript virtual machine or an operating system. An instance of a module corresponds to a dynamic representation of the module, its defined memory, and an execution stack.
  • a WASM computation is initiated by invoking a function exported from the instance.
  • WASMTIME is a jointly developed industry leading WebAssembly runtime; it includes a compiler for WASM written in Rust.
  • a Web Assembly System Interface (WASI) that may be host specific (processor specific) is used to enable application specific protocols (e.g., for machine language, for machine learning, etc. ) for communication and data sharing between the software environment running WASM (WASMTIME) and other host components.
  • WASMTIME Web Assembly System Interface
  • Embodiment 600 illustrates a WASM module 602 embodied as a direct command line interface (CLI) .
  • the WASI library 604 is referenced during WASMTIME CLI 606, and the operating system (OS) resources 608 of the host are utilized.
  • a WASI application programming interface (s) 610 ( “WASI API” ) enables communication and data sharing between the components in embodiment 600.
  • Embodiment 630 illustrates a WASM module 632 in which WASMTIME and WASI are embedded in an application.
  • a portable WASM application 634 includes the WASI library 636 that is referenced during WASMTIME 638.
  • the portable WASM application 634 may be referred to as a user application.
  • Embodiment 630 may employ a host API 646 for communication and data sharing between the WASM application 634 and the host for certain operations, and employ multiple WASI implementations 640 for communication and data sharing between the portable WASM application 634 and the host OS resources 642 (indicated generally with WASI APIs 648) .
  • Embodiment 630 may represent a standalone environment, such as, a standalone desktop, an Internet of Things (IOT) environment, a cloud application (e.g., a content delivery network (CDN) , function as a service (FaaS) , an envoy proxy, or the like) . In other scenarios, embodiment 630 may represent a resource constrained environment, such as in IOT, embedding, or the like.
  • IOT Internet of Things
  • cloud application e.g., a content delivery network (CDN) , function as a service (FaaS) , an envoy proxy, or the like
  • embodiment 630 may represent a resource constrained environment, such as in IOT, embedding, or the like.
  • the systems and methods described herein can be implemented in or performed by any of a variety of computing systems, including mobile computing systems (e.g., smartphones, handheld computers, tablet computers, laptop computers, portable gaming consoles, 2-in-1 convertible computers, portable all-in-one computers) , non-mobile computing systems (e.g., desktop computers, servers, workstations, stationary gaming consoles, set-top boxes, smart televisions, rack-level computing solutions (e.g., blade, tray, or sled computing systems) ) , and embedded computing systems (e.g., computing systems that are part of a vehicle, smart home appliance, consumer electronics product or equipment, manufacturing equipment) .
  • mobile computing systems e.g., smartphones, handheld computers, tablet computers, laptop computers, portable gaming consoles, 2-in-1 convertible computers, portable all-in-one computers
  • non-mobile computing systems e.g., desktop computers, servers, workstations, stationary gaming consoles, set-top boxes, smart televisions, rack-level computing solutions (e.g., blade, tray, or s
  • the term “computing system” includes compute nodes, computing devices, and systems comprising multiple discrete physical components.
  • the computing systems are located in a data center, such as an enterprise data center (e.g., a data center owned and operated by a company and typically located on company premises) , managed services data center (e.g., a data center managed by a third party on behalf of a company) , a co-located data center (e.g., a data center in which data center infrastructure is provided by the data center host and a company provides and manages their own data center components (servers, etc.
  • a data center such as an enterprise data center (e.g., a data center owned and operated by a company and typically located on company premises)
  • managed services data center e.g., a data center managed by a third party on behalf of a company
  • co-located data center e.g., a data center in which data center infrastructure is provided by the data center host and a company provides and manages their own data
  • cloud data center e.g., a data center operated by a cloud services provider that host companies applications and data, such as, web applications, games, and conference call applications
  • edge data center e.g., a data center, typically having a smaller footprint than other data center types, located close to the geographic area that it serves
  • a compute node 700 includes a compute engine (referred to herein as “compute circuitry” ) 702, an input/output (I/O) subsystem 708, data storage 710, a communication circuitry subsystem 712, and, optionally, one or more peripheral devices 714.
  • the compute node 700 or compute circuitry 702 may perform the operations and tasks attributed to the host 104.
  • respective compute nodes 500 may include other or additional components, such as those typically found in a computer (e.g., a display, peripheral devices, etc. ) .
  • one or more of the illustrative components may be incorporated in, or otherwise form a portion of, another component.
  • the compute node 700 may be embodied as a single device such as an integrated circuit, an embedded system, a field-programmable gate array (FPGA) , a system-on-a-chip (SOC) , or other integrated system or device.
  • the compute node 700 includes or is embodied as a processor 704 and a memory 706.
  • the processor 704 may be embodied as any type of processor capable of performing the functions described herein (e.g., executing compile functions and executing an application) .
  • the processor 704 may be embodied as a multi-core processor (s) , a microcontroller, a processing unit, a specialized or special purpose processing unit, or other processor or processing/controlling circuit.
  • the processor 704 may be embodied as, include, or be coupled to an FPGA, an application specific integrated circuit (ASIC) , reconfigurable hardware or hardware circuitry, or other specialized hardware to facilitate performance of the functions described herein. Also in some examples, the processor 704 may be embodied as a specialized x-processing unit (xPU) also known as a data processing unit (DPU) , infrastructure processing unit (IPU) , or network processing unit (NPU) .
  • xPU specialized x-processing unit
  • DPU data processing unit
  • IPU infrastructure processing unit
  • NPU network processing unit
  • Such an xPU may be embodied as a standalone circuit or circuit package, integrated within an SOC, or integrated with networking circuitry (e.g., in a SmartNIC, or enhanced SmartNIC) , acceleration circuitry, storage devices, or AI hardware (e.g., GPUs or programmed FPGAs) .
  • Such an xPU may be designed to receive programming to process one or more data streams and perform specific tasks and actions for the data streams (such as hosting microservices, performing service management or orchestration, organizing, or managing server or data center hardware, managing service meshes, or collecting and distributing telemetry) , outside of the CPU or general-purpose processing hardware.
  • a xPU, a SOC, a CPU, and other variations of the processor 704 may work in coordination with each other to execute many types of operations and instructions within and on behalf of the compute node 700.
  • the memory 706 may be embodied as any type of volatile (e.g., dynamic random-access memory (DRAM) , etc. ) or non-volatile memory or data storage capable of performing the functions described herein.
  • Volatile memory may be a storage medium that requires power to maintain the state of data stored by the medium.
  • Non-limiting examples of volatile memory may include various types of random-access memory (RAM) , such as DRAM or static random-access memory (SRAM) .
  • RAM random-access memory
  • SRAM static random-access memory
  • SDRAM synchronous dynamic random-access memory
  • the memory device is a block addressable memory device, such as those based on NAND or NOR technologies.
  • a memory device may also include a three-dimensional crosspoint memory device (e.g., 3D XPoint TM memory) , or other byte addressable write-in-place nonvolatile memory devices.
  • the memory device may refer to the die itself and/or to a packaged memory product.
  • 3D crosspoint memory e.g., 3D XPoint TM memory
  • 3D crosspoint memory may comprise a transistor-less stackable cross point architecture in which memory cells sit at the intersection of word lines and bit lines and are individually addressable and in which bit storage is based on a change in bulk resistance.
  • all or a portion of the memory 706 may be integrated into the processor 704.
  • the memory 706 may store various software and data used during operation such as one or more applications, data operated on by the application (s) , libraries, and drivers.
  • the compute circuitry 702 is communicatively coupled to other components of the compute node 700 via the I/O subsystem 708, which may be embodied as circuitry and/or components to facilitate input/output operations with the compute circuitry 702 (e.g., with the processor 704 and/or the main memory 706) and other components of the compute circuitry 702.
  • the I/O subsystem 708 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, integrated sensor hubs, firmware devices, communication links (e.g., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc. ) , and/or other components and subsystems to facilitate the input/output operations.
  • the I/O subsystem 708 may form a portion of a system-on-a-chip (SoC) and be incorporated, along with one or more of the processor 704, the memory 706, and other components of the compute circuitry 702, into the compute circuitry 702.
  • SoC system-on-a-chip
  • the one or more illustrative data storage devices 710 may be embodied as any type of devices configured for short-term or long-term storage of data such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, or other data storage devices.
  • Individual data storage devices 710 may include a system partition that stores data and firmware code for the data storage device 710.
  • Individual data storage devices 710 may also include one or more operating system partitions that store data files and executables for operating systems depending on, for example, the type of compute node 700.
  • the communication circuitry 712 may be embodied as any communication circuit, device, transceiver circuit, or collection thereof, capable of enabling communications over a network between the compute circuitry 702 and another compute device (e.g., an edge gateway of an implementing edge computing system) .
  • another compute device e.g., an edge gateway of an implementing edge computing system
  • the communication subsystem 712 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.11 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultra-mobile broadband (UMB) project (also referred to as “3GPP2” ) , etc. ) .
  • IEEE Institute for Electrical and Electronic Engineers
  • Wi-Fi IEEE 802.11 family
  • IEEE 802.16 standards e.g., IEEE 802.16-2005 Amendment
  • LTE Long-Term Evolution
  • LTE Long-Term Evolution
  • UMB ultra-mobile broadband
  • WiMAX Broadband Wireless Access
  • the communication subsystem 712 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E-HSPA) , or LTE network.
  • GSM Global System for Mobile Communication
  • GPRS General Packet Radio Service
  • UMTS Universal Mobile Telecommunications System
  • HSPA High Speed Packet Access
  • E-HSPA Evolved HSPA
  • the communication subsystem 712 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) .
  • the communication subsystem 712 may operate in accordance with Code Division Multiple Access (CDMA) , Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond.
  • the communication subsystem 712 may operate in accordance with other wireless protocols in other embodiments.
  • the communication subsystem 712 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., IEEE 802.3 Ethernet standards) .
  • the communication subsystem 712 may include multiple communication components. For instance, a first communication subsystem 712 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication subsystem 712 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others.
  • GPS global positioning system
  • EDGE EDGE
  • GPRS long-range wireless communications
  • CDMA Code Division Multiple Access
  • WiMAX Code Division Multiple Access
  • LTE Long Term Evolution
  • EV-DO Evolution-DO
  • the illustrative communication subsystem 712 includes an optional network interface controller (NIC) 720, which may also be referred to as a host fabric interface (HFI) .
  • NIC network interface controller
  • HFI host fabric interface
  • the NIC 720 may be embodied as one or more add-in-boards, daughter cards, network interface cards, controller chips, chipsets, or other devices that may be used by the compute node 700 to connect with another compute device (e.g., an edge gateway node) .
  • the NIC 720 may be embodied as part of a system-on-a-chip (SoC) that includes one or more processors or included on a multichip package that also contains one or more processors.
  • SoC system-on-a-chip
  • the NIC 720 may include a local processor (not shown) and/or a local memory (not shown) that are both local to the NIC 720.
  • the local processor of the NIC 720 may be capable of performing one or more of the functions of the compute circuitry 702 described herein.
  • the local memory of the NIC 720 may be integrated into one or more components of the client compute node at the board level, socket level, chip level, and/or other levels.
  • a respective compute node 700 may include one or more peripheral devices 714.
  • peripheral devices 714 may include any type of peripheral device found in a compute device or server such as audio input devices, a display, other input/output devices, interface devices, and/or other peripheral devices, depending on the the compute node 700.
  • the compute node 700 may be embodied by a respective edge compute node (whether a client, gateway, or aggregation node) in an edge computing system or like forms of appliances, computers, subsystems, circuitry, or other components.
  • the compute node 700 may be embodied as any type of device or collection of devices capable of performing various compute functions.
  • Respective compute nodes 700 may be embodied as a type of device, appliance, computer, or other “thing” capable of communicating with other compute nodes that may be edge, networking, or endpoint components.
  • a compute device may be embodied as a personal computer, server, smartphone, a mobile compute device, a smart appliance, smart camera, an in-vehicle compute system (e.g., a navigation system) , a weatherproof or weather-sealed computing appliance, a self-contained device within an outer case, shell, etc., or other device or system capable of performing the described functions.
  • FIG. 8 illustrates a multi-processor environment in which embodiments may be implemented.
  • Processors 802 and 804 further comprise cache memories 812 and 814, respectively.
  • the cache memories 812 and 814 can store data (e.g., instructions) utilized by one or more components of the processors 802 and 804, such as the processor cores 808 and 810.
  • the cache memories 812 and 814 can be part of a memory hierarchy for the computing system 800.
  • the cache memories 812 can locally store data that is also stored in a memory 816 to allow for faster access to the data by the processor 802.
  • the cache memories 812 and 814 can comprise multiple cache levels, such as level 1 (L1) , level 2 (L2) , level 3 (L3) , level 4 (L4) and/or other caches or cache levels.
  • level 1 (L1) level 1
  • L2 level 2
  • L3 level 3
  • level 4 (L4) level 4
  • one or more levels of cache memory e.g., L2, L3, L4
  • L2 level 2
  • L3 level 4
  • L4 level 4
  • one or more levels of cache memory e.g., L2, L3, L4 can be shared among multiple cores in a processor or among multiple processors in an integrated circuit component.
  • the last level of cache memory on an integrated circuit component can be referred to as a last level cache (LLC) .
  • LLC last level cache
  • One or more of the higher levels of cache levels (the smaller and faster caches) in the memory hierarchy can be located on the same integrated circuit die as a processor core and one or more of the lower cache levels (the larger and slower caches) can be located on an integrated circuit dies that are physically separate from the processor core integrated circuit dies.
  • a processor can take various forms such as a central processing unit (CPU) , a graphics processing unit (GPU) , general-purpose GPU (GPGPU) , accelerated processing unit (APU) , field-programmable gate array (FPGA) , neural network processing unit (NPU) , data processor (DPU) , accelerator (e.g., graphics accelerator, digital signal processor (DSP) , compression accelerator, artificial intelligence (AI) accelerator) , controller, or other types of processing units.
  • the processor can be referred to as an XPU (or xPU) .
  • a processor can comprise one or more of these various types of processing units.
  • the computing system comprises one processor with multiple cores, and in other embodiments, the computing system comprises a single processor with a single core.
  • the terms “processor, ” “processor unit, ” and “processing unit” can refer to any processor, processor core, component, module, engine, circuitry, or any other processing element described or referenced herein.
  • the computing system 800 can comprise one or more processors that are heterogeneous or asymmetric to another processor in the computing system. There can be a variety of differences between the processing units in a system in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like. These differences can effectively manifest themselves as asymmetry and heterogeneity among the processors in a system.
  • the processors 802 and 804 can be located in a single integrated circuit component (such as a multi-chip package (MCP) or multi-chip module (MCM) ) or they can be located in separate integrated circuit components.
  • An integrated circuit component comprising one or more processors can comprise additional components, such as embedded DRAM, stacked high bandwidth memory (HBM) , shared cache memories (e.g., L3, L4, LLC) , input/output (I/O) controllers, or memory controllers. Any of the additional components can be located on the same integrated circuit die as a processor, or on one or more integrated circuit dies separate from the integrated circuit dies comprising the processors. In some embodiments, these separate integrated circuit dies can be referred to as “chiplets” .
  • the heterogeneity or asymmetric can be among processors located in the same integrated circuit component.
  • interconnections between dies can be provided by the package substrate, one or more silicon interposers, one or more silicon bridges embedded in the package substrate (such as embedded multi-die interconnect bridges (EMIBs) ) , or combinations thereof.
  • EMIBs embedded multi-die interconnect bridges
  • Processors 802 and 804 further comprise memory controller logic (MC) 820 and 822.
  • MCs 820 and 622 control memories 816 and 818 coupled to the processors 802 and 804, respectively.
  • the memories 816 and 818 can comprise various types of volatile memory (e.g., dynamic random-access memory (DRAM) , static random-access memory (SRAM) ) and/or non-volatile memory (e.g., flash memory, chalcogenide-based phase-change non-volatile memories) , and comprise one or more layers of the memory hierarchy of the computing system. While MCs 820 and 822 are illustrated as being integrated into the processors 802 and 804, in alternative embodiments, the MCs can be external to a processor.
  • DRAM dynamic random-access memory
  • SRAM static random-access memory
  • non-volatile memory e.g., flash memory, chalcogenide-based phase-change non-volatile memories
  • Processors 802 and 804 are coupled to an Input/Output (I/O) subsystem 830 via point-to-point interconnections 832 and 834.
  • the point-to-point interconnection 832 connects a point-to-point interface 836 of the processor 802 with a point-to-point interface 838 of the I/O subsystem 830
  • the point-to-point interconnection 834 connects a point-to-point interface 840 of the processor 804 with a point-to-point interface 842 of the I/O subsystem 830.
  • Input/Output subsystem 830 further includes an interface 850 to couple the I/O subsystem 830 to a graphics engine 852.
  • the I/O subsystem 830 and the graphics engine 852 are coupled via a bus 854.
  • the Input/Output subsystem 830 is further coupled to a first bus 860 via an interface 862.
  • the first bus 860 can be a Peripheral Component Interconnect Express (PCIe) bus or any other type of bus.
  • PCIe Peripheral Component Interconnect Express
  • Various I/O devices 864 can be coupled to the first bus 860.
  • a bus bridge 870 can couple the first bus 860 to a second bus 880.
  • the second bus 880 can be a low pin count (LPC) bus.
  • LPC low pin count
  • Various devices can be coupled to the second bus 880 including, for example, a keyboard/mouse 882, audio I/O devices 888, and a storage device 890, such as a hard disk drive, solid-state drive, or another storage device for storing computer-executable instructions (code) 892 or data.
  • the code 892 can comprise computer-executable instructions for performing methods described herein.
  • Additional components that can be coupled to the second bus 880 include communication device (s) 884, which can provide for communication between the computing system 800 and one or more wired or wireless networks 886 (e.g.
  • Wi-Fi Wireless Fidelity
  • cellular cellular
  • satellite networks via one or more wired or wireless communication links (e.g., wire, cable, Ethernet connection, radio-frequency (RF) channel, infrared channel, Wi-Fi channel) using one or more communication standards (e.g., IEEE 802.11 standard and its supplements) .
  • wired or wireless communication links e.g., wire, cable, Ethernet connection, radio-frequency (RF) channel, infrared channel, Wi-Fi channel
  • RF radio-frequency
  • Wi-Fi wireless local area network
  • communication standards e.g., IEEE 802.11 standard and its supplements
  • the communication devices 884 can comprise wireless communication components coupled to one or more antennas to support communication between the computing system 800 and external devices.
  • the wireless communication components can support various wireless communication protocols and technologies such as Near Field Communication (NFC) , IEEE 802.11 (Wi-Fi) variants, WiMax, Bluetooth, Zigbee, 4G Long Term Evolution (LTE) , Code Division Multiplexing Access (CDMA) , Universal Mobile Telecommunication System (UMTS) and Global System for Mobile Telecommunication (GSM) , and 5G broadband cellular technologies.
  • the wireless modems can support communication with one or more cellular networks for data and voice communications within a single cellular network, between cellular networks, or between the computing system and a public switched telephone network (PSTN) .
  • PSTN public switched telephone network
  • the system 800 can comprise removable memory such as flash memory cards (e.g., SD (Secure Digital) cards) , memory sticks, Subscriber Identity Module (SIM) cards) .
  • the memory in system 800 (including caches 812 and 814, memories 816 and 818, and storage device 890) can store data and/or computer-executable instructions for executing an operating system 894 and application programs 896.
  • Example data includes web pages, text messages, images, sound files, and video data biometric thresholds for particular users or other data sets to be sent to and/or received from one or more network servers or other devices by the system 800 via the one or more wired or wireless networks 886, or for use by the system 800.
  • the system 800 can also have access to external memory or storage (not shown) such as external hard drives or cloud-based storage.
  • the operating system 894 (also simplified to “OS” herein) can control the allocation and usage of the components illustrated in FIG. 6 and support the one or more application programs 896.
  • the application programs 896 can include common computing system applications (e.g., email applications, calendars, contact managers, web browsers, messaging applications) as well as other computing applications.
  • a hypervisor (or virtual machine manager) operates on the operating system 894 and the application programs 896 operate within one or more virtual machines operating on the hypervisor.
  • the hypervisor is a type-2 or hosted hypervisor as it is running on the operating system 894.
  • the hypervisor is a type-1 or “bare-metal” hypervisor that runs directly on the platform resources of the computing system 894 without an intervening operating system layer.
  • the applications 896 can operate within one or more containers.
  • a container is a running instance of a container image, which is a package of binary images for one or more of the applications 896 and any libraries, configuration settings, and any other information that one or more applications 896 need for execution.
  • a container image can conform to any container image format, such as Appc, or LXC container image formats.
  • a container runtime engine such as Docker Engine, LXU, or an open container initiative (OCI) -compatible container runtime (e.g., Railcar, CRI-O) operates on the operating system (or virtual machine monitor) to provide an interface between the containers and the operating system 894.
  • An orchestrator can be responsible for management of the computing system 800 and various container-related tasks such as deploying container images to the computing system 894, monitoring the performance of deployed containers, and monitoring the utilization of the resources of the computing system 894.
  • the computing system 800 can support various additional input devices, represented generally as user interfaces 898, such as a touchscreen, microphone, monoscopic camera, stereoscopic camera, trackball, touchpad, trackpad, proximity sensor, light sensor, electrocardiogram (ECG) sensor, PPG (photoplethysmogram) sensor, galvanic skin response sensor, and one or more output devices, such as one or more speakers or displays.
  • user interfaces 898 such as a touchscreen, microphone, monoscopic camera, stereoscopic camera, trackball, touchpad, trackpad, proximity sensor, light sensor, electrocardiogram (ECG) sensor, PPG (photoplethysmogram) sensor, galvanic skin response sensor, and one or more output devices, such as one or more speakers or displays.
  • ECG electrocardiogram
  • PPG photoplethysmogram
  • galvanic skin response sensor galvanic skin response sensor
  • Other possible input and output devices include piezoelectric and other haptic I/O devices. Any of the input or output devices can be internal to, external to, or
  • one or more of the user interfaces 898 may be natural user interfaces (NUIs) .
  • NUIs natural user interfaces
  • the operating system 894 or applications 896 can comprise speech recognition logic as part of a voice user interface that allows a user to operate the system 800 via voice commands.
  • the computing system 800 can comprise input devices and logic that allows a user to interact with computing the system 800 via body, hand, or face gestures. For example, a user’s hand gestures can be detected and interpreted to provide input to a gaming application.
  • the I/O devices 864 can include at least one input/output port comprising physical connectors (e.g., USB, IEEE 1394 (FireWire) , Ethernet, RS-232) , a power supply (e.g., battery) , a global satellite navigation system (GNSS) receiver (e.g., GPS receiver) ; a gyroscope; an accelerometer; and/or a compass.
  • GNSS global satellite navigation system
  • a GNSS receiver can be coupled to a GNSS antenna.
  • the computing system 800 can further comprise one or more additional antennas coupled to one or more additional receivers, transmitters, and/or transceivers to enable additional functions.
  • interconnect technologies such as QuickPath Interconnect (QPI) , Ultra Path Interconnect (UPI) , Computer Express Link (CXL) , cache coherent interconnect for accelerators serializer/deserializer (SERDES) , NVLink, ARM Infinity Link, Gen-Z, or Open Coherent Accelerator Processor Interface (OpenCAPI) .
  • QPI QuickPath Interconnect
  • UPI Ultra Path Interconnect
  • CXL Computer Express Link
  • SERDES cache coherent interconnect for accelerators serializer/deserializer
  • NVLink ARM Infinity Link
  • Gen-Z Gen-Z
  • OpenCAPI Open Coherent Accelerator Processor Interface
  • FIG. 8 illustrates only one example computing system architecture.
  • Computing systems based on alternative architectures can be used to implement technologies described herein.
  • a computing system can comprise an SoC (system-on-a-chip) integrated circuit incorporating multiple processors, a graphics engine, and additional components.
  • SoC system-on-a-chip
  • a computing system can connect its constituent component via bus or point-to-point configurations different from that shown in FIG. 8.
  • the illustrated components in FIG. 8 are not required or all-inclusive, as shown components can be removed and other components added in alternative embodiments.
  • FIG. 9 is a block diagram of an example processor 900 to execute computer-executable instructions as part of implementing technologies described herein.
  • the processor 900 can be a single-threaded core or a multithreaded core in that it may include more than one hardware thread context (or “logical processor” ) per processor.
  • FIG. 9 also illustrates a memory 910 coupled to the processor 900.
  • the memory 910 can be any memory described herein or any other memory known to those of skill in the art.
  • the memory 910 can store computer-executable instructions 915 (code) executable by the processor 900.
  • the processor comprises front-end logic 920 that receives instructions from the memory 910.
  • An instruction can be processed by one or more decoders 930.
  • the decoder 930 can generate as its output a micro-operation such as a fixed width micro-operation in a predefined format, or generate other instructions, microinstructions, or control signals, which reflect the original code instruction.
  • the front-end logic 920 further comprises register renaming logic 935 and scheduling logic 940, which generally allocate resources and queues operations corresponding to converting an instruction for execution.
  • the processor 900 further comprises execution logic 950, which comprises one or more execution units (EUs) 965-1 through 965-N. Some processor embodiments can include a few execution units dedicated to specific functions or sets of functions. Other embodiments can include only one execution unit or one execution unit that can perform a particular function.
  • the execution logic 950 performs the operations specified by code instructions. After completion of execution of the operations specified by the code instructions, back-end logic 970 retires instructions using retirement logic 975. In some embodiments, the processor 900 allows out of order execution but requires in-order retirement of instructions. Retirement logic 975 can take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like) .
  • the processor 900 is transformed during execution of instructions, at least in terms of the output generated by the decoder 930, hardware registers and tables utilized by the register renaming logic 935, and any registers (not shown) modified by the execution logic 950.
  • Any of the disclosed methods can be implemented as computer-executable instructions (also referred to as machine readable instructions) or a computer program product stored on a computer readable (machine readable) storage medium. Such instructions can cause a computing system or one or more processors capable of executing computer-executable instructions to perform any of the disclosed methods.
  • the computer-executable instructions or computer program products as well as any data created and/or used during implementation of the disclosed technologies can be stored on one or more tangible or non-transitory computer-readable storage media, such as volatile memory (e.g., DRAM, SRAM) , non-volatile memory (e.g., flash memory, chalcogenide-based phase-change non-volatile memory) optical media discs (e.g., DVDs, CDs) , and magnetic storage (e.g., magnetic tape storage, hard disk drives) .
  • volatile memory e.g., DRAM, SRAM
  • non-volatile memory e.g., flash memory, chalcogenide-based phase-change non-volatile memory
  • optical media discs e.g., DVDs, CDs
  • magnetic storage e.g., magnetic tape storage, hard disk drives
  • Computer-readable storage media can be contained in computer-readable storage devices such as solid-state drives, USB flash drives, and memory modules.
  • any of the methods disclosed herein may be performed by hardware components comprising non-programmable circuitry.
  • any of the methods herein can be performed by a combination of non-programmable hardware components and one or more processing units executing computer-executable instructions stored on computer-readable storage media.
  • the computer-executable instructions can be part of, for example, an operating system of the host or computing system, an application stored locally to the computing system, or a remote application accessible to the computing system (e.g., via a web browser) . Any of the methods described herein can be performed by computer-executable instructions performed by a single computing system or by one or more networked computing systems operating in a network environment. Computer-executable instructions and updates to the computer-executable instructions can be downloaded to a computing system from a remote server.
  • implementation of the disclosed technologies is not limited to any specific computer language or program.
  • the disclosed technologies can be implemented by software written in C++, C#, Java, Perl, Python, JavaScript, Adobe Flash, C#, assembly language, Web Assembly, or any other programming language.
  • the disclosed technologies are not limited to any particular computer system or type of hardware.
  • any of the software-based embodiments can be uploaded, downloaded, or remotely accessed through a suitable communication means.
  • suitable communication means include, for example, the Internet, the World Wide Web, an intranet, cable (including fiber optic cable) , magnetic communications, electromagnetic communications (including RF, microwave, ultrasonic, and infrared communications) , electronic communications, or other such communication means.
  • references in the specification to "one embodiment, “an embodiment, “ “an illustrative embodiment, “ etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but it is not necessary that every embodiment include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • items included in a list in the form of "at least one of A, B, and C” can mean (A) ; (B) ; (C) ; (A and B) ; (A and C) ; (B and C) ; or (A, B, and C) .
  • items listed in the form of "at least one of A, B, or C” can mean (A) ; (B) ; (C) ; (A and B) ; (A and C) ; (B and C) ; or (A, B, and C) .
  • Example 1 is an apparatus comprising: a processor; a compiler executable by the processor to: receive a high-level source language file; translate the high-level source language file into a web assembly (wasm) file; insert a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value; perform a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; execute the JIT’d code to emit a checkpoint raw profile with the parameter value.
  • a compiler executable by the processor to: receive a high-level source language file; translate the high-level source language file into a web assembly (wasm) file; insert a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value; perform a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; execute the JIT’d code to emit a checkpoint raw profile with the parameter
  • Example 2 includes the subject matter of Example 1, wherein the compiler is further to insert the checkpoint at a specific location with respect to the function.
  • Example 3 includes the subject matter of Example 1, wherein the checkpoint is further to collect multiple parameter values, and the compiler is further to: execute the JIT’d code and emit the multiple parameter values.
  • Example 4 includes the subject matter of Example 3, wherein the compiler is further to: generate an inference operator for the wasm file; monitor a profile based on the inference operator and the multiple parameter values; regenerate the JIT’d code when there has been a change in the profile.
  • Example 5 includes the subject matter of Example 4, wherein the inference operator is to inference when a conditional branch will occur from an input parameter.
  • Example 6 includes the subject matter of Example 4, wherein the inference operator is to inference when a conditional branch will occur from a global value range.
  • Example 7 includes the subject matter of Example 4, wherein the inference operator is to inference when a specific memory access is to occur.
  • Example 8 includes the subject matter of Example 1, wherein the checkpoint is a first checkpoint, and wherein the compiler is further to insert a second checkpoint in the wasm file, the second checkpoint to collect a value for global data accesses by the function.
  • Example 9 includes the subject matter of Example 1, wherein the compiler is further to:generate a plurality of inference operators for the wasm file; populate an inference file with the plurality of inference operators; reference the inference file to generate the JIT’d code on the wasm file with the checkpoint.
  • Example 10 includes the subject matter of Example 1, wherein the compiler performs a runtime data dump when executing the JIT’d code, thereby generating checkpoint raw profiles for use in monitoring profile changes.
  • Example 11 is a method comprising: at a processor, receiving a high-level source language file; executing a compiler; translating the high-level source language file into a web assembly (wasm) file; inserting a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value; performing a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; executing the JIT’d code to emit a checkpoint raw profile with the parameter value.
  • JIT just in time
  • Example 12 includes the subject matter of Example 11, further comprising inserting the checkpoint at a specific location with respect to the function.
  • Example 13 includes the subject matter of Example 11, wherein the checkpoint is further to collect multiple parameter values, and further comprising executing the JIT’d code to emit a checkpoint raw profile with the multiple parameter values.
  • Example 14 includes the subject matter of Example 13, further comprising: generating an inference operator for the wasm file; monitoring the inference operator and the checkpoint raw profile; regenerating the JIT’d code when the inference operator is exceeded by a corresponding parameter value in the checkpoint raw profile.
  • Example 15 includes the subject matter of Example 14, wherein the inference operator is to inference when a conditional branch will occur from an input parameter.
  • Example 16 includes the subject matter of Example 14, wherein the inference operator is to inference when a conditional branch will occur from a global value range.
  • Example 17 includes the subject matter of Example 11, further comprising: generating a plurality of inference operators for the wasm file; storing the plurality of inference operators in an inference file; generating the JIT’d code further based on referencing the inference file.
  • Example 18 includes the subject matter of Example 11, wherein the checkpoint is a first checkpoint, and further comprising inserting a second checkpoint in the wasm file, the second checkpoint to collect a value for global data accesses by the function.
  • Example 19 includes the subject matter of Example 14, wherein the inference operator is to inference when a specific memory access is to occur.
  • Example 20 includes the subject matter of Example 11, wherein the compiler generates checkpoint raw profiles for use in monitoring profile changes by dumping runtime data while executing the JIT’d code.
  • Example 21 is one or more machine readable storage media having instructions stored thereon, the instructions when executed by a machine are to cause the machine to: receive a high-level source language file; translate the high-level source language file into a web assembly (wasm) file; insert a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value; perform a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; execute the JIT’d code to emit the parameter value.
  • JIT just in time
  • Example 22 includes the subject matter of Example 21, wherein the instructions, when executed by the machine, are to cause the machine further to insert the checkpoint at a specific location with respect to the function.
  • Example 23 includes the subject matter of Example 21, wherein the instructions, when executed by the machine, are to cause the machine further to: insert the checkpoint into a function in the wasm file, the checkpoint to further collect multiple parameter values; execute the JIT’d code; emit a checkpoint raw profile with the multiple parameter values.
  • Example 24 includes the subject matter of Example 23, wherein the instructions, when executed by the machine, are to cause the machine further to: generate an inference operator for the wasm file; monitor a profile based on the inference operator and the checkpoint raw profile; regenerate the JIT’d code upon determining there has been a change in the profile.
  • Example 25 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a conditional branch will occur from an input parameter.
  • Example 26 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a conditional branch will occur from a global value range.
  • Example 27 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a specific memory access is to occur.
  • Example 28 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to define the checkpoint as a first checkpoint, and insert a second checkpoint in the wasm file, the second checkpoint to collect a value for global data accesses by the function.
  • Example 29 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to: generate a plurality of inference operators for the wasm file; populate an inference file with the plurality of inference operators; generate the JIT’d code based on the inference file and the wasm file with the checkpoint.
  • Example 30 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to dump runtime data during execution of the JIT’d code to generate checkpoint raw profiles for use in monitoring profile changes.

Abstract

Systems and methods for self-evolving and multi-versioning code. The system includes a compiler configured to translate a high-level source language file into a web assembly (wasm) file. One or more checkpoints are inserted into a function in the wasm file. The checkpoints are specified at various functions, to cause the compiler to collect parameter values where they are inserted. The compiler performs a just in time (JIT) compile operation on the wasm file with the checkpoint, generating enhanced JIT'd code. Inference operators can also be created in a background thread and included in the compile process to support profile monitoring. The JIT'd code is executed, and depending on the application, parameter values from the runtime environment can be emitted, as well as the values of the checkpoint parameter (s). Upon detecting a profile change, the compiler regenerates the JIT'd code.

Description

SELF-EVOLVING AND MULTI-VERSIONING CODE
FIELD OF THE SPECIFICATION
This disclosure relates in general to the field of software compilation, and more particularly, though not exclusively, to systems and methods for self-evolving and multi-versioning code.
BACKGROUND
A software compiler generally translates a high-level software language (asource language) into a native machine code optimized for execution by a specific architecture of a host machine. A host receiving a source language may employ an ahead of time (AOT) compilation or a just in time (JIT) compilation. The AOT and JIT compilation approaches each offer benefits and drawbacks. Continued improvements to software compilation are desirable.
BRIEF DESCRIPTION OF THE DRAWINGS
The present disclosure is best understood from the following detailed description when read with the accompanying figures.
FIG. 1 is a simplified illustration of a apparatus in an operating environment, in accordance with various embodiments.
FIG. 2 is a flowchart for an example method for self-evolving and multi-versioning code, in accordance with various embodiments.
FIG. 3 provides a process flow diagram for an ahead-of-time compiler, in accordance with various embodiments
FIG. 4 provides a process flow for a just-in-time compiler, in accordance with various embodiments.
FIG. 5 is an example use case illustrating the use of instrument checkpoint and inference operators.
FIG. 6 illustrates examples of configurations of Web Assembly runtime environments and respective Web Assembly System Interfaces.
FIG. 7 is a block diagram of an example compute node that may include any of the embodiments disclosed herein.
FIG. 8 illustrates a multi-processor environment in which embodiments may be implemented.
FIG. 9 is a block diagram of an example processor to execute computer-executable instructions as part of implementing technologies described herein
DETAILED DESCRIPTION
A software compiler generally translates a high-level software language (asource language) into a native machine code optimized for execution by a specific architecture of a host architecture or apparatus (e.g., a host processing unit, such as, a complex instruction set computer, “CISC, ” or reduced instruction set computer “RISC, ” that has a specific machine architecture and language) . Often, the JIT compile operations are done in host software using host-specific libraries. The software environment in which compiling is done is called a runtime or runtime environment.
The software compiler may use techniques such as ahead of time (AOT) compiling or just in time (JIT) compiling. AOT compiling generally refers to a build or translate that occurs before execution. In contrast, JIT compiling ( “jitting” ) generally refers to compiling source or intermediate code into machine code while the machine code is executing. JIT compiling is often performed instruction by instruction, so it can slow performance, but JIT compiling provides an opportunity to dynamically review runtime information to improve runtime performance; this procedure is called profile guided optimization (PGO) .
Profile Guided Optimization (PGO) plays an important role in optimizing application performance through profiling runtime behavior and optimizing the software compiler based thereon. Maximizing the PGO requires a precise correlation between the profile and exact positions in the source code, and a selection of a typical workload that is representative to most user experiences.
There are two common PGO profiling methods. Hardware event sampling, based on monitoring predetermined hardware events/interruptions, has a relatively low overhead compared to other methods. Hardware event sampling is good for profile collection and running optimized native machine code but usually requires administrative permission to allow access to system hardware interruptions. Additionally, the profile to source code position correlation relies on  debug information, which may reduce the precision of the hardware event sampler when it is profiling optimized native machine code.
The other common PGO profiling method is an instrumented profiler that will run counters through call backs when instructions are executed. An instrumented profiler requires the additional callbacks and normally runs on top of non-optimized native machine code. The profile generated by the instrumented profiler can be more precise than the profile generated by the event sampler, although the runtime cost the instrumented profiler can be higher due to the added hardware.
Accordingly, a software compiler employing PGO during JIT compiling can dynamically acquire runtime information and use it to dynamically recompile parts of the executed native machine code, and based thereon, generate a more efficient native machine code. If the dynamic profile changes during execution, the software compiler can deoptimize the previous native machine code and generate a new native machine code that is optimized with the runtime information from the new profile. A similar mechanism applies to compiling Web Assembly language when running a software compiler inside an embedder and compiling or optimizing through a compiler.
As mentioned above, both common PGO methods induce limitations in profile quality and execution cost at runtime. If using instrumented profile, we need to reduce the runtime overheads, if using event sampling, we need also account for security concerns when running with administrative accesses to system events. Additionally, these PGO methods can only provide information about function calls or executed functions, they do not enable other optimizations, such as, code or function layout optimizations.
Provided embodiments propose a technical solution for the above-described inefficiencies in the form of systems and methods for self-evolving and multi-versioning code. Embodiments use checkpoints to collect globals and inputs/parameter values for key functions for use in a profile inference process in improved WASM execution. Embodiments generate profile inference operators and populate a separate WASM file with the inference operators ahead-of-time. During WASM execution, a background thread monitors profile changes, the inference operators, hot and cold branches, counts, and frequently accessed memory locations (address and size) to direct profile guided optimization (PGO) . Furthermore, other desirable features and characteristics of the system and method will become apparent from the subsequent detailed description and the  appended claims, taken in conjunction with the accompanying drawings and the preceding background.
The terms “module, ” "functional block, " "block, " "system, " and "engine" may be used herein, with functionality attributed to them. As one with skill in the art will appreciate, in various embodiments, the functionality of each of the module/blocks/systems/engines described herein can individually or collectively be achieved in various ways; such as, via an algorithm implemented in software and executed by a processor (e.g., a CPU, a reduced instruction set computer (RISC) , a complex instruction set computer (CISC) , a compute node, a graphics processing unit (GPU) ) , a processing system, as discrete logic or circuitry, as an application specific integrated circuit, as a field programmable gate array, etc., or a combination thereof. The approaches and methodologies presented herein can be utilized in various computer-based environments (including, but not limited to, virtual machines, web servers, and stand-alone computers) , edge computing environments, network environments, and/or database system environments.
As used herein, the terms “operating” , “executing” , or “running” as they pertain to software or firmware in relation to a processing unit, compute node, system, device, platform, or resource, are used interchangeably and can refer to software or firmware stored in one or more computer-readable storage media accessible by the system, device, platform or resource, even though the software or firmware instructions are not actively being executed by the system, device, platform, or resource.
As used herein, the term “circuitry” can comprise, singly or in any combination, non-programmable (hardwired) circuitry, programmable circuitry such as processors, state machine circuitry, and/or firmware that stores instructions executable by programmable circuitry.
Some embodiments may have some, all, or none of the features described for other embodiments. “First, ” “second, ” “third, ” and the like describe a common object and indicate different instances of like objects being referred to. Such adjectives do not imply objects so described must be in a given sequence, either temporally or spatially, in ranking, or any other manner.
Reference is now made to the drawings, which are not necessarily drawn to scale, wherein similar or same numbers may be used to designate same or similar parts in different figures. The use of similar or same numbers in different figures does not mean all figures including similar or same numbers constitute a single or same embodiment. Like numerals having different letter  suffixes may represent different instances of similar components. Elements described as “connected” may be in direct physical or electrical contact with each other, whereas elements described as “coupled” may co-operate or interact with each other, but they may or may not be in direct physical or electrical contact. Furthermore, the terms “comprising, ” “including, ” “having, ” and the like, as used with respect to embodiments of the present disclosure, are synonymous. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.
Turning now to FIG. 1, operating environment 100 includes a simplified illustration of a host 104 configured to receive high level languages, referred to as source language or source code, run a browser, and parse a web page. The host 104 is in operational communication with a source 102 of the source language or a JavaScript file (. jsp) . The host 104, generally via a communication circuitry 118, performs instruction monitoring.
The described compile operations can be interactive with a browser at the source 102, meaning that data and commands may be exchanged between the host 104 and the source 102, generally via communication circuitry 118.
In practice, the source 102 may be one of a plurality of sources that each independently may transmit a source language, JavaScript file or WASM file to the host 104. As described herein, the host 104 relies on at least one CPU, indicated generally with processor 106, and together they embody a language and hardware architecture (also referred to herein as a host architecture or apparatus) . The host 104 includes at least one storage component, indicated generally with storage device 116. Storage device 116 may be any combination of memory, disk, cache, etc., and may store, inter alia, instructions and parameters and data that are utilized in the operation of the compiler 110 described herein. As may be appreciated, in practice, the host 104 may be a complex computer node or computer processing system, and may include or be integrated with many more components and peripheral devices (see, for example, FIG. 7, compute node 700, and FIG. 8, computing system 800) .
The host 104 architecture includes or is upgraded to include an enhanced compiler 110 facilitating self-evolving and multi-versioning code, as described herein. The compiler 110 can be realized as hardware (circuitry) or an algorithm or set of rules embodied in software (e.g., stored in the storage device 116) and executed by the processor 106. The compiler 110 is depicted as a separate functional block or module for discussion; however, in practice, the compiler 110 may be  integrated with the host processor 106 as software, hardware, or a combination thereof. Accordingly, the compiler 110 may be updated during updates to the host 104 software, such as boot or during runtime.
A high-level source language file ( “input file” ) may be received by the compiler 110, starting a compile operation. During a compile operation, the compiler 110 may reference a host library 108. The host specific library 108 is configured with microcode (also referred to as machine code) instructions that are native to the host 104 architecture, so that the compile operation effectively translates the incoming source language into native machine code.
The compile operation may take the form of two different threads of overall compiler 110 operation. In a first thread, an ahead-of-time module (AOT 112) translates the high-level source language file into a wasm file (amain. wasm file) and also inserts one or more checkpoints into a function. The AOT 112 may store the checkpoints into a file of checkpoints referred to as an inference. wasm file) . An AOT wasm compile operation can be performed prior to running in a browser (e.g., source 102) , and the output of the AOT wasm compile operation can be part of the source language file received (Fig. 2, 204) . Moreover, as part of the AOT 112, in various embodiments, a “Wasm JIT compile” operation may be performed to translate the main. wasm file to JIT’d code that includes both the check-pointers and inference operators described in more detail herein.
In some embodiments, the AOT 112 module is part of a compiler 110. In other embodiments, the functionality performed by the AOT 112 module may be distributed among other components or processors within the source 102. Further, the AOT results ( “main. wasm” and “inference. wasm” ) may be pre-built and can be distributed through source 102, and made available for other components within source 102, just like other wasm or . js files at runtime.
In another thread of the compile operation, a just-in-time module (JIT 114) consumes the main. wasm file and the inference. wasm file, generating self-evolving JIT’d code and performing execution and PGO profiling based thereon. A JIT compile operation (e.g., that may be part of background module 122) can compile the main. wasm and inference. wasm to JIT’d code that then is executed in the JIT 114 module.
In various embodiments, the JIT 114 may be organized as an execution module 120 in communication with a background module 122. As part of the self-evolving process, the JIT 114 can perform a JIT compile operation on the main. wasm and inference. wasm, generating therefrom  JIT’d code to be executed in the JIT 114. In various embodiments, the JIT compile can be performed in the background module 122 and the JIT execution can occur in the execution module 120.
The following discussion references FIGS. 2-4 and develops the technical concepts introduced in FIG. 1. FIG. 2 provides a flowchart 200 for an example method for self-evolving and multi-versioning code. FIG. 3 provides a process flow for ahead-of-time operations (AOT 112) and FIG. 4 provides a process flow for various just in time operations (JIT 114) . As used herein, a processor 106 (e.g., a CISC machine) or a computer device, a compute node (FIG. 7, 700) or a processing system (e.g., FIG. 8, 800) referred to as being programmed to perform a method can be programmed to perform the method via software, hardware, firmware or combinations thereof. For illustrative purposes, the following description of the method 200 may refer to elements mentioned in connection with FIGS. 1, 3, or 4. In various embodiments, portions of method 200 may be performed by different components of the described system environment 100. It should be appreciated that method 200 may include any number of additional or alternative operations and tasks, the tasks shown in FIG. 2 need not be performed in the illustrated order, and method 200 may be incorporated into a more comprehensive procedure or method having additional functionality not described in detail herein. Moreover, one or more of the tasks shown in FIG. 2 could be omitted from an embodiment of the method 200 if the intended overall functionality remains intact.
As shown in FIG. 2, at 202, the host may be running a browser and parsing a website. At 204, a high-level source language file ( “input file” ) may be received by the host 104. Additionally, at 204, the host 104 could receive an input file in the form of a . jsp file or wasm file.
However, as mentioned above, and as illustrated with box 206, the AOT compile operations may be performed ahead of time (such as, before running a browser at 202) and outputs from the AOT compile may be pre-built. In these embodiments, outputs from the AOT compile operations may be integrated into, or be part of, receiving the source language file at 204. As illustrated at 206; an ahead of time (AOT) wasm compile 304 operation can be performed by AOT 112. This AOT operation may include identifying at 208 one or more functions in the input file (source language 302) and inserting one or more checkpoints into the input file at 210. The AOT compile operations may include a wasm JIT compile to translate the wasm file to JIT’d code that includes the checkpoints 210 and the inference operators 212.
In an embodiment, more than one checkpoint is inserted into each function of more than one functions. The checkpoints may be placed at a specific location with respect to the function, e.g., at the start of the function, and may include specific parameters for which to collect values. In an example embodiment, the location is predefined as at the entry of the key function, and parameter value to collect at the location is also predefined. Other parameter values to collect can include global data accesses by the key function, or the like. In an embodiment, more than one parameter value may be collected for a checkpoint. This operation at 210 may be referred to as creating a checkpoint instrument for the input file, and the output of this operation, the checkpoint instrumented 306 input file, is a wasm file and can be stored, e.g., into a main. wasm file.
To inference a full profile from the checkpoint collected values, embodiments of the AOT 112 compiler may also create at 212 one or more profile inference operators 308 to calculate values at other instrumentation points, for use for general PGO profiling. In various embodiments, the AOT 112 may populate an inference. wasm file with the one or a plurality of profile inference operators 308. For example, the AOT compile 304 will build a profile operator to inference condition branch from input parameter or global value ranges. For memory accesses on load or stores, build an operator from checkpoints to inference the memory address (es) to be accessed.
Output 213 includes the source language file from 204 and, respective output from operations at 206. Output 213 is consumed by JIT 114 module at 214.
In FIG. 4, the operations 402 represent the main wasm execution thread 120, which includes a functional block 403, whose functions are generally present on a client device or host (performing, responsive to an input wasm file or an input JavaScript file, the operations of fetch 404, baseline compile 406, instantiate 410, and interpret 408, to thereby translate the input file into JIT’d code for execution) . Additionally, operations 402 include technical improvements in the form of the wasm execution and profiling 412 operations and the creation of the checkpoint instrumented intermediate representation (IR) graph 414.
At 216, wasm execution and profiling (412) is performed on self-evolving JIT’d code 428 generated in the background thread 122. Output from wasm execution and profiling (412) can include a profiler runtime data dump that generates checkpoint raw profiles (e.g., checkpoint raw profile 416) . In various embodiments, the checkpoint raw profiles 416 can be stored into a shared memory file and transmitted to a profile monitor 418 in the background/inference thread 430.
The background/inference thread 430 also receives the profile inference operators 308 in the inf. wasm file and follows with a fetch 420 and a compile 422 based thereon. A profile inference 424 operation receives the compile 422 output and processes it with the profile monitor 418 output to determine whether there’s been a change from an initial profile. Said differently, the profile inference 424 operation determines whether an actual execution of a program matches a previously generated profile that is based on previous passes through the JIT compile 400.
From the initial baseline compilation 406, embodiments may build an intermediate representation (IR) graph and attach the inference operators 308 to nodes in the IR graph, thereby creating a checkpoint instrumented IR graph 414. The checkpoint instrumented IR graph 414 can be stored in a disk or other storage device (e.g., storage device 116) .
The checkpoint instrumented IR graph 414 can be compiled with output from the profile inference 424 operation (PGO compilation 426) . Pulling together the previous operations, the output from PGO compilation 426 represents monitored profile changes, the inference operators, hot and cold branches, counts, and frequently accessed spaces (address and size) . Output from the PGO compilation 426 can be combined with the baseline compiled (406) output from the main wasm execution thread 120; and, upon determining there has been a profile change, embodiments will regenerate the JIT’d code that the wasm execution and profiling 412 engine operates on (i.e., the herein described self-evolving JIT’d code 428) . In various embodiments, the first time through the wasm execution and profiling 412 engine may rely on initial assumptions including an initial profile; a profile change may be detected when there is a deviation in at least one value in the raw profile that exceeds a respective profile inference value. In various embodiments, the compiler 110 regenerates the JIT’d code 428 when an inference operator is exceeded by a corresponding parameter value in the checkpoint raw profile.
After 218, the method 200 may end or repeat.
FIG. 5 an example use case illustrating how the instrument checkpoint and inference operators facilitate PGO optimization. The code snippet for a function is shown at 500 and the corresponding flowchart is illustrated at 502. In this example, the AOT 112 of the compiler 110 inserted a checkpoint (an instrument checkpoint) at the beginning of the function to collect the input value of “day. ” During operation of the compiler 110, the profile monitoring operation (FIG. 4, 418) , identifies/determines that “day” is in a range of {Monday, Tuesday, Friday} more frequently than the remaining values, never or seldomly passing as Thursday and Wednesday.  When the compiler 110 moves to inference the full profile (FIG. 4, 424) , it invokes the attached Inference Operators 504 for each switch case condition (506) using the range of “day” to predict the branch (to blocks 1-N) frequencies. As used herein, blocks 1-N represent blocks of code; the first time through the compiler 110 the blocks are original wasm code. Thereafter, the compiler 110 can self-evolve the code (i.e., self-evolve the layout of the switch case conditions 506 and corresponding blocks 1-N) , perform a direct inline decision if any block calls another functions (not shown) , and conduct more advanced PGO optimizations.
Embodiments can use a similar profiling mechanism to optimize data placement during memory accesses. For example, some technologies give access to multiple different storage devices for memory accesses (e.g., registers, level 0 cache, level 1, cache, level 2 cache) , and each of these storage devices has a different access speed. In these scenarios, the compiler 110 can optimize data placement. This advantageously brings in non-volatility to memory consumption calculations, which is critical to Web performance and experience.
In some embodiments, the AOT 112 can have some of the WASM runtime features embedded into its compiled binary code, and subsequent passes through the compiler 110 can be used to improve/optimize the AOT 112.
Thus, systems and methods for self-evolving and multi-versioning code have been described. Advantageously, provided embodiments enable the monitoring of profile changes, predefined inference operators, hot and cold branches, counts, and frequently accessed spaces (address and size) to direct profile guided optimization (PGO) . Further, some additional advantages and applications as applied to the  AOT 112 module are provided below.
Additional concepts for a self-evolving AOT 112
Create an AOT runner module comprising an AOT module loader, an AOT module instantiator, and an AOT module runner. Then, upon a compile operations, the AOT runtime includes the previously described AOT 112 plus the AOT runner.
The AOT runtime is to accept the WASM file or the AOT file; in scenarios in which the input is the WASM file, AOT 112 is to compile the WASM file into an AOT file first.
How the AOT 112 compiler compiles the WASM file into an AOT file
● The WASM file is divided into a bytecode part and a non-bytecode part;
● The bytecode part is compiled into AOT code and the necessary rodata, the rodata is used to apply relocation on AOT code. The AOT code always accesses the AOT  112 module instance through its first function argument: each AOT function's first argument is the handle of the module instance. The AOT code doesn't access global variables in the object file, no global variables are generated in the object file.
● The generated AOT file includes: the original WASM file, the AOT code, the rodata, and the relocation records.
How to load and run an AOT file
● Create the AOT 112 module and module instance with the AOT file: the WASM file in the AOT file is used to create the AOT module's data, e.g., WASM global/table/memory and their instances. The AOT code, rodata and relocation records are used to allocate the machine code. Relocations are applied. The AOT function pointers are stored inside the AOT 112 module.
● When calling the AOT function, the AOT module instance's handle is passed as the AOT function's first argument, so that AOT code can access the AOT module instance's data.
● When running the AOT function, the profiling data may be generated
How to upgrade the AOT file (or update the AOT 112 module)
● After the PGO profiling data is generated, it can be input into the AOT 112 compiler, the AOT 112 compiler generates the new AOT file.
● There is no need to destroy the whole AOT 112 compiler module and reload it again, instead, embodiments allocate new machine code and apply the relocations to the AOT 112 compiler module, replace the function pointers in the AOT 112 compiler module, and then destroy/delete the original machine code.
As mentioned, WASM is a collaboratively developed portable low-level bytecode designed to improve upon the deficiencies of JavaScript. In various scenarios, WASM was developed with a component model in which code is organized in modules that have a shared-nothing inter-component invocation. A host 104, such as a virtual machine, container, or microservice, can be populated with multiple different WASM components (also referred to herein as WASM modules) . The WASM modules interface using the shared-nothing interface, which enables fast instance-derived import calls. The shared-nothing interface enables software and hardware optimization via adaptors.
Many of the above-described operations and/or functions may be part of a wasm module, and wasm modules can be found in various use cases, as illustrated in FIG. 6. A WASM module contains definitions for functions, globals, tables, and memories. The definitions can be imported or exported. A module can define only one memory, that memory is a traditional linear memory that is mutable and may be shared. The code in a module is organized into functions. Functions can call each other, but functions cannot be nested. Instantiating a module can be provided by a JavaScript virtual machine or an operating system. An instance of a module corresponds to a dynamic representation of the module, its defined memory, and an execution stack. A WASM computation is initiated by invoking a function exported from the instance.
WASMTIME and WASI. WASMTIME is a jointly developed industry leading WebAssembly runtime; it includes a compiler for WASM written in Rust. In various embodiments, a Web Assembly System Interface (WASI) that may be host specific (processor specific) is used to enable application specific protocols (e.g., for machine language, for machine learning, etc. ) for communication and data sharing between the software environment running WASM (WASMTIME) and other host components. These concepts are illustrated in FIG. 6. Embodiment 600 illustrates a WASM module 602 embodied as a direct command line interface (CLI) . The WASI library 604 is referenced during WASMTIME CLI 606, and the operating system (OS) resources 608 of the host are utilized. A WASI application programming interface (s) 610 ( “WASI API” ) enables communication and data sharing between the components in embodiment 600.
Embodiment 630 illustrates a WASM module 632 in which WASMTIME and WASI are embedded in an application. In the embedded environment, a portable WASM application 634 includes the WASI library 636 that is referenced during WASMTIME 638. The portable WASM application 634 may be referred to as a user application. Embodiment 630 may employ a host API 646 for communication and data sharing between the WASM application 634 and the host for certain operations, and employ multiple WASI implementations 640 for communication and data sharing between the portable WASM application 634 and the host OS resources 642 (indicated generally with WASI APIs 648) . In various embodiments, different instances of WASI may be concurrently supported for communications with a host application, a native OS, bare metal, a Web polyfill, or similar. The portable WASM application 634 can transmit into the WASM runtime environment 638 model and encoding information, and the WASM runtime  environment 638 may also reference models based thereon, such as, in a non-limiting example, a virtualized I/O machine learning (ML) model. Embodiment 630 may represent a standalone environment, such as, a standalone desktop, an Internet of Things (IOT) environment, a cloud application (e.g., a content delivery network (CDN) , function as a service (FaaS) , an envoy proxy, or the like) . In other scenarios, embodiment 630 may represent a resource constrained environment, such as in IOT, embedding, or the like.
The systems and methods described herein can be implemented in or performed by any of a variety of computing systems, including mobile computing systems (e.g., smartphones, handheld computers, tablet computers, laptop computers, portable gaming consoles, 2-in-1 convertible computers, portable all-in-one computers) , non-mobile computing systems (e.g., desktop computers, servers, workstations, stationary gaming consoles, set-top boxes, smart televisions, rack-level computing solutions (e.g., blade, tray, or sled computing systems) ) , and embedded computing systems (e.g., computing systems that are part of a vehicle, smart home appliance, consumer electronics product or equipment, manufacturing equipment) .
As used herein, the term “computing system” includes compute nodes, computing devices, and systems comprising multiple discrete physical components. In some embodiments, the computing systems are located in a data center, such as an enterprise data center (e.g., a data center owned and operated by a company and typically located on company premises) , managed services data center (e.g., a data center managed by a third party on behalf of a company) , a co-located data center (e.g., a data center in which data center infrastructure is provided by the data center host and a company provides and manages their own data center components (servers, etc. ) ) , cloud data center (e.g., a data center operated by a cloud services provider that host companies applications and data, such as, web applications, games, and conference call applications) , and an edge data center (e.g., a data center, typically having a smaller footprint than other data center types, located close to the geographic area that it serves) .
In the simplified example depicted in FIG. 7, a compute node 700 includes a compute engine (referred to herein as “compute circuitry” ) 702, an input/output (I/O) subsystem 708, data storage 710, a communication circuitry subsystem 712, and, optionally, one or more peripheral devices 714. With respect to the present example, the compute node 700 or compute circuitry 702 may perform the operations and tasks attributed to the host 104. In other examples, respective compute nodes 500 may include other or additional components, such as those typically found in  a computer (e.g., a display, peripheral devices, etc. ) . Additionally, in some examples, one or more of the illustrative components may be incorporated in, or otherwise form a portion of, another component.
In some examples, the compute node 700 may be embodied as a single device such as an integrated circuit, an embedded system, a field-programmable gate array (FPGA) , a system-on-a-chip (SOC) , or other integrated system or device. In the illustrative example, the compute node 700 includes or is embodied as a processor 704 and a memory 706. The processor 704 may be embodied as any type of processor capable of performing the functions described herein (e.g., executing compile functions and executing an application) . For example, the processor 704 may be embodied as a multi-core processor (s) , a microcontroller, a processing unit, a specialized or special purpose processing unit, or other processor or processing/controlling circuit.
In some examples, the processor 704 may be embodied as, include, or be coupled to an FPGA, an application specific integrated circuit (ASIC) , reconfigurable hardware or hardware circuitry, or other specialized hardware to facilitate performance of the functions described herein. Also in some examples, the processor 704 may be embodied as a specialized x-processing unit (xPU) also known as a data processing unit (DPU) , infrastructure processing unit (IPU) , or network processing unit (NPU) . Such an xPU may be embodied as a standalone circuit or circuit package, integrated within an SOC, or integrated with networking circuitry (e.g., in a SmartNIC, or enhanced SmartNIC) , acceleration circuitry, storage devices, or AI hardware (e.g., GPUs or programmed FPGAs) . Such an xPU may be designed to receive programming to process one or more data streams and perform specific tasks and actions for the data streams (such as hosting microservices, performing service management or orchestration, organizing, or managing server or data center hardware, managing service meshes, or collecting and distributing telemetry) , outside of the CPU or general-purpose processing hardware. However, it will be understood that a xPU, a SOC, a CPU, and other variations of the processor 704 may work in coordination with each other to execute many types of operations and instructions within and on behalf of the compute node 700.
The memory 706 may be embodied as any type of volatile (e.g., dynamic random-access memory (DRAM) , etc. ) or non-volatile memory or data storage capable of performing the functions described herein. Volatile memory may be a storage medium that requires power to maintain the state of data stored by the medium. Non-limiting examples of volatile memory may  include various types of random-access memory (RAM) , such as DRAM or static random-access memory (SRAM) . One particular type of DRAM that may be used in a memory module is synchronous dynamic random-access memory (SDRAM) .
In an example, the memory device is a block addressable memory device, such as those based on NAND or NOR technologies. A memory device may also include a three-dimensional crosspoint memory device (e.g., 
Figure PCTCN2022121108-appb-000001
3D XPoint TM memory) , or other byte addressable write-in-place nonvolatile memory devices. The memory device may refer to the die itself and/or to a packaged memory product. In some examples, 3D crosspoint memory (e.g., 
Figure PCTCN2022121108-appb-000002
3D XPoint TM memory) may comprise a transistor-less stackable cross point architecture in which memory cells sit at the intersection of word lines and bit lines and are individually addressable and in which bit storage is based on a change in bulk resistance. In some examples, all or a portion of the memory 706 may be integrated into the processor 704. The memory 706 may store various software and data used during operation such as one or more applications, data operated on by the application (s) , libraries, and drivers.
The compute circuitry 702 is communicatively coupled to other components of the compute node 700 via the I/O subsystem 708, which may be embodied as circuitry and/or components to facilitate input/output operations with the compute circuitry 702 (e.g., with the processor 704 and/or the main memory 706) and other components of the compute circuitry 702. For example, the I/O subsystem 708 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, integrated sensor hubs, firmware devices, communication links (e.g., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc. ) , and/or other components and subsystems to facilitate the input/output operations. In some examples, the I/O subsystem 708 may form a portion of a system-on-a-chip (SoC) and be incorporated, along with one or more of the processor 704, the memory 706, and other components of the compute circuitry 702, into the compute circuitry 702.
The one or more illustrative data storage devices 710 may be embodied as any type of devices configured for short-term or long-term storage of data such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, or other data storage devices. Individual data storage devices 710 may include a system partition that stores data and firmware code for the data storage device 710. Individual data storage devices 710 may also  include one or more operating system partitions that store data files and executables for operating systems depending on, for example, the type of compute node 700.
The communication circuitry 712 may be embodied as any communication circuit, device, transceiver circuit, or collection thereof, capable of enabling communications over a network between the compute circuitry 702 and another compute device (e.g., an edge gateway of an implementing edge computing system) .
The communication subsystem 712 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.11 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultra-mobile broadband (UMB) project (also referred to as “3GPP2” ) , etc. ) . IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for Worldwide Interoperability for Microwave Access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication subsystem 712 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E-HSPA) , or LTE network. The communication subsystem 712 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) . The communication subsystem 712 may operate in accordance with Code Division Multiple Access (CDMA) , Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond. The communication subsystem 712 may operate in accordance with other wireless protocols in other embodiments.
In some embodiments, the communication subsystem 712 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., IEEE 802.3 Ethernet standards) . As noted above, the communication subsystem 712 may include multiple communication components. For instance, a first communication subsystem 712 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second  communication subsystem 712 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In some embodiments, a first communication subsystem 712 may be dedicated to wireless communications, and a second communication subsystem 712 may be dedicated to wired communications.
The illustrative communication subsystem 712 includes an optional network interface controller (NIC) 720, which may also be referred to as a host fabric interface (HFI) . The NIC 720 may be embodied as one or more add-in-boards, daughter cards, network interface cards, controller chips, chipsets, or other devices that may be used by the compute node 700 to connect with another compute device (e.g., an edge gateway node) . In some examples, the NIC 720 may be embodied as part of a system-on-a-chip (SoC) that includes one or more processors or included on a multichip package that also contains one or more processors. In some examples, the NIC 720 may include a local processor (not shown) and/or a local memory (not shown) that are both local to the NIC 720. In such examples, the local processor of the NIC 720 may be capable of performing one or more of the functions of the compute circuitry 702 described herein. Additionally, or alternatively, in such examples, the local memory of the NIC 720 may be integrated into one or more components of the client compute node at the board level, socket level, chip level, and/or other levels.
Additionally, in some examples, a respective compute node 700 may include one or more peripheral devices 714. Such peripheral devices 714 may include any type of peripheral device found in a compute device or server such as audio input devices, a display, other input/output devices, interface devices, and/or other peripheral devices, depending on the the compute node 700. In further examples, the compute node 700 may be embodied by a respective edge compute node (whether a client, gateway, or aggregation node) in an edge computing system or like forms of appliances, computers, subsystems, circuitry, or other components.
In other examples, the compute node 700 may be embodied as any type of device or collection of devices capable of performing various compute functions. Respective compute nodes 700 may be embodied as a type of device, appliance, computer, or other “thing” capable of communicating with other compute nodes that may be edge, networking, or endpoint components. For example, a compute device may be embodied as a personal computer, server, smartphone, a mobile compute device, a smart appliance, smart camera, an in-vehicle compute system (e.g., a navigation system) , a weatherproof or weather-sealed computing appliance, a self-contained  device within an outer case, shell, etc., or other device or system capable of performing the described functions.
FIG. 8 illustrates a multi-processor environment in which embodiments may be implemented.  Processors  802 and 804 further  comprise cache memories  812 and 814, respectively. The  cache memories  812 and 814 can store data (e.g., instructions) utilized by one or more components of the  processors  802 and 804, such as the  processor cores  808 and 810. The  cache memories  812 and 814 can be part of a memory hierarchy for the computing system 800. For example, the cache memories 812 can locally store data that is also stored in a memory 816 to allow for faster access to the data by the processor 802. In some embodiments, the  cache memories  812 and 814 can comprise multiple cache levels, such as level 1 (L1) , level 2 (L2) , level 3 (L3) , level 4 (L4) and/or other caches or cache levels. In some embodiments, one or more levels of cache memory (e.g., L2, L3, L4) can be shared among multiple cores in a processor or among multiple processors in an integrated circuit component. In some embodiments, the last level of cache memory on an integrated circuit component can be referred to as a last level cache (LLC) . One or more of the higher levels of cache levels (the smaller and faster caches) in the memory hierarchy can be located on the same integrated circuit die as a processor core and one or more of the lower cache levels (the larger and slower caches) can be located on an integrated circuit dies that are physically separate from the processor core integrated circuit dies.
Although the computing system 800 is shown with two processors, the computing system 800 can comprise any number of processors. Further, a processor can comprise any number of processor cores. A processor can take various forms such as a central processing unit (CPU) , a graphics processing unit (GPU) , general-purpose GPU (GPGPU) , accelerated processing unit (APU) , field-programmable gate array (FPGA) , neural network processing unit (NPU) , data processor (DPU) , accelerator (e.g., graphics accelerator, digital signal processor (DSP) , compression accelerator, artificial intelligence (AI) accelerator) , controller, or other types of processing units. As such, the processor can be referred to as an XPU (or xPU) . Further, a processor can comprise one or more of these various types of processing units. In some embodiments, the computing system comprises one processor with multiple cores, and in other embodiments, the computing system comprises a single processor with a single core. As used herein, the terms “processor, ” “processor unit, ” and “processing unit” can refer to any processor, processor core,  component, module, engine, circuitry, or any other processing element described or referenced herein.
In some embodiments, the computing system 800 can comprise one or more processors that are heterogeneous or asymmetric to another processor in the computing system. There can be a variety of differences between the processing units in a system in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like. These differences can effectively manifest themselves as asymmetry and heterogeneity among the processors in a system.
The  processors  802 and 804 can be located in a single integrated circuit component (such as a multi-chip package (MCP) or multi-chip module (MCM) ) or they can be located in separate integrated circuit components. An integrated circuit component comprising one or more processors can comprise additional components, such as embedded DRAM, stacked high bandwidth memory (HBM) , shared cache memories (e.g., L3, L4, LLC) , input/output (I/O) controllers, or memory controllers. Any of the additional components can be located on the same integrated circuit die as a processor, or on one or more integrated circuit dies separate from the integrated circuit dies comprising the processors. In some embodiments, these separate integrated circuit dies can be referred to as “chiplets” . In some embodiments where there is heterogeneity or asymmetry among processors in a computing system, the heterogeneity or asymmetric can be among processors located in the same integrated circuit component. In embodiments where an integrated circuit component comprises multiple integrated circuit dies, interconnections between dies can be provided by the package substrate, one or more silicon interposers, one or more silicon bridges embedded in the package substrate (such as 
Figure PCTCN2022121108-appb-000003
embedded multi-die interconnect bridges (EMIBs) ) , or combinations thereof.
Processors  802 and 804 further comprise memory controller logic (MC) 820 and 822. As shown in FIG. 8, MCs 820 and 622  control memories  816 and 818 coupled to the  processors  802 and 804, respectively. The  memories  816 and 818 can comprise various types of volatile memory (e.g., dynamic random-access memory (DRAM) , static random-access memory (SRAM) ) and/or non-volatile memory (e.g., flash memory, chalcogenide-based phase-change non-volatile memories) , and comprise one or more layers of the memory hierarchy of the computing system. While  MCs  820 and 822 are illustrated as being integrated into the  processors  802 and 804, in alternative embodiments, the MCs can be external to a processor.
Processors  802 and 804 are coupled to an Input/Output (I/O) subsystem 830 via point-to- point interconnections  832 and 834. The point-to-point interconnection 832 connects a point-to-point interface 836 of the processor 802 with a point-to-point interface 838 of the I/O subsystem 830, and the point-to-point interconnection 834 connects a point-to-point interface 840 of the processor 804 with a point-to-point interface 842 of the I/O subsystem 830. Input/Output subsystem 830 further includes an interface 850 to couple the I/O subsystem 830 to a graphics engine 852. The I/O subsystem 830 and the graphics engine 852 are coupled via a bus 854.
The Input/Output subsystem 830 is further coupled to a first bus 860 via an interface 862. The first bus 860 can be a Peripheral Component Interconnect Express (PCIe) bus or any other type of bus. Various I/O devices 864 can be coupled to the first bus 860. A bus bridge 870 can couple the first bus 860 to a second bus 880. In some embodiments, the second bus 880 can be a low pin count (LPC) bus. Various devices can be coupled to the second bus 880 including, for example, a keyboard/mouse 882, audio I/O devices 888, and a storage device 890, such as a hard disk drive, solid-state drive, or another storage device for storing computer-executable instructions (code) 892 or data. The code 892 can comprise computer-executable instructions for performing methods described herein. Additional components that can be coupled to the second bus 880 include communication device (s) 884, which can provide for communication between the computing system 800 and one or more wired or wireless networks 886 (e.g. Wi-Fi, cellular, or satellite networks) via one or more wired or wireless communication links (e.g., wire, cable, Ethernet connection, radio-frequency (RF) channel, infrared channel, Wi-Fi channel) using one or more communication standards (e.g., IEEE 802.11 standard and its supplements) .
In embodiments where the communication devices 884 support wireless communication, the communication devices 884 can comprise wireless communication components coupled to one or more antennas to support communication between the computing system 800 and external devices. The wireless communication components can support various wireless communication protocols and technologies such as Near Field Communication (NFC) , IEEE 802.11 (Wi-Fi) variants, WiMax, Bluetooth, Zigbee, 4G Long Term Evolution (LTE) , Code Division Multiplexing Access (CDMA) , Universal Mobile Telecommunication System (UMTS) and Global System for Mobile Telecommunication (GSM) , and 5G broadband cellular technologies. In addition, the wireless modems can support communication with one or more cellular networks for data and voice communications within a single cellular network, between  cellular networks, or between the computing system and a public switched telephone network (PSTN) .
The system 800 can comprise removable memory such as flash memory cards (e.g., SD (Secure Digital) cards) , memory sticks, Subscriber Identity Module (SIM) cards) . The memory in system 800 (including  caches  812 and 814,  memories  816 and 818, and storage device 890) can store data and/or computer-executable instructions for executing an operating system 894 and application programs 896. Example data includes web pages, text messages, images, sound files, and video data biometric thresholds for particular users or other data sets to be sent to and/or received from one or more network servers or other devices by the system 800 via the one or more wired or wireless networks 886, or for use by the system 800. The system 800 can also have access to external memory or storage (not shown) such as external hard drives or cloud-based storage.
The operating system 894 (also simplified to “OS” herein) can control the allocation and usage of the components illustrated in FIG. 6 and support the one or more application programs 896. The application programs 896 can include common computing system applications (e.g., email applications, calendars, contact managers, web browsers, messaging applications) as well as other computing applications.
In some embodiments, a hypervisor (or virtual machine manager) operates on the operating system 894 and the application programs 896 operate within one or more virtual machines operating on the hypervisor. In these embodiments, the hypervisor is a type-2 or hosted hypervisor as it is running on the operating system 894. In other hypervisor-based embodiments, the hypervisor is a type-1 or “bare-metal” hypervisor that runs directly on the platform resources of the computing system 894 without an intervening operating system layer.
In some embodiments, the applications 896 can operate within one or more containers. A container is a running instance of a container image, which is a package of binary images for one or more of the applications 896 and any libraries, configuration settings, and any other information that one or more applications 896 need for execution. A container image can conform to any container image format, such as 
Figure PCTCN2022121108-appb-000004
Appc, or LXC container image formats. In container-based embodiments, a container runtime engine, such as Docker Engine, LXU, or an open container initiative (OCI) -compatible container runtime (e.g., Railcar, CRI-O) operates on the operating system (or virtual machine monitor) to provide an interface between the containers and the operating system 894. An orchestrator can be responsible for management of the  computing system 800 and various container-related tasks such as deploying container images to the computing system 894, monitoring the performance of deployed containers, and monitoring the utilization of the resources of the computing system 894.
The computing system 800 can support various additional input devices, represented generally as user interfaces 898, such as a touchscreen, microphone, monoscopic camera, stereoscopic camera, trackball, touchpad, trackpad, proximity sensor, light sensor, electrocardiogram (ECG) sensor, PPG (photoplethysmogram) sensor, galvanic skin response sensor, and one or more output devices, such as one or more speakers or displays. Other possible input and output devices include piezoelectric and other haptic I/O devices. Any of the input or output devices can be internal to, external to, or removably attachable with the system 800. External input and output devices can communicate with the system 800 via wired or wireless connections.
In addition, one or more of the user interfaces 898 may be natural user interfaces (NUIs) . For example, the operating system 894 or applications 896 can comprise speech recognition logic as part of a voice user interface that allows a user to operate the system 800 via voice commands. Further, the computing system 800 can comprise input devices and logic that allows a user to interact with computing the system 800 via body, hand, or face gestures. For example, a user’s hand gestures can be detected and interpreted to provide input to a gaming application.
The I/O devices 864 can include at least one input/output port comprising physical connectors (e.g., USB, IEEE 1394 (FireWire) , Ethernet, RS-232) , a power supply (e.g., battery) , a global satellite navigation system (GNSS) receiver (e.g., GPS receiver) ; a gyroscope; an accelerometer; and/or a compass. A GNSS receiver can be coupled to a GNSS antenna. The computing system 800 can further comprise one or more additional antennas coupled to one or more additional receivers, transmitters, and/or transceivers to enable additional functions.
In addition to those already discussed, integrated circuit components, integrated circuit constituent components, and other components in the computing system 894 can communicate with interconnect technologies such as 
Figure PCTCN2022121108-appb-000005
QuickPath Interconnect (QPI) , 
Figure PCTCN2022121108-appb-000006
Ultra Path Interconnect (UPI) , Computer Express Link (CXL) , cache coherent interconnect for accelerators 
Figure PCTCN2022121108-appb-000007
serializer/deserializer (SERDES) , 
Figure PCTCN2022121108-appb-000008
NVLink, ARM Infinity Link, Gen-Z, or Open Coherent Accelerator Processor Interface (OpenCAPI) . Other interconnect technologies may be used and a computing system 694 may utilize more or more interconnect technologies.
It is to be understood that FIG. 8 illustrates only one example computing system architecture. Computing systems based on alternative architectures can be used to implement technologies described herein. For example, instead of the  processors  802 and 804 and the graphics engine 852 being located on discrete integrated circuits, a computing system can comprise an SoC (system-on-a-chip) integrated circuit incorporating multiple processors, a graphics engine, and additional components. Further, a computing system can connect its constituent component via bus or point-to-point configurations different from that shown in FIG. 8. Moreover, the illustrated components in FIG. 8 are not required or all-inclusive, as shown components can be removed and other components added in alternative embodiments.
FIG. 9 is a block diagram of an example processor 900 to execute computer-executable instructions as part of implementing technologies described herein. The processor 900 can be a single-threaded core or a multithreaded core in that it may include more than one hardware thread context (or “logical processor” ) per processor.
FIG. 9 also illustrates a memory 910 coupled to the processor 900. The memory 910 can be any memory described herein or any other memory known to those of skill in the art. The memory 910 can store computer-executable instructions 915 (code) executable by the processor 900.
The processor comprises front-end logic 920 that receives instructions from the memory 910. An instruction can be processed by one or more decoders 930. The decoder 930 can generate as its output a micro-operation such as a fixed width micro-operation in a predefined format, or generate other instructions, microinstructions, or control signals, which reflect the original code instruction. The front-end logic 920 further comprises register renaming logic 935 and scheduling logic 940, which generally allocate resources and queues operations corresponding to converting an instruction for execution.
The processor 900 further comprises execution logic 950, which comprises one or more execution units (EUs) 965-1 through 965-N. Some processor embodiments can include a few execution units dedicated to specific functions or sets of functions. Other embodiments can include only one execution unit or one execution unit that can perform a particular function. The execution logic 950 performs the operations specified by code instructions. After completion of execution of the operations specified by the code instructions, back-end logic 970 retires instructions using retirement logic 975. In some embodiments, the processor 900 allows out of order execution but  requires in-order retirement of instructions. Retirement logic 975 can take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like) .
The processor 900 is transformed during execution of instructions, at least in terms of the output generated by the decoder 930, hardware registers and tables utilized by the register renaming logic 935, and any registers (not shown) modified by the execution logic 950.
Any of the disclosed methods (or a portion thereof) can be implemented as computer-executable instructions (also referred to as machine readable instructions) or a computer program product stored on a computer readable (machine readable) storage medium. Such instructions can cause a computing system or one or more processors capable of executing computer-executable instructions to perform any of the disclosed methods.
The computer-executable instructions or computer program products as well as any data created and/or used during implementation of the disclosed technologies can be stored on one or more tangible or non-transitory computer-readable storage media, such as volatile memory (e.g., DRAM, SRAM) , non-volatile memory (e.g., flash memory, chalcogenide-based phase-change non-volatile memory) optical media discs (e.g., DVDs, CDs) , and magnetic storage (e.g., magnetic tape storage, hard disk drives) . Computer-readable storage media can be contained in computer-readable storage devices such as solid-state drives, USB flash drives, and memory modules. Alternatively, any of the methods disclosed herein (or a portion) thereof may be performed by hardware components comprising non-programmable circuitry. In some embodiments, any of the methods herein can be performed by a combination of non-programmable hardware components and one or more processing units executing computer-executable instructions stored on computer-readable storage media.
The computer-executable instructions can be part of, for example, an operating system of the host or computing system, an application stored locally to the computing system, or a remote application accessible to the computing system (e.g., via a web browser) . Any of the methods described herein can be performed by computer-executable instructions performed by a single computing system or by one or more networked computing systems operating in a network environment. Computer-executable instructions and updates to the computer-executable instructions can be downloaded to a computing system from a remote server.
Further, it is to be understood that implementation of the disclosed technologies is not limited to any specific computer language or program. For instance, the disclosed technologies can  be implemented by software written in C++, C#, Java, Perl, Python, JavaScript, Adobe Flash, C#, assembly language, Web Assembly, or any other programming language. Likewise, the disclosed technologies are not limited to any particular computer system or type of hardware.
Furthermore, any of the software-based embodiments (comprising, for example, computer-executable instructions for causing a computer to perform any of the disclosed methods) can be uploaded, downloaded, or remotely accessed through a suitable communication means. Such suitable communication means include, for example, the Internet, the World Wide Web, an intranet, cable (including fiber optic cable) , magnetic communications, electromagnetic communications (including RF, microwave, ultrasonic, and infrared communications) , electronic communications, or other such communication means.
Theories of operation, scientific principles, or other theoretical descriptions presented herein in reference to the apparatuses or methods of this disclosure have been provided for the purposes of better understanding and are not intended to be limiting in scope. The apparatuses and methods in the appended claims are not limited to those apparatuses and methods that function in the manner described by such theories of operation.
While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are described herein in detail. It should be understood, however, that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives consistent with the present disclosure and the appended claims.
References in the specification to "one embodiment, " "an embodiment, " "an illustrative embodiment, " etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but it is not necessary that every embodiment include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. Additionally, it should be appreciated that items included in a list in the form of "at least one of A, B, and C" can mean (A) ; (B) ; (C) ; (A and B) ; (A and C) ; (B  and C) ; or (A, B, and C) . Similarly, items listed in the form of "at least one of A, B, or C" can mean (A) ; (B) ; (C) ; (A and B) ; (A and C) ; (B and C) ; or (A, B, and C) .
The following examples pertain to additional embodiments of technologies disclosed herein.
Example 1 is an apparatus comprising: a processor; a compiler executable by the processor to: receive a high-level source language file; translate the high-level source language file into a web assembly (wasm) file; insert a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value; perform a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; execute the JIT’d code to emit a checkpoint raw profile with the parameter value.
Example 2 includes the subject matter of Example 1, wherein the compiler is further to insert the checkpoint at a specific location with respect to the function.
Example 3 includes the subject matter of Example 1, wherein the checkpoint is further to collect multiple parameter values, and the compiler is further to: execute the JIT’d code and emit the multiple parameter values.
Example 4 includes the subject matter of Example 3, wherein the compiler is further to: generate an inference operator for the wasm file; monitor a profile based on the inference operator and the multiple parameter values; regenerate the JIT’d code when there has been a change in the profile.
Example 5 includes the subject matter of Example 4, wherein the inference operator is to inference when a conditional branch will occur from an input parameter.
Example 6 includes the subject matter of Example 4, wherein the inference operator is to inference when a conditional branch will occur from a global value range.
Example 7 includes the subject matter of Example 4, wherein the inference operator is to inference when a specific memory access is to occur.
Example 8 includes the subject matter of Example 1, wherein the checkpoint is a first checkpoint, and wherein the compiler is further to insert a second checkpoint in the wasm file, the second checkpoint to collect a value for global data accesses by the function.
Example 9 includes the subject matter of Example 1, wherein the compiler is further to:generate a plurality of inference operators for the wasm file; populate an inference file with the plurality of inference operators; reference the inference file to generate the JIT’d code on the wasm file with the checkpoint.
Example 10 includes the subject matter of Example 1, wherein the compiler performs a runtime data dump when executing the JIT’d code, thereby generating checkpoint raw profiles for use in monitoring profile changes.
Example 11 is a method comprising: at a processor, receiving a high-level source language file; executing a compiler; translating the high-level source language file into a web assembly (wasm) file; inserting a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value; performing a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; executing the JIT’d code to emit a checkpoint raw profile with the parameter value.
Example 12 includes the subject matter of Example 11, further comprising inserting the checkpoint at a specific location with respect to the function.
Example 13 includes the subject matter of Example 11, wherein the checkpoint is further to collect multiple parameter values, and further comprising executing the JIT’d code to emit a checkpoint raw profile with the multiple parameter values.
Example 14 includes the subject matter of Example 13, further comprising: generating an inference operator for the wasm file; monitoring the inference operator and the checkpoint raw  profile; regenerating the JIT’d code when the inference operator is exceeded by a corresponding parameter value in the checkpoint raw profile.
Example 15 includes the subject matter of Example 14, wherein the inference operator is to inference when a conditional branch will occur from an input parameter.
Example 16 includes the subject matter of Example 14, wherein the inference operator is to inference when a conditional branch will occur from a global value range.
Example 17 includes the subject matter of Example 11, further comprising: generating a plurality of inference operators for the wasm file; storing the plurality of inference operators in an inference file; generating the JIT’d code further based on referencing the inference file.
Example 18 includes the subject matter of Example 11, wherein the checkpoint is a first checkpoint, and further comprising inserting a second checkpoint in the wasm file, the second checkpoint to collect a value for global data accesses by the function.
Example 19 includes the subject matter of Example 14, wherein the inference operator is to inference when a specific memory access is to occur.
Example 20 includes the subject matter of Example 11, wherein the compiler generates checkpoint raw profiles for use in monitoring profile changes by dumping runtime data while executing the JIT’d code.
Example 21 is one or more machine readable storage media having instructions stored thereon, the instructions when executed by a machine are to cause the machine to: receive a high-level source language file; translate the high-level source language file into a web assembly (wasm) file; insert a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value; perform a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; execute the JIT’d code to emit the parameter value.
Example 22 includes the subject matter of Example 21, wherein the instructions, when executed by the machine, are to cause the machine further to insert the checkpoint at a specific location with respect to the function.
Example 23 includes the subject matter of Example 21, wherein the instructions, when executed by the machine, are to cause the machine further to: insert the checkpoint into a function in the wasm file, the checkpoint to further collect multiple parameter values; execute the JIT’d code; emit a checkpoint raw profile with the multiple parameter values.
Example 24 includes the subject matter of Example 23, wherein the instructions, when executed by the machine, are to cause the machine further to: generate an inference operator for the wasm file; monitor a profile based on the inference operator and the checkpoint raw profile; regenerate the JIT’d code upon determining there has been a change in the profile.
Example 25 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a conditional branch will occur from an input parameter.
Example 26 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a conditional branch will occur from a global value range.
Example 27 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a specific memory access is to occur.
Example 28 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to define the checkpoint as a first checkpoint, and insert a second checkpoint in the wasm file, the second checkpoint to collect a value for global data accesses by the function.
Example 29 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to: generate a plurality of inference operators for the wasm file; populate an inference file with the plurality of inference operators; generate the JIT’d code based on the inference file and the wasm file with the checkpoint.
Example 30 includes the subject matter of Example 24, wherein the instructions, when executed by the machine, are to cause the machine further to dump runtime data during execution of the JIT’d code to generate checkpoint raw profiles for use in monitoring profile changes.

Claims (25)

  1. An apparatus comprising:
    a processor;
    a compiler executable by the processor to:
    receive a high-level source language file;
    translate the high-level source language file into a web assembly (wasm) file;
    insert a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value;
    perform a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; and
    execute the JIT’d code to emit a checkpoint raw profile with the parameter value.
  2. The apparatus of claim 1, wherein the compiler is further to insert the checkpoint at a specific location with respect to the function.
  3. The apparatus of claim 1, wherein the checkpoint is further to collect multiple parameter values, and the compiler is further to:
    execute the JIT’d code and emit the multiple parameter values.
  4. The apparatus of claim 3, wherein the compiler is further to:
    generate an inference operator for the wasm file;
    monitor a profile based on the inference operator and the multiple parameter values; and
    regenerate the JIT’d code when there has been a change in the profile.
  5. The apparatus of claim 4, wherein the inference operator is to inference when a conditional branch will occur from an input parameter.
  6. The apparatus of claim 4, wherein the inference operator is to inference when a conditional branch will occur from a global value range.
  7. The apparatus of claim 4, wherein the inference operator is to inference when a specific memory access is to occur.
  8. The apparatus of claim 1, wherein the checkpoint is a first checkpoint, and wherein the compiler is further to insert a second checkpoint in the wasm file, the second checkpoint to collect a value for global data accesses by the function.
  9. The apparatus of claim 1, wherein the compiler is further to:
    generate a plurality of inference operators for the wasm file;
    populate an inference file with the plurality of inference operators; and
    reference the inference file to generate the JIT’d code on the wasm file with the checkpoint.
  10. The apparatus of claim 1, wherein the compiler performs a runtime data dump when executing the JIT’d code, thereby generating checkpoint raw profiles for use in monitoring profile changes.
  11. A method comprising: at a processor,
    executing a compiler;
    running a browser;
    receiving a high-level source language file;
    translating the high-level source language file into a web assembly (wasm) file;
    inserting a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value;
    performing a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; and
    executing the JIT’d code to emit a checkpoint raw profile with the parameter value.
  12. The method of claim 11, further comprising inserting the checkpoint at a specific location with respect to the function.
  13. The method of claim 11, wherein the checkpoint is further to collect multiple parameter values, and further comprising executing the JIT’d code to emit a checkpoint raw profile with the multiple parameter values.
  14. The method of claim 13, further comprising:
    generating an inference operator for the wasm file;
    monitoring the inference operator and the checkpoint raw profile; and
    regenerating the JIT’d code when the inference operator is exceeded by a corresponding parameter value in the checkpoint raw profile.
  15. The method of claim 11, further comprising:
    generating a plurality of inference operators for the wasm file;
    storing the plurality of inference operators in an inference file; and
    generating the JIT’d code further based on referencing the inference file.
  16. The method of claim 11, wherein the compiler generates checkpoint raw profiles for use in monitoring profile changes by dumping runtime data while executing the JIT’d code.
  17. One or more machine readable storage media having instructions stored thereon, the instructions when executed by a machine are to cause the machine to:
    receive a high-level source language file;
    translate the high-level source language file into a web assembly (wasm) file;
    insert a checkpoint into a function in the wasm file, the checkpoint to collect a parameter value;
    perform a just in time (JIT) compile operation on the wasm file with the checkpoint to generate just in time compiled (JIT’d) code; and
    execute the JIT’d code to emit the parameter value.
  18. The one or more machine readable storage media of claim 17, wherein the instructions, when executed by the machine, are to cause the machine further to insert the checkpoint at a specific location with respect to the function.
  19. The one or more machine readable storage media of claim 17, wherein the instructions, when executed by the machine, are to cause the machine further to:
    insert the checkpoint into a function in the wasm file, the checkpoint to further collect multiple parameter values;
    execute the JIT’d code; and
    emit a checkpoint raw profile with the multiple parameter values.
  20. The one or more machine readable storage media of claim 19, wherein the instructions, when executed by the machine, are to cause the machine further to:
    generate an inference operator for the wasm file;
    monitor a profile based on the inference operator and the checkpoint raw profile, to determine when there has been a change in the profile; and
    regenerate the JIT’d code when there has been a change in the profile.
  21. The one or more machine readable storage media of claim 17, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a conditional branch will occur from an input parameter.
  22. The one or more machine readable storage media of claim 17, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a conditional branch will occur from a global value range.
  23. The one or more machine readable storage media of claim 17, wherein the instructions, when executed by the machine, are to cause the machine further to inference when a specific memory access is to occur.
  24. The one or more machine readable storage media of claim 17, wherein the instructions, when executed by the machine, are to cause the machine further to define the checkpoint as a first checkpoint, and insert a second checkpoint in the wasm file, the second checkpoint to collect a value for global data accesses by the function.
  25. The one or more machine readable storage media of claim 17, wherein the instructions, when executed by the machine, are to cause the machine further to dump runtime data during execution of the JIT’d code to generate checkpoint raw profiles for use in monitoring profile changes.
PCT/CN2022/121108 2022-09-23 2022-09-23 Self-evolving and multi-versioning code WO2024060256A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/121108 WO2024060256A1 (en) 2022-09-23 2022-09-23 Self-evolving and multi-versioning code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/121108 WO2024060256A1 (en) 2022-09-23 2022-09-23 Self-evolving and multi-versioning code

Publications (1)

Publication Number Publication Date
WO2024060256A1 true WO2024060256A1 (en) 2024-03-28

Family

ID=90453794

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121108 WO2024060256A1 (en) 2022-09-23 2022-09-23 Self-evolving and multi-versioning code

Country Status (1)

Country Link
WO (1) WO2024060256A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060101434A1 (en) * 2004-09-30 2006-05-11 Adam Lake Reducing register file bandwidth using bypass logic control
CN113282378A (en) * 2021-07-23 2021-08-20 奥特酷智能科技(南京)有限公司 Vehicle-mounted system based on environment isolation subsystem

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060101434A1 (en) * 2004-09-30 2006-05-11 Adam Lake Reducing register file bandwidth using bypass logic control
CN113282378A (en) * 2021-07-23 2021-08-20 奥特酷智能科技(南京)有限公司 Vehicle-mounted system based on environment isolation subsystem

Similar Documents

Publication Publication Date Title
US10942716B1 (en) Dynamic computational acceleration using a heterogeneous hardware infrastructure
US10613885B2 (en) Portable aggregated information calculation and injection for application containers
US10768916B2 (en) Dynamic generation of CPU instructions and use of the CPU instructions in generated code for a softcore processor
US9501304B1 (en) Lightweight application virtualization architecture
US20070061286A1 (en) System and method for partitioning an application utilizing a throughput-driven aggregation and mapping approach
US20230026369A1 (en) Hardware acceleration for interface type conversions
US8938712B2 (en) Cross-platform virtual machine and method
CN112394938A (en) Method and device for configuring heterogeneous components in accelerator
CN116820764A (en) Method, system, electronic device and storage medium for providing computing resources
US20190324782A1 (en) Class splitting in object-oriented environments
US20230018149A1 (en) Systems and methods for code generation for a plurality of architectures
CN116261718A (en) Resource allocation for tuning superparameter of large-scale deep learning workload
US20230100873A1 (en) Memory tagging and tracking for offloaded functions and called modules
US9411569B1 (en) System and method for providing a climate data analytic services application programming interface distribution package
WO2024060256A1 (en) Self-evolving and multi-versioning code
US20230083849A1 (en) Parsing tool for optimizing code for deployment on a serverless platform
WO2023107789A1 (en) Deterministic replay of a multi-threaded trace on a multi-threaded processor
Tobler Gpuless–serverless gpu functions
US11573777B2 (en) Method and apparatus for enabling autonomous acceleration of dataflow AI applications
US11614963B2 (en) Machine learning based runtime optimization
US20210141723A1 (en) Memory usage in managed runtime applications
WO2022009011A1 (en) Managing asynchronous operations in cloud computing environments
US11194612B2 (en) Selective code segment compilation in virtual machine environments
Lordan et al. Enabling GPU support for the COMPSs-Mobile framework
US20230259341A1 (en) Distributable runtime snapshots