WO2019173937A1

WO2019173937A1 - Improved memory-mapped input/output (mmio) region access

Info

Publication number: WO2019173937A1
Application number: PCT/CN2018/078684
Authority: WO
Inventors: Yu Zhou; Bing NIU
Original assignee: Intel Corporation
Priority date: 2018-03-12
Filing date: 2018-03-12
Publication date: 2019-09-19
Also published as: DE112018007268T5

Abstract

A computing apparatus is provided, including: a hardware platform (102); a virtual machine manager (VMM) (104) including a virtual machine control structure (VMCS) (204), wherein the VMM(104) is configured to provision a virtual machine (VM) according to the VMCS (204); a data structure including an extended page table (EPT) (134) for the VM having a translation lookaside buffer (TLB), the TLB including a region having a passthrough region including a direct guest virtual address (GVA) to host physical address (HPA) translations; and logic to lock the passthrough region to prevent the passthrough region from being evicted from the TLB.

Description

IMPROVED MEMORY-MAPPED INPUT/OUTPUT (MMIO) REGION ACCESS

Field of the specification

This disclosure relates in general to the field of network computing, and more particularly, though not exclusively, to a system and method for improved memory-mapped input/output (MMIO) region access.

Background

In some modern data centers, the function of a device or appliance may not be tied to a specific, fixed hardware configuration. Rather, processing, memory, storage, and accelerator functions may in some cases be aggregated from different locations to form a virtual “composite node. ” A contemporary network may include a data center hosting a large number of generic hardware server devices, contained in a server rack for example, and controlled by a hypervisor. Each hardware device may run one or more instances of a virtual device, such as a workload server or virtual desktop.

In a virtualized environment such as a data center, single root input/output virtualization (SR-IOV) allows for a single physical peripheral component interconnect express (PCIe) bus to be shared by multiple guest systems in the virtual environment. SR-IOV is a specification that allows isolation of PCIe resources, which can improve manageability and performance. All of the guest systems in the virtualized environment can share this single PCIe hardware interface. SR-IOV allows each of the guest hosts in a virtualized environment, such as VMs, to share the single PCIe hardware interface across multiple virtual function instances. SR-IOV has found wide adoption in virtualization environments such as network function virtualization (NFV) .

Memory-mapped input/output (MMIO) is an input/output (I/O) method for providing coherency between the central processing unit (CPU) and peripheral devices in a computer. In MMIO, the memory and registers of I/O devices are mapped to the same address space as the main memory of the host device. Thus, when the CPU accesses a memory address, it may refer either to a portion of main memory, or to one of the mapped memory regions.

Brief Description of the Drawings

The present disclosure is best understood from the following detailed description when read with the accompanying figures. It is emphasized that, in accordance with the standard practice in the industry, various features are not necessarily drawn to scale, and are used for illustration purposes only. Where a scale is shown, explicitly or implicitly, it provides only one illustrative example. In other embodiments, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.

FIGURE 1 is a block diagram of selected elements of a data center, according to one or more examples of the present specification.

FIGURE 2 is a block diagram illustrating selected elements of an extended virtual machine control structure (VMCS) , according to one or more examples of the present specification.

FIGURE 3 illustrates an example of translation lookaside buffer (TLB) locking, according to one or more examples of the present specification.

FIGURE 4 is a block diagram illustrating passthrough access, according to one or more examples of the present specification.

FIGURE 5 is a flowchart of a method that may be performed, according to one or more examples of the present specification.

FIGURE 6 is a block diagram of selected components of a data center with network connectivity, according to one or more examples of the present application.

FIGURE 7 is a block diagram of selected components of an end-user computing device, according to one or more examples of the present specification.

FIGURE 8 is a block diagram of a network function virtualization (NFV) infrastructure, according to one or more examples of the present specification.

FIGURE 9 is a block diagram of components of a computing platform, according to one or more examples of the present specification.

Embodiments of the Disclosure

The following disclosure provides many different embodiments, or examples, for implementing different features of the present disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. Further, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. Different embodiments may have different advantages, and no particular advantage is necessarily required of any embodiment.

A contemporary computing platform, such as a hardware platform provided by

or similar, may include a capability for monitoring device performance and making decisions about resource provisioning. For example, in a large data center such as may be provided by a cloud service provider (CSP) , the hardware platform may include rackmounted servers with compute resources such as processors, memory, storage pools, accelerators, and other similar resources. As used herein, “cloud computing” includes network-connected computing resources and technology that enables ubiquitous (often worldwide) access to data, resources, and/or technology. Cloud resources are generally characterized by great flexibility to dynamically assign resources according to current workloads and needs. This can be accomplished, for example, via virtualization, wherein resources such as hardware, storage, and networks are provided to a virtual machine (VM) via a software abstraction layer, and/or containerization, wherein instances of network functions are provided in “containers” that are separated from one another, but that share underlying operating system, memory, and driver resources.

In large data centers and HPC clusters, network latency may be a premium design consideration. In some cases, the virtual network interface card (vNIC) and/or the virtual switch (vSwitch) can become a bottleneck.

For example, many telecommunication industry applications may require high throughput and low latency in the network. In these implementations, I/O performance may be treated as a premium factor. Many of these telecommunications and other data center applications have moved away from dedicated, single-use network appliances, and instead are providing network services via network function virtualization (NFV) , or in a particular network service may be provided by a virtual network function (VNF) . With such heavy reliance on VNFs, the performance of the vSwitch can become a limiting factor to the overall latency of the network.

An issue that heavily affects the performance of the vNIC in a virtualized environment is the overhead of memory address translation. Applications running on a VM or VNF make calls to a guest virtual address (GVA) . The virtualized system that the VNF is running on has a virtualized memory bank which provides an addressable guest physical address (GPA) . The GPA maps to a particular region of the host physical address (HPA) , or in other words, the virtualized host has a dedicated region of HPA that its memory addresses map to. Thus, a memory operation on the VNF may first require a GVA to GPA translation, and second, a GPA to HPA translation.

In an existing system, a page table (PT) includes the GVA to GPA mapping. The PT may be owned by the virtual machine. An extended page table (EPT) includes the GPA to HPA mapping, and is owned by the physical host. Both the PT and the EPT may employ the translation lookaside buffer (TLB) as content-addressable memory (CAM) that caches recent memory translations (including GVA to HPA mapping) . If a requested physical address is present in the TLB this is referred to as a TLB hit, which enables the GVA to HPA translation to be performed very quickly. However, if a requested address is not present in the TLB, this is defined as a TLB miss. If there is a miss on the TLB, then a two-stage page table lookup, or “walk, ” may be employed to repopulate the TLB and provide the correct address translation. Note that in various literature, this table walk may be referred to as a “two-phase, ” “two-stage, ” or “two-part” table walk, or similar. Throughout this specification, these terms may be considered synonymous.

Although the EPT is optimized to avoid the heavy overhead of a conventional mechanism such as a shadow page table, it does this at the expense of an extra two-stage page walk when a TLB miss occurs. Thus, a TLB miss can incur a substantial performance hit (for example, a 4x degradation in performance for non-virtualized cases, and 24x for virtualized cases) .

TLB misses can become important bottlenecks in the virtualized environment, especially in I/O-sensitive scenarios like carrier-grade NFV. With the increased throughput capability of modern NICs, which may have 10, 25, 50, or 100 gigabit Ethernet (GbE) capacities, memory-mapped input/output (MMIO) region access for a passthrough device such as an SR-IOV VF may be highly intensive for get or set data descriptors from or to transmit (Tx) or receive (Rx) cues.

However, according to embodiments of the present specification, this overhead can be mitigated for MMIO access on a passthrough device. Specifically, in the case of a TLB miss, to mitigate the heavy I/O performance hit in MMIO, the translation from GVA to HPA of an MMIO region for the passthrough device can be locked into a TLB while the guest operating system (OS) is running. The translation information can then be added in and provided by a virtual machine control structure (VMCS) to a TLB to lock the critical memory region.

This architecture reduces latency in the virtualized network, and thus increases performance for passthrough network devices. This can result in substantial improvements in NFV and cloud-type deployments. This solution enhances the access efficiency of an MMIO region for a passthrough device without requiring any dependence on modified CPU instructions. However, it should be noted that embodiments of the present specification could be provided by novel CPU or microcode instructions. This can help to reduce the cost of the solution, while providing improved NFV performance in data centers.

A system and method for improved memory-mapped input/output (MMIO) region access will now be described with more particular reference to the attached FIGURES. It should be noted that throughout the FIGURES, certain reference numerals may be repeated to indicate that a particular device or block is wholly or substantially consistent across the FIGURES. This is not, however, intended to imply any particular relationship between the various embodiments disclosed. In certain examples, a genus of elements may be referred to by a particular reference numeral ( “widget 10” ) , while individual species or examples of the genus may be referred to by a hyphenated numeral ( “first specific widget 10-1” and“second specific widget 10-2” ) .

In the example of FIGURE 1, a hardware platform 102 includes a virtual machine manager (VMM) 104. In this example, two guests are provisioned on hardware platform 102, namely guest A 108-1 and guest B 108-2. Peripheral devices A 112-1 and B 112-2 are provisioned in a passthrough configuration, for example via SR-IOV, so that guest A 108-1 has access to device A 112-1, and guest B 108-2 has direct access to device B 112-2. Note that devices 112 could be peripheral devices, accelerators, coprocessors, or other devices that may be used to assist guests 108 in performing their function.

As illustrated on the right side of the FIGURE, host address space 130 may include not only the main memory of hardware platform 102, but may also include MMIO regions for device A and device B, namely region 132 for device A and region 136 for device B.

When guest A 108-1 is provisioned, it is allocated guest A address space 116. Guest A address space 116 represents a region of an address space that includes a virtual main memory for guest A, which may be allocated from the main memory of hardware platform 102. Within guest A address space 116, there is a device A MMIO mapping from the point of view of guest A 108-1. This mapping 120 corresponds to MMIO region 132 for device A. Similarly, when guest B 108-2 is provisioned, a guest B address space 140 is allocated, including a device B MMIO 142 from the point of view of guest B 108-2. This maps to MMIO region 136 for device B.

Once mappings are established between MMIO regions 120 and 132 and MMIO regions 142 and 136 respectively, these mappings may remain constant while guests 108 remain active.

In certain existing systems, a page table 118 may be allocated and owned by guest A 108-1. Page table 118 includes GVA to GPA mapping.

Similarly, hardware platform 102 may include an EPT 134, which includes GPA to HPA mapping. A TLB may also be provided for both PT and EPT 134, then the combined map from GVA to HPA accessed recently will be cached. However, like a memory cache, entries can be evicted from the TLB, and a TLB miss can result in a page walk for the GVA to GPA mapping.

In this configuration, in case of a TLB miss for an access to a passthrough device 112, a two-phase page walk may be utilized, namely a page walk of PT 118 for GVA to GPA mapping, as well as a page walk of EPT 134 for GPA to HPA mapping. Such a TLB miss can be a very expensive penalty for an access to a passthrough device 112. In a VM, the penalty for a two-phase page walk may be increased exponentially.

To mitigate the expense of such a TLB miss, the present specification introduces a TLB locking and unlocking mechanism wherein certain MMIO mappings are locked upon entry of a VM, and unlocked only upon schedule out of the VM. Specifically, translations from GVA to HPA may be locked in a TLB for the MMIO region to prevent TLB misses once the guest 108 has been scheduled to run. These entries are unlocked when the guest is scheduled to release the occupied entries in the TLB.

To realize this new architecture, two elements are introduced. First, a modified TLB is introduced which can lock a direct GVA to HPA mapping that is owned by a guest 108. These locked mappings may be specific to an MMIO region for a device 112, and in some examples, a heterogeneous structure may be provided wherein a TLB specifically for the MMIO region for the passthrough device is allocated with locked entries. Other memory mappings may remain unlocked. Note that the GPA is determined when the guest sets up its BAR, but the GVA is determined only when the guest attempts to access a region (such as an MMIO) region with that GVA. When a guest access to the MMIO region occurs, a page fault may result, and the host may then find the GVA to GPA mapping from the guest’s PT. Thus, GVA to HPA mapping can be determined at this point.

The second element of providing the improved TLB is a method of manipulating a TLB while the guest is running with its passthrough device, combined with extensions to the virtual machine control structure (VMCS) .

In an embodiment, information on direct GVA to HPA translation can be collected when passthrough device 112 is enumerated by guest 108 for base address register (BAR) configuration and MMIO region first-time access. Thus, the mapping of GVA to HPA of an MMIO region for a passthrough device 112 is determined and constant from device enumeration. No other VM or VMM need be involved, because passthrough device 112 has been allocated and dedicated to guest 108.

FIGURE 2 is a block diagram illustrating selected elements of an extended VMCS 204, according to one or more examples of the present specification.

Modified VMCS 204 may provide the ability to lock TLB entries during VM scheduling. The method provided may be a hardware assisted technique combined with the VMCS.

VMCS 204 includes existing VMCS fields, such as VIRTUAL_PROCESSOR_ID, POSTED_INTR_NV, and HOST_RIP by way of illustrative and nonlimiting example.

VMCS 204 also includes extended VMCS fields including, for example, VM_LOCK_TLB_NR, VM_LOCK_TLB_ARRAY_ADDR, and VM_LOCK_TLB_ARRAY_ADDR_HIGH.

The field VM_LOCK_TLB_NR represents the number of TLB locking entries that are provided in the present instance. VM_LOCK_TLB_ARRAY_ADDR is a pointer to a position in the memory where TLB locking array 208 begins. Thus, starting from VM_LOCK_TLB_ARRAY_ADDR, VM_LOCK_TLB_NR entries follow, with each entry providing a locked direct GVA to HPA translation. Because these entries are locked, they are never evicted so long as the lock remains. In some embodiments, the lock is established when the guest is provisioned and the passthrough device is enumerated, and the lock is not released until the VM is terminated due to end of life cycle or scheduled out to relinquish the logical core to other VMs.

The new field VM_LOCK_TLB_ARRAY_ADDR_HIGH may be a flag or other data structure that indicates, for a 64-bit system, the upper 32 bits of the VM TLB locking address should be used.

Each element in TLB locking array 208 includes a direct GVA to HPA translation that is locked when the guest’s VMCS is loaded to a particular logic core.

FIGURE 3 illustrates an example of TLB locking, according to one or more examples of the present specification. When a guest such as a virtual machine or VNF is allocated, guest VMCS 304, including appropriate extensions such as those illustrated in FIGURE 2, may be used to allocate TLB 308. As illustrated in FIGURE 2, TLB 308 includes a plurality of direct GVA to HPA mappings, that may be particularly directed to an MMIO region for a passthrough device. Upon establishment of GVA to HPA, the entries in TLB 308 are locked as part of scheduling in for logic core 312.

Logic core 312 then performs its virtualized function or guest function for some time. While logic core 312 is providing its function, because the MMIO entries in TLB 308 are locked, they are never evicted from TLB 308, thus there is never an instance of a TLB miss for those MMIO entries, and thus there is no need for an expensive two-phase page walk for memory accesses to the passthrough device.

Eventually, logic core 312 may receive a schedule out event, meaning for example that the guest will at least temporarily not be using or accessing the passthrough device.

Upon scheduling out, logic core 312 unlocks the MMIO region of TLB 308, and control returns to the VMM.

Stated otherwise, TLB locking occurs automatically from the VMCS upon dispatch. Unlocking occurs automatically from the VMCS once swapout occurs.

With the TLB locking and VMCS capability of the present specification, translations of the MMIO region of a passthrough device are loaded and locked into the TLB automatically when the guest is scheduled to run and its VMCS is deployed to a logic core.

The mapping between GVA and HPA of the MMIO region for a passthrough device can be obtained from the TLB directly, with no need to access PT and EPT from DRAM with the heavy overhead of a two-phase page walk in the virtualization environment.

The TLB MMIO region is locked while the guest is running so that this portion of the TLB will not be swapped out from the TLB until the guest is scheduled out. Note that this embodiment uses limited TLB entries, because each GVA is mapped, for example, to a 4 kB physical memory page by default. However, this limited consumption of TLB entries provides significant improvement for MMIO region access of a passthrough device.

FIGURE 4 is a block diagram illustrating passthrough access according to one or more examples of the present specification. The method of FIGURE 4 specifically illustrates a method of avoiding the two-level page walk penalty for TLB translations to access an MMIO region of a VF directly from the VM.

As illustrated in FIGURE 4, use of the disclosed architecture to access virtual I/O devices (e.g., via SR-IOV VF) can provide sometimes even better performance than in a direct hardware environment. This provides intensive MMIO access with low latency and high throughput packet forwarding, which may be valuable to telecommunication NFVs, for example, such as a virtual firewall based on a VM with an SR-IOV vNIC.

For example, in FIGURE 4, VMCS 404 includes VMCS extensions 408. Using VMCS extensions 408, logical core 416 allocates a TLB 412 which includes an MMIO region map. MMIO region map includes direct GVA to HPA mapping for the MMIO regions. Mapping information from the VMCS is locked into TLB 412 to access the MMIO region directly.

Intensive MMIO access in the data plane VNF may be provided by vNIC 424, which may include functions such as virtual firewall (vFW) , virtual router (vRouter) , virtual evolved packet core (vEPC) , or similar.

vNIC 424 provides passthrough access, bypassing hypervisor 430, and accessing the MMIO region map. The MMIO region map may include transmit cues (TXQs) and receive cues (RXQs) with transmit destinations and receive destinations. Within MMIO region 440, MMIO region map of TLB 412 includes direct GVA to HPA mapping. Thus, logical core 416 can access MMIO region 440 directly, without fetching memory 450 for the translation from PT and EPT.

This avoids a two-phase page walk in memory 450 in the case of a TLB miss. Such a two-phase page walk can incur up to a 5-times penalty as opposed to a native TLB miss or 24-times to a direct TLB hit under virtualization environment. A page walk may require getting the HPA from a page table in memory 450 and then rebuilding the GVA to GPA and GPA to HPA mapping.

FIGURE 5 is a flowchart of a method 500 that may be performed, according to one or more examples of the present specification.

In block 504, a guest starts or is scheduled to run.

In block 508, a passthrough device has been allocated to the guest, and the guest enumerates the passthrough device. Upon enumeration, the GPA to HPA mapping is established. GVA to GPA mapping is established upon the first MMIO region access.

In block 512, the guest host may use an extended VMCS to build a TLB with an MMIO region, wherein the MMIO region includes direct GVA to HPA mapping for the passthrough device. Note that the TLB may also include other entries, which may include more traditional GVA to HPA-only mappings, and which may be subject to eviction according to ordinary eviction rules.

In block 516, the guest locks the MMIO region entries.

In block 520, the guest host performs its work, which may include high throughput and low latency access to the passthrough device, which may continue for as long as the guest host is performing its function. Because the MMIO region entries in the TLB are locked, those are not evicted from the TLB.

In block 524, the guest is scheduled out, which may include the guest being scheduled for termination or swapped out so that another guest can use the logic core. Consequently, the guest will temporarily or permanently not manipulate the passthrough device.

After the guest is scheduled out, in block 518, the guest unlocks the MMIO region entries in the TLB. These entries are now subject to eviction according to ordinary rules, and either the guest yields the logic core to neighbor VMs, or the guest itself terminates.

FIGURE 6 is a block diagram of selected components of a data center with connectivity to network 600 of a cloud service provider (CSP) 602, according to one or more examples of the present specification. CSP 602 may be, by way of nonlimiting example, a traditional enterprise data center, an enterprise “private cloud, ” or a “public cloud, ” providing services such as infrastructure as a service (IaaS) , platform as a service (PaaS) , or software as a service (SaaS) . In some cases, CSP 602 may provide, instead of or in addition to cloud services, high-performance computing (HPC) platforms or services. Indeed, while not expressly identical, HPC clusters ( “supercomputers” ) may be structurally similar to cloud data centers, and unless and except where expressly specified, the teachings of this specification may be applied to either.

CSP 602 may provision some number of workload clusters 618, which may be clusters of individual servers, blade servers, rackmount servers, or any other suitable server topology. In this illustrative example, two workload clusters, 618-1 and 618-2 are shown, each providing rackmount servers 646 in a chassis 648.

In this illustration, workload clusters 618 are shown as modular workload clusters conforming to the rack unit ( “U” ) standard, in which a standard rack, 19 inches wide, may be built to accommodate 42 units (42U) , each 1.75 inches high and approximately 36 inches deep. In this case, compute resources such as processors, memory, storage, accelerators, and switches may fit into some multiple of rack units from one to 42.

Each server 646 may host a standalone operating system and provide a server function, or servers may be virtualized, in which case they may be under the control of a virtual machine manager (VMM) , hypervisor, and/or orchestrator, and may host one or more virtual machines, virtual servers, or virtual appliances. These server racks may be collocated in a single data center, or may be located in different geographic data centers. Depending on the contractual agreements, some servers 646 may be specifically dedicated to certain enterprise clients or tenants, while others may be shared.

The various devices in a data center may be connected to each other via a switching fabric 670, which may include one or more high speed routing and/or switching devices. Switching fabric 670 may provide both “north-south” traffic (e.g., traffic to and from the wide area network (WAN) , such as the internet) , and “east-west” traffic (e.g., traffic across the data center) . Historically, north-south traffic accounted for the bulk of network traffic, but as web services become more complex and distributed, the volume of east-west traffic has risen. In many data centers, east-west traffic now accounts for the majority of traffic.

Furthermore, as the capability of each server 646 increases, traffic volume may further increase. For example, each server 646 may provide multiple processor slots, with each slot accommodating a processor having four to eight cores, along with sufficient memory for the cores. Thus, each server may host a number of VMs, each generating its own traffic.

To accommodate the large volume of traffic in a data center, a highly capable switching fabric 670 may be provided. Switching fabric 670 is illustrated in this example as a “flat” network, wherein each server 646 may have a direct connection to a top-of-rack (ToR) switch 620 (e.g., a “star” configuration) , and each ToR switch 620 may couple to a core switch 630. This two-tier flat network architecture is shown only as an illustrative example. In other examples, other architectures may be used, such as three-tier star or leaf-spine (also called “fat tree” topologies) based on the “Clos” architecture, hub-and-spoke topologies, mesh topologies, ring topologies, or 3D mesh topologies, by way of nonlimiting example.

The fabric itself may be provided by any suitable interconnect. For example, each server 646 may include an

Host Fabric Interface (HFI) , a network interface card (NIC) , a host channel adapter (HCA) , or other host interface. For simplicity and unity, these may be referred to throughout this specification as a “host fabric interface” (HFI) , which should be broadly construed as an interface to communicatively couple the host to the data center fabric. The HFI may couple to one or more host processors via an interconnect or bus, such as PCI, PCIe, or similar. In some cases, this interconnect bus, along with other “local” interconnects (e.g., core-to-core Ultra Path Interconnect) may be considered to be part of fabric 670. In other embodiments, the UPI (or other local coherent interconnect) may be treated as part of the secure domain of the processor complex, and thus not part of the fabric.

The interconnect technology may be provided by a single interconnect or a hybrid interconnect, such as where PCIe provides on-chip communication, 1Gb or 10Gb copper Ethernet provides relatively short connections to a ToR switch 620, and optical cabling provides relatively longer connections to core switch 630. Interconnect technologies that may be found in the data center include, by way of nonlimiting example,

Omni-Path ^TM Architecture (OPA) , TrueScale ^TM, Ultra Path Interconnect (UPI) (formerly called QPI or KTI) , FibreChannel, Ethernet, FibreChannel over Ethernet (FCoE) , InfiniBand, PCI, PCIe, or fiber optics, to name just a few. The fabric may be cache-and memory-coherent, cache-and memory-non-coherent, or a hybrid of coherent and non-coherent interconnects. Some interconnects are more popular for certain purposes or functions than others, and selecting an appropriate fabric for the instant application is an exercise of ordinary skill. For example, OPA and Infiniband are commonly used in high-performance computing (HPC) applications, while Ethernet and FibreChannel are more popular in cloud data centers. But these examples are expressly nonlimiting, and as data centers evolve fabric technologies similarly evolve.

Note that while high-end fabrics such as OPA are provided herein by way of illustration, more generally, fabric 670 may be any suitable interconnect or bus for the particular application. This could, in some cases, include legacy interconnects like local area networks (LANs) , token ring networks, synchronous optical networks (SONET) , asynchronous transfer mode (ATM) networks, wireless networks such as WiFi and Bluetooth, “plain old telephone system” (POTS) interconnects, or similar. It is also expressly anticipated that in the future, new network technologies will arise to supplement or replace some of those listed here, and any such future network topologies and technologies can be or form a part of fabric 670.

In certain embodiments, fabric 670 may provide communication services on various “layers, ” as originally outlined in the Open Systems Interconnection (OSI) seven-layer network model. In contemporary practice, the OSI model is not followed strictly. In general terms, layers 1 and 2 are often called the “Ethernet” layer (though in some data centers or supercomputers, Ethernet may be supplanted or supplemented by newer technologies) . Layers 3 and 4 are often referred to as the transmission control protocol/internet protocol (TCP/IP) layer (which may be further subdivided into TCP and IP layers) . Layers 5–7 may be referred to as the “application layer. ” These layer definitions are disclosed as a useful framework, but are intended to be nonlimiting.

FIGURE 7 is a block diagram of an end-user computing device 700, according to one or more examples of the present specification. As above, computing device 700 may provide, as appropriate, cloud service, high-performance computing, telecommunication services, enterprise data center services, or any other compute services that benefit from a computing device 700.

In this example, a fabric 770 is provided to interconnect various aspects of computing device 700. Fabric 770 may be the same as fabric 670 of FIGURE 6, or may be a different fabric. As above, fabric 770 may be provided by any suitable interconnect technology. In this example,

Omni-Path ^TM is used as an illustrative and nonlimiting example.

As illustrated, computing device 700 includes a number of logic elements forming a plurality of nodes. It should be understood that each node may be provided by a physical server, a group of servers, or other hardware. Each server may be running one or more virtual machines as appropriate to its application.

Node 0 708 is a processing node including a processor socket 0 and processor socket 1. The processors may be, for example,

Xeon ^TM processors with a plurality of cores, such as 4 or 8 cores. Node 0 708 may be configured to provide network or workload functions, such as by hosting a plurality of virtual machines or virtual appliances.

Onboard communication between processor socket 0 and processor socket 1 may be provided by an onboard uplink 778. This may provide a very high speed, short-length interconnect between the two processor sockets, so that virtual machines running on node 0708 can communicate with one another at very high speeds. To facilitate this communication, a virtual switch (vSwitch) may be provisioned on node 0 708, which may be considered to be part of fabric 770.

Node 0 708 connects to fabric 770 via an HFI 772. HFI 772 may connect to an

Omni-Path ^TM fabric. In some examples, communication with fabric 770 may be tunneled, such as by providing UPI tunneling over Omni-Path ^TM.

Because computing device 700 may provide many functions in a distributed fashion that in previous generations were provided onboard, a highly capable HFI 772 may be provided. HFI 772 may operate at speeds of multiple gigabits per second, and in some cases may be tightly coupled with node 0 708. For example, in some embodiments, the logic for HFI 772 is integrated directly with the processors on a system-on-a-chip. This provides very high speed communication between HFI 772 and the processor sockets, without the need for intermediary bus devices, which may introduce additional latency into the fabric. However, this is not to imply that embodiments where HFI 772 is provided over a traditional bus are to be excluded. Rather, it is expressly anticipated that in some examples, HFI 772 may be provided on a bus, such as a PCIe bus, which is a serialized version of PCI that provides higher speeds than traditional PCI. Throughout computing device 700, various nodes may provide different types of HFIs 772, such as onboard HFIs and plug-in HFIs. It should also be noted that certain blocks in a system on a chip may be provided as intellectual property (IP) blocks that can be “dropped” into an integrated circuit as a modular unit. Thus, HFI 772 may in some cases be derived from such an IP block.

Note that in “the network is the device” fashion, node 0 708 may provide limited or no onboard memory or storage. Rather, node 0 708 may rely primarily on distributed services, such as a memory server and a networked storage server. Onboard, node 0 708 may provide only sufficient memory and storage to bootstrap the device and get it communicating with fabric 770. This kind of distributed architecture is possible because of the very high speeds of contemporary data centers, and may be advantageous because there is no need to over-provision resources for each node. Rather, a large pool of high-speed or specialized memory may be dynamically provisioned between a number of nodes, so that each node has access to a large pool of resources, but those resources do not sit idle when that particular node does not need them.

In this example, a node 1 memory server 704 and a node 2 storage server 710 provide the operational memory and storage capabilities of node 0 708. For example, memory server node 1 704 may provide remote direct memory access (RDMA) , whereby node 0 708 may access memory resources on node 1 704 via fabric 770 in a direct memory access fashion, similar to how it would access its own onboard memory. The memory provided by memory server 704 may be traditional memory, such as double data rate type 3 (DDR3) dynamic random access memory (DRAM) , which is volatile, or may be a more exotic type of memory, such as a persistent fast memory (PFM) like

3D Crosspoint ^TM (3DXP) , which operates at DRAM-like speeds, but is nonvolatile.

Similarly, rather than providing an onboard hard disk for node 0 708, a storage server node 2 710 may be provided. Storage server 710 may provide a networked bunch of disks (NBOD) , PFM, redundant array of independent disks (RAID) , redundant array of independent nodes (RAIN) , network attached storage (NAS) , optical storage, tape drives, or other nonvolatile memory solutions.

Thus, in performing its designated function, node 0 708 may access memory from memory server 704 and store results on storage provided by storage server 710. Each of these devices couples to fabric 770 via a HFI 772, which provides fast communication that makes these technologies possible.

By way of further illustration, node 3 706 is also depicted. Node 3 706 also includes a HFI 772, along with two processor sockets internally connected by an uplink. However, unlike node 0 708, node 3 706 includes its own onboard memory 722 and storage 750. Thus, node 3 706 may be configured to perform its functions primarily onboard, and may not be required to rely upon memory server 704 and storage server 710. However, in appropriate circumstances, node 3 706 may supplement its own onboard memory 722 and storage 750 with distributed resources similar to node 0 708.

Computing device 700 may also include accelerators 730. These may provide various accelerated functions, including hardware or coprocessor acceleration for functions such as packet processing, encryption, decryption, compression, decompression, network security, or other accelerated functions in the data center. In some examples, accelerators 730 may include deep learning accelerators that may be directly attached to one or more cores in nodes such as node 0 708 or node 3 706. Examples of such accelerators can include, by way of nonlimiting example,

QuickData Technology (QDT) ,

QuickAssist Technology (QAT) ,

Direct Cache Access (DCA) ,

Extended Message Signaled Interrupt (MSI-X) ,

Receive Side Coalescing (RSC) , and other acceleration technologies.

In other embodiments, an accelerator could also be provided as an application-specific integrated circuit (ASIC) , field-programmable gate array (FPGA) , coprocessor, graphics processing unit (GPU) , digital signal processor (DSP) , or other processing entity, which may optionally be tuned or configured to provide the accelerator function.

The basic building block of the various components disclosed herein may be referred to as “logic elements. ” Logic elements may include hardware (including, for example, a software-programmable processor, an ASIC, or an FPGA) , external hardware (digital, analog, or mixed-signal) , software, reciprocating software, services, drivers, interfaces, components, modules, algorithms, sensors, components, firmware, microcode, programmable logic, or objects that can coordinate to achieve a logical operation. Furthermore, some logic elements are provided by a tangible, non-transitory computer-readable medium having stored thereon executable instructions for instructing a processor to perform a certain task. Such a non-transitory medium could include, for example, a hard disk, solid state memory or disk, read-only memory (ROM) , persistent fast memory (PFM) (e.g.,

3D Crosspoint ^TM) , external storage, redundant array of independent disks (RAID) , redundant array of independent nodes (RAIN) , network-attached storage (NAS) , optical storage, tape drive, backup system, cloud storage, or any combination of the foregoing by way of nonlimiting example. Such a medium could also include instructions programmed into an FPGA, or encoded in hardware on an ASIC or processor.

FIGURE 8 is a block diagram of a network function virtualization (NFV) infrastructure 800 according to one or more examples of the present specification. NFV is an aspect of network virtualization that is generally considered distinct from, but that can still interoperate with a software-defined network (SDN) . For example, virtual network functions (VNFs) may operate within the data plane of an SDN deployment. NFV was originally envisioned as a method for providing reduced capital expenditure (Capex) and operating expenses (Opex) for telecommunication services. One feature of NFV is replacing proprietary, special-purpose hardware appliances with virtual appliances running on commercial off-the-shelf (COTS) hardware within a virtualized environment. In addition to Capex and Opex savings, NFV provides a more agile and adaptable network. As network loads change, virtual network functions (VNFs) can be provisioned ( “spun up” ) or removed ( “spun down” ) to meet network demands. For example, in times of high load, more load balancer VNFs may be spun up to distribute traffic to more workload servers (which may themselves be virtual machines) . In times when more suspicious traffic is experienced, additional firewalls or deep packet inspection (DPI) appliances may be needed.

Because NFV started out as a telecommunications feature, many NFV instances are focused on telecommunications. However, NFV is not limited to telecommunication services. In a broad sense, NFV includes one or more VNFs running within a network function virtualization infrastructure (NFVI) , such as NFVI 400. Often, the VNFs are inline service functions that are separate from workload servers or other nodes. These VNFs can be chained together into a service chain, which may be defined by a virtual subnetwork, and which may include a serial string of network services that provide behind-the-scenes work, such as security, logging, billing, and similar.

Like SDN, NFV is a subset of network virtualization. In other words, certain portions of the network may rely on SDN, while other portions (or the same portions) may rely on NFV.

In the example of FIGURE 8, an NFV orchestrator 801 manages a number of the VNFs 812 running on an NFVI 800. NFV involves nontrivial resource management, such as allocating a very large pool of compute resources among appropriate numbers of instances of each VNF, managing connections between VNFs, determining how many instances of each VNF to allocate, and managing memory, storage, and network connections. This may involve complex software management, thus making NFV orchestrator 801 a valuable system resource. Note that NFV orchestrator 801 may provide a browser-based or graphical configuration interface, and in some embodiments may be integrated with SDN orchestration functions.

Note that NFV orchestrator 801 itself may be virtualized (rather than a special-purpose hardware appliance) . NFV orchestrator 801 may be integrated within an existing SDN system, wherein an operations support system (OSS) manages the SDN. This may interact with cloud resource management systems (e.g., OpenStack) to provide NFV orchestration. An NFVI 800 may include the hardware, software, and other infrastructure to enable VNFs to run. This may include a hardware platform 802 on which one or more VMs 804 may run. For example, hardware platform 802-1 in this example runs VMs 804-1 and 804-2. Hardware platform 802-2 runs VMs 804-3 and 804-4. Each hardware platform may include a hypervisor 820, virtual machine manager (VMM) , or similar function, which may include and run on a native (bare metal) operating system, which may be minimal so as to consume very few resources.

Hardware platforms 802 may be or comprise a rack or several racks of blade or slot servers (including, e.g., processors, memory, and storage) , one or more data centers, other hardware resources distributed across one or more geographic locations, hardware switches, or network interfaces. An NFVI 800 may also include the software architecture that enables hypervisors to run and be managed by NFV orchestrator 801.

Running on NFVI 800 are a number of VMs 804, each of which in this example is a VNF providing a virtual service appliance. Each VM 804 in this example includes an instance of the Data Plane Development Kit (DVDK) , a virtual operating system 808, and an application providing the VNF 812.

Virtualized network functions could include, as nonlimiting and illustrative examples, firewalls, intrusion detection systems, load balancers, routers, session border controllers, deep packet inspection (DPI) services, network address translation (NAT) modules, or call security association.

The illustration of FIGURE 8 shows that a number of VNFs 804 have been provisioned and exist within NFVI 800. This figure does not necessarily illustrate any relationship between the VNFs and the larger network, or the packet flows that NFVI 800 may employ.

The illustrated DPDK instances 816 provide a set of highly-optimized libraries for communicating across a virtual switch (vSwitch) 822. Like VMs 804, vSwitch 822 is provisioned and allocated by a hypervisor 820. The hypervisor uses a network interface to connect the hardware platform to the data center fabric (e.g., an HFI) . This HFI may be shared by all VMs 804 running on a hardware platform 802. Thus, a vSwitch may be allocated to switch traffic between VMs 804. The vSwitch may be a pure software vSwitch (e.g., a shared memory vSwitch) , which may be optimized so that data are not moved between memory locations, but rather, the data may stay in one place, and pointers may be passed between VMs 804 to simulate data moving between ingress and egress ports of the vSwitch. The vSwitch may also include a hardware driver (e.g., a hardware network interface IP block that switches traffic, but that connects to virtual ports rather than physical ports) . In this illustration, a distributed vSwitch 822 is illustrated, wherein vSwitch 822 is shared between two or more physical hardware platforms 802.

FIGURE 9 is a block diagram of components of a computing platform 902A according to one or more examples of the present specification. In the embodiment depicted,

platforms

902A, 902B, and 902C, along with a data center management platform 906 and data analytics engine 904 are interconnected via network 908. In other embodiments, a computer system may include any suitable number of (i.e., one or more) platforms. In some embodiments (e.g., when a computer system only includes a single platform) , all or a portion of the system management platform 906 may be included on a platform 902. A platform 902 may include platform logic 910 with one or more central processing units (CPUs) 912, memories 914 (which may include any number of different modules) , chipsets 916, communication interfaces 918, and any other suitable hardware and/or software to execute a hypervisor 920 or other operating system capable of executing workloads associated with applications running on platform 902. In some embodiments, a platform 902 may function as a host platform for one or more guest systems 922 that invoke these applications. Platform 902A may represent any suitable computing environment, such as a high performance computing environment, a data center, a communications service provider infrastructure (e.g., one or more portions of an Evolved Packet Core) , an in-memory computing environment, a computing system of a vehicle (e.g., an automobile or airplane) , an Internet of Things environment, an industrial control system, other computing environment, or combination thereof.

In various embodiments of the present disclosure, accumulated stress and/or rates of stress accumulated of a plurality of hardware resources (e.g., cores and uncores) are monitored and entities (e.g., system management platform 906, hypervisor 920, or other operating system) of computer platform 902A may assign hardware resources of platform logic 910 to perform workloads in accordance with the stress information. In some embodiments, self-diagnostic capabilities may be combined with the stress monitoring to more accurately determine the health of the hardware resources. Each platform 902 may include platform logic 910. Platform logic 910 comprises, among other logic enabling the functionality of platform 902, one or more CPUs 912, memory 914, one or more chipsets 916, and communication interfaces 928. Although three platforms are illustrated, computer platform 902A may be interconnected with any suitable number of platforms. In various embodiments, a platform 902 may reside on a circuit board that is installed in a chassis, rack, or other suitable structure that comprises multiple platforms coupled together through network 908 (which may comprise, e.g., a rack or backplane switch) .

CPUs 912 may each comprise any suitable number of processor cores and supporting logic (e.g., uncores) . The cores may be coupled to each other, to memory 914, to at least one chipset 916, and/or to a communication interface 918, through one or more controllers residing on CPU 912 and/or chipset 916. In particular embodiments, a CPU 912 is embodied within a socket that is permanently or removably coupled to platform 902A. Although four CPUs are shown, a platform 902 may include any suitable number of CPUs.

Memory 914 may comprise any form of volatile or nonvolatile memory including, without limitation, magnetic media (e.g., one or more tape drives) , optical media, random access memory (RAM) , read-only memory (ROM) , flash memory, removable media, or any other suitable local or remote memory component or components. Memory 914 may be used for short, medium, and/or long term storage by platform 902A. Memory 914 may store any suitable data or information utilized by platform logic 910, including software embedded in a computer readable medium, and/or encoded logic incorporated in hardware or otherwise stored (e.g., firmware) . Memory 914 may store data that is used by cores of CPUs 912. In some embodiments, memory 914 may also comprise storage for instructions that may be executed by the cores of CPUs 912 or other processing elements (e.g., logic resident on chipsets 916) to provide functionality associated with the manageability engine 926 or other components of platform logic 910. A platform 902 may also include one or more chipsets 916 comprising any suitable logic to support the operation of the CPUs 912. In various embodiments, chipset 916 may reside on the same die or package as a CPU 912 or on one or more different dies or packages. Each chipset may support any suitable number of CPUs 912. A chipset 916 may also include one or more controllers to couple other components of platform logic 910 (e.g., communication interface 918 or memory 914) to one or more CPUs. In the embodiment depicted, each chipset 916 also includes a manageability engine 926. Manageability engine 926 may include any suitable logic to support the operation of chipset 916. In a particular embodiment, a manageability engine 926 (which may also be referred to as an innovation engine) is capable of collecting real-time telemetry data from the chipset 916, the CPU (s) 912 and/or memory 914 managed by the chipset 916, other components of platform logic 910, and/or various connections between components of platform logic 910. In various embodiments, the telemetry data collected includes the stress information described herein.

In various embodiments, a manageability engine 926 operates as an out-of-band asynchronous compute agent which is capable of interfacing with the various elements of platform logic 910 to collect telemetry data with no or minimal disruption to running processes on CPUs 912. For example, manageability engine 926 may comprise a dedicated processing element (e.g., a processor, controller, or other logic) on chipset 916, which provides the functionality of manageability engine 926 (e.g., by executing software instructions) , thus conserving processing cycles of CPUs 912 for operations associated with the workloads performed by the platform logic 910. Moreover the dedicated logic for the manageability engine 926 may operate asynchronously with respect to the CPUs 912 and may gather at least some of the telemetry data without increasing the load on the CPUs.

A manageability engine 926 may process telemetry data it collects (specific examples of the processing of stress information will be provided herein) . In various embodiments, manageability engine 926 reports the data it collects and/or the results of its processing to other elements in the computer system, such as one or more hypervisors 920 or other operating systems and/or system management software (which may run on any suitable logic such as system management platform 906) . In particular embodiments, a critical event such as a core that has accumulated an excessive amount of stress may be reported prior to the normal interval for reporting telemetry data (e.g., a notification may be sent immediately upon detection) .

Additionally, manageability engine 926 may include programmable code configurable to set which CPU (s) 912 a particular chipset 916 will manage and/or which telemetry data will be collected.

Chipsets 916 also each include a communication interface 928. Communication interface 928 may be used for the communication of signaling and/or data between chipset 916 and one or more I/O devices, one or more networks 908, and/or one or more devices coupled to network 908 (e.g., system management platform 906) . For example, communication interface 928 may be used to send and receive network traffic such as data packets. In a particular embodiment, a communication interface 928 comprises one or more physical network interface controllers (NICs) , also known as network interface cards or network adapters. A NIC may include electronic circuitry to communicate using any suitable physical layer and data link layer standard such as Ethernet (e.g., as defined by a IEEE 802.3 standard) , Fibre Channel, InfiniBand, Wi-Fi, or other suitable standard. A NIC may include one or more physical ports that may couple to a cable (e.g., an Ethernet cable) . A NIC may enable communication between any suitable element of chipset 916 (e.g., manageability engine 926 or switch 930) and another device coupled to network 908. In various embodiments a NIC may be integrated with the chipset (i.e., may be on the same integrated circuit or circuit board as the rest of the chipset logic) or may be on a different integrated circuit or circuit board that is electromechanically coupled to the chipset.

In particular embodiments, communication interfaces 928 may allow communication of data (e.g., between the manageability engine 926 and the data center management platform 906) associated with management and monitoring functions performed by manageability engine 926. In various embodiments, manageability engine 926 may utilize elements (e.g., one or more NICs) of communication interfaces 928 to report the telemetry data (e.g., to system management platform 906) in order to reserve usage of NICs of communication interface 918 for operations associated with workloads performed by platform logic 910.

Switches 930 may couple to various ports (e.g., provided by NICs) of communication interface 928 and may switch data between these ports and various components of chipset 916 (e.g., one or more Peripheral Component Interconnect Express (PCIe) lanes coupled to CPUs 912) . Switches 930 may be a physical or virtual (i.e., software) switch.

Platform logic 910 may include an additional communication interface 918. Similar to communication interfaces 928, communication interfaces 918 may be used for the communication of signaling and/or data between platform logic 910 and one or more networks 908 and one or more devices coupled to the network 908. For example, communication interface 918 may be used to send and receive network traffic such as data packets. In a particular embodiment, communication interfaces 918 comprise one or more physical NICs. These NICs may enable communication between any suitable element of platform logic 910 (e.g., CPUs 912 or memory 914) and another device coupled to network 908 (e.g., elements of other platforms or remote computing devices coupled to network 908 through one or more networks) .

Platform logic 910 may receive and perform any suitable types of workloads. A workload may include any request to utilize one or more resources of platform logic 910, such as one or more cores or associated logic. For example, a workload may comprise a request to instantiate a software component, such as an I/O device driver 924 or guest system 922; a request to process a network packet received from a virtual machine 932 or device external to platform 902A (such as a network node coupled to network 908) ; a request to execute a process or thread associated with a guest system 922, an application running on platform 902A, a hypervisor 920 or other operating system running on platform 902A; or other suitable processing request.

A virtual machine 932 may emulate a computer system with its own dedicated hardware. A virtual machine 932 may run a guest operating system on top of the hypervisor 920. The components of platform logic 910 (e.g., CPUs 912, memory 914, chipset 916, and communication interface 918) may be virtualized such that it appears to the guest operating system that the virtual machine 932 has its own dedicated components.

A virtual machine 932 may include a virtualized NIC (vNIC) , which is used by the virtual machine as its network interface. A vNIC may be assigned a media access control (MAC) address or other identifier, thus allowing multiple virtual machines 932 to be individually addressable in a network.

VNF 934 may comprise a software implementation of a functional building block with defined interfaces and behavior that can be deployed in a virtualized infrastructure. In particular embodiments, a VNF 934 may include one or more virtual machines 932 that collectively provide specific functionalities (e.g., wide area network (WAN) optimization, virtual private network (VPN) termination, firewall operations, load-balancing operations, security functions, etc. ) . A VNF 934 running on platform logic 910 may provide the same functionality as traditional network components implemented through dedicated hardware. For example, a VNF 934 may include components to perform any suitable NFV workloads, such as virtualized evolved packet core (vEPC) components, mobility management entities, 3rd Generation Partnership Project (3GPP) control and data plane components, etc.

Service function chain (SFC) 936 is a group of VNFs 934 organized as a chain to perform a series of operations, such as network packet processing operations. Service function chaining may provide the ability to define an ordered list of network services (e.g. firewalls, load balancers) that are stitched together in the network to create a service chain.

A hypervisor 920 (also known as a virtual machine monitor) may comprise logic to create and run guest systems 922. The hypervisor 920 may present guest operating systems run by virtual machines with a virtual operating platform (i.e., it appears to the virtual machines that they are running on separate physical nodes when they are actually consolidated onto a single hardware platform) and manage the execution of the guest operating systems by platform logic 910. Services of hypervisor 920 may be provided by virtualizing in software or through hardware assisted resources with minimal software intervention, or both. Multiple instances of a variety of guest operating systems may be managed by the hypervisor 920. Each platform 902 may have a separate instantiation of a hypervisor 920.

Hypervisor 920 may be a native or bare-metal hypervisor that runs directly on platform logic 910 to control the platform logic and manage the guest operating systems. Alternatively, hypervisor 920 may be a hosted hypervisor that runs on a host operating system and abstracts the guest operating systems from the host operating system. Hypervisor 920 may include a virtual switch 938 that may provide virtual switching and/or routing functions to virtual machines of guest systems 922. The virtual switch 938 may comprise a logical switching fabric that couples the vNICs of the virtual machines 932 to each other, thus creating a virtual network through which virtual machines may communicate with each other.

Virtual switch 938 may comprise a software element that is executed using components of platform logic 910. In various embodiments, hypervisor 920 may be in communication with any suitable entity (e.g., an SDN controller) which may cause hypervisor 920 to reconfigure the parameters of virtual switch 938 in response to changing conditions in platform 902 (e.g., the addition or deletion of virtual machines 932 or identification of optimizations that may be made to enhance performance of the platform) .

Hypervisor 920 may also include resource allocation logic 944, which may include logic for determining allocation of platform resources based on the telemetry data (which may include stress information) . Resource allocation logic 944 may also include logic for communicating with various components of platform logic 910 entities of platform 902A to implement such optimization, such as components of platform logic 910.

Any suitable logic may make one or more of these optimization decisions. For example, system management platform 906; resource allocation logic 944 of hypervisor 920 or other operating system; or other logic of computer platform 902A may be capable of making such decisions. In various embodiments, the system management platform 906 may receive telemetry data from and manage workload placement across multiple platforms 902. The system management platform 906 may communicate with hypervisors 920 (e.g., in an out-of-band manner) or other operating systems of the various platforms 902 to implement workload placements directed by the system management platform.

The elements of platform logic 910 may be coupled together in any suitable manner. For example, a bus may couple any of the components together. A bus may include any known interconnect, such as a multi-drop bus, a mesh interconnect, a ring interconnect, a point-to-point interconnect, a serial interconnect, a parallel bus, a coherent (e.g. cache coherent) bus, a layered protocol architecture, a differential bus, or a Gunning transceiver logic (GTL) bus.

Elements of the computer platform 902A may be coupled together in any suitable manner such as through one or more networks 908. A network 908 may be any suitable network or combination of one or more networks operating using one or more suitable networking protocols. A network may represent a series of nodes, points, and interconnected communication paths for receiving and transmitting packets of information that propagate through a communication system. For example, a network may include one or more firewalls, routers, switches, security appliances, antivirus servers, or other useful network devices.

The foregoing outlines features of one or more embodiments of the subject matter disclosed herein. These embodiments are provided to enable a person having ordinary skill in the art (PHOSITA) to better understand various aspects of the present disclosure. Certain well-understood terms, as well as underlying technologies and/or standards may be referenced without being described in detail. It is anticipated that the PHOSITA will possess or have access to background knowledge or information in those technologies and standards sufficient to practice the teachings of the present specification.

The PHOSITA will appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes, structures, or variations for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. The PHOSITA will also recognize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

In the foregoing description, certain aspects of some or all embodiments are described in greater detail than is strictly necessary for practicing the appended claims. These details are provided by way of non-limiting example only, for the purpose of providing context and illustration of the disclosed embodiments. Such details should not be understood to be required, and should not be “read into” the claims as limitations. The phrase may refer to “an embodiment” or “embodiments. ” These phrases, and any other references to embodiments, should be understood broadly to refer to any combination of one or more embodiments. Furthermore, the several features disclosed in a particular “embodiment” could just as well be spread across multiple embodiments. For example, if

features

1 and 2 are disclosed in “an embodiment, ” embodiment A may have feature 1 but lack feature 2, while embodiment B may have feature 2 but lack feature 1.

This specification may provide illustrations in a block diagram format, wherein certain features are disclosed in separate blocks. These should be understood broadly to disclose how various features interoperate, but are not intended to imply that those features must necessarily be embodied in separate hardware or software. Furthermore, where a single block discloses more than one feature in the same block, those features need not necessarily be embodied in the same hardware and/or software. For example, a computer “memory” could in some circumstances be distributed or mapped between multiple levels of cache or local memory, main memory, battery-backed volatile memory, and various forms of persistent memory such as a hard disk, storage server, optical disk, tape drive, or similar. In certain embodiments, some of the components may be omitted or consolidated. In a general sense, the arrangements depicted in the figures may be more logical in their representations, whereas a physical architecture may include various permutations, combinations, and/or hybrids of these elements. Countless possible design configurations can be used to achieve the operational objectives outlined herein. Accordingly, the associated infrastructure has a myriad of substitute arrangements, design choices, device possibilities, hardware configurations, software implementations, and equipment options.

References may be made herein to a computer-readable medium, which may be a tangible and non-transitory computer-readable medium. As used in this specification and throughout the claims, a “computer-readable medium” should be understood to include one or more computer-readable mediums of the same or different types. A computer-readable medium may include, by way of non-limiting example, an optical drive (e.g., CD/DVD/Blu-Ray) , a hard drive, a solid-state drive, a flash memory, or other non-volatile medium. A computer-readable medium could also include a medium such as a read-only memory (ROM) , an FPGA or ASIC configured to carry out the desired instructions, stored instructions for programming an FPGA or ASIC to carry out the desired instructions, an intellectual property (IP) block that can be integrated in hardware into other circuits, or instructions encoded directly into hardware or microcode on a processor such as a microprocessor, digital signal processor (DSP) , microcontroller, or in any other suitable component, device, element, or object where appropriate and based on particular needs. A nontransitory storage medium herein is expressly intended to include any nontransitory special-purpose or programmable hardware configured to provide the disclosed operations, or to cause a processor to perform the disclosed operations.

Various elements may be “communicatively, ” “electrically, ” “mechanically, ” or otherwise “coupled” to one another throughout this specification and the claims. Such coupling may be a direct, point-to-point coupling, or may include intermediary devices. For example, two devices may be communicatively coupled to one another via a controller that facilitates the communication. Devices may be electrically coupled to one another via intermediary devices such as signal boosters, voltage dividers, or buffers. Mechanically-coupled devices may be indirectly mechanically coupled.

Any “module” or “engine” disclosed herein may refer to or include software, a software stack, a combination of hardware, firmware, and/or software, a circuit configured to carry out the function of the engine or module, or any computer-readable medium as disclosed above. Such modules or engines may, in appropriate circumstances, be provided on or in conjunction with a hardware platform, which may include hardware compute resources such as a processor, memory, storage, interconnects, networks and network interfaces, accelerators, or other suitable hardware. Such a hardware platform may be provided as a single monolithic device (e.g., in a PC form factor) , or with some or part of the function being distributed (e.g., a “composite node” in a high-end data center, where compute, memory, storage, and other resources may be dynamically allocated and need not be local to one another) .

There may be disclosed herein flow charts, signal flow diagram, or other illustrations showing operations being performed in a particular order. Unless otherwise expressly noted, the order should be understood to be a non-limiting example only. Furthermore, in cases where one operation is shown to follow another, other intervening operations may also occur, which may be related or unrelated. Some operations may also be performed simultaneously or in parallel. In cases where an operation is said to be “based on” or “according to” another item or operation, this should be understood to imply that the operation is based at least partly on or according at least partly to the other item or operation. This should not be construed to imply that the operation is based solely or exclusively on, or solely or exclusively according to the item or operation.

All or part of any hardware element disclosed herein may readily be provided in a system-on-a-chip (SoC) , including a central processing unit (CPU) package. An SoC represents an integrated circuit (IC) that integrates components of a computer or other electronic system into a single chip. Thus, for example, client devices or server devices may be provided, in whole or in part, in an SoC. The SoC may contain digital, analog, mixed-signal, and radio frequency functions, all of which may be provided on a single chip substrate. Other embodiments may include a multichip module (MCM) , with a plurality of chips located within a single electronic package and configured to interact closely with each other through the electronic package.

In a general sense, any suitably-configured circuit or processor can execute any type of instructions associated with the data to achieve the operations detailed herein. Any processor disclosed herein could transform an element or an article (for example, data) from one state or thing to another state or thing. Furthermore, the information being tracked, sent, received, or stored in a processor could be provided in any database, register, table, cache, queue, control list, or storage structure, based on particular needs and implementations, all of which could be referenced in any suitable timeframe. Any of the memory or storage elements disclosed herein, should be construed as being encompassed within the broad terms “memory” and “storage, ” as appropriate.

Computer program logic implementing all or part of the functionality described herein is embodied in various forms, including, but in no way limited to, a source code form, a computer executable form, machine instructions or microcode, programmable hardware, and various intermediate forms (for example, forms generated by an assembler, compiler, linker, or locator) . In an example, source code includes a series of computer program instructions implemented in various programming languages, such as an object code, an assembly language, or a high-level language such as OpenCL, FORTRAN, C, C++, JAVA, or HTML for use with various operating systems or operating environments, or in hardware description languages such as Spice, Verilog, and VHDL. The source code may define and use various data structures and communication messages. The source code may be in a computer executable form (e.g., via an interpreter) , or the source code may be converted (e.g., via a translator, assembler, or compiler) into a computer executable form, or converted to an intermediate form such as byte code. Where appropriate, any of the foregoing may be used to build or describe appropriate discrete or integrated circuits, whether sequential, combinatorial, state machines, or otherwise.

In one example embodiment, any number of electrical circuits of the FIGURES may be implemented on a board of an associated electronic device. The board can be a general circuit board that can hold various components of the internal electronic system of the electronic device and, further, provide connectors for other peripherals. Any suitable processor and memory can be suitably coupled to the board based on particular configuration needs, processing demands, and computing designs. Note that with the numerous examples provided herein, interaction may be described in terms of two, three, four, or more electrical components. However, this has been done for purposes of clarity and example only. It should be appreciated that the system can be consolidated or reconfigured in any suitable manner. Along similar design alternatives, any of the illustrated components, modules, and elements of the FIGURES may be combined in various possible configurations, all of which are within the broad scope of this specification.

Numerous other changes, substitutions, variations, alterations, and modifications may be ascertained to one skilled in the art and it is intended that the present disclosure encompass all such changes, substitutions, variations, alterations, and modifications as falling within the scope of the appended claims. In order to assist the United States Patent and Trademark Office (USPTO) and, additionally, any readers of any patent issued on this application in interpreting the claims appended hereto, Applicant wishes to note that the Applicant: (a) does not intend any of the appended claims to invoke paragraph six (6) of 35 U.S.C. section 112 (pre-AIA) or paragraph (f) of the same section (post-AIA) , as it exists on the date of the filing hereof unless the words “means for” or “steps for” are specifically used in the particular claims; and (b) does not intend, by any statement in the specification, to limit this disclosure in any way that is not otherwise expressly reflected in the appended claims.

Example Implementations

The following examples are provided by way of illustration.

Example 1 includes a computing apparatus, comprising: a hardware platform; a virtual machine manager (VMM) comprising a virtual machine control structure (VMCS) , wherein the VMM is configured to provision a virtual machine (VM) according to the VMCS; a data structure comprising an extended page table (EPT) for the VM having a translation lookaside buffer (TLB) , the TLB comprising a region having a passthrough region comprising a direct guest virtual address (GVA) to host physical address (HPA) translations; and logic to lock the passthrough region to prevent the passthrough region from being evicted from the TLB.

Example 2 includes the computing apparatus of example 1, wherein the passthrough region comprises address translations for a memory mapped input/output (MMIO) device.

Example 3 includes the computing apparatus of example 2, wherein the MMIO device is mapped according to a single-root input/output virtualization (SR-IOV) protocol.

Example 4 includes the computing apparatus of example 1, wherein the logic is further to lock the passthrough region in response to provisioning of the VM.

Example 5 includes the computing apparatus of example 1, wherein the logic is to lock the passthrough region upon enumeration of a single-root input/output virtualization (SR-IOV) device by the VM.

Example 6 includes the computing apparatus of example 1, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to release a device mapped to the passthrough region.

Example 7 includes the computing apparatus of example 1, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to terminate or release the VM.

Example 8 includes the computing apparatus of any of examples 1–7, wherein the VMCS comprises VMCS extensions to define the passthrough region.

Example 9 includes the computing apparatus of example 8, wherein the VMCS extensions comprise a number of locked entries in the passthrough region.

Example 10 includes the computing apparatus of example 8, wherein the VMCS extensions comprise a start address for an array of locked entries in the passthrough region.

Example 11 includes the computing apparatus of example 8, wherein the VMCS extensions comprise a field indicating that the upper 32 bits of a 64-bit address space should be used to define the passthrough region.

Example 12 includes one or more tangible, non-transitory computer-readable storage mediums having stored thereon instructions to execute on a hardware platform and to: provision a virtual machine control structure (VMCS) ; provision a virtual machine (VM) according to the VMCS; provision a data structure comprising an extended page table (EPT) for the VM having a translation lookaside buffer (TLB) , the TLB comprising a region having a passthrough region comprising direct guest virtual address (GVA) to host physical address (HPA) translations; and provide logic to lock the passthrough region to prevent the passthrough region from being evicted from the TLB.

Example 13 includes the one or more tangible, non-transitory computer-readable storage mediums of example 12, wherein the passthrough region comprises address translations for a memory mapped input/output (MMIO) device.

Example 14 includes the one or more tangible, non-transitory computer-readable storage mediums of example 13, wherein the MMIO device is mapped according to a single-root input/output virtualization (SR-IOV) protocol.

Example 15 includes the one or more tangible, non-transitory computer-readable storage mediums of example 14, wherein the logic is further to lock the passthrough region in response to provisioning of the VM.

Example 16 includes the one or more tangible, non-transitory computer-readable storage mediums of example 14, wherein the logic is to lock the passthrough region upon enumeration of a single-root input/output virtualization (SR-IOV) device by the VM.

Example 17 includes the one or more tangible, non-transitory computer-readable storage mediums of example 14, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to release a device mapped to the passthrough region.

Example 18 includes the one or more tangible, non-transitory computer-readable storage mediums of example 14, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to terminate or release the VM.

Example 19 includes the one or more tangible, non-transitory computer-readable storage mediums of any of examples 12–18, wherein the VMCS comprises VMCS extensions to define the passthrough region.

Example 20 includes the one or more tangible, non-transitory computer-readable storage mediums of example 19, wherein the VMCS extensions comprise a number of locked entries in the passthrough region.

Example 21 includes the one or more tangible, non-transitory computer-readable storage mediums of example 19, wherein the VMCS extensions comprise a start address for an array of locked entries in the passthrough region.

Example 22 includes the one or more tangible, non-transitory computer-readable storage mediums of example 19, wherein the VMCS extensions comprise a field indicating that the upper 32 bits of a 64-bit address space should be used to define the passthrough region.

Example 23 includes a computer-implemented method of providing passthrough memory mapping comprising: provisioning a virtual machine control structure (VMCS) ; provisioning a virtual machine (VM) according to the VMCS; provisioning a data structure comprising an extended page table (EPT) for the VM having a translation lookaside buffer (TLB) , the TLB comprising a region having a passthrough region comprising direct guest virtual address (GVA) to host physical address (HPA) translations, and providing logic to lock the passthrough region to prevent the passthrough region from being evicted from the TLB.

Example 24 includes the computer-implemented method of example 23, wherein the passthrough region comprises address translations for a memory mapped input/output (MMIO) device.

Example 25 includes the computer-implemented method of example 24, wherein the MMIO device is mapped according to a single-root input/output virtualization (SR-IOV) protocol.

Example 26 includes the computer-implemented method of example 25, wherein the logic is further to lock the passthrough region in response to provisioning of the VM.

Example 27 includes the computer-implemented method of example 25, wherein the logic is to lock the passthrough region upon enumeration of a single-root input/output virtualization (SR-IOV) device by the VM.

Example 28 includes the computer-implemented method of example 25, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to release a device mapped to the passthrough region.

Example 29 includes the computer-implemented method of example 25, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to terminate or release the VM.

Example 30 includes the computer-implemented method of any of examples 23–29, wherein the VMCS comprises VMCS extensions to define the passthrough region.

Example 31 includes the computer-implemented method of example 30, wherein the VMCS extensions comprise a number of locked entries in the passthrough region.

Example 32 includes the computer-implemented method of example 30, wherein the VMCS extensions comprise a start address for an array of locked entries in the passthrough region.

Example 33 includes the computer-implemented method of example 30, wherein the VMCS extensions comprise a field indicating that the upper 32 bits of a 64-bit address space should be used to define the passthrough region.

Example 34 includes an apparatus comprising means for performing the method of any of examples 23–33.

Example 35 includes the apparatus of example 34, wherein the means for performing the method comprise a processor and a memory.

Example 36 includes the apparatus of example 35, wherein the memory comprises machine-readable instructions, that when executed cause the apparatus to perform the method of any of examples 23–33.

Example 37 includes the apparatus of any of examples 34–36, wherein the apparatus is a computing system.

Example 38 includes at least one computer readable medium comprising instructions that, when executed, implement a method or realize an apparatus as illustrated in any of examples 23–37.

Claims

A computing apparatus, comprising:

a hardware platform;

a virtual machine manager (VMM) comprising a virtual machine control structure (VMCS) , wherein the VMM is configured to provision a virtual machine (VM) according to the VMCS;

a data structure comprising an extended page table (EPT) for the VM having a translation lookaside buffer (TLB) , the TLB comprising a region having a passthrough region comprising a direct guest virtual address (GVA) to host physical address (HPA) translations; and

logic to lock the passthrough region to prevent the passthrough region from being evicted from the TLB.
The computing apparatus of claim 1, wherein the passthrough region comprises address translations for a memory mapped input/output (MMIO) device.
The computing apparatus of claim 2, wherein the MMIO device is mapped according to a single-root input/output virtualization (SR-IOV) protocol.
The computing apparatus of claim 1, wherein the logic is further to lock the passthrough region in response to provisioning of the VM.
The computing apparatus of claim 1, wherein the logic is to lock the passthrough region upon enumeration of a single-root input/output virtualization (SR-IOV) device by the VM.
The computing apparatus of claim 1, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to release a device mapped to the passthrough region.
The computing apparatus of claim 1, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to terminate or release the VM.
The computing apparatus of any of claims 1–7, wherein the VMCS comprises VMCS extensions to define the passthrough region.
The computing apparatus of claim 8, wherein the VMCS extensions comprise a number of locked entries in the passthrough region.
The computing apparatus of claim 8, wherein the VMCS extensions comprise a start address for an array of locked entries in the passthrough region.
The computing apparatus of claim 8, wherein the VMCS extensions comprise a field indicating that the upper 32 bits of a 64-bit address space should be used to define the passthrough region.
One or more tangible, non-transitory computer-readable storage mediums having stored thereon instructions to execute on a hardware platform and to:

provision a virtual machine control structure (VMCS) ;

provision a virtual machine (VM) according to the VMCS;

provision a data structure comprising an extended page table (EPT) for the VM having a translation lookaside buffer (TLB) , the TLB comprising a region having a passthrough region comprising direct guest virtual address (GVA) to host physical address (HPA) translations; and

provide logic to lock the passthrough region to prevent the passthrough region from being evicted from the TLB.
The one or more tangible, non-transitory computer-readable storage mediums of claim 12, wherein the passthrough region comprises address translations for a memory mapped input/output (MMIO) device.
The one or more tangible, non-transitory computer-readable storage mediums of claim 13, wherein the MMIO device is mapped according to a single-root input/output virtualization (SR-IOV) protocol.
The one or more tangible, non-transitory computer-readable storage mediums of claim 14, wherein the logic is further to lock the passthrough region in response to provisioning of the VM.
The one or more tangible, non-transitory computer-readable storage mediums of claim 14, wherein the logic is to lock the passthrough region upon enumeration of a single-root input/output virtualization (SR-IOV) device by the VM.
The one or more tangible, non-transitory computer-readable storage mediums of claim 14, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to release a device mapped to the passthrough region.
The one or more tangible, non-transitory computer-readable storage mediums of claim 14, wherein the logic is further to release the lock on the passthrough region after receiving an instruction to terminate or release the VM.
The one or more tangible, non-transitory computer-readable storage mediums of any of claims 12–18, wherein the VMCS comprises VMCS extensions to define the passthrough region.
The one or more tangible, non-transitory computer-readable storage mediums of claim 19, wherein the VMCS extensions comprise a number of locked entries in the passthrough region.
The one or more tangible, non-transitory computer-readable storage mediums of claim 19, wherein the VMCS extensions comprise a start address for an array of locked entries in the passthrough region.
A computer-implemented method of providing passthrough memory mapping comprising:

provisioning a virtual machine control structure (VMCS) ;

provisioning a virtual machine (VM) according to the VMCS;

provisioning a data structure comprising an extended page table (EPT) for the VM having a translation lookaside buffer (TLB) , the TLB comprising a region having a passthrough region comprising direct guest virtual address (GVA) to host physical address (HPA) translations; and

providing logic to lock the passthrough region to prevent the passthrough region from being evicted from the TLB.
The computer-implemented method of claim 22, wherein the passthrough region comprises address translations for a memory mapped input/output (MMIO) device.
The computer-implemented method of claim 23, wherein the MMIO device is mapped according to a single-root input/output virtualization (SR-IOV) protocol.
The computer-implemented method of claim 24, wherein the logic is further to lock the passthrough region in response to provisioning of the VM.