CN117591004A - Data stream accelerator - Google Patents

Data stream accelerator Download PDF

Info

Publication number
CN117591004A
CN117591004A CN202311022669.6A CN202311022669A CN117591004A CN 117591004 A CN117591004 A CN 117591004A CN 202311022669 A CN202311022669 A CN 202311022669A CN 117591004 A CN117591004 A CN 117591004A
Authority
CN
China
Prior art keywords
pasid
processor
commit
inter
descriptor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311022669.6A
Other languages
Chinese (zh)
Inventor
R·M·桑卡兰
P·R·兰兹
N·兰甘纳坦
S·盖恩
S·库马
N·拉奥
D·A·乔希
H·M·郭
U·Y·卡凯亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/233,308 external-priority patent/US20240054011A1/en
Application filed by Intel Corp filed Critical Intel Corp
Publication of CN117591004A publication Critical patent/CN117591004A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0679Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Methods and apparatus related to a data stream accelerator are described. In an embodiment, a hardware accelerator, such as a Data Stream Accelerator (DSA) logic circuit, performs data movement and/or data transformation on data to be transferred between a processor (having one or more processor cores) and a storage device. Other embodiments are also disclosed and claimed.

Description

Data stream accelerator
RELATED APPLICATIONS
The present application relates to and claims priority from U.S. provisional patent application serial No. 63/397,457, 2022, 8/12 entitled "DATA STREAMING ACCELERATOR," which provisional patent application is incorporated herein in its entirety for all purposes.
Technical Field
The present disclosure relates generally to the field of data streaming. More particularly, embodiments relate to data stream accelerators.
Background
In general, memory used to store data in a computing system may be volatile (for storing volatile information) or non-volatile (for storing persistent information). Volatile data structures stored in volatile memory are typically used to support temporary or intermediate information required for the functionality of the program during program runtime. On the other hand, persistent data structures stored in non-volatile (or persistent) memory are available beyond the runtime range of the program and can be reused. Furthermore, new data is typically first generated as volatile data before a user or programmer decides to persist the data. For example, a programmer or user may cause mapping (i.e., instantiation) of volatile structures in volatile main memory that are directly accessible by the processor. On the other hand, persistent data structures are instantiated on a non-volatile storage device (e.g., a rotating disk attached to an Input/Output (I/O or IO) bus) or a non-volatile memory-based device (e.g., a solid state drive).
As computing power in processors increases, one concern or bottleneck is the speed with which memory can be accessed by the processor. For example, to process data, a processor may need to first fetch (fetch) data from a memory device. After the data processing is completed, the results may need to be stored in a memory device. Thus, memory access speed and/or efficiency may have a direct impact on overall system performance.
Drawings
The detailed description is provided with reference to the accompanying drawings. In the drawings, the leftmost digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
Fig. 1A illustrates a data stream accelerator (Data Streaming Accelerator, DSA) device according to an embodiment.
FIG. 1B illustrates example fields in a restricted inter-domain memory operation descriptor, according to an embodiment.
FIG. 1C illustrates a flow chart of a method for providing inter-domain memory operations, according to an embodiment.
Fig. 1D illustrates a command capability register (Command Capabilities Register, CMDCAP) according to an embodiment.
Fig. 2A illustrates an example inter-domain bitmap register according to an embodiment.
Fig. 2B illustrates a flow chart of a method for invalidating a bitmap cache according to an embodiment.
Fig. 3 illustrates an update window descriptor according to an embodiment.
FIG. 4 illustrates an example computing system.
Fig. 5 illustrates a block diagram of an example processor and/or System on a Chip (SoC) that may have one or more cores and have an integrated memory controller.
FIG. 6 (A) is a block diagram illustrating both example in-order pipelines and example register renaming, out-of-order issue/execution pipelines, according to some examples.
FIG. 6 (B) is a block diagram illustrating both an example in-order architecture core and an example register renaming, out-of-order issue/execution architecture core to be included in a processor, according to an example.
Fig. 7 illustrates an example of execution unit circuit(s).
FIG. 8 is a block diagram of a register architecture according to some examples.
Fig. 9 illustrates an example of an instruction format.
Fig. 10 illustrates an example of an addressing information field.
Fig. 11 illustrates an example of a first prefix.
Fig. 12 (a) - (D) illustrate examples of how the R, X and B fields of the first prefix in fig. 11 are used.
Fig. 13 (a) - (B) illustrate examples of the second prefix.
Fig. 14 illustrates an example of a third prefix.
FIG. 15 is a block diagram illustrating the use of a software instruction converter for converting binary instructions in a source instruction set architecture to binary instructions in a target instruction set architecture, according to an example.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the embodiments. However, embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to obscure the particular embodiments. Further, aspects of the embodiments may be performed using various means, such as integrated semiconductor circuits ("hardware"), computer-readable instructions organized into one or more programs ("software"), or some combination of hardware and software. For the purposes of this disclosure, reference to "logic" shall mean hardware (such as logic circuitry or, more generally, circuitry or circuitry), software, firmware, or some combination thereof.
As mentioned above, as computing power in a processor increases, one concern or bottleneck is the speed with which memory can be accessed by the processor. Thus, memory access speed and/or efficiency may have a direct impact on overall system performance.
Some embodiments relate to a data stream accelerator. In one embodiment, the data stream accelerator provides high performance (e.g., higher speed and/or efficiency) data replication and/or conversion acceleration. Logic for a Data Stream Accelerator (DSA) may be provided by or implemented on an integrated circuit device having a processor, a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), a Field programmable gate array (Field-Programmable Gate Array, FPGA), or the like. Further, DSA logic may optimize streaming data movement and/or conversion operations, such as applications commonly used for high performance storage, networking, persistent storage, and/or various data processing applications.
In one or more embodiments, DSA logic may provide improved virtualization efficiency, thereby enabling sharing and/or virtualization of devices. In at least one embodiment, a System-on-a-chip (SoC) or System-on-Package (SoP) may include DSA logic and/or a processor (such as illustrated in fig. 5).
Further, DSA logic may provide higher overall system performance for data movers and/or translation operations while freeing up CPU/processor cycles for other tasks, such as higher level functions. For example, DSA hardware may support high performance data mover capabilities to/from volatile memory, persistent memory, memory mapped input/Output (MMIO), and to/from remote volatile and/or persistent memory on another node in the cluster through a Non-transparent bridge (Non-Transparent Bridge, NTB) in the SoC/SoP. It may also provide an Operating System (OS) with a programming interface compatible with peripheral component interconnect express (Peripheral Component Interconnect express, PCIe) and/or may be controlled by a device driver.
In addition to performing general data mover operations, DSA logic may also be designed to perform some number of higher level translation operations on memory. For example, it may generate and test a cyclic redundancy code (Cyclic Redundancy Code, CRC) checksum or data integrity field (Data Integrity Field, DIF) in a memory area to support typical use of storage and/or networking applications. It may additionally support memory compare operations to achieve equality, generate incremental records, and/or apply incremental records to buffers. These may be compared and the delta generation/merge functionality may be utilized by applications such as Virtual Machine (VM) migration, VM rapid check indication, and/or software managed memory deduplication use.
Further examples of DSA logic features may be found in appendix a and/or B provided in U.S. provisional patent application entitled "DATA STREAMING accelerato," filed on 8/2022, serial No. 63/397,457, which provisional patent application is incorporated herein in its entirety for all purposes. However, embodiments are not limited to each and every feature discussed in appendix a and/or B, and DSA implementations may be adjusted for a given design and/or feature set. In addition, DSA logic functions can be referenced Data stream accelerator architecture specification, revision 2.0, month 9 of 2022.
Fig. 1A illustrates a Data Stream Accelerator (DSA) device 100 according to an embodiment. Downstream work requests from clients are received on the I/O fabric interface 101. Upstream data read/write operations (102) and address translation operations (103) are also sent on the I/O fabric interface 101. As shown in fig. 1A, the device includes one or more Work Queue (WQ) configuration registers 104, work queues (labeled WQ0 and WQ 1) for holding descriptors submitted by software, arbiters 105a/105b for implementing quality of service (Quality of Service, qoS) and fairness policies, processing engines, address translation logic and cache interfaces 103, and memory read/write interfaces 102. The Batch processing unit 106 processes Batch descriptor(s) (bd) by reading an array of descriptors from memory. The job descriptor processing unit 107 has stages for reading the memory, performing requested operations on the data, generating output data, and writing the output data, completing recording, and interrupting messages.
The WQ configuration logic 104 allows software to configure each WQ as a shared work queue (Shared Work Queue, SWQ) that can be shared by multiple software components/applications, or as a dedicated work queue (Dedicated Work Queue, DWQ) that is assigned to a single software component/application at a time. This configuration also allows software to control which WQs feed which engines and the relative priorities of the WQs that feed each engine.
In an embodiment, each work descriptor (wd) submitted to the DSA device 100 is associated with a default address space that corresponds to the address space of the work submitter. As discussed herein, a process address space identifier (Process Address Space Identifier, PASID) generally refers to a value used in a memory transaction to convey an address space on a host of an address used by a device. When the PASID capability is enabled, the default address space is explicitly specified by the PASID carried in the work descriptor submitted to the shared work queue, or by the PASID configured in the work queue configuration (Work Queue Configuration, WQCFG) register for the dedicated work queue. Memory access and Input/output memory management unit (Input/Output Memory Management Unit, IOMMU) requests are marked with the paid value.
When the PASID capability is disabled, the default address space is implicitly specified to the IOMMU via a peripheral component interconnect express (PCIe) requester Identifier (ID) (bus, device, function) of the device. This address space of the working submitter may be referred to as the descriptor PASID, and this is the address space from the submitter that is typically used for memory accesses and IOMMU requests from DSA devices.
When the PASID capability is enabled, certain operations may allow the presenter to select an alternate address space for the source address, destination address, or both source and destination addresses specified in the work descriptor. The alternate address space may be an address space of a collaborative process. This process may be referred to as the owner of the alternate address space. The set of operations that allow selection of the alternate address space may be referred to as inter-domain operations.
In an embodiment, inter-domain operations may operate on multiple address spaces (identified by the PASID) with a single descriptor. In one embodiment, inter-domain operation requires the PASID capability to be enabled. In an embodiment, support for inter-domain operations is indicated by an inter-domain support field in a general capability register (General Capabilities register, GENCAP) register. For example, when the field is 1, the Inter-domain capability is reported in an Inter-domain capability (Inter-Domain Capabilities, IDCAP) register. The set of inter-domain operations supported by the implementation is reported in an operation capability (Operations Capabilities, OPCAP) register and can only be used if inter-domain capabilities are supported. Selection of the PASID to use in each operation may be accomplished using the appropriate descriptor field. Some details of supported inter-domain operations and descriptions of corresponding descriptor fields are discussed below with reference to fig. 1B.
If the work submitter does not explicitly select an alternate PASID for an address in the descriptor, the descriptor PASID is used for memory accesses and IOMMU requests related to that address. If the descriptor selects a replacement PASID for the address, the PASID is used instead of the descriptor PASID if the presenter has the appropriate permissions to do so. When used in this manner, the replacement PASID may be referred to as an "access PASID" for the corresponding address in the descriptor. The device uses the access PASID to perform memory accesses and IOMMU requests related to the address. The descriptor PASID is used to write a completion record, to interrupt generation, and to verify that the presenter has sufficient permission to access the specified PASID, as described below.
In at least one embodiment, inter-domain operations may involve two or three PASIDs, depending on the use case. Some of the example use cases are listed below:
1. data is read from or written to the memory region derived by the user mode owner by one or more user mode submitters.
2. Data is read from or written to the memory region of the user mode process by the kernel mode presenter.
3. Data reads or data writes are made by the kernel-mode presenter between memory regions of two different user-mode processes.
4. Data is read from or written to a memory region of another kernel mode process by a kernel mode presenter.
5. Data reads or data writes are made by the privileged commit party between memory regions of two different guest OSs.
6. Any of the above executing within a guest Operating System (OS).
Use case (1) above requires that the owner explicitly grant access to portions of its memory space to one or more submitters. The area of memory that the owner grants access to is called the memory window. The memory window can only be accessed using the owner's PASID as the access PASID. Use cases (2) through (6) relate to privileged software accessing memory regions of other user-mode processes or kernel-mode processes within the OS domain. This may require flexibility and low overhead so that the privileged commit party explicitly specifies the PASID for each address in the descriptor, but without compromising security.
Referring to fig. 1A, if Inter-domain operations are supported, DSA implements an Inter-domain permission table (Inter-Domain Permissions Table, IDPT) 108 to allow software to manage: (1) Association between the descriptor PASID and the access PASID that the work submitter is allowed to access; (2) Attributes of memory regions in memory space of access PASIDs that the presenter is allowed to access; and/or (3) controls for managing such associated lifecycles. The IDPT may be managed by a host kernel mode driver and may be configured to support use of kernel mode applications and user mode applications in a host or guest OS.
FIG. 1B illustrates example fields in a restricted inter-domain memory operation descriptor in accordance with some embodiments.
In one or more embodiments, each entry in the IDPT contains the following: (1) an entry type as described below; (2) One or more submitter PASID values allowed to use the entry and a mechanism for validating them; (3) Depending on the entry type, the access PASID to be used for memory access; (4) memory window address ranges and attributes; and/or (5) permissions and other control information.
Each IDPT Entry (IDPTE) may be configured in one of the following ways, as indicated by the type field described below and summarized in table 1:
type 0-single access, single commit party (Single Access Single Submitter, SASS): the IDPTE specifies a single access PASID and a single submitter PASID. For example, a process that wants to expose a memory window to a peer process may request that the driver set a SASS entry by having its own PASID as the access PASID and the PASID of its peer process as the commit PASID.
Type 1-single access, multiple submitters (Single Access Multiple Submitter, SAMS): the IDPTE specifies a single access PASID. The commit side PASID field in the entry is unused. Instead, the IDPTE points to a bitmap in memory that specifies the set of commit PASIDs that are allowed to use the entry. A bit set to 1 in the bitmap indicates that the corresponding paid is allowed to commit inter-domain operations using the IDPTE. For example, a process that wants to allow multiple submitters to access a window in its address space requests that SAMS entries be set.
Table 1: inter-domain license table entry type
As discussed herein, a "descriptor" typically uses a handle in the descriptor to reference an IDPT entry. If the request IDPT handle field in the command capability ("CMDCAP", see, e.g., FIG. 1D) is 0, then the handle is an index of the desired entry in the IDPT. If the request IDPT handle field in the CMDCAP register is 1, then the software uses the request IDPT handle command to obtain the handle to use. The software specifies in the request IDPT handle command the index of the PASID table entry it wants to handle, and the response to this command contains the handle that the software should place in the descriptor.
In some embodiments, the inter-domain descriptor may contain more than one handle, depending on the type of operation. A separate handle may be specified in the descriptor for each different source and/or destination address. Each handle in the descriptor is used by hardware to find the corresponding IDPTE to: (1) verifying access permissions of the submitting party, (2) identifying access PASIDs and privileges to be used for memory accesses, (3) calculating valid memory addresses, and/or (4) verifying access compliance with memory windows and permissions granted by IDPTEs.
In an embodiment, the IDPTE may be referenced by:
(a) When the available bit in the IDPTE is 1, the inter-domain descriptor. In this case, the hardware checks whether the descriptor PASID matches the commit PASID value in the specified IDPTE.
(b) When the allowed update bit in the IDPTE is 1, the window descriptor is updated (see, e.g., FIG. 3). In this case, the hardware checks whether the descriptor PASID matches the access PASID value in the specified IDPTE.
If the PASID values do not match, then memory access using the entry is not allowed for the descriptor and the descriptor is completed with an error. Furthermore, type 0SASS and type 1SAMS idete can only be used with limited inter-domain operations (see, e.g., fig. 1B).
FIG. 1B illustrates details of a restricted descriptor for inter-domain operations according to an embodiment. In some embodiments, one or more of the following new operations support inter-domain capabilities:
(1) Limited inter-domain memory copying (for copying data from a source address to a destination address);
(2) Limited inter-domain padding (for padding memory at destination address with a value in the mode field);
(3) A limited inter-domain comparison (for comparing data at source 1 address with memory at source 2 address);
(4) A restricted inter-domain comparison mode (for comparing data at the source address with values in the mode field); and/or
(5) Limited inter-domain cache flushing (for flushing processor caches at destination addresses).
Further, the update window operation may atomically modify the properties of the memory window associated with the specified inter-domain license table entry.
Referring to fig. 1B, the restricted inter-domain descriptor includes an operation field 109a capable of specifying an operation to be performed (such as discussed above, including, for example, copy, fill, compare mode, flush, etc.), a PASID field 109c capable of specifying a commit PASID of a commit process executing on a (e.g., host) processor, and an IDPT handle field 110/111 capable of specifying an IDPT entry.
As shown in FIG. 1B, the descriptor for the restricted inter-domain operation allows software to specify an IDPT handle 110/111 for each source or destination address 112/113. IDPT handle 110/111 references type 0SASS or type 1SAMS IDPTE (described above). There is at least one valid IDPT handle indicated by the corresponding flag bit 114 set to 1.
In one embodiment, the IDPT handle(s) 110/111 may be used to find an access PASID that can specify an access PASID associated with the address space of another process. This approach is envisaged to provide higher security/privacy because the presenter specifies that access to the PASID is not necessarily trusted. Alternatively, the commit point to an IDPT entry, the device may first check the IDPT entry to ensure that the correct commit paid is using the entry, and then use the access paid in the IDPT entry as the source or destination of the descriptor operation. Thus, as discussed herein, "access to a PASID" generally refers to a PASID that is accessed for reading or writing. As a result, the access-PASID may be a source access-PASID or a destination access-PASID. In some embodiments, the access PASID is referenced in the new descriptor via IDPT handle(s) 110/111 as shown in FIG. 1B to allow access to the alternate address space.
FIG. 1C illustrates a flowchart of a method 150 for providing inter-domain memory operations, according to an embodiment. One or more components discussed herein, such as a hardware accelerator (e.g., DSA 100 of fig. 1A) and/or a processor (such as the processor discussed with reference to fig. 4 and subsequent figures), may be used to perform the operations of method 150.
Referring to fig. 1A-1C, at operation 152, a plurality of descriptors are stored in a work queue (e.g., WQ0 or WQ1 of fig. 1A). At operation 154, an arbiter (e.g., arbiter 105a of fig. 1A) dispatches descriptors from the work queue. As discussed with reference to fig. 1B, the descriptor may include: an operation field capable of specifying an operation to be performed, a PASID field capable of specifying a presenter PASID of a presenter process, an inter-domain permission table (IDPT) handle field capable of specifying an IDPT entry, and optionally an access PASID field capable of specifying an access PASID associated with an address space of another process.
At operation 158, for a limited inter-domain memory operation descriptor (such as that shown in FIG. 1B), the engine (e.g., one of engines 0-N of FIG. 1A) performs the following operations: (a) Obtaining an access PASID associated with an address space of another process from the IDPT entries; (b) Verifying whether the presenter process is permitted to access an address space of another process based at least in part on the presenter PASID and the IDPT entries; and (c) processing the descriptor based at least in part on the operation specified by the operation field (e.g., by the work descriptor processing unit 107 of fig. 1A). In one or more embodiments, the operation to be performed is one of the following: copy operation, fill operation, compare mode operation, and flush operation.
Further, in various embodiments: (a) The IDPT entries may specify a single access PASID and a single submitter PASID; (b) The IDPT entries may specify a single access PASID and multiple submitter PASIDs; and/or (c) a plurality of submitters' PASIDs may be specified by a PASID bit map in memory, as will be discussed further below.
FIG. 1D illustrates a command capability register (CMDCAP) 170 according to an embodiment. The command capability register indicates which management commands are supported by the command register. The register is a bit mask of: wherein each bit corresponds to a command having the same command code as the bit position. For example, bit 1 of the register corresponds to an enable device command (command code 1).
In an embodiment, this register exists only when the command capability support field in the GENCAP is 1. If the register indicates support for a request interrupt handler command, the command is used to obtain interrupt handling for descriptor completion. If command capability support is 0, then this register does not exist and the following commands in Table 2 are supported:
command Code Operation of
Enabling device 1 The device is enabled.
Disabling devices 2 The device is disabled.
Depleting all 3 Waiting for all descriptors.
Suspending all 4 Discard and/or wait for all descriptors.
Reset device 5 Disabling devices and clearing settingsAnd (5) preparing configuration.
Enabling WQ 6 WQ is enabled.
Disabling WQ 7 The specified WQ is disabled.
Depletion of WQ 8 Wait for descriptors in the specified WQ.
Suspension of WQ 9 Discard and/or wait for descriptors in the specified WQ.
Resetting WQ 10 The specified WQ is disabled and the WQ configuration is cleared.
Depleting PASID 11 Waiting for the use of the descriptor of the specified PASID.
Suspending PASID 12 The descriptors of the specified PASIDs are discarded and/or waited for use.
Table 2: supported default commands
Commit Fang Bite diagram
In an embodiment, type 1SAMS IDPTE points to a commit Fang Bite diagram in memory, with one bit for each possible PASID value. The bitmap is indexed by the PASID value to be checked against the bitmap. Access is allowed only if the bit in the bitmap corresponding to the checked PASID is 1. For SAMS IDPTE, the hardware will check the descriptor PASID against the bitmap before allowing any memory access using the table entry. The type 1SAMS entry specifies a 4KB aligned virtual address or physical address, which is referred to as a commit bitmap address. Privileged software (e.g., kernel mode drivers) is responsible for setting and maintaining bitmaps in memory. The maximum size of the commit party bitmap is 2 20 Bits, i.e., 128KB. Each IDPTE that requires a bitmap may point to a different commit bitmap in memory. The software may also choose to share the commit party bitmap among multiple IDPTEs, if appropriate.
In one embodiment, an Inter-domain bitmap register (Inter-Domain Bitmap Register, IDBR) controls whether the hardware should use the PASID value for the commit bitmap read. Fig. 2A illustrates an example inter-domain bitmap register 200 according to an embodiment.
If enabled, the IDBR specifies the PASID value and privileges to be used for bitmap reading. Although each commit bit map is mapped to a contiguous virtual address range in the corresponding PASID space, it may also be mapped to a non-contiguous physical page in system memory. The software does not need to map the bit map completely into system memory at a given time; different bit map pages may be mapped as desired. If a page of the bitmap is inaccessible, then all bits on that page are considered 0. Depending on the IOMMU configuration, the fault may be reported on unmapped pages. To avoid these faults, the software may fix (pin) all memory pages corresponding to the bit map. The IDBR also specifies the traffic class to be used for bitmap reading.
In one embodiment, if the inter-domain support field in the GENCAP is 1 and bit 1 in the type support field in the IDCAP is set, the inter-domain bitmap register is used to specify the PASID and privileges to be used to read the bitmap referenced by the IDPT. Otherwise the register is reserved. This register is read-write when the device is disabled, otherwise read-only.
Fig. 2B illustrates a flowchart of a method 220 for invalidating a bitmap cache, according to an embodiment. The bitmap read operations may be accomplished by hardware in an implementation-specific manner. Implementations may issue the bitmap read as a translated or an untranslated access. The hardware may read a single byte or double word or cache line or larger region of the bitmap corresponding to the PASID to be checked. For example, for a PASID value p to be checked against the bitmap, an implementation of a cache line read using the bitmap would read the cache line at (PPBA+ ((p > > 3) 0xFFFFFFC 0)) and check the bit corresponding to the PASID to be checked.
Referring to fig. 2B, at operation 222, portions of the bitmap may be cached in a device cache, e.g., to avoid repeated memory read accesses. This capability, if supported, may be indicated by the invalidate commit Fang Bite graph cache field in the CMDCAP of FIG. 1D. If the capability is enabled (e.g., 1), then at operation 226, the software issues an invalidation request Fang Bite graph cache command after modifying any portion of the bitmap in memory (e.g., as determined at operation 224) or modifying the mapping of any page of the bitmap. In the latter case, the software performs the bitmap invalidation after performing any required invalidation normally associated with page mapping modification. Furthermore, such techniques support any possible caching of bitmap values for devices that support bitmap caching on the device. The bitmap may contain a system-level permission map, and multiple devices on the system may have entries from an opportunistic cache of a single system-level bitmap. An invalidation process may be performed on all such devices, for example, to ensure that any stale cache entries are purged.
Memory window
As discussed herein, a "memory window" is a region of memory in the owner's address space that is allowed to be accessed by one or more submitters. It may be defined by a window base address, window size, window mode, and access grant fields in the IDPTE. The window attribute may be initialized at the same time that the kernel mode driver assigns the IDPTE to the owner or privileged commit party. A window enable field in the IDPTE controls whether the memory window is active for the IDPTE.
If the window enable is 0, the hardware does not perform an address range check when using the entry. The authenticated submitter is allowed to access any address in the address space and window mode, window base address, and window size fields are reserved.
If the window enable is 1, the hardware checks if the memory region in the descriptor referencing the IDPTE falls within the memory window. The memory window will not wrap around 2 64 Address boundaries. The window mode field controls the interpretation of addresses in descriptors that reference IDPTEs.
In some embodiments, two window modes are supported:
(1) Address mode: the hardware checks whether the memory region in the descriptor referencing the IDPTE is located within the window, i.e., between the window base address and the sum of the window base address and the window size.
(2) Offset mode: the address of the memory region in the descriptor is considered to be offset from the base address of the window. The valid start of a memory region is calculated as the sum of the window base address and the address in the descriptor referencing the IDPTE. The effective end of a memory region is the sum of the effective starting address and the region size. The valid start and end of the memory region are located within the window.
The IDPTE specifies read and write permissions for memory accesses using the entry. If the requested permission does not match the granted permission, access is denied.
Memory window modification
For SASS or SAMS IDPTE, if the allowed update bit in the IDPTE is 1, the owner may use the update window descriptor (see, e.g., fig. 3) to modify the memory window attribute. Only processes whose PASIDs match the access PASIDs in the IDPTE are allowed to issue update windows. If the descriptor PASID does not match the access PASID, updating the window descriptor is done with an error.
While an update of 0 is allowed for IDPTE, the kernel mode driver may use MMIO write to modify the entry when IDPTE is not available.
Fig. 3 illustrates an update window descriptor 300 according to an embodiment. Updating the window descriptor only atomically changes the values of the window base address, window size, window mode, read and write permissions, and window enable fields in the IDPTE. Since the updating is done atomically by hardware, any inter-domain descriptor that references IDPTE at the same time ensures that the old or new value of the window attribute is seen.
After completing the atomic update, the update window descriptor may also perform implicit exhaustion to clear (flush out) any running descriptors that are still using the IDPTE's pre-update window attribute. This ensures that when the update window operation is completed, any descriptors referencing the IDPTE have also completed. In an embodiment, updating the window descriptor also allows suppression of implicit exhaustion if desired.
In addition, the update window operation (0 x 21) atomically modifies the properties of the memory window associated with the specified inter-domain permission table entry. The descriptor PASID must match the access PASID in the entry referenced by the handle and the allowed update bit in the entry must be 1. There is no alignment requirement for the window base address or window size field. If the window enable field in the window flag is 1, the sum of the window base address and window size in the descriptor must be less than or equal to 2 64 . If window enable is 0, then window mode, window base address, and window size fields in the descriptor must be 0.
In one embodiment, implicit exhaustion is performed to clear any running descriptors that are still using the pre-updated window attribute. The software may use the inhibit depletion flag to avoid implicit depletion if necessary. Further, the update window descriptor may not be included in the batch; it is considered an unsupported type of operation.
Table 3 shows example window flags for update window descriptors, and Table 4 shows update window operation specific flags.
Table 3: window sign
Table 4: updating window operation specific flags
Additionally, some embodiments may be applied in computing systems including one or more processors (e.g., where the one or more processors may include one or more processor cores) such as those discussed with reference to fig. 1A and the figures below, including, for example, desktop computers, workstations, computer servers, server blades, or mobile computing devices. Mobile computing devices may include smart phones, tablets, UMPCs (ultra mobile personal computers), laptops, super-book (TM) computing devices, wearable devices (such as smart watches, smart rings, smart bracelets, or smart glasses), and the like.
FIG. 4 illustrates an example computing system. Multiprocessor system 400 is an interface system, and includes multiple processors or cores, including a first processor 470 and a second processor 480 coupled via an interface 450, such as a point-to-point (P-P) interconnect, fabric, and/or bus. In some examples, the first processor 470 and the second processor 480 are homogenous. In some examples, the first processor 470 and the second processor 480 are heterogeneous. Although the example system 400 is shown with two processors, the system may have three or more processors, or may be a single processor system. In some embodiments, the computing system is a system on a chip (SoC).
Processors 470 and 480 are shown as including integrated memory controller (integrated memory controller, IMC) circuits 472 and 482, respectively. Processor 470 also includes interface circuits 476 and 478; similarly, the second processor 480 includes interface circuits 486 and 488. Processors 470, 480 may exchange information via an interface 450 using interface circuits 478, 488. IMCs 472 and 482 couple processors 470, 480 to respective memories, namely a memory 432 and a memory 434, which may be portions of main memory locally attached to the respective processors.
Processors 470, 480 may each exchange information with a network interface (network interface, NW I/F) 490 via respective interfaces 452, 454 using interface circuits 476, 494, 486, 498. The network interface 490 (e.g., one or more of an interconnect, bus, and/or fabric, in some examples a chipset) may optionally exchange information with the coprocessor 438 via interface circuitry 492. In some examples, coprocessor 438 is a special-purpose processor, such as, for example, a high-throughput processor, a network or communication processor, compression engine, graphics processor, general-purpose graphics processing unit (general purpose graphics processing unit, GPGPU), neural-Network Processing Unit (NPU), embedded processor, or the like.
A shared cache (not shown) may be included in either processor 470, 480 or external to both processors but connected to the processors via an interface (e.g., a P-P interconnect) such that: local cache information for either or both processors may also be stored in the shared cache if a processor is placed in a low power mode.
The network interface 490 may be coupled to the first interface 416 via an interface circuit 496. In some examples, the first interface 416 may be an interface such as a peripheral component interconnect (Peripheral Component Interconnect, PCI) interconnect, a PCI Express (PCI Express) interconnect, or another I/O interconnect. In some examples, the first interface 416 is coupled to a power control unit (power control unit, PCU) 417, which PCU 417 may include circuitry, software, and/or firmware to perform power management operations with respect to the processors 470, 480 and/or the co-processor 438. The PCU 417 provides control information to a voltage regulator (not shown) to cause the voltage regulator to generate the appropriate regulated voltage. The PCU 417 also provides control information to control the generated operating voltage. In various examples, PCU 417 may include various power management logic units (circuits) to perform hardware-based power management. Such power management may be entirely processor controlled (e.g., by various processor hardware and may be triggered by workload and/or power constraints, thermal constraints, or other processor constraints), and/or power management may be performed in response to an external source (e.g., a platform or power management source or system software).
PCU 417 is illustrated as residing as separate logic from processor 470 and/or processor 480. In other cases, PCU 417 may execute on a given one or more of the cores (not shown) of processors 470 or 480. In some cases, PCU 417 may be implemented as a microcontroller (dedicated or general purpose) or other control logic configured to execute its own dedicated power management code (sometimes referred to as P-code). In still other examples, the power management operations to be performed by PCU 417 may be implemented external to the processor, for example, by a separate power management integrated circuit (power management integrated circuit, PMIC) or another component external to the processor. In still other examples, the power management operations to be performed by PCU 417 may be implemented within a BIOS or other system software.
Various I/O devices 414 and bus bridge 418 may be coupled to first interface 416, with bus bridge 418 coupling first interface 416 to second interface 420. In some examples, additional processor(s) 415 are coupled to first interface 416, such as coprocessors, high-throughput integrated many-core (many integrated core, MIC) processors, gpgpgpu, accelerators (e.g., graphics accelerators or digital signal processing (digital signal processing, DSP) units), field programmable gate arrays (field programmable gate array, FPGAs), or any other processor. In some examples, the second interface 420 may be a Low Pin Count (LPC) interface. Various devices may be coupled to the second interface 420 including, for example, a keyboard and/or mouse 422, a communication device 427, and a storage circuit 428. The storage circuitry 428 may be one or more non-transitory machine-readable storage media described below, such as a disk drive or other mass storage device, which in some examples may include instructions/code and data 430 and may implement storage' ISAB03. In addition, an audio I/O424 may be coupled to the second interface 420. Note that other architectures are possible besides the point-to-point architecture described above. For example, a system such as multiprocessor system 400 may implement a multi-drop interface or other such architecture, rather than a point-to-point architecture.
Example core architectures, processors, and computer architectures.
The processor cores may be implemented in different ways, in different processors for different purposes. For example, implementations of these cores may include: 1) A general purpose ordered core for general purpose computing purposes; 2) A high-performance general out-of-order core, aimed at general computing purposes; 3) Dedicated cores, mainly for graphics and/or scientific (throughput) computing purposes. Implementations of different processors may include: 1) A CPU comprising one or more general-purpose ordered cores for general-purpose computing purposes and/or one or more general-purpose out-of-order cores for general-purpose computing purposes; and 2) coprocessors comprising one or more dedicated cores mainly for graphics and/or scientific (throughput) computing purposes. These different processors result in different computer system architectures that may include: 1) The coprocessor is on a separate chip from the CPU; 2) The coprocessor and the CPU are on separate dies in the same package; 3) Co-processors are on the same die as CPUs (in which case such co-processors are sometimes referred to as dedicated logic, e.g., integrated graphics and/or scientific (throughput) logic, or as dedicated cores); and 4) a system on a chip (SoC) that may be included on the same die as the described CPU (sometimes referred to as application core(s) or application processor(s), coprocessors described above, and additional functionality. An example core architecture is described next followed by a description of an example processor and computer architecture.
Fig. 5 illustrates a block diagram of an example processor and/or SoC 500, which processor and/or SoC 500 may have one or more cores and an integrated memory controller. The processor 500 illustrated in solid line boxes has a single core 502 (a), a system agent unit circuit 510, and a set of one or more interface controller unit circuits 516, while the optionally added dashed line boxes illustrate the alternative processor 500 as having multiple cores 502 (a) - (N), a set of one or more integrated memory control unit circuits 514 in the system agent unit circuit 510, dedicated logic 508, and a set of one or more interface controller unit circuits 516. Note that processor 500 may be one of processors 470 or 480 or coprocessors 438 or 415 of fig. 4.
Thus, different implementations of the processor 500 may include: 1) A CPU, wherein dedicated logic 508 is integrated graphics and/or scientific (throughput) logic (which may include one or more cores, not shown), cores 502 (a) - (N) are one or more general-purpose cores (e.g., general-purpose ordered cores, general-purpose out-of-order cores, or a combination of both); 2) Coprocessors in which cores 502 (a) - (N) are a large number of specialized cores primarily for graphics and/or scientific (throughput) purposes; and 3) coprocessors in which cores 502 (a) - (N) are a large number of general purpose ordered cores. Thus, the processor 500 may be a general purpose processor, a coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput integrated many-core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 500 may be part of one or more substrates and/or may be implemented on one or more substrates using any of a variety of process technologies, such as complementary metal oxide semiconductor (complementary metal oxide semiconductor, CMOS), bipolar CMOS (BiCMOS), P-type metal oxide semiconductor (P-type metal oxide semiconductor, PMOS), or N-type metal oxide semiconductor (N-type metal oxide semiconductor, NMOS).
The memory hierarchy includes one or more levels of cache cell circuitry 504 (a) - (N) within cores 502 (a) - (N), a set of one or more shared cache cell circuitry 506, and an external memory (not shown) coupled to the set of integrated memory controller cell circuitry 514. The set of one or more shared cache unit circuits 506 may include one or more intermediate level caches, such as level 2 (L2), level 3 (L3), level 4 (4), or other levels of cache, such as Last Level Cache (LLC), and/or combinations of these. While in some examples interface network circuitry 512 (e.g., a ring interconnect) provides an interface to dedicated logic 508 (e.g., integrated graphics logic), the set of shared cache unit circuitry 506, and system agent unit circuitry 510, alternative examples use any number of well-known techniques to provide an interface to these units. In some examples, coherency is maintained between one or more of the shared cache unit circuits 506 and cores 502 (a) - (N). In some examples, interface controller unit circuitry 516 couples these cores 502 to one or more other devices 518, such as one or more I/O devices, storage, one or more communication devices (e.g., wireless network, wired network, etc.), and so forth.
In some examples, one or more of cores 502 (a) - (N) have multi-threading capabilities. System agent unit circuitry 510 includes those components that coordinate and operate cores 502 (A) - (N). The system agent unit circuit 510 may include, for example, a power control unit (power control unit, PCU) circuit and/or a display unit circuit (not shown). The PCU may be (or may include) logic and components required to adjust the power states of cores 502 (a) - (N) and/or dedicated logic 508 (e.g., integrated graphics logic). The display element circuit is used to drive one or more externally connected displays.
Cores 502 (a) - (N) may be homogenous in terms of instruction set architecture (instruction set architecture, ISA). Alternatively, cores 502 (A) - (N) may also be heterogeneous with respect to ISA; that is, a subset of cores 502 (a) - (N) may be capable of executing one ISA, while other cores may be capable of executing only a subset of that ISA or capable of executing another ISA.
Example core architecture-ordered and unordered core block diagrams.
FIG. 6 (A) is a block diagram illustrating both example in-order pipelines and example register renaming, out-of-order issue/execution pipelines, according to some examples. FIG. 6 (B) is a block diagram illustrating both an example in-order architecture core and an example register renaming, out-of-order issue/execution architecture core to be included in a processor, according to an example. The solid line boxes in fig. 6 (a) - (B) illustrate the in-order pipeline and in-order core, while the optionally added dashed line boxes illustrate the register renaming, out-of-order issue/execution pipeline and core. Considering that the ordered aspects are a subset of the unordered aspects, the unordered aspects will be described.
In fig. 6 (a), the processor pipeline 600 includes a fetch stage 602, an optional length decode stage 604, a decode stage 606, an optional allocate (Alloc) stage 608, an optional rename stage 610, a dispatch (also referred to as dispatch or issue) stage 612, an optional register read/memory read stage 614, an execute stage 616, a write back/memory write stage 618, an optional exception handling stage 622, and an optional commit stage 624. One or more operations may be performed in each of these processor pipeline stages. For example, during fetch stage 602, one or more instructions are fetched from instruction memory, and during decode stage 606, the fetched one or more instructions may be decoded, an address (e.g., a Load Store Unit (LSU) address) using a forwarding register port may be generated, and branch forwarding (e.g., an immediate offset or Link Register (LR)) may be performed. In one example, decode stage 606 and register read/memory read stage 614 may be combined into one pipeline stage. In one example, during the execution stage 616, decoded instructions may be executed, LSU address/data pipelining to an advanced microcontroller bus (Advanced Microcontroller Bus, AMB) interface may be performed, multiply and add operations may be performed, arithmetic operations with branch results may be performed, and so on.
By way of example, the example register renaming, out-of-order issue/execution architecture core of FIG. 6 (B) may implement pipeline 600 by: 1) Instruction fetch circuit 638 performs fetch and length decode stages 602 and 604; 2) Decoding circuitry 640 performs decoding stage 606; 3) Rename/allocator unit circuit 652 executes allocation phase 608 and rename phase 610;
4) Scheduler circuit(s) 656 perform scheduling stage 612; 5) The physical register file circuit(s) 658 and the memory cell circuit 670 perform a register read/memory read phase 614; the execution cluster(s) 660 execute the execution stage 616; 6) Memory cell circuitry 670 and physical register file(s) circuitry 658 perform write back/memory write stage 618; 7) Various circuits may be involved in the exception handling phase 622; and 8) retirement unit circuitry 654 and physical register file circuit(s) 658 perform commit stage 624.
Fig. 6 (B) shows that processor core 690 includes front-end unit circuitry 630 coupled to execution engine unit circuitry 650, and both coupled to memory unit circuitry 670. The cores 690 may be reduced instruction set architecture computing (reduced instruction set architecture computing, RISC) cores, complex instruction set architecture computing (complex instruction set architecture computing, CISC) cores, very long instruction word (very long instruction word, VLIW) cores, or hybrid or alternative core types. As another option, core 690 may be a dedicated core, such as a network or communication core, compression engine, coprocessor core, general purpose computing graphics processing unit (general purpose computing graphics processing unit, GPGPU) core, graphics core, or the like.
The front end unit circuit 630 may include a branch prediction circuit 632 coupled to an instruction cache circuit 634 coupled to an instruction translation look-aside buffer (translation lookaside buffer, TLB) 636 coupled to an instruction fetch circuit 638 coupled to a decode circuit 640. In one example, instruction cache circuit 634 is included in memory cell circuit 670, rather than front-end circuit 630. The decode circuitry 640 (or decoder) may decode the instruction and generate as output one or more micro-operations, micro-code entry points, micro-instructions, other instructions, or other control signals that are decoded from, or otherwise reflect, or are derived from the original instruction. The decoding circuit 640 may further include an address generation unit (address generation unit, AGU, not shown) circuit. In one example, the AGU uses the forwarded register port to generate the LSU address and may further perform branch forwarding (e.g., immediate offset branch forwarding, LR register branch forwarding, etc.). The decoding circuit 640 may be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (programmable logic array, PLA), microcode Read Only Memory (ROM), and the like. In one example, core 690 includes a microcode ROM (not shown) or other medium that stores microcode for certain macro-instructions (e.g., in decode circuitry 640 or otherwise within front-end circuitry 630). In one example, the decode circuitry 640 includes micro-operations (micro-ops) or operation caches (not shown) to hold/cache decoded operations, micro-tags, or micro-operations generated during the decode or other stages of the processor pipeline 600. The decode circuit 640 may be coupled to a rename/allocator unit circuit 652 in the execution engine circuit 650.
The execution engine circuitry 650 includes rename/allocator unit circuitry 652 coupled to retirement unit circuitry 654 and a set of one or more scheduler circuits 656. Scheduler circuitry 656 represents any number of different schedulers including reservation stations, central instruction windows, and the like. In some examples, the scheduler circuit(s) 656 may include arithmetic logic unit (arithmetic logic unit, ALU) scheduler/scheduling circuits, ALU queues, address generation unit (address generation unit, AGU) scheduler/scheduling circuits, AGU queues, and so forth. Scheduler circuit(s) 656 is coupled to physical register file circuit(s) 658. Each of the physical register file(s) circuit 658 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., instruction pointer, i.e., address of the next instruction to be executed), and so forth. In one example, physical register file circuit(s) 658 include vector register cell circuitry, write mask register cell circuitry, and scalar register cell circuitry. These register units may provide architectural vector registers, vector mask registers, general purpose registers, and so forth. The physical register file circuit(s) 658 are coupled to retirement unit circuit 654 (also referred to as retirement queues) to demonstrate the various ways in which register renaming and out-of-order execution may be implemented (e.g., with reorder buffer(s) (ROB) and retirement register file(s), with future file(s), history buffer(s), and retirement register file(s), with a register map and pool of registers, etc.). Retirement unit circuitry 654 and physical register file circuit(s) 658 are coupled to execution cluster(s) 660. Execution cluster(s) 660 include a set of one or more execution unit circuits 662 and a set of one or more memory access circuits 664. Execution unit circuit(s) 662 may perform various arithmetic, logic, floating point, or other types of operations (e.g., shifting, adding, subtracting, multiplying) on various types of data (e.g., scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some examples may include several execution units or execution unit circuits dedicated to a particular function or set of functions, other examples may include only one execution unit circuit or multiple execution units/execution unit circuits that all perform all functions. The scheduler circuit(s) 656, physical register file circuit(s) 658, and execution cluster(s) 660 are shown as potentially multiple because some examples create separate pipelines for certain types of data/operations (e.g., scalar integer pipelines, scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipelines, and/or memory access pipelines, each having its own scheduler circuit, physical register file circuit(s), and/or execution cluster-and in the case of separate memory access pipelines, only the execution cluster of that pipeline has memory access unit(s) 664 in some examples implemented. It should also be appreciated that where separate pipelines are used, one or more of these pipelines may be out-of-order in issue/execution, with the remainder being in order.
In some examples, execution engine unit circuitry 650 may perform Load Store Unit (LSU) address/data pipelining, as well as address phase and write back, data phase loads, stores, and branches to an Advanced Microcontroller Bus (AMB) interface (not shown).
The set of memory access circuits 664 is coupled to memory unit circuitry 670, which includes data TLB circuitry 672, which is coupled to data cache circuitry 674, which is coupled to level 2 (L2) cache circuitry 676. In one example, the memory access circuitry 664 may include load unit circuitry, store address unit circuitry, and store data unit circuitry, each of which is coupled to data TLB circuitry 672 in the memory unit circuitry 670. Instruction cache circuit 634 is further coupled to a level 2 (L2) cache circuit 676 in memory cell circuit 670. In one example, instruction cache 634 and data cache 674 are combined into a single instruction and data cache (not shown) in L2 cache circuit 676, level 3 (L3) cache circuit (not shown), and/or main memory. The L2 cache circuit 676 is coupled to one or more other levels of cache and ultimately to main memory.
The core 690 may support one or more instruction sets (e.g., x86 instruction set architecture (optionally with some extensions that have been added with updated versions), MIPS instruction set architecture, ARM instruction set architecture (optionally with optional additional extensions, e.g., NEON)), which include instruction(s) described herein. In one example, core 690 includes logic to support a packed data instruction set architecture extension (e.g., AVX1, AVX 2) to allow operations used by many multimedia applications to be performed with packed data.
Example execution unit circuitry.
Fig. 7 illustrates an example of execution unit circuit(s), such as execution unit circuit(s) 662 of fig. 6 (B). As shown, execution unit circuit(s) 662 may include one or more ALU circuits 701, optional vector/single instruction multiple data (single instruction multipledata, SIMD) circuits 703, load/store circuits 705, branch/jump circuits 707, and/or Floating Point Unit (FPU) circuits 709. The ALU circuit 701 performs integer arithmetic and/or Boolean operations. vector/SIMD circuitry 703 performs vector/SIMD operations on packed data (e.g., SIMD/vector registers). Load/store circuitry 705 executes load and store instructions to load data from memory into registers or store data from registers to memory. The load/store circuitry 705 may also generate addresses. Branch/jump circuit 707 causes a branch or jump to a certain memory address depending on the instruction. The FPU circuit 709 performs floating point arithmetic. The width of the execution unit circuit(s) 662 varies depending on the example and may range from 16 bits to 1024 bits, for example. In some examples, two or more smaller execution units are logically combined to form a larger execution unit (e.g., two 128-bit execution units are logically combined to form a 256-bit execution unit).
Example register architecture.
FIG. 8 is a block diagram of a register architecture 800 according to some examples. As shown, the register architecture 800 includes vector/SIMD registers 810 that vary in width from 128 bits to 1024 bits. In some examples, vector/SIMD register 810 is 512 bits physically and, depending on the mapping, only some of the low order bits are used. For example, in some examples, vector/SIMD register 810 is a 512-bit ZMM register: the lower 256 bits are used for the YMM register and the lower 128 bits are used for the XMM register. Thus, there is an overlay of registers. In some examples, the vector length field selects between a maximum length and one or more other shorter lengths, where each such shorter length is half the length of the previous length. Scalar operations are operations performed on the lowest order data element locations in the ZMM/YMM/XMM registers; the higher order data element position is either kept the same as it was before the instruction or zeroed out, depending on the example.
In some examples, the register architecture 800 includes a write mask/predicate (predicate) register 815. For example, in some examples, there are 8 write mask/predicate registers (sometimes referred to as k0 through k 7), each of which is 16, 32, 64, or 128 bits in size. The write mask/predicate register 815 may allow merging (e.g., allow any set of elements in the destination to be protected from updating during execution of any operation) and/or zeroing (e.g., the zeroing vector mask allows any set of elements in the destination to be zeroed during execution of any operation). In some examples, each data element location in a given write mask/predicate register 815 corresponds to a data element location of a destination. In other examples, the write mask/predicate register 815 is scalable and consists of a set number of enable bits for a given vector element (e.g., 8 enable bits for each 64-bit vector element).
The register architecture 800 includes a plurality of general purpose registers 825. These registers may be 16 bits, 32 bits, 64 bits, etc., and can be used for scalar operations. In some examples, these registers are referred to by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP and R8 to R15.
In some examples, register architecture 800 includes a scalar floating-point (FP) register file 845 that is used to perform scalar floating point operations on 32/64/80 bit floating point data using an x87 instruction set architecture extension, or as an MMX register to perform operations on 64 bit packed integer data, and to save operation objects for some operations performed between the MMX and XMM registers.
One or more flag registers 840 (e.g., EFLAGS, RFLAGS, etc.) store state and control information for arithmetic, comparison, and system operation. For example, one or more flag registers 840 may store condition code information such as carry, parity, auxiliary carry, zero, sign, and overflow. In some examples, one or more flag registers 840 are referred to as program status and control registers.
Segment register 820 contains segment points for accessing memory. In some examples, these registers are referred to by the names CS, DS, SS, ES, FS and GS.
Machine-specific registers (MSRs) 835 control and report processor performance. Most MSRs 835 handle system-related functions and are inaccessible to applications. Machine check register 860 is comprised of control, status and error reporting MSRs for detecting and reporting hardware errors.
One or more instruction pointer registers 830 store instruction pointer values. Control register(s) 855 (e.g., CR0-CR 4) determine the operating mode of the processor (e.g., processors 470, 480, 438, 418, and/or 500) and the nature of the task currently being performed. Debug registers 850 control and allow for monitoring of debug operations of the processor or core.
A memory (mem) management register 865 specifies the location of the data structure used in protected mode memory management. These registers may include global descriptor table registers (global descriptor table register, GDTR), interrupt descriptor table registers (interrupt descriptor table register, IDTR), task registers, and local descriptor table registers (local descriptor table register, LDTR).
Alternative examples may use wider or narrower registers. Further, alternative examples may use more, fewer, or different register files and registers. The register architecture 800 may be used, for example, in a register file/memory ISAB08, or in a physical register file circuit 658.
Instruction set architecture.
An instruction set architecture (instruction set architecture, ISA) may include one or more instruction formats. A given instruction format may define various fields (e.g., number of bits, location of bits) to specify the operation (e.g., opcode) to be performed and the operation object(s) and/or other data field(s) (e.g., mask) to be performed for the operation, etc. Some instruction formats are further decomposed by the definition of instruction templates (or sub-formats). For example, an instruction template for a given instruction format may be defined as having different subsets of the fields of that instruction format (the included fields are typically in the same order, but at least some have different bit positions, as fewer included fields) and/or as having given fields interpreted in different ways. Thus, each instruction of the ISA is expressed with a given instruction format (and in one of the instruction templates of that instruction format if defined), and includes fields for specifying the operation and the object of the operation. For example, an example ADD instruction has a particular opcode and instruction format that includes an opcode field to specify the opcode and an operand field to select an operand (source 1/destination and source 2); and the occurrence of this ADD instruction in the instruction stream will have specific content in the operand field that selects the specific operand. Moreover, while the following description is made in the context of the x86 ISA, it is within the knowledge of one skilled in the art to apply the teachings of the present disclosure to other ISAs.
Example instruction format.
Examples of the instruction(s) described herein may be implemented in different formats. Furthermore, example systems, architectures, and pipelines are detailed below. Examples of instruction(s) may be executed on these systems, architectures, and pipelines, but are not limited to those detailed.
Fig. 9 illustrates an example of an instruction format. As shown, the instruction may include a number of components including, but not limited to, one or more fields for: one or more prefixes 901, opcodes 903, addressing information 905 (e.g., register identifiers, memory addressing information, etc.), displacement values 907, and/or immediate values 909. Note that some instructions utilize some or all of the fields of the format, while other instructions may use only the fields of the opcode 903. In some examples, the illustrated order is the order in which the fields are to be encoded, however it should be appreciated that in other examples, the fields may be encoded in another order, combined, etc.
Prefix field(s) 901 modifies the instruction when used. In some examples, one or more prefixes are used to repeat string instructions (e.g., 0xF0, 0xF2, 0xF3, etc.), provide section overrides (e.g., 0x2E, 0x36, 0x3E, 0x26, 0x64, 0x65, 0x2E, 0x3E, etc.), perform bus lock operations, and/or change an operand (e.g., 0x 66) and an address size (e.g., 0x 67). Some instructions require mandatory prefixes (e.g., 0x66, 0xF2, 0xF3, etc.). Some of these prefixes may be considered "legacy" prefixes. Other prefixes (one or more examples of which are detailed herein) indicate and/or provide further capabilities, such as specifying particular registers, etc. These other prefixes typically follow the "legacy" prefix.
The opcode field 903 is used to at least partially define the operation to be performed upon decoding of an instruction. In some examples, the length of the primary opcode encoded in opcode field 903 is one, two, or three bytes. In other examples, the primary opcode may be other lengths. An additional 3-bit opcode field is sometimes encoded in another field.
The addressing information field 905 is used to address one or more operands of the instruction, such as a location in memory or one or more registers. Fig. 10 illustrates an example of an addressing information field 905. In this illustration, an optional MOD R/M byte 1002 and an optional Scale, index, base, SIB byte 1004 are shown. MOD R/M byte 1002 and SIB byte 1004 are used to encode up to two operands of an instruction, each of which is a direct register or an effective memory address. Note that these two of these fields are optional, i.e., not all instructions include one or more of these fields. MOD R/M byte 1002 includes MOD field 1042, register (reg) field 1044, and R/M field 1046.
The contents of MOD field 1042 distinguish between memory access and non-memory access modes. In some examples, the register direct addressing mode is utilized when MOD field 1042 has a binary value of 11 (11 b), otherwise the register indirect addressing mode is utilized.
The register field 1044 may encode a destination register operand or a source register operand or may also encode an opcode extension without being used to encode any instruction operands. The contents of register field 1044 specify the location (in a register or in memory) of the source or destination operand directly or through address generation. In some examples, register field 1044 is complemented with additional bits from a prefix (e.g., prefix 901) to allow for greater addressing.
The R/M field 1046 may be used to encode an instruction operand that references a memory address, or may be used to encode a destination register operand or a source register operand. Note that in some examples, R/M field 1046 may be combined with MOD field 1042 to specify an addressing mode.
SIB byte 1004 includes a scaling field 1052, an index field 1054, and a base address field 1056 for generation of an address. The scale field 1052 indicates a scale factor. The index field 1054 specifies the index register to be used. In some examples, the index field 1054 is complemented with additional bits from a prefix (e.g., prefix 901) to allow for greater addressing. The base address field 1056 specifies the base address register to be used. In some examples, base address field 1056 is complemented with additional bits from a prefix (e.g., prefix 901) to allow for greater addressing. In practice, the contents of the scaling field 1052 allow the contents of the index field 1054 to be scaled for memory address generation (e.g., for use 2 Scaling * Index + address generation of base).
Some forms of addressing utilize displacement values to generate memory addresses. For example, it can be according to 2 Scaling * Index+base+shift, index scaling+shift, r/m+shift, instruction pointer (RIP/EIP) +shift, register+shift, and the like. The displacement may be a value of 1 byte, 2 bytes, 4 bytes, etc. In some examples, the displacement field907 provides this value. Furthermore, in some examples, the use of a displacement factor is encoded in the MOD field of addressing information field 905, which indicates a compressed displacement scheme for which a displacement value is calculated and stored in displacement field 907.
In some examples, the immediate value field 909 specifies an immediate value for the instruction. The immediate value may be encoded as a 1 byte value, a 2 byte value, a 4 byte value, and so on.
Fig. 11 illustrates an example of a first prefix 901 (a). In some examples, the first prefix 901 (a) is an example of a REX prefix. Instructions using this prefix may specify general purpose registers, 64-bit packed data registers (e.g., single Instruction Multiple Data (SIMD) registers or vector registers), and/or control registers and debug registers (e.g., CR8-CR15 and DR8-DR 15).
An instruction using the first prefix 901 (a) may specify up to three registers using a 3-bit field, depending on the format: 1) Reg field 1044 and R/M field 1046 of MOD R/M byte 1002 are used; 2) Using MOD R/M bytes 1002 and SIB bytes 1004, including using reg field 1044 and base address field 1056 and index field 1054; or 3) register fields using opcodes.
In the first prefix 901 (a), bit positions 7:4 are set to 0100. Bit position 3 (W) may be used to determine the operation object size, but the operation object width cannot be determined alone. Thus, when w=0, the operation object size is determined by the code segment descriptor (code segment descriptor, cs.d), and when w=1, the operation object size is 64 bits.
Note that adding another bit allows the bit of the bit pair 16 (2 4 ) The registers are addressed, whereas the separate MOD R/M reg fields 1044 and R/M fields 1046 of MOD R/M each address only 8 registers.
In the first prefix 901 (a), bit position 2 (R) may be an extension of the reg field 1044 of MOD R/M, and may be used to modify the reg field 1044 of MOD R/M when the field encodes a general purpose register, a 64-bit packed data register (e.g., SSE register), or a control or debug register. When MOD R/M byte 1002 specifies other registers or defines an extended opcode, R is ignored.
Bit position 1 (X) may modify SIB byte index field 1054.
Bit position 0 (B) may modify the base address in the R/M field 1046 or SIB byte base address field 1056 of MOD R/M; or it may modify the opcode register field for accessing a general purpose register (e.g., general purpose register 825).
Fig. 12 (a) - (D) illustrate examples of how the R, X and B fields of the first prefix 901 (a) are used. FIG. 12 (A) illustrates that R and B from the first prefix 901 (A) are used to extend the reg field 1044 and R/M field 1046 of the MOD R/M byte 1002 when the SIB byte 1004 is not used for memory addressing. Fig. 12 (B) illustrates that when SIB byte 1004 is not used, R and B from first prefix 901 (a) are used to extend reg field 1044 and R/M field 1046 of MOD R/M byte 1002 (register-register addressing). Fig. 12 (C) illustrates that R, X and B from the first prefix 901 (a) are used to extend reg field 1044 and index and base fields 1054 and 1056 of MOD R/M byte 1002 when SIB byte 1004 is used for memory addressing. Fig. 12 (D) illustrates that B from the first prefix 901 (a) is used to extend the reg field 1044 of MOD R/M byte 1002 when a register is encoded in the opcode 903.
Fig. 13 (a) - (B) illustrate an example of a second prefix 901 (B). In some examples, the second prefix 901 (B) is an example of a VEX prefix. The second prefix 901 (B) encodes a permission instruction that has more than two operands and that permits SIMD vector registers (e.g., vector/SIMD registers 810) to be longer than 64 bits (e.g., 128 bits and 256 bits). The use of the second prefix 901 (B) provides a syntax of three operation objects (or more). For example, the previous two-operand instruction performs an operation such as a=a+b, which overwrites the source operand. The use of the second prefix 901 (B) enables the operation object to perform a non-destructive operation, such as a=b+c.
In some examples, the second prefix 901 (B) has two forms, a two byte form and a three byte form. A second prefix 901 (B) of two bytes is used primarily for 128-bit, scalar and some 256-bit instructions; while a three byte second prefix 901 (B) provides for a compact replacement of the 3 byte opcode instruction and the first prefix 901 (a).
Fig. 13 (a) illustrates an example of a two-byte form of the second prefix 901 (B). In one example, format field 1301 (byte 0 1303) contains a value of C5H. In one example, byte 1 1305 includes an "R" value in bit [7 ]. This value is the complement of the "R" value of the first prefix 901 (a). Bit [2] is used to specify the length (L) of the vector (where the value of 0 is a scalar or 128-bit vector and the value of 1 is a 256-bit vector). Bits [1:0] provide opcode expansions equivalent to some legacy prefixes (e.g., 00 = no prefix, 01 = 66H,10 = F3H, and 11 = F2H). Bits [6:3] shown as vvv may be used to: 1) Encoding a first source register operand that is specified in an inverted (code-reversed) form and valid for an instruction having 2 or more source operands; 2) Encoding a destination register operand, the operand specified in an anti-code form, for certain vector shifts; or 3) does not encode any operand, this field is reserved and should contain a certain value, e.g., 1111b.
An instruction using this prefix may use the R/M field 1046 of the MOD R/M to encode an instruction operand that references a memory address, or to encode a destination register operand or a source register operand.
Instructions that use this prefix may use the reg field 1044 of the MOD R/M to encode either the destination register operand or the source register operand, or be treated as an opcode extension without being used to encode any instruction operands.
For instruction syntax supporting four operands, the vvv, the R/M field 1046 of MOD R/M and the reg field 1044 of MOD R/M encode three of the four operands. Bits [7:4] of the immediate value field 909 are then used to encode the third source register operand.
Fig. 13 (B) illustrates an example of a three-byte form of the second prefix 901 (B). In one example, format field 1311 (byte 0 1313) contains a value of C4H. Byte 1 1315 includes "R", "X", and "B" in bits [7:5], which are complements of these values of the first prefix 901 (A). Bits [4:0] (shown as mmmmmm) of byte 1 1315 include content that encodes one or more implied preamble opcode bytes as needed. For example, 00001 means 0FH preamble, 00010 means 0F38H preamble, 00011 means 0F3AH preamble, and so on.
The use of bit [7] of byte 2 1317 is similar to the W of the first prefix 901 (A), including helping to determine the liftable operand size. Bit [2] is used to specify the length (L) of the vector (where the value of 0 is a scalar or 128-bit vector and the value of 1 is a 256-bit vector). Bits [1:0] provide opcode expansions equivalent to some legacy prefixes (e.g., 00 = no prefix, 01 = 66H,10 = F3H, and 11 = F2H). Bits [6:3] shown as vvv may be used to: 1) Encoding a first source register operand specified in an inverted (code-reversing) form valid for an instruction having 2 or more source operands; 2) Encoding a destination register operand, the operand specified in an anti-code form, for certain vector shifts; or 3) does not encode any operand, this field is reserved and should contain a certain value, e.g., 1111b.
An instruction using this prefix may use the R/M field 1046 of the MOD R/M to encode an instruction operand that references a memory address, or to encode a destination register operand or a source register operand.
Instructions that use this prefix may use the reg field 1044 of the MOD R/M to encode either the destination register operand or the source register operand, or be treated as an opcode extension without being used to encode any instruction operands.
For instruction syntax supporting four operands, the vvv, the R/M field 1046 of MOD R/M and the reg field 1044 of MOD R/M encode three of the four operands. Bits [7:4] of the immediate value field 909 are then used to encode the third source register operand.
Fig. 14 illustrates an example of a third prefix 901 (C). In some examples, the third prefix 901 (C) is an example of an EVEX prefix. The third prefix 901 (C) is a four byte prefix.
The third prefix 901 (C) is capable of encoding 32 vector registers (e.g., 128-bit, 256-bit, and 512-bit registers) in 64-bit mode. In some examples, an instruction that utilizes a write mask/operation mask (see discussion of registers in previous figures, e.g., fig. 8) or predicate utilizes this prefix. The operation mask register allows conditional processing or selection control. The operation mask instruction whose source/destination operation object is an operation mask register and the contents of the operation mask register are treated as a single value is encoded using the second prefix 901 (B).
The third prefix 901 (C) may encode instruction class specific functions (e.g., a packed instruction with "load + operation" semantics may support an embedded broadcast function, a floating point instruction with rounding semantics may support a static rounding function, a floating point instruction with non-rounding arithmetic semantics may support a "suppress all exceptions" function, etc.).
The first byte of the third prefix 901 (C) is a format field 1411, which in one example has a value of 62H. The subsequent bytes are referred to as payload bytes 1415-1419 and collectively form a 24-bit value of P [23:0], providing specific capabilities in the form of one or more fields (detailed herein).
In some examples, P [1:0] of payload byte 1419 is the same as the two mm bits of the low order bits. In some examples, P [3:2] is reserved. Bit P [4] (R') when combined with P [7] and reg field 1044 of MOD R/M allows access to the high 16 vector register set. P [6] may also provide access to the high 16 vector registers when SIB type addressing is not required. P [7:5] is comprised of R, X and B, which are operand specifier modifier bits for vector registers, general purpose registers, memory addressing, and when combined with MOD R/M register field 1044 and MOD R/M field 1046, allow access to the next set of 8 registers beyond the lower 8 registers. P [9:8] provides opcode extensibility equivalent to some legacy prefixes (e.g., 00=no prefix, 01=66H, 10=f3h, and 11=f2h). P10 is a fixed value of 1 in some examples. P [14:11] shown as vvv may be used to: 1) Encoding a first source register operand specified in an inverted (code-reversing) form valid for an instruction having 2 or more source operands; 2) Encoding a destination register operand, the operand specified in an anti-code form, for certain vector shifts; or 3) does not encode any operand, this field is reserved and should contain a certain value, e.g., 1111b.
P [15] is similar to W of the first prefix 901 (A) and the second prefix 911 (B) and may be raised as an opcode extension bit or an operand size.
P [18:16] specifies the index of a register in the operation mask (write mask) register (e.g., write mask/predicate register 815). In one example, the particular value aaa=000 has a special behavior that implies that no operation mask is used for this particular instruction (this may be accomplished in a variety of ways, including using operation masks hardwired to all ones or hardware that bypasses masking hardware). When merging, the vector mask allows any set of elements in the destination to be protected from updating during execution of any operation (specified by the base operation and the enhancement operation); in another example, the old value of each element of the destination is preserved (if the corresponding mask bit has a value of 0). In contrast, when zero, the vector mask allows any set of elements in the destination to be zeroed during the execution of any operation (specified by the base and enhancement operations); in one example, the elements of the destination are set to 0 when the corresponding mask bit has a value of 0. A subset of this functionality is the ability to control the vector length of the operation being performed (i.e., the span of elements being modified, from first to last); however, the elements that are modified need not be contiguous. Thus, the operation mask field allows partial vector operations, including load, store, arithmetic, logic, and the like. While in the described example the contents of the operation mask field select which of several operation mask registers contains the operation mask to be used (such that the contents of the operation mask field indirectly identify the mask to be performed), alternative examples allow the contents of the mask write field to directly specify the mask to be performed.
P [19] may be combined with P [14:11] to encode the second source vector register in a non-destructive source syntax that may utilize P [19] to access the upper 16 vector registers. P20 encodes a variety of functions that vary among different classes of instructions and can affect the meaning of the vector length/rounding control designator field (P22:21). P [23] indicates support for merge-write masking (e.g., when set to 0) or support for return-to-zero and merge-write masking (e.g., when set to 1).
The following table details an example of encoding registers in an instruction using the third prefix 901 (C).
Table 5: 32 register support in 64-bit mode
/>
Table 6: encoding register designators in 32-bit mode
Table 7: operation mask register designator encoding
Program code may be applied to input information to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor, such as a digital signal processor (digital signal processor, DSP), microcontroller, application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA), microprocessor, or any combination thereof.
Program code may be implemented in a process-or object-oriented high-level programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
Examples of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementation approaches. Examples may be implemented as a computer program or program code that is executed on a programmable system including at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
One or more aspects of at least one example may be implemented by representative instructions stored on a machine-readable medium which represent various logic within a processor, which when read by a machine, cause the machine to fabricate logic to perform the techniques described herein. These manifestations are referred to as "Intellectual Property (IP) cores" and may be stored on tangible machine readable media and provided to various customers or manufacturing facilities for loading into a fabrication machine that fabricated the logic or processor.
Such machine-readable storage media may include, but are not limited to, non-transitory tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as: hard disk, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (compact disk rewritable, CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (random access memory, RAMs) such as dynamic random access memories (dynamic random access memory, DRAMs), static random access memories (static random access memory, SRAMs), erasable programmable read-only memories (erasable programmable read-only memories, EPROMs), flash memories, electrically erasable programmable read-only memories (electrically erasable programmable read-only memories, EEPROMs), phase change memories (phase change memory, PCMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Thus, examples also include non-transitory tangible machine-readable media containing instructions or containing design data defining the structures, circuits, devices, processors, and/or system features described herein, such as hardware description language (Hardware Description Language, HDL). Such an example may also be referred to as a program product.
Simulation (including binary translation, code morphing, etc.).
In some cases, an instruction converter may be used to convert instructions from a source instruction set architecture to a target instruction set architecture. For example, the instruction converter may translate (e.g., utilizing static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert instructions to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on-processor, off-processor, or a portion on-processor and a portion off-processor.
FIG. 15 is a block diagram illustrating the use of a software instruction translator for translating binary instructions in a source ISA to binary instructions in a target ISA, according to an example. In the illustrated example, the instruction converter is a software instruction converter, but alternatively the instruction converter may be implemented in software, firmware, hardware, or various combinations thereof. FIG. 15 illustrates that a program in high-level language 1502 can be compiled using first ISA compiler 1504 to generate first ISA binary code 1506, which can be natively executed by processor 1516 with at least one first ISA core. Processor 1516 with at least one first ISA core represents any such processor: such a processor can be implemented by compatibly executing or otherwise processing (1) a substantial portion of a first ISA or (2) to provide a first ISA core with at least one first ISA core On a processorAn object code version of an application or other software running as a target performs substantially the same functions as an Intel processor having at least one first ISA core to achieve substantially the same results as a processor having at least one first ISA core. First ISA compiler 1504 is representative of a compiler operable to generate first ISA binary code 1506 (e.g., object code) that is capable of being executed on processor 1516 with at least one first ISA core with or without additional chaining processing. Similarly, FIG. 15 illustrates that programs in high-level language 1502 may be compiled using alternative ISA compiler 1508 to generate alternative ISA binary code 1510 that may be natively executed by processor 1514 without the first ISA core. Instruction translator 1512 is used to translate first ISA binary code 1506 into code that can be natively executed by processor 1514 without the first ISA core. Such translated code may not necessarily be identical to alternative ISA binary 1510; however, the translated code will implement the overall operation and be composed of instructions from the alternative ISA. Thus, instruction converter 1512 represents software, firmware, hardware, or a combination thereof that, through emulation, simulation, or any other process, allows a processor or other electronic device that does not have a first ISA processor or core to execute first ISA binary code 1506.
References to "an example," "one example," etc., indicate that the example described may include a particular feature, structure, or characteristic, but every example may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same example. Further, when a particular feature, structure, or characteristic is described in connection with an example, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other examples (whether or not explicitly described).
Furthermore, in the various examples described above, unless specifically noted otherwise, a selected language such as the phrase "A, B or at least one of C" or "A, B and/or C" should be understood to refer to A, B or C, or any combination thereof (i.e., a and B, a and C, B and C, and A, B and C).
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the claims.
The following examples relate to further embodiments. Example 1 includes an apparatus comprising: work queue circuitry for storing one or more descriptors; an arbiter circuit for dispatching a descriptor from the work queue, wherein the descriptor includes an operation field capable of specifying an inter-domain operation to be performed, a commit party PASID field capable of specifying a commit party Process Address Space Identifier (PASID) of a commit party process to be performed on the processor, and a handle to a destination PASID associated with an address space of another process; and an engine circuit for: verifying whether the presenter process is permitted to access an address space of another process based at least in part on the presenter PASID and the destination PASID handle; and processing the descriptor based at least in part on the inter-domain operation specified by the operation field.
Example 2 includes the apparatus of example 1, wherein the inter-domain operation to be performed is at least one of: copy operation, fill operation, compare mode operation, and flush operation. Example 3 includes the apparatus of example 1, further comprising: a cache for storing bitmaps associated with one or more commit side processes that are allowed to commit inter-domain operations.
Example 4 includes the apparatus of example 3, wherein the hardware accelerator includes a work queue, an arbiter, an engine, and a cache. Example 5 includes the apparatus of example 3, wherein the invalidation command is issued in response to a determination that the bitmap has been modified. Example 6 includes the apparatus of example 1, wherein the hardware accelerator comprises a work queue, an arbiter, and an engine.
Example 7 includes the apparatus of example 6, wherein the hardware accelerator is to perform data movement or data transformation on data to be transferred between the processor and the storage device. Example 8 includes the apparatus of example 6, wherein the system on a chip (SOC) apparatus includes a hardware accelerator and a processor. Example 9 includes the apparatus of example 1, wherein the work queue is configured as a Shared Work Queue (SWQ) or a Dedicated Work Queue (DWQ), wherein the SWQ is shared by a plurality of software applications and the DWQ is assigned to a single software application. Example 10 includes the apparatus of example 1, wherein the processor includes one or more processor cores to execute the commit party process.
Example 11 includes a method comprising: storing one or more descriptors in a work queue; dispatching, at the arbiter, a descriptor from the work queue, wherein the descriptor includes an operation field capable of specifying an inter-domain operation to be performed, a commit party PASID field capable of specifying a commit party Process Address Space Identifier (PASID) of a commit party process to be performed on the processor, and a handle to a destination PASID associated with an address space of another process; and verifying, at the engine, whether the presenter process is permitted to access an address space of another process based at least in part on the presenter PASID and the destination PASID handle; and processing, at the engine, the descriptor based at least in part on the inter-domain operation specified by the operation field.
Example 12 includes the method of example 11, further comprising performing the inter-domain operation by performing at least one of: copy operation, fill operation, compare mode operation, and flush operation. Example 13 includes the method of example 11, further comprising: a bitmap associated with one or more commit side processes that are allowed to commit the inter-domain operation is stored in a cache.
Example 14 includes a system comprising: a processor for executing one or more processes; an input/output (IO) structure for transferring data between the accelerator device and the memory unit; the accelerator apparatus includes: work queue circuitry for storing one or more descriptors; an arbiter circuit for dispatching a descriptor from the work queue, wherein the descriptor includes an operation field capable of specifying an inter-domain operation to be performed, a commit party PASID field capable of specifying a commit party Process Address Space Identifier (PASID) of a commit party process to be performed on the processor, and a handle to a destination PASID associated with an address space of another process; and an engine circuit for: verifying whether the presenter process is permitted to access an address space of another process based at least in part on the presenter PASID and the destination PASID handle; and processing the descriptor based at least in part on the inter-domain operation specified by the operation field.
Example 15 includes the system of example 14, wherein the inter-domain operation to be performed is at least one of: copy operation, fill operation, compare mode operation, and flush operation. Example 16 includes the system of example 14, wherein the accelerator device comprises: a cache for storing bitmaps associated with one or more commit side processes that are allowed to commit inter-domain operations. Example 17 includes the system of example 16, wherein the invalidation command is issued in response to a determination that the bitmap has been modified.
Example 18 includes the system of example 14, wherein the hardware accelerator is to perform data movement or data transformation on data to be transferred between the processor and the storage device. Example 19 includes the system of example 14, wherein the system on a chip (SOC) device includes a hardware accelerator and a processor. Example 20 includes the system of example 14, wherein the work queue is configured as a Shared Work Queue (SWQ) or a Dedicated Work Queue (DWQ), wherein the SWQ is shared by a plurality of software applications and the DWQ is assigned to a single software application.
Example 21 includes an apparatus comprising means for performing operations as set forth in any of the preceding examples. Example 22 includes a machine-readable storage device comprising machine-readable instructions that, when executed, are to implement any of the operations set forth in the preceding examples or to implement any of the devices set forth in the preceding examples.
In various embodiments, one or more of the operations discussed with reference to fig. 1 and below, and so forth, may be performed by one or more components (interchangeably referred to herein as "logic") discussed with reference to any of the figures.
In some embodiments, the operations discussed herein (e.g., with reference to fig. 1 and the following figures) may be implemented as hardware (e.g., logic circuitry), software, firmware, or a combination thereof, which may be provided as a computer program product, e.g., comprising one or more tangible (e.g., non-transitory) machine-readable or computer-readable media having stored thereon instructions (or software programs) for programming a computer to perform the processes discussed herein. A machine-readable medium may include storage devices such as those discussed with reference to the figures.
Further, while various embodiments described herein use the term system-on-a-chip or system-on-a-chip ("SoC" or "SoC") to describe devices and systems having processors and associated circuitry (e.g., input/output ("I/O") circuitry, power delivery circuitry, memory circuitry, etc.) monolithically integrated into a single integrated circuit ("Integrated Circuit", "IC") die or chip, the disclosure is not limited in this respect. For example, in various embodiments of the present disclosure, a device or system may have a device and system of one or more processors (e.g., one or more processor cores) and associated circuitry (e.g., input/output ("I/O") circuitry, power delivery circuitry, etc.) arranged in a broken-down set of discrete dies, slices, and/or chiplets (e.g., one or more discrete processor core dies arranged adjacent to one or more other dies such as memory dies, I/O dies, etc.). In such disaggregated devices and systems, individual dies, pieces, and/or chiplets can be physically and/or electrically coupled together through packaging structures including, for example, individual package substrates, intermediaries, active intermediaries, photonic intermediaries, interconnect bridges, and so forth. The exploded set of discrete dies, chips, and/or chiplets can also be part of a system on package ("SoP").
Additionally, such computer-readable media may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals provided in a carrier wave or other propagation medium via a communication link (e.g., a bus, a modem, or a network connection).
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, and/or characteristic described in connection with the embodiment can be included in at least one implementation. The appearances of the phrase "in one embodiment" in various places in the specification may or may not be all referring to the same embodiment.
Also, in the description and claims, the terms "coupled" and "connected," along with their derivatives, may be used. In some embodiments, "connected" may be used to indicate that two or more elements are in direct physical or electrical contact with each other. "coupled" may mean that two or more elements are in direct physical or electrical contact. However, "coupled" may also mean that two or more elements may not be in direct contact with each other, but may still cooperate or interact with each other.
Thus, although embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that claimed subject matter may not be limited to the specific features or acts described. Rather, the specific features and acts are disclosed as sample forms of implementing the claimed subject matter.

Claims (22)

1. An apparatus for providing data stream acceleration, the apparatus comprising:
work queue circuitry for storing one or more descriptors;
an arbiter circuit for dispatching a descriptor from a work queue, wherein the descriptor comprises an operation field capable of specifying an inter-domain operation to be performed, a commit side PASID field capable of specifying a commit side process address space identifier PASID of a commit side process to be performed on a processor, and a handle to a destination PASID associated with an address space of another process; and
engine circuitry for:
verifying whether the presenter process is permitted to access the address space of the other process based at least in part on the presenter PASID and a destination PASID handle; and
the descriptor is processed based at least in part on the inter-domain operation specified by the operation field.
2. The device of claim 1, wherein the inter-domain operation to be performed is at least one of: copy operation, fill operation, compare mode operation, and flush operation.
3. The apparatus of any one of claims 1 to 2, further comprising: a cache for storing bitmaps associated with one or more commit side processes that are permitted to commit the inter-domain operation.
4. The apparatus of any of claims 1 to 3, wherein a hardware accelerator comprises the work queue, the arbiter, the engine, and the cache.
5. The apparatus of any of claims 1 to 4, wherein an invalidation command is issued in response to a determination that the bitmap has been modified.
6. The apparatus of any of claims 1 to 5, wherein a hardware accelerator comprises the work queue, the arbiter, and the engine.
7. The device of any of claims 1 to 6, wherein the hardware accelerator is to perform data movement or data transformation on data to be transferred between the processor and a storage device.
8. The device of any of claims 1-7, wherein a system-on-chip SOC device includes the hardware accelerator and the processor.
9. The apparatus of any of claims 1 to 8, wherein the work queue is configured as a shared work queue, SWQ, or a dedicated work queue, DWQ, wherein the SWQ is shared by a plurality of software applications and the DWQ is assigned to a single software application.
10. The apparatus of any of claims 1 to 9, wherein the processor comprises one or more processor cores to execute the presenter process.
11. A method for providing data stream acceleration, the method comprising:
storing one or more descriptors in a work queue;
dispatching, at an arbiter, a descriptor from the work queue, wherein the descriptor includes an operation field capable of specifying an inter-domain operation to be performed, a commit side PASID field capable of specifying a commit side process address space identifier PASID of a commit side process to be performed on a processor, and a handle to a destination PASID associated with an address space of another process; and
verifying, at an engine, whether the presenter process is permitted to access an address space of the other process based at least in part on the presenter PASID and a destination PASID handle; and
the descriptor is processed at the engine based at least in part on the inter-domain operation specified by the operation field.
12. The method of claim 11, further comprising performing the inter-domain operation by performing at least one of: copy operation, fill operation, compare mode operation, and flush operation.
13. The method of any of claims 11 to 12, further comprising storing bitmaps associated with one or more commit side processes that are allowed to commit the inter-domain operation in a cache.
14. A system for providing data stream acceleration, the system comprising:
a processor for executing one or more processes;
an input/output IO fabric for transferring data between the accelerator device and the memory unit;
the accelerator apparatus includes:
work queue circuitry for storing one or more descriptors;
an arbiter circuit for dispatching a descriptor from a work queue, wherein the descriptor comprises an operation field capable of specifying an inter-domain operation to be performed, a commit side PASID field capable of specifying a commit side process address space identifier PASID of a commit side process to be performed on the processor, and a handle to a destination PASID associated with an address space of another process; and
Engine circuitry for:
verifying whether the presenter process is permitted to access the address space of the other process based at least in part on the presenter PASID and a destination PASID handle; and
the descriptor is processed based at least in part on the inter-domain operation specified by the operation field.
15. The system of claim 14, wherein the inter-domain operation to be performed is at least one of: copy operation, fill operation, compare mode operation, and flush operation.
16. The system of any one of claims 14 to 15, wherein the accelerator device comprises: a cache for storing bitmaps associated with one or more commit side processes that are permitted to commit the inter-domain operation.
17. The system of any of claims 14 to 16, wherein an invalidation command is issued in response to a determination that the bitmap has been modified.
18. The system of any of claims 14 to 17, wherein the accelerator device is to perform data movement or data transformation on data to be transferred between the processor and a storage device.
19. The system of any of claims 14 to 18, wherein a system on a chip, SOC, device comprises the accelerator device and the processor.
20. The system of any of claims 14 to 19, wherein the work queue is configured as a shared work queue, SWQ, or a dedicated work queue, DWQ, wherein the SWQ is shared by multiple software applications and the DWQ is assigned to a single software application.
21. A machine readable medium comprising code which, when executed, causes a machine to perform the operations of any of claims 1 to 20.
22. An apparatus comprising means for performing the operations of any one of claims 1 to 20.
CN202311022669.6A 2022-08-12 2023-08-14 Data stream accelerator Pending CN117591004A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US63/397,457 2022-08-12
US18/233,308 2023-08-12
US18/233,308 US20240054011A1 (en) 2022-08-12 2023-08-12 Data streaming accelerator

Publications (1)

Publication Number Publication Date
CN117591004A true CN117591004A (en) 2024-02-23

Family

ID=89910397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311022669.6A Pending CN117591004A (en) 2022-08-12 2023-08-14 Data stream accelerator

Country Status (1)

Country Link
CN (1) CN117591004A (en)

Similar Documents

Publication Publication Date Title
CN117873922A (en) Processor, method, system and instructions for protecting shadow stack
US12045185B2 (en) Highly scalable accelerator
US10503662B2 (en) Systems, apparatuses, and methods for implementing temporary escalated privilege
EP4160406A1 (en) User-level interprocessor interrupts
US11741018B2 (en) Apparatus and method for efficient process-based compartmentalization
CN112148510A (en) Apparatus, method and system for linear address mask architecture
CN115357332A (en) Virtualization of inter-processor interrupts
JP2022151658A (en) Memory bandwidth control in core
EP4242900A1 (en) Bypassing memory encryption for non-confidential virtual machines in a computing system
EP3716079A1 (en) Memory management apparatus and method for managing different page tables for different privilege levels
US20210200685A1 (en) Memory tagging metadata manipulation
EP4124964B1 (en) Method and apparatus for high-performance page-fault handling for multi-tenant scalable accelerators
EP4124952A1 (en) Method and apparatus for dynamically adjusting pipeline depth to improve execution latency
CN116302104A (en) Circuit and method for implementing non-redundant metadata storage addressed by bounded capability
EP3885950B1 (en) Shadow stack isa extensions to support fast return and event delivery (fred) architecture
CN117063162A (en) Apparatus and method for implementing shared virtual memory in trusted zone
US20240054011A1 (en) Data streaming accelerator
CN117591004A (en) Data stream accelerator
CN113849221A (en) Apparatus, method and system for operating system transparent instruction state management
US20240054080A1 (en) Speculating object-granular key identifiers for memory safety
US20220197638A1 (en) Generating encrypted capabilities within bounds
WO2022266989A1 (en) Exitless guest to host notification
US20220308998A1 (en) Apparatus and method to reduce bandwidth and latency overheads of probabilistic caches
CN117377944A (en) Host-to-guest notification
CN116339615A (en) Reading all zero data or random data when the volatile memory is first read

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication