TW202405646A

TW202405646A - Bfloat16 arithmetic instructions

Info

Publication number: TW202405646A
Application number: TW111127268A
Authority: TW
Inventors: 亞歷山大海內克; 梅納赫姆艾德曼; 羅伯特華倫泰; 季夫史波博; 阿密特格瑞斯坦; 馬克查尼; 伊凡傑洛斯喬治納斯; 迪拉伊卡拉姆卡; 克里斯多福休吉斯; 克里斯汀娜安德森
Original assignee: 美商英特爾公司
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2024-02-01

Abstract

Techniques for performing arithmetic operations on BF16 values are described. An exemplary instruction includes fields for an opcode, an identification of a location of a first packed data source operand, an identification of a location of a second packed data source operand, and an identification of location of a packed data destination operand, wherein the opcode is to indicate an arithmetic operation execution circuitry is to perform, for each data element position of the identified packed data source operands, the arithmetic operation on BF16 data elements in that data element position in BF16 format and store a result of each arithmetic operation into a corresponding data element position of the identified packed data destination operand.

Description

16-bit brain floating point (BFLOAT16) arithmetic instructions

本發明係有關於16位腦浮點算術指令。The present invention relates to 16-bit brain floating point arithmetic instructions.

近年來，具有較低精確度乘法及較高精確度累積的融合乘加(FMA)單元已證明在機器學習/人工智慧應用中係有用的，最顯著的是在訓練深度神經網路中，因為其極端運算強度。相較於經典的IEEE-754 32位元(FP32)及64位元(FP64)算術，此降低之精確度的算術不成比例於其縮短之寬度而可自然地加速。In recent years, fused multiply-add (FMA) units with lower precision multiplication and higher precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks, because Its extreme computing intensity. Compared to classic IEEE-754 32-bit (FP32) and 64-bit (FP64) arithmetic, this reduced-precision arithmetic is naturally accelerated disproportionately to its shortened width.

依據本發明之一實施例，係特地提出一種設備，其包含：解碼電路系統，其用以解碼一單指令的一實例，該單指令用以包括用於一運算碼、一第一包封資料來源運算元之一位置之一識別、一第二包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示一算術運算執行電路系統係要針對經識別之該等包封資料來源運算元的每一資料元素位置，對在該資料元素位置以BF16格式的BF16資料元素施行該算術運算，且將每一算術運算之一結果儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中；且該執行電路系統用以根據該運算碼來執行經解碼之該指令。According to an embodiment of the present invention, a device is specifically proposed, which includes: a decoding circuit system for decoding an instance of a single instruction, which includes an operation code and a first packet data. a field identifying a position of the source operand, an identification of a position of the second wrapper data source operand, and a field identifying a position of the wrapper data destination operand, where the opcode is used to Instructs an arithmetic operation execution circuitry to perform the arithmetic operation on the BF16 data element in BF16 format at each data element position of the identified wrapping data source operand, and to A result of an arithmetic operation is stored in a corresponding data element location of the identified packet data destination operand; and the execution circuitry is used to execute the decoded instruction according to the opcode.

本揭露內容係有關於用於對BF16資料元素施行算術運算之方法、設備、系統及非暫時性電腦可讀儲存媒體。BF16由於其在機器學習演算法中，特別是在深度學習訓練中表現良好之能力而倍受矚目。FP32格式101具有一正負號位元(S)、一8位元指數及一23位元分數(使用一隱式位元的一24位元尾數)。FP16格式103具有一正負號位元(S)、一5位元指數及一10位元分數。BF16格式105具有一正負號位元(S)、一8位元指數及一7位元分數。This disclosure relates to methods, devices, systems, and non-transitory computer-readable storage media for performing arithmetic operations on BF16 data elements. BF16 has attracted much attention due to its ability to perform well in machine learning algorithms, especially in deep learning training. FP32 format 101 has a sign bit (S), an 8-bit exponent, and a 23-bit fraction (a 24-bit mantissa using an implicit bit). FP16 format 103 has a sign bit (S), a 5-bit exponent, and a 10-bit fraction. BF16 format 105 has a sign bit (S), an 8-bit exponent, and a 7-bit fraction.

與IEEE 754標準化16位元(FP16)變體對比，BF16與FP32比較時不折損範圍。FP32數字具有8位元之指數及24位元之尾數(包括一隱式者)。BF16從24位元FP32尾數削除16位元以生成16位元浮點資料類型。對比而言，FP16大致減半FP32尾數至10顯式位元且將指數減小至5位元以相合於16位元資料類型包封。Compared to the IEEE 754 standardized 16-bit (FP16) variant, BF16 does not compromise range when compared to FP32. FP32 numbers have an 8-bit exponent and a 24-bit mantissa (including an implicit one). BF16 strips 16 bits from the 24-bit FP32 mantissa to produce a 16-bit floating point data type. In comparison, FP16 roughly halve the FP32 mantissa to 10 explicit bits and reduce the exponent to 5 bits to match the 16-bit data type envelope.

儘管BF16提供的精確度低於FP16，但其一般較佳合適於支援深度學習任務。由於FP16的有限範圍，所以它的範圍不足以立即可用地來完成深度學習訓練。BF16不受此問題困擾，且此有限的精確度實際上可幫助在神經網路訓練任務中一般化已學習的權重。換言之，可將較低精確度視為提供一內建正則化性質。Although BF16 provides less accuracy than FP16, it is generally better suited to support deep learning tasks. Due to the limited range of FP16, its range is not sufficient for deep learning training out of the box. BF16 does not suffer from this problem, and this limited accuracy can actually help generalize learned weights in neural network training tasks. In other words, lower accuracy can be thought of as providing a built-in regularization property.

本文中所詳細說明者係指令之實施例及其支援，該等指令藉由施行加法、減法、乘法、除法中之一者或倒數的計算，而對BF16來源資料元素原生地操作。在一些實施例中，指令係定義為其之執行係將反正規輸入或輸出視為零、支援任何捨入模式及/或報告或抑制浮點數值旗標。操作這些指令不需要上轉換等。然而，雖然可施行準確的算術運算(諸如加法、除法、乘法、減法)，但將儲存之最終結果可係經捨入的。Detailed herein are embodiments and support for instructions that natively operate on BF16 source data elements by performing one of addition, subtraction, multiplication, division, or reciprocal calculations. In some embodiments, instructions are defined such that their execution treats denormal input or output as zero, supports any rounding mode, and/or reports or suppresses floating point value flags. No upconversion, etc. is required to operate these instructions. However, although accurate arithmetic operations (such as addition, division, multiplication, subtraction) may be performed, the final results to be stored may be rounded.

圖1例示對來源BF16資料元素施行一算術運算之一指令的範例性執行。儘管此例示係呈小在前排列格式，於此所論述之原則以大在前排列格式運作。對來源BF16資料元素指令的該算術運算(在此以範例性運算碼助憶符V{ARITH}NEPBF16顯示，其中將施行之算術運算係以{ARITH}方式註記，諸如ADD、SUB、DIV或MUL)包括一或多個欄位，以定義用於該指令的運算碼、參考或指示包封資料來源(例如，暫存器或記憶體位置)的一或多個欄位，及/或用以參考或指示包封資料目的地(例如，暫存器或記憶體位置)的一或多個欄位。在一些實施例中，該指令亦包括一或多個欄位以參考或指示一寫入遮罩或述語暫存器，其係用以儲存如稍後所說明之寫入遮罩或述語值。Figure 1 illustrates an exemplary execution of an instruction that performs an arithmetic operation on source BF16 data elements. Although this illustration is in a small-first format, the principles discussed here operate in a large-first format. The arithmetic operation on the source BF16 data element instruction (shown here with the example opcode mnemonic V{ARITH}NEPBF16, where the arithmetic operation to be performed is annotated with {ARITH}, such as ADD, SUB, DIV, or MUL ) includes one or more fields that define the opcode used for the instruction, one or more fields that reference or indicate the source of the enveloping data (e.g., a register or memory location), and/or One or more fields that reference or indicate the destination of packet data (e.g., a register or memory location). In some embodiments, the command also includes one or more fields to reference or indicate a write mask or predicate register, which is used to store write mask or predicate values as described later.

在此範例中，包封資料來源101及103包括8個包封資料元素，且每一包封資料元素係呈BF16格式。包封資料來源101及103可係一暫存器或一記憶體位置。在一些實施例中，當第二來源103係一記憶體位置時，從記憶體載入對應於所選擇向量暫存器大小之長度的指定位址處的位元組之數目，或從記憶體載入指定位址處的一純量值，並廣播至該向量暫存器的所有條目(例如，使用廣播電路系統111)。在一些實施例中，彼等變體中之兩者造成暫時向量暫存器的負載，且運算接著如同在3個向量暫存器上運算般進行。一實施例可選擇不同的微架構來用於處置融合之記憶體運算元。In this example, envelope data sources 101 and 103 include 8 envelope data elements, and each envelope data element is in BF16 format. Encapsulated data sources 101 and 103 may be a register or a memory location. In some embodiments, when the second source 103 is a memory location, the number of bytes at the specified address corresponding to the length of the selected vector register size is loaded from memory, or the number of bytes at the specified address is loaded from memory. Load a scalar value at the specified address and broadcast to all entries in the vector register (eg, using broadcast circuitry 111). In some embodiments, two of these variations cause a temporary load to the vector registers, and the operation then proceeds as if operating on 3 vector registers. An embodiment may select different microarchitectures for handling fused memory operators.

包封資料來源101及103(或暫時暫存器變型)被饋送進執行電路系統109中以受操作。特定而言，執行電路系統109根據運算碼之{ARITH}指示來施行對應的每資料元素位置加法(例如，使用加法電路系統114)、乘法(例如，使用乘法電路系統116)、減法(例如，使用減法電路系統115)或除法(例如，使用除法電路系統117)。Encapsulated data sources 101 and 103 (or scratchpad variants) are fed into execution circuitry 109 to be manipulated. Specifically, the execution circuitry 109 performs corresponding per-data element position addition (e.g., using the addition circuitry 114), multiplication (e.g., using the multiplication circuitry 116), and subtraction (e.g., using the multiplication circuitry 116) according to the {ARITH} instruction of the operation code. using subtraction circuitry 115) or division (eg, using division circuitry 117).

在一些實施例中，此執行使用一捨入至最接近(偶數)捨入模式。在一些實施例中，輸出反正規值總是被清除為零，且輸入反正規值總是被視為零。In some embodiments, this implementation uses a round-to-nearest (even) rounding mode. In some embodiments, the output denormal value is always cleared to zero, and the input denormal value is always treated as zero.

寫入包封資料目的地131以將所得算術運算儲存在對應於如包封資料來源101之包封資料元素中。在一些實施例中，當該指令要求使用述語或寫入遮罩時，一寫入遮罩(或述語)暫存器131指定如何使用寫入遮罩電路系統121來儲存及/或歸零所得算術運算值。Wrap data destination 131 is written to store the resulting arithmetic operation in a wrap data element corresponding to, for example, wrap data source 101. In some embodiments, when the instruction requires the use of predicates or write masks, a write mask (or predicate) register 131 specifies how to use write mask circuitry 121 to store and/or zero out the resulting Arithmetic operation value.

圖2例示由一處理器對BF16資料元素指令施行一算術運算之方法的一實施例。舉例而言，如圖13(B)中所示之一處理器核心、如下文所詳細說明之一管線等施行此方法。Figure 2 illustrates one embodiment of a method for performing an arithmetic operation on a BF16 data element instruction by a processor. For example, a processor core as shown in Figure 13(B), a pipeline as described in detail below, etc. perform this method.

在步驟201處，一指令被提取，該指令具有用於一運算碼、一第一包封資料來源運算元之一位置之一識別、一第二包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示一算術運算執行電路系統係要針對經識別之該等包封資料來源運算元的每一資料元素位置，對在該資料元素位置的bf16資料元素施行該算術運算，且將每一算術運算之一結果儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中。At step 201, an instruction is fetched, the instruction having an identification for an operation code, an identification of a location of a first packaging data source operand, and an identification of a location of a second packaging data source operand. and a field identifying one of the locations of the packet data destination operands, wherein the opcode is used to indicate that an arithmetic operation execution circuitry is to be performed on each data of the identified packet data source operands element position, perform the arithmetic operation on the bf16 data element at the data element position, and store a result of each arithmetic operation in a corresponding data element position of the identified wrapping data destination operand.

用於對BF16資料元素指令施行算術運算之格式的實施例係V{ARITH}NEPBF16 DST{k}, SRC。在一些實施例中，V{ARITH}NEPBF16係指令的運算碼助憶符。DST係用於包封資料目的地暫存器運算元的欄位。SRC係用於來源的一或多個欄位，諸如包封資料暫存器及/或記憶體。來源運算元及目的地運算元可以一或多個大小呈現，諸如128位元、256位元、512位元等。當使用寫入遮罩或述語時，使用{k}。An example of a format for performing arithmetic operations on BF16 data element instructions is V{ARITH}NEPBF16 DST{k}, SRC. In some embodiments, V{ARITH}NEPBF16 is the opcode mnemonic of the instruction. DST is the field used to encapsulate the data destination register operand. The SRC is one or more fields for a source, such as an envelope data register and/or memory. The source and destination operands may be presented in one or more sizes, such as 128 bits, 256 bits, 512 bits, etc. When using write masks or predicates, use {k}.

在一些實施例中，在步驟203處，一第一ISA的所提取之該指令被轉譯成一第二、不同ISA的一或多個指令。第二、不同ISA的一或多個指令被執行時，會提供像是所提取之該指令已被執行般的相同結果。應注意的是，轉譯可由硬體、軟體或其組合予以實行。In some embodiments, at step 203, the fetched instructions of a first ISA are translated into one or more instructions of a second, different ISA. Second, when one or more instructions of different ISAs are executed, they will provide the same result as if the fetched instruction had been executed. It should be noted that translation can be performed by hardware, software, or a combination thereof.

在步驟205處，解碼該指令(或經轉譯之該等一或多個指令)。此解碼可致使要被施行的一或多個微運算之產生。At step 205, the instruction (or the translated instruction or instructions) is decoded. This decoding may result in the generation of one or more micro-operations to be performed.

在步驟207處，關聯於經解碼之該指令之該來源運算元的資料值被擷取。舉例而言，當一來源運算元係儲存於記憶體中時，來自所指示之記憶體位置的資料被擷取。At step 207, the data value associated with the source operand of the decoded instruction is retrieved. For example, when a source operand is stored in memory, data from the indicated memory location is retrieved.

在步驟209處，經解碼之該指令係由諸如本文中所詳細說明之執行電路系統(硬體)執行。該執行電路系統係要針對經識別之該等包封資料來源運算元的每一資料元素位置，對在該資料元素位置的bf16資料元素施行該算術運算，且將每一算術運算之一結果儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中。At step 209, the decoded instructions are executed by execution circuitry (hardware) such as those detailed herein. The execution circuitry is to perform the arithmetic operation on the bf16 data element at the data element position for each data element position of the identified wrapping data source operands, and store one result of each arithmetic operation into a corresponding data element position of the identified packet data destination operand.

在一些實施例中，在步驟211處，指令被提交或引退。In some embodiments, at step 211, the instruction is committed or retired.

圖3-6例示表示V{ARITH}NEPBF16指令之執行及格式之虛擬碼的範例性實施例。應注意的是，EVEX.b係映射至首碼1601(C)之b。DAZ、FTZ、RNE及SAE之註解係指支援有關下列操作之使用：清除為零(FTZ)、反正規值為零(DAZ)、抑制所有例外異常(SAE)及捨入至偶數(RNE)捨入。Figures 3-6 illustrate exemplary embodiments of virtual code representing the execution and format of the V{ARITH}NEPBF16 instruction. It should be noted that EVEX.b is mapped to b of the first code 1601(C). Annotations for DAZ, FTZ, RNE, and SAE refer to support for the following operations: clear to zero (FTZ), denormalize to zero (DAZ), suppress all exceptions (SAE), and round to even (RNE). enter.

圖7例示用以計算BF16資料元素之一倒數之一指令的範例性執行。儘管此例示係呈小在前排列格式，於此所論述之原則以大在前排列格式運作。BF16資料元素的倒數之指令(在此以範例性運算碼助憶符VRCPNEPBF16顯示)包括一或多個欄位，以定義用於該指令的運算碼、參考或指示包封資料來源(例如，暫存器或記憶體位置)的一或多個欄位，及/或用以參考或指示包封資料目的地(例如，暫存器或記憶體位置)的一或多個欄位。在一些實施例中，該指令亦包括一或多個欄位以參考或指示一寫入遮罩或述語暫存器，其係用以儲存如稍後所說明之寫入遮罩或述語值。Figure 7 illustrates an exemplary execution of an instruction to calculate the reciprocal of one of the BF16 data elements. Although this illustration is in a small-first format, the principles discussed here operate in a large-first format. The instruction for the reciprocal of the BF16 data element (shown here with the example opcode mnemonic VRCPNEPBF16) includes one or more fields to define the opcode, reference, or directive to wrap the data source (e.g., temporary one or more fields that reference or indicate the destination of the packet data (e.g., a register or memory location). In some embodiments, the command also includes one or more fields to reference or indicate a write mask or predicate register, which is used to store write mask or predicate values as described later.

用於計算BF16資料元素指令之近似倒數之格式的實施例係VRCPNEPBF16 DST{k}, SRC。在一些實施例中，VRCPNEPBF16係該指令的運算碼助憶符。DST係用於包封資料目的地暫存器運算元的欄位。SRC係用於來源的一或多個欄位，諸如包封資料暫存器及/或記憶體。來源運算元及目的地運算元可以一或多個大小呈現，諸如128位元、256位元、512位元等。當使用寫入遮罩或述語時，使用{k}。An example of a format for calculating the approximate reciprocal of a BF16 data element instruction is VRCPNEPBF16 DST{k}, SRC. In some embodiments, VRCPNEPBF16 is the opcode mnemonic for this instruction. DST is the field used to encapsulate the data destination register operand. The SRC is one or more fields for a source, such as an envelope data register and/or memory. The source and destination operands may be presented in one or more sizes, such as 128 bits, 256 bits, 512 bits, etc. When using write masks or predicates, use {k}.

在此範例中，包封資料來源701包括8個包封資料元素，且每一包封資料元素係呈BF16格式。包封資料來源701可係一暫存器或一記憶體位置。In this example, the envelope data source 701 includes 8 envelope data elements, and each envelope data element is in BF16 format. Encapsulated data source 701 may be a register or a memory location.

包封資料來源701被饋送進執行電路系統709中以受操作。特定而言，執行電路系統709針對該等包封資料元素中之每一者施行一近似倒數的一計算。在一些實施例中，該指令之此執行使用一捨入至最接近(偶數)捨入模式。在一些實施例中，輸出反正規值總是被清除為零，且輸入反正規值總是被視為零。Encapsulated data source 701 is fed into execution circuitry 709 to be manipulated. Specifically, execution circuitry 709 performs a calculation that approximates the reciprocal for each of the packet data elements. In some embodiments, this execution of the instruction uses a round-to-nearest (even) rounding mode. In some embodiments, the output denormal value is always cleared to zero, and the input denormal value is always treated as zero.

寫入包封資料目的地731以將所得BF16格式近似倒數儲存在對應於如包封資料來源701之包封資料元素中。在一些實施例中，當該指令要求使用述語或寫入遮罩時，一寫入遮罩(或述語)暫存器731指定如何使用寫入遮罩電路系統721來儲存及/或歸零所得BF16近似倒數。The wrapper data destination 731 is written to store the resulting BF16 format approximate reciprocal in the wrapper data element corresponding to, for example, the wrapper data source 701. In some embodiments, when the instruction requires the use of predicates or write masks, a write mask (or predicate) register 731 specifies how to use write mask circuitry 721 to store and/or zero the resulting BF16 is the approximate reciprocal.

圖8例示由一處理器所施行之方法的一實施例，該處理器用以計算BF16資料元素指令之一近似倒數。舉例而言，如圖13(B)中所示之一處理器核心、如下文所詳細說明之一管線等施行此方法。Figure 8 illustrates one embodiment of a method performed by a processor for computing an approximate reciprocal of a BF16 data element instruction. For example, a processor core as shown in Figure 13(B), a pipeline as described in detail below, etc. perform this method.

在步驟801處，一指令被提取，該指令具有用於一運算碼、一包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示執行電路系統係要針對經識別之該包封資料來源運算元的每一資料元素位置，對在該資料元素位置的一bf16資料元素施行一近似倒數的一運算，且將該近似倒數儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中。At step 801, an instruction is fetched that has fields for an operation code, an identification of a position of a wrapper data source operand, and an identification of a position of a wrapper data destination operand, The operation code is used to instruct the execution circuit system to perform an approximate reciprocal operation on a bf16 data element at the data element position for each data element position of the identified packet data source operand, The approximate reciprocal is stored in a corresponding data element location of the identified packet data destination operand.

在一些實施例中，在步驟803處，一第一ISA的所提取之該指令被轉譯成一第二、不同ISA的一或多個指令。第二、不同ISA的一或多個指令被執行時，會提供像是所提取之該指令已被執行般的相同結果。應注意的是，轉譯可由硬體、軟體或其組合予以實行。In some embodiments, at step 803, the fetched instructions of a first ISA are translated into one or more instructions of a second, different ISA. Second, when one or more instructions of different ISAs are executed, they will provide the same result as if the fetched instruction had been executed. It should be noted that translation can be performed by hardware, software, or a combination thereof.

在步驟805處，解碼該指令(或經轉譯之該等一或多個指令)。此解碼可致使要被施行的一或多個微運算之產生。At step 805, the instruction (or the translated instruction or instructions) is decoded. This decoding may result in the generation of one or more micro-operations to be performed.

在步驟807處，關聯於經解碼之該指令之該來源運算元的資料值被擷取。舉例而言，當一來源運算元係儲存於記憶體中時，來自所指示之記憶體位置的資料被擷取。At step 807, the data value associated with the source operand of the decoded instruction is retrieved. For example, when a source operand is stored in memory, data from the indicated memory location is retrieved.

在步驟809處，經解碼之該指令係由諸如本文中所詳細說明之執行電路系統(硬體)執行。該執行電路系統係要針對經識別之該包封資料來源運算元的每一資料元素位置，對在該資料元素位置的一bf16資料元素施行一近似倒數的一運算，且將該近似倒數儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中。At step 809, the decoded instructions are executed by execution circuitry (hardware) such as those detailed herein. The execution circuitry is to perform an approximate reciprocal operation on a bf16 data element at the data element position for each data element position of the identified packet data source operand, and store the approximate reciprocal in in a corresponding data element position of the identified packet data destination operand.

在一些實施例中，在步驟811處，指令被提交或引退。In some embodiments, at step 811, the instruction is committed or retired.

圖9例示表示一VRCPNEPBF16指令之執行及格式之虛擬碼的範例性實施例。應注意的是，EVEX.b係映射至首碼1601(C)之b。DAZ、FTZ、RNE及SAE之註解係指支援有關下列操作之使用：清除為零(FTZ)、反正規值為零(DAZ)、抑制所有例外異常(SAE)及捨入至偶數(RNE)捨入。Figure 9 illustrates an exemplary embodiment of virtual code representing the execution and format of a VRCPNEPBF16 instruction. It should be noted that EVEX.b is mapped to b of the first code 1601(C). Annotations for DAZ, FTZ, RNE, and SAE refer to support for the following operations: clear to zero (FTZ), denormalize to zero (DAZ), suppress all exceptions (SAE), and round to even (RNE). enter.

圖10例示用以處理諸如V{ARITH}NEPBF16及/或VRCPNEPBF16指令之一指令之硬體的實施例。如所例示，儲存器1003儲存將被執行之一V{ARITH}NEPBF16及/或VRCPNEPBF16指令1001。Figure 10 illustrates an embodiment of hardware to process an instruction such as the V{ARITH}NEPBF16 and/or VRCPNEPBF16 instruction. As illustrated, memory 1003 stores a V{ARITH}NEPBF16 and/or VRCPNEPBF16 instruction 1001 to be executed.

指令1001係由解碼電路系統1005所接收。舉例而言，解碼電路系統1005係從提取邏輯/電路系統接收此指令。指令包括用於一運算碼、第一及第二來源及一目的地的欄位。在一些實施例中，來源及目的地為暫存器，而在其他實施例中，一或多者為記憶體位置。在一些實施例中，運算碼詳細說明將施行哪種算術運算。Instruction 1001 is received by decoding circuitry 1005. For example, decode circuitry 1005 receives this instruction from the fetch logic/circuitry. The command includes fields for an opcode, first and second sources, and a destination. In some embodiments, the source and destination are registers, while in other embodiments, one or more are memory locations. In some embodiments, the opcode specifies which arithmetic operation is to be performed.

至少一種指令格式之更詳細實施例將於稍後詳細說明。解碼電路系統1005將該指令解碼成一或多個運算。在一些實施例中，此解碼包括產生將由執行電路系統(諸如，執行電路系統1009)施行之複數個微運算。解碼電路系統1005亦解碼指令首碼。More detailed embodiments of at least one instruction format will be described in detail later. Decoding circuitry 1005 decodes the instruction into one or more operations. In some embodiments, this decoding includes generating a plurality of micro-operations to be performed by execution circuitry, such as execution circuitry 1009. The decoding circuitry 1005 also decodes the instruction first code.

在一些實施例中，暫存器重命名、暫存器分配、及/或排程電路系統1007提供以下之一或多者之功能性：1)將邏輯運算元值重命名成實體運算元值(例如，一些實施例中之一暫存器別名表)，2)分配狀態位元及旗標至經解碼之該指令，以及3)排程出自一指令池以供在執行電路系統上執行的經解碼之該指令(例如，在一些實施例中使用一保留站)。In some embodiments, register renaming, register allocation, and/or scheduling circuitry 1007 provides functionality for one or more of the following: 1) renaming logical operand values into physical operand values ( For example, in some embodiments, a register alias table), 2) allocates status bits and flags to the decoded instruction, and 3) schedules the process from an instruction pool for execution on the execution circuitry. Decode the instruction (eg, using a reservation station in some embodiments).

暫存器(暫存器夾)及/或記憶體1008將資料儲存為將受執行電路系統1009操作之指令之運算元。範例性暫存器類型包括包封資料暫存器、通用暫存器及浮點暫存器。Registers (register folders) and/or memory 1008 store data as operands that will be subject to instructions operating on circuitry 1009 . Example register types include packed data registers, general purpose registers, and floating point registers.

執行電路系統1009執行經解碼之該指令。圖1、圖13等中顯示範例性詳細執行電路系統。Execution circuitry 1009 executes the decoded instruction. Exemplary detailed implementation circuitry is shown in Figures 1, 13, etc.

在一些實施例中，引退/回寫電路系統1011架構性地提交結果1008，並且使該指令引退。 範例性電腦架構 In some embodiments, retirement/writeback circuitry 1011 architecturally commits the result 1008 and retires the instruction. Exemplary Computer Architecture

下文詳細說明支援所揭露之指令的範例性電腦架構、指令格式等。用於膝上型電腦、桌上型電腦及手持式PC之業界所知的其他系統設計及組配、個人數位助理、工程工作站、伺服器、網路裝置、網路集線器、交換器、嵌入式處理器、數位信號處理器(DSP)、圖形裝置、視訊遊戲裝置、機上盒、微控制器、行動電話、可攜式媒體播放器、手持式裝置及各種其他電子裝置亦係合適的。通常而言，能夠併入如本文所揭露之一處理器及/或其他執行邏輯之種類繁多的系統或電子裝置通常係合適的。Exemplary computer architectures, command formats, etc. that support the disclosed commands are described in detail below. Other systems known in the industry for design and assembly of laptops, desktops, and handheld PCs, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded Processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, mobile phones, portable media players, handheld devices, and various other electronic devices are also suitable. Generally speaking, a wide variety of systems or electronic devices that can incorporate a processor and/or other execution logic as disclosed herein are generally suitable.

圖11例示一範例性系統的實施例。多處理器系統1100係一點對點互連系統且包括經由一點對點互連件1150耦接之複數個處理器，包括一第一處理器1170及一第二處理器1180。在一些實施例中，第一處理器1170及第二處理器1180係同質的。在一些實施例中，第一處理器1170及第二處理器1180係異質的。Figure 11 illustrates an embodiment of an exemplary system. The multi-processor system 1100 is a point-to-point interconnect system and includes a plurality of processors, including a first processor 1170 and a second processor 1180 , coupled through a point-to-point interconnect 1150 . In some embodiments, first processor 1170 and second processor 1180 are homogeneous. In some embodiments, the first processor 1170 and the second processor 1180 are heterogeneous.

處理器1170及1180顯示為分別包括整合式記憶體控制器(IMC)單元電路系統1172及1182。處理器1170亦包括點對點(P-P)介面1176及1178作為其互連控制器單元之部分；相似地，第二處理器1180包括P-P介面1186及1188。處理器1170、1180可使用P-P介面電路1178、1188經由點對點(P-P)互連件1150交換資訊。IMC 1172及1182將處理器1170、1180耦接至個別記憶體，亦即記憶體1132及記憶體1134，其可為區域地附接至個別處理器之主記憶體的部分。Processors 1170 and 1180 are shown including integrated memory controller (IMC) unit circuitry 1172 and 1182, respectively. Processor 1170 also includes point-to-point (P-P) interfaces 1176 and 1178 as part of its interconnecting controller unit; similarly, second processor 1180 includes P-P interfaces 1186 and 1188 . Processors 1170, 1180 may exchange information via point-to-point (P-P) interconnect 1150 using P-P interface circuits 1178, 1188. IMCs 1172 and 1182 couple the processors 1170, 1180 to individual memories, namely memory 1132 and memory 1134, which may be portions of main memory that are regionally attached to the individual processors.

處理器1170、1180各自可使用點對點介面電路1176、1194、1186、1198經由個別P-P互連件1152、1154與一晶片組1190交換資訊。晶片組1190可任擇地經由一高效能介面1192與一共處理器1138交換資訊。在一些實施例中，共處理器1138為特定用途處理器，舉例而言，諸如高處理量MIC處理器、網路或通訊處理器、壓縮引擎、圖形處理器、GPGPU、嵌入式處理器或類似者。Processors 1170, 1180 may each exchange information with a chipset 1190 via respective P-P interconnects 1152, 1154 using point-to-point interface circuits 1176, 1194, 1186, 1198. Chipset 1190 may optionally exchange information with co-processor 1138 via a high performance interface 1192. In some embodiments, co-processor 1138 is a special purpose processor, such as, for example, a high-throughput MIC processor, a network or communications processor, a compression engine, a graphics processor, a GPGPU, an embedded processor, or the like. By.

一共享快取記憶體(未示出)可被包括於處理器1170、1180中或位在兩個處理器的外部，再經由P-P互連件與處理器連接，以使得當處理器係處於低功率模式中時，處理器之本地快取記憶體資訊任一者或兩者可被儲存於共享快取記憶體中。A shared cache (not shown) may be included in processors 1170, 1180 or external to both processors and connected to the processors via a P-P interconnect so that when the processors are in low While in power mode, either or both of the processor's local cache information may be stored in the shared cache.

晶片組1190可經由一介面1196耦接至一第一互連件1116。在一些實施例中，第一互連件1116可係一周邊組件互連(PCI)互連件，或諸如一快速PCI互連件或另一I/O互連件的一互連件。在一些實施例中，該等互連件中之一者係耦接至電源控制單元(PCU)1117，其可包括電路系統、軟體及/或韌體以施行有關於處理器1170、1180及/或共處理器1138之電源管理操作。PCU 1117將控制資訊提供至一電壓調節器以致使該電壓調節器產生適當經調節之電壓。PCU 1117亦提供控制資訊以控制所產生之操作電壓。在各種實施例中，PCU 1117可包括用以施行硬體式電源管理的多種電源管理邏輯單元(電路系統)。此種電源管理可完全由處理器所控制(例如，由各種處理器硬體控制，且其可由工作負載及/或功率、熱或其他處理器制約所觸發)及/或電源管理可響應於外部來源(諸如，一平台或電源管理源或系統軟體)而施行。Chipset 1190 may be coupled to a first interconnect 1116 via an interface 1196 . In some embodiments, the first interconnect 1116 may be a Peripheral Component Interconnect (PCI) interconnect, or an interconnect such as a PCI Express interconnect or another I/O interconnect. In some embodiments, one of the interconnects is coupled to a power control unit (PCU) 1117, which may include circuitry, software, and/or firmware to perform operations related to processors 1170, 1180, and/or or power management operations of co-processor 1138. PCU 1117 provides control information to a voltage regulator to cause the voltage regulator to generate an appropriately regulated voltage. PCU 1117 also provides control information to control the generated operating voltage. In various embodiments, PCU 1117 may include various power management logic units (circuitry) to perform hardware-based power management. Such power management may be fully processor-controlled (e.g., controlled by various processor hardware and may be triggered by workload and/or power, thermal, or other processor constraints) and/or the power management may be responsive to external source (such as a platform or power management source or system software).

PCU 1117被例示為與處理器1170及/或處理器1180分開的邏輯而存在。在其他狀況下，PCU 1117可在處理器1170或1180之核心(未示出)中給定之一或多者上執行。在一些狀況下，PCU 1117可實行為一微控制器(專用或通用)或其他組配來執行其自身之專用電源管理碼的控制邏輯，有時被稱作P代碼。而在又其他實施例中，將由PCU 1117施行之電源管理操作可在處理器外部實行，諸如用作一單獨的電源管理積體電路(PMIC)或在處理器外部之另一組件。但在其他實施例中，將由PCU 1117施行之電源管理操作可在BIOS或其他系統軟體內部實行。PCU 1117 is illustrated as existing as separate logic from processor 1170 and/or processor 1180 . In other cases, PCU 1117 may execute on a given one or more of the cores (not shown) of processor 1170 or 1180. In some cases, the PCU 1117 may be implemented as a microcontroller (dedicated or general purpose) or other control logic configured to execute its own dedicated power management code, sometimes referred to as P-code. In yet other embodiments, the power management operations to be performed by PCU 1117 may be performed external to the processor, such as as a separate power management integrated circuit (PMIC) or another component external to the processor. However, in other embodiments, the power management operations to be performed by PCU 1117 may be performed within the BIOS or other system software.

各種I/O裝置1114可耦接至第一互連件1116，連帶耦接至(匯流排)橋接器1118，此(匯流排)橋接器1118將第一互連件1116耦接至第二互連件1120。在一些實施例中，一或多個額外處理器1115，諸如共處理器、高處理量MIC處理器、GPGPU、加速器(諸如例如圖形加速器或數位信號處理(DSP)單元)、現場可規劃閘陣列(FPGA)或任何其他的處理器係被耦接至第一互連件1116。在一些實施例中，第二互連件1120可係一低接腳計數(LPC)互連件。各種裝置可被耦接至第二互連件1120，包括例如鍵盤及/或滑鼠1122、通訊裝置1127及儲存單元電路系統1128。在一些實施例中，儲存單元電路系統1128係磁碟機或可包括指令/程式碼及資料1130之其他大容量儲存裝置。另外，音訊I/O 1124可被耦接至第二互連件1120。應注意的是，不同於上文說明之點對點架構的其他架構係可能的。舉例而言，取代點對點架構，諸如多處理器系統1100的系統可實行多點互連件或其他此種架構。 範例性核心架構、處理器以及電腦架構 Various I/O devices 1114 may be coupled to the first interconnect 1116 and in conjunction with a (bus) bridge 1118 that couples the first interconnect 1116 to the second interconnect. Connected parts 1120. In some embodiments, one or more additional processors 1115, such as co-processors, high throughput MIC processors, GPGPUs, accelerators (such as, for example, graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays (FPGA) or any other processor is coupled to the first interconnect 1116 . In some embodiments, the second interconnect 1120 may be a low pin count (LPC) interconnect. Various devices may be coupled to second interconnect 1120 , including, for example, keyboard and/or mouse 1122 , communication device 1127 , and storage unit circuitry 1128 . In some embodiments, storage unit circuitry 1128 is a disk drive or other mass storage device that may include instructions/code and data 1130 . Additionally, audio I/O 1124 may be coupled to second interconnect 1120 . It should be noted that other architectures than the point-to-point architecture described above are possible. For example, instead of a point-to-point architecture, a system such as multiprocessor system 1100 may implement a multipoint interconnect or other such architecture. Exemplary Core Architectures, Processors, and Computer Architectures

處理器核心可出於不同目的而以不同方式被實行，且可位在不同處理器中。舉例來說，此等核心之實行方式可包括：1)意欲用於通用運算的通用有序核心；2)意欲用於通用運算的高效能通用無序核心；3)主要意欲用於圖形及/或科學(處理量)運算的特定用途核心。不同處理器之實行方式可包括：1)一CPU，其包括意欲用於通用運算的一或多個通用有序核心及/或意欲用於通用運算的一或多個通用無序核心；以及2)一共處理器，其包括主要意欲用於圖形及/或科學(處理量)的一或多個特定用途核心。此等不同的處理器導致不同的電腦系統架構，其可包括：1)在與CPU分開之晶片上的共處理器；2)在與CPU相同之封裝中之分開晶粒上的共處理器；3)在與CPU相同之晶粒上的共處理器(在此狀況下，此一共處理器有時被稱為專用邏輯，諸如整合式圖形及/或科學(處理量)邏輯，或作為專用核心)；及4)系統單晶片，其在如所描述的CPU(有時被稱為應用程式核心或應用程式處理器)之相同的晶粒上可包括前文所述之共處理器及額外的功能性。接下來說明範例性核心架構，接著為範例性處理器及電腦架構之說明。Processor cores may be implemented in different ways for different purposes and may be located in different processors. For example, implementations of such cores may include: 1) general-purpose in-order cores intended for general-purpose computing; 2) high-performance general-purpose out-of-order cores intended for general-purpose computing; 3) primarily intended for graphics and/or or special-purpose cores for scientific (processing) operations. Implementations of different processors may include: 1) a CPU including one or more general-purpose in-order cores intended for general-purpose operations and/or one or more general-purpose out-of-order cores intended for general-purpose operations; and 2 ) A co-processor that includes one or more special-purpose cores intended primarily for graphics and/or scientific (throughput). These different processors result in different computer system architectures, which may include: 1) co-processors on a separate die from the CPU; 2) co-processors on a separate die in the same package as the CPU; 3) A co-processor on the same die as the CPU (in which case this co-processor is sometimes referred to as dedicated logic, such as integrated graphics and/or scientific (throughput) logic, or as a dedicated core ); and 4) a system-on-chip, which may include the co-processor and additional functionality described above on the same die as the CPU (sometimes referred to as an application core or application processor) as described sex. Next, an exemplary core architecture is described, followed by a description of exemplary processor and computer architectures.

圖12例示可具有多於一核心、可具有一整合式記憶體控制器且可具有整合式圖形之一處理器1200之實施例的方塊圖。實線框例示具有單個核心1202(A)、系統代理1210、及一組一或多個互連控制器單元電路系統1216的處理器1200，而任擇的虛線框之添加例示出具有多個核心1202(A)-(N)、系統代理單元電路系統1210中之一組一或多個整合式記憶體控制器單元電路系統1214及特定用途邏輯1208、以及一組一或多個互連控制器單元電路系統1216之替代的處理器1200。應注意的是，處理器1200可為圖11之處理器1170或1180、或共處理器1138或1115中之一者。Figure 12 illustrates a block diagram of an embodiment of a processor 1200 that may have more than one core, may have an integrated memory controller, and may have integrated graphics. The solid box illustrates the processor 1200 with a single core 1202(A), system agent 1210, and a set of one or more interconnected controller unit circuits 1216, while the addition of optional dashed boxes illustrates having multiple cores 1202(A)-(N), a set of one or more integrated memory controller unit circuitry 1214 and special purpose logic 1208 in system agent unit circuitry 1210, and a set of one or more interconnect controllers Replacement processor 1200 for unit circuitry 1216 . It should be noted that the processor 1200 may be one of the processors 1170 or 1180 of FIG. 11 or the co-processors 1138 or 1115.

因此，處理器1200之不同實行方式可包括：1)一CPU，其具有為整合式圖形及/或科學(處理量)邏輯(其可包括一或多個核心，未示出)的特定用途邏輯1208，以及為一或多個通用核心(例如，通用有序核心、通用無序核心、或兩者之一組合)的核心1202(A)-(N)；2)一共處理器，其具有為主要意欲用於圖形及/或科學(處理量)之大量特定用途核心的核心1202(A)-(N)；及3)一共處理器，其具有為大量通用有序核心的核心1202(A)-(N)。因此，處理器1200可為一通用處理器、共處理器或特定用途處理器，舉例而言諸如一網路或通訊處理器、壓縮引擎、圖形處理器、GPGPU(通用圖形處理單元電路系統)、一高處理量多整合式核心(MIC)共處理器(包括30或更多個核心)、嵌入式處理器、或類似者。該處理器可被實行於一或多個晶片上。處理器1200可為使用數種製程技術中之任一者而為一或多個基體之部分及/或可在一或多個基體上被實行，舉例而言，此多種製程技術諸如BiCMOS、CMOS或NMOS。Accordingly, different implementations of processor 1200 may include: 1) a CPU with special purpose logic that is integrated graphics and/or scientific (throughput) logic (which may include one or more cores, not shown) 1208, and cores 1202(A)-(N) that are one or more general-purpose cores (e.g., a general-purpose in-order core, a general-purpose out-of-order core, or a combination of both); 2) a co-processor having Cores 1202(A)-(N) that are large numbers of special purpose cores intended primarily for graphics and/or science (throughput); and 3) a co-processor with cores 1202(A) that are large numbers of general purpose in-order cores -(N). Therefore, the processor 1200 may be a general-purpose processor, a co-processor, or a special-purpose processor, such as a network or communications processor, a compression engine, a graphics processor, a GPGPU (general-purpose graphics processing unit circuitry), A high throughput multi-integrated core (MIC) co-processor (comprising 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. Processor 1200 may be part of and/or may be implemented on one or more substrates using any of several process technologies, such as, for example, BiCMOS, CMOS or NMOS.

一記憶體階層包括在核心1202(A)-(N)內部之一或多個快取記憶體單元電路系統1204(A)-(N)層級、一組一或多個共享快取記憶體單元電路系統1206、及耦接至該組整合式記憶體控制器單元電路系統1214的外部記憶體(未示出)。該組一或多個共享快取記憶體單元電路系統1206可包括一或多個中間層級快取記憶體，諸如層級2(L2)、層級3(L3)、層級4(L4)、或諸如一最末級快取記憶體(LLC)之其他層級之快取記憶體、及/或其之組合。儘管在一些實施例中，以環形為基之互連網路電路系統1212互連特定用途邏輯1208(例如，整合式圖形邏輯)、該組共享快取記憶體單元電路系統1206及系統代理單元電路系統1210，替代實施例使用任何數目之熟知技術用於互連此等單元。在一些實施例中，在共享快取記憶體單元電路系統1206及核心1202(A)-(N)中之一或多者之間維持一致性。A memory hierarchy includes one or more cache memory unit circuitry 1204(A)-(N) within cores 1202(A)-(N), a set of one or more shared cache memory units circuitry 1206, and external memory (not shown) coupled to the set of integrated memory controller unit circuitry 1214. The set of one or more shared cache unit circuits 1206 may include one or more intermediate level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or such as a The last level cache (LLC), other levels of cache, and/or combinations thereof. Although in some embodiments, ring-based interconnect network circuitry 1212 interconnects special purpose logic 1208 (eg, integrated graphics logic), the set of shared cache unit circuitry 1206 and system agent unit circuitry 1210 , alternative embodiments use any number of well-known techniques for interconnecting such units. In some embodiments, coherence is maintained between shared cache memory cell circuitry 1206 and one or more of cores 1202(A)-(N).

在一些實施例中，核心1202(A)-(N)中之一或多者能夠有多執行緒。系統代理單元電路系統1210包括那些協調及操作核心1202(A)-(N)之組件。系統代理單元電路系統1210可包括例如電源控制單元(PCU)電路系統及/或顯示單元電路系統(未示出)。PCU可為或可包括用於調節核心1202(A)-(N)及/或特定用途邏輯1208(例如，整合式圖形邏輯)之電力狀態所需的邏輯及組件。該顯示器單元電路系統係用於驅動一或多個外部連接之顯示器。In some embodiments, one or more of cores 1202(A)-(N) are capable of multi-threading. System agent unit circuitry 1210 includes those components of coordination and operations cores 1202(A)-(N). System agent unit circuitry 1210 may include, for example, power control unit (PCU) circuitry and/or display unit circuitry (not shown). The PCU may be or may include the logic and components required to regulate the power state of cores 1202(A)-(N) and/or special purpose logic 1208 (eg, integrated graphics logic). The display unit circuitry is used to drive one or more externally connected displays.

核心1202(A)-(N)就架構指令集而言可係同質或異質的；亦即，核心1202(A)-(N)中之二或更多者可能能夠執行相同指令集，而其他核心可能能夠僅執行該指令集或一不同指令集的一子集。 範例性核心架構 有序及無序核心方塊圖 Cores 1202(A)-(N) may be homogeneous or heterogeneous with respect to architectural instruction sets; that is, two or more of cores 1202(A)-(N) may be capable of executing the same instruction set while others The core may be able to execute only a subset of that instruction set or a different instruction set. Exemplary core architecture ordered and unordered core block diagrams

圖13(A)為根據本發明之實施例例示一範例性有序管線及一範例性暫存器重命名、無序發布/執行管線兩者的方塊圖。圖13(B)為根據本發明之實施例例示將被包括在一處理器中的一有序架構核心之一範例性實施例及一範例性暫存器重命名、無序發布/執行架構核心兩者的方塊圖。圖13(A)-(B)中之實線框例示有序管線及有序核心，而任擇的虛線框之添加例示暫存器重命名、無序發布/執行管線及核心。考慮到有序態樣為無序態樣之子集，無序態樣將被說明。13(A) is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue/execution pipeline according to an embodiment of the present invention. 13(B) illustrates an exemplary embodiment of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core to be included in a processor in accordance with embodiments of the present invention. block diagram. The solid boxes in Figures 13(A)-(B) illustrate in-order pipelines and in-order cores, while the addition of optional dashed boxes illustrates register renaming, out-of-order issue/execution pipelines and cores. Considering ordered aspects as a subset of disordered aspects, disordered aspects will be explained.

在圖13(A)中，一處理器管線1300包括一提取級1302、一任擇的長度解碼級1304、一解碼級1306、一任擇的分配級1308、一任擇的重命名級1310、一排程(亦被稱為一分派或發布)級1312、一任擇的暫存器讀取/記憶體讀取級1314、一執行級1316、一回寫/記憶體寫入級1318、一任擇的例外處置級1322及一任擇的提交級1324。可在這些處理器管線級之每一者中施行一或多個操作。舉例而言，在提取級1302期間，一或多個指令從指令記憶體被提取，在解碼級1306期間，該等一或多個所提取指令可被解碼、使用轉遞暫存器埠之位址(例如，負載儲存單元(LSU)位址)可被產生、以及分支轉遞(例如，立即偏移或一鏈結暫存器(LR))可被施行。在一實施例中，解碼級1306及暫存器讀取/記憶體讀取級1314可被組合至一個管線級中。在一實施例中，在執行級1316期間，經解碼之指令可被執行、管線送至進階微控制器匯流排(AHB)介面中之LSU位址/資料可被施行、乘法及加法運算可被施行、具有分支結果之算術運算可被施行等。In Figure 13(A), a processor pipeline 1300 includes a fetch stage 1302, an optional length decode stage 1304, a decode stage 1306, an optional allocation stage 1308, an optional rename stage 1310, a scheduler (also called a dispatch or release) stage 1312, an optional register read/memory read stage 1314, an execution stage 1316, a writeback/memory write stage 1318, an optional exception handling Level 1322 and an optional submission level 1324. One or more operations may be performed in each of these processor pipeline stages. For example, during fetch stage 1302, one or more instructions are fetched from instruction memory. During decode stage 1306, the one or more fetched instructions may be decoded using the address of the transfer register port. (eg, a load storage unit (LSU) address) may be generated, and branch transfers (eg, an immediate offset or a linked register (LR)) may be performed. In one embodiment, decode stage 1306 and register read/memory read stage 1314 may be combined into one pipeline stage. In one embodiment, during execution level 1316, decoded instructions may be executed, LSU addresses/data pipelined to the Advanced Microcontroller Bus (AHB) interface may be executed, and multiplication and addition operations may be performed. be performed, arithmetic operations with branching results may be performed, etc.

舉例來說，範例性暫存器重命名、無序發布/執行核心架構可如下來實行管線1300：1)指令提取1338施行提取及長度解碼級1302與1304；2)解碼單元電路系統1340施行解碼級1306；3)重命名/分配器單元電路系統1352施行分配級1308及重命名級1310；4)排程器單元電路系統1356施行排程級1312；5)實體暫存器夾單元電路系統1358及記憶體單元電路系統1370施行暫存器讀取/記憶體讀取級1314；執行叢集1360施行執行級1316；6)記憶體單元電路系統1370及實體暫存器夾單元電路系統1358施行回寫/記憶體寫入級1318；7)各種單元(單元電路系統)可涉及例外處置級1322；以及8)引退單元電路系統1354及實體暫存器夾單元電路系統1358施行提交級1324。For example, an exemplary register renaming, out-of-order issue/execution core architecture may execute pipeline 1300 as follows: 1) instruction fetch 1338 performs fetch and length decode stages 1302 and 1304; 2) decode unit circuitry 1340 performs decode stage 1306; 3) Rename/distributor unit circuitry 1352 performs allocation level 1308 and rename level 1310; 4) Scheduler unit circuitry 1356 performs scheduling level 1312; 5) Physical register clip unit circuitry 1358 and The memory unit circuit 1370 performs the register read/memory read stage 1314; the execution cluster 1360 performs the execution stage 1316; 6) the memory unit circuit 1370 and the physical register unit circuit 1358 perform writeback/ Memory write stage 1318; 7) Various units (unit circuitry) may be involved in exception handling stage 1322; and 8) Retirement unit circuitry 1354 and physical register clip unit circuitry 1358 execute commit stage 1324.

圖13(B)顯示處理器核心1390，其包括耦接至一執行引擎單元電路系統1350的前端單元電路系統1330，且兩者皆被耦接至一記憶體單元電路系統1370。核心1390可為精簡指令集運算(RISC)核心、複雜指令集運算(CISC)核心、極長指令字(VLIW)核心、或混合式或替代式核心類型。而作為又另一選項，核心1390可為特定用途核心，舉例而言，諸如網路或通訊核心、壓縮引擎、共處理器核心、通用運算圖形處理單元(GPGPU)核心、圖形核心或類似者。13(B) shows a processor core 1390 that includes front-end unit circuitry 1330 coupled to an execution engine unit circuitry 1350, and both are coupled to a memory unit circuitry 1370. Core 1390 may be a reduced instruction set computing (RISC) core, a complex instruction set computing (CISC) core, a very long instruction word (VLIW) core, or a hybrid or alternative core type. As yet another option, core 1390 may be a special purpose core, such as, for example, a networking or communications core, a compression engine, a co-processor core, a general-purpose graphics processing unit (GPGPU) core, a graphics core, or the like.

前端單元電路系統1330可包括耦接至指令快取記憶體單元電路系統1334之分支預測單元電路系統1332，指令快取記憶體單元電路系統1334被耦接至指令轉譯後備緩衝器(TLB)1336，指令轉譯後備緩衝器1336被耦接至指令提取單元電路系統1338，指令提取單元電路系統1338被耦接至解碼單元電路系統1340。在一實施例中，指令快取記憶體單元電路系統1334被包括在記憶體單元電路系統1370中，而非在前端單元電路系統1330中。解碼單元電路系統1340(或解碼器)可解碼指令，並產生一或多個微運算、微碼進入點、微指令、其他指令、或其他控制信號作為輸出，其等係從原始指令解碼、或以其他方式反映或導自原始指令。解碼單元電路系統1340可進一步包括位址產生單元電路系統(AGU，未示出)。在一實施例中，該AGU使用轉遞之暫存器埠產生LSU位址，並且可進一步施行分支轉遞(例如，立即偏移分支轉遞、LR暫存器分支轉遞等)。解碼單元電路系統1340可使用各種不同的機制來實行。合適機制之範例包括但不限於查找表、硬體實行方式、可規劃邏輯陣列(PLA)、微碼唯讀記憶體(ROM)等。在一實施例中，核心1390包括儲存用於特定微指令(例如，在解碼單元電路系統1340中或以其他方式在前端單元電路系統1330內部)之微碼的微碼ROM(未示出)或其他的媒體。在一實施例中，解碼單元電路系統1340包括用以保持/快取在處理器管線1300之解碼或其他級期間所產生之經解碼的運算、微標籤或微運算之一微運算(micro-op)或運算快取記憶體(未示出)。解碼單元電路系統1340可被耦接至執行引擎單元電路系統1350中之重命名/分配器單元電路系統1352。Front-end unit circuitry 1330 may include branch prediction unit circuitry 1332 coupled to instruction cache unit circuitry 1334, which is coupled to an instruction translation lookaside buffer (TLB) 1336, Instruction translation lookaside buffer 1336 is coupled to instruction fetch unit circuitry 1338 , which is coupled to decode unit circuitry 1340 . In one embodiment, instruction cache unit circuitry 1334 is included in memory unit circuitry 1370 rather than in front-end unit circuitry 1330 . Decode unit circuitry 1340 (or decoder) may decode instructions and generate as output one or more microoperations, microcode entry points, microinstructions, other instructions, or other control signals that are decoded from the original instruction, or Otherwise reflect or derive from the original instructions. Decoding unit circuitry 1340 may further include address generation unit circuitry (AGU, not shown). In one embodiment, the AGU uses the transferred register port to generate the LSU address, and may further perform branch transfer (eg, immediate offset branch transfer, LR register branch transfer, etc.). Decoding unit circuitry 1340 may be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, lookup tables, hardware implementations, programmable logic arrays (PLA), microcode read-only memory (ROM), etc. In one embodiment, core 1390 includes a microcode ROM (not shown) that stores microcode for specific microinstructions (eg, in decode unit circuitry 1340 or otherwise within front-end unit circuitry 1330 ) or Other media. In one embodiment, decode unit circuitry 1340 includes a micro-op to hold/cache decoded operations, micro-tags, or micro-ops generated during decode or other stages of processor pipeline 1300 . ) or arithmetic cache (not shown). Decode unit circuitry 1340 may be coupled to rename/distributor unit circuitry 1352 in execution engine unit circuitry 1350 .

執行引擎電路系統1350包括耦接至引退單元電路系統1354及一組一或多個排程器電路系統1356的重命名/分配器單元電路系統1352。排程器電路系統1356代表任何數目的不同排程器，包括保留站、中央指令窗等。在一些實施例中，排程器電路系統1356可包括算術邏輯單元(ALU)排程器/排程電路系統、ALU佇列、算術產生單元(AGU)排程器/排程電路系統、AGU佇列等。排程器電路系統1356係耦接至實體暫存器夾電路系統1358。實體暫存器夾電路系統1358中之每一者代表一或多個實體暫存器夾，其中不同的實體暫存器夾儲存一或多個不同資料類型，諸如純量整數、純量浮點、包封整數、包封浮點、向量整數、向量浮點、狀態(例如，將被執行之下一指令的位址之一指令指標)等。在一實施例中，實體暫存器夾單元電路系統1358包括向量暫存器單元電路系統、寫入遮罩暫存器單元電路系統及純量暫存器單元電路系統。這些暫存器單元可提供架構向量暫存器、向量遮罩暫存器、通用暫存器等。實體暫存器夾單元電路系統1358係被引退單元電路系統1354(亦被稱為一退出佇列或一引退佇列)所重疊，以例示暫存器重命名及無序執行可被實行之各種方式(例如，使用重排序緩衝器(ROB)及引退暫存器夾；使用未來檔案、歷史緩衝器及引退暫存器夾；使用暫存器映像及暫存器池；等)。引退單元電路系統1354及實體暫存器夾電路系統1358係耦接至執行叢集1360。執行叢集1360包括一組一或多個執行單元電路系統1362及一組一或多個記憶體存取電路系統1364。執行單元電路系統1362可施行各種算術、邏輯、浮點或其他類型之操作(例如，移位、加法、減法、乘法)且針對各種類型之資料(例如，純量浮點、包封整數、包封浮點、向量整數、向量浮點)施行。儘管一些實施例可包括專用於特定功能或功能集合之數個執行單元或執行單元電路系統，但其他實施例可包括僅一個執行單元電路系統或皆施行所有功能之多個執行單元/執行單元電路系統。排程器電路系統1356、實體暫存器夾單元電路系統1358及執行叢集1360被顯示為可能係複數個，因為某些實施例生成用於某些類型之資料/運算之單獨管線(例如，各自具有其自身排程器電路系統、實體暫存器夾單元電路系統及/或執行叢集的純量整數管線、純量浮點/包封整數/包封浮點/向量整數/向量浮點管線及/或記憶體存取管線──且在一單獨記憶體存取管線之狀況下，可實行僅此管線之執行叢集具有記憶體存取單元電路系統1364之某些實施例)。亦應理解，在使用單獨管線之情況下，這些管線中之一或多者可為無序發布/執行且其餘部分為有序的。Execution engine circuitry 1350 includes rename/distributor unit circuitry 1352 coupled to retirement unit circuitry 1354 and a set of one or more scheduler circuits 1356 . Scheduler circuitry 1356 represents any number of different schedulers, including reservation stations, central command windows, etc. In some embodiments, scheduler circuitry 1356 may include arithmetic logic unit (ALU) scheduler/scheduling circuitry, ALU queues, arithmetic generation unit (AGU) scheduler/scheduling circuitry, AGU queues Column etc. Scheduler circuitry 1356 is coupled to physical register clip circuitry 1358 . Each of the physical register folder circuitry 1358 represents one or more physical register folders, where different physical register folders store one or more different data types, such as scalar integers, scalar floating point , packed integer, packed floating point, vector integer, vector floating point, status (for example, an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register clip unit circuitry 1358 includes vector register unit circuitry, write mask register unit circuitry, and scalar register unit circuitry. These register units provide architectural vector registers, vector mask registers, general purpose registers, etc. Physical register clip unit circuitry 1358 is overlaid by retirement unit circuitry 1354 (also referred to as an exit queue or a retirement queue) to illustrate the various ways in which register renaming and out-of-order execution may be performed. (For example, using reorder buffers (ROB) and retirement register folders; using future files, history buffers, and retirement register folders; using register images and register pools; etc.). Retirement unit circuitry 1354 and physical register folder circuitry 1358 are coupled to execution cluster 1360 . Execution cluster 1360 includes a set of one or more execution unit circuits 1362 and a set of one or more memory access circuits 1364 . Execution unit circuitry 1362 may perform various arithmetic, logical, floating point, or other types of operations (e.g., shifts, additions, subtractions, multiplications) and on various types of data (e.g., scalar floating point, packed integers, packed Sealed floating point, vector integer, vector floating point) implementation. While some embodiments may include several execution units or execution unit circuitry dedicated to a particular function or set of functions, other embodiments may include only one execution unit circuitry or multiple execution units/execution unit circuits that all perform all functions. system. Scheduler circuitry 1356, physical register unit circuitry 1358, and execution clusters 1360 are shown as possibly being plural, as some embodiments generate separate pipelines for certain types of data/operations (e.g., each A scalar integer pipeline, scalar floating point/wrapped integer/wrapped floating point/vectored integer/vectored floating point pipeline with its own scheduler circuitry, physical register clip unit circuitry, and/or execution clusters, and /or memory access pipeline - and in the case of a single memory access pipeline, some embodiments may be implemented in which only the execution cluster of this pipeline has memory access unit circuitry 1364). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order release/execution and the remainder in-order.

在一些實施例中，執行引擎單元電路系統1350可施行載入儲存單元(LSU)位址/資料管線至進階微控制器匯流排(AHB)介面(未示出)，及施行位址相位及回寫、資料相位載入、儲存及分支。In some embodiments, execution engine unit circuitry 1350 may implement a load storage unit (LSU) address/data pipeline to an advanced microcontroller bus (AHB) interface (not shown), and implement address phase and Write back, data phase loading, saving and branching.

該組記憶體存取電路系統1364係耦接至記憶體單元電路系統1370，記憶體單元電路系統1370包括耦接至資料快取記憶體電路系統1374之資料TLB單元電路系統1372，資料快取記憶體電路系統1374耦接至層級2(L2)快取記憶體電路系統1376。在一範例性實施例中，記憶體存取單元電路系統1364可包括載入單元電路系統、儲存位址單元電路及儲存資料單元電路系統，其中之每一者係耦接至記憶體單元電路系統1370中之資料TLB電路系統1372。指令快取記憶體電路系統1334係進一步耦接至記憶體單元電路系統1370中之層級2(L2)快取記憶體單元電路系統1376。在一實施例中，指令快取記憶體1334及資料快取記憶體1374被組合至L2快取記憶體單元電路系統1376中之單指令及資料快取記憶體(未示出)、層級3(L3)快取記憶體單元電路系統(未示出)，及/或主記憶體。L2快取記憶體單元電路系統1376係耦接至一或多個其他層級之快取記憶體且最終耦接至主記憶體。The set of memory access circuitry 1364 is coupled to memory unit circuitry 1370. Memory unit circuitry 1370 includes data TLB unit circuitry 1372 coupled to data cache circuitry 1374. Data cache circuitry 1372. Bank circuitry 1374 is coupled to level 2 (L2) cache circuitry 1376 . In an exemplary embodiment, memory access unit circuitry 1364 may include load unit circuitry, storage address unit circuitry, and storage data unit circuitry, each of which is coupled to the memory unit circuitry. Data TLB circuit system 1372 in 1370. Instruction cache circuitry 1334 is further coupled to level 2 (L2) cache memory cell circuitry 1376 in memory cell circuitry 1370 . In one embodiment, instruction cache 1334 and data cache 1374 are combined into a single instruction and data cache (not shown) in L2 cache unit circuitry 1376, Level 3 ( L3) cache memory unit circuit system (not shown), and/or main memory. L2 cache cell circuitry 1376 is coupled to one or more other levels of cache and ultimately to main memory.

核心1390可支援包括本文所說明的指令之一或多個指令集(例如，x86指令集(具有已新增有較新版本之一些延伸)；MIPS指令集；ARM指令集(具有諸如NEON之任擇額外延伸))。在一實施例中，核心1390包括用以支援包封資料指令集延伸(例如，AVX1、AVX2)的邏輯，藉此允許由許多多媒體應用程式所使用的操作將使用包封資料來被施行。 範例性執行單元電路系統 Core 1390 may support one or more instruction sets including the instructions described herein (e.g., the x86 instruction set (with some extensions that have been added in newer versions); the MIPS instruction set; the ARM instruction set (with any extensions such as NEON). Select additional extensions)). In one embodiment, core 1390 includes logic to support packed data instruction set extensions (eg, AVX1, AVX2), thereby allowing operations used by many multimedia applications to be performed using packed data. Exemplary Execution Unit Circuitry

圖14例示執行單元電路系統的實施例，諸如圖13(B)之執行單元電路系統1362。如所例示，執行單元電路1362可包括一或多個ALU電路1401、向量/SIMD單元電路1403、載入/儲存單元電路1405及/或分支/跳轉單元電路1407。ALU電路1401施行整數算術及/或布林運算。向量/SIMD單元電路1403對包封資料(諸如SIMD/向量暫存器)施行向量/SIMD運算。載入/儲存單元電路1405執行載入及儲存指令以將來自記憶體的資料載入進暫存器中或從暫存器儲存至記憶體。載入/儲存單元電路1405亦可產生位址。分支/跳轉單元電路1407取決於該指令而造成分支或跳轉至一記憶體位址。浮點單元(FPU)電路1409施行浮點算術。執行單元電路系統1362之寬度取決於實施例而變化且可在16位元至1,024位元之範圍內。在一些實施例中，二或更多個較小執行單元被邏輯式組合以形成一較大執行單元(例如，二個128位元執行單元係經邏輯式組合以形成一256位元執行單元)。 範例性暫存器架構 Figure 14 illustrates an embodiment of execution unit circuitry, such as execution unit circuitry 1362 of Figure 13(B). As illustrated, execution unit circuits 1362 may include one or more ALU circuits 1401, vector/SIMD unit circuits 1403, load/store unit circuits 1405, and/or branch/jump unit circuits 1407. ALU circuit 1401 performs integer arithmetic and/or Boolean operations. Vector/SIMD unit circuit 1403 performs vector/SIMD operations on packet data (such as SIMD/vector registers). The load/store unit circuit 1405 executes load and store instructions to load data from the memory into the register or store data from the register into the memory. Load/store unit circuit 1405 may also generate addresses. The branch/jump unit circuit 1407 causes a branch or jump to a memory address depending on the instruction. Floating point unit (FPU) circuit 1409 performs floating point arithmetic. The width of execution unit circuitry 1362 varies depending on the embodiment and can range from 16 bits to 1,024 bits. In some embodiments, two or more smaller execution units are logically combined to form a larger execution unit (e.g., two 128-bit execution units are logically combined to form a 256-bit execution unit) . Example register architecture

圖15為根據一些實施例之一暫存器架構1500的方塊圖。如所例示，存在有從128位元變化至1,024位元寬度的向量/SIMD暫存器1510。在一些實施例中，向量/SIMD暫存器1510係實體上512位元，且取決於映射，僅一些較低位元被使用。舉例而言，在一些實施例中，向量/SIMD暫存器1510為ZMM暫存器，其為512個位元：較低之256位元係用於YMM暫存器，且較低之128位元係用於XMM暫存器。如此，有一暫存器之疊覆。在一些實施例中，向量長度欄位在最大長度與一或多個其他較短長度之間選擇，其中每一此類較短長度為先前長度之長度的一半。純量運算為在ZMM/YMM/XMM暫存器中之最低階資料元素位置上施行之運算；取決於實施例，該較高階資料元素位置係保持與其在指令之前相同或是歸零。Figure 15 is a block diagram of a register architecture 1500 according to some embodiments. As illustrated, there are vector/SIMD registers 1510 varying in width from 128 bits to 1,024 bits. In some embodiments, the vector/SIMD register 1510 is physically 512 bits, and depending on the mapping, only some of the lower bits are used. For example, in some embodiments, the vector/SIMD register 1510 is a ZMM register, which is 512 bits: the lower 256 bits are used for the YMM register, and the lower 128 bits The element system is used for XMM registers. Thus, there is an overlay of registers. In some embodiments, the vector length field selects between a maximum length and one or more other shorter lengths, where each such shorter length is half the length of the previous length. Scalar operations are operations performed on the lowest-order data element positions in the ZMM/YMM/XMM registers; depending on the embodiment, the higher-order data element positions remain the same as they were before the instruction or are reset to zero.

在一些實施例中，暫存器架構1500包括寫入遮罩/述語暫存器1515。舉例而言，在一些實施例中，存在各自係16位元、32位元、64位元或128位元大小之8個寫入遮罩/述語暫存器(有時被稱為k0至k7)。寫入遮罩/述語暫存器1515可允許合併(例如，允許目的地中之任何元素集合在任何運算執行期間被保護以免更新)及/或歸零(例如，歸零向量遮罩允許目的地中之任何元素集合在任何運算執行期間被歸零)。在一些實施例中，在一給定寫入遮罩/述語暫存器1515中的每一資料元素位置對應於目的地的一資料元素位置。在其他實施例中，寫入遮罩/述語暫存器1515係可縮放的，且由用於一給定向量元素之一設定數目的致能位元(例如，每64位元向量元素之8個致能位元)所組成。In some embodiments, the register architecture 1500 includes a write mask/predicate register 1515. For example, in some embodiments, there are 8 write mask/predicate registers (sometimes referred to as k0 through k7) that are each 16-bit, 32-bit, 64-bit, or 128-bit in size. ). Writing the mask/predicate register 1515 may allow merging (e.g., allowing any set of elements in the destination to be protected from updates during the execution of any operation) and/or zeroing (e.g., zeroing the vector mask allowing the destination Any set of elements in is zeroed out during the execution of any operation). In some embodiments, each data element position in a given write mask/predicate register 1515 corresponds to a data element position at the destination. In other embodiments, write mask/predicate register 1515 is scalable and consists of a set number of enable bits for a given vector element (e.g., 8 per 64-bit vector element). Enabled bits).

暫存器架構1500包括複數個通用暫存器1525。這些暫存器可為16位元、32位元、64位元等，且可被使用於純量運算。在一些實施例中，這些暫存器係藉由名稱RAX、RBX、RCX、RDX、RBP、RSI、RDI、RSP及R8至R15來參照。The register architecture 1500 includes a plurality of general registers 1525 . These registers can be 16-bit, 32-bit, 64-bit, etc., and can be used for scalar operations. In some embodiments, these registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.

在一些實施例中，暫存器架構1500包括純量浮點暫存器1545，其係用於使用x87指令集延伸對32/64/80位元浮點資料施行純量浮點運算或作為MMX暫存器以對64位元包封整數資料施行運算，以及用以保持用於在MMX與XMM暫存器之間施行的一些運算之運算元。In some embodiments, register architecture 1500 includes scalar floating point registers 1545 for performing scalar floating point operations on 32/64/80-bit floating point data using x87 instruction set extensions or as MMX The registers are used to perform operations on 64-bit packed integer data, and to hold operands for some operations performed between the MMX and XMM registers.

一或多個旗標暫存器1540(例如，EFLAGS、RFLAGS等)儲存用於算術、比較及系統運算之狀態及控制資訊。舉例而言，一或多個旗標暫存器1540可儲存條件碼資訊，諸如進位、奇偶、輔助進位、零、正負號及溢位。在一些實施例中，一或多個旗標暫存器1540被稱作程式狀態及控制暫存器。One or more flag registers 1540 (eg, EFLAGS, RFLAGS, etc.) store status and control information used for arithmetic, comparison, and system operations. For example, one or more flag registers 1540 may store condition code information such as carry, parity, auxiliary carry, zero, sign, and overflow. In some embodiments, one or more flag registers 1540 are referred to as program status and control registers.

區段暫存器1520含有使用於存取記憶體之區段點。在一些實施例中，這些暫存器係藉由名稱CS、DS、SS、ES、FS及GS來參照。Sector register 1520 contains sector points used to access memory. In some embodiments, these registers are referenced by the names CS, DS, SS, ES, FS, and GS.

機器特定暫存器(MSR)1535控制及報告處理器效能。大部分MSR 1535處理系統相關功能，且不可由應用程式存取。機器檢查暫存器1560由被使用來偵測及報告硬體錯誤之控制、狀態及錯誤報告MSR所組成。Machine Specific Register (MSR) 1535 controls and reports processor performance. Most of the MSR 1535 handles system-related functions and is not accessible by applications. Machine check register 1560 consists of control, status, and error reporting MSRs that are used to detect and report hardware errors.

一或多個指令指標暫存器1530儲存指令指標值。控制暫存器1555(例如，CR0-CR4)判定處理器(例如，處理器1170、1180、1138、1115及/或1200)之操作模式及目前執行任務之特性。除錯暫存器1550控制且允許監測處理器或核心之除錯操作。One or more instruction pointer registers 1530 store instruction pointer values. Control registers 1555 (eg, CRO-CR4) determine the operating mode of the processor (eg, processors 1170, 1180, 1138, 1115, and/or 1200) and the characteristics of the currently executing task. Debug register 1550 controls and allows monitoring of processor or core debug operations.

記憶體管理暫存器1565指定用於受保護模式記憶體管理中之資料結構的位置。這些暫存器可包括GDTR、IDRT、任務暫存器以及LDTR暫存器。Memory management register 1565 specifies the location of data structures used in protected mode memory management. These registers may include GDTR, IDRT, task register and LDTR register.

本發明之替代實施例可使用較寬或較窄的暫存器。此外，本發明之替代實施例可使用更多、更少或不同暫存器夾及暫存器。 指令集 Alternative embodiments of the invention may use wider or narrower registers. Additionally, alternative embodiments of the invention may use more, fewer, or different register folders and registers. Instruction Set

指令集架構(ISA)可包括一或多個指令格式。給定指令格式可定義各種欄位(例如位元數目、位元位置)，除了其他方面外，用以指定要施行之運算(例如，運算碼)及將被施行之此運算的運算元，及/或其他資料欄位(例如，遮罩)。一些指令格式係進一步經由指令範本(或子格式)之定義被分解。舉例而言，給定指令格式之指令範本可被定義以具有指令格式之欄位的不同子集(所包括欄位一般呈相同次序，但至少一些具有不同位元位置，因為包括較少之欄位)及/或被定義以具有不同解譯之給定欄位。因此，ISA之每一指令係使用給定指令格式(且在被界定的情況下，以彼指令格式之指令範本中之給定者)來表達，且包括用於指定運算及運算元之欄位。舉例而言，範例性ADD指令具有特定運算碼及指令格式，該指令格式包括用以指定彼運算碼之運算碼欄位及用以選擇運算元(來源1/目的地及來源2)之運算元欄位；且於指令串流中之此ADD指令的出現將具有在運算元欄位中之選擇特定運算元的特定內容。 範例性指令格式 An instruction set architecture (ISA) may include one or more instruction formats. A given instruction format may define various fields (e.g., number of bits, bit positions) that specify, among other things, the operation to be performed (e.g., opcode) and the operands on which this operation is to be performed, and /or other data fields (e.g., masks). Some command formats are further broken down through the definition of command templates (or subformats). For example, a command template for a given command format may be defined to have different subsets of the command format's fields (the fields are generally included in the same order, but at least some have different bit positions because fewer fields are included) bit) and/or are defined to have different interpretations of a given field. Thus, each instruction of the ISA is expressed using a given instruction format (and, where defined, as given in the instruction template for that instruction format), and includes fields for specifying operations and operands. . For example, an exemplary ADD instruction has a specific opcode and instruction format that includes opcode fields to specify its opcode and operands to select operands (source1/destination and source2) field; and occurrences of this ADD instruction in the instruction stream will have the specific content of selecting a particular operand in the Operand field. Example command format

本文中所說明的指令之實施例可以不同的格式被體現。此外，範例性系統、架構及管線在下文詳細說明。指令之實施例可在此等系統、架構及管線上被執行，但不限於彼等經詳細說明者。Embodiments of the instructions described herein may be embodied in different formats. Additionally, exemplary systems, architectures, and pipelines are described in detail below. Embodiments of the instructions may be executed on such systems, architectures, and pipelines, but are not limited to those specified.

圖16例示一指令格式的實施例。如所例示，一指令可包括多個組件，該等組件包括但不限於用於下列之一或多個欄位：一或多個首碼1601、一運算碼1603、定址資訊1605(例如，暫存器識別符、記憶體定址資訊等)、一位移值1607及/或一立即1609。應注意的是，一些指令利用一些或所有的格式之欄位，然而其他指令可僅使用運算碼1603之欄位。在一些實施例中，所例示之次序係以這些欄位將被編碼的次序，然而，應瞭解的是在其他實施例中，這些欄位可以不同次序被編碼、被組合等。Figure 16 illustrates an embodiment of an instruction format. As illustrated, a command may include multiple components including, but not limited to, fields for one or more of the following: one or more headers 1601, an opcode 1603, addressing information 1605 (e.g., temporary register identifier, memory addressing information, etc.), an offset value 1607 and/or an immediate 1609. It should be noted that some instructions utilize some or all of the format's fields, while other instructions may use only the fields of opcode 1603. In some embodiments, the order illustrated is the order in which the fields will be encoded, however, it should be understood that in other embodiments, the fields may be encoded in a different order, combined, etc.

首碼欄位1601在使用時會修改指令。在一些實施例中，一或多個首碼係用以重複串列指令(例如，0xF0、0xF2、0xF3等)，以提供區段覆設(例如，0x2E、0x36、0x3E、0x26、0x64、0x65、0x2E、0x3E等)，以施行匯流排鎖定操作，及/或改變運算元(例如，0x66)及位址大小(例如，0x67)。某些指令需要一強制的首碼(例如，0x66、0xF2、0xF3等)。此等首碼中之某些可被視為「舊有」首碼。其他首碼(本文中所詳細說明之一或多個範例)指示及/或提供進一步的能力，諸如指定特定暫存器等。該等其他首碼一般接續該等「舊有」首碼。The first code field 1601 will modify the command when used. In some embodiments, one or more headers are used to repeat a sequence of instructions (e.g., 0xF0, 0xF2, 0xF3, etc.) to provide sector overlays (e.g., 0x2E, 0x36, 0x3E, 0x26, 0x64, 0x65 , 0x2E, 0x3E, etc.) to perform a bus lock operation and/or change the operand (e.g., 0x66) and address size (e.g., 0x67). Some instructions require a mandatory first code (for example, 0x66, 0xF2, 0xF3, etc.). Some of these headers may be considered "legacy" headers. Other headers (one or more examples detailed herein) indicate and/or provide further capabilities, such as specifying specific registers. These other first codes generally continue from the "old" first codes.

運算碼欄位1603係用以至少部分地定義要在指令之解碼後要去施行的運算。在一些實施例中，經編碼於運算碼欄位1603中之一主運算碼的長度係1、2或3位元組。在其他實施例中，主運算碼可為不同長度。額外的3位元運算碼欄位有時被編碼於另一欄位中。Opcode field 1603 is used to define, at least in part, the operation to be performed after decoding of the instruction. In some embodiments, the length of a primary opcode encoded in opcode field 1603 is 1, 2, or 3 bytes. In other embodiments, the primary opcodes may be of different lengths. The additional 3-bit opcode field is sometimes encoded in another field.

定址欄位1605係用以定址指令之一或多個運算元，諸如記憶體中之位置或一或多個暫存器。圖17例示定址欄位1605的實施例。在此例示中，一任擇ModR/M位元組1702及一任擇尺度、索引、基址(SIB)位元組1704被顯示。ModR/M位元組1702及SIB位元組1704係用於編碼指令之高達兩個運算元，其中每一者係為直接暫存器或有效記憶體位址。應注意的是，這些欄位中之每一者係任擇的，此係因為並非所有指令均包括一或多個這些欄位。MOD R/M位元組1702包括一MOD欄位1742、一暫存器欄位1744及R/M欄位1746。Addressing field 1605 is used to address one or more operands of the instruction, such as a location in memory or one or more registers. Figure 17 illustrates an embodiment of an addressing field 1605. In this illustration, an optional ModR/M byte 1702 and an optional Scale, Index, Base (SIB) byte 1704 are shown. ModR/M bytes 1702 and SIB bytes 1704 are used to encode up to two operands of the instruction, each of which is a direct register or valid memory address. It should be noted that each of these fields is optional since not all instructions include one or more of these fields. MOD R/M byte 1702 includes a MOD field 1742, a register field 1744, and an R/M field 1746.

MOD欄位1742之內容區分記憶體存取與非記憶體存取模式。在一些實施例中，當MOD欄位1742具有值b11時，利用暫存器直接定址模式，且否則使用暫存器間接定址。The contents of MOD field 1742 differentiate between memory access and non-memory access modes. In some embodiments, register direct addressing mode is utilized when MOD field 1742 has the value b11, and register indirect addressing is used otherwise.

暫存器欄位1744可對目的地暫存器運算元或來源暫存器運算元進行編碼，或可對一運算碼延伸進行編碼而不被使用來編碼任何指令運算元。暫存器索引欄位1744之內容直接或透過位址產生來指定來源或目的地運算元之位置(在暫存器或在記憶體中)。在一些實施例中，暫存器欄位1744係以來自首碼(例如，首碼1601)之額外位元進行補充以允許較大的定址。Register field 1744 may encode a destination register operand or a source register operand, or may encode an opcode extension that is not used to encode any instruction operand. The contents of register index field 1744 specify, directly or through address generation, the location of the source or destination operand (either in a register or in memory). In some embodiments, register field 1744 is supplemented with additional bits from the header code (eg, header code 1601) to allow larger addressing.

R/M欄位1746可被使用來對參考記憶體位址之指令運算元進行編碼，或可被使用來對目的地暫存器運算元或來源暫存器運算元進行編碼。應注意的是，在一些實施例中，R/M欄位1746可被與MOD欄位1742組合以指定一定址模式。R/M field 1746 may be used to encode an instruction operand that refers to a memory address, or may be used to encode a destination register operand or a source register operand. It should be noted that in some embodiments, the R/M field 1746 may be combined with the MOD field 1742 to specify an addressing mode.

SIB位元組1704包括一尺度欄位1752、索引欄位1754及基址欄位1756以被使用於位址的產生。尺度欄位1752指示尺度因子。索引欄位1754指定要使用之索引暫存器。在一些實施例中，索引欄位1754係以來自首碼(例如，首碼1601)之額外位元進行補充以允許較大的定址。基址欄位1756指定要使用之基址暫存器。在一些實施例中，基址欄位1756係以來自首碼(例如，首碼1601)之額外位元進行補充以允許較大的定址。實際上，尺度欄位1752之內容允許縮放索引欄位1754之內容以用於記憶體位址產生(例如，用於使用2 ^尺度* 索引 + 基址的位址產生)。 The SIB byte 1704 includes a size field 1752, an index field 1754, and a base address field 1756 for use in address generation. Scale field 1752 indicates the scale factor. Index field 1754 specifies the index register to use. In some embodiments, index field 1754 is supplemented with additional bits from the header code (eg, header code 1601) to allow for larger addressing. Base address field 1756 specifies the base address register to be used. In some embodiments, base address field 1756 is supplemented with additional bits from the header code (eg, header code 1601) to allow for larger addressing. In effect, the contents of scale field 1752 allow scaling of the contents of index field 1754 for memory address generation (eg, for address generation using 2 ^scale *index+base).

一些定址形式利用一位移值以產生一記憶體位址。舉例而言，記憶體位址可依據2 ^尺度* 索引 + 基址 + 位移、索引 * 尺度 + 位移、r/m + 位移、指令指標(RIP/EIP) + 位移、暫存器 + 位移等而被產生。位移可為1位元組、2位元組、4位元組等值。在一些實施例中，位移欄位1607提供此值。此外，在一些實施例中，位移因子使用係被編碼於定址欄位1605之MOD欄位中，定址欄位1605指示壓縮位移方案，其係用於藉由將disp8與基於向量長度、b位元值及指令之輸入元素大小所判定的尺度因子N相乘來計算位移值。位移值係儲存於位移欄位1607中。 Some addressing forms use an offset value to generate a memory address. For example, memory addresses can be generated based on 2 ^scale *index+base+displacement, index*scale+displacement, r/m+displacement, instruction pointer (RIP/EIP)+displacement, register+displacement, etc. . The displacement can be 1 byte, 2 bytes, 4 bytes, etc. In some embodiments, the displacement field 1607 provides this value. Additionally, in some embodiments, the displacement factor usage is encoded in the MOD field of addressing field 1605, which indicates a compressed displacement scheme that is used by combining disp8 with vector length-based, b-bit The displacement value is calculated by multiplying the value and the scale factor N determined by the size of the input element of the command. The displacement value is stored in displacement field 1607.

在一些實施例中，立即欄位1609指定用於指令之立即。立即可被編碼為1位元組值、2位元組值、4位元組值等。In some embodiments, the immediate field 1609 specifies the immediate for the instruction. Immediately can be encoded into a 1-byte value, a 2-byte value, a 4-byte value, etc.

圖18例示一第一首碼1601(A)之實施例。在一些實施例中，第一首碼1601(A)係REX首碼的實施例。使用此首碼之指令可指定通用暫存器、64位元包封資料暫存器(例如，單指令多資料(SIMD)暫存器或向量暫存器)及/或控制暫存器及除錯暫存器(例如，CR8-CR15及DR8-DR15)。Figure 18 illustrates an embodiment of a first header 1601(A). In some embodiments, the first header 1601(A) is an embodiment of a REX header. Instructions using this header can specify general-purpose registers, 64-bit packed data registers (for example, single instruction multiple data (SIMD) registers or vector registers), and/or control registers and error register (for example, CR8-CR15 and DR8-DR15).

使用第一首碼1601(A)之指令可取決於格式而使用3位元欄位指定高達三個暫存器：1)使用Mod R/M位元組1702之暫存器欄位1744及R/M欄位1746；2)使用Mod R/M位元組1702與SIB位元組1704，其包括使用暫存器欄位1744及基址欄位1756及索引欄位1754；或3)使用一運算碼之暫存器欄位。Instructions using first header 1601(A) may use 3-bit fields to specify up to three registers depending on the format: 1) Register fields 1744 and R using Mod R/M byte 1702 /M field 1746; 2) use the Mod R/M byte 1702 and the SIB byte 1704, which includes the use of the register field 1744 and the base address field 1756 and the index field 1754; or 3) use a Register field for the opcode.

在第一首碼1601(A)中，位元位置7:4被設定為0100。位元位置3(W)可被使用來判定運算元大小，但可不僅判定運算元寬度。如此一來，當W=0時，運算元大小係藉由程式碼區段描述符(CS.D)來判定，且當W=1時，運算元大小為64位元。In the first code 1601(A), bit positions 7:4 are set to 0100. Bit position 3(W) can be used to determine operand size, but not only operand width. In this way, when W=0, the operand size is determined by the code section descriptor (CS.D), and when W=1, the operand size is 64 bits.

應注意的是，另一位元之添加允許16(2 ⁴)暫存器被定址，而MOD R/M暫存器欄位1744及MOD R/M R/M欄位1746單獨可各自僅定址8個暫存器。 It should be noted that the addition of another bit allows 16 (2 ⁴ ) registers to be addressed, whereas MOD R/M register field 1744 and MOD R/MR/M field 1746 alone can each address only 8 a temporary register.

在第一首碼1601(A)中，位元位置2(R)可為MOD R/M暫存器欄位1744之延伸，且可被使用來在該欄位編碼一通用暫存器、一64位元包封資料暫存器(例如，SSE暫存器)或一控制或除錯暫存器時，修改ModR/M暫存器欄位1744。當Mod R/M位元組1702指定其他暫存器或定義延伸運算碼時，R被忽略。In first header 1601(A), bit position 2(R) may be an extension of MOD R/M register field 1744, and may be used to encode a general purpose register, a ModR/M register field 1744 is modified when a 64-bit packet data register (e.g., SSE register) or a control or debug register is used. When the Mod R/M byte 1702 specifies other registers or defines extended opcodes, R is ignored.

位元位置1 (X) X位元可修改SIB位元組索引欄位1754。Bit position 1 (X) The X bit modifies SIB byte index field 1754.

位元位置B (B) B可修改Mod R/M R/M欄位1746或SIB位元組基址欄位1756中之基址；或其可修改用於存取通用暫存器(例如，通用暫存器1525)之運算碼暫存器欄位。Bit location B (B) B may modify the base address in Mod R/M R/M field 1746 or SIB byte base address field 1756; or it may be modified to access a general purpose register (e.g., general purpose The opcode register field of register 1525).

圖19(A)-(D)例示如何使用第一首碼1601(A)之R、X及B欄位的實施例。圖19(A)例示來自第一首碼1601(A)的R及B，其被使用來在SIB位元組1704未被使用於記憶體定址時，去延伸MOD R/M位元組1702之暫存器欄位1744及R/M欄位1746。圖19(B)例示來自第一首碼1601(A)的R及B，其被使用來在SIB位元組1704未被使用(暫存器對暫存器定址)時，去延伸MOD R/M位元組1702之暫存器欄位1744及R/M欄位1746。圖19(C)例示來自第一首碼1601(A)的R、X及B，其被使用來在SIB位元組1704被使用於記憶體定址時，去延伸MOD R/M位元組1702之暫存器欄位1744及索引欄位1754及基址欄位1756。圖19(D)例示來自第一首碼1601(A)的B，其被使用來在暫存器被編碼於運算碼1603中時，去延伸MOD R/M位元組1702之暫存器欄位1744。19(A)-(D) illustrate an embodiment of how to use the R, X and B fields of the first header 1601(A). Figure 19(A) illustrates R and B from first header 1601(A) used to extend MOD R/M bytes 1702 when SIB bytes 1704 are not used for memory addressing. Register field 1744 and R/M field 1746. Figure 19(B) illustrates R and B from the first header 1601(A), which are used to extend MOD R/ when the SIB byte 1704 is not used (register to register addressing). Register field 1744 and R/M field 1746 of M byte 1702. Figure 19(C) illustrates R, The register field 1744, the index field 1754 and the base address field 1756. Figure 19(D) illustrates B from first header 1601(A) used to extend the register field of MOD R/M bytes 1702 when the register is encoded in opcode 1603 Bit 1744.

圖20(A)-(B)例示一第二首碼1601(B)的實施例。在一些實施例中，第二首碼1601(B)係VEX首碼的實施例。第二首碼1601(B)編碼允許指令具有兩個以上的運算元，且允許SIMD向量暫存器(例如，向量/SIMD暫存器1510)比64位元長(例如，128位元及256位元)。第二首碼1601(B)之使用提供三運算元(或更多)語法。舉例而言，先前二個運算元指令施行諸如A = A + B之運算，其覆寫一來源運算元。第二首碼1601(B)的使用使運算元能夠施行非破壞性運算，諸如A = B + C。20(A)-(B) illustrate an embodiment of the second first code 1601(B). In some embodiments, the second first code 1601(B) is an embodiment of a VEX first code. The second header 1601(B) encoding allows instructions to have more than two operands and allows SIMD vector registers (e.g., vector/SIMD registers 1510) to be longer than 64 bits (e.g., 128 bits and 256 bits). bits). Use of the second code 1601(B) provides three-operand (or more) syntax. For example, the previous two operand instructions perform an operation such as A = A + B, which overwrites a source operand. The use of the second code 1601(B) enables the operands to perform non-destructive operations such as A = B + C.

在一些實施例中，第二首碼1601(B)呈兩種形式──二位元組形式及三位元組形式。二位元組第二首碼1601(B)係主要使用於128位元、純量及一些256位元指令；而三位元組第二首碼1601(B)提供第一首碼1601(A)及3位元組運算碼指令之緊湊替換。In some embodiments, the second first code 1601(B) is in two forms—a two-byte form and a three-byte form. The two-byte second code 1601(B) is mainly used for 128-bit, scalar and some 256-bit instructions; the three-byte second code 1601(B) provides the first code 1601(A ) and a compact replacement for 3-byte opcode instructions.

圖20(A)例示第二首碼1601(B)之二位元組形式的實施例。在一範例中，格式欄位2001(位元組0 2003)含有值C5H。在一範例中，位元組1 2005包括位元[7]中之「R」值。此值為第一首碼1601(A)之相同值的補數。位元[2]被使用來指定向量之長度(L) (其中值0為純量或128位元向量且值1為256位元向量)。位元[1:0]提供等效於一些舊有首碼之運算碼延伸性(例如，00 = 無首碼，01 = 66H，10 = F3H，且11 = F2H)。顯示為vvvv之位元[6:3]可被使用來：1)對第一來源暫存器運算元進行編碼，其以反置(1補數)形式指定且針對具有2或更多個來源運算元之指令有效；2)對目的地暫存器運算元進行編碼，其以用於特定向量移位之1補數形式指定；或3)不對任何運算元進行編碼，該欄位被保留且應含有諸如1111b之一特定值。FIG. 20(A) illustrates an embodiment of the two-byte form of the second first code 1601(B). In one example, format field 2001 (byte 0 2003) contains the value C5H. In one example, byte 1 2005 includes the "R" value in bit [7]. This value is the complement of the same value of the first code 1601(A). Bit[2] is used to specify the length (L) of the vector (where the value 0 is a scalar or 128-bit vector and the value 1 is a 256-bit vector). Bits[1:0] provide opcode extensions equivalent to some old prefixes (for example, 00 = no prefix, 01 = 66H, 10 = F3H, and 11 = F2H). Bits[6:3] shown as vvvv can be used to: 1) encode the first source register operand, which is specified in inverted (1's complement) form and for sources with 2 or more The instruction for the operand is valid; 2) encodes the destination register operand, which is specified in 1's complement form for the specific vector shift; or 3) does not encode any operand, the field is reserved and Should contain a specific value such as 1111b.

使用此首碼之指令可使用Mod R/M R/M欄位1746以對參考記憶體位址的指令運算元進行編碼，或對目的地暫存器運算元或來源暫存器運算元進行編碼。Instructions using this first code can use Mod R/M R/M field 1746 to encode the instruction operand that refers to the memory address, or to encode the destination register operand or the source register operand.

使用此首碼之指令可使用Mod R/M暫存器欄位1744以對目的地暫存器運算元或來源暫存器運算元進行編碼，被視為運算碼延伸且不使用來對任何指令運算元進行編碼。Instructions using this first code can use Mod R/M register field 1744 to encode the destination register operand or the source register operand. It is considered an opcode extension and is not used for any instructions. Operands are encoded.

針對支援四個運算元vvvv之指令語法，Mod R/M R/M欄位1746及Mod R/M暫存器欄位1744對四個運算元中之三個進行編碼。立即1609之位元[7:4]接著被使用來對第三來源暫存器運算元進行編碼。For instruction syntax that supports four operands vvvv, Mod R/M R/M field 1746 and Mod R/M register field 1744 encode three of the four operands. Bits [7:4] of immediate 1609 are then used to encode the third source register operand.

圖20(B)例示第二首碼1601(B)之三位元組形式的實施例。在一範例中，格式欄位2011(位元組0 2013)含有值C4H。位元組1 2015在位元[7:5]中包括「R」、「X」以及「B」，其係第一首碼1601(A)之相同值的補數。位元組1 2015之位元[4:0](顯示為mmmmm)包括用以按需要對一或多個隱含前導運算碼位元組進行編碼之內容。舉例而言，00001隱含0FH前導運算碼、00010隱含0F38H前導運算碼、00011隱含0F3AH前導運算碼等。FIG. 20(B) illustrates an embodiment of the three-tuple form of the second header code 1601(B). In one example, format field 2011 (byte 0 2013) contains the value C4H. Byte 1 2015 includes "R", "X" and "B" in bits [7:5], which is the complement of the same value of the first prefix 1601(A). Bits[4:0] of Byte 1 2015 (shown as mmmmm) contain the content used to encode one or more implicit preamble opcode bytes as needed. For example, 00001 implies a 0FH preamble opcode, 00010 implies a 0F38H preamble opcode, 00011 implies a 0F3AH preamble opcode, etc.

位元組2 2017之位元[7]係相似於第一首碼1601(A)之W被使用，包括幫助判定可推廣之運算元大小。位元[2]被使用來指定向量之長度(L) (其中值0為純量或128位元向量且值1為256位元向量)。位元[1:0]提供等效於一些舊有首碼之運算碼延伸性(例如，00 = 無首碼，01 = 66H，10 = F3H，且11 = F2H)。顯示為vvvv之位元[6:3]可被使用來：1)對第一來源暫存器運算元進行編碼，其以反置(1補數)形式指定且針對具有2或更多個來源運算元之指令有效；2)對目的地暫存器運算元進行編碼，其以用於特定向量移位之1補數形式指定；或3)不對任何運算元進行編碼，該欄位被保留且應含有諸如1111b之一特定值。Bit [7] of Byte 2 2017 is used similar to W of the first header 1601(A), including to help determine the operand size that can be generalized. Bit[2] is used to specify the length (L) of the vector (where the value 0 is a scalar or 128-bit vector and the value 1 is a 256-bit vector). Bits[1:0] provide opcode extensions equivalent to some old prefixes (for example, 00 = no prefix, 01 = 66H, 10 = F3H, and 11 = F2H). Bits[6:3] shown as vvvv can be used to: 1) encode the first source register operand, which is specified in inverted (1's complement) form and for sources with 2 or more The instruction for the operand is valid; 2) encodes the destination register operand, which is specified in 1's complement form for the specific vector shift; or 3) does not encode any operand, the field is reserved and Should contain a specific value such as 1111b.

圖21例示一第三首碼1601(C)的實施例。在一些實施例中，第一首碼1601(A)係EVEX首碼的實施例。第三首碼1601(C)係四位元組首碼。Figure 21 illustrates an embodiment of a third header 1601(C). In some embodiments, the first header 1601(A) is an example of an EVEX header. The third first code 1601(C) is the four-byte first code.

第三首碼1601(C)可以64位元模式編碼32個向量暫存器(例如，128位元、256位元及512位元暫存器)。在一些實施例中，利用寫入遮罩/運算遮罩之指令(參看先前圖式中之暫存器的論述，諸如圖15)或述語利用此首碼。運算遮罩暫存器允許條件處理或選擇控制。來源/目的地運算元為運算遮罩暫存器且將運算遮罩暫存器之內容視為單個值之運算遮罩指令係使用第二首碼1601(B)被編碼。The third header 1601(C) can encode 32 vector registers in 64-bit mode (eg, 128-bit, 256-bit, and 512-bit registers). In some embodiments, this header is utilized using write mask/compute mask instructions (see discussion of registers in previous figures, such as Figure 15) or predicates. Arithmetic mask registers allow conditional processing or selection control. Arithmetic mask instructions whose source/destination operands are the arithmetic mask registers and which treat the contents of the arithmetic mask register as a single value are encoded using the second header 1601(B).

第三首碼1601(C)可對特定於指令類別之功能性進行編碼(例如，具有「載入+運算」語義之一包封指令可支援嵌入式廣播功能性、具有捨入語義之一浮點指令可支援靜態捨入功能性、具有非捨入算術語義之浮點指令可支援「抑制所有例外異常」功能性等)。The third header 1601(C) can encode functionality specific to the instruction class (e.g., a wrapped instruction with load+operate semantics can support embedded broadcast functionality, a float with rounding semantics Point instructions can support static rounding functionality, floating point instructions with non-rounding arithmetic meaning can support "suppress all exceptions" functionality, etc.).

第三首碼1601(C)之第一位元組係一格式欄位2111，格式欄位2111具有在一範例中為62H之值。後續位元組被稱為酬載位元組2115-2119且共同地形成一24位元值P[23:0]，其以一或多個欄位(本文中所詳細說明)之形式提供特定能力。The first tuple of the third header code 1601(C) is a format field 2111, and the format field 2111 has a value of 62H in one example. Subsequent bytes are referred to as payload bytes 2115-2119 and collectively form a 24-bit value P[23:0], which provides specific ability.

在一些實施例中，酬載位元組2119之P[1:0]係與低的兩個mmmmm位元相同。在一些實施例中，P[3:2]被保留。位元P[4] (R')允許在與P[7]及ModR/M暫存器欄位1744組合時對高的16個向量暫存器集合之存取。當不需要SIB類型定址時，P[6]亦可提供對高的16個向量暫存器之存取。P[7:5]由用於向量暫存器、通用暫存器、記憶體定址之運算元說明符修飾符位元的R、X及B所組成，且在與ModR/M暫存器欄位1744及ModR/M R/M欄位1746組合時允許對超過低的8個暫存器的下一組8個暫存器存取。P[9:8]提供等效於一些舊有首碼之運算碼延伸性(例如，00 = 無首碼，01 = 66H，10 = F3H，且11 = F2H)。在一些實施例中，P[10]為固定值1。顯示為vvvv之P[14:11]可被使用來：1)對第一來源暫存器運算元進行編碼，其以反置(1補數)形式指定且針對具有2或更多個來源運算元之指令有效；2)對目的地暫存器運算元進行編碼，其以用於特定向量移位之1補數形式指定；或3)不對任何運算元進行編碼，該欄位被保留且應含有諸如1111b之一特定值。In some embodiments, P[1:0] of payload bytes 2119 are the same as the lower two mmmmmm bits. In some embodiments, P[3:2] is retained. Bit P[4] (R') allows access to the upper 16 vector register set when combined with P[7] and the ModR/M register field 1744. When SIB type addressing is not required, P[6] can also provide access to the upper 16 vector registers. P[7:5] consists of R, Bit 1744 and ModR/M R/M field 1746 when combined allow access to the next set of 8 registers beyond the lower 8 registers. P[9:8] provides opcode extensions equivalent to some old prefixes (for example, 00 = no prefix, 01 = 66H, 10 = F3H, and 11 = F2H). In some embodiments, P[10] is a fixed value of 1. P[14:11] shown as vvvv can be used to: 1) encode the first source register operand, which is specified in inverted (1's complement) form and for operations with 2 or more sources instruction is valid; 2) encode the destination register operand, which is specified in 1's complement form for the specific vector shift; or 3) do not encode any operand, this field is reserved and should Contains a specific value such as 1111b.

P[15]係相似於第一首碼1601(A)及第二首碼1611(B)之W，且可充當為運算碼延伸位元或運算元大小提升。P[15] is similar to W of the first code 1601(A) and the second code 1611(B), and can be used as an operation code extension bit or an operand size increase.

P[18:16]指定在運算遮罩(寫入遮罩)暫存器(例如，寫入遮罩/述語暫存器1515)中之暫存器的索引。在本發明之一實施例中，特定值aaa = 000具有暗示沒有運算遮罩被使用於特定指令之特殊行為(此可通過多種方式實行，包括使用硬連線至所有者之運算遮罩或繞過遮罩硬體之硬體)。當合併時，向量遮罩允許目的地中之任何元素集合在任何運算(由基址運算及擴增運算所指定)之執行期間受保護而不被更新；在其他實施例中，當對應遮罩位元具有一0值時，保持目的地之每一元素之舊值。對比而言，當歸零時，向量遮罩允許目的地中之任何元素集合在任何運算(由基址運算及擴增運算所指定)之執行期間被歸零；在一實施例中，當對應遮罩位元具有一0值時，將目的地之一元素設定為0。此功能性之子集係用以控制正施行之運算的向量長度之能力(亦即，元素之跨距自第一者至最後一者被修改)；然而，被修改之元素並不需要為連續的。因此，運算遮罩欄位允許部分向量運算，包括載入、儲存、算術、邏輯等。雖然說明了運算遮罩欄位之內容選擇含有將被使用之運算遮罩的數個運算遮罩暫存器中之一者(且因此運算遮罩欄位之內容間接識別將被施行之遮罩)的本發明之實施例，但替代的或其他的不同實施例允許遮罩寫入欄位之內容以直接指定將被施行之遮罩。P[18:16] specifies the index of the register in the operation mask (write mask) register (eg, write mask/predicate register 1515). In one embodiment of the present invention, the specific value aaa = 000 has the special behavior of implying that no arithmetic mask is used for a particular instruction (this can be implemented in a variety of ways, including using an arithmetic mask hardwired to the owner or a wraparound Hardware over masking hardware). When combined, a vector mask allows any set of elements in the destination to be protected from being updated during the execution of any operation (specified by the base operation and the augmentation operation); in other embodiments, when the corresponding mask When a bit has a value of 0, the old value of each element of the destination is retained. In contrast, when zeroed, a vector mask allows any set of elements in the destination to be zeroed during the execution of any operation (specified by the base operation and the augmentation operation); in one embodiment, when the corresponding mask When the mask bit has a value of 0, one of the destination elements is set to 0. This subset of functionality is the ability to control the length of the vector for the operation being performed (i.e., the span of elements is modified from the first to the last); however, the elements being modified need not be consecutive . Therefore, the operation mask field allows some vector operations, including loading, storage, arithmetic, logic, etc. Although it is stated that the contents of the computation mask field select one of several computation mask registers containing the computation mask to be used (and therefore the contents of the computation mask field indirectly identify the mask to be applied ) embodiment of the invention, but alternative or different embodiments allow the contents of the mask write field to directly specify the mask to be applied.

P[19]可與P[14:11]組合以使用一非破壞性來源語法來對第二來源向量暫存器進行編碼，其可使用P[19]來存取上部的16個向量暫存器。P[20]編碼多個功能性，該等功能性在不同類別之指令中有所不同且可影響向量長度/捨入控制說明符欄位(P[22:21])之含義。P[23]指示對合併-寫入遮罩之支援(例如，當設定為0時)或對歸零及合併-寫入遮罩之支援(例如，當設定為1時)。P[19] can be combined with P[14:11] to encode the second source vector register using a non-destructive source syntax, which can use P[19] to access the upper 16 vector registers device. P[20] encodes multiple functionality that varies among different classes of instructions and can affect the meaning of the vector length/rounding control specifier field (P[22:21]). P[23] indicates support for merge-write masks (e.g., when set to 0) or support for zeroing and merge-write masks (e.g., when set to 1).

使用第三首碼1601(C)之指令中暫存器之編碼的範例性實施例在下表中詳係說明。 4 3 [2:0] 暫存器類型 通常使用 暫存器 R’ R ModR/M 暫存器 GPR、向量來源或目的地 VVVV V’ vvvv GPR、向量第二來源或目的地 RM X B ModR/M R/M GPR、向量第一來源或目的地基址 0 B ModR/M R/M GPR 記憶體定址索引 0 X SIB.index GPR 記憶體定址 VIDX V’ X SIB.index 向量 VSIB記憶體定址表1 ：以64位元模式支援的32-暫存器 [2:0] 暫存器類型 通常使用 暫存器 ModR/M暫存器 GPR、向量來源或目的地 VVVV vvvv GPR、向量第二來源或目的地 RM ModR/M R/M GPR、向量第一來源或目的地基址 ModR/M R/M GPR 記憶體定址索引 SIB.index GPR 記憶體定址 VIDX SIB.index 向量 VSIB記憶體定址表2 ：以32位元模式編碼暫存器說明符 [2:0] 暫存器類型 通常使用 暫存器 ModR/M暫存器 k0-k7 來源 VVVV vvvv k0-k7 第二來源 RM ModR/M R/M k0-7 第一來源 {k1] aaa k0 ¹-k7 運算遮罩表3 ：運算遮罩暫存器說明符編碼 An exemplary embodiment of the encoding of the register in the instruction using the third header code 1601(C) is detailed in the table below. 4 3 [2:0] scratchpad type Commonly used scratchpad R' R ModR/M register GPR, vector source or destination VVVV V' vvvv GPR, vector Secondary source or destination RM X B ModR/MR/M GPR, vector first source or destination Base address 0 B ModR/MR/M GPR memory addressing index 0 X SIB.index GPR memory addressing VIDX V' X SIB.index vector VSIB memory addressing Table 1 : 32-registers supported in 64-bit mode [2:0] scratchpad type Commonly used scratchpad ModR/M register GPR, vector source or destination VVVV vvvv GPR, vector Secondary source or destination RM ModR/MR/M GPR, vector first source or destination Base address ModR/MR/M GPR memory addressing index SIB.index GPR memory addressing VIDX SIB.index vector VSIB memory addressing Table 2 : Encoding register specifiers in 32-bit mode [2:0] scratchpad type Commonly used scratchpad ModR/M register k0-k7 Source VVVV vvvv k0-k7 secondary source RM ModR/MR/M k0-7 first source {k1] aaa k0 ¹ -k7 Operate mask Table 3 : Operation mask register specifier encoding

程式碼可被應用於輸入指令以施行本文所說明之功能並產生輸出資訊。以已知方式，輸出資訊可被應用於一或多個輸出裝置。為了本申請案之目的，處理系統包括具有處理器之任何系統，該處理器諸如，舉例來說，數位信號處理器(DSP)、微控制器、特定應用積體電路(ASIC)或微處理器。Programming code may be applied to input commands to perform the functions described herein and to generate output information. In a known manner, the output information may be applied to one or more output devices. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor .

程式碼可被以高階程序或物件導向式程式設計語言來實行以與處理系統通訊。若需要，程式碼亦可用組合語言或機器語言來實行。事實上，本文中所說明之機制在範圍上並不限於任何特定程式設計語言。在任何情況下，該語言可以是編譯語言或解譯語言。Programming code may be executed in a high-level programming or object-oriented programming language to communicate with the processing system. If necessary, the program code can also be implemented in assembly language or machine language. In fact, the mechanisms described in this article are not limited in scope to any particular programming language. In any case, the language may be a compiled language or an interpreted language.

本文中所揭露之機制的實施例可以硬體、軟體、韌體、或此等實行作法之一組合來被實行。本發明之實施例可被實行為在包含至少一個處理器、一儲存系統(包括依電性及非依電性記憶體及/或儲存元件)、至少一個輸入裝置、及至少一個輸出裝置之可規劃系統上執行之電腦程式或程式碼。Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the invention may be implemented on a computer including at least one processor, a storage system (including dependent and non-volatile memory and/or storage devices), at least one input device, and at least one output device. A computer program or code that executes on a planning system.

至少一實施例之一或多個態樣可藉由儲存於機器可讀媒體上的表示處理器內之各種邏輯之代表性指令來被實行，該等代表性指令當由機器讀取時致使該機器建構用以施行本文中所說明之技術的邏輯。被稱為「IP核心」之此等表示可被儲存於有形機器可讀媒體上，並且供應至各種消費者或製造設施，以載入至實際上構成該邏輯或處理器之建構機器中。One or more aspects of at least one embodiment may be implemented by representative instructions representing various logic within a processor stored on a machine-readable medium that, when read by a machine, cause the Machines construct the logic used to perform the techniques described in this article. These representations, known as "IP cores," may be stored on tangible machine-readable media and supplied to various consumer or manufacturing facilities for loading into the building machines that actually constitute the logic or processor.

此等機器可讀儲存媒體可包括但不限於由一機器或裝置製造或形成之物品的非暫時性有形布置，包括儲存媒體，諸如硬碟、包括軟碟之任何其他類型之磁碟、光碟、光碟唯讀記憶體(CD-ROM)、可重寫光碟(CD-RW)及磁光碟、諸如唯讀記憶體(ROM)之半導體裝置、諸如動態隨機存取記憶體(DRAM)、靜態隨機存取記憶體(SRAM)之隨機存取記憶體(RAM)、可抹除可規劃唯讀記憶體(EPROM)、快閃記憶體、電可抹除可規劃唯讀記憶體(EEPROM)、相變記憶體(PCM)、磁卡或光學卡，或合適於儲存電子指令的任何其他類型之媒體。Such machine-readable storage media may include, but are not limited to, non-transitory tangible arrangements of items manufactured or formed by a machine or device, including storage media such as hard disks, any other type of magnetic disks including floppy disks, optical disks, Compact disc read-only memory (CD-ROM), rewritable optical disc (CD-RW) and magneto-optical disc, semiconductor devices such as read-only memory (ROM), dynamic random access memory (DRAM), static random access memory Random access memory (RAM) of memory (SRAM), erasable programmable read-only memory (EPROM), flash memory, electrically erasable programmable read-only memory (EEPROM), phase change Memory (PCM), magnetic or optical card, or any other type of media suitable for storing electronic instructions.

據此，本發明之實施例亦可包括含有指令或含有設計資料之非暫時性有形機器可讀媒體，諸如硬體描述語言(HDL)，其界定本文中所說明之結構、電路、設備、處理器及/或系統特徵。此等實施例亦可被稱作程式產品。 模仿( 包括二進位轉譯、代碼編程等) Accordingly, embodiments of the present invention may also include non-transitory tangible machine-readable media, such as a hardware description language (HDL), containing instructions or containing design data that defines the structures, circuits, devices, processes described herein device and/or system characteristics. These embodiments may also be referred to as program products. Imitation ( including binary translation, code programming, etc.)

在一些情況下，指令轉換器可被使用來將指令自來源指令集轉換至目標指令集。舉例而言，指令轉換器可將指令轉譯(例如，使用靜態二進位轉譯、包括動態編譯之動態二進位轉譯)、轉化、模仿或以其他方式轉換成將由核心所處理之一或多個其他指令。指令轉換器可以軟體、硬體、韌體或其組合予以實行。指令轉換器可在處理器上、在處理器外，或部分地在處理器上且部分地在處理器外。In some cases, an instruction converter may be used to convert instructions from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), translate, emulate, or otherwise convert the instruction into one or more other instructions to be processed by the core . The command converter can be implemented in software, hardware, firmware, or a combination thereof. The instruction translator may be on-processor, external to processor, or partially on-processor and partially external to processor.

圖22例示根據本發明之實施例的對比使用一軟體指令轉換器以將一來源指令集中之二進位指令轉換為一目標指令集中之二進位指令的方塊圖。在所例示的實施例中，指令轉換器為軟體指令轉換器，但替代地，指令轉換器可以軟體、韌體、硬體或其各種組合予以實行。圖22顯示可使用第一ISA編譯器2204來編譯呈高階語言2202之程式以產生第一ISA二進位碼2206，其可原生地由具有至少一個第一指令集核心之處理器2216所執行。具有至少一第一ISA指令集核心之處理器2216代表可藉由相容地執行或以其他方式處理以下各者以便達成與具有至少一第一ISA指令集核心之處理器實質上相同的結果而施行與具有至少一第一ISA指令集核心之Intel®處理器實質上相同的功能的任一處理器：(1)第一ISA指令集核心之指令集的實質部分，或(2)目標為在具有至少一第一ISA指令集核心之Intel處理器上運行的應用程式或其他軟體之目的碼版本。第一ISA編譯器2204代表可操作以產生第一ISA二進位碼2206(例如，目的碼)的編譯器，第一ISA二進位碼2206可在具有或不具有額外鏈結處理的情況下，在具有至少一第一ISA指令集核心之處理器2216上執行。相似地，圖22顯示可使用替代性指令集編譯器2208來編譯呈高階語言2202之程式以產生替代性指令集二進位碼2210，其可原生地由不具有第一ISA指令集核心之處理器2214所執行。指令轉換器2212係用以將第一ISA二進位碼2206轉換成可原生地由不具有第一ISA指令集核心之處理器2214所執行的程式碼。此經轉換之程式碼可不太可能與替代性指令集二進位碼2210相同，因為能夠做到此的指令集轉換器係難以製作的；然而，該經轉換之程式碼將完成一般操作且由來自替代性指令集之指令所構成。因此，指令轉換器2212代表透過模仿、模擬或任何其他程序而允許不具有第一ISA指令集處理器或核心的處理器或其他電子裝置去執行第一ISA二進位碼2206的軟體、韌體、硬體或其組合。22 illustrates a block diagram comparing the use of a software instruction converter to convert binary instructions in a source instruction set into binary instructions in a target instruction set according to an embodiment of the present invention. In the illustrated embodiment, the command converter is a software command converter, but alternatively, the command converter may be implemented in software, firmware, hardware, or various combinations thereof. Figure 22 shows that a first ISA compiler 2204 can be used to compile a program in a high-level language 2202 to produce a first ISA binary code 2206 that can be natively executed by a processor 2216 having at least one first instruction set core. A processor 2216 having at least a first ISA instruction set core may achieve substantially the same results as a processor having at least a first ISA instruction set core by consistently executing or otherwise processing the following: Any processor that performs substantially the same functions as an Intel® processor having at least one first ISA instruction set core: (1) a substantial portion of the instruction set of the first ISA instruction set core, or (2) with the goal of An object code version of an application or other software running on an Intel processor with at least one first ISA instruction set core. The first ISA compiler 2204 represents a compiler operable to generate a first ISA binary 2206 (e.g., object code) that may be generated with or without additional chaining processing. Executed on a processor 2216 having at least one first ISA instruction set core. Similarly, Figure 22 shows that an alternative instruction set compiler 2208 can be used to compile a program in a high-level language 2202 to produce alternative instruction set binaries 2210 that can be natively used by a processor that does not have the first ISA instruction set core. 2214 executed. The instruction converter 2212 is used to convert the first ISA binary code 2206 into program code that can be natively executed by the processor 2214 that does not have the first ISA instruction set core. It is unlikely that the converted code will be identical to the alternative instruction set binary code 2210 because an instruction set converter capable of doing so is difficult to make; however, the converted code will perform the normal operation and be generated from consists of instructions from an alternative instruction set. Therefore, the instruction converter 2212 represents software, firmware, or other electronic devices that allow a processor or other electronic device that does not have the first ISA instruction set processor or core to execute the first ISA binary code 2206 through emulation, emulation, or any other process. Hardware or combination thereof.

對「一個實施例」、「一實施例」、「一範例實施例」等之參考指示了所說明之實施例可包括一特定特徵、結構或特性，但每一實施例可能未必包括該特定特徵、結構或特性。此外，此等短語未必係參照相同實施例。另外，當關連於一實施例來說明一特定特徵、結構或特性時，要主張的是，無論是否明確地說明，去對與其他實施例相關之此等特徵、結構或特性作改變時都是在熟習此藝者之知識範圍內。References to "one embodiment," "an embodiment," "an example embodiment," etc. indicate that the illustrated embodiment may include a particular feature, structure, or characteristic, but that each embodiment may not necessarily include the particular feature , structure or characteristics. Furthermore, these phrases are not necessarily referring to the same embodiment. Additionally, when a particular feature, structure, or characteristic is described in connection with one embodiment, it is intended that changes to such feature, structure, or characteristic be made in connection with other embodiments, whether explicitly stated or not. Within the knowledge of those skilled in the art.

此外，在如上所說明之各種實施例中，除非另外明確指出，否則諸如短語「A、B或C中之至少一者」之分離性語言意欲被理解為意謂A、B或C，或其等之任何組合(例如，A、B及/或C)。如此，分離性語言不意欲且不應理解為暗示一給定實施例要求A中之至少一者、B中之至少一者、或C中之至少一者要每一者皆存在。Furthermore, in the various embodiments described above, unless expressly stated otherwise, disjunctive language such as the phrase "at least one of A, B, or C" is intended to be understood to mean A, B, or C, or Any combination of them (for example, A, B and/or C). As such, disjunctive language is not intended and should not be understood to imply that a given embodiment requires the presence of each of at least one of A, at least one of B, or at least one of C.

利用BF16資料元素之範例包括，但不限於： 1.一種設備，其包含：解碼電路系統，其用以解碼一單指令的一實例，該單指令用以包括用於一運算碼、一第一包封資料來源運算元之一位置之一識別、一第二包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示一算術運算執行電路系統係要針對經識別之該等包封資料來源運算元的每一資料元素位置，對在該資料元素位置以BF16格式的BF16資料元素施行該算術運算，且將每一算術運算之一結果儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中；及該執行電路系統，其用以根據該運算碼來執行經解碼之該指令。 2.如範例1之設備，其中用於該第一來源運算元之該識別的該欄位係用以識別一向量暫存器。 3.如範例1之設備，其中用於該第一來源運算元之該識別的該欄位係用以識別一記憶體位置。 4.如範例1之設備，其中該算術運算係加法。 5.如範例1之設備，其中該算術運算係乘法。 6.如範例1之設備，其中該算術運算係除法。 7.如範例1之設備，其中該算術運算係減法。 8.一種系統，其包含：記憶體，其用以儲存一單指令的一實例；解碼電路系統，其用以解碼該單指令的該實例，該單指令用以包括用於一運算碼、一第一包封資料來源運算元之一位置之一識別、一第二包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示一算術運算執行電路系統係要針對經識別之該等包封資料來源運算元的每一資料元素位置，對在該資料元素位置以BF16格式的BF16資料元素施行該算術運算，且將每一算術運算之一結果儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中；及該執行電路系統，其用以根據該運算碼來執行經解碼之該指令。 9.如範例8之系統，其中用於該第一來源運算元之該識別的該欄位係用以識別一向量暫存器。 10.如範例8之系統，其中用於該第一來源運算元之該識別的該欄位係用以識別一記憶體位置。 11.如範例8之系統，其中該算術運算係加法。 12.如範例8之系統，其中該算術運算係乘法。 13.如範例8之系統，其中該算術運算係除法。 14.如範例8之系統，其中該算術運算係減法。 15.一種方法，其包含：解碼一單指令的一實例，該單指令用以包括用於一運算碼、一第一包封資料來源運算元之一位置之一識別、一第二包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示一算術運算執行電路系統係要針對經識別之該等包封資料來源運算元的每一資料元素位置，對在該資料元素位置以BF16格式的BF16資料元素施行該算術運算，且將每一算術運算之一結果儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中；且根據該運算碼執行經解碼之該指令。 16.如範例15之方法，其中該算術運算係加法。 17.如範例15之方法，其中該算術運算係乘法。 18.如範例15之方法，其中該算術運算係除法。 19.如範例15之方法，其中該算術運算係減法。 20.如範例15之方法，其進一步包含：將該單指令轉譯成一不同指令集架構的一或多個指令，其中根據該運算碼執行經解碼之該指令包含執行該不同指令集架構的該等一或多個指令。 21.一種設備，其包含：解碼電路系統，其用以解碼一單指令的一實例，該單指令用以包括用於一運算碼、一包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示執行電路系統係要針對經識別之該包封資料來源運算元的每一資料元素位置，對在該資料元素位置的一bf16資料元素施行一近似倒數的一運算，且將該近似倒數儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中；及該執行電路系統，其用以根據該運算碼來執行經解碼之該指令。 22.如範例21之設備，其中用於該第一來源運算元之該識別的該欄位係用以識別一向量暫存器。 23.如範例21之設備，其中用於該第一來源運算元之該識別的該欄位係用以識別一記憶體位置。 24.一種系統，其包含：記憶體，其用以儲存一單指令的一實例；解碼電路系統，其用以解碼該單指令的該實例，該單指令用以包括用於一運算碼、一包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示執行電路系統係要針對經識別之該包封資料來源運算元的每一資料元素位置，對在該資料元素位置的一bf16資料元素施行一近似倒數的一運算，且將該近似倒數儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中；及該執行電路系統，其用以根據該運算碼來執行經解碼之該指令。 25.如範例24之系統，其中用於該第一來源運算元之該識別的該欄位係用以識別一向量暫存器。 26.如範例24之系統，其中用於該第一來源運算元之該識別的該欄位係用以識別一記憶體位置。 27.一種方法，其包含：解碼一單指令的一實例，該單指令用以包括用於一運算碼、一包封資料來源運算元之一位置之一識別及一包封資料目的地運算元之位置之一識別的欄位，其中該運算碼係用以指示執行電路系統係要針對經識別之該包封資料來源運算元的每一資料元素位置，對在該資料元素位置的一bf16資料元素施行一近似倒數的一運算，且將該近似倒數儲存進經識別之該包封資料目的地運算元的一對應資料元素位置中；且根據該運算碼執行經解碼之該指令。 28.如範例27之方法，其進一步包含：將該單指令轉譯成一不同指令集架構的一或多個指令，其中根據該運算碼執行經解碼之該指令包含執行該不同指令集架構的該等一或多個指令。 Examples of utilizing BF16 data elements include, but are not limited to: 1. A device containing: Decoding circuitry for decoding an instance of a single instruction that includes an identification of a location for an opcode, a first packet data source operand, and a second packet data source. A field identifying a position of an operand and a position of a packet data destination operand, where the opcode is used to indicate that an arithmetic operation execution circuitry is to be performed on the identified packets. For each data element position of the data source operand, perform the arithmetic operation on the BF16 data element in BF16 format at that data element position, and store one result of each arithmetic operation into the identified wrapper data destination operation in the position of a corresponding data element of the element; and The execution circuit system is used to execute the decoded instruction according to the operation code. 2. The device of example 1, wherein the field used for the identification of the first source operand is used to identify a vector register. 3. The device of Example 1, wherein the field used for the identification of the first source operand is used to identify a memory location. 4. The device of Example 1, wherein the arithmetic operation is addition. 5. The device of Example 1, wherein the arithmetic operation is multiplication. 6. As in the device of example 1, the arithmetic operation is division. 7. The device of Example 1, wherein the arithmetic operation is subtraction. 8. A system comprising: Memory for storing an instance of a single instruction; Decoding circuitry for decoding the instance of the single instruction to include an identification of a location for an opcode, a first packet data source operand, a second packet data source A field identifying a position of an operand and a position of a packet data destination operand, where the opcode is used to indicate that an arithmetic operation execution circuitry is to be performed on the identified packets. For each data element position of the data source operand, perform the arithmetic operation on the BF16 data element in BF16 format at that data element position, and store one result of each arithmetic operation into the identified wrapper data destination operation in the position of a corresponding data element of the element; and The execution circuit system is used to execute the decoded instruction according to the operation code. 9. The system of Example 8, wherein the field used for the identification of the first source operand is used to identify a vector register. 10. The system of example 8, wherein the field for the identification of the first source operand is used to identify a memory location. 11. As in the system of Example 8, the arithmetic operation is addition. 12. As in the system of Example 8, the arithmetic operation is multiplication. 13. As in the system of Example 8, the arithmetic operation is division. 14. As in the system of Example 8, the arithmetic operation is subtraction. 15. A method comprising: An example of decoding a single instruction to include an identification of a location for an opcode, a first packet data source operand, and one of a location for a second packet data source operand. A field identifying the location of an envelope data destination operand, where the opcode is used to indicate that an arithmetic operation execution circuitry is to be performed for each of the identified envelope data source operands. data element positions, perform the arithmetic operation on the BF16 data element in BF16 format at the data element position, and store one result of each arithmetic operation in a corresponding data element position of the identified envelope data destination operand middle; and The decoded instruction is executed according to the opcode. 16. The method of Example 15, wherein the arithmetic operation is addition. 17. The method of Example 15, wherein the arithmetic operation is multiplication. 18. The method of Example 15, wherein the arithmetic operation is division. 19. The method of Example 15, wherein the arithmetic operation is subtraction. 20. As in Example 15, it further includes: Translating the single instruction into one or more instructions of a different instruction set architecture, wherein executing the decoded instruction according to the opcode includes executing the one or more instructions of the different instruction set architecture. 21. A device comprising: Decoding circuitry for decoding an instance of a single instruction that includes an identification of a location for an opcode, a wrapper data source operand, and a wrapper data destination operand. A field identifying one of the positions, where the opcode is used to instruct the execution circuitry to, for each identified data element position of the envelope data source operand, perform a bf16 data element at that data element position. Perform an operation on an approximate reciprocal and store the approximate reciprocal in a corresponding data element location of the identified packet data destination operand; and The execution circuit system is used to execute the decoded instruction according to the operation code. 22. The device of example 21, wherein the field used for the identification of the first source operand is used to identify a vector register. 23. The device of example 21, wherein the field for the identification of the first source operand is used to identify a memory location. 24. A system comprising: Memory for storing an instance of a single instruction; Decoding circuitry for decoding the instance of the single instruction to include an identification of a location for an opcode, a wrapper data source operand, and a wrapper data destination operand. A field identifying one of the positions, where the opcode is used to instruct the execution circuitry to, for each identified data element position of the envelope data source operand, perform a bf16 data element at that data element position. Perform an operation on an approximate reciprocal and store the approximate reciprocal in a corresponding data element location of the identified packet data destination operand; and The execution circuit system is used to execute the decoded instruction according to the operation code. 25. The system of Example 24, wherein the field used for the identification of the first source operand is used to identify a vector register. 26. The system of example 24, wherein the field for the identification of the first source operand is used to identify a memory location. 27. A method comprising: An example of decoding a single instruction to include fields for an opcode, an identification of a location of a wrapper data source operand, and an identification of a location of a wrapper data destination operand. , where the operation code is used to instruct the execution circuit system to perform an approximate reciprocal operation on a bf16 data element at the data element position for each data element position of the identified packet data source operand. , and store the approximate reciprocal in a corresponding data element location of the identified packet data destination operand; and The decoded instruction is executed according to the opcode. 28. As in Example 27, it further includes: Translating the single instruction into one or more instructions of a different instruction set architecture, wherein executing the decoded instruction according to the opcode includes executing the one or more instructions of the different instruction set architecture.

本說明書及圖式因此應被視為例示性而非限制性概念。然而，將顯而易見的是，可以對其進行各種修改及改變，而不脫離如在申請專利範圍中所闡述之本揭露內容更廣泛的精神及範圍。The specification and drawings should therefore be regarded as illustrative rather than restrictive. However, it will be apparent that various modifications and changes may be made thereto without departing from the broader spirit and scope of the present disclosure as set forth in the claims.

101:包封資料來源,FP32格式 103:包封資料來源,FP16格式,第二來源 105:BF16格式 109,709,1009:執行電路系統 111:廣播電路系統 114:加法電路系統 115:減法電路系統 116:乘法電路系統 117:除法電路系統 121,721:寫入遮罩電路系統 131:寫入遮罩(或述語)暫存器,包封資料目的地 701:包封資料來源 731:寫入包封資料目的地,寫入遮罩(或述語)暫存器 1001:(VRCPNEPBF16)指令 1003:儲存器 1005:解碼電路系統 1007:排程電路系統 1008:暫存器(暫存器夾)及/或記憶體,結果 1011:引退/回寫電路系統 1100:多處理器系統 1114:I/O裝置 1115,1138:(共)處理器 1116:第一互連件 1117:電源控制單元(PCU) 1118:互連件(匯流排)橋接器 1120:第二互連件 1122:鍵盤及/或滑鼠 1124:音訊I/O 1127:通訊裝置 1128:儲存單元電路系統 1130:程式碼及資料 1132,1134:記憶體 1150:點對點(P-P)互連件 1152,1154:P-P互連件 1170:(第一)處理器 1172,1182:整合式記憶體控制器(IMC)單元電路系統 1176:點對點介面電路,點對點(P-P)介面 1178:P-P介面電路,點對點(P-P)介面 1180:(第二)處理器 1186:點對點介面電路,P-P介面 1188:P-P介面電路,P-P介面 1190:晶片組 1192:高效能介面 1194,1198:點對點介面電路 1196:介面 1200,2214,2216:處理器 1202(A),1202(N):核心 1204(A),1204(N):快取記憶體單元電路系統 1206:共享快取記憶體單元電路系統 1208:特定用途邏輯 1210:系統代理(單元電路系統) 1212:以環形為基之互連網路電路系統 1214:整合式記憶體控制器單元電路系統 1216:互連控制器單元電路系統 1300:(處理器)管線 1302:提取級 1304:長度解碼級 1306:解碼級 1308:分配級 1310:重命名級 1312:排程級 1314:暫存器讀取/記憶體讀取級 1316:執行級 1318:回寫/記憶體寫入級 1322:例外處置級 1324:提交級 1330:前端單元電路系統 1332:分支預測單元電路系統 1334:指令快取記憶體(單元電路系統) 1336:指令轉譯後備緩衝器(TLB) 1338:指令提取(單元電路系統) 1340:解碼單元電路系統 1350:引擎(單元)電路系統 1352:重命名/分配器單元電路系統 1354:引退單元電路系統 1356:排程器(單元)電路系統 1358:實體暫存器夾(單元)電路系統 1360:執行叢集 1362:執行單元電路(系統) 1364:記憶體存取(單元)電路系統 1370:記憶體單元電路系統 1372:資料TLB(單元)電路系統 1374:資料快取記憶體(電路系統) 1376:層級2(L2)快取記憶體(單元)電路系統 1390:(處理器)核心 1401:ALU電路 1403:向量/SIMD單元電路 1405:載入/儲存單元電路 1407:分支/跳轉單元電路 1409:浮點單元(FPU)電路 1500:暫存器架構 1510:向量/SIMD暫存器 1515:寫入遮罩/述語暫存器 1520:區段暫存器 1525:通用暫存器 1530:指令指標暫存器 1535:機器特定暫存器(MSR) 1540:旗標暫存器 1545:純量浮點暫存器 1550:除錯暫存器 1555:控制暫存器 1560:機器檢查暫存器 1565:記憶體管理暫存器 1601(A):第一首碼 1601(B),1611:第二首碼 1601(C):(第三)首碼 1601:首碼(欄位) 1603:運算碼(欄位) 1605:定址資訊 1607:位移欄位,位移值 1609:立即(欄位) 1702:MOD R/M位元組,Mod R/M位元組,ModR/M位元組 1704:基址(SIB)位元組 1742:MOD欄位 1744:暫存器(索引)欄位,MOD R/M暫存器欄位,Mod R/M位元組,ModR/M暫存器欄位 1746:(Mod R/M )R/M欄位,(MOD R/M )R/M欄位,(ModR/M )R/M欄位 1752:尺度欄位 1754:(SIB位元組)索引欄位 1756:(SIB位元組)基址欄位 2001,2011,2111:格式欄位 2003,2013:位元組0 2005,2015:位元組1 201,203,205,207,209,211,801,803,805,807,809,811:步驟 2017:位元組2 2115,2217,2119:酬載位元組 2202:高階語言 2204:第一ISA編譯器 2206:第一ISA二進位碼 2208:替代性指令集編譯器 2210:替代性指令集二進位碼 2212:指令轉換器 101: Encapsulated data source, FP32 format 103: Encapsulated data source, FP16 format, second source 105:BF16 format 109,709,1009:Execution circuit system 111:Broadcast circuit system 114:Adder circuit system 115:Subtraction circuit system 116: Multiplication circuit system 117: Division circuit system 121,721:Write mask circuit system 131: Write mask (or predicate) register, encapsulate data destination 701: Encapsulated data source 731: Write the destination of the packet data and write the mask (or predicate) register 1001:(VRCPNEPBF16) command 1003:Storage 1005: Decoding circuit system 1007: Scheduling circuit system 1008: Register (register folder) and/or memory, result 1011: Retirement/writeback circuit system 1100:Multiprocessor system 1114:I/O device 1115,1138: (total) processors 1116: First interconnection 1117:Power control unit (PCU) 1118: Interconnect (Bus) Bridge 1120: Second interconnection piece 1122:Keyboard and/or mouse 1124: Audio I/O 1127:Communication device 1128: Storage unit circuit system 1130: Program code and data 1132,1134: memory 1150: Point-to-point (P-P) interconnects 1152,1154:P-P interconnect 1170: (First) Processor 1172,1182: Integrated memory controller (IMC) unit circuit system 1176: Point-to-point interface circuit, point-to-point (P-P) interface 1178:P-P interface circuit, point-to-point (P-P) interface 1180: (Second) Processor 1186: Point-to-point interface circuit, P-P interface 1188: P-P interface circuit, P-P interface 1190:Chipset 1192:High performance interface 1194,1198: Point-to-point interface circuit 1196:Interface 1200,2214,2216:processor 1202(A),1202(N):Core 1204(A),1204(N): Cache memory unit circuit system 1206: Shared cache memory unit circuit system 1208:Specific purpose logic 1210: System agent (unit circuit system) 1212: Ring-based interconnection network circuit system 1214: Integrated memory controller unit circuit system 1216: Interconnected Controller Unit Circuit System 1300: (Processor) Pipeline 1302: Extraction level 1304: Length decoding level 1306: Decoding level 1308: Distribution level 1310: Rename level 1312:Scheduling level 1314: Temporary register read/memory read level 1316:Executive level 1318: Write back/memory write level 1322:Exception level 1324: Submit level 1330: Front-end unit circuit system 1332: Branch prediction unit circuit system 1334: Instruction cache (unit circuit system) 1336: Instruction Translation Lookaside Buffer (TLB) 1338: Instruction fetch (unit circuit system) 1340: Decoding unit circuit system 1350: Engine (unit) circuit system 1352: Rename/distributor unit circuit system 1354: Retirement unit circuit system 1356: Scheduler (unit) circuit system 1358: Physical register clip (unit) circuit system 1360: Execution cluster 1362: Execution unit circuit (system) 1364: Memory access (unit) circuit system 1370: Memory unit circuit system 1372: Data TLB (unit) circuit system 1374: Data cache (circuit system) 1376: Level 2 (L2) cache memory (unit) circuit system 1390: (processor) core 1401:ALU circuit 1403: Vector/SIMD unit circuit 1405: Load/store unit circuit 1407: Branch/jump unit circuit 1409: Floating point unit (FPU) circuit 1500: scratchpad architecture 1510: Vector/SIMD register 1515: Write mask/predicate register 1520: Section register 1525: General register 1530: Instruction indicator register 1535: Machine Specific Register (MSR) 1540: Flag register 1545: Scalar floating point register 1550: Debug register 1555: Control register 1560: Machine check scratchpad 1565: Memory management register 1601(A): first digit 1601(B),1611: The second digit 1601(C): (third) first code 1601: First code (field) 1603:Operation code (field) 1605:Addressing information 1607: Displacement field, displacement value 1609: Immediately (field) 1702: MOD R/M byte, Mod R/M byte, ModR/M byte 1704: Base address (SIB) byte 1742:MOD field 1744: Register (index) field, MOD R/M register field, Mod R/M byte, ModR/M register field 1746:(Mod R/M )R/M field, (MOD R/M )R/M field, (ModR/M )R/M field 1752: Standard field 1754: (SIB byte) index field 1756: (SIB byte) base address field 2001,2011,2111:Format field 2003, 2013: Byte 0 2005, 2015: Byte 1 201,203,205,207,209,211,801,803,805,807,809,811: Steps 2017: Byte 2 2115, 2217, 2119: payload bytes 2202:High-level languages 2204: The first ISA compiler 2206: First ISA binary code 2208: Alternative instruction set compiler 2210: Alternative instruction set binary code 2212:Instruction converter

將參看圖式來說明根據本揭露內容之各種實施例，其中：Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which:

圖1例示對來源BF16資料元素施行一算術運算之一指令的範例性執行。Figure 1 illustrates an exemplary execution of an instruction that performs an arithmetic operation on source BF16 data elements.

圖2例示由一處理器對BF16資料元素指令施行一算術運算之方法的一實施例。Figure 2 illustrates one embodiment of a method for performing an arithmetic operation on a BF16 data element instruction by a processor.

圖3-6例示表示V{ARITH}NEPBF16指令之執行及格式之虛擬碼的範例性實施例。Figures 3-6 illustrate exemplary embodiments of virtual code representing the execution and format of the V{ARITH}NEPBF16 instruction.

圖7例示用以計算BF16資料元素之一倒數之一指令的範例性執行。Figure 7 illustrates an exemplary execution of an instruction to calculate the reciprocal of one of the BF16 data elements.

圖8例示由一處理器所施行之方法的一實施例，該處理器用以計算BF16資料元素指令之一近似倒數。Figure 8 illustrates one embodiment of a method performed by a processor for computing an approximate reciprocal of a BF16 data element instruction.

圖9例示表示一VRCPNEPBF16指令之執行及格式之虛擬碼的範例性實施例。Figure 9 illustrates an exemplary embodiment of virtual code representing the execution and format of a VRCPNEPBF16 instruction.

圖10例示用以處理諸如V{ARITH}NEPBF16及/或VRCPNEPBF16指令之一指令之硬體的實施例。Figure 10 illustrates an embodiment of hardware to process an instruction such as the V{ARITH}NEPBF16 and/or VRCPNEPBF16 instruction.

圖11例示一範例性系統的實施例。Figure 11 illustrates an embodiment of an exemplary system.

圖12例示可具有多於一核心、可具有一整合式記憶體控制器且可具有整合式圖形之一處理器之實施例的方塊圖。Figure 12 illustrates a block diagram of an embodiment of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics.

圖13(A)為根據本發明之實施例例示一範例性有序管線及一範例性暫存器重命名、無序發布/執行管線兩者的方塊圖。13(A) is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue/execution pipeline according to an embodiment of the present invention.

圖13(B)為根據本發明之實施例例示將被包括在一處理器中的一有序架構核心之一範例性實施例及一範例性暫存器重命名、無序發布/執行架構核心兩者的方塊圖。13(B) illustrates an exemplary embodiment of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core to be included in a processor in accordance with embodiments of the present invention. or block diagram.

圖14例示執行單元電路系統的實施例，諸如圖13(B)之執行單元電路系統。Figure 14 illustrates an embodiment of execution unit circuitry, such as that of Figure 13(B).

圖15為根據一些實施例之一暫存器架構的方塊圖。Figure 15 is a block diagram of a register architecture according to some embodiments.

圖16例示一指令格式的實施例。Figure 16 illustrates an embodiment of an instruction format.

圖17例示一定址欄位的實施例。Figure 17 illustrates an embodiment of an address field.

圖18例示一第一首碼的實施例。Figure 18 illustrates an embodiment of a first header.

圖19(A)-(D)例示如何使用第一首碼1601(A)之R、X及B欄位的實施例。19(A)-(D) illustrate an embodiment of how to use the R, X and B fields of the first header 1601(A).

圖20(A)-(B)例示一第二首碼的實施例。20(A)-(B) illustrate an embodiment of a second first code.

圖21例示一第三首碼的實施例。Figure 21 illustrates an embodiment of a third header.

圖22例示根據本發明之實施例的對比使用一軟體指令轉換器以將一來源指令集中之二進位指令轉換為一目標指令集中之二進位指令的方塊圖。22 illustrates a block diagram comparing the use of a software instruction converter to convert binary instructions in a source instruction set into binary instructions in a target instruction set according to an embodiment of the present invention.

101:包封資料來源，FP32格式 101: Encapsulation data source, FP32 format

103:包封資料來源，FP16格式，第二來源 103: Encapsulated data source, FP16 format, second source

109:執行電路系統 109: Execution circuit system

111:廣播電路系統 111:Broadcast circuit system

113:算術電路系統 113: Arithmetic circuit systems

114:加法電路系統 114:Adder circuit system

115:減法電路系統 115:Subtraction circuit system

116:乘法電路系統 116: Multiplication circuit system

117:除法電路系統 117: Division circuit system

121:寫入遮罩電路系統 121:Write mask circuit system

131:寫入遮罩(或述語)暫存器，包封資料目的地 131: Write the mask (or predicate) register to encapsulate the data destination

Claims

A device containing: A decoding component for decoding an instance of a single instruction that includes an identification of a location for an operation code, a first wrapped data source operand, and a second wrapped data source operation. A field identifying a position of a packet data destination operand and an identification field of a position of a packet data destination operand, where the opcode is used to indicate that an arithmetic operation execution unit is to be targeted at the identified source of packet data. For each data element position of an operand, perform the arithmetic operation on the BF16 data element in BF16 format at that data element position, and store one result of each arithmetic operation into the identified destination operand of the wrapped data. in the position of a corresponding data element; and The execution component is used to execute the decoded instruction according to the operation code.

The device of claim 1, wherein the field used for the identification of the first source operand is used to identify a vector register.

The device of claim 1, wherein the field used for the identification of the first source operand is used to identify a memory location.

Such as the device of claim 1, wherein the arithmetic operation is addition.

Such as the device of claim 1, wherein the arithmetic operation is multiplication.

Such as the device of claim 1, wherein the arithmetic operation is division.

Such as the device of claim 1, wherein the arithmetic operation is subtraction.

A system that includes: Memory for storing an instance of a single instruction; Decoding circuitry for decoding the instance of the single instruction to include an identification of a location for an opcode, a first packet data source operand, a second packet data source A field identifying a position of an operand and a position of a packet data destination operand, where the opcode is used to indicate that an arithmetic operation execution circuitry is to be performed on the identified packets. For each data element position of the data source operand, perform the arithmetic operation on the BF16 data element in BF16 format at that data element position, and store one result of each arithmetic operation into the identified wrapper data destination operation in the position of a corresponding data element of the element; and The execution circuit system is used to execute the decoded instruction according to the operation code.

The system of claim 8, wherein the field used for the identification of the first source operand is used to identify a vector register.

The system of claim 8, wherein the field for the identification of the first source operand is used to identify a memory location.

The system of claim 8, wherein the arithmetic operation is addition.

The system of claim 8, wherein the arithmetic operation is multiplication.

Such as the system of claim 8, wherein the arithmetic operation is division.

The system of claim 8, wherein the arithmetic operation is subtraction.

A method that contains: An example of decoding a single instruction to include an identification of a location for an opcode, a first packet data source operand, and one of a location for a second packet data source operand. A field identifying the location of an envelope data destination operand, where the opcode is used to indicate that an arithmetic operation execution circuitry is to be performed for each of the identified envelope data source operands. data element positions, perform the arithmetic operation on the BF16 data element in BF16 format at the data element position, and store one result of each arithmetic operation in a corresponding data element position of the identified envelope data destination operand middle; and The decoded instruction is executed according to the opcode.

The method of claim 15, wherein the arithmetic operation is addition.

The method of claim 15, wherein the arithmetic operation is multiplication.

The method of claim 15, wherein the arithmetic operation is division.

The method of claim 15, wherein the arithmetic operation is subtraction.

For example, the method of any one of claim items 15 to 19 further includes: Translating the single instruction into one or more instructions of a different instruction set architecture, wherein executing the decoded instruction according to the opcode includes executing the one or more instructions of the different instruction set architecture.

A computer program comprising instructions which, when executed by a computer, cause the computer to perform the method as claimed in any of the preceding claims 15 to 20.