TW202318190A

TW202318190A - Apparatus and method for implementing vector mask in vector processing unit

Info

Publication number: TW202318190A
Application number: TW110138605A
Authority: TW
Inventors: 丁明陳; 許家瑋
Original assignee: 晶心科技股份有限公司
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2023-05-01

Abstract

The mask data corresponding to each data element of the issued instruction may be handled by a mask queue, where only the valid mask data are stored to the mask queue. The mask data of multiple vector instructions may be stored in the mask queue. The corresponding mask data may be accessed from the mask queue when the vector instruction(s) is issued from the execution queue to the functional unit for execution. In the case of 512-bit wide mask data is needed, the dispatching of the vector instruction from the decode/issue unit to the execution queue may be stalled until the mask queue is available. In some embodiments, one mask queue may be dedicated to one execution queue. Alternatively, one mask queue may be shared between two different execution queues. In the disclosure, resources are conserved without dedicating additional storage space for handling mask data of the vector instruction.

Description

Microprocessor and method for handling mask data for vector instructions

本發明是有關於一種微處理器，且特別是有關於一種用於處理向量指令的方法和微處理器。The present invention relates to a microprocessor, and more particularly to a method and microprocessor for processing vector instructions.

單指令多資料（Single Instruction Multiple Data；SIMD）架構通過並存執行由SIMD指令（也稱為向量指令）指定的倍數資料元素來實現高性能，而標量指令僅處理一個資料元素或一對資料元素（即，兩個來源運算元）。資料元素中的每一個可表示存儲於暫存器或其它存儲位置中的個別資料片（例如，圖元資料、圖形座標等）以及通常具有相同大小的其它資料元素。由向量指令指定的資料元素的數目基於資料元素大小和向量長度乘數（vector length multiplier；LMUL）而極大地改變。舉例來說，當LMUL是1時，512位元寬向量資料可具有64個8位元寬元素、32個16位元寬資料元素、十六個32位元寬資料元素等等。當LMUL是8時，512位元寬向量資料可具有五百一十二個8位元寬資料元素、兩百五十六個16位元寬資料元素、一百二十八個32位元寬資料元素等等。The Single Instruction Multiple Data (SIMD) architecture achieves high performance by concurrently executing multiple data elements specified by SIMD instructions (also known as vector instructions), while scalar instructions only process one data element or a pair of data elements ( That is, two source operands). Each of the data elements may represent an individual piece of data (eg, primitive data, graphics coordinates, etc.) stored in a scratchpad or other storage location, along with other data elements that are typically of the same size. The number of data elements specified by a vector instruction varies greatly based on the data element size and a vector length multiplier (LMUL). For example, when LMUL is 1, a 512-bit wide vector data may have 64 8-bit wide data elements, 32 16-bit wide data elements, sixteen 32-bit wide data elements, and so on. When LMUL is 8, 512-bit wide vector data can have 512 8-bit wide data elements, 256 16-bit wide data elements, 128 32-bit wide data elements data elements etc.

在向量指令的處理中，向量暫存器的每一資料元素將與掩碼位附接，所述掩碼位在啟用時將對應資料元素掩碼到指定操作。在最壞情況情境中，掩碼向量暫存器的所有512位元將用於五百一十二個8位元寬資料元素。另一方面，512位元掩碼向量暫存器的僅16位元為具有十六個32位元寬資料元素的512位元寬向量資料所需。雖然並非所有向量指令是最差情況情境(即，所有512位元皆需使用)，但每一向量指令仍發出有512位元掩碼資料以覆蓋預測的所有可能性，而不管所有512位元掩碼資料是否為向量指令所需（即，“掩碼資料的強行實施”）。具有向量指令的掩碼資料的此類實施方案佔據向量暫存器的多個執行管線中的多個所排隊向量指令的許多存儲區域和功率。In the processing of vector instructions, each data element of the vector register will be appended with a mask bit which, when enabled, masks the corresponding data element to a specified operation. In the worst case scenario, all 512 bits of the mask vector register would be used for five hundred and twelve 8-bit wide data elements. On the other hand, only 16 bits of the 512-bit mask vector register are required for 512-bit wide vector data with sixteen 32-bit wide data elements. Although not all vector instructions are worst-case scenarios (i.e., all 512 bits are required), each vector instruction still issues 512 bits of mask data to cover all possibilities of prediction regardless of all 512 bits Whether mask data is required for vector instructions (ie, "enforcement of mask data"). Such implementations with mask profiles for vector instructions occupy much storage area and power for multiple queued vector instructions in multiple execution pipelines of a vector scratchpad.

本發明提供一種掩碼佇列，所述掩碼佇列管理對應於從解碼/發射單元分派到執行佇列的向量指令的資料元素的掩碼資料。The present invention provides a mask queue that manages mask data corresponding to data elements of vector instructions dispatched from a decode/issue unit to an execution queue.

本發明的微處理器包含解碼/發射單元和執行佇列。執行佇列包含多個佇列條目和掩碼佇列。在實施例中，執行佇列配置成將從解碼/發射單元分派且在具有多個第一資料元素的資料上操作的第一指令分配給第一佇列條目。掩碼佇列包含多個掩碼條目，且當第一指令分配給執行佇列中的第一佇列條目時，對應於第一指令的第一掩碼資料寫入到第一數目個掩碼條目，其中第一數目基於第一資料元素的寬度而判斷。The microprocessor of the present invention includes a decode/transmit unit and an execution queue. Execution queues consist of multiple queue entries and mask queues. In an embodiment, the execution queue is configured to assign to the first queue entry a first instruction dispatched from the decode/issue unit and operating on data having a first plurality of data elements. The mask queue includes a plurality of mask entries, and when the first instruction is assigned to the first queue entry in the execution queue, first mask data corresponding to the first instruction is written to the first number of masks entries, wherein the first number is determined based on the width of the first data element.

本發明的處置向量指令的掩碼資料的方法包括以下步驟。將在具有多個第一資料元素的資料上操作的第一指令分派到包含掩碼佇列的執行佇列。將所述第一指令分配給所述執行佇列中的第一佇列條目。將對應於所述第一指令的第一掩碼資料寫入到所述掩碼佇列中的第一數目個掩碼條目，其中所述第一數目基於所述第一資料元素的寬度而判斷。The method for handling mask data of vector instructions of the present invention includes the following steps. A first instruction operating on data having a plurality of first data elements is dispatched to an execution queue including a mask queue. The first instruction is assigned to a first queue entry in the execution queue. writing first mask data corresponding to the first command to a first number of mask entries in the mask queue, wherein the first number is determined based on a width of the first data element .

在本發明中，對應於所分派指令的資料元素的掩碼資料可由所引入掩碼佇列處置或管理，其中執行佇列中的所有向量指令的僅有效掩碼資料存儲到掩碼佇列。在本公開中，多個向量指令的掩碼資料可存儲於掩碼佇列中。當將向量指令從執行佇列發出到功能單元以用於執行時，可從掩碼佇列存取對應掩碼資料。如果掩碼佇列並不具有足夠可用條目，那麼可停止向量指令從解碼/發射單元分派到執行佇列。在一些實施例中，一個掩碼佇列可專用於一個執行佇列。在一些其它實施例中，一個掩碼佇列可在兩個不同執行佇列之間共用。In the present invention, mask data corresponding to data elements of dispatched instructions may be handled or managed by an introduced mask queue, where only valid mask data for all vector instructions in the execution queue are stored to the mask queue. In the present disclosure, mask data for multiple vector instructions can be stored in a mask queue. When a vector instruction is issued from the execution queue to a functional unit for execution, corresponding mask data can be accessed from the mask queue. Dispatch of vector instructions from the decode/issue unit to the execution queue may be stopped if the mask queue does not have enough entries available. In some embodiments, one mask queue may be dedicated to one execution queue. In some other embodiments, a mask queue can be shared between two different execution queues.

基於上述，在本發明中，保留資源而不使額外存儲空間專用於處置向量指令的掩碼資料。也就是說，當向量指令分派到執行佇列時，掩碼佇列將從掩碼暫存器僅讀取向量指令所需的掩碼資料。掩碼佇列的實施極大地減少管理用於處理向量指令的資料元素的掩碼所需的資源。Based on the above, in the present invention, resources are reserved without dedicating additional storage space to handling mask data for vector instructions. That is, when the vector instruction is dispatched to the execution queue, the mask queue will only read the mask data required by the vector instruction from the mask register. The implementation of mask queues greatly reduces the resources required to manage the masks used to process data elements of vector instructions.

以下公開內容提供用於實施本公開的不同特徵的許多不同實施例或實例。下文描述元件和佈置的特定實例以簡化本公開。當然，這些僅是實例且並不意欲為限制性的。舉例來說，在以下描述中，第一特徵在第二特徵上方或上的形成可包含第一特徵和第二特徵直接接觸地形成的實施例，且還可包含額外特徵可在第一特徵與第二特徵之間形成以使得第一特徵和第二特徵可不直接接觸的實施例。此外，本公開可在各種實例中重複參考標號和/或字母。這種重複是出於簡化和清楚的目的且本身並不規定所論述的各種實施例和/或配置之間的關係。本公開引入掩碼佇列，所述掩碼佇列管理對應於分派的向量指令的資料元素的掩碼資料。在“掩碼操作的強行實施”中，暫存器不管是否將需要所有掩碼資料，都將具有整個掩碼暫存器（例如，512位元）的每一指令分派到執行佇列。如果執行佇列具有8個條目，那麼掩碼存儲裝置將是4096位元（即，512 × 8）。如果存在8個執行佇列，那麼掩碼存儲裝置將是32768位元（即，512 × 8 × 8）。由於並非所有向量指令都會使用512位元掩碼，所以強行實施的掩碼存儲裝置是浪費的。在本公開中，當向量指令分派到執行佇列時，掩碼佇列將從掩碼暫存器僅讀取向量指令所需的掩碼資料。掩碼佇列的實施極大地減少管理用於處理向量指令的資料元素的掩碼所需的資源。 The following disclosure provides many different embodiments, or examples, for implementing different features of the disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. Of course, these are examples only and are not intended to be limiting. For example, in the following description, the formation of a first feature over or on a second feature may include embodiments in which the first feature and the second feature are formed in direct contact, and may also include that additional features may be formed between the first feature and the second feature. An embodiment in which the second features are formed such that the first and second features may not be in direct contact. Additionally, the present disclosure may repeat reference numerals and/or letters in various instances. This repetition is for purposes of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. The present disclosure introduces a mask queue that manages mask data corresponding to data elements of dispatched vector instructions. In "Brute Force of Mask Operations," every instruction with the entire mask register (eg, 512 bits) is dispatched to the execution queue regardless of whether all mask data will be needed by the register. If the execution queue has 8 entries, the mask storage will be 4096 bits (ie, 512 x 8). If there were 8 execution queues, then the mask storage would be 32768 bits (ie, 512 x 8 x 8). Since not all vector instructions will use a 512-bit mask, the forced mask storage is wasteful. In the present disclosure, when the vector instruction is dispatched to the execution queue, the mask queue will only read the mask data required by the vector instruction from the mask register. The implementation of mask queues greatly reduces the resources required to manage the masks used to process data elements of vector instructions.

參考圖1，根據一些實施例示出包含微處理器10和記憶體30的資料處理系統1的示意圖。實施微處理器10以通過執行存儲於記憶體30中的指令來執行各種資料處理功能性。記憶體30可包含2級（level 2；L2）快取記憶體和3級（level 3；L3）快取記憶體以及資料處理系統1的主記憶體，其中L2快取記憶體和L3快取記憶體具有比主記憶體快的存取時間。記憶體可包含隨機存取記憶體（random access memory；RAM）、動態隨機存取記憶體（dynamic random access memory；DRAM）、靜態隨機存取記憶體（static random access memory；SRAM）、唯讀記憶體（read only memory；ROM）、可程式設計唯讀記憶體（programmable read only memory；PROM）、電可程式設計唯讀記憶體（electrically programmable read only memory；EPROM）、電可擦除可程式設計唯讀記憶體（electrically erasable programmable read only memory；EEPROM）以及快閃記憶體中的至少一個。Referring to FIG. 1 , a schematic diagram of a data processing system 1 including a microprocessor 10 and a memory 30 is shown according to some embodiments. Microprocessor 10 is implemented to perform various data processing functionality by executing instructions stored in memory 30 . The memory 30 may include a level 2 (level 2; L2) cache memory and a level 3 (level 3; L3) cache memory and the main memory of the data processing system 1, wherein the L2 cache memory and the L3 cache memory Memory has faster access times than main memory. Memory can include random access memory (random access memory; RAM), dynamic random access memory (dynamic random access memory; DRAM), static random access memory (static random access memory; SRAM), read-only memory Body (read only memory; ROM), programmable read only memory (programmable read only memory; PROM), electrically programmable read only memory (electrically programmable read only memory; EPROM), electrically erasable programmable design At least one of an electrically erasable programmable read only memory (EEPROM) and a flash memory.

微處理器10可以是通用處理器（例如，中央處理單元）或專用處理器（例如，網路處理器、通信處理器、DPS、嵌入式處理器等）。處理器可具有如以下的指令集架構中的任一個：複雜精簡指令集計算（Complex Reduced Instruction Set Computing；CISC）、精簡指令集計算（Reduced Instruction Set Computing；RISC）、超長指令字（Very Long Instruction Word；VLIW）、其混合或其它類型的指令集架構。在實施例中的一些中，微處理器是在向量指令上執行預測或掩碼的RISC處理器。微處理器在單一微處理器內實施指令級並行性且通過執行多個指令每時鐘週期來實現高性能。將多個指令發出到不同功能單元以供並存執行。超標量微處理器可採用亂序（out-of-order；OOO）執行，其中可在第一指令之前執行對第一指令沒有任何相依性的第二指令。在傳統的亂序微處理器設計中，指令可亂序執行，但由於如分支誤預測、中斷以及精確異常的控制危險，所述指令必須有序引退到微處理器的暫存器組。如重新排序緩衝器和暫存器重命名的臨時存儲裝置用於結果資料，直到指令從執行管線有序引退為止。只要指令不具有資料相依性和控制危險，微處理器10就可通過將結果資料亂序寫回到暫存器組而亂序執行和引退指令。The microprocessor 10 may be a general-purpose processor (eg, central processing unit) or a special-purpose processor (eg, network processor, communication processor, DPS, embedded processor, etc.). The processor may have any one of the following instruction set architectures: Complex Reduced Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), Very Long Instruction Word (VLW) Instruction Word; VLIW), its hybrids, or other types of instruction set architectures. In some of the embodiments, the microprocessor is a RISC processor that performs prediction or masking on vector instructions. Microprocessors implement instruction-level parallelism within a single microprocessor and achieve high performance by executing multiple instructions per clock cycle. Multiple instructions are issued to different functional units for concurrent execution. A superscalar microprocessor may employ out-of-order (OOO) execution, wherein a second instruction without any dependency on the first instruction may be executed before the first instruction. In traditional out-of-order microprocessor designs, instructions may be executed out-of-order, but must be retired in-order to the microprocessor's scratchpad due to control hazards such as branch mispredictions, interrupts, and precise exceptions. Temporary storage such as reorder buffers and scratchpad renames are used for result data until an orderly retirement of an instruction from the execution pipeline. As long as the instruction has no data dependencies and control hazards, microprocessor 10 can execute and retire instructions out of order by writing the resulting data back out of order to the register bank.

參考圖1，微處理器10可包含指令快取記憶體11、分支預測單元（branch prediction unit；BPU）12、解碼/發射單元13、暫存器組14、記分板15、讀取/寫入控制單元16、載入/存儲單元17、資料快取記憶體18、多個執行佇列（execution queue；EQ）19A到執行佇列19E以及多個功能單元（functional unit，FUNT）20A到功能單元20C。微處理器10還包含讀取匯流排31和結果匯流排32。讀取匯流排31耦合到載入/存儲單元17、功能單元20A到功能單元20C以及暫存器組14以用於將運算元資料從暫存器組14中的暫存器傳輸到載入/存儲單元17以及功能單元20A到功能單元20C，這還可稱作從暫存器組14讀取運算元據的操作。結果匯流排32耦合到資料快取記憶體18、功能單元20A到功能單元20C以及暫存器組14以用於將資料從資料快取記憶體18或功能單元20A到功能單元20C傳輸到暫存器組14的暫存器，這還可稱作寫回結果資料到暫存器組14的操作。本文中以後接字母的特定參考標號指代的元件將由單獨參考標號共同指代。舉例來說，除非規定，否則執行佇列19A到執行佇列19E以及功能單元20A到功能單元20C可分別統稱為執行佇列19和功能單元20。本公開的一些實施例可使用比圖1中所示出的元件更多、更少或不同的組件。Referring to FIG. 1, the microprocessor 10 may include an instruction cache memory 11, a branch prediction unit (branch prediction unit; BPU) 12, a decode/issue unit 13, a register set 14, a scoreboard 15, a read/write Into control unit 16, load/store unit 17, data cache memory 18, multiple execution queues (execution queue; EQ) 19A to execution queue 19E, and multiple functional units (functional unit, FUNT) 20A to function Unit 20C. Microprocessor 10 also includes read bus 31 and result bus 32 . Read bus 31 is coupled to load/store unit 17, functional unit 20A to functional unit 20C, and register bank 14 for transferring operational metadata from registers in register bank 14 to load/store The storage unit 17 and the functional unit 20A to the functional unit 20C can also be referred to as the operation of reading the operation metadata from the register set 14 . Results bus 32 is coupled to data cache 18, functional unit 20A to functional unit 20C, and scratchpad bank 14 for transferring data from data cache 18 or functional unit 20A to functional unit 20C to scratchpad This may also be referred to as an operation of writing back result data to the scratchpad bank 14. Elements referred to herein by a specific reference number followed by a letter will be referred to collectively by a single reference number. For example, execution queue 19A to execution queue 19E and functional unit 20A to functional unit 20C may be collectively referred to as execution queue 19 and functional unit 20 , respectively, unless otherwise specified. Some embodiments of the present disclosure may use more, fewer or different components than those shown in FIG. 1 .

在一些實施例中，指令快取記憶體11耦合（未繪示）到記憶體30和解碼/發射單元13，且配置成存儲從記憶體30提取的指令並將所述指令分派到解碼/發射單元13。指令快取記憶體11包含來自記憶體30的連續指令位元組的許多快取記憶體線。快取記憶體線被組織為直接映射、完全關聯映射或組相聯映射等。直接映射、完全關聯映射以及組相聯映射在相關技術中是眾所周知的，因此省略關於以上映射的詳細描述。In some embodiments, instruction cache 11 is coupled (not shown) to memory 30 and to decode/issue unit 13 and is configured to store instructions fetched from memory 30 and dispatch the instructions to decode/issue Unit 13. Instruction cache 11 contains a number of cache lines of consecutive instruction bytes from memory 30 . Cache lines are organized as direct-mapped, fully-associative-mapped, or set-associative-mapped, etc. Direct mapping, fully associative mapping, and set associative mapping are well known in the related art, so detailed descriptions about the above mappings are omitted.

指令快取記憶體11可包含標籤陣列（未繪示）和資料陣列（未繪示）以用於分別存儲位址的一部分和由微處理器10頻繁使用的指令的資料。標籤陣列中的每一標籤對應於資料陣列中的快取記憶體線。當微處理器10需要執行指令時，微處理器10首先通過將指令的位址與存儲於標籤陣列中的標籤進行比較來檢查指令在指令快取記憶體11中的存在。如果指令位址與標籤陣列中的標籤中的一個匹配（即，快取記憶體命中），那麼從資料陣列提取對應快取記憶體線。如果指令位址不與標籤陣列中的任何條目匹配（即，快取記憶體未命中），那麼微處理器10可存取記憶體30以找到指令。在一些實施例中，微處理器10更包含指令佇列（未繪示），所述指令佇列耦合到指令快取記憶體11和解碼/發射單元13以用於在將指令發送到解碼/發射單元13之前存儲來自指令快取記憶體11或記憶體30的指令。The instruction cache memory 11 may include a tag array (not shown) and a data array (not shown) for respectively storing a part of addresses and data of instructions frequently used by the microprocessor 10 . Each tag in the tag array corresponds to a cache line in the data array. When the microprocessor 10 needs to execute an instruction, the microprocessor 10 first checks the existence of the instruction in the instruction cache 11 by comparing the address of the instruction with the tags stored in the tag array. If the instruction address matches one of the tags in the tag array (ie, a cache hit), then the corresponding cache line is fetched from the data array. If the instruction address does not match any entry in the tag array (ie, a cache miss), microprocessor 10 may access memory 30 to find the instruction. In some embodiments, the microprocessor 10 further includes an instruction queue (not shown), which is coupled to the instruction cache 11 and the decode/issue unit 13 for sending the instruction to the decode/issue unit 13. The issue unit 13 previously stores instructions from the instruction cache 11 or the memory 30 .

BPU 12耦合到指令快取記憶體11且配置成推測地提取在分支指令之後的指令。BPU 12可基於分支指令的過往行為而提供對分支指令的分支方向（採取或不採取）的預測，且提供所採取的分支指令的預測分支目標位址。分支方向可被“採取”，其中從所採取分支指令的分支目標位址提取後續指令。分支方向可“未採取”，其中從與分支指令連續的記憶體位置提取後續指令。在一些實施例中，BPU 12實施用於從基本塊的起始位址預測基本塊的結束的基本塊分支預測。基本塊的起始位址（例如，基本塊的第一指令的位址）可以是先前採取的分支指令的目標位址。基本塊的結束位址是在基本塊的最後一個指令之後的指令位址，所述指令位址可以是另一基本塊的起始位址。基本塊可包含多個指令，且當基本塊中的分支跳躍到另一基本塊時，所述基本塊結束。BPU 12 is coupled to instruction cache 11 and is configured to speculatively fetch instructions following a branch instruction. The BPU 12 may provide a prediction of the branch direction (taken or not taken) of the branch instruction based on the past behavior of the branch instruction and provide the predicted branch target address of the taken branch instruction. Branch directions may be "taken," wherein subsequent instructions are fetched from the branch target address of the taken branch instruction. A branch direction may be "not taken," wherein subsequent instructions are fetched from memory locations consecutive to the branch instruction. In some embodiments, the BPU 12 implements basic block branch prediction for predicting the end of the basic block from the starting address of the basic block. The starting address of the basic block (eg, the address of the first instruction of the basic block) may be the target address of a previously taken branch instruction. The end address of a basic block is the instruction address after the last instruction of the basic block, which may be the start address of another basic block. A basic block may contain multiple instructions, and the basic block ends when a branch in the basic block jumps to another basic block.

解碼/發射單元13可解碼從指令快取記憶體11接收到的指令。指令可包含以下欄位：操作代碼（或作業碼）、運算元（例如，來源運算元和目的地運算元）以及立即值。作業碼可指定進行何種操作（例如，ADD、SUBTRACT、SHIFT、STORE、LOAD等）。運算元可指定暫存器組14中的暫存器的索引(即，位址)，其中來源運算元指示來自暫存器組的將從其讀取操作的暫存器，且目的地運算元指示操作的結果資料將寫回到的在暫存器組中的暫存器。應注意，來源運算元和目的地運算元還可稱作來源暫存器和目的地暫存器，其在下文中可互換使用。在實施例中，運算元將需要5位元索引來識別具有32個暫存器的暫存器組中的暫存器。一些指令可使用如指令中所指定的立即值而非暫存器資料。每一指令將在功能單元20或載入/存儲單元17中執行。基於由作業碼指定的操作類型和資源的可用性（例如，暫存器、功能單元等），每一指令將具有執行時延時間和吞吐時間。執行時延時間（或時延時間）是指用於執行由指令指定的操作以完成和寫回結果資料的時間量（即，時鐘週期數）。吞吐時間是指當下一指令可進入功能單元20時的時間量（即，時鐘週期數）。The decode/issue unit 13 can decode instructions received from the instruction cache 11 . An instruction can contain the following fields: operation code (or operation code), operands (for example, source operand and destination operand), and immediate value. The job code specifies what operation to perform (for example, ADD, SUBTRACT, SHIFT, STORE, LOAD, etc.). An operand may specify the index (i.e., address) of a register in the register bank 14, where the source operand indicates the register from which the operation is to be read from the register bank, and the destination operand Indicates the register in the register group to which the result data of the operation will be written back. It should be noted that source operand and destination operand may also be referred to as source register and destination register, which are used interchangeably hereinafter. In an embodiment, an operand will require a 5-bit index to identify a register in a bank of 32 registers. Some instructions may use immediate values instead of register data as specified in the instruction. Each instruction will be executed in functional unit 20 or load/store unit 17 . Each instruction will have an execution latency and throughput time based on the type of operation specified by the jobcode and the availability of resources (eg, registers, functional units, etc.). Execution latency (or latency) is the amount of time (ie, the number of clock cycles) it takes to perform the operation specified by an instruction to complete and write back the resulting data. Throughput time refers to the amount of time (ie, the number of clock cycles) when the next instruction can enter the functional unit 20 .

在實施例中，指令在解碼/發佈單元13中解碼以基於作業碼而獲得執行時延時間、吞吐時間以及指令類型。多個指令可分派到一個執行佇列19，其中累積多個指令的吞吐時間。鑒於執行佇列19中的先前分派指令，所累積吞吐時間指示待發出的下一指令何時可進入功能單元20以用於執行（即，在進入功能單元20之前指令必須等待的時間量）。當待發出的指令可發送到功能單元20時界定的時間稱作讀取時間（來自暫存器組），且當指令由功能單元20完成時界定的時間稱為寫入時間（到暫存器組）。指令分派到執行佇列19，其中每一所分派指令具有調度的讀取時間以發出到對應功能單元20或載入/存儲單元17以供執行。在指令的分派處，執行佇列19中的所分派指令的累積吞吐時間是待分派的指令的讀取時間。當指令分派到執行佇列19的下一可用條目時，將待分派的指令的執行時延時間添加到累積的吞吐時間以產生寫入時間。修改後的執行時延時間在本文中將稱作最近分派的指令的寫入時間，且修改後的開始時間在本文中將稱作待發出的下一指令的讀取時間。寫入時間和讀取時間也可統稱為存取時間，所述存取時間描述將所分派指令寫入到暫存器組14的暫存器或從暫存器組14的暫存器讀取所分派指令的特定時間點。由於來源暫存器調度為從暫存器組14恰好及時讀取以供功能單元20執行，所以來源暫存器在執行佇列中不需要臨時暫存器，這是本實施例的微處理器相較於其他實施例的其它微處理器的優點。由於目的地暫存器調度為在未來的確切時間從功能單元20或資料快取記憶體18寫回到暫存器組14，因此即使與其他實施例中的其它功能單元20或資料快取記憶體18存在衝突，那麼仍不需要臨時暫存器來存儲結果資料，這也是相較於其它微處理器的優點。對於超過一個指令的並行發出，可基於在第二指令之前發出的第一指令而進一步調整第二指令的寫入時間和讀取時間。In an embodiment, instructions are decoded in the decode/issue unit 13 to obtain execution latency, throughput time, and instruction type based on the job code. Multiple commands can be dispatched to an execution queue 19 where the throughput times of the multiple commands are accumulated. Given the previously dispatched instructions in execution queue 19, the accumulated throughput time indicates when the next instruction to issue may enter functional unit 20 for execution (ie, the amount of time an instruction must wait before entering functional unit 20). The time defined when the instruction to be issued can be sent to the functional unit 20 is called the read time (from the scratchpad bank), and the time defined when the instruction is completed by the functional unit 20 is called the write time (to the scratchpad bank). Group). Instructions are dispatched to an execution queue 19, with each dispatched instruction having a scheduled fetch time to issue to a corresponding functional unit 20 or load/store unit 17 for execution. At the dispatch of an instruction, the cumulative throughput time of the dispatched instructions in the execution queue 19 is the fetch time of the instruction to be dispatched. When an instruction is dispatched to the next available entry in the execution queue 19, the execution latency of the instruction to be dispatched is added to the accumulated throughput time to generate the write time. The modified execution latency will be referred to herein as the write time of the most recently dispatched instruction, and the modified start time will be referred to herein as the read time of the next instruction to issue. Write time and read time may also be collectively referred to as access time, which describes the writing of dispatched instructions to or reading from the scratchpads of scratchpad bank 14. A specific point in time at which an instruction is dispatched. Since the source register is scheduled to be read from the register bank 14 just in time for the execution of the functional unit 20, the source register does not require a temporary register in the execution queue, which is why the microprocessor of this embodiment Advantages of other microprocessors compared to other embodiments. Since the destination register is scheduled to be written back to the scratchpad bank 14 from the functional unit 20 or data cache 18 at an exact time in the future, even with other functional units 20 or data caches in other embodiments If there is a conflict in the body 18, then there is no need for a temporary register to store the result data, which is also an advantage compared to other microprocessors. For parallel issue of more than one instruction, the write time and read time of the second instruction may be further adjusted based on the first instruction issued before the second instruction.

對於向量處理，解碼/發射單元13從暫存器組14的掩碼向量暫存器v(0)讀取掩碼資料且將具有向量指令的掩碼資料附接到執行佇列19。每一執行佇列19包含掩碼佇列21以保存執行佇列19中的每一所發出向量指令的掩碼資料。當指令從執行佇列19發出到功能單元20時，掩碼資料從掩碼佇列21讀取（如果啟用掩碼操作）且與指令一起發送到功能單元20。For vector processing, decode/issue unit 13 reads mask data from mask vector register v(0) of register set 14 and attaches the mask data with the vector instructions to execute queue 19 . Each execution queue 19 includes a mask queue 21 for storing mask data for each vector instruction issued in the execution queue 19 . When an instruction is issued from the execution queue 19 to the functional unit 20, mask data is read from the mask queue 21 (if masking is enabled) and sent to the functional unit 20 along with the instruction.

在實施例中，解碼/發射單元13配置成在發出指令之前檢查和解決所有可能衝突。指令可具有以下4種基本類型的衝突：（1）資料相依性，其包含讀後寫（write-after-read；WAR）、寫後讀（read-after-write；RAW）以及寫後寫（write-after-write；WAW）相依性；（2）用以從暫存器組讀取資料到功能單元的讀取埠的可用性；（3）用以從功能單元寫回資料到暫存器組的寫入埠的可用性；以及（4）用以執行資料的功能單元20的可用性。在指令可分派到執行佇列19之前，解碼/發射單元13可存取記分板15以檢查資料相依性。此外，暫存器組14具有有限數目個讀取和寫入埠，且所發佈指令必須仲裁或保留讀取和寫入埠以在未來時間存取暫存器組14。解碼/發射單元13可存取讀取/寫入控制單元16以檢查暫存器組14的讀取埠和寫入埠的可用性，以便調度指令的存取時間（即，讀取和寫入時間）。在其它實施例中，其中一個寫入埠可專用於具有未知寫入時間而不使用寫入埠控制的指令，且其中一個讀取埠可保留以用於具有未知讀取時間而不使用讀取埠控制的指令。暫存器組14的讀取埠可動態地保留（不專用）以用於具有未知存取時間的讀取操作。在這種情況下，功能單元20必須確保在試圖從暫存器組14讀取資料時，讀取埠不忙碌。在實施例中，可通過與其中累積所排隊指令（即，先前分派到執行佇列）的吞吐時間的執行佇列19協調來解析功能單元20的可用性。基於執行佇列中的累積吞吐時間，指令可分派到執行佇列19，其中指令可調度為在功能單元20可用時的未來的特定時間發佈到功能單元20。In an embodiment, the decode/transmit unit 13 is configured to check and resolve all possible conflicts before issuing an instruction. Instructions can have the following four basic types of conflicts: (1) Data dependencies, which include write-after-read (WAR), read-after-write (RAW), and write-after-write ( write-after-write; WAW) dependency; (2) Availability of a read port to read data from the register bank to the functional unit; (3) Write data back from the functional unit to the register bank Availability of the write port; and (4) availability of the functional unit 20 for executing the data. Before an instruction can be dispatched to execution queue 19, decode/issue unit 13 may access scoreboard 15 to check for data dependencies. Furthermore, the register bank 14 has a limited number of read and write ports, and issued instructions must arbitrate or reserve the read and write ports for accessing the register bank 14 at a future time. The decode/issue unit 13 can access the read/write control unit 16 to check the availability of the read port and the write port of the register bank 14 in order to schedule the access times of the instructions (i.e., the read and write times ). In other embodiments, one of the write ports may be dedicated to commands with unknown write times that do not use write port control, and one of the read ports may be reserved for commands with unknown read times that do not use read port control commands. The read ports of the register bank 14 can be dynamically reserved (not dedicated) for read operations with unknown access times. In this case, the functional unit 20 must ensure that the read port is not busy when attempting to read data from the register set 14 . In an embodiment, the availability of functional units 20 may be resolved by coordinating with the execution queue 19 where the throughput time of queued instructions (ie, previously dispatched to the execution queue) is accumulated. Based on the cumulative throughput time in the execution queue, instructions may be dispatched to execution queue 19, where the instruction may be scheduled to be issued to functional unit 20 at a specific time in the future when functional unit 20 becomes available.

圖2為示出根據本公開的一些實施例的暫存器組14和記分板15的框圖。暫存器組14可包含多個向量暫存器v(0)到向量暫存器v(N)、讀取埠（未繪示）以及寫入埠（未繪示），其中N是大於1的整數。在實施例中，暫存器組14可包含標量暫存器和/或向量暫存器。本公開並不希望限制暫存器組14中的暫存器、讀取埠以及寫入埠的數目。在實施例中，暫存器組14中包含的向量暫存器中的一個將用於呈現向量處理的掩碼資料，且因此除非另外規定，否則術語“暫存器”的出現在下文中是指向量暫存器。記分板15包含多個條目150(0)到條目150(N)，且每一記分板條目對應於暫存器組14中的一個暫存器且記錄與對應暫存器相關的資訊。在實施例中，記分板15具有與暫存器組14相同數目個條目（即，N數目個條目），但本公開並不希望限制記分板15中的條目的數目。FIG. 2 is a block diagram illustrating a register bank 14 and a scoreboard 15 according to some embodiments of the present disclosure. The register set 14 may include a plurality of vector registers v(0) to vector registers v(N), read ports (not shown) and write ports (not shown), wherein N is greater than 1 an integer of . In an embodiment, register bank 14 may include scalar registers and/or vector registers. The present disclosure does not intend to limit the number of registers, read ports, and write ports in the register bank 14 . In an embodiment, one of the vector registers contained in the register set 14 will be used to present masking material for vector processing, and therefore unless otherwise specified, occurrences of the term "register" hereinafter are Points to the scratchpad. The scoreboard 15 includes a plurality of entries 150(0) to 150(N), and each scoreboard entry corresponds to a register in the register set 14 and records information related to the corresponding register. In an embodiment, scoreboard 15 has the same number of entries as scratchpad bank 14 (ie, N number of entries), but the present disclosure does not wish to limit the number of entries in scoreboard 15 .

圖3A到圖3B是示出根據本公開的一些實施例的記分板條目的各種結構的圖式。在實施例中，記分板15可包含用於處置到暫存器組14的寫回操作的第一記分板151和用於處置來自暫存器組14的讀取操作的第二記分板152。第一記分板151和第二記分板152可或可不共存於微處理器10中。本公開並不意欲限於此。在其它實施例中，可將第一記分板151和第二記分板152實施為或查看為處置讀取和寫入操作兩者的一個記分板15。圖3A示出用於所發出指令的目的地暫存器的第一記分板151。圖3B示出用於所發出指令的來源暫存器的記分板15。參考圖3A，第一記分板151的每一條目1510(0)到條目1510(N)包含未知欄位（“未知”）1511、寫入計數欄位（“cnt”）1513以及功能單元欄位（“funit”）1515。這些欄位中的每一個記錄與待由所發出指令寫入的對應目的地暫存器相關的資訊。可在發出指令時設置記分板條目的這些欄位。3A-3B are diagrams illustrating various structures of scoreboard entries according to some embodiments of the present disclosure. In an embodiment, the scoreboard 15 may include a first scoreboard 151 for handling write-back operations to the scratchpad bank 14 and a second scoreboard for handling read operations from the scratchpad bank 14 plate 152. The first scoreboard 151 and the second scoreboard 152 may or may not coexist in the microprocessor 10 . The present disclosure is not intended to be limited in this regard. In other embodiments, first scoreboard 151 and second scoreboard 152 may be implemented or viewed as one scoreboard 15 that handles both read and write operations. FIG. 3A shows a first scoreboard 151 for the destination register of issued instructions. Figure 3B shows the scoreboard 15 for the source register of issued instructions. Referring to FIG. 3A , each entry 1510(0) through entry 1510(N) of the first scoreboard 151 includes an unknown field ("unknown") 1511, a write count field ("cnt") 1513, and a functional unit field bit("funit") 1515. Each of these fields records information related to the corresponding destination register to be written by the issued command. These fields for scoreboard entries can be set when an order is issued.

未知欄位1511包含指示對應於記分板條目的暫存器的寫入時間是已知還是未知的位元值。舉例來說，未知欄位1511可包含一個位元或任何數目個位元，其中非零值表示暫存器具有未知寫入時間，且零值表示暫存器具有如由寫入計數欄位1513所指示的已知寫入時間。未知欄位1511可在指令的發出時間設置或修改且在解析未知暫存器寫入時間之後復位。舉例來說，重定操作可由解碼/發射單元13、載入/存儲單元17（例如，在資料命中之後）或功能單元20（例如，在INT DIV整數除法操作解析需要被除(divide)的數位的數目之後）以及微處理器中的涉及執行具有未知寫入時間的指令的其它單元執行。寫入計數欄位1513記錄指示在暫存器可由（待發出的）下一指令存取之前的時鐘週期數的寫入計數值(即，寫入時間)。換句話說，寫入計數欄位1513記錄在其中先前發出的指令將完成操作且將結果資料寫回到暫存器的時鐘週期數。寫入計數欄位1513的寫入計數值在指令的發出時間基於指令的寫入時間（也可稱為執行時延時間）而設置。接著，寫入計數值針對每一時鐘週期倒計數（即，減小一）直到計數值變為零為止（即，自復位計數器（self-reset counter））。記分板條目的功能單元欄位1515指定待寫回到暫存器的功能單元20（由所發出指令指定）。Unknown field 1511 contains a bit value indicating whether the write time of the scratchpad corresponding to the scoreboard entry is known or unknown. For example, unknown field 1511 may contain one bit or any number of bits, where a non-zero value indicates that the register has an unknown write time, and a value of zero indicates that the register has an unknown write time as indicated by write count field 1513. The indicated known write time. The unknown field 1511 can be set or modified at the issue time of the command and reset after resolving the unknown register write time. For example, the reset operation can be performed by the decode/transmit unit 13, the load/store unit 17 (e.g., after a data hit), or the functional unit 20 (e.g., after an INT DIV integer division operation parses the number of digits that need to be divided). number) and other units in the microprocessor that involve executing instructions with unknown write times. Write count field 1513 records a write count value (ie, write time) indicating the number of clock cycles before the scratchpad can be accessed by the next instruction (to be issued). In other words, the write count field 1513 records the number of clock cycles in which a previously issued instruction will complete the operation and write the resulting data back to the scratchpad. The write count value of the write count field 1513 is set based on the write time of the command (also referred to as execution latency) at the time when the command is issued. Then, the write count value counts down (ie, decreases by one) for each clock cycle until the count value becomes zero (ie, self-reset counter). The functional unit field 1515 of the scoreboard entry specifies the functional unit 20 (specified by the issued command) to be written back to the scratchpad.

參考圖3B，第二記分板152具有記分板條目1520(0)到記分板條目1520(N)的結構，且第二記分板152被用來在分派指令之前，先解析WAR資料相依性，即解析寫入目的地暫存器與讀取對應於記分板條目的暫存器時的衝突。這一記分板還可稱作用於解析WAR資料相依性的WAR記分板。記分板條目1520(0)到記分板條目1520(N)中的每一個條目包含未知欄位1521和讀取計數欄位1523。可在WAR記分板的實施方案中省略功能單元欄位。未知欄位1521包含表示對應於記分板條目的暫存器的讀取時間是已知還是未知的位元值。舉例來說，未知欄位1521可包含一個位元，其中非零值表示暫存器具有未知讀取時間，且零值表示暫存器具有如讀取計數欄位1523所指示的已知讀取時間。類似於圖3A中的未知欄位1511，未知欄位1521可包含任何位數，以表示已有一或多個已發出的指令具有未知讀取時間。未知欄位1521的操作和功能性類似於未知欄位1511，且因此，出於簡潔的目的省略其細節。讀取計數欄位1523記錄計數值，此計數值表示，已發出的指令將於多少個時鐘週期後，進行最後一次對應暫存器的讀取。讀取計數值針對每一時鐘週期倒計數（即，減小一），直到讀取計數值達到0為止。除非規定，否則讀取計數欄位1523的操作和功能性類似於寫入計數欄位1513，且因此省略其細節。Referring to FIG. 3B, the second scoreboard 152 has a structure of scoreboard entry 1520(0) to scoreboard entry 1520(N), and the second scoreboard 152 is used to parse WAR data before dispatching instructions. Dependencies, that is, resolving conflicts between writing to the destination register and reading the register corresponding to the scoreboard entry. This scoreboard can also be referred to as the WAR scoreboard for resolving dependencies of WAR data. Each of scoreboard entry 1520(0) through scoreboard entry 1520(N) includes an unknown field 1521 and a read count field 1523 . The functional unit field may be omitted in a WAR scoreboard implementation. Unknown field 1521 contains a bit value indicating whether the read time of the scratchpad corresponding to the scoreboard entry is known or unknown. For example, unknown field 1521 may contain a bit where a non-zero value indicates that the register has an unknown read time, and a value of zero indicates that the register has a known read time as indicated by read count field 1523 . Similar to unknown field 1511 in FIG. 3A , unknown field 1521 may contain any number of digits to indicate that there have been one or more issued commands with unknown read times. The operation and functionality of unknown field 1521 is similar to unknown field 1511, and thus, its details are omitted for the sake of brevity. The read count field 1523 records the count value, which indicates the number of clock cycles after which the issued instruction will be read from the corresponding register for the last time. The read count value counts down (ie, decreases by one) for each clock cycle until the read count value reaches zero. Unless specified, the operation and functionality of the read count field 1523 is similar to the write count field 1513, and details thereof are therefore omitted.

參考圖1，讀取/寫入控制單元16配置成在將來的多個時鐘週期處記錄暫存器組14的讀取埠和/或寫入埠的可用性，以用於調度待發出的指令的存取。在發出指令時，基於指令所指定的存取時間，解碼/發射單元13存取讀取/寫入控制單元16以檢查暫存器組14的讀取埠和/或寫入埠的可用性。詳細地說，讀取/寫入控制單元16在未來時間選擇可用讀取埠作為將來源運算元讀取到功能單元20的所調度讀取時間，且在未來時間選擇可用寫入埠作為寫回來自功能單元20的結果資料的所調度寫入時間。在實施例中，讀取/寫入控制單元16可包含用於調度讀取埠和寫入埠的讀取移位器（未繪示）和寫入移位器（未繪示）。讀取移位器和寫入移位器中的每一個包含多個移位器條目，其中每一條目對應於將來的時鐘週期且在對應時鐘週期處記錄待存取的暫存器和待存取暫存器的功能單元的位址。在實施例中，將在每一時鐘週期移出一個條目。在一些實施例中，暫存器組14的每一讀取埠和每一寫入埠可對應於讀取移位器和寫入移位器。Referring to FIG. 1 , the read/write control unit 16 is configured to record the availability of the read port and/or write port of the scratchpad bank 14 at multiple clock cycles in the future for use in scheduling the availability of instructions to be issued. access. When issuing a command, the decode/transmit unit 13 accesses the read/write control unit 16 to check the availability of the read port and/or write port of the register set 14 based on the access time specified by the command. In detail, the read/write control unit 16 selects an available read port at a future time as the scheduled read time for reading source operands to the functional unit 20, and selects an available write port at a future time as a write-back Scheduled write time for result data from functional unit 20. In an embodiment, the read/write control unit 16 may include a read shifter (not shown) and a write shifter (not shown) for scheduling the read port and the write port. Each of the read shifter and the write shifter contains a plurality of shifter entries, where each entry corresponds to a future clock cycle and records the scratchpad to be accessed and the scratchpad to be stored at the corresponding clock cycle. Fetch the address of the functional unit of the scratchpad. In an embodiment, one entry will be shifted out every clock cycle. In some embodiments, each read port and each write port of register set 14 may correspond to a read shifter and a write shifter.

向量執行佇列19配置成保持調度為發出到功能單元20的所發出向量指令。在實施例中，每一向量執行佇列19包含存儲對應於分派到執行佇列19的向量指令的掩碼資料的掩碼佇列21。參考圖1，向量執行佇列19A包含掩碼佇列21A，向量執行佇列19B包含掩碼佇列21B，向量執行佇列19C包含掩碼佇列21C，等等。功能單元20可包含但不限於：整數乘法、整數除法、算數邏輯單位（arithmetic logic unit；ALU）、浮點單元（floating-point unit；FPU）、分支執行單元（branch execution unit；BEU）、接收已解碼的指令且執行操作的單元或類似物。在實施例中，執行佇列19中的每一個耦合到或專用於功能單元20中的一個。舉例來說，執行佇列19A耦合在解碼/發射單元13與對應功能單元20A之間以將指令依序發出所述指令至對應的功能單元20A。類似地，執行佇列19B耦合在解碼/發射單元13與對應功能單元20B之間，且執行佇列19C耦合在解碼/發射單元13與對應功能單元20C之間。在實施例中，執行佇列19D、執行佇列19E耦合在解碼/發射單元13與載入/發出單元17之間以處置載入/存儲指令。Vector execution queue 19 is configured to hold issued vector instructions scheduled for issue to functional units 20 . In an embodiment, each vector execution queue 19 includes a mask queue 21 that stores mask data corresponding to the vector instructions dispatched to the execution queue 19 . Referring to FIG. 1 , vector execution queue 19A includes mask queue 21A, vector execution queue 19B includes mask queue 21B, vector execution queue 19C includes mask queue 21C, and so on. The functional unit 20 may include but not limited to: integer multiplication, integer division, arithmetic logic unit (arithmetic logic unit; ALU), floating-point unit (floating-point unit; FPU), branch execution unit (branch execution unit; BEU), receiving A unit or similar that decodes instructions and performs operations. In an embodiment, each of the execution queues 19 is coupled to or dedicated to one of the functional units 20 . For example, the execution queue 19A is coupled between the decode/transmit unit 13 and the corresponding functional unit 20A to sequentially issue the instructions to the corresponding functional unit 20A. Similarly, execute queue 19B is coupled between decode/transmit unit 13 and corresponding functional unit 20B, and execute queue 19C is coupled between decode/transmit unit 13 and corresponding functional unit 20C. In an embodiment, execute queue 19D, execute queue 19E are coupled between decode/issue unit 13 and load/issue unit 17 to handle load/store instructions.

圖4是示出根據本公開的一些實施例的向量執行佇列19的圖式。向量執行佇列19可包含多個執行佇列條目190(0)到執行佇列條目190(Q)，以用於記錄關於從解碼/發射單元13發出的向量指令的資訊，其中Q是大於1的整數。在實施例中，執行佇列19的每一條目包含但不限於有效欄位（“v”）191、執行控制資料欄位（“ex_ctrl”）193、位址欄位（“vd”）195、吞吐時間欄位（“xput_cnt”）197以及微操作計數欄位（“mop_cnt”）198。實施例並不意欲限制向量執行佇列19的每一條目中包含的欄位的資訊或數目。在替代性實施例中，更多或更少欄位可用於在每一執行佇列中記錄更少或更多資訊。還應注意，功能單元20A到功能單元20C中的每一個可耦合到與圖4中示出的向量執行佇列相同或類似的向量執行佇列，其中向量執行佇列19A到向量執行佇列19C中的每一個接收從解碼/發射單元13分派的向量指令且將接收到的向量指令發出到對應功能單元20A到功能單元20C。FIG. 4 is a diagram illustrating a vector execution queue 19 according to some embodiments of the present disclosure. Vector execution queue 19 may include a plurality of execution queue entries 190(0) through 190(Q) for recording information about vector instructions issued from decode/issue unit 13, where Q is greater than 1 an integer of . In an embodiment, each entry of the execution queue 19 includes, but is not limited to, a valid field ("v") 191, an execution control data field ("ex_ctrl") 193, an address field ("vd") 195, A throughput time field ("xput_cnt") 197 and a micro-op count field ("mop_cnt") 198 . The embodiments are not intended to limit the information or number of fields included in each entry of the vector execution queue 19 . In alternative embodiments, more or fewer fields may be used to record less or more information in each execution queue. It should also be noted that each of functional unit 20A through functional unit 20C may be coupled to a vector execution queue that is the same or similar to that shown in FIG. 4 , with vector execution queue 19A through vector execution queue 19C Each of these receives a dispatched vector instruction from decode/issue unit 13 and issues the received vector instruction to a corresponding functional unit 20A-20C.

有效欄位191指示條目是否有效（例如，有效條目由“1”指示且無效條目由“0”指示）。執行控制資料欄位193指示接收到的向量指令的對應功能單元20的執行控制資訊。位址欄位195記錄向量指令存取到的暫存器的位址。吞吐時間欄位197記錄表示功能單元20接受對應於執行佇列條目的向量指令的時鐘週期數的吞吐時間值。換句話說，在吞吐時間欄位197中指定的時鐘週期數到期之後，功能單元20將自由地接受向量執行佇列19中的向量指令。吞吐時間值針對每一時鐘週期倒計數（即，減小一），直到吞吐時間值達到零為止。當吞吐時間值達到0時，執行佇列19將對應執行佇列條目中的向量指令發出到功能單元20。微操作欄位198記錄表示由執行佇列條目的向量指令指定的微操作的數目。微操作計數值針對一個微操作的每一調度遞減一，直到微操作計數值達到0為止。當當前執行佇列條目的微操作計數值和吞吐時間值倒數至0時，當前執行佇列條目變成無效的條目，可開始處理後續有效的執行佇列條目。Valid field 191 indicates whether the entry is valid (eg, valid entries are indicated by "1" and invalid entries are indicated by "0"). The execution control data field 193 indicates the execution control information of the corresponding functional unit 20 of the received vector instruction. The address field 195 records the address of the register accessed by the vector instruction. The throughput time field 197 records a throughput time value representing the number of clock cycles for the functional unit 20 to accept the vector instruction corresponding to the execution queue entry. In other words, functional unit 20 will be free to accept vector instructions in vector execution queue 19 after the number of clock cycles specified in throughput time field 197 has expired. The throughput time value counts down (ie, decreases by one) for each clock cycle until the throughput time value reaches zero. When the throughput time value reaches 0, the execution queue 19 issues the vector instruction corresponding to the execution queue entry to the functional unit 20 . The micro-op field 198 records the number of micro-ops specified by the vector instruction executing the queue entry. The micro-operation count value is decremented by one for each dispatch of a micro-operation until the micro-operation count value reaches zero. When the micro-operation count value and the throughput time value of the current execution queue entry count down to 0, the current execution queue entry becomes an invalid entry, and subsequent valid execution queue entries can be processed.

執行佇列19可包含或耦合到用於存儲累積計數值acc_cnt的累積計數器199，所述累積計數值acc_cnt針對每一時鐘週期倒計數一，直到計數器值變為零為止。累積計數為零指示執行佇列19是空的。累積計數器199的累積計數值acc_cnt指示在其處解碼/發射單元13中的下一指令可經由執行佇列19發出到功能單元20或載入/存儲單元17的將來的時間（即，時鐘週期數）。在一些實施例中，指令的讀取時間為累積計數值，且根據下一指令的當前acc_cnt和指令吞吐時間(inst_xput_time）的總和（acc_cnt = acc_cnt + inst_xput_time）來設置累積計數值。在一些其它實施例中，可修改讀取時間，且根據下一指令的指令的讀取時間（rd_cnt）和指令的吞吐時間的總和（acc_cnt = rd_cnt + inst_xput_time）來設置累積計數值acc_cnt。在一些實施例中，讀取移位器161和寫入移位器163設計成與執行佇列19同步。舉例來說，執行佇列19可在根據讀取移位元器161從暫存器組14讀取來源暫存器的同時將指令發出到功能單元20或載入/存儲單元17，且來自功能單元20或載入/存儲單元17的結果資料根據寫入移位元器163寫回到暫存器組14。The execution queue 19 may include or be coupled to an accumulation counter 199 for storing an accumulation count value acc_cnt that counts down by one for each clock cycle until the counter value becomes zero. A cumulative count of zero indicates that the execution queue 19 is empty. The accumulative count value acc_cnt of the accumulative counter 199 indicates the time in the future (i.e., the number of clock cycles) at which the next instruction in the decode/issue unit 13 can issue to the functional unit 20 or the load/store unit 17 via the execute queue 19 ). In some embodiments, the fetch time of an instruction is the cumulative count value, and the cumulative count value is set according to the sum of the current acc_cnt and the instruction throughput time (inst_xput_time) of the next instruction (acc_cnt = acc_cnt + inst_xput_time). In some other embodiments, the read time may be modified and the accumulative count value acc_cnt is set according to the sum of the instruction's fetch time (rd_cnt) and the instruction's throughput time (acc_cnt = rd_cnt + inst_xput_time) of the next instruction. In some embodiments, the read shifter 161 and the write shifter 163 are designed to be synchronized with the execution queue 19 . For example, the execution queue 19 may issue instructions to the functional unit 20 or the load/store unit 17 while reading the source register from the register set 14 according to the read shifter 161, and from the function The result data of the unit 20 or the load/store unit 17 is written back to the register set 14 according to the write shifter 163 .

舉例來說，兩個執行佇列條目190(0)、執行佇列條目190(1)是有效的且分別記錄第一指令和在第一指令之後發出的第二指令。執行佇列條目190(0)中的第一指令具有如吞吐時間值197中所記錄的5個時鐘週期的吞吐時間和如mop_cnt欄位198中所記錄的微操作計數4。在實例中，對於每5個時鐘週期，第一指令的一個微操作將發送到功能單元20直到微操作計數達到0為止。第一執行佇列條目190(0)中的第一指令的總體執行吞吐時間將是20個時鐘週期（即，5個時鐘週期 × 4個微操作）。類似地，第二執行佇列條目190(1)中的第二指令的總體執行吞吐時間將是16個時鐘週期，這是由於存在8個微操作且各自具有2個時鐘週期的執行吞吐時間。累積吞吐時間器199將設置為36個時鐘週期，其將用於將第三指令分派到下一可用執行佇列條目（即，第三執行佇列條目190(2)）。For example, two execution queue entries 190(0), 190(1) are active and respectively record a first instruction and a second instruction issued after the first instruction. The first instruction in execution queue entry 190(0) has a throughput time of 5 clock cycles as recorded in throughput time value 197 and a micro-op count of 4 as recorded in mop_cnt field 198 . In an example, for every 5 clock cycles, one uop of the first instruction will be sent to the functional unit 20 until the uop count reaches zero. The overall execution throughput time for the first instruction in the first execution queue entry 190(0) will be 20 clock cycles (ie, 5 clock cycles x 4 uops). Similarly, the overall execution throughput time for the second instruction in the second execution queue entry 190(1) will be 16 clock cycles since there are 8 micro-ops each having an execution throughput time of 2 clock cycles. The cumulative throughput timer 199 will be set to 36 clock cycles, which will be used to dispatch the third instruction to the next available execution queue entry (ie, the third execution queue entry 190(2)).

參考圖1，載入/存儲單元17耦合到解碼/發射單元13以處置載入指令且存儲指令。在實施例中，解碼/發射單元13將載入/存儲指令發出為包含標籤微操作和資料微操作的兩個微操作（micro operations/micro-ops）。執行佇列19D、執行佇列19E分別稱為標籤執行佇列（tag execution queue；TEQ）19D和資料執行佇列（data execution queue；DEQ）19E，其中標籤微操作發送到TEQ 19D且資料微操作發送到DEQ 19E。在一些實施例中，載入/存儲指令的微操作的吞吐時間是1個週期。TEQ 19D和DEQ 19E是獨立操作，且在DEQ 19E發出用於資料操作的資料微操作之前，TEQ 19D發出用於標籤操作的標籤微操作。Referring to FIG. 1 , load/store unit 17 is coupled to decode/issue unit 13 to handle load instructions and store instructions. In an embodiment, the decode/issue unit 13 issues the load/store instruction as two micro-operations (micro operations/micro-ops) including a label micro-operation and a data micro-operation. The execution queue 19D and the execution queue 19E are respectively called the tag execution queue (TEQ) 19D and the data execution queue (data execution queue; DEQ) 19E, wherein the tag micro-operation is sent to the TEQ 19D and the data micro-operation Send to DEQ 19E. In some embodiments, the throughput time of the micro-operation of the load/store instruction is 1 cycle. TEQ 19D and DEQ 19E are independent operations, and TEQ 19D issues tag micro-ops for tag operations before DEQ 19E issues data micro-ops for data operations.

資料快取記憶體18耦合到暫存器組14、記憶體30以及載入/存儲單元17，且配置成臨時存儲從記憶體30提取的資料。載入/存儲單元17存取資料快取記憶體18以用於載入資料或存儲資料。資料快取記憶體18包含來自記憶體30的連續指令位元組的許多快取記憶體線。資料快取記憶體18的快取記憶體線被組織為類似於指令快取記憶體11的直接映射、完全關聯映射或組相聯映射，但不需要與指令快取記憶體11相同的映射。資料快取記憶體18可包含標籤陣列（tag array；TA）22和資料陣列（data array；DA）24以用於分別存儲由微處理器10頻繁使用的位址的一部分和資料。標籤陣列22中的每一標籤對應於資料陣列24中的快取記憶體線。當微處理器10需要執行載入/存儲指令時，微處理器10首先通過將載入/存儲位址與存儲於標籤陣列22中的標籤進行比較來檢查載入/存儲資料在資料快取記憶體18中是否存在。TEQ 19D將標籤操作發出到載入/存儲單元17的位址產生單元（address generation unit；AGU）171以計算載入/存儲位址。載入/存儲位址用於存取資料快取記憶體18的標籤陣列（TA）22。如果載入/存儲位址與標籤陣列中的標籤中的一個匹配（快取記憶體命中），那麼針對載入/存儲資料存取資料陣列24中的對應快取記憶體線。如果載入/存儲位址不與標籤陣列22中的任何條目匹配（快取記憶體未命中），那麼微處理器10可存取記憶體30以找到所述資料。在快取記憶體命中的情況下，載入/存儲指令的執行時延已知。在快取記憶體未命中的情況下，載入/存儲指令的執行時延是未知的。在一些實施例中，載入/存儲指令可基於假定快取記憶體命中的已知執行時延而發出，所述已知執行時延可以是預定計數值（例如2、3、6或任何數目個時鐘週期）。當遇到快取記憶體未命中時，載入/存儲指令的發出可將記分板15配置成指示對應暫存器具有與未知執行時延時間的資料相依性。The data cache 18 is coupled to the register set 14 , the memory 30 and the load/store unit 17 , and is configured to temporarily store data retrieved from the memory 30 . The load/store unit 17 accesses the data cache memory 18 for loading data or storing data. Data cache 18 contains a number of cache lines of sequential instruction bytes from memory 30 . The cache lines of data cache 18 are organized as direct mapped, fully associative mapped or set associative mapped similarly to instruction cache 11 , but need not be the same as instruction cache 11 . The data cache memory 18 may include a tag array (TA) 22 and a data array (DA) 24 for storing part of frequently used addresses and data by the microprocessor 10 respectively. Each tag in tag array 22 corresponds to a cache line in data array 24 . When the microprocessor 10 needs to execute a load/store instruction, the microprocessor 10 first checks the load/store data in the data cache by comparing the load/store address with the tags stored in the tag array 22. Whether it exists in body 18. The TEQ 19D issues tag operations to the address generation unit (AGU) 171 of the load/store unit 17 to calculate the load/store address. Load/store addresses are used to access tag array (TA) 22 of data cache 18 . If the load/store address matches one of the tags in the tag array (cache hit), then the corresponding cache line in the data array 24 is accessed for the load/store data. If the load/store address does not match any entry in tag array 22 (cache miss), microprocessor 10 may access memory 30 to find the data. In the case of a cache hit, the execution latency of the load/store instruction is known. In the case of a cache miss, the execution latency of the load/store instruction is unknown. In some embodiments, load/store instructions may be issued based on a known execution latency assuming a cache hit, which may be a predetermined count value (e.g., 2, 3, 6, or any number clock cycles). When a cache miss is encountered, the issuance of a load/store instruction may configure the scoreboard 15 to indicate that the corresponding register has a data dependency with an unknown execution latency.

在下文中，將解釋通過使用記分板15、執行佇列19中的指令的累積吞吐時間以及讀取/寫入控制單元16發出具有已知存取時間的指令的過程。In the following, the process of issuing instructions with known access times by using the scoreboard 15, the cumulative throughput time of the instructions in the execution queue 19, and the read/write control unit 16 will be explained.

當解碼/發射單元13從指令快取記憶體11接收指令時，解碼/發射單元13在發出所述指令之前存取記分板15以檢查任何資料相依性。具體地說，將檢查對應於暫存器的記分板條目的未知欄位和計數欄位以用於判斷先前發出的指令是否具有已知存取時間。在一些實施例中，還可存取累積計數器199的當前累積計數值以用於檢查功能單元20的可用性。如果先前已分派的指令（即，第一指令）和接收到的待分派的指令（即，第二指令）存取同一暫存器，那麼第二指令可能會有資料相依性。第二指令會在第一指令之後等待分派。通常，資料相依性可分類成寫後寫（WAW）相依性、寫後讀（RAW）相依性以及讀後寫（WAR）相依性。WAW相依性是指在第二指令可寫入到暫存器之前，第二指令必須等待第一指令將結果資料寫回到同一暫存器的情形。RAW相依性是指在第二指令可從暫存器讀取資料之前，第二指令必須等待第一指令寫回到同一暫存器的情形。WAR相依性是指在第二指令可寫入到暫存器之前，第二指令必須等待第一指令從同一暫存器讀取資料的情形。通過上文所描述的記分板15和執行佇列19，可發出具有已知存取時間的指令，並將其調度到將來時間以避免這些資料相依性。When decode/issue unit 13 receives an instruction from instruction cache 11, decode/issue unit 13 accesses scoreboard 15 to check for any data dependencies before issuing the instruction. Specifically, the unknown field and the count field of the scoreboard entry corresponding to the register are checked for determining whether a previously issued command has a known access time. In some embodiments, the current accumulated count value of the accumulated counter 199 may also be accessed for checking the availability of the functional unit 20 . If a previously dispatched instruction (ie, the first instruction) and a received instruction to be dispatched (ie, the second instruction) access the same register, then the second instruction may have a data dependency. The second instruction will wait to dispatch after the first instruction. In general, data dependencies can be classified into write-after-write (WAW) dependencies, read-after-write (RAW) dependencies, and write-after-read (WAR) dependencies. A WAW dependency refers to a situation where a second instruction must wait for a first instruction to write result data back to the same register before the second instruction can write to the same register. A RAW dependency refers to a situation where a second instruction must wait for a first instruction to write back to the same register before a second instruction can read data from the register. A WAR dependency refers to a situation where a second instruction must wait for a first instruction to read data from the same scratchpad before a second instruction can write to the same scratchpad. Through the scoreboard 15 and execution queue 19 described above, instructions with known access times can be issued and scheduled to a future time to avoid these data dependencies.

在處置RAW資料相依性的實施例中，如果計數欄位1513的計數值等於或小於待發出的指令的讀取時間（即，rd_cnt），那麼不存在RAW相依性，且解碼/發射單元可發出所述指令。如果計數欄位1513的計數值大於指令讀取時間與1的總和（即，rd_cnt+1），那麼存在RAW資料相依性，且解碼/發射單元13必須停止指令的發出。如果計數欄位1513的計數值等於指令讀取時間與1的總和（即，rd_cnt+1），那麼可從功能單元欄位1515中所記錄的功能單元轉發結果資料。在此類情況下，仍可發出具有RAW資料相依性的指令。功能單元欄位1515可用於將來自所記錄功能單元的結果資料轉發到待發出的指令的功能單元。在處置WAW資料相依性的實施例中，如果計數欄位1513的計數值大於或等於待發出的指令的寫入時間，那麼存在WAW資料相依性，且解碼/發射單元13必須停止指令的發出。在處置WAR資料相依性的實施例中，如果計數欄位1523的計數值（其記錄先前發出的指令的最後讀取時間）大於指令的寫入時間，那麼存在WAR資料相依性，且解碼/發射單元13必須停止指令的發出。如果計數欄位1523的計數值小於或等於指令的寫入時間，那麼不存在WAR資料相依性，且解碼/發射單元13可發出所述指令。In an embodiment that handles RAW data dependencies, if the count value of the count field 1513 is equal to or less than the read time of the command to issue (i.e., rd_cnt), then there is no RAW dependency and the decode/issue unit may issue the instruction. If the count value of the count field 1513 is greater than the sum of the command fetch time and 1 (ie, rd_cnt+1), then there is a RAW data dependency and the decode/issue unit 13 must stop issuing commands. If the count value of the count field 1513 is equal to the sum of the command fetch time and 1 (ie, rd_cnt+1), then the result data may be forwarded from the functional unit recorded in the functional unit field 1515 . In such cases, commands with RAW data dependencies can still be issued. Functional unit field 1515 may be used to forward result data from the recorded functional unit to the functional unit of the command to be issued. In an embodiment that handles WAW data dependencies, if the count value of count field 1513 is greater than or equal to the write time of the command to issue, then there is a WAW data dependency and decode/issue unit 13 must stop issuing commands. In an embodiment that handles WAR data dependencies, if the count value of the count field 1523 (which records the last read time of a previously issued command) is greater than the write time of the command, then there is a WAR data dependency and the decode/launch Unit 13 must stop issuing commands. If the count value of count field 1523 is less than or equal to the write time of the command, then there is no WAR data dependency and decode/issue unit 13 may issue the command.

基於記分板15的計數欄位中的計數值，解碼/發射單元13可預測暫存器的可用性且將指令的執行調度到執行佇列19，其中執行佇列19可從解碼/發射單元13接收到指令的次序將指令依序發出到功能單元20。執行佇列19可累積執行佇列19中的所有指令的吞吐時間以預測執行佇列19可用於執行下一指令的時鐘週期。解碼/發射單元13還可通過存取讀取/寫入控制單元16以在發出指令之前檢查暫存器組14的讀取埠和寫入埠的可用性來使暫存器組的讀取埠和寫入埠同步。舉例來說，執行佇列19中的第一指令的累積吞吐時間指示功能單元20將由第一指令佔據11個時鐘週期。如果第二指令的寫入時間是12個時鐘週期，那麼結果資料將在23個時鐘週期後（或從現在起的第23個時鐘週期）從功能單元20寫回到暫存器組14。換句話說，解碼/發射單元13將在第二指令的發出時間處，確保暫存器和讀取埠在第11個時鐘週期處的可用性，和用於寫回操作的寫入埠在第23個時鐘週期處的可用性。如果讀取埠或寫入埠在對應時鐘週期中是忙碌的，那麼解碼/發射單元13可停止一個時鐘週期且再次檢查暫存器和讀取/寫入埠的可用性。Based on the count value in the count field of scoreboard 15, decode/issue unit 13 may predict the availability of the register and schedule execution of the instruction to execution queue 19, which may be retrieved from decode/issue unit 13 The order in which instructions are received sends instructions to the functional units 20 in sequence. The execution queue 19 may accumulate the throughput times of all instructions in the execution queue 19 to predict the clock cycles that the execution queue 19 will be available to execute the next instruction. The decode/transmit unit 13 can also enable the read port and write port of the register bank 14 by accessing the read/write control unit 16 to check the availability of the read port and the write port of the register bank 14 before issuing the command. Write port sync. For example, the cumulative throughput time of the first instruction in execution queue 19 indicates that functional unit 20 will take 11 clock cycles by the first instruction. If the write time of the second instruction is 12 clock cycles, then the result data will be written back from the functional unit 20 to the register bank 14 after 23 clock cycles (or the 23rd clock cycle from now). In other words, the decode/issue unit 13 will ensure the availability of the scratchpad and the read port at the 11th clock cycle, and the write port for the writeback operation at the 23rd clock cycle at the issue time of the second instruction. Availability at clock cycles. If the read port or the write port is busy for the corresponding clock cycle, the decode/transmit unit 13 may stop for one clock cycle and check the register and read/write port availability again.

掩碼佇列21處置分派到執行佇列19的向量指令的掩碼資料。圖5是示出根據本公開的一些實施例的掩碼佇列21的圖式。掩碼佇列21邏輯地結構化成包含多個掩碼條目210(0)到掩碼條目210(M)，其中M是大於1的整數。每一掩碼條目包含多個掩碼資料（例如，16位元），其中掩碼資料的每一位元對應於向量資料的元素。掩碼位元指示元素的結果資料是否應寫回到暫存器組14，即如果掩碼位元是1，那麼元素的結果資料應寫回到暫存器組14。如果掩碼位元是0，那麼結果資料不應寫回到暫存器組14。在實施例中，向量暫存器中的元素的最小數目是16（當元素是32位元寬資料），因此掩碼條目在實施例中設置成具有16個掩碼位元。每一元素的結果資料由個別掩碼位元指示以寫回到暫存器組14。舉例來說，掩碼位元（資料）1111_1111_0000_1111指示阻擋元素5-8（即，掩碼資料的位元位4到位7）寫回到暫存器組14。在實施例中，向量暫存器中的資料元素的最大數目是64（當元素是8位元寬資料），因此需要4個掩碼條目(每個掩碼條目具有16個掩碼位元)以使得64個元素能夠寫回到暫存器組14。在實施例中，最大向量長度乘數（LMUL）是8，在所述情況下向量指令可將結果資料寫回到暫存器組14的8個向量暫存器。由於每一向量暫存器在每資料元素是8位元寬的情況下，可具有64個資料元素，所以向量指令最多可具有512個資料元素（即，8個向量暫存器，各自有64個元素），其可在下文中稱作最壞情況向量指令（或 “最壞情況情境”）。掩碼佇列21需要具有用於最壞情況向量指令的至少512個掩碼位元，且因此，掩碼佇列21結構化成具有32個掩碼條目，且每個掩碼條目具有16位元掩碼。也就是說，為了執行用於一個單一向量指令的掩碼操作，掩碼暫存器的每一位元都是必要的。應注意，512位元寬掩碼暫存器此處僅出於說明的目的而使用。本公開可在不脫離本公開的情況下應用於具有各種大小的其它掩碼暫存器。舉例來說，掩碼暫存器大小可以是32、64、128、1024或任何其它數目。此外，掩碼佇列可在不脫離本公開的範圍的情況下邏輯地結構化成具有不同數目個掩碼條目和/或不同數目個位元每掩碼條目。舉例來說，具有32位的16個掩碼條目或8位的64個掩碼條目的掩碼佇列可用於處置512位元掩碼資料。在又其它實施例中，具有16位的64個條目或32位的32個條目的掩碼佇列可用於處置1024位元掩碼資料。在以上描述中，掩碼佇列21的大小可依賴於暫存器組14的寬度。然而，本公開並不意圖限制掩碼佇列21的大小。在一替代性實施例中，掩碼佇列21中的掩碼位元的總數目可獨立於暫存器組的寬度。舉例來說，掩碼佇列21可具有用於總計640個掩碼位的16位元掩碼條目的40個條目。Mask queue 21 handles mask data for vector instructions dispatched to execution queue 19 . FIG. 5 is a diagram illustrating a mask queue 21 according to some embodiments of the present disclosure. The mask queue 21 is logically structured to include a plurality of mask entries 210(0) to 210(M), where M is an integer greater than one. Each mask entry includes a plurality of mask data (eg, 16 bits), wherein each bit of the mask data corresponds to an element of the vector data. The mask bit indicates whether the result data of the element should be written back to the register bank 14 , ie if the mask bit is 1, then the result data of the element should be written back to the register bank 14 . If the mask bit is 0, then result data should not be written back to scratchpad bank 14. In an embodiment, the minimum number of elements in a vector register is 16 (when the elements are 32-bit wide data), so the mask entry is set to have 16 mask bits in an embodiment. The resulting data for each element is indicated by individual mask bits for writing back to register set 14 . For example, mask bits (data) 1111_1111_0000_1111 indicate that elements 5-8 (ie, bits bits 4 to 7 of the mask data) are blocked from being written back to register set 14 . In an embodiment, the maximum number of data elements in a vector register is 64 (when the elements are 8-bit wide data), thus requiring 4 mask entries (each mask entry has 16 mask bits) so that 64 elements can be written back to the scratchpad bank 14 . In an embodiment, the maximum vector length multiplier (LMUL) is 8, in which case the vector instruction can write result data back to the 8 vector registers of register bank 14 . Since each vector register can have 64 data elements where each data element is 8 bits wide, a vector instruction can have a maximum of 512 data elements (i.e., 8 vector registers each with 64 elements), which may hereinafter be referred to as the worst-case vector instruction (or "worst-case scenario"). The mask queue 21 needs to have at least 512 mask bits for the worst case vector instruction, and therefore, the mask queue 21 is structured to have 32 mask entries with 16 bits each mask. That is, in order to perform a mask operation for a single vector instruction, every bit of the mask register is necessary. It should be noted that a 512-bit wide mask register is used here for illustration purposes only. The disclosure can be applied to other mask registers of various sizes without departing from the disclosure. For example, the mask register size can be 32, 64, 128, 1024 or any other number. Furthermore, the mask queue may be logically structured to have a different number of mask entries and/or a different number of bits per mask entry without departing from the scope of the present disclosure. For example, a mask queue with 16 mask entries for 32 bits or 64 mask entries for 8 bits may be used to handle 512-bit mask data. In yet other embodiments, a mask queue with 64 entries of 16 bits or 32 entries of 32 bits may be used to handle 1024-bit mask data. In the above description, the size of the mask queue 21 may depend on the width of the register set 14 . However, the disclosure does not intend to limit the size of the mask queue 21 . In an alternative embodiment, the total number of mask bits in the mask queue 21 may be independent of the width of the register set. For example, mask queue 21 may have 40 entries for 16-bit mask entries totaling 640 mask bits.

掩碼操作可表示於預測運算元或條件控制運算元或條件向量操作控制運算元中。在實施例中，可基於向量指令的位元25而啟用掩碼操作。換句話說，向量指令的位元25指示向量指令是掩碼的向量指令或未掩碼的向量指令。向量指令中的其它位元可用於啟用掩碼操作，本公開並不意圖限制於此。掩碼佇列21的掩碼資料可用於預測、條件性地控制或掩碼操作的個別結果是否待存儲為目的地運算元的資料元素和/或與向量指令相關聯的操作是否待在來源運算元的資料元素上執行。通常，一個掩碼位元將附接到向量指令的每一資料元素。掩碼資料的數目基於向量資料長度（vector data length；VLEN）、資料元素寬度（data element width；ELEN）以及向量長度乘數（LMUL）而改變。向量長度乘數表示組合成形成向量暫存器群組的向量暫存器的數目。向量長度乘數的值可以是1、2、4、8等等。資料元素的數目可通過向量資料長度除以資料元素寬度（VLEN/ELEN）來計算，且當啟用掩碼操作時，每一資料元素將需要掩碼位元。通過向量長度乘數，一個單一向量指令可包含各種數目個微操作，且微操作中的每一個還需要用以執行掩碼操作的掩碼資料。在向量長度乘數是8的情況下，即LMUL=8，用於一個單一指令的資料元素的數目與LMUL=1相比增加8倍（即，(VLEN × LMUL) / ELEN）。在512位元寬向量資料和8個微操作的情況下，用於一個單一向量指令的資料元素的數目可能多至512個(當資料元素寬度是8位元（ELEN=8）時)。在此類情況下，需要512位元掩碼資料，其可稱作512位元掩碼暫存器的最壞情況情境。另一方面，具有32位的資料元素寬度和1個微操作（LMUL=1）的向量指令將僅需要16位元掩碼資料，其可稱作掩碼暫存器的最好情況情境。在強行實施方案中，執行佇列的每一條目將配備有512個位元以用於處置512位元掩碼資料的可能性，而不管執行佇列中的向量指令是否需要所有512個位元。如果執行佇列具有8個條目，那麼將需要總計32,768位元（8×8×512）以處置執行佇列中的每一條目的最壞情況情境中的掩碼。這是掩碼的過度存儲。在實施例中，掩碼佇列用於處理執行佇列中的所有佇列條目的向量資料的掩碼，而非在每一佇列條目中保留512位元以用於處理掩碼。Mask operations can be expressed in predictive operands or conditional control operands or conditional vector operation control operands. In an embodiment, masking operations may be enabled based on bit 25 of the vector instruction. In other words, bit 25 of the vector instruction indicates whether the vector instruction is a masked vector instruction or an unmasked vector instruction. Other bits in the vector instructions may be used to enable masking operations, as this disclosure is not intended to be limited. The mask data in mask queue 21 can be used to predict, conditionally control, or mask whether individual results of operations are to be stored as data elements of destination operands and/or whether operations associated with vector instructions are to be stored in source operands Executed on the data element of the meta. Typically, a mask bit will be attached to each data element of a vector instruction. The number of mask data varies based on vector data length (VLEN), data element width (ELEN) and vector length multiplier (LMUL). The vector length multiplier represents the number of vector registers combined to form a vector register group. The value of the vector length multiplier can be 1, 2, 4, 8, etc. The number of data elements can be calculated by dividing the vector data length by the data element width (VLEN/ELEN), and when masking is enabled, each data element will require mask bits. With the vector length multiplier, a single vector instruction can contain various numbers of micro-ops, and each of the micro-ops also requires mask data to perform the mask operation. In the case of a vector length multiplier of 8, ie, LMUL=8, the number of data elements used for a single instruction is increased by 8 times compared to LMUL=1 (ie, (VLEN × LMUL) / ELEN). In the case of 512-bit wide vector data and 8 micro-ops, the number of data elements for a single vector instruction may be as high as 512 (when the data element width is 8 bits (ELEN=8)). In such cases, 512-bit mask data is required, which may be referred to as the worst-case scenario for a 512-bit mask register. On the other hand, a vector instruction with a data element width of 32 bits and 1 micro-op (LMUL=1) will only require 16 bits of mask data, which can be called the best case scenario for mask registers. In a mandatory implementation, each entry of the execution queue will be equipped with 512 bits for the possibility of handling 512 bits of mask data, regardless of whether all 512 bits are required by the vector instructions in the execution queue . If the execution queue had 8 entries, a total of 32,768 bits (8x8x512) would be required to handle the mask in the worst case scenario for each entry in the execution queue. This is excessive storage of masks. In an embodiment, a mask queue is used to process the mask of the vector data for all queue entries in the execution queue, rather than reserving 512 bits in each queue entry for processing the mask.

在512位元寬掩碼暫存器v(0)的實施例中，掩碼佇列21的32個掩碼條目將具有處置用於當LMUL是1時具有32位元寬資料元素的32個向量指令的掩碼資料的能力，其中僅掩碼暫存器的前16位是用於向量暫存器的掩碼資料（即，最好情況情境）。當LMUL是8時，掩碼佇列21具有處置具有32位元寬資料元素的4個向量指令的能力，其中掩碼暫存器的前128位(512*8/32)是用於向量暫存器的掩碼資料。對於16位元寬資料元素，掩碼佇列21具有處置用於當LMUL是1時的16個向量指令（即，32位元的掩碼資料）和當LMUL是8時的2個向量指令（即，256位元的掩碼資料）的掩碼資料的能力。對於8位元寬資料元素，掩碼佇列21具有處置用於當LMUL是1時的8個向量指令（即，64位元的掩碼資料）和當LMUL=8時的1個向量指令（即，512位元的掩碼資料）的掩碼資料的能力。In an embodiment of the 512-bit wide mask register v(0), the 32 mask entries of the mask queue 21 would have the 32 mask entries for the 32-bit wide data elements when LMUL is 1 The ability to mask data for vector instructions, where only the first 16 bits of the mask register are used for the mask data for the vector register (ie, the best case scenario). When the LMUL is 8, the mask queue 21 has the ability to handle 4 vector instructions with 32-bit wide data elements, where the first 128 bits (512*8/32) of the mask register are used for the vector register The mask data of the register. For 16-bit wide data elements, mask queue 21 has processing for 16 vector instructions when LMUL is 1 (i.e., 32-bit mask data) and 2 vector instructions when LMUL is 8 ( That is, 256-bit mask data) mask data capabilities. For 8-bit wide data elements, mask queue 21 has processing for 8 vector instructions when LMUL is 1 (i.e., 64-bit mask data) and 1 vector instruction when LMUL=8 ( That is, 512 bits of masking data) of the masking data capability.

應注意，512位元寬掩碼暫存器用於展示本發明的概念，具有如32、63、128、1024位的不同寬度位元的掩碼暫存器也可適於處置掩碼操作的掩碼資料。舉例來說，在1024位元寬掩碼暫存器的實施例中，微處理器可配備有1024位元寬掩碼佇列以處置最壞情況情境（例如，VLEN = 1024，ELEN = 8，MLUL = 8）。此外，資料元素的寬度和向量長度乘數（LMUL）的數目還可在不脫離本發明的範圍的情況下改變。如說明書中所描述的掩碼佇列的相同演算法也可適於處置具有64位、128位等寬的資料元素。It should be noted that a 512-bit wide mask register is used to demonstrate the concepts of the present invention, mask registers with different width bits such as 32, 63, 128, 1024 bits may also be suitable for handling mask operations. code data. For example, in an embodiment of a 1024-bit wide mask register, the microprocessor may be equipped with a 1024-bit wide mask queue to handle worst-case scenarios (e.g., VLEN = 1024, ELEN = 8, MLUL=8). Furthermore, the width of the data elements and the number of vector length multipliers (LMULs) may also vary without departing from the scope of the present invention. The same algorithm for mask queues as described in the specification can also be adapted to handle data elements with widths of 64 bits, 128 bits, etc.

在實施例中，掩碼佇列21可通過環形佇列的寫入指標（“wrptr”）211和讀取指標（“rdptr”）213的指標來存取。寫入指標211根據執行佇列中的一個向量指令的分配而遞增。讀取指標213根據一個向量指令的完成而遞增。掩碼資料可當作一個R位元（例如，R=512位元）的單位一次寫入到掩碼佇列21，且從掩碼佇列21作為M個條目（例如，M=32個條目）的佇列結構讀取。In an embodiment, the masked queue 21 is accessible by pointers to the write pointer ("wrptr") 211 and the read pointer ("rdptr") 213 of the circular queue. Write pointer 211 is incremented upon allocation of a vector instruction in the execution queue. The fetch indicator 213 is incremented upon completion of a vector instruction. The mask data can be written to the mask queue 21 as a unit of R bits (for example, R=512 bits) at a time, and from the mask queue 21 as M entries (for example, M=32 entries ) for the queue structure read.

在掩碼佇列21的寫入操作中，當向量指令分派到執行佇列19時，掩碼暫存器v(0)的整個寬度可寫入到掩碼佇列21作為一個條目。也就是說，512位（即，掩碼暫存器v(0)的總寬度）可寫入到從由寫入指標211的位置指定的掩碼條目開始的掩碼佇列21。具體來說，啟用掩碼佇列的32個條目以用於寫入從寫入指標211開始的512位元掩碼資料。寫入指標211的再定位可基於所發出向量指令的向量資料所需的掩碼資料的數目而計算。舉例來說，第一向量指令具有2個微操作（即，LMUL=2），且向量資料具有16位元的資料元素寬度（ELEN）。向量資料將具有64個資料元素，其需要用於掩碼操作的掩碼暫存器v(0)的前64位。寫入指標211的再定位將基於第一向量指令所需的64位元（16 × 4）掩碼資料而計算。具體來說，啟用掩碼佇列21的4個條目以用於寫入從寫入指標211開始的64位元掩碼資料。寫入掩碼佇列21作為512位元掩碼資料的單一條目但僅啟用4個掩碼條目以用於寫入64位元掩碼資料，同時阻擋448位元(512位元減64位元)掩碼資料寫入到掩碼佇列21中。每一掩碼條目可指派有寫入啟用位（未繪示）以用於指示是否啟用對應掩碼條目以用於寫入操作。在實例中，寫入指標211將遞增4個條目。如果第一向量指令分派到執行佇列19的第一佇列條目190(0)，那麼寫入操作將從第一掩碼條目210(0)開始且寫入指標211將從第一個掩碼條目210(0)遞增到第五個掩碼條目210(4)。當第二向量指令分派到第二佇列條目190(1)時，掩碼暫存器v(0)的整個寬度將用於寫入到掩碼佇列21作為一個條目。用於第二向量指令的掩碼資料的掩碼佇列21的寫入操作將從寫入指標的新位置（即，第5個掩碼條目210(4)）開始。如果第二向量指令是需要掩碼佇列21的整個寬度（例如，當VLEN = 512，ELEN = 8位且LMUL = 8時，512位元）的最壞情況情境，那麼第二向量指令的發出將停止於解碼/發射單元中，直到第一向量指令(的微操作)全部發出到功能單元20為止。在另一情境中，如果第一向量指令是需要用於掩碼操作的掩碼佇列21的整個寬度的最壞情況情境，那麼在第一向量指令之後的第二向量指令的發出將停止直到第一向量指令發出到功能單元20為止。然而，在不依賴於掩碼暫存器的寬度的掩碼佇列21的替代實施例中，掩碼佇列21的大小可處置除了512位元寬向量暫存器的最壞情況情境中的512位元掩碼資料之外的更多掩碼位元。在此類實施例中，第二指令可仍在具有最壞情況情境的第一向量指令之後發出，只要掩碼佇列21中的可用掩碼條目的數目足以處置對應於第二向量指令的掩碼資料即可。在任何情況中，向量指令可在解碼/發射單元13中停止直到掩碼佇列21中的可用掩碼條目的數目足以保存用於向量指令的新掩碼資料為止。In a write operation to the mask queue 21, when the vector instruction is dispatched to the execution queue 19, the entire width of the mask register v(0) can be written to the mask queue 21 as an entry. That is, 512 bits (ie, the total width of the mask register v(0)) can be written to the mask queue 21 starting from the mask entry specified by the location of the write pointer 211 . Specifically, 32 entries of the mask queue are enabled for writing the 512-bit mask data starting from the write pointer 211 . The relocation of the write pointer 211 may be calculated based on the amount of mask data required for the vector data of the issued vector instruction. For example, the first vector instruction has 2 micro-ops (ie, LMUL=2), and the vector data has a data element width (ELEN) of 16 bits. Vector data will have 64 data elements, which require the first 64 bits of mask register v(0) for mask operations. The relocation of the write pointer 211 will be calculated based on the 64-bit (16 x 4) mask data required for the first vector instruction. Specifically, 4 entries of the mask queue 21 are enabled for writing the 64-bit mask data starting from the write pointer 211 . Write mask queue 21 as a single entry for 512-bit mask data but only enable 4 mask entries for writing 64-bit mask data, while blocking 448 bits (512 bits minus 64 bits ) mask data is written into the mask queue 21. Each mask entry may be assigned a write enable bit (not shown) for indicating whether the corresponding mask entry is enabled for write operations. In an example, the write indicator 211 will be incremented by 4 entries. If the first vector instruction is dispatched to the first queue entry 190(0) of the execution queue 19, then the write operation will start from the first mask entry 210(0) and the write pointer 211 will start from the first mask entry Entry 210(0) increments to the fifth mask entry 210(4). When the second vector instruction is dispatched to the second queue entry 190(1), the entire width of the mask register v(0) will be used to write to the mask queue 21 as one entry. The write operation to the mask queue 21 for the mask data of the second vector instruction will start from the new location of the write pointer (ie, the 5th mask entry 210(4)). If the second vector instruction is the worst case scenario requiring the entire width of mask queue 21 (e.g., 512 bits when VLEN = 512, ELEN = 8 bits, and LMUL = 8), then the issue of the second vector instruction Will stall in the decode/issue unit until (micro-ops of) the first vector instruction has all issued to the functional unit 20. In another scenario, if the first vector instruction is the worst case scenario requiring the entire width of the mask queue 21 for the mask operation, then the issue of the second vector instruction following the first vector instruction will stall until The first vector instruction is issued to functional unit 20 . However, in an alternate embodiment of the mask queue 21 that does not depend on the width of the mask register, the size of the mask queue 21 can handle the worst-case scenario of anything but a 512-bit wide vector register. More mask bits than the 512 bit mask data. In such embodiments, the second instruction can still issue after the first vector instruction with the worst-case scenario, as long as the number of available mask entries in mask queue 21 is sufficient to handle the mask corresponding to the second vector instruction. code data. In any case, the vector instruction may stall in decode/issue unit 13 until the number of available mask entries in mask queue 21 is sufficient to hold new mask data for the vector instruction.

掩碼佇列21的讀取操作從由讀取指標213所指向的掩碼條目開始且在執行佇列19中的對應向量指令發出到功能單元20時遞增。向量指令可具有如由微操作計數欄位198所指示的許多微操作，其中每當微操作發出到功能單元20時，微操作計數欄位198遞減1。當向量指令的所有微操作皆發出到功能單元20時，微操作計數欄位198中的計數值遞減為0，在執行佇列19的條目的有效位元欄位191會被清除為0，且掩碼佇列21的讀取指標將會遞增。讀取指標213指向對應於向量指令的第一微操作的掩碼條目中的一個（稱作當前讀取掩碼條目）。讀取操作可讀取從當前讀取掩碼條目開始的X個連續掩碼條目，其中X是大於0的整數。當前讀取掩碼條目可通過微操作的次序位移以讀取存儲於掩碼佇列21中的對應掩碼資料。對於8位元元素，每個向量微操作所需的掩碼位元的數目是64位元掩碼資料(即，掩碼佇列21的4個掩碼條目)。對於16位元元素，每個向量微操作所需的掩碼位元的數目是32位元掩碼資料(即，掩碼佇列21的2個掩碼條目)。對於32位元元素，每個向量微操作所需的掩碼位元的數目是16位元掩碼資料(即，掩碼佇列21的1個掩碼條目)。用於每一微操作的掩碼條目的數目在本文中稱為微操作掩碼大小，即用於8位元元素的4個掩碼條目、用於16位元元素的2個掩碼條目以及用於32位元元素的1個掩碼條目。在實施例中，每次讀取四個掩碼條目（即，X=4），而非計算針對不同元素長度（8位元、16位元以及32位元元素）讀取的掩碼條目的確切數目。在8位元寬資料元素的情況下，所有4個條目用於每一微操作。在16位元寬資料元素的情況下，前2個條目用於每一微操作。在32位元寬資料元素的情況下，第一條目用於每一微操作。因此，掩碼佇列21的讀取操作配置成讀取至少64位，即，四個16位元寬標記條目，以處置可由於資料元素寬度而具有各種寬度的掩碼資料的每一微操作。The read operation of the mask queue 21 starts with the mask entry pointed to by the read pointer 213 and is incremented when the corresponding vector instruction in the execute queue 19 is issued to the functional unit 20 . A vector instruction may have as many uops as indicated by uop count field 198 , which is decremented by one each time a uop is issued to a functional unit 20 . When all micro-ops of the vector instruction are issued to the functional unit 20, the count value in the micro-op count field 198 is decremented to 0, the valid bit field 191 of the entry in the execution queue 19 is cleared to 0, and The read pointer for mask queue 21 will be incremented. The read pointer 213 points to one of the mask entries corresponding to the first micro-operation of the vector instruction (referred to as the current read mask entry). A read operation may read X consecutive mask entries starting from the current read mask entry, where X is an integer greater than zero. The current read mask entry can be shifted by the order of the micro-operations to read the corresponding mask data stored in the mask queue 21 . For 8-bit elements, the number of mask bits required for each vector uop is 64 bits of mask data (ie, 4 mask entries of mask queue 21). For 16-bit elements, the number of mask bits required per vector uop is 32 bits of mask data (ie, 2 mask entries of mask queue 21). For 32-bit elements, the number of mask bits required per vector uop is 16 bits of mask data (ie, 1 mask entry of mask queue 21). The number of mask entries for each micro-op is referred to herein as the micro-op mask size, namely 4 mask entries for 8-bit elements, 2 mask entries for 16-bit elements, and 1 mask entry for 32-bit elements. In an embodiment, four mask entries are read at a time (i.e., X=4), rather than calculating the number of mask entries read for different element lengths (8-bit, 16-bit, and 32-bit elements) exact number. In the case of 8-bit wide data elements, all 4 entries are used for each micro-operation. In the case of 16-bit wide data elements, the first 2 entries are used for each micro-op. In the case of 32-bit wide data elements, the first entry is used for each micro-op. Accordingly, the read operation of the mask queue 21 is configured to read at least 64 bits, i.e., four 16-bit wide tag entries, for each micro-operation that handles mask data that can be of various widths due to the data element width .

如上文所描述，當前讀取掩碼條目可通過微操作的次序位移。如果向量指令具有三個微操作，那麼第一微操作將讀取從由讀取指標213所指的掩碼條目開始的4個連續掩碼條目。第二微操作將讀取從所修改讀取指標開始的4個連續掩碼條目。在實施例中，讀取指標通過將微操作掩碼大小（其取決於資料元素的寬度）添加到讀取指標213來修改。第三微操作將通過將2個微操作掩碼大小添加到讀取指標213來讀取從所修改讀取指標開始的4個連續掩碼條目。在具有一個掩碼條目的微操作掩碼大小（即，16個掩碼位）的32位元寬資料元素的情況下，讀取指標213指向掩碼條目210(0)作為當前讀取掩碼條目。第一微操作讀取從掩碼條目210(0)開始的四個連續掩碼條目，即掩碼條目210(0)到掩碼條目210(3)。第二微操作讀取從掩碼條目210(1)開始的掩碼條目210(1)到掩碼條目210(4)，其中由讀取指標213所指的掩碼條目通過將1個微操作掩碼大小添加到讀取指標213的位置來修改。第三微操作讀取從掩碼條目210(2)開始的掩碼條目210(2)到掩碼條目210(5)，其中由讀取指標213所指的掩碼條目通過將2個微操作掩碼大小添加到讀取指標213的位置來修改，等等。待添加的微操作掩碼大小的數目取決於微操作的次序。在實施例中，可從掩碼佇列21讀取64位元掩碼資料。然而，向量指令的微操作所需的掩碼資料可基於資料元素的寬度而改變。在16位元寬資料元素的情況下，每個512位元寬向量資料長度的微操作僅需要32位元掩碼資料，且每個掩碼資料的讀取操作將遞增2個掩碼條目。微操作將使用64位元讀取掩碼資料的前32位元（即，讀取掩碼資料[31:0]）且忽略64位元讀取掩碼資料的後32位元（即，讀取掩碼資料[63:32]）。應注意，四個連續掩碼條目的讀取操作並不意圖限制本公開。掩碼佇列21的讀取操作可涉及可在不脫離本公開的範圍的情況下實施的各種數目個掩碼條目，如1、2、4、8、16等等。在1024位元寬向量資料的替代實施例中，執行佇列19可配置成讀取八個連續標記條目（即，128位元掩碼資料）。As described above, the current read mask entry may be shifted through the order of micro-operations. If the vector instruction has three micro-ops, then the first micro-op will read 4 consecutive mask entries starting from the mask entry pointed to by read pointer 213 . The second micro-op will read 4 consecutive mask entries starting from the modified read pointer. In an embodiment, the read pointer is modified by adding the micro-op mask size (which depends on the width of the data element) to the read pointer 213 . The third micro-op will read 4 consecutive mask entries starting from the modified read pointer 213 by adding 2 micro-op mask sizes to the read pointer 213 . In the case of a 32-bit wide data element with a micro-op mask size of one mask entry (i.e., 16 mask bits), read pointer 213 points to mask entry 210(0) as the current read mask entry. The first micro-operation reads four consecutive mask entries starting from mask entry 210(0), mask entry 210(0) to mask entry 210(3). The second micro-op reads mask entry 210(1) starting from mask entry 210(1) to mask entry 210(4), where the mask entry pointed to by read pointer 213 is passed by 1 micro-op The mask size is added to the location of the read pointer 213 to modify. The third micro-operation reads mask entry 210(2) starting from mask entry 210(2) to mask entry 210(5), wherein the mask entry pointed to by the read pointer 213 is obtained by combining 2 micro-ops The mask size is added to read the location of pointer 213 to modify, etc. The number of micro-op mask sizes to add depends on the order of the micro-ops. In an embodiment, 64-bit mask data can be read from the mask queue 21 . However, the mask data required for the micro-operations of the vector instructions may vary based on the width of the data elements. In the case of 16-bit wide data elements, each micro-operation of 512-bit wide vector data length requires only 32 bits of mask data, and each mask data read operation will increment 2 mask entries. The micro-op will use the first 32 bits of the 64-bit read mask data (i.e., read mask data[31:0]) and ignore the last 32 bits of the 64-bit read mask data (i.e., read Take Mask Profile [63:32]). It should be noted that a read operation of four consecutive mask entries is not intended to limit the present disclosure. A read operation of mask queue 21 may involve various numbers of mask entries, such as 1, 2, 4, 8, 16, etc., that may be implemented without departing from the scope of the present disclosure. In an alternate embodiment of 1024-bit wide vector data, execute queue 19 may be configured to read eight consecutive tag entries (ie, 128-bit mask data).

圖6A到圖6C是示出根據本公開的一些實施例的將向量指令分派到執行佇列19和掩碼佇列21的操作的圖式。參考圖6A，接收包含4個微操作的第一向量指令以在具有16位元寬資料元素的512位元寬向量資料上操作。解碼/發射單元13檢查是否啟用掩碼操作且存取記分板以檢查對應暫存器和掩碼暫存器v0的可用性。在實施例中，掩碼暫存器v(0)可直接接線到具有專用讀取埠的解碼/發射單元13。如果掩碼暫存器v(0) 存在未完成的寫入操作，那麼記分板條目150(0)則會顯示掩碼暫存器v(0)並未能讀取的忙碌資訊。記分板條目150(0)中的忙碌資訊可通過設置未知欄位1511（或1521）、計數欄位1513（或1523）或記分板條目150(0)中的指示掩碼暫存器v(0)忙碌的額外欄位來實施。6A-6C are diagrams illustrating operations of dispatching vector instructions to execute queue 19 and mask queue 21 according to some embodiments of the present disclosure. Referring to FIG. 6A, a first vector instruction containing 4 micro-ops is received to operate on 512-bit wide vector data with 16-bit wide data elements. The decode/transmit unit 13 checks whether the mask operation is enabled and accesses the scoreboard to check the availability of the corresponding register and the mask register v0. In an embodiment, the mask register v(0) can be directly wired to the decode/transmit unit 13 with a dedicated read port. If mask register v(0) has an outstanding write operation, then the scoreboard entry 150(0) will display the busy information that mask register v(0) could not be read. The busy information in the scoreboard entry 150(0) can be set by setting the unknown field 1511 (or 1521), the count field 1513 (or 1523), or the indication mask register v in the scoreboard entry 150(0) (0) busy extra field to implement.

作為實例，第一向量指令分派和分配給執行佇列19的第一佇列條目190(0)，而對應於第一向量指令的掩碼資料將基於寫入指標211而分配給掩碼佇列21的第一多個掩碼條目210(0)到掩碼條目210(7)。掩碼資料基於寫入指標211的位置而分配給掩碼佇列21，而非從作為佇列中的第一向量指令的部分的第一佇列條目190(0)中的掩碼暫存器v(0)分配512位元寬掩碼資料。掩碼資料可經由匯流排或經由解碼/發射單元13直接從掩碼暫存器v(0)發送到掩碼佇列21作為發出第一向量指令的部分，本公開並不意圖限制掩碼資料的傳輸路徑。在實施例中，第一向量指令將由於16位元寬資料元素和4個微操作而具有128位元寬掩碼資料（32 × 4），其需要8個掩碼條目。在分配第一向量指令之後，寫入指標211增8以指示用於下一向量指令的下一掩碼條目。在實施例中，用於前8個掩碼條目的寫入啟用位設置成從寫入指標211開始以允許128位元掩碼資料寫入到掩碼佇列21。As an example, the first vector instruction is dispatched and assigned to the first queue entry 190(0) of the execution queue 19, and the mask data corresponding to the first vector instruction will be assigned to the mask queue based on the write pointer 211 The first plurality of mask entries 210(0) through mask entries 210(7) of 21. Mask data is allocated to the mask queue 21 based on the location of the write pointer 211, rather than from the mask register in the first queue entry 190(0) that is part of the first vector instruction in the queue v(0) allocates 512-bit wide mask data. Mask data may be sent directly from mask register v(0) to mask queue 21 via the bus or via decode/transmit unit 13 as part of issuing the first vector instruction. This disclosure is not intended to be limited to mask data transmission path. In an embodiment, the first vector instruction will have 128-bit wide mask data (32×4) due to 16-bit wide data elements and 4 micro-ops, which requires 8 mask entries. After the first vector instruction is allocated, the write pointer 211 is incremented by 8 to indicate the next mask entry for the next vector instruction. In an embodiment, the write enable bits for the first 8 mask entries are set starting at write pointer 211 to allow 128 bits of mask data to be written to mask queue 21 .

參考圖6B，在第一向量指令分配給執行佇列19之後接收包含2個微操作的第二向量指令。第二向量指令配置成在具有32位元寬資料元素的512位元寬向量資料上操作。基於第二向量指令的結構（即，ELEN=32，LMUL=2），將需要32位元掩碼資料以在第二向量指令上執行掩碼操作，這需要2個掩碼條目來存儲。對應於第二向量指令的32位元寬掩碼資料寫入到從由寫入指標211的當前位置指示的當前寫入掩碼條目開始的掩碼佇列21，所述當前寫入掩碼條目將是掩碼條目210(8)。如圖6B中所示出，對應於第二向量指令的第一微操作的掩碼資料（“m-op 2-1”）寫入到掩碼佇列210(8)，且對應於第二向量指令的第二微操作的掩碼資料（“m-op 2-2”）寫入到掩碼佇列210(9)。在分配第二向量指令之後，寫入指標211遞增2以指示用於下一向量指令的下一掩碼條目。在實施例中，寫入指標211將重新定位到掩碼佇列210(10)以指示用於存儲在第二向量指令之後的向量指令的掩碼資料的下一可用掩碼條目。Referring to FIG. 6B , the second vector instruction comprising 2 micro-ops is received after the first vector instruction is assigned to the execution queue 19 . The second vector instruction is configured to operate on 512-bit wide vector data having 32-bit wide data elements. Based on the structure of the second vector instruction (ie, ELEN=32, LMUL=2), 32 bits of mask data would be required to perform a mask operation on the second vector instruction, which requires 2 mask entries to store. The 32-bit wide mask data corresponding to the second vector instruction is written to the mask queue 21 starting with the current write mask entry indicated by the current position of the write pointer 211, which Will be mask entry 210(8). As shown in FIG. 6B, mask data ("m-op 2-1") corresponding to the first micro-operation of the second vector instruction is written to mask queue 210(8), and corresponding to the second The mask data for the second micro-op of the vector instruction ("m-op 2-2") is written to mask queue 210(9). After the second vector instruction is allocated, write pointer 211 is incremented by 2 to indicate the next mask entry for the next vector instruction. In an embodiment, write pointer 211 will be relocated to mask queue 210(10) to indicate the next available mask entry for storing the mask data for the vector instruction following the second vector instruction.

參考圖6C，示出分派第一佇列條目190(0)中的第一向量指令的操作。如上文所描述，第一向量指令的每一微操作在具有三十二個16位元寬資料元素的向量資料上操作。因此，將向第一向量指令的每一微操作分派32位元的掩碼資料。微操作掩碼大小對於每一微操作是2個掩碼條目，其中讀取指標213修改為0、2、4、6以讀取用於每一微操作指令的掩碼資料，且每一微操作的讀取操作將讀取從所修改讀取指標213開始的4個連續掩碼條目。在實施例中，執行佇列19存取掩碼佇列21以獲得佇列條目190(0)中的第一向量指令的掩碼資料。第一向量指令的第一微操作將與存儲於掩碼佇列21的四個掩碼條目210(0)到掩碼條目210(3)中的掩碼資料一起發出到功能單元20。由於向量資料是16位元寬資料元素，所以功能單元20可僅使用第一個32位元掩碼資料（例如，位元[31:0]）且忽略第二個32位元掩碼資料（例如，位元[63:32]）。對於第二微操作，當前讀取掩碼條目將位移一個微操作掩碼大小。換句話說，第一向量指令的第二微操作將讀取4個連續掩碼條目210(2)到掩碼條目210(5)以發出到功能單元20。功能單元20可僅使用存儲於掩碼佇列21的兩個掩碼條目210(2)到掩碼條目210(3)中的32位元掩碼資料且忽略來自掩碼條目210(4)到掩碼條目210(5)的其它32位元。第一向量指令的第三微操作將與存儲於掩碼佇列21的四個掩碼條目210(4)到掩碼條目210(7)中的掩碼資料一起發出到功能單元20，其中將讀取指標213修改兩個微操作掩碼大小。第一向量指令的第四微操作將與存儲於掩碼佇列21的四個掩碼條目210(6)到掩碼條目210(9)中的掩碼資料一起發出到功能單元20，其中將讀取指標213修改3個微操作掩碼大小。當第一向量指令完全地發出到功能單元20時，使讀取指標213遞增四個微操作掩碼大小，也就是說,在16位元寬資料元素的情況下,前述四個微操作掩碼將是8個掩碼條目。Referring to FIG. 6C, the operation of dispatching the first vector instruction in the first queue entry 190(0) is shown. As described above, each micro-operation of the first vector instruction operates on vector data having thirty-two 16-bit wide data elements. Therefore, 32 bits of mask data will be allocated to each micro-op of the first vector instruction. The micro-op mask size is 2 mask entries for each micro-op, where the read pointer 213 is modified to 0, 2, 4, 6 to read the mask data for each micro-op instruction, and each micro-op The read operation of the operation will read 4 consecutive mask entries starting from the modified read pointer 213 . In an embodiment, execution queue 19 accesses mask queue 21 to obtain mask data for the first vector instruction in queue entry 190(0). The first uop of the first vector instruction will be issued to the functional unit 20 together with the mask data stored in the four mask entries 210 ( 0 ) to 210 ( 3 ) of the mask queue 21 . Since vector data is a 16-bit wide data element, functional unit 20 may only use the first 32-bit mask data (e.g., bits[31:0]) and ignore the second 32-bit mask data ( For example, bits[63:32]). For the second micro-operation, the current read mask entry will be shifted by the size of the micro-operation mask. In other words, the second uop of the first vector instruction will read 4 consecutive mask entries 210 ( 2 ) to mask entry 210 ( 5 ) to issue to functional unit 20 . Functional unit 20 may only use the 32-bit mask data stored in two mask entries 210(2) through 210(3) of mask queue 21 and ignore The other 32 bits of the mask entry 210(5). The third uop of the first vector instruction will be issued to the functional unit 20 along with the mask data stored in the four mask entries 210(4) to 210(7) of the mask queue 21 to the functional unit 20. Read pointer 213 modifies two micro-op mask sizes. The fourth uop of the first vector instruction will be issued to functional unit 20 along with the mask data stored in four mask entries 210(6) to 210(9) of mask queue 21 to functional unit 20. Read pointer 213 modifies 3 micro-op mask sizes. When the first vector instruction is fully issued to the functional unit 20, the read pointer 213 is incremented by four micro-op mask sizes, that is, in the case of 16-bit wide data elements, the aforementioned four micro-op mask will be 8 mask entries.

在將第二向量指令發出到功能單元20的情況下，由於32位元寬資料元素，所以微操作掩碼大小是1個掩碼條目。第二向量指令的第一微操作將與存儲於掩碼條目210(8)到掩碼條目210(11)中的掩碼資料一起發出到功能單元20。由於向量資料是32位元寬資料元素，所以功能單元20可僅使用第一個16位元掩碼資料且忽略其它個48位元掩碼資料。第二向量指令的第二微操作將與存儲於掩碼條目210(9)到掩碼條目210(12)中的掩碼資料一起發出到功能單元20。In the case of issuing the second vector instruction to functional unit 20, the micro-op mask size is 1 mask entry due to the 32-bit wide data elements. The first uop of the second vector instruction will issue to functional unit 20 along with the mask data stored in mask entry 210(8) through mask entry 210(11). Since the vector data is a 32-bit wide data element, the functional unit 20 can only use the first 16-bit mask data and ignore the other 48-bit mask data. The second uop of the second vector instruction will be issued to functional unit 20 together with the mask data stored in mask entry 210(9) through mask entry 210(12).

在一些實施例中，雙倍寬度向量指令可分派到執行佇列19。在雙倍寬度向量指令的操作中，向量操作的結果資料將是來源資料的寬度的兩倍。詳細地說，來源資料的前半部（即，一半暫存器寬度）用於產生具有全暫存器寬度的第一結果資料，且來源資料的後半部用於產生具有全暫存器寬度的第二結果資料。當執行雙倍寬度向量指令的每一微操作時，讀取來源暫存器兩次。在實施例中，掩碼資料用於結果資料寬度且不用於來源資料寬度。作為實例，對於雙倍寬度指令，元素資料寬度是16位元且結果資料寬度是32位元。舉例來說，在LMUL=4的情況下，16位元元素的“單寬度”向量指令將具有4個微操作指令且寫回到暫存器組14的4個向量暫存器，且每一微操作指令具有32位元掩碼資料。16位元元素的“雙倍寬度”向量指令將具有8個微操作指令且寫回到暫存器組14的8個向量暫存器，其中每一微操作指令具有16位元掩碼資料。返回參考圖6C，第一向量指令是“單寬度”向量指令，其具有各自具有由2個掩碼條目（即，m-op 1-0是210(0)到210(1)）組成的掩碼資料的4個微操作。在“雙倍寬度”向量指令的情況下，雙倍寬度向量指令在邏輯上被視為對每一雙倍寬度微操作使用一個單一掩碼條目的8個雙倍寬度微操作，而非各自具有2個掩碼條目的4個微操作。舉例來說，對應於第一佇列條目190(0)中的雙倍寬度第一向量指令的8個掩碼條目210(0)到掩碼條目210(7)中的掩碼資料在邏輯上被視為“m-op 1-0”、“m-op 1-1”、“m-op 1-2”、“m-op 1-3”、“m-op 1-4”、“m-op 1- 4”、“m-op 1-5”、“m-op 1-6”以及“m-op 1-7”。在一些其它實施例中，來自掩碼佇列21的讀取操作掩碼資料可延遲1個時鐘週期，這是由於來源運算元資料元素必須移位到用於操作的正確位置中。In some embodiments, double-width vector instructions may be dispatched to execution queue 19 . In operations on double-width vector instructions, the result data of the vector operation will be twice the width of the source data. In detail, the first half of the source data (i.e., half the register width) is used to generate the first result data with the full register width, and the second half of the source data is used to generate the first result data with the full register width. 2. Results data. The source scratchpad is read twice when executing each micro-operation of the double-width vector instruction. In an embodiment, mask data is used for result data width and not for source data width. As an example, for a double width instruction, the element data width is 16 bits and the result data width is 32 bits. For example, with LMUL=4, a "single-width" vector instruction with 16-bit elements would have 4 micro-ops and write back to the 4 vector registers of register bank 14, and each Micro-op instructions have 32-bit mask data. A "double wide" vector instruction of 16-bit elements would have 8 micro-ops and write back to the 8 vector registers of register bank 14, with each micro-op having 16-bit mask data. Referring back to FIG. 6C , the first vector instructions are "single-width" vector instructions that have mask 4 micro-operations for encoding data. In the case of "double-width" vector instructions, the double-width vector instruction is logically seen as 8 double-width uops using a single mask entry for each double-width uop, rather than each having 4 micro-ops for 2 mask entries. For example, the mask data in the eight mask entries 210(0) through mask entry 210(7) corresponding to the double-width first vector instruction in the first queue entry 190(0) logically Treated as "m-op 1-0", "m-op 1-1", "m-op 1-2", "m-op 1-3", "m-op 1-4", "m -op 1-4", "m-op 1-5", "m-op 1-6", and "m-op 1-7". In some other embodiments, read operation mask data from mask queue 21 may be delayed by 1 clock cycle because the source operand data elements must be shifted into the correct location for the operation.

在一些其它實施例中，如果第二向量指令使用相同掩碼向量暫存器v(0)、相同LMUL以及相同ELEN，那麼第二向量指令可使用掩碼佇列21中的與第一向量指令相同的掩碼條目集合。實施例並不意圖排除其它大小的LMUL和ELEN，只要掩碼位可來源於基於v(0)的相同掩碼條目即可。不需要將相同掩碼資料寫入到掩碼佇列21中。相同掩碼向量暫存器v(0)意謂向量暫存器v(0)不由第一向量指令與第二向量指令之間的任一指令寫入。記分板位可用於指示掩碼向量暫存器v(0)的狀態。LMUL和ELEN值與讀取指標213一起存儲以便驗證下一向量指令的相同LMUL和ELEN。讀取指標213用作用於向量指令的掩碼條目的集合的識別符。掩碼佇列21可包含向量指令計數器（未繪示）以追蹤使用掩碼條目的相同集合的向量指令的數目，使得讀取指標213將僅在向量指令計數器達到0時再定位。由於每一向量指令使用掩碼條目的相同集合，當向量指令分派到執行佇列時，向量指令計數器遞增1。當向量指令的所有微操作從執行佇列19發出到功能單元20時，接著掩碼佇列條目中的向量指令計數器遞減。當向量指令計數器是零時，接著讀取指標213增加至微操作的數目，每一微操作具有微操作掩碼大小。由於相同掩碼資料並不重複多次寫入到掩碼佇列21中，所以重複使用掩碼佇列21中的掩碼條目(即以上描述)是掩碼資料和功耗的較高效率使用。In some other embodiments, if the second vector instruction uses the same masked vector register v(0), the same LMUL, and the same ELEN, then the second vector instruction can use the mask queue 21 as the first vector instruction A collection of identically masked entries. Embodiments are not intended to exclude other sizes of LMUL and ELEN, as long as the mask bits can be derived from the same mask entry based on v(0). There is no need to write the same mask data into the mask queue 21 . Same-mask vector register v(0) means that vector register v(0) is not written by any instruction between the first vector instruction and the second vector instruction. A scoreboard bit can be used to indicate the state of the mask vector scratchpad v(0). The LMUL and ELEN values are stored with the read pointer 213 to verify the same LMUL and ELEN for the next vector instruction. The read pointer 213 is used as an identifier for the set of mask entries for the vector instruction. Mask queue 21 may include a vector instruction counter (not shown) to track the number of vector instructions using the same set of mask entries, so that fetch pointer 213 will only be relocated when the vector instruction counter reaches zero. Since each vector instruction uses the same set of mask entries, the vector instruction counter is incremented by one when a vector instruction is dispatched to the execution queue. When all uops of the vector instructions are issued from the execution queue 19 to the functional unit 20, then the vector instruction counter in the mask queue entry is decremented. When the vector instruction counter is zero, then the read pointer 213 is incremented to the number of micro-ops, each micro-op has the size of the micro-op mask. Since the same mask data is not repeatedly written into the mask queue 21, reusing the mask entries in the mask queue 21 (ie, the above description) is a more efficient use of mask data and power consumption .

在上文中，掩碼資料由相同執行佇列中的多個向量指令共用。然而，掩碼資料的共用不限於相同佇列中的向量指令。在一些其它實施例中，一個掩碼佇列的掩碼資料可由不同執行佇列中的多個向量指令共用。舉例來說，第一向量指令可分派到執行佇列19B，其中第一掩碼資料分派到掩碼佇列21B。如果第二向量指令分派到執行佇列19C且使用與執行佇列19B中的第一向量指令相同的掩碼資料（即，相同掩碼向量暫存器v(0)、相同LMUL以及相同ELEN），那麼第二向量指令還可共用掩碼佇列21B中的掩碼資料。如上文所描述的向量指令計數器還可用以倒計數第一向量指令和第二向量指令。在又一些其它實施例中，即使第一向量指令和第二向量指令並不使用相同掩碼資料，一個掩碼佇列也可共用於執行佇列之間。In the above, mask data is shared by multiple vector instructions in the same execution queue. However, sharing of mask data is not limited to vector instructions in the same queue. In some other embodiments, the mask data of a mask queue can be shared by multiple vector instructions in different execution queues. For example, a first vector instruction may be dispatched to execution queue 19B, with first mask data dispatched to mask queue 21B. If the second vector instruction is dispatched to execution queue 19C and uses the same mask data as the first vector instruction in execution queue 19B (i.e., same masked vector register v(0), same LMUL, and same ELEN) , then the second vector instruction can also share the mask data in the mask queue 21B. A vector instruction counter as described above may also be used to count down the first vector instruction and the second vector instruction. In yet other embodiments, even if the first vector instruction and the second vector instruction do not use the same mask data, a mask queue can be shared between execution queues.

根據以上實施例，掩碼資料可由上文呈現的掩碼佇列處置，而非在執行佇列的每一條目中保留512位（或掩碼暫存器的任何寬度）。在本公開中，多個向量指令的掩碼資料可存儲於掩碼佇列中。當將向量指令從執行佇列發出到功能單元以用於執行時，可從掩碼佇列存取對應掩碼資料。解碼/發射單元可停止分派向量指令到執行佇列直到掩碼佇列具有足夠掩碼條目寫入掩碼資料的為止。在實施例中，一個掩碼佇列可專用於一個執行佇列。在一些其它實施例中，掩碼佇列21可由超過一個執行佇列共用。具有相同LMUL和ELEN的相同向量掩碼可用於不同功能單元中的多個向量指令，在所述情況下掩碼佇列在多個執行佇列之間共用可節省較多區域。實施例並不通過多個執行佇列限制掩碼佇列的共用，因為掩碼佇列條目由不同執行佇列中的向量指令共用。實際上，即使不同執行佇列中的向量指令並不共用任何掩碼佇列條目，掩碼佇列也可共用。掩碼佇列條目標記有LMUL和ELEN，如果第二向量指令使用相同掩碼向量暫存器、LMUL以及ELEN，那麼掩碼佇列條目的集合（例如，圖6C的210(0)到210(7)）用於第二向量指令，否則在掩碼佇列具有足夠條目的情況下創建掩碼佇列條目的新集合。掩碼佇列條目的集合包含計數器以追蹤使用掩碼佇列的這一集合的執行佇列中的向量指令的數目。執行佇列必須追蹤掩碼佇列21的讀取指標213以存取每一發出到功能單元20的微操作的掩碼資料。當向量指令的所有微操作從執行佇列19發出到功能單元20時，接著掩碼佇列中的向量指令計數遞減1。當掩碼佇列21中的向量指令計數是零時，接著掩碼條目的集合可用以接受來自從解碼/發射單元13發出的向量指令的掩碼資料的新集合。在所有情況下，保留資源而不使專用於處置向量指令的掩碼資料的額外存儲空間。According to the above embodiments, mask data may be handled by the mask queue presented above, rather than reserving 512 bits (or whatever width the mask register is) in each entry of the execute queue. In the present disclosure, mask data for multiple vector instructions can be stored in a mask queue. When a vector instruction is issued from the execution queue to a functional unit for execution, corresponding mask data can be accessed from the mask queue. The decode/issue unit may stop dispatching vector instructions to the execution queue until the mask queue has enough mask entries to write the mask data. In an embodiment, one mask queue may be dedicated to one execution queue. In some other embodiments, mask queue 21 may be shared by more than one execution queue. The same vector mask with the same LMUL and ELEN can be used for multiple vector instructions in different functional units, in which case the mask queue is shared among multiple execution queues to save more area. Embodiments do not limit the sharing of mask queues by multiple execution queues because mask queue entries are shared by vector instructions in different execution queues. In fact, mask queues can be shared even if vector instructions in different execution queues do not share any mask queue entries. Mask queue entries are labeled LMUL and ELEN, and if the second vector instruction uses the same mask vector register, LMUL, and ELEN, then the set of mask queue entries (eg, 210(0) to 210( 7)) for the second vector instruction, otherwise create a new set of mask queue entries if the mask queue has enough entries. The set of mask queue entries includes counters to track the number of vector instructions in the execution queue that use this set of mask queues. The execution queue must track the read pointer 213 of the mask queue 21 to access the mask data for each micro-operation issued to the functional unit 20 . When all micro-ops of the vector instruction are issued from the execution queue 19 to the functional unit 20, then the count of the vector instruction in the mask queue is decremented by one. When the vector instruction count in mask queue 21 is zero, then the set of mask entries is available to accept a new set of mask data from the vector instruction issued from decode/issue unit 13 . In all cases, resources are reserved without additional storage dedicated to handling mask data for vector instructions.

根據實施例中的一個，微處理器包含解碼/發射單元和執行佇列。執行佇列包含多個佇列條目和掩碼佇列。在實施例中，執行佇列配置成將從解碼/發射單元發出且在具有多個第一資料元素的資料上操作的第一指令分配給第一佇列條目。掩碼佇列包含多個掩碼條目，且當第一指令分配給執行佇列中的第一佇列條目時，對應於第一指令的第一掩碼資料寫入到第一數目個掩碼條目，其中第一數目基於第一資料元素的寬度而判斷。According to one of the embodiments, the microprocessor includes a decode/transmit unit and an execution queue. Execution queues consist of multiple queue entries and mask queues. In an embodiment, the execution queue is configured to assign to the first queue entry a first instruction issued from the decode/issue unit and operating on data having a first plurality of data elements. The mask queue includes a plurality of mask entries, and when the first instruction is assigned to the first queue entry in the execution queue, first mask data corresponding to the first instruction is written to the first number of masks entries, wherein the first number is determined based on the width of the first data element.

根據實施例中的一個，處置向量指令的掩碼資料的方法包含至少以下步驟：將在具有多個第一資料元素的資料上操作的第一指令分派到包含掩碼佇列的執行佇列的步驟，將第一指令分配給執行佇列中的第一佇列條目的步驟，以及將對應於第一指令的第一掩碼資料寫入到掩碼佇列中的第一數目個掩碼條目的步驟，其中第一數目基於第一資料元素的寬度而判斷。According to one of the embodiments, a method of handling masked data of vector instructions comprises at least the steps of: dispatching a first instruction operating on data having a plurality of first data elements to an execution queue comprising a masked queue the step of assigning a first instruction to a first queue entry in the execution queue, and writing first mask data corresponding to the first instruction to a first number of mask entries in the mask queue The step of , wherein the first number is determined based on the width of the first data element.

前文已概述了若干實施例的特徵以使得所屬領域的技術人員可更好地理解以下詳細描述。本領域的技術人員應瞭解，他們可輕易地將本公開用作設計或修改用於實現本文中所引入的實施例的相同目的和/或達成相同優點的其它過程和結構的基礎。所屬領域的技術人員還應認識到，此類等效構造並不脫離本公開的精神和範圍，且其可在不脫離本公開的精神和範圍的情況下在本文中進行各種改變、替代和更改。The foregoing has outlined features of several embodiments so that those skilled in the art may better understand the following detailed description. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions and alterations herein without departing from the spirit and scope of the present disclosure. .

1:資料處理系統 10:微處理器 11:指令快取記憶體 12:分支預測單元 13:解碼/發射單元 14:暫存器組 15:記分板 16:讀取/寫入控制單元 17:載入/存儲單元 18:資料快取記憶體 19,19A,19B,19C,19D,19E:執行佇列 20,20A,20B,20C:功能單元 21,21A,21B,21C,21E:掩碼佇列 22:標籤陣列 24:資料陣列 30:記憶體 31:讀取匯流排 32:結果匯流排 150(0)到150(N),1510(0)到1510(N) ,1520(0)到1520(N):記分板條目 151:第一記分板 152:第二記分板 161:讀取移位器 163:寫入移位器 171:位址產生單元 190(0)到190(Q):執行佇列條目 191:有效欄位 193:執行控制資料欄位 195:位址欄位 197:吞吐時間欄位 198:微操作計數欄位 199:累積計數器 210(0)到210(M):掩碼佇列條目 211:寫入指標 213:讀取指標 1511,1521:未知欄位 1513:寫入計數欄位 1515:功能單元欄位 1523:讀取計數欄位 v(0)到v(N):向量暫存器條目 1: Data processing system 10: Microprocessor 11: Instruction cache memory 12: Branch prediction unit 13: decoding/transmitting unit 14: Register group 15: Scoreboard 16: Read/write control unit 17: Load/store unit 18:Data cache memory 19, 19A, 19B, 19C, 19D, 19E: execution queue 20, 20A, 20B, 20C: functional units 21,21A,21B,21C,21E: mask queue 22: label array 24: data array 30: Memory 31: Read the bus 32: Result bus 150(0) to 150(N), 1510(0) to 1510(N), 1520(0) to 1520(N): Scoreboard entries 151: The first scoreboard 152: Second scoreboard 161: Read shifter 163: write shifter 171: Address generation unit 190(0) to 190(Q): Execute queue entries 191: Valid fields 193:Execution Control Data Field 195:Address field 197: Throughput time field 198:Micro operation count field 199: cumulative counter 210(0) to 210(M): mask queue entries 211: Write indicator 213:Read indicators 1511,1521: Unknown field 1513: Write count field 1515: Functional unit field 1523: Read count field v(0) to v(N): vector scratchpad entries

當結合附圖閱讀時，從以下詳細描述最好地理解本公開的各方面。應注意，根據業界中的標準慣例，各種特徵未按比例繪製。實際上，為了論述清楚起見，可任意增大或減小各種特徵的尺寸。圖1是示出根據一些實施例的資料處理系統的框圖。圖2是示出根據本公開的一些實施例的記分板和暫存器組的圖式。圖3A到圖3B是示出根據本公開的一些實施例的記分板條目的各種結構的圖式。圖4是示出根據本公開的一些實施例的向量執行佇列的圖式。圖5是示出根據本公開的一些實施例的掩碼佇列的圖式。圖6A到圖6C是示出根據本公開的一些實施例的將向量指令分派到執行佇列和掩碼佇列的操作的圖式。 Aspects of the disclosure are best understood from the following detailed description when read with the accompanying figures. It should be noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or decreased for clarity of discussion. Figure 1 is a block diagram illustrating a data processing system according to some embodiments. Figure 2 is a diagram illustrating a scoreboard and register bank according to some embodiments of the present disclosure. 3A-3B are diagrams illustrating various structures of scoreboard entries according to some embodiments of the present disclosure. FIG. 4 is a diagram illustrating a vector execution queue according to some embodiments of the disclosure. FIG. 5 is a diagram illustrating a mask queue according to some embodiments of the present disclosure. 6A-6C are diagrams illustrating operations of dispatching vector instructions to execute queues and mask queues according to some embodiments of the present disclosure.

21:掩碼佇列 21: Mask queue

210(0)到210(M):掩碼佇列條目 210(0) to 210(M): mask queue entries

211:寫入指標 211: Write indicator

213:讀取指標 213:Read indicators

Claims

A microprocessor comprising: decoding/transmitting unit; and an execution queue comprising a plurality of queue entries and a mask queue, and assigning to the first queue entry a first instruction dispatched from the decode/issue unit, wherein the first instruction has a plurality of first data elements operate on the data, Wherein the mask queue includes a plurality of mask entries, and when the first instruction is dispatched to the first queue entry in the execution queue, the first mask corresponding to the first instruction will be Data is written to a first number of mask entries, wherein the first number is determined based on a width of the first data element.

The microprocessor of claim 1, wherein the execution queue further assigns a second instruction dispatched from the decode/issue unit to a second queue entry following the first queue entry, wherein A second instruction operates on data having a plurality of second data elements, and the mask queue writes second mask data corresponding to the second instruction to a second number of mask entries, wherein The second number is determined based on the width of the second data element.

The microprocessor of claim 2, wherein the decode/transmit unit is configured to store the mask data for the second instruction in the event that the mask queue does not have sufficient space Stop dispatching the second instruction.

The microprocessor of claim 1, wherein the first number of mask entries used to store the first mask data is further determined based on a vector length multiplier of the first instruction.

The microprocessor of claim 1, wherein said first number of said mask entries starts from a first write mask entry indicated by a write pointer, and said write pointer is at said first instructions assigned to said first queue entry of said execution queue are relocated by said first number of said mask entries, and second mask data corresponding to a second instruction is at said write pointer The relocation of is followed by writing to the second writemask entry indicated by the write pointer.

The microprocessor according to claim 1, wherein the execution queue is further configured to access the first number of the mask entries with The first instruction of the first mask data is issued to a functional unit, and the execution queue reads X number of mask entries per micro-op starting from the current read mask entry, wherein the X is an integer equal to or greater than 1, wherein the read pointer is repositioned upon completion of issuing the first instruction to the functional unit based on the width of the data element and a vector length multiplier corresponding to the first instruction.

The microprocessor of claim 6, wherein, for each micro-operation of the first instruction, the current read mask entry shifts the micro-operation mask based on the order of the micro-operations of the first instruction A factor of size, and the micro-op mask size is determined based on the width of the data element corresponding to the first instruction.

The microprocessor of claim 7, wherein the first instruction is a double-width instruction, and the micro-op mask size is modified based on the number of double-width micro-ops modified and the mask is read The timing of the data is delayed while shifting the current read mask.

The microprocessor of claim 1, wherein the execution queue further assigns a second instruction dispatched from the decode/issue unit and operating on data having a second plurality of data elements to the a second queue entry following a queue entry, and wherein the second instruction is determined to use the same mask entry as the first instruction in the first queue entry, and the first instruction and the second instruction is issued to the same functional unit.

In the microprocessor according to claim 9, the mask queue further includes a vector instruction counter configured to count the first instruction and the second instruction, and in the first instruction Or the second command is decremented by one when it is issued to the corresponding functional unit, and the read pointer of the mask queue is reset when the command counter reaches 0.

The microprocessor as claimed in claim 1, wherein the execution queues include a first execution queue corresponding to a first functional unit and a second execution queue corresponding to a second functional unit, the decode/transmit unit Further configured to dispatch a second instruction operating on data having a second plurality of data elements to a first queue entry of the second execution queue, a second masked data write corresponding to the second instruction into a second number of mask entries of the mask queue shared with the first instruction in the first entry of the execution queue, wherein the first instruction and the second instruction are respectively Issued to the first functional unit and the second functional unit through the first execution queue and the second execution queue.

The microprocessor of claim 11, wherein said first number of said mask entries corresponding to said first instruction, and said second number of said mask entries corresponding to said second instruction A code entry maps to the same mask entry in the mask queue.

A method of handling mask data for vector instructions, comprising: dispatching a first instruction operating on data having a plurality of first data elements to an execution queue comprising a mask queue; allocating the first instruction to a first queue entry in the execution queue; and writing first mask data corresponding to the first command to a first number of mask entries in the mask queue, wherein the first number is determined based on a width of the first data element .

The method as described in claim item 13, said method further comprising: assigning a second instruction operating on data having a second plurality of data elements to a second queue entry in the execution queue subsequent to the first queue entry; and Writing second mask data corresponding to the second instruction to a second number of mask entries, wherein the second number is determined based on a width of the second data element.

The method of claim 14, wherein the first number is further determined based on a first vector length multiplier of the first instruction, and the second number is further based on a second vector length of the second instruction judged by the multiplier.

The method as described in claim item 14, said method further comprising: writing the first mask data based on a write index; After the first instruction is allocated to the first queue entry, relocating the write based on the width of the first data element of the first instruction and a vector length multiple of the first instruction input indicators; and Writing the second mask material corresponding to the second instruction starts from a current write mask entry among the mask entries as indicated by the relocated write pointer, wherein the current A mask entry is written immediately after the portion of the mask entry storing said first number of said first masking material.

The method as described in claim item 13, said method further comprising: dispatching a second instruction to the execution queue; issuing the first instruction to a first functional unit with the first mask data read from the first number of mask entries; and The second instruction is issued to a first functional unit with the first mask material read from the first number of mask entries.

The method as described in claim item 17, said method further comprising: incrementing an instruction count by one when each of the first instruction and the second instruction is dispatched; and decrementing the instruction count by one when each of the first instruction and the second instruction is issued; and When micro-op count reaches 0, relocating by said first number of said first mask data determined based on a multiple of said width and vector length of said first data element of said first instruction Read metrics.

The method according to claim 13, wherein the execution queue includes a first execution queue corresponding to the first functional unit and a second execution queue corresponding to the second functional unit, the method further comprising: dispatching the first instruction and the second instruction to the first execution queue and the second execution queue respectively, and assigning the first mask data and the second mask data corresponding to the second instruction code data is written to the same masked queue shared between said first execution queue and said second execution queue; and Issue the first instruction and the second instruction from the first execution queue and the second execution queue to the first functional unit and the second functional unit.

The method as described in claim item 13, said method further comprising: reading X number of consecutive mask entries to obtain said first mask material starting from a current read mask entry indicating a read indicator until a micro-operation count reaches 0, wherein said X is equal to or greater than 1; The read pointer is shifted by a factor of a micro-op mask size based on the order of the micro-ops of the first instruction and the width of the data element, wherein the micro-op mask size is based on the determined by the width of the data element of a command; and When the uop count reaches zero, the read pointer is repositioned based on a multiple of the width of the data element of the first instruction and a vector length.