TW202409827A

TW202409827A - Hardware-based galois multiplication

Info

Publication number: TW202409827A
Application number: TW112124022A
Authority: TW
Inventors: 希薇亞梅莉塔穆勒; 笛巴普里雅洽特傑; 馬丁Ｊ布爾斯馬; 馬丁迪德柏克斯
Original assignee: 美商萬國商業機器公司
Priority date: 2022-07-05
Filing date: 2023-06-28
Publication date: 2024-03-01

Abstract

A processor includes an instruction fetch unit that fetches instructions to be executed, an architected register file including a plurality of registers for storing source and destination operands, and an execution unit for executing a Galois multiply instruction. The execution unit includes a carryless multiplier configured to multiply operands of the Galois multiply instruction to generate a product. The execution unit further includes a modular reduction circuit configured to receive the product and determine, based on a logical combination of the product and a fixed polynomial, a reduced product having a fewer number of bits than the product. The execution unit is configured to store the reduced product to the architected register file as a result of the Galois multiply instruction.

Description

Hardware-based Galois multiplication

本發明大體而言係關於資料處理，且特定言之，係關於伽羅瓦(Galois)乘法。The present invention relates generally to data processing, and in particular to Galois multiplication.

資料安全之重要態樣為經由加密來保護靜止資料(例如，當儲存於資料儲存裝置中時)或轉變中之資料(例如，在傳輸期間)。一般而言，加密涉及經由利用加密函數將明文與一或多個加密密鑰組合來將未加密資料(被稱作明文)轉換成經加密資料(被稱作密文)。為了自密文恢復明文，藉由利用一或多個解密密鑰之解密函數處理密文。因此，加密藉由在當事方能夠存取受保護明文之前彼當事方已知額外秘密(亦即，解密密鑰)的要求來提供資料安全。An important aspect of data security is the protection of data at rest (e.g., when stored in a data storage device) or data in transition (e.g., during transmission) through encryption. Generally speaking, encryption involves converting unencrypted data (called plaintext) into encrypted data (called ciphertext) by combining the plaintext with one or more encryption keys using an encryption function. To recover plaintext from ciphertext, the ciphertext is processed by a decryption function using one or more decryption keys. Encryption therefore provides data security by requiring that an additional secret (i.e., a decryption key) be known to a party before that party can access the protected plaintext.

在許多實現中，利用執行於通用處理器上之軟體來執行資料加密。雖然在軟體中實現加密提供了能夠選擇不同加密功能且易於調適所選擇加密演算法以使用各種長度之明文及加密密鑰的優點，但在軟體中執行加密具有相對不良效能的伴隨缺點。隨著資料集之量在「大資料」時代繼續顯著增加，當加密大資料集時，藉由軟體實現加密達成之效能可係不可接受的。因此，常常需要提供對硬體中之加密的支援以達成改良之效能。In many implementations, data encryption is performed in software running on a general-purpose processor. While implementing encryption in software offers the advantages of being able to choose between different encryption functions and easily adapting the chosen encryption algorithm to work with plaintext and encryption key lengths of varying lengths, performing encryption in software has the attendant disadvantage of relatively poor performance. As the size of data sets continues to increase dramatically in the "Big Data" era, the performance achieved by implementing encryption in software can be unacceptable when encrypting large data sets. Therefore, it is often desirable to provide support for encryption in hardware to achieve improved performance.

特別適用於磁碟加密之另一先前技術解決方案為連接至記憶體階層之實體加密引擎的實現。相較於處理器核心中之硬體解決方案，耦接至記憶體階層之實體加密引擎可實現起來更昂貴。另外，此類解決方案通常不適用於飛行中之資料。Another prior art solution particularly useful for disk encryption is the implementation of a physical encryption engine coupled to the memory hierarchy. Physical encryption engines coupled to the memory hierarchy can be more expensive to implement than hardware solutions in the processor core. Additionally, such solutions are generally not suitable for data in flight.

本發明瞭解，多個常用加密函數，諸如進階加密標準(Advanced Encryption Standard；AES)-伽羅瓦計數器模式(Galois Counter Mode；GCM)及具有密文竊用的基於XEX之微調碼簿模式(XEX-based tweaked-codebook mode with ciphertext stealing；AES-XTS)，利用伽羅瓦乘法(亦即，無進位乘法及模約簡)來邏輯地組合加密運算元。舉例而言，AES-GCM及AES-XTS兩者在由固定多項式g(x) = 1 + X + x^2 + x^7+ x^128定義的GF(2^128)域中使用伽羅瓦乘法。在AES-GCM中，伽羅瓦乘法用以生成經加密訊息之簽章，該簽章可在解密期間用以偵測密文或簽章是否已經篡改。在AES-XTS中，伽羅瓦乘法用作訊息自身之加密及解密的部分。本發明揭示用於在硬體及相關聯伽羅瓦乘法指令中實現伽羅瓦乘法的電路之各種實施例。The present invention understands that multiple commonly used encryption functions, such as the Advanced Encryption Standard (Advanced Encryption Standard; AES)-Galois Counter Mode (GCM) and the XEX-based fine-tuned codebook mode with ciphertext theft (XEX -based tweaked-codebook mode with ciphertext stealing; AES-XTS), which uses Galois multiplication (i.e., carry-free multiplication and modular reduction) to logically combine cryptographic operands. For example, both AES-GCM and AES-XTS use Galois in the GF(2^128) domain defined by the fixed polynomial g(x) = 1 + X + x^2 + x^7+ x^128 multiplication. In AES-GCM, Galois multiplication is used to generate a signature of the encrypted message, which can be used during decryption to detect whether the ciphertext or signature has been tampered with. In AES-XTS, Galois multiplication is used as part of the encryption and decryption of the message itself. Various embodiments of circuits for implementing Galois multiplication in hardware and associated Galois multiplication instructions are disclosed.

在一個實施例中，一種處理器包括：一指令提取單元，其提取待執行之指令；一架構式暫存器檔案，其包括用於儲存源及目的地運算元之複數個暫存器；及一執行單元，其用於執行一伽羅瓦乘法指令。該執行單元包括一無進位乘法器，該無進位乘法器經組態以將該伽羅瓦乘法指令之運算元相乘以生成一乘積。該執行單元進一步包括一模約簡電路，該模約簡電路經組態以接收該乘積且基於該乘積與一固定多項式之一邏輯組合判定相比該乘積具有較少數目個位元的一約簡乘積。該執行單元經組態以將該約簡乘積作為該伽羅瓦乘法指令之一結果儲存至該架構式暫存器檔案。In one embodiment, a processor includes: an instruction fetch unit that fetches instructions for execution; an architectural register file that includes a plurality of registers for storing source and destination operands; and An execution unit for executing a Galois multiplication instruction. The execution unit includes a carry-less multiplier configured to multiply operands of the Galois multiplication instruction to generate a product. The execution unit further includes a modular reduction circuit configured to receive the product and determine based on a logical combination of the product with a fixed polynomial that the product has a reduction of a smaller number of bits than Simple product. The execution unit is configured to store the reduced product as a result of the Galois multiplication instruction to the architectural register file.

在一些實施例中，該處理器可形成一較大資料處理系統之部分或可實現為體現於一機器可讀儲存裝置中之一設計結構。In some embodiments, the processor may form part of a larger data processing system or may be implemented as a design embodied in a machine-readable storage device.

根據一種資料處理方法，一處理器之一指令提取單元提取待由該處理器執行之指令，包括一伽羅瓦乘法指令。基於接收到該伽羅瓦乘法指令，該處理器之一執行單元執行該伽羅瓦乘法指令。執行該伽羅瓦乘法指令包括藉由一無進位乘法器將該伽羅瓦乘法指令之運算元相乘以生成一乘積。執行該指令進一步包括一模約簡電路接收該乘積且基於該乘積與一固定多項式之一邏輯組合判定相比該乘積具有較少數目個位元的一約簡乘積。該處理器接著將該約簡乘積作為該伽羅瓦乘法指令之一結果儲存至該處理器之一架構式暫存器檔案。According to one data processing method, an instruction fetch unit of a processor fetches instructions to be executed by the processor, including a Galois multiplication instruction. Based on receiving the Galois multiplication instruction, an execution unit of the processor executes the Galois multiplication instruction. Executing the Galois multiplication instruction includes multiplying the operands of the Galois multiplication instruction by a carryless multiplier to generate a product. Executing the instructions further includes a modular reduction circuit receiving the product and determining a reduced product having a smaller number of bits than the product based on a logical combination of the product and a fixed polynomial. The processor then stores the reduced product as a result of the Galois multiplication instruction to an architectural register file of the processor.

在至少一個實施例中，該固定多項式為g(x) = 1 + X + x^2 + x^7+ x^128。In at least one embodiment, the fixed polynomial is g(x) = 1 + X + x^2 + x^7+ x^128.

在至少一個實施例中，該無進位乘法之該乘積包括包括該乘積之高階位元的一高部分及包括該乘積之低階位元的一低部分，且該模約簡電路經組態以計算等效於該高部分與該固定多項式之一無進位乘法的一第一結果。該模約簡電路包括：移位電路系統，其將多個不同位元位置移位應用於與該固定多項式中之經確立位元一致的該乘積之該高部分；及逐位元互斥或(XOR)電路系統，其在邏輯上組合具有由該移位電路系統應用之不同各別位元位置移位的該乘積之該高部分的多個實例。In at least one embodiment, the product of the carry-free multiplication includes a high portion including high-order bits of the product and a low portion including low-order bits of the product, and the modular reduction circuit is configured to The calculation is equivalent to a first result of a carry-free multiplication of the high part and one of the fixed polynomials. The modular reduction circuit includes: shift circuitry that applies multiple different bit position shifts to the high portion of the product consistent with established bits in the fixed polynomial; and bit-wise mutual exclusion or (XOR) circuitry that logically combines multiple instances of the high portion of the product with different individual bit position shifts applied by the shift circuitry.

在至少一個實施例中，該移位電路系統經進一步組態以將多個不同位元位置移位應用於與該固定多項式中之經確立位元一致的該第一結果之高部分；且該逐位元互斥或(XOR)電路系統經進一步組態以在邏輯上組合具有由該移位電路系統應用之不同各別位元位置移位的該第一結果之該高部分的多個實例以獲得一第二結果。該逐位元XOR電路系統基於該第一結果、該第二結果及該乘積之該低部分而生成該約簡乘積。In at least one embodiment, the shift circuitry is further configured to apply a plurality of different bit position shifts to a high portion of the first result consistent with established bits in the fixed polynomial; and the Bitwise exclusive OR (XOR) circuitry further configured to logically combine multiple instances of the high portion of the first result with different respective bit position shifts applied by the shift circuitry to obtain a second result. The bitwise XOR circuitry generates the reduced product based on the first result, the second result, and the low portion of the product.

在至少一個實施例中，該逐位元互斥或(XOR)電路系統包括逐位元XOR電路系統之至少兩個級。In at least one embodiment, the bitwise exclusive OR (XOR) circuitry includes at least two stages of bitwise XOR circuitry.

在至少一個實施例中，該處理器包括一條件位元反轉電路，該條件位元反轉電路經組態以在該等運算元相乘之前，基於由該伽羅瓦乘法指令指示之一模式條件性地反轉該等運算元中之一者中的位元組之一位元排序。In at least one embodiment, the processor includes a conditional bit-flip circuit configured to multiply the operands based on a pattern indicated by the Galois multiplication instruction. Conditionally reverses the bit ordering of the bytes in one of the operands.

在至少一個實施例中，該無進位乘法器係一第一乘-乘引擎，該執行單元包括一第二乘-乘引擎，該第一乘-乘引擎及該第二乘-乘引擎兩者具有一第一資料寬度，且該等運算元包括具有作為該第一資料寬度之整數倍的一第二資料寬度的第一及第二運算元。在此狀況下，該第一乘-乘引擎及該第二乘-乘引擎經組態以並行地將該第一運算元及該第二運算元之子集相乘。In at least one embodiment, the carryless multiplier is a first multiply-by engine, the execution unit includes a second multiply-by engine, both the first multiply-by engine and the second multiply-by engine. There is a first data width, and the operands include first and second operands having a second data width that is an integer multiple of the first data width. In this case, the first multiply-multiply engine and the second multiply-multiply engine are configured to multiply a subset of the first operand and the second operand in parallel.

現在參考諸圖且特別參考圖 1，繪示根據一個實施例的資料處理系統 100之高階方塊圖。在一些實現中，資料處理系統 100可為(例如)伺服器電腦系統(諸如，可購自國際商業機器公司之POWER系列伺服器中之一者)、大型電腦系統、行動計算裝置(諸如智慧型手機或平板電腦)、膝上型或桌上型個人電腦系統或嵌入式處理器系統。 Referring now to the drawings and specifically to FIG. 1 , shown is a high-level block diagram of a data processing system 100 according to one embodiment. In some implementations, data processing system 100 may be, for example, a server computer system (such as one of the POWER series of servers available from International Business Machines Corporation), a mainframe computer system, a mobile computing device (such as a smart phone) mobile phone or tablet), laptop or desktop PC system, or embedded processor system.

如所示，資料處理系統 100包括處理指令及資料之一或多個處理器 102。如此項技術中已知，每一處理器 102可實現為具有半導體基板之各別積體電路，在該半導體基板中形成有積體電路系統。在至少一些實施例中，處理器 102可通常實現多個市售處理器架構中之任一者，例如，POWER、ARM、Intel x86、NVidia、Apple silicon等。在所描繪之實例中，每一處理器 102包括一或多個處理器核心 104及快取記憶體 106，該快取記憶體提供對很可能待由處理器核心 104讀取及/或寫入之指令及運算元的低潛時存取。處理器 102經耦接以用於藉由系統互連件 110進行通信，該系統互連件在各種實現中可包括一或多個匯流排、交換器、橋接器及/或混合互連件。 As shown, data processing system 100 includes one or more processors 102 that process instructions and data. As is known in the art, each processor 102 may be implemented as a respective integrated circuit having a semiconductor substrate in which the integrated circuit system is formed. In at least some embodiments, processor 102 may generally implement any of a number of commercially available processor architectures, such as POWER, ARM, Intel x86, NVidia, Apple silicon, etc. In the depicted example, each processor 102 includes one or more processor cores 104 and a cache memory 106 that provides low latency access to instructions and operands that are likely to be read and/or written by processor core 104 . Processors 102 are coupled for communication via a system interconnect 110 , which in various implementations may include one or more buses, switches, bridges, and/or hybrid interconnects.

資料處理系統 100可另外包括耦接至系統互連件 110之數個其他組件。舉例而言，此等組件可包括控制由處理器 102及資料處理系統 100之其他組件對系統記憶體 114之存取的記憶體控制器 112。另外，資料處理系統 100可包括：輸入/輸出(I/O)配接器 116，其用於將一或多個I/O裝置耦接至系統互連件 110；非揮發性儲存系統 118 ；及網路配接器 120，其用於將資料處理系統 100耦接至通信網路(例如，有線或無線區域網路及/或網際網路)。 Data processing system 100 may additionally include several other components coupled to system interconnect 110 . For example, these components may include memory controller 112 that controls access to system memory 114 by processor 102 and other components of data processing system 100 . Additionally, data processing system 100 may include: an input/output (I/O) adapter 116 for coupling one or more I/O devices to system interconnect 110 ; a non-volatile storage system 118 ; and a network adapter 120 for coupling the data processing system 100 to a communication network (eg, a wired or wireless local area network and/or the Internet).

熟習此項技術者應另外瞭解，圖 1中所展示之資料處理系統 100可包括許多額外未繪示之組件。因為此類額外組件對於理解所描述實施例並非必需的，所以其並未在圖 1中加以繪示或在本文中加以進一步論述。然而，亦應理解，本文中所描述之增強適用於不同架構之資料處理系統及處理器，且決不限於圖 1中所繪示之一般化資料處理系統架構。 Those skilled in the art will also appreciate that the data processing system 100 shown in FIG . 1 may include many additional components that are not shown. Because such additional components are not necessary for understanding the described embodiments, they are not shown in FIG. 1 or further discussed herein. However, it should also be understood that the enhancements described herein are applicable to data processing systems and processors of different architectures and are in no way limited to the generalized data processing system architecture shown in FIG . 1 .

現參考圖 2，描繪根據一個實施例的例示性處理器核心 200之高階方塊圖。處理器核心 200可用以實現圖 1之處理器核心 104中之任一者。 Referring now to FIG. 2 , depicted is a high-level block diagram of an exemplary processor core 200 in accordance with one embodiment. Processor core 200 may be used to implement any of processor cores 104 of FIG . 1 .

在所描繪之實例中，處理器核心 200包括用於自儲存器 230(其可包括例如來自圖 1之快取記憶體 106及/或系統記憶體 114)提取一或多個指令串流內之指令的指令提取單元 202。在典型實現中，每一指令具有由處理器核心 200之指令集架構定義之格式，且至少包括指定待由處理器核心 200執行之操作(例如，固定點或浮點算術運算、向量運算、矩陣運算、邏輯運算、分支運算、記憶體存取操作、加密運算等)的作業碼(operation code/opcode)欄位。某些指令可另外包括一或多個運算元欄位，該一或多個運算元欄位直接指定運算元或隱含地或明確地參考儲存待用於指令執行中之源運算元的一或多個暫存器及用於儲存藉由指令執行而生成的目的地運算元的一或多個暫存器。在一些實施例中可與指令提取單元 202合併的指令解碼單元 204，解碼藉由指令提取單元 202自儲存器 230擷取之指令，且將控制執行流之分支指令轉遞至分支處理單元 206。在一些實施例中，藉由分支處理單元 206執行之分支指令的處理可包括推測條件分支指令之結果。由分支處理單元 206進行的分支處理(推測性及非推測性兩者)之結果繼而可用以重新引導藉由指令提取單元 202進行的指令提取之一或多個串流。 In the depicted example, processor core 200 includes means for fetching one or more instruction streams from storage 230 (which may include, for example, cache 106 and/or system memory 114 from FIG . 1 ). Instruction fetch unit 202 of instructions. In a typical implementation, each instruction has a format defined by the instruction set architecture of processor core 200 and includes, at a minimum, specifying an operation to be performed by processor core 200 (e.g., fixed-point or floating-point arithmetic operations, vector operations, matrix operations, etc. Operation code/opcode field for operations, logical operations, branch operations, memory access operations, encryption operations, etc.). Certain instructions may additionally include one or more operand fields that directly specify operands or implicitly or explicitly reference one or more operands that store the source operand to be used in the execution of the instruction. A plurality of registers and one or more registers for storing destination operands generated by instruction execution. Instruction decoding unit 204 , which may be incorporated with instruction fetch unit 202 in some embodiments, decodes instructions fetched from memory 230 by instruction fetch unit 202 and forwards branch instructions that control execution flow to branch processing unit 206 . In some embodiments, processing of branch instructions performed by branch processing unit 206 may include speculating on the results of conditional branch instructions. The results of branch processing (both speculative and non-speculative) by branch processing unit 206 may then be used to redirect one or more streams of instruction fetches by instruction fetch unit 202 .

指令解碼單元 204將並非分支指令的指令(常常被稱作「依序指令」)轉遞至映射器電路 210。映射器電路 210負責視需要將處理器核心 200之暫存器檔案內的實體暫存器指派給指令以支援指令執行。映射器電路 210較佳實現暫存器重命名。因此，對於至少一些類別之指令，映射器電路 210建立藉由指令參考之邏輯(或經架構)暫存器之集合與處理器核心 200之暫存器檔案內的實體暫存器之較大集合之間的暫態映射。結果，處理器核心 200可避免對並非資料相依的指令進行不必要的串列化，否則可能由於按程式次序附近的指令再使用經架構暫存器之有限集合而發生此情形。 Instruction decode unit 204 passes instructions that are not branch instructions (often referred to as "sequential instructions") to mapper circuit 210 . Mapper circuit 210 is responsible for assigning physical registers within the register file of processor core 200 to instructions as necessary to support instruction execution. Mapper circuit 210 preferably implements register renaming. Thus, for at least some classes of instructions, mapper circuit 210 creates a set of logical (or architected) registers referenced by the instruction and a larger set of physical registers within the register file of processor core 200 transient mapping between. As a result, processor core 200 can avoid unnecessary serialization of instructions that are not data dependent, which might otherwise occur due to reusing a limited set of architected registers for instructions near program order.

仍參看圖 2，處理器核心 200另外包括一分派電路 216，該分派電路經組態以確保觀測到指令之間的任何資料相依性並在依序指令變得準備好執行時分派依序指令。由分派電路 216分派之指令暫時在發行佇列 218中經緩衝，直至處理器核心 200之執行單元具有可用於執行經分派指令之資源。當適當的執行資源變得可用時，發行佇列 218機會性地且可能相對於指令之原始程式次序無序地將指令自發行佇列 218發行至處理器核心 200之執行單元。 Still referring to Figure 2 , processor core 200 additionally includes a dispatch circuit 216 that is configured to ensure that any data dependencies between instructions are observed and that sequential instructions are dispatched when sequential instructions become ready for execution. Instructions dispatched by dispatch circuit 216 are temporarily buffered in issue queue 218 until the execution units of processor core 200 have resources available to execute the dispatched instructions. When appropriate execution resources become available, issue queue 218 issues instructions from issue queue 218 to the execution units of processor core 200 opportunistically and possibly out of order relative to the original program order of the instructions.

在所描繪之實例中，處理器核心 200包括用於執行各別不同類別之指令的若干不同類型之執行單元。在此實例中，執行單元包括：一或多個固定點單元 220，其用於執行存取固定點運算元之指令；一或多個浮點單元 222，其用於執行存取浮點運算元之指令；一或多個載入-儲存單元 224，其用於自儲存器 230載入資料並將資料儲存至該儲存器；及一或多個向量-純量單元 226，其用於執行存取向量及/或純量運算元之指令。在一典型實施例中，每一執行單元經實現為多階段管線，其中可在不同執行階段同時處理多個指令。每一執行單元較佳包括至少一個暫存器檔案或經耦接以存取至少一個暫存器檔案，該至少一個暫存器檔案包括用於暫時緩衝在指令執行中存取或藉由指令執行生成之運算元的複數個實體暫存器。 In the depicted example, processor core 200 includes several different types of execution units for executing respective different classes of instructions. In this example, the execution units include: one or more fixed-point units 220 for executing instructions that access fixed-point operands; one or more floating-point units 222 for executing instructions that access floating-point operands; one or more load-store units 224 for loading data from and storing data to memory 230 ; and one or more vector-scalar units 226 for executing instructions that access vector and/or scalar operands. In a typical embodiment, each execution unit is implemented as a multi-stage pipeline, in which multiple instructions can be processed simultaneously at different execution stages. Each execution unit preferably includes at least one register file or is coupled to access at least one register file, the at least one register file including a plurality of physical registers for temporarily buffering operands accessed during or generated by instruction execution.

熟習此項技術者應瞭解，處理器核心 200可包括額外未繪示之組件，諸如經組態以管理由執行單元 220至 226之執行結束所針對之指令的完成及引退的邏輯。因為此等額外組件對於理解所描述實施例並非必需的，所以其並未在圖 2中加以繪示或在本文中加以進一步論述。 Those skilled in the art will appreciate that processor core 200 may include additional components not shown, such as logic configured to manage the completion and retirement of instructions targeted by the completion of execution of execution units 220 through 226. Because these additional components are not necessary for understanding the described embodiments, they are not shown in FIG. 2 or discussed further herein.

現在參考圖 3，繪示根據一個實施例的處理器 102之例示性執行單元之高階方塊圖。在此實例中，更詳細地展示處理器核心 200之向量-純量單元 226。在圖 3之實施例中，向量-純量單元 226經組態以執行對不同類型之運算元之操作並生成不同類型之運算元的多個不同類別之指令。舉例而言，向量-純量單元 226經組態以執行對向量及純量源運算元進行操作並生成向量及純量目的地運算元的第一類別之指令。向量-純量單元 226在功能單元 302至 312中執行此第一類別之指令中的指令，在所描繪之實施例中，該等功能單元包括：用於執行加法、減法及旋轉運算之算術邏輯單元/旋轉單元 302、用於執行二進位乘法之乘法單元 304、用於執行二進位除法之除法單元 306、用於執行加密功能之加密單元 308、用於執行運算元置換之置換單元 310及用於執行十進位數學運算之二進位寫碼十進位(BCD)單元 312。對其執行此等運算之向量及純量源運算元以及藉由此等運算生成之向量及純量目的地運算元在架構式暫存器檔案 300之實體暫存器中被緩衝。 Referring now to FIG. 3 , a high-level block diagram of an exemplary execution unit of the processor 102 according to one embodiment is shown. In this example, the vector-scalar unit 226 of the processor core 200 is shown in greater detail. In the embodiment of FIG. 3 , the vector-scalar unit 226 is configured to perform operations on different types of operators and generate multiple different classes of instructions for different types of operators. For example, the vector-scalar unit 226 is configured to perform operations on vector and scalar source operators and generate a first class of instructions for vector and scalar destination operators. The vector-scalar unit 226 executes instructions of this first class of instructions in functional units 302-312 , which in the depicted embodiment include: an arithmetic logic unit /rotate unit 302 for performing addition, subtraction, and rotate operations, a multiplication unit 304 for performing binary multiplication, a division unit 306 for performing binary division, an encryption unit 308 for performing encryption functions, a permutation unit 310 for performing operand permutations, and a binary coded decimal (BCD) unit 312 for performing decimal math operations. The vector and scalar source operators on which these operations are performed, and the vector and scalar destination operators generated by these operations, are buffered in physical registers of the architectural register file 300 .

在此實例中，向量-純量單元 226另外經組態以執行對矩陣運算元進行操作且生成矩陣運算元的第二類別之指令。向量-純量單元 226在矩陣乘法累積(MMA)單元 314中執行此第二類別之指令中的指令。對其執行此等操作之矩陣運算元以及藉由此等操作生成之矩陣運算元在非架構式暫存器檔案 316之實體暫存器中被緩衝及累積。 In this example, vector-scalar unit 226 is additionally configured to execute instructions that operate on matrix operands and generate a second class of matrix operands. Vector-scalar unit 226 executes instructions in this second category of instructions in matrix multiply accumulate (MMA) unit 314 . The matrix operands on which these operations are performed and the matrix operands generated by these operations are buffered and accumulated in the physical registers of the non-architectural register file 316 .

在操作中，向量-純量單元 226自發行佇列 218接收指令。若指令係在第一類別之指令(例如，向量-純量指令)中，則在架構式暫存器檔案 300中利用由映射器電路 210建立的邏輯暫存器與實體暫存器之間的映射來存取用於指令之相關源運算元，且接著將其與指令一起轉遞至功能單元 302至 312中之一相關功能單元以供執行。藉由彼執行生成的目的地運算元接著儲存回至架構式暫存器檔案 300的藉由映射器電路 210建立之映射判定的實體暫存器。另一方面，若指令處於第二類別之指令(例如，MMA指令)中，則將該指令轉遞至MMA單元 314以關於在非架構式暫存器檔案 316之指定實體暫存器中緩衝的運算元進行執行。在此狀況下，由MMA單元 314進行之執行包括執行矩陣乘法運算，接著將所得乘積與非架構式暫存器檔案 316中之一或多個指定實體暫存器之內容累積(例如求和)。 In operation, vector-scalar unit 226 autonomously issues queue 218 to receive instructions. If the instruction is within the first category of instructions (e.g., vector-scalar instructions), then a link between the logical registers and the physical registers established by the mapper circuit 210 is utilized in the architectural register file 300 The mapping accesses the relevant source operand for the instruction and then forwards it along with the instruction to one of the relevant functional units 302 to 312 for execution. The destination operands generated by this execution are then stored back into the physical registers of the architectural register file 300 for the mapping decision established by the mapper circuit 210 . On the other hand, if the instruction is in the second category of instructions (eg, MMA instructions), the instruction is forwarded to the MMA unit 314 for information about the instruction buffered in the designated physical register of the unarchitected register file 316. Operands are executed. In this case, execution by MMA unit 314 includes performing a matrix multiplication operation and then accumulating (eg, summing) the resulting product with the contents of one or more designated physical registers in unarchitected register file 316 .

現在參看圖 4，描繪根據一個實施例的例示性加密單元 308之更詳細方塊圖。在此實例中，加密單元 308包括用於在硬體中根據進階加密標準(Advanced Encryption Standard；AES)執行加密及解密的電路系統。例如在以引用方式併入本文中之國際標準化組織(ISO)/國際電工委員會(IEC)標準18033-3(第2版、2010年12月15日)中定義AES。如所展示，此電路系統包括AES加密/解密電路 400，其組合加密密鑰與明文以獲得密文且組合解密密鑰與密文以獲得明文。加密單元單元 308另外包括AES密鑰生成電路 402，該AES密鑰生成電路生成由AES加密/解密電路 400利用以加密及解密資料的密鑰。加密單元 308亦包括無進位乘法電路 404，如下文詳細描述。無進位乘法電路 404可例如用於生成簽章之程序中，該等簽章用以鑑認AES-GCM (伽羅瓦計數器乘法)中之經加密訊息。無進位乘法電路 404亦可用以在加密及解密訊息的程序中(例如在AES-XTS(具有密文竊用的基於XEX之微調碼簿模式)中)執行伽羅瓦乘法。 Referring now to Figure 4 , depicted is a more detailed block diagram of an exemplary encryption unit 308 in accordance with one embodiment. In this example, encryption unit 308 includes circuitry for performing encryption and decryption in hardware according to Advanced Encryption Standard (AES). AES is defined, for example, in the International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC) Standard 18033-3 (2nd Edition, December 15, 2010), which is incorporated herein by reference. As shown, this circuitry includes AES encryption/decryption circuitry 400 that combines the encryption key with the plaintext to obtain the ciphertext and the decryption key with the ciphertext to obtain the plaintext. Encryption unit 308 additionally includes AES key generation circuitry 402 that generates keys utilized by AES encryption/decryption circuitry 400 to encrypt and decrypt material. Encryption unit 308 also includes carry-less multiplication circuit 404 , as described in detail below. The carry-less multiplication circuit 404 may be used, for example, in a program that generates signatures used to authenticate encrypted messages in AES-GCM (Galois Counter Multiplication). The carryless multiplication circuit 404 can also be used to perform Galois multiplication in programs that encrypt and decrypt messages, such as in AES-XTS (XEX-based fine-tuned codebook mode with ciphertext stealing).

現在參考圖 5，繪示利用進階加密標準-伽羅瓦計數器模式(Advanced Encryption Standard - Galois Counter Mode；AES-GCM)之加密及鑑認程序 500的時間-空間圖。AES-GCM用以加密 n個明文塊 502(各自為128位元輸入字串)以獲得 n個128位元密文塊 522及一簽章 532，該簽章可用以鑑認出該等 n個密文塊 522未經修改。AES-GCM因此適用於保全飛行中之資料。 Referring now to FIG. 5 , a time-space diagram of an encryption and authentication process 500 using Advanced Encryption Standard - Galois Counter Mode (AES-GCM) is shown. AES-GCM is used to encrypt n plaintext blocks 502 (each a 128-bit input string) to obtain n 128-bit ciphertext blocks 522 and a signature 532 that can be used to authenticate that the n ciphertext blocks 522 have not been modified. AES-GCM is therefore suitable for securing data in flight.

除了明文 502以外，加密程序 500開始於初始值 504(例如，擴展至128個位元之隨機值)、加密密鑰K 506(此處假定為128位元值)、鑑認資料 508(此處假定為單個128位元塊)、128位元鑑認密鑰H 510，及藉由明文(及密文)中之128位元塊之數目判定的正整數 n。初始值 504為用以初始化計數器0 512之值的128位元值。計數器0 512之值藉由遞增函數 514反覆地遞增 n次，以生成在圖 5中識別為計數器1 512至計數器 n 512的128位元計數器值之序列。此等 n+1個計數器值中之每一者及加密密鑰K 506在AES加密函數 516之 n+1個實例中之一者中經處理，以產生 n+1個128位元加密輸出X0至Xn 518中之一各別128位元加密輸出。 In addition to plaintext 502 , encryption process 500 begins with initial value 504 (e.g., a random value expanded to 128 bits), encryption key K 506 (here assumed to be a 128-bit value), authentication data 508 (here assumed to be a single 128-bit block), 128-bit authentication key H 510 , and a positive integer n determined by the number of 128-bit blocks in the plaintext (and ciphertext). Initial value 504 is a 128-bit value used to initialize the value of counter 0 512. The value of counter 0 512 is repeatedly incremented n times by increment function 514 to generate a sequence of 128-bit counter values identified as counter 1 512 to counter n 512 in FIG . Each of these n +1 counter values and encryption key K 506 are processed in one of the n +1 instances of AES encryption function 516 to produce a respective 128-bit encrypted output of n +1 128-bit encrypted outputs X0 to Xn 518 .

如圖 5中進一步所展示，明文1 502藉由互斥或(exclusive OR；XOR)函數 520與加密輸出X1 518邏輯地組合以生成密文1 522。明文2 502類似地藉由XOR函數 520與加密輸出X2邏輯地組合以生成密文2。此程序繼續進行 n次反覆，直至獲得最終密文 n 522為止。 5 , plaintext 1 502 is logically combined with the encrypted output X1 518 by an exclusive OR (XOR) function 520 to generate ciphertext 1 522. Plaintext 2 502 is similarly logically combined with the encrypted output X2 by an XOR function 520 to generate ciphertext 2. This process continues for n iterations until the final ciphertext n 522 is obtained.

鑑認資料 508及鑑認密鑰H 510形成伽羅瓦計數器乘法(GCM) 0函數 524之兩個輸入，該GCM 0函數產生128位元鑑認值Y1 526。此鑑認值Y1 526藉由XOR函數 528與密文1 522邏輯地組合以產生下一GCM乘1函數 524之輸入。GCM乘1函數 524將此輸入乘以鑑認密鑰H 510。此程序反覆地繼續 n個回合之乘法，直至獲得鑑認值Y n+1 526為止。鑑認值Y n+1藉由XOR函數 528與128位元長度指示符 530邏輯地組合，該128位元長度指示符係藉由鑑認資料 508之長度與密文 522之長度的串連而形成。此XOR函數 528之輸出提供GCM乘 n+1函數 524之第一輸入，該GCM乘 n+1函數將此輸入乘以鑑認密鑰H 510。藉由GCM乘 n+1函數 524產生之128位元乘積接著藉由XOR函數 528與加密輸出X0 518邏輯地組合以獲得可用以鑑認 n塊密文之簽章 532。 The authentication data 508 and the authentication key H 510 form the two inputs of the Galois counter multiplication (GCM) 0 function 524 , which generates a 128-bit authentication value Y1 526. This authentication value Y1 526 is logically combined with the ciphertext 1 522 by the XOR function 528 to generate the input of the next GCM multiplication 1 function 524. The GCM multiplication 1 function 524 multiplies this input by the authentication key H 510. This process is repeated for n rounds of multiplication until the authentication value Yn +1 526 is obtained. The authentication value Yn +1 is logically combined by an XOR function 528 with a 128-bit length indicator 530 formed by concatenating the length of the authentication data 508 with the length of the ciphertext 522. The output of this XOR function 528 provides the first input to a GCM multiply n +1 function 524 , which multiplies this input by the authentication key H 510. The 128-bit product produced by the GCM multiply n + 1 function 524 is then logically combined by the XOR function 528 with the encrypted output X0 518 to obtain a signature 532 that can be used to authenticate the n -block ciphertext.

如上文所提及，效能問題使得需要提供對處理器硬體中諸如圖 5中所描繪之AES-GCM函數之加密演算法的直接支援。實現用於加密演算法之硬體支援的一個複雜問題為在各種處理器硬體與各種加密演算法之間使用不同的位元及位元組端序。舉例而言，以下表I概述用於三個共同加密函數(亦即，AES、GCM及XTS)之位元及位元組之端序。如所示，在AES下，每一位元組中之前導(最左)位元為最高有效位元，且每一資料字中之前導(最左)位元組為最高有效位元組。位元及位元組排序之此組合可被稱作大-大端序(Big-Big Endian；BBE)。另一方面，GCM指定每一位元組之前導位元及每一資料字之前導位元組係最低有效的。此端序被稱作小-小端序(Little-Little Endian；LLE)。XTS採用第三種端序，被稱作大-小端序(Big-Little Endian；BLE)，其中每一位元組之前導位元係最高有效位元，但資料字之前導位元組係最低有效位元組。表I 端序位元位元組 AES 大大 BBE GCM 小小 LLE XTS 大小 BLE As mentioned above, performance issues necessitate providing direct support for encryption algorithms in processor hardware such as the AES-GCM function depicted in Figure 5 . A complication in implementing hardware support for encryption algorithms is the use of different bit and byte endianness between various processor hardware and various encryption algorithms. For example, Table I below summarizes the bit and byte endianness for three common encryption functions (i.e., AES, GCM, and XTS). As shown, under AES, the leading (leftmost) byte in each byte is the most significant byte, and the leading (leftmost) byte in each data word is the most significant byte. This combination of bit and byte ordering may be called Big-Big Endian (BBE). GCM, on the other hand, specifies that the leading bit before each byte and the leading bit before each data word are least significant. This endianness is called Little-Little Endian (LLE). XTS uses the third type of endianness, which is called Big-Little Endian (BLE). The leading bit before each byte is the most significant bit, but the leading bit before the data word is the most significant bit. Least significant byte. Table I Endianness bit Byte AES big big BBE GCM Small Small LLE XTS big Small BLE

一般而言，每一實體處理器實現此等三個端序選項中之一原生選項，其中BBE及BLE硬體在商業上最常見。不論哪一原生端序係藉由硬體實現，都需要重新格式化一些資料以補償端序之差異以提供對於具有不同端序(例如，AES、GCM及XTS)之加密演算法的硬體支援。一般而言，例如當將資料直接在每一GCM函數 524之前及之後或作為GCM函數 524自身之部分自程序 500之AES計數器加密部分(展示於虛線左側)傳遞至該程序之鑑認部分(展示於虛線右側)時，可實現資料重新格式化。 Generally, each physical processor implements one of these three endian options natively, with BBE and BLE hardware being the most common commercially. Regardless of which native endian is implemented by hardware, some data reformatting is required to compensate for the endian differences to provide hardware support for encryption algorithms with different endianness (e.g., AES, GCM, and XTS). Generally, data reformatting may be achieved, for example, when data is passed from the AES counter encryption portion of process 500 (shown to the left of the dashed line) to the authentication portion of the process (shown to the right of the dashed line) either directly before and after each GCM function 524 or as part of the GCM function 524 itself.

在密碼編譯軟體程式庫中，加密及鑑認常常實現為單獨的功能。在此類軟體實現中，加密輸出X0 518及密文1至密文 n傳遞通過記憶體，且將此等資料自動重新格式化，作為記憶體讀取及記憶體寫入存取之部分。然而，對於密碼編譯函數之硬體實現，必須將資料重新格式化應用於暫存器檔案資料以獲得可接受的效能。在以下描述中，詳細描述了具有原生BBE端序及對AES-GCM及AES-XTS中所採用之伽羅瓦乘法之硬體支援的處理器的實施例。處理器接受機器原生端序(例如，BBE)中之資料並對LLE或BLE格式執行伽羅瓦乘法。根據以下描述，熟習此項技術者將瞭解，所揭示之技術亦可經由實現電路系統來反轉用於GCM加密之位元組中之位元的排序而應用於具有原生BLE端序之處理器。 Encryption and authentication are often implemented as separate functions in cryptographic software libraries. In this type of software implementation, the encrypted output X0 518 and ciphertext 1 to ciphertext n are passed through the memory, and these data are automatically reformatted as part of memory read and memory write access. However, for hardware implementations of cryptographic functions, data reformatting must be applied to the register file data to obtain acceptable performance. In the following description, embodiments of a processor with native BBE endianness and hardware support for Galois multiplication employed in AES-GCM and AES-XTS are described in detail. The processor accepts data in the machine's native endian (e.g., BBE) and performs Galois multiplication in LLE or BLE format. Based on the following description, those skilled in the art will understand that the disclosed technology may also be applied to processors with native BLE endianness by implementing circuitry to reverse the ordering of bits in a byte for GCM encryption. .

例如在圖 5之GCM乘法函數 524中所採用的伽羅瓦乘法涉及無進位乘法，接著是模約簡。舉例而言，在一個實施例中，兩個128位元輸入運算元A與B之無進位乘法產生255位元乘積P。模約簡模多項式g(x)經由反覆程序將P約簡至128位元數。 For example, the Galois multiplication employed in the GCM multiplication function 524 of Figure 5 involves carry-free multiplication followed by modular reduction. For example, in one embodiment, a carry-free multiplication of two 128-bit input operands A and B produces a 255-bit product P. Modular reduction The modular polynomial g(x) reduces P to a 128-bit number through an iterative process.

為易於理解，首先解釋模G的自然數之約簡，且接著將其應用於模g(x)的伽羅瓦域(GF)(2^128)之模約簡。在以下約簡中，變數G及J首先定義如下： G = 2^128 + (2^7 + 2^2 + 2^1 + 1) = 2^128 + J，其中 J = 2^7 + 2^2 + 2^1 + 1 For ease of understanding, the reduction of natural numbers modulo G is first explained, and then applied to the reduction modulo the Galois field (GF) (2^128) modulo g(x). In the following reduction, the variables G and J are first defined as follows: G = 2^128 + (2^7 + 2^2 + 2^1 + 1) = 2^128 + J, where J = 2^7 + 2^2 + 2^1 + 1

鑒於G及J之此等定義，模約簡之第一步驟將任意255位元數P分割成128位元低部分PL及127位元高部分PH，其中PL ＜ G。此分割產生以下關係： P = PH * 2^128 + PL 在約簡之第二步驟中，P根據以下關係集合約簡至具有至多135個位元的數目T： P = PL + PH * ( 2^128 + (J - J)) = PL + PH * (G - J) == PL + PH * (-J) = T 其中「==」意謂等效模G(而非相等)。在約簡之第三步驟中，T再次分割成7位元高部分(TH)及128位元低部分(TL)，如由下式給出： T = TL + TH * 2^128 最後，在約簡之第四步驟中，T以與步驟2中相同的方式歸約至T'，以經由以下關係獲得小於G之結果： T' = TL + TH * (-J) Given these definitions of G and J, the first step of modular reduction is to divide any 255-bit number P into a 128-bit low part PL and a 127-bit high part PH, where PL < G. This division results in the following relationships: P = PH * 2^128 + PL In the second step of the reduction, P is reduced to a number T with at most 135 bits according to the following set of relations: P = PL + PH * ( 2^128 + (J - J)) = PL + PH * (G - J) == PL + PH * (-J) = T Where "==" means equivalence modulo G (rather than equality). In the third step of reduction, T is divided again into a 7-bit high part (TH) and a 128-bit low part (TL), as given by: T = TL + TH * 2^128 Finally, in the fourth step of reduction, T is reduced to T' in the same way as in step 2 to obtain a result less than G through the following relationship: T' = TL + TH * (-J)

鑒於對模約簡程序之此理解，現在參考圖 6，其為根據一個實施例的用於在具有多項式g(x)之伽羅瓦域中執行模約簡的模約簡電路 600之方塊圖。 With this understanding of the modular reduction process in mind, reference is now made to FIG. 6 , which is a block diagram of a modular reduction circuit 600 for performing modular reduction in a Galois field having a polynomial g(x), according to one embodiment.

在伽羅瓦域中，乘法係無進位運算，且利用逐位元XOR執行加法及減法。兩個 w位元數之無進位乘法產生(2 w-1)位元乘積。因此，舉例而言，兩個128位元數之乘法產生具有255個位元之乘積P。如上文所描述，在約簡之第一步驟中，乘積P分割成128位元低階部分PL 602及127位元高階部分PH 604。PH 604形成無進位乘法器 606a之輸入，該無進位乘法器將PH 604乘以等效物(-J)。對於具有LLE格式之GF(2^128)，此乘數為R= ‘1110.0001’。PH 604與R 608之無進位乘法產生134位元乘積T 610，該134位元乘積T經分割成128位元低階部分TL 610a及6位元高階部分TH 610b。第二無進位乘法器 606b將TH 610b乘以R 608，從而產生13位元乘積T' 614。逐位元XOR電路 612執行PL 602、TL 610a及T' 614之左對準的逐位元互斥或以產生完全約簡的128位元乘積M 618。在各種實施例中，可運用全大小128×128位元LLE乘法器或分別運用127×8位元乘法器及6×8位元乘法器來實現無進位乘法器 606a及 606b。當應用R之不同8位元值時，圖 6之模約簡電路 600亦可用於具有f( x)之其他伽羅瓦域(2^ k)，只要 k小於或等於128且f( x)具有形式f( x) = x^ k+ a7 * x^7 + a6 * x^6 …. + a1 * x^1 + a0，其中所有a(i)皆在集合{0, 1, -1}中。 In the Galois field, the multiplication system has no carry operation, and bitwise XOR is used to perform addition and subtraction. Carry-free multiplication of two w -bit numbers produces a (2 w -1)-bit product. So, for example, the multiplication of two 128-bit numbers produces a product P of 255 bits. As described above, in the first step of reduction, the product P is divided into a 128-bit low-order part PL 602 and a 127-bit high-order part PH 604 . PH 604 forms the input to carry-less multiplier 606a , which multiplies PH 604 by the equivalent (-J). For GF(2^128) with LLE format, this multiplier is R= '1110.0001'. The carry-free multiplication of PH 604 and R 608 generates a 134-bit product T 610 , which is divided into a 128-bit low-order part TL 610a and a 6-bit high-order part TH 610b . A second carryless multiplier 606b multiplies TH 610b by R 608 , producing a 13-bit product T' 614 . Bitwise XOR circuit 612 performs a left-aligned bitwise exclusive OR of PL 602 , TL 610a , and T' 614 to produce the fully reduced 128-bit product M 618 . In various embodiments, carry-less multipliers 606a and 606b may be implemented using full-size 128×128-bit LLE multipliers or using 127×8-bit multipliers and 6×8-bit multipliers, respectively. When using different 8-bit values of R, the modular reduction circuit 600 of Figure 6 can also be used for other Galois fields (2^ k ) with f( x ), as long as k is less than or equal to 128 and f( x ) has The form f( x ) = x ^ k + a7 * x ^7 + a6 * x ^6 .... + a1 * x ^1 + a0, where all a(i) are in the set {0, 1, -1} .

現在參考圖 7，繪示根據一個實施例之無進位乘法之進一步最佳化。如所示，127位元PH 604與8位元常數R 608的無進位乘法將產生八個部分乘積PP 700a 至 700h。然而，因為R 608為恰好具有四個一的常數(亦即，「1110.0001」)，所以PH 604與R 608之無進位乘法可丟棄四個部分乘積PP 700d 至 700g乘以零。因此，可利用執行以下互斥或運算的四輸入逐位元XOR電路 702自剩餘部分乘積PP 700a 至PP 700c及PP 700h獲得乘積： PL * R = PL XOR (PL ＞＞1) XOR (PL ＞＞2) XOR (PL ＞＞7)，其中「＞＞」指示向右移位指定數目個位元位置此最佳化可將實現圖 6之無進位乘法器 606a的硬體成本減少超過兩倍。 Referring now to FIG. 7 , a further optimization of carry-less multiplication according to one embodiment is illustrated. As shown, the carry-less multiplication of the 127-bit PH 604 and the 8-bit constant R 608 will produce eight partial products PP 700a to 700h . However, because R 608 is a constant with exactly four ones (i.e., “1110.0001”), the carry-less multiplication of PH 604 and R 608 may discard four partial products PP 700d to 700g multiplied by zero. Therefore, the product can be obtained from the residual partial products PP 700a to PP 700c and PP 700h using a four-input bitwise XOR circuit 702 that performs the following exclusive OR operation: PL*R = PL XOR (PL>>1) XOR (PL>>2) XOR (PL>>7), where ">>" indicates a shift to the right by a specified number of bit positions. This optimization can reduce the hardware cost of implementing the carry-less multiplier 606a of Figure 6 by more than two times.

圖 8描繪圖 7中所示之無進位乘法技術應用於圖 6之模約簡電路 600。特定言之，圖 8繪示無進位乘法器 606a對PH * R的乘法導致四個約簡項：PH 802、向右移位1個位元位置的PH 804、向右移位2個位元位置的PH 806，及向右移位7個位元位置的PH 808。PH 806之最高有效位元及PH 808之最高有效6個位元經組合以形成TH 610b，其為第一約簡結果之高階部分。更精確而言， TH(0) = PH(126) XOR PH(121)且 TH(1:5) = PH(122:126) FIG. 8 depicts the carry-less multiplication technique shown in FIG . 7 applied to the modular reduction circuit 600 of FIG . 6 . Specifically, Figure 8 illustrates that multiplication of PH*R by carryless multiplier 606a results in four reduction terms: PH 802 , PH 804 shifted right by 1 bit position, PH 804 shifted right by 2 bits PH 806 at the position, and PH 808 shifted to the right by 7 bit positions. The most significant bit of PH 806 and the most significant 6 bits of PH 808 are combined to form TH 610b , which is the higher-order portion of the first reduction result. More precisely, TH(0) = PH(126) XOR PH(121) and TH(1:5) = PH(122:126)

第二約簡項T' = TH * R 614可類似地擴展至四個項，即，TH 610b、向右移位1個位元位置的TH 812、向右移位2個位元位置的TH 814及向右移位7個位元位置的TH 816。此等四個約簡項(約簡結果T 610及PL 602之四個約簡項)形成所有約簡項 818之集合，該等所有約簡項根據128個位元位置中之每一者的值藉由逐位元XOR電路 612邏輯組合以產生完全約簡之乘積M 618。下文參考圖 12描述實現圖 8中給出之無進位乘法技術的模約簡電路之實施例的細節。 The second reduction term T' = TH * R 614 can be similarly expanded to four terms, namely, TH 610b , TH 812 shifted right by 1 bit position, TH 814 shifted right by 2 bit positions, and TH 816 shifted right by 7 bit positions. These four reduction terms (the four reduction terms of the reduction result T 610 and PL 602 ) form a set of all reduction terms 818 , which are logically combined by the bit-by-bit XOR circuit 612 according to the value of each of the 128 bit positions to produce the fully reduced product M 618. The details of an embodiment of a modular reduction circuit that implements the carry-less multiplication technique given in FIG . 8 are described below with reference to FIG. 12 .

現在參考圖 9，繪示例示性128位元乘128位元乘法陣列 900之示意性表示，其展示在圖 8之無進位乘法技術中所涉及的高階乘積位元及低階乘積位元之位置。如所展示，形成TH 610b之高階乘積位元 902及與TH 610b組合以形成T' 614之低階乘積位元 904位於乘法陣列 900之極端處，其中閘之深度最淺。因此，乘積位元 902之計算以及乘積位元 902與乘積位元 904之邏輯組合並非由模約簡電路 600執行之無進位乘法之執行中的限制因素。 Referring now to Figure 9 , a schematic representation of an exemplary 128-bit by 128-bit multiplication array 900 is shown showing the locations of the high-order product bits and low-order product bits involved in the carry-less multiplication technique of Figure 8 . As shown, the high-order product bits 902 that form TH 610b and the low-order product bits 904 that combine with TH 610b to form T' 614 are located at the extreme ends of the multiplication array 900 where the gates are at their shallowest depth. Therefore, the calculation of product bit 902 and the logical combination of product bit 902 and product bit 904 are not limiting factors in the execution of the carry-less multiplication performed by modular reduction circuit 600 .

現在參看圖 10，描繪根據一個實施例的無進位乘法電路 404之高階方塊圖。實現伽羅瓦乘法的無進位乘法電路 404支援GCM及XTS兩者，且連接至保存呈原生BBE格式之運算元資料的架構式暫存器檔案 300。如上文所提及，GCM及XTS分別解譯呈LLE格式及BLE格式之資料。 Referring now to FIG. 10 , depicted is a high-level block diagram of a carry-less multiplication circuit 404 according to one embodiment. The carryless multiplication circuit 404 that implements Galois multiplication supports both GCM and XTS and is connected to the architectural register file 300 that holds the operand data in native BBE format. As mentioned above, GCM and XTS interpret data in LLE format and BLE format respectively.

在所描繪之實施例中，無進位乘法電路 404包括在其輸入埠處之兩個條件位元反轉電路 1002a、 1002b及在其輸出埠處之一額外條件位元反轉電路 1002c。條件位元反轉電路 1 0 02a、 1 0 02b各自經耦接以自架構式暫存器檔案 300中之暫存器XA及XB接收兩個128位元運算元A及B中之一各別者，且基於(例如)指示無進位乘法電路 404是否正用以執行用於GCM或XTS之乘法的模式輸入條件性地反轉運算元A及B之位元組內的位元之排序。條件位元反轉電路 1002a、 1002b分別輸出128位元被乘數 1004a及128位元乘數 1004b。 In the depicted embodiment, the carry-less multiplication circuit 404 includes two conditional bit-flipping circuits 1002a , 1002b at its input ports and one additional conditional bit-flipping circuit 1002c at its output port. The conditional bit-flipping circuits 1002a , 1002b are each coupled to receive one of two 128-bit operands A and B from registers XA and XB in the architectural register file 300 , respectively, and conditionally flip the order of bits within bytes of operands A and B based on, for example, a mode input indicating whether the carry-less multiplication circuit 404 is being used to perform multiplication for GCM or XTS. The conditional bit inversion circuits 1002a and 1002b output a 128-bit multiplicand 1004a and a 128-bit multiplier 1004b respectively.

無進位乘法電路 404包括無進位乘法器 1006，該無進位乘法器執行被乘數 1004a與乘數 1004b之128位元乘128位元乘法，從而產生255位元乘積P 1008。乘積P 1008由模約簡電路 1010接收並約簡至128位元約簡之乘積M 618，該模約簡電路之一實施例更詳細地繪示於圖 12中。此約簡之乘積M 618接著在藉由條件位元反轉電路 1002c進行之可能位元排序反轉之後經儲存回至架構式暫存器檔案 300之暫存器XT。 The carry-less multiplication circuit 404 includes a carry-less multiplier 1006 that performs a 128-bit by 128-bit multiplication of the multiplicand 1004a and the multiplier 1004b to produce a 255-bit product P 1008. The product P 1008 is received and reduced to a 128-bit reduced product M 618 by a modular reduction circuit 1010 , one embodiment of which is shown in more detail in FIG . 12. This reduced product M 618 is then stored back to register XT of the architectural register file 300 after possible bit order inversion by the conditional bit inversion circuit 1002c .

在所繪示之實施例中，架構式暫存器檔案 300及無進位乘法器 1006中之資料運算元實現BBE資料格式。可在BBE格式化之無進位乘法器 1006中簡單地藉由執行BBE格式化乘法接著是乘積P 1008向左移位一個位元位置來執行GCM所需之LLE無進位乘法。因為乘積P 1008直接流入模約簡電路 1010中，所以1位元向左移位可方便地藉由模約簡電路 1010實現。 In the illustrated embodiment, the architected register file 300 and the data operands in the carry-less multiplier 1006 implement the BBE data format. The LLE carry-less multiplication required for GCM can be performed in the BBE formatted carry-less multiplier 1006 simply by performing a BBE formatted multiplication followed by a left shift of the product P 1008 by one bit position. Because the product P 1008 flows directly into the modular reduction circuit 1010 , the 1-bit left shift can be conveniently implemented by the modular reduction circuit 1010 .

另一方面，對於XTS，運算元A及B藉由條件位元反轉電路 1002a、 1002b重新格式化以匹配GCM之LLE資料格式。(如上文在表I中所提及，GCM及LLE共用相同位元組排序，但具有不同的位元排序)。在將約簡乘積M 1012寫入至架構式暫存器檔案 300中之暫存器XT之前，條件位元反轉電路 1002c類似地反轉XTS乘法之約簡乘積M 1012之位元排序。 On the other hand, for XTS, operands A and B are reformatted by conditional bit reversal circuits 1002a , 1002b to match the LLE data format of GCM. (As mentioned above in Table I, GCM and LLE share the same byte ordering, but have different bit orderings.) Conditional bit reversal circuit 1002c similarly reverses the bit ordering of the reduced product M 1012 of the XTS multiplication before writing the reduced product M 1012 to register XT in the architectural register file 300 .

現在參看圖 11，描繪可用以實現圖 10之條件位元反轉電路 1002a至 1002c中之任一者的條件位元反轉電路 1100之例示性實施例的較詳細方塊圖。在此實例中，條件位元反轉電路 1100具有耦合至雙輸入128位元多工器 1106之一個輸入的128位元(亦即，16位元組)輸入 1102。128位元輸入 1102之十六個位元組中之每一者另外耦合至十六個位元反轉電路 1104中之一各別者，其可例如藉由反轉輸入 1102之相關位元組中之位元的排序的佈線來實現。所有位元反轉電路 1104之輸出一起形成多工器 1106之第二128位元輸入。 Referring now to FIG. 11 , a more detailed block diagram of an exemplary embodiment of a conditional bit inversion circuit 1100 that may be used to implement any of the conditional bit inversion circuits 1002 a - 1002 c of FIG . 10 is depicted. In this example, the conditional bit inversion circuit 1100 has a 128-bit (i.e., 16-bit) input 1102 coupled to one input of a dual-input 128-bit multiplexer 1106. Each of the sixteen bytes of the 128-bit input 1102 is further coupled to a respective one of the sixteen bit inversion circuits 1104 , which may be implemented, for example, by routing the ordering of the bits in the associated byte of the input 1102 . The outputs of all bit inversion circuits 1104 together form the second 128-bit input of multiplexer 1106 .

多工器 1106基於指示是將應用大的抑或小的端序位元排序的模式輸入在其兩個輸入處呈現的資料之間進行選擇。模式輸入可(例如)藉由伽羅瓦乘法指令之對應欄位予以判定，如下文參考圖 13更詳細地描述。由多工器 1106選擇之資料呈現於128位元輸出 1108上。 Multiplexer 1106 selects between the data presented at its two inputs based on a pattern input indicating whether large or small endian bit ordering is to be applied. The mode input may be determined, for example, by the corresponding field of the Galois multiplication instruction, as described in greater detail below with reference to FIG . 13 . The data selected by multiplexer 1106 is presented on 128-bit output 1108 .

現在參考圖 12，繪示圖 10之模約簡電路 1010之例示性實施例的更詳細方塊圖。此例示性電路利用參考圖 8所論述之最佳化緊密地實現關於圖 6所描述之雙級約簡。 Referring now to FIG. 12 , there is shown a more detailed block diagram of an exemplary embodiment of the analog reduction circuit 1010 of FIG . 10 . This exemplary circuit closely implements the two-stage reduction described with respect to FIG . 6 using the optimization discussed with reference to FIG . 8 .

模約簡電路 1010接收由如圖 10中所展示之無進位乘法器 1006生成的255位元乘積P 1008作為輸入。如上文參考圖 6所論述，乘積P 1008包括128位元低部分PL 602及127位元高部分PH 604。在初步位元移位及填充之後，模約簡電路 1010利用對左對準輸入進行操作之四個逐位元XOR電路 1200、 1202、 1204及 612將乘積P 1008約簡成約簡乘積M 618。 Modular reduction circuit 1010 receives as input a 255-bit product P 1008 generated by carry-less multiplier 1006 as shown in FIG10 . As discussed above with reference to FIG6 , product P 1008 includes a 128-bit low portion PL 602 and a 127-bit high portion PH 604. After preliminary bit shifting and padding, modular reduction circuit 1010 reduces product P 1008 to a reduced product M 618 using four bit-wise XOR circuits 1200 , 1202 , 1204 , and 612 operating on left-aligned inputs.

逐位元XOR電路 1200包括三個128位元輸入，其經耦合以接收來自圖 8之第一至第三約簡項 818，即，PL 602、PH 802(其為由填充電路1210填充的PH 604，其中在位元位置127中具有尾隨零)及PH 804(其為由移位電路 1212向右移位一個位元位置的PH 604)。此等三個輸入之逐位元互斥或產生逐位元XOR電路 612之第一128位元輸入。 The bitwise XOR circuit 1200 includes three 128-bit inputs coupled to receive the first through third simplifications 818 from FIG. 8 , namely, PL 602 , PH 802 (which is PH 604 filled by fill circuit 1210 with trailing zeros in bit position 127), and PH 804 (which is PH 604 shifted right by one bit position by shift circuit 1212 ). The bitwise exclusive OR of these three inputs produces the first 128-bit input of the bitwise XOR circuit 612 .

逐位元XOR電路 1202類似地在邏輯上組合第四及第五約簡項 818，即，PH 806(其為由移位電路 1214向右移位兩個位元位置的PH 604)及PH 808(其為由移位電路 1216向右移位七個位元位置的PH 604)。此兩個輸入之互斥或生成134位元結果。此XOR結果之位元0：127形成TL 610a，該TL為逐位元XOR電路 612之第二128位元輸入。由逐位元XOR電路 1202生成之XOR結果的位元128：133形成TH 610b，該TH經傳遞至逐位元XOR電路 1204以及移位電路 1216、 1218及 1220。 The bitwise XOR circuit 1202 similarly logically combines the fourth and fifth reduced terms 818 , i.e., PH 806 (which is PH 604 shifted right by two bit positions by shift circuit 1214 ) and PH 808 (which is PH 604 shifted right by seven bit positions by shift circuit 1216 ). The exclusive OR of these two inputs generates a 134-bit result. Bits 0:127 of this XOR result form TL 610a , which is the second 128-bit input of the bitwise XOR circuit 612. Bits 128:133 of the XOR result generated by the bitwise XOR circuit 1202 form TH 610b , which is passed to the bitwise XOR circuit 1204 and shift circuits 1216 , 1218 , and 1220 .

逐位元XOR電路 1204邏輯地組合第六至第九約簡項 818，從而形成第二約簡結果T' 614。亦即，逐位元XOR 1204邏輯地組合TH 610b、TH 810(其為藉由移位電路 1216向右移位一個位元位置的TH 610b)、TH 812(其為藉由移位電路 1218向右移位一個位元位置的TH 610b)及TH 814(其為藉由移位電路 1220向右移位一個位元位置的TH 610b)。此等四個輸入之逐位元互斥或(其為13位元第二約簡結果 614')，形成逐位元XOR電路 612之第三輸入。逐位元XOR電路 612在其三個輸入上執行左對準之逐位元互斥或，以產生約簡之乘積M 618。 The bitwise XOR circuit 1204 logically combines the sixth to ninth reduction terms 818 to form the second reduction result T' 614. That is, the bitwise XOR circuit 1204 logically combines TH 610b , TH 810 (which is TH 610b shifted right by one bit position by the shift circuit 1216 ), TH 812 (which is TH 610b shifted right by one bit position by the shift circuit 1218 ), and TH 814 (which is TH 610b shifted right by one bit position by the shift circuit 1220 ). The bitwise exclusive OR of these four inputs (which is the 13-bit second reduction result 614' ) forms the third input of the bitwise XOR circuit 612 . Bitwise XOR circuit 612 performs a left-aligned bitwise exclusive OR on its three inputs to produce a reduced product M 618 .

現在參考圖 13，繪示根據一個實施例之例示性伽羅瓦乘法指令 1300。在一較佳實施例中，可在向量-純量單元 226之加密單元 308中執行單一伽羅瓦乘法指令 1300，以使圖 10之無進位乘法電路 404對多項式g(x)執行無進位乘法與模約簡，如參考圖 12所描述。 Referring now to FIG. 13 , an exemplary Galois multiplication instruction 1300 is illustrated according to one embodiment. In a preferred embodiment, a single Galois multiplication instruction 1300 may be executed in the encryption unit 308 of the vector-scalar unit 226 to cause the carry-less multiplication circuit 404 of FIG. 10 to perform carry-less multiplication and modular reduction on the polynomial g(x), as described with reference to FIG . 12 .

在此實例中，伽羅瓦乘法指令 1300包括作業碼欄位 1302，該作業碼欄位指定指示伽羅瓦無進位乘法與模約簡的架構特定作業碼。伽羅瓦乘法指令 1300另外包括運算元欄位 1304，該運算元欄位直接或間接地指示用以儲存伽羅瓦無進位乘法及模約簡運算之源及目的地運算元的架構式暫存器檔案 300之一或多個暫存器XA、XB、XT。最後，伽羅瓦乘法指令 1300包括模式欄位 1306，該模式欄位指定適用於伽羅瓦無進位乘法及模約簡運算之資料格式(例如，BLE/XTS或LLE/GCM)。如上文所提及，模式欄位 1306之設定可由條件位元反轉電路 1100利用以選擇是否將位元排序反轉應用於輸入資料字之位元組。 In this example, the Galois multiplication instruction 1300 includes an operation code field 1302 that specifies an architecture-specific operation code indicating a Galois carry-less multiplication and modular reduction. The Galois multiplication instruction 1300 further includes an operand field 1304 that directly or indirectly indicates one or more registers XA, XB, XT of the architectural register file 300 for storing source and destination operands for the Galois carry-less multiplication and modular reduction operation. Finally, the Galois multiplication instruction 1300 includes a mode field 1306 that specifies a data format (e.g., BLE/XTS or LLE/GCM) applicable to the Galois carry-less multiplication and modular reduction operation. As mentioned above, the setting of mode field 1306 may be utilized by conditional bit inversion circuit 1100 to select whether to apply bit order inversion to bytes of an input data word.

前述描述已參考具有寬(例如，在本發明技術中為128位元乘128位元)無進位乘法器 1006之無進位乘法電路 404描述所揭示發明之態樣。然而，一些商業處理器可不包括寬的無進位乘法器，但替代地包括對較小資料元素並行操作的多個較窄無進位乘法器。舉例而言，圖 14繪示先前技術單指令多資料(SIMD)乘-乘引擎 1400，其包括對128位元SIMD運算元A 1402及B 1404串聯操作的兩個64位元無進位乘法器 1406、 1408。在此實例中，SIM乘-乘指令使得無進位乘法器 1406將SIMD運算元 1402、 1404之64位元高部分相乘以生成乘積之128位元高部分PH 1414，使得無進位乘法器 1408將SIMD運算元 1402、 1404之64位元低部分相乘以生成乘積之128位元低部分PL 1416，且使得逐位元XOR電路 1410執行PH 1414及PL 1416之128位元逐位元互斥或以產生乘積Q 1412。當然，此習知SIMD架構可經擴展以添加額外通道以支援較大資料寬度(例如，256位元運算元)。 The foregoing description has described aspects of the disclosed invention with reference to a carryless multiplication circuit 404 having a wide (eg, 128 bits by 128 bits in the present technology) carryless multiplier 1006 . However, some commercial processors may not include a wide carry-less multiplier, but instead include multiple narrower carry-less multipliers operating in parallel on smaller data elements. For example, FIG. 14 illustrates a prior art single instruction multiple data (SIMD) multiply-multiply engine 1400 that includes two 64-bit carryless multipliers 1406 operating in series on 128-bit SIMD operands A 1402 and B 1404 , 1408 . In this example, the SIM multiply-multiply instruction causes carryless multiplier 1406 to multiply the 64-bit high portions of SIMD operands 1402 , 1404 to generate the 128-bit high portion of the product PH 1414 , causing carry-less multiplier 1408 to The 64-bit low parts of SIMD operands 1402 and 1404 are multiplied to generate the 128-bit low part of the product PL 1416 , and the bitwise XOR circuit 1410 performs the 128-bit bitwise mutual exclusion of PH 1414 and PL 1416 or to produce the product Q 1412 . Of course, this conventional SIMD architecture can be extended to add additional lanes to support larger data widths (eg, 256-bit operands).

根據一或多個實施例，圖 14中給出之習知SIMD架構亦可擴展以支援如上文所描述之伽羅瓦乘法。舉例而言，圖 15描繪如本文中所描述的支援伽羅瓦乘法的例示性SIMD無進位乘法電路 1500。 According to one or more embodiments, the conventional SIMD architecture presented in Figure 14 can also be extended to support Galois multiplication as described above. For example, FIG. 15 depicts an exemplary SIMD carry-less multiplication circuit 1500 supporting Galois multiplication as described herein.

SIMD無進位乘法電路 1500包括條件位元反轉電路 1506及 1508，其分別條件性地反轉SIMD運算元A 1502及B 1504之每一位元組中的位元之排序。條件位元反轉電路 1506及 1508中之每一者可藉由如上文參考圖 11所描述之條件位元反轉電路 1100實現。 SIMD carry-less multiplication circuit 1500 includes conditional bit-reversal circuits 1506 and 1508 that conditionally reverse the order of bits in each bit group of SIMD operands A 1502 and B 1504 , respectively. Each of conditional bit-reversal circuits 1506 and 1508 may be implemented by conditional bit-reversal circuit 1100 as described above with reference to FIG . 11 .

SIMD無進位乘法電路 1500另外包括兩個128位元SIMD乘-乘引擎 1512、 1514，其各者可(例如)運用圖 14之先前技術乘-乘引擎 1400實現。乘-乘引擎 1512將運算元 1502、 1504(亦即，AH,AL及BH,BL)相乘以生成包括127位元高部分P1H及128位元低部分P1L之255位元乘積P1。P1H與P1L之互斥或為結果Q1。為了補償BBE與LLE之間的資料格式差異，自P1H丟棄位元0以達成1位元之左移位。乘-乘引擎 1514具有經耦合以接收SIMD運算元A 1502(亦即，AH,AL)之第一輸入及經耦合以接收調換雙字電路 1510之輸出之第二輸入，該調換雙字電路調換SIMD運算元B 1504(亦即，BL,BH)之高及低64位元雙字。乘-乘引擎 1514以無進位方式將此兩個輸入相乘並組合以生成128位元結果Q2 (其中Q2對應於AL*BH + AH*BL，亦即，對應於具有反轉B運算元的64位元乘-乘運算之128b結果)。 SIMD carry-less multiplication circuit 1500 further includes two 128-bit SIMD multiply-multiply engines 1512 , 1514 , each of which can be implemented, for example, using the prior art multiply-multiply engine 1400 of FIG . 14. Multiply-multiply engine 1512 multiplies operands 1502 , 1504 (i.e., AH, AL and BH, BL) to generate a 255-bit product P1 including a 127-bit high portion P1H and a 128-bit low portion P1L. The exclusive or of P1H and P1L is the result Q1. To compensate for the data format difference between BBE and LLE, bit 0 is discarded from P1H to achieve a 1-bit left shift. Multiply-multiply engine 1514 has a first input coupled to receive SIMD operator A 1502 (i.e., AH, AL) and a second input coupled to receive the output of swap doubleword circuit 1510 , which swaps the high and low 64-bit doublewords of SIMD operator B 1504 (i.e., BL, BH). Multiply-multiply engine 1514 multiplies these two inputs in a carry-less manner and combines them to generate a 128-bit result Q2 (where Q2 corresponds to AL*BH + AH*BL, i.e., corresponds to a 128b result of a 64-bit multiply-multiply operation with inverted B operator).

SIMD無進位乘法電路 1500另外包括：約簡電路 1516，其將自結果Q2及部分乘積P1L、P1H導出的255位元乘積P約簡至128位元；及多工器 1518，其在約簡電路 1516之128位元結果Q1與128位元輸出之間進行選擇作為SIMD無進位乘法電路 1500之結果 1520。如圖 15中進一步所展示，約簡電路 1516利用採用以下三個輸入的左對準之255位元逐位元XOR電路 1534邏輯地組合乘積P1H、P1L及Q2：127位元P1H，藉由利用移位電路 1530將127位元向右移位應用至P1L獲得的255位元值，及藉由利用移位電路 1532將63位元向右移位應用至Q2獲得的191位元值。左對準之此等三個輸入由逐位元XOR電路 1534邏輯地組合以生成255位元無進位乘積P。此255位元無進位乘積P藉由如先前參考圖 10所描述之模約簡電路 1010約簡至128位元約簡之乘積M。約簡之乘積M之每一位元組內的位元排序藉由條件位元反轉電路 1536條件性地反轉以產生形成多工器 1518之第二輸入的128位元輸出。若SIMD無進位乘法電路 1500用以執行SIMD乘法指令，則多工器 1518選擇結果Q1作為結果 1520，且若SIMD無進位乘法電路 1500用以執行伽羅瓦乘法指令，則多工器 1518選擇約簡電路 1516之輸出作為結果 1520。 The SIMD carry-less multiplication circuit 1500 additionally includes: a reduction circuit 1516 , which reduces the 255-bit product P derived from the result Q2 and the partial products P1L, P1H to 128 bits; and a multiplexer 1518 , which in the reduction circuit Select between the 128-bit result Q1 of 1516 and the 128-bit output as the result 1520 of the SIMD carry-less multiplication circuit 1500 . As further shown in Figure 15 , reduction circuit 1516 logically combines products P1H, P1L, and Q2 using a left-aligned 255-bit bitwise XOR circuit 1534 using three inputs: 127-bit P1H, by using The shift circuit 1530 applies a 127-bit right shift to the 255-bit value obtained by P1L, and the shift circuit 1532 applies a 63-bit right shift to the 191-bit value obtained by Q2. These three left-aligned inputs are logically combined by bitwise XOR circuit 1534 to generate the 255-bit carry-less product P. This 255-bit carry-free product P is reduced to a 128-bit reduced product M by the modular reduction circuit 1010 as previously described with reference to FIG. 10 . The bit ordering within each byte of the reduced product M is conditionally inverted by conditional bit inversion circuit 1536 to produce a 128-bit output that forms the second input of multiplexer 1518 . If the SIMD carry-less multiply circuit 1500 is used to execute the SIMD multiply instruction, the multiplexer 1518 selects the result Q1 as the result 1520 , and if the SIMD carry-less multiply circuit 1500 is used to execute the Galois multiply instruction, the multiplexer 1518 selects the reduction The output of circuit 1516 is result 1520 .

熟習此項技術者應瞭解，SIMD無進位乘法電路 1500採用最佳化(且可採用額外最佳化)以減小電路大小。舉例而言，SIMD無進位乘法電路 1500可採用卡拉楚巴(Karatsuba)演算法以將AH*BL + AL*BH之計算約簡成(AH+AL) * (BH+BL) + PH + PL。此簡化可藉由一個64位元無進位乘法器、對運算元A及B進行操作之兩個64位元逐位元XOR電路及對三個乘積項之三向128位元XOR來計算。可藉由組合條件位元反轉電路 1506、 1508及調換雙字電路 1510及藉由組合條件位元反轉電路 1536與多工器 1518來執行進一步最佳化。一般而言，若針對128位元SIMD實現卡拉楚巴(Karatsuba)演算法，則圖 10之無進位乘法電路 404可在不超過圖 14之乘-乘引擎 1400之面積1.5倍的面積內實現，且圖 15之SIMD無進位乘法電路 1500可在不超過乘-乘引擎 1400之面積兩倍的面積內實現。若在256位元或較高SIMD中採用64位元乘-乘引擎，則可用更少的硬體開銷實現128位元伽羅瓦乘法。此256位元SIMD引擎中之兩個128位元乘-乘引擎可沿著圖 15之線組合以對高或低128位元SIMD元素執行128位元伽羅瓦乘法。唯一開銷為用於伽羅瓦乘法及約簡電路 1516之運算元的多工及條件位元反轉。 Those skilled in the art will appreciate that SIMD carry-less multiplication circuit 1500 employs optimizations (and may employ additional optimizations) to reduce circuit size. For example, SIMD carry-less multiplication circuit 1500 may employ the Karatsuba algorithm to simplify the calculation of AH*BL + AL*BH to (AH+AL) * (BH+BL) + PH + PL. This simplification may be computed using a 64-bit carry-less multiplier, two 64-bit bitwise XOR circuits operating on operands A and B, and a three-way 128-bit XOR of three product terms. Further optimization may be performed by combining conditional bit flip circuits 1506 , 1508 and swap double word circuit 1510 and by combining conditional bit flip circuit 1536 with multiplexer 1518. In general, if the Karatsuba algorithm is implemented for 128-bit SIMD, the carry-less multiplication circuit 404 of FIG. 10 may be implemented in an area no greater than 1.5 times the area of the multiply-multiply engine 1400 of FIG . 14, and the SIMD carry-less multiplication circuit 1500 of FIG . 15 may be implemented in an area no greater than twice the area of the multiply-multiply engine 1400. If a 64-bit multiply-multiply engine is used in 256-bit or higher SIMD, 128-bit Galois multiplication may be implemented with less hardware overhead. The two 128-bit multiply-multiply engines in this 256-bit SIMD engine can be combined along the lines of Figure 15 to perform 128-bit Galois multiplication on either the upper or lower 128-bit SIMD elements. The only overhead is the multiplexing and conditional bit reversal of the operands for the Galois multiplication and reduction circuit 1516 .

現在參看圖 16，描繪根據一個實施例的伽羅瓦乘法之例示性方法之高階邏輯流程圖。為了易於理解，參考圖 10中給出的無進位乘法電路 404之實施例描述圖 16之程序。 Referring now to FIG. 16 , a high-level logic flow chart of an exemplary method of Galois multiplication according to one embodiment is depicted. For ease of understanding, the process of FIG . 16 is described with reference to the embodiment of the carry-less multiplication circuit 404 given in FIG . 10 .

所繪示之程序開始於區塊 1600，且接著繼續進行至區塊 1602，該區塊繪示處理器核心 200之向量-純量單元 226接收到需要伽羅瓦乘法之指令，諸如圖 13之伽羅瓦乘法指令 1300。回應於接收到指令，向量-純量單元 226自架構式暫存器檔案 300讀出運算元A及B且將運算元A及B傳遞至無進位乘法電路 404之輸入埠(區塊 1604)。在區塊 1606處，向量-純量單元 226判定指令是否指定(例如，在模式欄位 1306中)需要BLE資料格式而非LLE資料格式的XTS模式。回應於區塊 1606處之肯定判定，向量-純量單元 226利用無進位乘法電路 404之條件位元反轉電路 1002a、 1002b來反轉運算元A及B之每一位元組中的位元之排序(區塊 1608)。 The illustrated process begins at block 1600 and then proceeds to block 1602 , which illustrates the vector-scalar unit 226 of the processor core 200 receiving an instruction requiring a Galois multiplication, such as the Galois multiplication of Figure 13 Watt multiplication instructions 1300 . In response to receiving the instruction, vector-scalar unit 226 reads operands A and B from architectural register file 300 and passes operands A and B to the input port of carry-less multiplication circuit 404 (block 1604 ). At block 1606 , vector-scalar unit 226 determines whether the instruction specifies (eg, in mode field 1306 ) an XTS mode that requires BLE data format rather than LLE data format. In response to a positive determination at block 1606 , vector-scalar unit 226 utilizes conditional bit inversion circuits 1002a , 1002b of carryless multiplication circuit 404 to invert the bits in each byte of operands A and B. Sorting (block 1608 ).

在區塊 1608之後或回應於在區塊 1606處之否定判定，向量-純量單元 226利用無進位乘法器 1006執行運算元A及B之無進位乘法以獲得255位元乘積P 1008(區塊 1610)。該乘法採用BBE資料格式。在區塊 1612處，向量-純量單元 226根據多項式g(x)經由兩個或多於兩個逐位元XOR級來約簡乘積P以獲得約簡之乘積M 618，如圖 12中所示。如區塊 1614至 1616處所展示，若指令指定需要BLE資料格式而非LLE資料格式的XTS模式，則向量-純量單元 226再次利用條件位元反轉電路 1002c來反轉約簡乘積M 618之每一位元組中的位元排序。在區塊 1616之後或在指令不指定XTS模式之情況下，向量-純量單元 226將約簡乘積M寫回至暫存器檔案 300(區塊 1618)。在區塊 1618之後，圖 16之程序在區塊 1620處結束。 After block 1608 or in response to a negative determination at block 1606 , the vector-scalar unit 226 performs a carry-less multiplication of operators A and B using the carry-less multiplier 1006 to obtain a 255-bit product P 1008 (block 1610 ). The multiplication uses the BBE data format. At block 1612 , the vector-scalar unit 226 reduces the product P according to the polynomial g(x) through two or more bitwise XOR stages to obtain a reduced product M 618 , as shown in Figure 12 . As shown at blocks 1614-1616 , if the instruction specifies XTS mode requiring BLE data format instead of LLE data format, the vector-scalar unit 226 again utilizes the conditional bit flip circuit 1002c to flip the bit order in each bit group of the reduced product M 618. After block 1616 or if the instruction does not specify XTS mode, the vector-scalar unit 226 writes the reduced product M back to the register file 300 (block 1618 ). After block 1618 , the process of FIG. 16 ends at block 1620 .

現在參考圖 17，繪示用於(例如)半導體IC邏輯設計、模擬、測試、佈局以及製造中的例示性設計流程 1700的方塊圖。設計流程 1700包括用於處理設計結構或裝置以生成上文所描述並在本文中所展示的設計結構及/或裝置之邏輯上或以其他方式功能上等效表示的程序、機器及/或機構。藉由設計流程 1700處理及/或生成的設計結構可在機器可讀傳輸或儲存媒體上經編碼以包括當在資料處理系統上執行或以其他方式處理時生成硬體組件、電路、裝置或系統之邏輯上、結構上、機械上或以其他方式功能上等效表示的資料及/或指令。機器包括但不限於用於IC設計程序之任何機器，該IC設計程序諸如設計、製造或模擬電路、組件、裝置或系統。舉例而言，機器可包括：微影機器、用於生成遮罩之機器及/或裝備(例如電子束寫入器)、用於模擬設計結構之電腦或裝備、用於製造或測試程序之任何設備或用於將設計結構之功能上等效的表示程式化至任何媒體中的任何機器(例如，用於程式化可程式化閘陣列的機器)。 Referring now to FIG. 17 , a block diagram of an exemplary design flow 1700 for use in, for example, semiconductor IC logic design, simulation, testing, layout, and manufacturing is shown. The design flow 1700 includes a program, machine, and/or mechanism for processing a design structure or device to generate a logically or otherwise functionally equivalent representation of the design structure and/or device described above and shown herein. The design structure processed and/or generated by the design flow 1700 may be encoded on a machine-readable transmission or storage medium to include data and/or instructions that, when executed or otherwise processed on a data processing system, generate a logically, structurally, mechanically, or otherwise functionally equivalent representation of a hardware component, circuit, device, or system. A machine includes, but is not limited to, any machine used in an IC design process, such as designing, manufacturing, or simulating a circuit, component, device, or system. For example, a machine may include: a lithography machine, a machine and/or equipment for generating masks (e.g., an electron beam writer), a computer or equipment for simulating a design structure, any equipment for a manufacturing or testing process, or any machine for programming a functionally equivalent representation of a design structure into any medium (e.g., a machine for programming a programmable gate array).

設計流程 1700可取決於正設計的表示之類型而變化。舉例而言，用於建置特殊應用IC (ASIC)之設計流程 1700可不同於用於設計標準組件之設計流程 1700或不同於用於將設計實體化為可程式化陣列之設計流程 1700，可程式化陣列例如由Altera®公司或Xilinx®公司提供之可程式化閘陣列(PGA)或場可程式化閘陣列(FPGA)。 The design process 1700 may vary depending on the type of representation being designed. For example, a design flow 1700 for building an application-specific IC (ASIC) may be different than a design flow 1700 for designing standard components or different from a design flow 1700 for materializing a design into a programmable array. The programmable array is, for example, a programmable gate array (PGA) or a field programmable gate array (FPGA) provided by Altera® or Xilinx®.

圖 17繪示包括較佳藉由設計程序 1710處理之輸入設計結構 1020的多個此類設計結構。設計結構 1720可為藉由設計程序 1710生成且處理以產生硬體裝置之邏輯上等效之功能表示的邏輯模擬設計結構。設計結構 1720亦可或替代地包含在藉由設計程序 1710處理時生成硬體裝置之實體結構之功能表示的資料及/或程式指令。無論表示功能及/或結構設計特徵，都可使用諸如由核心開發者/設計者實現之電子電腦輔助設計(ECAD)來生成設計結構 1720。當經編碼於機器可讀資料傳輸、閘陣列或儲存媒體上時，設計結構 1720可藉由設計程序 1710內之一或多個硬體及/或軟體模組存取及處理以模擬或另外功能上表示電子組件、電路、電子或邏輯模組、設備、裝置或系統，諸如本文中所展示之彼等電子組件、電路、電子或邏輯模組、設備、裝置或系統。因而，設計結構 1720可包含檔案或包括人類及/或機器可讀原始程式碼的其他資料結構、經編譯結構及電腦可執行程式碼結構，該等電腦可執行程式碼結構在由設計或模擬資料處理系統處理時在功能上模擬或以其他方式表示硬體邏輯設計之電路或其他層級。此類資料結構可包括硬體描述語言(HDL)設計實體或符合較低層級HDL設計語言(諸如Verilog及VHDL)及/或較高層級設計語言(諸如C或C++)及/或與較低層級HDL設計語言及/或較高層級設計語言相容的其他資料結構。 FIG. 17 illustrates a plurality of such design structures including an input design structure 1020 that is preferably processed by a design program 1710 . Design structure 1720 may be a logical simulation design structure generated by design program 1710 and processed to produce a logically equivalent functional representation of the hardware device. Design structure 1720 may also or alternatively include data and/or program instructions that, when processed by design program 1710 , generate a functional representation of the physical structure of the hardware device. Whether representing functional and/or structural design features, the design structure 1720 may be generated using, for example, electronic computer-aided design (ECAD) implemented by a core developer/designer. When encoded on a machine-readable data transmission, gate array, or storage medium, the design structure 1720 may be accessed and processed by one or more hardware and/or software modules within the design program 1710 to simulate or otherwise function. The above refers to electronic components, circuits, electronic or logic modules, devices, devices or systems, such as those shown herein. Thus, design structure 1720 may include files or other data structures that include human and/or machine readable source code, compiled structures, and computer executable code structures that are generated from design or simulation data. Processing systems process circuits or other levels that functionally simulate or otherwise represent hardware logic designs. Such data structures may include Hardware Description Language (HDL) design entities or be consistent with lower-level HDL design languages (such as Verilog and VHDL) and/or higher-level design languages (such as C or C++) and/or with lower-level HDL design language and/or other data structures compatible with higher-level design languages.

設計程序 1710較佳採用且併入硬體及/或軟體模組以用於合成、轉譯或以其他方式處理本文中所展示之組件、電路、裝置或邏輯結構之設計/模擬功能等效者以生成可含有諸如設計結構 1720之設計結構的接線對照表 1780。接線對照表 1780可包含例如經編譯或以其他方式處理之資料結構，其表示描述至積體電路設計中之其他元件及電路之連接的導線、離散組件、邏輯閘、控制電路、I/O裝置、模型等之清單。接線對照表 1780可使用反覆程序來合成，其中接線對照表 1780取決於用於裝置之設計規格及參數而經重新合成一或多次。如同本文中所描述的其他設計結構類型，接線對照表 1780可經記錄於機器可讀儲存媒體上或經程式化至可程式化閘陣列中。媒體可為非揮發性儲存媒體，諸如磁碟機或光碟機、可程式化閘陣列、CF卡(compact flash)或其他快閃記憶體。另外或在替代例中，媒體可為系統或快取記憶體或緩衝空間。 Design process 1710 preferably employs and incorporates hardware and/or software modules for synthesizing, translating, or otherwise processing design/simulation functional equivalents of components, circuits, devices, or logic structures shown herein to generate a wiring lookup table 1780 that may contain design structures such as design structure 1720. Wiring lookup table 1780 may include, for example, a compiled or otherwise processed data structure representing a list of wires, discrete components, logic gates, control circuits, I/O devices, models, etc. that describe connections to other components and circuits in an integrated circuit design. Wiring lookup table 1780 may be synthesized using an iterative process in which wiring lookup table 1780 is resynthesized one or more times depending on the design specifications and parameters for the device. As with other types of designs described herein, the wiring lookup table 1780 may be recorded on a machine readable storage medium or programmed into a programmable gate array. The medium may be a non-volatile storage medium such as a disk drive or optical disk drive, a programmable gate array, a CF card (compact flash) or other flash memory. Additionally or alternatively, the medium may be system or cache memory or buffer space.

設計程序 1710可包括用於處理包括接線對照表 1780之多種輸入資料結構類型的硬體及軟體模組。此類資料結構類型可駐留於例如程式庫元件 1730內，且包括用於給定製造技術(例如，不同技術節點：32 nm、45 nm、170 nm等)的常用元件、電路及裝置之集合，包括模型、佈局及符號表示。資料結構類型可進一步包括設計規格 1740、特性化資料 1750、驗證資料 1760、設計規則 1790以及測試資料檔案 1785，該等測試資料檔案可包括輸入測試圖案、輸出測試結果及其他測試資訊。設計程序 1710可進一步包括例如標準機械設計程序，諸如應力分析、熱分析、機械事件模擬、用於諸如澆鑄、模製及模壓成形等之操作的程序模擬。機械設計之一般熟習此項技術者可瞭解用於設計程序 1710中之可能的機械設計工具及應用的範圍而不偏離本發明之範疇及精神。設計程序 1710亦可包括用於執行諸如定時分析、驗證、設計規則檢查、置放及路由操作等之標準電路設計程序之模組。 The design program 1710 may include hardware and software modules for processing various input data structure types including the wiring lookup table 1780 . Such data structure types may reside, for example, within library element 1730 and include a collection of commonly used components, circuits, and devices for a given manufacturing technology (e.g., different technology nodes: 32 nm, 45 nm, 170 nm, etc.), Includes model, layout and symbolic representation. The data structure type may further include design specifications 1740 , characterization data 1750 , verification data 1760 , design rules 1790 , and test data files 1785. The test data files may include input test patterns, output test results, and other test information. The design program 1710 may further include, for example, standard mechanical design programs such as stress analysis, thermal analysis, mechanical event simulation, program simulation for operations such as casting, molding, compression forming, and the like. Those skilled in the art of mechanical design can appreciate the range of possible mechanical design tools and applications for use in the design process 1710 without departing from the scope and spirit of the invention. Design program 1710 may also include modules for performing standard circuit design procedures such as timing analysis, verification, design rule checking, placement and routing operations, and the like.

設計程序 1710採用且併入諸如HDL編譯器及模擬模型建構工具的邏輯及實體設計工具以連同所描繪支援資料結構中之一些或全部以及任何額外機械設計或資料(若適用)來處理設計結構 1720，以生成第二設計結構 1790。設計結構 1790以用於交換機械裝置及結構之資料的資料格式(例如，以IGES、DXF、Parasolid XT、JT、DRG或用於儲存或呈現此類機械設計結構之任何其他合適格式儲存的資訊)駐留於儲存媒體或可程式化閘陣列上。類似於設計結構 1720，設計結構 1790較佳包含一或多個檔案、資料結構，或其他電腦經編碼資料或指令，其駐留於傳輸或資料儲存媒體上且當藉由ECAD系統處理時生成本文中所展示的本發明之實施例中之一或多者的邏輯上或以其他方式功能上等效之形式。在一個實施例中，設計結構 1790可包含在功能上模擬本文中所展示之裝置的經編譯、可執行之HDL模擬模型。 The design program 1710 employs and incorporates logical and physical design tools such as HDL compilers and simulation model building tools to process the design structure 1720 along with some or all of the depicted supporting data structures and any additional mechanical design or data, if applicable, to generate a second design structure 1790. The design structure 1790 resides on a storage medium or programmable gate array in a data format for exchanging data of mechanical devices and structures (e.g., information stored in IGES, DXF, Parasolid XT, JT, DRG, or any other suitable format for storing or presenting such mechanical design structures). Similar to design structure 1720 , design structure 1790 preferably includes one or more files, data structures, or other computer-encoded data or instructions that reside on a transmission or data storage medium and that, when processed by an ECAD system, generate a logically or otherwise functionally equivalent form of one or more of the embodiments of the invention presented herein. In one embodiment, design structure 1790 may include a compiled, executable HDL simulation model that functionally simulates a device presented herein.

設計結構 1790亦可採用用於交換積體電路之佈局資料的資料格式及/或符號資料格式(例如，以GDSII (GDS2)、GL1、OASIS、映射檔案或用於儲存此類設計資料結構之任何其他合適格式儲存之資訊)。設計結構 1790可包含資訊，諸如(例如)符號資料、映射檔案、測試資料檔案、設計內容檔案、製造資料、佈局參數、導線、金屬層級、通孔、形狀、用於經由所製造線路由的資料以及製造商或其他設計者/開發者生產如上文所描述及本文中所展示的裝置或結構所需的任何其他資料。設計結構 1790接著可繼續進行至階段 1795，其中例如設計結構 1790：繼續進行至成品出廠驗證(tape-out)，經釋放至製造，經釋放至遮罩室，經發送至另一設計室，經發送回至客戶等。 The design structure 1790 may also adopt a data format and/or symbolic data format used to exchange layout data of integrated circuits (e.g., in GDSII (GDS2), GL1, OASIS, mapping files, or any other used to store such design data structures). information stored in other suitable formats). Design structure 1790 may include information such as, for example, symbol data, mapping files, test data files, design content files, manufacturing data, layout parameters, wires, metal levels, vias, shapes, data for routing through manufactured lines and any other information necessary by manufacturers or other designers/developers to produce devices or structures as described above and illustrated herein. The design structure 1790 may then proceed to stage 1795 , where, for example, the design structure 1790 : proceeds to tape-out, is released to manufacturing, is released to the mask room, is sent to another design room, is Send back to client etc.

如已描述，在至少一個實施例中，一種處理器包括：一指令提取單元，其提取待執行之指令；一架構式暫存器檔案，其包括用於儲存源及目的地運算元之複數個暫存器；及一執行單元，其用於執行一伽羅瓦乘法指令。該執行單元包括一無進位乘法器，該無進位乘法器經組態以將該伽羅瓦乘法指令之運算元相乘以生成一乘積。該執行單元進一步包括一模約簡電路，該模約簡電路經組態以接收該乘積且基於該乘積與一固定多項式之一邏輯組合判定相比該乘積具有較少數目個位元的一約簡乘積。該執行單元經組態以將該約簡乘積作為該伽羅瓦乘法指令之一結果儲存至該架構式暫存器檔案。As described, in at least one embodiment, a processor includes: an instruction fetch unit that fetches instructions to be executed; an architected register file that includes a plurality of registers for storing source and destination operands; and an execution unit that executes a Galois multiplication instruction. The execution unit includes a carry-less multiplier configured to multiply the operands of the Galois multiplication instruction to generate a product. The execution unit further includes a modular reduction circuit that is configured to receive the product and determine a reduced product having a fewer number of bits than the product based on a logical combination of the product and a fixed polynomial. The execution unit is configured to store the reduced product as a result of the Galois multiplication instruction in the architected register file.

在一些實施例中，該處理器可形成一較大資料處理系統之部分或可實現為體現於一機器可讀儲存裝置中之一設計結構。In some embodiments, the processor may form part of a larger data processing system or may be implemented as a design structure embodied in a machine-readable storage device.

在至少一個實施例中，該無進位乘法之該乘積包括包括該乘積之高階位元的一高部分及包括該乘積之低階位元的一低部分，且該模約簡電路經組態以計算等效於該高部分與該固定多項式之一無進位乘法的一第一結果。該模約簡電路包括：移位電路系統，其將多個不同位元位置移位應用於與該固定多項式中之經確立位元一致的該乘積之該高部分；及逐位元互斥或(XOR)電路系統，其在邏輯上組合具有由該移位電路系統應用之不同各別位元位置移位的該乘積之該高部分的多個實例。In at least one embodiment, the product of the carry-less multiplication includes a high portion including high-order bits of the product and a low portion including low-order bits of the product, and the modular reduction circuit is configured to compute a first result equivalent to a carry-less multiplication of the high portion and the fixed polynomial. The modular reduction circuit includes: shift circuitry that applies a plurality of different bit position shifts to the high portion of the product consistent with established bits in the fixed polynomial; and bitwise exclusive OR (XOR) circuitry that logically combines multiple instances of the high portion of the product having different respective bit position shifts applied by the shift circuitry.

在至少一個實施例中，該移位電路系統經進一步組態以將多個不同位元位置移位應用於與該固定多項式中之經確立位元一致的該第一結果之高部分；且該逐位元互斥或(XOR)電路系統經進一步組態以在邏輯上組合具有由該移位電路系統應用之不同各別位元位置移位的該第一結果之該高部分的多個實例以獲得一第二結果。該逐位元XOR電路系統基於該第一結果、該第二結果及該乘積之該低部分而生成該約簡乘積。In at least one embodiment, the shift circuitry is further configured to apply a plurality of different bit position shifts to a high portion of the first result that is consistent with established bits in the fixed polynomial; and the bitwise exclusive OR (XOR) circuitry is further configured to logically combine a plurality of instances of the high portion of the first result having different respective bit position shifts applied by the shift circuitry to obtain a second result. The bitwise XOR circuitry generates the reduced product based on the first result, the second result, and the low portion of the product.

在至少一個實施例中，該處理器包括一條件位元反轉電路，該條件位元反轉電路經組態以在該等運算元相乘之前，基於由該伽羅瓦乘法指令指示之一模式條件性地反轉該等運算元中之一者中的位元組之一位元排序。In at least one embodiment, the processor includes a conditional bit-flip circuit configured to conditionally flip a bit ordering of bytes in one of the operands based on a mode indicated by the Galois multiplication instruction before the operands are multiplied.

在至少一個實施例中，該無進位乘法器係一第一乘-乘引擎，該執行單元包括一第二乘-乘引擎，該第一無進位乘-乘引擎及該第二無進位乘-乘引擎兩者具有一第一資料寬度，且該等運算元包括具有作為該第一資料寬度之整數倍的一第二資料寬度的第一及第二運算元。在此狀況下，該第一乘-乘引擎及該第二乘-乘引擎經組態以並行地將該第一運算元及該第二運算元之子集相乘。In at least one embodiment, the carry-less multiplier is a first multiply-multiply engine, the execution unit includes a second multiply-multiply engine, the first carry-less multiply-multiply engine and the second carry-less multiply-multiply engine both have a first data width, and the operands include first and second operands having a second data width that is an integer multiple of the first data width. In this case, the first multiply-multiply engine and the second multiply-multiply engine are configured to multiply subsets of the first operand and the second operand in parallel.

雖然已特別展示並描述了各種實施例，但熟習此項技術者應瞭解，在不脫離所附申請專利範圍之精神及範疇的情況下，可在其中作出形式及細節上的各種改變，且此等替代實現皆屬於所附申請專利範圍之範疇。舉例而言，雖然已關於特定密碼編譯演算法(例如AES、GCM、XTS)及資料寬度來描述本發明，但熟習此項技術者應瞭解，所揭示之發明亦適用於其他加密演算法及資料寬度。Although various embodiments have been particularly shown and described, those skilled in the art will appreciate that various changes in form and detail may be made therein without departing from the spirit and scope of the appended claims, and such alternative implementations are intended to be within the scope of the appended claims. For example, although the invention has been described with respect to specific cryptographic algorithms (e.g., AES, GCM, XTS) and data widths, those skilled in the art will appreciate that the disclosed invention is also applicable to other encryption algorithms and data widths.

諸圖中之流程圖及方塊圖繪示根據本發明之各種實施例的系統、方法及電腦程式產品之可能實現之架構、功能性及操作。就此而言，流程圖或方塊圖中之每一區塊可表示模組、區段或指令之部分，其包含用於實現指定邏輯函數之一或多個可執行指令。在一些替代實現中，區塊中所提及之功能可能不以諸圖中所提及之次序發生。舉例而言，取決於所涉及之功能性，連續展示的兩個區塊事實上可實質上同時地執行，或該等區域塊有時可以反向次序執行。亦應注意，方塊圖及/或流程圖繪示之每一區塊以及方塊圖及/或流程圖繪示中之區塊組合可由執行指定功能或動作或進行特殊用途硬體及電腦指令之組合的基於特殊用途硬體之系統實現。The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the invention. In this regard, each block in the flowchart or block diagram may represent a module, section, or portion of instructions that contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may in fact execute substantially concurrently, or the blocks may sometimes execute in the reverse order, depending on the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be configured by a combination of special purpose hardware and computer instructions to perform the specified functions or actions or perform special purpose hardware System implementation based on special purpose hardware.

另外，儘管已關於執行引導本發明之功能之程式碼的電腦系統描述態樣，但應理解，本發明可替代地實現為包括儲存可由資料處理系統處理之程式碼的電腦可讀儲存裝置的程式產品。電腦可讀儲存裝置可包括揮發性或非揮發性記憶體、光碟或磁碟或其類似者。然而，如本文中所採用，「儲存裝置」具體地定義為僅包括法定製品且排除信號媒體本身、暫時性傳播信號本身及能量本身。Additionally, although aspects have been described with respect to a computer system executing program code directing the functions of the present invention, it should be understood that the present invention may alternatively be implemented as a program product including a computer-readable storage device storing program code that can be processed by a data processing system. The computer-readable storage device may include volatile or non-volatile memory, an optical or magnetic disk, or the like. However, as used herein, "storage device" is specifically defined to include only legal products and excludes the signal medium itself, the transient propagating signal itself, and the energy itself.

程式產品可包括資料及/或指令，該等資料及/或指令在資料處理系統上經執行或以其他方式經處理時生成本文中所揭示的硬體組件、電路、裝置或系統之邏輯上、結構上或以其他方式功能上等效的表示(包括模擬模型)。此類資料及/或指令可包括硬體描述語言(HDL)設計實體或符合較低層級HDL設計語言(諸如Verilog及VHDL)及/或較高層級設計語言(諸如C或C++)及/或與較低層級HDL設計語言及/或較高層級設計語言相容的其他資料結構。此外，資料及/或指令亦可採用用於交換積體電路之佈局資料的資料格式及/或符號資料格式(例如，以GDSII (GDS2)、GL1、OASIS、映射檔案或用於儲存此類設計資料結構之任何其他合適格式儲存之資訊)。The program product may include data and/or instructions that, when executed or otherwise processed on a data processing system, generate the logic, functionality, or functionality of the hardware components, circuits, devices, or systems disclosed herein. Structurally or otherwise functionally equivalent representations (including simulation models). Such data and/or instructions may include Hardware Description Language (HDL) design entities or conform to lower level HDL design languages (such as Verilog and VHDL) and/or higher level design languages (such as C or C++) and/or with The lower-level HDL design language and/or other data structures compatible with the higher-level design language. In addition, data and/or instructions may also be in data formats and/or symbolic data formats used to exchange layout data for integrated circuits (e.g., in GDSII (GDS2), GL1, OASIS, mapping files or used to store such designs). information stored in any other suitable format of the data structure).

100:資料處理系統 102:處理器 104:處理器核心 106:快取記憶體 110:系統互連件 112:記憶體控制器 114:系統記憶體 116:輸入/輸出(I/O)配接器 118:非揮發性儲存系統 120:網路配接器 200:處理器核心 202:指令提取單元 204:指令解碼單元 206:分支處理單元 210:映射器電路 216:分派電路 218:發行佇列 220:固定點單元 222:浮點單元 224:載入-儲存單元 226:向量-純量單元 230:儲存器 300:架構式暫存器檔案 302:功能單元/算術邏輯單元/旋轉單元 304:功能單元/乘法單元 306:功能單元/除法單元 308:功能單元/加密單元 310:功能單元/置換單元 312:功能單元/二進位寫碼十進位(BCD)單元 314:矩陣乘法累積(MMA)單元 316:非架構式暫存器檔案 400:AES加密/解密電路 402:AES密鑰生成電路 404:無進位乘法電路 500:加密及鑑認程序 502:明文塊 504:初始值 506:加密密鑰K 508:鑑認資料 510:128位元鑑認密鑰H 512:計數器0/計數器1/計數器n 514:遞增函數 516:AES加密函數 518:128位元加密輸出X0至Xn 520:互斥或(XOR)函數 522:128位元密文塊/密文1/最終密文 n524:伽羅瓦計數器乘法(GCM) 0函數/GCM乘1函數 526:128位元鑑認值Y1/鑑認值Y n+1 528:XOR函數 530:128位元長度指示符 532:簽章 600:模約簡電路 602:128位元低階部分PL 604:127位元高階部分PH 606a:無進位乘法器 606b:無進位乘法器 608:R 610:134位元乘積T 610a:128位元低階部分TL 610b:6位元高階部分TH 612:逐位元XOR電路 614:13位元乘積T'/第二約簡結果T' 618:128位元約簡之乘積M 700a:部分乘積PP 700b:部分乘積PP 700c:部分乘積PP 700d:部分乘積PP 700e:部分乘積PP 700f:部分乘積PP 700g:部分乘積PP 700h:部分乘積PP 702:四輸入逐位元XOR電路 802:PH 804:向右移位1個位元位置的PH 806:向右移位2個位元位置的PH 808:向右移位7個位元位置的PH 812:向右移位1個位元位置的TH 814:向右移位2個位元位置的TH 816:向右移位7個位元位置的TH 818:第一至第三約簡項/第四及第五約簡項/第六至第九約簡項 900:128位元乘128位元乘法陣列 902:高階乘積位元 904:低階乘積位元 1002a:條件位元反轉電路 1002b:條件位元反轉電路 1002c:額外條件位元反轉電路 1004a:128位元被乘數 1004b:128位元乘數 1006:無進位乘法器 1008:255位元乘積P 1010:模約簡電路 1100:條件位元反轉電路 1102:128位元輸入 1104:位元反轉電路 1106:雙輸入128位元多工器 1108:128位元輸出 1200:逐位元XOR電路 1202:逐位元XOR電路 1204:逐位元XOR電路 1210:填充電路 1212:移位電路 1214:移位電路 1216:移位電路 1218:移位電路 1220:移位電路 1300:伽羅瓦乘法指令 1302:作業碼欄位 1304:運算元欄位 1306:模式欄位 1400:先前技術單指令多資料(SIMD)乘-乘引擎 1402:128位元SIMD運算元A 1404:128位元SIMD運算元B 1406:64位元無進位乘法器 1408:64位元無進位乘法器 1410:逐位元XOR電路 1412:乘積Q 1414:128位元高部分PH 1416:128位元低部分PL 1500:SIMD無進位乘法電路 1502:SIMD運算元A 1504:SIMD運算元B 1506:條件位元反轉電路 1508:條件位元反轉電路 1510:調換雙字電路 1512:128位元SIMD乘-乘引擎 1514:128位元SIMD乘-乘引擎 1516:約簡電路 1518:多工器 1520:結果 1530:移位電路 1532:移位電路 1534:逐位元XOR電路 1536:條件位元反轉電路 1600:區塊 1602:區塊 1604:區塊 1606:區塊 1608:區塊 1610:區塊 1612:區塊 1614:區塊 1616:區塊 1618:區塊 1620:區塊 1700:設計流程 1710:設計程序 1720:設計結構 1730:程式庫元件 1740:設計規格 1750:特性化資料 1760:驗證資料 1780:接線對照表 1785:測試資料檔案 1790:設計規則/第二設計結構 1795:階段 100: Data processing system 102: Processor 104: Processor core 106: Cache memory 110: System interconnect 112: Memory controller 114: System memory 116: Input/output (I/O) adapter 118: Non-volatile storage system 120: Network adapter 200: Processor core 202: Instruction fetch unit 204: Instruction decode unit 206: Branch processing unit 210: Mapper circuit 216: Dispatch circuit 218: Issue queue 220: Fixed point unit 222: Floating point unit 224: Load-store unit 226: Vector-scalar unit 230: Storage 300: Architectural register file 302: Function unit/arithmetic logic unit/rotation unit 304: Function unit/multiplication unit 306: Function unit/division unit 308: Function unit/encryption unit 310: Function unit/permutation unit 312: Function unit/binary coded decimal (BCD) unit 314: matrix multiplication and accumulation (MMA) unit 316: Non-architectural register file 400: AES encryption/decryption circuit 402: AES key generation circuit 404: carry-less multiplication circuit 500: encryption and authentication program 502: plain text block 504: initial value 506: encryption key K 508: Authentication data 510: 128-bit authentication key H 512: Counter 0/Counter 1/Counter n 514: Increment function 516: AES encryption function 518: 128-bit encrypted output X0 to Xn 520: Exclusive OR (XOR) function 522: 128-bit ciphertext block/ciphertext 1/final ciphertext n 524: Galois counter multiplication (GCM) 0 function/GCM multiplication 1 function 526: 128-bit authentication value Y1/authentication value Y n +1 528: XOR function 530: 128-bit length indicator 532: Signature 600: Modulus reduction circuit 602: 128-bit low-level part PL 604: 127-bit high-level part PH 606a: carry-less multiplier 606b: carry-less multiplier 608: R 610: 134-bit product T 610a: 128-bit low-order part TL 610b: 6-bit high-order part TH 612: bit-by-bit XOR circuit 614: 13-bit product T'/second reduced result T' 618: 128-bit reduced product M 700a: partial product PP 700b: partial product PP 700c: partial product PP 700d: partial product PP 700e: partial product PP 700f: partial product PP 700g: partial product PP 700h: partial product PP 702: four-input bit-by-bit XOR circuit 802: PH 804: PH shifted right by 1 bit position 806: PH shifted right by 2 bit positions 808: PH shifted right by 7 bit positions 812: TH shifted right by 1 bit position 814: TH shifted right by 2 bit positions 816: TH shifted right by 7 bit positions 818: First to third simplified terms/fourth and fifth simplified terms/sixth to ninth simplified terms 900: 128-bit by 128-bit multiplication array 902: high-order product bits 904: low-order product bits 1002a: conditional bit inversion circuit 1002b: conditional bit inversion circuit 1002c: additional conditional bit inversion circuit 1004a: 128-bit multiplicand 1004b: 128-bit multiplier 1006: carry-less multiplier 1008: 255-bit product P 1010: Modular reduction circuit 1100: Conditional bit reversal circuit 1102: 128-bit input 1104: Bit reversal circuit 1106: Dual input 128-bit multiplexer 1108: 128-bit output 1200: Bitwise XOR circuit 1202: Bitwise XOR circuit 1204: Bitwise XOR circuit 1210: Fill circuit 1212: Shift circuit 1214: Shift circuit 1216: Shift circuit 1218: Shift circuit 1220: Shift circuit 1300: Galois multiplication instruction 1302: Operation code field 1304: Operand field 1306: Mode field 1400: Prior art single instruction multiple data (SIMD) multiply-multiply engine 1402: 128-bit SIMD operator A 1404: 128-bit SIMD operator B 1406: 64-bit carry-less multiplier 1408: 64-bit carry-less multiplier 1410: bitwise XOR circuit 1412: product Q 1414: 128-bit high part PH 1416: 128-bit low part PL 1500: SIMD carry-less multiplication circuit 1502: SIMD operator A 1504: SIMD operator B 1506: conditional bit reversal circuit 1508: conditional bit reversal circuit 1510: swap double word circuit 1512: 128-bit SIMD multiply-multiply engine 1514: 128-bit SIMD multiply-multiply engine 1516: reduction circuit 1518: multiplexer 1520: result 1530: shift circuit 1532: shift circuit 1534: bitwise XOR circuit 1536: conditional bit reversal circuit 1600: block 1602: block 1604: block 1606: Block 1608: Block 1610: Block 1612: Block 1614: Block 1616: Block 1618: Block 1620: Block 1700: Design flow 1710: Design procedure 1720: Design structure 1730: Library component 1740: Design specification 1750: Characterization data 1760: Verification data 1780: Wiring comparison table 1785: Test data file 1790: Design rules/second design structure 1795: Phase

圖 1為根據一個實施例的包括處理器之資料處理系統之高階方塊圖； FIG1 is a high-level block diagram of a data processing system including a processor according to one embodiment;

圖 2為根據一個實施例的處理器核心之高階方塊圖； FIG2 is a high-level block diagram of a processor core according to one embodiment;

圖 3為根據一個實施例的處理器核心之例示性執行單元之高階方塊圖； Figure 3 is a high-level block diagram of an exemplary execution unit of a processor core according to one embodiment;

圖 4為根據一個實施例的在處理器核心內之加密單元之更詳細方塊圖； FIG4 is a more detailed block diagram of an encryption unit within a processor core according to one embodiment;

圖 5為利用進階加密標準-伽羅瓦計數器模式(Advanced Encryption Standard - Galois Counter Mode；AES-GCM)之加密及鑑認程序的時間-空間圖； FIG5 is a time-space diagram of the encryption and authentication process using Advanced Encryption Standard - Galois Counter Mode (AES-GCM);

圖 6為根據一個實施例的用於在伽羅瓦域中執行模約簡的模約簡電路之方塊圖； FIG6 is a block diagram of a modular reduction circuit for performing modular reduction in a Galois field according to one embodiment;

圖 7繪示根據一個實施例之無進位乘法之最佳化； Figure 7 illustrates optimization of carry-free multiplication according to one embodiment;

圖 8描繪圖 7之無進位乘法之技術應用於圖 6之電路； Figure 8 depicts the carry-less multiplication technique of Figure 7 applied to the circuit of Figure 6 ;

圖 9為例示性乘法陣列之示意圖，其繪示經受一個模約簡階段的無進位乘積位元之第一集合及經受兩個模約簡階段的無進位乘積位元之第二集合的位置； FIG9 is a schematic diagram of an exemplary multiplication array showing the positions of a first set of carry-free product bits subjected to one modular reduction stage and a second set of carry-free product bits subjected to two modular reduction stages;

圖 10為根據一個實施例之無進位乘法電路之高階方塊圖； FIG10 is a high-level block diagram of a carry-less multiplication circuit according to one embodiment;

圖 11為圖 10之條件位元反轉電路之例示性實施例的更詳細方塊圖； FIG11 is a more detailed block diagram of an exemplary embodiment of the conditional bit flip circuit of FIG10 ;

圖 12為圖 10之模約簡電路之例示性實施例的更詳細方塊圖； Figure 12 is a more detailed block diagram of an exemplary embodiment of the modular reduction circuit of Figure 10 ;

圖 13描繪根據一個實施例之例示性伽羅瓦乘法指令； Figure 13 depicts an exemplary Galois multiplication instruction according to one embodiment;

圖 14繪示先前技術單指令多資料(SIMD)無進位乘法電路； FIG. 14 illustrates a prior art single instruction multiple data (SIMD) carry-less multiplication circuit;

圖 15描繪根據一個實施例的支援伽羅瓦乘法之例示性SIMD無進位乘法電路； FIG. 15 depicts an exemplary SIMD carry-less multiplication circuit supporting Galois multiplication according to one embodiment;

圖 16為根據一個實施例的伽羅瓦乘法之例示性方法之高階邏輯流程圖；且 FIG16 is a high-level logic flow chart of an exemplary method of Galois multiplication according to one embodiment; and

圖 17描繪根據一個實施例之例示性設計程序。 Figure 17 depicts an exemplary design procedure according to one embodiment.

300:架構式暫存器檔案 300: Architectural register file

404:無進位乘法電路 404: Carry-free multiplication circuit

1002a:條件位元反轉電路 1002a: Conditional bit inversion circuit

1002b:條件位元反轉電路 1002b: Conditional bit inversion circuit

1002c:額外條件位元反轉電路 1002c: Additional conditional bit inversion circuit

1004a:128位元被乘數 1004a: 128-bit multiplicand

1004b:128位元乘數 1004b: 128-bit multiplier

1006:無進位乘法器 1006: Carry-free multiplier

1008:255位元乘積P 1008:255-bit product P

1010:模約簡電路 1010: Modular reduction circuit

Claims

A processor containing: An instruction fetch unit that fetches instructions to be executed; an architectural register file that includes a plurality of registers for storing source and destination operands; and An execution unit for executing a Galois multiplication instruction, wherein the execution unit includes: a carry-less multiplier configured to multiply the operands of the Galois multiplication instruction to produce a product; and A modular reduction circuit configured to receive the product and determine a reduced product having a smaller number of bits than the product based on a logical combination of the product and a fixed polynomial, wherein the execution unit is configured state to store the reduced product in the architectural register file as one of the results of the Galois multiplication instruction.

The processor of claim 1, wherein the fixed polynomial is g(x) = 1 + X + x^2 + x^7+ x^128.

The processor of claim 1, wherein: the product includes a high portion including high-order bits of the product and a low portion including low-order bits of the product; the modular reduction circuit is configured to compute a first result equivalent to a carry-less multiplication of the high portion and the fixed polynomial, wherein the modular reduction circuit includes: shift circuitry that applies multiple different bit position shifts to the high portion of the product consistent with established bits in the fixed polynomial; and bitwise exclusive OR (XOR) circuitry that logically combines multiple instances of the high portion of the product having different respective bit position shifts applied by the shift circuitry.

The processor of claim 3, wherein: the shift circuitry is further configured to apply a plurality of different bit position shifts to a high portion of the first result that is consistent with established bits in the fixed polynomial; the bitwise exclusive OR (XOR) circuitry is further configured to logically combine a plurality of instances of the high portion of the first result having different respective bit position shifts applied by the shift circuitry to obtain a second result, wherein the bitwise XOR circuitry generates the reduced product based on the first result, the second result, and the low portion of the product.

A processor as claimed in claim 3, wherein the bitwise exclusive OR (XOR) circuit system includes at least two stages of bitwise XOR circuit system.

The processor of claim 1, further comprising: A conditional bit reversal circuit configured to conditionally reverse a bit ordering of bytes in one of the operands based on an endian mode indicated by the Galois multiplication instruction before the operands are multiplied.

A processor as claimed in claim 1, wherein: the carry-less multiplier is a first multiply-multiply engine; the execution unit includes a second multiply-multiply engine, wherein the first multiply-multiply engine and the second multiply-multiply engine have a first data width; the operands include first and second operands having a second data width that is an integer multiple of the first data width; and the first multiply-multiply engine and the second multiply-multiply engine are configured to multiply subsets of the first operands and the second operands in parallel.

A data processing system comprising: a plurality of processors, including the processor of claim 1; a shared memory; and a system interconnect coupling the shared memory and the plurality of processors in a communicative manner.

A method of performing data processing in a processor, the method comprising: Fetching instructions to be executed by the processor by an instruction fetch unit, wherein the instructions include a Galois multiplication instruction; and Based on receiving the Galois multiplication instruction, an execution unit of the processor executes the Galois multiplication instruction, where the execution includes: Multiply the operands of the Galois multiplication instruction by a carryless multiplier to generate a product; A modular reduction circuit receives the product and determines a reduced product that has a smaller number of bits than the product based on a logical combination of the product and a fixed polynomial; and The reduced product is stored in an architectural register file of the processor as a result of the Galois multiplication instruction.

Such as the method of request item 9, where the fixed polynomial is g(x) = 1 + X + x^2 + x^7+ x^128.

Such as the method of request item 9, wherein: the product includes a high part including the high-order bits of the product and a low part including the low-order bits of the product; Determining the reduced product includes computing a first result equivalent to a carry-free multiplication of the high part and one of the fixed polynomials, wherein the computing includes: applying a plurality of different bit position shifts to the high portion of the product consistent with established bits in the fixed polynomial by shifting circuitry; and Multiple instances of the high portion of the product having different individual bit position shifts applied by the shift circuitry are logically combined by bitwise exclusive OR (XOR) circuitry.

The method of claim 11, further comprising: applying a plurality of different bit position shifts by the shift circuit system to a high portion of the first result consistent with the established bits in the fixed polynomial; and logically combining by the bitwise exclusive OR (XOR) circuit system a plurality of instances of the high portion of the first result having different respective bit position shifts applied by the shift circuit system to obtain a second result; wherein determining the reduced product comprises determining the reduced product based on the first result, the second result, and the low portion of the product.

For example, the method of request item 9 further includes: Before multiplying the operands, a bit ordering of the bytes in one of the operands is conditionally reversed based on an endian mode indicated by the Galois multiplication instruction.

Such as the method of request item 9, wherein: The carry-less multiplier is a first multiply-multiply engine; The execution unit includes a second multiply-by engine, wherein the first multiply-by engine and the second multiply-by engine have a first data width; The operands include first and second operands having a second data width that is an integer multiple of the first data width; and Multiplying the operands includes the first multiply-multiply engine and the second multiply-multiply engine multiplying a subset of the first operands and the second operands in parallel.

A design structure, tangibly embodied in a machine-readable storage device, for designing, manufacturing, or testing an integrated circuit, the design structure comprising: a processor, comprising: an instruction fetch unit that fetches instructions to be executed; an architectural register file that includes a plurality of registers for storing source and destination operands; and an execution unit for executing a Galois multiplication instruction, wherein the execution unit includes: a carry-less multiplier configured to multiply the operands of the Galois multiplication instruction to generate a product; and A modular reduction circuit configured to receive the product and determine a reduced product having a fewer number of bits than the product based on a logical combination of the product and a fixed polynomial, wherein the execution unit is configured to store the reduced product as a result of the Galois multiplication instruction to the architected register file.

For example, in the design structure of request item 15, the fixed polynomial is g(x) = 1 + X + x^2 + x^7+ x^128.

The design structure of claim 15, wherein: the product includes a high portion including high-order bits of the product and a low portion including low-order bits of the product; the modular reduction circuit is configured to compute a first result equivalent to a carry-less multiplication of the high portion and the fixed polynomial, wherein the modular reduction circuit includes: a shift circuit system that applies multiple different bit position shifts to the high portion of the product that are consistent with established bits in the fixed polynomial; and a bitwise exclusive OR (XOR) circuit system that logically combines multiple instances of the high portion of the product having different respective bit position shifts applied by the shift circuit system.

The design structure of claim 17, wherein: the shift circuit system is further configured to apply multiple different bit position shifts to the high portion of the first result that is consistent with the established bits in the fixed polynomial; the bitwise exclusive OR (XOR) circuit system is further configured to logically combine multiple instances of the high portion of the first result having different respective bit position shifts applied by the shift circuit system to obtain a second result, wherein the bitwise XOR circuit system generates the reduced product based on the first result, the second result, and the low portion of the product.

The design structure of claim 15, further comprising: A conditional bit reversal circuit configured to conditionally reverse a bit ordering of bytes in one of the operands based on an endian mode indicated by the Galois multiplication instruction before the operands are multiplied.

The design structure of claim 15, wherein: the carry-less multiplier is a first multiply-multiply engine; the execution unit includes a second multiply-multiply engine, wherein the first multiply-multiply engine and the second multiply-multiply engine have a first data width; the operands include first and second operands having a second data width that is an integer multiple of the first data width; and the first multiply-multiply engine and the second multiply-multiply engine are configured to multiply subsets of the first operand and the second operand in parallel.