CN105204820A

CN105204820A - Instructions and logic to provide general purpose gf(256) simd cryptographic arithmetic functionality

Info

Publication number: CN105204820A
Application number: CN201510274232.0A
Authority: CN
Inventors: S·格伦
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2014-06-26
Filing date: 2015-05-26
Publication date: 2015-12-30
Anticipated expiration: 2035-05-26
Also published as: DE102015006670A1; CN105204820B

Abstract

The invention discloses instructions and logic to provide general purpose GF(256) SIMD cryptographic arithmetic functionality. Instructions and logic provide general purpose GF(28) SIMD cryptographic arithmetic functionality. Embodiments include a processor to decode an instruction for a SIMD affine transformation specifying a source data operand, a transformation matrix operand, and a translation vector. The transformation matrix is applied to each element of the source data operand, and the translation vector is applied to each of the transformed elements. A result of the instruction is stored in a SIMD destination register. Some embodiments also decode an instruction for a SIMD binary finite field multiplicative inverse to compute an inverse in a binary finite field modulo an irreducible polynomial for each element of the source data operand. Some embodiments also decode an instruction for a SIMD binary finite field multiplication specifying first and second source data operands to multiply each corresponding pair of elements of the first and second source data operand modulo an irreducible polynomial.

Description

For providing general GF(256) instruction of SIMD encrypted mathematical function and logic

Technical field

The instruction set architecture field that the disclosure relates to processing logic, microprocessor and is associated, when performing this instruction set architecture by processor or other processing logics, this instruction set architecture actuating logic, mathematics or other functional performances.More particularly, the disclosure relates to instruction and the logic for providing general GF (256) SIMD encrypted mathematical function.

Background technology

Cryptography is the instrument that Dependent Algorithm in Precision and key carry out protection information.This algorithm is complicated mathematical algorithm, and key is bit string.There is the encryption system of two kinds of fundamental types: secret-key systems and public key systems.Secret-key systems (being also referred to as balanced system) has the single key (" privacy key ") that two sides or more side shares.This single key is used for both encrypting also decryption information.

Such as, Advanced Encryption Standard (AES) (being also referred to as Rijndael) is developed by two Belgian cryptologist JoanDaemen and VincentRijmen, and be adopted as the block encryption of encryption standard by U.S. government.On November 26 calendar year 2001, AES is declared as U.S.FIPSPUB197 (FIPS197) by national standard and technical institute (NIST).

AES has the fixed block size of 128, and has the keys sizes of 128,192 or 256.The cipher key spreading of Rijndael key schedule is used to be 10 round key, 12 round key or 14 round key of 128 by the key conversion of 128,192 or 256 sizes.These round key are used to be processed by the block (being regarded as the byte arrays of 4 × 4) of clear data as 128 by wheel, and convert them to ciphertext blocks.Usually, for 128 inputs (16 byte) to wheel, according to the look-up table being called as S box (S-box), each byte is replaced by another byte.This part of block encryption is called as byte replacement (SubBytes).Next, row (being regarded as 4 × 4 arrays) ring shift left of byte or left circulation particular offset (that is, zero row 0 byte, the first row 1 byte, the second row 2 byte, the third line 3 byte).This part of block encryption is called as row displacement (ShiftRows).Then, each row of byte are regarded as finite field gf (256) and (are also referred to as Galois (Galois) territory 2 ⁸) in polynomial four coefficients, and be multiplied by Reversible Linear Transformation.This part of block encryption is called as row mixing (MixColumns).Finally, the block of 128 and round key carry out XOR (XOR) computing to generate the ciphertext blocks of 16 bytes, and this is called as round key and is added (AddRoundKey).

In the system with 32 or larger word, it is possible for realizing AES password by the table of 32 byte replacement, row displacement and mixcolumns being converted to four 256 entries of memory-aided 4096 bytes of profit.A shortcoming of software simulating is performance.Running software must the multiple order of magnitude slower in specialized hardware, therefore, expects the performance with the increase that hardware/firmware realizes.

The direct hardware implementing of typical case cost in circuit area of searching storer, truth table, Binary Decision Diagrams or 256 input multiplexers is used to be high.It may be efficient for using with the alternative method of the Galois field of GF (256) isomorphism on area, but may be also slow than direct hardware implementing.

Modern all multiprocessors generally include provides computational intesiveness but the instruction providing the operation of the data parallelism of height, by using the effective implemention of the various data storage devices of such as single instruction multiple data (SIMD) vector registor and so on to utilize this data parallelism.Then, CPU (central processing unit) (CPU) can provide Parallel Hardware to support process vector.Vector is the data structure keeping multiple continuous data element.(wherein, M is 2 to be of a size of M ^k, such as, 256,128,64,32 ... 4 or 2) vector registor can comprise N number of vector element being of a size of O, wherein, and N=M/O.Such as, 64 byte vector registers may be partitioned into: (a) 64 vector elements, and each element preserves the data item taking 1 byte; (b) 32 vector elements, each element preserves the data item taking 2 bytes (or one " word "); (c) 16 vector elements, each element preserves the data item taking 4 bytes (or one " double word "); Or (d) 8 vector elements, each element preserves the data item taking 8 bytes (or one " four words ").Concurrency essence in SIMD vector registor can be applicable to process Secure Hash Algorithm well.

Other similar cryptographic algorithm also may be interested.Such as, Rijndael specification itself utilizes various pieces of sizes and keys sizes to specify, these block sizes and keys sizes can be minimum 128 and maximum 256, any multiple of 32.Another example is SMS4, and it is the block encryption used in wireless LAN WAPI CNS (wired certification and privacy capital construction).Clear data also processes as 128 blocks in GF (256) by wheel (that is, 32) by it, but execution asks mould (reductionsmodulo) to the polynomial reduction of difference.

Up to now, the option providing the design of efficient Space-Time to weigh and the potential solution for this quasi-complexity, performance limitations problem and other bottlenecks is not also explored completely.

Accompanying drawing explanation

In multiple figure of appended accompanying drawing, by way of example, and not by way of limitation the present invention is shown.

Figure 1A is the block diagram of an embodiment of the system of the instruction performed for providing general GF (256) SIMD encrypted mathematical function.

Figure 1B is the block diagram of another embodiment of the system of the instruction performed for providing general GF (256) SIMD encrypted mathematical function.

Fig. 1 C is the block diagram of another embodiment of the system of the instruction performed for providing general GF (256) SIMD encrypted mathematical function.

Fig. 2 is the block diagram of an embodiment of the processor of the instruction performed for providing general GF (256) SIMD encrypted mathematical function.

Fig. 3 A illustrates the packed data type according to an embodiment.

Fig. 3 B illustrates the packed data type according to an embodiment.

Fig. 3 C illustrates the packed data type according to an embodiment.

Fig. 3 D illustrates the instruction encoding for providing general GF (256) SIMD encrypted mathematical function according to an embodiment.

Fig. 3 E illustrates the instruction encoding for providing general GF (256) SIMD encrypted mathematical function according to another embodiment.

Fig. 3 F illustrates the instruction encoding for providing general GF (256) SIMD encrypted mathematical function according to another embodiment.

Fig. 3 G illustrates the instruction encoding for providing general GF (256) SIMD encrypted mathematical function according to another embodiment.

Fig. 3 H illustrates the instruction encoding for providing general GF (256) SIMD encrypted mathematical function according to another embodiment.

Fig. 4 A illustrates for performing the multiple elements providing an embodiment of the processor micro-architecture of the instruction of general GF (256) SIMD encrypted mathematical function.

Fig. 4 B illustrates for performing the multiple elements providing another embodiment of the processor micro-architecture of the instruction of general GF (256) SIMD encrypted mathematical function.

Fig. 5 is for performing the block diagram providing an embodiment of the processor of the instruction of general GF (256) SIMD encrypted mathematical function.

Fig. 6 is for performing the block diagram providing an embodiment of the computer system of the instruction of general GF (256) SIMD encrypted mathematical function.

Fig. 7 is for performing the block diagram providing another embodiment of the computer system of the instruction of general GF (256) SIMD encrypted mathematical function.

Fig. 8 is for performing the block diagram providing another embodiment of the computer system of the instruction of general GF (256) SIMD encrypted mathematical function.

Fig. 9 is for performing the block diagram providing an embodiment of the system on chip of the instruction of general GF (256) SIMD encrypted mathematical function.

Figure 10 is for performing the block diagram providing the embodiment of the processor of the instruction of general GF (256) SIMD encrypted mathematical function.

Figure 11 is to provide the block diagram of an embodiment of the IP kernel development system of general GF (256) SIMD encrypted mathematical function.

Figure 12 illustrates an embodiment of the framework analogue system providing general GF (256) SIMD encrypted mathematical function.

Figure 13 illustrates the embodiment providing the system of the instruction of general GF (256) SIMD encrypted mathematical function for conversion.

Figure 14 illustrates the process flow diagram of an embodiment of the process of the encrypt/decrypt standard for realizing Advanced Encryption Standard (AES) efficiently.

Figure 15 illustrates that the multiplication for realizing AESS box is efficiently inverted the process flow diagram of an embodiment of process of (multiplicativeinverse).

Figure 16 A illustrates the figure of an embodiment of the device for performing the affine maps instruction for providing general GF (256) SIMD encrypted mathematical function..

Figure 16 B illustrates the figure of an embodiment of the device for performing affine (affineinverse) instruction of inverting for providing general GF (256) SIMD encrypted mathematical function.

Figure 16 C illustrates the figure of the alternate embodiment of the device for performing affine (inverseaffine) instruction of inverting, this affine instruction of inverting, for calculating multiplicative inverse, then carries out affined transformation to provide general GF (256) SIMD encrypted mathematical function to result.

Figure 17 A illustrates the figure for performing an embodiment for the device providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction.

Figure 17 B illustrates the figure for performing the alternate embodiment for the device providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction.

Figure 17 C illustrates the figure for performing another alternate embodiment for the device providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction.

Figure 18 A illustrates the specific figure asking an embodiment of the device of modular reduction (modulusreduction) instruction for performing for providing general GF (256) SIMD encrypted mathematical function.

Figure 18 B illustrates the specific figure asking the alternate embodiment of the device of modular reduction instruction for performing for providing general GF (256) SIMD encrypted mathematical function.

Figure 18 C illustrates for performing for providing general GF (2 ¹²⁸) specific AES character used in proper names and in rendering some foreign names sieve watt (Galois) counter mode (GCM) of SIMD encrypted mathematical function asks the figure of another alternate embodiment of the device of modular reduction instruction.

Figure 18 D illustrates for performing for providing general GF (2 ^t) figure asking an embodiment of the device of modular reduction instruction of SIMD encrypted mathematical function.

Figure 19 A illustrates the figure of an embodiment of the device for performing the scale-of-two finite field multiplier instruction for providing general GF (256) SIMD encrypted mathematical function.

Figure 19 B illustrates the figure of the alternate embodiment of the device for performing the scale-of-two finite field multiplier instruction for providing general GF (256) SIMD encrypted mathematical function.

Figure 20 A illustrates the process flow diagram of an embodiment of the process for performing the affine maps instruction for providing general GF (256) SIMD encrypted mathematical function.

Figure 20 B illustrates the process flow diagram for performing an embodiment for the process providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction.

Figure 20 C illustrates the process flow diagram of an embodiment of the process for performing the affine instruction of inverting for providing general GF (256) SIMD encrypted mathematical function.

Figure 20 D illustrates the process flow diagram of an embodiment of the process for performing the scale-of-two finite field multiplier instruction for providing general GF (256) SIMD encrypted mathematical function.

Embodiment

The following description disclose for providing general GF (2 ⁿ) instruction of SIMD encrypted mathematical function and processing logic, specifically, wherein n can equal 2 ^m(such as, GF (2 ⁸), GF (2 ¹⁶), GF (2 ³²) ... GF (2 ¹²⁸) etc.).Embodiment comprises processor, and it is for decoding to the instruction for SIMD affined transformation of assigned source data operand, transformation matrix operand and converting vector.Transformation matrix is applied to each element in source data operation number, and converting vector be applied to each through conversion element.The result of this instruction is stored in SIMD destination register.Some embodiments also decode to ask mould for the inverse element in each element calculating scale-of-two Galois field in source data operation number to irreducible function to the instruction of inverting for SIMD scale-of-two finite field multiplier.Some embodiments are also decoded to the instruction of invert for SIMD affined transformation and multiplication (or multiplication is inverted and affined transformation), wherein, before or after this multiplication inversion operation, transformation matrix is applied to each element in source data operation number, and converting vector be applied to each through conversion element.Some embodiments also specific ask modulo polynomial p to asking the instruction of modular reduction to decode to calculate for SIMD to what select from the multiple polynomial expressions (asking modular reduction for they provide by instruction (or micro-order)) in scale-of-two Galois field _sthe reduction of carrying out asks mould.Some embodiments are also decoded to the instruction for SIMD scale-of-two finite field multiplier of appointment first and second source data operation number, so that by the element of each correspondence of the first and second source data operation numbers to being multiplied, and ask mould to irreducible function.

Will be understood that, described in multiple embodiment as described herein, general GF (2 ⁿ) instruction of SIMD encrypted mathematical can be used for providing arithmetic function in several applications, such as, for guaranteeing in the cryptographic protocol of the privacy of financial transaction, ecommerce, Email, software dispatch, data storage etc., data integrity, authentication, message content certification and message source certification and internet communication.

Also will understand, there is provided the execution at least following instruction: (1) SIMD affined transformation, its assigned source data operand, transformation matrix operand and converting vector, wherein, transformation matrix is applied to each data element in source data operation number, and converting vector be applied to each through conversion element; (2) SIMD scale-of-two finite field multiplier is inverted, and it asks mould for the inverse element calculated in scale-of-two Galois field for each element in source data operation number to irreducible function; (3) SIMD affined transformation and multiplication invert (or multiplication is inverted and affined transformation), its assigned source data operand, transformation matrix operand and converting vector, wherein, before or after multiplication inversion operation, transformation matrix is applied to each element in source data operation number, and converting vector be applied to each through conversion element; (4) ask modular reduction, it specific asks modulo polynomial p for calculating to what select from the multiple polynomial expressions (asking modular reduction for these specific polynomial expressions provide by instruction (or micro-order)) in scale-of-two Galois field _scarry out reduction and ask mould; (5) SIMD scale-of-two finite field multiplier, it specifies the first and second source data operation numbers, and for by the element of each correspondence in the first and second source data operation numbers to being multiplied and asking mould to irreducible function; Wherein, the result of these instructions is stored in SIMD destination register; And general GF (256) and/or other scale-of-two Galois field SIMD encrypted mathematical functions substituted can be provided with the form of hardware and/or micro-code sequence, not need the too much or excessive functional unit of requirement adjunct circuit, area or power just can support the significant performance improvement applied some important performance-critical.

In the following description, set forth such as processing logic, processor type, micro-architecture condition, event, enable numerous specific detail such as machine-processed, to provide more thoroughly understanding the embodiment of the present invention.But those skilled in the art will be appreciated that does not have these details can implement the present invention yet.In addition, some known structure, circuit etc. are not shown specifically, to avoid unnecessarily making multiple embodiment of the present invention fuzzy.

Although describe following multiple embodiment with reference to processor, other embodiments are also applicable to integrated circuit and the logical device of other types.The similar techniques of embodiments of the invention and instruction can be applicable to circuit or the semiconductor devices of the other types of the performance can benefiting from higher streamline handling capacity and improvement.The instruction of multiple embodiment of the present invention is applicable to any processor or the machine that perform data manipulation.But, the invention is not restricted to execution 512, the processor of 256,128,64,32,16 or 8 bit data computings or machine, and be applicable to any processor and machine that perform data manipulation or management.In addition, following description provides example, and accompanying drawing illustrates various example for illustrative purposes.But these examples should not be understood to have restrictive, sense, because they only aim to provide the example of multiple embodiment of the present invention, and not carry out exhaustive to all possible implementation of multiple embodiment of the present invention.

Although following example describes instruction process in the situation of performance element and logical circuit and distribution, but also by being stored in data on machine readable tangible medium and/or instruction to complete other embodiments of the present invention, these data and/or instruction make this machine perform the function consistent with at least one embodiment of the present invention when being performed by machine.In one embodiment, the function be associated with multiple embodiment of the present invention is embodied in machine-executable instruction.These instructions can be used to the general processor that makes to be programmed by these instructions or application specific processor performs step of the present invention.Multiple embodiment of the present invention also can provide as computer program or software, this computer program or software can comprise the machine or computer-readable medium that it store instruction, and these instructions can be used to programme to perform the one or more operations according to multiple embodiment of the present invention to computing machine (or other electronic equipments).Or multiple steps of multiple embodiment of the present invention can be performed by the specialized hardware components of the fixed function logic comprised for performing these steps, or are performed by any combination of computer module by programming and fixed function nextport hardware component NextPort.

Can to store in storer in systems in which in (such as, DRAM, high-speed cache, flash memory or other memory devices) to the instruction that logic programmes to perform multiple embodiment of the present invention by being used for.In addition, instruction can distribute via network or by other computer-readable mediums.Therefore, computer-readable medium can comprise for machine (such as, computing machine) readable form stores or any mechanism of the information of transmission, but be not limited to: floppy disk, CD, aacompactadisk read onlyamemory (CD-ROM), magneto-optic disk, ROM (read-only memory) (ROM), random access memory (RAM), Erasable Programmable Read Only Memory EPROM (EPROM), Electrically Erasable Read Only Memory (EEPROM), magnetic or optical card, flash memory, or passing through electricity via internet, light, sound, or other forms of transmitting signal (such as, carrier wave, infrared signal, digital signal etc.) tangible machine readable memory used in transmission information.Therefore, computer-readable medium comprises the tangible machine computer-readable recording medium being applicable to storage or the e-command of distribution of machine (such as, computing machine) readable form or any type of information.

Design can experience multiple stage, from being created to emulation to manufacture.Represent that the data of design can represent this design with various ways.First, as useful in emulation, hardware description language or another functional description language can be used to represent hardware.In addition, the circuit level model with logic and/or transistor gate can be produced in some stages of design process.In addition, great majority design all reaches the level of the data of the physical layout representing various equipment in hardware model in certain stage.In the case where conventional semiconductor fabrication techniques are used, the data of expression hardware model can be the data specifying in the various feature of presence or absence on the different mask layers for the manufacture of the mask of integrated circuit.In any design represents, data can be stored in any type of machine readable media.Storer or magnetic or light storage device (such as, dish) can be the machine readable medias of storage information, and these information send via light or electric wave, modulation or otherwise generate these light or electric wave to transmit these information.When the electric carrier wave sending instruction or carrying code or design reaches the degree copying, cushion or resend realizing this electric signal, namely create new copy.Therefore, communication provider or network provider can store the goods (such as, being encoded into the information of carrier wave) of the technology specializing multiple embodiment of the present invention at least provisionally on tangible machine readable media.

In modern processors, by multiple different performance element for the treatment of with execution various code and instruction.Not all instruction is all created comparably, and because some instructions complete quickly, other instructions may need multiple clock period to complete.The handling capacity of instruction is faster, then the overall performance of processor is better.Therefore, it will be favourable for many instructions being performed as quickly as possible.But, exist and there is larger complexity, and in execution time and processor resource, require some instruction more.Such as, there are floating point instruction, load/store operations, data mobile etc.

Because more computer system is used to internet, text and multimedia application, so little by little introduced additional processor support.In one embodiment, instruction set can be associated with one or more computer architecture, and one or more computer architecture comprises: data type, instruction, register framework, addressing mode, memory architecture, interruption and abnormality processing and outside input and output (I/O).

In one embodiment, instruction set architecture (ISA) can be performed by one or more micro-architecture, and micro-architecture comprises processor logic for realizing one or more instruction set and circuit.Therefore, multiple processors with different micro-architecture can share common instruction set at least partially.Such as, pentium four (Pentium4) processor, duo (Core ^tM) processor and the advanced micro devices company limited (AdvancedMicroDevices from California Sani's Weir (Sunnyvale), Inc.) multiple processors perform the x86 instruction set (adding some expansions in the version upgraded) of almost identical version, but have different indoor designs.Similarly, the multiple processors designed by other processor development companies (such as, ARM Pty Ltd, MIPS or their authorized party or compatible parties) can share common instruction set at least partially, but can comprise different CPU design.Such as, the identical register framework of ISA can use new or known technology to realize in different ways in different micro-architectures, comprise one or more dynamic assignment physical registers of special physical register, use register renaming mechanism (such as, using register alias table (RAT), resequencing buffer (ROB) and resignation Parasites Fauna).In one embodiment, register can comprise: can by software programmer's addressing or can not by one or more registers of software programmer's addressing, register framework, Parasites Fauna or other set of registers.

In one embodiment, instruction can comprise one or more order format.In one embodiment, order format can indicate multiple field (number of position, the position etc. of position) to specify the operand etc. of the operation that will be performed and the operation that will be performed.Some order formats can segment definition by instruction template (or subformat) further.Such as, the instruction template of given order format can be defined as the different subset of order format field, and/or is defined as the given field made an explanation by different way.In one embodiment, use order format (and, if be defined, then with in the given instruction template of this order format) come presentation directives, and the operand that this instruction is specified or instruction operates and this operation will operate.

Science application, financial application, automatically vectorization common application, RMS (identify, excavate and synthesis) application and vision and multimedia application (such as, 2D/3D figure, image procossing, video compression/decompression, speech recognition algorithm and audio frequency process) may need to perform identical operation to mass data item.In one embodiment, single instruction multiple data (SIMD) refers to and makes processor perform the instruction type of an operation to multiple data element.SIMD technology can be used for the multiple positions in register logically can being divided in the processor of fixed measure or variable-sized data element (each data element represents independent value).Such as, in one embodiment, the multiple hytes in 64 bit registers can be woven to the source operand comprising four independent 16 bit data elements, each data element represents the independent value of 16.This data type can be called as packed data type or vector data types, and the operand of this data type is called as compressed data operation number or vector operand.In one embodiment, packed data item or vector can be the sequences of the packed data element be stored in single register, and compressed data operation number or vector operand can be source operand or the destination operand of SIMD instruction (or " packed data instruction " or " vector instruction ").In one embodiment, SIMD instruction specifies the single vector operation of destination vector operand (being also referred to as result vector operand) that will perform data element that have identical or different size with generation, that have identical or different quantity to two source vector operands, that have identical or different data element order.

Such as by duo (Core ^tM) processor (has and comprise x86, MMX ^tM, streaming SIMD extensions (SSE), SSE2, SSE3, SSE4.1, SSE4.2 instruction instruction set), arm processor (such as, ARM processor affinity, there is the instruction set comprising vector floating-point (VFP) and/or NEON instruction) and the SIMD technology of the SIMD technology that adopts of MIPS processor (such as, institute of computing technology of the Chinese Academy of Sciences (ICT) develop Loongson processor race) and so on application performance, bring great raising (Core ^tMand MMX ^tMregistered trademark or the trade mark of the Intel company in Santa Clara city).

In one embodiment, destination register/data and source-register/data are the generic terms of the source and destination representing corresponding data or operation.In certain embodiments, they can be realized by register, storer or other storage areas with the title different from those described titles or function or function.Such as, in one embodiment, " DEST1 " can be Temporary storage registers or other storage areas, and " SRC1 " and " SRC2 " can be the first and second source storage registers or other storage areas, etc.In other embodiments, two or more in SRC and DEST storage area may correspond to the different pieces of information storage element (such as, simd register) in same storage area.In one embodiment, by such as the result of the operation performed the first and second source datas being written back to that register as destination register in two source-registers, one in source-register also can as destination register.

Figure 1A is the block diagram of exemplary computer system according to an embodiment of the invention, and this computer system is formed to have the processor of the performance element comprised for performing instruction.According to the present invention, such as according to embodiment described herein, system 100 comprises the assembly of such as processor 102 and so on, and this processor 102 processes data to perform an algorithm for using the performance element comprising logic.System 100 represents based on can to obtain from the Intel company of Santa Clara City, California, America iII, 4, Xeon ^tM, xScale ^tMand/or StrongARM ^tMthe disposal system of microprocessor, but also can use other system (comprising the PC, engineering work station, Set Top Box etc. with other microprocessors).In one embodiment, sample system 100 can perform the WINDOWS that can obtain from the Microsoft in Washington state Lei Mengde city ^tMa version of operating system, but also can use other operating systems (such as, UNIX and Linux), embedded software and/or graphic user interface.Therefore, various embodiments of the present invention are not limited to any concrete combination of hardware circuit and software.

Embodiment is not limited to computer system.Alternate embodiment of the present invention can be used for other equipment, such as portable equipment and Embedded Application.Some examples of handheld device comprise cell phone, Internet protocol equipment, digital camera, personal digital assistant (PDA) and Hand held PC.Embedded Application can comprise microcontroller, digital signal processor (DSP), system on chip, network computer (NetPC), Set Top Box, hub, wide area network (WAN) switch, maybe can perform any other system of one or more instruction according at least one embodiment.

Figure 1A is the block diagram of computer system 100, and computer system 100 is formed to have processor 102, and processor 102 comprises one or more performance element 108 for execution algorithm, thus performs at least one instruction according to an embodiment of the invention.An embodiment can be described in the situation of uniprocessor desktop or server system, but alternate embodiment can be comprised in a multi-processor system.System 100 is examples of " maincenter " system architecture.Computer system 100 comprises the processor 102 for the treatment of data-signal.Processor 102 can be complex instruction set computer (CISC) (CISC) microprocessor, Jing Ke Cao Neng (RISC) microprocessor, very long instruction word (VLIW) microprocessor, the processor realizing the combination of multiple instruction set or other processor devices (such as, digital signal processor) arbitrarily.Processor 102 is coupled to processor bus 110, and this processor bus can transmission of data signals between other assemblies in processor 102 and system 100.Multiple key elements of system 100 perform conventional func known in the art.

In one embodiment, processor 102 comprises the first order (L1) internal cache memory 104.Depend on framework, processor 102 can have single internally cached or multiple-stage internal high-speed cache.Or in another embodiment, cache memory can reside in the outside of processor 102.Other embodiments also can comprise internally cached and combination that is External Cache, and this depends on specific implementation and demand.Dissimilar data can store in various register (comprising integer registers, flating point register, status register, instruction pointer register) by Parasites Fauna 106.

Performance element 108 (comprising the logic for performing integer and floating-point operation) is also in processor 102 resident.Processor 102 also comprises microcode (ucode) ROM stored for specific macro instruction.For an embodiment, performance element 108 comprises the logic for the treatment of compact instruction collection 109.By in the instruction set that compact instruction collection 109 is included in general processor 102 and the relevant circuit comprised for performing these instructions, the packed data in general processor 102 can be used perform the operation used by many multimedia application.Therefore, by the complete width of processor data bus being used for perform, packed data being operated, can accelerate and more efficiently perform many multimedia application.This can eliminate across processor data bus transmit more small data unit once to perform the demand of one or more operation to a data element.

The alternate embodiment of performance element 108 also can be used to the logical circuit of microcontroller, flush bonding processor, graphics device, DSP and other types.System 100 comprises storer 120.Storer 120 can be dynamic RAM (DRAM) equipment, static RAM (SRAM) equipment, flash memory device or other memory devices.Storer 120 can store the instruction and/or data that are represented by the data-signal that can be performed by processor 102.

System logic chip 116 is coupled to processor bus 110 and storer 120.System logic chip 116 is in the embodiment illustrated memory controller hub (MCH).Processor 102 can communicate with MCH116 via processor bus 110.MCH116 is provided to the high bandwidth memory path 118 of storer 120, stores for instruction and data, and for storing graph command, data and texture.MCH116 is used for the data-signal between other assemblies in bootstrap processor 102, storer 120 and system 100, and at processor bus 110, this data-signal of bridge joint between storer 120 and system I/O122.In certain embodiments, system logic chip 116 can provide the graphics port being coupled to graphics controller 112.MCH116 is coupled to storer 120 via memory interface 118.Graphics card 112 is interconnected by Accelerated Graphics Port (AGP) and 114 is coupled to MCH116.

System 100 uses special hub interface bus 122 MCH116 to be coupled to I/O controller maincenter (ICH) 130.ICH130 provides direct connection via local I/O bus to some I/O equipment.Local I/O bus is High Speed I/O bus, for peripherals is connected to storer 120, chipset and processor 102.Some examples are Audio Controllers, firmware maincenter (flash memory BIOS) 128, transceiver 126, data storage device 124, comprise user's input and the conventional I/O controller of keyboard interface, serial expansion port (such as USB (universal serial bus) (USB)) and network controller 134.Data storage device 124 can comprise hard disk drive, floppy disk, CD-ROM equipment, flash memory device or other mass-memory units.

For another embodiment of system, can use according to the instruction of an embodiment together with system on chip.An embodiment of system on chip comprises processor and storer.Storer for this type systematic is flash memories.Flash memories can be positioned on the tube core identical with other system assembly with processor.In addition, other logical blocks of such as Memory Controller or graphics controller and so on also can be positioned on system on chip.

Figure 1B illustrates data handling system 140, and this data handling system 140 realizes the principle of one embodiment of the present of invention.It will be appreciated by those of ordinary skill in the art that multiple embodiment described herein can be used for the disposal system substituted, and do not deviate from the scope of multiple embodiment of the present invention.

Computer system 140 comprises at least one instruction process core 159 that can perform according to an embodiment.For an embodiment, process core 159 represents the processing unit of the framework (including but not limited to, CISC, RISC or VLIW type architecture) of any type.Process core 159 also can be suitable for manufacturing with one or more treatment technologies, and promotes described manufacture by enough representing in detail to be applicable on a machine-readable medium.

Process core 159 comprises performance element 142, the set of Parasites Fauna 145 and demoder 144.Process core 159 also comprises for the optional adjunct circuit (not shown) of understanding multiple embodiment of the present invention.The instruction that performance element 142 receives for performing process core 159.Except performing typical processor instruction, performance element 142 also can perform the instruction in compact instruction collection 143, to perform the operation carried out packed data form.Compact instruction collection 143 comprises instruction for performing multiple embodiment of the present invention and other compact instruction.Performance element 142 is coupled to Parasites Fauna 145 by internal bus.Parasites Fauna 145 represents on process core 159 for storing the storage area of the information comprising data.As mentioned before, be appreciated that it is not crucial that this storage area is used for storage compacting data.Performance element 142 is coupled to demoder 144.Demoder 144 is control signal and/or microcode inlet point for the instruction decoding received by process core 159.In response to these control signals and/or microcode inlet point, performance element 142 performs suitable operation.In one embodiment, demoder is used for the operational code of interpretive order, instruction should be performed any operation by this operational code to corresponding data indicated in this instruction.

Process core 159 is coupled to bus 141, for communicating with various other system equipment, other system equipment can include but not limited to: such as, Synchronous Dynamic Random Access Memory (SDRAM) controller 146, static RAM (SRAM) controller 147, flash interface 148 of bursting, PCMCIA (personal computer memory card international association) (PCMCIA)/compact flash memory (CF) card controller 149, liquid crystal display (LCD) controller 150, direct memory access (DMA) (DMA) controller 151 and the bus master interface 152 substituted.In one embodiment, data handling system 140 also can comprise I/O bridge 154, for communicating with various I/O equipment via I/O bus 153.This type of I/O equipment can include but not limited to: such as, universal asynchronous receiver/transmitter (UART) 155, USB (universal serial bus) (USB) 156, blue teeth wireless UART157 and I/O expansion interface 158.

An embodiment of data handling system 140 provides mobile communication, network service and/or radio communication, and provides the process core 159 that can perform the SIMD operation comprising text string compare operation.Various audio frequency, video, imaging and the communication of algorithms can be utilized to programme to process core 159, and these algorithms comprise: discrete transform (such as Walsh-Hadamard conversion, fast fourier transform (FFT), discrete cosine transform (DCT) and their corresponding inverse transformations); Compression/de-compression technology (such as, colour space transformation, Video coding estimation or video decode motion compensation); And modulating/demodulating (MODEM) function (such as, pulse code modulation (PCM) (PCM)).

Fig. 1 C illustrates other alternate embodiments of the data handling system of the instruction that can perform for providing general GF (256) SIMD encrypted mathematical function.According to an alternate embodiment, data handling system 160 can comprise primary processor 166, simd coprocessor 161, high-speed buffer processor 167 and input/output 168.Input/output 168 can be coupled to wave point 169 alternatively.Simd coprocessor 161 can perform the operation of the instruction comprised according to an embodiment.Process core 170 is applicable to and manufactures with one or more treatment technologies, and by enough representing in detail on a machine-readable medium, is applicable to all or part of manufacture promoting the data handling system 160 comprising process core 170.

For an embodiment, simd coprocessor 161 comprises performance element 162 and one group of Parasites Fauna 164.An embodiment of primary processor 166 comprises demoder 165, this demoder 165 for identify comprise according to an embodiment, for many instructions in the instruction set 163 of instruction that performed by performance element 162.For alternate embodiment, simd coprocessor 161 also comprises at least part of of demoder 165B for decoding to many instructions in instruction set 163.Process core 170 also comprises for understanding the optional adjunct circuit (not shown) of embodiments of the invention.

In operation, primary processor 166 performs the data processing instructions stream of the data processing operation (comprise and between cache memory 167 and input/output 168 mutual) controlling universal class.Simd coprocessor instruction is embedded in this data processing instructions stream.These simd coprocessor instructions are identified as the type that should be performed by attached simd coprocessor 161 by the demoder 165 of primary processor 166.Therefore, primary processor 166 issues these simd coprocessor instructions (or representing the control signal of simd coprocessor instruction) on coprocessor bus 171, receives these instructions by any attached simd coprocessor from this coprocessor bus 171.In this case, simd coprocessor 161 will accept and performs any simd coprocessor instruction for this simd coprocessor received.

Data can be received to be processed by simd coprocessor instruction via wave point 169.For an example, voice communication can be received with the form of digital signal, can by this digital signal of simd coprocessor instruction process to regenerate the digital audio samples representing this voice communication.For another example, can receive by the audio frequency that compresses and/or video with the form of digital bit stream, can by this digital bit stream of simd coprocessor instruction process to regenerate audio sample and/or sport video frame.For an embodiment of process core 170, primary processor 166 and simd coprocessor 161 are integrated in single process core 170, and this single process core 170 comprises performance element 162, one group of Parasites Fauna 164 and for identifying the demoder 165 comprised according to many instructions in the instruction set 163 of many instructions of an embodiment.

Fig. 2 is the block diagram of the micro-architecture of processor 200, and this processor 200 comprises the logic for performing instruction according to an embodiment of the invention.In certain embodiments, can will be embodied as according to the instruction of an embodiment there are byte size, word size, double word size, four word sizes etc. and the data element with many data types (such as, single precision and double integer and floating type) operates.In one embodiment, orderly front end 201 is parts of processor 200, and this part takes out the instruction that will be performed, and prepares these instructions to use in processor pipeline after a while.Front end 201 can comprise some unit.In one embodiment, instruction prefetch device 226 takes out instruction from storer, and by feeds instructions to instruction decoder 228, instruction decoder 228 is decoded subsequently or explained these instructions.Such as, in one embodiment, the instruction decoding received is one or more operations that machine is executable, be called as " micro-order " or " microoperation " (also referred to as micro-op or uop) by this demoder.In other embodiments, instructions parse is can be used for performing according to the operational code of multiple operations of an embodiment and the data of correspondence and control field by micro-architecture by this demoder.In one embodiment, trace cache 230 accepts the microoperation through decoding, and they is combined as the trace in program ordered sequence or uop queue 234, for execution.When trace cache 230 runs into complicated order, microcode ROM232 provides the uop needed for complete operation.

Some instructions are converted into single micro-op, and other instructions need several micro-op to complete complete operation.In one embodiment, if need to complete instruction more than four micro-op, then demoder 228 accesses microcode ROM232 to perform this instruction.For an embodiment, can be a small amount of micro-op by instruction decoding, to process at instruction decoder 228 place.In another embodiment, if need multiple micro-op to carry out complete operation, then instruction can be stored in microcode ROM232.Trace cache 230 determines correct micro-order pointer with reference to inlet point programmable logic array (PLA), to read one or more instruction that micro-code sequence has been come according to an embodiment from microcode ROM232.After microcode ROM232 completes the serializing operation carried out micro-op of instruction, the front end 201 of this machine recovers to extract micro-op from trace cache 230.

Unordered enforcement engine 203 is that preparation instruction is to carry out the place performed.Order execution logic has multiple impact damper, and these impact dampers are used for making instruction stream level and smooth and this instruction stream that reorders, and down enters streamline to optimize instruction stream and is scheduled for performance when performing.The machine impact damper that each micro-op of dispatcher logic distribution needs and resource are for execution.Register rename logic is by the multiple entries in multiple logic register RNTO Parasites Fauna.Before instruction scheduler (storer scheduler, fast scheduler 202, at a slow speed/general floating point scheduler 204 and simple floating point scheduler 206), divider is also that each uop in a queue in two uop queues (a uop queue is used for storage operation, and another uop queue is used for non-memory operation) distributes entry.Based on the availability of the execution resource completed needed for its operation standby condition and the microoperation of their subordinate input register operand source, uop scheduler 202,204,206 determines that when ready uop is for performing.Fast scheduler 202 in an embodiment can be dispatched on every half of master clock cycle, and other schedulers only can be dispatched once on each primary processor clock period.Scheduler is arbitrated to dispatch uop to perform to distribution port.

In execution block 211, Parasites Fauna 208 and 210 scheduler 202,204 and 206 and performance element 212,214,216,218,220, between 222 and 224.Also there is independent Parasites Fauna 208,210, be respectively used to integer and floating-point operation.Each Parasites Fauna 208,210 in an embodiment also comprises bypass network, and this bypass network can get around and is not also written to result in Parasites Fauna, that just complete or these results is forwarded in new subordinate uop.Integer registers group 208 and flating point register group 210 also can transmit data each other.For an embodiment, integer registers group 208 is divided into two independent Parasites Fauna, a Parasites Fauna is used for 32 bit data of low order, and second Parasites Fauna is used for 32 bit data of high-order.Flating point register group 210 in an embodiment has the entry of 128 bit wides, because floating point instruction has the operand from 64 to 128 bit widths usually.

Execution block 211 comprises performance element 212,214,216,218,220,222 and 224, and reality performs instruction in these performance elements.This block comprise store micro-order perform required for integer and the Parasites Fauna 208 and 210 of floating-point data operands value.Processor 200 in an embodiment comprises multiple performance element: scalar/vector (AGU) 212, AGU214, fast A LU216, fast A LU218, at a slow speed ALU220, floating-point ALU222 and floating-point mobile unit 224.For an embodiment, floating-point execution block 222 and 224 performs floating-point, MMX, SIMD, SSE and other operations.Floating-point ALU222 in an embodiment comprises the Floating-point divider of 64/64 for performing the micro-op of division, square root and remainder.For multiple embodiment of the present invention, floating point hardware can be utilized process the instruction relating to floating point values.In one embodiment, high speed ALU performance element 216 and 218 is gone in ALU operation.Fast A LU216 and 218 in an embodiment can perform the fast operating that effective stand-by period is half clock period.For an embodiment, ALU220 is at a slow speed gone in most of complex integer operation, because the integer that ALU220 comprises for high latency type operations at a slow speed performs hardware, such as, and multiplier, shift unit, annotated logic and branch process equipment.Storer load/store operations is performed by AGU212 and 214.For an embodiment, in the situation data operand of 64 being performed to integer operation, integer ALU216,218 and 220 is described.In alternative embodiments, ALU216,218 and 220 can be realized to support to comprise the various data bit such as 16,32,128,256.Similarly, floating point unit 222 and 224 can be realized to support to have the sequence of operations number of various bit wide.For an embodiment, floating point unit 222 and 224 in conjunction with SIMD and multimedia instruction, can operate 128 bit width compressed data operation numbers.

In one embodiment, before father has loaded execution, uop scheduler 202,204 and 206 has just assigned slave operation.Owing to speculatively dispatching in processor 200 and performing uop, therefore processor 200 also comprises for the treatment of the miss logic of storer.If Data import is miss in data cache, then can there is the operating slave operation leaving scheduler with interim incorrect data in a pipeline.Replay mechanism is followed the tracks of and is re-executed the instruction using incorrect data.Only slave operation needs to be reset, and allows independent operation to complete.Scheduler in an embodiment of processor and replay mechanism also designed to be used the instruction that seizure provides general GF (256) SIMD encrypted mathematical function.

Term " register " refers to the part that is used as instruction with processor memory location on the plate identifying operand.In other words, register can be those from processor outside (angle from programmer) available those.But the register in embodiment is not limited to the circuit representing particular type.On the contrary, the register in embodiment can store and provide data, and performs described function herein.Any amount of different technologies can be used described register is realized herein by the circuit in processor, such as, the dynamic assignment physical register of special physical register, use register renaming, and special and combination etc. that is dynamic assignment physical register.In one embodiment, integer registers stores the integer data of 32.Parasites Fauna in an embodiment also comprises eight multimedia SIM D registers for packed data.For following discussion, register should be interpreted as the data register being designed for keeping packed data, such as from the MMX enabling 64 bit wides in the microprocessor of MMX technology of the Intel company of Santa Clara City, California, America ^tMregister (in some instances also referred to as ' mm ' register)." these MMX registers (can be used on integer with in floating-point format) can with SIMD with SSE instruction packed data element together with operate.Similarly, the 128 bit wide XMM registers relating to the technology (being referred to as " SSEx ") of SSE2, SSE3, SSE4 or renewal also can be used to keep such compressed data operation number.In one embodiment, when storage compacting data and integer data, register does not need to distinguish this two classes data type.In one embodiment, integer and floating data can be included in identical Parasites Fauna, or are included in different Parasites Fauna.In addition, in one embodiment, floating-point and integer data can be stored in different registers, or are stored in identical register.

In the example of following accompanying drawing, describe multiple data operand.The various packed data types that Fig. 3 A shows in multimedia register according to an embodiment of the invention represent.Fig. 3 A illustrate for 128 bit wide operands packed byte 310, tighten word 320 and tighten the data type of double word (dword) 330.The packed byte form 310 of this example be 128 long, and comprise 16 packed byte data elements.Byte is defined as 8 bit data at this.The information of each byte data element is stored as: store in place 7 for byte 0 and put 0 in place, stores in place 15 put 8 in place for byte 1, stores in place 23 put 16 in place for byte 2, finally stores in place 120 for byte 15 and puts 127 in place.Therefore, in this register, all available positions are employed.This stored configuration improves the storage efficiency of processor.Equally, because have accessed 16 data elements, so an operation can be performed to 16 data elements concurrently now.

Usually, data element is stored in single register or memory location, independent data slice together with other data elements with equal length.In the packed data sequence relating to SSEx technology, the number being stored in the data element in XMM register is that 128 positions divided by independent data element are long.Similarly, in the packed data sequence relating to MMX and SSE technology, the number being stored in the data element in MMX register is that 64 positions divided by independent data element are long.Although the data type shown in Fig. 3 A be 128 long, multiple embodiment of the present invention also can operate the operand of 64 bit wides, 256 bit wides, 512 bit wides or other sizes.Deflation word format 320 in this example be 128 long, and comprise eight and tighten digital data elements.Each deflation word comprises the information of sixteen bit.The deflation Double Word Format 330 of Fig. 3 A be 128 long, and comprise four and tighten double-word data elements.Each deflation double-word data element comprises the information of 32.Tighten four words be 128 long, and comprise two tighten four digital data elements.

Fig. 3 B illustrates alternative data in register storage format.Each packed data can comprise more than one independent data element.Show three kinds of packed data forms: tighten half data element 341, tighten forms data element 342 and tighten double data element 343.The embodiment tightening half data element 341, deflation forms data element 342 and deflation double data element 343 comprises point of fixity data element.For alternate embodiment, what tighten half data element 341, tighten forms data element 342 and tighten in double data element 343 one or morely comprises floating data element.The alternate embodiment tightening half data element 341 be 128 long, and comprise eight 16 bit data elements.The alternate embodiment tightening forms data element 342 be 128 long, and comprise four 32 bit data elements.Tighten an embodiment of double data element 343 be 128 long, and comprise two 64 bit data elements.Will be understood that, can further by this type of packed data trellis expansion to other register lengths, such as, 96,160,192,224,256,512 or longer.

To show in multimedia register according to an embodiment of the invention various has symbol and without symbolic compaction data types to express for Fig. 3 C.Without symbolic compaction byte representation 344 illustrate by without symbolic compaction bytes store in simd register.The information of each byte data element is stored as: store in place 7 for byte 0 and put 0 in place, stores in place 15 put 8 in place for byte 1, stores in place 23 put 16 in place for byte 2, etc., finally in place 120 are stored for byte 15 and put 127 in place.Therefore, in this register, all available positions are employed.This storage arrangement can improve the storage efficiency of processor.Equally, because have accessed 16 data elements, so an operation can be performed to 16 data elements in a parallel fashion.Symbolic compaction byte representation 345 is had to show the storage of symbolic compaction byte.Note, the 8th is-symbol designator of each byte data element.Show that 346 show and how to be stored in simd register to word 0 by word 7 without symbolic compaction word table.347 are similar to without expression 346 in symbolic compaction word register to have symbolic compaction word table to show.Note, the sixteen bit is-symbol designator of each digital data element.Represent 348 show how to store double-word data element without symbolic compaction double word.349 are similar to without expression 348 in symbolic compaction double-word register to have symbolic compaction double word to represent.Note, necessary sign bit is the 32 of each double-word data element.

Fig. 3 D be with can from the WWW of the Intel company of Santa Clara City, California, America (www) intel.com/products/processor/manuals/ obtain " 64 and IA-32 Intel Architecture Software developer handbook combine volume 2A and 2B: instruction set reference A-Z ( 64andIA-32IntelArchitectureSoftwareDeveloper ' sManualCombinedVolume2Aand2B:InstructionSetReferenceA-Z) " in the operational code Format Type that describes corresponding, the description with 32 or more operate coding (operational code) forms 360 of position and an embodiment of register/memory operand addressing mode.In one embodiment, by one or more in field 361 and 362, instruction is encoded.For every bar command identification as many as two operand positions, as many as two source operand identifiers 364 and 365 can be comprised.For an embodiment, operand identification symbol 366 in destination is identical with source operand identifier 364, and they are not identical in other embodiments.For alternate embodiment, operand identification symbol 366 in destination is identical with source operand identifier 365, and they are not identical in other embodiments.In one embodiment, one in the source operand identified by source operand identifier 364 and 365 is override by the result of instruction, and in other embodiments, identifier 364 corresponds to source-register element, and identifier 365 corresponds to destination register element.For an embodiment, operand identification symbol 364 and 365 can be used to the source and destination operand of mark 32 or 64.

Fig. 3 E is the description of another alternative operate coding (operational code) form 370 with 40 or more positions.Operational code form 370 corresponds to operational code form 360, and comprises optional prefix byte 378.Encode by one or more in field 378,371 and 372 according to the instruction of an embodiment.By source operand identifier 374 and 375 and by prefix byte 378, can to every bar command identification as many as two operand positions.For an embodiment, prefix byte 378 can be used to the source and destination operand of mark 32 or 64.For an embodiment, operand identification symbol 376 in destination is identical with source operand identifier 374, and they are not identical in other embodiments.For alternate embodiment, operand identification symbol 376 in destination is identical with source operand identifier 375, and they are not identical in other embodiments.In one embodiment, instruction operates to be accorded with in 374 and 375 operands that identify by operand identification one or more, and accord with the 374 and 375 one or more operands identified by the result overriding of this instruction by operand identification, and in other embodiments, the operand identified by identifier 374 and 375 is write in another data element in another register.Operational code form 360 and 370 allow by MOD field 363 and 373 and by optional ratio-index-plot (scale-index-base) and displacement (displacement) byte sections the register of specifying to register addressing, storer to register addressing, by storer to register addressing, by register pair register addressing, by immediate to register addressing, register to memory addressing.

Next Fig. 3 F is forwarded to, in some alternative embodiments, 64 (or 128 or 256 or 512 or more position) single instruction multiple data (SIMD) arithmetical operations can perform via coprocessor data processing (CDP) instruction.Operate coding (operational code) form 380 depicts this type of instruction with CDP opcode field 382 and 389.For alternate embodiment, can be encoded by one or more these types to CDP command operating in field 383,384,387 and 388.To each command identification as many as three operand positions, as many as two source operand identifiers 385 and 390 and a destination operand identification symbol 386 can be comprised.An embodiment of coprocessor can operate the value of 8,16,32 and 64.For an embodiment, instruction is performed to integer data element.In certain embodiments, can service condition field 381, perform instruction conditionally.For some embodiments, by field 383, source data size is encoded.In certain embodiments, can zero (Z), negative (N), carry (C) be performed to SIMD field and overflow (V) detection.For some instructions, encode by field 384 pairs of saturation types.

Next forward Fig. 3 G to, which depict with can from the WWW of the Intel company of Santa Clara City, California, America (www) intel.com/products/processor/manuals/ obtain " high-level vector expansion programming reference ( advancedVectorExtensionsProgrammingReference) the operational code Format Type described " corresponding, according to another alternative operate coding (operational code) form 397 for providing general GF (256) SIMD encrypted mathematical function of another embodiment.

Original x86 instruction set provides the address byte of various form (syllable) to 1 byte oriented operand and is included in the immediate operand in extra byte, wherein can know the existence of extra byte from first " operational code " byte.In addition, there is some byte value (being called as prefix (prefix), because before they must being placed in instruction) retained as the modifier to this operational code.When the original allotment (comprising the prefix value that these are special) of 256 opcode byte exhausts, single byte is exclusively used in the mode of jumping out (escape) of the new set of going to 256 operational codes.Because with the addition of vector instruction (such as, SIMD), therefore, even if by using prefix to expand, also create the demand to more multioperation code, and " two bytes " operational code to map also be inadequate., new instruction added in additional mapping, two bytes are added that optional prefix is used as identifier by this additional mapping for this reason.

In addition, for the ease of realizing extra register in 64 bit patterns, additional prefix (being called as " REX ") can be used between prefix and operational code (and determine that this operational code is necessary anyly jump out byte).In one embodiment, this REX has 4 " Payload " positions, uses additional register to indicate in 64 bit patterns.In other embodiments, this REX can have and to be less than or more than 4.The general format (usually corresponding to form 360 and/or form 370) of at least one instruction set is shown generically as follows:

[prefixes] [rex] escape [escape2] opcodemodrm (s)

Operational code form 397 corresponds to operational code form 370, and comprise the traditional instruction prefix byte of alternative other public uses most and jump out code, optional VEX prefix byte 391 (in one embodiment, starting with hexadecimal C4).Such as, shown below the embodiment that use two fields carry out coded order, can second jump out code be present in presumptive instruction time, or needing to use this embodiment when using extra bits (such as, XB and W field) in REX field.In embodiment shown below, tradition is jumped out and is jumped out value to represent by new, tradition prefix is fully compressed the part into " Payload (payload) " byte, tradition prefix is again declared and be can be used for following expansion, second jumps out code to be compressed in " map (map) " field and the mapping in future or feature space can be used, and add new feature (such as, the vector length of increase and additional source-register specificator).

Encode by one or more in field 391 and 392 according to the instruction of an embodiment.By field 391 in conjunction with source operational code identifier 374 and 375, and in conjunction with optional ratio-index-plot (SIB) identifier 393, optional displacement identifier 394 and optional immediate byte 395, can be every bar command identification as many as four operand positions.For an embodiment, VEX prefix byte 391 can be used to the source and destination operand of mark 32 or 64 and/or 128 or 256 simd registers or memory operand.For an embodiment, the function provided by operational code form 397 can form redundancy with operational code form 370, and they are different in other embodiments.Operational code form 370 and 397 allow by MOD field 373 and the register of partly being specified by optional (SIB) identifier 393, optional displacement identifier 394 and optional immediate byte 395 to register addressing, storer to register addressing, by storer to register addressing, by register pair register addressing, by immediate to register addressing, register to memory addressing.

Next forward Fig. 3 H to, which depict according to another embodiment, for provide push general GF (256) SIMD encrypted mathematical function another substitute operate coding (operational code) form 398.Operational code form 398 corresponds to operational code form 370 and 397, and comprise the traditional instruction prefix byte of alternative other public uses most and jump out code, and additional function, optional EVEX prefix byte 396 (in one embodiment, starting with hexadecimal 62) is provided.Encode by one or more in field 396 and 392 according to the instruction of an embodiment.By field 396 in conjunction with source operational code identifier 374 and 375, and in conjunction with optional ratio-index-plot (scale-index-baseSIB) identifier 393, optional displacement identifier 394 and optional immediate byte 395, can be every bar instruction as many as four operand positions and mark mask.For an embodiment, EVEX prefix byte 396 can be used to the source and destination operand of mark 32 or 64 and/or 128,256 or 512 simd registers or memory operand.For an embodiment, the function provided by operational code form 398 can form redundancy with operational code form 370 or 397, and they are different in other embodiments.Operational code form 398 allow by MOD field 373 and by optional (SIB) identifier 393, optional displacement identifier 394 and optional immediate byte 395 is partly specified, the register that utilizes mask to register addressing, storer to register addressing, by storer to register addressing, by register pair register addressing, by immediate to register addressing, register to memory addressing.The general format (it corresponds to form 360 and/or form 370 usually) of at least one instruction set is shown generically as follows:

evex1RXBmmmmmWvvvLppevex4opcodemodrm[sib][disp][imm]

For an embodiment, extra " useful load " position can be had according to the instruction that EVEX form 398 is encoded, it is used to provide general GF (256) SIMD encrypted mathematical function, and there is additional new feature, such as, the configurable mask register of user or additional operand or the selection etc. made from 128,256 or 512 bit vector registers or more register to be selected.

Such as, when VEX form 397 can be used for providing general GF (256) the SIMD encrypted mathematical function with implicit expression mask, EVEX form 398 can be used for providing general GF (256) the SIMD encrypted mathematical function with the configurable mask of explicit user.In addition, when VEX form 397 can be used for general GF (256) the SIMD encrypted mathematical function be provided on 128 or 256 bit vector registers, EVEX form 398 can be used for being provided in general GF (256) the SIMD encrypted mathematical function on the vector registor of 128,256,512 or larger (or less).

By following example, the example instruction for providing general GF (256) SIMD encrypted mathematical function is shown:

Also will understand, there is provided the execution at least following instruction: (1) SIMD affined transformation, its assigned source data operand, transformation matrix operand and converting vector, wherein, transformation matrix is applied to each data element of source data operation number, and converting vector be applied to each through conversion element; (2) SIMD scale-of-two finite field multiplier is inverted, and it asks mould for calculate in scale-of-two Galois field for each element in source data operation number inverse to irreducible function; (3) SIMD affined transformation and multiplication inverse (or multiplication is inverse and affined transformation), its assigned source data operand, transformation matrix operand and converting vector, wherein, before or after multiplication inverse operation, transformation matrix is applied to each element in source data operation number, and converting vector be applied to each through conversion element; (4) ask modular reduction, it specificly asks modulo polynomial p to carry out reduction to ask mould to what select from the multiple polynomial expressions (asking modular reduction for these specific polynomial expressions provide by instruction (or micro-order)) in scale-of-two Galois field for calculating; (5) SIMD scale-of-two finite field multiplier, it specifies the first and second source data operation numbers, and for by the element of each correspondence in the first and second source data operation numbers to being multiplied and asking mould to irreducible function; Wherein, the result of these instructions is stored in SIMD destination register; And general GF (256) and/or other scale-of-two Galois field SIMD encrypted mathematical functions substituted can be provided with the form of hardware and/or micro-code sequence, not need the too much or excessive functional unit of requirement adjunct circuit, area or power just can support the significant performance improvement applied some important performance-critical.

Fig. 4 A illustrates the block diagram according to the ordered flow waterline of at least one embodiment of the present invention and register renaming level, unordered issue/execution pipeline.Fig. 4 B illustrates according at least one embodiment of the present invention, the block diagram that will be included orderly framework core within a processor and register rename logic, unordered issue/actuating logic.Solid box in Fig. 4 A shows ordered flow waterline, and dotted line frame shows register renaming, unordered issue/execution pipeline.Similarly, the solid box in Fig. 4 B shows orderly framework logic, and dotted line frame shows register rename logic and unordered issue/actuating logic.

In Figure 4 A, processor pipeline 400 comprises taking-up level 402, length decoder level 404, decoder stage 406, distribution stage 408, rename level 410, scheduling (be also referred to as and assign or issue) level 412, register read/storer fetch stage 414, execution level 416, writes back/storer write level 418, abnormality processing level 422 and submission level 424.

In figure 4b, arrow indicates the coupling between two or more unit, and the direction of data stream between those unit of the direction of arrow.Fig. 4 B illustrates processor core 490, and it comprises the front end unit 430 being coupled to enforcement engine unit 450, and this front end unit and enforcement engine unit are all coupled to memory cell 470.

Core 490 can be Jing Ke Cao Neng (RISC) core, sophisticated vocabulary calculates (CISC) core, very long instruction word (VLIW) core or mixing or alternative core type.As another option, core 490 can be specific core, such as, and network or communication core, compression engine, graphics core etc.

Front end unit 430 comprises the inch prediction unit 432 being coupled to Instruction Cache Unit 434, this Instruction Cache Unit is coupled to instruction transformation look-aside buffer (TLB) 436, this instruction transformation look-aside buffer (TLB) is coupled to instruction fetch units 438, and this instruction fetch units is coupled to decoding unit 440.Decoding unit or the instruction of demoder decodable code, and generate decode from presumptive instruction otherwise reflect presumptive instruction or derive from presumptive instruction one or more microoperations, microcode inlet point, micro-order, other instructions or other control signals be as output.Various different mechanism can be used to realize demoder.The example of suitable mechanism includes but not limited to, look-up table, hardware implementing, programmable logic array (PLA), microcode ROM (read-only memory) (ROM) etc.Instruction Cache Unit 434 is also coupled to the second level (L2) cache element 476 in memory cell 470.Decoding unit 440 is coupled to the rename/dispenser unit 452 in enforcement engine unit 450.

Enforcement engine unit 450 comprises the rename/dispenser unit 452 of the set of being coupled to retirement unit 454 and one or more dispatcher unit 456.Dispatcher unit 456 represents the different schedulers of any amount, comprises reserved station, central command window etc.Dispatcher unit 456 is coupled to physical register set unit 458.Each physical register set unit 458 represents one or more physical register set, wherein different physical register set stores one or more different data type (such as, scalar integer, scalar floating-point, deflation integer, deflation floating-point, vectorial integer, vector floating-point, etc.), state (such as, as the instruction pointer of the address of next instruction that will be performed) etc.Physical register set unit 458 is covered by retirement unit 454, to illustrate that the various modes that can realize register renaming and unordered execution (such as, use resequencing buffer and resignation Parasites Fauna; Use future file (futurefile), historic buffer and resignation Parasites Fauna; Use register mappings and register pond etc.).Usually, architectural registers is outside or be visible from the visual angle of programmer from processor.These registers are not limited to any known particular electrical circuit type.Various dissimilar register is suitable, as long as they can store and provide data as herein described.The example of suitable register includes but not limited to, special physical register, the physical register of dynamic assignment using register renaming and the combination of special physical register and dynamic assignment physical register, etc.Retirement unit 454 and physical register set unit 458 are coupled to execution and troop 460.Performing troops 460 comprises the set of one or more performance element 462 and the set of one or more memory access unit 464.Performance element 462 can perform various operation (such as, displacement, addition, subtraction, multiplication) and can perform various data type (such as, scalar floating-point, deflation integer, deflation floating-point, vectorial integer, vector floating-point).Although some embodiments can comprise the multiple performance elements being exclusively used in specific function or function set, other embodiments can comprise only a performance element or the multiple performance element that all perform all functions.Dispatcher unit 456, physical register set unit 458 and execution troop 460, and to be shown as may be a plurality of, create multiple independent streamline (such as, all there is scalar integer streamline, scalar floating-point/deflation integer/deflation floating-point/vectorial integer/vector floating-point streamline and/or pipeline memory accesses that respective dispatcher unit, physical register set unit and/or execution are trooped because some embodiment is some data/action type; And when independent pipeline memory accesses, the execution that some embodiment is implemented as only this streamline is trooped and is had memory access unit 464).It is also understood that when using streamline separately, one or more in these streamlines can be unordered issue/execution, and all the other streamlines can be orderly issue/execution.

The set of memory access unit 464 is coupled to memory cell 470, this memory cell comprises data TLB unit 472, this data TLB element coupling is to cache element 474, and this cache element is coupled to the second level (L2) cache element 476.In one exemplary embodiment, memory access unit 464 can comprise loading unit, memory address unit and storage data units, and wherein each is all coupled to the data TLB unit 472 in memory cell 470.L2 cache element 476 is coupled to the high-speed cache of other levels one or more, and is finally coupled to primary memory.

Exemplarily, the unordered issue of exemplary register renaming/execution core framework can realize streamline 400:1 in the following manner) instruction extractor 438 perform take out and length decoder level 402 and 404; 2) decoding unit 440 performs decoder stage 406; 3) rename/dispenser unit 452 performs distribution stage 408 and rename level 410; 4) dispatcher unit 456 operation dispatching level 412; 5) physical register set unit 458 and memory cell 470 perform register read/storer fetch stage 414; Execution is trooped and 460 is realized execution level 416; 6) memory cell 470 and physical register set unit 458 perform and write back/storer write level 418; 7) various unit can be involved in abnormality processing level 422; And 8) retirement unit 454 and physical register set unit 458 perform and submit level 424 to.

Core 490 can support one or more instruction set (such as, x86 instruction set (there are some expansions adding and upgrade version), the MIPS instruction set of MIPS Technologies Inc. of California Sani's Weir, the ARM instruction set (there is optional additional extension, such as NEON) of the ARM parent corporation of California Sani's Weir).

Be to be understood that, endorse and support multithreading operation (performing the set of two or more parallel operations or thread), and can variously carry out this multithreading, various mode comprises time-division multithreading, synchronizing multiple threads (wherein single physical core provide Logic Core for physics core each thread just in each thread of synchronizing multiple threads) or its combination (such as, time-division take out and decoding and afterwards synchronizing multiple threads operation (such as, use hyperthread technology)).

Although describe register renaming in the situation of unordered execution, should be appreciated that and can use register renaming in orderly framework.Although the L2 cache element 476 that the shown embodiment of processor also comprises independent instruction and data cache element 434/474 and shares, but it is single internally cached that the embodiment substituted can have for both instruction and datas, the internally cached or multiple level of the such as such as first order (L1) internally cached.In certain embodiments, this system can comprise combination that is internally cached and External Cache in core and/or processor outside.Or all high-speed caches can in the outside of core and/or processor.

Fig. 5 is the block diagram of the single core processor with integrated Memory Controller and graphics devices according to multiple embodiment of the present invention and polycaryon processor 500.The solid box of Fig. 5 shows processor 500, it has single core 502A, System Agent 510, one group of one or more bus controller unit 516, and optional additional dotted line frame shows alternative processor 500, it has multiple core 502A-N, be arranged in one group of System Agent unit 510 one or more integrated memory controller unit 514 and integrated graphics logic 508.

Storage hierarchy comprises the one or more cache hierarchy in core, one group of one or more shared cache element 506 and is coupled to the external memory storage (not shown) of this group integrated memory controller unit 514.This group shares cache element 506 can comprise one or more intermediate high-speed cache, such as, the high-speed cache of the second level (L2), the third level (L3), the fourth stage (L4) or other ranks, last level cache (LLC) and/or above combination.Although in one embodiment, based on the interconnecting unit 512 of annular by integrated graphics logic 508, this group shares cache element 506 and System Agent unit 510 interconnects, but the embodiment substituted also can use the known technology of any amount to these unit that interconnect.

In certain embodiments, one or more core 502A-N can carry out multithreading operation.System Agent 510 comprises those assemblies coordinated and operate core 502A-N.System Agent unit 510 can comprise such as power control unit (PCU) and display unit.PCU maybe can comprise the logic needed for regulating the power rating of core 502A-N and integrated graphics logic 508 and assembly.The display that display unit connects for driving one or more outside.

In framework and/or instruction set, core 502A-N can be isomorphism or isomery.Such as, some in core 502A-N can be orderly, and other are unordered.As another example, two or more in core 502A-N can perform identical instruction set, and other nuclear energy enough perform the only subset in this instruction set or perform different instruction set.

Processor can be general processor, such as Duo (Core ^tM) i3, i5, i7,2Duo and Quad, extremely by force (Xeon ^tM), Anthem (Itanium ^tM), XScale ^tMor StrongARM ^tMprocessor, these all can obtain from the Intel company of Santa Clara, California.Or processor can from another company, such as from ARM parent corporation, MIPS etc.Processor can be application specific processor, such as, and network or communication processor, compression engine, graphic process unit, coprocessor, flush bonding processor etc.This processor can be implemented on one or more chip.Processor 500 can be a part for one or more substrate, and/or uses any technology in kinds of processes technology (such as, BiCMOS, CMOS or NMOS) to be implemented on one or more substrate.

Fig. 6-8 is the example system being suitable for comprising processor 500, and Fig. 9 is one or more Exemplary cores SOC (system on a chip) (SoC) that can comprise in 502.Other system to laptop computer, desktop computer, Hand held PC, personal digital assistant, engineering work station, server, the network equipment, hub, switch, flush bonding processor, digital signal processor (DSP), graphics device, video game device, Set Top Box, microcontroller, cell phone, portable electronic device, handheld device and other electronic equipments various design known in the art and configuration are also suitable.Usually, multiple systems or the electronic equipment that can comprise processor disclosed herein and/or other actuating logics are all generally suitable.

With reference now to Fig. 6, the block diagram of system 600 according to an embodiment of the invention that shown is.System 600 can comprise the one or more processors 610 and 615 being coupled to Graphics Memory Controller maincenter (GMCH) 620.The optional character of additional processor 615 represents in figure 6 by a dotted line.

Each processor 610,615 can be certain version of processor 500.But, it should be noted that integrated graphics logic and integrated memory control module unlikely appear in processor 610 and 615.Fig. 6 illustrates that GMCH620 can be coupled to storer 640, and this storer 640 can be such as dynamic RAM (DRAM).For at least one embodiment, DRAM can be associated with non-volatile cache.

GMCH620 can be the part of chipset or chipset.GMCH620 can communicate with 615 with processor 610, and mutual between control processor 610,615 and storer 640.GMCH620 also can take on the accelerate bus interface between other elements in processor 610,615 and system 600.For at least one embodiment, GMCH620 communicates with 615 with processor 610 via the multi-point bus of such as Front Side Bus (FSB) 695 and so on.

In addition, GMCH620 is coupled to display 645 (such as flat-panel monitor).GMCH620 can comprise integrated graphics accelerator.GMCH620 is also coupled to I/O (I/O) controller maincenter (ICH) 650, and this I/O (I/O) controller maincenter (ICH) 650 can be used for various peripherals to be coupled to system 600.Exemplarily show external graphics devices 660 and another peripherals 670 in the embodiment in fig 6, this external graphics devices 660 can be the discrete graphics device being coupled to ICH650.

Alternatively, additional or different processor can also be there is in system 600.Such as, Attached Processor 615 can comprise the Attached Processor identical with processor 610, with processor 610 foreign peoples or asymmetric Attached Processor, accelerator (such as, graphics accelerator or digital signal processing (DSP) unit), field programmable gate array or any other processor.The each species diversity in a series of quality metrics comprising framework, micro-architecture, heat and power consumption features etc. can be there is between physical resource 610 and 615.These differences effectively can be shown as asymmetry between processor 610 and 615 and heterogeneity.For at least one embodiment, various processor 610 and 615 can reside in same die package.

Referring now to Fig. 7, the block diagram of the second system 700 that shown is according to the embodiment of the present invention.As shown in Figure 7, multicomputer system 700 is point-to-point interconnection systems, and comprises the first processor 770 and the second processor 780 that are coupled via point-to-point interconnection 750.Each in processor 770 and 780 can be certain version (one or more as in processor 610,615) of processor 500.

Although only illustrate with two processors 770 and 780, should be appreciated that scope of the present invention is not limited thereto.In other embodiments, one or more Attached Processor can be there is in given processor.

Processor 770 and 780 is shown as and comprises integrated memory controller unit 772 and 782 respectively.Processor 770 also comprises point-to-point (P-P) interface 776 and 778 of the part as its bus controller unit; Similarly, the second processor 780 comprises P-P interface 786 and 788.Processor 770 and 780 can exchange information via using the P-P interface 750 of point-to-point (P-P) interface circuit 778 and 788.As shown in Figure 7, processor is coupled to respective storer by IMC772 and 782, i.e. storer 732 and storer 734, and these storeies can be the parts that this locality is attached to the primary memory of each self processor.

Processor 770,780 can exchange information via each P-P interface 752 and 754 and chipset 790 of using point-to-point interface circuit 776,794,786 and 798 separately.Chipset 790 also can exchange information via high performance graphics interface 739 and high performance graphics circuit 738.

Sharing high-speed cache (not shown) can be included in arbitrary processor, or in the outside of two processors but via P-P interconnection be connected with these processors, if make processor be placed in low-power mode, then the local cache information of any one or these two processors can be stored in this high-speed cache shared.

Chipset 790 can be coupled to the first bus 716 via interface 796.In one embodiment, the first bus 716 can be the bus of periphery component interconnection (PCI) bus or such as PCI high-speed bus or another third generation I/O interconnect bus and so on, but scope of the present invention is not limited thereto.

As shown in Figure 7, various I/O equipment 714 can be coupled to the first bus 716 together with bus bridge 718, and the first bus 716 is coupled to the second bus 720 by bus bridge 718.In one embodiment, the second bus 720 can be low pin-count (LPC) bus.Various equipment can be coupled to the second bus 720, in one embodiment, these equipment comprise such as keyboard and/or mouse 722, communication facilities 727 and such as can comprise the storage unit 728 of disk drive or other mass-memory units and so on of instructions/code and data 730.In addition, audio frequency I/O724 can be coupled to the second bus 720.Note, other frameworks are possible.Such as, replace the Peer to Peer Architecture of Fig. 7, system can realize multi-point bus or other this type of frameworks.

Referring now to Fig. 8, the block diagram of the 3rd system 800 that shown is according to the embodiment of the present invention.Like in Fig. 7 and 8 uses similar Reference numeral, and some aspect eliminating Fig. 7 is in fig. 8 to avoid other aspects making Fig. 8 fuzzy.

Fig. 8 illustrates that processor 870 and 880 can comprise integrated memory and I/O steering logic (" CL ") 872 and 882 respectively.For at least one embodiment, CL872 and 882 can comprise the integrated memory controller unit such as described by above composition graphs 5 and 7.In addition, CL872,882 also can comprise I/O steering logic.Fig. 8 illustrates that not only storer 832 and 834 is coupled to CL872 and 882, and I/O equipment 814 is also coupled to steering logic 872 and 882.Conventional I/O equipment 815 is coupled to chipset 890.

Referring now to Fig. 9, the block diagram of shown is SoC900 according to an embodiment of the invention.Similar assembly in Fig. 5 has identical label.In addition, dotted line frame is the optional feature on more advanced SoC.In fig .9, interconnecting unit 902 is coupled to: application processor 910, and it comprises one group of one or more core 502A-N and shared cache element 506; System Agent unit 510; Bus controller unit 516; Integrated memory controller unit 514; One group of one or more Media Processor 920, it can comprise integrated graphics logic 508, for providing static and/or the image processor of video camera function 924, the audio process 926 for providing hardware audio to accelerate, the video processor 928 for providing encoding and decoding of video to accelerate, static RAM (SRAM) unit 930; Direct memory access (DMA) (DMA) unit 932; And display unit 940, it is for being coupled to one or more external display.

Figure 10 illustrates processor, and comprise CPU (central processing unit) (CPU) and Graphics Processing Unit (GPU), this processor can perform at least one instruction according to an embodiment.In one embodiment, perform and can be performed by CPU according to the instruction of the operation of at least one embodiment.In another embodiment, instruction can be performed by GPU.In another embodiment, instruction can perform in the combination of operation performed by GPU and CPU.Such as, in one embodiment, the instruction according to an embodiment can be received, and decoded, to perform on GPU.But the one or more operations in the instruction of decoding can be performed by CPU, and result is returned to GPU, to carry out the final resignation of instruction.On the contrary, in certain embodiments, CPU can be used as primary processor, and GPU is as coprocessor.

In certain embodiments, the instruction benefiting from the handling capacity processor of highly-parallel can be performed by GPU, and the instruction benefiting from processor (deep pipeline framework benefited from by these processors) performance can be performed by CPU.Such as, the application of figure, science, financial application and other parallel workloads can be benefited from the performance of GPU and correspondingly be performed, and more serializing application (such as, operating system nucleus or application code) is more suitable for CPU.

In Fig. 10, processor 1000 comprises, CPU1005, GPU1010, image processor 1015, video processor 1020, USB controller 1025, UART controller 1030, SPI/SDIO controller 1035, display device 1040, HDMI (High Definition Multimedia Interface) (HDMI) controller 1045, MIPI controller 1050, Flash memory controller 1055, double data rate (DDR) (DDR) controller 1060, security engine 1065, I ²s/I ²c (integrated across chip voice/across integrated circuit) interface 1070.Other logics and circuit (comprising more CPU or GPU and other peripheral interface controllers) can be included in the processor of Figure 10.

One or more aspects of at least one embodiment can be realized by the representative data be stored on the machine readable media of the various logic represented in processor, when machine reads these representative data, these representative data make this machine for the manufacture of the logic performing the techniques described herein.This type of can be represented that (i.e. so-called " IP kernel ") is stored on tangible machine readable media (" tape "), and provide it to various client or production facility, to be loaded in the manufacturing machine of this logical OR processor of actual fabrication.Such as, the IP kernel (Cortex such as developed by ARM parent corporation ^tMprocessor affinity and the Godson IP kernel developed by institute of computing technology of the Chinese Academy of Sciences (ICT)) can be authorized to or be sold to various client or be subject to licensor, such as Texas Instrument, high pass, apple or Samsung, and be implemented in by these clients or the processor by licensor production.

Figure 11 illustrates the block diagram developed according to the IP kernel of an embodiment.Memory device 1130 comprises simulation software 1120 and/or hardware or software model 1110.In one embodiment, represent that the data of IP kernel design can be provided to memory device 1130 via storer 1140 (such as, hard disk), wired connection (such as, internet) 1150 or wireless connections 1160.The IP kernel information generated by emulation tool and model can be sent to production facility subsequently, can manufacture this IP kernel information to perform at least one instruction according at least one embodiment by third party in this production facility.

In certain embodiments, one or more instruction can correspond to the first kind or framework (such as, x86), and can be converted or emulate on the processor (such as, ARM) of dissimilar or framework.According to an embodiment, therefore processor in office or processor type (comprising ARM, x86, MIPS, GPU or other processor types or framework) can perform instruction.

How Figure 12 shows according to the instruction of the first kind of an embodiment by dissimilar processor simulation.In fig. 12, program 1205 comprises some instructions that can perform identical with according to the instruction of an embodiment or substantially identical function, these instructions.But the instruction of program 1205 can be the type different or incompatible from processor 1215 and/or form, this means can not by the instruction of the type in processor 1215 Proterozoic executive routine 1205.But by means of emulation logic 1210, the instruction transformation of program 1205 being become can by the instruction of the primary execution of processor 1215.In one embodiment, emulation logic is specific within hardware.In another embodiment, be embodied in by emulation logic in tangible machine readable media, this machine readable media comprises the software for by such instruction transformation in program 1205 being the type that can be performed by processor 1215 Proterozoic.In other embodiments, emulation logic is fixed function or programmable hardware and the combination being stored in the program on tangible machine readable media.In one embodiment, processor comprises emulation logic, and in other embodiments, emulation logic outside processor, and is provided by third party.In one embodiment, processor by performing the microcode or firmware that are included within a processor or are associated with this processor, can load and being embodied in the emulation logic comprised in the tangible machine readable media of software.

Figure 13 uses software instruction converter the binary command in source instruction set to be converted to the block diagram of the binary command that target instruction target word is concentrated according to the contrast of multiple embodiment of the present invention.In an illustrated embodiment, dictate converter is software instruction converter, but as an alternative, can realize this dictate converter in software, firmware, hardware or its various combination.Figure 13 illustrates and x86 compiler 1304 can be used to compile the program utilizing higher level lanquage 1302, to generate the x86 binary code 1306 that can be performed by processor 1316 Proterozoic with at least one x86 instruction set core.The processor with at least one x86 instruction set core 1316 represents can by compatibly performing or the otherwise following any processor performing the function substantially identical with the Intel processors with at least one x86 instruction set core of process:, (1) Intel x86 instruction set core instruction set essence part or, (2) be intended to run to realize the application of result substantially identical with the Intel processors with at least one x86 instruction set core or the object code version of other softwares on the Intel processors with at least one x86 instruction set core.X86 compiler 1304 represents and can be used for generating x86 binary code 1306 (such as, object code) compiler, this x86 binary code 1306 can be performed on the processor 1316 with at least one x86 instruction set core by additional link process or without the need to additional link process.Similarly, Figure 13 illustrates and alternative instruction set compiler 1308 can be used to compile the program utilizing higher level lanquage 1302, to generate the alternative command collection binary code 1310 that can be performed by processor 1314 (processor of the core of the ARM instruction set of the processor such as, with the core of the MIPS instruction set of the MIPS Technologies Inc. performing California Sani's Weir and/or the ARM parent corporation the performing California Sani's Weir) Proterozoic without at least one x86 instruction set core.This dictate converter 1312 is used to x86 binary code 1306 is converted to the code that can be performed by processor 1314 Proterozoic without x86 instruction set core.This code through conversion is unlikely identical with alternative command collection binary code 1310, can complete such dictate converter because be difficult to manufacture; But the code through conversion will complete general operation, and the instruction of being concentrated by alternative command is formed.Therefore, by emulation, simulation or any other process, dictate converter 1312 represents that allow not have x86 instruction set processor or core processor or other electronic equipments perform the software of x86 binary code 1306, firmware, hardware or they combination.

Figure 14 illustrates the process flow diagram of an embodiment of the process 1401 of the encrypt/decrypt standard for realizing Advanced Encryption Standard (AES) efficiently.Perform process disclosed herein 1401 and other processes by processing block, these processing blocks can comprise specialized hardware or can by general-purpose machinery or by custom-built machine or the software performed by the combination of general-purpose machinery and custom-built machine or firmware operation code.In one embodiment, for the reverse mixcolumns of AES, compositum GF ((2 can be used ⁴) ²) and irreducible function x ⁴+ x ²+ x+1 and x ²+ 2x+0xE.

In processing block 1411, the input block and the round key that comprise 128 of 16 byte values carry out logic XOR (XOR) computing.In processing block 1412, determine that whether this process is encryption, when encrypting, processing and continuing from point 1418, if or this process be deciphering, then in this case, process and recover in processing block 1413.

In processing block 1413, territory change-over circuit is used to respectively the polynomial repressentation that each in 16 byte values is corresponding from GF (256) is transformed into compositum GF ((2 ⁴) ²) in the polynomial repressentation of another correspondence.For an embodiment of processing block 1413, by each byte value being multiplied by the transition matrix of 8 × 8, the polynomial repressentation [a in GF (256) ₇, a ₆, a ₅, a ₄, a ₃, a ₂, a ₁, a ₀] compositum GF ((2 can be converted into ⁴) ²) middle corresponding polynomial repressentation [b ₇, b ₆, b ₅, b ₄, b ₃, b ₂, b ₁, b ₀], this realizes by following a series of XOR:

b_{0} = a_{0} &CirclePlus; a_{2} &CirclePlus; a_{3} &CirclePlus; a_{4} {&CirclePlus; a}_{5} &CirclePlus; a_{6} &CirclePlus; a_{7},

b ₁＝a ₇,

b_{2} = a_{4} &CirclePlus; a_{5} {&CirclePlus; a}_{7},

b_{3} = a_{1} &CirclePlus; a_{3} &CirclePlus; a_{5} &CirclePlus; a_{6},

b_{4} = a_{4} &CirclePlus; a_{5} {&CirclePlus; a}_{6},

b_{5} = a_{1} &CirclePlus; a_{4} &CirclePlus; a_{5} &CirclePlus; a_{6},

b_{6} = a_{5} {&CirclePlus; a}_{7},

b_{7} = a_{2} &CirclePlus; a_{3} &CirclePlus; a_{4} &CirclePlus; a_{6} {&CirclePlus; a}_{7} .

Now, these 16 bytes can be regarded as having that four lines and four arranges, 4 × 4 block of bytes.In processing block 1414, determine that in the end one takes turns when whether front-wheel is that last is taken turns/specially to take turns/special when taking turns, do not have reverse row mixing to be performed, otherwise in processing block 1415, reverse row hybrid circuit is used to calculate GF ((2 ⁴) ²) in the reverse mixcolumns of 16 byte values to obtain GF ((2 ⁴) ²) in corresponding, through the polynomial repressentation of conversion.For an embodiment, can perform as follows 16 byte input values, at GF ((2 ⁴) ²) in reverse mixcolumns:

Will be understood that, by being multiplied by the uniquity needed for the matrix constant in expression formula in the first phase for each result calculating execution, and subsequently these uniquities are sued for peace to generate each result, can at GF ((2 ⁴) ²) in [a ₃, a ₂, a ₁, a ₀, b ₃, b ₂, b ₁, b ₀] perform this type of matrix multiplication.Such as, calculate needed for above-mentioned matrix multiplication, from nibble [a ₃, a ₂, a ₁, a ₀] uniquity is:

(a_{3} &CirclePlus; a_{0}) &CirclePlus; a_{1}, (a_{2} &CirclePlus; a_{1}) &CirclePlus; a_{3}, (a_{2} &CirclePlus; a_{0}) &CirclePlus; a_{1}, (a_{3} &CirclePlus; a_{2}) &CirclePlus; a_{0}, a_{3} &CirclePlus; a_{1}, (a_{3} &CirclePlus; a_{2}) + (a_{1} &CirclePlus;

a_{0}) .

Calculate needed for above-mentioned matrix multiplication, from nibble [b ₃, b ₂, b ₁, b ₀] uniquity is:

(b_{3} &CirclePlus; b_{2}) &CirclePlus; b_{1}, (b_{2} &CirclePlus; b_{1}) &CirclePlus; b_{3}, b_{3} &CirclePlus; b_{0}, b_{3} &CirclePlus; b_{1}, b_{1} &CirclePlus; b_{0}, (b_{3} &CirclePlus; b_{0}) + (b_{1} &CirclePlus; b_{2}) .

In processing block 1414 in determined any one situation, in processing block 1416, corresponding to reverse row mixing transformation, hard-wired line replacement is performed to 16 byte values.In processing block 1417, the second territory change-over circuit is for changing GF ((2 ⁴) ²) in the polynomial repressentation through conversion of each correspondence, and also for applying affined transformation, to generate respectively except GF ((2 ⁴) ²) outside Galois field in the polynomial repressentation of the 3rd correspondence.In an embodiment of process 1401, except GF ((2 ⁴) ²) outside that new Galois field be compositum GF ((2 ²) ⁴).With reference to figure 2, this embodiment is described in more detail hereinafter.In the alternate embodiment of process 1401, this new Galois field is original domain GF (256).With reference to figure 3a and 3b, these embodiments are described in more detail hereinafter.

Continue from point 1418, multiplication computing inverse circuit is used in processing block 1420, is removing GF ((2 to calculate for each in the polynomial repressentation of the 3rd correspondence of 16 byte values respectively ⁴) ²) outside new Galois field in corresponding multiplicative inverse polynomial repressentation.In processing block 1421, determine that whether this process is deciphering, in case of decryption, wheel process is done, and in processing block 1426 Output rusults, if or this process be encryption, then in this case, process and recover in processing block 1422.

In processing block 1422, circuit is used to the multiplicative inverse polynomial repressentation of each correspondence affined transformation being applied to 16 byte values, thus is created on respectively and is different from GF ((2 ⁴) ²) that new Galois field in, through conversion corresponding polynomial repressentation.If that new Galois field is not original domain GF (256), then in frame 1422, another territory conversion can combined circuit with by each correspondence through conversion polynomial repressentation be back transformed into original domain GF (256).Therefore, can suppose that the polynomial repressentation of process 1401 remainder is in original domain GF (256).

In processing block 1423, corresponding to the capable mixing transformation of forward, hard-wired line replacement is performed to 16 byte values.In processing block 1424, determine when whether front-wheel is that last is taken turns/specially to take turns, in the end one takes turns/and special when taking turns, do not arrange mixing to be performed, otherwise, in processing block 1425, forward row hybrid circuit is used to the forward mixcolumns of calculating 16 byte values in GF (256) to obtain the polynomial repressentation through conversion corresponding in GF (256).Will be understood that, because coefficient is relatively little in the forward mixcolumns in GF (256), therefore, in processing block 1425, do not have the domain representation substituted to be used.Finally, the wheel process of process 1401 is done, and in processing block 1426,16 byte result are output.

Figure 15 illustrates the process flow diagram of an embodiment of the process 1501 that the multiplication for realizing AESS box is efficiently inverted.In shown below embodiment, can in conjunction with irreducible function x ⁴+ x ³+ x ²+ 2 by compositum GF ((2 ²) ⁴) convert for S box.

Continuing from the point 1418 of process 1401, at processing block 1518 place, determining whether this process is encryption, and when encrypting, in processing block 1519, process continues.Otherwise, if this process is deciphering, then in processing block 1417, performed territory conversion, and the polynomial table of the 3rd correspondence of 16 byte values is shown in compositum GF ((2 ²) ⁴) in.For an embodiment of processing block 1417, by each byte value being multiplied by the XOR (that is, by bit reversal) that the transition matrix of 8 × 8 and some and constant carry out, can affined transformation of inverting be applied, and GF ((2 ⁴) ²) in polynomial repressentation [a ₇, a ₆, a ₅, a ₄, a ₃, a ₂, a ₁, a ₀] compositum GF ((2 can be converted into ²) ⁴) middle corresponding polynomial repressentation [b ₇, b ₆, b ₅, b ₄, b ₃, b ₂, b ₁, b ₀], this realizes by following a series of XOR:

b_{1} = a_{1} &CirclePlus; a_{2} &CirclePlus; a_{3} &CirclePlus; a_{4},

b_{2} = a_{0} &CirclePlus; a_{2} &CirclePlus; a_{4} &CirclePlus; a_{5} &CirclePlus; a_{6},

b_{3} = a_{0} &CirclePlus; {a_{1} &CirclePlus; a}_{2} &CirclePlus; a_{4} &CirclePlus; a_{5} &CirclePlus; a_{6},

b_{5} = a_{0} &CirclePlus; {a_{1} &CirclePlus; a}_{2} &CirclePlus; {a_{3} &CirclePlus; a}_{5} &CirclePlus; a_{6} &CirclePlus; a_{7},

b_{7} = a_{0} &CirclePlus; {a_{1} &CirclePlus; a}_{2} &CirclePlus; {a_{3} &CirclePlus; a}_{4} &CirclePlus; a_{6} .

In processing block 1519, need territory to change for ciphering process, therefore territory change-over circuit is used to respectively the polynomial repressentation that each in 16 byte values is corresponding from GF (256) is transformed into compositum GF ((2 ²) ⁴) middle corresponding polynomial repressentation.For an embodiment of processing block 1519, by each byte value being multiplied by the transition matrix of 8 × 8, the polynomial repressentation [a in GF (256) ₇, a ₆, a ₅, a ₄, a ₃, a ₂, a ₁, a ₀] compositum GF ((2 can be converted into ²) ⁴) middle corresponding polynomial repressentation [b ₇, b ₆, b ₅, b ₄, b ₃, b ₂, b ₁, b ₀], this realizes by following a series of XOR:

b_{0} = a_{0} &CirclePlus; a_{1} &CirclePlus; a_{6},

b_{1} = a_{1} &CirclePlus; a_{4} &CirclePlus; a_{6},

b_{2} = a_{5} &CirclePlus; a_{6} &CirclePlus; a_{7},

b_{3} = a_{3} &CirclePlus; a_{4},

b_{4} = {a_{1} &CirclePlus; a}_{2} &CirclePlus; {a_{3} &CirclePlus; a}_{4} &CirclePlus; a_{5},

b_{5} = a_{3} &CirclePlus; a_{4} &CirclePlus; a_{5} &CirclePlus; a_{7},

b_{6} = a_{2} &CirclePlus; a_{5} &CirclePlus; a_{6},

b_{7} = a_{3} &CirclePlus; a_{7} .

In processing block 1520, inverting circuit be used to respectively for 16 byte values at GF ((2 ²) ⁴) in each polynomial repressentation calculate GF ((2 ²) ⁴) in multiplicative inverse polynomial repressentation.For an embodiment, corresponding to compositum GF ((2 ²) ⁴) in the input [a, b, c, d] of polynomial repressentation and the relation of multiplicative inverse [A, B, C, D] as follows:

(a &CirclePlus; c &CirclePlus; d) \cdot A &CirclePlus; (b &CirclePlus; c) \cdot B &CirclePlus; (a &CirclePlus; b) \cdot C &CirclePlus; a \cdot D = 0

(2 \cdot a &CirclePlus; b &CirclePlus; c) \cdot A &CirclePlus; (a &CirclePlus; b &CirclePlus; d) \cdot B &CirclePlus; (a &CirclePlus; c) \cdot C &CirclePlus; b \cdot D = 0

(2 \cdot a &CirclePlus; 2 \cdot b) \cdot A &CirclePlus; (2 \cdot a) \cdot B &CirclePlus; d \cdot C &CirclePlus; c \cdot D = 0

(2 \cdot b &CirclePlus; 2 \cdot c) \cdot A &CirclePlus; (2 \cdot a &CirclePlus; 2 \cdot b) \cdot B &CirclePlus; (2 \cdot a) \cdot C &CirclePlus; d \cdot D = 1

Wherein, ' ' represents GF (2 respectively ²) addition and multiplication.

Xie Wei: A=Δ ^-1Δ _a, B=Δ ^-1Δ _b, C=Δ ^-1Δ _c, D=Δ ^-1Δ _d, wherein, determinant Δ is given:

Δ = |\begin{matrix} a &CirclePlus; c &CirclePlus; d & b &CirclePlus; c & a &CirclePlus; b & a \\ 2 \cdot a &CirclePlus; b &CirclePlus; c & a &CirclePlus; b &CirclePlus; d & a &CirclePlus; c & b \\ 2 \cdot a &CirclePlus; 2 \cdot b & 2 \cdot a & d & c \\ 2 \cdot b &CirclePlus; 2 \cdot c & 2 \cdot a &CirclePlus; 2 \cdot b & 2 \cdot a & d \end{matrix}|

And by using respectively, { 0,0,0,1} substitutes the first, second, third and fourth row of Δ, draws determinant Δ from Δ _a, Δ _b, Δ _cand Δ _d.Will be understood that again, by expansion determinant computation, calculate such as a within hardware ², b ², a ³, 3b ²deng and so on uniquity and required important item unique and, subsequently to specific item combination summation to generate necessary result so that at GF (2 ²) this type of calculating of middle enforcement.

In processing block 1521, determine whether this process is deciphering, in case of decryption, in processing block 1522, process continues.In processing block 1522, another territory change-over circuit be used to respectively by each in 16 byte values from compositum GF ((2 ²) ⁴) in corresponding polynomial repressentation be converted to polynomial repressentation corresponding in GF (256).For an embodiment of processing block 1522, by each byte value being multiplied by the transition matrix of 8 × 8, compositum GF ((2 ²) ⁴) in polynomial repressentation [a ₇, a ₆, a ₅, a ₄, a ₃, a ₂, a ₁, a ₀] polynomial repressentation [b corresponding in GF (256) can be converted into ₇, b ₆, b ₅, b ₄, b ₃, b ₂, b ₁, b ₀], this realizes by following a series of XOR:

b_{0} = a_{0} {&CirclePlus; a}_{3} &CirclePlus; a_{4} &CirclePlus; a_{6},

b_{1} = a_{2} {&CirclePlus; a}_{4} &CirclePlus; a_{5} &CirclePlus; a_{6},

b_{2} = a_{1} {&CirclePlus; a}_{2} &CirclePlus; a_{4} &CirclePlus; a_{7},

b_{3} = a_{1} &CirclePlus; a_{4} &CirclePlus; a_{6},

b_{4} = a_{1} {&CirclePlus; a}_{3} &CirclePlus; a_{4} &CirclePlus; a_{6},

b_{5} = a_{1} {&CirclePlus; a}_{3} &CirclePlus; a_{4} &CirclePlus; a_{5} &CirclePlus; a_{6} &CirclePlus; a_{7},

b_{6} = a_{2} &CirclePlus; a_{3} &CirclePlus; a_{5},

b_{7} = a_{1} {&CirclePlus; a}_{4} &CirclePlus; a_{6} &CirclePlus; a_{7} .

Otherwise, if this process is encryption, then process the processing block 1421 proceeded in process 1401.As the processing block 1422 in reference process 1401 explain, the circuit for affined transformation being applied to 16 bytes in processing block 1422 can be combined with the territory change-over circuit of the present embodiment, so that by these 16 byte values from GF ((2 ²) ⁴) in polynomial repressentation be transformed into polynomial repressentation corresponding in GF (256).For an embodiment of processing block 1422, by each byte value is multiplied by 8 × 8 transition matrix and with some constants XOR (that is, by bit reversal), can affined transformation be applied, and compositum GF ((2 ²) ⁴) in polynomial repressentation [a ₇, a ₆, a ₅, a ₄, a ₃, a ₂, a ₁, a ₀] polynomial repressentation [b corresponding in GF (256) can be converted into ₇, b ₆, b ₅, b ₄, b ₃, b ₂, b ₁, b ₀], this realizes by following a series of XOR:

b_{2} = a_{0} &CirclePlus; a_{2} &CirclePlus; a_{6},

b_{3} = a_{0} {&CirclePlus; a_{1} &CirclePlus; a}_{3} &CirclePlus; a_{4} &CirclePlus; a_{5},

b_{4} = a_{0} &CirclePlus; a_{1} &CirclePlus; a_{4} &CirclePlus; a_{5} &CirclePlus; a_{7},

b_{7} = a_{2} &CirclePlus; a_{3} .

Figure 16 A illustrates the figure of an embodiment of the device 1601 for performing affine maps instruction, and this affine maps instruction is used for affined transformation to provide general GF (256) SIMD encrypted mathematical function.In certain embodiments, device 1601 can be replicated 16 times, and each device 1601 comprises the hardware handles block for realizing the affined transformation to 128 blocks comprising 16 byte values (each byte has the polynomial repressentation in GF (256)) efficiently.In other embodiments of affine maps instruction (or micro-order), also can designed element size, and/or optional apparatus 1601 copy quantity to realize the affined transformation to 128 blocks or 256 blocks or 512 blocks etc.Multiple embodiments of device 1601 can be for providing the part of the streamline 400 of the affine maps instruction of general GF (256) SIMD encrypted mathematical function (such as execution, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1601 can be coupled with the decoder stage (such as, decoding 406) for decoding to the instruction for the affined transformation in GF (256) or demoder (such as, decoding unit 440).In certain embodiments, affine maps instruction can be realized by micro-order (or microoperation, micro-op or uop)---such as, Galois field Matrix-Vector multiplication micro-order and Galois field vectorial addition (XOR) micro-order afterwards.

Such as, multiple embodiments of device 1601 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.For providing some embodiment assigned source data operand element set 1612, transformation matrix 1610 operand and converting vector 1614 operands of the affine maps instruction of general GF (256) SIMD affined transformation function.In response to decoded affine maps instruction, one or more performance element (such as, performance element 462) by eight step-by-step "AND" (AND) 1627-1620 via GF (256) the byte multiplier array in processing block 1602, transformation matrix 1610 operand to be applied in source data operation manifold (such as, in 128 blocks of 16 byte elements) each element 1612, and via eight 9 input XOR1637-1630 application converting vectors 1614 of GF (256) the position adder array in processing block 1603, to perform SIMD affined transformation to each in source data operation manifold through the element of conversion.Each element 1612 in the source data operation manifold of affine maps instruction, be stored in (such as, in physical register set unit 458) in SIMD destination register through the result element 1618 of affined transformation.

Figure 16 B illustrates the figure of an embodiment of the device 1605 for performing affine instruction of inverting, this affine instruction of inverting for carrying out affined transformation, and subsequently the multiplicative inverse of result of calculation to provide general GF (256) SIMD encrypted mathematical function.Multiple embodiments of device 1605 can be for providing the part of the streamline 400 of the affine instruction of inverting of general GF (256) SIMD encrypted mathematical function (such as execution, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1605 can with for being coupled to the decoder stage (such as, decoding 406) of decoding for the affined transformation in GF (256) and the instruction of inverting or demoder (such as, decoding unit 440).In certain embodiments, affine instruction of inverting can be realized by micro-order (or microoperation, micro-op or uop)---such as, affine maps 1601 micro-order and finite field multiplier are afterwards inverted micro-order 1604.In alternative embodiments, affine instruction of inverting can be realized by different micro-orders---such as, Galois field Matrix-Vector multiplication micro-order and byte broadcast micro-order afterwards, Galois field vectorial addition (XOR) micro-order and finite field multiplier are inverted micro-order.

Multiple embodiments of device 1605 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.For the irreducible function of affine invert instruction and some embodiment assigned source data operand element sets 1612 of the multiplicative inverse of result of calculation, transformation matrix 1610 operand, converting vector 1614 operand and the optional monic afterwards that provide general GF (256) SIMD affined transformation function.In response to decoded affine instruction of inverting, one or more performance element (such as, performance element 462) by eight step-by-step "AND" (AND) 1627-1620 via GF (256) the byte multiplier array in processing block 1602, transformation matrix 1610 operand to be applied in source data operation manifold (such as, in 128 blocks of 16 byte elements) each element 1612, and via eight 9 input XOR1637-1630 application converting vector 1614 operands of GF (256) the position adder array in processing block 1603, to perform SIMD affined transformation to each in source data operation manifold through the element of conversion.Will be understood that, this point in this calculating may correspond to the point 1418 in process 1403.Can via multiplication inversion unit 1640, according to for each element 1612 in source data operation manifold, calculate finite field multiplier inverse element element 1648 pairs of irreducible functions through the result element 1618 of affined transformation and ask mould.Each multiplicative inverse result element 1648 through the result element 1618 of affined transformation for affine instruction of inverting is stored in (such as, in physical register set unit 458) in SIMD destination register.

Will be understood that, some embodiments of affine instruction of inverting may be useful to the process performing such as process 1403 and so on.Other embodiments may be useful to the process performing such as process 1402 and so on.

Figure 16 C illustrates the figure of the alternate embodiment of the device 1606 for performing affine instruction of inverting, and this affine instruction of inverting for calculating multiplicative inverse, and carries out affined transformation to provide general GF (256) SIMD encrypted mathematical function to result subsequently.Multiple embodiments of device 1606 can be for providing the part of the streamline 400 of the affine instruction of inverting of general GF (256) SIMD encrypted mathematical function (such as execution, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1606 can with for in GF (256) inverting and decoder stage (such as, decoding 406) that the instruction of affined transformation is decoded or demoder (such as, decoding unit 440) are coupled.In certain embodiments, can realize inverting affine instruction by micro-order (or microoperation, micro-op or uop)---such as, finite field multiplier is inverted micro-order 1604 and affine maps afterwards 1601 micro-order.In alternative embodiments, affine instruction of inverting can be realized by different micro-orders---such as, finite field multiplier is inverted micro-order and Galois field Matrix-Vector multiplication micro-order afterwards and Galois field vector scalar conversion (such as, broadcast and XOR) micro-order.

Multiple embodiments of device 1606 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.The irreducible function of some embodiment assigned source data operand element sets 1612 of invert affine instruction and the affined transformation function afterwards that calculate for providing general GF (256) SIMD of multiplicative inverse, transformation matrix 1610 operand, converting vector 1614 operand and optional monic.In processing block 1604, in response to decoded affine instruction of inverting, one or more performance element (such as, performance element 462) via multiplication inversion unit 1640, calculate SIMD scale-of-two finite field multiplier inverse element element 1616 pairs of irreducible functions for each element 1612 in source data operation manifold and ask mould.Subsequently, described one or more performance element by via eight step-by-steps of GF (256) the byte multiplier array in processing block 1602 " with " (AND) 1627-1620, transformation matrix 1610 operand to be applied in source data operation manifold (such as, in 128 blocks of 16 byte elements) each multiplicative inverse element 1616 of element 1612, and via eight 9 input XOR1637-1630 application converting vector 1614 operands of GF (256) the position adder array in processing block 1603, to perform SIMD affined transformation to each in source data operation manifold through the inverse element element of conversion.Invert the element 1612 in the source data operation manifold of affine instruction each multiplicative inverse element 1616, be stored in (such as, in physical register set unit 458) in SIMD destination register through the result element 1638 of affined transformation.

Figure 17 A illustrates the figure for performing an embodiment for the device 1701 providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction.In certain embodiments, device 1701 can be replicated 16 times, each device 1701 comprise for realize efficiently to 128 blocks comprising 16 byte values (each byte has the polynomial repressentation in GF (256)), hardware handles block that the multiplication of AESS box is inverted.Invert in other embodiments of instruction (or micro-order) at finite field multiplier, element size also can be designated, and/or the quantity that copies of device 1701 can be inverted by the finite field multiplier selecting to realize to 128 blocks or 256 blocks or 512 blocks etc.Multiple embodiments of device 1701 can be part for performing the streamline 400 for providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction (such as, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1701 can be coupled with the decoder stage (such as, decoding 406) for decoding to the instruction of inverting for the multiplication in GF (256) or demoder (such as, decoding unit 440).In device 1701, we consider that each byte x is output from the point 1418 process 1401, and therefore device 1701 starts by accessing the source data operation manifold comprising x.Processing block 1711-1717 comprises polynomial expression power chunk generative circuit, and it is for calculating the power x of the polynomial repressentation had corresponding to they respective byte value x respectively for each in 16 byte values ², x ⁴, x ⁸, x ¹⁶, x ³², x ⁶⁴and x ¹²⁸, the byte value of polynomial repressentation in GF (256).Processing block 1718-1720 and 1728-1730 comprises multiplier chunk circuit, this circuit is used for taking advantage of corresponding to together for the byte value of the power of the polynomial repressentation of each in 16 byte values respectively in GF (256), to generate the multiplicative inverse x having separately and correspond respectively to their respective byte value x ^-1=x ²⁵⁴, 16 byte values of polynomial repressentation in GF (256).Then, these 16 multiplicative inverse byte values are stored (such as, in register group unit 458) or in being output in process 1401 processing block 1421, there, affined transformation circuit (such as, 1601) alternatively for the treatment of in frame 1422 with depend on process 1401 execution encryption or deciphering apply affined transformation.

Figure 17 B illustrates the figure for performing the alternate embodiment for the device 1702 providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction.In certain embodiments, device 1702 can be replicated 16 times, each device 1702 comprise for realize efficiently to 128 blocks comprising 16 byte values (each byte has the polynomial repressentation in GF (256)), hardware handles block that the multiplication of AESS box is inverted.Invert in other embodiments of instruction (or micro-order) at finite field multiplier, element size also can be designated, and/or the quantity that copies of device 1702 can be inverted by the finite field multiplier selecting to realize to 128 blocks or 256 blocks or 512 blocks etc.Multiple embodiments of device 1702 can be part for performing the streamline 400 for providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction (such as, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1702 can be coupled with the decoder stage (such as, decoding 406) for decoding to the instruction of inverting for the multiplication in GF (256) or demoder (such as, decoding unit 440).In device 1702, we consider that each byte x is output from the point 1418 process 1401 again, and therefore device 1702 starts by accessing the source data operation manifold comprising x.Will be understood that, the point 1418 in process 1401 can represent the affine maps instruction in the output of affined transformation circuit (such as, 1601) or processing block 1417.Processing block 1721-1727 comprises polynomial expression power chunk generative circuit, and it is for calculating the power x of the polynomial repressentation had corresponding to they respective byte value x respectively for each in 16 byte values ⁶, x ²⁴, x ⁹⁶and x ¹²⁸, the byte value of polynomial repressentation in GF (256).Processing block 1728-1730 comprises multiplier chunk circuit, this circuit is used for taking advantage of corresponding to together for the byte value of the power of the polynomial repressentation of each in 16 byte values respectively in GF (256), to generate the multiplicative inverse x having separately and correspond respectively to their respective byte value x ^-1=x ²⁵⁴, 16 byte values of polynomial repressentation in GF (256).These 16 multiplicative inverse byte values are stored (such as, in register group unit 458) or in being output in process 1401 processing block 1421, there, affined transformation circuit (such as, 1601) alternatively for the treatment of in frame 1422 with depend on process 1401 execution encryption or decoding apply affined transformation.

Figure 17 C illustrates the figure for performing another alternate embodiment for the device 1703 providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction.In certain embodiments, device 1703 can be replicated 16 times, and each device 1703 comprises for realizing the hardware handles block of inverting to the finite field multiplier of 128 blocks comprising 16 byte values (each byte has the polynomial repressentation in GF (256)) efficiently.Invert in other embodiments of instruction (or micro-order) at finite field multiplier, element size also can be designated, and/or the quantity that copies of device 1703 can be inverted by the finite field multiplier selecting to realize to 128 blocks or 256 blocks or 512 blocks etc.Multiple embodiments of device 1703 can be part for performing the streamline 400 for providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction (such as, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1703 can be coupled with the decoder stage (such as, decoding 406) for decoding to the instruction of inverting for the multiplication in GF (256) or demoder (such as, decoding unit 440).

Multiple embodiments of device 1703 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.Finite field multiplier for providing general GF (256) SIMD multiplication to invert function is inverted some embodiment assigned source DES data elements sets 1710 of instruction and the irreducible function 1740 of monic.To invert instruction in response to decoded finite field multiplier, one or more performance element (such as, performance element 462) calculates SIMD scale-of-two finite field multiplier inverse element for each element 1710 in source data operation manifold and asks mould to irreducible function.Some embodiments of device 1703 perform compositum GF ((2 ⁴) ²) in finite field multiplier inversion operation.In processing block 1734, each element 1710 in source data operation manifold is mapped to compositum GF ((2 ⁴) ²), processing block 1734 exports the field element z of 4 _h1735 and z _l1736.For an embodiment, inverse element field element z _l ^-11746 as follows calculate: (1) in compositum, field element z _h1735 and z _l1736 are added (step-by-step XOR1737); (2) in processing block 1739, the output of step-by-step XOR1737 is multiplied and is asked mould to irreducible function p.In one embodiment, polynomial expression p=z ⁴+ z ³+ 1, but in alternative embodiments, other 4 irreducible functions can be used.Proceed inverse element field element z _l ^-1the calculating of 1746: (3) in processing block 1738, to field element z _h1735 ask square, are multiplied by hexadecimal value 8, and ask mould to p, its result be added with the output of processing block 1739 (step-by-step XOR1741) in compositum; (4) in processing block 1742, the inverse element of the output of step-by-step XOR1741 is calculated; And (5) are in processing block 1744, with field element z _l1736 are multiplied and ask mould to generate inverse element field element z to p _l ^-11746.For an embodiment, inverse element field element z _h ^-11745 calculate as follows: step (1) is described above to (4); And (5) are in processing block 1743, by the output of processing block 1742 and field element z _h1735 are multiplied and ask mould to generate inverse element field element z to p _h ^-11745.Then, in processing block 1747, from compositum GF ((2 ⁴) ²) in the every a pair field element z of 4 of inverse mapping _h ^-11745 and z _l ^-11746 to generate the multiplicative inverse result element 1750 in GF (256).The multiplicative inverse result element 1750 of each element 1710 that finite field multiplier is inverted in the source data operation manifold of instruction is finally stored in (such as, in physical register set unit 458) in SIMD destination register.

Figure 18 A illustrates the specific figure asking an embodiment of the device 1801 of modular reduction instruction for performing for providing general GF (256) SIMD encrypted mathematical function.In example shown in current, modulo polynomial 1811B is specifically asked to be p=x in GF (256) ⁸+ x ⁴+ x ³+ x+1.In certain embodiments, device 1801 can be replicated 16 times, each device 1801 comprises asks modular reduction to generate the hardware handles block comprising 128 blocks of 16 byte values for realizing efficiently to two 128 blocks comprising 16 two byte values (or 256 blocks) specific, and each in 16 byte values obtained has the polynomial repressentation in GF (256).Multiple embodiments of device 1801 can be for performing the specific part asking the streamline 400 of modular reduction instruction for providing general GF (256) SIMD encrypted mathematical function (such as, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1801 can with for being coupled to the decoder stage (such as, decoding 406) asking the instruction of modular reduction to decode for specific in GF (256) or demoder (such as, decoding unit 440).

Multiple embodiments of device 1801 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.The specific of modular reduction function is asked to ask some embodiment assigned source data operand element sets 1810 of modular reduction instruction and the irreducible function 1811B of monic for providing general GF (256) SIMD.Ask modular reduction instruction in response to decoded, one or more performance element (such as, performance element 462) calculates SIMD scale-of-two Galois field for each element 1810 in source data operation manifold and asks mould to irreducible function reduction.There is the element 1810 of the source data operation manifold of two byte values as q _h1828 and q _l1820 are imported in processing block 1821.In processing block 1821, some embodiments of device 1801 perform 12 bit manipulations, and it equals:

T &LeftArrow; q_{L} &CirclePlus; (q_{H} < < 4) &CirclePlus; (q_{H} < < 3) &CirclePlus; (q_{H} < < 1) &CirclePlus; q_{H} .

Have by 12 place values of partly reduction, the element T obtained of processing block 1825 is as T _h1838 and T _l1830 are imported in processing block 1831.In processing block 1831, some embodiments of device 1801 perform 8 bit manipulations in processing block 1835, and it equals:

q \mod p &LeftArrow; T_{L} &CirclePlus; (T_{H} < < 4) &CirclePlus; (T_{H} < < 3) &CirclePlus; (T_{H} < < 1) &CirclePlus; T_{H} .

Will be understood that, in xor operation, zero (0) input can be eliminated, and then reduces the logical complexity of device 1801 further.The specific modular reduction result element 1850 of asking of specific each element 1810 asked in the source data operation manifold of modular reduction instruction is stored in (such as, in physical register set unit 458) in SIMD destination register.

Figure 18 B illustrates the specific figure asking the alternate embodiment of the device 1802 of modular reduction instruction for performing for providing general GF (256) SIMD encrypted mathematical function.In example shown in current, modulo polynomial 1811B is specifically asked also to be p=x in GF (256) ⁸+ x ⁴+ x ³+ x+1.Will be understood that, similar techniques is also applicable to realize asking the specific of modulo polynomial to ask modular reduction instruction (or micro-order) for other, such as, that use in the block encryption SMS4 used in wireless LAN WAPI CNS (wired certification and privacy capital construction), in GF (256) f ₅=x ⁸+ x ⁷+ x ⁶+ x ⁵+ x ⁴+ x ²+ 1.In certain embodiments, device 1802 can be replicated 16 times, each device 1802 comprises asks modular reduction to generate the hardware handles block comprising 128 blocks of 16 byte values for realizing efficiently to two 128 blocks comprising 16 two byte values (or 256 blocks) specific, and each in 16 byte values obtained has the polynomial repressentation in GF (256).Multiple embodiments of device 1802 can be for performing the specific part asking the streamline 400 of modular reduction instruction for providing general GF (256) SIMD encrypted mathematical function (such as, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1802 can with for being coupled to the decoder stage (such as, decoding 406) asking the instruction of modular reduction to decode for specific in GF (256) or demoder (such as, decoding unit 440).

Multiple embodiments of device 1802 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.The specific of modular reduction function is asked to ask some embodiment assigned source data operand element sets 1810 of modular reduction instruction and the irreducible function 1811B of monic for providing general GF (256) SIMD.Ask modular reduction instruction in response to decoded, one or more performance element (such as, performance element 462) calculates SIMD scale-of-two Galois field for each element 1810 in source data operation manifold and asks mould to irreducible function reduction.The element 1810 with the source data operation manifold of two byte values is imported in processing block 1861 as q [15:8] 1828 and q [7:0] 1820.In processing block 1861, some embodiments actuating logic operation in xor logic door 1867-1860 of device 1802, it equals:

q_{0} \mod p = q_{0} &CirclePlus; q_{8} &CirclePlus; q_{12} &CirclePlus; q_{13},

q_{1} \mod p = q_{1} &CirclePlus; q_{8} &CirclePlus; q_{9} &CirclePlus; q_{12} &CirclePlus; q_{14},

q_{2} \mod p = q_{2} &CirclePlus; q_{9} &CirclePlus; q_{10} &CirclePlus; q_{13},

q_{3} \mod p = q_{3} &CirclePlus; q_{8} &CirclePlus; q_{10} &CirclePlus; q_{11} {&CirclePlus; q}_{12} &CirclePlus; q_{13} &CirclePlus; q_{14},

q_{4} \mod p = q_{4} &CirclePlus; q_{8} &CirclePlus; q_{9} &CirclePlus; q_{12} &CirclePlus; q_{14},

q_{5} \mod p = q_{5} &CirclePlus; q_{9} &CirclePlus; q_{10} &CirclePlus; q_{12},

q_{6} \mod p = q_{6} &CirclePlus; q_{10} &CirclePlus; q_{11} &CirclePlus; q_{13},

q_{7} \mod p = q_{7} &CirclePlus; q_{11} &CirclePlus; q_{12} &CirclePlus; q_{14} .

The specific modular reduction result element (qmodp) 1850 of asking of specific each element 1810 asked in the source data operation manifold of modular reduction instruction is stored in (such as, in physical register set unit 458) in SIMD destination register.

Figure 18 C illustrates for performing for providing GF (2 ¹²⁸) specific AES character used in proper names and in rendering some foreign names sieve watt counter mode (GCM) of SIMD encrypted mathematical function ask the figure of another alternate embodiment of the device 1803 of modular reduction instruction.In example shown in current, modulo polynomial 1887 is specifically asked to be p=x in GF (256) ¹²⁸+ x ⁷+ x ²+ x+1.Multiple embodiments of device 1803 can be for performing for providing GF (2 ¹²⁸) the specific of SIMD encrypted mathematical function ask the part (such as, execution level 416) of the streamline 400 of modular reduction instruction or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1803 can with for for GF (2 ¹²⁸) in specific ask the instruction of modular reduction to carry out decoding decoder stage (such as, decode 406) or demoder (such as, decoding unit 440) coupling.

Multiple embodiments of device 1803 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.For providing GF (2 ¹²⁸) in AESGCM ask some embodiment assigned source data operand element sets 1813 of the specific instruction of modular reduction function and the irreducible function 1887 of monic.Ask modular reduction instruction in response to decoded Galois field, one or more performance element (such as, performance element 462) calculates the reduction of SIMD Galois field to irreducible function for each element 1813 in source data operation manifold and asks mould.

The element 1813 with the source data operation manifold of 32 byte values is imported in processing block 1871.In processing block 1871, some embodiments of device 1803 perform the operation relative to the polynomial non-reflection position of the reduction of non-reflection position (non-bit-reflected), and what it equaled the long-pending reflection position to reflection position (bit-reflected) as follows asks modular reduction:

(i)[X ₃,X ₂,X ₁,X ₀]＝q[255:0]<<1；

(ii)A＝X ₀<<63；B＝X ₀<<62；C＝X ₀<<57；

(iii) - - - D = X_{1} &CirclePlus; A &CirclePlus; B &CirclePlus; C;

(iv)[E ₁,E ₀]＝[D,X ₀]>>1；[F ₁,F ₀]＝[D,X ₀]>>2；[G ₁,G ₀]＝[D,X ₀]>>7；

(v) - - - q [127 : 64] = X_{3} &CirclePlus; D &CirclePlus; E_{1} &CirclePlus; F_{1} &CirclePlus; G_{1} (\mod p);

(vi) - - - q [63 : 0] = X_{2} &CirclePlus; X_{0} &CirclePlus; E_{0} &CirclePlus; F_{0} &CirclePlus; G_{0} (\mod p) .

Correspondingly, equation (i) is realized generating [X according to element 1813 by shift unit 1870 ₃, X ₂, X ₁, X ₀] 1872.Equation (ii) is realized by shift unit 1873-1875.Equation (iii) is realized by processing block 1876.Equation (iv) is realized by shift unit 1877-1879.Equation (v) is realized by processing block 1885, and equation (vi) is realized by processing block 1880.The specific modular reduction result element (qmodp) 1853 of asking of specific each element 1813 asked in the source data operation manifold of modular reduction instruction is stored in (such as, in physical register set unit 458) in SIMD destination register.

Figure 18 D illustrates for performing for providing general binary finite field gf (2 ^t) figure asking an embodiment of the device 1804 of modular reduction instruction of SIMD encrypted mathematical function.In example shown in current, can from specifically ask modulo polynomial (specifically ask modulo polynomial to provide by instruction (or micro-order) for these and ask modular reduction, such as, p ₀, p ₁... p _n) in selection specifically ask modulo polynomial p _s.In some embodiments of t=8, device 1804 can be replicated 16 times, each device 1804 comprises and comprises the hardware handles block of 128 blocks of 16 byte values to the specific modular reduction of asking of two 128 blocks comprising 16 two byte values (or 256 blocks) with generation for realizing efficiently, each in 16 byte values obtained to have in GF (256) or alternatively certain compositum (such as, GF ((2 ⁴) ²) or GF ((2 ²) ⁴) etc.) and in polynomial repressentation.In other embodiments asking modular reduction instruction (or micro-order), size t also can be designated, and/or the copying quantity and can be selected to generate 128 blocks or 256 blocks or 512 blocks etc. of device 1804.Multiple embodiments of device 1804 can be for performing for providing general binary finite field gf (2 ^t) SIMD encrypted mathematical function ask the part (such as, execution level 416) of the streamline 400 of modular reduction instruction or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1804 can with for at scale-of-two finite field gf (2 ^t) or alternatively at certain compositum (such as, GF ((2 ^u) ^v), wherein t=u+v) in the decoder stage (such as, decoding 406) of carrying out asking the instruction of modular reduction to carry out decoding or demoder (such as, decoding unit 440) coupling.

Figure 19 A illustrates the figure of an embodiment of the device 1901 for performing the scale-of-two finite field multiplier instruction for providing general GF (256) SIMD encrypted mathematical function.In certain embodiments, device 1901 can be replicated 16 times, and each device 1901 comprises for realizing efficiently each hardware handles block comprising the scale-of-two finite field multiplier of two 128 blocks of 16 byte values (each byte has the polynomial repressentation in GF (256)).In other embodiments of scale-of-two finite field multiplier instruction (or micro-order), element size also can be performed, and/or the quantity that copies of device 1901 can by the scale-of-two finite field multiplier selecting to realize to 128 blocks or 256 blocks or 512 blocks etc.Multiple embodiments of device 1901 can be for providing the part of the streamline 400 of the scale-of-two finite field multiplier instruction of general GF (256) SIMD encrypted mathematical function (such as execution, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1901 can be coupled with the decoder stage (such as, decoding 406) for decoding to the instruction for the finite field multiplier in GF (256) or demoder (such as, decoding unit 440).

Multiple embodiments of device 1901 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.Some embodiments of the scale-of-two finite field multiplier instruction calculated for providing general GF (256) SIMD of scale-of-two finite field multiplier function specify the irreducible function of two source data operation number element sets 1910 and 1920 and monic.In processing block 1902, in response to decoded scale-of-two finite field multiplier instruction, one or more performance element (such as, performance element 462) calculate 8 × 8 multiplication of SIMD no-carry with the long-pending element 1915 generating 15, and for every a pair element 1910 and 1912 in source data operation manifold, long-pending 1918 of reduction asks mould via asking modular reduction unit 1917 to the irreducible function selected by (such as, via selector switch 1916).The result that the reduction of element in source data operation manifold to each scale-of-two finite field multiplier of 1910 and 1912 amasss 1918 is stored in (such as, in physical register set unit 458) in SIMD destination register.

Figure 19 B illustrates the figure of the alternate embodiment of the device 1903 for performing the scale-of-two finite field multiplier instruction for providing general GF (256) SIMD encrypted mathematical function.In certain embodiments, device 1903 can be replicated 2 times, and each device 1903 comprises for realizing efficiently each hardware handles block comprising the scale-of-two finite field multiplier of two 128 blocks of 16 byte values (each byte has the polynomial repressentation in GF (256)).In other embodiments of scale-of-two finite field multiplier instruction (or micro-order), element size also can be performed, and/or the quantity that copies of device 1903 can by the scale-of-two finite field multiplier selecting to realize to 128 blocks or 256 blocks or 512 blocks etc.Multiple embodiments of device 1903 can be for providing the part of the streamline 400 of the scale-of-two finite field multiplier instruction of general GF (256) SIMD encrypted mathematical function (such as execution, execution level 416) or the part (such as, performance element 462) of core 490.Multiple embodiments of device 1903 can be coupled with the decoder stage (such as, decoding 406) for decoding to the instruction for the finite field multiplier in GF (256) or demoder (such as, decoding unit 440).

Multiple embodiments of device 1903 can with SIMD vector registor (such as, physical register set unit 458) coupling, this SIMD vector registor comprises the variable sized data field of m variable number of the value of the variable sized data element for storing m variable number.Some embodiments of the scale-of-two finite field multiplier instruction calculated for providing general GF (256) SIMD of scale-of-two finite field multiplier function specify the irreducible function p of two source data operation manifolds (such as, 1920 and 1922) and monic.In the processing block 1902 of array 1925, in response to decoded scale-of-two finite field multiplier instruction, one or more performance element (such as, performance element 462) calculate 8 × 8 multiplication of SIMD no-carry with formation product element 1915, and for every a pair element in source data operation manifold 1920 and 1922, long-pending 1918 of reduction asks mould via asking modular reduction unit 1917 to the irreducible function selected by (such as, via selector switch 1916).The result of the reduction productive set conjunction 1928 of the SIMD scale-of-two finite field multiplier of source data operation manifold 1920 and 1922 is stored in (such as, in physical register set unit 458) in SIMD destination register.

Figure 20 A illustrates the process flow diagram of an embodiment of the process 2001 for performing the affine maps instruction for providing general GF (256) SIMD encrypted mathematical function.Come implementation 2001 and other processes disclosed herein by processing block, these processing blocks can comprise specialized hardware or the software that can be performed by the combination of general-purpose machinery or custom-built machine or general-purpose machinery and custom-built machine or firmware operation code.

In processing block 2011, the processor affine maps instruction for the SIMD affined transformation in Galois field is decoded.In processing block 2016, many micro-orders are generated alternatively to the decoding of affine maps instruction, such as, for the first micro-order of Galois field Matrix-Vector multiplication 1602 and the second micro-order for Galois field vectorial addition (or XOR) 1603.In processing block 2021, source data operation number element set is accessed.In processing block 2031, transformation matrix operand is accessed.In processing block 2041, converting vector operand is accessed.In processing block 2051, transformation matrix operand is applied to each element in source data operation manifold.In processing block 2061, converting vector operand is applied to each element through conversion in source data operation manifold.In processing block 2081, make and determining whether the process of each element in source data operation manifold is completed.If no, then SIMD affined transformation re-starts the iteration started from processing block 2051.Otherwise in processing block 2091, the result of SIMD affined transformation is stored in SIMD destination register.

Figure 20 B illustrates the process flow diagram for performing an embodiment for the process 2002 providing the finite field multiplier of general GF (256) SIMD encrypted mathematical function to invert instruction.In processing block 2012, the processor multiplication of inverting for the SIMD multiplication in Galois field instruction of inverting is decoded.In processing block 2016, many micro-orders are generated alternatively to the invert decoding of instruction of multiplication, such as, first micro-order of inverting for multiplication and the second micro-order asking modular reduction in such as 1801-1804 and so on.In processing block 2022, source data operation number element set is accessed.In processing block 2032, irreducible function is positively identified alternatively.In one embodiment, such as this irreducible function can be appointed as hexadecimal controlling value 1B in the immediate operand of instruction to indicate the polynomial expression x in character used in proper names and in rendering some foreign names roua domain GF (256) ⁸+ x ⁴+ x ³+ x+1.In another embodiment, such as this irreducible function can be appointed as hexadecimal controlling value FA in the immediate operand of instruction to indicate the polynomial expression x in character used in proper names and in rendering some foreign names roua domain GF (256) ⁸+ x ⁷+ x ⁶+ x ⁵+ x ⁴+ x ²+ 1 or alternatively indicate another polynomial expression.In another alternate embodiment, can specify in instruction mnemonic and/or clearly identify this irreducible function.In processing block 2042, calculate the scale-of-two finite field multiplier inverse element for each element in source data operation manifold, and in processing block 2052, make the inverse element of each element in source data operation manifold ask mould to irreducible function reduction alternatively.In processing block 2082, make and determining whether the process of each element in source data operation manifold is completed.If no, then SIMD finite field multiplier is inverted the iteration re-starting and start from processing block 2042.Otherwise in processing block 2092, the result of SIMD affined transformation is stored in SIMD destination register.

Figure 20 C illustrates the process flow diagram of an embodiment of the process 2003 for performing the affine instruction of inverting for providing general GF (256) SIMD encrypted mathematical function.In processing block 2013, decode to for the SIMD affined transformation in Galois field and the affine instruction of inverting of processor of inverting.In processing block 2016, many micro-orders are generated alternatively to the decoding of affine instruction of inverting, such as, for the first micro-order of Galois field affine maps 1601 and invert for finite field multiplier 1604 the second micro-order; Or alternatively, for the first micro-order of Galois field Matrix-Vector multiplication 1601, and the second micro-order for byte broadcast afterwards, for the 3rd micro-order of Galois field vectorial addition (XOR) 1602 and invert for finite field multiplier 1604 the 4th micro-order.In processing block 2023, source data operation number element set is accessed.In processing block 2033, transformation matrix operand is accessed.In processing block 2043, converting vector operand is accessed.In processing block 2053, transformation matrix operand is applied to each element in source data operation manifold.In processing block 2063, converting vector operand is applied to each element through conversion in source data operation manifold.In processing block 2073, for each of source data operation manifold through conversion element, calculate scale-of-two finite field multiplier inverse element.In processing block 2083, make and determining whether the process of each element in source data operation manifold is completed.If no, then SIMD affined transformation and to invert the iteration re-starting and start from processing block 2053.Otherwise in processing block 2093, the result that SIMD affined transformation and multiplication are inverted is stored in SIMD destination register.

Figure 20 D illustrates the process flow diagram of an embodiment of the process 2004 for performing the scale-of-two finite field multiplier instruction for providing general GF (256) SIMD encrypted mathematical function.In processing block 2014, the processor multiplying order for the SIMD multiplication in Galois field is decoded.In processing block 2016, many micro-orders are generated alternatively to the decoding of affine instruction of inverting, such as, for Galois field no-carry multiplication 1913 the first micro-order and ask the second micro-order of modular reduction 1917 for the Galois field of in such as 1801-1804 and so on.In processing block 2024, the first source data operation number element set is accessed.In processing block 2034, the second source data operation number element set is accessed.In processing block 2044, irreducible function is positively identified alternatively.In one embodiment, such as this irreducible function can be appointed as hexadecimal controlling value 1B in the immediate operand of instruction to indicate the polynomial expression x in character used in proper names and in rendering some foreign names roua domain GF (256) ⁸+ x ⁴+ x ³+ x+1.In another embodiment, such as this irreducible function can be appointed as hexadecimal controlling value FA in the immediate operand of instruction to indicate the polynomial expression x in character used in proper names and in rendering some foreign names roua domain GF (256) ⁸+ x ⁷+ x ⁶+ x ⁵+ x ⁴+ x ²+ 1.In another alternate embodiment, can specify in instruction mnemonic and/or clearly identify this irreducible function.In processing block 2054, calculate for right the amassing of the corresponding element of each in the corresponding element of the first and second source data operation manifolds, and in processing block 2064, make the long-pending of each in the first and second source data operation manifolds in corresponding element ask mould to irreducible function reduction alternatively.In processing block 2084, make and determining whether the process of each element in corresponding element in the first and second source data operation manifolds is completed.If no, then SIMD finite field multiplier re-starts the iteration started from processing block 2054.Otherwise in processing block 2094, the result of SIMD finite field multiplier is stored in SIMD destination register.

Will be understood that, although above can be used for providing the instruction of general SIMD encrypted mathematical function to be depicted as be iteration by being used for performing, but whenever possible, one or more examples of various processing block just can and preferably simultaneously and/or be performed concurrently to increase execution performance and handling capacity.

Will be understood that, the instruction of general GF (256) SIMD encrypted mathematical can be used for providing general GF (256) SIMD encrypted mathematical function in several applications, such as, for guaranteeing cryptographic protocol for the data integrity of financial transaction, ecommerce, Email, software dispatch, data storage etc., authentication, message content certification and message source certification and internet communication.

Therefore, will be understood that, there is provided the execution at least following instruction: (1) SIMD affined transformation, its assigned source data operand, transformation matrix operand and converting vector, wherein, transformation matrix is applied to each data element of source data operation number, and converting vector be applied to each through conversion element; (2) SIMD scale-of-two finite field multiplier is inverted, and it asks mould for calculate in scale-of-two Galois field for each element in source data operation number inverse to irreducible function; (3) SIMD affined transformation and multiplication inverse (or multiplication is inverse and affined transformation), its assigned source data operand, transformation matrix operand and converting vector, wherein, before or after multiplication inverse operation, transformation matrix is applied to each element in source data operation number, and converting vector be applied to each through conversion element; (4) ask modular reduction, it specificly asks modulo polynomial ps to carry out reduction to ask mould to what select from the multiple polynomial expressions (asking modular reduction for these specific polynomial expressions provide by instruction (or micro-order)) in scale-of-two Galois field for calculating; (5) SIMD scale-of-two finite field multiplier, it specifies the first and second source data operation numbers, and for by the element of each correspondence in the first and second source data operation numbers to being multiplied and asking mould to irreducible function; Wherein, the result of these instructions is stored in SIMD destination register; And general GF (256) and/or other scale-of-two Galois field SIMD encrypted mathematical functions substituted can be provided with the form of hardware and/or micro-code sequence, not need the too much or excessive functional unit of requirement adjunct circuit, area or power just can support the significant performance improvement applied some important performance-critical.

Each embodiment of mechanism disclosed herein can be implemented in the combination of hardware, software, firmware or this type of implementation.Multiple embodiment of the present invention can be embodied as the computer program or program code that perform on programmable system, this programmable system comprises at least one processor, storage system (comprising volatibility and nonvolatile memory and/or memory element), at least one input equipment and at least one output device.

Program code can be applied to input instruction to perform function described herein and to produce output information.Output information can be applied to one or more output device in a known manner.In order to the object of the application, disposal system comprises any system of the processor such as with such as digital signal processor (DSP), microcontroller, special IC (ASIC) or microprocessor and so on.

Program code can realize, to communicate with disposal system with advanced procedures language or OO programming language.When needed, also program code can be realized by assembly language or machine language.In fact, mechanism described herein is not limited to the scope of any certain programmed language.Under any circumstance, this language can be compiler language or interpretative code.

One or more aspects of at least one embodiment can be realized by the representative instruction be stored on the machine readable media of the various logic represented in processor, when being read these representative instructions by machine, these instructions make the logic of this machine making for performing the techniques described herein.This type of expression being called as " IP kernel " can be stored on tangible machine readable media, and provide it to various client or production facility, to be loaded in the manufacturing machine of this logical OR processor of actual manufacture.

This type of machinable medium can include but not limited to the non-transient tangible arrangement of the goods by machine or device fabrication or formation, and it comprises storage medium, such as: hard disk; The dish of any other type, comprises floppy disk, CD, aacompactadisk read onlyamemory (CD-ROM), compact-disc can rewrite (CD-RW) and magneto-optic disk; Semiconductor devices, random access memory (RAM), Erasable Programmable Read Only Memory EPROM (EPROM), flash memory, the Electrically Erasable Read Only Memory (EEPROM) of such as ROM (read-only memory) (ROM), such as dynamic RAM (DRAM) and static RAM (SRAM) and so on; Phase transition storage (PCM); Magnetic or optical card; Or be suitable for the medium of any other type of store electrons instruction.

Correspondingly, multiple embodiment of the present invention also comprises non-transient tangible machine computer-readable recording medium, this medium comprises instruction or comprises the design data (such as, hardware description language (HDL)) of definition structure described herein, circuit, device, processor and/or system features.Also this type of embodiment is called program product.

In some cases, dictate converter can be used to instruction to be converted to target instruction set from source instruction set.Such as, dictate converter convertible (such as, use static binary conversion, comprise the dynamic binary translation of on-the-flier compiler), distortion, emulator command or otherwise by one or more other instructions that instruction transformation becomes to be processed by core.This dictate converter can be realized in software, hardware, firmware or its combination.Dictate converter can on a processor, at processor outer or part on a processor and part outside processor.

Therefore, the technology for performing one or more instruction according at least one embodiment is disclosed.Although described and some exemplary embodiment shown in the drawings, but be to be understood that, this type of embodiment is only unrestricted to the explanation of this broad invention, and the invention is not restricted to shown and described ad hoc structure and configuration, because those skilled in the art can expect to know other amendments various after have studied the disclosure.Such as the application such, development rapidly and further progress be difficult in the technical field predicted, disclosed multiple embodiments, being easily amendment by enabling in configuration and details that technical progress facilitates, do not deviate from the scope of principle of the present disclosure and appended claims simultaneously.

Claims

1. a processor, comprising:

Decoder stage, for decoding to first instruction of inverting for single instruction multiple data (SIMD) scale-of-two finite field multiplier, the irreducible function of described first instruction assigned source data manipulation manifold and monic; And

One or more performance element, in response to the first decoded instruction:

For each element in described source data operation manifold, calculate SIMD scale-of-two finite field multiplier inverse element and mould is asked to described irreducible function; And

The result of described first instruction is stored in SIMD destination register.

2. processor as claimed in claim 1, it is characterized in that, described SIMD destination register is appointed as destination operand by described first instruction.

3. processor as claimed in claim 1, is characterized in that, described first instruction specifies simd register multiple 16 byte elements to be preserved as described source data operation manifold.

4. processor as claimed in claim 1, is characterized in that, described first instruction specifies simd register multiple 32 byte elements to be preserved as described source data operation manifold.

5. processor as claimed in claim 1, is characterized in that, described first instruction specifies simd register multiple 64 byte elements to be preserved as described source data operation manifold.

6. as the processor in claim 1-5 as described in any one, it is characterized in that, by character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in each the element involution in described source data operation manifold is 254 power and described irreducible function is asked to the calculating that mould is inverted to perform described SIMD scale-of-two finite field multiplier.

7. as the processor in claim 1-5 as described in any one, it is characterized in that, in described first instruction mnemonic, described irreducible function is designated as 1B to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁴+ x ³+ x+1.

8. as the processor in claim 1-5 as described in any one, it is characterized in that, in the immediate operand of described first instruction, described irreducible function is designated as hexadecimal controlling value 1B to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁴+ x ³+ x+1.

9. as the processor in claim 1-5 as described in any one, it is characterized in that, in the immediate operand of described first instruction, described irreducible function is designated as hexadecimal controlling value F5 to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁷+ x ⁶+ x ⁵+ x ⁴+ x ²+ 1.

10., for performing an equipment for single instruction multiple data (SIMD) scale-of-two finite field multiplier inversion operation, described equipment comprises:

For the device of the irreducible function of access originator data operand element set and monic;

For asking the device of mould to described irreducible function for each element calculating SIMD scale-of-two finite field multiplier inverse element in source data operation manifold; And

For asking the result of mould to be stored in device in SIMD destination register to described irreducible function described SIMD scale-of-two finite field multiplier inverse element.

11. equipment as claimed in claim 10, is characterized in that, in the immediate operand of the first instruction, the irreducible function of described monic is designated as hexadecimal controlling value 1B to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁴+ x ³+ x+1.

12. equipment as claimed in claim 10, is characterized in that, in described first instruction mnemonic, the irreducible function of described monic is designated as 1B to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁴+ x ³+ x+1.

13. equipment as claimed in claim 10, is characterized in that, in the immediate operand of the first instruction, the irreducible function of described monic is designated as hexadecimal controlling value 87 to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ¹²⁸) in x ¹²⁸+ x ⁷+ x ²+ x+1.

14. equipment as claimed in claim 10, is characterized in that, in the immediate operand of the first instruction, the irreducible function of described monic is designated as hexadecimal controlling value F5 to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁷+ x ⁶+ x ⁵+ x ⁴+ x ²+ 1.

15. 1 kinds of methods, comprising:

First instruction of inverting for single instruction multiple data (SIMD) scale-of-two finite field multiplier is decoded, the irreducible function of described first instruction assigned source data manipulation manifold and monic;

In response to the first decoded instruction, calculate SIMD scale-of-two finite field multiplier inverse element for each element in described source data operation manifold and mould is asked to described irreducible function; And

16. methods as claimed in claim 15, is characterized in that, in the immediate operand of described first instruction, the irreducible function of described monic is designated as hexadecimal controlling value 1B to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁴+ x ³+ x+1.

17. methods as claimed in claim 15, is characterized in that, in described first instruction mnemonic, the irreducible function of described monic is designated as 1B to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁴+ x ³+ x+1.

18. methods as claimed in claim 15, is characterized in that, in the immediate operand of described first instruction, the irreducible function of described monic is designated as hexadecimal controlling value 87 to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ¹²⁸) in x ¹²⁸+ x ⁷+ x ²+ x+1.

19. methods as claimed in claim 15, is characterized in that, in the immediate operand of described first instruction, the irreducible function of described monic is designated as hexadecimal controlling value F5 to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁷+ x ⁶+ x ⁵+ x ⁴+ x ²+ 1.

20. 1 kinds of disposal systems, comprising:

Storer, for storing the first instruction, described first instruction is used for SIMD Secure Hash Algorithm wheel fragment; And

Processor, comprising:

Level is taken out in instruction, for taking out described first instruction;

One or more performance element, in response to the first decoded instruction:

Calculate SIMD scale-of-two finite field multiplier inverse element for each element in described source data operation manifold and mould is asked to described irreducible function; And

21. disposal systems as claimed in claim 20, is characterized in that, in described first instruction mnemonic, described irreducible function is designated as 1B to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁴+ x ³+ x+1.

22. disposal systems as claimed in claim 20, is characterized in that, in the immediate operand of described first instruction, described irreducible function is designated as hexadecimal controlling value 1B to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁴+ x ³+ x+1.

23. as the disposal system in claim 20-22 as described in any one, it is characterized in that, described first instruction is further used for the SIMD affined transformation of each scale-of-two finite field multiplier inverse element, and described one or more performance element is further used in response to the first decoded instruction:

By transformation matrix operand being applied to the described multiplicative inverse of each element in described source data operation manifold and each multiplicative inverse through conversion converting vector operand being applied to the element in described source data operation manifold performs SIMD affined transformation, to generate the result of described first instruction.

24. disposal systems as claimed in claim 20, is characterized in that, in the immediate operand of described first instruction, described irreducible function is designated as hexadecimal controlling value 87 to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ¹²⁸) in x ¹²⁸+ x ⁷+ x ²+ x+1.

25. disposal systems as claimed in claim 20, is characterized in that, in the immediate operand of described first instruction, described irreducible function is designated as hexadecimal controlling value F5 to indicate character used in proper names and in rendering some foreign names roua domain GF (2 ⁸) in x ⁸+ x ⁷+ x ⁶+ x ⁵+ x ⁴+ x ²+ 1.