CN104012031B

CN104012031B - Instruction for performing JH keyed hash

Info

Publication number: CN104012031B
Application number: CN201180075719.6A
Authority: CN
Inventors: K·S·雅普; G·M·沃尔里齐; V·戈帕尔; J·D·吉尔福德; E·奥兹图科; S·M·格尔雷; W·K·费格哈利; M·G·迪克森
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2011-12-22
Filing date: 2011-12-22
Publication date: 2017-07-21
Anticipated expiration: 2031-12-22
Also published as: CN104012031A; TW201338492A; US20140053000A1; US9251374B2; TWI517654B; WO2013095484A1

Abstract

Describe a kind of method.This method includes performing one or more JH_SBOX_L instruction to be converted with performing S Box mappings and linear (L) in JH states, and once has been carried out S Box mappings and L conversion is carried out one or more JH_Permute instructions to perform permutation function in JH states.

Description

Instruction for performing JH keyed hash

Technical field

This disclosure relates to AES, and especially relate to JH hashing algorithms.

Background technology

Cryptography is to rely on the instrument of algorithm and is the key of protection information.Algorithm is complicated mathematical algorithm and key It is the string (string of bits) of position.There is the encryption system of two kinds of fundamental types：Secret-key systems and public key systems. Secret-key systems are also referred to as balanced system, with the single key (" privacy key ") just shared by two sides or more.Should Single key is both also used for solving confidential information for encryption information.

JH hash functions (JH) are a kind of encryption functions, and the encryption function is directed to national standard and technological associations (NIST) hash function competes and submits to develop new SHA-3 functions to substitute older SHA-1 and SHA-2.JH is to be based on Include the algorithm of four modifications (JH-224, JH-256, JH-384 and JH-512), different size of summary can be produced (digest).However, identical compression function is realized in JH each modification.

At present, the finger in streaming SIMD extensions (SSE) or high-level vector extension (AVX) can be used on general purpose processor Make performing JH.Anyway, such application must realistic row be up to 30 instructions and perform JH algorithms.

Brief description of the drawings

It can from the following detailed description be obtained with reference to accompanying drawing and the present invention is best understood from, wherein：

Fig. 1 is the block diagram of the one embodiment for the system that shows；

Fig. 2 is the block diagram of the one embodiment for showing processor；

Fig. 3 is the block diagram of the one embodiment for showing packing data register；

Fig. 4 shows one embodiment of gained nibble displacement；

Fig. 5 is the flow chart of the one embodiment for showing the process by instruction execution；

Fig. 6 is the flow chart of the one embodiment for showing the process by instruction execution；

Fig. 7 is shown with two round JH of instruction embodiment；

Fig. 8 is the block diagram of register architecture according to an embodiment of the invention；

Fig. 9 A are to be connected on tube core internet according to an embodiment of the invention and slow at a high speed with the second level (L2) The block diagram of the single CPU core for the local subset deposited；

Fig. 9 B are the expanded views of a part for the CPU core according to various embodiments of the present invention；

Figure 10 is the block diagram for showing unordered framework exemplary according to an embodiment of the invention；

Figure 11 is the block diagram of system according to an embodiment of the invention；

Figure 12 is the block diagram of second system according to an embodiment of the invention；

Figure 13 is the block diagram of the 3rd system according to an embodiment of the invention；

Figure 14 is the block diagram of on-chip system (SoC) according to an embodiment of the invention；

Figure 15 is that the monokaryon according to an embodiment of the invention with integrated Memory Controller and graphics devices is handled The block diagram of device and polycaryon processor；And

Figure 16 is that control according to an embodiment of the invention uses software instruction converter by the binary system in source instruction set Instruction is converted into the block diagram of the binary command of target instruction target word concentration.

Embodiment

In the following description, for purpose of explanation, numerous details are elaborated to provide comprehensive reason to the present invention Solution.However, the skilled person will be apparent that, it can also implement this hair without some of these details It is bright.In other instances, well-known structure and equipment are shown in form of a block diagram, to avoid the bottom of the desalination present invention former Reason.

In this manual, the reference to " one embodiment " or " embodiment " means to combine what the embodiment was described Special characteristic, structure or characteristic are included at least one embodiment of the invention.In the short of this specification middle appearance everywhere Language " in one embodiment " is not necessarily all referring to same embodiment.

Describe the mechanism of the instruction including handling JH hashing algorithms.According to one embodiment, via in AVX instruction set Instruct to realize JH hashing algorithms.AVX instruction set is x86 instruction set architectures (ISA) extension, and this adds deposit from 128 Device group.

Fig. 1 is the block diagram of one embodiment of system 100, and system 100 includes being used for performing in general purpose processor The AVX instruction set extensions that JH is encrypted and decrypted.

System 100 is included in processor 101, storage control maincenter (MCH) 102 and input/output (I/O) controller Pivot (ICH) 104.MCH102 includes the storage control 106 of the communication between control processor 101 and memory 108.Processor 101 and MCH102 communicates on system bus 116.

Processor 101 can be any one in multiple processors, these processors such as monokaryon Processor, monokaryon Intel Celeron processors,XScale processors or polycaryon processor, such asPentium D,ProcessorI3, i5, i7,2Duo and Quad,The processor of processor or any other type.

Memory 108 can be dynamic random access memory (DRAM), static RAM (SRAM), synchronization Dynamic random access memory (SDRAM), double data rate (DDR) 2 (DDR2) RAM or Rambus dynamic random access memory (RDRAM) or any other type memory.

114 (such as direct media interfaces (DMI)) are interconnected using high-speed chip-p- chip, ICH104 can be coupled to MCH102.Via two half-duplex channels, DMI supports the concurrent transmission speed of 2 gigabit/secs.

ICH104 may include memory I/O controller 110, for controlling to set with least one storage coupled to ICH104 Standby 112 communication.Storage device may include, for example, disk drive, digital versatile disc (DVD) driver, compact disk (CD) are driven Dynamic device, RAID (RAID), tape drive or other storage devices.Using serial storage protocol, such as go here and there Row attachment small computer system interface (SAS) or serial advanced technology attachment meet (SATA), on storage protocol interconnection 118, ICH104 can communicate with storage device 112.

In one embodiment, processor 101 includes JH functions 103, for performing JH encrypt and decrypt operations.It can be used JH functions 103 to the information for being stored in memory 108 and/or being stored in storage device 112 are encrypted or decrypted.

Fig. 2 is the block diagram of the one embodiment for showing processor 101.Processor 101 includes fetching and decoding unit 202, For being decoded to the processor instruction received from rank 1 (L1) instruction cache 202.For performing the instruction Data can be stored in register group 208.In one embodiment, register group 208 includes multiple registers, and it can be by AVX is instructed instructs the data used for storing by AVX.

Fig. 3 is the block diagram of the example embodiment of one group of suitable packing data register in register group 208.It is shown Packing data register includes 32 512 packing datas or vector registor.These 32 512 bit register quilts Labeled as ZMM0 to ZMM31.In the embodiment shown, 256 (that is, ZMM0- of the low order of the low level in these registers 16 ZMM15) by aliasing or it is covered on corresponding 256 packing datas or vector registor (being labeled as YMM0-YMM15), still What this was not required.

Equally, in the embodiment shown, YMM0-YMM15 low order 128 is by overlapping or be covered in corresponding 128 and beat In bag data or vector registor (being labeled as XMM0-XMM1), but this is nor required.512 bit register ZMM0 are extremely ZMM31 can be used to keep 512 packing datas, 256 packing datas or 128 packing datas.

256 bit register YMM0-YMM15 can be used to keep 256 packing datas or 128 packing datas.128 Bit register XMM0-XMM1 can be used to keep 128 packing datas.Each register can be used for storage packing floating-point data Or packing integer data.Support different pieces of information element size, including at least octet data, 16 digital datas, 32 double words or Single-precision floating-point data and 64 four words or double-precision floating point data.The alternative embodiment of packing data register may include The register of varying number, different size of register, and can or larger register aliasing (alias) can not be existed On smaller register.

Referring back to Fig. 2, extract and decoding unit 202 takes out macro-instruction from L1 instruction caches 202, to decode this grand Instruct and divide them into so-called microoperation (μ op) simple operations.Execution unit 210 is dispatched and performs the microoperation.Institute Show in embodiment, the JH functions 103 in execution unit 210 include the microoperation that AVX is instructed.Retirement unit 212 will be performed The result write-in register or memory of instruction.

JH functions 103 perform compression function, include three functions of 42 rounds of operation.First function is S-Box functions, It includes realizing two conversion (S₀And S₁) one of convert 4 adjacent nibbles (nibble).Table 1 shows that S-Box converts S₀ And S (x)₁(x) one embodiment.

Table 1

x	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
																	S₀(x)	9	0	4	11	13	12	3	15	1	10	2	6	7	5	8	14
S₁(x)	3	12	6	13	5	7	1	9	15	2	0	4	11	10	14	8

Second function is linear transformation (L), and it is in GF (2⁴) on realize separable distance (MDS) code of (4,2,3) maximum, Wherein GF2⁴It is defined as binary polynomial mould irreducible function X⁴+ X+1 multiple (multiplication).Adjacent Octet (or two adjacent S-Box output) on perform linear transformation.A, B, C and D is set to represent 4 words, then L is by (A, B) (C, D) is converted to, i.e. (C, D)=L (A, B)=(5A+2B, 2A+B).Therefore function (C, D)=L (A, B) is calculated For：

D0=B0 ⊕ A1；D1=B1 ⊕ A2；

D2=B2 ⊕ A3 ⊕ A0；D3=B3 ⊕ A0；

C0=A0 ⊕ D1；C1=A1 ⊕ D2；

C2=A2 ⊕ D3 ⊕ D0；C3=A3 ⊕ D0.

3rd function is permutation function (P_d)。P_dIt is the simple substitute on 2d elements, from π_d(exchange alternate half-word Section), P '_d(nibble for exchanging lower half and high half portion from state) and φ_d(the half-word in the high half portion of swap status Section) build.Fig. 4 shows that the gained nibble for d=4 in 64 bit datapaths replaces P_d(π_d,P′_d,φ_d) one Individual embodiment, wherein d are the sizes (dimension) of a block.In one embodiment, JH functions are for 256 4 nibbles The data width of (or 1024) uses d=8.

In the conventional system, JH is by " position section ", rather than operation in the nibble in byte.Position section can make half-word The position of section is divided into the word of separation.Therefore, S-Box nibbles allow to perform all S- via SSE/AVX parallel instructions Box nibbles.Further, position section and alternate odd even SBOX register-combinatorials can be realized that SBOX and L conversion is estimated (evaluation).During section in place is realized, it is not necessary that replaced completely for each round.Specifically, it is suitable strange S-Box is transfused to position to operate with lower suitable even S-Box in next one.Exchanged and replaced by using 7, for 42 JH rounds are repeated 6 times, and complete this measure.

Although position dicing method can cause all SBOX are calculated and L conversion is parallel to perform, 20 instructions are needed 23 logical functions of SBOX logics are performed, and need 10 to instruct for 10 XOR (XOR) functions converted comprising L (being used for 2 operand XOR).Such performance be can give it is improved.

According to one embodiment, the new instruction and data path of definable two, it is in 4 nibbles and nibble to upper Operate to perform SBOX and L transforming function transformation functions using 512 ZMM registers in register group 208.In such embodiment In, 1024 states are stored in two ZMM registers, and wherein nibble 0-127 is in the first ZMM registers, and half-word 128-255 is saved in the 2nd ZMM registers.

New instruction and data path JH_SBOX_L is defined as JH_SBOX_L ZMM, ZMM masks (ZMMmask).Fig. 5 It is the flow chart for showing to be instructed one embodiment of the process performed by JH_SBOX_L.As described above, 1024 mode bit quilt Continuously organized (being represented in JH specifications) in two ZMM registers from 0 to 1023.

In processing block 510, retrieved from ZMM registers and represent mode bit¹/₂512 sections.In processing block 520, examined S-Box and L conversion is performed on the mode bit of rope.In one embodiment, S- is performed using the mask information from ZMM masks Box functions.In one embodiment, ZMM masks represent constant (A.2, the wheel in E8 position section is realized from JH specifications Secondary constant).Using ZMM, 256 can be exported by carrying out odd even bit interleave to each round.

Once S-Box operations are completed, in each 8 nibbles to upper progress L conversion operations.In processing block 530, it will become 512 results changed are stored in destination register.JH_SBOX_L instructions are performed twice (for low 512, then for height 512) converted with the round S-Box and L completed for complete JH states.

JH_Permute (JH_ displacements) instruction and data path is implemented as the result to keeping S-Box and L conversion Each in ZMM registers performs displacement step P_d.In one embodiment, JH_Permute instructions are implemented as, and are performed Be defined as JH_Permute ZMM1, ZMM2, imm8, wherein ZMM1 store the low nibble of pre-permutation 128 (for example, 512), ZMM stores the high nibble of pre-permutation 128, and imm8=0/1, and it specifies low high nibble.

Fig. 6 is the flow chart for showing to be instructed one embodiment of the process performed by JH_PD.In processing block 550, JH states 1/2 section of pre-permutation retrieved from the ZMM registers indicated by imm8.In processing block 560, the position retrieved is performed and put Change processing.In processing block 570, the result of displacement is stored.JH_Permute instructions are performed twice to complete round displacement.Figure 7 are shown with two in the JH of above-mentioned instruction 42 rounds.

Three cycle pipeline data paths are realized in above-mentioned JH instructions.Therefore, the JH of a round is completed within 8 cycles (for example, execution twice of each of JH_SBOX_L and JH_Permute instructions).This causes the 2-3 for being better than a dicing method Times performance improvement.

Exemplary register architectural framework-Fig. 8

Fig. 8 is the block diagram for showing register architecture 800 according to an embodiment of the invention.Register bank tying The register group (register file) and register of structure are listed below：

Vector registor group 810-in the embodiment illustrated, there is the vector registor of 32 512 bit widths；These Register is cited as zmm0 to zmm31.856 positions of low order of 16 zmm registers of low level are covered in register ymm0-16 On.128 positions of low order (128 positions of low order of ymm registers) of 16 zmm registers of low level are covered in register xmm0-15 On.

Mask register 815-in an illustrated embodiment is write, there are 8 and writes mask register (k0 to k7), it is each to write The size of mask register is 64.In one embodiment of the invention, vector mask register k0 is not used as writing mask； When normally may indicate that k0 coding be used as write mask when, it select it is hard-wired write mask 0xFFFF so that effectively disable should What is instructed writes mask.

Multimedia extension state of a control register (MXCSR) 1020-in an illustrated embodiment, this 32 bit register The state used in floating-point operation and control bit are provided.

General register 825-in the embodiment illustrated, there are 16 64 general registers, these registers connect Compared with x86 addressing modes be used to addressable memory operation number.These registers by title RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 to R15 are quoted.

Extension flag (EFLAGS) register 830-in the embodiment shown, recorded very using this 32 bit register The result of MIMD.

Floating-point control word (FCW) register 835 and floating-point status word (FSW) register 840-in the embodiment shown, this A little registers are used come setting rotation (rounding) pattern, abnormal mask and mark in the case of FCW by x87 instruction set extensions Will, and keep in the case of FSW the tracking for exception.

Scalar floating-point stack register group (x87 storehouses) 845, in the above aliasing have MMX pack the flat register of integer Group 1050-in the embodiment illustrated, x87 storehouses are used for using x87 instruction set extensions come to 32/64/80 floating number According to eight element stacks for performing Scalar floating-point operation；And operation is performed to 64 packing integer datas using MMX registers, with And preserve operand for some operations performed between MMX and XMM register.

Segment register 855-in the embodiment shown, there are six 16 bit registers, for storing the address for being segmented The data of generation.

RIP register 865-in the embodiment shown, this 64 bit register store instruction pointer.

The alternative embodiment of the present invention can use wider or narrower register.In addition, the replacement of the present invention is implemented Example can use more, less or different register group and register.

Exemplary order processor architecture-Fig. 6 A-6B

Fig. 9 A-B show the block diagram of exemplary order processor architecture.These exemplary embodiments be surround from width to Measure the multiple examples for the orderly CPU core that processor (VPU) expands and design.The Internet that high bandwidth is passed through according to application, core Network and function logic, memory I/O Interface and the other necessary I/O logic communications of some fixations.For example, this embodiment PCIe buses will be generally comprised as independent GPU realization.

Fig. 9 A be connected to according to an embodiment of the invention on tube core internet 902 and with the second level (L2) at a high speed The block diagram of the single cpu core of the local subset 904 of caching.Instruction decoder 900 supports the x86 instruction set with extension.Although In one embodiment of the invention (in order to simplify design), scalar units 908 and vector location 910 use separated register Gather (being respectively scalar register 912 and vector registor 914), and the data shifted between these registers are written into Then read back to memory and from one-level (L1) cache 906, but alternative embodiment can use different methods (for example Using single set of registers or including allowing data to transmit between the two register groups without being written into and reading back Communication path).

L1 caches 906, which allow to access the low latency of cache memory, enters scalar sum vector location In.Together with loading operation (load-op) instruction in vectorial friendly instruction format, it means that L1 caches 906 can quilt It is considered as the register group of similar extension in a way.This significantly improves the performance of many algorithms.

The local subset 904 of L2 caches is a part for global L2 caches, and the global L2 caches are drawn It is divided into multiple separated local subsets, i.e., each local subset of CPU core one.Each CPU has the L2 to their own slow at a high speed The direct access path for the local subset 904 deposited.The data read by CPU core are stored in its L2 cached subset 904, And it can be quickly accessed, the access and the local L2 cached subsets that other CPU cores access their own are parallel.By CPU The data of core write-in are stored in the L2 cached subsets 904 of its subset, and clear from other subsets in the case of necessary Remove.Loop network ensures the uniformity of shared data.

Fig. 9 B are the expanded views of a part for the CPU core in Fig. 9 A according to various embodiments of the present invention.Fig. 9 B include L1 L1 data high-speeds caching 906A parts of cache 904 and on the more of vector location 910 and vector registor 1114 Details.Specifically, vector location 910 is 16 fat vector processing units (VPU) (see 16 width ALU928), and the unit performs whole Type, single-precision floating point and double-precision floating point instruction.The VPU supports to mix (swizzling) register by mixed cell 920 Input, numerical value conversion carried out by numerical value converting unit 922A-B, and carry out by copied cells 924 answering memory input System.Write the vector write-in that mask register 926 allows to assert gained.

Register data can be mixed in a variety of ways, e.g., carry out support matrix multiplication.Data from memory can be across VPU It is replicated passage.This is the general operation in figure and the processing of non-graphic parallel data, and this dramatically increases cache effect Rate.

Loop network is two-way, to allow the agency of such as CPU core, L2 caches and other logical blocks etc in core Communicated with one another in piece.Each circular data path is each bit width of direction 1012.

Exemplary out-of-order architecture-Fig. 7

Figure 10 is the block diagram for showing unordered framework exemplary according to an embodiment of the invention.Specifically, Figure 10 shows public affairs The exemplary unordered framework known, it has been modified into combining vectorial friendly instruction format and its execution.In Fig. 10, arrow Head indicates the coupling between two or more units, and the direction of arrow indicates the direction of the data flow between these units. Figure 10 includes the front end unit 1005 for being coupled to enforcement engine unit 1010 and memory cell 1015；Enforcement engine unit 1010 It is additionally coupled to memory cell 1015.

Front end unit 1005 includes being coupled to one-level (L1) inch prediction unit of two grades of (L2) inch prediction units 1022 1020.L1 and L2 inch prediction units 1020 and 1022 are coupled to L1 Instruction Cache Units 1024.L1 instruction caches Unit 1024 is coupled to instruction translation look-aside buffer (TLB) 1026, and the TLB1026 is further coupled to instruction and extracted and pre- solution Code unit 1028.Instruction is extracted and pre-decode unit 1028 is coupled to instruction queue unit 1030, the further coupling of unit 1030 It is bonded to decoding unit 1032.Decoding unit 1032 include complex decoder unit 1034 and three simple decoder elements 1036, 1038 and 1040.Decoding unit 1032 includes microcode ROM cell 1042.In decoding level segment, decoding unit 7 can be as described above Ground is operated.L1 Instruction Cache Units 1024 are additionally coupled to the L2 cache elements 1048 in memory cell 1015.Refer to Make two grades of TLB units 1046 that TLB unit 1026 is additionally coupled in memory cell 1015.Decoding unit 1032, microcode ROM Unit 1042 and circulation detector (LSD) unit 1044 are respectively coupled to renaming/distributor in enforcement engine unit 1010 Unit 1056.

Enforcement engine unit 1010 include being coupled to the renaming of retirement unit 1074 and United Dispatching device unit 1058/point Orchestration unit 1056.Retirement unit 1074 is additionally coupled to execution unit 1060 and including resequencing buffer unit 1078.It is unified Dispatcher unit 1058 is additionally coupled to physical register group unit 1076, and physical register group unit 1076 is coupled to execution unit 1060.Physical register group unit 1076 includes vector registor unit 1077A, writes mask register unit 1077B and scalar Register cell 1077C；These register cells can provide vector registor 510, vector mask register 515 and general Destination register 825；And physical register group unit 1076 may include that unshowned adjunct register group (e.g., is aliasing in MMX to beat Scalar floating-point stack register group 845 in bag integer plane registers device group 850).Execution unit 1060 includes three mixing scalars With vector location unit 1062,1064 and 1072；Load unit 1066；Storage address unit 1068；Data storage unit 1070. Load unit 1066, storage address unit 1068 and data storage unit 1070 are each further coupled to memory cell 1015 In data TLB unit 1052.

Memory cell 1015 includes two grades of TLB units 1046 for being coupled to data TLB unit 1052.Data TLB unit 1052 are coupled to L1 data cache units 1054.L1 data cache units 1054 are additionally coupled to L2 cache elements 1048.In certain embodiments, L2 cache elements 1048 are additionally coupled to the L3 of memory cell 1015 internally and/or externally With higher level cache element 1050.

In an illustrative manner, process line 8200 can be implemented as described below in exemplary out-of-order architecture：1) instruction extract and Pre-decode unit 728 performs fetching and length decoder level；2) perform decoding of decoding unit 732 level；3) renaming/dispenser unit 1056 perform distribution stage and renaming level；4) United Dispatching device 1058 performs scheduling level；5) physical register group unit 1076, again Order buffer unit 1078 and memory cell 1015 perform register reading/memory and read level；Execution unit 1060 enters Row execution/data conversion level；6) memory cell 1015 and resequencing buffer unit 1078, which are performed, writes back/memory write level 1960；7) retirement unit 1074 performs ROB and reads level；8) each unit can involve abnormality processing level；And 9) retirement unit 1074 and physical register group unit 1076 perform submit level.

Exemplary computer system and processor-Fig. 8-10

Figure 11-13 shows to be suitable to include the example system of processor 101.It is known in the art to laptop devices, platform Formula machine, Hand held PC, personal digital assistant, engineering work station, server, the network equipment, network backbone, interchanger, embedded place Manage device, it is digital signal processor (DSP), graphics device, video game device, set top box, microcontroller, cell phone, portable The other systems design and configuration of formula media player, handheld device and various other electronic equipments are also suitable.Typically For, a large amount of systems and electronic equipment that can contain processor and/or other execution logics disclosed herein are general all It is suitable.

Referring now to Figure 11, shown is the block diagram of system 1100 according to embodiments of the present invention.System 1100 can be wrapped Include the one or more processors 1115,1120 coupled to Graphics Memory Controller maincenter (GMCH) 1110.Additional processing Device 1115 is optionally represented by a dotted line in fig. 11.

Each processor 1110,1115 can be certain version of processor 1100.It is to be noted, however, that integrated graphics Logical sum integrated memory control unit may not be present in processor 1110 and 1115.

Figure 11 shows that GMCH1120 can be coupled to memory 1140, and the memory 1140 can be such as dynamic randon access Memory (DRAM).For at least one embodiment, DRAM can be associated with non-volatile cache.

GMCH1120 can be a part for chipset or chipset.GMCH1120 can with processor (multiple) 1110, 1115 are communicated, and interacting between control processor 1110,1115 and memory 1140.GMCH1120 can also act as (each) Acceleration EBI between processor (multiple) 1110,1115 and other elements of system 1100.For at least one implementation Example, GMCH1120 enters via the multiple-limb bus of such as Front Side Bus (FSB) 1195 etc with processor (multiple) 1110,1115 Row communication.

In addition, GMCH1120 is coupled to display 1145 (such as flat-panel monitor).GMCH1120 may include integrated graphics Accelerator.GMCH1120 is also coupled to input/output (I/O) controller maincenter (ICH) 1150, the input/output (I/O) control Device maincenter (ICH) 1150 can be used for various ancillary equipment being coupled to system 1100.For example, showing in the embodiment in figure 11 External graphics devices 860 and another ancillary equipment 1170, the external graphics devices 860 can be coupled to ICH1150 point Vertical graphics device.

Alternatively, additional or different processor also may be present in system 1100.For example, Attached Processor (multiple) 1115 It may include and the identical Attached Processor (multiple) of processor 1110 and the foreign peoples of processor 1110 or asymmetric Attached Processor (multiple), accelerator (such as graphics accelerator or Digital Signal Processing (DSP) unit), field programmable gate array or it is any its Its processor.The measurement spectrum of the advantages of according to including architecture, microarchitecture, heat, power consumption features etc., physical resource 1110th, there are various difference between 1115.These difference itself can effectively be shown as not right between treatment element 1110,1115 Title property and diversity.For at least one embodiment, various treatment elements 1110,1115 can reside in same die package.

Referring now to Fig. 9, shown is the block diagram of second system 1200 according to an embodiment of the invention.Such as Figure 12 institutes Show, multicomputer system 1200 is point-to-point interconnection system, and the first processor including being coupled via point-to-point interconnection 1250 1270 and second processor 1280.As shown in figure 12, in processor 1270 and 1280 can be each a certain of processor 101 Version.

Alternatively, processor 1270, one or more of 1280 can be element in addition to processors, such as accelerate Device or field programmable gate array.

Although only being shown with two processors 1270,1280, it should be understood that the scope of the present invention not limited to this.Other In embodiment, one or more additional processing elements may be present in given processor.

Processor 1270 may also include integrated memory controller maincenter (IMC) 1272 and point-to-point (P-P) interface 1276 With 1278.Similarly, second processor 1280 may include IMC1282 and P-P interfaces 1286 and 1288.Processor 1270,1280 Data can be exchanged via using point-to-point (PtP) interface 1250 of point-to-point (PtP) interface circuit 1278,1288.As schemed Shown in 12, the 1272 of IMC and 1282 couple the processor to corresponding memory, i.e. memory 1242 and memory 1244, this A little memories can be the portion of main memory for being locally attached to respective processor.

Processor 1270,1280 can be each via each of use point-to-point interface circuit 1276,1294,1286 and 1298 Individual P-P interfaces 1252,1254 exchange data with chipset 1290.Chipset 1290 can also via high performance graphics interface 1239 with High performance graphics circuit 938 exchanges data.

Shared cache (not shown) can be included within any one of two processors or be included at two It is connected via P-P interconnection outside reason device but still with these processors, if so as to place a processor into low-power mode, can be by The local cache information of any processor or two processors is stored in the shared cache.Chipset 1290 can be with The first bus 1216 is coupled to via interface 1296.In one embodiment, the first bus 916 can be peripheral parts interconnected (PCI) bus, or such as bus of PCI Express buses or other third generation I/O interconnection bus etc, but the model of the present invention Enclose and be not limited thereto.

As shown in figure 12, various I/O equipment 1214 can be coupled to the first bus 1216, bus together with bus bridge 1218 First bus 1216 is coupled to the second bus 1220 by bridge 1218.In one embodiment, the second bus 1220 can be low draws Pin number (LPC) bus.In one embodiment, each equipment can be coupled to the second bus 1220, including such as keyboard and/or mouse 1222nd, communication equipment 1226 and it may include that such as disk drive of code 1230 or the data of other mass memory units are deposited Storage unit 1228.Further, audio I/O1224 may be coupled to the second bus 1220.Note, other architectures are possible 's.For example, instead of Figure 12 Peer to Peer Architecture, system can realize multiple-limb bus or other such frameworks.

Referring now to Figure 13, shown is the block diagram of the 3rd system 1300 according to embodiments of the present invention.Figure 12 and figure Same parts in 13 represent with same reference numerals, and in terms of eliminating some of Figure 12 from Figure 13, to avoid making figure 13 other side becomes ambiguous.

Figure 13 shows that treatment element 1270,1280 can include integrated memory and I/O control logics (" CL ") 1272 respectively With 1282.For at least one embodiment, CL1272,1282 may include memory controller hub logic (IMC).In addition, CL1272,1282 may also include I/O control logics.Figure 10 is shown：Not only memory 1242,1244 is coupled to CL1272,1282, I/O equipment 1214 is also coupled to control logic 1272,1282.Traditional I/O equipment 1215 is coupled to chipset 1290.

Referring now to Figure 14, shown is SoC1400 according to embodiments of the present invention block diagram.Similar member in Figure 15 Part has similar reference.In addition, dotted line frame is more advanced SoC optional feature.In fig. 14, interconnecting unit is (more It is individual) 1402 it is coupled to：Application processor 1410, includes one or more core 1402A-N set and shared cache element (multiple) 1406；System agent unit 1410；Bus control unit unit (multiple) 1414；Integrated memory controller unit is (more It is individual) 1414；The set of one or more Media Processors 1420, it may include integrated graphics logic 1408, for provide it is static and/ Or image processor 1424, the audio process 1426 for providing hardware audio acceleration, the Yi Jiyong of video camera functionality In the video processor 1428 for providing encoding and decoding of video acceleration；Static RAM (SRAM) unit 1430；Directly Memory access (DMA) unit 1432；And display unit 1440, for coupled to one or more external displays.

Each embodiment of mechanism disclosed herein can be implemented in the group of hardware, software, firmware or these implementation methods In conjunction.Computer program or program code that embodiments of the invention can be realized to perform on programmable system, this may be programmed System includes at least one processor, storage system (including volatibility and nonvolatile memory and/or memory element), at least One input equipment and at least one output equipment.

Can be by program code using performing functions described herein to input data and produce output information.Output information One or more output equipments can be applied in a known manner.For the purpose of the application, processing system is included with all Such as such as digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC) or the processor of microprocessor Any system.

Program code can realize with the programming language of advanced procedures language or object-oriented, so as to processing system Communication.Program code can also be realized with assembler language or machine language in case of need.In fact, described herein Mechanism be not limited only to the scope of any certain programmed language.In either case, language can be compiler language or interpretation language Speech.

The one or more aspects of at least one embodiment can be by storing representative instruction on a machine-readable medium To realize, the instruction represents the various logic in processor, and the instruction is when being read by a machine so that the machine is made for holding The logic of row the techniques described herein.Tangible machine readable media can be stored in by being referred to as these expressions of " IP kernel " On, and be provided to various clients or production facility to be loaded into the manufacture machine for actually manufacturing the logic or processor.

Such machinable medium may include but be not limited to by the non-volatile of machine or device fabrication or formation Physical device, including storage medium, such as：Hard disk；Including floppy disk, CD, compact disk read-only storage (CD-ROM), it can weigh Write the disk of compact disk (CD-RW) and any other type of magneto-optic disk；Such as semiconductor device of read-only storage (ROM) etc Part；Such as random access memory of dynamic random access memory (DRAM), static RAM (SRAM) etc (RAM)；Erasable Programmable Read Only Memory EPROM (EPROM)；Flash memory；Electrically Erasable Read Only Memory (EEPROM)；Magnetic Card or light-card；Or suitable for the medium for any other type for storing e-command.

Therefore, various embodiments of the present invention also include non-transient, tangible machine-readable media, and the medium is friendly comprising vector The instruction of instruction format includes design data, such as hardware description language (HDL), its definition structure described herein, electricity Road, device, processor and/or system performance.These embodiments are also referred to as program product.

In some cases, dictate converter can be used to from source instruction set change instruction to target instruction set.For example, referring to Making converter can convert and (for example include the dynamic binary translation of on-the-flier compiler using static binary conversion), deform (morph), emulate or otherwise convert instructions into the one or more of the other instruction that will be handled by core.Instruction conversion Device can be realized with software, hardware, firmware or its combination.Dictate converter can on a processor, outside processor or Part is on a processor partly outside processor.

Figure 16 is that control according to an embodiment of the invention uses software instruction converter by the binary system in source instruction set Instruction is converted into the block diagram of the binary command of target instruction target word concentration.In an illustrated embodiment, dictate converter is that software refers to Converter is made, but can be realized as the dictate converter is substituted with software, firmware, hardware or its various combination.

Figure 16 shows that x86 compilers 1604 can be used to compile the program of high-level language 1602, can be by generate The x86 binary codes 1606 that processor Proterozoic with least one x86 instruction set core 1616 is performed are (in presumptive instruction Some are compiled with vectorial friendly instruction format).Processor with least one x86 instruction set core 1816 represents any place Manage device, the processor can be by compatibly performing or the otherwise instruction set of processing (1) Intel x86 instruction set cores Most of or (2) are directed at the application run on the Intel processors with least one x86 instruction set core or other softwares Object identification code version so that perform with the essentially identical function of Intel processors with least one x86 instruction set core, To realize the result essentially identical with the Intel processors with least one x86 instruction set core.X86 compilers 1804 are represented Compiler for generating x86 binary codes 1606 (for example, object identification code), the binary code 1616 can by or it is obstructed The additional processing that links is crossed to perform on the processor with least one x86 instruction set core 1016.Similarly, Figure 90 is shown With the program of high-level language 1602 the instruction set compiler 1608 of replacement can be used to compile, can be by without extremely with generation The processor 1614 of few x86 instruction set cores is (such as public with the MIPS technologies for performing California Sunnyvale city The MIPS instruction set of department, and/or perform the core of the ARM instruction set of the ARM holding companies in California Sunnyvale city Processor) come primary execution alternative command collection binary code 1610.Dictate converter 1612 was used to x86 binary system generations Code 1606 be converted into can by the primary execution of processor without x86 instruction set core 1614 code.The converted code It is unlikely identical with replaceability instruction set binary code 1610, because it is difficult to make the dictate converter that can so do； However, the code after conversion will complete general operation and is made up of the instruction from replaceability instruction set.Therefore, dictate converter 1612 represent：Allowed by emulation, simulation or any other process processor without x86 instruction set processors or core or Other electronic equipments are carried out software, firmware, hardware or its combination of x86 binary codes 1606.

Some operations of instruction (multiple) can be performed by nextport hardware component NextPort, and may be embodied in machine-executable instruction, and this refers to Make and the operation is performed with the circuit of the instruction programming or other nextport hardware component NextPorts for causing or at least resulting in.Circuit may include Universal or special processor or logic circuit, only provide several examples here.These operations are also optionally by hardware and software Combination perform.Execution logic and/or processor may include special or particular electrical circuit or other logics, and it is in response to machine instruction Or derived from machine instruction or one or more control signals, and the result operand that store instruction is specified.For example, public herein The embodiment for the instruction (multiple) opened can be performed in one or more systems, and the instruction (multiple) of the friendly instruction format of vector Embodiment be storable in the program code that will be performed in systems.The treatment element of these other accompanying drawings can be using this paper in detail One of streamline and/or framework (such as orderly and unordered framework) of the detailed description carefully described.For example, the decoding of framework in order Unit decodable code instructs (multiple), decoded instruction is sent into vector or scalar units etc..

Foregoing description is intended to illustrate the preferred embodiment of invention.According to the above discussion, it should also be apparent that, Quickly grow and be further in progress in this technical field for being difficult to predict, those skilled in the art can be right in arrangement and details The present invention modifies, without departing from the principle of the invention fallen in the range of appended claims and its equivalence.Example Such as, one or more operations of method can be combined or be spaced further apart.

Alternative embodiment

Although it have been described that by the primary embodiment for performing the friendly instruction format of vector, but the alternative embodiment of the present invention The processor of different instruction set can be performed by operating in (for example, performing the MIPS technologies of U.S. Jia Lifuya states Sunnyvale The processor of the MIPS instruction set of company, the processing of the ARM instruction set of the ARM holding companies of execution Jia Lifuya states Sunnyvale Device) on the simulation layer that runs perform vectorial friendly instruction format.Equally, although the flow in accompanying drawing illustrates certain of the present invention The specific operation order of a little embodiments, it should be understood that this is sequentially exemplary (for example, alternative embodiment can be held by different order Row operation, combine some operations, make some operations overlapping etc.).

In the above description, for illustrative purposes, numerous details are illustrated to provide to embodiments of the invention Thorough understanding.However, will be apparent to those skilled in the art also may be used without some in these details Put into practice one or more other embodiments.Described specific embodiment is provided and is not limited to the present invention but in order to illustrate Embodiments of the invention.The scope of the present invention is determined by the specific example provided, but only true by appended claims It is fixed.

Claims

1. a kind of method for the implementation procedure in computer processor, including：

Before the instruction of the first kind is performed, JH mode bits are stored in multiple registers；

The instruction of instruction and Second Type to the first kind is decoded；

The instruction of one or more first kind is performed by following operation to map and linear to perform S-Box in JH states (L) convert：

The instruction of the first kind is performed for the first time, to be held on the first component being stored in the first register of the JH states Row S-Box maps and L conversion；And

The instruction of second of execution first kind, to be held on the second component being stored in the second register of the JH states Row S-Box maps and L conversion；And

Once having been carried out S-Box mappings and L conversion, the instruction for being carried out one or more Second Types comes in the JH shapes Permutation function is performed in state, wherein the first of half of the form of the instruction of the first kind including being used to store JH states posts Storage operand, and the form of the instruction of the Second Type includes the implementing result of the instruction for keeping the first kind Second and the 3rd register operand.

2. the method as described in claim 1, it is characterised in that the multiple register is 512 bit registers.

3. method as claimed in claim 2 a, it is characterised in that register stores the low 512 of the JH states, and one Different registers stores the high 512 of the JH states.

4. the method as described in claim 1, it is characterised in that perform first for the first time and for the second time using mask register The instruction of type.

5. the method as described in claim 1, it is characterised in that further comprise：

The result that first time is performed to the instruction of the first kind in the first destination register is stored as the first JH state outcomes；With And

The result for the instruction for performing the first kind by second in the second destination register is stored as the 2nd JH state outcomes.

6. method as claimed in claim 5, it is characterised in that performing the instruction of the Second Type also includes：

JH state outcomes are retrieved from first and second destination register；

The first permutation function is performed in the first JH state outcomes；And

The second permutation function is performed in the 2nd JH state outcomes.

7. a kind of instruction processing unit, including：

Multiple data registers, wherein the multiple data register include be used for store JH mode bits half register with And for storing second half register of JH mode bits；And

The execution unit coupled with the multiple data register, for performing the instruction of one or more first kind with JH S-Box mappings and linear (L) conversion are performed in state, and once has been carried out S-Box mappings and L conversion, one is carried out Or the instruction of multiple Second Types performs permutation function in JH states, wherein the form of the instruction of the first kind includes Form for the first register operand of the half for storing JH states, and the instruction of the Second Type includes being used to keep Second and the 3rd register operand of the implementing result of the instruction of the first kind, wherein the execution unit is used for first The secondary instruction for performing the first kind is converted and is used for L to perform S-Box mappings in the first half portion of the JH mode bits The instruction of second of execution first kind is converted with performing S-Box mappings and L in the second half portion of the JH mode bits.

8. instruction processing unit as claimed in claim 7, it is characterised in that first register is 512 bit registers.

9. instruction processing unit as claimed in claim 7, it is characterised in that the execution unit is used to use mask register Carry out for the first time and perform for second the instruction of the first kind.

10. instruction processing unit as claimed in claim 9, it is characterised in that the execution unit is used for：Posted in the first purpose The result that first time is performed to the instruction of the first kind in storage is stored as the first JH state outcomes, and in the second destination register The result of the interior instruction for performing the first kind by second is stored as the 2nd JH state outcomes.

11. instruction processing unit as claimed in claim 10, it is characterised in that the execution unit is used to pass through following steps Perform the instruction of the Second Type：JH state outcomes are retrieved from first and second destination register, in the first JH The first permutation function is performed in state outcome, and the second permutation function is performed in the 2nd JH state outcomes.

12. a kind of equipment for performing JH keyed hash, including：

First instruction executing device, for perform the first kind instruction before by JH mode bits be stored in multiple registers with And mapped and linear to perform the instruction of one or more first kind with performing S-Box in JH states by following operate (L) convert：

Second instruction executing device, is used for：Once having been carried out S-Box mappings and L conversion, one or more second are carried out The instruction of type in the JH states performs permutation function, wherein the form of the instruction of the first kind includes being used to deposit The first register operand of the half of JH states is stored up, and the form of the instruction of the Second Type includes being used to keep described the Second and the 3rd register operand of the implementing result of the instruction of one type.

13. equipment as claimed in claim 12, it is characterised in that first instruction executing device is used to deposit using mask Device carrys out for the first time and performed for second the instruction of the first kind.

14. equipment as claimed in claim 12, it is characterised in that first instruction executing device is further used for：

15. equipment as claimed in claim 14, it is characterised in that second instruction executing device is further used for：

JH state outcomes are retrieved from first and second destination register；

The second permutation function is performed in the 2nd JH state outcomes.

16. a kind of computer system, including：

Interconnection；

The processor coupled is interconnected with described, the processor includes：

Multiple data registers, the multiple data register include be used for store JH mode bits the first half portion register with And for the register for the second half portion for storing JH mode bits；And

The execution unit coupled with the multiple data register, for performing the instruction of one or more first kind with JH S-Box mappings and linear (L) conversion are performed in state, and once has been carried out S-Box mappings and L conversion, one is carried out Or the instruction of multiple Second Types performs permutation function in JH states, wherein the form of the instruction of the first kind includes Form for the first register operand of the half for storing JH states, and the instruction of the Second Type includes being used to keep Second and the 3rd register operand of the implementing result of the instruction of the first kind, wherein the execution unit is used for first The secondary instruction for performing the first kind is converted and is used for L to perform S-Box mappings in the first half portion of the JH mode bits The instruction of second of execution first kind is converted with performing S-Box mappings and L in the second half portion of the JH mode bits；And

The dynamic random access memory (DRAM) coupled is interconnected with described.

17. computer system as claimed in claim 16, it is characterised in that the execution unit is used to use mask register Carry out for the first time and perform for second the instruction of the first kind.

18. computer system as claimed in claim 16, it is characterised in that the execution unit is used for the deposit in the first mesh The result that first time is performed to the instruction of the first kind in device is stored as the first JH state outcomes, and in the second destination register The result for the instruction for performing the first kind by second is stored as the 2nd JH state outcomes.

19. computer system as claimed in claim 18, it is characterised in that the execution unit is used to hold by following steps The instruction of the row Second Type：JH state outcomes are retrieved from first and second destination register, in the first JH shapes The first permutation function is performed in state result, and the second permutation function is performed in the 2nd JH state outcomes.