CN202257543U

CN202257543U - Instruction optimization processor aiming at advanced encryption standard (AES) symmetry encrypting program

Info

Publication number: CN202257543U
Application number: CN2011201714458U
Authority: CN
Inventors: 夏辉; 贾智平; 陈仁海; 张志勇; 颜冲
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2011-05-26
Filing date: 2011-05-26
Publication date: 2012-05-30
Anticipated expiration: 2021-05-26

Abstract

The utility model discloses an instruction optimization processor aiming at an advanced encryption standard (AES) symmetry encrypting program, which is mainly composed of a data storage, a code storage, a register file and a flow line. The flow line comprises an address obtaining unit, a decoding unit, an implementation unit and a flow line controller. By means of an instruction optimization method, in the aspect of implementation efficiency, clock periods required by operating the AES encrypting program through an AES_application specific instruction processor (ASIP) are stated through period-level simulation and reduces by 57.3x% compared with an advanced risc machine (ARM) processor, thereby greatly improving the implementation efficiency of the program; in the aspect of code space, an instruction code occupies an internal memory space of 783 bytes of the ARM processor, and the instruction code only occupies an internal memory space of 416 bytes of the AES_ ASIP, thereby saving the internal memory space of 46.6x%.

Description

Instruction optimized processor to the AES symmetric encipherment algorithm

Technical field

The utility model relates to the encryption and decryption field of AES symmetric encipherment algorithm, relates in particular to the instruction optimized processor of AES symmetric encipherment algorithm.

Background technology

Aes algorithm has converged strong security that data are encrypted, high-performance, high-level efficiency, advantage such as easy-to-use and flexible.Yet because the encrypt and decrypt process takies more processor resource, performance of processors becomes the key constraints of the efficient operation of AES.Though microprocessor performance is in continuous lifting, in a lot of fields in the execution efficient of AES can not meet all computing design requirements, especially under the limited embedded environment of computational resource.Because the embedded microprocessor performance is lower, arithmetic speed is slower, the efficient that AES moves in such microprocessor is lower.How to improve the execution efficient of AES under embedded environment, guarantee that the data efficient safe transmission becomes the hot issue of domestic and international research.

Academia exists three kinds to improve the mode that cryptographic algorithm is carried out efficient at present: first kind is the program circuit of optimizing cryptographic algorithm with the pure software mode, makes algorithm flow more reasonable, moves more efficient.The optimization version that people such as Bertoni propose aes algorithm quickens the execution efficient of this algorithm [1] on 32 bit processors of memory-limited.Along with improving constantly of microprocessor performance under the embedded environment; This optimal way has also correspondingly improved the execution efficient of cryptographic algorithm; Though this optimal way is flexible, its optimization space is very narrow, and the optimization amplitude reaches about 21% at most under the microprocessor of same type.The realization of pure software optimal way need be used look-up table in addition, and in the search procedure of data, look-up table receives the side-channel attack based on cache easily, and this attack pattern causes and is prone in the AES operational process symmetric key is revealed to the assailant; Second kind is to realize cryptographic algorithm with pure hardware mode, with one or some continual commands in the bottom program language with special hardware circuit realization.Based on this optimization method; People such as Kuo [2] propose to realize aes algorithm based on the mode of application-specific integrated circuit ASIC (Application Special Integrated Circuit); This method only can be accomplished the AES-128 algorithm with 10 cycles, and the chip architecture and the optimal design of execution algorithm discussed simultaneously in article.This optimal way can the Rapid Realization cryptographic algorithm, but its extensibility a little less than, the hardware resource that takies is more, causes the microcontroller circuit hardware cost significantly to rise, and is difficult to merge mutually with other computing module; The third is to adopt the mode of instruction set architecture (ISA) expansion that cryptographic algorithm is optimized.It is expanded processor instruction towards certain applications, adopts hardware to realize influencing the basic operating element of cryptographic algorithm performance, and in instruction set, adds corresponding instruction, finally generates dedicated instruction processor (ASIP).Based on this optimization method, people such as Wu [3] have introduced a kind of encryption processor-coprocessor of fast and flexible, and the author has verified that at first coprocessor acts on the high-quality effect on the 3DES algorithm, when keeping dirigibility, also can support multiple encryption algorithms.People such as Sun [4] have defined three kinds of extended instructions for the efficient aes algorithm of realizing that operation is decomposed based on the fine granularity random mask, and the combined command accidental scheduling method has provided the complete realization flow of aes algorithm.This optimal way has merged the advantage that pure software, hardware mode are optimized; Both kept the dirigibility that software is realized; Elevator system performance further again is that cost exchanges algorithm for and carries out significantly promoting of efficient and significantly reduction that instruction code takes up room to increase less hardware resource.And its extendability is stronger, can merge mutually with other computing modules.People's such as Wu method need be attached special embedded microprocessor (coprocessor) outside original embedded microprocessor, rather than in original microprocessor, accomplishes the instruction Optimizing operation of algorithm.The method takies a part of processor resource in addition, and cost is higher and inapplicable.People's such as Sun method is optimized Algorithm extended instruction operation to greatest extent not, and the optimization effect of the optimization method of its proposition is not fairly obvious.This seminar is that the patent of 201110024766X adopts the mode of instruction set architecture (ISA) expansion to do instruction optimization to Sbox generating algorithm in the AES AES specially in the application number of 2011.1.24 application; And propose 2 expansions and optimize instruction, Sbox generating algorithm efficient is highly improved; This seminar is that another patent of 201110024639X is obscured module to row in the aes algorithm and also adopted the mode of instruction set architecture (ISA) expansion to do instruction optimization in the application number of application on the same day, and the execution efficient of this module is highly improved.But the work that above two utility models are carried out is only instructed the optimization extended operation to some computing modules of aes algorithm; And do not consider whole aes algorithm is carried out the instruction set expansion; When being applied to it in whole aes algorithm separately simultaneously, the effect of improving of aes algorithm being carried out efficient not is fairly obvious.

The utility model content

For remedying the deficiency of prior art; The utility model provides a kind of AES ASIP; The method that the utility model adopts the instruction set architecture expansion to optimize is instructed to expand to aes algorithm and is optimized, and based on the irrespective of size ESL of Department of Electronics method design cycle, design has realized that 5 are specifically designed to the extended instruction of quickening AES in the utility model; And use processor Core Generator to make up a completion efficient AES dedicated instruction processor model (AES_ASIP) based on the LISA language; In order to satisfying the demand of this algorithm under the limited embedded environment of arithmetic speed and memory headroom, and the processor die type is implemented among the FPGA the most at last, accomplishes object authentication.

For realizing above-mentioned purpose, the utility model adopts following technical scheme:

A kind of instruction optimization method to the AES symmetric encipherment algorithm; Make operational code length identical with original processor model middle finger satisfying new instruction manipulation code length; The instruction figure place that the operational code of new instruction and operand summation can not exceed former instruction set, the execution unit of new instruction can not be too complicated, and new execution process instruction can not reduce the travelling speed of system; The instruction strip number of new expansion can not be too much; Reduce under the prerequisite of the hardware resource expense of bringing thus, carry out the instruction set architecture expansion to the AES symmetric encipherment algorithm and optimize, optimization method is following:

1) in S box byte replacement process, the contraposition of affined transformation process need is operated, and each affined transformation all need be each taking-up of eight-digit binary number; Instruction getbit < dest >=< src>adopted in the fetch bit operation; <bitpos >, the function of this instruction is from the src register, to take out the bitpos position, deposits last position of dest register then in; Above process will be accomplished in a clock period, thereby the fetch bit computing is quickened;

2) the affined transformation process need be carried out five yuan of xor operations to each after taking out all positions of eight-digit binary number, and five yuan of xor operations are adopted instruction xor_5 < dest >=< src1 >; < src2 >, < src3 >, < src4 >; < src5 >, the function of this instruction is that src1 is carried out xor operation to the content in the register of src5 representative, the result is kept in the register of dest representative; Above process will be accomplished in a clock period, thereby XOR is quickened;

3) use the multiplying in the Galois Field GF during row are obscured, adopt instruction ifand < src1 >, < src2>to multiply operation; < xor_src1 >; < xor_src2 >, the function of this instruction be src1 with src2 with, if the result is not 0; Then xor_src1 and xor_src2 carry out XOR, and the result is kept among the xor_src1; If the result is 0, then do not carry out xor operation, above process will be accomplished in a clock period, thereby territory inner multiplication computing is quickened;

The position that needs repeatedly data in the positional matrix when 4) in row are obscured, carrying out the matrix multiple operation; Data search in the matrix is adopted instruction matrixpos < dest >=< src1 >, < src2 >, < src3 >; < src4 >; The function of this instruction is the data of searching assigned address in the matrix, and above process will be accomplished in a clock period, thereby quickens searching the assigned address data operation;

5) in addition; In row are obscured, also use data exchange operation, realize exchange data using instruction swap < src1 >, < src2 >; The function of this instruction is with source operand src1 and src2 exchange; Soon the numerical value of source operand src1 is composed to src2 and simultaneously the numerical value of src2 is composed to src1, and above process will be accomplished in a clock period, thereby the data commutative operation is quickened;

Through the operation of above-mentioned five extended instructions, improve aes algorithm and carry out efficient and reduce algorithm instruction code storage space simultaneously.

To the instruction optimized processor of AES symmetric encipherment algorithm, it mainly is made up of data-carrier store, code memory, register file and streamline four parts;

Wherein: said streamline comprises gets unit, location, decoding unit, performance element and streamline controller; The said output terminal of getting the unit, location is connected with the input end of pipeline register I; The output terminal of pipeline register I is connected with the input end of decoding unit; The output terminal of decoding unit is connected with the input end of pipeline register II, and the output terminal of pipeline register II is connected with the input end of performance element; Said data-carrier store is connected with performance element is two-way; The output terminal of code memory is connected with the input end of getting the unit, location; The output terminal of streamline controller respectively with register file, pipeline register I is connected with the input end of pipeline register II; The output terminal of decoding unit is connected with the input end of streamline controller.

Said register file is got location register, 1 SP and 1 link register and is formed by 32 general-purpose registers, 1.

Said extended instruction getbit execution unit comprises 1 shift unit, 1 with door and 1 MUX, and the execution end of parts is general-purpose register, the shift unit input end receives general-purpose register r0 and 4 s' i; The maximal value of i is 31, the figure place that indicator register moves; Result after the shift unit displacement and 0x00000001 through with Men Xiangyu, and be output as one 32 numerical value, and the i position of last in store r0 of this numerical value, and everybody be 0 other with door; Control signal getbit_exe controls MUX, and MUX is accepted the address of 50 and general-purpose register r1 simultaneously, comes control address to select; When control signal was 1, MUX sent the address of r1 to register file, thereby will compose to r1 with the output of door; If control signal is 0 o'clock, MUX passes to register file with 50, promptly transmits address blank, and processor judges it is not carry out assign operation after the address blank; Getbit_exe is a control signal, sends control command by the decoding stage, and whether decision carries out the getbit operation.

Said extended instruction xor5 execution unit comprises 1 XOR circuit group and 1 MUX, and the execution end of parts is general-purpose register; The XOR circuit group is made up of a series of exclusive or logic gates, and its input end receives general-purpose register r2, r3, and r4, r5, the data of r6, the output result is 5 yuan of values behind the XOR; Control signal xor5_exe controls MUX, and MUX is accepted the address of 50 and general-purpose register r1 simultaneously, comes control address to select; When control signal was 1, MUX sent the address of r1 to register file, thereby the output result of XOR circuit group is composed to r1; If control signal is 0 o'clock, MUX passes to register file with 50, promptly transmits address blank, and processor judges it is not carry out assign operation after the address blank; Xor5_exe is a control signal, sends control command by the decoding stage, and whether decision carries out the xor5 operation.

Said extended instruction ifand execution unit comprise 2 with door, 1 or, 1 exclusive or logic gate and 1 MUX, the execution end of parts is the shared general-purpose register of entire process device; Accept the input of r0 and r1 with door I, its circuit output be r0 with r1 with after one 32 bit value; Or the function accomplished of door be to the output of door, promptly 32 bit value carry out by turn with, and the output of generation is one 1 numerical value; This output will be with the input of control signal ifand_exe conduct with door I, will come control address to select as the input of MUX with the output of door II; If with the output of door II be 1, then MUX passes to register file with the address of general-purpose register r2, thereby r2 and r3 are composed to r2 through the output result of exclusive or logic gate; If with the output of door II be 0, then MUX passes to register file with 50, promptly transmits address blank, and processor judges it is not carry out assign operation after the address blank; Ifand_exe is a control signal, sends control command by the decoding stage, and whether decision carries out the ifand operation.

Said extended instruction matrixpos execution unit comprises 1 multiplication execution unit, 1 addition execution unit and 1 MUX, and the execution end of parts is entire process device data shared storeies; The multiplication performance element is accepted the input of i and n, calculates the numerical value of i*n, and is transferred to the addition execution unit with input signal j and r1 together; The function of addition execution unit is the r1 that accomplishes input, the addition of i*n three numbers that j and multiplication performance element are exported, thus calculate the address location of matrix element; The output of addition execution unit will come control address to select as the input of MUX; If the control of MUX output is 1, then MUX passes to data-carrier store with the address of universal matrix interior element, thereby is that the output result of r1+i*n+j composes to r1 with matrix position; If the control of MUX output is 0, then MUX passes to data-carrier store with 16 0, promptly transmits address blank, and processor judges it is not carry out assign operation after the address blank;

A_matrixpos_EX_in is a control signal, sends control command by the decoding stage, and whether decision carries out the matixpos operation.

Said extended instruction swap execution unit comprises 2 MUXs and 1 register file, and the execution end of parts is the shared general-purpose register of entire process device.MUX I accepts to select input signal r1_addr, and MUX II accepts to select input signal r2_add r, and the address selection of swap data is controlled in the input of MUX; If the output control signal of MUX I is 1, then MUX I passes to register file with the address of general-purpose register r1, thereby the numerical result of register r1 is composed to r2; If the output control signal of MUX I is 0, then MUX I passes to register file with 50, promptly transmits address blank, and processor judges it is not carry out assign operation after the address blank; If the output control signal of MUX II is 1, then MUX II passes to register file with the address of general-purpose register r2, thereby the numerical result of register r2 is composed to r1; If the output control signal of MUX II is 0, then MUX II passes to register file with 50, promptly transmits address blank, and processor judges it is not carry out assign operation after the address blank; A_swap_EX_in is a control signal, sends control command by the decoding stage, and whether decision carries out the swap operation.

Beneficial effect: the efficient aspect is carried out in (1): through the grade simulated needed clock periodicity of AES_ASIP operation AES AES that counts of cycle, arm processor has reduced 57.3x% relatively, has greatly improved algorithm efficiency; (2) code space aspect: instruction code takies the 783bytes memory headroom on arm processor; And instruction code only takies 416bytes on AES_ASIP; Saved the code memory space of 46.6x%; Carry out the algorithm instruction through the optimal way of instruction set architecture expansion and optimize, can reduce the memory headroom of storage algorithm code effectively.Can draw by the statistical experiment result, the contrast arm processor, AES_ASIP has improved 58.4% execution efficient and has saved 47.4% memory headroom.Aes algorithm uses the processor resource number to be 86816cells before the instruction expansion, and instruction expansion back AES_ASIP uses the processor resource number to be 93038cells, and the hardware resource that takies has increased by 7.2%.

Description of drawings

Fig. 1 (a) is aes algorithm encryption flow figure;

Fig. 1 (b) is aes algorithm deciphering process flow diagram;

Fig. 2 is the dedicated instruction processor design cycle;

Fig. 3 is an AES dedicated instruction processor model framework;

Fig. 4 is a getbit instruction execution unit hardware model;

Fig. 5 is an xor_5 instruction execution unit hardware model;

Fig. 6 is an ifand instruction execution unit hardware model;

Fig. 7 is a matrixpos instruction execution unit hardware model;

Fig. 8 is a swap instruction execution unit hardware model;

Wherein: 1. data-carrier store; 2. register file; 3. code memory; 4. streamline; 5. get the location streamline; 6. decoding streamline; 7. execution pipeline; 8. jump instruction decoding unit; 9.AES extended instruction decoding unit; 10. universal command decoding unit; 11. read write command execution unit; 12.AES explosion command execution unit; 13. logic arithmetic instruction execution unit; 14. streamline controller; 15. pipeline register I, 16. pipeline register II; 17. general-purpose register; 18. shift unit; 19. with door I; 20. MUX I; 21. XOR circuit group; 22. MUX II; 23. with door II; 24. or door I; 25. with door III; 26. MUX II; 27. exclusive or logic gate; 28. multiplier; 29. totalizer; 30. MUX III; 31. data-carrier store; 32. MUX III; 33. MUX IV.

Embodiment

Below in conjunction with accompanying drawing and embodiment the utility model is described further:

A kind of AES symmetric encipherment algorithm expansion instruction set optimization method does not change original processor instruction operational code length, instruction figure place and does not influence under the prerequisite of processor travelling speed satisfying, and Optimizing operation is:

1) operation of the fetch bit in the S box byte replacement process, each affined transformation all need be each taking-up of eight-digit binary number.But in arm processor and other flush bonding processor commonly used, do not have direct fetch bit operation, classic method is accomplished three assembly instructions of fetch bit action need, need to carry out three clock period, so implementation is very time-consuming.In order to quicken this process, design and adopted instruction getbit < dest >=< src >, <bitpos >.The function that this instruction is accomplished is from general-purpose register src, to take out the bitpos position, deposits last position of general-purpose register dest then in, thereby accomplishes the fetch bit operation.New instruction will be accomplished in a clock period, and therefore the processing speed than conventional processors has improved three times, thereby the operation of the fetch bit in the S box byte replacement process is quickened.

2) take out all positions of binary number in the affined transformation process after, need carry out five yuan of xor operations, and former number replaced with the result after operating.In traditional arm processor, accomplishing this function needs four assembly language, therefore needs four clock period.In order to quicken five yuan of xor operations, design and adopted new instruction xor5 < dest >=< src1 >, < src2 >, < src3 >, < src4 >, < src5 >.The function of this instruction is that src1 is carried out xor operation to the content in the represented general-purpose register of src5, and the result is kept in the represented general-purpose register of dest.Above process will be accomplished in a clock period, and therefore the processing speed than conventional processors has improved four times, thereby five yuan of xor operations in the affined transformation process are quickened.

3) use Galois Field GF (2 during row are obscured ⁸) interior multiplying.Certain element multiplies each other in 256 elements of packet and S box, and the sum operation that adds up, and this process also need circulate 8 times, and calculating process is very consuming time.We find in territory inner multiplication cyclic process each time, need once judge the process of back XOR.In traditional arm processor, accomplishing this function needs four assembly statements, therefore needs four clock period.In order to quicken this process, we design and have adopted instruction ifand < src1 >, < src2 >, < xor_src1 >, < xor_src2 >.The function of this instruction be with operand src1 and src2 with, if the result is not 0, then xor_src1 and xor_src2 just carry out XOR, and the result is kept in the general-purpose register that xor_src1 representes; If the result is 0, then do not carry out xor operation.Above process will be accomplished in a clock period, and therefore the processing speed than conventional processors has improved four times, thereby territory inner multiplication computing is quickened.

4) row are obscured and are carried out matrix multiple operation, need repeatedly the position of data in the positional matrix, and in actual memory, the data in the matrix are linear storages.Classic method needs 5 assembly instructions to realize, therefore needs 5 clock period.In order to quicken this process, we design and have adopted new instruction matrixpos < dest >=< src1 >, and < src2 >, < src3 >, < src4>substitutes five original assembly instructions.The function of new instruction is the data of searching assigned address in the matrix.Above process will be accomplished in a clock period, and therefore the processing speed than conventional processors has improved 5 times, thereby the position operation of data in the positional matrix is quickened.

5) row are obscured and are used data exchange operation in a large number, need 2 data in the exchange register.Classic method is accomplished 3 assembly instructions of this action need, therefore needs 3 clock period.In order to quicken this process, we design and have adopted new instruction swap < src1 >, and < src2>substitutes three original assembly instructions.The function of new instruction is that soon the numerical value of source operand src1 is composed to src2 and simultaneously the numerical value of src2 composed to src1 with source operand src1 and src2 exchange.Above process will be accomplished in a clock period, and therefore the processing speed than conventional processors has improved 3 times, thereby the data swap operation is quickened.

Through the operation of above-mentioned five extended instructions, quicken AES symmetric encipherment algorithm in the embedded microprocessor.

A kind of aes algorithm ASIP model (AES_ASIP) according to the expansion instruction set design, it has realized above-mentioned extended instruction on hardware logic, therefore can be used in to quicken the AES AES.The model hardware structure mainly is made up of data-carrier store (Data_RAM), code memory (Prog_RAM), register file (Registers) and streamline (Pipe) four parts.The program memory address definition space in the 0x0000-0x7FFF scope, big or small 32K.The code memory address space is defined in the 0x8000-0xFFFF scope, big or small 32K.Register file is got location register (FPR), 1 SP (SPR) and 1 link register (LR) and is formed by 32 general-purpose registers (GPR [0...31]), 1.Streamline partly adopts three class pipeline: get location (FE), decoding (DC) and carry out (EX).Streamline controller (Pipe_Ctl) mainly is responsible for jump instruction is controlled; Jump instruction only need be stored in jump address and get in the location register (FPR); Need not pass through execution unit, and then the buffer memory of pipeline partly refreshes, and prevents execution unit execution jump instruction.At decoding, the execution unit of AES_ASIP processor, added special instruction code translator (Decode_AES_EX) and actuator (AES_EX) to aes algorithm, special decoding and execution are carried out in the instruction of expansion.

Principle of the utility model and concrete implementation method are following:

Advanced Encryption Standard AES (Advanced Encryption Standard) algorithm belongs to block cipher; The intermediate packets that its input is divided into groups, exported in grouping and the enciphering/deciphering process all is 128 bits; Use 10,12 or 14 wheels (Nr representes to take turns number); With the length K of the corresponding input key of wheel number is 128,192 or 256 bits.Use Nk=4,6,8 represent the number of words (1 word=32 bits) of key string, and each is taken turns all needs one to divide into groups to have the participation of the expanded keys Key of same length (128 bit) with input.Since the key k limited length of outside input, thus in the aes algorithm operating process, need to use cipher key spreading (Key Expansion) routine to be extended to longer Bit String to external key k, to generate the round key that adds of each wheel.First round key is as the conversion of preparation wheel, and remaining round key is taken turns last conversion of ending as each.Aes algorithm uses four conversions in operating process: the displacement of S box byte, line translation, row are obscured and key adds.Except that last was taken turns, each was taken turns and all uses four reversible conversions, and last is taken turns only with three conversions (column free is mixed conversion).In deciphering side, but use is inverse conversion: S box displacement inverting, line translation inverting, row are obscured inverting and are added round key (this conversion is that self is reversible), like Fig. 1 (a) expression AES encryption flow, and Fig. 1 (b) expression AES deciphering flow process.

(1) enciphering transformation

Expressly import if X is 128 bits, Y is the ciphertext output of 128 bits, and then AES ciphertext Y can draw with following complex transformation:

Y＝(A _Nr·R·S)·(A _Nr-1·C·R·S)·(A _Nr-2·C·R·S)…(A ₁·C·R·S)·A ₀(X)

Wherein represent compound operation.Here A _Ni(X) expression is to the conversion A of X _Ni(X)=Xxor K _Ni(K _NiBe the sub-key of i wheel, be the XOR of Bit String).The displacement of S:S-box byte is promptly done a displacement to each byte with S-box.Carry out efficient for boosting algorithm, S-box is the conversion table of an appointment.R: line translation.C: row are obscured, and row relate to the polynomial multiplication computing in obscuring.

(2) deciphering conversion

The deciphering conversion is the inverse transformation of enciphering transformation, no longer details here.

Analysis introduction about AES AES computing module is following:

(1) S box byte displacement S-boxSubBytes ()

Any elements A of input matrix is done like down conversion S [A]: (a) any elements A sees it all is one eight binary number from stored angles.Calculate the sexadecimal number x of preceding four representatives and the sexadecimal number y of back four representatives.During like A=11000100, x=C, y=4; (b) (matrix of 16 row, 16 row, wherein each element is a value of finding out S [A]=S [x, y] in the byte from the given S-box of aes algorithm.During like A:11000100, S [A]=S [x, y]=S [c, 4]=| 29|=00101001 or directly through following formula with A=b ₇b ₆b ₅b ₄b ₃b ₂b ₁b ₀Become S [A]=b ' ₇B ' ₆B ' ₅B ' ₄B ' ₃B ' ₂B ' ₁B ' ₀

{b^{'}}_{i} = b_{i} &CirclePlus; b_{(i + 4)} \mod 8 &CirclePlus; b_{(i + 5) \mod 8} &CirclePlus; b_{(i + 6) \mod 8} &CirclePlus; b_{(i + 7) \mod 8} &CirclePlus; (0 x 63)

(2) line translation ShiftRows ()

In ciphering process, conversion is called the row displacement and to shifting left, carry digit depends on the line number (0,1,2 or 3) of state matrix.This just shows row 0 not displacement fully, and last column has been moved three bytes.Notice that delegation is only operated in row displacement conversion at every turn.

(3) row are obscured MixColumns ()

Row are obscured through matrix multiple realization (matrix A is obscured transformation matrix for row) in the AES AES, and detailed process is following:

(\begin{matrix} s_{11}^{'} & s_{12}^{'} & s_{13}^{'} & s_{14}^{'} \\ s_{21}^{'} & s_{22}^{'} & s_{23}^{'} & s_{24}^{'} \\ s_{31}^{'} & s_{32}^{'} & s_{33}^{'} & s_{34}^{'} \\ s_{41}^{'} & s_{42}^{'} & s_{43}^{'} & s_{44}^{'} \end{matrix}) = (\begin{matrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{matrix}) \times (\begin{matrix} s_{11} & s_{12} & s_{13} & s_{14} \\ s_{21} & s_{22} & s_{23} & s_{24} \\ s_{31} & s_{32} & s_{33} & s_{34} \\ s_{41} & s_{42} & s_{43} & s_{44} \end{matrix})

Row are obscured transformation matrix and decrypting process and are listed as and obscure transformation matrix and be respectively in the ciphering process:

(\begin{matrix} 02 & 03 & 01 & 01 \\ 01 & 02 & 03 & 01 \\ 01 & 01 & 02 & 03 \\ 03 & 01 & 01 & 02 \end{matrix})

(\begin{matrix} 0 x 0 e & 0 x 0 b & 0 x 0 d & 0 x 09 \\ 0 x 09 & 0 x 0 e & 0 x 0 b & 0 x 0 d \\ 0 x 0 d & 0 x 09 & 0 x 0 e & 0 x 0 b \\ 0 x 0 b & 0 x 0 d & 0 x 09 & 0 x 0 e \end{matrix})

Row are obscured conversion operations and are promptly converted each row of state matrix state to new row.Byte in every broomrape and constant converting matrix all will turn to and have GF (2 ⁸) in the word (or polynomial expression) of 8 bits of coefficient.The multiplication of byte is through Galois Field GF (2 in the matrix ⁸) in multiplication be that polynomial multiplication is realized, simultaneously for the element after guaranteeing to multiply each other still in the territory, at GF (2 ⁸) the middle needs with mould m (x)=(10001101) or (x ⁸+ x ⁴+ x ³+ x+1) removing and realize, the addition in the matrix is realized through xor operation.Promptly to any one eight binary number A=b in the territory ₇b ₆b ₅b ₄b ₃b ₂b ₁b ₀, convert a polynomial expression A=f (x)=b to ₇x ⁷+ b ₆x ⁶+ b ₅x ⁵+ b ₄x ⁴+ b ₃x ³+ b ₂x ²+ b ₁X+b ₀Multiply each other with x:

The corresponding domain inner multiplication is:

A＝00000010 B＝b ₇b ₆b ₅b ₄b ₃b ₂b ₁b ₀

Multiply by the polynomial expression that is higher than once and can recycle the following formula realization, so GF (2 ⁸) multiplication is a plurality of intermediate result additions.

(4) key adds AddRoundKey ()

The key add operation also is each row of handling, and this obscures similar with row.Row are obscured multiplies each other a constant square formation and each status Bar, and it is a round key word and each status Bar matrix addition that key adds.It is to each row of state matrix and the XOR of corresponding secret key word that key adds conversion essence.Sc and wNi (round)+4c is 4 * 1 column matrix, carries out key according to following formula and adds computing

0≤c≤3.

(5) cipher key spreading Key Expansion

In order to create the round key that each is taken turns, aes algorithm utilizes outside input key K (number of words of key string is Nk), uses a cipher key spreading routine to generate key.If the wheel number is Nr, the cipher key spreading routine just creates the round key of Nr+1 128 bits from the cryptographic key of one 128 bit.First round key is as the conversion of preparation wheel, and remaining round key is taken turns last conversion of ending as each.The round key extended routines is word-for-word created key, arrangement that word is exactly a nybble here, and routine is created 4 * (Nr+1) individual words.Relate to following three modules: (a) word rotation RotWord ().Become [a1, a2, a3, a0] to the sequence of four bytes [a0, a1, a2, the a3] byte that moves to left; (b) word replacement SubWord ().Each byte to the input word [a0, a1, a2, a3] of a nybble is carried out the conversion of s box, then as output; (3) wheel constant Rcon [].Rcon [i] expression 32 bits 16 system character string [x ^I-1, 00,00,00].Here x=(02), x ^I-1It is the hexadecimal representation of (i-1) inferior power of x=(02).Rcon[1]＝[01000000]，Rcon[2]＝[02000000]，Rcon[3]＝[04000000]，Rcon[4]＝[08000000]，...，Rcon[10]＝[36000000]。

Preceding Nk word of expanded keys is exactly external key K; Later word w _iEqual its previous word w _I-1With preceding Nk word w _I-NkXOR, i.e. w _i=w _I-1Xor w _I-NkBut if i is the multiple of Nk, then w _i=w _I-NkXor SubWord (RotWord (w _I-1)) xor Rcon [i/Nk].

AES encryption/decryption algorithm higher-layer programs code is analyzed assembly language and is designed 5 new instructions through after the compiling of compilation compiler:

(1) operation of the fetch bit in the S box byte replacement process, each affined transformation all need be each taking-up of eight-digit binary number.But in arm processor and flush bonding processor commonly used, there is not direct fetch bit operational order.The fetch bit of analyzing affined transformation is operating as:

y＝(x＞＞i)and?0x1

At first with x i the position that move to right, carry out and operation with 0x1 then the i position of getting x, and the result composes to y, and the corresponding ARM concrete implication that collects is:

move r1，r0

leftshift r1，r1，(31-i)

rightshift r1，r1，31

Use new instruction getbit replacement, getbit r1=r0, i

To the new fetch bit instruction of fetch bit operational design getbit < dest >=< src >, three instructions above <bitpos>substitutes.The function of new instruction is from the src register, to take out the bitpos position, deposits last position of dest register then in.

(2) S box byte replacement process is after taking out all positions of eight-digit binary number, need be to each operand replacement after with XOR, and each inverse element is 8, therefore need carry out 8 times five yuan of xor operations, the xor operation of analyzing affined transformation is:

y[i]＝x[i]xor?x[(i+4)mod8]xor?x[(i+5)mod8]xor?x[(i+6)mod8]xor?x[(i+7)mod8]

The i position of x is carried out xor operation with (i+4) mod8 to (i+7) mod8 position, be compiled into the ARM assembly instruction, specifically implication is:

r1＝r2?xor?r3

r1＝r1?xor?r4

r1＝r1?xor?r5

r1＝r1?xor?r6

Use new instruction xor_5 replacement, xor_5 r1=r2, r3, r4, r5, r6

To the five yuan of new instruction of xor operation design xor_5 < dest >=< src1 >, < src2 >, < src3 >, < src4 >, four instructions above < src5>substitutes.The function of new instruction is that src1 is carried out xor operation to the content in the register of src5 representative, and the result is kept in the register of dest representative.Receive arm processor 32 bit instruction length restriction, src1 all only takies four to src5, if definition takies five bit lengths, then exceeds 32 bit instruction length restriction.

(3) use Galois Field GF (2 during row are obscured ⁸) interior multiplying.Certain element multiplies each other in 256 elements of packet and S box, and the sum operation that adds up, and this process also need circulate 8 times, and calculating process is very consuming time.The analysis domain inner multiplication, can expand optimization to following instruction:

if?a?and?b?is?0?then

c＝c?xor?d

This quasi-sentence is compiled into four assembly instructions through the ARM compiler, and concrete implication is:

and r1，r1，r0

cmp r1，0

equal?jumpxor?xor r2，r2，r3

Use new instruction ifand replacement, ifand r1, r0, r2, r3

To the new instruction of multiply operation design ifand < src1 >, < src2 >, < xor_src1 >, four instructions above < xor_src2>substitutes.The function of new instruction be src1 with src2 with, if the result is not 0, xor_src1 and xor_src2 execution XOR then, and the result is kept among the xor_src1; If the result is 0, then do not carry out xor operation.

(4) row are obscured and are carried out matrix multiple operation, need repeatedly the position of data in the positional matrix, and in actual memory, the data in the matrix are linear storages, and that searches the capable j column position of matrix S-box i data operates in the C language expression as follows:

x＝sbox[i][j]

Be compiled into the ARM assembly instruction, concrete implication is:

move?r1，sbox

move?r2，i

mul r2＝r2，n

add r2＝r2，j

add r2＝r2，r1

load?r1＝r2

Use new instruction matrixpos replacement

move?r1，sbox

matrixpos r1＝r1，i，j，n

The implication of top ARM assembly language is searched data according to deviation post for calculate the deviation post of data in the actual memory according to i and j on the basis of base address sbox, at last with the data assignment of searching to r1.To the new instruction of data search design matrixpos < dest >=< src1>in the matrix, < src2 >, < src3 >, < src4>substitutes top five instructions.The function of new instruction is the data of searching assigned address in the matrix.

(5) row are obscured and are used data exchange operation in a large number, and swap operation is represented as follows:

c＝a；b＝a；a＝b；

Be compiled into the ARM assembly instruction, concrete implication is:

mov?r2，r0

mov?r0，r1

mov?r1，r2

Use new instruction swap replacement, swap r1, r2

Design new assembly instruction swap < src1 >, < src2>substitutes top three instructions.The function of new instruction is that soon the numerical value of source operand src1 is composed to src2 and simultaneously the numerical value of src2 composed to src1 with source operand src1 and src2 exchange.

Based on the instruction set after the expansion, use the dedicated instruction processor model (AES_ASIP) that makes up aes algorithm based on the processor Core Generator of LISA language, the new instruction for expansion simultaneously designs corresponding execution unit hardware model.The dedicated instruction processor model of accomplishing is implemented among the FPGA, accomplishes the object authentication of this utility model.

The order format of compatible original model (like the arm processor instruction set) is wanted in instruction in the process of the new instruction of design after the necessary assurance expansion, could guarantee the normal execution of decoding unit like this.For example: in the 32 bit processor models, instruction operation code and operand figure place sum are 32 at most, will consider during design instruction newly that its operational code and operand figure place sum can not surpass 32.Simultaneously, the execution unit of new instruction can not be too complicated, if the execution unit of new instruction is too complicated, can produce than long time delay in the implementation, reduces the travelling speed of system.To sum up the each side factor is considered, the instruction design limit that must make new advances condition is following:

(a) new instruction manipulation code length makes operational code length identical with original processor model middle finger;

The instruction figure place that operational code of (b) newly instructing and operand summation can not exceed former instruction set;

(c) execution unit of new instruction can not be too complicated, and new execution process instruction can not reduce the travelling speed of system;

(d) instruction strip number of new expansion can not be too much, reduces the hardware resource expense of bringing thus.

Satisfying on the above-mentioned condition basis; Carrying out the instruction set architecture expansion to special algorithm optimizes; The time of getting location, decoding can greatly be reduced; Make get location, the decode procedure of the instruction of original N bar only need an instruction to accomplish now, promptly get the location, decode procedure can dwindle (N-1) * 2 clock period.Consider from performance element, need a plurality of performance elements collaborative function of accomplishing in a plurality of clock period before the instruction expansion, after the instruction expansion, only need a new performance element in a clock period, just can accomplish.Simultaneously, can avoid to a certain extent because of instructing the relevant bubble effect that produces.

Fig. 2 has described the dedicated instruction processor design cycle of learning based on the electronic system level method.The assembly language of target algorithm higher-layer programs code after compiling that at first utilizes C-SPY debugger under the IAR cross-compiler environment to check to have optimized uses in the algoritic module analysis of complexity tool analysis assembly instruction frequency of utilization higher and influence algorithm many bigger continuous assemblies of operation time and instruct.On the basis of satisfying four constraint conditions of preamble, design a new assembly instruction and realize this many continuous assembly instructions.Based on the instruction set after the expansion, use dedicated instruction processor model based on the processor Core Generator establishing target algorithm of LISA language, the new instruction for expansion simultaneously designs corresponding execution unit hardware model.Through a series of emulation and proof procedure, finally in FPGA, realize this dedicated instruction processor model, accomplish object authentication to this optimization method.

Shown in Figure 3 is the AES_ASIP processor model block diagram to the aes algorithm design.The model hardware structure mainly is made up of data-carrier store 1 (Data_RAM), code memory 3 (Prog_RAM), register file 2 (Registers) and streamline 4 (Pipe) four parts.Data-carrier store 1 address space is defined in the 0x0000-0x7FFF scope, big or small 32K.Code memory 3 address spaces are defined in the 0x8000-0xFFFF scope, big or small 32K.Register file 2 is got location register (FPR), 1 SP (SPR) and 1 link register (LR) and is formed by 32 general-purpose registers (GPR [0...31]), 1.Streamline partly adopts three class pipeline: get location (FE), decoding (DC) and carry out (EX).Streamline controller 14 (Pipe_Ctl) mainly is responsible for jump instruction is controlled; Jump instruction only need be stored in jump address and get in the location register (FPR); Need not pass through execution unit, and then the buffer memory of pipeline partly refreshes, and prevents execution unit execution jump instruction.At decoding, the execution unit of AES_ASIP processor, added special instruction code translator (Decode_AES_EX) and actuator (AES_EX) to aes algorithm, special decoding and execution are carried out in the instruction of expansion.

Model command collection framework is constructed as follows shown in the table 1 by 27 instructions:

Table 1AES dedicated instruction processor instruction set architecture

The ALU universal command	mov，add，sub，0r，and，xor，shl，shr，nop
		Jump instruction	jp，jeq，jne，jl，jle，ja，jae
Access instruction	stb，sth，stw，ldb，ldh，ldw
		The aes algorithm special instruction	getbit，xor_5，ifand，matrixpos，swap

[0132] The new order format of following table 2 expressions.

Table 2 is to the new order format of AES expansion

Fig. 4-8 representes respectively and 5 corresponding execution unit hardware models of new instruction.

Fig. 4 representes the getbit instruction execution unit, r0 among the figure, i, r1_addr are input signals, through shifting part rsh, with the data shift right i position in the r0 register, then through with the result of goalkeeper displacement and 0x0001 with, only get the least significant bit (LSB) after the displacement.The output result is according to MUX mux control output, and A_getbit_EX_in is a control signal, and by the control output of decoding stage, whether r1_addr is effective for the decision INADD.If A_getbit_EX_in is 1, then MUX mux selects and will be saved in the r1 register with the result; If be 0 then MUX mux selects an invalid address, do not carry out any register-stored operation.

Specific descriptions are: extended instruction getbit execution unit comprises 18,1 of 1 shift units and door I 19 and 1 MUX I 20, and the execution end of parts is general-purpose register 17.Shift unit 18 input ends receive general-purpose register r0 and 4 s' i; The maximal value of i is 31, the figure place that indicator register moves; Result after shift unit 23 displacement and 0x00000001 through with door I19 with, and be output as one 32 numerical value, and the i position of last in store r0 of this numerical value, and everybody be 0 other with door I19; Control signal getbit_exe controls MUX 20, and MUX 20 is accepted the address of 50 and general-purpose register r1 simultaneously, comes control address to select; When control signal was 1, MUX I 20 sent the address of r1 to register file 17, thereby will compose to r1 with the output of door I19; If control signal is 0 o'clock, MUX I 20 passes to register file 17 with 50, promptly transmits address blank, and processor judges it is will not carry out assign operation after the address blank.Getbit_exe is a control signal, sends control command by the decoding stage, and whether decision carries out the getbit operation.

Fig. 5 representes the xor_5 instruction execution unit.R2 among the figure, r3, r4, r5, r6, r1_addr are input signals, r2, r3, r4, r5, the data of r6 register are carried out xor operation through five yuan of XOR parts.A_xor_5_EX_in is a control signal, and by the control output of decoding stage, whether r1_addr is effective for the decision INADD.If A_xor_5_EX_in is 1, then MUX mux selects the result is saved in the r1 register; If be 0 then MUX mux selects an invalid address, do not carry out any register-stored operation.

Specific descriptions are: extended instruction xor5 execution unit comprises 1 XOR circuit group 21 and 1 MUX II 22, and the execution end of parts is general-purpose register 17.XOR circuit group 21 is made up of a series of exclusive or logic gates, and its input end receives general-purpose register r2, r3, and r4, r5, the data of r6, the output result is 5 yuan of values behind the XOR; Control signal xor5exe control MUX II 22, MUX II 22 accepts the address of 50 and general-purpose register r1 simultaneously, comes control address to select; When control signal was 1, MUX II 22 sent the address of r1 to register file 17, thereby the output result of XOR circuit group 21 is composed to r1; If control signal is 0 o'clock, MUX II 22 passes to register file 17 with 50, promptly transmits address blank, and processor judges it is will not carry out assign operation after the address blank.Xor5_exe is a control signal, sends control command by the decoding stage, and whether decision carries out the xor5 operation.

Fig. 6 representes the ifand instruction execution unit.R0 among the figure, r1, r2, r3, r2_addr are the input signals of execution unit, and the output result selects output through MUX mux, and ifand_exe is a control signal, sends control command by the decoding stage, and whether decision carries out the ifand operation.Can find out that from last figure circuit if the result of r0 and r1 and ifand_exe and operation is 1, then MUX mux selects r2 and r3 XOR result are outputed in the r2 register; If with the result of operation be 0, then MUX mux selects invalid address of output, does not carry out any storage operation.

Specific descriptions are: extended instruction ifand execution unit comprise 2 with the door II 23 and with 25,1 of door III or 24,1 MUX II 26 of an I and 1 exclusive or logic gate 27, the execution end of parts is shared general-purpose register of entire process device 17.Accept the input of r0 and r1 with door II 23, its circuit output be r0 with r1 with after one 32 bit value; Or the function accomplished of door I 24 be to the output of door III 25, promptly 32 bit value carry out by turn with, and the output of generation is one 1 numerical value; This output will be with the input of control signal ifand_exe conduct with door III 25, will come control address to select as the input of MUX II 26 with the output of door III 25; If with the output of door III 25 be 1, then MUX II 26 passes to register file 17 with the address of general-purpose register r2, thereby r2 and r3 are composed to r2 through the output result of exclusive or logic gate 27; If with the output of door III 25 be 0, then MUX II 26 passes to register file 17 with 50, promptly transmits address blank, and processor judges it is will not carry out assign operation after the address blank.Ifand_exe is a control signal, sends control command by the decoding stage, and whether decision carries out the ifand operation.

Fig. 7 representes the matrixpos instruction execution unit.R1 among the figure, i, n, j are input signals, through a multiplier and a totalizer, the position that calculates matrix is r1+i*n+j.A_getbit_EX_in is a control signal, by the control output of decoding stage, if be 1 then MUX mux selects the effective address of calculating is sent to the address wire of datarams Data_RAM, according to the address internal storage data is stored among the register r1; If be 0 then transmit an invalid address, do not carry out any read operation.

Specific descriptions are: extended instruction matrixpos execution unit comprises 28,1 totalizer 29 of 1 multiplier and 1 MUX III 30, and the execution end of parts is entire process device data shared storeies 31.Multiplier 28 is accepted the input of i and n, calculates the numerical value of i*n, and is transferred to totalizer 29 with input signal j and r1 together; The function of addition execution unit is the r1 that accomplishes input, the addition of i*n three numbers that j and multiplication performance element are exported, thus calculate the address location of matrix element; The output of totalizer 28 will come control address to select as the input of MUX III 30; If the control of MUX III 30 output is 1, then MUX III 30 passes to data-carrier store 31 with the address of universal matrix interior element, thereby is that the output result of r1+i*n+j composes to r1 with matrix position; If the control of MUX III 30 output is 0, then MUX III 30 passes to data-carrier store 31 with 16 0, promptly transmits address blank, and processor judges it is will not carry out assign operation after the address blank.A_matrixpos_EX_in is a control signal, sends control command by the decoding stage, and whether decision carries out the matixpos operation.

Fig. 8 representes the swap instruction execution unit.R1 among the figure, r2, r1_addr, r2_addr are input signals, A_swap_EX_in is a control signal, by the control output of decoding stage.If be 1 MUX mux select to store the value of r1 register into the r2 register, the value with the r2 register stores the r1 register into simultaneously; If be 0 then transmit an invalid address, do not carry out any read operation.

Specific descriptions are: extended instruction swap execution unit comprises 2 MUX III 32 and MUX IV33,1 general-purpose register 17, and the execution end of parts is shared general-purpose register of entire process device 17.MUX III 32 accepts to select input signal r1_addr, and MUX IV 33 accepts to select input signal r2_addr, and the address selection of swap data is controlled in the input of MUX III 32; If the output control signal of MUX III 32 is 1, then MUX III 32 passes to register file r2 with the address of general-purpose register r1, thereby the numerical result of register r1 is composed to r2; If the output control signal of MUX III 32 is 0, then MUX III 32 passes to general-purpose register 17 with 50, promptly transmits address blank, and processor judges it is will not carry out assign operation after the address blank.If the output control signal of MUX IV 33 is 1, then MUX IV 33 passes to register file r1 with the address of general-purpose register r2, thereby the numerical result of register r2 is composed to r1; If the output control signal of MUX IV 33 is 0, then MUX IV 33 passes to general-purpose register 17 with 50, promptly transmits address blank, and processor judges it is will not carry out assign operation after the address blank.A_swap_EX_in is a control signal, sends control command by the decoding stage, and whether decision carries out the swap operation.

In function and the order format of confirming those new extended instructions; And after designing the corresponding dedicated instruction processor of AES instruction set (AES_ASIP) model; With Xilinx be platform with Virtex5 LX110T FPGA, Design Model is verified and performance evaluation.

AES-Sbox is generated that method runs on patent that application number is 201110024766X respectively and on the widely used arm processor of built-in field.Find that through contrast optimize through the instruction expansion, the AES-Sbox generating algorithm only takies the code memory space of 88bytes in dedicated instruction processor, than in arm processor, having reduced 38.6%; Carry out execution cycle and taper to 112 clock period by 287 original clock period, carry out the efficiency ratio arm processor the execution improved efficiency 60.9%.It only is the part in the whole aes algorithm that yet Sbox generate to optimize, and optimizing this computing module separately is not very significantly to the raising effect of the operation efficiency of whole algorithm, and code space reduces 9.3%, carries out improved efficiency 13.7%.

Multiplicative inverse optimization instruction that wherein relates to and fetch bit optimization instruction make an amendment slightly and simultaneously also can be applied to (obscure the multiplying and the fetch bit computing of module like row) in the related computing of other computing modules of aes algorithm.In conjunction with multiplicative inverse instruction and fetch bit instruction, and three newly-increased optimization instructions, newly optimize the instruction code storage space that instruction set will improve algorithm efficiency effectively and save algorithm greatly.

Table 3ARM9 and AES_ASIP experimental result contrast table

Drawn by table 3: the efficient aspect is carried out in (1): with the AES AES is example; Through the grade simulated needed clock periodicity of AES_ASIP operation AES AES that counts of cycle; Arm processor has reduced 57.3x% relatively, has greatly improved algorithm efficiency; (2) code space aspect: with the AES AES is example; Instruction code takies the 783bytes memory headroom on arm processor; And instruction code only takies 416bytes on AES_ASIP; Saved the code memory space of 46.6x%. carry out the algorithm instruction through the optimal way of instruction set architecture expansion and optimize, can reduce the memory headroom of storage algorithm code effectively.

We can draw statistical experiment as a result; The contrast arm processor; AES_ASIP has improved 58.4% execution efficient and has saved 47.4% memory headroom .AES algorithm and before the instruction expansion, used the processor resource number to be 86816cells; Instruction expansion back AES_ASIP uses the processor resource number to be 93038cells, and the hardware resource that takies has increased by 7.2%.

List of references

[1]Bertoni?G.，et?al.Efficient?Software?Implementation?of?AES?on?32-bit?Platforms[J].Lecture?Notes?In?Computer?Science，2003，25(23)：159-171；

[2]Kuo?H?and?Verbauwhede?I.Architectural?Optimization?for?a?1.82Gb/s?VLSI?Implementation?of?the?AES?Rijndael?Algoritm[J].Lecture?Notes?in?Computer?Science，2001，21(62)：51-64；

[3]Wu?L，Weaver?C，Austin?T.CryptoManiac：A?Fast?Flexible?Architecture?for?Secure Communication[C]//Proc?of?Annual?Int.Symposium?on?Computer?Architecture(ISCA).IEEE，2001：110-119；

[4]Sun?Yinghong，Tong?Yuanman，Wang?Zhiying.ASE?implementation?based?on?instruction?extension?and?randomized?scheduling[J].Computer?Engineering?and?Applications，2009，45(16)：106-110。

Claims

1. the instruction optimized processor to the AES symmetric encipherment algorithm is characterized in that it mainly is made up of data-carrier store, code memory, register file and streamline four parts;

Wherein: said streamline comprises gets unit, location, decoding unit, performance element and streamline controller; The said output terminal of getting the unit, location is connected with the input end of pipeline register I; The output terminal of pipeline register I is connected with the input end of decoding unit; The output terminal of decoding unit is connected with the input end of pipeline register II, and the output terminal of pipeline register II is connected with the input end of performance element; Said data-carrier store is connected with performance element is two-way; The output terminal of code memory is connected with the input end of getting the unit, location; The output terminal of streamline controller respectively with register file, the pipeline register I is connected with the input end of pipeline register II; The output terminal of decoding unit is connected with the input end of streamline controller.

2. the instruction optimized processor to the AES symmetric encipherment algorithm as claimed in claim 1 is characterized in that, said register file is got location register, 1 SP and 1 link register and formed by 32 general-purpose registers, 1.