CN114710285A

CN114710285A - High-performance SM4 bit slice optimization method for heterogeneous parallel architecture

Info

Publication number: CN114710285A
Application number: CN202210542472.4A
Authority: CN
Inventors: 关志; 陈钟; 何逸飞; 王珂; 孙磊; 齐向东; 刘勇; 孔坚
Original assignee: Peking University; Qianxin Technology Group Co Ltd
Current assignee: Peking University; Qianxin Technology Group Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-07-05
Anticipated expiration: 2042-05-19
Also published as: CN114710285B

Abstract

The invention discloses a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture, which belongs to the technical field of safe password application, and realizes a multi-thread SM4 on a non-vector instruction set and a vector instruction set by realizing an SM4 block cipher algorithm under the width of 1 bit data, and can support higher encryption speed on the vector instruction set.

Description

High-performance SM4 bit slice optimization method for heterogeneous parallel architecture

Technical Field

The invention belongs to the technical field of security password application, and relates to a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture.

Background

The SM4 is a block cipher standard adopted in the WAPI standard of the wireless local area network in China, and is subsequently adopted by the commercial cipher standard in China. As a block cipher standard of commercial cipher in China, SM4 is expected to gradually replace foreign block cipher standards such as 3DES and AES in sensitive but non-confidential application fields in China, and is used for application occasions such as communication encryption and data encryption. SM4 is a symmetric cryptographic algorithm with a key length and packet length of 128 bits, outputting 128 bits as ciphertext.

The operational symbols used and the corresponding meanings are given below:

mod: performing modulo operation;

: a 32-bit AND operation;

: a 32-bit OR operation;

: a 32-bit non-operation;

: a 32-bit exclusive or operation;

：mod

a bit arithmetic addition operation;

: shift left for 32 bit cycle

Bit operation;

: left assignment operator;

: comprises

Finite field of elements.

The key expansion algorithm is as follows:

the standard algorithm SM4 word length is 32 bits, the encryption key length is 128 bits, and the representation is 4 words

(ii) a The round key is represented as 32 words

(ii) a The plaintext input is treated as 4 words

The ciphertext output is represented as

；

SM4 key expansion algorithm:

1) setting 4 words

；

2) The round key generation algorithm is

；

Wherein the content of the first and second substances,

is aA synthetic permutation function

Comprising a linear transformation

And nonlinear transformation

，

Are all constant.

The encryption algorithm is as follows:

1) 32 iterations

；

2) Output of

；

Wherein the content of the first and second substances,

is a synthetic permutation function

Comprising a linear transformation

And nonlinear transformation

。

Because the expected application field of the SM4 design is a low-power chip (namely, a WAPI chip), the SM4 is optimized for reducing the number of hardware circuits, and as a result, the software implementation efficiency of the SM4 is low, it is difficult to fully utilize the computing power of a mainstream 32-bit/64-bit general processor, and the software implementation efficiency is usually much lower than that of the AES of the same kind of symmetric encryption algorithm.

The currently mainstream CPU includes an extended Instruction set such as a SIMD (Single Instruction, Multiple Data) Instruction set in addition to a basic general Instruction set such as SISD (Single Instruction, Single Data), for example, the currently mainstream X86 processor of Intel and AMD supports an AVX/AVX2 Instruction set, and the mobile processor of ARM Cortex-a series architecture supports a NEON Instruction set. The Intel AVX-2 supports a 256-bit instruction set, the AVX2 instruction set includes 16 256-bit vector registers, the Intel has published the AVX-512 instruction set, and the AVX-512 instruction set includes 32 512-bit vector registers, which can perform 16-way 32-bit word vector operations. NEON is the SIMD instruction set of the ARM Cortex-A architecture. The NEON architecture contains 16 128-bit SIMD registers on which 4-way 32-bit scalar vector computations can be performed.

Currently, the Intel/AMD X86 processor and ARM processor support both non-vector and vector instruction set computations. Conventional parallel algorithms often execute a plurality of SM4 algorithms at the same time by directly using vector registers or multi-core processors, but because the register width required for realizing the standard SM4 algorithm is relatively wide (32 bits), even if the vector registers with the width of 256 bits are used for realizing, at most, 8 SM4 algorithms can be simultaneously executed. The GPU operation has similar problems, and for a single thread, the operation speed of the GPU is difficult to exceed that of the CPU, and the parallel capability of the GPU is limited by the stream processor, so that more threads cannot be processed simultaneously. Under the use scene that the single-thread SM4 encryption speed is low in requirement, but the SM4 encryption threads are more, the algorithm is not enough in parallelism and low in overall encryption efficiency, and an algorithm which is more sufficient in computing resource utilization and faster in multi-thread SM4 operation speed is urgently needed.

Disclosure of Invention

In order to realize a more efficient multithreading SM4 algorithm, the invention provides a software-implemented SM4 block cipher bit slice optimization algorithm which is implemented on SISD, SIMD instruction set and GPU. The present invention can implement either a multithreaded SM4 on a non-vector instruction set or a multithreaded SM4 on a vector instruction set, where the vector instruction set can support higher encryption speeds.

The invention provides a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture, which comprises the following steps:

1) dividing a variable with a word length of 32 bits of an original standard algorithm SM4 into 32 variables with a word length of 1 bit in sequence;

2) the linear part of the original standard algorithm SM4 is operated as: performing XOR operation on the 32-bit words and cyclic left shift operation on the 32-bit words, and converting variables defined in the step 1) into 32 words with 1 bit length for XOR and transposition;

3) decomposing the nonlinear part operation S box of the original standard algorithm SM4 into matrix affine transformation and finite field

Inversion is carried out;

4) for finite fields

Inversion is carried out by utilizing finite field tower structure transformation

By matrix affine transformation, isomorphically mapped to the composite domain

To be converted into

The inversion and multiplication of the method are realized under the word length of 1 bit;

5) for the

By using finite field tower structure transformation, will

Isomorphic mapping of finite fields to finite fields by matrix affine transformations

Finite field, into

6) for the

The inversion is equivalent to that the high-order bit is unchanged, and the low-order bit is equal to that of the high-order XOR low order bit, so that the calculation under the word length of 1 bit is realized;

7) according to the steps 1) to 6) above, the whole SM4 algorithm is realized only depending on 1-bit word length, exclusive OR and AND operation to complete the whole calculation, so that the X-bit register is regarded as an X vector register to be used, and multithread parallel calculation of the X-group SM4 algorithm is realized.

The method is a software-implemented SM4 block cipher bit slicing optimization algorithm, and is characterized in that linear operation of an original standard SM4 encryption algorithm on a 32-bit word is equivalently converted into linear operation with the length of 1-bit word through algorithm linear analysis, and meanwhile, nonlinear operation in the original standard SM4 algorithm is mapped onto the nonlinear operation with the length of 1-bit word through constructing finite field isomorphic mapping. By the method, the whole SM4 algorithm is converted from the implementation based on the word length of 32 bits to the implementation based on the word length of 1 bit, and then on a non-vector or vector instruction set platform with wider register width, the register is divided into different SM4 threads according to 1 bit (for example, the 32-bit register can be divided into 32 1 bits, so that the SM4 algorithm of 32 threads can be realized), and further, the SM4 multithreading optimization algorithm with higher parallelism is realized. In a specific implementation, a wider register is no longer needed for the calculation of the whole algorithm, so that on a calculation platform supporting 32-bit, 64-bit or wider registers, one register can simultaneously store data of 8, 16 or more threads, and when the data is calculated, the data of 8, 16 or more threads can be simultaneously operated, so that the SM4 algorithm of higher threads is realized.

The core of the invention is: the SM4 block cipher algorithm is implemented under the data width of 1 bit, and is applied to a vector instruction set or a non-vector instruction set, and a register is divided according to 1 bit, so that the multithreading SM4 algorithm with the parallelism being 32 times higher than that of the standard SM4 algorithm (the word length is 32 bits) under the same condition can be obtained. Through analysis, the SM4 parallel optimization method provided by the invention can be realized on vector instruction operation or GPU, and the algorithm can also be realized on a CPU platform which does not support the operation of SIMD instructions.

Compared with the prior art, the invention has the beneficial effects that: 1) the constant time algorithm replaces the operation of searching the S box by matrix operation, displacement, exclusive OR and other operations, can be realized in constant time, has no relation between the execution time and the internal state of the algorithm, can resist various time-based channel measurement attacks, and is safer in SM4 algorithm. 2) The higher parallel thread can complete the whole algorithm only by using 1 bit of the register, under the environment of 32-bit and 64-bit registers, 32-path and 64-path SM4 encryption and decryption algorithms can be directly executed in parallel, and under the AVX512 register, 512-path SM4 encryption algorithms can be executed at most, so that the overall encryption rate is improved. Under the application scene that the requirement on the encryption speed of the single-thread SM4 is low, but the number of the encryption threads is large, the overall efficiency can be greatly improved. In the calculation on the GPU, assuming that there are x parallel GPU threads, if each thread uses 64-bit wide registers, then the SM4 algorithm for 64 threads can be processed simultaneously, and the number of threads in parallel for the SM4 algorithm can be extended by 64 times.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The invention provides a heterogeneous parallel architecture-oriented high-performance SM4 bit slice optimization method, which realizes SM4 operation only by depending on 1-bit data width and under the condition that logic operation only includes XOR and AND operation, compared with the software implementation of the traditional SM4 algorithm, the algorithm can regard an X-bit register as an X-vector register for use, so that parallel calculation of an X-group SM4 algorithm is realized, wherein X is the width of the register, and the operation speed is greatly improved.

Specifically, the method of the invention adopts the following steps:

1) all variables in the SM4 are divided into 1-bit data from low order to high order, for example, a plaintext is divided into 128 bits (divided into 4 32-bit words for processing), and the data is regarded as 128 pieces of 1-bit data;

2) the exclusive or operation and the cyclic shift in SM4 are regarded as exclusive or and transposition between 128 pieces of 1-bit data;

3) decomposing the unique nonlinear operation S box of SM4 into matrix affine transformation and finite field according to principle

Inversion is carried out;

4) will be provided with

Inversion isomorphic mapping to composite domain via tower decomposition

Is inverted and further transformed into

Inversion and multiplication on;

5) isomorphic mapping, matrix operation and multiplication of a specific finite field in the decomposition process of the step 4) are all regarded as matrix multiplication, and calculation under the width of 1 bit data is realized;

6) will be provided with

Further tower decomposition of the above inversion, isomorphic mapping to

And is further transformed into

Inversion and multiplication on;

7) isomorphic mapping, matrix operation and multiplication of a specific finite field in the decomposition process of the step 6) are all regarded as matrix multiplication, and calculation under the width of 1 bit data is realized;

8）

the inversion is equivalent to the invariance of the high-order bit, and the low-order bit is equivalent to the high-order XOR low-order bit, so that the inversion is calculated under the data width of 1 bit;

9) by integrating all algorithms, the whole SM4 realizes that the whole calculation is completed only by relying on 1-bit data width, exclusive OR and operation, and the X-bit register can be regarded as an X vector register to be used, so that the parallel calculation of the X-group SM4 algorithm is realized.

Variable representation of SM4 multithread optimization algorithm:

in the standard SM4 algorithm, the key is 16 bytes, the input plaintext is 16 bytes, and each type of intermediate state variable is a 4-byte word. In the method, a key and a plaintext are divided into 128 pieces of 1-bit data according to the high and low bits of a byte, and a variable word of an intermediate state is divided into 32 pieces of 1-bit data according to the high and low bits.

Now, assume that all variables of the present invention are divided into 1-bit widths according to the following method:

key format: 128 bits of key

WhereinqA number representing a bit;

plain text format: 128 bits of plaintext

，qA number representing a bit;

a 32-bit word: the basic operation word of the SM4 is 4 bytes, and is divided into the following parts according to the high and low 32 bits: whereindataThe name of the 32-bit variable is represented,qindicating the number of bits.

Linear operation of SM4 multithread optimization algorithm:

and (3) XOR calculation:

the SM4 multithread optimization method of the invention converts the exclusive OR calculation among the 32-bit words in all the original standard SM4 algorithm into the length of 1 bit for carrying out;

the calculation process is as follows:

1) assume that the standard algorithm SM4 has two 32-bit state words

Need to calculate

；

2) Splitting variables in SM4 into 32 bits

And

；

3) and (4) carrying out exclusive or calculation on the alignment respectively:

the result is exactly also in the form of a 32-bit state word divided in 1 bit.

Linear transformation function:

the linear transformation function in standard SM4 includes XOR computation and displacement computation, where

The format is as follows:

；

in the above formula

Indicating the number of bits left-shifted by four cycles (note: XOR-calculated)

Not the same);

is a 32-bit word, and

and

the two functions are similar in format and are similar in structure, and the difference is that

Is/are as follows

In the order of 2,10,18,24, and

is/are as follows

13,23,0,0 in sequence; according to variable representation of SM4 multithread optimization algorithm

Divided into 32 data of 1 bit length

32 pieces of 1-bit data of the same format must also be output, and thus, the output can be expressed as

The 32-bit xor of the corresponding positions, i.e.:

；

wherein

To represent

The numbers from the upper to lower bits are divided into 1 bit,

is shown in

To middle

Bit data (hypothesis)

The next bit of the last 1 bit is

Bit 1) of the received signal,

is related to the number of bits of the cyclic shift and the position of the output; % is the modulus operation, i.e., the remainder of the division by 32 is calculated.

The calculation process is as follows:

1) inputting 32 bits into

Sequentially divided into 1-bit units

；

2) Output 32 bits

Sequentially divided into 1-bit units

；

3) According to the formula

Given isiOf (2) to find

Involving an exclusive-or operation in the calculation

The bit data of (c), calculating:

；

4) for the

Repeating the calculation step 3) to obtain

The final result of (1).

Linear transformation function

Principle of (1)

And (5) the consistency is achieved.

Nonlinear S-box operation of a multithread optimization algorithm:

s box mathematical principle:

the S-box is the only nonlinear component in SM4 of the standard algorithm and is used for the nonlinear function

The S-box is a list of 8-bit input and 8-bit output alternatives, as follows:

0xd6,0x90,0xe9,0xfe,0xcc,0xe1,0x3d,0xb7,0x16,0xb6,0x14,0xc2,0x28,0xfb,0x2c,0x05, 0x2b,0x67,0x9a,0x76,0x2a,0xbe,0x04,0xc3,0xaa,0x44,0x13,0x26,0x49,0x86,0x06,0x99, 0x9c,0x42,0x50,0xf4,0x91,0xef,0x98,0x7a,0x33,0x54,0x0b,0x43,0xed,0xcf,0xac,0x62, 0xe4,0xb3,0x1c,0xa9,0xc9,0x08,0xe8,0x95,0x80,0xdf,0x94,0xfa,0x75,0x8f,0x3f,0xa6, 0x47,0x07,0xa7,0xfc,0xf3,0x73,0x17,0xba,0x83,0x59,0x3c,0x19,0xe6,0x85,0x4f,0xa8, 0x68,0x6b,0x81,0xb2,0x71,0x64,0xda,0x8b,0xf8,0xeb,0x0f,0x4b,0x70,0x56,0x9d,0x35, 0x1e,0x24,0x0e,0x5e,0x63,0x58,0xd1,0xa2,0x25,0x22,0x7c,0x3b,0x01,0x21,0x78,0x87, 0xd4,0x00,0x46,0x57,0x9f,0xd3,0x27,0x52,0x4c,0x36,0x02,0xe7,0xa0,0xc4,0xc8,0x9e, 0xea,0xbf,0x8a,0xd2,0x40,0xc7,0x38,0xb5,0xa3,0xf7,0xf2,0xce,0xf9,0x61,0x15,0xa1, 0xe0,0xae,0x5d,0xa4,0x9b,0x34,0x1a,0x55,0xad,0x93,0x32,0x30,0xf5,0x8c,0xb1,0xe3, 0x1d,0xf6,0xe2,0x2e,0x82,0x66,0xca,0x60,0xc0,0x29,0x23,0xab,0x0d,0x53,0x4e,0x6f, 0xd5,0xdb,0x37,0x45,0xde,0xfd,0x8e,0x2f,0x03,0xff,0x6a,0x72,0x6d,0x6c,0x5b,0x51, 0x8d,0x1b,0xaf,0x92,0xbb,0xdd,0xbc,0x7f,0x11,0xd9,0x5c,0x41,0x1f,0x10,0x5a,0xd8, 0x0a,0xc1,0x31,0x88,0xa5,0xcd,0x7b,0xbd,0x2d,0x74,0xd0,0x12,0xb8,0xe5,0xb4,0xb0, 0x89,0x69,0x97,0x4a,0x0c,0x96,0x77,0x7e,0x65,0xb9,0xf1,0x09,0xc5,0x6e,0xc6,0x84, 0x18,0xf0,0x7d,0xec,0x3a,0xdc,0x4d,0x20,0x79,0xee,0x5f,0x3e,0xd7,0xcb,0x39,0x48。

the S-box substitution in the standard algorithm is actually a finite field formed by 7-degree polynomial

The above element transformation can be described mathematically as follows:

is provided with

Is a binary 8 x8 matrix:

{1,1,1,0,0,1,0,1},

{1,1,1,1,0,0,1,0},

{0,1,1,1,1,0,0,1},

{1,0,1,1,1,1,0,0},

{0,1,0,1,1,1,1,0},

{0,0,1,0,1,1,1,1},

{1,0,0,1,0,1,1,1},

{1,1,0,0,1,0,1,1},

is a length-8 binary vector of {1, 1, 0,0, 1, 0,1, 1 }.

Is provided with

Is an 8-bit binary number

Corresponding polynomial

. Primitive polynomial

，

To represent

In a polynomial

Inverse in the modulus domain. If it is

Then define

Contrary element of

。

The inverse operation relationship is as follows:

1.

；

2.

；

3.

。

suppose that

Is the 8-bit input to the S-box,

is the output of the S-box, which can be expressed as:

。

the S-box operations include 8 x8 matrix operations, 8-bit vector xor operations, and inverse operations over an 8-bit finite field, wherein: the matrix multiplication and vector exclusive-or belong to binary linear operation, and are easily decomposed into calculation under the word length of 4 bits according to the relation between variables, the matrix can be divided into 8 column vectors firstly, and then the 8 bit vector is divided into high and low 4 bits, so that the matrix vector operation is realized on a 4-bit register;

the inverse operation on the finite field is a nonlinear operation, which cannot be directly implemented on a 1-bit register, and requires isomorphic mapping of the finite field.

Operate on to

Compound domain isomorphic mapping:

8 degree primitive polynomial

The polynomial modulus field formed above can be equivalentIsomorphic mapping to composite domains

The above.

Composite domain definition:

all the above elements can be expressed as polynomials

Can be regarded as a polynomial of degree 2

An element in the modulus domain, wherein the modulus polynomial

，

Is defined in a 4-degree order polynomial

The polynomial modulus field of (a).

The invention is directed to

Is that

A modulus field of which the coefficients of the polynomial are defined in

In the modulus domain. In particular, of the SM4 algorithm

Is defined in modulus

Above, if present

To

Is then on

The inversion operation above can be equivalent to

The inversion operation of (3).

The specific isomorphic operation method and inversion operation are as follows:

1) by matrix multiplication, construction

To

Isomorphic mapping of, e.g. when

、

At time (as an example), there is

To

Isomorphic mapping of (a). Isomorphic mapped matrix

Comprises the following steps:

{0,1,0,1,1,1,1,0,},

{0,1,1,1,1,1,0,0,},

{1,1,0,1,0,0,0,0,},

{0,1,0,1,0,0,0,0,},

{1,1,0,1,1,1,0,0,},

{0,1,1,0,1,1,0,0,},

{1,1,1,1,0,1,1,0,},

{0,1,0,1,1,1,1,1,},

in isomorphic sense, the inversion operation becomes

Inversion of (1);

2) given in step 1)

The following inversion operation can be derived using the following formula:

is provided with

Is that

Middle element, if

Is the inverse of this element, then there must be the following equation:

，

to represent

4 elements in the sequence;

after unfolding, obtaining:

；

。

through isomorphic mapping at

Become at the inversion operation of

While the inversion operation selected according to the invention

In a

The above inversion operation further becomes an element

In that

Sum of inversions on

The multiplication operation of (c).

Operate to

Composite domain isomorphic mapping:

after the first isomorphic mapping, the first time,

the problem of inversion in (A) is reduced to

The inversion operation and the multiplication operation of (a). In this respect, isomorphic mapping may be further performed, and

to the computational problem of

The above.

1) By matrix multiplication, construction

To

Isomorphic mapping, as an example, coefficients are defined in

Formed finite fields

In the above-mentioned manner,

forming a polynomial modulus finite field

And are and

formed finite field

There is an isomorphic mapping;

2）

the upper inversion operation is derived from the following equation:

is provided with

Is that

Middle element, if

Is the inverse of this element, then there must be the following equation:

；

after unfolding, obtaining:

；

；

as can be seen from the above, it is shown that,

to

Isomorphic mapping of

The inverse operation of (3) becomes

The above inversion and multiplication operations;

3) given of

The multiplication can be converted into matrix vector multiplication operation and implemented within a word length of 1 bit, specifically, two 4-bit elements to be multiplied are regarded as two vectors { x3, x2, x1, x0}, one of the vectors is modified into a matrix form and is multiplied by the other vector, and assuming that the vector to be modified is { x3, x2, x1, x0}, the modified affine matrix corresponds to:

4）

the multiplication can also use matrix operation, specifically, two multiplied 2-bit elements are considered as two vectors { x1, x0}, one of the vectors is modified into a matrix form, and is multiplied by the other vector, and assuming that the vector to be modified is { x1, x0}, the modified affine matrix corresponds to:

while

The inversion can be implemented in the following table lookup manner at a word length of 1 bit as follows:

for example, as an embodiment, the polynomial is defined in modulus polynomial

Upper finite field

The length of the element is only 2 bits, and the inverse relation is as follows:

for

Upper arbitrary element

Always have

Can be realized by the formula

The inversion operation of (3). The formula can be realized within a word length of 1 bit, namely, the high order bits are unchanged, and the low order bits are equal to the high order XOR low order bits.

Therefore, in summary, as described above,

the inversion operation on can be converted into

The inversion operation and the multiplication operation of

The inversion operation on can be converted into

The inverse operation of the above is carried out,

the multiplication in (b) can be modified to matrix-vector multiplication, which can be implemented at 1-bit word length. By adaptation, both the non-linear and linear parts of the standard SM4 are done in 1-bit word lengths.

The query of the S-box is usually one of the most time-consuming operations in the block cipher, and through the X64 instruction, the calculation of the bit slicing algorithm in the query of the S-box is actual, so that it is obvious that the bit slicing algorithm has a significant advantage in the multi-thread calculation, and the calculation speed is close to 4 times, as compared with table 1, the complete SM4 algorithm, and the advantages of the slicing algorithm are shown in table 2.

Table 1: slicing algorithm and table lookup efficiency comparison of S-box of SM4

The query speed in table 1 refers to the average number of times of querying S-boxes per second in the last column, the execution number is the number of times of test program operation, and the program operating once in the slicing method can query 64 times of S-boxes, which is 64 times of the common method, but the required time is only increased by less than 20 times (21.288S/1.287S =16.54 times), and obviously, the query speed is faster than that of the conventional S-box table look-up algorithm.

Table 2: speed comparison of SM4 slicing algorithm to standard implementation

Algorithm	Number of execution blocks	Total time of day	Data volume	Average query per second
					Standard SM4 implementation	12800000	2.451s	1638400000 bit	668Mb/s
Slicing SM4 implementation	12800000	0.991s	1638400000 bit	1,655Mb/s

As can be seen from table 2, the slice SM4 implementation algorithm is shorter in time consumption and faster in query speed than the standard SM4 implementation algorithm.

By the method, a standard SM4 algorithm is changed into an SM4 algorithm, input, output, intermediate state and calculation processes can be realized in a 1-bit register, and each variable in the SM4 algorithm only occupies the length of 1 bit. On a 32-bit computing platform, each variable can store 32 different groups of values in a 32-bit register, and in the process of computing, each operation can simultaneously compute 32 groups in the 32-bit register, so that 32 threads SM4 on the 32-bit platform are computed in parallel. On a wider computing platform (e.g., 64 bits), SM4 synchronous computations are supported that enable more threads.

Implementing on the SIMD instruction set and the GPU:

given two vectors

Wherein

For words, the calculation of the vector instruction of the present invention is given below:

；

。

example 1: the ARM/NEON instruction set is realized as follows:

the ARM processor is a mainstream processor adopted by smart mobile equipment such as a mobile phone at present, wherein the most widely deployed Cortex-a series ARM processor architecture comprises an NEON SIMD instruction set in addition to an ARM general instruction set (ARMv7 instruction set). The NEON instruction set comprises 16 128-bit SIMD registers, and can perform parallel computation on 4-way 32-bit words, so that the SM4 of the present embodiment can be further expanded into 128 sets of parallel computation.

The NEON instructions used are given below, which are given in the form of pseudo-functions (Intrinsics), whereint32x4_tNEON vector register for 128 bits:

c

a

b use instructionint32x4_t c = veorq_s32 (int32x4_t a, int32x4_t b)；

c

a

k use instructionint32x4_t c = vandq_s32 (int32x4_t a, int32x4_t k)。

Example 2: the X86/AVX2 instruction set implements:

the non-vector code can be written by using a high-level language and realized by compiling, the vector algorithm is realized by using an AVX2 instruction, and parallel computation of 8-path 32-bit words can be performed, so that the SM4 of the embodiment can be further expanded into 256 groups of parallel computation;

c

a

b using the instruction__m256i c = _mm256_xor_si256 (__m256i a, __ m256i b)；

c

a &k use instruction__m256i c = _mm256_and_si256 (__m256i a, __ m256i b)。

Example 3: opencl-based GPU implementation:

opencl is written similarly to C, since the slicing algorithm has 1 bit for each variable, the input plaintext is 128 bits, each state word is 32 bits, and the input and output are 8 bits when calculating the S-box. Since 64-bit wide registers are used, a single GPU thread is equivalent to computing 64 SM4 threads in parallel.

The S-box query code structure of Opencl is as follows:

unsigned long long tp[32];

...

for (i = 0; i < 32; i++)

{

...

SBOX_SMS4_BS(&tp[31], &tp[30], &tp[29], &tp[28], &tp[27], &tp[26], &tp[25], &tp[24]);

...

the code structure of the code slicing algorithm, the original standard operation, is to query the S-box for the upper 8 bits of a 32-bit tp word. Under the slicing algorithm, 32 bits are represented as an array, the input 8 bits must be input separately in the form of 1 bit, and since the register is 64 bits wide, SBOX _ SMS4_ BS actually calculates 64 sets of S boxes simultaneously. Thereby expanding the SM4 thread by a factor of 64.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, but those skilled in the art will appreciate that various substitutions and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A high-performance SM4 bit slice optimization method oriented to heterogeneous parallel architecture is characterized by comprising the following steps:

2) the linear part of the original standard algorithm SM4 is operated: the XOR operation of the 32-bit words and the cyclic left shift operation of the 32-bit words are converted into XOR and transposition among 32 1-bit word lengths according to the variables defined in the step 1);

3) the non-linear part of the original standard algorithm SM4 was operated on an S-box,decomposition into affine transformations of matrices and finite fields

Inversion is carried out;

4) for finite fields

By means of matrix affine transformation, isomorphically mapped to the complex domain

To be converted into

5) for the

By using finite field tower structure transformation, will

Finite field, into

6) for the

Is equivalent to the high order bit being unchanged and the low order bit being equal to the high order XORThe calculation is realized under the word length of 1 bit;

2. The method of claim 1, the exclusive or operation in step 2) comprising:

1) two 32-bit state words that exist for the original standard algorithm SM4

Calculating

；

2) Splitting variables in SM4 into 32 bits

And

；

3) and (4) carrying out exclusive or calculation on the alignment respectively:

and outputting a result, wherein the result is in the form of a 32-bit state word divided by 1 bit.

3. The method of claim 1, the step of transposing in step 2) comprising:

1) will 32 bit word

Sequentially divided into 1-bit units

；

2) Output of 32-bit word

Sequentially divided into 1-bit units

；

3) According to a linear transformation formula

Given a given

Of (2) to find

Involving an exclusive-or operation in the calculation

The bit data above, calculated as:

；

wherein

Is shown in

To middle

Bit data;

4) for the

Repeating the calculating step 3) to obtain all

The final result of (1).

4. The method of claim 1, the processing step in step 4) comprising:

1) by matrix multiplication, construction

To

Isomorphic mapping of (a);

2) to the product obtained in step 1)

Carrying out inversion operation, wherein the formula is as follows:

；

，

is a modulus polynomial modulus;

the formula is developed to obtain:

；

；

according to the formula in step 2), will be

The above inversion operation further becomes an element

In that

Sum of inversions on

The multiplication of (2).

5. The method of claim 1, in step 4)

The multiplication is realized by converting into matrix vector multiplication operation, and the method is to regard two multiplied 4-bit elements as two vectors, wherein one vector is converted into a matrix form, and is multiplied by the other vector, and the calculation is realized under the word length of 1 bit.

6. The method of claim 1, the processing step in step 5) comprising:

1) by matrix multiplication, construction

To

Isomorphic mapping;

2) to the product obtained in step 1)

Carrying out inversion operation, wherein the formula is as follows:

；

wherein the content of the first and second substances,

is that

The elements (A) and (B) in (B),

is the inverse of the element(s),

，

is a polynomial modulus;

the formula is developed to obtain:

；

；

according to the formula in step 2), will

The inversion operation on becomes

The inversion operation and the multiplication operation of (1).

7. The method of claim 1, in step 5)

The multiplication is realized by matrix operation, and the steps comprise: regarding two multiplied 2-bit elements as two vectors, modifying one vector into a matrix form, and multiplying the other vector; then in a finite field

The inversion operation is performed by means of table lookup.