CN111736902B

CN111736902B - Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium

Info

Publication number: CN111736902B
Application number: CN202010687106.9A
Authority: CN
Inventors: 钱晶; 董明武; 温程; 王芷玲; 白小勇
Original assignee: Beijing Lianshi Networks Technology Co ltd
Current assignee: Beijing Lianshi Networks Technology Co ltd
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2022-04-19
Anticipated expiration: 2040-07-16
Also published as: CN111736902A

Abstract

The invention provides a parallel computing method and device of SM4 based on SIMD instruction, and readable storage medium, comprising: arranging a plurality of input SM4 grouped messages to obtain arranged grouped messages; performing an optimized SM4 encryption or decryption operation process on the arranged packet message; the Sbox operation in the encryption or decryption operation process is calculated by replacing a table look-up method with a composite domain technology; performing inversion operation in GF (2^8) by using fast multiplication operation in GF (2^4) based on SIMD instructions for the complex domain technology; and performing reverse arrangement calculation on the encryption or decryption operation process to obtain the ciphertext or plaintext message corresponding to the grouped message. The method utilizes the composite domain technology to perform the process of SM4 algorithm nonlinear operation equivalent transformation, utilizes operation sequence adjustment and fusion of a plurality of linear transformations to perform SM4 algorithm linear operation equivalent transformation process and fast multiplication operation in GF (2^4) based on SIMD instructions, and improves the execution speed of SM4 encryption and decryption process.

Description

Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium

Technical Field

The invention relates to the technical field of computer security, in particular to a parallel computing method, a parallel computing device and a computer readable storage medium of SM4 based on SIMD instructions.

Background

To ensure the security of data encryption operations, corresponding standard algorithms have been introduced in countries around the world, such as the AES algorithm in the united states, the CLEFIA and Camellia algorithms in japan and the SM4 in china, formerly also referred to as the SMs4 algorithm.

The SM4 cryptographic algorithm is constructed based on a 4-branch generalized Feistel structure, plaintext and ciphertext and a key are 128 bits, and the SM4 cryptographic algorithm comprises an encryption algorithm, a decryption algorithm and a key arrangement algorithm, wherein the key arrangement algorithm is 128-bitThe encryption algorithm and the decryption algorithm both comprise the same 32-round nonlinear round function and 1-time reverse order transformation R, and the difference between the 32-round nonlinear round function and the 1-time reverse order transformation R is the use sequence of the 32-round keys, 4 variables with 32 bits are used

Representing a plaintext input of 128 bits, the operation process of the encryption algorithm is as follows:

1. performing 32 iterative operations, RK_iFor round keys:

t is composed of two parts of non-linear transformation tau and linear transformation L (U) L (tau (U)), and is used

To represent

4 bytes, the nonlinear transformation is represented as:

V＝(v₀，v₁，v₂，v₃)＝τ(U)＝(Sbox(u₀)，Sbox(u₁)，Sbox(u₂)，Sbox(u₃))

and the linear transformation L is represented as:

2. and performing reverse order transformation R on the last round of data to obtain a ciphertext:

(Y₀，Y₁，Y₂，Y₃)＝R(X₃₂，X₃₃，X₃₄，X₃₅)＝(X₃₅，X₃₄，X₃₃，X₃₂)

the popularization of SM4 algorithm is promoted by implementing a domestic network security method and a cryptographic method, but the reduction of the transaction processing speed of a computer information system introduced by encryption and decryption becomes an obstacle to the popularization of SM4 algorithm.

Hardware implementation improves the encryption and decryption efficiency of the SM4 algorithm by continuously optimizing the number of gates required for realizing the SM4 algorithm, wherein the key technology is to use a composite domain technology to enable the Sbox of the SM4 algorithm to be GF (2)⁸) Inner nonlinear inversion arithmetic equivalent transformation to GF (2)⁴)²And further applying complex domain techniques until the operation is converted to an operation in GF (2) that can be gated.

There are currently 4 main technical approaches for software implementation, including GPU, SM4 hardware instructions, AESNI and bit slicing (bitscle):

1. GPU: the SM4 encryption and decryption efficiency is improved by using the parallel capability of the GPU;

2. SM4 hardware instruction: constructing a hardware instruction supporting an SM4 encryption and decryption algorithm in a CPU;

3. AESNI: based on the algebraic isomorphic characteristics of the Sbox of the AES algorithm and the Sbox of the SM4 algorithm, completing Sbox operation of SM4 by using an AESNI instruction AESENCCLAST; transforming the SM4 algorithm into an algebraic structure of AES from the algebraic structure of SM 4;

4. bitslice: 256 data packets are processed simultaneously using bitsell technology and the 256-bit registers of the AVX2 instruction;

the technical defects of the schemes are as follows: the GPU and SM4 hardware instructions are not a general solution, AESNI depends on the existence of AESNI hardware instructions and only AESNI instructions supporting 128-bit registers are available, when the AESNI instructions are matched with the SIMD instructions with 256-bit registers, operations related to the AESNI instructions need to be serialized, Bitslice needs to process 256 data packets (4096 bytes) at the same time, and the data arrangement steps and GF (2 bytes) involved in the Bitslice technology are limited in applicability⁴) The multiplication operations above are all complex.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a parallel computing method of SM4 based on SIMD instructions, which is used for arranging a plurality of input grouped messages to obtain arranged grouped messages; for the arranged packet message, based on a SIMD instruction, GF (2^4) is used for replacing GF (2^8) to complete Sbox substitution calculation of SM4 so as to realize inversion operation on a wheel function and obtain the result of the inversion operation; and performing reverse arrangement calculation on the result of the inversion operation to obtain the ciphertext message corresponding to the grouped message, and overcoming various defects of the traditional method.

The specific scheme of the invention is as follows:

a method of SM4 parallel computation based on SIMD instructions, comprising the steps of:

step S1: arranging a plurality of input grouped messages to obtain arranged grouped messages;

step S2: using GF (2) based on SIMD instructions for the composed packet message⁴) Substitution of GF (2)⁸) Completing Sbox substitution calculation of SM4 to realize inversion operation of the wheel function and obtain the result of the inversion operation;

the SIMD instruction based usage GF (2)⁴) Substitution of GF (2)⁸) The Sbox substitution calculation procedure to complete SM4 is as follows:

the Sbox substitution calculation process substitutes an 8-bit input with an 8-bit output Sbox (u) according to a substitution table of 256 bytes:

Sbox(u)＝A(A·u+C)^-1+C

where a is a matrix of 8x8, C is a matrix of 8x1, u represents an 8bit number, and each element in matrices a and C is an element in GF (2):

the mathematical structure of Sbox defining SM4 is: GF (2)⁸)，f(x)＝1+x²+x⁴+x⁵+x⁶+ x⁷+x⁸；

The finite field is defined as: GF (2)⁴)，g(x)＝1+x+x⁴；

The mathematical structure defining Sbox is isomorphic with a quadratic algebraic expansion of finite fields:

defining an isomorphic mapping matrix M and an inverse isomorphic mapping matrix M^-1The values of (A) are:

thus, let v ∈ GF (2)⁸) Written as v ═ a₀+a₁y, wherein a₀，a₁∈GF(2⁴) Due to v^-1∈ GF(2⁸) V is to be^-1Is shown as v^-1＝b₀+b₁y，b₀，b₁∈GF(2⁴) Then according to (a)₀+a₁y)(b₀+ b₁y) 1 has:

i.e. GF (2)⁸) The inner inversion operation can be at GF (2)⁴)²The calculation process is as follows:

for u e GF (2)⁸) By affine transformation to

To pair

Isomorphic mapping results in

To pair

Performing a complex domain inversion operation to obtain

To pair

Isomorphic mapping is carried out to obtain

To pair

By affine transformation to

Wherein u is GF (2)⁸) U is affine transformed to obtain GF (2)⁸) V in (1), and isomorphic mapping of v to GF (2)⁴)²Then, the complex domain inversion operation is carried out on w to obtain GF (2)⁴)²Element w of (5)^-1Then to w^-1Isomorphic mapping to GF (2)⁸) S, and finally affine transformation is carried out on s to obtain GF (2)⁸) Element (ii) t, w^-1Inv (w) denotes GF (2)⁴)²The inversion operation in (1);

step S3: and performing reverse arrangement calculation on the result of the inversion operation to obtain the ciphertext message corresponding to the grouped message.

Further, for GF (2)⁴)²GF (2) in the inversion operation in (1)⁴) The multiplication process is as follows: with GF (2)⁴)^*Denotes GF (2)⁴) G 0x02 e GF (2)⁴)^*Is GF (2)⁴)^*Is the generator of, i.e. all e e.g. GF (2)⁴)^*I.ltoreq.15, such that e ═ gⁱIn order to calculate c ═ a · b, a, b, c ∈ bGF(2⁴) First, calculate log_gc＝log_ga+log_gb, then calculating

Wherein log_ga，log_gb can be done by looking up a LOG lookup table containing 16 elements.

Furthermore, the calculation process of the linear transformation L in the round function is adjusted by using the characteristics of the linear transformation L and the characteristics of the SIMD instruction set, and is as follows:

the linear transformation L in the round function is defined as:

the linear transformation L is equivalently transformed into:

the calculation only needs 4 bits of XOR, 3 shaffles, 1 left shift, 1 right shift, 1 bit or 10 SIMD instructions in total.

Further, the inverse transform R computation in SM4 is incorporated into the message de-marshalling computation process.

The present invention provides a SIMD instruction based SM4 parallel computing apparatus comprising a processor and a memory, the memory storing a computer program which, when executed by the processor, implements the method of any one of the above.

The invention also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of any of the above.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a parallel computing method of SM4 based on SIMD instruction, a device thereof and a computer readable storage medium, the method comprises: arranging a plurality of input SM4 grouped messages to obtain arranged grouped messages; performing an optimized SM4 encryption or decryption operation process on the arranged packet message; performing calculation on Sbox operation in the encryption or decryption operation process by using a composite domain technology to replace a table look-up method; for the composite domain technology, the fast multiplication operation in GF (2^4) based on the SIMD instruction provided by the invention is utilized to complete the inversion operation in GF (2^ 8); and performing reverse arrangement calculation on the encryption or decryption operation process to obtain a ciphertext or plaintext message corresponding to the grouped message. The invention provides a new execution process of SM4 block cipher algorithm with good height with SIMD instruction set by using the process of SM4 algorithm nonlinear operation equivalent transformation by using composite domain technique, the process of SM4 algorithm linear operation equivalent transformation by using operation sequence adjustment and fusion of multiple linear transformations, and the new proposed fast multiplication operation in GF (2^4) based on SIMD instruction, which can realize parallel encryption and decryption processing of 4, 8, 16, 32 or more SM4 block messages, improve the execution speed of SM4 encryption and decryption process, and the method is independent of specific hardware platform and can be realized on any hardware supporting SIMD instruction.

Compared with the existing method for accelerating the SM4 calculation process based on the AESNI instruction, the method provided by the invention does not depend on the specific AESNI instruction, and only depends on the more general SIMD instruction. Compared with the existing method of SM4 calculation process based on Bitslice technology and AVX2 instructions, the method provided by the invention has better universality, supports parallel processing of 4, 8, 16 and 32 groups, and is more suitable for use scenes. The SM4 calculation process, which is based on the bitsolice technology and AVX2 instructions, requires that the simultaneous processing of 512 SM4 packets is limited in use scenarios. In addition, the SM4 calculation process provided by the invention doubles the processing speed along with the number of messages processed in parallel in a test in a real environment. And when 16 or 32 packets are processed in parallel, the actual measurement speed is 20 to 30 percent faster than the actual measurement speed of the AESNI and Bitslice methods

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is a flow chart of a parallel computing method of SM4 based on SIMD instructions according to the present invention;

FIG. 2 is a schematic diagram of an exemplary editing process;

FIG. 3 is a schematic diagram of an inversion operation process; and

FIG. 4 is a diagram illustrating a reverse arrangement process according to an embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention aims to provide a parallel computing method of SM4 based on SIMD instructions, which comprises the following steps:

step S1: and editing the input plurality of grouped messages to obtain edited grouped messages.

The method of the present invention can process multiple groups of messages in parallel, and the following takes the implementation of SIMD instruction based on 128 bits as an example, and shows the parallel processing of 4 groups of messages as an example, and shows the message arrangement process, see fig. 2, wherein

i is 0, 1, 2,3 are all 32 bits, and 4 groups of 128-bit messages are:

(A₀，A₁，A₂，A₃)，(B₀，B₁，B₂，B₃)，(C₀，C₁，C₂，C₃)，(D₁，D₂，D₃，D₄)；

after the message layout, there are 4 128-bit registers for storing the message layout, where the contents stored in the 4 registers are:

(A₀，B₀，C₀，D₀)，(A₁，B₁，C₁，D₁)，(A₂，B₂，C₂，D₂)，(A₃，B₃，C₃，D₃)。

step S2: using GF (2) for choreographed packet messages based on SIMD instructions⁴) Substitution of GF (2)⁸) Completing Sbox substitution calculation of SM4 to realize inversion operation of the wheel function and obtain the result of the inversion operation;

wherein GF (2) is used based on SIMD instructions⁴) Substitution of GF (2)⁸) The Sbox substitution calculation procedure to complete SM4 is as follows:

Sbox(u)＝A(A·u+C)^-1+C

for example, for the calculation process when u is 0x7b, the matrix form corresponding to u is (11011110)^TFirst, the following calculation is performed to obtain v:

matrix (01100111)^TCorresponding values are v ═ 0xe6, GF (2)⁸) Where v is the inverse of 0xe6 as v^-10xfe, corresponding to the matrix form (01111111)^TThen Sbox (u) is A.v^-1+C：

Matrix (11100111)^TThe corresponding value is Sbox (u) ═ 0xe7, and the calculation results were found to be correct by comparing the Sbox definition in the SM4 standard; this is GF (2)⁴) Substitution of GF (2)⁸) The basis of mathematics.

The finite field is defined as: GF (2)⁴)，g(x)＝1+x+x⁴；

x and y are calculation data objects;

thus, let v ∈ GF (2)⁸) Written as v ═ a₀+a₁y, wherein, a₀，a₁∈GF(2⁴) Due to v^-1∈ GF(2⁸) V is to be^-1Is shown as v^-1＝b₀+b₁y，b₀，b₁∈GF(2⁴) Then according to (a)₀+a₁y)(b₀+ b₁y) 1 has:

for u e GF (2)⁸) By affine transformation to

To pair

Isomorphic mapping results in

To pair

Performing a complex domain inversion operation to obtain

To pair

Isomorphic mapping is carried out to obtain

To pair

By affine transformation to

Wherein, w^-1Inv (w) denotes GF (2)⁴)²The inversion operation in (1). Wherein w^-1Inv (w) denotes GF (2)⁴)²The inversion operation in (b) is based on₀，b₁The calculation formula (c) can be performed according to the process shown in fig. 3, i.e. GF (2)⁸) The inner inversion operation is converted into GF (2)⁴) Addition ofMultiplication, sum of squares inversion operation.

In one embodiment, for GF (2)⁴)²GF (2) in the inversion operation in (1)⁴) The multiplication process is as follows: with GF (2)⁴)^*Denotes GF (2)⁴) G 0x02 e GF (2)⁴)^*Is GF (2)⁴)^*Is the generator of, i.e. all e e.g. GF (2)⁴)^*I.ltoreq.15, such that e ═ gⁱFor calculation of c ═ a · b, a, b, c ∈ GF (2)⁴) First, calculate log_gc＝log_ga+log_gb, then calculating

The above is the key optimization measure in the method of the present invention, which is to transfer the inversion operation on the Sbox to GF (2) isomorphic therewith by using the algebraic structure of the SM4 block cipher algorithm Sbox⁴)²Is completed based on the proposed GF (2)⁴) The optimized multiplication process can accelerate the replacement process of the SM4 block cipher algorithm Sbox, because the SM4 block cipher algorithm key arrangement algorithm and the encryption and decryption process share the same Sbox replacement process, the calculation process can also apply the SM4 block cipher algorithm key arrangement algorithm, and the calculation speed is improved, which is the important invention point of the invention.

Step S3: performing reverse arrangement calculation on the result of the inversion operation to obtain ciphertext messages corresponding to the grouped messages; in the invention, the reverse order transformation R calculation in the SM4 is integrated into the message reverse arrangement calculation process, and the integration of the reverse order transformation R into the message reverse arrangement process does not increase extra calculation, thereby being beneficial to improving the software execution speed.

Taking a 128-bit SIMD instruction implementation as an example, parallel processing of 4 groups of messages is an example to show the message de-arrangement process, see FIG. 4, where

According to the message layout process, after executing the round function 32 times, the contents stored in the 4 registers are:

(A₃₂，B₃₂，C₃₂，D₃₂)，(A₃₃，B₃₃，C₃₃，D₃₃)，(A₃₄，B₃₄，C₃₄，D₃₄)，(A₃₅，B₃₅，C₃₅，D₃₅)；

after the message reverse arrangement operation of the reverse order transformation R is fused, the stored contents in the 4 registers are respectively:

(A₃₅，A₃₄，A₃₃，A₃₂)，(B₃₅，B₃₄，B₃₃，B₃₂)，(C₃₅，C₃₄，C₃₃，C₃₂)，(D₃₅，D₃₄，D₃₃，D₃₂)；

i.e., ciphertext corresponding to 4 sets of plaintext messages, note that with the common SIMD instruction set, fusing the reverse order transforms R does not add additional operational instructions.

GF(2⁴) The addition operation can be completed by bit exclusive OR, the square operation and the inversion operation can be completed by constructing an operation table and utilizing a shuffle instruction which is common in the SIMD instruction set, and in order to realize efficient calculation, a quick calculation GF (2) is needed⁴) The invention proposes to use GF (2)⁴) The generation element and the log table and the exponent table can complete the multiplication quickly, and the SIMD instruction set can complete GF by only 7 instructions (the SIMD instruction set based on 128-bit registers can complete GF by 7 instructions (2)⁴) The last 16 multiplications, a SIMD instruction set based on 256-bit registers may complete GF with 7 instructions (2)⁴) For the last 32 multiplications, the 512-bit register based SIMD instruction set may complete GF with 7 instructions (2)⁴) The last 64 multiplications).

Therefore, the calculation process of the linear transformation L in the round function is adjusted by using the characteristics of the linear transformation L and the characteristics of the SIMD instruction set, and is as follows:

the linear transformation L in the round function is defined as:

the linear transformation L is equivalently transformed into:

The method of the invention utilizes the characteristics of the linear transformation L and the characteristics of the SIMD instruction set to adjust the operation process of the linear transformation L, so that the linear transformation L can be completed by using fewer SIMD instructions, and the operation speed is further improved.

The invention proposes a parallel computing apparatus of SM4 based on SIMD instructions, comprising a processor and a memory storing a computer program which, when executed by the processor, implements the method of any one of the above.

For convenience of description, the above devices are described as being functionally separated into various units and described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device, which may be a personal computer, a server, or a network device, etc., to execute the apparatus of the embodiments or some portions of the embodiments of the present application.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims

1. A parallel computing method of SM4 based on SIMD instructions is characterized in that it includes the following steps:

are 32 bits each, and the 4 groups of 128-bit messages are: (A)₀,A₁,A₂,A₃),(B₀,B₁,B₂,B₃),(C₀,C₁,C₂,C₃),(D₁,D₂,D₃,D₄)；

(A₀,B₀,C₀,D₀),(A₁,B₁,C₁,D₁),(A₂,B₂,C₂,D₂),(A₃,B₃,C₃,D₃)；

step S2: using GF (2) for the choreographed packet message based on SIMD instructions⁴) Substitution of GF (2)⁸) Performing Sbox substitution calculation of SM4 to realize inversion operation of wheel function, and obtaining solutionThe result of the inverse operation;

Sbox(u)＝A(A·u+C)^-1+C

the mathematical structure of Sbox defining SM4 is: GF (2)⁸),f(x)＝1+x²+x⁴+x⁵+x⁶+x⁷+x⁸；

The finite field is defined as: GF (2)⁴),g(x)＝1+x+x⁴；

h(y)＝9+y+y²；

thus, let v ∈ GF (2)⁸) Written as v ═ a₀+a₁y, wherein a₀,a₁∈GF(2⁴) Due to v^-1∈GF(2⁸) V is to be^-1Is shown as v^-1＝b₀+b₁y,b₀,b₁∈GF(2⁴) Then according to (a)₀+a₁y)(b₀+b₁y) 1 has:

for u e GF (2)⁸) By affine transformation to

To pair

Isomorphic mapping results in

To pair

Performing a complex domain inversion operation to obtain

To pair

Isomorphic mapping is carried out to obtain

To pair

By affine transformation to

for GF (2)⁴)²GF (2) in the inversion operation in (1)⁴) The multiplication process is as follows: with GF (2)⁴)^*Denotes GF (2)⁴) G 0x02 e GF (2)⁴)^*Is GF (2)⁴)^*Is the generator of, i.e. all e e.g. GF (2)⁴)^*I.ltoreq.15, such that e ═ gⁱFor calculation of c ═ a · b, a, b, c ∈ GF (2)⁴) First, calculate log_g c＝log_g a+log_gb, then calculating

Wherein log_g a,log_gb can be done by looking up a LOG lookup table containing 16 elements;

GF(2⁴) The addition operation is completed through bit exclusive or, and the square operation and the inversion operation can be completed through constructing an operation table and utilizing a common shuffle instruction in the SIMD instruction set; using GF (2)⁴) The process of multiplication calculation is rapidly completed by the generator, the logarithm table and the exponent table, and the SIMD instruction set can be used for completing the process only by 7 instructions;

2. A method of parallel computation of a SIMD instruction based SM4 according to claim 1, wherein the computation of the linear transformations L in the round functions is adjusted by using the characteristics of the linear transformations L and the characteristics of the SIMD instruction set by:

the linear transformation L in the round function is defined as:

the linear transformation L is equivalently transformed into:

the calculation only needs 10 SIMD instructions including XOR for 4 times, shuffle for 3 times, left shift for 1 time, right shift for 1 time, and OR for 1 time.

3. A method of parallel computation of a SIMD instruction based SM4 according to any of claims 1-2, wherein the inverse transform R computation in SM4 is incorporated into the message de-marshalling computation process.

4. A parallel computing apparatus based on the parallel computing method of the SM4 of SIMD instructions, characterized in that it comprises a processor and a memory, said memory storing a computer program which, when executed by the processor, implements the method of any of claims 1-3.

5. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the parallel computing method of SIMD instruction based SM4 of any of claims 1-3.