CN111736902B - Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium - Google Patents

Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium Download PDF

Info

Publication number
CN111736902B
CN111736902B CN202010687106.9A CN202010687106A CN111736902B CN 111736902 B CN111736902 B CN 111736902B CN 202010687106 A CN202010687106 A CN 202010687106A CN 111736902 B CN111736902 B CN 111736902B
Authority
CN
China
Prior art keywords
simd
sbox
calculation
instructions
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010687106.9A
Other languages
Chinese (zh)
Other versions
CN111736902A (en
Inventor
钱晶
董明武
温程
王芷玲
白小勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lianshi Networks Technology Co ltd
Original Assignee
Beijing Lianshi Networks Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lianshi Networks Technology Co ltd filed Critical Beijing Lianshi Networks Technology Co ltd
Priority to CN202010687106.9A priority Critical patent/CN111736902B/en
Publication of CN111736902A publication Critical patent/CN111736902A/en
Application granted granted Critical
Publication of CN111736902B publication Critical patent/CN111736902B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention provides a parallel computing method and device of SM4 based on SIMD instruction, and readable storage medium, comprising: arranging a plurality of input SM4 grouped messages to obtain arranged grouped messages; performing an optimized SM4 encryption or decryption operation process on the arranged packet message; the Sbox operation in the encryption or decryption operation process is calculated by replacing a table look-up method with a composite domain technology; performing inversion operation in GF (2^8) by using fast multiplication operation in GF (2^4) based on SIMD instructions for the complex domain technology; and performing reverse arrangement calculation on the encryption or decryption operation process to obtain the ciphertext or plaintext message corresponding to the grouped message. The method utilizes the composite domain technology to perform the process of SM4 algorithm nonlinear operation equivalent transformation, utilizes operation sequence adjustment and fusion of a plurality of linear transformations to perform SM4 algorithm linear operation equivalent transformation process and fast multiplication operation in GF (2^4) based on SIMD instructions, and improves the execution speed of SM4 encryption and decryption process.

Description

Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium
Technical Field
The invention relates to the technical field of computer security, in particular to a parallel computing method, a parallel computing device and a computer readable storage medium of SM4 based on SIMD instructions.
Background
To ensure the security of data encryption operations, corresponding standard algorithms have been introduced in countries around the world, such as the AES algorithm in the united states, the CLEFIA and Camellia algorithms in japan and the SM4 in china, formerly also referred to as the SMs4 algorithm.
The SM4 cryptographic algorithm is constructed based on a 4-branch generalized Feistel structure, plaintext and ciphertext and a key are 128 bits, and the SM4 cryptographic algorithm comprises an encryption algorithm, a decryption algorithm and a key arrangement algorithm, wherein the key arrangement algorithm is 128-bitThe encryption algorithm and the decryption algorithm both comprise the same 32-round nonlinear round function and 1-time reverse order transformation R, and the difference between the 32-round nonlinear round function and the 1-time reverse order transformation R is the use sequence of the 32-round keys, 4 variables with 32 bits are used
Figure BDA0002587966320000011
Representing a plaintext input of 128 bits, the operation process of the encryption algorithm is as follows:
1. performing 32 iterative operations, RKiFor round keys:
Figure BDA0002587966320000012
t is composed of two parts of non-linear transformation tau and linear transformation L (U) L (tau (U)), and is used
Figure BDA0002587966320000013
Figure BDA0002587966320000014
To represent
Figure BDA0002587966320000015
4 bytes, the nonlinear transformation is represented as:
V=(v0,v1,v2,v3)=τ(U)=(Sbox(u0),Sbox(u1),Sbox(u2),Sbox(u3))
and the linear transformation L is represented as:
Figure BDA0002587966320000016
2. and performing reverse order transformation R on the last round of data to obtain a ciphertext:
(Y0,Y1,Y2,Y3)=R(X32,X33,X34,X35)=(X35,X34,X33,X32)
the popularization of SM4 algorithm is promoted by implementing a domestic network security method and a cryptographic method, but the reduction of the transaction processing speed of a computer information system introduced by encryption and decryption becomes an obstacle to the popularization of SM4 algorithm.
Hardware implementation improves the encryption and decryption efficiency of the SM4 algorithm by continuously optimizing the number of gates required for realizing the SM4 algorithm, wherein the key technology is to use a composite domain technology to enable the Sbox of the SM4 algorithm to be GF (2)8) Inner nonlinear inversion arithmetic equivalent transformation to GF (2)4)2And further applying complex domain techniques until the operation is converted to an operation in GF (2) that can be gated.
There are currently 4 main technical approaches for software implementation, including GPU, SM4 hardware instructions, AESNI and bit slicing (bitscle):
1. GPU: the SM4 encryption and decryption efficiency is improved by using the parallel capability of the GPU;
2. SM4 hardware instruction: constructing a hardware instruction supporting an SM4 encryption and decryption algorithm in a CPU;
3. AESNI: based on the algebraic isomorphic characteristics of the Sbox of the AES algorithm and the Sbox of the SM4 algorithm, completing Sbox operation of SM4 by using an AESNI instruction AESENCCLAST; transforming the SM4 algorithm into an algebraic structure of AES from the algebraic structure of SM 4;
4. bitslice: 256 data packets are processed simultaneously using bitsell technology and the 256-bit registers of the AVX2 instruction;
the technical defects of the schemes are as follows: the GPU and SM4 hardware instructions are not a general solution, AESNI depends on the existence of AESNI hardware instructions and only AESNI instructions supporting 128-bit registers are available, when the AESNI instructions are matched with the SIMD instructions with 256-bit registers, operations related to the AESNI instructions need to be serialized, Bitslice needs to process 256 data packets (4096 bytes) at the same time, and the data arrangement steps and GF (2 bytes) involved in the Bitslice technology are limited in applicability4) The multiplication operations above are all complex.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a parallel computing method of SM4 based on SIMD instructions, which is used for arranging a plurality of input grouped messages to obtain arranged grouped messages; for the arranged packet message, based on a SIMD instruction, GF (2^4) is used for replacing GF (2^8) to complete Sbox substitution calculation of SM4 so as to realize inversion operation on a wheel function and obtain the result of the inversion operation; and performing reverse arrangement calculation on the result of the inversion operation to obtain the ciphertext message corresponding to the grouped message, and overcoming various defects of the traditional method.
The specific scheme of the invention is as follows:
a method of SM4 parallel computation based on SIMD instructions, comprising the steps of:
step S1: arranging a plurality of input grouped messages to obtain arranged grouped messages;
step S2: using GF (2) based on SIMD instructions for the composed packet message4) Substitution of GF (2)8) Completing Sbox substitution calculation of SM4 to realize inversion operation of the wheel function and obtain the result of the inversion operation;
the SIMD instruction based usage GF (2)4) Substitution of GF (2)8) The Sbox substitution calculation procedure to complete SM4 is as follows:
the Sbox substitution calculation process substitutes an 8-bit input with an 8-bit output Sbox (u) according to a substitution table of 256 bytes:
Sbox(u)=A(A·u+C)-1+C
where a is a matrix of 8x8, C is a matrix of 8x1, u represents an 8bit number, and each element in matrices a and C is an element in GF (2):
Figure BDA0002587966320000031
the mathematical structure of Sbox defining SM4 is: GF (2)8),f(x)=1+x2+x4+x5+x6+ x7+x8
The finite field is defined as: GF (2)4),g(x)=1+x+x4
The mathematical structure defining Sbox is isomorphic with a quadratic algebraic expansion of finite fields:
Figure BDA0002587966320000041
defining an isomorphic mapping matrix M and an inverse isomorphic mapping matrix M-1The values of (A) are:
Figure BDA0002587966320000042
thus, let v ∈ GF (2)8) Written as v ═ a0+a1y, wherein a0,a1∈GF(24) Due to v-1∈ GF(28) V is to be-1Is shown as v-1=b0+b1y,b0,b1∈GF(24) Then according to (a)0+a1y)(b0+ b1y) 1 has:
Figure BDA0002587966320000043
i.e. GF (2)8) The inner inversion operation can be at GF (2)4)2The calculation process is as follows:
for u e GF (2)8) By affine transformation to
Figure BDA0002587966320000044
To pair
Figure BDA0002587966320000045
Isomorphic mapping results in
Figure BDA0002587966320000046
To pair
Figure BDA0002587966320000047
Performing a complex domain inversion operation to obtain
Figure BDA0002587966320000048
To pair
Figure BDA0002587966320000049
Isomorphic mapping is carried out to obtain
Figure BDA00025879663200000410
To pair
Figure BDA00025879663200000411
By affine transformation to
Figure BDA00025879663200000412
Wherein u is GF (2)8) U is affine transformed to obtain GF (2)8) V in (1), and isomorphic mapping of v to GF (2)4)2Then, the complex domain inversion operation is carried out on w to obtain GF (2)4)2Element w of (5)-1Then to w-1Isomorphic mapping to GF (2)8) S, and finally affine transformation is carried out on s to obtain GF (2)8) Element (ii) t, w-1Inv (w) denotes GF (2)4)2The inversion operation in (1);
step S3: and performing reverse arrangement calculation on the result of the inversion operation to obtain the ciphertext message corresponding to the grouped message.
Further, for GF (2)4)2GF (2) in the inversion operation in (1)4) The multiplication process is as follows: with GF (2)4)*Denotes GF (2)4) G 0x02 e GF (2)4)*Is GF (2)4)*Is the generator of, i.e. all e e.g. GF (2)4)*I.ltoreq.15, such that e ═ giIn order to calculate c ═ a · b, a, b, c ∈ bGF(24) First, calculate loggc=logga+loggb, then calculating
Figure BDA0002587966320000051
Wherein logga,loggb can be done by looking up a LOG lookup table containing 16 elements.
Furthermore, the calculation process of the linear transformation L in the round function is adjusted by using the characteristics of the linear transformation L and the characteristics of the SIMD instruction set, and is as follows:
the linear transformation L in the round function is defined as:
Figure BDA0002587966320000052
the linear transformation L is equivalently transformed into:
Figure BDA0002587966320000053
the calculation only needs 4 bits of XOR, 3 shaffles, 1 left shift, 1 right shift, 1 bit or 10 SIMD instructions in total.
Further, the inverse transform R computation in SM4 is incorporated into the message de-marshalling computation process.
The present invention provides a SIMD instruction based SM4 parallel computing apparatus comprising a processor and a memory, the memory storing a computer program which, when executed by the processor, implements the method of any one of the above.
The invention also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of any of the above.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a parallel computing method of SM4 based on SIMD instruction, a device thereof and a computer readable storage medium, the method comprises: arranging a plurality of input SM4 grouped messages to obtain arranged grouped messages; performing an optimized SM4 encryption or decryption operation process on the arranged packet message; performing calculation on Sbox operation in the encryption or decryption operation process by using a composite domain technology to replace a table look-up method; for the composite domain technology, the fast multiplication operation in GF (2^4) based on the SIMD instruction provided by the invention is utilized to complete the inversion operation in GF (2^ 8); and performing reverse arrangement calculation on the encryption or decryption operation process to obtain a ciphertext or plaintext message corresponding to the grouped message. The invention provides a new execution process of SM4 block cipher algorithm with good height with SIMD instruction set by using the process of SM4 algorithm nonlinear operation equivalent transformation by using composite domain technique, the process of SM4 algorithm linear operation equivalent transformation by using operation sequence adjustment and fusion of multiple linear transformations, and the new proposed fast multiplication operation in GF (2^4) based on SIMD instruction, which can realize parallel encryption and decryption processing of 4, 8, 16, 32 or more SM4 block messages, improve the execution speed of SM4 encryption and decryption process, and the method is independent of specific hardware platform and can be realized on any hardware supporting SIMD instruction.
Compared with the existing method for accelerating the SM4 calculation process based on the AESNI instruction, the method provided by the invention does not depend on the specific AESNI instruction, and only depends on the more general SIMD instruction. Compared with the existing method of SM4 calculation process based on Bitslice technology and AVX2 instructions, the method provided by the invention has better universality, supports parallel processing of 4, 8, 16 and 32 groups, and is more suitable for use scenes. The SM4 calculation process, which is based on the bitsolice technology and AVX2 instructions, requires that the simultaneous processing of 512 SM4 packets is limited in use scenarios. In addition, the SM4 calculation process provided by the invention doubles the processing speed along with the number of messages processed in parallel in a test in a real environment. And when 16 or 32 packets are processed in parallel, the actual measurement speed is 20 to 30 percent faster than the actual measurement speed of the AESNI and Bitslice methods
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a flow chart of a parallel computing method of SM4 based on SIMD instructions according to the present invention;
FIG. 2 is a schematic diagram of an exemplary editing process;
FIG. 3 is a schematic diagram of an inversion operation process; and
FIG. 4 is a diagram illustrating a reverse arrangement process according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The invention aims to provide a parallel computing method of SM4 based on SIMD instructions, which comprises the following steps:
step S1: and editing the input plurality of grouped messages to obtain edited grouped messages.
The method of the present invention can process multiple groups of messages in parallel, and the following takes the implementation of SIMD instruction based on 128 bits as an example, and shows the parallel processing of 4 groups of messages as an example, and shows the message arrangement process, see fig. 2, wherein
Figure BDA0002587966320000071
i is 0, 1, 2,3 are all 32 bits, and 4 groups of 128-bit messages are:
(A0,A1,A2,A3),(B0,B1,B2,B3),(C0,C1,C2,C3),(D1,D2,D3,D4);
after the message layout, there are 4 128-bit registers for storing the message layout, where the contents stored in the 4 registers are:
(A0,B0,C0,D0),(A1,B1,C1,D1),(A2,B2,C2,D2),(A3,B3,C3,D3)。
step S2: using GF (2) for choreographed packet messages based on SIMD instructions4) Substitution of GF (2)8) Completing Sbox substitution calculation of SM4 to realize inversion operation of the wheel function and obtain the result of the inversion operation;
wherein GF (2) is used based on SIMD instructions4) Substitution of GF (2)8) The Sbox substitution calculation procedure to complete SM4 is as follows:
the Sbox substitution calculation process substitutes an 8-bit input with an 8-bit output Sbox (u) according to a substitution table of 256 bytes:
Sbox(u)=A(A·u+C)-1+C
where a is a matrix of 8x8, C is a matrix of 8x1, u represents an 8bit number, and each element in matrices a and C is an element in GF (2):
Figure BDA0002587966320000081
for example, for the calculation process when u is 0x7b, the matrix form corresponding to u is (11011110)TFirst, the following calculation is performed to obtain v:
Figure BDA0002587966320000082
matrix (01100111)TCorresponding values are v ═ 0xe6, GF (2)8) Where v is the inverse of 0xe6 as v-10xfe, corresponding to the matrix form (01111111)TThen Sbox (u) is A.v-1+C:
Figure BDA0002587966320000091
Matrix (11100111)TThe corresponding value is Sbox (u) ═ 0xe7, and the calculation results were found to be correct by comparing the Sbox definition in the SM4 standard; this is GF (2)4) Substitution of GF (2)8) The basis of mathematics.
The mathematical structure of Sbox defining SM4 is: GF (2)8),f(x)=1+x2+x4+x5+x6+ x7+x8
The finite field is defined as: GF (2)4),g(x)=1+x+x4
The mathematical structure defining Sbox is isomorphic with a quadratic algebraic expansion of finite fields:
Figure BDA0002587966320000092
Figure BDA0002587966320000093
x and y are calculation data objects;
defining an isomorphic mapping matrix M and an inverse isomorphic mapping matrix M-1The values of (A) are:
Figure BDA0002587966320000094
thus, let v ∈ GF (2)8) Written as v ═ a0+a1y, wherein, a0,a1∈GF(24) Due to v-1∈ GF(28) V is to be-1Is shown as v-1=b0+b1y,b0,b1∈GF(24) Then according to (a)0+a1y)(b0+ b1y) 1 has:
Figure BDA0002587966320000101
i.e. GF (2)8) The inner inversion operation can be at GF (2)4)2The calculation process is as follows:
for u e GF (2)8) By affine transformation to
Figure BDA0002587966320000102
To pair
Figure BDA0002587966320000103
Isomorphic mapping results in
Figure BDA0002587966320000104
To pair
Figure BDA0002587966320000105
Performing a complex domain inversion operation to obtain
Figure BDA0002587966320000106
To pair
Figure BDA0002587966320000107
Isomorphic mapping is carried out to obtain
Figure BDA0002587966320000108
To pair
Figure BDA0002587966320000109
By affine transformation to
Figure BDA00025879663200001010
Wherein, w-1Inv (w) denotes GF (2)4)2The inversion operation in (1). Wherein w-1Inv (w) denotes GF (2)4)2The inversion operation in (b) is based on0,b1The calculation formula (c) can be performed according to the process shown in fig. 3, i.e. GF (2)8) The inner inversion operation is converted into GF (2)4) Addition ofMultiplication, sum of squares inversion operation.
In one embodiment, for GF (2)4)2GF (2) in the inversion operation in (1)4) The multiplication process is as follows: with GF (2)4)*Denotes GF (2)4) G 0x02 e GF (2)4)*Is GF (2)4)*Is the generator of, i.e. all e e.g. GF (2)4)*I.ltoreq.15, such that e ═ giFor calculation of c ═ a · b, a, b, c ∈ GF (2)4) First, calculate loggc=logga+loggb, then calculating
Figure BDA00025879663200001011
Wherein logga,loggb can be done by looking up a LOG lookup table containing 16 elements.
The above is the key optimization measure in the method of the present invention, which is to transfer the inversion operation on the Sbox to GF (2) isomorphic therewith by using the algebraic structure of the SM4 block cipher algorithm Sbox4)2Is completed based on the proposed GF (2)4) The optimized multiplication process can accelerate the replacement process of the SM4 block cipher algorithm Sbox, because the SM4 block cipher algorithm key arrangement algorithm and the encryption and decryption process share the same Sbox replacement process, the calculation process can also apply the SM4 block cipher algorithm key arrangement algorithm, and the calculation speed is improved, which is the important invention point of the invention.
Step S3: performing reverse arrangement calculation on the result of the inversion operation to obtain ciphertext messages corresponding to the grouped messages; in the invention, the reverse order transformation R calculation in the SM4 is integrated into the message reverse arrangement calculation process, and the integration of the reverse order transformation R into the message reverse arrangement process does not increase extra calculation, thereby being beneficial to improving the software execution speed.
Taking a 128-bit SIMD instruction implementation as an example, parallel processing of 4 groups of messages is an example to show the message de-arrangement process, see FIG. 4, where
Figure BDA0002587966320000111
According to the message layout process, after executing the round function 32 times, the contents stored in the 4 registers are:
(A32,B32,C32,D32),(A33,B33,C33,D33),(A34,B34,C34,D34),(A35,B35,C35,D35);
after the message reverse arrangement operation of the reverse order transformation R is fused, the stored contents in the 4 registers are respectively:
(A35,A34,A33,A32),(B35,B34,B33,B32),(C35,C34,C33,C32),(D35,D34,D33,D32);
i.e., ciphertext corresponding to 4 sets of plaintext messages, note that with the common SIMD instruction set, fusing the reverse order transforms R does not add additional operational instructions.
GF(24) The addition operation can be completed by bit exclusive OR, the square operation and the inversion operation can be completed by constructing an operation table and utilizing a shuffle instruction which is common in the SIMD instruction set, and in order to realize efficient calculation, a quick calculation GF (2) is needed4) The invention proposes to use GF (2)4) The generation element and the log table and the exponent table can complete the multiplication quickly, and the SIMD instruction set can complete GF by only 7 instructions (the SIMD instruction set based on 128-bit registers can complete GF by 7 instructions (2)4) The last 16 multiplications, a SIMD instruction set based on 256-bit registers may complete GF with 7 instructions (2)4) For the last 32 multiplications, the 512-bit register based SIMD instruction set may complete GF with 7 instructions (2)4) The last 64 multiplications).
Therefore, the calculation process of the linear transformation L in the round function is adjusted by using the characteristics of the linear transformation L and the characteristics of the SIMD instruction set, and is as follows:
the linear transformation L in the round function is defined as:
Figure BDA0002587966320000112
the linear transformation L is equivalently transformed into:
Figure BDA0002587966320000121
the calculation only needs 4 bits of XOR, 3 shaffles, 1 left shift, 1 right shift, 1 bit or 10 SIMD instructions in total.
The method of the invention utilizes the characteristics of the linear transformation L and the characteristics of the SIMD instruction set to adjust the operation process of the linear transformation L, so that the linear transformation L can be completed by using fewer SIMD instructions, and the operation speed is further improved.
The invention proposes a parallel computing apparatus of SM4 based on SIMD instructions, comprising a processor and a memory storing a computer program which, when executed by the processor, implements the method of any one of the above.
The invention also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of any of the above.
For convenience of description, the above devices are described as being functionally separated into various units and described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device, which may be a personal computer, a server, or a network device, etc., to execute the apparatus of the embodiments or some portions of the embodiments of the present application.
Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims (5)

1. A parallel computing method of SM4 based on SIMD instructions is characterized in that it includes the following steps:
step S1: arranging a plurality of input grouped messages to obtain arranged grouped messages;
Figure FDA0003541086010000011
are 32 bits each, and the 4 groups of 128-bit messages are: (A)0,A1,A2,A3),(B0,B1,B2,B3),(C0,C1,C2,C3),(D1,D2,D3,D4);
After the message layout, there are 4 128-bit registers for storing the message layout, where the contents stored in the 4 registers are:
(A0,B0,C0,D0),(A1,B1,C1,D1),(A2,B2,C2,D2),(A3,B3,C3,D3);
step S2: using GF (2) for the choreographed packet message based on SIMD instructions4) Substitution of GF (2)8) Performing Sbox substitution calculation of SM4 to realize inversion operation of wheel function, and obtaining solutionThe result of the inverse operation;
the SIMD instruction based usage GF (2)4) Substitution of GF (2)8) The Sbox substitution calculation procedure to complete SM4 is as follows:
the Sbox substitution calculation process substitutes an 8-bit input with an 8-bit output Sbox (u) according to a substitution table of 256 bytes:
Sbox(u)=A(A·u+C)-1+C
where a is a matrix of 8x8, C is a matrix of 8x1, u represents an 8bit number, and each element in matrices a and C is an element in GF (2):
Figure FDA0003541086010000012
the mathematical structure of Sbox defining SM4 is: GF (2)8),f(x)=1+x2+x4+x5+x6+x7+x8
The finite field is defined as: GF (2)4),g(x)=1+x+x4
The mathematical structure defining Sbox is isomorphic with a quadratic algebraic expansion of finite fields:
Figure FDA00035410860100000212
h(y)=9+y+y2
defining an isomorphic mapping matrix M and an inverse isomorphic mapping matrix M-1The values of (A) are:
Figure FDA0003541086010000021
thus, let v ∈ GF (2)8) Written as v ═ a0+a1y, wherein a0,a1∈GF(24) Due to v-1∈GF(28) V is to be-1Is shown as v-1=b0+b1y,b0,b1∈GF(24) Then according to (a)0+a1y)(b0+b1y) 1 has:
Figure FDA0003541086010000022
i.e. GF (2)8) The inner inversion operation can be at GF (2)4)2The calculation process is as follows:
for u e GF (2)8) By affine transformation to
Figure FDA0003541086010000023
To pair
Figure FDA0003541086010000024
Isomorphic mapping results in
Figure FDA0003541086010000025
To pair
Figure FDA0003541086010000026
Performing a complex domain inversion operation to obtain
Figure FDA0003541086010000027
To pair
Figure FDA0003541086010000028
Isomorphic mapping is carried out to obtain
Figure FDA0003541086010000029
To pair
Figure FDA00035410860100000210
By affine transformation to
Figure FDA00035410860100000211
Wherein u is GF (2)8) U is affine transformed to obtain GF (2)8) V in (1), and isomorphic mapping of v to GF (2)4)2Then, the complex domain inversion operation is carried out on w to obtain GF (2)4)2Element w of (5)-1Then to w-1Isomorphic mapping to GF (2)8) S, and finally affine transformation is carried out on s to obtain GF (2)8) Element (ii) t, w-1Inv (w) denotes GF (2)4)2The inversion operation in (1);
for GF (2)4)2GF (2) in the inversion operation in (1)4) The multiplication process is as follows: with GF (2)4)*Denotes GF (2)4) G 0x02 e GF (2)4)*Is GF (2)4)*Is the generator of, i.e. all e e.g. GF (2)4)*I.ltoreq.15, such that e ═ giFor calculation of c ═ a · b, a, b, c ∈ GF (2)4) First, calculate logg c=logg a+loggb, then calculating
Figure FDA0003541086010000031
Wherein logg a,loggb can be done by looking up a LOG lookup table containing 16 elements;
GF(24) The addition operation is completed through bit exclusive or, and the square operation and the inversion operation can be completed through constructing an operation table and utilizing a common shuffle instruction in the SIMD instruction set; using GF (2)4) The process of multiplication calculation is rapidly completed by the generator, the logarithm table and the exponent table, and the SIMD instruction set can be used for completing the process only by 7 instructions;
step S3: and performing reverse arrangement calculation on the result of the inversion operation to obtain the ciphertext message corresponding to the grouped message.
2. A method of parallel computation of a SIMD instruction based SM4 according to claim 1, wherein the computation of the linear transformations L in the round functions is adjusted by using the characteristics of the linear transformations L and the characteristics of the SIMD instruction set by:
the linear transformation L in the round function is defined as:
Figure FDA0003541086010000032
the linear transformation L is equivalently transformed into:
Figure FDA0003541086010000033
the calculation only needs 10 SIMD instructions including XOR for 4 times, shuffle for 3 times, left shift for 1 time, right shift for 1 time, and OR for 1 time.
3. A method of parallel computation of a SIMD instruction based SM4 according to any of claims 1-2, wherein the inverse transform R computation in SM4 is incorporated into the message de-marshalling computation process.
4. A parallel computing apparatus based on the parallel computing method of the SM4 of SIMD instructions, characterized in that it comprises a processor and a memory, said memory storing a computer program which, when executed by the processor, implements the method of any of claims 1-3.
5. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the parallel computing method of SIMD instruction based SM4 of any of claims 1-3.
CN202010687106.9A 2020-07-16 2020-07-16 Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium Active CN111736902B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010687106.9A CN111736902B (en) 2020-07-16 2020-07-16 Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010687106.9A CN111736902B (en) 2020-07-16 2020-07-16 Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium

Publications (2)

Publication Number Publication Date
CN111736902A CN111736902A (en) 2020-10-02
CN111736902B true CN111736902B (en) 2022-04-19

Family

ID=72654782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010687106.9A Active CN111736902B (en) 2020-07-16 2020-07-16 Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium

Country Status (1)

Country Link
CN (1) CN111736902B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507644B (en) * 2020-12-03 2021-05-14 湖北大学 Optimized SM4 algorithm linear layer circuit
CN113282947A (en) * 2021-07-21 2021-08-20 杭州安恒信息技术股份有限公司 Data encryption method and device based on SM4 algorithm and computer platform
CN114244496B (en) * 2021-12-01 2023-07-18 华南师范大学 SM4 encryption and decryption algorithm parallelization realization method based on tower domain optimization S box
CN114091086A (en) * 2022-01-14 2022-02-25 麒麟软件有限公司 Rapid realization method of SM4 algorithm based on bit slice

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106712930A (en) * 2017-01-24 2017-05-24 北京炼石网络技术有限公司 SM4 encryption method and device
CN108092760A (en) * 2016-11-22 2018-05-29 北京同方微电子有限公司 A kind of co-processor device of block cipher and non-linear transformation method
CN109417468A (en) * 2017-04-12 2019-03-01 北京炼石网络技术有限公司 The method and apparatus that safe and efficient block cipher is realized
CN110166223A (en) * 2019-05-22 2019-08-23 北京航空航天大学 A kind of Fast Software implementation method of the close SM4 of state

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109600217A (en) * 2019-01-18 2019-04-09 江苏实达迪美数据处理有限公司 Optimize the method and processor of SM4 encryption and decryption in parallel operational mode
CN109981250B (en) * 2019-03-01 2020-04-07 北京海泰方圆科技股份有限公司 SM4 encryption and key expansion method, device, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108092760A (en) * 2016-11-22 2018-05-29 北京同方微电子有限公司 A kind of co-processor device of block cipher and non-linear transformation method
CN106712930A (en) * 2017-01-24 2017-05-24 北京炼石网络技术有限公司 SM4 encryption method and device
CN109417468A (en) * 2017-04-12 2019-03-01 北京炼石网络技术有限公司 The method and apparatus that safe and efficient block cipher is realized
CN110166223A (en) * 2019-05-22 2019-08-23 北京航空航天大学 A kind of Fast Software implementation method of the close SM4 of state

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SM4的快速软件实现技术;郎欢 等;《中国科学院大学学报》;20180331;第35卷(第2期);180-187 *

Also Published As

Publication number Publication date
CN111736902A (en) 2020-10-02

Similar Documents

Publication Publication Date Title
CN111736902B (en) Parallel computing method and device of SM4 based on SIMD (Single instruction multiple data) instructions and readable storage medium
CN110166223B (en) Rapid implementation method of cryptographic block cipher algorithm SM4
CN113940028B (en) Method and device for realizing white box password
JP5711681B2 (en) Cryptographic processing device
US20050232430A1 (en) Security countermeasures for power analysis attacks
CN107257279B (en) Plaintext data encryption method and device
WO2020253108A1 (en) Information hiding method, apparatus, device, and storage medium
CN110880967B (en) Method for parallel encryption and decryption of multiple messages by adopting packet symmetric key algorithm
JP5612007B2 (en) Encryption key generator
Abdellatif et al. AES-GCM and AEGIS: efficient and high speed hardware implementations
JP5689826B2 (en) Secret calculation system, encryption apparatus, secret calculation apparatus and method, program
CN111314054B (en) Lightweight ECEG block cipher realization method, system and storage medium
CN109936437B (en) power consumption attack resisting method based on d +1 order mask
CN110266481B (en) Post-quantum encryption and decryption method and device based on matrix
CN114244496B (en) SM4 encryption and decryption algorithm parallelization realization method based on tower domain optimization S box
Lim et al. Differential fault attack on lightweight block cipher PIPO
CN113691364B (en) Encryption and decryption method of dynamic S-box block cipher based on bit slice technology
Kabulov et al. Gost R 34.12-2015 (Kuznechik) analysis of a cryptographic algorithm
CN105577362B (en) A kind of byte replacement method and system applied to aes algorithm
CN110224829B (en) Matrix-based post-quantum encryption method and device
CN112287333A (en) Lightweight adjustable block cipher implementation method, system, electronic device and readable storage medium
CN113922948B (en) SM4 data encryption method and system based on composite domain round function
KR102253211B1 (en) Computing Apparatus and Method for Hardware Implementation of Public-Key Cryptosystem Supporting Elliptic Curves over Prime Field and Binary Field
CN116915405B (en) Data processing method, device, equipment and storage medium based on privacy protection
JP2013205437A (en) Method and apparatus for calculating nonlinear function s-box

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant