CN114710285A - High-performance SM4 bit slice optimization method for heterogeneous parallel architecture - Google Patents
High-performance SM4 bit slice optimization method for heterogeneous parallel architecture Download PDFInfo
- Publication number
- CN114710285A CN114710285A CN202210542472.4A CN202210542472A CN114710285A CN 114710285 A CN114710285 A CN 114710285A CN 202210542472 A CN202210542472 A CN 202210542472A CN 114710285 A CN114710285 A CN 114710285A
- Authority
- CN
- China
- Prior art keywords
- bit
- algorithm
- inversion
- multiplication
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/08—Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
- H04L9/0861—Generation of secret information including derivation or calculation of cryptographic keys or passwords
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/14—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols using a plurality of keys or algorithms
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Abstract
The invention discloses a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture, which belongs to the technical field of safe password application, and realizes a multi-thread SM4 on a non-vector instruction set and a vector instruction set by realizing an SM4 block cipher algorithm under the width of 1 bit data, and can support higher encryption speed on the vector instruction set.
Description
Technical Field
The invention belongs to the technical field of security password application, and relates to a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture.
Background
The SM4 is a block cipher standard adopted in the WAPI standard of the wireless local area network in China, and is subsequently adopted by the commercial cipher standard in China. As a block cipher standard of commercial cipher in China, SM4 is expected to gradually replace foreign block cipher standards such as 3DES and AES in sensitive but non-confidential application fields in China, and is used for application occasions such as communication encryption and data encryption. SM4 is a symmetric cryptographic algorithm with a key length and packet length of 128 bits, outputting 128 bits as ciphertext.
The operational symbols used and the corresponding meanings are given below:
mod: performing modulo operation;
The key expansion algorithm is as follows:
the standard algorithm SM4 word length is 32 bits, the encryption key length is 128 bits, and the representation is 4 words(ii) a The round key is represented as 32 words(ii) a The plaintext input is treated as 4 wordsThe ciphertext output is represented as;
SM4 key expansion algorithm:
Wherein the content of the first and second substances,is aA synthetic permutation functionComprising a linear transformationAnd nonlinear transformation,Are all constant.
The encryption algorithm is as follows:
Wherein the content of the first and second substances,is a synthetic permutation functionComprising a linear transformationAnd nonlinear transformation。
Because the expected application field of the SM4 design is a low-power chip (namely, a WAPI chip), the SM4 is optimized for reducing the number of hardware circuits, and as a result, the software implementation efficiency of the SM4 is low, it is difficult to fully utilize the computing power of a mainstream 32-bit/64-bit general processor, and the software implementation efficiency is usually much lower than that of the AES of the same kind of symmetric encryption algorithm.
The currently mainstream CPU includes an extended Instruction set such as a SIMD (Single Instruction, Multiple Data) Instruction set in addition to a basic general Instruction set such as SISD (Single Instruction, Single Data), for example, the currently mainstream X86 processor of Intel and AMD supports an AVX/AVX2 Instruction set, and the mobile processor of ARM Cortex-a series architecture supports a NEON Instruction set. The Intel AVX-2 supports a 256-bit instruction set, the AVX2 instruction set includes 16 256-bit vector registers, the Intel has published the AVX-512 instruction set, and the AVX-512 instruction set includes 32 512-bit vector registers, which can perform 16-way 32-bit word vector operations. NEON is the SIMD instruction set of the ARM Cortex-A architecture. The NEON architecture contains 16 128-bit SIMD registers on which 4-way 32-bit scalar vector computations can be performed.
Currently, the Intel/AMD X86 processor and ARM processor support both non-vector and vector instruction set computations. Conventional parallel algorithms often execute a plurality of SM4 algorithms at the same time by directly using vector registers or multi-core processors, but because the register width required for realizing the standard SM4 algorithm is relatively wide (32 bits), even if the vector registers with the width of 256 bits are used for realizing, at most, 8 SM4 algorithms can be simultaneously executed. The GPU operation has similar problems, and for a single thread, the operation speed of the GPU is difficult to exceed that of the CPU, and the parallel capability of the GPU is limited by the stream processor, so that more threads cannot be processed simultaneously. Under the use scene that the single-thread SM4 encryption speed is low in requirement, but the SM4 encryption threads are more, the algorithm is not enough in parallelism and low in overall encryption efficiency, and an algorithm which is more sufficient in computing resource utilization and faster in multi-thread SM4 operation speed is urgently needed.
Disclosure of Invention
In order to realize a more efficient multithreading SM4 algorithm, the invention provides a software-implemented SM4 block cipher bit slice optimization algorithm which is implemented on SISD, SIMD instruction set and GPU. The present invention can implement either a multithreaded SM4 on a non-vector instruction set or a multithreaded SM4 on a vector instruction set, where the vector instruction set can support higher encryption speeds.
The invention provides a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture, which comprises the following steps:
1) dividing a variable with a word length of 32 bits of an original standard algorithm SM4 into 32 variables with a word length of 1 bit in sequence;
2) the linear part of the original standard algorithm SM4 is operated as: performing XOR operation on the 32-bit words and cyclic left shift operation on the 32-bit words, and converting variables defined in the step 1) into 32 words with 1 bit length for XOR and transposition;
3) decomposing the nonlinear part operation S box of the original standard algorithm SM4 into matrix affine transformation and finite fieldInversion is carried out;
4) for finite fieldsInversion is carried out by utilizing finite field tower structure transformationBy matrix affine transformation, isomorphically mapped to the composite domainTo be converted intoThe inversion and multiplication of the method are realized under the word length of 1 bit;
5) for theBy using finite field tower structure transformation, willIsomorphic mapping of finite fields to finite fields by matrix affine transformationsFinite field, intoThe inversion and multiplication of the method are realized under the word length of 1 bit;
6) for theThe inversion is equivalent to that the high-order bit is unchanged, and the low-order bit is equal to that of the high-order XOR low order bit, so that the calculation under the word length of 1 bit is realized;
7) according to the steps 1) to 6) above, the whole SM4 algorithm is realized only depending on 1-bit word length, exclusive OR and AND operation to complete the whole calculation, so that the X-bit register is regarded as an X vector register to be used, and multithread parallel calculation of the X-group SM4 algorithm is realized.
The method is a software-implemented SM4 block cipher bit slicing optimization algorithm, and is characterized in that linear operation of an original standard SM4 encryption algorithm on a 32-bit word is equivalently converted into linear operation with the length of 1-bit word through algorithm linear analysis, and meanwhile, nonlinear operation in the original standard SM4 algorithm is mapped onto the nonlinear operation with the length of 1-bit word through constructing finite field isomorphic mapping. By the method, the whole SM4 algorithm is converted from the implementation based on the word length of 32 bits to the implementation based on the word length of 1 bit, and then on a non-vector or vector instruction set platform with wider register width, the register is divided into different SM4 threads according to 1 bit (for example, the 32-bit register can be divided into 32 1 bits, so that the SM4 algorithm of 32 threads can be realized), and further, the SM4 multithreading optimization algorithm with higher parallelism is realized. In a specific implementation, a wider register is no longer needed for the calculation of the whole algorithm, so that on a calculation platform supporting 32-bit, 64-bit or wider registers, one register can simultaneously store data of 8, 16 or more threads, and when the data is calculated, the data of 8, 16 or more threads can be simultaneously operated, so that the SM4 algorithm of higher threads is realized.
The core of the invention is: the SM4 block cipher algorithm is implemented under the data width of 1 bit, and is applied to a vector instruction set or a non-vector instruction set, and a register is divided according to 1 bit, so that the multithreading SM4 algorithm with the parallelism being 32 times higher than that of the standard SM4 algorithm (the word length is 32 bits) under the same condition can be obtained. Through analysis, the SM4 parallel optimization method provided by the invention can be realized on vector instruction operation or GPU, and the algorithm can also be realized on a CPU platform which does not support the operation of SIMD instructions.
Compared with the prior art, the invention has the beneficial effects that: 1) the constant time algorithm replaces the operation of searching the S box by matrix operation, displacement, exclusive OR and other operations, can be realized in constant time, has no relation between the execution time and the internal state of the algorithm, can resist various time-based channel measurement attacks, and is safer in SM4 algorithm. 2) The higher parallel thread can complete the whole algorithm only by using 1 bit of the register, under the environment of 32-bit and 64-bit registers, 32-path and 64-path SM4 encryption and decryption algorithms can be directly executed in parallel, and under the AVX512 register, 512-path SM4 encryption algorithms can be executed at most, so that the overall encryption rate is improved. Under the application scene that the requirement on the encryption speed of the single-thread SM4 is low, but the number of the encryption threads is large, the overall efficiency can be greatly improved. In the calculation on the GPU, assuming that there are x parallel GPU threads, if each thread uses 64-bit wide registers, then the SM4 algorithm for 64 threads can be processed simultaneously, and the number of threads in parallel for the SM4 algorithm can be extended by 64 times.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The invention provides a heterogeneous parallel architecture-oriented high-performance SM4 bit slice optimization method, which realizes SM4 operation only by depending on 1-bit data width and under the condition that logic operation only includes XOR and AND operation, compared with the software implementation of the traditional SM4 algorithm, the algorithm can regard an X-bit register as an X-vector register for use, so that parallel calculation of an X-group SM4 algorithm is realized, wherein X is the width of the register, and the operation speed is greatly improved.
Specifically, the method of the invention adopts the following steps:
1) all variables in the SM4 are divided into 1-bit data from low order to high order, for example, a plaintext is divided into 128 bits (divided into 4 32-bit words for processing), and the data is regarded as 128 pieces of 1-bit data;
2) the exclusive or operation and the cyclic shift in SM4 are regarded as exclusive or and transposition between 128 pieces of 1-bit data;
3) decomposing the unique nonlinear operation S box of SM4 into matrix affine transformation and finite field according to principleInversion is carried out;
4) will be provided withInversion isomorphic mapping to composite domain via tower decompositionIs inverted and further transformed intoInversion and multiplication on;
5) isomorphic mapping, matrix operation and multiplication of a specific finite field in the decomposition process of the step 4) are all regarded as matrix multiplication, and calculation under the width of 1 bit data is realized;
6) will be provided withFurther tower decomposition of the above inversion, isomorphic mapping toAnd is further transformed intoInversion and multiplication on;
7) isomorphic mapping, matrix operation and multiplication of a specific finite field in the decomposition process of the step 6) are all regarded as matrix multiplication, and calculation under the width of 1 bit data is realized;
8)the inversion is equivalent to the invariance of the high-order bit, and the low-order bit is equivalent to the high-order XOR low-order bit, so that the inversion is calculated under the data width of 1 bit;
9) by integrating all algorithms, the whole SM4 realizes that the whole calculation is completed only by relying on 1-bit data width, exclusive OR and operation, and the X-bit register can be regarded as an X vector register to be used, so that the parallel calculation of the X-group SM4 algorithm is realized.
Variable representation of SM4 multithread optimization algorithm:
in the standard SM4 algorithm, the key is 16 bytes, the input plaintext is 16 bytes, and each type of intermediate state variable is a 4-byte word. In the method, a key and a plaintext are divided into 128 pieces of 1-bit data according to the high and low bits of a byte, and a variable word of an intermediate state is divided into 32 pieces of 1-bit data according to the high and low bits.
Now, assume that all variables of the present invention are divided into 1-bit widths according to the following method:
a 32-bit word: the basic operation word of the SM4 is 4 bytes, and is divided into the following parts according to the high and low 32 bits: whereindataThe name of the 32-bit variable is represented,qindicating the number of bits.
Linear operation of SM4 multithread optimization algorithm:
and (3) XOR calculation:
the SM4 multithread optimization method of the invention converts the exclusive OR calculation among the 32-bit words in all the original standard SM4 algorithm into the length of 1 bit for carrying out;
the calculation process is as follows:
3) and (4) carrying out exclusive or calculation on the alignment respectively:the result is exactly also in the form of a 32-bit state word divided in 1 bit.
Linear transformation function:
the linear transformation function in standard SM4 includes XOR computation and displacement computation, whereThe format is as follows:
in the above formulaIndicating the number of bits left-shifted by four cycles (note: XOR-calculated)Not the same);is a 32-bit word, andandthe two functions are similar in format and are similar in structure, and the difference is thatIs/are as followsIn the order of 2,10,18,24, andis/are as follows13,23,0,0 in sequence; according to variable representation of SM4 multithread optimization algorithmDivided into 32 data of 1 bit length32 pieces of 1-bit data of the same format must also be output, and thus, the output can be expressed asThe 32-bit xor of the corresponding positions, i.e.:
whereinTo representThe numbers from the upper to lower bits are divided into 1 bit,is shown inTo middleBit data (hypothesis)The next bit of the last 1 bit isBit 1) of the received signal,is related to the number of bits of the cyclic shift and the position of the output; % is the modulus operation, i.e., the remainder of the division by 32 is calculated.
The calculation process is as follows:
3) According to the formulaGiven isiOf (2) to findInvolving an exclusive-or operation in the calculationThe bit data of (c), calculating:
Nonlinear S-box operation of a multithread optimization algorithm:
s box mathematical principle:
the S-box is the only nonlinear component in SM4 of the standard algorithm and is used for the nonlinear functionThe S-box is a list of 8-bit input and 8-bit output alternatives, as follows:
0xd6,0x90,0xe9,0xfe,0xcc,0xe1,0x3d,0xb7,0x16,0xb6,0x14,0xc2,0x28,0xfb,0x2c,0x05, 0x2b,0x67,0x9a,0x76,0x2a,0xbe,0x04,0xc3,0xaa,0x44,0x13,0x26,0x49,0x86,0x06,0x99, 0x9c,0x42,0x50,0xf4,0x91,0xef,0x98,0x7a,0x33,0x54,0x0b,0x43,0xed,0xcf,0xac,0x62, 0xe4,0xb3,0x1c,0xa9,0xc9,0x08,0xe8,0x95,0x80,0xdf,0x94,0xfa,0x75,0x8f,0x3f,0xa6, 0x47,0x07,0xa7,0xfc,0xf3,0x73,0x17,0xba,0x83,0x59,0x3c,0x19,0xe6,0x85,0x4f,0xa8, 0x68,0x6b,0x81,0xb2,0x71,0x64,0xda,0x8b,0xf8,0xeb,0x0f,0x4b,0x70,0x56,0x9d,0x35, 0x1e,0x24,0x0e,0x5e,0x63,0x58,0xd1,0xa2,0x25,0x22,0x7c,0x3b,0x01,0x21,0x78,0x87, 0xd4,0x00,0x46,0x57,0x9f,0xd3,0x27,0x52,0x4c,0x36,0x02,0xe7,0xa0,0xc4,0xc8,0x9e, 0xea,0xbf,0x8a,0xd2,0x40,0xc7,0x38,0xb5,0xa3,0xf7,0xf2,0xce,0xf9,0x61,0x15,0xa1, 0xe0,0xae,0x5d,0xa4,0x9b,0x34,0x1a,0x55,0xad,0x93,0x32,0x30,0xf5,0x8c,0xb1,0xe3, 0x1d,0xf6,0xe2,0x2e,0x82,0x66,0xca,0x60,0xc0,0x29,0x23,0xab,0x0d,0x53,0x4e,0x6f, 0xd5,0xdb,0x37,0x45,0xde,0xfd,0x8e,0x2f,0x03,0xff,0x6a,0x72,0x6d,0x6c,0x5b,0x51, 0x8d,0x1b,0xaf,0x92,0xbb,0xdd,0xbc,0x7f,0x11,0xd9,0x5c,0x41,0x1f,0x10,0x5a,0xd8, 0x0a,0xc1,0x31,0x88,0xa5,0xcd,0x7b,0xbd,0x2d,0x74,0xd0,0x12,0xb8,0xe5,0xb4,0xb0, 0x89,0x69,0x97,0x4a,0x0c,0x96,0x77,0x7e,0x65,0xb9,0xf1,0x09,0xc5,0x6e,0xc6,0x84, 0x18,0xf0,0x7d,0xec,0x3a,0xdc,0x4d,0x20,0x79,0xee,0x5f,0x3e,0xd7,0xcb,0x39,0x48。
the S-box substitution in the standard algorithm is actually a finite field formed by 7-degree polynomialThe above element transformation can be described mathematically as follows:
{1,1,1,0,0,1,0,1},
{1,1,1,1,0,0,1,0},
{0,1,1,1,1,0,0,1},
{1,0,1,1,1,1,0,0},
{0,1,0,1,1,1,1,0},
{0,0,1,0,1,1,1,1},
{1,0,0,1,0,1,1,1},
{1,1,0,0,1,0,1,1},
Is provided withIs an 8-bit binary numberCorresponding polynomial. Primitive polynomial,To representIn a polynomialInverse in the modulus domain. If it isThen defineContrary element of。
The inverse operation relationship is as follows:
the S-box operations include 8 x8 matrix operations, 8-bit vector xor operations, and inverse operations over an 8-bit finite field, wherein: the matrix multiplication and vector exclusive-or belong to binary linear operation, and are easily decomposed into calculation under the word length of 4 bits according to the relation between variables, the matrix can be divided into 8 column vectors firstly, and then the 8 bit vector is divided into high and low 4 bits, so that the matrix vector operation is realized on a 4-bit register;the inverse operation on the finite field is a nonlinear operation, which cannot be directly implemented on a 1-bit register, and requires isomorphic mapping of the finite field.
8 degree primitive polynomialThe polynomial modulus field formed above can be equivalentIsomorphic mapping to composite domainsThe above.
Composite domain definition:all the above elements can be expressed as polynomialsCan be regarded as a polynomial of degree 2An element in the modulus domain, wherein the modulus polynomial,Is defined in a 4-degree order polynomialThe polynomial modulus field of (a).
The invention is directed toIs thatA modulus field of which the coefficients of the polynomial are defined inIn the modulus domain. In particular, of the SM4 algorithmIs defined in modulusAbove, if presentToIs then onThe inversion operation above can be equivalent toThe inversion operation of (3).
The specific isomorphic operation method and inversion operation are as follows:
1) by matrix multiplication, constructionToIsomorphic mapping of, e.g. when、At time (as an example), there isToIsomorphic mapping of (a). Isomorphic mapped matrixComprises the following steps:
{0,1,0,1,1,1,1,0,},
{0,1,1,1,1,1,0,0,},
{1,1,0,1,0,0,0,0,},
{0,1,0,1,0,0,0,0,},
{1,1,0,1,1,1,0,0,},
{0,1,1,0,1,1,0,0,},
{1,1,1,1,0,1,1,0,},
{0,1,0,1,1,1,1,1,},
is provided withIs thatMiddle element, ifIs the inverse of this element, then there must be the following equation:
after unfolding, obtaining:
through isomorphic mapping atBecome at the inversion operation ofWhile the inversion operation selected according to the inventionIn aThe above inversion operation further becomes an elementIn thatSum of inversions onThe multiplication operation of (c).
after the first isomorphic mapping, the first time,the problem of inversion in (A) is reduced toThe inversion operation and the multiplication operation of (a). In this respect, isomorphic mapping may be further performed, andto the computational problem ofThe above.
1) By matrix multiplication, constructionToIsomorphic mapping, as an example, coefficients are defined inFormed finite fieldsIn the above-mentioned manner,forming a polynomial modulus finite fieldAnd are andformed finite fieldThere is an isomorphic mapping;
is provided withIs thatMiddle element, ifIs the inverse of this element, then there must be the following equation:
after unfolding, obtaining:
as can be seen from the above, it is shown that,toIsomorphic mapping ofThe inverse operation of (3) becomesThe above inversion and multiplication operations;
3) given ofThe multiplication can be converted into matrix vector multiplication operation and implemented within a word length of 1 bit, specifically, two 4-bit elements to be multiplied are regarded as two vectors { x3, x2, x1, x0}, one of the vectors is modified into a matrix form and is multiplied by the other vector, and assuming that the vector to be modified is { x3, x2, x1, x0}, the modified affine matrix corresponds to:
4)the multiplication can also use matrix operation, specifically, two multiplied 2-bit elements are considered as two vectors { x1, x0}, one of the vectors is modified into a matrix form, and is multiplied by the other vector, and assuming that the vector to be modified is { x1, x0}, the modified affine matrix corresponds to:
whileThe inversion can be implemented in the following table lookup manner at a word length of 1 bit as follows:
for example, as an embodiment, the polynomial is defined in modulus polynomialUpper finite fieldThe length of the element is only 2 bits, and the inverse relation is as follows:
forUpper arbitrary elementAlways haveCan be realized by the formulaThe inversion operation of (3). The formula can be realized within a word length of 1 bit, namely, the high order bits are unchanged, and the low order bits are equal to the high order XOR low order bits.
Therefore, in summary, as described above,the inversion operation on can be converted intoThe inversion operation and the multiplication operation ofThe inversion operation on can be converted intoThe inverse operation of the above is carried out,the multiplication in (b) can be modified to matrix-vector multiplication, which can be implemented at 1-bit word length. By adaptation, both the non-linear and linear parts of the standard SM4 are done in 1-bit word lengths.
The query of the S-box is usually one of the most time-consuming operations in the block cipher, and through the X64 instruction, the calculation of the bit slicing algorithm in the query of the S-box is actual, so that it is obvious that the bit slicing algorithm has a significant advantage in the multi-thread calculation, and the calculation speed is close to 4 times, as compared with table 1, the complete SM4 algorithm, and the advantages of the slicing algorithm are shown in table 2.
Table 1: slicing algorithm and table lookup efficiency comparison of S-box of SM4
The query speed in table 1 refers to the average number of times of querying S-boxes per second in the last column, the execution number is the number of times of test program operation, and the program operating once in the slicing method can query 64 times of S-boxes, which is 64 times of the common method, but the required time is only increased by less than 20 times (21.288S/1.287S =16.54 times), and obviously, the query speed is faster than that of the conventional S-box table look-up algorithm.
Table 2: speed comparison of SM4 slicing algorithm to standard implementation
Algorithm | Number of execution blocks | Total time of day | Data volume | Average query per second |
Standard SM4 implementation | 12800000 | 2.451s | 1638400000 bit | 668Mb/s |
Slicing SM4 implementation | 12800000 | 0.991s | 1638400000 bit | 1,655Mb/s |
As can be seen from table 2, the slice SM4 implementation algorithm is shorter in time consumption and faster in query speed than the standard SM4 implementation algorithm.
By the method, a standard SM4 algorithm is changed into an SM4 algorithm, input, output, intermediate state and calculation processes can be realized in a 1-bit register, and each variable in the SM4 algorithm only occupies the length of 1 bit. On a 32-bit computing platform, each variable can store 32 different groups of values in a 32-bit register, and in the process of computing, each operation can simultaneously compute 32 groups in the 32-bit register, so that 32 threads SM4 on the 32-bit platform are computed in parallel. On a wider computing platform (e.g., 64 bits), SM4 synchronous computations are supported that enable more threads.
Implementing on the SIMD instruction set and the GPU:
given two vectorsWhereinFor words, the calculation of the vector instruction of the present invention is given below:
example 1: the ARM/NEON instruction set is realized as follows:
the ARM processor is a mainstream processor adopted by smart mobile equipment such as a mobile phone at present, wherein the most widely deployed Cortex-a series ARM processor architecture comprises an NEON SIMD instruction set in addition to an ARM general instruction set (ARMv7 instruction set). The NEON instruction set comprises 16 128-bit SIMD registers, and can perform parallel computation on 4-way 32-bit words, so that the SM4 of the present embodiment can be further expanded into 128 sets of parallel computation.
The NEON instructions used are given below, which are given in the form of pseudo-functions (Intrinsics), whereint32x4_tNEON vector register for 128 bits:
Example 2: the X86/AVX2 instruction set implements:
the non-vector code can be written by using a high-level language and realized by compiling, the vector algorithm is realized by using an AVX2 instruction, and parallel computation of 8-path 32-bit words can be performed, so that the SM4 of the embodiment can be further expanded into 256 groups of parallel computation;
Example 3: opencl-based GPU implementation:
opencl is written similarly to C, since the slicing algorithm has 1 bit for each variable, the input plaintext is 128 bits, each state word is 32 bits, and the input and output are 8 bits when calculating the S-box. Since 64-bit wide registers are used, a single GPU thread is equivalent to computing 64 SM4 threads in parallel.
The S-box query code structure of Opencl is as follows:
unsigned long long tp[32];
...
for (i = 0; i < 32; i++)
{
...
SBOX_SMS4_BS(&tp[31], &tp[30], &tp[29], &tp[28], &tp[27], &tp[26], &tp[25], &tp[24]);
...
the code structure of the code slicing algorithm, the original standard operation, is to query the S-box for the upper 8 bits of a 32-bit tp word. Under the slicing algorithm, 32 bits are represented as an array, the input 8 bits must be input separately in the form of 1 bit, and since the register is 64 bits wide, SBOX _ SMS4_ BS actually calculates 64 sets of S boxes simultaneously. Thereby expanding the SM4 thread by a factor of 64.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, but those skilled in the art will appreciate that various substitutions and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.
Claims (7)
1. A high-performance SM4 bit slice optimization method oriented to heterogeneous parallel architecture is characterized by comprising the following steps:
1) dividing a variable with a word length of 32 bits of an original standard algorithm SM4 into 32 variables with a word length of 1 bit in sequence;
2) the linear part of the original standard algorithm SM4 is operated: the XOR operation of the 32-bit words and the cyclic left shift operation of the 32-bit words are converted into XOR and transposition among 32 1-bit word lengths according to the variables defined in the step 1);
3) the non-linear part of the original standard algorithm SM4 was operated on an S-box,decomposition into affine transformations of matrices and finite fieldsInversion is carried out;
4) for finite fieldsInversion is carried out by utilizing finite field tower structure transformationBy means of matrix affine transformation, isomorphically mapped to the complex domainTo be converted intoThe inversion and multiplication of the method are realized under the word length of 1 bit;
5) for theBy using finite field tower structure transformation, willIsomorphic mapping of finite fields to finite fields by matrix affine transformationsFinite field, intoThe inversion and multiplication of the method are realized under the word length of 1 bit;
6) for theIs equivalent to the high order bit being unchanged and the low order bit being equal to the high order XORThe calculation is realized under the word length of 1 bit;
7) according to the steps 1) to 6) above, the whole SM4 algorithm is realized only depending on 1-bit word length, exclusive OR and AND operation to complete the whole calculation, so that the X-bit register is regarded as an X vector register to be used, and multithread parallel calculation of the X-group SM4 algorithm is realized.
2. The method of claim 1, the exclusive or operation in step 2) comprising:
3. The method of claim 1, the step of transposing in step 2) comprising:
3) According to a linear transformation formulaGiven a givenOf (2) to findInvolving an exclusive-or operation in the calculationThe bit data above, calculated as:
4. The method of claim 1, the processing step in step 4) comprising:
2) to the product obtained in step 1)Carrying out inversion operation, wherein the formula is as follows:
5. The method of claim 1, in step 4)The multiplication is realized by converting into matrix vector multiplication operation, and the method is to regard two multiplied 4-bit elements as two vectors, wherein one vector is converted into a matrix form, and is multiplied by the other vector, and the calculation is realized under the word length of 1 bit.
6. The method of claim 1, the processing step in step 5) comprising:
2) to the product obtained in step 1)Carrying out inversion operation, wherein the formula is as follows:
wherein the content of the first and second substances,is thatThe elements (A) and (B) in (B),is the inverse of the element(s),,is a polynomial modulus;
7. The method of claim 1, in step 5)The multiplication is realized by matrix operation, and the steps comprise: regarding two multiplied 2-bit elements as two vectors, modifying one vector into a matrix form, and multiplying the other vector; then in a finite fieldThe inversion operation is performed by means of table lookup.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210542472.4A CN114710285B (en) | 2022-05-19 | 2022-05-19 | High-performance SM4 bit slice optimization method for heterogeneous parallel architecture |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210542472.4A CN114710285B (en) | 2022-05-19 | 2022-05-19 | High-performance SM4 bit slice optimization method for heterogeneous parallel architecture |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114710285A true CN114710285A (en) | 2022-07-05 |
CN114710285B CN114710285B (en) | 2022-08-23 |
Family
ID=82175690
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210542472.4A Active CN114710285B (en) | 2022-05-19 | 2022-05-19 | High-performance SM4 bit slice optimization method for heterogeneous parallel architecture |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114710285B (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101848081A (en) * | 2010-06-11 | 2010-09-29 | 中国科学院软件研究所 | S box and construction method thereof |
CN101938349A (en) * | 2010-10-01 | 2011-01-05 | 北京航空航天大学 | S box applicable to hardware realization and circuit realization method thereof |
CN104065473A (en) * | 2014-06-25 | 2014-09-24 | 成都信息工程学院 | Compact realization method of SM4 block cipher algorithm S box |
US20160232129A1 (en) * | 2015-02-05 | 2016-08-11 | Weng Tianxiang | Apparatus of wave-pipelined circuits |
CN106209358A (en) * | 2016-07-12 | 2016-12-07 | 黑龙江大学 | A kind of SM4 key schedule based on long key realize system and method |
CN106712930A (en) * | 2017-01-24 | 2017-05-24 | 北京炼石网络技术有限公司 | SM4 encryption method and device |
US20190245679A1 (en) * | 2018-02-02 | 2019-08-08 | Intel Corporation | Unified hardware accelerator for symmetric-key ciphers |
CN110166223A (en) * | 2019-05-22 | 2019-08-23 | 北京航空航天大学 | A kind of Fast Software implementation method of the close SM4 of state |
CN110197076A (en) * | 2019-05-22 | 2019-09-03 | 北京航空航天大学 | A kind of software optimization implementation method of SM4 Encryption Algorithm |
CN110278070A (en) * | 2018-03-13 | 2019-09-24 | 中国科学技术大学 | The implementation method and device of S box in a kind of SM4 algorithm |
CN110474761A (en) * | 2019-07-11 | 2019-11-19 | 北京电子科技学院 | One kind 16 takes turns SM4-256 whitepack password implementation method |
CN114091086A (en) * | 2022-01-14 | 2022-02-25 | 麒麟软件有限公司 | Rapid realization method of SM4 algorithm based on bit slice |
CN114244496A (en) * | 2021-12-01 | 2022-03-25 | 华南师范大学 | SM4 encryption and decryption algorithm parallelization realization method based on tower domain optimization S box |
EP3998738A1 (en) * | 2021-08-26 | 2022-05-18 | Irdeto B.V. | Secured performance of a cryptographic process |
-
2022
- 2022-05-19 CN CN202210542472.4A patent/CN114710285B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101848081A (en) * | 2010-06-11 | 2010-09-29 | 中国科学院软件研究所 | S box and construction method thereof |
CN101938349A (en) * | 2010-10-01 | 2011-01-05 | 北京航空航天大学 | S box applicable to hardware realization and circuit realization method thereof |
CN104065473A (en) * | 2014-06-25 | 2014-09-24 | 成都信息工程学院 | Compact realization method of SM4 block cipher algorithm S box |
US20160232129A1 (en) * | 2015-02-05 | 2016-08-11 | Weng Tianxiang | Apparatus of wave-pipelined circuits |
CN106209358A (en) * | 2016-07-12 | 2016-12-07 | 黑龙江大学 | A kind of SM4 key schedule based on long key realize system and method |
CN106712930A (en) * | 2017-01-24 | 2017-05-24 | 北京炼石网络技术有限公司 | SM4 encryption method and device |
US20190245679A1 (en) * | 2018-02-02 | 2019-08-08 | Intel Corporation | Unified hardware accelerator for symmetric-key ciphers |
CN110278070A (en) * | 2018-03-13 | 2019-09-24 | 中国科学技术大学 | The implementation method and device of S box in a kind of SM4 algorithm |
CN110166223A (en) * | 2019-05-22 | 2019-08-23 | 北京航空航天大学 | A kind of Fast Software implementation method of the close SM4 of state |
CN110197076A (en) * | 2019-05-22 | 2019-09-03 | 北京航空航天大学 | A kind of software optimization implementation method of SM4 Encryption Algorithm |
CN110474761A (en) * | 2019-07-11 | 2019-11-19 | 北京电子科技学院 | One kind 16 takes turns SM4-256 whitepack password implementation method |
EP3998738A1 (en) * | 2021-08-26 | 2022-05-18 | Irdeto B.V. | Secured performance of a cryptographic process |
CN114244496A (en) * | 2021-12-01 | 2022-03-25 | 华南师范大学 | SM4 encryption and decryption algorithm parallelization realization method based on tower domain optimization S box |
CN114091086A (en) * | 2022-01-14 | 2022-02-25 | 麒麟软件有限公司 | Rapid realization method of SM4 algorithm based on bit slice |
Non-Patent Citations (2)
Title |
---|
李玲等: "基于GReP通用可重构处理器的密码算子优化设计", 《计算机应用研究》 * |
梁浩等: "基于复合域的SM4算法的设计与实现", 《微电子学与计算机》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114710285B (en) | 2022-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fan et al. | Wg-8: A lightweight stream cipher for resource-constrained smart devices | |
EP2442482B1 (en) | Method and device for implementing stream cipher | |
Karakoç et al. | ITUbee: a software oriented lightweight block cipher | |
Gueron | Intel advanced encryption standard (AES) new instructions set | |
Ivanov et al. | Reversed genetic algorithms for generation of bijective s-boxes with good cryptographic properties | |
Gueron et al. | Intel® carry-less multiplication instruction and its usage for computing the GCM mode | |
Maazouz et al. | FPGA implementation of a chaos-based image encryption algorithm | |
CN110198214B (en) | Identity generation method, identity verification method and identity verification device | |
Stern et al. | Cs-cipher | |
do Nascimento et al. | FlexAEAD-A lightweight cipher with integrated authentication | |
US20100040226A1 (en) | Device, program and method for generating hash values | |
CN114710285B (en) | High-performance SM4 bit slice optimization method for heterogeneous parallel architecture | |
Tuychiev | New encryption algorithm based on network PES8-1 using of the transformations of the encryption algorithm AES | |
CN115348101A (en) | Data encryption method and system based on chaotic block cipher | |
Fan et al. | WG-8: A lightweight stream cipher for resource-constrained smart devices | |
Tang et al. | Awareness and control of personal data Based on the Cyber-I privacy model | |
Kuznetsov et al. | A new cost function for heuristic search of nonlinear substitutions | |
Ding et al. | Cryptanalysis of Loiss stream cipher | |
Gueron | White box aes using intel's new aes instructions | |
CN114189324B (en) | Message security signature method, system, equipment and storage medium | |
Maximov | A new stream cipher Mir-1 | |
Tuychiev | The encryption algorithm AESRFWKIDEA32-1 based on network RFWKIDEA32-1 | |
Bishoi | Generalized word-oriented feedback shift registers | |
CN114172632B (en) | Method and device for improving AES encryption and decryption efficiency | |
Isobe et al. | Key Committing Security Analysis of AEGIS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |