CN114710285A - High-performance SM4 bit slice optimization method for heterogeneous parallel architecture - Google Patents

High-performance SM4 bit slice optimization method for heterogeneous parallel architecture Download PDF

Info

Publication number
CN114710285A
CN114710285A CN202210542472.4A CN202210542472A CN114710285A CN 114710285 A CN114710285 A CN 114710285A CN 202210542472 A CN202210542472 A CN 202210542472A CN 114710285 A CN114710285 A CN 114710285A
Authority
CN
China
Prior art keywords
bit
algorithm
inversion
multiplication
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210542472.4A
Other languages
Chinese (zh)
Other versions
CN114710285B (en
Inventor
关志
陈钟
何逸飞
王珂
孙磊
齐向东
刘勇
孔坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Qianxin Technology Group Co Ltd
Original Assignee
Peking University
Qianxin Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Qianxin Technology Group Co Ltd filed Critical Peking University
Priority to CN202210542472.4A priority Critical patent/CN114710285B/en
Publication of CN114710285A publication Critical patent/CN114710285A/en
Application granted granted Critical
Publication of CN114710285B publication Critical patent/CN114710285B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/14Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols using a plurality of keys or algorithms
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture, which belongs to the technical field of safe password application, and realizes a multi-thread SM4 on a non-vector instruction set and a vector instruction set by realizing an SM4 block cipher algorithm under the width of 1 bit data, and can support higher encryption speed on the vector instruction set.

Description

High-performance SM4 bit slice optimization method for heterogeneous parallel architecture
Technical Field
The invention belongs to the technical field of security password application, and relates to a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture.
Background
The SM4 is a block cipher standard adopted in the WAPI standard of the wireless local area network in China, and is subsequently adopted by the commercial cipher standard in China. As a block cipher standard of commercial cipher in China, SM4 is expected to gradually replace foreign block cipher standards such as 3DES and AES in sensitive but non-confidential application fields in China, and is used for application occasions such as communication encryption and data encryption. SM4 is a symmetric cryptographic algorithm with a key length and packet length of 128 bits, outputting 128 bits as ciphertext.
The operational symbols used and the corresponding meanings are given below:
mod: performing modulo operation;
Figure 427212DEST_PATH_IMAGE001
: a 32-bit AND operation;
Figure 875511DEST_PATH_IMAGE002
: a 32-bit OR operation;
Figure 451986DEST_PATH_IMAGE003
: a 32-bit non-operation;
Figure 768566DEST_PATH_IMAGE004
: a 32-bit exclusive or operation;
Figure 645256DEST_PATH_IMAGE005
:mod
Figure 4736DEST_PATH_IMAGE006
a bit arithmetic addition operation;
Figure 334086DEST_PATH_IMAGE007
: shift left for 32 bit cycle
Figure 595303DEST_PATH_IMAGE008
Bit operation;
Figure 60919DEST_PATH_IMAGE009
: left assignment operator;
Figure 116600DEST_PATH_IMAGE010
: comprises
Figure 792301DEST_PATH_IMAGE011
Finite field of elements.
The key expansion algorithm is as follows:
the standard algorithm SM4 word length is 32 bits, the encryption key length is 128 bits, and the representation is 4 words
Figure 591630DEST_PATH_IMAGE012
(ii) a The round key is represented as 32 words
Figure 177332DEST_PATH_IMAGE013
(ii) a The plaintext input is treated as 4 words
Figure 403914DEST_PATH_IMAGE014
The ciphertext output is represented as
Figure 442277DEST_PATH_IMAGE015
SM4 key expansion algorithm:
1) setting 4 words
Figure 514138DEST_PATH_IMAGE016
2) The round key generation algorithm is
Figure 219926DEST_PATH_IMAGE017
Wherein the content of the first and second substances,
Figure 210885DEST_PATH_IMAGE018
is aA synthetic permutation function
Figure 2123DEST_PATH_IMAGE019
Comprising a linear transformation
Figure 871816DEST_PATH_IMAGE020
And nonlinear transformation
Figure 432110DEST_PATH_IMAGE021
Figure 859549DEST_PATH_IMAGE022
Are all constant.
The encryption algorithm is as follows:
1) 32 iterations
Figure 606926DEST_PATH_IMAGE023
2) Output of
Figure 20589DEST_PATH_IMAGE024
Wherein the content of the first and second substances,
Figure 435390DEST_PATH_IMAGE025
is a synthetic permutation function
Figure 440255DEST_PATH_IMAGE026
Comprising a linear transformation
Figure 799561DEST_PATH_IMAGE027
And nonlinear transformation
Figure 16916DEST_PATH_IMAGE021
Because the expected application field of the SM4 design is a low-power chip (namely, a WAPI chip), the SM4 is optimized for reducing the number of hardware circuits, and as a result, the software implementation efficiency of the SM4 is low, it is difficult to fully utilize the computing power of a mainstream 32-bit/64-bit general processor, and the software implementation efficiency is usually much lower than that of the AES of the same kind of symmetric encryption algorithm.
The currently mainstream CPU includes an extended Instruction set such as a SIMD (Single Instruction, Multiple Data) Instruction set in addition to a basic general Instruction set such as SISD (Single Instruction, Single Data), for example, the currently mainstream X86 processor of Intel and AMD supports an AVX/AVX2 Instruction set, and the mobile processor of ARM Cortex-a series architecture supports a NEON Instruction set. The Intel AVX-2 supports a 256-bit instruction set, the AVX2 instruction set includes 16 256-bit vector registers, the Intel has published the AVX-512 instruction set, and the AVX-512 instruction set includes 32 512-bit vector registers, which can perform 16-way 32-bit word vector operations. NEON is the SIMD instruction set of the ARM Cortex-A architecture. The NEON architecture contains 16 128-bit SIMD registers on which 4-way 32-bit scalar vector computations can be performed.
Currently, the Intel/AMD X86 processor and ARM processor support both non-vector and vector instruction set computations. Conventional parallel algorithms often execute a plurality of SM4 algorithms at the same time by directly using vector registers or multi-core processors, but because the register width required for realizing the standard SM4 algorithm is relatively wide (32 bits), even if the vector registers with the width of 256 bits are used for realizing, at most, 8 SM4 algorithms can be simultaneously executed. The GPU operation has similar problems, and for a single thread, the operation speed of the GPU is difficult to exceed that of the CPU, and the parallel capability of the GPU is limited by the stream processor, so that more threads cannot be processed simultaneously. Under the use scene that the single-thread SM4 encryption speed is low in requirement, but the SM4 encryption threads are more, the algorithm is not enough in parallelism and low in overall encryption efficiency, and an algorithm which is more sufficient in computing resource utilization and faster in multi-thread SM4 operation speed is urgently needed.
Disclosure of Invention
In order to realize a more efficient multithreading SM4 algorithm, the invention provides a software-implemented SM4 block cipher bit slice optimization algorithm which is implemented on SISD, SIMD instruction set and GPU. The present invention can implement either a multithreaded SM4 on a non-vector instruction set or a multithreaded SM4 on a vector instruction set, where the vector instruction set can support higher encryption speeds.
The invention provides a high-performance SM4 bit slice optimization method for a heterogeneous parallel architecture, which comprises the following steps:
1) dividing a variable with a word length of 32 bits of an original standard algorithm SM4 into 32 variables with a word length of 1 bit in sequence;
2) the linear part of the original standard algorithm SM4 is operated as: performing XOR operation on the 32-bit words and cyclic left shift operation on the 32-bit words, and converting variables defined in the step 1) into 32 words with 1 bit length for XOR and transposition;
3) decomposing the nonlinear part operation S box of the original standard algorithm SM4 into matrix affine transformation and finite field
Figure 551803DEST_PATH_IMAGE028
Inversion is carried out;
4) for finite fields
Figure 727569DEST_PATH_IMAGE028
Inversion is carried out by utilizing finite field tower structure transformation
Figure 777433DEST_PATH_IMAGE028
By matrix affine transformation, isomorphically mapped to the composite domain
Figure 532900DEST_PATH_IMAGE029
To be converted into
Figure 724890DEST_PATH_IMAGE030
The inversion and multiplication of the method are realized under the word length of 1 bit;
5) for the
Figure 399454DEST_PATH_IMAGE030
By using finite field tower structure transformation, will
Figure 608718DEST_PATH_IMAGE030
Isomorphic mapping of finite fields to finite fields by matrix affine transformations
Figure 26930DEST_PATH_IMAGE031
Finite field, into
Figure 5250DEST_PATH_IMAGE032
The inversion and multiplication of the method are realized under the word length of 1 bit;
6) for the
Figure 850716DEST_PATH_IMAGE032
The inversion is equivalent to that the high-order bit is unchanged, and the low-order bit is equal to that of the high-order XOR low order bit, so that the calculation under the word length of 1 bit is realized;
7) according to the steps 1) to 6) above, the whole SM4 algorithm is realized only depending on 1-bit word length, exclusive OR and AND operation to complete the whole calculation, so that the X-bit register is regarded as an X vector register to be used, and multithread parallel calculation of the X-group SM4 algorithm is realized.
The method is a software-implemented SM4 block cipher bit slicing optimization algorithm, and is characterized in that linear operation of an original standard SM4 encryption algorithm on a 32-bit word is equivalently converted into linear operation with the length of 1-bit word through algorithm linear analysis, and meanwhile, nonlinear operation in the original standard SM4 algorithm is mapped onto the nonlinear operation with the length of 1-bit word through constructing finite field isomorphic mapping. By the method, the whole SM4 algorithm is converted from the implementation based on the word length of 32 bits to the implementation based on the word length of 1 bit, and then on a non-vector or vector instruction set platform with wider register width, the register is divided into different SM4 threads according to 1 bit (for example, the 32-bit register can be divided into 32 1 bits, so that the SM4 algorithm of 32 threads can be realized), and further, the SM4 multithreading optimization algorithm with higher parallelism is realized. In a specific implementation, a wider register is no longer needed for the calculation of the whole algorithm, so that on a calculation platform supporting 32-bit, 64-bit or wider registers, one register can simultaneously store data of 8, 16 or more threads, and when the data is calculated, the data of 8, 16 or more threads can be simultaneously operated, so that the SM4 algorithm of higher threads is realized.
The core of the invention is: the SM4 block cipher algorithm is implemented under the data width of 1 bit, and is applied to a vector instruction set or a non-vector instruction set, and a register is divided according to 1 bit, so that the multithreading SM4 algorithm with the parallelism being 32 times higher than that of the standard SM4 algorithm (the word length is 32 bits) under the same condition can be obtained. Through analysis, the SM4 parallel optimization method provided by the invention can be realized on vector instruction operation or GPU, and the algorithm can also be realized on a CPU platform which does not support the operation of SIMD instructions.
Compared with the prior art, the invention has the beneficial effects that: 1) the constant time algorithm replaces the operation of searching the S box by matrix operation, displacement, exclusive OR and other operations, can be realized in constant time, has no relation between the execution time and the internal state of the algorithm, can resist various time-based channel measurement attacks, and is safer in SM4 algorithm. 2) The higher parallel thread can complete the whole algorithm only by using 1 bit of the register, under the environment of 32-bit and 64-bit registers, 32-path and 64-path SM4 encryption and decryption algorithms can be directly executed in parallel, and under the AVX512 register, 512-path SM4 encryption algorithms can be executed at most, so that the overall encryption rate is improved. Under the application scene that the requirement on the encryption speed of the single-thread SM4 is low, but the number of the encryption threads is large, the overall efficiency can be greatly improved. In the calculation on the GPU, assuming that there are x parallel GPU threads, if each thread uses 64-bit wide registers, then the SM4 algorithm for 64 threads can be processed simultaneously, and the number of threads in parallel for the SM4 algorithm can be extended by 64 times.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The invention provides a heterogeneous parallel architecture-oriented high-performance SM4 bit slice optimization method, which realizes SM4 operation only by depending on 1-bit data width and under the condition that logic operation only includes XOR and AND operation, compared with the software implementation of the traditional SM4 algorithm, the algorithm can regard an X-bit register as an X-vector register for use, so that parallel calculation of an X-group SM4 algorithm is realized, wherein X is the width of the register, and the operation speed is greatly improved.
Specifically, the method of the invention adopts the following steps:
1) all variables in the SM4 are divided into 1-bit data from low order to high order, for example, a plaintext is divided into 128 bits (divided into 4 32-bit words for processing), and the data is regarded as 128 pieces of 1-bit data;
2) the exclusive or operation and the cyclic shift in SM4 are regarded as exclusive or and transposition between 128 pieces of 1-bit data;
3) decomposing the unique nonlinear operation S box of SM4 into matrix affine transformation and finite field according to principle
Figure 812855DEST_PATH_IMAGE033
Inversion is carried out;
4) will be provided with
Figure 175704DEST_PATH_IMAGE033
Inversion isomorphic mapping to composite domain via tower decomposition
Figure 274110DEST_PATH_IMAGE034
Is inverted and further transformed into
Figure 556055DEST_PATH_IMAGE035
Inversion and multiplication on;
5) isomorphic mapping, matrix operation and multiplication of a specific finite field in the decomposition process of the step 4) are all regarded as matrix multiplication, and calculation under the width of 1 bit data is realized;
6) will be provided with
Figure 739912DEST_PATH_IMAGE035
Further tower decomposition of the above inversion, isomorphic mapping to
Figure 189608DEST_PATH_IMAGE036
And is further transformed into
Figure 142521DEST_PATH_IMAGE037
Inversion and multiplication on;
7) isomorphic mapping, matrix operation and multiplication of a specific finite field in the decomposition process of the step 6) are all regarded as matrix multiplication, and calculation under the width of 1 bit data is realized;
8)
Figure 1892DEST_PATH_IMAGE037
the inversion is equivalent to the invariance of the high-order bit, and the low-order bit is equivalent to the high-order XOR low-order bit, so that the inversion is calculated under the data width of 1 bit;
9) by integrating all algorithms, the whole SM4 realizes that the whole calculation is completed only by relying on 1-bit data width, exclusive OR and operation, and the X-bit register can be regarded as an X vector register to be used, so that the parallel calculation of the X-group SM4 algorithm is realized.
Variable representation of SM4 multithread optimization algorithm:
in the standard SM4 algorithm, the key is 16 bytes, the input plaintext is 16 bytes, and each type of intermediate state variable is a 4-byte word. In the method, a key and a plaintext are divided into 128 pieces of 1-bit data according to the high and low bits of a byte, and a variable word of an intermediate state is divided into 32 pieces of 1-bit data according to the high and low bits.
Now, assume that all variables of the present invention are divided into 1-bit widths according to the following method:
key format: 128 bits of key
Figure 407466DEST_PATH_IMAGE038
WhereinqA number representing a bit;
plain text format: 128 bits of plaintext
Figure 377696DEST_PATH_IMAGE039
qA number representing a bit;
a 32-bit word: the basic operation word of the SM4 is 4 bytes, and is divided into the following parts according to the high and low 32 bits: whereindataThe name of the 32-bit variable is represented,qindicating the number of bits.
Linear operation of SM4 multithread optimization algorithm:
and (3) XOR calculation:
the SM4 multithread optimization method of the invention converts the exclusive OR calculation among the 32-bit words in all the original standard SM4 algorithm into the length of 1 bit for carrying out;
the calculation process is as follows:
1) assume that the standard algorithm SM4 has two 32-bit state words
Figure 44170DEST_PATH_IMAGE040
Need to calculate
Figure 136759DEST_PATH_IMAGE041
2) Splitting variables in SM4 into 32 bits
Figure 888684DEST_PATH_IMAGE042
And
Figure 52818DEST_PATH_IMAGE043
3) and (4) carrying out exclusive or calculation on the alignment respectively:
Figure 969870DEST_PATH_IMAGE044
the result is exactly also in the form of a 32-bit state word divided in 1 bit.
Linear transformation function:
the linear transformation function in standard SM4 includes XOR computation and displacement computation, where
Figure 498940DEST_PATH_IMAGE045
The format is as follows:
Figure 738161DEST_PATH_IMAGE046
in the above formula
Figure 253456DEST_PATH_IMAGE047
Indicating the number of bits left-shifted by four cycles (note: XOR-calculated)
Figure 160101DEST_PATH_IMAGE042
Not the same);
Figure 1018DEST_PATH_IMAGE048
is a 32-bit word, and
Figure 727534DEST_PATH_IMAGE049
and
Figure 577678DEST_PATH_IMAGE045
the two functions are similar in format and are similar in structure, and the difference is that
Figure 948617DEST_PATH_IMAGE045
Is/are as follows
Figure 79210DEST_PATH_IMAGE047
In the order of 2,10,18,24, and
Figure 168388DEST_PATH_IMAGE049
is/are as follows
Figure 681278DEST_PATH_IMAGE047
13,23,0,0 in sequence; according to variable representation of SM4 multithread optimization algorithm
Figure 437882DEST_PATH_IMAGE048
Divided into 32 data of 1 bit length
Figure 886181DEST_PATH_IMAGE045
32 pieces of 1-bit data of the same format must also be output, and thus, the output can be expressed as
Figure 728235DEST_PATH_IMAGE048
The 32-bit xor of the corresponding positions, i.e.:
Figure 779236DEST_PATH_IMAGE050
wherein
Figure 124767DEST_PATH_IMAGE051
To represent
Figure 743967DEST_PATH_IMAGE052
The numbers from the upper to lower bits are divided into 1 bit,
Figure 666792DEST_PATH_IMAGE053
is shown in
Figure 662430DEST_PATH_IMAGE048
To middle
Figure 727381DEST_PATH_IMAGE054
Bit data (hypothesis)
Figure 517483DEST_PATH_IMAGE048
The next bit of the last 1 bit is
Figure 334129DEST_PATH_IMAGE048
Bit 1) of the received signal,
Figure 461354DEST_PATH_IMAGE053
is related to the number of bits of the cyclic shift and the position of the output; % is the modulus operation, i.e., the remainder of the division by 32 is calculated.
The calculation process is as follows:
1) inputting 32 bits into
Figure 47056DEST_PATH_IMAGE048
Sequentially divided into 1-bit units
Figure 273638DEST_PATH_IMAGE055
2) Output 32 bits
Figure 312001DEST_PATH_IMAGE052
Sequentially divided into 1-bit units
Figure 649442DEST_PATH_IMAGE056
3) According to the formula
Figure 214284DEST_PATH_IMAGE057
Given isiOf (2) to find
Figure 346188DEST_PATH_IMAGE058
Involving an exclusive-or operation in the calculation
Figure 137427DEST_PATH_IMAGE048
The bit data of (c), calculating:
Figure 12979DEST_PATH_IMAGE059
4) for the
Figure 307694DEST_PATH_IMAGE060
Repeating the calculation step 3) to obtain
Figure 752711DEST_PATH_IMAGE056
The final result of (1).
Linear transformation function
Figure 31246DEST_PATH_IMAGE061
Principle of (1)
Figure 976068DEST_PATH_IMAGE052
And (5) the consistency is achieved.
Nonlinear S-box operation of a multithread optimization algorithm:
s box mathematical principle:
the S-box is the only nonlinear component in SM4 of the standard algorithm and is used for the nonlinear function
Figure 390869DEST_PATH_IMAGE062
The S-box is a list of 8-bit input and 8-bit output alternatives, as follows:
0xd6,0x90,0xe9,0xfe,0xcc,0xe1,0x3d,0xb7,0x16,0xb6,0x14,0xc2,0x28,0xfb,0x2c,0x05, 0x2b,0x67,0x9a,0x76,0x2a,0xbe,0x04,0xc3,0xaa,0x44,0x13,0x26,0x49,0x86,0x06,0x99, 0x9c,0x42,0x50,0xf4,0x91,0xef,0x98,0x7a,0x33,0x54,0x0b,0x43,0xed,0xcf,0xac,0x62, 0xe4,0xb3,0x1c,0xa9,0xc9,0x08,0xe8,0x95,0x80,0xdf,0x94,0xfa,0x75,0x8f,0x3f,0xa6, 0x47,0x07,0xa7,0xfc,0xf3,0x73,0x17,0xba,0x83,0x59,0x3c,0x19,0xe6,0x85,0x4f,0xa8, 0x68,0x6b,0x81,0xb2,0x71,0x64,0xda,0x8b,0xf8,0xeb,0x0f,0x4b,0x70,0x56,0x9d,0x35, 0x1e,0x24,0x0e,0x5e,0x63,0x58,0xd1,0xa2,0x25,0x22,0x7c,0x3b,0x01,0x21,0x78,0x87, 0xd4,0x00,0x46,0x57,0x9f,0xd3,0x27,0x52,0x4c,0x36,0x02,0xe7,0xa0,0xc4,0xc8,0x9e, 0xea,0xbf,0x8a,0xd2,0x40,0xc7,0x38,0xb5,0xa3,0xf7,0xf2,0xce,0xf9,0x61,0x15,0xa1, 0xe0,0xae,0x5d,0xa4,0x9b,0x34,0x1a,0x55,0xad,0x93,0x32,0x30,0xf5,0x8c,0xb1,0xe3, 0x1d,0xf6,0xe2,0x2e,0x82,0x66,0xca,0x60,0xc0,0x29,0x23,0xab,0x0d,0x53,0x4e,0x6f, 0xd5,0xdb,0x37,0x45,0xde,0xfd,0x8e,0x2f,0x03,0xff,0x6a,0x72,0x6d,0x6c,0x5b,0x51, 0x8d,0x1b,0xaf,0x92,0xbb,0xdd,0xbc,0x7f,0x11,0xd9,0x5c,0x41,0x1f,0x10,0x5a,0xd8, 0x0a,0xc1,0x31,0x88,0xa5,0xcd,0x7b,0xbd,0x2d,0x74,0xd0,0x12,0xb8,0xe5,0xb4,0xb0, 0x89,0x69,0x97,0x4a,0x0c,0x96,0x77,0x7e,0x65,0xb9,0xf1,0x09,0xc5,0x6e,0xc6,0x84, 0x18,0xf0,0x7d,0xec,0x3a,0xdc,0x4d,0x20,0x79,0xee,0x5f,0x3e,0xd7,0xcb,0x39,0x48。
the S-box substitution in the standard algorithm is actually a finite field formed by 7-degree polynomial
Figure 395734DEST_PATH_IMAGE028
The above element transformation can be described mathematically as follows:
is provided with
Figure 895985DEST_PATH_IMAGE063
Is a binary 8 x8 matrix:
{1,1,1,0,0,1,0,1},
{1,1,1,1,0,0,1,0},
{0,1,1,1,1,0,0,1},
{1,0,1,1,1,1,0,0},
{0,1,0,1,1,1,1,0},
{0,0,1,0,1,1,1,1},
{1,0,0,1,0,1,1,1},
{1,1,0,0,1,0,1,1},
Figure 113340DEST_PATH_IMAGE064
is a length-8 binary vector of {1, 1, 0,0, 1, 0,1, 1 }.
Is provided with
Figure 648227DEST_PATH_IMAGE065
Is an 8-bit binary number
Figure 558414DEST_PATH_IMAGE066
Corresponding polynomial
Figure 545961DEST_PATH_IMAGE067
. Primitive polynomial
Figure 567007DEST_PATH_IMAGE068
Figure 956400DEST_PATH_IMAGE069
To represent
Figure 303068DEST_PATH_IMAGE070
In a polynomial
Figure 512332DEST_PATH_IMAGE071
Inverse in the modulus domain. If it is
Figure 337069DEST_PATH_IMAGE072
Then define
Figure 445882DEST_PATH_IMAGE065
Contrary element of
Figure 963451DEST_PATH_IMAGE073
The inverse operation relationship is as follows:
1.
Figure 925591DEST_PATH_IMAGE074
2.
Figure 288439DEST_PATH_IMAGE075
3.
Figure 386845DEST_PATH_IMAGE076
Figure 75316DEST_PATH_IMAGE077
suppose that
Figure 524751DEST_PATH_IMAGE078
Is the 8-bit input to the S-box,
Figure 425711DEST_PATH_IMAGE079
is the output of the S-box, which can be expressed as:
Figure 644203DEST_PATH_IMAGE080
the S-box operations include 8 x8 matrix operations, 8-bit vector xor operations, and inverse operations over an 8-bit finite field, wherein: the matrix multiplication and vector exclusive-or belong to binary linear operation, and are easily decomposed into calculation under the word length of 4 bits according to the relation between variables, the matrix can be divided into 8 column vectors firstly, and then the 8 bit vector is divided into high and low 4 bits, so that the matrix vector operation is realized on a 4-bit register;
Figure 503575DEST_PATH_IMAGE028
the inverse operation on the finite field is a nonlinear operation, which cannot be directly implemented on a 1-bit register, and requires isomorphic mapping of the finite field.
Figure 174727DEST_PATH_IMAGE028
Operate on to
Figure 410537DEST_PATH_IMAGE081
Compound domain isomorphic mapping:
8 degree primitive polynomial
Figure 217956DEST_PATH_IMAGE071
The polynomial modulus field formed above can be equivalentIsomorphic mapping to composite domains
Figure 491637DEST_PATH_IMAGE082
The above.
Composite domain definition:
Figure 102616DEST_PATH_IMAGE082
all the above elements can be expressed as polynomials
Figure 938854DEST_PATH_IMAGE083
Can be regarded as a polynomial of degree 2
Figure 866358DEST_PATH_IMAGE084
An element in the modulus domain, wherein the modulus polynomial
Figure 67532DEST_PATH_IMAGE085
Figure 713277DEST_PATH_IMAGE086
Is defined in a 4-degree order polynomial
Figure 25310DEST_PATH_IMAGE087
The polynomial modulus field of (a).
The invention is directed to
Figure 807321DEST_PATH_IMAGE082
Is that
Figure 434524DEST_PATH_IMAGE088
A modulus field of which the coefficients of the polynomial are defined in
Figure 957778DEST_PATH_IMAGE089
In the modulus domain. In particular, of the SM4 algorithm
Figure 260452DEST_PATH_IMAGE090
Is defined in modulus
Figure 552762DEST_PATH_IMAGE071
Above, if present
Figure 95739DEST_PATH_IMAGE090
To
Figure 450497DEST_PATH_IMAGE082
Is then on
Figure 369911DEST_PATH_IMAGE090
The inversion operation above can be equivalent to
Figure 860935DEST_PATH_IMAGE082
The inversion operation of (3).
The specific isomorphic operation method and inversion operation are as follows:
1) by matrix multiplication, construction
Figure 592392DEST_PATH_IMAGE090
To
Figure 700025DEST_PATH_IMAGE082
Isomorphic mapping of, e.g. when
Figure 891972DEST_PATH_IMAGE091
Figure 503082DEST_PATH_IMAGE092
At time (as an example), there is
Figure 653440DEST_PATH_IMAGE090
To
Figure 841845DEST_PATH_IMAGE082
Isomorphic mapping of (a). Isomorphic mapped matrix
Figure 837483DEST_PATH_IMAGE093
Comprises the following steps:
{0,1,0,1,1,1,1,0,},
{0,1,1,1,1,1,0,0,},
{1,1,0,1,0,0,0,0,},
{0,1,0,1,0,0,0,0,},
{1,1,0,1,1,1,0,0,},
{0,1,1,0,1,1,0,0,},
{1,1,1,1,0,1,1,0,},
{0,1,0,1,1,1,1,1,},
in isomorphic sense, the inversion operation becomes
Figure 303099DEST_PATH_IMAGE029
Inversion of (1);
2) given in step 1)
Figure 624359DEST_PATH_IMAGE029
The following inversion operation can be derived using the following formula:
is provided with
Figure 175426DEST_PATH_IMAGE094
Is that
Figure 240334DEST_PATH_IMAGE029
Middle element, if
Figure 826036DEST_PATH_IMAGE095
Is the inverse of this element, then there must be the following equation:
Figure 787039DEST_PATH_IMAGE096
Figure 825402DEST_PATH_IMAGE097
to represent
Figure 667237DEST_PATH_IMAGE098
4 elements in the sequence;
after unfolding, obtaining:
Figure 107446DEST_PATH_IMAGE099
Figure 770508DEST_PATH_IMAGE100
through isomorphic mapping at
Figure 296168DEST_PATH_IMAGE028
Become at the inversion operation of
Figure 437299DEST_PATH_IMAGE029
While the inversion operation selected according to the invention
Figure 997593DEST_PATH_IMAGE101
In a
Figure 565978DEST_PATH_IMAGE029
The above inversion operation further becomes an element
Figure 844513DEST_PATH_IMAGE102
In that
Figure 523756DEST_PATH_IMAGE030
Sum of inversions on
Figure 204136DEST_PATH_IMAGE030
The multiplication operation of (c).
Figure 671983DEST_PATH_IMAGE103
Operate to
Figure 172234DEST_PATH_IMAGE104
Composite domain isomorphic mapping:
after the first isomorphic mapping, the first time,
Figure 920748DEST_PATH_IMAGE028
the problem of inversion in (A) is reduced to
Figure 190055DEST_PATH_IMAGE030
The inversion operation and the multiplication operation of (a). In this respect, isomorphic mapping may be further performed, and
Figure 365821DEST_PATH_IMAGE030
to the computational problem of
Figure 353369DEST_PATH_IMAGE031
The above.
1) By matrix multiplication, construction
Figure 639994DEST_PATH_IMAGE030
To
Figure 75392DEST_PATH_IMAGE031
Isomorphic mapping, as an example, coefficients are defined in
Figure 749956DEST_PATH_IMAGE105
Formed finite fields
Figure 27397DEST_PATH_IMAGE032
In the above-mentioned manner,
Figure 180029DEST_PATH_IMAGE106
forming a polynomial modulus finite field
Figure 17404DEST_PATH_IMAGE031
And are and
Figure 738235DEST_PATH_IMAGE107
formed finite field
Figure 762692DEST_PATH_IMAGE030
There is an isomorphic mapping;
2)
Figure 984595DEST_PATH_IMAGE031
the upper inversion operation is derived from the following equation:
is provided with
Figure 83001DEST_PATH_IMAGE108
Is that
Figure 771471DEST_PATH_IMAGE031
Middle element, if
Figure 955328DEST_PATH_IMAGE109
Is the inverse of this element, then there must be the following equation:
Figure 387446DEST_PATH_IMAGE110
after unfolding, obtaining:
Figure 340359DEST_PATH_IMAGE111
Figure 199730DEST_PATH_IMAGE112
as can be seen from the above, it is shown that,
Figure 700961DEST_PATH_IMAGE030
to
Figure 671191DEST_PATH_IMAGE031
Isomorphic mapping of
Figure 744189DEST_PATH_IMAGE030
The inverse operation of (3) becomes
Figure 774462DEST_PATH_IMAGE032
The above inversion and multiplication operations;
3) given of
Figure 932911DEST_PATH_IMAGE030
The multiplication can be converted into matrix vector multiplication operation and implemented within a word length of 1 bit, specifically, two 4-bit elements to be multiplied are regarded as two vectors { x3, x2, x1, x0}, one of the vectors is modified into a matrix form and is multiplied by the other vector, and assuming that the vector to be modified is { x3, x2, x1, x0}, the modified affine matrix corresponds to:
Figure 706832DEST_PATH_IMAGE113
4)
Figure 899916DEST_PATH_IMAGE114
the multiplication can also use matrix operation, specifically, two multiplied 2-bit elements are considered as two vectors { x1, x0}, one of the vectors is modified into a matrix form, and is multiplied by the other vector, and assuming that the vector to be modified is { x1, x0}, the modified affine matrix corresponds to:
Figure 694566DEST_PATH_IMAGE115
while
Figure 137048DEST_PATH_IMAGE114
The inversion can be implemented in the following table lookup manner at a word length of 1 bit as follows:
for example, as an embodiment, the polynomial is defined in modulus polynomial
Figure 511398DEST_PATH_IMAGE116
Upper finite field
Figure 886884DEST_PATH_IMAGE032
The length of the element is only 2 bits, and the inverse relation is as follows:
Figure 327136DEST_PATH_IMAGE117
for
Figure 991336DEST_PATH_IMAGE032
Upper arbitrary element
Figure 434955DEST_PATH_IMAGE118
Always have
Figure 337052DEST_PATH_IMAGE119
Can be realized by the formula
Figure 348871DEST_PATH_IMAGE032
The inversion operation of (3). The formula can be realized within a word length of 1 bit, namely, the high order bits are unchanged, and the low order bits are equal to the high order XOR low order bits.
Therefore, in summary, as described above,
Figure 969208DEST_PATH_IMAGE090
the inversion operation on can be converted into
Figure 623043DEST_PATH_IMAGE086
The inversion operation and the multiplication operation of
Figure 645226DEST_PATH_IMAGE086
The inversion operation on can be converted into
Figure 359104DEST_PATH_IMAGE032
The inverse operation of the above is carried out,
Figure 201158DEST_PATH_IMAGE086
the multiplication in (b) can be modified to matrix-vector multiplication, which can be implemented at 1-bit word length. By adaptation, both the non-linear and linear parts of the standard SM4 are done in 1-bit word lengths.
The query of the S-box is usually one of the most time-consuming operations in the block cipher, and through the X64 instruction, the calculation of the bit slicing algorithm in the query of the S-box is actual, so that it is obvious that the bit slicing algorithm has a significant advantage in the multi-thread calculation, and the calculation speed is close to 4 times, as compared with table 1, the complete SM4 algorithm, and the advantages of the slicing algorithm are shown in table 2.
Table 1: slicing algorithm and table lookup efficiency comparison of S-box of SM4
Figure 393105DEST_PATH_IMAGE121
The query speed in table 1 refers to the average number of times of querying S-boxes per second in the last column, the execution number is the number of times of test program operation, and the program operating once in the slicing method can query 64 times of S-boxes, which is 64 times of the common method, but the required time is only increased by less than 20 times (21.288S/1.287S =16.54 times), and obviously, the query speed is faster than that of the conventional S-box table look-up algorithm.
Table 2: speed comparison of SM4 slicing algorithm to standard implementation
Algorithm Number of execution blocks Total time of day Data volume Average query per second
Standard SM4 implementation 12800000 2.451s 1638400000 bit 668Mb/s
Slicing SM4 implementation 12800000 0.991s 1638400000 bit 1,655Mb/s
As can be seen from table 2, the slice SM4 implementation algorithm is shorter in time consumption and faster in query speed than the standard SM4 implementation algorithm.
By the method, a standard SM4 algorithm is changed into an SM4 algorithm, input, output, intermediate state and calculation processes can be realized in a 1-bit register, and each variable in the SM4 algorithm only occupies the length of 1 bit. On a 32-bit computing platform, each variable can store 32 different groups of values in a 32-bit register, and in the process of computing, each operation can simultaneously compute 32 groups in the 32-bit register, so that 32 threads SM4 on the 32-bit platform are computed in parallel. On a wider computing platform (e.g., 64 bits), SM4 synchronous computations are supported that enable more threads.
Implementing on the SIMD instruction set and the GPU:
given two vectors
Figure 597690DEST_PATH_IMAGE122
Wherein
Figure 482469DEST_PATH_IMAGE123
For words, the calculation of the vector instruction of the present invention is given below:
Figure 540381DEST_PATH_IMAGE124
Figure 801598DEST_PATH_IMAGE125
example 1: the ARM/NEON instruction set is realized as follows:
the ARM processor is a mainstream processor adopted by smart mobile equipment such as a mobile phone at present, wherein the most widely deployed Cortex-a series ARM processor architecture comprises an NEON SIMD instruction set in addition to an ARM general instruction set (ARMv7 instruction set). The NEON instruction set comprises 16 128-bit SIMD registers, and can perform parallel computation on 4-way 32-bit words, so that the SM4 of the present embodiment can be further expanded into 128 sets of parallel computation.
The NEON instructions used are given below, which are given in the form of pseudo-functions (Intrinsics), whereint32x4_tNEON vector register for 128 bits:
c
Figure 391848DEST_PATH_IMAGE009
a
Figure 447529DEST_PATH_IMAGE004
b use instructionint32x4_t c = veorq_s32 (int32x4_t a, int32x4_t b)
c
Figure 264175DEST_PATH_IMAGE009
a
Figure 63504DEST_PATH_IMAGE126
k use instructionint32x4_t c = vandq_s32 (int32x4_t a, int32x4_t k)
Example 2: the X86/AVX2 instruction set implements:
the non-vector code can be written by using a high-level language and realized by compiling, the vector algorithm is realized by using an AVX2 instruction, and parallel computation of 8-path 32-bit words can be performed, so that the SM4 of the embodiment can be further expanded into 256 groups of parallel computation;
c
Figure 649206DEST_PATH_IMAGE009
a
Figure 610209DEST_PATH_IMAGE004
b using the instruction__m256i c = _mm256_xor_si256 (__m256i a, __ m256i b)
c
Figure 569943DEST_PATH_IMAGE009
a &k use instruction__m256i c = _mm256_and_si256 (__m256i a, __ m256i b)
Example 3: opencl-based GPU implementation:
opencl is written similarly to C, since the slicing algorithm has 1 bit for each variable, the input plaintext is 128 bits, each state word is 32 bits, and the input and output are 8 bits when calculating the S-box. Since 64-bit wide registers are used, a single GPU thread is equivalent to computing 64 SM4 threads in parallel.
The S-box query code structure of Opencl is as follows:
unsigned long long tp[32];
...
for (i = 0; i < 32; i++)
{
...
SBOX_SMS4_BS(&tp[31], &tp[30], &tp[29], &tp[28], &tp[27], &tp[26], &tp[25], &tp[24]);
...
the code structure of the code slicing algorithm, the original standard operation, is to query the S-box for the upper 8 bits of a 32-bit tp word. Under the slicing algorithm, 32 bits are represented as an array, the input 8 bits must be input separately in the form of 1 bit, and since the register is 64 bits wide, SBOX _ SMS4_ BS actually calculates 64 sets of S boxes simultaneously. Thereby expanding the SM4 thread by a factor of 64.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, but those skilled in the art will appreciate that various substitutions and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (7)

1. A high-performance SM4 bit slice optimization method oriented to heterogeneous parallel architecture is characterized by comprising the following steps:
1) dividing a variable with a word length of 32 bits of an original standard algorithm SM4 into 32 variables with a word length of 1 bit in sequence;
2) the linear part of the original standard algorithm SM4 is operated: the XOR operation of the 32-bit words and the cyclic left shift operation of the 32-bit words are converted into XOR and transposition among 32 1-bit word lengths according to the variables defined in the step 1);
3) the non-linear part of the original standard algorithm SM4 was operated on an S-box,decomposition into affine transformations of matrices and finite fields
Figure DEST_PATH_IMAGE002
Inversion is carried out;
4) for finite fields
Figure 204022DEST_PATH_IMAGE002
Inversion is carried out by utilizing finite field tower structure transformation
Figure 744331DEST_PATH_IMAGE002
By means of matrix affine transformation, isomorphically mapped to the complex domain
Figure DEST_PATH_IMAGE004
To be converted into
Figure DEST_PATH_IMAGE006
The inversion and multiplication of the method are realized under the word length of 1 bit;
5) for the
Figure 274801DEST_PATH_IMAGE006
By using finite field tower structure transformation, will
Figure 607693DEST_PATH_IMAGE006
Isomorphic mapping of finite fields to finite fields by matrix affine transformations
Figure DEST_PATH_IMAGE008
Finite field, into
Figure DEST_PATH_IMAGE010
The inversion and multiplication of the method are realized under the word length of 1 bit;
6) for the
Figure 362678DEST_PATH_IMAGE010
Is equivalent to the high order bit being unchanged and the low order bit being equal to the high order XORThe calculation is realized under the word length of 1 bit;
7) according to the steps 1) to 6) above, the whole SM4 algorithm is realized only depending on 1-bit word length, exclusive OR and AND operation to complete the whole calculation, so that the X-bit register is regarded as an X vector register to be used, and multithread parallel calculation of the X-group SM4 algorithm is realized.
2. The method of claim 1, the exclusive or operation in step 2) comprising:
1) two 32-bit state words that exist for the original standard algorithm SM4
Figure DEST_PATH_IMAGE012
Calculating
Figure DEST_PATH_IMAGE014
2) Splitting variables in SM4 into 32 bits
Figure DEST_PATH_IMAGE016
And
Figure DEST_PATH_IMAGE018
3) and (4) carrying out exclusive or calculation on the alignment respectively:
Figure DEST_PATH_IMAGE020
and outputting a result, wherein the result is in the form of a 32-bit state word divided by 1 bit.
3. The method of claim 1, the step of transposing in step 2) comprising:
1) will 32 bit word
Figure DEST_PATH_IMAGE022
Sequentially divided into 1-bit units
Figure DEST_PATH_IMAGE024
2) Output of 32-bit word
Figure DEST_PATH_IMAGE026
Sequentially divided into 1-bit units
Figure DEST_PATH_IMAGE028
3) According to a linear transformation formula
Figure DEST_PATH_IMAGE030
Given a given
Figure DEST_PATH_IMAGE032
Of (2) to find
Figure DEST_PATH_IMAGE034
Involving an exclusive-or operation in the calculation
Figure 825888DEST_PATH_IMAGE022
The bit data above, calculated as:
Figure DEST_PATH_IMAGE036
wherein
Figure DEST_PATH_IMAGE038
Is shown in
Figure 840724DEST_PATH_IMAGE022
To middle
Figure 977307DEST_PATH_IMAGE032
Bit data;
4) for the
Figure DEST_PATH_IMAGE040
Repeating the calculating step 3) to obtain all
Figure 459235DEST_PATH_IMAGE034
The final result of (1).
4. The method of claim 1, the processing step in step 4) comprising:
1) by matrix multiplication, construction
Figure DEST_PATH_IMAGE042
To
Figure DEST_PATH_IMAGE044
Isomorphic mapping of (a);
2) to the product obtained in step 1)
Figure 206261DEST_PATH_IMAGE044
Carrying out inversion operation, wherein the formula is as follows:
Figure DEST_PATH_IMAGE046
Figure DEST_PATH_IMAGE048
Figure DEST_PATH_IMAGE050
is a modulus polynomial modulus;
the formula is developed to obtain:
Figure DEST_PATH_IMAGE052
Figure DEST_PATH_IMAGE054
according to the formula in step 2), will be
Figure 459125DEST_PATH_IMAGE004
The above inversion operation further becomes an element
Figure DEST_PATH_IMAGE056
In that
Figure 258454DEST_PATH_IMAGE006
Sum of inversions on
Figure 470255DEST_PATH_IMAGE006
The multiplication of (2).
5. The method of claim 1, in step 4)
Figure 24733DEST_PATH_IMAGE006
The multiplication is realized by converting into matrix vector multiplication operation, and the method is to regard two multiplied 4-bit elements as two vectors, wherein one vector is converted into a matrix form, and is multiplied by the other vector, and the calculation is realized under the word length of 1 bit.
6. The method of claim 1, the processing step in step 5) comprising:
1) by matrix multiplication, construction
Figure 204042DEST_PATH_IMAGE006
To
Figure 158792DEST_PATH_IMAGE008
Isomorphic mapping;
2) to the product obtained in step 1)
Figure 926896DEST_PATH_IMAGE008
Carrying out inversion operation, wherein the formula is as follows:
Figure DEST_PATH_IMAGE058
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE060
is that
Figure 12795DEST_PATH_IMAGE008
The elements (A) and (B) in (B),
Figure DEST_PATH_IMAGE062
is the inverse of the element(s),
Figure DEST_PATH_IMAGE064
Figure DEST_PATH_IMAGE066
is a polynomial modulus;
the formula is developed to obtain:
Figure DEST_PATH_IMAGE068
Figure DEST_PATH_IMAGE070
according to the formula in step 2), will
Figure 240252DEST_PATH_IMAGE006
The inversion operation on becomes
Figure 256749DEST_PATH_IMAGE010
The inversion operation and the multiplication operation of (1).
7. The method of claim 1, in step 5)
Figure 383755DEST_PATH_IMAGE010
The multiplication is realized by matrix operation, and the steps comprise: regarding two multiplied 2-bit elements as two vectors, modifying one vector into a matrix form, and multiplying the other vector; then in a finite field
Figure 342353DEST_PATH_IMAGE010
The inversion operation is performed by means of table lookup.
CN202210542472.4A 2022-05-19 2022-05-19 High-performance SM4 bit slice optimization method for heterogeneous parallel architecture Active CN114710285B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210542472.4A CN114710285B (en) 2022-05-19 2022-05-19 High-performance SM4 bit slice optimization method for heterogeneous parallel architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210542472.4A CN114710285B (en) 2022-05-19 2022-05-19 High-performance SM4 bit slice optimization method for heterogeneous parallel architecture

Publications (2)

Publication Number Publication Date
CN114710285A true CN114710285A (en) 2022-07-05
CN114710285B CN114710285B (en) 2022-08-23

Family

ID=82175690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210542472.4A Active CN114710285B (en) 2022-05-19 2022-05-19 High-performance SM4 bit slice optimization method for heterogeneous parallel architecture

Country Status (1)

Country Link
CN (1) CN114710285B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101848081A (en) * 2010-06-11 2010-09-29 中国科学院软件研究所 S box and construction method thereof
CN101938349A (en) * 2010-10-01 2011-01-05 北京航空航天大学 S box applicable to hardware realization and circuit realization method thereof
CN104065473A (en) * 2014-06-25 2014-09-24 成都信息工程学院 Compact realization method of SM4 block cipher algorithm S box
US20160232129A1 (en) * 2015-02-05 2016-08-11 Weng Tianxiang Apparatus of wave-pipelined circuits
CN106209358A (en) * 2016-07-12 2016-12-07 黑龙江大学 A kind of SM4 key schedule based on long key realize system and method
CN106712930A (en) * 2017-01-24 2017-05-24 北京炼石网络技术有限公司 SM4 encryption method and device
US20190245679A1 (en) * 2018-02-02 2019-08-08 Intel Corporation Unified hardware accelerator for symmetric-key ciphers
CN110166223A (en) * 2019-05-22 2019-08-23 北京航空航天大学 A kind of Fast Software implementation method of the close SM4 of state
CN110197076A (en) * 2019-05-22 2019-09-03 北京航空航天大学 A kind of software optimization implementation method of SM4 Encryption Algorithm
CN110278070A (en) * 2018-03-13 2019-09-24 中国科学技术大学 The implementation method and device of S box in a kind of SM4 algorithm
CN110474761A (en) * 2019-07-11 2019-11-19 北京电子科技学院 One kind 16 takes turns SM4-256 whitepack password implementation method
CN114091086A (en) * 2022-01-14 2022-02-25 麒麟软件有限公司 Rapid realization method of SM4 algorithm based on bit slice
CN114244496A (en) * 2021-12-01 2022-03-25 华南师范大学 SM4 encryption and decryption algorithm parallelization realization method based on tower domain optimization S box
EP3998738A1 (en) * 2021-08-26 2022-05-18 Irdeto B.V. Secured performance of a cryptographic process

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101848081A (en) * 2010-06-11 2010-09-29 中国科学院软件研究所 S box and construction method thereof
CN101938349A (en) * 2010-10-01 2011-01-05 北京航空航天大学 S box applicable to hardware realization and circuit realization method thereof
CN104065473A (en) * 2014-06-25 2014-09-24 成都信息工程学院 Compact realization method of SM4 block cipher algorithm S box
US20160232129A1 (en) * 2015-02-05 2016-08-11 Weng Tianxiang Apparatus of wave-pipelined circuits
CN106209358A (en) * 2016-07-12 2016-12-07 黑龙江大学 A kind of SM4 key schedule based on long key realize system and method
CN106712930A (en) * 2017-01-24 2017-05-24 北京炼石网络技术有限公司 SM4 encryption method and device
US20190245679A1 (en) * 2018-02-02 2019-08-08 Intel Corporation Unified hardware accelerator for symmetric-key ciphers
CN110278070A (en) * 2018-03-13 2019-09-24 中国科学技术大学 The implementation method and device of S box in a kind of SM4 algorithm
CN110166223A (en) * 2019-05-22 2019-08-23 北京航空航天大学 A kind of Fast Software implementation method of the close SM4 of state
CN110197076A (en) * 2019-05-22 2019-09-03 北京航空航天大学 A kind of software optimization implementation method of SM4 Encryption Algorithm
CN110474761A (en) * 2019-07-11 2019-11-19 北京电子科技学院 One kind 16 takes turns SM4-256 whitepack password implementation method
EP3998738A1 (en) * 2021-08-26 2022-05-18 Irdeto B.V. Secured performance of a cryptographic process
CN114244496A (en) * 2021-12-01 2022-03-25 华南师范大学 SM4 encryption and decryption algorithm parallelization realization method based on tower domain optimization S box
CN114091086A (en) * 2022-01-14 2022-02-25 麒麟软件有限公司 Rapid realization method of SM4 algorithm based on bit slice

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李玲等: "基于GReP通用可重构处理器的密码算子优化设计", 《计算机应用研究》 *
梁浩等: "基于复合域的SM4算法的设计与实现", 《微电子学与计算机》 *

Also Published As

Publication number Publication date
CN114710285B (en) 2022-08-23

Similar Documents

Publication Publication Date Title
Fan et al. Wg-8: A lightweight stream cipher for resource-constrained smart devices
EP2442482B1 (en) Method and device for implementing stream cipher
Karakoç et al. ITUbee: a software oriented lightweight block cipher
Gueron Intel advanced encryption standard (AES) new instructions set
Ivanov et al. Reversed genetic algorithms for generation of bijective s-boxes with good cryptographic properties
Gueron et al. Intel® carry-less multiplication instruction and its usage for computing the GCM mode
Maazouz et al. FPGA implementation of a chaos-based image encryption algorithm
CN110198214B (en) Identity generation method, identity verification method and identity verification device
Stern et al. Cs-cipher
do Nascimento et al. FlexAEAD-A lightweight cipher with integrated authentication
US20100040226A1 (en) Device, program and method for generating hash values
CN114710285B (en) High-performance SM4 bit slice optimization method for heterogeneous parallel architecture
Tuychiev New encryption algorithm based on network PES8-1 using of the transformations of the encryption algorithm AES
CN115348101A (en) Data encryption method and system based on chaotic block cipher
Fan et al. WG-8: A lightweight stream cipher for resource-constrained smart devices
Tang et al. Awareness and control of personal data Based on the Cyber-I privacy model
Kuznetsov et al. A new cost function for heuristic search of nonlinear substitutions
Ding et al. Cryptanalysis of Loiss stream cipher
Gueron White box aes using intel's new aes instructions
CN114189324B (en) Message security signature method, system, equipment and storage medium
Maximov A new stream cipher Mir-1
Tuychiev The encryption algorithm AESRFWKIDEA32-1 based on network RFWKIDEA32-1
Bishoi Generalized word-oriented feedback shift registers
CN114172632B (en) Method and device for improving AES encryption and decryption efficiency
Isobe et al. Key Committing Security Analysis of AEGIS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant