CN117908835A

CN117908835A - Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability

Info

Publication number: CN117908835A
Application number: CN202410318131.8A
Authority: CN
Inventors: 吴雯; 董建阔; 董振江; 陈滏媛; 吉欣仪
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2024-03-20
Filing date: 2024-03-20
Publication date: 2024-04-19
Anticipated expiration: 2044-03-20
Also published as: CN117908835B

Abstract

The invention discloses a method for accelerating SM2 cryptographic calculation based on floating point number calculation capability, and relates to the field of information security; the method comprises the steps of dividing a large integer of a 256-bit SM2 prime number domain into 5 words with 52 bits, calculating to obtain a high/low subproduct of every two word products by utilizing an improved product-sum fused instruction, and storing the high/low subproduct in a mantissa part of a double-precision floating point number; accumulating the product of each word of the multiplicand A and each word of the multiplier B to corresponding positions according to a certain sequence; the sign bit and the exponent bit of the floating point number are converted into zero by using the mask operation, and the zero is used for storing a carry generated in the accumulation process, so that the storage space of the floating point number is effectively saved, and the occupation of a register is reduced. The beneficial effects of the invention are as follows: through the proposed SM2 large integer representation method, the number of representation words and the times of multiply-add operation are reduced, and the calculation complexity is reduced, so that the calculation speed is greatly improved.

Description

Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability

Technical Field

The invention belongs to the technical field of information security, and particularly relates to a method for accelerating SM2 cryptographic algorithm based on floating point number computing capability.

Background

Due to rapid development of the internet and related industries, how to ensure safe storage and transmission of information and ensure the integrity and non-repudiation of the information becomes a research hotspot in the field of network security. Public key cryptosystems play an irreplaceable role in the relevant field. The binary length of the RSA key is increased from 512 bits to 1024 bits in order to guarantee security, and 2048 bits are required in order to obtain a higher security level. The security of ECC is based on discrete logarithm problem (ECDLP) on elliptic curve, compared to the factorization problem (IFP) of large integer, the cracking of ECDLP problem is exponential, it is higher security, and the key length of ECC is much smaller than RSA at the same security level. Due to the reduction of the key length, the ECC has lower bandwidth requirement and faster transmission speed in the signature verification process, and has become the most promising competitor of public key encryption systems. The SM2 algorithm issued by the national cipher administration is an asymmetric cipher algorithm on an elliptic curve, the number of bits of the calculation parameters related to the SM2 national cipher algorithm is usually far greater than the standard word length of a processor, the conventional method is to divide the calculation parameters of the cipher algorithm into a plurality of unsigned integer numbers, and compared with the floating point number representation method, the method has the advantages of more words and higher calculation complexity.

Early GPUs were configurable graphics processors, and over time, GPUs have become programmable parallel processors with high flexibility. The transistors of the GPU are mainly used for data processing functions, which is more advantageous for parallel computing, so that the GPU is able to provide higher instruction throughput and memory bandwidth. The multi-core computing architecture with the functions of product and add instructions and double-precision floating point number computing capability, such as OpenCL, ROC and the like, provides powerful parallel computing capability for cryptographic high-performance computing tasks; the architecture supports general computing and heterogeneous computing, and makes full use of different processing units such as a CPU (Central processing Unit), a GPU (graphics processing Unit) and the like, so that efficient and accurate digital operation is realized, the algorithm performance is improved, and portability across hardware platforms is realized through a general computing framework. The CUDA parallel computing architecture of NVIDIA company can greatly improve the computing performance by utilizing the processing capacity of the GPU, and is gradually widely applied to high-performance password computing due to the characteristics of large scale, high parallelism and easy development. And various instructions of a CUDA platform and the like are utilized, so that large integer parallel computation on an SM2 finite field can be realized through a plurality of threads, and the computing efficiency is improved.

At present, related researches on accelerating cryptographic algorithms based on GPU computing capacity are carried out in the prior art, but the problems that SM2 cryptographic algorithms are accelerated based on GPU floating point number parallel computing capacity on the basis of utilizing the architecture platform cannot be met, namely, the realization efficiency of the algorithms is improved by accelerating large integer modular multiplication operation on SM2 finite field Fp.

For example, CN113221193a discloses a method and a system for quickly implementing SM2 digital signature and signature verification based on GPU, and performing modular operation optimization processing and compression function optimization signature processing or signature verification processing at the GPU end, wherein the whole operation process uses GPU to accelerate optimization but fails to fully utilize floating point number computing capability of the GPU.

CN109145616a discloses a method and a system for implementing SM2 encryption, signature and key exchange based on efficient modular multiplication, which utilize the features of SM2 prime numbers to implement efficient modular multiplication, but the method accelerates modular multiplication operation on SM2 finite fields by optimizing prime number algorithm to increase the speed of SM2 cryptographic algorithm, and does not discuss using integer number or floating point number computing capability of GPU platform to accelerate computing speed.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for accelerating SM2 cryptographic algorithm based on floating point number computing capability, which reduces the number of representation words and the times of multiply-add operation by splitting a large integer of an SM2 prime number domain with 256 bits and utilizing modes of improved product-sum fused instruction, mask operation and the like, and reduces the computing complexity, thereby greatly improving the computing speed.

The invention discloses a method for accelerating SM2 cryptographic algorithm based on floating point number computing capability, which comprises the following steps:

Step 1, data division: dividing a multiplicand A and a multiplier B which are both n bits in length on an SM2 prime number domain Fp into M sections of single word representations, respectively marking as ; Wherein each segment has a length of w bits, and the data is stored in the GPU shared memory;

step 2, multiplication operation: multi-precision multiplication of multiplicand A and multiplier B into The length of the obtained product result C is not more than 2n bits; sequentially calculating the product result of each segment by using an improved product-sum fused instruction, dividing the product result into a high-order subproduct and a low-order subproduct, and respectively storing the high-order subproduct and the low-order subproduct in 52-bit mantissa parts of two double-precision floating point numbers;

Step 3, masking operation: masking the high-order subproduct and the low-order subproduct obtained in the step 2 to make the sign bit and the exponent bit of the double-precision floating point number all be 0, converting the sign bit and the exponent bit into binary forms of uint64_t format, and then summing in an integer domain;

step 4, fast reduction: the accumulator storing the accumulated operation result of each column in 10 multiplication processes is reduced to 5 accumulators by using a fast reduction formula, and the product result is reduced to a result with the length of 266 bits And carrying out carry digestion operation to obtain the result of modular multiplication operation on the prime number domain Fp of SM 2.

Further, the multiplicand A and the multiplier B with the lengths of n bits are divided into M sections from the low order to the high order, each section has the length of w bits, and if the length of the last section is less than w bits, the length is filled to w bits by the high order by supplementing 0.

Further, in step2, the n-bit large integer expressed by M-segment single words is expressed as multiplication operationWherein/>。

Further, for the square operation in the multiplication operation, there is a repetition calculation sub-product, and the multiplication operation is expressed asWhere E is a sub-product of the different words of the repeated computation, denoted/>F is a sub-product of the same word, denoted/>。

Further, the modified product-sum-add instruction is expressed as:

，

wherein, A sum-product and-melt instruction; x and y are two multipliers of the product-sum fused instruction, and the data types of the multipliers are double-precision floating point numbers; /(I)Is a high-order product,/>Is a low-order subproduct.

Further, the method for rapidly reducing SM2 reduces 10 accumulators involved in the accumulation process to 5 accumulators, reduces the length of the product result to 266 bits through one round of carry digestion, and obtains the result of modular multiplication operation on the prime number domain Fp of SM2 through carry digestion operation. In step 4, the fast reduction formula isWherein:

；

in the above formula, the water content of the water-soluble polymer, To/>Representing the value in 10 accumulators before the fast reduction,/>Representing the value in 5 accumulators after a fast reduction,/>Representing the product result after a fast reduction of 266 bits in length.

The beneficial effects of the invention are as follows: according to the method, the large SM2 integer with 256 bits is split into a plurality of words and stored in the mantissa part of the floating point number, the carry generated in the accumulation process is stored by fully utilizing the sign bit and the exponent bit of the floating point number, and the occupation of the floating point number storage space and the register is effectively saved; the product and the fused instruction are utilized to accumulate the product of each word of the multiplicand A and each word of the multiplier B to the corresponding position according to a certain order, the length of the low-order subproduct is anchored at 52 bits, no additional data alignment operation is needed in the accumulation process, and the operation is convenient; by adopting the large integer representation method, the number of words is reduced, the times of multiply-add operation are reduced, the calculation complexity is reduced, and the calculation speed is improved.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of the accumulation process of GF (p) based multi-precision multiplication;

fig. 3 is a schematic diagram of masking operations.

Detailed Description

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

The method for accelerating SM2 cryptographic algorithm based on floating point number computing capability disclosed by the invention accelerates SM2 cryptographic algorithm by accelerating modular multiplication operation on a finite field, as shown in figure 1, specifically comprises the following steps:

data dividing section

1) Dividing a multiplicand A and a multiplier B with lengths of 256 bits into 5 segments respectively; wherein each segment has a length of 52 bits, and stores the data in the GPU shared memory;

2) The following symbols are defined in terms of the following, Representing paragraphs 0 to 4 of A,/>Representing segments 0 through 4 of B, the last segment of A and B having only 48 bits, the high order bits being padded with 0 to 52 bits; definition/>Representing a word with a word length of 51 using a double representation,/>Representing a word of length 48 using a double representation,/>The representation uses a double representation of words with a word length of 44.

(Second) multiplication operation part

Multi-precision multiplication of multiplicand A and multiplier BCan be converted into 25 single-precision multiplicationsAt this time/>The sub-product length of (2) is at most 104 bits; by improved product-sum and fuse instruction/>The higher part of each sub-product can be obtained by usingThe instruction gets the lower 52 bits of the sub-product, the length of which is anchored, so that no additional operations are required to align during accumulation. So that the product of A and B can be converted into/>Wherein/>The corresponding accumulation model is shown in fig. 2 for the sum of the high/low sub-products in each column of accumulators.

(III) masking operation section in the accumulated model obtained in step 2,For the sum of the i-th column high/low sub-products, a carry may be generated during accumulation, and the sign bit and exponent bit of each high/low sub-product may affect the carry processing to cause a calculation error. Therefore, a masking operation is required for the high/low sub-product, which is performed by first converting double-type data directly into binary in the uint_64 format, then generating a negative initial value for the accumulator to cancel the sign bit and exponent bit of the floating point number, and finally summing in the integer domain, as shown in fig. 2.

(IV) quick reduction portion

The length of the product result of A and B obtained by the steps 2 and 3 is 512 bits, 10 accumulators can be reduced to 5 accumulators by using an SM2 rapid reduction formula, and in order to avoid overflow in the calculation process, the rapid reduction formula is further used by using & and shiftDigestion, and finally, a SM2 rapid reduction formula based on double-precision floating point numbers is as followsWherein:

In the above formula:

。

Obtaining 5 redundant uint64_t type values after rapid reduction, and accumulating the upper 12 bits of the four subsequent uint64_t values to the next uint64_t value by using carry digestion, thereby obtaining a product result with 256 bits long;

，

；

And finally, carrying out modular p operation on the result obtained in the last step to obtain modular multiplication results of A and B on the SM2 prime number domain.

The effect of the method of the present invention was compared with the methods of prior art documents 1 and 2, wherein the specific results of Pascal Giorgi, Thomas Izard, Arnaud Tisserand, et al. Comparison of modular arithmetic algorithms on GPUs. In ParCo'09: International Conference on Parallel Computing, 2009. document 2 to Shi Pu and Jyh-Charn Liu. EAGL: An Elliptic Curve Arithmetic GPUBased Library for Bilinear Pairing. In Pairing-Based Cryptography–Pairing 2013, pages 1–19. Springer, 2014. are shown in the following table 1:

TABLE 1

。

As can be seen from Table 1, the modular multiplication algorithm provided by the invention has a very large improvement compared with the implementation of the existing scheme, and even if the difference between different platforms is considered, the performance of the method provided by the invention is improved by 71.4% compared with that of the implementation of the document 2.

The foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the present invention, and all equivalent variations using the description and drawings of the present invention are within the scope of the present invention.

Claims

1. A method for accelerating SM2 cryptographic algorithm based on floating point number computing capability, the method comprising the steps of:

2. The method for accelerating SM2 cryptographic algorithm based on floating point number computing capability according to claim 1, wherein the multiplicand A and the multiplier B with the lengths of n bits are divided into M sections from low order to high order, each section has the length of w bits, and if the length of the last section is less than w bits, the length is filled to w bits by the high order by supplementing 0.

3. The method for accelerating SM2 cryptographic algorithm based on floating point number computing power according to claim 1, wherein in step 2, the n-bit large integer expressed as M-segment single words is divided, and the multiplication operation is expressed asWherein。

4. A method for accelerating the cryptographic algorithm of SM2 country based on floating point number computing power as recited in claim 3, wherein for square operation in multiplication operation, there is a repeated computation sub-product, and the multiplication operation is expressed asWhere E is a sub-product of the different words of the repeated computation, denoted/>F is a sub-product of the same word, expressed as。

5. The method of accelerating SM2 cryptographic algorithm based on floating point number computing capability of claim 1, wherein the modified product-sum-add instruction is expressed as:

，

6. The method for accelerating SM2 cryptographic algorithm based on floating point number computing power of claim 1, wherein in step 4, the fast reduction formula isWherein:

；