CN117908835A - Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability - Google Patents

Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability Download PDF

Info

Publication number
CN117908835A
CN117908835A CN202410318131.8A CN202410318131A CN117908835A CN 117908835 A CN117908835 A CN 117908835A CN 202410318131 A CN202410318131 A CN 202410318131A CN 117908835 A CN117908835 A CN 117908835A
Authority
CN
China
Prior art keywords
floating point
product
bits
point number
order
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410318131.8A
Other languages
Chinese (zh)
Other versions
CN117908835B (en
Inventor
吴雯
董建阔
董振江
陈滏媛
吉欣仪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202410318131.8A priority Critical patent/CN117908835B/en
Publication of CN117908835A publication Critical patent/CN117908835A/en
Application granted granted Critical
Publication of CN117908835B publication Critical patent/CN117908835B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a method for accelerating SM2 cryptographic calculation based on floating point number calculation capability, and relates to the field of information security; the method comprises the steps of dividing a large integer of a 256-bit SM2 prime number domain into 5 words with 52 bits, calculating to obtain a high/low subproduct of every two word products by utilizing an improved product-sum fused instruction, and storing the high/low subproduct in a mantissa part of a double-precision floating point number; accumulating the product of each word of the multiplicand A and each word of the multiplier B to corresponding positions according to a certain sequence; the sign bit and the exponent bit of the floating point number are converted into zero by using the mask operation, and the zero is used for storing a carry generated in the accumulation process, so that the storage space of the floating point number is effectively saved, and the occupation of a register is reduced. The beneficial effects of the invention are as follows: through the proposed SM2 large integer representation method, the number of representation words and the times of multiply-add operation are reduced, and the calculation complexity is reduced, so that the calculation speed is greatly improved.

Description

Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability
Technical Field
The invention belongs to the technical field of information security, and particularly relates to a method for accelerating SM2 cryptographic algorithm based on floating point number computing capability.
Background
Due to rapid development of the internet and related industries, how to ensure safe storage and transmission of information and ensure the integrity and non-repudiation of the information becomes a research hotspot in the field of network security. Public key cryptosystems play an irreplaceable role in the relevant field. The binary length of the RSA key is increased from 512 bits to 1024 bits in order to guarantee security, and 2048 bits are required in order to obtain a higher security level. The security of ECC is based on discrete logarithm problem (ECDLP) on elliptic curve, compared to the factorization problem (IFP) of large integer, the cracking of ECDLP problem is exponential, it is higher security, and the key length of ECC is much smaller than RSA at the same security level. Due to the reduction of the key length, the ECC has lower bandwidth requirement and faster transmission speed in the signature verification process, and has become the most promising competitor of public key encryption systems. The SM2 algorithm issued by the national cipher administration is an asymmetric cipher algorithm on an elliptic curve, the number of bits of the calculation parameters related to the SM2 national cipher algorithm is usually far greater than the standard word length of a processor, the conventional method is to divide the calculation parameters of the cipher algorithm into a plurality of unsigned integer numbers, and compared with the floating point number representation method, the method has the advantages of more words and higher calculation complexity.
Early GPUs were configurable graphics processors, and over time, GPUs have become programmable parallel processors with high flexibility. The transistors of the GPU are mainly used for data processing functions, which is more advantageous for parallel computing, so that the GPU is able to provide higher instruction throughput and memory bandwidth. The multi-core computing architecture with the functions of product and add instructions and double-precision floating point number computing capability, such as OpenCL, ROC and the like, provides powerful parallel computing capability for cryptographic high-performance computing tasks; the architecture supports general computing and heterogeneous computing, and makes full use of different processing units such as a CPU (Central processing Unit), a GPU (graphics processing Unit) and the like, so that efficient and accurate digital operation is realized, the algorithm performance is improved, and portability across hardware platforms is realized through a general computing framework. The CUDA parallel computing architecture of NVIDIA company can greatly improve the computing performance by utilizing the processing capacity of the GPU, and is gradually widely applied to high-performance password computing due to the characteristics of large scale, high parallelism and easy development. And various instructions of a CUDA platform and the like are utilized, so that large integer parallel computation on an SM2 finite field can be realized through a plurality of threads, and the computing efficiency is improved.
At present, related researches on accelerating cryptographic algorithms based on GPU computing capacity are carried out in the prior art, but the problems that SM2 cryptographic algorithms are accelerated based on GPU floating point number parallel computing capacity on the basis of utilizing the architecture platform cannot be met, namely, the realization efficiency of the algorithms is improved by accelerating large integer modular multiplication operation on SM2 finite field Fp.
For example, CN113221193a discloses a method and a system for quickly implementing SM2 digital signature and signature verification based on GPU, and performing modular operation optimization processing and compression function optimization signature processing or signature verification processing at the GPU end, wherein the whole operation process uses GPU to accelerate optimization but fails to fully utilize floating point number computing capability of the GPU.
CN109145616a discloses a method and a system for implementing SM2 encryption, signature and key exchange based on efficient modular multiplication, which utilize the features of SM2 prime numbers to implement efficient modular multiplication, but the method accelerates modular multiplication operation on SM2 finite fields by optimizing prime number algorithm to increase the speed of SM2 cryptographic algorithm, and does not discuss using integer number or floating point number computing capability of GPU platform to accelerate computing speed.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for accelerating SM2 cryptographic algorithm based on floating point number computing capability, which reduces the number of representation words and the times of multiply-add operation by splitting a large integer of an SM2 prime number domain with 256 bits and utilizing modes of improved product-sum fused instruction, mask operation and the like, and reduces the computing complexity, thereby greatly improving the computing speed.
The invention discloses a method for accelerating SM2 cryptographic algorithm based on floating point number computing capability, which comprises the following steps:
Step 1, data division: dividing a multiplicand A and a multiplier B which are both n bits in length on an SM2 prime number domain Fp into M sections of single word representations, respectively marking as ; Wherein each segment has a length of w bits, and the data is stored in the GPU shared memory;
step 2, multiplication operation: multi-precision multiplication of multiplicand A and multiplier B into The length of the obtained product result C is not more than 2n bits; sequentially calculating the product result of each segment by using an improved product-sum fused instruction, dividing the product result into a high-order subproduct and a low-order subproduct, and respectively storing the high-order subproduct and the low-order subproduct in 52-bit mantissa parts of two double-precision floating point numbers;
Step 3, masking operation: masking the high-order subproduct and the low-order subproduct obtained in the step 2 to make the sign bit and the exponent bit of the double-precision floating point number all be 0, converting the sign bit and the exponent bit into binary forms of uint64_t format, and then summing in an integer domain;
step 4, fast reduction: the accumulator storing the accumulated operation result of each column in 10 multiplication processes is reduced to 5 accumulators by using a fast reduction formula, and the product result is reduced to a result with the length of 266 bits And carrying out carry digestion operation to obtain the result of modular multiplication operation on the prime number domain Fp of SM 2.
Further, the multiplicand A and the multiplier B with the lengths of n bits are divided into M sections from the low order to the high order, each section has the length of w bits, and if the length of the last section is less than w bits, the length is filled to w bits by the high order by supplementing 0.
Further, in step2, the n-bit large integer expressed by M-segment single words is expressed as multiplication operationWherein/>
Further, for the square operation in the multiplication operation, there is a repetition calculation sub-product, and the multiplication operation is expressed asWhere E is a sub-product of the different words of the repeated computation, denoted/>F is a sub-product of the same word, denoted/>
Further, the modified product-sum-add instruction is expressed as:
wherein, A sum-product and-melt instruction; x and y are two multipliers of the product-sum fused instruction, and the data types of the multipliers are double-precision floating point numbers; /(I)Is a high-order product,/>Is a low-order subproduct.
Further, the method for rapidly reducing SM2 reduces 10 accumulators involved in the accumulation process to 5 accumulators, reduces the length of the product result to 266 bits through one round of carry digestion, and obtains the result of modular multiplication operation on the prime number domain Fp of SM2 through carry digestion operation. In step 4, the fast reduction formula isWherein:
in the above formula, the water content of the water-soluble polymer, To/>Representing the value in 10 accumulators before the fast reduction,/>Representing the value in 5 accumulators after a fast reduction,/>Representing the product result after a fast reduction of 266 bits in length.
The beneficial effects of the invention are as follows: according to the method, the large SM2 integer with 256 bits is split into a plurality of words and stored in the mantissa part of the floating point number, the carry generated in the accumulation process is stored by fully utilizing the sign bit and the exponent bit of the floating point number, and the occupation of the floating point number storage space and the register is effectively saved; the product and the fused instruction are utilized to accumulate the product of each word of the multiplicand A and each word of the multiplier B to the corresponding position according to a certain order, the length of the low-order subproduct is anchored at 52 bits, no additional data alignment operation is needed in the accumulation process, and the operation is convenient; by adopting the large integer representation method, the number of words is reduced, the times of multiply-add operation are reduced, the calculation complexity is reduced, and the calculation speed is improved.
Drawings
FIG. 1 is an overall flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of the accumulation process of GF (p) based multi-precision multiplication;
fig. 3 is a schematic diagram of masking operations.
Detailed Description
In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments that are illustrated in the appended drawings.
The method for accelerating SM2 cryptographic algorithm based on floating point number computing capability disclosed by the invention accelerates SM2 cryptographic algorithm by accelerating modular multiplication operation on a finite field, as shown in figure 1, specifically comprises the following steps:
data dividing section
1) Dividing a multiplicand A and a multiplier B with lengths of 256 bits into 5 segments respectively; wherein each segment has a length of 52 bits, and stores the data in the GPU shared memory;
2) The following symbols are defined in terms of the following, Representing paragraphs 0 to 4 of A,/>Representing segments 0 through 4 of B, the last segment of A and B having only 48 bits, the high order bits being padded with 0 to 52 bits; definition/>Representing a word with a word length of 51 using a double representation,/>Representing a word of length 48 using a double representation,/>The representation uses a double representation of words with a word length of 44.
(Second) multiplication operation part
Multi-precision multiplication of multiplicand A and multiplier BCan be converted into 25 single-precision multiplicationsAt this time/>The sub-product length of (2) is at most 104 bits; by improved product-sum and fuse instruction/>The higher part of each sub-product can be obtained by usingThe instruction gets the lower 52 bits of the sub-product, the length of which is anchored, so that no additional operations are required to align during accumulation. So that the product of A and B can be converted into/>Wherein/>The corresponding accumulation model is shown in fig. 2 for the sum of the high/low sub-products in each column of accumulators.
(III) masking operation section in the accumulated model obtained in step 2,For the sum of the i-th column high/low sub-products, a carry may be generated during accumulation, and the sign bit and exponent bit of each high/low sub-product may affect the carry processing to cause a calculation error. Therefore, a masking operation is required for the high/low sub-product, which is performed by first converting double-type data directly into binary in the uint_64 format, then generating a negative initial value for the accumulator to cancel the sign bit and exponent bit of the floating point number, and finally summing in the integer domain, as shown in fig. 2.
(IV) quick reduction portion
The length of the product result of A and B obtained by the steps 2 and 3 is 512 bits, 10 accumulators can be reduced to 5 accumulators by using an SM2 rapid reduction formula, and in order to avoid overflow in the calculation process, the rapid reduction formula is further used by using & and shiftDigestion, and finally, a SM2 rapid reduction formula based on double-precision floating point numbers is as followsWherein:
In the above formula:
Obtaining 5 redundant uint64_t type values after rapid reduction, and accumulating the upper 12 bits of the four subsequent uint64_t values to the next uint64_t value by using carry digestion, thereby obtaining a product result with 256 bits long;
And finally, carrying out modular p operation on the result obtained in the last step to obtain modular multiplication results of A and B on the SM2 prime number domain.
The effect of the method of the present invention was compared with the methods of prior art documents 1 and 2, wherein the specific results of Pascal Giorgi, Thomas Izard, Arnaud Tisserand, et al. Comparison of modular arithmetic algorithms on GPUs. In ParCo'09: International Conference on Parallel Computing, 2009. document 2 to Shi Pu and Jyh-Charn Liu. EAGL: An Elliptic Curve Arithmetic GPUBased Library for Bilinear Pairing. In Pairing-Based Cryptography–Pairing 2013, pages 1–19. Springer, 2014. are shown in the following table 1:
TABLE 1
As can be seen from Table 1, the modular multiplication algorithm provided by the invention has a very large improvement compared with the implementation of the existing scheme, and even if the difference between different platforms is considered, the performance of the method provided by the invention is improved by 71.4% compared with that of the implementation of the document 2.
The foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the present invention, and all equivalent variations using the description and drawings of the present invention are within the scope of the present invention.

Claims (6)

1. A method for accelerating SM2 cryptographic algorithm based on floating point number computing capability, the method comprising the steps of:
Step 1, data division: dividing a multiplicand A and a multiplier B which are both n bits in length on an SM2 prime number domain Fp into M sections of single word representations, respectively marking as ; Wherein each segment has a length of w bits, and the data is stored in the GPU shared memory;
step 2, multiplication operation: multi-precision multiplication of multiplicand A and multiplier B into The length of the obtained product result C is not more than 2n bits; sequentially calculating the product result of each segment by using an improved product-sum fused instruction, dividing the product result into a high-order subproduct and a low-order subproduct, and respectively storing the high-order subproduct and the low-order subproduct in 52-bit mantissa parts of two double-precision floating point numbers;
Step 3, masking operation: masking the high-order subproduct and the low-order subproduct obtained in the step 2 to make the sign bit and the exponent bit of the double-precision floating point number all be 0, converting the sign bit and the exponent bit into binary forms of uint64_t format, and then summing in an integer domain;
step 4, fast reduction: the accumulator storing the accumulated operation result of each column in 10 multiplication processes is reduced to 5 accumulators by using a fast reduction formula, and the product result is reduced to a result with the length of 266 bits And carrying out carry digestion operation to obtain the result of modular multiplication operation on the prime number domain Fp of SM 2.
2. The method for accelerating SM2 cryptographic algorithm based on floating point number computing capability according to claim 1, wherein the multiplicand A and the multiplier B with the lengths of n bits are divided into M sections from low order to high order, each section has the length of w bits, and if the length of the last section is less than w bits, the length is filled to w bits by the high order by supplementing 0.
3. The method for accelerating SM2 cryptographic algorithm based on floating point number computing power according to claim 1, wherein in step 2, the n-bit large integer expressed as M-segment single words is divided, and the multiplication operation is expressed asWherein
4. A method for accelerating the cryptographic algorithm of SM2 country based on floating point number computing power as recited in claim 3, wherein for square operation in multiplication operation, there is a repeated computation sub-product, and the multiplication operation is expressed asWhere E is a sub-product of the different words of the repeated computation, denoted/>F is a sub-product of the same word, expressed as
5. The method of accelerating SM2 cryptographic algorithm based on floating point number computing capability of claim 1, wherein the modified product-sum-add instruction is expressed as:
wherein, A sum-product and-melt instruction; x and y are two multipliers of the product-sum fused instruction, and the data types of the multipliers are double-precision floating point numbers; /(I)Is a high-order product,/>Is a low-order subproduct.
6. The method for accelerating SM2 cryptographic algorithm based on floating point number computing power of claim 1, wherein in step 4, the fast reduction formula isWherein:
in the above formula, the water content of the water-soluble polymer, To/>Representing the value in 10 accumulators before the fast reduction,/>Representing the value in 5 accumulators after a fast reduction,/>Representing the product result after a fast reduction of 266 bits in length.
CN202410318131.8A 2024-03-20 2024-03-20 Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability Active CN117908835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410318131.8A CN117908835B (en) 2024-03-20 2024-03-20 Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410318131.8A CN117908835B (en) 2024-03-20 2024-03-20 Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability

Publications (2)

Publication Number Publication Date
CN117908835A true CN117908835A (en) 2024-04-19
CN117908835B CN117908835B (en) 2024-05-17

Family

ID=90682322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410318131.8A Active CN117908835B (en) 2024-03-20 2024-03-20 Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability

Country Status (1)

Country Link
CN (1) CN117908835B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942028A (en) * 2014-04-15 2014-07-23 中国科学院数据与通信保护研究教育中心 Large integer multiplication method and device applied to password technology
CN104461449A (en) * 2014-11-14 2015-03-25 中国科学院数据与通信保护研究教育中心 Large integer multiplication realizing method and device based on vector instructions
CN105930128A (en) * 2016-05-17 2016-09-07 中国科学院数据与通信保护研究教育中心 Method for realizing computation speedup of large integer multiplication by utilizing floating point computing instruction
CN113608718A (en) * 2021-07-12 2021-11-05 中国科学院信息工程研究所 Method for realizing acceleration of prime number domain large integer modular multiplication calculation
WO2022170809A1 (en) * 2021-02-09 2022-08-18 南方科技大学 Reconfigurable floating point multiply-accumulate operation unit and method suitable for multi-precision calculation
CN115348002A (en) * 2021-05-12 2022-11-15 中国科学院声学研究所 Montgomery modular multiplication fast calculation method based on multi-word long multiplication instruction
CN115664747A (en) * 2022-10-18 2023-01-31 京东科技信息技术有限公司 Encryption method and device
CN117155572A (en) * 2023-08-31 2023-12-01 南京邮电大学 Method for realizing large integer multiplication in cryptographic technology based on GPU (graphics processing Unit) parallel

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942028A (en) * 2014-04-15 2014-07-23 中国科学院数据与通信保护研究教育中心 Large integer multiplication method and device applied to password technology
CN104461449A (en) * 2014-11-14 2015-03-25 中国科学院数据与通信保护研究教育中心 Large integer multiplication realizing method and device based on vector instructions
CN105930128A (en) * 2016-05-17 2016-09-07 中国科学院数据与通信保护研究教育中心 Method for realizing computation speedup of large integer multiplication by utilizing floating point computing instruction
WO2022170809A1 (en) * 2021-02-09 2022-08-18 南方科技大学 Reconfigurable floating point multiply-accumulate operation unit and method suitable for multi-precision calculation
CN115348002A (en) * 2021-05-12 2022-11-15 中国科学院声学研究所 Montgomery modular multiplication fast calculation method based on multi-word long multiplication instruction
CN113608718A (en) * 2021-07-12 2021-11-05 中国科学院信息工程研究所 Method for realizing acceleration of prime number domain large integer modular multiplication calculation
CN115664747A (en) * 2022-10-18 2023-01-31 京东科技信息技术有限公司 Encryption method and device
CN117155572A (en) * 2023-08-31 2023-12-01 南京邮电大学 Method for realizing large integer multiplication in cryptographic technology based on GPU (graphics processing Unit) parallel

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JIANKUO DONG 等: "Towards High-performance X25519/448 Key Agreement in General Purpose GPUs", 2018 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY (CNS), 31 December 2018 (2018-12-31) *
YU HU等: "A security JPEG image system accelerated by NEON technology based on FT-2000/4", CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 22 January 2024 (2024-01-22) *
李凡 等: "基于FPGA的SM2点运算快速并行实现", 电子测量技术, no. 15, 8 August 2020 (2020-08-08) *
董建阔 等: "基于异构多核心GPU的高性能密码计算技术研究进展", 软件学报, 15 March 2024 (2024-03-15) *
蒋丽娟 等: "大整数Comba和Karatsuba乘法的多核并行化研究", 计算机系统应用, no. 11, 15 November 2016 (2016-11-15) *

Also Published As

Publication number Publication date
CN117908835B (en) 2024-05-17

Similar Documents

Publication Publication Date Title
EP4258182A2 (en) Accelerated mathematical engine
EP2856303B1 (en) Vector and scalar based modular exponentiation
US8862651B2 (en) Method and apparatus for modulus reduction
KR100756137B1 (en) Division and square root arithmetic unit
US11816448B2 (en) Compressing like-magnitude partial products in multiply accumulation
WO2022170809A1 (en) Reconfigurable floating point multiply-accumulate operation unit and method suitable for multi-precision calculation
CN110519058A (en) A kind of accelerated method for the public key encryption algorithm based on lattice
Zheng et al. Exploiting the floating-point computing power of GPUs for RSA
CN113608718B (en) Method for realizing prime number domain large integer modular multiplication calculation acceleration
WO2022170811A1 (en) Fixed-point multiply-add operation unit and method suitable for mixed-precision neural network
Dong et al. Utilizing the Double‐Precision Floating‐Point Computing Power of GPUs for RSA Acceleration
WO2021136259A1 (en) Floating-point number multiplication computation method and apparatus, and arithmetical logic unit
GB2511314A (en) Fast fused-multiply-add pipeline
CN117908835B (en) Method for accelerating SM2 cryptographic algorithm based on floating point number computing capability
CN117155572A (en) Method for realizing large integer multiplication in cryptographic technology based on GPU (graphics processing Unit) parallel
US20210064976A1 (en) Neural network circuitry having floating point format with asymmetric range
CN113672196B (en) Double multiplication calculating device and method based on single digital signal processing unit
Dalmia et al. Novel high speed vedic multiplier proposal incorporating adder based on quaternary signed digit number system
Prema et al. Enhanced high speed modular multiplier using Karatsuba algorithm
US6256656B1 (en) Apparatus and method for extending computational precision of a computer system having a modular arithmetic processing unit
Zadiraka et al. Parallel Methods of Representing Multidigit Numbers in Numeral Systems for Testing Multidigit Arithmetic Operations
US11157594B2 (en) Matrix multiplication in hardware using modular math
US20230110383A1 (en) Floating-point logarithmic number system scaling system for machine learning
WO2022124010A1 (en) Arithmetic and control device, arithmetic and control method, and recording medium
WO2023003756A2 (en) Multi-lane cryptographic engines with systolic architecture and operations thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant