CN117155572A

CN117155572A - Method for realizing large integer multiplication in cryptographic technology based on GPU (graphics processing Unit) parallel

Info

Publication number: CN117155572A
Application number: CN202311115333.4A
Authority: CN
Inventors: 董振江; 叶青波; 董建阔; 亓晋; 孙雁飞; 陈滏媛
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-08-31
Filing date: 2023-08-31
Publication date: 2023-12-01

Abstract

The invention belongs to the technical field of public key cryptography, and discloses a method for realizing large integer multiplication in the cryptography based on GPU (graphics processing Unit) parallel, which is characterized in that large integer multiplication operation is disassembled into a plurality of threads to be calculated, when the large integer multiplication operation is carried out, a multiplicand and a multiplier are disassembled into a plurality of words, multiplication calculation and carry generation in each thread are completed by utilizing a multiplication and addition instruction for distinguishing high and low bits, carry generated from low bits is obtained by utilizing a shuffling instruction of a thread bundle, and cyclic carry operation is realized by utilizing a bundle voting function instruction and an addition instruction until the high and low bit results of each word of the multiplicand and each word of the multiplier are accumulated into a result. The beneficial effects of the invention are as follows: the large integer multiplication operation is completed through the threads, the calculated amount of calculating the large integer multiplication by a single thread is reduced, the utilization rate of GPU calculation resources is effectively improved, the operation speed of the large integer multiplication is improved, and the practicability of the public key encryption algorithm is improved.

Description

Method for realizing large integer multiplication in cryptographic technology based on GPU (graphics processing Unit) parallel

Technical Field

The invention relates to the technical field of public key cryptography, in particular to a method for realizing large integer multiplication in cryptography based on GPU (graphics processing Unit) parallel.

Background

With the advent of the cloud computing era and the increasing demand for cloud services, security and privacy of user data have become a hotspot of concern. Although the cloud platform stores encrypted data of users, the secret key is known by cloud service providers at the same time, so that the safety and privacy of the user data cannot be ensured. Homomorphic encryption perfectly accords with the calculation mode of privacy calculation according to the capability of processing encrypted data, becomes a hot spot of current academic research, and receives wide attention; the public key encryption algorithm in which Paillier compounds the remaining class of problems is one of the most important algorithms. In practical applications, the length of the Paillier key needs to reach 2048 bits, so as to meet the security requirements of most application scenes. However, 2048-bit secret keys participate in multiplication operation, so that the speed is low, the algorithm complexity is high, and the application range of Paillier and other algorithms is limited; therefore, research on a large number multiplication algorithm capable of improving encryption performance is rapidly developed at home and abroad.

Based on the traditional bit-wise multiplication, which requires multiplying each bit of one operand by each bit of the other operation data and accumulating, the algorithm complexity reaches O (N2), which is not acceptable for calculating large-number multiplications using CPU single threads. Therefore, research on the parallel calculation of the GPU and the large integer multiplication to improve the key operation speed has important research value and significance.

The initial design purpose of the GPU is to assist the CPU to complete computer graphics functions such as image rendering, and with the continuous perfection of hardware and related software systems, the application of the GPU is no longer limited to computer graphics processing, and general parallel computing research based on the GPU has attracted a great deal of attention.

The CUDA parallel computing architecture utilizes the processing capability of the GPU, and is gradually widely applied to high-performance password computing due to the characteristics of large scale, high parallelism and easy development. The multiplication of two w-bit integers is completed by utilizing a CUDA platform to distinguish multiplication and addition instructions of high-order and low-order, so that high w-bit and low w-bit of multiplication operation can be respectively obtained, meanwhile, carry with length of w-bit generated by operation of other threads in the same thread bundle is obtained by utilizing a shuffling instruction, and addition operation of the high w-bit or the low w-bit of the product and the carry is completed. Various instructions of the CUDA platform are facilitated, and parallel calculation of large integer multiplication can be conveniently realized through a plurality of threads; however, in the existing GPU large integer multiplication research, a single thread is adopted to complete one large integer multiplication, so that the algorithm has the problems of higher implementation complexity and low parallelism.

As CN201610325863 discloses a method for realizing large integer multiplication acceleration by using floating point number calculation instruction, which realizes large integer multiplication by using floating point number calculation instruction of CUDA platform; however, the method does not split the multiplication, the calculation amount of a single thread is still quite large, and the overall calculation delay is high.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for realizing large integer multiplication in the cryptographic technology based on GPU (graphics processing Unit) parallelism, which divides the large integer multiplication in public key cryptographic calculation into a plurality of parts to be distributed among a plurality of threads, and adopts the plurality of threads to finish one large integer multiplication operation in parallel, thereby reducing the calculation amount of a single thread, effectively improving the performance of multiplication operation in public key cryptography, reducing the complexity of large integer multiplication calculation and improving the calculation speed.

The invention discloses a method for realizing large integer multiplication in a cryptographic technology based on GPU (graphics processing Unit) parallel, which comprises the following steps:

step 1, data division: dividing data requiring large integer multiplication in a public key cryptographic encryption and decryption process performed on a GPU platform, marking an input multiplicand as A, marking a multiplication as B, marking the multiplicand A with the length of N bits as a word every w bits in the order from low to high (or from high to low), totaling N words, marking the multiplicand B with the length of M bits as a word every w bits in the same order as A, totaling M words;

step 2, thread fetching: calculating a large integer multiplication in parallel by adopting x threads, wherein each thread takes a multiplicand and a multiplier at a corresponding position;

step 3, multiplication operation: for the number obtained in the step 2, a multiply-add instruction for distinguishing high-order and low-order is used in each thread, the product of a multiplicand and a multiplier is calculated in sequence, the final Result is saved to Result, and the Carry is saved in Carry1 and Carry 2;

step 4, carry operation: carry1 and Carry2 generated by other thread operation are obtained through a shuffling instruction and are put into tmp, the sum of Result and tmp is calculated through an addition instruction, a new Result is written into Result, and the newly generated Carry is stored in Carry; judging whether a thread has a carry in one thread bundle or not by a loop iteration mode, namely using a bundle voting function of CUDA; if so, all threads in the thread bundle accumulate iterations simultaneously until all threads do not generate a carry;

step 5, writing back the result: and putting the calculation Result obtained by each thread into a corresponding position of the final Result and outputting the Result.

Further, when dividing the multiplicand a and the multiplier B, if the length is not divided by w, the length is padded to an integer multiple of w by the high order 0.

Further, for the division result of step 1, the following symbols are defined: a [ u ] represents the u-th word of A, B [ u ] represents the u-th word of B, A [0:N-1] represents N words of 0 th to (N-1) of the multiplicand A, and B [0:M-1] represents M words of 0 th to (M-1) of the multiplier B.

Further, the method is characterized in that for the division result of step 1: a is N words, B is M words, whereinEach thread multiplies the multiplicand A or 0 by the multiplier B by k words according to the position, whereinAll threads of a large integer multiplication belong to one thread bundle, and the size of one thread bundle is 32, so the number x of threads is less than or equal to 32.

Furthermore, the bottom multiplication instruction provided by CUDAPTX is used for distinguishing the multiplication instruction with high-order and low-order, so that the realization process of multiplication operation is improved; defining two multipliers a and b, an adder c and a result d, wherein the lengths of the multipliers a and b and the adder c are w bits; the flag register before calculation is CF_in, the flag register after calculation is CF_out, and the multiply-add instruction for distinguishing high and low bits is expressed as:

(d,CF_out)＝madc.lo.cc.(a,b,c)＝(a×b).lo+c+CF_in

(d,CF_out)＝madc.hi.cc.(a,b,c)＝(a×b).hi+c+CF_in

where (a×b). Lo represents a low w-bit operation and (a×b). Hi represents a high w-bit operation, cf_out is used to hold the computationally generated carry.

Further, the x threads perform parallel computation to complete a large integer multiplication, and the specific process is as follows:

the calculation process of the ith thread is as follows:

1) j is from 0 to M-1, one word B [ j ] of the multiplier B is taken out as B each time, k words are taken from the multiplicand A or 0 according to the values of the thread number i and the cycle number j and put into temp, multiplication of B and temp is completed by utilizing a multiply-add instruction for distinguishing high-order and low-order, multiplication calculation of B and temp is completed by cycle M times, and the Result is put into Result [0:8], and Carry is stored into Carry1 and Carry 2;

2) After the parallel computation of the x threads is completed, carry1 and Carry2 of the low-order thread are obtained through shuffling instructions, addition operation is carried out on a computation Result [0:8] of the thread, the Result is put into Result [0:8], and Carry is stored in Carry;

3) Judging whether the Carry value of each thread is 0, if so, ending the calculation, otherwise, executing 2).

The beneficial effects of the invention are as follows: the method adopts the multiply-add instruction provided by the CUDA platform for distinguishing high-order from low-order, improves the realization process of multiplication operation in public key cryptography, avoids data type conversion of a multiplier, a multiplicand and a result, and reduces the complexity of a public key cryptography algorithm; meanwhile, each part of the large integer multiplication is split into a plurality of blocks and distributed into a plurality of threads, the plurality of threads calculate each part of the large integer multiplication at the same time, communication among the threads is realized by using a shuffling instruction and a bundle voting function of CUDA, calculation results of the threads are obtained for calculation of merging and carrying, so that the parallelism of an algorithm is improved, the time cost of the algorithm is reduced, and the calculation resources of a platform are fully exerted.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of a multi-threaded multiply operation.

Detailed Description

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

As shown in fig. 1, the method for implementing large integer multiplication in the cryptographic technology based on GPU parallelism according to the present invention includes the following steps:

step 1, data dividing section

1) Marking an input multiplicand as A, marking a multiplication as B, dividing the multiplicand A with the length of N bits into one word for every w bits in the order from low to high (or from high to low), dividing the multiplicand B with the length of M bits into one word for every w bits in the same high-low order as A, and storing data in a GPU shared memory;

2) Defining the following symbols, A [0:s ] (s > 0) represents the 0 th to s th words of A, A [ u ] represents the u-th segment of A; representing a multiplicand A of length N bits as A [0:N-1], wherein A [0] is the lowest order of the multiplicand A and A [ N-1] is the highest order of A; the multiplier B is similarly denoted as B [0:M-1];

step 2, thread fetch

Parallel computing of large integer multiplications with x threads, for any one thread, the multiplier BI is fetched each time](0.ltoreq.i.ltoreq.M-1) is placed in a register, and then the corresponding segment of multiplicand A is fetched according to the thread position and the value of i or 0 is placed in register temp, each temp containsSegments.

Specifically, as shown in fig. 2 below, each thread is responsible for calculating k consecutive columns, e.g., 0 th thread is responsible for calculating columns 0 to (k-1), i th thread calculates columns i×k to (i×k+k-1), and so on. For each thread, if the upper left corner and the lower right corner of the table are calculated, taking 0 and putting the 0 into temp; otherwise, the corresponding word of A is taken and put into temp.

Step 3, multiplication operation part

As shown in fig. 2, for any multiplication operation in fig. 2, the steps of sequentially calculating using the fused instruction for distinguishing high order from low order, and performing the fused instruction until the final additional Carry is generated, storing the additional Carry in Carry1 and Carry2, taking the multiplication operation in the thread No. 0 as an example, are as follows:

for i＝0to M-1do

S ₀ ＝mad.lo.cc.(temp ₀ ,B _i ,S ₀ )

for j＝1to k-1do

S _j ＝mad.lo.cc.(temp _j ,B _i ,S _j )

end for

Carry1＝addc.cc(Carry1,0)

Carry2＝addc.cc(Carry2,0)

S ₁ ＝mad.hi.cc.(temp _j ,B _i ,S ₁ )

for j＝2to k-1do

S _j ＝madc.hi.cc.(temp _j-1 ,B _i ,S _j )

end for

Carry1＝madc.hi.cc(temp _k-1 ,B _i ,Carry1)

Carry2＝addc(Carry2,0)

end for

step 4, carry operation part

Carry1 and Carry2 generated by thread operation of low-order operation are obtained by using a shuffling instruction, carry operation is completed by using an addition instruction, a newly generated Carry is stored in the Carry, whether a Carry exists in one thread bundle is judged by using a bundle voting function of CUDA (compute unified device architecture) in a cyclic iteration mode, and if the Carry exists, all threads in the thread bundle are accumulated and iterated at the same time until all threads are not generating Carry.

And 5, repeating the operations of the steps 102-104 until the high and low bit results of each word of the multiplicand and each word of the multiplier are accumulated into the result.

In the embodiment, experiments are carried out by using an NVIDIA RTX 4090 platform, and the final experimental result shows that when the length of a multiplicand A is 2048 bits, the length of a multiplier B is 2048 bits, and the length of a single word is 32 bits in data division, the throughput of large integer multiplication reaches 1.79 hundred million times per second.

Compared with the prior art, the method changes the multiplication quantity executed by a single thread from N to MSpecifically, in the public key cryptography, when the number of words m=64 of the multiplicand, the number of words n=64 of the multiplier, and the number of threads k=8 are input, the number of multiplications performed by a single thread is reduced from 4096 to 1016; when the number of words m=128 of the multiplicand, the number of words n=128 of the multiplier, and the number of threads k=16, the number of multiplications of a single thread is reduced from 16384 to 2040. Therefore, the invention can effectively reduce the multiplication operation quantity of a single thread in the large integer multiplication process, thereby reducing the complexity of large integer multiplication calculation in the single thread and improving the integral calculation speed of the public key cipher.

The foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the present invention, and all equivalent variations using the description and drawings of the present invention are within the scope of the present invention.

Claims

1. The method for realizing large integer multiplication in the cryptographic technology based on GPU (graphics processing Unit) parallelism is characterized by comprising the following steps:

step 1, data division: dividing data requiring large integer multiplication in a public key cryptographic encryption and decryption process performed on a GPU platform, marking an input multiplicand as A, marking a multiplication as B, marking the multiplicand A with the length of N bits as a word every w bits in the order from low to high or from high to low, marking the multiplicand B with the length of M bits as a word every w bits in the same high-low order as the A, and marking M words in total;

step 2, thread fetching: calculating a large integer multiplication in parallel by adopting x threads, wherein each thread takes k words from a multiplicand A or 0 according to the position to multiply with a multiplier B;

2. The method for implementing large integer multiplication in cryptographic technology based on GPU parallelism according to claim 1, wherein when dividing multiplicand a and multiplier B, if the length cannot be divided by w, the length is padded to integer multiple of w by high order 0.

3. A method for implementing large integer multiplication in cryptographic techniques based on GPU parallelism according to any one of claims 1 or 2, wherein for the partitioning result of step 1, the following symbols are defined: a [ u ] represents the u-th word of A, B [ u ] represents the u-th word of B, A [0:N-1] represents N words of 0 th to (N-1) of the multiplicand A, and B [0:M-1] represents M words of 0 th to (M-1) of the multiplier B.

4. According to claimThe method for implementing large integer multiplication in cryptographic technology based on GPU parallel as described in 1 or 2, wherein for the division result of step 1: a is N words, B is M words, wherein Each thread multiplies the multiplier B by k words from the multiplicand A or 0 according to the position, wherein +.>All threads of a large integer multiplication belong to one thread bundle, and the size of one thread bundle is 32, so that the number x of threads is less than or equal to 32.

5. The method for realizing large integer multiplication in the cryptographic technique based on GPU parallelism according to claim 2, wherein the multiplication and addition instruction provided by CUDA PTX for distinguishing high order from low order improves the realization process of multiplication operation; defining two multipliers a and b, an adder c and a result d, wherein the lengths of the multipliers a and b and the adder c are w bits; the flag register before calculation is CF_in, the flag register after calculation is CF_out, and the multiply-add instruction for distinguishing high and low bits is expressed as:

(d,CF_out)＝madc.lo.cc.(a,b,c)＝(a×b).lo+c+CF_in

(d,CF_out)＝madc.hi.cc.(a,b,c)＝(a×b).hi+c+CF_in

6. The method for realizing large integer multiplication in cryptographic technology based on GPU parallel according to claim 5, wherein the x threads perform a large integer multiplication by parallel computation, comprising the following steps:

the calculation process of the ith thread is as follows:

2) After the parallel computation of the x threads is completed, acquiring Carry1 and Carry2 of the low-order thread through a shuffling instruction of the thread bundle, performing addition operation on a computation Result [0:8] of the thread, putting the Result into Result [0:8], and storing the Carry into the Carry;

7. The method for realizing large integer multiplication in cryptographic technology based on GPU parallelism according to claim 1, wherein the method is applied to a public key cryptographic algorithm requiring large integer multiplication in computational engineering.