CN108510429B

CN108510429B - Multivariable cryptographic algorithm parallelization acceleration method based on GPU

Info

Publication number: CN108510429B
Application number: CN201810228547.5A
Authority: CN
Inventors: 龚征; 廖国鸿; 黎伟杰; 马昌社; 刘志杰; 罗裴然; 黄家敏
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2021-11-02
Anticipated expiration: 2038-03-20
Also published as: CN108510429A

Abstract

The invention discloses a multivariable cryptographic algorithm parallelization acceleration method based on a GPU, which comprises the following steps: s1, carrying out homonymization operation on all terms of the multivariable equation; s2, generating a GF2 domain multiplication table; s3, mapping the item number table and the multiplication table to a texture memory of the GPU; s4, calling a multivariable main kernel function for calculation and executing Reduce operation on each piece of data; s5, writing a main function to schedule a multivariable main kernel function; and S6, executing the program, outputting the encryption and decryption result, and releasing the resources. The invention optimizes the cryptographic algorithm of the multivariable cryptographic system by carrying out homologization on all the items of the multivariable and combining the idea of Map-Reduce, and gives the implementation and performance comparison under the CUDA platform by taking the SpongeMPH hash function algorithm as an example. Experiments show that the scheme improves the operation efficiency of the algorithm and can be used for accelerating the cryptographic algorithm based on the multivariate cryptographic system.

Description

Multivariable cryptographic algorithm parallelization acceleration method based on GPU

Technical Field

The invention relates to the technical field of cryptographic algorithms, in particular to a multivariable cryptographic algorithm parallelization acceleration method based on a GPU.

Background

The GPU, a graphics processing unit, was originally designed for image processing, and in recent years, due to the limitation of CPU power consumption and the rapid increase of computational demand, the computational power of the GPU rapidly developed at a speed far exceeding moore's law, causing the GPU to be widely used in the field of scientific computing.

Multivariate cryptographic algorithms are cryptographic schemes that are constructed using multivariate polynomials over a finite field. The problem of solving a multivariate polynomial equation set in a finite field is an NP-Hard problem, and is one of the design ideas of quantum attack resistance at present. However, the large computation load of the multi-variable cipher algorithm, resulting in low efficiency, is a main aspect that limits its practicability. Therefore, how to improve the execution efficiency of the GPU is one of the directions studied by those skilled in the art.

Disclosure of Invention

The invention mainly aims to overcome the defects and shortcomings of the prior art, and provides a GPU-based parallelization acceleration method for a multivariate cryptographic algorithm, which realizes the parallelization of the multivariate cryptographic algorithm by combining the GPU with the idea of Map-Reduce, thereby improving the execution efficiency of the multivariate cryptographic algorithm.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a multivariable cryptographic algorithm parallelization acceleration method based on a GPU, which comprises the following steps:

s1, carrying out homonymization operation on all terms of the multivariable equation;

s2, generating a generator table and a logarithm table on the finite field, and performing table lookup operation on the two tables to realize multiplication of the finite field so as to improve the consistency in the GPU thread calculation process, wherein the generator table is represented by a table, and is a table [ i ] formed by power of the first q-1

natural numbers

0,1,2, … and q-2 generated by a generator g on a q-order finite field F and 0]＝gⁱAnd let table [ q-1 ]]＝1，table[q]G ═ g; the logarithm table is represented by arc _ table, which means that for any element a in the finite field, there is arc _ table [ a]I, where table [ i]A, and let arc _ table [0]Is a large negative number such that 0 a is table [ arc _ table [0]]+arc_table[a]]Middle arc _ table [0]]+arc_table[a]Constant negative number, and table negative number]Is 0;

s3, mapping the item table, the coefficient table, the generated element table and the logarithm table to the texture memory of the GPU, wherein the item table refers to subscripts of variables respectively forming each item in the multivariable equation, and when a certain item is a₁x₁x₃x₄Wherein x is₁x₃x₄If the variable is variable, 1, 3 and 4 are stored in corresponding positions of an item table, and the coefficient table refers to the coefficient of each item in the multivariable equation and corresponds to the item table one by one;

s4, calling a multivariable main kernel function for calculation and executing Reduce summation operation to each piece of data, wherein the parameters of the multivariable main kernel function comprise the address of the data to be processed, the address of the value of the current polynomial variable and the address for storing intermediate temporary data; the content of the multivariable main kernel function comprises the steps of obtaining the value of each variable in each basic scheduling unit in the GPU, calculating the value of each item after operation, then carrying out Reduce summation operation, obtaining the result of each polynomial and storing the result into the variable array of the current polynomial

S5, compiling a master function to schedule a multi-variable master kernel function, wherein the master function comprises the steps of setting the block size, applying for GPU memory space and texture memory binding, continuously transmitting block data to the master kernel function, finally copying a calculation result back to a host-side memory, and releasing resources;

and S6, executing the program, outputting the encryption and decryption result, and releasing the resources.

As a preferred technical solution, in step S1, the synchronization operation specifically includes:

the redundant variable with the value of 1 is multiplied by the low-order terms to enable the redundant variable to be equal to the order of the polynomial, so that the terms and summation of each polynomial are calculated by the same operation in one-time kernel function calling, and the performance reduction of the GPU caused by a branch structure is avoided;

meanwhile, introducing redundant items with the value of 0, so that the number of items of each equation is a multiple of Block; the Block is defined on CUDA, and the Block is word-gruop on OpenCl.

Preferably, in step S2, the step of generating a multiplication table in the finite field is as follows:

in the mod n domain, if g is a generator, and the greatest common divisor of n and g is 1, the value of the generator g can be obtained by extending the euclidean algorithm, and g can be enumerated⁰,g¹,…,g^(p-2)To obtain a multiplication table and an inverse table.

Preferably, step S3 further includes the following steps:

preprocessing the data at the CPU end, i.e. using redundant items 0x_tx_tx_tFilling the term table and coefficient table so that the term number of each equation is exactly the block number of the GPU x the number of threads each block possesses, facilitates the Reduce summation operation in step S4, assuming that the multivariable system contains the variable x₀，x₁,…,x_t-1And add an additional custom variable x with a value of 1_t(ii) a And then copying the arrays to a GPU (graphics processing unit) end memory by using asynchronous operation, and then binding the arrays with a texture memory.

As a preferred technical solution, in step S4, the address of the data to be processed and the address where the value of each variable of the current polynomial is stored are unsigned character vector pointers (uint8_ t);

the address where the intermediate temporary data is stored is an unsigned 32-bit integer pointer (uint32_ t).

As a preferred technical solution, in step S4, the calculating and reducing summation operation performed by the multivariate main kernel function includes the following steps:

s41, copying the value of a variable array for each block of the GPU;

s42, processing the input data, wherein in SpongEMPH, the sponge operation in the sponge structure is carried out;

s43, searching corresponding items and coefficients from the texture memory according to the current global thread id, and calculating the product value by using a multiplication table;

s44, calculating the value of each equation by using a Reduce summation mode, firstly, quickly calculating the sum of 32 threads by each Warp in a halving summation mode, and then simultaneously solving the value of each polynomial by using an atomic summation operation;

and S45, copying the result of the equation summation back to the variable array by using the memory management function.

As a preferred technical solution, in the step S5, the writing of the master function to schedule the multi-variable master kernel function specifically includes the following steps:

s51, setting the block number of the main kernel function and the number of threads in each block;

s52, applying for a corresponding memory space from the GPU, copying the data in the step S51 into a GPU video memory by asynchronous stream operation, and then binding the data into a texture memory;

s53, transferring the corresponding text data and the value of the variable array obtained by the last calculation to a main kernel function each time, and continuously updating and obtaining the value of the polynomial variable by continuously calling the kernel function;

and S54, copying the final hash value from the GPU video memory back to the host memory, and releasing resources by using cudaFree and free commands.

As a preferred technical solution, step S6 specifically includes the following contents:

and after all data processing results are finished, outputting the results to a specified position and releasing resources.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention adopts a GPU-based parallelization technical scheme, fully utilizes the parallel structure characteristic of the GPU by the homologation and filling operation of multivariable equations and combining a multiplication table of a texture memory, and quickly calculates the value of each equation after each iteration by halving summation and atomic operation, thereby overcoming the problem that the application scene of the multivariable cryptographic algorithm is influenced by the low calculation speed of the multivariable cryptographic algorithm on a CPU platform, achieving the acceleration effect of 15 times the calculation speed of the CPU and improving the practicability of the multivariable cryptographic algorithm. The invention comprehensively considers the parallel granularity, allocates the memory and fully utilizes the GPU to carry out polynomial fast summation optimization, thereby ensuring the performance of the multivariable password when the GPU is used for parallel implementation.

Drawings

FIG. 1 is a flow diagram of a SpongeMPH hash function;

FIG. 2 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

In this embodiment, a multivariate hash function SpongeMPH is taken as an example, and a flow of the SpongeMPH hash function is shown in fig. 1:

a) padding operation is first performed so that the input data length is an integral multiple of the packet length

b) Reading data of packet length circularly, carrying out XOR with the first r x k bits of the current state, and calculating and updating the value of the current state by using a multi-variable function MPE until all data are read to obtain a state SL

c) And calling MPE once again to update SL to obtain a final value S0, and finally obtaining a final result through S0.

Specific implementation and performance comparison of SpongeMPH is given on CUDA platform based on the scheme. Slight modifications can also be used for fast implementation of other multivariate cryptographic algorithms, depending on the steps of the embodiment.

As shown in fig. 2, the parallelization acceleration method for the GPU-based multivariate cryptographic algorithm of the present embodiment includes the following steps:

s1, performing a homologation operation on all the terms of the multivariate equation. In order to fully utilize the multi-core characteristic and the calculation characteristic of the GPU to accelerate the multivariable cryptographic algorithm, the invention introduces the steps of synchronization. The redundant variable with the value of 1 is multiplied by the low-order terms to enable the redundant variable to be equal to the order of the polynomial, so that the terms and summation of each polynomial can be calculated in one-time kernel function calling by the same operation, and the performance reduction of the GPU caused by a branch structure is avoided. Meanwhile, redundant items with the value of 0 are introduced, so that the number of items of each equation is multiple of Block (Block is the definition on CUDA, and word-gruop on OpenCl).

In this embodiment, each block has a data size of 320 bits, and is stored in the memory using 40 variables of 8 bits. The state size of SpongeMPH is 40 x 8 bits, i.e., 40 equations, 40 variables, where each equation has 40 order 1 terms, 842 order 2 terms, and 400 order 3 terms. Using SpongeMPHThe 40 variables are labeled x₀，x₁，……，x₃₉Wherein x is_iIs 8 bits, let x₄₀1, the ith equation is:

F_t(x₀，x₁，……，x₃₉)＝

∑_{t≤i≤j≤k≤n}α_tijx_ix_jx_k+∑_{t≤i≤j≤n}β_tix_ix_j+∑_t≤i≤nγ_tix_i+δ_t

then the operation of the synchronization can be expressed as: f_t(x₀，x₁，……，x₃₉)＝∑_{t≤i≤j≤k≤n}α_tijx_ix_jx_k+∑_{t≤i≤j≤n}β_tix_ix_jx₄₀+∑_t≤i≤nγ_tix₄₀x₄₀+δ_tx₄₀x₄₀x₄₀

Storing the coefficients of each term of all equations in an array var, x for each term of the equation_ix_jx_kThe subscripts are stored in three arrays of index, and indez, respectively, and finally each equation can be expressed as:

simultaneously, 0 and 1 are respectively used for filling var and index, indez, so that the term of each equation is the multiple of the number of each block line of the GPU of the equation, wherein the number of each block line is set to be 128, and therefore, after filling, the following steps are carried out:

and S2, generating a multiplication table on the finite field. Multiplication of the multivariable cryptographic algorithm in the operation process is based on operation on a finite field, so that the generated multiplication table is beneficial to the consistency in the GPU thread calculation process, unnecessary branch structures and repeated calculation are reduced, and the calculation speed is improved.

The SpongeMPH version used in this example is GF2⁸And (4) performing operation on the field by using the generating element 3 as a basis for constructing a multiplication table and storing corresponding data by using a table and an arc _ table. Wherein table [ i ]]＝3ⁱ，table[i]＝table[arc_table[x]]Then at GF2 for element a, b⁸The multiplication over the domain can be expressed as:

1) if a, b! 0, a × b table [ (arc _ table [ a ] + arc _ table [ b ])% 0xFF ];

2) otherwise, a ═ b ═ table [ negative number ] ═ 0;

let arc _ table [0] be a sufficiently small negative number, and let tmp ═ arc _ table [ a ] + arc _ table [ b ], then the result can be obtained by (tmp > 0) × (tmp% OxFF), thus unifying 1) and 2).

And S3, mapping the item table, the coefficient table and the multiplication table to a texture memory of the GPU. The multivariate cryptographic algorithm needs to read the number of terms of each term during each operation, and the term table may be very large, and in order to increase the reading speed, the texture memory is used for storage. In addition, multiplication table look-up is needed in the calculation process, and the multiplication operation may be unevenly distributed, so that the texture memory is adopted for storage, which is beneficial to improving the look-up efficiency.

Furthermore, data is preprocessed at the CPU end, i.e. redundant items 0x are used₄₀x₄₀x₄₀The term table and coefficient table are populated such that the number of terms per equation is exactly the number of blocks per block of the GPU x the number of threads per block holds, facilitating the Reduce summation operation in step 4. And then copying the arrays to a GPU (graphics processing unit) end memory by using asynchronous operation, and then binding the arrays with a texture memory.

And S4, calling a multivariable main kernel function for calculation on each block of data and executing Reduce summation operation. The parameters of the multi-variable main kernel function comprise the address of the data to be processed, the address of the value of the current polynomial variable and the address of the intermediate temporary data storage. The addresses of the data to be processed, the addresses of the values of each variable of the current polynomial are unsigned character vector pointers (uint8_ t), and the addresses of the intermediate temporary data are unsigned 32-bit integer pointers (uint32_ t). The content of the kernel function comprises the steps of obtaining the value of each variable in each basic scheduling unit in the GPU, calculating the value of each item after operation, then carrying out Reduce summation operation, obtaining the result of each polynomial and storing the result into the variable array of the current polynomial.

In this embodiment, there are three kernel function parameters, which are the array address (input data) for storing the data to be processed, the address of the array for storing the value of the current polynomial variable (the output data is eventually updated to this array), and the address of the intermediate temporary data array. Each thread in the GPU corresponds to a term of a equation, and the function mainly consists of the following operations:

(a) the block blocks of each GPU copy a value of a variable array, in the embodiment, the first 40 threads of each block copy a value of a variable to a shared memory respectively;

(b) processing input data, wherein in SpongEMPH, the sponge operation in the sponge structure is carried out;

(c) searching corresponding items and coefficients from the texture memory according to the current global thread id, and calculating the product value by using a multiplication table;

(d) calculating the value of each equation by using a Reduce summation mode, firstly, quickly calculating the sum of 32 threads by using a reduced half summation mode for each Warp, and then simultaneously solving the value of each polynomial by using an atomic summation operation;

(e) copying the result after the equation summation to a variable array by using the first 40 threads of another GPU kernel program Map;

in this embodiment, the pseudo code of the main kernel function is shown in table 1; the pseudo code of the Reduce function is shown in table 2.

TABLE 1

TABLE 2

S5, writing a main function to schedule the multi-variable main kernel function. The main function comprises the steps of setting the size of a block, applying for GPU memory space, binding texture memory, continuously transmitting the block data to a main kernel function, finally copying a calculation result back to a host-side memory, releasing resources and the like.

Firstly, setting the block number of a kernel function and the number of threads in each block: in this embodiment, since the number of valid terms of each equation is 1284, in combination with the structure of the GPU, the block number is 440, the number of threads in a single block is 128, and each equation is filled with an invalid term having a value of 0, so that the block number used by each equation is exactly 11.

And then, a corresponding memory space is applied from the GPU, the data in the steps are copied into a GPU video memory by asynchronous stream operation, and then the data are bound into a texture memory. Then, the corresponding text data and the values (initialized to 0) of the 40 variables obtained by the last calculation are passed through the main kernel function each time, and the values of the polynomial variables are obtained by continuously updating one kernel function.

And in the last part, copying the final hash value from the GPU video memory to the host memory, and releasing resources by using commands such as cudaFree and free.

And S6, executing the program, outputting the encryption and decryption result, and releasing the resources. After the main program is written, the target device can directly run, and reasonably and circularly executes copying data to the target device according to the setting and scheduling strategy of the main program, so that each thread runs the kernel program, and the running result is copied back to the host from the target device. And after all data processing results are finished, outputting the results to the specified position and releasing the resources.

And opening the CUDA main program in the IDE, or directly compiling and running the CUDA main program by using a command line interface. According to the CUDA main program, the hash value is output to a screen or a designated file.

In this embodiment, the pseudo code of the scheduling kernel function is shown in table 3.

TABLE 3

Results of the experiment

The example operating environment is: the CPU model is Core i 76700 k, the memory is 16G, the operating system is ArchLinux (64 bits), the GPU model is Nvidia GTX1070, the video memory is 11G, the used SDK version is CUDA Toolkit 9.0, and the used integrated development environment is nsight.

The performance comparison of GPU-SpongeMPH and CPU-SpongeMPH for this example at an input data size of 40MB is shown in Table 4 as follows:

TABLE 4

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A multivariable cryptographic algorithm parallelization acceleration method based on a GPU is characterized by comprising the following steps:

s2, generating a generator table and a logarithm table on the finite field, and performing table look-up operation on the two tables to realize multiplication of the finite field to improve the consistency in the GPU thread calculation process, wherein the generator table is represented by a table and refers to the first q-1 natural numbers 0,1,2, … and q-2 generated by a generator g on a q-order finite field FTable [ i ] formed by square and 0]＝gⁱAnd let table [ q-1 ]]＝1，table[q]G ═ g; the logarithm table is represented by arc _ table, which means that for any element a in the finite field, there is arc _ table [ a]I, where table [ i]A, and let arc _ table [0]Is a large negative number such that 0 a is table [ arc _ table [0]]+arc_table[a]]Middle arc _ table [0]]+arc_table[a]Constant negative number, and table negative number]Is 0;

s3, mapping the item table, the coefficient table, the generated element table and the logarithm table to the texture memory of the GPU, wherein the item table refers to subscripts of variables respectively forming each item in the multivariable equation, and when a certain item is a₁x₁x₃x₄Wherein x is₁x₃x₄If the variable is a variable, 1, 3 and 4 are stored in the corresponding position of the item table; the coefficient table refers to the coefficient of each item in the multivariable equation and corresponds to the item table one by one;

2. The method for parallelizing and accelerating multivariate cryptographic algorithms based on GPU of claim 1, wherein in step S1, the homologation operation is specifically:

3. The method for parallelizing acceleration of multivariate cryptographic algorithms based on GPU of claim 1, wherein in step S2, the step of generating multiplication table on finite field is as follows:

4. The method for parallelizing acceleration of multivariate cryptographic algorithms based on GPU of claim 1, wherein the step S3 further comprises the following steps:

5. The method for parallelizing acceleration of multivariate cryptographic algorithms based on GPU of claim 1, wherein in step S4, the data address to be processed and the address where the value of each variable of the current polynomial is stored are unsigned character vector pointers (uint8_ t);

6. A GPU-based multivariate cryptographic algorithm parallelization acceleration method according to claim 1 or 5, wherein in step S4, the multivariate main kernel function performing the calculation and the Reduce summation operation comprises the following steps:

s41, copying the value of a variable array for each block of the GPU;

7. The GPU-based multivariate cryptographic algorithm parallelization acceleration method according to claim 1, wherein in the step S5, the writing of the master function to schedule the multivariate master kernel function specifically comprises the following steps:

8. The method for parallelizing acceleration of multivariate cryptographic algorithms based on GPU of claim 1, wherein the step S6 specifically comprises the following steps: