CN108510429B - Multivariable cryptographic algorithm parallelization acceleration method based on GPU - Google Patents

Multivariable cryptographic algorithm parallelization acceleration method based on GPU Download PDF

Info

Publication number
CN108510429B
CN108510429B CN201810228547.5A CN201810228547A CN108510429B CN 108510429 B CN108510429 B CN 108510429B CN 201810228547 A CN201810228547 A CN 201810228547A CN 108510429 B CN108510429 B CN 108510429B
Authority
CN
China
Prior art keywords
gpu
multivariable
variable
value
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810228547.5A
Other languages
Chinese (zh)
Other versions
CN108510429A (en
Inventor
龚征
廖国鸿
黎伟杰
马昌社
刘志杰
罗裴然
黄家敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Normal University
Original Assignee
South China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Normal University filed Critical South China Normal University
Priority to CN201810228547.5A priority Critical patent/CN108510429B/en
Publication of CN108510429A publication Critical patent/CN108510429A/en
Application granted granted Critical
Publication of CN108510429B publication Critical patent/CN108510429B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a multivariable cryptographic algorithm parallelization acceleration method based on a GPU, which comprises the following steps: s1, carrying out homonymization operation on all terms of the multivariable equation; s2, generating a GF2 domain multiplication table; s3, mapping the item number table and the multiplication table to a texture memory of the GPU; s4, calling a multivariable main kernel function for calculation and executing Reduce operation on each piece of data; s5, writing a main function to schedule a multivariable main kernel function; and S6, executing the program, outputting the encryption and decryption result, and releasing the resources. The invention optimizes the cryptographic algorithm of the multivariable cryptographic system by carrying out homologization on all the items of the multivariable and combining the idea of Map-Reduce, and gives the implementation and performance comparison under the CUDA platform by taking the SpongeMPH hash function algorithm as an example. Experiments show that the scheme improves the operation efficiency of the algorithm and can be used for accelerating the cryptographic algorithm based on the multivariate cryptographic system.

Description

Multivariable cryptographic algorithm parallelization acceleration method based on GPU
Technical Field
The invention relates to the technical field of cryptographic algorithms, in particular to a multivariable cryptographic algorithm parallelization acceleration method based on a GPU.
Background
The GPU, a graphics processing unit, was originally designed for image processing, and in recent years, due to the limitation of CPU power consumption and the rapid increase of computational demand, the computational power of the GPU rapidly developed at a speed far exceeding moore's law, causing the GPU to be widely used in the field of scientific computing.
Multivariate cryptographic algorithms are cryptographic schemes that are constructed using multivariate polynomials over a finite field. The problem of solving a multivariate polynomial equation set in a finite field is an NP-Hard problem, and is one of the design ideas of quantum attack resistance at present. However, the large computation load of the multi-variable cipher algorithm, resulting in low efficiency, is a main aspect that limits its practicability. Therefore, how to improve the execution efficiency of the GPU is one of the directions studied by those skilled in the art.
Disclosure of Invention
The invention mainly aims to overcome the defects and shortcomings of the prior art, and provides a GPU-based parallelization acceleration method for a multivariate cryptographic algorithm, which realizes the parallelization of the multivariate cryptographic algorithm by combining the GPU with the idea of Map-Reduce, thereby improving the execution efficiency of the multivariate cryptographic algorithm.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to a multivariable cryptographic algorithm parallelization acceleration method based on a GPU, which comprises the following steps:
s1, carrying out homonymization operation on all terms of the multivariable equation;
s2, generating a generator table and a logarithm table on the finite field, and performing table lookup operation on the two tables to realize multiplication of the finite field so as to improve the consistency in the GPU thread calculation process, wherein the generator table is represented by a table, and is a table [ i ] formed by power of the first q-1 natural numbers 0,1,2, … and q-2 generated by a generator g on a q-order finite field F and 0]=giAnd let table [ q-1 ]]=1,table[q]G ═ g; the logarithm table is represented by arc _ table, which means that for any element a in the finite field, there is arc _ table [ a]I, where table [ i]A, and let arc _ table [0]Is a large negative number such that 0 a is table [ arc _ table [0]]+arc_table[a]]Middle arc _ table [0]]+arc_table[a]Constant negative number, and table negative number]Is 0;
s3, mapping the item table, the coefficient table, the generated element table and the logarithm table to the texture memory of the GPU, wherein the item table refers to subscripts of variables respectively forming each item in the multivariable equation, and when a certain item is a1x1x3x4Wherein x is1x3x4If the variable is variable, 1, 3 and 4 are stored in corresponding positions of an item table, and the coefficient table refers to the coefficient of each item in the multivariable equation and corresponds to the item table one by one;
s4, calling a multivariable main kernel function for calculation and executing Reduce summation operation to each piece of data, wherein the parameters of the multivariable main kernel function comprise the address of the data to be processed, the address of the value of the current polynomial variable and the address for storing intermediate temporary data; the content of the multivariable main kernel function comprises the steps of obtaining the value of each variable in each basic scheduling unit in the GPU, calculating the value of each item after operation, then carrying out Reduce summation operation, obtaining the result of each polynomial and storing the result into the variable array of the current polynomial
S5, compiling a master function to schedule a multi-variable master kernel function, wherein the master function comprises the steps of setting the block size, applying for GPU memory space and texture memory binding, continuously transmitting block data to the master kernel function, finally copying a calculation result back to a host-side memory, and releasing resources;
and S6, executing the program, outputting the encryption and decryption result, and releasing the resources.
As a preferred technical solution, in step S1, the synchronization operation specifically includes:
the redundant variable with the value of 1 is multiplied by the low-order terms to enable the redundant variable to be equal to the order of the polynomial, so that the terms and summation of each polynomial are calculated by the same operation in one-time kernel function calling, and the performance reduction of the GPU caused by a branch structure is avoided;
meanwhile, introducing redundant items with the value of 0, so that the number of items of each equation is a multiple of Block; the Block is defined on CUDA, and the Block is word-gruop on OpenCl.
Preferably, in step S2, the step of generating a multiplication table in the finite field is as follows:
in the mod n domain, if g is a generator, and the greatest common divisor of n and g is 1, the value of the generator g can be obtained by extending the euclidean algorithm, and g can be enumerated0,g1,…,g(p-2)To obtain a multiplication table and an inverse table.
Preferably, step S3 further includes the following steps:
preprocessing the data at the CPU end, i.e. using redundant items 0xtxtxtFilling the term table and coefficient table so that the term number of each equation is exactly the block number of the GPU x the number of threads each block possesses, facilitates the Reduce summation operation in step S4, assuming that the multivariable system contains the variable x0,x1,…,xt-1And add an additional custom variable x with a value of 1t(ii) a And then copying the arrays to a GPU (graphics processing unit) end memory by using asynchronous operation, and then binding the arrays with a texture memory.
As a preferred technical solution, in step S4, the address of the data to be processed and the address where the value of each variable of the current polynomial is stored are unsigned character vector pointers (uint8_ t);
the address where the intermediate temporary data is stored is an unsigned 32-bit integer pointer (uint32_ t).
As a preferred technical solution, in step S4, the calculating and reducing summation operation performed by the multivariate main kernel function includes the following steps:
s41, copying the value of a variable array for each block of the GPU;
s42, processing the input data, wherein in SpongEMPH, the sponge operation in the sponge structure is carried out;
s43, searching corresponding items and coefficients from the texture memory according to the current global thread id, and calculating the product value by using a multiplication table;
s44, calculating the value of each equation by using a Reduce summation mode, firstly, quickly calculating the sum of 32 threads by each Warp in a halving summation mode, and then simultaneously solving the value of each polynomial by using an atomic summation operation;
and S45, copying the result of the equation summation back to the variable array by using the memory management function.
As a preferred technical solution, in the step S5, the writing of the master function to schedule the multi-variable master kernel function specifically includes the following steps:
s51, setting the block number of the main kernel function and the number of threads in each block;
s52, applying for a corresponding memory space from the GPU, copying the data in the step S51 into a GPU video memory by asynchronous stream operation, and then binding the data into a texture memory;
s53, transferring the corresponding text data and the value of the variable array obtained by the last calculation to a main kernel function each time, and continuously updating and obtaining the value of the polynomial variable by continuously calling the kernel function;
and S54, copying the final hash value from the GPU video memory back to the host memory, and releasing resources by using cudaFree and free commands.
As a preferred technical solution, step S6 specifically includes the following contents:
and after all data processing results are finished, outputting the results to a specified position and releasing resources.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention adopts a GPU-based parallelization technical scheme, fully utilizes the parallel structure characteristic of the GPU by the homologation and filling operation of multivariable equations and combining a multiplication table of a texture memory, and quickly calculates the value of each equation after each iteration by halving summation and atomic operation, thereby overcoming the problem that the application scene of the multivariable cryptographic algorithm is influenced by the low calculation speed of the multivariable cryptographic algorithm on a CPU platform, achieving the acceleration effect of 15 times the calculation speed of the CPU and improving the practicability of the multivariable cryptographic algorithm. The invention comprehensively considers the parallel granularity, allocates the memory and fully utilizes the GPU to carry out polynomial fast summation optimization, thereby ensuring the performance of the multivariable password when the GPU is used for parallel implementation.
Drawings
FIG. 1 is a flow diagram of a SpongeMPH hash function;
FIG. 2 is a flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
In this embodiment, a multivariate hash function SpongeMPH is taken as an example, and a flow of the SpongeMPH hash function is shown in fig. 1:
a) padding operation is first performed so that the input data length is an integral multiple of the packet length
b) Reading data of packet length circularly, carrying out XOR with the first r x k bits of the current state, and calculating and updating the value of the current state by using a multi-variable function MPE until all data are read to obtain a state SL
c) And calling MPE once again to update SL to obtain a final value S0, and finally obtaining a final result through S0.
Specific implementation and performance comparison of SpongeMPH is given on CUDA platform based on the scheme. Slight modifications can also be used for fast implementation of other multivariate cryptographic algorithms, depending on the steps of the embodiment.
As shown in fig. 2, the parallelization acceleration method for the GPU-based multivariate cryptographic algorithm of the present embodiment includes the following steps:
s1, performing a homologation operation on all the terms of the multivariate equation. In order to fully utilize the multi-core characteristic and the calculation characteristic of the GPU to accelerate the multivariable cryptographic algorithm, the invention introduces the steps of synchronization. The redundant variable with the value of 1 is multiplied by the low-order terms to enable the redundant variable to be equal to the order of the polynomial, so that the terms and summation of each polynomial can be calculated in one-time kernel function calling by the same operation, and the performance reduction of the GPU caused by a branch structure is avoided. Meanwhile, redundant items with the value of 0 are introduced, so that the number of items of each equation is multiple of Block (Block is the definition on CUDA, and word-gruop on OpenCl).
In this embodiment, each block has a data size of 320 bits, and is stored in the memory using 40 variables of 8 bits. The state size of SpongeMPH is 40 x 8 bits, i.e., 40 equations, 40 variables, where each equation has 40 order 1 terms, 842 order 2 terms, and 400 order 3 terms. Using SpongeMPHThe 40 variables are labeled x0,x1,……,x39Wherein x isiIs 8 bits, let x401, the ith equation is:
Ft(x0,x1,……,x39)=
t≤i≤j≤k≤nαtijxixjxk+∑t≤i≤j≤nβtixixj+∑t≤i≤nγtixit
then the operation of the synchronization can be expressed as: ft(x0,x1,……,x39)=∑t≤i≤j≤k≤nαtijxixjxk+∑t≤i≤j≤nβtixixjx40+∑t≤i≤nγtix40x40tx40x40x40
Storing the coefficients of each term of all equations in an array var, x for each term of the equationixjxkThe subscripts are stored in three arrays of index, and indez, respectively, and finally each equation can be expressed as:
Figure GDA0003242000960000071
simultaneously, 0 and 1 are respectively used for filling var and index, indez, so that the term of each equation is the multiple of the number of each block line of the GPU of the equation, wherein the number of each block line is set to be 128, and therefore, after filling, the following steps are carried out:
Figure GDA0003242000960000072
and S2, generating a multiplication table on the finite field. Multiplication of the multivariable cryptographic algorithm in the operation process is based on operation on a finite field, so that the generated multiplication table is beneficial to the consistency in the GPU thread calculation process, unnecessary branch structures and repeated calculation are reduced, and the calculation speed is improved.
The SpongeMPH version used in this example is GF28And (4) performing operation on the field by using the generating element 3 as a basis for constructing a multiplication table and storing corresponding data by using a table and an arc _ table. Wherein table [ i ]]=3i,table[i]=table[arc_table[x]]Then at GF2 for element a, b8The multiplication over the domain can be expressed as:
1) if a, b! 0, a × b table [ (arc _ table [ a ] + arc _ table [ b ])% 0xFF ];
2) otherwise, a ═ b ═ table [ negative number ] ═ 0;
let arc _ table [0] be a sufficiently small negative number, and let tmp ═ arc _ table [ a ] + arc _ table [ b ], then the result can be obtained by (tmp > 0) × (tmp% OxFF), thus unifying 1) and 2).
And S3, mapping the item table, the coefficient table and the multiplication table to a texture memory of the GPU. The multivariate cryptographic algorithm needs to read the number of terms of each term during each operation, and the term table may be very large, and in order to increase the reading speed, the texture memory is used for storage. In addition, multiplication table look-up is needed in the calculation process, and the multiplication operation may be unevenly distributed, so that the texture memory is adopted for storage, which is beneficial to improving the look-up efficiency.
Furthermore, data is preprocessed at the CPU end, i.e. redundant items 0x are used40x40x40The term table and coefficient table are populated such that the number of terms per equation is exactly the number of blocks per block of the GPU x the number of threads per block holds, facilitating the Reduce summation operation in step 4. And then copying the arrays to a GPU (graphics processing unit) end memory by using asynchronous operation, and then binding the arrays with a texture memory.
And S4, calling a multivariable main kernel function for calculation on each block of data and executing Reduce summation operation. The parameters of the multi-variable main kernel function comprise the address of the data to be processed, the address of the value of the current polynomial variable and the address of the intermediate temporary data storage. The addresses of the data to be processed, the addresses of the values of each variable of the current polynomial are unsigned character vector pointers (uint8_ t), and the addresses of the intermediate temporary data are unsigned 32-bit integer pointers (uint32_ t). The content of the kernel function comprises the steps of obtaining the value of each variable in each basic scheduling unit in the GPU, calculating the value of each item after operation, then carrying out Reduce summation operation, obtaining the result of each polynomial and storing the result into the variable array of the current polynomial.
In this embodiment, there are three kernel function parameters, which are the array address (input data) for storing the data to be processed, the address of the array for storing the value of the current polynomial variable (the output data is eventually updated to this array), and the address of the intermediate temporary data array. Each thread in the GPU corresponds to a term of a equation, and the function mainly consists of the following operations:
(a) the block blocks of each GPU copy a value of a variable array, in the embodiment, the first 40 threads of each block copy a value of a variable to a shared memory respectively;
(b) processing input data, wherein in SpongEMPH, the sponge operation in the sponge structure is carried out;
(c) searching corresponding items and coefficients from the texture memory according to the current global thread id, and calculating the product value by using a multiplication table;
(d) calculating the value of each equation by using a Reduce summation mode, firstly, quickly calculating the sum of 32 threads by using a reduced half summation mode for each Warp, and then simultaneously solving the value of each polynomial by using an atomic summation operation;
(e) copying the result after the equation summation to a variable array by using the first 40 threads of another GPU kernel program Map;
in this embodiment, the pseudo code of the main kernel function is shown in table 1; the pseudo code of the Reduce function is shown in table 2.
TABLE 1
Figure GDA0003242000960000101
TABLE 2
Figure GDA0003242000960000102
Figure GDA0003242000960000111
S5, writing a main function to schedule the multi-variable main kernel function. The main function comprises the steps of setting the size of a block, applying for GPU memory space, binding texture memory, continuously transmitting the block data to a main kernel function, finally copying a calculation result back to a host-side memory, releasing resources and the like.
Firstly, setting the block number of a kernel function and the number of threads in each block: in this embodiment, since the number of valid terms of each equation is 1284, in combination with the structure of the GPU, the block number is 440, the number of threads in a single block is 128, and each equation is filled with an invalid term having a value of 0, so that the block number used by each equation is exactly 11.
And then, a corresponding memory space is applied from the GPU, the data in the steps are copied into a GPU video memory by asynchronous stream operation, and then the data are bound into a texture memory. Then, the corresponding text data and the values (initialized to 0) of the 40 variables obtained by the last calculation are passed through the main kernel function each time, and the values of the polynomial variables are obtained by continuously updating one kernel function.
And in the last part, copying the final hash value from the GPU video memory to the host memory, and releasing resources by using commands such as cudaFree and free.
And S6, executing the program, outputting the encryption and decryption result, and releasing the resources. After the main program is written, the target device can directly run, and reasonably and circularly executes copying data to the target device according to the setting and scheduling strategy of the main program, so that each thread runs the kernel program, and the running result is copied back to the host from the target device. And after all data processing results are finished, outputting the results to the specified position and releasing the resources.
And opening the CUDA main program in the IDE, or directly compiling and running the CUDA main program by using a command line interface. According to the CUDA main program, the hash value is output to a screen or a designated file.
In this embodiment, the pseudo code of the scheduling kernel function is shown in table 3.
TABLE 3
Figure GDA0003242000960000121
Results of the experiment
The example operating environment is: the CPU model is Core i 76700 k, the memory is 16G, the operating system is ArchLinux (64 bits), the GPU model is Nvidia GTX1070, the video memory is 11G, the used SDK version is CUDA Toolkit 9.0, and the used integrated development environment is nsight.
The performance comparison of GPU-SpongeMPH and CPU-SpongeMPH for this example at an input data size of 40MB is shown in Table 4 as follows:
TABLE 4
Figure GDA0003242000960000131
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (8)

1. A multivariable cryptographic algorithm parallelization acceleration method based on a GPU is characterized by comprising the following steps:
s1, carrying out homonymization operation on all terms of the multivariable equation;
s2, generating a generator table and a logarithm table on the finite field, and performing table look-up operation on the two tables to realize multiplication of the finite field to improve the consistency in the GPU thread calculation process, wherein the generator table is represented by a table and refers to the first q-1 natural numbers 0,1,2, … and q-2 generated by a generator g on a q-order finite field FTable [ i ] formed by square and 0]=giAnd let table [ q-1 ]]=1,table[q]G ═ g; the logarithm table is represented by arc _ table, which means that for any element a in the finite field, there is arc _ table [ a]I, where table [ i]A, and let arc _ table [0]Is a large negative number such that 0 a is table [ arc _ table [0]]+arc_table[a]]Middle arc _ table [0]]+arc_table[a]Constant negative number, and table negative number]Is 0;
s3, mapping the item table, the coefficient table, the generated element table and the logarithm table to the texture memory of the GPU, wherein the item table refers to subscripts of variables respectively forming each item in the multivariable equation, and when a certain item is a1x1x3x4Wherein x is1x3x4If the variable is a variable, 1, 3 and 4 are stored in the corresponding position of the item table; the coefficient table refers to the coefficient of each item in the multivariable equation and corresponds to the item table one by one;
s4, calling a multivariable main kernel function for calculation and executing Reduce summation operation to each piece of data, wherein the parameters of the multivariable main kernel function comprise the address of the data to be processed, the address of the value of the current polynomial variable and the address for storing intermediate temporary data; the content of the multivariable main kernel function comprises the steps of obtaining the value of each variable in each basic scheduling unit in the GPU, calculating the value of each item after operation, then carrying out Reduce summation operation, obtaining the result of each polynomial and storing the result into the variable array of the current polynomial
S5, compiling a master function to schedule a multi-variable master kernel function, wherein the master function comprises the steps of setting the block size, applying for GPU memory space and texture memory binding, continuously transmitting block data to the master kernel function, finally copying a calculation result back to a host-side memory, and releasing resources;
and S6, executing the program, outputting the encryption and decryption result, and releasing the resources.
2. The method for parallelizing and accelerating multivariate cryptographic algorithms based on GPU of claim 1, wherein in step S1, the homologation operation is specifically:
the redundant variable with the value of 1 is multiplied by the low-order terms to enable the redundant variable to be equal to the order of the polynomial, so that the terms and summation of each polynomial are calculated by the same operation in one-time kernel function calling, and the performance reduction of the GPU caused by a branch structure is avoided;
meanwhile, introducing redundant items with the value of 0, so that the number of items of each equation is a multiple of Block; the Block is defined on CUDA, and the Block is word-gruop on OpenCl.
3. The method for parallelizing acceleration of multivariate cryptographic algorithms based on GPU of claim 1, wherein in step S2, the step of generating multiplication table on finite field is as follows:
in the mod n domain, if g is a generator, and the greatest common divisor of n and g is 1, the value of the generator g can be obtained by extending the euclidean algorithm, and g can be enumerated0,g1,…,g(p-2)To obtain a multiplication table and an inverse table.
4. The method for parallelizing acceleration of multivariate cryptographic algorithms based on GPU of claim 1, wherein the step S3 further comprises the following steps:
preprocessing the data at the CPU end, i.e. using redundant items 0xtxtxtFilling the term table and coefficient table so that the term number of each equation is exactly the block number of the GPU x the number of threads each block possesses, facilitates the Reduce summation operation in step S4, assuming that the multivariable system contains the variable x0,x1,…,xt-1And add an additional custom variable x with a value of 1t(ii) a And then copying the arrays to a GPU (graphics processing unit) end memory by using asynchronous operation, and then binding the arrays with a texture memory.
5. The method for parallelizing acceleration of multivariate cryptographic algorithms based on GPU of claim 1, wherein in step S4, the data address to be processed and the address where the value of each variable of the current polynomial is stored are unsigned character vector pointers (uint8_ t);
the address where the intermediate temporary data is stored is an unsigned 32-bit integer pointer (uint32_ t).
6. A GPU-based multivariate cryptographic algorithm parallelization acceleration method according to claim 1 or 5, wherein in step S4, the multivariate main kernel function performing the calculation and the Reduce summation operation comprises the following steps:
s41, copying the value of a variable array for each block of the GPU;
s42, processing the input data, wherein in SpongEMPH, the sponge operation in the sponge structure is carried out;
s43, searching corresponding items and coefficients from the texture memory according to the current global thread id, and calculating the product value by using a multiplication table;
s44, calculating the value of each equation by using a Reduce summation mode, firstly, quickly calculating the sum of 32 threads by each Warp in a halving summation mode, and then simultaneously solving the value of each polynomial by using an atomic summation operation;
and S45, copying the result of the equation summation back to the variable array by using the memory management function.
7. The GPU-based multivariate cryptographic algorithm parallelization acceleration method according to claim 1, wherein in the step S5, the writing of the master function to schedule the multivariate master kernel function specifically comprises the following steps:
s51, setting the block number of the main kernel function and the number of threads in each block;
s52, applying for a corresponding memory space from the GPU, copying the data in the step S51 into a GPU video memory by asynchronous stream operation, and then binding the data into a texture memory;
s53, transferring the corresponding text data and the value of the variable array obtained by the last calculation to a main kernel function each time, and continuously updating and obtaining the value of the polynomial variable by continuously calling the kernel function;
and S54, copying the final hash value from the GPU video memory back to the host memory, and releasing resources by using cudaFree and free commands.
8. The method for parallelizing acceleration of multivariate cryptographic algorithms based on GPU of claim 1, wherein the step S6 specifically comprises the following steps:
and after all data processing results are finished, outputting the results to a specified position and releasing resources.
CN201810228547.5A 2018-03-20 2018-03-20 Multivariable cryptographic algorithm parallelization acceleration method based on GPU Active CN108510429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810228547.5A CN108510429B (en) 2018-03-20 2018-03-20 Multivariable cryptographic algorithm parallelization acceleration method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810228547.5A CN108510429B (en) 2018-03-20 2018-03-20 Multivariable cryptographic algorithm parallelization acceleration method based on GPU

Publications (2)

Publication Number Publication Date
CN108510429A CN108510429A (en) 2018-09-07
CN108510429B true CN108510429B (en) 2021-11-02

Family

ID=63375986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810228547.5A Active CN108510429B (en) 2018-03-20 2018-03-20 Multivariable cryptographic algorithm parallelization acceleration method based on GPU

Country Status (1)

Country Link
CN (1) CN108510429B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918125B (en) * 2019-03-20 2022-06-03 浪潮商用机器有限公司 GPU configuration method and device based on OpenPOWER architecture
CN112131583B (en) * 2020-09-02 2023-12-15 上海科技大学 Model counting and constraint solving method based on GPU

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1870499B (en) * 2005-01-11 2012-01-04 丁津泰 Method for generating multiple variable commom key password system
CN101008937B (en) * 2007-02-06 2010-05-19 中国科学院研究生院 Method for promoting computing speed of multiplication of finite field and large matrix elimination
CN101916185B (en) * 2010-08-27 2013-07-03 上海交通大学 Automatic parallelization acceleration method of serial programs running under multi-core platform
CN101977109A (en) * 2010-10-21 2011-02-16 李晨 Linear mixed high ordered equation public key algorithm
CN102006170B (en) * 2010-11-11 2013-04-17 西安理工大学 Ring signature method for anonymizing information based on MQ problem in finite field
CN102214086A (en) * 2011-06-20 2011-10-12 复旦大学 General-purpose parallel acceleration algorithm based on multi-core processor
CN102811125B (en) * 2012-08-16 2015-01-28 西北工业大学 Certificateless multi-receiver signcryption method with multivariate-based cryptosystem
KR101694306B1 (en) * 2012-12-14 2017-01-09 한국전자통신연구원 Apparatus and method for predicting performance according to parallelization of hardware acceleration device
CN103490877A (en) * 2013-09-05 2014-01-01 北京航空航天大学 Parallelization method for ARIA symmetric block cipher algorithm based on CUDA
CN103745447B (en) * 2014-02-17 2016-05-25 东南大学 A kind of fast parallel implementation method of non-local mean filtering
CN103973431B (en) * 2014-04-16 2017-04-05 华南师范大学 A kind of AES parallelization implementation methods based on OpenCL
US9558094B2 (en) * 2014-05-12 2017-01-31 Palo Alto Research Center Incorporated System and method for selecting useful smart kernels for general-purpose GPU computing
CN104020983A (en) * 2014-06-16 2014-09-03 上海大学 KNN-GPU acceleration method based on OpenCL
CN105743644B (en) * 2016-01-26 2019-02-05 广东技术师范学院 A kind of mask encryption device of multivariate quadratic equation
CN105933111B (en) * 2016-05-27 2019-03-22 华南师范大学 A kind of Fast implementation of the Bitslicing-KLEIN based on OpenCL
CN107392429A (en) * 2017-06-22 2017-11-24 东南大学 Under the direction of energy that a kind of GPU accelerates method is pushed away before trigonometric equation group

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多变量代数理论及其在密码学中的应用;王后珍等;《北京工业大学学报》;20100515(第05期);全文 *

Also Published As

Publication number Publication date
CN108510429A (en) 2018-09-07

Similar Documents

Publication Publication Date Title
Bermudo Mera et al. Time-memory trade-off in Toom-Cook multiplication: an application to module-lattice based cryptography
Gupta et al. Pqc acceleration using gpus: Frodokem, newhope, and kyber
CN115622684B (en) Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption
CN108510429B (en) Multivariable cryptographic algorithm parallelization acceleration method based on GPU
Choi et al. Fast implementation of SHA-3 in GPU environment
Wan et al. TESLAC: accelerating lattice-based cryptography with AI accelerator
Seo SIKE on GPU: Accelerating supersingular isogeny-based key encapsulation mechanism on graphic processing units
TWI531966B (en) Computing apparatus, computing method, and non-transitory machine readable storage
CN106371803B (en) Calculation method and computing device for Montgomery domain
US9804826B2 (en) Parallelization of random number generators
WO2023141936A1 (en) Techniques and devices for efficient montgomery multiplication with reduced dependencies
WO2019178735A1 (en) Gpu-based parallel acceleration method for multi-variable password algorithm
Zheng Encrypted cloud using GPUs
Li et al. An area-efficient large integer NTT-multiplier using discrete twiddle factor approach
US20220100873A1 (en) Computation of xmss signature with limited runtime storage
US20230081763A1 (en) Conditional modular subtraction instruction
US11323268B2 (en) Digital signature verification engine for reconfigurable circuit devices
Cruz-Cortés et al. A GPU parallel implementation of the RSA private operation
Myllykoski et al. On solving separable block tridiagonal linear systems using a GPU implementation of radix-4 PSCR method
Chien et al. Parallel path tracking for homotopy continuation using GPU
WO2023141933A1 (en) Techniques, devices, and instruction set architecture for efficient modular division and inversion
Wang et al. SAM: A Scalable Accelerator for Number Theoretic Transform Using Multi-Dimensional Decomposition
Seshasayee et al. Hash Table Scalability on Intel PIUMA
Zhang et al. Tensor-Product-Based Accelerator for Area-efficient and Scalable Number Theoretic Transform
Iemma On the use of a SIMD vector extension for the fast evaluation of Boundary Element Method coefficients

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant