CN108510429A

CN108510429A - A kind of multivariable cryptographic algorithm parallelization accelerated method based on GPU

Info

Publication number: CN108510429A
Application number: CN201810228547.5A
Authority: CN
Inventors: 龚征; 廖国鸿; 黎伟杰; 马昌社; 刘志杰; 罗裴然; 黄家敏
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2018-09-07
Anticipated expiration: 2038-03-20
Also published as: CN108510429B

Abstract

The invention discloses a kind of multivariable cryptographic algorithm parallelization accelerated method based on GPU, this method include the following steps：S1, all progress same order operations to multivariable equation；S2, the domains GF2 comultiplication table is generated；S3, the texture memory that item number table and multiplication table are mapped to GPU；S4, each piece of main kernel function of data call multivariable is calculated and executes Reduce operations；S5, principal function is write to dispatch the main kernel function of multivariable；S6, program is executed, output encryption and decryption is as a result, release resource.The present invention is mainly by carrying out same orders by all of multivariable and being optimized to the cryptographic algorithm of multivariable cipher system in conjunction with the thought of Map Reduce, and by taking SpongeMPH hash function algorithms as an example, the realization below CUDA platforms is given compared with performance.Experiment shows that the program improves the operational efficiency of algorithm, can be used for accelerating the cryptographic algorithm based on multivariable cipher system.

Description

A kind of multivariable cryptographic algorithm parallelization accelerated method based on GPU

Technical field

The present invention relates to the technical fields of cryptographic algorithm, are calculated more specifically to a kind of multivariable password based on GPU Method parallelization accelerated method.

Background technology

Graphics processing unit GPU is originally designed for carrying out image procossing, in recent years due to the limitation of CPU power consumption and To calculating the rapid growth of demand, and the computing capability of GPU is fast-developing with the speed of far super Moore's Law, promotes GPU wide It is general to apply in scientific algorithm field.

Multivariable cryptographic algorithm is the cryptography scheme constituted using multivariable polynomial in finite field.And it solves in finite field Multivariable polynomial equation group problem is a NP-Hard problem, is one of the mentality of designing of anti-quantum attack at present.However it is more The operand of variable cryptographic algorithm is larger, causes less efficient to be a main aspect for limiting the practicality.So how to carry The execution efficiency of high GPU is one of the direction of those skilled in the art's research.

Invention content

The shortcomings that it is a primary object of the present invention to overcome the prior art and deficiency, provide a kind of multivariable based on GPU Cryptographic algorithm parallelization accelerated method realizes the parallel of multivariable cryptographic algorithm using the thought of GPU combinations Map-Reduce Change, to improve its execution efficiency.

In order to achieve the above object, the present invention uses following technical scheme：

A kind of multivariable cryptographic algorithm parallelization accelerated method based on GPU of the present invention, includes the following steps：

S1, all progress same order operations to multivariable equation；

Generation member table and logarithmic table on S2, generation finite field pass through the two tables and carry out table lookup operation realization finite field Multiplication improve the consistency in GPU thread calculating process, the first table of the generation refers to the generation member g on q rank finite fields F The preceding q-1 natural number 0,1,2 generated ..., table, that is, table [i]=g of the Power powers of q-2 and 0 compositionⁱAnd enable table [p- 1]=1, table [p]=g；The logarithmic table refers to having arc_table [a]=i for the arbitrary element a in finite field, Middle table [i]=a, and it is a big negative to enable the value of arc_table [0] so that 0*a=table [arc_table [0]+ Arc_table [a]] in arc_table [0]+arc_table [a] it is permanent be negative, and the value of table [negative] is 0；

S3, the texture memory that item table, coefficient table, the first table of generation and logarithmic table are mapped to GPU, the item table refers to changeable The subscript that the variable of each single item is respectively constituted in amount equation, when a certain item is a₁x₁x₃x₄, wherein x₁x₃x₄For variable, then item table exists Corresponding position deposits 1,3,4, and the coefficient table refers to the coefficient of each single item in multivariable equation, is corresponded with item table；

S4, Reduce sum operations are calculated and are executed to each piece of main kernel function of data call multivariable, it is described The parameter of the main kernel function of multivariable includes the address of pending data, the address of value of current variable of a polynomial and intermediate interim The address of data storage；The content of the main kernel function of multivariable, which each of is included in GPU in basic thread, to be obtained The value of each variable calculates the complete value of each single item operation, then carries out Reduce sum operations, obtains each polynomial knot Fruit is simultaneously saved in current polynomial variable array

S5, principal function is write to dispatch the main kernel function of multivariable, principal function includes setting piecemeal size, applies in GPU Space and texture memory binding are deposited, block data is constantly passed into main kernel function, finally copies back result of calculation Host end memory discharges resource；

S6, program is executed, output encryption and decryption is as a result, release resource.

As a preferred technical solution, in step S1, same orderization operation is specially：

It is multiplied with the item of low order for 1 nuisance variable to make it equal to order of a polynomial by introducing value, thus Each polynomial item and summation are calculated with identical operation in disposable kernel function call, is avoided because branched structure is led The performance of GPU is caused to decline；

Meanwhile the redundant term that introducing value is 0 so that each member of equation number quantity is the multiple of Block；The Block It is the definition on CUDA, is then work-gruop on OpenCl.

As a preferred technical solution, in step S2, generate finite field on multiplication table the step of it is as follows：

For in the domains mod n, if g is to generate member, the greatest common divisor of n and g are 1, then can be by extending Euclid Algorithm finds out the value for generating member g, to enumerate g⁰,g¹,…,g^(p-2)To obtain multiplication table and inverse table.

Further include following the description as a preferred technical solution, in step S3：

First data are pre-processed at the ends CPU, that is, use redundancy 0*x_tx_tx_tFiller table and coefficient table so that each Member of equation number is just the Thread Count that each block of block numbers * of GPU possess, convenient for Reduce summations in step s 4 Operation, wherein assuming to contain variable x in multi-variable system₀, x₁,…,x_t-1, and the additional custom variable x that add value is 1_t； Then these arrays are copied to GPU end memories with asynchronous operation, then is bound with texture memory.

As a preferred technical solution, in step S4, wherein the pending data address, current multinomial each The address of the value storage of variable is unsigned character vector pointer (uint8_t*)；

The address of the intermediate ephemeral data storage is 32 integer pointers (uint32_t*) of no symbol.

As a preferred technical solution, in step S4, the main kernel function of multivariable is calculated and executes Reduce Sum operation includes following the description：

The value of a variable array of block piecemeals copy of S41, each GPU；

S42, input data is handled, is then the absord operations carried out in sponge structure in SpongeMPH；

S43, respective items and coefficient are searched from texture memory according to current global thread id, and is calculated using multiplication table Its product value；

S44, the value of each equation is calculated using the mode of Reduce summations, is each Warp summations by half first Mode quickly calculates the sum of 32 threads, then solves each polynomial value simultaneously using atom sum operation；

S45, equation is summed with memory management function again after result copy back in variable array.

It is described to write principal function to dispatch the main kernel function of multivariable as a preferred technical solution, in the step S5 Specifically include following the description：

Number of threads in S51, the block numbers of the main kernel function of setting, each block；

S52, apply for corresponding memory headroom from the ends GPU, the data asynchronous flow operation in step S51 is copied into GPU In video memory, then bind into texture memory；

S53, it transmits corresponding text data every time and the value of variable array that last computation obtains is to main kernel function, leads to It crosses and constantly calls the kernel function to constantly update the value for obtaining variable of a polynomial；

S54, final cryptographic Hash from GPU video memorys is copied back to host end memory, and uses cudaFree, free order Discharge resource.

Step S6 specifically includes following the description as a preferred technical solution,：

After principal function writes, directly invoke operation, target device according to the setting of principal function and scheduling strategy reasonably Cycle executes copy data to target device, allows each thread to run kernel program, operation result is copied back from target device Three operations of host, after waiting all data processed results all to complete, are output to specified position by result and discharge resource.

Compared with prior art, the present invention having the following advantages that and advantageous effect：

Present invention employs the Parallelizing Techniques scheme based on GPU, grasped by same order to multivariable equation and filling Make, makes full use of the parallel organization characteristic of GPU in conjunction with the multiplication table of texture memory, and pass through summation and atomic operation by half It quickly calculates the value of separate equation after each iteration, overcomes multivariable cryptographic algorithm low shadow of calculating speed on CPU platforms The problem of ringing its application scenarios, it is close to improve multivariable to achieve the effect that the acceleration of 15 times of CPU calculating speeds The practicability of code algorithm.The invention has considered parallel granularity, and Memory Allocation simultaneously takes full advantage of GPU to carry out multinomial quick Summation optimization, it is ensured that performance of the multivariable password when carrying out Parallel Implementation using GPU in the present invention.

Description of the drawings

Fig. 1 is SpongeMPH hash function flow charts；

Fig. 2 is the flow chart of the present invention.

Specific implementation mode

Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.

Embodiment

The present embodiment is by taking multivariable hash function SpongeMPH as an example, the flow of the SpongeMPH hash functions flow As shown in Figure 1：

A) padding paddings are carried out first so that the data length of input is the integral multiple of block length

B) cycle reads the data of block length, and carries out exclusive or with the preceding r*k bits of current state, then uses multivariable Function MPE calculates the value of update current state, and state SL is obtained until all digital independents finish end

C) a MPE update SL is recalled, final value S0 is obtained, final result is obtained finally by S0.

SpongeMPH concrete implementations are provided compared with performance in CUDA platforms based on the program.According to this embodiment Step can also be used for fast implementing for other multivariable cryptographic algorithms with minor modifications.

As described in Figure 2, multivariable cryptographic algorithm parallelization accelerated method of the present embodiment based on GPU, includes the following steps：

S1, all progress same order operations to multivariable equation.In order to make full use of the multinuclear characteristic and its meter of GPU The step of calculation feature accelerates multivariable cryptographic algorithm, and invention introduces same order.By introducing value be 1 nuisance variable with The item of low order is multiplied to make it equal to order of a polynomial, so as in disposable kernel function call with the same behaviour Make to calculate each polynomial item and summation, avoid because branched structure causes the performance of GPU to decline.Meanwhile introducing value is 0 Redundant term so that each member of equation number quantity is that (Block is the definition on CUDA to Block, is then on OpenCl Work-gruop multiple).

In this implementation column, each block data size is 320 bits, is deposited in memory using the variable of 40 8 bits It puts.The state size of SpongeMPH is 40*8bit, i.e. 40 equations, 40 variables, wherein each equation possesses 40 1 ranks , 842 2 ranks and 400 3 ranks.40 variable labels that SpongeMPH is used are x₀,x₁... ..., x₃₉, wherein x_i For 8 bits, x is enabled₄₀=1, remember that i-th of equation is:

F_t(x₀,x₁... ..., x₃₉)=

∑_{t≤i≤j≤k≤n}α_tijx_ix_jx_k+∑_{t≤i≤j≤n}β_tix_ix_j+∑_t≤i≤nγ_tix_i+δ_t

Then the operation of same order can be expressed as：F_t(x₀,x₁... ..., x₃₉)=∑_{t≤i≤j≤k≤n}α_tijx_ix_jx_k+∑_{t≤i≤j≤n} β_tix_ix_jx₄₀+∑_t≤i≤nγ_tix₄₀x₄₀+δ_tx₄₀x₄₀x₄₀

The coefficient of all equation each single item is stored in an array var, for each single item x of equation_ix_jx_kBy its subscript It is stored to index respectively, among tri- arrays of indey, indez, then each final equation can be expressed as：

Var and index, indey, indez are filled respectively with 0 and 1 simultaneously so that each member of equation is the GPU of equation Each block Thread Counts multiple, be set as 128 using the Thread Count of each block here, therefore be after filling：

Multiplication table on S2, generation finite field.Multiplication of the multivariable cryptographic algorithm in calculating process is to be based on finite field On operation, therefore generating multiplication table contributes to the consistency in GPU thread calculating process, reduces unnecessary branched structure and again It is multiple to calculate, to improve calculating speed.

The SpongeMPH versions used in the present embodiment are GF2⁸Operation on domain is used as construction multiplication using member 3 is generated The basis of table, and store corresponding data using table and arc_table.Wherein table [i]=3ⁱ, table [i]= Table [arc_table [x]] is then for element a, and b is in GF2⁸Multiplication on domain can be expressed as：

If 1) a, b！=0, a*b=table [(arc_table [a]+arc_table [b]) %0xFF]；

2) otherwise a*b=table [negative]=0；

It is sufficiently small negative wherein to enable arc_table [0], enables tmp=arc_table [a]+arc_table [b], then (tmp can be passed through>0) * (tmp%0xFF) is obtained as a result, to by 1) and 2) unite.

S3, the texture memory that item table, coefficient table, multiplication table are mapped to GPU.Multivariable cryptographic algorithm is in each operation It is required for reading the item number of each single item in journey, and item number table may be very big, in order to improve reading speed, is deposited using texture memory Storage.In addition, progress multiplication is also needed to table look-up in calculating process, since multiplying may be unevenly distributed, therefore use in texture It deposits and is conducive to improve search efficiency to store.

Further, first data are pre-processed at the ends CPU, that is, uses redundancy 0*x₄₀x₄₀x₄₀Filler table and coefficient Table so that each member of equation number is just the Thread Count that each block of block numbers * of GPU possess, and is convenient in step 4 Reduce sum operations.Then these arrays are copied to GPU end memories with asynchronous operation, then is bound with texture memory.

S4, each piece of main kernel function of data call multivariable is calculated and executes Reduce sum operations.It is changeable The parameter for measuring main kernel function includes the address of pending data, the address of the value of current variable of a polynomial and intermediate ephemeral data The address of storage.Wherein, the address of pending data address, the value storage of current each variable of multinomial is no symbol word Vector pointer (uint8_t*) is accorded with, the address of intermediate ephemeral data storage is 32 integer pointers (uint32_t*) of no symbol.It is interior The content of kernel function each of is included in GPU the value that each variable is obtained in basic thread, calculates each single item operation Then complete value carries out Reduce sum operations, obtain each polynomial result and be saved in current polynomial variable number In group.

In the present embodiment, there are three kernel function parameters, be respectively the group address (input data) for depositing pending data, The address (output data finally can also be updated into this array) of the array of current variable of a polynomial value, intermediate ephemeral data The address of array.Each thread in GPU corresponds to a member of equation, which is mainly made of following operation：

(a) the block piecemeals of each GPU copy the value of a variable array, are by each block in the present embodiment Value to shared drive of preceding 40 threads to copy a variable respectively in；

(b) input data is handled, is then the absord operations carried out in sponge structure in SpongeMPH；

(c) respective items and coefficient are searched from texture memory according to current global thread id, and is calculated using multiplication table Its product value；

(d) value of each equation is calculated using the mode of Reduce summations, is each Warp summations by half first Mode quickly calculates the sum of 32 threads, then solves each polynomial value simultaneously using atom sum operation；

(e) result after equation being summed with preceding 40 threads of another GPU kernel program Map again copies back variable number In group；

In the present embodiment, shown in the pseudo table 1 of the main kernel function；The pseudocode of Reduce functions is as shown in table 2.

Table 1

Table 2

S5, principal function is write to dispatch the main kernel function of multivariable.Principal function includes setting piecemeal size, is applied in GPU Space is deposited, block data is constantly passed to main kernel function, result of calculation is finally copied back host by texture memory binding End memory, release resource etc..

Number of threads first in the block numbers, each block of setting kernel function：In the present embodiment, due to each side Cheng Youxiao item numbers are 1284, and in conjunction with the structure of GPU, setting block numbers are 440, and the Thread Count in single block is 128, is led to Cross the void item for being 0 to each equation Filling power so that the block numbers that each equation uses just are 11.

Then to apply for corresponding memory headroom from the ends GPU, by the data asynchronous flow operation in above-mentioned steps copy into In GPU video memorys, then bind into texture memory.Then corresponding text data is worn every time and 40 variables that last computation obtains Value (being initialized as 0) give main kernel function, then the value of acquisition variable of a polynomial is constantly updated with kernel function.

Final cryptographic Hash is copied from GPU video memorys and understands host end memory by last part, and use cudaFree, The orders such as free discharge resource.

S6, program is executed, output encryption and decryption is as a result, release resource.After main program writes, so that it may directly to run, Target device can reasonably recycle execution copy data to target device according to the setting of main program and scheduling strategy, allow each line Operation result is copied back three operations of host by Cheng Yunhang kernel programs from target device.It is all complete etc. all data processed results At later, result is output to specified position and discharges resource.

CUDA main programs are opened in IDE, or use Command Line Interface, and operation directly is compiled to it.Root According to CUDA main programs, cryptographic Hash can be output in screen or specified file.

In the present embodiment, the pseudocode for dispatching kernel function is as shown in table 3.

Table 3

Experimental result

This example running environment is：CPU model Core i7 6700k, memory 16G, operating system ArchLinux (64), GPU model Nvidia GTX1070, video memory 11G, used SDK versions are CUDA Toolkit 9.0, are used Integrated Development Environment be nsight.

This example the following is the performance of GPU-SpongeMPH and CPU-SpongeMPH when input data size is 40MB Compare as shown in table 4：

Table 4

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications, Equivalent substitute mode is should be, is included within the scope of the present invention.

Claims

1. a kind of multivariable cryptographic algorithm parallelization accelerated method based on GPU, which is characterized in that include the following steps：

S1, all progress same order operations to multivariable equation；

Generation member table and logarithmic table on S2, generation finite field, pass through the two tables and carry out multiplying for table lookup operation realization finite field Method improves the consistency in GPU thread calculating process, and the first table of the generation refers to that the generation member g on q rank finite fields F is generated Preceding q-1 natural number 0,1,2 ..., the power side of q-2 with 0 constitute table, that is, table [i]=gⁱAnd enable table [p-1]= 1, table [p]=g；The logarithmic table refer to have arc_table [a]=i for the arbitrary element a in finite field, wherein Table [i]=a, and it is a big negative to enable the value of arc_table [0] so that 0*a=table [arc_table [0]+arc_ Table [a]] in arc_table [0]+arc_table [a] it is permanent be negative, and the value of table [negative] is 0；

S3, the texture memory that item table, coefficient table, the first table of generation and logarithmic table are mapped to GPU, the item table refers to multivariable side The subscript that the variable of each single item is respectively constituted in journey, when a certain item is a₁x₁x₃x₄, wherein x₁x₃x₄For variable, then item table is corresponding Deposit 1,3,4 in position；The coefficient table refers to the coefficient of each single item in multivariable equation, is corresponded with item table；

S4, Reduce sum operations are calculated and are executed to each piece of main kernel function of data call multivariable, it is described changeable The parameter for measuring main kernel function include the address of pending data, current variable of a polynomial value address and intermediate ephemeral data The address of storage；The content of the main kernel function of multivariable each of be included in GPU in basic thread obtain it is each The value of a variable calculates the complete value of each single item operation, then carries out Reduce sum operations, obtains each polynomial result simultaneously It is saved in current polynomial variable array

S5, principal function is write to dispatch the main kernel function of multivariable, principal function includes setting piecemeal size, application GPU memory skies Between and texture memory binding, block data is constantly passed into main kernel function, result of calculation is finally copied back into host End memory discharges resource；

2. the multivariable cryptographic algorithm parallelization accelerated method based on GPU according to claim 1, which is characterized in that step In S1, same orderization operation is specially：

It is multiplied with the item of low order for 1 nuisance variable to make it equal to order of a polynomial by introducing value, to primary Property kernel function call in identical operation calculate each polynomial item and summation, avoid because branched structure causes The performance of GPU declines；

Meanwhile the redundant term that introducing value is 0 so that each member of equation number quantity is the multiple of Block；The Block is Definition on CUDA is then work-gruop on OpenCl.

3. the multivariable cryptographic algorithm parallelization accelerated method based on GPU according to claim 1, which is characterized in that step In S2, generate finite field on multiplication table the step of it is as follows：

For in the domains mod n, if g is to generate member, the greatest common divisor of n and g are 1, then can pass through Extended Euclidean Algorithm The value for generating member g is found out, to enumerate g⁰,g¹,…,g^(p-2)To obtain multiplication table and inverse table.

4. the multivariable cryptographic algorithm parallelization accelerated method based on GPU according to claim 1, which is characterized in that step Further include following the description in S3：

First data are pre-processed at the ends CPU, that is, use redundancy 0*x_tx_tx_tFiller table and coefficient table so that each equation Item number be just GPU the Thread Counts that possess of each block of block numbers *, convenient for Reduce summation behaviour in step s 4 Make, wherein assuming to contain variable x in multi-variable system₀, x₁,…,x_t-1, and the additional custom variable x that add value is 1_t；So It copies these arrays to GPU end memories with asynchronous operation afterwards, then is bound with texture memory.

5. the multivariable cryptographic algorithm parallelization accelerated method based on GPU according to claim 1, which is characterized in that step In S4, wherein the pending data address, the address that the value of current each variable of multinomial is stored are unsigned character Vector pointer (uint8_t*)；

6. according to claim 1 or the 5 multivariable cryptographic algorithm parallelization accelerated methods based on GPU, which is characterized in that step In rapid S4, it includes following the description that the main kernel function of multivariable, which is calculated and executes Reduce sum operations,：

The value of a variable array of block piecemeals copy of S41, each GPU；

S43, respective items and coefficient are searched from texture memory according to current global thread id, and calculates it using multiplication table and multiplies Product value；

S44, the value of each equation is calculated using the mode of Reduce summations, is modes of each Warp with summation by half first The sum of 32 threads is quickly calculated, then solves each polynomial value simultaneously using atom sum operation；

7. the multivariable cryptographic algorithm parallelization accelerated method based on GPU according to claim 1, which is characterized in that described It is described to write principal function and specifically include following the description to dispatch the main kernel function of multivariable in step S5：

S52, apply for corresponding memory headroom from the ends GPU, the data asynchronous flow operation in step S51 is copied into GPU video memorys In, then bind into texture memory；

S53, corresponding text data is transmitted every time and the value of variable array that last computation obtains is to main kernel function, by not It is disconnected to call the kernel function to constantly update the value for obtaining variable of a polynomial；

S54, final cryptographic Hash from GPU video memorys is copied back to host end memory, and is discharged using cudaFree, free order Resource.

8. the multivariable cryptographic algorithm parallelization accelerated method based on GPU according to claim 1, which is characterized in that step S6 specifically includes following the description：

After principal function writes, operation is directly invoked, target device is reasonably recycled according to the setting of principal function and scheduling strategy Copy data are executed to target device, allows each thread to run kernel program, operation result is copied back into host from target device Three operations, after waiting all data processed results all to complete, are output to specified position by result and discharge resource.