CN116561819A

CN116561819A - Encryption and decryption method based on from-Cook on-loop polynomial multiplication and on-loop polynomial multiplier

Info

Publication number: CN116561819A
Application number: CN202310536435.7A
Authority: CN
Inventors: 杨晨; 王剑飞; 张发鸿; 孟依烁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-08-08

Abstract

The invention provides an encryption and decryption method based on a from-Cook ring polynomial multiplication and a ring polynomial multiplier, wherein the encryption and decryption method is based on the improved from-Cook ring polynomial multiplication, and integrates two steps of polynomial recombination and polynomial modular operation in an original algorithm into an interpolation process, so that the from-Cook algorithm directly obtains a final multiplication result of the ring polynomial multiplication after interpolation without other steps, and the algorithm flow is simplified; meanwhile, the interpolation matrix is changed, polynomial recombination and polynomial mode are mapped into the interpolation matrix, a large amount of redundant arithmetic operations are eliminated, the number of times of post-processing arithmetic operations is reduced by at least 33.33% compared with that of the original boom-Cook algorithm, and therefore the time complexity and the space complexity of a post-processing local algorithm are effectively reduced, the encryption and decryption speed is improved, a large amount of time is saved, and the size of a processing unit array in hardware implementation can be reduced.

Description

Encryption and decryption method based on from-Cook on-loop polynomial multiplication and on-loop polynomial multiplier

Technical Field

The invention relates to a polynomial multiplication accelerator, in particular to an encryption and decryption method based on a from-Cook ring polynomial multiplication and a ring polynomial multiplier.

Background

Lattice-based Cryptography (LBC) is receiving increasing attention as one of the important branches of modern cryptography due to its high security and resistance to quantum computer attacks. The application of lattice cryptography is very wide, especially in the two leading-edge technical fields of Post-quantum Cryptography, PQC and isomorphic encryption (Fully Homomorphic Encryption, FHE). For example, in the first standard PQC algorithm published by NIST in the united states, kyber, dilithium and Falcon are both based on lattice theory, and moreover, almost all mainstream FHE schemes, such as BGV, CKKS and TFHE, are based on lattice theory. At present, most of the security of lattice cryptography mainly comes from the problem of fault-tolerant learning on loop (Ring Learning with Errors, RLWE), but the RLWE-based encryption scheme generally requires a polynomial loop with a higher order, and thus takes a lot of time and effort to perform polynomial multiplication and polynomial modulo operation, which is also a major bottleneck limiting the large-scale application of the RLWE-based LBC encryption technology in practical production.

The acceleration algorithm and hardware implementation of the polynomial multiplication on the ring are mainly three: the number theory transformation (Number Theoretic Transform, NTT), the boom-Cook algorithm, and the Schoolbook algorithm, respectively. Among them, NTT-based methods are widely used because of the lowest algorithm time complexity. Whereas the loop polynomial multiplication based on the boom-hook algorithm uses less because its algorithm time complexity is higher than NTT. The Schoolbook algorithm is rarely used alone because its algorithm time complexity is highest among the three. From the standpoint of algorithm time complexity, NTT seems to be the best loop polynomial multiplication algorithm, but its lower algorithm time complexity is often at the cost of more difficult hardware implementation, more complex algorithm logic, and narrower application range. An important limitation for NTT-based on-loop polynomial multiplication is that the coefficient modulus of the polynomial must be Prime, and thus NTT is not applicable to some non-Prime-modulus cryptographic schemes, such as NTRU-Prime and Saber, etc. In addition, when the polynomial order is high, the NTT needs to use an additional remainder system (Residue Number System, RNS) module and chinese remainder theorem (Chinese Remainder Theorem, CRT) module to split the higher order polynomial into a plurality of lower order polynomials, and the from-hook algorithm does not generate additional hardware overhead because of its own polynomial splitting property. Therefore, there is now increasing research considering the use of a from-hook based on-loop polynomial multiplication algorithm to accelerate the LBC scheme.

The boom-Cook algorithm adopts a recursion decomposition method to decompose the higher-order polynomial multiplication into a plurality of lower-order polynomial multiplications in a preprocessing stage, then uses a Schoolboost algorithm to complete the plurality of lower-order polynomial multiplications, and finally combines a plurality of intermediate results through post-processing to restore a final higher-order polynomial result. However, the overhead of pre-and post-processing can be quite high, and some research work implementing the polynomial multiplication on the boom-hook loop on ARM Cortex-M4 suggests that the computational overhead of pre-and post-processing accounts for 44% of the total overhead of the polynomial multiplication on the single loop. In addition, much of the optimization work currently is focused mainly on the Schoolbook used in the boom-Cook algorithm, and the post-processing process is ignored. In the pretreatment and post-treatment stages of the boom-Cook algorithm, the former simply splits and evaluates the polynomial (only requires addition), and the latter interpolates the Schoolbook result (requires a large number of multiplications), adds the shift, and modulo the polynomial, with many redundant calculations and steps. Thus, post-processing not only requires considerable intermediate memory, but also limits the speed of from-Cook. In summary, the existing from-hook loop polynomial multiplication algorithm and hardware accelerator still have the following problems:

1. there is great computational redundancy at the algorithm level, and the algorithm steps are more cumbersome, especially polynomial shift accumulation and polynomial modulo in the post-processing flow.

2. Since both polynomial shift accumulation and polynomial modulo in the post-processing flow require interpolation results, the storage of intermediate results is unavoidable in conventional from-hook algorithms, whereas LBC not only has a large polynomial order, but also a very large coefficient bit length, which means that the intermediate results require a large amount of on-chip storage.

3. With the increase of the boom-hook parameter n, the size of the interpolation matrix is expanded in geometric progression, so that in the current boom-hook ring polynomial multiplication accelerator, a great part of calculation tasks are serially executed, and the parallelism potential of the boom-hook itself is not fully utilized.

4. At present, most of the from-Cook ring polynomial multiplication accelerators only support one parameter, and the memory structure is single, which means that the data path is fixed, the parallelism and the storage requirement of different from-Cook parameters cannot be met at the same time, and the flexibility is low.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an encryption and decryption method based on a from-Cook ring polynomial multiplication and a ring polynomial multiplier, which can simplify the flow of a from-Cook algorithm, reduce the on-chip storage requirement, greatly reduce the redundant operand in the from-Cook post-processing, improve the encryption and decryption speed and support various from-Cook parameters.

The invention is realized by the following technical scheme:

an encryption and decryption method based on the polynomial multiplication on a from-hook ring comprises the following steps:

s1, acquiring information, and processing the information to obtain a first polynomial and a second polynomial;

s2, splitting the first polynomial and the second polynomial to obtain a first splitting result and a second splitting result;

s3, evaluating the first splitting result and the second splitting result to obtain an evaluation value, a first univariate polynomial and a second univariate polynomial;

s4, carrying out corresponding multiplication on the first univariate polynomial and the second univariate polynomial to obtain an intermediate result and an interpolation matrix;

s5, calculating to obtain undetermined coefficient W by using the interpolation matrix and the intermediate result _i Coefficient to be determined W _i After zero filling, the mixture is divided into two parts which are respectively marked as W1 _i And W2 _i And for W1 _i And W2 _i Performing polynomial die taking to obtain W1' _i And W2' _i Accordingly, the interpolation matrix is split into a first interpolation matrix IM1 and a second interpolation matrix IM2, and the intermediate result is split intoAnd-> Performing interpolation operation with the first interpolation matrix IM 1to obtain a first interpolation result, wherein +_>Performing interpolation operation with a second interpolation matrix IM2 to obtain a second interpolation result;

s6, adding the first interpolation result and the second interpolation result to obtain a polynomial multiplication result;

when the information is the information to be encrypted, the first polynomial and the second polynomial are plaintext information and an encryption key respectively, and the multiplication result of the polynomials is ciphertext; when the information is ciphertext, the first polynomial and the second polynomial are ciphertext and a decryption key respectively, and the polynomial multiplication result is decrypted information.

Preferably, in S5,interpolation operation with the first interpolation matrix IM1 and +.>In parallel with the interpolation operation of the second interpolation matrix IM 2.

Preferably, if the elements of the first interpolation matrix IM1 are added to the power of 2, thenPerforming interpolation operation with the first interpolation matrix by adopting a shift accumulation method, and if the element of the first interpolation matrix IM1 is the element of modulo inversion, +.>Interpolation operation with first interpolation matrix by multiplication and shiftCalculating;

if the elements of the second interpolation matrix IM2 are added to the power of 2, thenPerforming interpolation operation with the second interpolation matrix by adopting a shift accumulation method, and if the element of the second interpolation matrix IM2 is the element of the modulo inversion, performing +.>And carrying out interpolation operation by adopting multiplication and shift with the second interpolation matrix.

Preferably, the first polynomial and the second polynomial are respectivelyAnd

s2 specifically comprises the following steps: splitting A (X), B (X) and C (X) into the following forms:

s3 specifically comprises the following steps: 2 n-1Y values were selected for evaluation as follows:

carry-inAnd->Obtaining a first univariate polynomial and a second univariate polynomial as follows:

the specific calculation method of S4 is as follows:

s5 specifically comprises the following steps: coefficient of uncertainty W _i Expressed as formula (8):

the interpolation matrix is expressed as formula (9):

W1 _i and W2 _i Expressed as equation (14) and equation (15), respectively:

W1’ _i and W2' _i Expressed as equation (16) and equation (17), respectively:

the first interpolation matrix and the second interpolation matrix are expressed as formula (19) and formula (20), respectively:

where N is the length of the polynomial ring, i=1, 2, …, N-1, N represents the parameters of the boom-hook algorithm and q represents the modulus of the polynomial coefficients.

An on-loop polynomial multiplier comprises a polynomial data storage module, a polynomial evaluation module and a heterogeneous PE array;

the polynomial evaluation module is used for splitting the input first polynomial and second polynomial to obtain a first splitting result and a second splitting result; evaluating the first splitting result and the second splitting result to obtain an evaluation value, a first univariate polynomial and a second univariate polynomial, and outputting the evaluation value, the first univariate polynomial and the second univariate polynomial to the heterogeneous PE array;

the heterogeneous PE array is used for carrying out corresponding multiplication on the input first univariate polynomial and the second univariate polynomial to obtain an intermediate result and an interpolation matrix; the interpolation matrix is split into a first interpolation matrix IM1 and a second interpolation matrix IM2 which are stored in a polynomial data storage module, and the intermediate result is split into a first interpolation matrix IM2 and a second interpolation matrix IM2And->A polynomial data storage module; will->Interpolation operation is carried out with the first interpolation matrix IM 1to obtain a first interpolation result, and +.>Performing interpolation operation with a second interpolation matrix IM2 to obtain a second interpolation result; adding the first interpolation result and the second interpolation result to obtain a polynomial multiplication result;

the splitting principle of the interpolation matrix and the intermediate result is as follows: the undetermined coefficient W is calculated according to the interpolation matrix and the intermediate result _i After zero filling, the mixture is divided into two parts which are respectively marked as W1 _i And W2 _i For W1 _i And W2 _i Polynomial modulus taking to obtain W1' _i And W2' _i Accordingly, the interpolation matrix is split into a first interpolation matrix IM1 and a second interpolation matrix IM2, and the intermediate result is split intoAnd->A polynomial data storage module.

Preferably, the heterogeneous PE array includes a barrett modular multiplication and shift unit, a shift accumulation unit, and a 7-input addition tree;

if the elements of the first interpolation matrix IM1 are added to the power of 2, thenInterpolation operation is carried out on the first interpolation matrix by adopting a shift accumulation unit and a 7-input addition tree, and if the element of the first interpolation matrix IM1 is the element of modulo inversion, the element is +.>The first interpolation matrix is multiplied by the barrett modulus and the shift unit and 7-input addition tree is adoptedPerforming interpolation operation;

if the elements of the second interpolation matrix IM2 are added to the power of 2, thenInterpolation operation is carried out on the second interpolation matrix by adopting a shift accumulation unit and a 7-input addition tree, and if the element of the second interpolation matrix IM2 is the element of modulo inversion, the element is +.>And carrying out interpolation operation on the second interpolation matrix by adopting a barrett modular multiplication and shift unit and a 7-input addition tree.

Further, the shift accumulation unit comprises a shifter, an adder, a symbol judgment module and a Modq module.

Preferably, the heterogeneous PE array includes a modular multiply-accumulate unit;

and the module multiplication accumulation unit is used for carrying out corresponding multiplication on the first univariate polynomial and the second univariate polynomial to obtain an intermediate result and an interpolation matrix.

Further, the modular multiply-accumulate unit comprises a barrett modular multiplier and a modular adder.

Preferably, the polynomial evaluation module fuses the data paths of other parameters into one data path by extracting common data path parts in the evaluation path of a single parameter, and performs parameter selection when outputting, and outputs data results of different data paths according to different parameters, wherein the parameters refer to the Toom-Cook algorithm parameters.

Compared with the prior art, the invention has the following beneficial effects:

the encryption and decryption method is based on improved from-Cook loop polynomial multiplication, namely, a from-Cook post-processing method is simplified and improved, and two steps of polynomial recombination and polynomial modular operation in an original algorithm are integrated into an interpolation process, so that the from-Cook algorithm directly obtains a final multiplication result of the loop polynomial multiplication after interpolation, other steps are not needed, and the algorithm flow is simplified; meanwhile, the interpolation matrix is changed, polynomial recombination and polynomial mode are mapped into the interpolation matrix, a large number of redundant arithmetic operations are eliminated, and the number of times of post-processing arithmetic operations is reduced by at least 33.33% compared with that of the original boom-Cook algorithm, so that the time complexity and the space complexity of a post-processing local algorithm are effectively reduced, the encryption and decryption speed is improved, a large amount of time is saved, and the size of a processing unit (Processing Element, PE) array in hardware implementation can be reduced.

Furthermore, the related calculation of the two new interpolation matrixes is performed in parallel, so that further acceleration is realized.

Further, W1 _i Or W2 _i When the elements in the digital signal processor are added to the power of 2, the shift and addition are used for replacing multiplication, so that multiplication operation can be reduced at the hardware level, the use of a digital signal processor (Digital Signal Processer, DSP) is further reduced, and the resource use and the hardware area are reduced.

The invention improves the polynomial multiplication on the Toom-Cook ring, eliminates a great amount of redundant arithmetic operations, and reduces the size of the PE array. And the heterogeneous PE array is constructed, so that the use of a DSP is greatly reduced, and efficient calculation is realized. Experiments have shown that the design of the invention has significant advantages in terms of performance and area efficiency compared to existing implementations.

Furthermore, the invention constructs a reconfigurable polynomial evaluation module, fuses the from-Cook-2, 3,4 through high logic sharing, and realizes a more flexible algorithm execution strategy, thereby meeting the on-chip storage and parallelism requirements of the from-Cook-2, 3,4.

Drawings

Fig. 1 shows the boom-Cook post-processing method (for example, n=4): (a) Combining a shift accumulation and polynomial modulo in an original post-processing algorithm; (b) schematic representation of interpolation calculations in the original post-processing algorithm; (c) the key step of post-processing compression optimization of the present invention; (d) Post-processing after compression optimization (interpolation, shift accumulation and polynomial modular merging are integrated into one step);

FIG. 2 is a reconfigurable accelerator hardware architecture for polynomial multiplication on a from-hook ring;

FIG. 3 is a memory structure map of Toom-Cook-2,3,4 in Poly-Buffer-A, B, C;

FIG. 4 is a data path of a reconfigurable polynomial evaluation module;

FIG. 5 shows three PE structures in a heterogeneous PE array: (a) barrett's modular multiplication and shift unit (BSE); (b) a barrett's modular multiplier; (c) a die adder; (d) a modulo multiply add unit (MMA); (e) a shift accumulation unit (SAC); (f) modulo addition algorithm; (g) implementation details of the Mod q block in SAC;

FIG. 6 is a block diagram of the mapping of the from-Cook-2, 3,4 on the multiplier and the algorithm execution flow: (a) execution flow of different parameters from-Cook; (b) mapping the configuration of heterogeneous PE arrays when from-Cook-4; (c) mapping the configuration of heterogeneous PE arrays when the Toom-Cook-3 is mapped; (d) mapping the configuration of heterogeneous PE arrays when the Toom-Cook-2 is mapped;

FIG. 7 is a modified from-Cook-2, 3,4 interpolation matrix for specific and particular evaluation values.

Detailed Description

For a further understanding of the present invention, the present invention is described below in conjunction with the following examples, which are provided to further illustrate the features and advantages of the present invention and are not intended to limit the claims of the present invention.

The invention discloses an encryption and decryption method based on a polynomial multiplication on a from-hook ring, which comprises the following steps:

The invention improves the polynomial multiplication on the Toom-Cook loop, and the specific deduction process is described as follows.

The original Toom-Cook algorithm used in the prior art lattice cryptography is as follows.

For the polynomial on the ringAnd->If calculate +.>The polynomial is first split.

1) Polynomial splitting of pretreatment: a (X), B (X) and C (X) are rewritten as follows:

wherein the method comprises the steps ofB _i Similarly, N is the length of the polynomial ring, i=1, 2, …, N-1, N represents a parameter of the from-hook algorithm, typically 2 or 3 or 4, q represents a modulus of the polynomial coefficients. Next, to solve for W _i It is necessary to build 2n-1 equations.

2) Polynomial evaluation of pretreatment: 2 n-1Y values were selected for evaluation, bring inAnd->The following results were obtained:

3) Low order Schoolbook: obtainingAnd->Can further get +.>

4) Interpolation of post-processing:

equation (8) is a mathematical form of interpolation, the right side of which relates to y _i Is called an interpolation matrix IM epsilon R ⁽²ⁿ ^-1)×(2n-1) As shown in equation (9).

5) Post-processing polynomial reconstruction and polynomial modeling: as shown in fig. 1 (a).

First, by W _i Re-representing polynomial reorganization and polynomial modulus in post-processing, i.e., equations (10) and(11). For ease of description, the computation C is represented by a particular gray scale ₀ ，C ₁ ，C ₂ ，C ₃ The corresponding data required, while filling each vector with zeros, makes their lengths the same, as shown in fig. 1 (a). I.e. W of the same grey scale _i The result of the vector addition is equal to one quarter of C (X).

In the post-processing of the original boom-hook algorithm, the algorithm steps are complex, and multiple multiplication operations are needed during interpolation. Therefore, the invention improves the post-processing, simplifies the steps and reduces the multiplication times.

The invention provides an improved from-Cook loop polynomial multiplication method, as shown in fig. 1 (b) (c) (d). In fig. 1 (b) (c) (d), the left side shows how the present invention performs classification decomposition merging on data. According to the gray scale classification in fig. 1 (a), the present invention composes the data on the left side with the corresponding gray scale blocks. The right side is a calculation schematic corresponding to the left side data. That is, the left and right sides are two different representations of the same set of data, the left side is using W _i The right side is represented using the corresponding mathematical expression (interpolation matrix sumProduct of (d) representation. The work done by the invention on the gray scale blocks on the left will be reflected in the calculation scheme on the right.

W in FIG. 1 (a) _i Direction change returns W _i In matrix form, as shown in fig. 1 (b). The data distribution according to the same gray level shows that W _i The left half and the right half of the matrix need to be interleaved and added, and the upper and lower halves need to be added accordingly. Therefore, to simplify the matrix operation, the invention uses W _i Rearrangement, resolution into W1 _i And W2 _i Two parts, as shown in fig. 1 (c).

Next, the present invention will W1 _i And W2 _i Interpolation matrix sum mapped back to rightIs an expression of (2). As can be seen from equation (8), W _i And->Is linearly related to each other, so for W in FIG. 1 (b) _i The operation of the matrix can be mapped directly toIs expressed in terms of (a). Thus (S)>The expression of (2) is also divided into +.>And->And (3) two parts are filled with zero, so that the data of the two parts are aligned and the data sizes are the same. General Toom-Cook-n +.>And->The expression is shown in the formula (12) and the formula (13).

Similar to equation (8), according to the obtainedAnd->Expression, expression W1 obtainable by the present invention _i With respect toExpression of (2) and W2 _i About->The expressions of (2) are shown in formulas (14) and (15), respectively.

Due toAnd->Zero, so that the relevant data can be ignored. As shown in FIG. 1 (b), equation (14) and equation (15) have already taken W as W _i The segmentation maps back to the interpolation matrix, and then the invention requires subtracting and mapping back to the interpolation matrix, which corresponds up and down. Namely, expressed W1' _i ，W2′ _i And C _i As in equations (16), (17) and (18).

At this point, the original interpolation matrix (9) has been changed to two smaller interpolation matrices (19) and (20).

Polynomial reconstruction and polynomial modulo operation are no longer required in the post-processing. The invention can obtain the final on-loop polynomial multiplication result C (X) by only adding the calculation results of the two smaller interpolation matrixes in the step (18). As shown in fig. 1 (d).

Compared with the original boom-hook algorithm post-processing, the boom-hook-n algorithm for improving the post-processing integrates polynomial recombination and polynomial modular operation in an interpolation matrix, and the algorithm flow is simplified. The original interpolation matrix is split into two new interpolation matrices, and the related calculation of the two new interpolation matrices is performed in parallel, so that data conflict is not generated, the number of subsequent multiplications and the memory of intermediate results are greatly reduced, and the time complexity and the space complexity of a post-processing local algorithm are effectively reduced. In post-processing, the main computation and memory overhead comes from the intermediate result W _i Is calculated by the computer. Assuming that the bit length of the modulus is l, the original algorithm W _i The total memory size required is (2N/N-1) x (2N-1) x lbs. Calculation of W _i Requiring 2n-2 additions and 2n-1 multiplications for each coefficient in (a). Thus, calculate W _i The total number of multiplications required is (2N/N-1) × (2N-1) ² The total number of additions was (2N/N-1) X (2N-2). Under the same conditions, for the modified Toom-Cook-n post-treatment of the invention, the intermediate result is derived from W _i The total memory size required is (2N-N) x l bits. Calculation of W1 _i And W2 _i Requiring 2n-1 multiplications and 2n-2 additions for each coefficient in (a). Thus, calculate W1 _i And W2 _i The total number of multiplications required is (2N-N) x (2N-1), and the total number of additions is (2N-N) x (2N-2). The comparison is summarized in Table 1.

TABLE 1Toom-Cook-n post-treatment comparison

The reconfigurable accelerator hardware architecture facing to the polynomial multiplication on the from-hook ring, which is realized by the invention, is shown in figure 2.

The invention carries out hardware realization on the Toom-Cook after post-processing improvement, and realizes a reconfigurable and efficient ring polynomial multiplier based on the Toom-Cook. The invention has flexible storage structure and general heterogeneous computation PE, and can adapt to the common Toom-Cook-2,3,4 algorithm without increasing hardware resources. For the single layer from-Cook-2, 3,4 algorithm, the present invention can support up to 256×256 polynomial multiplication. By improving post-processing, the invention can greatly reduce the hardware resource amount of interpolation calculation under the same calculation period. The top layer architecture of the hardware accelerator realized by the invention is shown in fig. 2, and mainly consists of 4 parts:

1. flexible and efficient polynomial data storage module: the buffers for the input data, intermediate results, and output data are also interfaces to off-chip memory.

2. Reconfigurable polynomial evaluation module: correspondingly completing calculation of related tasks in a from-Cook preprocessing stage, namely splitting an input first polynomial and a second polynomial to obtain a first splitting result and a second splitting result; evaluating the first splitting result and the second splitting result to obtain an evaluation value, a first univariate polynomial and a second univariate polynomial, and outputting the evaluation value, the first univariate polynomial and the second univariate polynomial to the heterogeneous PE array; three parameters of from-Cook-2, 3,4 can be supported.

3. Heterogeneous PE array: the method comprises the steps of calculating a low-order Schoolboost and a post-processing task, namely, correspondingly multiplying a first univariate polynomial and a second univariate polynomial which are input to obtain an intermediate result and an interpolation matrix; the interpolation matrix is split into a first interpolation matrix IM1 and a second interpolation matrix IM2 which are stored in a polynomial data storage module, and the intermediate result is split into a first interpolation matrix IM2 and a second interpolation matrix IM2And->A polynomial data storage module; will->Interpolation operation is carried out with the first interpolation matrix IM 1to obtain a first interpolation result, and +.>Performing interpolation operation with a second interpolation matrix IM2 to obtain a second interpolation result; and adding the first interpolation result and the second interpolation result to obtain a polynomial multiplication result. Consists of four PE units of a barrett modular multiplication and shift unit (BSE), a modular multiplication accumulation unit (MMA), a shift accumulation unit (SAC) and a 7-input addition tree (AT-7). If W1 _i The elements of the first interpolation matrix IM1 in the expression are added to the power of 2, then +.>Performing interpolation operation with the first interpolation matrix by using a shift accumulation unit and a 7-input addition tree, if W1 _i The element of the first interpolation matrix IM1 in the expression is the element of the modulo inversion, then +.>Carrying out interpolation operation with the first interpolation matrix by adopting a barrett modular multiplication and shift unit and a 7-input addition tree; if W2 _i The elements of the second interpolation matrix IM2 in the expression are added to the power of 2, then +.>Performing interpolation operation with the second interpolation matrix by using a shift accumulation unit and a 7-input addition tree, if W2 _i The element of the second interpolation matrix IM2 in the expression is the element of the modulo inversion, then +.>And carrying out interpolation operation on the second interpolation matrix by adopting a barrett modular multiplication and shift unit and a 7-input addition tree.

Thanks to the reduced size of the interpolation matrix AT the algorithm level, the heterogeneous PE array also reduces in size, the original interpolation matrix of size 7×7 requires 7×7 BSEs and 7 ATs-7, the interpolation matrix now has size only 4×7, only 4×7 BSEs and 4 ATs-7, and further, some of them are replaced by shifting and adding, that is, SAC is used to replace BSEs, so that only 15 BSEs are used. Since BSE uses a large number of DSPs, and SAC uses only a very simple shifter of adders, compared to very few resources used by SAC, it is advantageous that the number of BSEs used is greatly reduced.

4. And (3) a controller: the main functions include address generation of each bank, selection of data paths, scheduling of tasks and pre-calculation of some fixed values.

The memory structure mapping of the Toom-Cook-2,3,4 in Poly-Buffer, which is realized by the invention, is shown in figure 3.

Based on the Poly-Buffer storage structure, the utilization rate is highest when the from-Cook-4 is executed, and meanwhile, the requirement of parallelism can be met without extra memory when the from-Cook-2 and the from-Cook-3 are executed. The memory structure of the from-Cook-2, 3,4 is shown in FIG. 3. It should be noted that in order to minimize the number of interfaces to the address lines, when storing polynomials of the same size, when there is a free bank in Poly-Buffe A or B, it is preferable to use a 64-depth bank instead of 4 16-depth banks or 8-depth banks. For Toom-Cook-2, the support coefficient modulo q is less than or equal to 2 ³² -1, for Toom-Cook-3, the support coefficient modulo q.ltoreq.2 ³¹ -1, while for Toom-Cook-4 the support coefficient modulo q.ltoreq.2 ²⁹ -1。

The present invention implements the data path of a reconfigurable polynomial evaluation module as shown in fig. 4. The polynomial evaluation module fuses the data paths of other parameters into one data path by extracting common data path parts in the evaluation path of a single parameter, performs parameter selection during output, and outputs data results of different data paths according to different parameters, wherein the parameters refer to the Toom-Cook algorithm parameters.

The invention can respectively find 3 pairs and 5 pairs when executing the from-Cook-2, 3,4Pairs and 7 pairsAnd->Furthermore, for Toom-Cook-3 and 4, only one pair of evaluation results of the input polynomials A and B can be calculated simultaneously. Whereas for Toom-Cook-2, it can calculate the results of the evaluation of at most two pairs of input polynomials A1, B1 and A2, B2 simultaneously; i.e.a total of 2 sets of results can be obtained, wherein each set has 3 pairs +.>And->For the from-Cook-2 of one input polynomial, a and B, if the storage structure of the split sub-polynomials supports simultaneous reading of two values (e.g., each sub-polynomial is stored in two banks in the design of the present invention), then only half the number of cycles is needed to complete the polynomial evaluation as compared to the case where only one value can be read at a time.

Three PE structures in the heterogeneous PE array are implemented in accordance with the invention, as shown in FIG. 5.

Heterogeneous computation PE consists of 4 PEs, MMA, BSE, SAC and AT-7.

1. Where AT-7 is a common 7-input 3-level adder tree, comprising 6 modulo adders, which are not developed in detail here. The heterogeneous PE array comprises an AT-7 array with the size of 1 multiplied by 4.

2. The BSE is composed of barrett's modulus multiplier and shift accumulator as shown in fig. 5 (a). The heterogeneous PE array comprises a BSE array with the size of 5 multiplied by 3.

3. The structure of the barrett's modulus multiplier is similar to that of BSE, as shown in fig. 5 (b).

4. MMA consists of a barrett's modulo multiplier and modulo adder, as shown in FIG. 5 (d). The MMA array is used for completing the calculation task of the low-order Schoolboost. The heterogeneous PE array contains MMA arrays with a size of 7×4.

5. The structure and algorithm of the modulo adder is shown in fig. 5 (c) (f).

6. SAC is composed of a shifter, an adder, a symbol judgment module, and a Mod q module, as shown in fig. 5 (e). The SAC and the BSE together complete the interpolation calculation task. The heterogeneous PE array comprises SAC arrays with the size of 3×3.

7. The internal structure of the Mod q block is shown in fig. 5 (g).

Taking the example of Toom-Cook-4, under the same hardware design approach, the implementation of Schoolboost in the original algorithm is the same as the MMA array designed by the present invention, since the present invention has no part of the algorithm improvement involved, and therefore no attention is paid to and comparison of Schoolboost. In post-processing interpolation, the size of the interpolation matrix of the original algorithm is 7×7 (see formulas (7) - (9)), that is, under the same hardware design method, the original algorithm needs 49 BSEs, and the improved algorithm of the invention only needs 28 BSEs. Here, for comparison purposes, the present invention uniformly considers the worst case, that is, the elements in the interpolation matrix are not special values given by the present invention, but are customized by the customer, and all the partial tasks calculated by SAC are also taken over by BSE, that is, all the calculation tasks are completed by BSE. The saved 21 BSEs occupy 42.86% of the hardware resources of the original algorithm. This resource saving is achieved by improving interpolation in the boom-hook post-processing. The proportion of resource reduction in hardware implementation is consistent with the proportion of theoretical calculation amount reduction in algorithm. Thus, it is believed that the improvements to the algorithm of the present invention can be effectively embodied in hardware.

The present invention implements the mapping and execution flow of from-Cook-2, 3,4 on the accelerator, as shown in fig. 6.

The present invention can flexibly support from-Cook-2, 3,4, and next, the present invention will fully describe how the present invention maps and performs from-Cook-2, 3,4. Since the present invention is implemented based on from-Cook-4 and is compatible with from-Cook-2 and 3, the utilization and throughput of the various modules in executing from-Cook-2, 3,4 is different. Furthermore, the delay of the present invention to perform a complete boom-Cook-2, 3,4 is also different, as shown in fig. 6 (a). All hardware designs were built around from-Cook-4. Therefore, when the invention executes the from-Cook-4, all modules (memory, evaluation module and PE) have the highest utilization, and the utilization reaches 100%. Since the Schoolbook of the from-Cook-4 is 64×64 (only the case where the input is 256×256 polynomial multiplication is considered), the parallelism is 7, and thus 28 MMAs need to be simultaneously operated when calculating the Schoolbook. The interpolation matrix of the from-Cook-4 is 4×7 as shown in fig. 7, where there are 4 zeros, 9 elements that do not require modulo inversion, and 15 elements that do require modulo inversion, so 9 SACs and 15 BSEs are required to operate simultaneously as shown in fig. 6 (b). Since 7 sets of schoolbooks are parallel, each set can only be allocated 4 MMAs, 16 cycles and accumulation are required to complete a 64 x 64 Schoolbook. As in mode 1 in fig. 6 (a).

When the invention executes the from-Cook-3, the memory utilization rate also reaches 100%. But the throughput and PE utilization of the evaluation module are poor. This is because the parameters of from-Cook-3 and from-Cook-4 are similar but different. The present invention has enough resources to run one from-Cook-3, but the remaining resources cannot support another from-Cook-3, so these resources are idle. For example, the schoolbrook of boom-Cook-3 is 88×88 (since 256 needs to be divisible by 3 and the result after division by 3 is divisible by 4, the present invention fills 256×256 to 264×264 polynomial multiplications with 0), the parallelism is 5, and 4 MMA's are required for each parallelism. Thus, 20 MMAs are required to work simultaneously when computing Schoolbook. The interpolation matrix of from-Cook-3 is 3×5, with one zero, 10 elements that do not require modulo inversion, 4 elements that require modulo inversion, and thus 10 SACs and 4 BSEs are required to operate simultaneously, but only 9 SACs are actually required, and therefore one BSE needs to be supplemented instead of a SAC, and thus 9 SACs and 5 BSEs are actually required to operate simultaneously, as shown in fig. 6 (c). Since only 4 MMAs are available for calculation of each set of schoolbooks, 22 cycles and accumulation are required to complete 88 x 88 schoolbooks. As in mode 2 in fig. 6 (a).

The parameter of Toom-Cook-2 is half that of Toom-Cook-4, but its Schoolbrook size is 128X 128, twice that of Toom-Cook-4. Therefore, the memory utilization rate for data storage is very high, although not 100%. The polynomial evaluation module has the same throughput and delay as from-Cook-4. For PE utilization, the parallelism of Toom-Cook-2 is 3, and a maximum of 8 MMA's can be allocated per parallelism (the number of MMA's per parallelism must be a factor of 128). Thus, 24 MMAs need to work simultaneously when calculating the lower order Schoolbook. MMA utilization was between Toom-Cook-3 and 4. The interpolation matrix of the from-Cook-2 is 2×3, where there are 1 zeros, and 5 elements do not need modulo inversion, so only 5 SACs need to work simultaneously, as shown in fig. 6 (d). Since only 8 MMAs are available per group of schoolbooks, 16 cycles and accumulation are required to complete a 128 x 128 size Schoolbook. As in mode 3 in fig. 6 (a).

Furthermore, the present invention is dynamically configurable. If the mode mapping is to be switched when the invention works, all that is required is to change the configuration information, the invention is to dynamically switch modes without reprogramming or regenerating bit stream files or re-burning the FPGA.

The invention is realized on an Xilinx Virtex-7 VC709 FPGA, the highest order of a polynomial is set to 256, the coefficient modulus is set to 32 bits, and the working frequency of the invention is 360MHz. The total number of resources and the number of resources of each module are shown in table 2, and the running performance is shown in table 3.

Table 2 results of the implementation of the invention on VC709 FPGA

Project	Lookup table	Trigger device	On-chip storage	DSP
					Total number of usable	433200	866400	6615KB	3600
Total number of uses of the invention	19145	17475	4.5KB	473
					Polynomial evaluation module	2357	1480	—	—
Individual MMA	252	225	—	11
					Single BSE	224	167	—	11
Single SAC	260	228	—	—

TABLE 3 Performance of the invention for running Toom-Cook-2,3,4 at 360MHz

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. An encryption and decryption method based on a polynomial multiplication on a from-hook ring is characterized by comprising the following steps:

2. The encryption and decryption method based on the polynomial multiplication on the boom-hook ring according to claim 1, wherein in S5,interpolation operation with the first interpolation matrix IM1 and +.>In parallel with the interpolation operation of the second interpolation matrix IM 2.

3. The encryption and decryption method based on polynomial multiplication on the Toom-Cook ring as claimed in claim 1, wherein if the elements of the first interpolation matrix IM1 are added to powers of 2, thenPerforming interpolation operation with the first interpolation matrix by adopting a shift accumulation method, and if the element of the first interpolation matrix IM1 is the element of modulo inversion, +.>Performing interpolation operation by adopting multiplication and shift with the first interpolation matrix;

4. The encryption and decryption method based on the polynomial multiplication on the Toom-Cook ring as claimed in claim 1, wherein the first polynomial and the second polynomial are respectively set asAnd

the specific calculation method of S4 is as follows:

the interpolation matrix is expressed as formula (9):

W1 _i and W2 _i Expressed as equation (14) and equation (15), respectively:

W1’ _i and W2' _i Expressed as equation (16) and equation (17), respectively:

where N is the length of the polynomial ring, i=1, 2,..n-1, N represents the boom-hook algorithm parameters and q represents the modulus of the polynomial coefficients.

5. The ring polynomial multiplier is characterized by comprising a polynomial data storage module, a polynomial evaluation module and a heterogeneous PE array;

6. The on-loop polynomial multiplier of claim 5, wherein the heterogeneous PE array comprises a barrett modular multiplication and shift unit, a shift accumulation unit, and a 7-input addition tree;

if the elements of the first interpolation matrix IM1 are added to the power of 2, thenInterpolation operation is carried out on the first interpolation matrix by adopting a shift accumulation unit and a 7-input addition tree, and if the element of the first interpolation matrix IM1 is the element of modulo inversion, the element is +.>Carrying out interpolation operation with the first interpolation matrix by adopting a barrett modular multiplication and shift unit and a 7-input addition tree;

7. The loop polynomial multiplier of claim 6, in which the shift accumulation unit comprises a shifter, an adder, a sign judgment module and a Mod q module.

8. The on-loop polynomial multiplier of claim 5, wherein the heterogeneous PE array comprises a modulo multiply-accumulate unit;

9. The loop-on polynomial multiplier of claim 8, wherein the modulo-multiply accumulation unit comprises a barrett modulo multiplier and a modulo adder.

10. The polynomial multiplier of claim 5 wherein the polynomial evaluation module merges the data paths of other parameters into one data path by extracting common data path portions in the evaluation path of a single parameter and performs parameter selection at the time of output, outputting the data results of the different data paths according to the different parameters, the parameters all referring to the parameters of the from-hook algorithm.