CN114780057A

CN114780057A - Polynomial hardware multiplier based on Saber key encapsulation and use method

Info

Publication number: CN114780057A
Application number: CN202210321371.4A
Authority: CN
Inventors: 刘伟强; 章渊拓; 崔益军; 徐天宇; 倪子颖; 王成华
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-22

Abstract

The invention provides a polynomial hardware multiplier based on Saber key encapsulation and a using method thereof, wherein the polynomial hardware multiplier comprises an addressing circuit, a public polynomial data loading module, a coefficient multiplication accumulation circuit and a control module; the control module controls the whole state trend and provides the address index of two multipliers for the addressing circuit. The first storage unit outputs 64-bit data, continuous 2-path coefficient streams can be stably obtained through the public polynomial data loading module, meanwhile, the 2-path coefficient streams of the secret polynomial can be directly read out from the second storage unit according to addresses, a 19-bit Com _ s signal is formed by the lower 3-bit absolute value of the 2-path secret polynomial and 13-bit 0, and the three signals enter the coefficient multiplication and accumulation circuit for operation. The invention avoids the cycle of frequently reading and writing the accumulation result, does not need to suspend a polynomial multiplier, and shortens the operation time under the condition of keeping the hardware resource consumption basically unchanged and the same low power consumption.

Description

Polynomial hardware multiplier based on Saber key encapsulation and use method

Technical Field

The invention belongs to the technical field of information security encryption, and particularly relates to a polynomial hardware multiplier based on Saber key encapsulation and a using method thereof.

Background

After Shor's quantum algorithm appeared, traditional public key encryption schemes such as RSA (asymmetric encryption) and ECC (elliptic curve encryption) were greatly threatened, and there was a possibility that they could be broken by quantum computers within polynomial time. There are three key encapsulation schemes based on lattices, and the Saber key encapsulation scheme is one of them. Saber's security is based on Module-Learning with Rounding (M-LWR) with Rounding problems, derived from public key cryptographic primitives via F-O transformations. In Saber, since the modulus is a power of 2, the transformation of the modulus domain can be performed by rounding to introduce random errors, so that the volumes of the ciphertext and the public key are reduced, which brings advantages to lightweight hardware implementation.

In specific implementation, under the premise of no optimization, whether a software platform or a hardware platform, the polynomial multiplication occupies the most resources and operation time in Saber. However, because modulus is not prime Number, The fast Number Theory Transform (NTT) cannot be used for hardware implementation of Saber, and researchers instead use tom-Cook k-way and Karatsuba algorithms to speed up The operation of polynomial multiplication, which has a core idea of reducing The Number of multiplications. For example, in 2020, Yihong Zhu et al proposed an 8-stage iterative Karatsuba polynomial multiplication structure that reduced 65536 coefficient multiplications by 90%, and the overall operation time was greatly reduced. A similar "divide and conquer" algorithm, which replaces multiplication by addition and subtraction on a hardware implementation, can reduce the number of cycles, but requires more preprocessing and post-processing steps, and thus consumes more hardware resources, and even the requirement of a single multiplier exceeds the available resources of a hardware platform, which is not suitable for lightweight implementation on a resource-constrained platform.

Lightweight hardware implementations are concerned with as few resources as possible and reasonably fast implementation speeds. How to realize the lightweight multiplier efficiently has great significance.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a polynomial hardware multiplier based on Saber key encapsulation, which comprises an addressing circuit, a common polynomial data loading module, a coefficient multiplication accumulation circuit and a control module;

the addressing circuit comprises a first storage unit and a second storage unit; the first output end of the control module is electrically connected with the first input end of the first storage unit through a first address line, the second output end of the control module is electrically connected with the second input end of the first storage unit through a second address line, and the third output end of the control module is electrically connected with the input end of the second storage unit through a third address line;

the public polynomial data loading module comprises a Buffer control unit, a first register, a second register, a third register, a three-out-of-one selector, a first delay register and a second delay register; the input end of the Buffer control unit is electrically connected with the output end of the second storage unit, the first output end of the Buffer control unit is electrically connected with the input end of the first register, the second output end of the Buffer control unit is electrically connected with the input end of the second register, the third output end of the Buffer control unit is electrically connected with the input end of the third register, and the fourth output end of the Buffer control unit is electrically connected with the control end of the one-of-three selector; the output end of the first register is electrically connected with the first input end of the one-out-of-three selector; the output end of the second register is electrically connected with the second input end of the one-out-of-three selector; the output end of the third register is electrically connected with the third input end of the one-out-of-three selector; the output end of the one-out-of-three selector is electrically connected with the input end of the first delay register; the output end of the first delay register is electrically connected with the input end of the second delay register;

the coefficient multiplication and accumulation circuit comprises a first DSP and a second DSP; the first input end of the first DSP is electrically connected with the output end of the second delay register, and the output end of the first DSP is electrically connected with the input end of the first inverter, the first input end of the first alternative selector, the input end of the second inverter and the first input end of the second alternative selector; the output end of the first phase inverter is electrically connected with the second input end of the first alternative selector; the output end of the second phase inverter is electrically connected with the second input end of the second alternative selector; the output end of the first one-of-two selector is electrically connected with the input end of a first accumulation register; the output end of the second alternative selector is electrically connected with the input end of a second accumulation register; the second input end of the first DSP is electrically connected with the first output end and the second output end of the first storage unit;

the first input end of the second DSP is electrically connected with the output end of the one-out-of-three selector, the second input end of the second DSP is electrically connected with the first output end and the second output end of the first storage unit, and the output end of the second DSP is electrically connected with the input end of a third inverter, the first input end of a third one-out-of-three selector, the input end of a fourth inverter and the first input end of a fourth one-out-of-three selector; the output end of the third inverter is electrically connected with the second input end of the third alternative selector; the output end of the fourth inverter is electrically connected with the second input end of the fourth alternative selector; the output end of the third alternative selector is electrically connected with the input end of a third accumulation register; the output end of the fourth alternative selector is electrically connected with the input end of a fourth accumulation register;

and a fourth output end, a fifth output end, a sixth output end and a seventh output end of the control module are respectively and electrically connected with the control end of the first one-out-of-two selector, the control end of the second one-out-of-two selector, the control end of the third one-out-of-two selector and the control end of the fourth one-out-of-two selector.

In a second aspect, the present invention provides a method for using a polynomial hardware multiplier based on Saber key encapsulation, where the method is applied to the polynomial hardware multiplier in the first aspect, and includes:

when matrix-vector multiplication is carried out, the common polynomial data loading module is in a ring domain

In the first period from the beginning of the polynomial hardware multiplier, the low 52 bits of the 64-bit first register load the second memory cell51:0 of address 0 data in the element]Namely polynomial coefficients a3, a2, a1 and a0, the polynomial coefficients a0 are read by a one-out-of-three selector, and a second register [11: 0]]Load address 0 data in second memory location [63:52 ]]I.e., the lower 12 bits of the polynomial coefficient a4, the complete polynomial coefficient will be synthesized when the data for the next address arrives;

in the next three cycles, the 64-bit first register outputs polynomial coefficients a1, a2 and a3 in sequence; in the next cycle, the highest bit of the third register is loaded with the last bit of the address 1 data to form the polynomial coefficient a4 and output, and the third register [10:0] is loaded with [63:53] of the address 1 data, namely the lower 11 bits of the polynomial coefficient a 9; the first register reads [52:1] of address 1 data one cycle later, that is, polynomial coefficients a8, a7, a6, and a5, and outputs the coefficients; the three-out-of-one selector is used for controlling the first register, the second register and the third register to sequentially and continuously output coefficients; the first delay register and the second delay register are used for delaying the coefficient stream by two cycles, and the coefficient streams a [ j ] and a [ (j +2) mod 256] simultaneously enter the coefficient multiply-accumulate circuit; alternately reading the data state every 13 addresses updates and updates 4 times in a single degree polynomial multiplication, i.e., 52 addresses;

when the vector inner product multiplication is carried out, the public polynomial data loading module is in a ring domain

Only the first register, the first delay register and the second delay register of 64 bits are enabled; after the addressing circuit coefficient flow is prepared in the coefficient multiplication accumulation circuit, the first DSP and the second DSP respectively execute 19-bit multiplied by 13-bit multiplication, which is equivalent to two times of coefficient multiplication; multiplication results of the first DSP and the second DSP are generated in the next period, after low 16bits and high 16bits are subjected to modular reduction operation, under the control signal of the control module, phase-modular addition or phase-modular subtraction is carried out on the multiplication results and the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register through the selection of the first alternative selector, the second alternative selector, the third alternative selector and the fourth alternative selector; after 256 cycles4 continuous coefficients of the product polynomial are obtained and written into the first storage unit and the second storage unit, and after 16384 cycles in total, 256-order polynomial multiplication is completed; the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register read the coefficient of the corresponding previous polynomial multiplication result in the first storage unit and the second storage unit every 256 periods, which is equivalent to initializing the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register.

The invention provides a polynomial hardware multiplier based on Saber key encapsulation and a using method thereof, wherein the polynomial hardware multiplier comprises an addressing circuit, a public polynomial data loading module, a coefficient multiplication accumulation circuit and a control module; the control module controls the overall state trend and provides the addressing circuit with address indices of two multipliers according to the two-level cyclic control logic for Saber's schoolboot polynomial multiplication algorithm. The first storage unit outputs 64-bit data, continuous 2-path 13-bit coefficient streams can be stably obtained through the public polynomial data loading module, meanwhile, 2-path 4-bit coefficient streams of the secret polynomial can be directly read out from the second storage unit according to addresses, a 19-bit Com _ s signal is formed by the lower 3-bit absolute value of the 2-path secret polynomial and 13-bit 0, and the three signals enter the coefficient multiplication and accumulation circuit for operation. The invention adopts the device to avoid the cycle of frequently reading and writing the accumulation result, does not need to suspend the polynomial multiplier, and shortens the operation time under the condition of keeping the hardware resource consumption basically unchanged and the same low power consumption.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a circuit structure diagram of a polynomial hardware multiplier based on Saber key encapsulation according to an embodiment of the present invention;

FIG. 2 provides an embodiment of the present inventionIn the ring domain of the common polynomial a (x)

A schematic stored in the storage unit when in;

fig. 3 is a schematic diagram of a parallel schoolboost algorithm optimized for Saber according to an embodiment of the present invention;

fig. 4 is a circuit diagram of a common polynomial data loading module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a conventional schoolboost algorithm;

fig. 6 is a schematic diagram of a loop polynomial operation rule of the schoolboost according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

An embodiment of the present invention is directed to a lightweight schoolboost polynomial hardware multiplier of the Saber key encapsulation scheme, as shown in fig. 1. The overall architecture of a lightweight parallel schoolboost polynomial multiplication circuit design is given in the figure.

When the single-order polynomial multiplication starts, the control module firstly judges whether the current polynomial multiplier carries out matrix-vector multiplication or vector inner product multiplication. Different modes will use different measurements. The control module generates an address index according to an algorithm line3-line6 in fig. 3 and provides the address index to an addressing circuit, and the addressing circuit is connected to two RAMs for data reading.

The embodiment of the invention provides a polynomial hardware multiplier based on Saber key encapsulation, which comprises an addressing circuit, a public polynomial data loading module, a coefficient multiplication accumulation circuit and a control module, wherein the public polynomial data loading module is used for loading public polynomial data; wherein the control module is composed of a state machine.

The addressing circuit comprises a first storage unit RAM _ S and a second storage unit RAM _ A; the first output end of the control module is electrically connected with the first input end of the first memory unit RAM _ S through the 8 bits S0_ addr of the first address line, the second output end is electrically connected with the second input end of the first memory unit RAM _ S through the 8 bits S1_ addr of the second address line, and the third output end is electrically connected with the input end of the second memory unit through the 7 bits a (x) addr of the third address line. The address lines are used for acquiring data of corresponding addresses.

The common polynomial data loading module comprises a Buffer control unit, a 64-bit first register Buffer0, a 13-bit second register Buffer1, a 13-bit third register Buffer2, a three-to-one selector, a 13-bit first delay register and a 13-bit second delay register, wherein the Buffer control unit is composed of a state machine. The input end of the Buffer control unit is electrically connected with the output end of the second storage unit RAM _ a, the first output end of the Buffer control unit is electrically connected with the input end of the first register Buffer0, the second output end of the Buffer control unit is electrically connected with the input end of the second register Buffer1, the third output end of the Buffer control unit is electrically connected with the input end of the third register Buffer2, and the fourth output end of the Buffer control unit is electrically connected with the control end of the one-out-of-three selector; the output end of the first register buffer0 is electrically connected with the first input end of the one-out-of-three selector; the output end of the second register buffer1 is electrically connected with the second input end of the one-of-three selector; the output end of the third register buffer2 is electrically connected with the third input end of the one-of-three selector; the output end of the one-out-of-three selector is electrically connected with the input end of the first delay register; the output end of the first delay register is electrically connected with the input end of the second delay register.

The coefficient multiplication and accumulation circuit comprises a first DSP and a second DSP, wherein the first DSP and the second DSP are both the DSP48E 1; the first input end of the first DSP is electrically connected with the output end of the second delay register, and the output end of the first DSP is electrically connected with the input end of a first inverter M1, the first input end of a first alternative selector G1, the input end of a second inverter M2 and the first input end of a second alternative selector G2; the output end of the first inverter M1 is electrically connected with the second input end of the first alternative selector G1; the output end of the second inverter M2 is electrically connected with the second input end of the second one-of-two selector G2; the output end of the first alternative selector G1 is electrically connected with the input end of a first accumulation register acc 1; the output end of the second one-of-two selector G2 is electrically connected with the input end of a second accumulation register acc 2; and a second input end of the first DSP is electrically connected with a first output end and a second output end of the first storage unit RAM _ S.

The first input end of the second DSP is electrically connected with the output end of the three-to-one selector, the second input end of the second DSP is electrically connected with the first output end and the second output end of the first storage unit RAM _ S, and the output end of the second DSP is electrically connected with the input end of a third inverter M3, the first input end of a third two-to-one selector G3, the input end of a fourth inverter M4 and the first input end of a fourth two-to-one selector G4; the output end of the third inverter M3 is electrically connected with the second input end of the third one-of-two selector G3; the output end of the fourth inverter M4 is electrically connected with the second input end of the fourth alternative selector G4; the output end of the third one-of-two selector G3 is electrically connected with the input end of a third accumulator register acc 3; the output end of the fourth alternative selector G4 is electrically connected with the input end of a fourth accumulator register acc 4.

And a fourth output end, a fifth output end, a sixth output end and a seventh output end of the control module are respectively electrically connected with a control end of the first two-way selector, a control end of the second two-way selector, a control end of the third two-way selector and a control end of the fourth two-way selector through output lines Sign1 ^ b0[3], Sign2 ^ b1[3], Sign3 ^ b0[3] and Sign4 ^ b1[3 ].

The embodiment of the invention also provides a using method of the polynomial hardware multiplier based on Saber key encapsulation, which comprises the step that when matrix-vector multiplication is carried out, the public polynomial data loading module is arranged in a ring domain

In the first cycle from the polynomial hardware multiplier, the lower 52 bits of the 64-bit first register load the second memory location51:0 of medium address 0 data]I.e., polynomial coefficients a3, a2, a1, a0, read polynomial coefficients a0 by a one-out-of-three selector, second register [11:0]Load address 0 data in second memory location [63:52 ]]I.e., the lower 12 bits of the polynomial coefficient a4, the complete polynomial coefficient will be synthesized at the arrival of the data for the next address.

In the next three periods, a 64-bit first register sequentially outputs polynomial coefficients a1, a2 and a 3; in the next cycle, the highest bit of the third register is loaded with the last bit of the address 1 data to form the polynomial coefficient a4 and output, and the third register [10:0] is loaded with [63:53] of the address 1 data, namely the lower 11 bits of the polynomial coefficient a 9; the first register reads [52:1] of the address 1 data one cycle after, that is, polynomial coefficients a8, a7, a6, and a5, and outputs the coefficients; the three-out-of-one selector is used for controlling the first register, the second register and the third register to sequentially and continuously output coefficients; the first delay register and the second delay register are used for delaying the coefficient stream by two cycles, and the coefficient streams a [ j ] and a [ (j +2) mod 256] simultaneously enter the coefficient multiplication accumulation circuit; the alternate read data state is updated every 13 addresses and 4 times, i.e., 52 addresses, in a single degree polynomial multiplication.

Only a 64-bit first register, a first delay register and a second delay register are enabled; after the addressing circuit coefficient flow in the coefficient multiplication accumulation circuit is prepared, the first DSP and the second DSP respectively execute 19-bit multiplied by 13-bit multiplication, which is equivalent to two times of coefficient multiplication; multiplication results of the first DSP and the second DSP are generated in the next period, after low 16bits and high 16bits are subjected to modular reduction operation, under the control signal of the control module, phase-modular addition or phase-modular subtraction is carried out on the multiplication results and the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register through the selection of the first alternative selector, the second alternative selector, the third alternative selector and the fourth alternative selector; after the lapse of 256 cycles, the number of cycles,4 continuous coefficients of the product polynomial are obtained and written into the first storage unit and the second storage unit, and after 16384 cycles in total, 256-order polynomial multiplication is completed; the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register read the coefficient of the corresponding previous polynomial multiplication result in the first storage unit and the second storage unit every 256 cycles, which is equivalent to initializing the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register.

In Saber, when matrix-vector multiplication is performed, the control module generates an address index to be supplied to the addressing circuit, the coefficients of the common polynomial are 13bits, and the output expansion function shift-128 that generates the common polynomial writes 1344-bit byte strings into the RAM in the form of a 64-bit stream per "squeeze" output process. The reason why the output is made 64-bit wide is that the consumption of data alignment can be reduced. However, the 13bits coefficient has a phenomenon of cross-address storage in the memory unit RAM with 64 bits. As shown in fig. 2, a specific coefficient is divided by addresses in such a storage manner that, for example, the upper 12 bits of address 0 belong to the lower 12 bits of coefficient a4, and the remaining highest 1 bit of a4 is the lowest 1 bit of address 1; the upper 11 bits of address 1 belong to the lower 11 bits of coefficient a9, and the remaining 2 bits of a9 are the lowest 2 bits of address 2; the upper 2 bits of address 10 belong to the lower 2 bits of a54, the remaining 10 bits of a54 are the lower 10 bits of address 11, and so on, this would repeat every 13 addresses, with 52 addresses occupied by a polynomial repeating 4 times. Therefore, when it is necessary to sequentially and continuously read the coefficients of a common polynomial in a 64-bit RAM, it becomes a challenge to carefully design the read strategy to circumvent the case where the coefficients are divided. A conventional solution that can be directly conceived is to directly read the data of the first memory cell RAM _ a with a 13 × 64-832-bit buffer, but the number of consumed registers is large, and the cycle overhead of waiting for the 832-bit buffer to be read full is large.

The design designs the data loading module as shown in fig. 4 according to the storage mode of the common polynomial in the RAM, wherein the 64 bits of the low order 52 bits of the first register buffer0 are used to load the complete continuous 4 coefficients in each address and output them sequentially. The second register buffer1 and the third register buffer2 alternately load part of the high order bits and part of the low order bits of the address data, for example, the second register buffer1 will preferentially read the high order 12 bits of the address 0 and wait for the remaining 1 bits to constitute a4 and then output one cycle after a3, and the third register buffer2 will preferentially read the high order 11 bits of the address 1 and wait for the remaining 2 bits to constitute a9 and then output after a 8. The above alternative reading control logic is composed of a counter and a state machine in the buffer control unit, when a complete coefficient is synthesized, the one-out-of-three selector selects one of the first register buffer0, the second register buffer1 and the third register buffer2 to read, the remaining two registers are used for delaying the coefficient stream a [ j ] by two cycles to generate a [ (j +2) mod n ], and the two paths of data enter the coefficient multiplication accumulation circuit at the same time after being prepared. a [ j ] is the coefficient flow required by the third line of the algorithm in fig. 3, and is updated once every period, and the cyclic index j of the inner layer controls each period to be a certain common polynomial coefficient to participate in multiplication.

In another case of Saber, the scheme is to perform vector inner product polynomial multiplication at

In the middle operation, the coefficient of the common polynomial is 16bits, and the highest 6bits is 0 supplemented for facilitating subsequent data alignment, so that the storage of the coefficient in the RAM does not have a phenomenon of cross-address, and therefore, the first register buffer0 can be simply read in, and only the first delay register and the second delay register after the three-to-one selector are called for continuous output. In the ring domain

The public polynomial and the secret polynomial in (2) are further constrained to

At this time, the coefficients of the polynomial are all 10 bits, and at this time, if the coefficients are directly stored in a 64-bit-wide RAM, the coefficients are also truncated, so that the coefficients can be compensated at high bits0 is 16bits and thus can not be truncated by RAM. Such operations are not implemented within multipliers and consume little computational resources.

In both cases, the coefficient of the secret polynomial S (x) is 4bits, and the data of the first storage unit RAM _ S can be directly read according to the address index in the algorithm. After the coefficient streams Coeff _ a0, Coeff _ a1, and Com _ s are ready, the called first DSP and second DSP begin to perform 4 coefficient multiplications. The high 16bits and the low 16bits of the first DSP and the second DSP are taken as temporary results to perform modular operation to obtain temp1-temp4, and the temporary results are respectively subjected to modular addition or modular subtraction with the first accumulation register acc1, the second accumulation register acc2, the third accumulation register acc3 and the fourth accumulation register acc4 under the control of Sign | [ b ] 3, as shown in FIG. 1. Both the public polynomial and the secret polynomial are generated by a function, SHAKE-128, which is capable of generating endless random coefficients, but the inputs to the function are also referred to as the seed difference when the two polynomials are generated, wherein the secret polynomial is required to be secret and is therefore referred to as the secret polynomial.

The invention compares the high-efficiency parallel schoolboost algorithm schematic diagram for Saber with the traditional schoolboost algorithm schematic diagram, as shown in fig. 3 and fig. 5. The conventional schoolboost polynomial multiplication is a convolution operation of two polynomial multipliers. Initializing an accumulator register acc, and when a variable j of an inner layer is increased by one, increasing a multiplier coefficient a [ j ]]And each multiplicand polynomial b [ i ]]Multiplication is traversed while in the ring domain R_qFor the calculation, the polynomial b (x) needs to be multiplied by x in the outer loop to perform order conversion. The result in the bold line box in fig. 6 is obtained each time the outer loop is finished, but in a lightweight implementation the size of the accumulator register is limited by not being very resource intensive, so that frequent writing/reading of the accumulated value with the memory location is required when performing multiplication. While the algorithm in fig. 5 is different, the optimized schoolboost algorithm for Saber changes the order of reading a (x) and s (x). a [ j ]]、a[(j+2)mod n]And s [ (i-j) mod n]、s[(i-j+1)mod n]The reading strategy can fix continuous accumulation registers, avoid frequent reading and writing of the accumulation values like the traditional schoolboost polynomial multiplication, and save the reading strategyAn unnecessary period. After one cycle of the outer layer is over, 4 successive coefficients of results are obtained, as indicated by the thin line boxes in fig. 6. This is equivalent to performing a small convolution operation. 16384 cycles later, a complete polynomial multiplication is completed.

It is emphasized that in this example, the polynomial multiplier is optimized separately, and RAM resources can be recycled to the overall resource requirement. In this embodiment, Vivado 2018.3 is used to construct the above hardware structure on Artix-7 FPGA, and q is 2¹³And n is 256. The final overall frequency reached 130Mhz, which required 16384 clock cycles. 561 LUT area resources, 302 FFs, and 2 DSPs 48E1 are consumed. Compared with the existing lightweight polynomial multiplier technology for Saber, the embodiment can provide faster operation speed with almost equivalent hardware resources, save 3087 cycles and improve frequency.

The same and similar parts in the various embodiments in this specification may be referred to each other. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the description in the method embodiment.

The invention has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the invention. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention. The scope of the invention is defined by the appended claims.

Claims

1. A polynomial hardware multiplier based on Saber key encapsulation is characterized by comprising an addressing circuit, a common polynomial data loading module, a coefficient multiplication accumulation circuit and a control module;

the public polynomial data loading module comprises a Buffer control unit, a first register, a second register, a third register, a three-out-of-one selector, a first delay register and a second delay register; the input end of the Buffer control unit is electrically connected with the output end of the second storage unit, the first output end of the Buffer control unit is electrically connected with the input end of the first register, the second output end of the Buffer control unit is electrically connected with the input end of the second register, the third output end of the Buffer control unit is electrically connected with the input end of the third register, and the fourth output end of the Buffer control unit is electrically connected with the control end of the one-of-three selector; the output end of the first register is electrically connected with the first input end of the one-of-three selector; the output end of the second register is electrically connected with the second input end of the one-out-of-three selector; the output end of the third register is electrically connected with the third input end of the one-of-three selector; the output end of the one-out-of-three selector is electrically connected with the input end of the first delay register; the output end of the first delay register is electrically connected with the input end of the second delay register;

the coefficient multiplication and accumulation circuit comprises a first DSP and a second DSP; the first input end of the first DSP is electrically connected with the output end of the second delay register, and the output end of the first DSP is electrically connected with the input end of the first inverter, the first input end of the first alternative selector, the input end of the second inverter and the first input end of the second alternative selector; the output end of the first phase inverter is electrically connected with the second input end of the first alternative selector; the output end of the second phase inverter is electrically connected with the second input end of the second alternative selector; the output end of the first alternative selector is electrically connected with the input end of the first accumulation register; the output end of the second one-of-two selector is electrically connected with the input end of a second accumulation register; the second input end of the first DSP is electrically connected with the first output end and the second output end of the first storage unit;

the first input end of the second DSP is electrically connected with the output end of the one-out-of-three selector, the second input end of the second DSP is electrically connected with the first output end and the second output end of the first storage unit, and the output end of the second DSP is electrically connected with the input end of a third inverter, the first input end of a third one-out-of-three selector, the input end of a fourth inverter and the first input end of a fourth one-out-of-three selector; the output end of the third phase inverter is electrically connected with the second input end of the third alternative selector; the output end of the fourth inverter is electrically connected with the second input end of the fourth alternative selector; the output end of the third alternative selector is electrically connected with the input end of a third accumulation register; the output end of the fourth alternative selector is electrically connected with the input end of a fourth accumulation register;

2. A method for using a Saber key encapsulation based polynomial hardware multiplier, wherein the method is applied to the polynomial hardware multiplier in claim 1, and comprises:

when matrix-vector multiplication is carried out, the public polynomial data loading module is in a ring domain

In the first cycle from the beginning of the polynomial hardware multiplier, the low 52 bits of the 64-bit first register load the [51: 0] of the address 0 data in the second memory location]Namely polynomial coefficients a3, a2, a1 and a0, the polynomial coefficients a0 are read by a one-out-of-three selector, and a second register [11: 0]]Load address 0 data in second memory location [63:52 ]]I.e., the lower 12 bits of the polynomial coefficient a4, the complete polynomial coefficient will be synthesized when the data for the next address arrives;

in the next three periods, a 64-bit first register sequentially outputs polynomial coefficients a1, a2 and a 3; in the next cycle, the highest bit of the third register is loaded with the last bit of the address 1 data to form the polynomial coefficient a4 and output, and the third register [10:0] is loaded with [63:53] of the address 1 data, namely, the lower 11 bits of the polynomial coefficient a 9; the first register reads [52:1] of address 1 data one cycle later, that is, polynomial coefficients a8, a7, a6, and a5, and outputs the coefficients; the three-out-of-one selector is used for controlling the first register, the second register and the third register to sequentially and continuously output coefficients; the first delay register and the second delay register are used for delaying the coefficient stream by two cycles, and the coefficient streams a [ j ] and a [ (j +2) mod 256] simultaneously enter the coefficient multiply-accumulate circuit; alternately reading the data state every 13 addresses updates and updates 4 times in a single degree polynomial multiplication, i.e., 52 addresses;

Only a 64-bit first register, a first delay register and a second delay register are enabled; after the addressing circuit coefficient flow in the coefficient multiplication accumulation circuit is prepared, the first DSP and the second DSP respectively execute 19-bit multiplied by 13-bit multiplication, which is equivalent to two times of coefficient multiplication; multiplication results of the first DSP and the second DSP are generated in the next period, after low 16bits and high 16bits are subjected to modular reduction operation, under the control signal of the control module, phase-modular addition or phase-modular subtraction is carried out on the multiplication results and the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register through the selection of the first alternative selector, the second alternative selector, the third alternative selector and the fourth alternative selector; after 256 cycles, 4 continuous coefficients of the product polynomial are obtained and written into the first storage unit and the second storage unit, and after 16384 cycles in total, 256-order polynomial multiplication is completed; the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register read the coefficient of the corresponding previous-order polynomial multiplication result in the first storage unit and the second storage unit every 256 cycles, which is equivalent to initializing the first accumulation register and the second accumulation registerAn add register, a third accumulation register, and a fourth accumulation register.