CN114780057A - Polynomial hardware multiplier based on Saber key encapsulation and use method - Google Patents

Polynomial hardware multiplier based on Saber key encapsulation and use method Download PDF

Info

Publication number
CN114780057A
CN114780057A CN202210321371.4A CN202210321371A CN114780057A CN 114780057 A CN114780057 A CN 114780057A CN 202210321371 A CN202210321371 A CN 202210321371A CN 114780057 A CN114780057 A CN 114780057A
Authority
CN
China
Prior art keywords
register
polynomial
electrically connected
output end
input end
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210321371.4A
Other languages
Chinese (zh)
Inventor
刘伟强
章渊拓
崔益军
徐天宇
倪子颖
王成华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202210321371.4A priority Critical patent/CN114780057A/en
Publication of CN114780057A publication Critical patent/CN114780057A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a polynomial hardware multiplier based on Saber key encapsulation and a using method thereof, wherein the polynomial hardware multiplier comprises an addressing circuit, a public polynomial data loading module, a coefficient multiplication accumulation circuit and a control module; the control module controls the whole state trend and provides the address index of two multipliers for the addressing circuit. The first storage unit outputs 64-bit data, continuous 2-path coefficient streams can be stably obtained through the public polynomial data loading module, meanwhile, the 2-path coefficient streams of the secret polynomial can be directly read out from the second storage unit according to addresses, a 19-bit Com _ s signal is formed by the lower 3-bit absolute value of the 2-path secret polynomial and 13-bit 0, and the three signals enter the coefficient multiplication and accumulation circuit for operation. The invention avoids the cycle of frequently reading and writing the accumulation result, does not need to suspend a polynomial multiplier, and shortens the operation time under the condition of keeping the hardware resource consumption basically unchanged and the same low power consumption.

Description

Polynomial hardware multiplier based on Saber key encapsulation and use method
Technical Field
The invention belongs to the technical field of information security encryption, and particularly relates to a polynomial hardware multiplier based on Saber key encapsulation and a using method thereof.
Background
After Shor's quantum algorithm appeared, traditional public key encryption schemes such as RSA (asymmetric encryption) and ECC (elliptic curve encryption) were greatly threatened, and there was a possibility that they could be broken by quantum computers within polynomial time. There are three key encapsulation schemes based on lattices, and the Saber key encapsulation scheme is one of them. Saber's security is based on Module-Learning with Rounding (M-LWR) with Rounding problems, derived from public key cryptographic primitives via F-O transformations. In Saber, since the modulus is a power of 2, the transformation of the modulus domain can be performed by rounding to introduce random errors, so that the volumes of the ciphertext and the public key are reduced, which brings advantages to lightweight hardware implementation.
In specific implementation, under the premise of no optimization, whether a software platform or a hardware platform, the polynomial multiplication occupies the most resources and operation time in Saber. However, because modulus is not prime Number, The fast Number Theory Transform (NTT) cannot be used for hardware implementation of Saber, and researchers instead use tom-Cook k-way and Karatsuba algorithms to speed up The operation of polynomial multiplication, which has a core idea of reducing The Number of multiplications. For example, in 2020, Yihong Zhu et al proposed an 8-stage iterative Karatsuba polynomial multiplication structure that reduced 65536 coefficient multiplications by 90%, and the overall operation time was greatly reduced. A similar "divide and conquer" algorithm, which replaces multiplication by addition and subtraction on a hardware implementation, can reduce the number of cycles, but requires more preprocessing and post-processing steps, and thus consumes more hardware resources, and even the requirement of a single multiplier exceeds the available resources of a hardware platform, which is not suitable for lightweight implementation on a resource-constrained platform.
Lightweight hardware implementations are concerned with as few resources as possible and reasonably fast implementation speeds. How to realize the lightweight multiplier efficiently has great significance.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a polynomial hardware multiplier based on Saber key encapsulation, which comprises an addressing circuit, a common polynomial data loading module, a coefficient multiplication accumulation circuit and a control module;
the addressing circuit comprises a first storage unit and a second storage unit; the first output end of the control module is electrically connected with the first input end of the first storage unit through a first address line, the second output end of the control module is electrically connected with the second input end of the first storage unit through a second address line, and the third output end of the control module is electrically connected with the input end of the second storage unit through a third address line;
the public polynomial data loading module comprises a Buffer control unit, a first register, a second register, a third register, a three-out-of-one selector, a first delay register and a second delay register; the input end of the Buffer control unit is electrically connected with the output end of the second storage unit, the first output end of the Buffer control unit is electrically connected with the input end of the first register, the second output end of the Buffer control unit is electrically connected with the input end of the second register, the third output end of the Buffer control unit is electrically connected with the input end of the third register, and the fourth output end of the Buffer control unit is electrically connected with the control end of the one-of-three selector; the output end of the first register is electrically connected with the first input end of the one-out-of-three selector; the output end of the second register is electrically connected with the second input end of the one-out-of-three selector; the output end of the third register is electrically connected with the third input end of the one-out-of-three selector; the output end of the one-out-of-three selector is electrically connected with the input end of the first delay register; the output end of the first delay register is electrically connected with the input end of the second delay register;
the coefficient multiplication and accumulation circuit comprises a first DSP and a second DSP; the first input end of the first DSP is electrically connected with the output end of the second delay register, and the output end of the first DSP is electrically connected with the input end of the first inverter, the first input end of the first alternative selector, the input end of the second inverter and the first input end of the second alternative selector; the output end of the first phase inverter is electrically connected with the second input end of the first alternative selector; the output end of the second phase inverter is electrically connected with the second input end of the second alternative selector; the output end of the first one-of-two selector is electrically connected with the input end of a first accumulation register; the output end of the second alternative selector is electrically connected with the input end of a second accumulation register; the second input end of the first DSP is electrically connected with the first output end and the second output end of the first storage unit;
the first input end of the second DSP is electrically connected with the output end of the one-out-of-three selector, the second input end of the second DSP is electrically connected with the first output end and the second output end of the first storage unit, and the output end of the second DSP is electrically connected with the input end of a third inverter, the first input end of a third one-out-of-three selector, the input end of a fourth inverter and the first input end of a fourth one-out-of-three selector; the output end of the third inverter is electrically connected with the second input end of the third alternative selector; the output end of the fourth inverter is electrically connected with the second input end of the fourth alternative selector; the output end of the third alternative selector is electrically connected with the input end of a third accumulation register; the output end of the fourth alternative selector is electrically connected with the input end of a fourth accumulation register;
and a fourth output end, a fifth output end, a sixth output end and a seventh output end of the control module are respectively and electrically connected with the control end of the first one-out-of-two selector, the control end of the second one-out-of-two selector, the control end of the third one-out-of-two selector and the control end of the fourth one-out-of-two selector.
In a second aspect, the present invention provides a method for using a polynomial hardware multiplier based on Saber key encapsulation, where the method is applied to the polynomial hardware multiplier in the first aspect, and includes:
when matrix-vector multiplication is carried out, the common polynomial data loading module is in a ring domain
Figure BDA0003571761340000021
In the first period from the beginning of the polynomial hardware multiplier, the low 52 bits of the 64-bit first register load the second memory cell51:0 of address 0 data in the element]Namely polynomial coefficients a3, a2, a1 and a0, the polynomial coefficients a0 are read by a one-out-of-three selector, and a second register [11: 0]]Load address 0 data in second memory location [63:52 ]]I.e., the lower 12 bits of the polynomial coefficient a4, the complete polynomial coefficient will be synthesized when the data for the next address arrives;
in the next three cycles, the 64-bit first register outputs polynomial coefficients a1, a2 and a3 in sequence; in the next cycle, the highest bit of the third register is loaded with the last bit of the address 1 data to form the polynomial coefficient a4 and output, and the third register [10:0] is loaded with [63:53] of the address 1 data, namely the lower 11 bits of the polynomial coefficient a 9; the first register reads [52:1] of address 1 data one cycle later, that is, polynomial coefficients a8, a7, a6, and a5, and outputs the coefficients; the three-out-of-one selector is used for controlling the first register, the second register and the third register to sequentially and continuously output coefficients; the first delay register and the second delay register are used for delaying the coefficient stream by two cycles, and the coefficient streams a [ j ] and a [ (j +2) mod 256] simultaneously enter the coefficient multiply-accumulate circuit; alternately reading the data state every 13 addresses updates and updates 4 times in a single degree polynomial multiplication, i.e., 52 addresses;
when the vector inner product multiplication is carried out, the public polynomial data loading module is in a ring domain
Figure BDA0003571761340000031
Only the first register, the first delay register and the second delay register of 64 bits are enabled; after the addressing circuit coefficient flow is prepared in the coefficient multiplication accumulation circuit, the first DSP and the second DSP respectively execute 19-bit multiplied by 13-bit multiplication, which is equivalent to two times of coefficient multiplication; multiplication results of the first DSP and the second DSP are generated in the next period, after low 16bits and high 16bits are subjected to modular reduction operation, under the control signal of the control module, phase-modular addition or phase-modular subtraction is carried out on the multiplication results and the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register through the selection of the first alternative selector, the second alternative selector, the third alternative selector and the fourth alternative selector; after 256 cycles4 continuous coefficients of the product polynomial are obtained and written into the first storage unit and the second storage unit, and after 16384 cycles in total, 256-order polynomial multiplication is completed; the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register read the coefficient of the corresponding previous polynomial multiplication result in the first storage unit and the second storage unit every 256 periods, which is equivalent to initializing the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register.
The invention provides a polynomial hardware multiplier based on Saber key encapsulation and a using method thereof, wherein the polynomial hardware multiplier comprises an addressing circuit, a public polynomial data loading module, a coefficient multiplication accumulation circuit and a control module; the control module controls the overall state trend and provides the addressing circuit with address indices of two multipliers according to the two-level cyclic control logic for Saber's schoolboot polynomial multiplication algorithm. The first storage unit outputs 64-bit data, continuous 2-path 13-bit coefficient streams can be stably obtained through the public polynomial data loading module, meanwhile, 2-path 4-bit coefficient streams of the secret polynomial can be directly read out from the second storage unit according to addresses, a 19-bit Com _ s signal is formed by the lower 3-bit absolute value of the 2-path secret polynomial and 13-bit 0, and the three signals enter the coefficient multiplication and accumulation circuit for operation. The invention adopts the device to avoid the cycle of frequently reading and writing the accumulation result, does not need to suspend the polynomial multiplier, and shortens the operation time under the condition of keeping the hardware resource consumption basically unchanged and the same low power consumption.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a circuit structure diagram of a polynomial hardware multiplier based on Saber key encapsulation according to an embodiment of the present invention;
FIG. 2 provides an embodiment of the present inventionIn the ring domain of the common polynomial a (x)
Figure BDA0003571761340000041
A schematic stored in the storage unit when in;
fig. 3 is a schematic diagram of a parallel schoolboost algorithm optimized for Saber according to an embodiment of the present invention;
fig. 4 is a circuit diagram of a common polynomial data loading module according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a conventional schoolboost algorithm;
fig. 6 is a schematic diagram of a loop polynomial operation rule of the schoolboost according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
An embodiment of the present invention is directed to a lightweight schoolboost polynomial hardware multiplier of the Saber key encapsulation scheme, as shown in fig. 1. The overall architecture of a lightweight parallel schoolboost polynomial multiplication circuit design is given in the figure.
When the single-order polynomial multiplication starts, the control module firstly judges whether the current polynomial multiplier carries out matrix-vector multiplication or vector inner product multiplication. Different modes will use different measurements. The control module generates an address index according to an algorithm line3-line6 in fig. 3 and provides the address index to an addressing circuit, and the addressing circuit is connected to two RAMs for data reading.
The embodiment of the invention provides a polynomial hardware multiplier based on Saber key encapsulation, which comprises an addressing circuit, a public polynomial data loading module, a coefficient multiplication accumulation circuit and a control module, wherein the public polynomial data loading module is used for loading public polynomial data; wherein the control module is composed of a state machine.
The addressing circuit comprises a first storage unit RAM _ S and a second storage unit RAM _ A; the first output end of the control module is electrically connected with the first input end of the first memory unit RAM _ S through the 8 bits S0_ addr of the first address line, the second output end is electrically connected with the second input end of the first memory unit RAM _ S through the 8 bits S1_ addr of the second address line, and the third output end is electrically connected with the input end of the second memory unit through the 7 bits a (x) addr of the third address line. The address lines are used for acquiring data of corresponding addresses.
The common polynomial data loading module comprises a Buffer control unit, a 64-bit first register Buffer0, a 13-bit second register Buffer1, a 13-bit third register Buffer2, a three-to-one selector, a 13-bit first delay register and a 13-bit second delay register, wherein the Buffer control unit is composed of a state machine. The input end of the Buffer control unit is electrically connected with the output end of the second storage unit RAM _ a, the first output end of the Buffer control unit is electrically connected with the input end of the first register Buffer0, the second output end of the Buffer control unit is electrically connected with the input end of the second register Buffer1, the third output end of the Buffer control unit is electrically connected with the input end of the third register Buffer2, and the fourth output end of the Buffer control unit is electrically connected with the control end of the one-out-of-three selector; the output end of the first register buffer0 is electrically connected with the first input end of the one-out-of-three selector; the output end of the second register buffer1 is electrically connected with the second input end of the one-of-three selector; the output end of the third register buffer2 is electrically connected with the third input end of the one-of-three selector; the output end of the one-out-of-three selector is electrically connected with the input end of the first delay register; the output end of the first delay register is electrically connected with the input end of the second delay register.
The coefficient multiplication and accumulation circuit comprises a first DSP and a second DSP, wherein the first DSP and the second DSP are both the DSP48E 1; the first input end of the first DSP is electrically connected with the output end of the second delay register, and the output end of the first DSP is electrically connected with the input end of a first inverter M1, the first input end of a first alternative selector G1, the input end of a second inverter M2 and the first input end of a second alternative selector G2; the output end of the first inverter M1 is electrically connected with the second input end of the first alternative selector G1; the output end of the second inverter M2 is electrically connected with the second input end of the second one-of-two selector G2; the output end of the first alternative selector G1 is electrically connected with the input end of a first accumulation register acc 1; the output end of the second one-of-two selector G2 is electrically connected with the input end of a second accumulation register acc 2; and a second input end of the first DSP is electrically connected with a first output end and a second output end of the first storage unit RAM _ S.
The first input end of the second DSP is electrically connected with the output end of the three-to-one selector, the second input end of the second DSP is electrically connected with the first output end and the second output end of the first storage unit RAM _ S, and the output end of the second DSP is electrically connected with the input end of a third inverter M3, the first input end of a third two-to-one selector G3, the input end of a fourth inverter M4 and the first input end of a fourth two-to-one selector G4; the output end of the third inverter M3 is electrically connected with the second input end of the third one-of-two selector G3; the output end of the fourth inverter M4 is electrically connected with the second input end of the fourth alternative selector G4; the output end of the third one-of-two selector G3 is electrically connected with the input end of a third accumulator register acc 3; the output end of the fourth alternative selector G4 is electrically connected with the input end of a fourth accumulator register acc 4.
And a fourth output end, a fifth output end, a sixth output end and a seventh output end of the control module are respectively electrically connected with a control end of the first two-way selector, a control end of the second two-way selector, a control end of the third two-way selector and a control end of the fourth two-way selector through output lines Sign1 ^ b0[3], Sign2 ^ b1[3], Sign3 ^ b0[3] and Sign4 ^ b1[3 ].
The embodiment of the invention also provides a using method of the polynomial hardware multiplier based on Saber key encapsulation, which comprises the step that when matrix-vector multiplication is carried out, the public polynomial data loading module is arranged in a ring domain
Figure BDA0003571761340000051
In the first cycle from the polynomial hardware multiplier, the lower 52 bits of the 64-bit first register load the second memory location51:0 of medium address 0 data]I.e., polynomial coefficients a3, a2, a1, a0, read polynomial coefficients a0 by a one-out-of-three selector, second register [11:0]Load address 0 data in second memory location [63:52 ]]I.e., the lower 12 bits of the polynomial coefficient a4, the complete polynomial coefficient will be synthesized at the arrival of the data for the next address.
In the next three periods, a 64-bit first register sequentially outputs polynomial coefficients a1, a2 and a 3; in the next cycle, the highest bit of the third register is loaded with the last bit of the address 1 data to form the polynomial coefficient a4 and output, and the third register [10:0] is loaded with [63:53] of the address 1 data, namely the lower 11 bits of the polynomial coefficient a 9; the first register reads [52:1] of the address 1 data one cycle after, that is, polynomial coefficients a8, a7, a6, and a5, and outputs the coefficients; the three-out-of-one selector is used for controlling the first register, the second register and the third register to sequentially and continuously output coefficients; the first delay register and the second delay register are used for delaying the coefficient stream by two cycles, and the coefficient streams a [ j ] and a [ (j +2) mod 256] simultaneously enter the coefficient multiplication accumulation circuit; the alternate read data state is updated every 13 addresses and 4 times, i.e., 52 addresses, in a single degree polynomial multiplication.
When the vector inner product multiplication is carried out, the public polynomial data loading module is in a ring domain
Figure BDA0003571761340000061
Only a 64-bit first register, a first delay register and a second delay register are enabled; after the addressing circuit coefficient flow in the coefficient multiplication accumulation circuit is prepared, the first DSP and the second DSP respectively execute 19-bit multiplied by 13-bit multiplication, which is equivalent to two times of coefficient multiplication; multiplication results of the first DSP and the second DSP are generated in the next period, after low 16bits and high 16bits are subjected to modular reduction operation, under the control signal of the control module, phase-modular addition or phase-modular subtraction is carried out on the multiplication results and the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register through the selection of the first alternative selector, the second alternative selector, the third alternative selector and the fourth alternative selector; after the lapse of 256 cycles, the number of cycles,4 continuous coefficients of the product polynomial are obtained and written into the first storage unit and the second storage unit, and after 16384 cycles in total, 256-order polynomial multiplication is completed; the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register read the coefficient of the corresponding previous polynomial multiplication result in the first storage unit and the second storage unit every 256 cycles, which is equivalent to initializing the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register.
In Saber, when matrix-vector multiplication is performed, the control module generates an address index to be supplied to the addressing circuit, the coefficients of the common polynomial are 13bits, and the output expansion function shift-128 that generates the common polynomial writes 1344-bit byte strings into the RAM in the form of a 64-bit stream per "squeeze" output process. The reason why the output is made 64-bit wide is that the consumption of data alignment can be reduced. However, the 13bits coefficient has a phenomenon of cross-address storage in the memory unit RAM with 64 bits. As shown in fig. 2, a specific coefficient is divided by addresses in such a storage manner that, for example, the upper 12 bits of address 0 belong to the lower 12 bits of coefficient a4, and the remaining highest 1 bit of a4 is the lowest 1 bit of address 1; the upper 11 bits of address 1 belong to the lower 11 bits of coefficient a9, and the remaining 2 bits of a9 are the lowest 2 bits of address 2; the upper 2 bits of address 10 belong to the lower 2 bits of a54, the remaining 10 bits of a54 are the lower 10 bits of address 11, and so on, this would repeat every 13 addresses, with 52 addresses occupied by a polynomial repeating 4 times. Therefore, when it is necessary to sequentially and continuously read the coefficients of a common polynomial in a 64-bit RAM, it becomes a challenge to carefully design the read strategy to circumvent the case where the coefficients are divided. A conventional solution that can be directly conceived is to directly read the data of the first memory cell RAM _ a with a 13 × 64-832-bit buffer, but the number of consumed registers is large, and the cycle overhead of waiting for the 832-bit buffer to be read full is large.
The design designs the data loading module as shown in fig. 4 according to the storage mode of the common polynomial in the RAM, wherein the 64 bits of the low order 52 bits of the first register buffer0 are used to load the complete continuous 4 coefficients in each address and output them sequentially. The second register buffer1 and the third register buffer2 alternately load part of the high order bits and part of the low order bits of the address data, for example, the second register buffer1 will preferentially read the high order 12 bits of the address 0 and wait for the remaining 1 bits to constitute a4 and then output one cycle after a3, and the third register buffer2 will preferentially read the high order 11 bits of the address 1 and wait for the remaining 2 bits to constitute a9 and then output after a 8. The above alternative reading control logic is composed of a counter and a state machine in the buffer control unit, when a complete coefficient is synthesized, the one-out-of-three selector selects one of the first register buffer0, the second register buffer1 and the third register buffer2 to read, the remaining two registers are used for delaying the coefficient stream a [ j ] by two cycles to generate a [ (j +2) mod n ], and the two paths of data enter the coefficient multiplication accumulation circuit at the same time after being prepared. a [ j ] is the coefficient flow required by the third line of the algorithm in fig. 3, and is updated once every period, and the cyclic index j of the inner layer controls each period to be a certain common polynomial coefficient to participate in multiplication.
In another case of Saber, the scheme is to perform vector inner product polynomial multiplication at
Figure BDA0003571761340000071
In the middle operation, the coefficient of the common polynomial is 16bits, and the highest 6bits is 0 supplemented for facilitating subsequent data alignment, so that the storage of the coefficient in the RAM does not have a phenomenon of cross-address, and therefore, the first register buffer0 can be simply read in, and only the first delay register and the second delay register after the three-to-one selector are called for continuous output. In the ring domain
Figure BDA0003571761340000072
The public polynomial and the secret polynomial in (2) are further constrained to
Figure BDA0003571761340000073
At this time, the coefficients of the polynomial are all 10 bits, and at this time, if the coefficients are directly stored in a 64-bit-wide RAM, the coefficients are also truncated, so that the coefficients can be compensated at high bits0 is 16bits and thus can not be truncated by RAM. Such operations are not implemented within multipliers and consume little computational resources.
In both cases, the coefficient of the secret polynomial S (x) is 4bits, and the data of the first storage unit RAM _ S can be directly read according to the address index in the algorithm. After the coefficient streams Coeff _ a0, Coeff _ a1, and Com _ s are ready, the called first DSP and second DSP begin to perform 4 coefficient multiplications. The high 16bits and the low 16bits of the first DSP and the second DSP are taken as temporary results to perform modular operation to obtain temp1-temp4, and the temporary results are respectively subjected to modular addition or modular subtraction with the first accumulation register acc1, the second accumulation register acc2, the third accumulation register acc3 and the fourth accumulation register acc4 under the control of Sign | [ b ] 3, as shown in FIG. 1. Both the public polynomial and the secret polynomial are generated by a function, SHAKE-128, which is capable of generating endless random coefficients, but the inputs to the function are also referred to as the seed difference when the two polynomials are generated, wherein the secret polynomial is required to be secret and is therefore referred to as the secret polynomial.
The invention compares the high-efficiency parallel schoolboost algorithm schematic diagram for Saber with the traditional schoolboost algorithm schematic diagram, as shown in fig. 3 and fig. 5. The conventional schoolboost polynomial multiplication is a convolution operation of two polynomial multipliers. Initializing an accumulator register acc, and when a variable j of an inner layer is increased by one, increasing a multiplier coefficient a [ j ]]And each multiplicand polynomial b [ i ]]Multiplication is traversed while in the ring domain RqFor the calculation, the polynomial b (x) needs to be multiplied by x in the outer loop to perform order conversion. The result in the bold line box in fig. 6 is obtained each time the outer loop is finished, but in a lightweight implementation the size of the accumulator register is limited by not being very resource intensive, so that frequent writing/reading of the accumulated value with the memory location is required when performing multiplication. While the algorithm in fig. 5 is different, the optimized schoolboost algorithm for Saber changes the order of reading a (x) and s (x). a [ j ]]、a[(j+2)mod n]And s [ (i-j) mod n]、s[(i-j+1)mod n]The reading strategy can fix continuous accumulation registers, avoid frequent reading and writing of the accumulation values like the traditional schoolboost polynomial multiplication, and save the reading strategyAn unnecessary period. After one cycle of the outer layer is over, 4 successive coefficients of results are obtained, as indicated by the thin line boxes in fig. 6. This is equivalent to performing a small convolution operation. 16384 cycles later, a complete polynomial multiplication is completed.
It is emphasized that in this example, the polynomial multiplier is optimized separately, and RAM resources can be recycled to the overall resource requirement. In this embodiment, Vivado 2018.3 is used to construct the above hardware structure on Artix-7 FPGA, and q is 213And n is 256. The final overall frequency reached 130Mhz, which required 16384 clock cycles. 561 LUT area resources, 302 FFs, and 2 DSPs 48E1 are consumed. Compared with the existing lightweight polynomial multiplier technology for Saber, the embodiment can provide faster operation speed with almost equivalent hardware resources, save 3087 cycles and improve frequency.
The same and similar parts in the various embodiments in this specification may be referred to each other. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the description in the method embodiment.
The invention has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the invention. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention. The scope of the invention is defined by the appended claims.

Claims (2)

1. A polynomial hardware multiplier based on Saber key encapsulation is characterized by comprising an addressing circuit, a common polynomial data loading module, a coefficient multiplication accumulation circuit and a control module;
the addressing circuit comprises a first storage unit and a second storage unit; the first output end of the control module is electrically connected with the first input end of the first storage unit through a first address line, the second output end of the control module is electrically connected with the second input end of the first storage unit through a second address line, and the third output end of the control module is electrically connected with the input end of the second storage unit through a third address line;
the public polynomial data loading module comprises a Buffer control unit, a first register, a second register, a third register, a three-out-of-one selector, a first delay register and a second delay register; the input end of the Buffer control unit is electrically connected with the output end of the second storage unit, the first output end of the Buffer control unit is electrically connected with the input end of the first register, the second output end of the Buffer control unit is electrically connected with the input end of the second register, the third output end of the Buffer control unit is electrically connected with the input end of the third register, and the fourth output end of the Buffer control unit is electrically connected with the control end of the one-of-three selector; the output end of the first register is electrically connected with the first input end of the one-of-three selector; the output end of the second register is electrically connected with the second input end of the one-out-of-three selector; the output end of the third register is electrically connected with the third input end of the one-of-three selector; the output end of the one-out-of-three selector is electrically connected with the input end of the first delay register; the output end of the first delay register is electrically connected with the input end of the second delay register;
the coefficient multiplication and accumulation circuit comprises a first DSP and a second DSP; the first input end of the first DSP is electrically connected with the output end of the second delay register, and the output end of the first DSP is electrically connected with the input end of the first inverter, the first input end of the first alternative selector, the input end of the second inverter and the first input end of the second alternative selector; the output end of the first phase inverter is electrically connected with the second input end of the first alternative selector; the output end of the second phase inverter is electrically connected with the second input end of the second alternative selector; the output end of the first alternative selector is electrically connected with the input end of the first accumulation register; the output end of the second one-of-two selector is electrically connected with the input end of a second accumulation register; the second input end of the first DSP is electrically connected with the first output end and the second output end of the first storage unit;
the first input end of the second DSP is electrically connected with the output end of the one-out-of-three selector, the second input end of the second DSP is electrically connected with the first output end and the second output end of the first storage unit, and the output end of the second DSP is electrically connected with the input end of a third inverter, the first input end of a third one-out-of-three selector, the input end of a fourth inverter and the first input end of a fourth one-out-of-three selector; the output end of the third phase inverter is electrically connected with the second input end of the third alternative selector; the output end of the fourth inverter is electrically connected with the second input end of the fourth alternative selector; the output end of the third alternative selector is electrically connected with the input end of a third accumulation register; the output end of the fourth alternative selector is electrically connected with the input end of a fourth accumulation register;
and a fourth output end, a fifth output end, a sixth output end and a seventh output end of the control module are respectively and electrically connected with the control end of the first one-out-of-two selector, the control end of the second one-out-of-two selector, the control end of the third one-out-of-two selector and the control end of the fourth one-out-of-two selector.
2. A method for using a Saber key encapsulation based polynomial hardware multiplier, wherein the method is applied to the polynomial hardware multiplier in claim 1, and comprises:
when matrix-vector multiplication is carried out, the public polynomial data loading module is in a ring domain
Figure FDA0003571761330000021
In the first cycle from the beginning of the polynomial hardware multiplier, the low 52 bits of the 64-bit first register load the [51: 0] of the address 0 data in the second memory location]Namely polynomial coefficients a3, a2, a1 and a0, the polynomial coefficients a0 are read by a one-out-of-three selector, and a second register [11: 0]]Load address 0 data in second memory location [63:52 ]]I.e., the lower 12 bits of the polynomial coefficient a4, the complete polynomial coefficient will be synthesized when the data for the next address arrives;
in the next three periods, a 64-bit first register sequentially outputs polynomial coefficients a1, a2 and a 3; in the next cycle, the highest bit of the third register is loaded with the last bit of the address 1 data to form the polynomial coefficient a4 and output, and the third register [10:0] is loaded with [63:53] of the address 1 data, namely, the lower 11 bits of the polynomial coefficient a 9; the first register reads [52:1] of address 1 data one cycle later, that is, polynomial coefficients a8, a7, a6, and a5, and outputs the coefficients; the three-out-of-one selector is used for controlling the first register, the second register and the third register to sequentially and continuously output coefficients; the first delay register and the second delay register are used for delaying the coefficient stream by two cycles, and the coefficient streams a [ j ] and a [ (j +2) mod 256] simultaneously enter the coefficient multiply-accumulate circuit; alternately reading the data state every 13 addresses updates and updates 4 times in a single degree polynomial multiplication, i.e., 52 addresses;
when the vector inner product multiplication is carried out, the public polynomial data loading module is in a ring domain
Figure FDA0003571761330000022
Only a 64-bit first register, a first delay register and a second delay register are enabled; after the addressing circuit coefficient flow in the coefficient multiplication accumulation circuit is prepared, the first DSP and the second DSP respectively execute 19-bit multiplied by 13-bit multiplication, which is equivalent to two times of coefficient multiplication; multiplication results of the first DSP and the second DSP are generated in the next period, after low 16bits and high 16bits are subjected to modular reduction operation, under the control signal of the control module, phase-modular addition or phase-modular subtraction is carried out on the multiplication results and the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register through the selection of the first alternative selector, the second alternative selector, the third alternative selector and the fourth alternative selector; after 256 cycles, 4 continuous coefficients of the product polynomial are obtained and written into the first storage unit and the second storage unit, and after 16384 cycles in total, 256-order polynomial multiplication is completed; the first accumulation register, the second accumulation register, the third accumulation register and the fourth accumulation register read the coefficient of the corresponding previous-order polynomial multiplication result in the first storage unit and the second storage unit every 256 cycles, which is equivalent to initializing the first accumulation register and the second accumulation registerAn add register, a third accumulation register, and a fourth accumulation register.
CN202210321371.4A 2022-03-30 2022-03-30 Polynomial hardware multiplier based on Saber key encapsulation and use method Pending CN114780057A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210321371.4A CN114780057A (en) 2022-03-30 2022-03-30 Polynomial hardware multiplier based on Saber key encapsulation and use method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210321371.4A CN114780057A (en) 2022-03-30 2022-03-30 Polynomial hardware multiplier based on Saber key encapsulation and use method

Publications (1)

Publication Number Publication Date
CN114780057A true CN114780057A (en) 2022-07-22

Family

ID=82425527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210321371.4A Pending CN114780057A (en) 2022-03-30 2022-03-30 Polynomial hardware multiplier based on Saber key encapsulation and use method

Country Status (1)

Country Link
CN (1) CN114780057A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112030A (en) * 2023-09-12 2023-11-24 南京微盟电子有限公司 Register set address automatic accumulation circuit and application method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112030A (en) * 2023-09-12 2023-11-24 南京微盟电子有限公司 Register set address automatic accumulation circuit and application method
CN117112030B (en) * 2023-09-12 2024-03-26 南京微盟电子有限公司 Register set address automatic accumulation circuit and application method

Similar Documents

Publication Publication Date Title
JPH10187438A (en) Method for reducing transition to input of multiplier
JP2002152014A (en) Hardware accelerator for coefficient adaptation on the basis of normal least mean square algorithm
US7920150B2 (en) Image scaling system capable of saving memory
CN114780057A (en) Polynomial hardware multiplier based on Saber key encapsulation and use method
US11556614B2 (en) Apparatus and method for convolution operation
CN112639839A (en) Arithmetic device of neural network and control method thereof
US4063082A (en) Device generating a digital filter and a discrete convolution function therefor
CN113222129B (en) Convolution operation processing unit and system based on multi-level cache cyclic utilization
CN114003198A (en) Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium
WO2021168644A1 (en) Data processing apparatus, electronic device, and data processing method
CN109669666A (en) Multiply accumulating processor
US6549925B1 (en) Circuit for computing a fast fourier transform
US20140089370A1 (en) Parallel bit reversal devices and methods
US7058787B2 (en) Method and circuit for generating memory addresses for a memory buffer
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN115033293A (en) Zero-knowledge proof hardware accelerator, generating method, electronic device and storage medium
JPH0767063B2 (en) Digital signal processing circuit
US5621337A (en) Iterative logic circuit
CN115481721B (en) Psum calculation circuit for convolutional neural network
KR100235537B1 (en) Variable tap of digital filter and multiplier circuit thereof
JP2957845B2 (en) Fast Fourier transform device
TWI777231B (en) Device for computing an inner product of vectors
JP5072558B2 (en) Data processing device
JP2778478B2 (en) Correlation processor
JP2001043084A (en) Processor system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination