CN116893797A

CN116893797A - Iterative NTT system based on FIFO storage

Info

Publication number: CN116893797A
Application number: CN202310710946.6A
Authority: CN
Inventors: 陈涧升; 崔益军; 牛万泽; 刘伟强; 王成华
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-10-17

Abstract

The application discloses an iterative NTT system based on FIFO storage, which comprises a storage address control unit, a FIF0 input control unit, a first FIFO storage unit, a second FIFO storage unit, a third FIFO storage unit, a FIFO output control unit, a butterfly operation unit and a distributed ROM; the operation result output by the butterfly operation unit is stored in two FIFO storage units according to the address corresponding to the current stage, and after the storage is finished, the upper and lower paths of input data are output from the FIFO storage units to the butterfly operation unit for operation according to the number theory transformation change rule; the first FIFO storage unit, the second FIFO storage unit and the third FIFO storage unit are accessed alternately, and NTT address conversion is completed in groups of two according to the current stage number and the corresponding butterfly operation times. The application can improve the utilization efficiency of on-chip resources of the FPGA and the maximum operating frequency and the overall performance of the system.

Description

Iterative NTT system based on FIFO storage

Technical Field

The application relates to the field of Dilithium schemes of lattice-based post quantum cryptography, in particular to an iterative NTT system based on FIFO storage.

Background

With the rapid development of information technology and the internet, how to ensure the security and confidentiality in the information transmission process has been one of the key problems in the information technology field. In current cryptography, some classical public key cryptographic algorithms such as RSA and elliptic curve cryptography (Elliptic curve cryptography, ECC) become at risk with the advent of quantum computers, so new cryptographic schemes are needed to cope with quantum computer attacks, and post-quantum cryptographic schemes have evolved.

The security of the cell-based password owner is high and is therefore also given a premium. Three of the 4 post quantum cryptography schemes standardized by NIST are all lattice-based cryptography. Of the three passwords of digital signatures, dilithium, which has the same mathematical architecture as the unique key encapsulation mechanism (Key Encapsulation Mechanism, KEM) password Kyber, has received more attention. Dilithium uses NTT friendly parameters, so NTT acceleration can be used in the calculation of polynomials in the core arithmetic unit of the lattice cipher. The existing NTT accelerators all use special memories, and require access operations to data by using addresses, which limits the lightweight optimization of the NTT accelerator.

Disclosure of Invention

The application aims to solve the problems of long operation period and limited speed of the existing iterative NTT architecture, and provides an iterative NTT system based on FIFO storage, which optimizes the iterative NTT architecture and greatly reduces the waiting period of a butterfly operation unit; and the on-chip resource utilization efficiency of the FPGA is improved by optimizing the storage mode in the NTT operation.

In order to achieve the technical purpose, the application adopts the following technical scheme:

an iterative NTT system based on FIFO storage comprises a storage address control unit, a FIFO input control unit, a first FIFO storage unit, a second FIFO storage unit, a third FIFO storage unit, a FIFO output control unit, a butterfly operation unit and a distributed ROM;

the first FIFO storage unit, the second FIFO storage unit and the third FIFO storage unit are connected in parallel between the FIFO input control unit and the FIFO output control unit, and preprocessed data are stored; the distributed ROM is connected with the butterfly operation unit and used for storing rotation factors; the input end of the butterfly operation unit is connected with the output end of the FIFO output control unit, and the output end is connected with the input end of the memory address control unit; the output end of the memory address control unit is connected with the input end of the FIFO input control unit;

the method comprises the steps that after preprocessing, initial data firstly pass through a storage address control unit to carry out address interleaving processing, then enter two FIFO storage units through judging indication signals, and immediately output to a butterfly operation unit from the FIFO storage units to carry out operation after the initial data are stored;

the operation result output by the butterfly operation unit is output according to the judgment indication signal output by the storage address control unit and is stored in two FIFO storage units according to the address corresponding to the current stage, after the storage is finished, the upper and lower paths of input data are output from the FIFO storage units to the butterfly operation unit for operation according to the number theory transformation change rule, and the operation process is repeated until the whole iterative operation is completed; the first FIFO storage unit, the second FIFO storage unit and the third FIFO storage unit are accessed alternately, and NTT address conversion is completed in groups of two according to the current stage number and the corresponding butterfly operation times.

Further, the butterfly operation unit adopts a GS structure or a CT structure.

Further, when the GS structure is adopted, the butterfly operation unit comprises a modular addition module, a modular subtraction module and a barrett modular multiplication module; the module adding module and the module subtracting module respectively perform addition and subtraction operation on two data to be processed, the addition operation result is directly output, and the subtraction operation result and the corresponding rotation factor are output after multiplication operation is performed through the barrett modular multiplication unit.

Further, when the CT structure is adopted, the butterfly operation unit comprises a modular addition module, a modular subtraction module and a barrett modular multiplication module;

the barrett modular multiplication module multiplies one of the data to be processed with a corresponding twiddle factor, the module adding module and the module subtracting module respectively carry out addition and subtraction on the multiplication operation result output by the barrett's module and the other data sum to be processed, and then output the operation result.

Further, the writing sequence and the reading sequence of the first FIFO storage unit, the second FIFO storage unit and the third FIFO storage unit are opposite;

specifically, each write satisfies the upper way { first FIFO memory cell, third FIFO memory cell } and the lower way { second FIFO memory cell, first FIFO memory cell }, and the read satisfies the upper way { first FIFO memory cell, second FIFO memory cell } and the lower way { third FIFO memory cell, first FIFO memory cell }.

Further, for the NTT transform of n-point input, log is performed in total ₂ n stages of operations, each stage of operations having n/2 butterfly transformations.

Further, the number of input points of the butterfly operation unit is 128, and the number of stages of the iterative NTT system is 7.

Further, the first FIFO memory unit and the second FIFO memory unit use 64×12FIFO memories, and the third FIFO memory unit uses 32×12FIFO memories.

Further, for 128-point input, the number theory transformation change rule is:

the first stage adopts one butterfly group, the number of butterfly operations of each group is 64, the second stage adopts two butterfly groups, the number of butterfly operations of each group is 32, the number of butterfly groups of each stage is increased by a power of 2, the number of butterfly operations of each corresponding group is decreased by a power of 2, and the total number of butterfly operations of each stage is kept unchanged to 64;

when the number theory transformation algorithm is carried out, the data enter a first round of transformation after pretreatment, the distance between two input points of a first round of butterfly operation unit is n/2, and the lower half four times of butterfly transformation are multiplied with 0 to n/2-1 power of a twiddle factor respectively; in the second transformation, the original sequence is split into two groups, each part is subjected to n/4 times of butterfly transformation, and two inputs of the butterfly operation unit are inputThe point distance is n/4, and the two butterfly transformations of the upper part and the lower part are respectively combined with square omega of a twiddle factor ² To a power of 0 to n/4-1; in the last round of transformation, the four groups of sequences are divided, each part is subjected to 1 butterfly transformation, and the distance between two input points of the butterfly operation unit is n/8

Compared with the prior art, the application has the following beneficial effects:

first, the iterative NTT system based on FIFO storage can reduce the BF waiting period of iterative NTT, and greatly improve the data processing efficiency.

Secondly, the iterative NTT system based on FIFO storage adopts the FIFO storage unit to replace the BRAM unit, and simultaneously provides a novel data access mode, so that the on-chip resource utilization efficiency of the FPGA can be improved, and the maximum operation frequency and the overall performance of the system are improved.

Drawings

Fig. 1 is a diagram of an existing BRAM-based iterative NTT architecture;

FIG. 2 is a block diagram of two butterfly units;

FIG. 3 is a block diagram of an iterative NTT system based on FIFO storage according to the present application;

FIG. 4 is a diagram illustrating the change of the operand address of the FIFO cell;

fig. 5 is an 8-point NTT butterfly transformation diagram.

Detailed Description

Embodiments of the present application are described in further detail below with reference to the accompanying drawings.

Referring to fig. 4, the application discloses an iterative NTT system based on FIFO storage, which comprises a storage address control unit, a FIFO input control unit, a first FIFO storage unit, a second FIFO storage unit, a third FIFO storage unit, a FIFO output control unit, a butterfly operation unit and a distributed ROM;

Fig. 1 is a diagram of a conventional BRAM-based iterative NTT architecture. The NTT is essentially an algorithm for performing DFT on a polynomial in a modular sense, the NTT adopts a modulus taking mode and a modulus taking mode of the number of elements in a polynomial coefficient domain, and the DFT on the complex domain is converted into polynomial operation in the modular sense by using a congruence theorem in a number theory.

For polynomials, the most common representation is a coefficient representation, e.g., a (x) =1+x+2x ² +3x ³ +4x ⁴ For the d-order polynomial P (x) =p ₀ +p ₁ x+p ₂ x ² +...+p _d x ^d Which can be expressed as [ p ] by coefficient representation ₀ ，p ₁ ，p ₂ ，...，p _d ]. For a d-th order polynomial P (x) =p ₀ +p ₁ x+p ₂ x ² +...+p _d x ^d . The polynomial (curve) was characterized by d+1 points, denoted { (x) ₀ ，P(x ₀ ))，(x ₁ ，P(x ₁ ))，...，(x _d ，P(x _d ) And) such a representation is called a point value representation.

In FFT, it is critical to select a specific point to perform point value representation, and in NTT, it is also necessary to perform calculation by selecting a specific point value (called twiddle factor ω). For an n-1 order polynomial expressed in the form of point values, the n special points selected are the nth powers of the twiddle factor omega. A for polynomial coefficients ₀ ，a ₁ ，...，a _n-1 Representation, conversion after NTT/FFT conversionThe NTT transform is represented by the following formula, where ω is n times the integer primitive root:

assume thatc (x) =a (x) ×b (x), the steps of polynomial multiplication based on the NTT algorithm are as follows:

1. calculating the nth power of the twiddle factor: [ omega ] ⁰ ，ω ¹ ，ω ² ，…，ω ^n-1 ]。

2. Calculating the point value form of a (x) and b (x) by using NTT transformationAnd->(evaluation operation).

3. Calculation ofObtaining +.>

4. Benefit (benefit)By INTT conversionConverted to c (x) (interpolation operation).

The specific flow of the algorithm is as follows:

for an M-LWE based trellis-coded implementation, the modulus of the new parameter is reduced from 7681 to 3329, with the highest degree of the polynomial being 256, i.e., the input is 256 points. However, the modulus does not satisfy the condition q=1mod 2n of NTT multiplication, i.e., 3329 has only 256 primordia instead of 512 primordia. Although 3329 cannot decompose the irreducible polynomial x according to the NIST calculation rules ²⁵⁶ +1 is the product of 256 terms, but can be decomposed into products of 128 square terms, as shown in equation 2:

wherein ζ=17, all 256 primordia are { ζ, ζ ³ ，ζ ⁵ ，ζ ²⁵⁵ Then ring R ₃₃₂₉ The NTT transform of the upper polynomial can be redefined as equation (3) with 128 polynomial degree 1:

observations can find that the constant term coefficients for the odd and even terms are calculated as:

in the above formula, if ζ ² =289 is considered as a new twiddle factor, then the expression corresponds to a 128-point NTT transform, i.e. a 256-point NTT transform can be split into two 128-point NTT transforms for processing. However, the subsequent point-by-point multiplication also changes correspondingly, if the polynomial f, g.epsilon.R ₃₃₂₉ ThenRepresenting 128 polynomial vector products of degree 1, < >>Representing this particular PWM, the resulting product term needs to be reduced to a polynomial of degree 1, as follows:

the simplified odd and even terms are written as:

because the NTT discards small bit width data during calculation, the multiplication in the NTT algorithm in the application is realized by adopting DSP resources in the FPGA, and in order to reduce the consumption of the resources, the five multiplications in PWM can be reduced to four times by using even term results through simple mathematical transformation:

the polynomial multiplication parameters of the NTT algorithm at q=3329 are therefore as in table 1:

table 1 polynomial multiplication parameters of 1 q =3329 NTT algorithm

From previous analysis of NTT, by parity division, we can get:

in practice, however, resolution of the first n/2 term and the last n/2 term may also be performed to yield the following two formulas:

a _k+n/2 ＝X _k +ω ^k+n/2 Y _k ＝X _k -ω ^k Y _k (13)

according to the two different splitting modes, the application designs two different butterfly operation unit structures, namely a CT (Cooley-Tukey) and a GS (Gentleman-san de) as shown in fig. 2, wherein (a) in fig. 2 is a butterfly operation unit of the GS structure, and (b) in fig. 2 is a butterfly operation unit of the CT structure. The adder subtracter is modulo addition and modulo subtraction. Specifically, when the GS structure is adopted, the butterfly operation unit comprises a modular addition module, a modular subtraction module and a barrett modular multiplication module; the module adding module and the module subtracting module respectively perform addition and subtraction operation on two data to be processed, the addition operation result is directly output, and the subtraction operation result and the corresponding rotation factor are output after multiplication operation is performed through the barrett modular multiplication unit. When the CT structure is adopted, the butterfly operation unit comprises a modular addition module, a modular subtraction module and a barrett modular multiplication module; the barrett modular multiplication module multiplies one of the data to be processed with a corresponding twiddle factor, the module adding module and the module subtracting module respectively carry out addition and subtraction on the multiplication operation result output by the barrett's module and the other data sum to be processed, and then output the operation result.

The two butterfly operation units are mainly different in the sequence of multiplication operation and modulo addition and modulo subtraction. The corresponding algorithms of the modular addition and subtraction are shown in the algorithm 2 and the algorithm 3:

the hardware circuit is provided with a sum bit width of 13 bits and a difference value of 13 bits, and the modular addition or modular subtraction result at [0, q-1] can be obtained by judging the carry and borrow of 1 bit. The modulo addition and the modulo subtraction are completed in the same clock period, and only modulo addition operation is involved in the upper half part of the butterfly operation unit, so that in order to trim the time sequence, the multiplication result needs to be output at the same time and a 5-stage trigger is added as a delay unit.

For the two butterfly operation units Gs and cT mentioned in the previous section, the order of the input and output data is different except that the order of multiplication operations in the butterfly operation units is different. The input of the CT butterfly operation unit can be input into the BF unit after the bit inversion operation is required, and the output of the Gs butterfly operation unit can be input into the operation of the next stage after the bit inversion operation is required. Different NTT architectures are designed based on different butterfly operation units of Gs and CT. Taking 8 points as an example, fig. 5 is an 8-point NTT transform butterfly graph implemented with Gs. As can be seen from fig. 5, in the Gs NTT architecture, preprocessing is needed for the input data, and meanwhile, the BF unit of the last stage does not need to perform multiplication operation, and correspondingly, in the CT NTT architecture, preprocessing is generally not needed, but there is a special requirement for the variation of the twiddle factor.

Taking the GS butterfly transformation diagram as an example, for n-point input NTT transformation, log is required in total ₂ n-level operation, each level operation has n/2 butterfly changesAlternatively, for 128-point inputs, the first stage has one butterfly group, with a number of 64 butterfly operations per group, and the second stage has two butterfly groups, with a number of 32 butterfly operations per group, and so on. It can be seen that the number of butterfly groups per stage increases to a power of 2, while the number of butterfly operations corresponding to each group similarly decreases regularly, but the total number of butterfly operations per stage remains unchanged at 64. When the NTT algorithm is carried out, the first round of transformation is carried out after pretreatment, the distance between two input points of the first round of butterfly operation unit is n/2, and the lower half four times of butterfly transformation are multiplied by 0 to n/2-1 power of twiddle factors respectively. In the second round, the original sequence is split into two groups, each part needs n/4 times of butterfly transformation, the distance between two input points of the butterfly operation unit is n/4, and the two times of butterfly transformation of the upper part and the lower part are respectively combined with square omega of a twiddle factor ² To a power of 0 to n/4-1. In the last round of transformation, four groups of sequences are divided, each part only needs 1 butterfly transformation, and the distance between two input points of the butterfly operation unit is n/8, namely adjacent points. The address transformation in the NTT algorithm is shown in algorithm 4.

Compared with the pipelined NTT with clear structure and distinct hierarchy, the iterative NTT has a simpler structure and needs a more complex address control unit because only one butterfly operation unit structure is utilized.

The data access of the iterative NTT system is completed by three FIFO memory cells. The address control part is the same as the change of the starting point group in the running water, each stage changes the starting point group and the number of the butterfly operation of each group according to the NTT rule, and the reading and writing process of the data is controlled by the starting signal. The application employs two 64 x 12 and one 32 x 12FIFO elements for accessing data into and out of BF operation. The three FIFO memory cells are accessed alternately, and NTT address conversion is completed in groups according to the current stage number and the required butterfly operation times. For example, for 128-point input, serial data needs to be split into the first 64 items and the last 64 items first, and the serial data is input to the upper and lower paths of the BF unit to complete the first-stage processing. For the second stage, the number of write and read data (i.e., the number of butterfly operations per set) is 32, while the second stage requires two sets of butterfly units. Therefore, the upper ways 0 to 31 need to be sent to the first FIFO memory unit (FIFO 1), the 32 to 63 send to the third FIFO memory unit (FIFO 3), the lower ways 64 to 95 send to the second FIFO memory unit (FIFO 2), the 96 to 127 send to the first FIFO memory unit (FIFO 1), when FIFO data is read, the FIFO1 to the upper way, the FIFO3 to the lower way, the FIFO2 to the upper way after reading 32 data, the FIFO1 to the lower way, and so on, each writing satisfies the upper way { first FIFO memory unit, the third FIFO memory unit } the lower way { second FIFO memory unit, the first FIFO memory unit } (the writing order is indicated in succession) and the upper way { first FIFO memory unit, the second FIFO memory unit } the lower way { third FIFO memory unit, the first FIFO memory unit }, and at the same time, the address conversion of the data input to the BF unit can be ensured, as shown in fig. 4, wherein (a) is the address of the first stage writing unit and (b) is the address of the second stage writing FIFO unit in fig. 4.

In the implementation of the iterative NTT based on the FIFO, the data is preprocessed and then subjected to address interleaving, then enters the corresponding FIFO unit through judging the indication signal, and after BF operation is finished, the indication signal is output after judging and stored according to the address corresponding to the current stage, and the like, so that the whole iterative operation is completed.

After the synthesis and implementation are completed, the resulting q=3329 FIFO-based iterative NTT implementation performance and resource consumption are shown in table 2:

table 2 q =3329 FIFO-based performance and resource consumption of iterative NTT implementation

Compared with the iterative NTT based on BRAM, the application improves the maximum frequency by 12%, basically keeps the same in terms of resource occupation and total operation period, greatly optimizes the overall architecture from the storage mode, and improves the access efficiency.

The application can be implemented in other lattice ciphers calculated by using NTT as well, and the increase of the number of points of the polynomial operation does not affect the method proposed by the application. In addition, some NTT calculations based on conventional NTT improvements, such as parity independent NTT operations in Kyber, may also employ the methods herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The iterative NTT system based on FIFO storage is characterized by comprising a storage address control unit, a FIFO input control unit, a first FIFO storage unit, a second FIFO storage unit, a third FIFO storage unit, a FIFO output control unit, a butterfly operation unit and a distributed ROM;

2. The FIFO memory-based iterative NTT system according to claim 1, wherein the butterfly unit employs a GS structure or a CT structure.

3. The FIFO storage-based iterative NTT system according to claim 1, wherein when a GS structure is used, the butterfly unit comprises a modulo addition module, a modulo subtraction module, and a barrett's modulo multiplication module; the module adding module and the module subtracting module respectively perform addition and subtraction operation on two data to be processed, the addition operation result is directly output, and the subtraction operation result and the corresponding rotation factor are output after multiplication operation is performed through the barrett modular multiplication unit.

4. The FIFO storage-based iterative NTT system of claim 1, wherein when a CT architecture is employed, the butterfly unit comprises a modulo addition module, a modulo subtraction module, and a barrett's modulo multiplication module;

5. The FIFO storage-based iterative NTT system of claim 1, wherein the first FIFO storage unit, the second FIFO storage unit, and the third FIFO storage unit have opposite writing and reading orders;

6. The FIFO storage based iterative NTT system according to claim 1, wherein for the NTT transform of an n-point input, a total log is performed ₂ n stages of operations, each stage of operations having n/2 butterfly transformations.

7. The FIFO memory-based iterative NTT system of claim 6, wherein the number of input points to the butterfly unit is 128 and the number of stages in the iterative NTT system is 7.

8. The FIFO storage-based iterative NTT system according to claim 7, wherein the first FIFO storage unit, the second FIFO storage unit, and the third FIFO storage unit employ a 64 x 12FIFO memory, and the third FIFO storage unit employs a 32 x 12FIFO memory.

9. The FIFO storage based iterative NTT system of claim 6, wherein for a 128-point input, the number-wise transformation change rule is:

when the number theory transformation algorithm is carried out, the data enter a first round of transformation after pretreatment, the distance between two input points of a first round of butterfly operation unit is n/2, and the lower half four times of butterfly transformation are multiplied with 0 to n/2-1 power of a twiddle factor respectively; in the second transformation, the original sequence is split into two groups, each part is subjected to n/4 times of butterfly transformation, the distance between two input points of the butterfly operation unit is n/4, and the upper part and the lower part are respectively subjected to two times of butterfly transformation and the square omega of the twiddle factor ² To a power of 0 to n/4-1; and in the last round of transformation, the four groups of sequences are divided, each part is subjected to butterfly transformation for 1 time, and the distance between two input points of the butterfly operation unit is n/8.