CN111694541B

CN111694541B - Base 32 operation circuit for number theory transformation multiplication

Info

Publication number: CN111694541B
Application number: CN202010371312.9A
Authority: CN
Inventors: 华斯亮; 张惠国; 刘玉申; 徐健; 卞九辉; 张静亚
Original assignee: Changshu Institute of Technology
Current assignee: Changshu Institute of Technology
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2023-04-21
Anticipated expiration: 2040-05-06
Also published as: CN111694541A

Abstract

The invention discloses a basic 32 operation circuit for number theory transformation multiplication, which comprises 32 operand generation modules, wherein each of 32 input data is divided into 11 words by taking 6 bits as one word after being subjected to high-order zero padding, 1 way of 32 96-bit operands, 16 ways of 11 192-bit operands, 3 ways of 16 192-bit operands and 12 ways of 12 192-bit operands are combined and output, each operand generation module is connected with an operand modular addition module, and the operands output by each operand generation module are subjected to modular addition; the modulo p module is used for modulo outputting the data output by each operand modulo adding module to prime number p, wherein the prime number p=2 ⁶⁴ ‑2 ³² +1. The invention combines 1024 operands from the prior art to 400 operands, greatly reduces the calculation cost and improves the calculation efficiency of the base 32 operation.

Description

A radix-32 operation circuit for number-theoretic transformation and multiplication

技术领域technical field

本发明涉及一种运算电路，特别是涉及一种用于数论变换乘法的基32运算电路。The invention relates to an operation circuit, in particular to a base-32 operation circuit for number theory transformation and multiplication.

背景技术Background technique

大整数乘法除了传统的长乘法，还有

-Strassen算法。

-Strassen算法的核心思想是：对两个长度为n的大整数分别做一次环上的FFT，转换为频域分布；对两个整数的频域分布做点乘，得到乘积的频域分布；对乘积的频域分布做一次环上的IFFT，由此得到乘积。使用数论变换而不是离散傅立叶变换，可以通过使用模块化算术而不是浮点算术来避免舍入误差问题。数论变换乘法特指

-Strassen算法中使用数论变换的乘法。数论变换和逆数论变换作为数论变换乘法中的运算核心，占据了NTT乘法中90％以上的运算量和运算时间，优化数论变换的速度、面积和功耗，对于NTT乘法的整体性能，具有关键性的影响。In addition to traditional long multiplication, large integer multiplication also has

-Strassen algorithm.

-The core idea of the Strassen algorithm is: perform an FFT on the ring for two large integers with a length of n, and convert them into frequency domain distributions; perform dot multiplication on the frequency domain distributions of two integers to obtain the frequency domain distribution of the product; The product is obtained by performing an IFFT in the ring on the frequency domain distribution of the product. Using number-theoretic transforms instead of discrete Fourier transforms avoids round-off error problems by using modular arithmetic instead of floating-point arithmetic. Number Theoretic Transformation and Multiplication

- Multiplication using number-theoretic transformations in Strassen's algorithm. Number-theoretic transformation and inverse number-theoretic transformation, as the core of operation in number-theoretic transformation multiplication, occupy more than 90% of the calculation amount and operation time in NTT multiplication. Optimizing the speed, area and power consumption of number-theoretic transformation is critical to the overall performance of NTT multiplication sexual influence.

一个1048576点的数论变换可以被分解成4级基32运算单元和旋转因子乘法的运算。其中旋转因子的计算可以事先计算好并存于ROM中，需要使用时直接读取即可。基32运算的计算量占数论变换的90％以上，它的优化对数论变换的效率至关重要。A 1048576-point number-theoretic transformation can be decomposed into operations of 4-level radix-32 arithmetic units and multiplication of twiddle factors. The calculation of the twiddle factor can be calculated in advance and stored in the ROM, and can be read directly when needed. The calculation amount of radix-32 operation accounts for more than 90% of the number-theoretic transformation, and its optimization is crucial to the efficiency of the number-theoretic transformation.

大整数乘法器FPGA设计与实现，谢星等，电子与信息学报，2019年。该论文描述了一种基于

-Strassen算法的大整数乘法器硬件架构。论文将65536点的数论变换，分解成64点与1024点的形式，1024点数论变换使用2级基32运算串行构建的结构。其基32运算包括32个移位单元和树形大数求和处理单元。论文所采用的“0”填充的方式，使得每个树形大数求和处理单元，需要处理32个192位的数据，整个基32运算需要处理32*32＝1024个操作数。该基32运算电路效率不够高，导致电路实现后所需的功耗和资源比较大。FPGA Design and Implementation of Large Integer Multiplier, Xie Xing et al., Journal of Electronics and Information Technology, 2019. This paper describes a

- Large integer multiplier hardware architecture for Strassen algorithm. The paper decomposes the 65536-point number theory transformation into 64-point and 1024-point forms, and the 1024-point number theory transformation uses a structure constructed in series by two-level radix-32 operations. Its radix-32 operation includes 32 shift units and a tree-shaped large number sum processing unit. The "0" filling method used in the paper makes each tree-shaped large number sum processing unit need to process 32 pieces of 192-bit data, and the entire radix-32 operation needs to process 32*32=1024 operands. The efficiency of the radix-32 operation circuit is not high enough, resulting in relatively large power consumption and resources required after the circuit is implemented.

发明内容Contents of the invention

针对上述现有技术的缺陷，本发明提供了一种用于数论变换乘法的基32运算电路，解决基32运算电路功耗及资源开销大的问题。In view of the above-mentioned defects in the prior art, the present invention provides a radix-32 operation circuit for number-theoretic transformation and multiplication, which solves the problems of high power consumption and resource overhead of the radix-32 operation circuit.

本发明技术方案如下：一种用于数论变换乘法的基32运算电路，包括：The technical scheme of the present invention is as follows: a base 32 operation circuit for number theory transformation multiplication, comprising:

操作数生成模块，设有32个，32个操作数生成模块编号为Xk，k＝0，1，2，...，31，每个所述操作数生成模块包括分割电路、合并电路和填充零电路，所述分割电路对32个输入数据的每一个进行高位填零后以6比特为一个字分割为11个字，分割后的输入数据为x_n，m，0≤n＜32，0≤m＜11，所述合并电路将所述分割为32×11个字的输入数据形成操作数输出，32个所述操作数生成模块的所述合并电路中1个输出为32个96比特操作数、16个输出为11个192比特操作数、3个输出为16个192比特操作数以及12个输出为12个192比特操作数，所述填充零电路将所述合并电路输出操作数时的空位填入“0”；Operand generation module, be provided with 32, 32 operand generation module numbers are Xk, k=0,1,2,...,31, each described operand generation module comprises dividing circuit, combining circuit and filling A zero circuit, the segmentation circuit fills high bits with zeros for each of the 32 input data and divides it into 11 words with 6 bits as a word, and the input data after segmentation is x _{n, m} , 0≤n<32,0 ≤m<11, the merging circuit forms an operand output from the input data divided into 32×11 words, and one output of the merging circuits of the 32 operand generation modules is 32 96-bit operations Number, 16 outputs are 11 192-bit operands, 3 outputs are 16 192-bit operands, and 12 outputs are 12 192-bit operands, when the zero-filling circuit outputs the operands from the combining circuit Fill in the empty space with "0";

操作数模加模块，对每个所述操作数生成模块的输出的操作数进行模加；An operand modulo addition module modulo-adds the output operands of each of the operand generation modules;

以及，as well as,

模p模块，实现将每个所述操作数模加模块输出的数据对质数p取模后输出，所述质数p＝2⁶⁴-2³²+1。The modulo p module realizes outputting the data output by each operand modulo addition module modulo a prime number p, where the prime number p=2 ⁶⁴ -2 ³² +1.

进一步地，所述输出为32个96比特操作数的操作数生成模块编号为X0，每个96比特操作数的后11个字为输入的数据，前5个字被分配为零。Further, the number of the operand generating module whose output is 32 96-bit operands is X0, the last 11 words of each 96-bit operand are input data, and the first 5 words are allocated as zero.

进一步地，所述输出为11个192比特操作数的操作数生成模块编号为Xk，k为奇数，每个操作数OP_m由32个不同的输入数据x_n，m，0≤n＜32，使用相同的字索引m，0≤m＜11合并而成，x_n，m的最低位在OP_m中的位置，是由6×(m+nk)(mod 192)计算所得。Further, the number of the operand generation module whose output is 11 192-bit operands is Xk, k is an odd number, and each operand OP _m consists of 32 different input data x _{n, m} , 0≤n<32, Using the same word index m, 0 ≤ m < 11 combined, x _{n, the position of the lowest bit of m} in OP _m , is calculated by 6×(m+nk) (mod 192).

进一步地，所述输出为16个192比特操作数的操作数生成模块编号为X8、X16和X24，16个操作数分为8组，每组2个操作数，OP0和OP1是一组，OP2和OP3是一组，以此类推，每组内的操作数OP_2j和OP_2j+1由44个不同的输入数据x_n，m，4j≤n≤4j+3，0≤m＜11合并而成，x_n，m的最低位在OP_2j和OP_2j+1中的位置，是由6×(m+nk)(mod 192)计算所得，x_n，m优先置于OP_2j中，如OP_2j中该位置已经被占用，则置于OP_2j+1中对应的位置。Further, the operand generation module whose output is 16 192-bit operands is numbered X8, X16 and X24, and the 16 operands are divided into 8 groups, each group has 2 operands, OP0 and OP1 are one group, and OP2 and OP3 are a group, and so on, the operands OP _2j and OP _2j+1 in each group are combined by 44 different input data x _{n, m} , 4j≤n≤4j+3, 0≤m<11 and Cheng, the position of the lowest bit of x _{n, m} in OP _2j and OP _2j+1 is calculated by 6×(m+nk) (mod 192), x _{n, m} is preferentially placed in OP _2j , such as OP If the position in _2j is already occupied, it is placed in the corresponding position in OP _2j+1 .

进一步地，所述输出为12个192比特操作数的操作数生成模块编号为除X0、X8、X16和X24外的Xk，k为偶数，12个操作数分为2组，OP0至OP5是一组，OP6至OP11是一组，每组内的操作数OP_6j至OP_6j+5由176个不同的输入数据x_n，m，16j≤n≤16j+15，0≤m＜11合并而成，x_n，m的最低位在OP_6j至OP_6j+5中的位置，是由6×(m+nk)(mod 192)计算所得，x_n，m以2个字为周期合并操作数，优先置于OP_6j至OP_6j+5中索引号较小的OP中。Further, the number of the operand generation module whose output is 12 192-bit operands is Xk except X0, X8, X16 and X24, k is an even number, and the 12 operands are divided into 2 groups, and OP0 to OP5 are one Groups, OP6 to OP11 are a group, and the operands OP _6j to OP _6j+5 in each group are formed by merging 176 different input data x _{n, m} , 16j≤n≤16j+15, 0≤m<11 , The position of the lowest bit of x _{n, m} in OP _6j to OP _6j+5 is calculated by 6×(m+nk) (mod 192), x _{n, m} combines operands with a cycle of 2 words, Placed preferentially in the OP with the lower index number among OP _6j to OP _6j+5 .

本发明所提供的技术方案的优点在于：The advantage of the technical solution provided by the present invention is:

利用操作数移位后的“零填充”的空位，合并数论变换乘法中基32运算的操作数，将操作数从现有技术的1024个合并到400个，大幅减小了计算开销，提高了基32运算的计算效率。Utilize the "zero padding" vacancy after the operand is shifted, merge the operands of the radix 32 operation in the multiplication of the number theory transformation, and merge the operands from 1024 in the prior art to 400, which greatly reduces the calculation cost and improves the efficiency. Computational efficiency of radix-32 operations.

附图说明Description of drawings

图1为本发明用于数论变换乘法的基32运算电路的总体结构示意图。FIG. 1 is a schematic diagram of the overall structure of the radix-32 operation circuit for number-theoretic transformation and multiplication of the present invention.

图2为操作数生成模块中分割电路对输入数据填充零分割方法示意图。Fig. 2 is a schematic diagram of a segmentation circuit filling zeros for input data in the operand generation module.

图3为操作数生成模块中分割电路示意图。Fig. 3 is a schematic diagram of the segmentation circuit in the operand generation module.

图4为X0操作数生成模块的合并电路得到的输出数据示意图。Fig. 4 is a schematic diagram of the output data obtained by the merging circuit of the X0 operand generation module.

图5为X0操作数生成模块的合并电路示意图。Fig. 5 is a schematic diagram of the merging circuit of the X0 operand generation module.

图6为X1操作数生成模块的合并电路合并后的操作数。Fig. 6 shows operands after merging by the merging circuit of the X1 operand generation module.

图7为X1操作数生成模块中0号操作数OP0的合并电路。Fig. 7 is the merging circuit of operand 0 OP0 in the X1 operand generation module.

图8为X3操作数生成模块的合并电路合并后的操作数。Fig. 8 shows operands after merging by the merging circuit of the X3 operand generation module.

图9为X16操作数生成模块的合并电路合并后的操作数。Fig. 9 shows operands after merging by the merging circuit of the X16 operand generation module.

图10为X2操作数生成模块的合并电路合并后的操作数。Fig. 10 shows the operands combined by the merging circuit of the X2 operand generation module.

图11为32操作数模加模块的电路示意图。Fig. 11 is a schematic circuit diagram of a 32-operation digital-to-analog module.

图12为11操作数模加模块的电路示意图。Fig. 12 is a schematic circuit diagram of an 11-operation digital analog addition module.

图13为16操作数模加模块的电路示意图。Fig. 13 is a schematic circuit diagram of a 16-operation digital analog addition module.

图14为12操作数模加模块的电路示意图。Fig. 14 is a schematic circuit diagram of a 12-operation digital analog addition module.

具体实施方式Detailed ways

下面结合实施例对本发明作进一步说明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等同形式的修改均落于本申请所附权利要求所限定的范围内。The present invention will be further described below in conjunction with embodiment, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art can modify various equivalent forms of the present invention All fall within the scope defined by the appended claims of this application.

基32运算的公式如下The formula for the base 32 operation is as follows

其中，0≤k＜32，p是质数，W₃₂是第32个单位根。Wherein, 0≤k<32, p is a prime number, and W ₃₂ is the 32nd unit root.

在质数p为Solinas质数，p＝2⁶⁴-2³²+1。该质数支持高效的取模操作：2¹⁹²mod p＝1，2⁹⁶mod p＝-1，2⁶⁴mod p＝2³²-1。利用该质数计算得到的单位根W₃₂＝2⁶是2的幂次方的特性，可以将以上的乘加运算，方便地转换为移位和模加运算，降低数论变换的计算复杂性。由此，基32运算可以写成The prime number p is a Solinas prime number, p=2 ⁶⁴ -2 ³² +1. The prime numbers support efficient modulo operations: 2 ¹⁹² mod p = 1, 2 ⁹⁶ mod p = -1, 2 ⁶⁴ mod p = 2 ³² -1. Utilizing the property that the unit root W ₃₂ =2 ⁶ obtained by the prime number calculation is a power of 2, the above multiplication and addition operations can be conveniently converted into shift and modular addition operations, reducing the computational complexity of number theory transformation. Thus, the radix-32 operation can be written as

将每个x_n以6比特为一个基本单位，分成11个字，称为x_n，m，0≤m＜11。x_n可以表示为Divide each x _n into 11 words with 6 bits as a basic unit, called x _n,m , 0≤m<11. x _n can be expressed as

其中m表示第m个字，x_n的数据宽度是64比特，x_n，m的数据宽度是6比特，x_n，10的有效数据位是4比特。将输入数据分割后，基32运算可以写成以下公式，利用“0填充”可以合并移位后的操作数，减少模加的运算操作数。Where m represents the mth word, the data width of x _n is 64 bits, the data width of x _{n, m} is 6 bits, and the effective data bits of x _{n, 10} are 4 bits. After the input data is divided, the radix-32 operation can be written as the following formula, and the shifted operands can be combined by using "0 padding" to reduce the number of operation operations of modulo addition.

请结合图1所示，本实施例涉及的一种用于数论变换乘法的基32运算电路，包括X0至X31共32个操作数生成模块、操作数模加模块和模p模块，其中操作数模加模块根据输入操作数的数量不同分为32操作数模加模块，11操作数模加模块，16操作数模加模块，12操作数模加模块。电路结构上输入的32个64位比特数据作为每一个操作数生成模块的输入，每个操作数生成模块后连接操作数模加模块，每个操作数模加模块后连接模p模块。Please refer to Fig. 1, a kind of radix-32 operation circuit for number theory transformation multiplication involved in this embodiment, including 32 operand generation modules from X0 to X31, an operand modulo addition module and a modulo p module, wherein the operand The modular addition module is divided into a 32-operand analog addition module, an 11-operand analog addition module, a 16-operand analog addition module, and a 12-operand analog addition module according to the number of input operands. The 32 pieces of 64-bit bit data input on the circuit structure are used as the input of each operand generating module, and each operand generating module is connected to the operand modulo adding module, and each operand modulo adding module is connected to the modulo p module.

操作数生成模块包括分割电路、合并电路和填充零电路，依次对输入的32个64位比特数据进行分割、合并、填充零处理，形成操作数。请结合图2及图3所示，分割电路将每个64比特输入数据x_n的最高2比特填充0，形成66比特数据，然后分割成11个字，每个字包含6比特，第11个字因为最高2比特填充0，所以有效数据位是4比特。数据分割能够很容易的用现有硬件实现，几乎没有硬件开销。The operand generating module includes a dividing circuit, a combining circuit and a zero-filling circuit, which sequentially divide, combine and fill zero-input 32 pieces of 64-bit data to form operands. Please combine Figure 2 and Figure 3, the segmentation circuit fills the highest 2 bits of each 64-bit input data x _n with 0s to form 66-bit data, and then divides it into 11 words, each word contains 6 bits, the 11th Since the highest 2 bits of the word are filled with 0, the effective data bits are 4 bits. Data partitioning can be easily implemented with existing hardware with almost no hardware overhead.

以Xk，k＝0，1，2，...，31对操作数生成模块编号，每个操作数生成模块中的合并电路不同，但可以按照类型分成4组，每组内的电路相似。The operand generation modules are numbered with Xk, k=0, 1, 2, .

组一：X0，共1个；组二：X1、X3、X5等k为奇数的，共16个；组三：X8、X16和X24，共3个；组四：除组一和组三外的k为偶数的，如X2、X4、X6等，共12个。Group 1: X0, a total of 1; Group 2: X1, X3, X5, etc. k is an odd number, a total of 16; Group 3: X8, X16 and X24, a total of 3; Group 4: except Group 1 and Group 3 The k is an even number, such as X2, X4, X6, etc., a total of 12.

以下分组解释每组的数据合并操作：The following groupings explain the data merge operations for each group:

组一，即X0操作数生成模块的合并电路。Group 1 is the merging circuit of the X0 operand generation module.

操作数实际上是对齐的输入数据。换句话说，每个操作数都是从分割电路输出数据的11个连续字中派生得出的。合并电路输出为32个96比特的操作数，每个新的96比特操作数由16个字组成，后11个字为输入的数据，前5个字被分配为零。如图4所示，i号操作数OP_j有96比特，是将x_n置于低66位，高30位填充零得到的，合并电路如图5所示。The operands are actually aligned input data. In other words, each operand is derived from 11 consecutive words of output data from the split circuit. The output of the merging circuit is 32 96-bit operands, each new 96-bit operand is composed of 16 words, the last 11 words are input data, and the first 5 words are assigned as zero. As shown in Figure 4, the number i operand OP _j has 96 bits, which is obtained by placing x _n in the lower 66 bits and filling the upper 30 bits with zeros. The merging circuit is shown in Figure 5.

组二，X1、X3、X5等奇数操作数生成模块的合并电路。Group 2, the merging circuit of X1, X3, X5 and other odd operand generation modules.

对于k为奇数的Xk操作数生成模块的合并电路，输入是32个64比特输入数据，输出是11个192比特操作数。每个操作数OP_m由32个不同的数据x_n，m，0≤n＜32，使用相同的字索引m，0≤m＜11合并而成。x_n，m的最低位在OP_m中的位置，是由6×(m+nk)(mod 192)计算所得。以下以X1和X3为例说明输出的操作数构成：For the merging circuit of the Xk operand generation module where k is an odd number, the input is 32 64-bit input data, and the output is 11 192-bit operands. Each operand OP _m is formed by merging 32 different data x _n,m , 0≤n<32, using the same word index m, 0≤m<11. x _{n, the position of the lowest bit of m} in OP _m , is calculated by 6×(m+nk)(mod 192). The following takes X1 and X3 as examples to illustrate the output operand composition:

X1操作数生成模块的合并电路合并后的操作数，如图6所示。合并后共有11个操作数，每个操作数由32个不同的数据x_n，m，0≤n＜32，使用相同的字索引m，0≤m＜11合并而成。x_0，0的最低位在OP0中的位置为6×(0+0×1)(mod 192)＝0，x_1，0的最低位在OP0中的位置为6×(0+1×1)(mod 192)＝6，而x_0，1的最低位在OP1中的位置为6×(1+0×1)(mod 192)＝6，x_31，1的最低位在OP1中的位置为6×(1+31×1)(mod 192)＝0。X1操作数生成模块中0号操作数OP0的合并电路如图7所示。The combined operands of the merging circuit of the X1 operand generating module are shown in FIG. 6 . After merging, there are 11 operands in total, and each operand is formed by merging 32 different data x _n,m , 0≤n<32, using the same word index m, 0≤m<11. x ₀ , the position of the lowest bit of 0 in OP0 is 6×(0+0×1)(mod 192)=0, x ₁ , the position of the lowest bit of 0 in OP0 is 6×(0+1×1 )(mod 192)=6, and x ₀ , the position of the lowest bit of 1 in OP1 is 6*(1+0×1)(mod 192)=6, x ₃₁ , the position of the lowest bit of 1 in OP1 It is 6×(1+31×1)(mod 192)=0. The merging circuit of operand OP0 in X1 operand generation module is shown in Fig. 7 .

X3操作数生成模块的合并电路合并后的操作数，如图8所示。x_0，0的最低位在OP0中的位置为6×(0+0×3)(mod 192)＝0，x_1，0的最低位在OP0中的位置为6×(0+1×3)(mod192)＝18，而x_0，1的最低位在OP1中的位置为6×(1+0×3)(mod 192)＝6，x_31，1的最低位在OP1中的位置为6×(1+31×3)(mod 192)＝180。The combined operands of the merging circuit of the X3 operand generation module are shown in FIG. 8 . x ₀ , the position of the lowest bit of 0 in OP0 is 6×(0+0×3)(mod 192)=0, x ₁ , the position of the lowest bit of 0 in OP0 is 6×(0+1×3 )(mod192)=18, and x _0, the position of the lowest bit of 1 in OP1 is 6×(1+0×3)(mod 192)=6, x ₃₁ , the position of the lowest bit of 1 in OP1 is 6*(1+31*3)(mod 192)=180.

其余的操作数生成模块的合并电路输出的操作数依次类推。The operands output by the merging circuits of the remaining operand generating modules are deduced in turn.

组三，X8、X16和X24操作数生成模块的合并电路。Group three, merging circuits of X8, X16 and X24 operand generation modules.

输入是32个64比特输入数据，输出是16个192比特操作数。16个操作数分为8组，每组2个操作数，OP0和OP1是一组，OP2和OP3是一组，以此类推。每组内的操作数OP_2j和OP_2j+1由44个不同的数据x_n，m，4j≤n≤4j+3，0≤m＜11合并而成。x_n，m的最低位在OP_2j和OP_2j+1中的位置，是由6×(m+nk)(mod 192)计算所得。x_n，m优先置于OP_2j中，如OP_2j中该位置已经被占用，则置于OP_2j+1中对应的位置。其余的空位全部填“0”。以X16操作数生成模块的合并电路输出数据为例，如图9所示，有8组操作数，每组包括2个合并后的操作数。每个新的192位操作数由32个字组成，它们来自2个不同的输入数据，每个输入数据提供11个连续的字。192的高30位及两个连续的11个字之间30位填充0。The input is 32 64-bit input data, and the output is 16 192-bit operands. The 16 operands are divided into 8 groups, each group has 2 operands, OP0 and OP1 are one group, OP2 and OP3 are one group, and so on. The operands OP _2j and OP _2j+1 in each group are formed by combining 44 different data x _n,m , 4j≤n≤4j+3, 0≤m<11. x _{n, the position of the lowest bit of m} in OP _2j and OP _2j+1 is calculated by 6×(m+nk) (mod 192). x _{n, m} is placed in OP _2j first, if the position in OP _2j is already occupied, it is placed in the corresponding position in OP _2j+1 . Fill the rest of the blanks with "0". Taking the output data of the merging circuit of the X16 operand generating module as an example, as shown in FIG. 9 , there are 8 groups of operands, and each group includes 2 operands after merging. Each new 192-bit operand consists of 32 words from 2 different input data, each providing 11 consecutive words. The upper 30 bits of 192 and the 30 bits between two consecutive 11 words are filled with 0.

组四，除组一和组三外的偶数操作数生成模块的合并电路。Group four, merging circuits of even operand generation modules except for groups one and three.

对于k为除0、8、16或24以外的偶数的Xk操作数生成模块的合并电路，输入是32个64比特输入数据，输出是12个192比特操作数。12个操作数分为2组，每组6个操作数，OP0至OP5是一组，OP6至OP11是一组。每组内的操作数OP_6j至OP_6j+5由176个不同的数据x_n，m，16j≤n≤16j+15，0≤m＜11合并而成。x_n，m的最低位在OP_6j至OP_6j+5中的位置，是由6×(m+nk)(mod192)计算所得。x_n，m以2个字为周期合并操作数，优先置于OP_6j至OP_6j+5中索引号较小的OP中。其余的空位全部填“0”。以X2操作数生成模块的合并电路输出数据为例，如图9所示，有2组操作数，每组包括6个合并后的操作数。第一组包含了OP0至OP5；第二组包括OP6至OP11。每个新的192位操作数由32个字组成，它们来自16个不同的输入数据，每个输入数据提供2个连续的字。For the merging circuit of the Xk operand generation module where k is an even number other than 0, 8, 16 or 24, the input is 32 64-bit input data, and the output is 12 192-bit operands. The 12 operands are divided into 2 groups, each group has 6 operands, OP0 to OP5 is a group, OP6 to OP11 is a group. The operands OP _6j to OP _6j+5 in each group are formed by combining 176 different data x _n,m , 16j≤n≤16j+15, 0≤m<11. x _{n, the position of the lowest bit of m} in OP _6j to OP _6j+5 is calculated by 6×(m+nk) (mod192). x _{n, m} combine operands at a period of 2 words, and place them in the OP with a smaller index number among OP _6j to OP _6j+5 first. Fill the rest of the blanks with "0". Taking the output data of the merging circuit of the X2 operand generating module as an example, as shown in FIG. 9 , there are 2 groups of operands, and each group includes 6 operands after merging. The first group includes OP0 to OP5; the second group includes OP6 to OP11. Each new 192-bit operand consists of 32 words from 16 different input data, each providing 2 consecutive words.

根据上述不同组的操作数生成模块得到操作数数量不同，操作数模加模块包括32操作数模加模块，11操作数模加模块，16操作数模加模块，12操作数模加模块。According to the number of operands obtained by the above different groups of operand generation modules, the operand modulus addition module includes a 32-operand modulo addition module, an 11-operand modulus addition module, a 16-operand modulus addition module, and a 12-operand modulus addition module.

32操作数模加模块如图11所示，其中CSA表示保留进位加法器，CPA表示行波进位加法器，“＜＜1”表示将保留进位加法器的进位端(Carry端)向左移位1比特。32个操作数中保留4i，i＝1，2，...，8位置的操作数，其余的操作数每三个输入第一层CSA；第一层CSA的进位端向左移位1比特与其和数端及4i，i＝1，2，...，8位置的操作数输入第二层CSA；将每两个第二层CSA的和数端及其中一个第二层CSA的进位端向左移位1比特输入第三层CSA；第三层CSA的进位端向左移位1比特、第三层CSA的和数端及每两个第二层CSA中另一个第二层CSA的进位端向左移位1比特输入第四层CSA；将每两个第四层CSA的和数端及其中一个第四层CSA的进位端向左移位1比特输入第五层CSA；第五层CSA的进位端向左移位1比特、第五层CSA的和数端及每两个第四层CSA中另一个第四层CSA的进位端向左移位1比特输入第六层CSA；第六层共两个CSA，将第二个CSA的进位端向左移位1比特、第二个CSA的和数端及第一个CSA的和数端输入第七层CSA(共1个)；第七层的CSA进位端向左移位1比特、和数端以及第六层第一个CSA的进位端向左移位1输入第八层CSA；第八层的CSA进位端向左移位1比特及和数端输入CPA，结果再输入模加模块。模加模块实现将输入的193比特宽度数据，实现低192位数据与第193位数据相加操作，输出结果与输入数据对质数p同余。The 32-operand analog addition module is shown in Figure 11, where CSA means carry-save adder, CPA means ripple carry adder, "<<1" means shift the carry end (Carry end) of the carry-save adder to the left 1 bit. Among the 32 operands, 4i, i=1, 2, ..., 8 operands are reserved, and the rest of the operands are input into the first layer of CSA every three; the carry end of the first layer of CSA is shifted to the left by 1 bit With its sum end and 4i, i=1, 2,..., the operand at 8 positions is input to the second layer CSA; the sum end of every two second layer CSAs and the carry end of one of the second layer CSAs Shift 1 bit to the left to enter the third-level CSA; the carry end of the third-level CSA is shifted to the left by 1 bit, the sum end of the third-level CSA and the other second-level CSA of every two second-level CSAs Shift the carry end to the left by 1 bit and input the fourth layer CSA; shift the sum end of every two fourth layer CSAs and the carry end of one of the fourth layer CSAs to the left by 1 bit and input the fifth layer CSA; The carry end of the layer CSA is shifted to the left by 1 bit, the sum end of the fifth layer CSA and the carry end of the other fourth layer CSA in every two fourth layer CSAs are shifted to the left by 1 bit and input to the sixth layer CSA; There are two CSAs in the sixth layer, shift the carry end of the second CSA to the left by 1 bit, and input the sum end of the second CSA and the sum end of the first CSA into the seventh layer CSA (1 in total) ;The carry end of the CSA of the seventh layer is shifted to the left by 1 bit, the sum end and the carry end of the first CSA of the sixth layer are shifted to the left by 1 and input to the CSA of the eighth layer; the CSA carry end of the eighth layer is shifted to the left Bit 1 and the sum terminal are input to CPA, and the result is then input to the modulo addition module. The modular addition module implements the input 193-bit width data, realizes the addition operation of the lower 192-bit data and the 193-bit data, and the output result is congruent to the prime number p of the input data.

11操作数模加模块如图12所示，其中CSA表示保留进位加法器，CPA表示行波进位加法器，“ROL 1-bit”表示将保留进位加法器的进位端(Carry端)向左循环移位1比特。11个操作数中1、2、3；5、6、7；9、10、11分别输入三个第一层CSA，第一层CSA中第一个CSA的和数端、操作数4和第一层CSA中第二个CSA的进位端向左循环移位1比特输入第二层第一个CSA，操作数8、第一层CSA中第三个CSA的进位端向左循环移位1比特及其和数端输入第二层第二个CSA，第一层CSA中第一个CSA的进位端向左循环移位1比特、第二层第一个CSA的进位端向左循环移位1比特及其和数端输入第三层第一个CSA，第一层CSA中第二个CSA的和数端、第二层第二个CSA的进位端向左循环移位1比特及其和数端输入第三层第二个CSA；第三层CSA中第一个CSA的和数端、第三层第二个CSA的进位端向左循环移位1比特及其和数端输入第四层CSA；第三层CSA中第一个CSA的进位端向左循环移位1比特、第四层CSA的进位端向左循环移位1比特及其和数端输入第五层CSA；第五层的CSA进位端向左循环移位1比特及和数端输入CPA，结果再输入模加模块。模加模块实现将输入的193比特宽度数据，实现低192位数据与第193位数据相加操作，输出结果与输入数据对质数p同余。The 11-operand modulo addition module is shown in Figure 12, where CSA stands for carry-save adder, CPA stands for ripple-carry adder, and "ROL 1-bit" means the carry end (Carry end) of the carry-save adder is rotated to the left Shift by 1 bit. Among the 11 operands, 1, 2, 3; 5, 6, 7; 9, 10, and 11 are respectively input into three first-level CSAs, and the sum terminal of the first CSA in the first-level CSA, operand 4 and the second The carry end of the second CSA in the first layer of CSA is rotated to the left by 1 bit and input to the first CSA of the second layer, and the operand 8 and the carry end of the third CSA in the first layer of CSA are shifted to the left by 1 bit And its sum end is input to the second CSA of the second layer, the carry end of the first CSA in the first layer CSA is cyclically shifted to the left by 1 bit, and the carry end of the first CSA of the second layer is cyclically shifted to the left by 1 The bit and its sum terminal are input to the first CSA of the third layer, the sum terminal of the second CSA in the first layer CSA, and the carry terminal of the second CSA of the second layer are cyclically shifted to the left by 1 bit and its sum The terminal is input to the second CSA of the third layer; the sum terminal of the first CSA in the third layer CSA, the carry terminal of the second CSA of the third layer is cyclically shifted to the left by 1 bit and its sum terminal is input to the fourth layer CSA; the carry end of the first CSA in the third layer CSA is cyclically shifted to the left by 1 bit, the carry end of the fourth layer CSA is cyclically shifted to the left by 1 bit and the sum end is input to the fifth layer CSA; the fifth layer The carry end of CSA is shifted 1 bit to the left and the sum end is input to CPA, and the result is input to the modulo addition module. The modular addition module realizes the addition operation of the input 193-bit width data, the lower 192-bit data and the 193-bit data, and the output result is congruent to the prime number p of the input data.

16操作数模加模块如图13所示，其中CSA表示保留进位加法器，CPA表示行波进位加法器，“<<1”表示将保留进位加法器的进位端(Carry端)向左移位1比特。16个操作数中保留4i，i＝1,2,3,4位置的操作数，其余的操作数每三个输入第一层CSA；第一层CSA的进位端向左移位1比特与其和数端及4i，i＝1,2,3,4位置的操作数输入第二层CSA；将每两个第二层CSA的和数端及其中一个第二层CSA的进位端向左移位1比特输入第三层CSA；第三层CSA的进位端向左移位1比特、第三层CSA的和数端及每两个第二层CSA中另一个第二层CSA的进位端向左移位1比特输入第四层CSA；第四层CSA共两个CSA，将第二个CSA的进位端向左移位1比特、第二个CSA的和数端及第一个CSA的和数端输入第五层CSA(共1个)；第五层的CSA进位端向左移位1比特、和数端以及第四层第一个CSA的进位端向左移位1比特输入第六层CSA；第六层的CSA进位端向左移位1及和数端输入CPA，结果再输入模加模块。模加模块实现将输入的193比特宽度数据，实现低192位数据与第193位数据相加操作，输出结果与输入数据对质数p同余。The 16-operand analog addition module is shown in Figure 13, where CSA means carry-save adder, CPA means ripple carry adder, "<<1" means shift the carry end (Carry end) of the carry-save adder to the left 1 bit. Among the 16 operands, 4i, i=1, 2, 3, and 4 operands are reserved, and the remaining operands are input into the first layer of CSA every three; the carry end of the first layer of CSA is shifted to the left by 1 bit and its sum Number terminal and 4i, i=1, 2, 3, 4 operands are input to the second-level CSA; the sum terminal of every two second-level CSAs and the carry terminal of one of the second-level CSAs are shifted to the left 1 bit is input to the third-level CSA; the carry end of the third-level CSA is shifted to the left by 1 bit, the sum end of the third-level CSA and the carry end of the other second-level CSA in every two second-level CSAs are left Shift 1 bit to input the fourth layer CSA; the fourth layer CSA has two CSAs, shift the carry end of the second CSA to the left by 1 bit, the sum end of the second CSA and the sum of the first CSA input to the fifth layer CSA (total 1); the CSA carry end of the fifth layer is shifted to the left by 1 bit, the sum end and the carry end of the first CSA of the fourth layer are shifted to the left by 1 bit and input to the sixth layer CSA: The CSA carry end of the sixth layer is shifted to the left by 1 and the sum end is input to CPA, and the result is then input to the modulo addition module. The modular addition module realizes the addition operation of the input 193-bit width data, the lower 192-bit data and the 193-bit data, and the output result is congruent to the prime number p of the input data.

12操作数模加模块如图14所示，其中CSA表示保留进位加法器，CPA表示行波进位加法器，“ROL 1-bit”表示将保留进位加法器的进位端(Carry端)向左循环移位1比特。12个操作数每三个输入第一层CSA，将每两个第一层CSA的和数端及其中一个第二层CSA的进位端向左循环移位1比特输入第二层CSA；第二层CSA的进位端向左循环移位1比特、第二层CSA的和数端及每两个第一层CSA中另一个第一层CSA的进位端向左循环移位1输入第三层CSA；第三层CSA共两个CSA，将第二个CSA的进位端向左循环移位1比特、第二个CSA的和数端及第一个CSA的和数端输入第四层CSA(共1个)；第四层的CSA进位端向左循环移位1比特、和数端以及第三层第一个CSA的进位端向左循环移位1比特输入第五层CSA；第五层的CSA进位端向左循环移位1比特及和数端输入CPA，结果再输入模加模块。模加模块实现将输入的193比特宽度数据，实现低192位数据与第193位数据相加操作，输出结果与输入数据对质数p同余。The 12-operand modulo addition module is shown in Figure 14, where CSA stands for carry-save adder, CPA stands for ripple carry adder, and "ROL 1-bit" means that the carry end (Carry end) of the carry-save adder is rotated to the left Shift by 1 bit. Every three of the 12 operands are input to the first-layer CSA, and the sum ends of every two first-layer CSAs and the carry end of one of the second-layer CSAs are cyclically shifted to the left by 1 bit and input to the second-layer CSA; The carry end of the layer CSA is cyclically shifted to the left by 1 bit, and the sum end of the second layer CSA and the carry end of the other first layer CSA in every two first layer CSAs are cyclically shifted to the left by 1 and input to the third layer CSA There are two CSAs in the third layer of CSA, the carry end of the second CSA is cyclically shifted to the left by 1 bit, the sum end of the second CSA and the sum end of the first CSA are input into the fourth layer CSA (total 1); the CSA carry end of the fourth layer is cyclically shifted to the left by 1 bit, the sum end and the carry end of the first CSA of the third layer are cyclically shifted to the left by 1 bit and input to the fifth layer CSA; The carry terminal of CSA is rotated to the left by 1 bit and the sum terminal is input to CPA, and the result is then input to the modulo addition module. The modular addition module realizes the addition operation of the input 193-bit width data, the lower 192-bit data and the 193-bit data, and the output result is congruent to the prime number p of the input data.

模p模块实现对输入的数据对质数p取模。The modulo p module implements the modulo of the prime number p on the input data.

Claims

1. A radix-32 arithmetic circuit for number-theoretic transformation multiplication, characterized in that the operand generating modules are provided with 32, and the 32 operand generating modules are numbered Xk, k=0, 1, 2,... , 31, each of the operand generation modules includes a segmentation circuit, a merge circuit and a zero-filling circuit, and the segmentation circuit performs high-order zero-filling on each of the 32 input data and divides it into 11 words with 6 bits as a word , the divided input data is x _{n, m} , 0≤n<32, 0≤m<11, the merging circuit forms the input data divided into 32×11 words to form an operand output, and the 32 In the said merging circuit of the operand generating module, 1 output is 32 96-bit operands, 16 outputs are 11 192-bit operands, 3 outputs are 16 192-bit operands, and 12 outputs are 12 192-bit operands. A bit operand, the zero-filling circuit fills the vacancy when the combining circuit outputs the operand with "0";

An operand modulo addition module modulo-adds the output operands of each of the operand generation modules;

as well as,

The modulo p module realizes outputting the data output by each operand modulo addition module modulo a prime number p, where the prime number p=2 ⁶⁴ -2 ³² +1.

2. the base 32 arithmetic circuit for number theory transformation multiplication according to claim 1, is characterized in that, described output is the operand generation module numbering of 32 96-bit operands is X0, and the number of each 96-bit operand The last 11 words are input data, and the first 5 words are assigned zeros.

3. the base 32 arithmetic circuit for number theory transformation multiplication according to claim 1, is characterized in that, described output is the operand generation module numbering of 11 192-bit operands Xk, and k is an odd number, and each operation The number OP _m is formed by merging 32 different input data x _{n, m} , 0≤n<32, using the same word index m, 0≤m<11, and the lowest bit of x _{n, m} is in the position of OP _m , is calculated by 6×(m+nk)(mod 192).

4. The base 32 arithmetic circuit for number theory transformation multiplication according to claim 1, characterized in that, the operand generation module numbering of 16 192-bit operands for the output is X8, X16 and X24, and 16 operations The number is divided into 8 groups, each group has 2 operands, OP0 and OP1 are a group, OP2 and OP3 are a group, and so on, the operands OP _2j and OP _2j+1 in each group are composed of 44 different inputs Data x _{n, m} , 4j≤n≤4j+3, 0≤m<11 are combined, and the position of the lowest bit of x _{n, m} in OP _2j and OP _2j+1 is determined by 6×(m+nk ) (mod 192), x _{n, m} is preferentially placed in OP _2j , if the position in OP _2j is already occupied, it is placed in the corresponding position in OP _2j+1 .

5. the base 32 operation circuit for number theory transformation multiplication according to claim 1, is characterized in that, described output is the operand generation module numbering of 12 192 bit operands except X0, X8, X16 and X24 Xk, k is an even number, 12 operands are divided into 2 groups, OP0 to OP5 is a group, OP6 to OP11 is a group, and the operands OP _6j to OP _6j+5 in each group are composed of 176 different input data x _{n, m} , 16j≤n≤16j+15, 0≤m<11 are combined, the position of the lowest bit of x _{n, m} in OP _6j to OP _6j+5 is 6×(m+nk) Calculated by (mod 192), x _{n, m} combine operands at a period of 2 words, and place them in OP _6j to OP _6j+5 with a smaller index number first.