CN103186360A

CN103186360A - Fast arithmetic multi-bit serial pulse dual-base binary finite field multiplier

Info

Publication number: CN103186360A
Application number: CN2013101154017A
Authority: CN
Inventors: 潘正祥; 杨春生; 白忠海; 李秋莹
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2013-04-03
Filing date: 2013-04-03
Publication date: 2013-07-03
Anticipated expiration: 2033-04-03
Also published as: CN103186360B

Abstract

The invention relates to a multi-bit serial pulsating double-base binary finite-field multiplier with fast operation, comprising an input terminal B, k PE modules, an FRRP module, and an R3 module, wherein the k PE modules are connected in series, and the k PE modules After k cycles, the input of the first cycle A is , B is directly input, and the calculation result is restored and input to the temporary register C through the FRRP module; the input of A in the second cycle , B is input through the R3 module, and the calculation result is also restored by the FRRP module, added to the calculation result of the first cycle, and stored in the temporary register C; thus, in the kth cycle, the input of A is , B is input after ( k -1) times of the R3 module, the calculation result is restored by the FRRP module, added to the previous ( k -1) accumulation results, saved in the temporary register C, and then stored by the temporary C outputs the result.

Description

Fast operation multi-bit series systolic double base binary finite field multiplier

技术领域technical field

本发明涉及一种二进制有限域乘法器，尤其涉及一种快速运算多位元串联脉动双基底二进制有限域乘法器。The invention relates to a binary finite field multiplier, in particular to a fast operation multi-bit series pulse double base binary finite field multiplier.

背景技术Background technique

近年来，椭圆曲线密码学(ECC,Elliptic curve cryptography)[1],[2]已经被与密码学的研究联系起来。随着椭圆曲线密码学在公钥密码系统中的出现，一些硬件实现的问题在ECC的应用上被提了出来。NIST推荐了5个二位元场，GF(2¹⁶³),GF(2²³³),GF(2²⁸³),GF(2⁴⁰⁹),and GF(2⁵⁷¹)。在基于ECC基底的密码协议中，有现场乘法是计算点成的必不可少的元素。密码系统硬件的有效性通常影响面积，能量消耗，以及性能表现。In recent years, Elliptic Curve Cryptography (ECC, Elliptic curve cryptography) [1], [2] has been associated with the study of cryptography. With the emergence of elliptic curve cryptography in public key cryptosystems, some hardware implementation issues have been raised in the application of ECC. NIST recommends five binary fields, GF(2 ¹⁶³ ), GF(2 ²³³ ), GF(2 ²⁸³ ), GF(2 ⁴⁰⁹ ), and GF(2 ⁵⁷¹ ). In cryptographic protocols based on ECC substrates, on-site multiplication is an essential element for computing point scores. The availability of cryptosystem hardware generally affects area, power consumption, and performance.

对于高速大规模集成电路(VLSI,very-large-scale integration)的实现，心脏收缩阵列结构是更佳的选择。在扩展的二位元场中，多种有效的心脏收缩阵列乘法器已经被设计并且可以被归类为位并行和为串联机构。有效的位并行心脏收缩乘法器通常采用LSB优先或是MSB优先算法。位并行心脏收缩乘法器的主要优点是整个计算过程中的贯通性。然而，这些结构对基于二位元场的多项式需要O(m²)XOR，O(m²)AND，O(m²)一位的锁存器和O(m)的延迟复杂度。为了减少时间和空间复杂度，LEE[8],[9],[13]算法展示了有现场乘法对于一些特殊多项式，例如全一多项式，五项多项式，三项多项式，都可以用Toeplitz矩阵向量乘法(TMVP,Toeplitz matrix-vector product)去建立满为并行心脏收缩乘法器。位串联心脏收缩阵列乘法器需要O(m)的空间复杂度，但他们导致了更长的计算延迟。For the realization of high-speed large-scale integration (VLSI, very-large-scale integration), the systolic array structure is a better choice. In the extended binary field, a variety of efficient systolic array multipliers have been designed and can be classified as bit-parallel and as serial mechanisms. Efficient bit-parallel systolic multipliers typically use LSB-first or MSB-first algorithms. The main advantage of the bit-parallel systolic multiplier is the continuity throughout the computation. However, these structures require O(m ² ) XOR, O(m ² ) AND, O(m ² ) one-bit latches and O(m) delay complexity for polynomials based on two-bit fields. In order to reduce the time and space complexity, the LEE[8],[9],[13] algorithms show that there are field multiplications. For some special polynomials, such as all-one polynomials, five-term polynomials, and three-term polynomials, Toeplitz matrix vectors can be used Multiplication (TMVP, Toeplitz matrix-vector product) to build fully parallel systolic multipliers. Bit-concatenated systolic array multipliers require O(m) space complexity, but they incur longer computation delays.

为了时间复杂度和空间复杂度的一个折中，在为并列和为串联乘法器之间，数字串列心脏收缩乘法器已经被公开。数字串列转换多项式基底乘法器，基于内部是数字，外部是并行的结构被在[20]中被提出。在这样的乘法器里，元素域长中m位可以再分成

个d位长的子段。在每个时钟周期里，d位的字串计算出来并且一个m位的乘法计算出来了。一个可扩展的和心脏收缩的乘法器使用一个固有的d*d位的平行的汉克向量矩阵已经在[15],[16]提出来它的延迟是

个时钟周期。多位元串联脉动乘法器内部和外部使用不同的结构在文献中呈现。这些乘法器的延迟是时钟周期。如前面所提到的,低复杂度的心脏收缩有限域乘法器的设计依靠于不可约多项式的选择和表现基底的选择，这些数字串联乘法器需要高延时去实现乘法计算。For a trade-off between time complexity and space complexity, between parallel and serial multipliers, digital serial systolic multipliers have been disclosed. A serial-to-digital polynomial base multiplier, based on a digital-inner, parallel-outer structure, was proposed in [20]. In such a multiplier, the m bits of the element field length can be subdivided into

A subsection of d bits long. In each clock cycle, a d-bit string is computed and an m-bit multiplication is computed. A scalable and systolic multiplier using an inherent d*d-bit parallel Hank vector matrix has been proposed in [15], [16] whose delay is

clock cycle. Multi-bit cascaded systolic multipliers using different structures internally and externally are presented in the literature. The delay of these multipliers is clock cycle. As mentioned earlier, the design of low-complexity systolic finite-field multipliers relies on the choice of irreducible polynomials and the choice of representational bases, and these digital cascade multipliers require high latency for multiplication computations.

发明内容Contents of the invention

本发明解决的技术问题是：构建一种快速运算多位元串联脉动双基底二进制有限域乘法器，克服现有乘法器需要高延时去实现乘法计算的技术问题。The technical problem solved by the invention is to construct a multi-bit serial pulsation double-base binary finite-field multiplier with fast operation, and overcome the technical problem that the existing multiplier requires high delay to realize multiplication calculation.

本发明的技术方案是：构建一种快速运算多位元串联脉动双基底二进制有限域乘法器，包括输入端B、k个PE模块、FRRP模块、R3模块，所述k个PE模块串联，所述k个PE模块经k个周期，第1个周期A的输入是A₀、A₁、...、A_k-1，B直接输入，计算结果经过所述FRRP模块还原输入到暂存器C中；第2个周期A的输入A_k、A_k+1、…、A_2k-1，B经过所述R3模块输入，计算结果也经过FRRP模块还原，与第1个周期的计算结果相加，保存在暂存器C中；如此，第k个周期，A的输入是B经过（k-1）次所述R3模块后输入，计算结果经过所述FRRP模块还原，与所述（k-1）次累加结果相加，保存到暂存器C中，再由暂存器C输出结果，所述R3模块实现Bx^kdmodF(x)的计算，所述PE模块包括R1模块、CMP模块、CVP模块、PWM模块、

个异或门、和个锁存器，所述R3模块输出到所述R1模块后经所述CMP模块进行系数转换，A的分段输入所述CVP模块进行A的分段的系数转换，CMP模块和CVP模块的计算结果均输入到PWM模块，实现B_in和A分段乘积计算，经过

个异或门累加，结果保存在个锁存器中，由

锁存器输出结果

其中，A通过三项多项式F(x)=1+xⁿ+x^m，表示为A=a₀+a₁x+...+a_m-1x^m-1，共有m个系数，即(a₀,a₁,...,a_m-1)。使用分段切割法，将m位的A切割成

每段d位，总共有k²个分段，因此有

B通过双基底可表示为B=b₀β₀+b₁β₁+...+b_m-1β_m-1，作为乘法器的另一个输入；C为输出结果。The technical scheme of the present invention is: build a kind of rapid operation multi-bit serial pulsation double-base binary finite-field multiplier, including input terminal B, k PE modules, FRRP module, R3 module, described k PE modules are connected in series, so The above k PE modules go through k cycles, the input of A in the first cycle is A ₀ , A ₁ , ..., A _k-1 , B is directly input, and the calculation result is restored and input to the temporary register by the FRRP module In C; the input A _k , A _k+1 , ..., A _2k-1 of A in the second cycle, B is input through the R3 module, and the calculation result is also restored by the FRRP module, which is the same as the calculation result of the first cycle Add and store in register C; thus, in the kth cycle, the input of A is B is input after (k-1) times of the R3 module, the calculation result is restored by the FRRP module, added to the (k-1) accumulation result, saved in the temporary register C, and then stored by the temporary Device C output result, described R3 module realizes the calculation of Bx ^kd modF (x), and described PE module comprises R1 module, CMP module, CVP module, PWM module,

XOR gates, and A latch, after the R3 module is output to the R1 module, the coefficient conversion is performed by the CMP module, and the subsection of A is input to the CVP module to perform the subsection coefficient conversion of A, and the calculation of the CMP module and the CVP module The results are all input to the PWM module to realize the calculation of the segmental product of B _in and A, after

XOR gates are accumulated, and the result is stored in of latches, by

Latch output result

Among them, A is expressed as A=a ₀ +a ₁ x+...+a _m-1 x ^m-1 through a three-term polynomial F(x)=1+x ⁿ +x ^m , and there are m coefficients in total, namely ( a ₀ ,a ₁ ,...,a _m-1 ). Using the segmented cutting method, the m-bit A is cut into

d bits per segment, there are k ² segments in total, so there are

B can be expressed as B=b ₀ β ₀ +b ₁ β ₁ +...+b _m-1 β _m-1 through the double basis, as another input of the multiplier; C is the output result.

本发明的进一步技术方案是：所述FRRP模块包括FR模块、R2模块，所述R2模块实现Cmod(x^m+1)的计算，所述FR模块的输入是k个串联PE模块的计算结果，对结果进行还原，输出到R2模块。The further technical scheme of the present invention is: described FRRP module comprises FR module, R2 module, and described R2 module realizes the calculation of Cmod (x ^m +1), and the input of described FR module is the calculation result of k serial PE modules, The result is restored and output to the R2 module.

本发明的进一步技术方案是：所述CMP模块包括异或门XOR_1和XOR_2，所述异或门XOR_1和XOR_2并联。A further technical solution of the present invention is: the CMP module includes exclusive OR gates XOR_1 and XOR_2, and the exclusive OR gates XOR_1 and XOR_2 are connected in parallel.

本发明的进一步技术方案是：所述CVP模块为异或门XOR_3。A further technical solution of the present invention is: the CVP module is an exclusive OR gate XOR_3.

本发明的进一步技术方案是：所述PWM模块包括三个并联的与门AND_1、AND_2和AND_3。将所述CMP模块和所述CVP模块输出的结果进行点对点相乘。A further technical solution of the present invention is: the PWM module includes three parallel AND gates AND_1, AND_2 and AND_3. Perform point-to-point multiplication on the results output by the CMP module and the CVP module.

本发明的进一步技术方案是：所述FR模块包括两个并联的异或门XOR_4和XOR_5。A further technical solution of the present invention is: the FR module includes two parallel-connected XOR gates XOR_4 and XOR_5.

本发明的技术效果是：构建一种快速运算多位元串联脉动双基底二进制有限域乘法器，包括输入端B、k个PE模块、FRRP模块、R3模块，所述k个PE模块串联，所述k个PE模块经k个周期，第1个周期A的输入是(A₀,A₁,…A_k-1)，B直接输入，计算结果经过所述FRRP模块还原输入到暂存器C中；第2个周期A的输入(A_k,A_k+1,…,A_2k-1)，B经过所述R3模块输入，计算结果也经过FRRP模块还原，与第1个周期的计算结果相加，保存在暂存器C中；如此，第k个周期，A的输入是

B经过（k-1）次所述R3模块后输入，计算结果经过所述FRRP模块还原，与前面（k-1）次累加结果相加，保存到暂存器C中，再由暂存器C输出结果，本发明结合多项式基底和MPB去建立双基底乘法。一些有现场乘法能够得到在位并行结构里通过次子空间TMVP得到。在二位元场GF(2^m)，不可分解三项多项式和五项多项式被广泛的应用在密码领域，在这样的领域中位长通常比较大。本发明中通过一种新的数字串联新站收缩双基底乘法器通过使用次二次TMVP公式，一旦一个d*d的Toeplitz乘法被选择了，被提出的结构能去的非常低的

时钟周期。The technical effect of the present invention is: construct a kind of multi-bit serial pulsating double-base binary finite-field multiplier of fast operation, comprise input terminal B, k PE modules, FRRP module, R3 module, described k PE modules are connected in series, so The above k PE modules go through k cycles, the input of A in the first cycle is (A ₀ , A ₁ ,...A _k-1 ), B is directly input, and the calculation result is restored and input to the temporary register C by the FRRP module Middle; the input of A in the second period (A _k ,A _k+1 ,...,A _2k-1 ), B is input through the R3 module, and the calculation result is also restored by the FRRP module, which is the same as the calculation result of the first period Added and saved in the temporary register C; thus, in the kth cycle, the input of A is

B is input after (k-1) times of the R3 module, the calculation result is restored by the FRRP module, added to the previous (k-1) accumulation results, saved in the temporary register C, and then transferred to the temporary register C output results, the present invention combines polynomial basis and MPB to establish double basis multiplication. Some in-field multiplications can be obtained by subsubspace TMVP in bit-parallel architectures. In the two-bit field GF(2 ^m ), indecomposable trinomial polynomials and pentanomial polynomials are widely used in the field of cryptography, and the bit length is usually relatively large in such fields. In the present invention, by using a new digital cascaded new-station contraction double-base multiplier by using the sub-quadratic TMVP formula, once a d*d Toeplitz multiplication is selected, the proposed structure can go to very low

clock cycle.

附图说明Description of drawings

图1为本发明的结构示意图。Fig. 1 is a structural schematic diagram of the present invention.

图2为本发明多位元串联脉动乘法器结构图。FIG. 2 is a structural diagram of a multi-bit serial systolic multiplier of the present invention.

图3为本发明处理单元PE的结构图。Fig. 3 is a structural diagram of the processing unit PE of the present invention.

图4为本发明PE模块的具体电路图。Fig. 4 is a specific circuit diagram of the PE module of the present invention.

具体实施方式Detailed ways

下面结合具体实施例，对本发明技术方案进一步说明。The technical solutions of the present invention will be further described below in conjunction with specific embodiments.

如图2所示，本发明的具体实施方式是：构建一种快速运算多位元串联脉动双基底二进制有限域乘法器，包括输入端B、k个PE模块、FRRP模块、R3模块，所述k个PE模块串联，所述k个PE模块经k个周期，第1个周期A的输入是A₀、A₁、…、A_k-1，B直接输入，计算结果经过所述FRRP模块还原输入到暂存器C中；第2个周期A的输入A_k、A_k+1、…、A_2k-1，B经过所述R3模块输入，计算结果也经过FRRP模块还原，与第1个周期的计算结果相加，保存在暂存器C中；如此，第k个周期，A的输入是

B经过（k-1）次所述R3模块后输入，计算结果经过所述FRRP模块还原，与所述（k-1）次累加结果相加，保存到暂存器C中，再由暂存器C输出结果，所述R3模块实现Bx^kdmodF(x)的计算，所述PE模块包括R1模块、CMP模块、CVP模块、PWM模块、

个异或门累加，结果保存在

个锁存器中，由

锁存器输出结果

每段d位，总共有k²个分段，因此有

B通过双基底可表示为B=b₀β₀+b₁β₁+...+b_m-1β_m-1，作为乘法器的另一个输入；C为输出结果。As shown in Fig. 2, the specific embodiment of the present invention is: build a kind of multi-bit serial pulsation double-base binary finite-field multiplier of fast operation, comprise input terminal B, k PE modules, FRRP module, R3 module, described K PE modules are connected in series, the k PE modules go through k cycles, the input of A in the first cycle is A ₀ , A ₁ ,..., A _k-1 , B is directly input, and the calculation result is restored by the FRRP module Input to the temporary register C; the input A _k , A _k+1 , ..., A _2k-1 of the second cycle A, B is input through the R3 module, and the calculation result is also restored by the FRRP module, which is the same as the first The calculation results of the cycle are added and stored in the temporary register C; thus, in the kth cycle, the input of A is

B is input after (k-1) times of the R3 module, the calculation result is restored by the FRRP module, added to the (k-1) accumulation result, saved in the temporary register C, and then stored by the temporary Device C output result, described R3 module realizes the calculation of Bx ^kd modF (x), and described PE module comprises R1 module, CMP module, CVP module, PWM module,

XOR gates are accumulated, and the result is stored in

of latches, by

Latch output result

d bits per segment, there are k ² segments in total, so there are

本发明的优选实施方式是：所述FRRP模块包括FR模块、R2模块，所述R2模块实现Cmod(x^m+1)的计算，所述FR模块的输入是k个串联PE模块的计算结果，对结果进行还原，输出到R2模块。A preferred embodiment of the present invention is: the FRRP module includes an FR module and an R2 module, the R2 module realizes the calculation of Cmod(x ^m +1), and the input of the FR module is the calculation result of k series PE modules, The result is restored and output to the R2 module.

CMP模块和CVP模块的输入分别是B_in和

其输出结果都作为PWM模块的输入，PWM模块的输出经过个异或门，和

个锁存器，输出结果

R1模块的输入是B_in,其输出经过m个锁存器，输出结果B_out。CMP模块的输入是Bx^dk(i+1)+jd，输出是[B^(p+q),B(^p+q+1),...,B^(p+q+d-1)]，CVP模块的输入是A_ik+j，输出的是[a_q,a_q+1,...,a_q+d-1]^T，其中

表示

排列成矩阵的行数和列数，i,j=0,1,...,k-1，i表示矩阵的第i行，j表示矩阵的第j列，p表示dk(i+1)+jd，q表示(ik+j)d，T表示[a_q,a_q+1,...,a_q+d-1]矩阵的转置。其输出结果与上一个FRRP模块的结果进行累加，并输出到下一个FRRP模块。The inputs of the CMP module and the CVP module are B _in and

The output results are all used as the input of the PWM module, and the output of the PWM module is passed through XOR gates, and

latch, the output result

The input of the R1 module is B _in , its output passes through m latches, and the output result is B _out . The input of the CMP module is Bx ^dk(i+1)+jd , and the output is [B ^(p+q) ,B( ^p+q+1) ,...,B ^(p+q+d-1) ], The input of the CVP module is A _ik+j , and the output is [a _q ,a _q+1 ,...,a _q+d-1 ] ^T , where

express

The number of rows and columns arranged into a matrix, i,j=0,1,...,k-1, i represents the i-th row of the matrix, j represents the j-th column of the matrix, and p represents dk(i+1) +jd, q means (ik+j)d, T means the transpose of [a _q ,a _q+1 ,...,a _q+d-1 ] matrix. The output result is accumulated with the result of the previous FRRP module and output to the next FRRP module.

图1脉动阵列双基底乘法器结构中展示了整个双基底乘法的结构，A，B，C是三个在GF(2^m)中的元素，由不可分解三项多项式F(x)=1+xⁿ+x^m组成，其中，n≤m/2。元素A由多项式基底表示法表示，B和C用双基底表示法表示，整个乘法器实现C=ABmodF(x)功能，其中A、B作为输入，C为输出结果。A通过三项多项式F(x)=1+xⁿ+x^m，表示为A=a₀+a₁x+...+a_m-1x^m-1，共有m个系数，即(a₀,a₁,...,a_m-1)。使用分段切割法，将m位的A切割成每段d位，总共有k²个分段，因此有每个分段Ai可表示为A_i=a_id+a_id+1x+…+a_id+d-1x^d-1，所有分段

代替A作为整个乘法器的输入。B通过双基底可表示为B=b₀β₀+b₁β₁+...+b_m-1β_m-1，作为乘法器的另一个输入。C为输出结果，由C=ABmodF(x)计算得到，即整个乘法器实现的功能。The structure of the entire double-base multiplication is shown in the structure of the systolic array double-base multiplier in Fig. 1. A, B, and C are three elements in GF(2 ^m ), and the non-decomposable trinomial polynomial F(x)=1+ x ⁿ + x ^m , where n≤m/2. Element A is represented by polynomial basis notation, B and C are represented by double basis notation, and the whole multiplier realizes the function of C=ABmodF(x), where A and B are used as input, and C is the output result. A passes the three-term polynomial F(x)=1+x ⁿ +x ^m , expressed as A=a ₀ +a ₁ x+...+a _m-1 x ^m-1 , there are m coefficients in total, namely (a ₀ ,a ₁ ,...,a _m-1 ). Using the segmented cutting method, the m-bit A is cut into d bits per segment, there are k ² segments in total, so there are Each segment Ai can be expressed as A _i =a _id +a _id+1 x+…+a _id+d-1 x ^d-1 , all segments

instead of A as the input to the entire multiplier. B can be expressed as B=b ₀ β ₀ +b ₁ β ₁ +...+b _m-1 β _m-1 through the double basis, as another input of the multiplier. C is the output result, which is calculated by C=ABmodF(x), that is, the function realized by the entire multiplier.

由于A被分割成所以A可表示为 $A = A_{0} + A_{1} x^{d} + . . . + A_{k^{2} - 1} x^{(k^{2} - 1) d} .$ 因此将C=ABmodF(x)中A展开可以得到：Since A is split into So A can be expressed as $A = A_{0} + A_{1} x^{d} + . . . + A_{k^{2} - 1} x^{(k^{2} - 1) d} .$ Therefore, expanding A in C=ABmodF(x) can be obtained:

其中 $\begin{matrix} C = AB \mod F (x) \\ = B (A_{0} + A_{1} x^{d} + \cdot \cdot \cdot {+ A}_{k^{2} - 1} x^{(k^{2} - 1) d}) \mod F (x) \\ = (B (A_{0} + A_{1} x^{d} + \cdot \cdot \cdot + A_{k - 1} x^{(k - 1) d}) + \\ {Bx}^{dk} (A_{k} + A_{k + 1} x^{d} + \cdot \cdot \cdot + A_{2 k - 1} x^{(k - 1) d}) + \\ \cdot \cdot \cdot + \\ {Bx}^{dk (k - 1)} (A_{k (k - 1)} + A_{k (k - 1) + 1} x^{d} + \cdot \cdot \cdot + A_{k^{2} - 1} x^{(k - 1) d})) \mod F (x) \\ = (C_{0} + C_{1} + \cdot \cdot \cdot + C_{k - 1}) \mod F (x) \\ C_{0} = B (A_{0} + A_{1} x^{d} + \cdot \cdot \cdot + A_{k - 1} x^{(k - 1) d}) \\ C_{1} = {Bx}^{dk} (A_{k} + A_{k + 1} x^{d} + \cdot \cdot \cdot + A_{2 k - 1} x^{(k - 1) d}) \\ \cdot \cdot \cdot \\ C_{k - 1} = {Bx}^{dk (k - 1)} (A_{k (k - 1)} + A_{k (k - 1 + 1)} x^{d} + \cdot \cdot \cdot + A_{k^{2} - 1} x^{(k - 1) d}) \end{matrix}$ in $\begin{matrix} C = AB \mod f (x) \\ = B (A_{0} + A_{1} x^{d} + &Center Dot; &Center Dot; &Center Dot; {+ A}_{k^{2} - 1} x^{(k^{2} - 1) d}) \mod f (x) \\ = (B (A_{0} + A_{1} x^{d} + &Center Dot; &Center Dot; &Center Dot; + A_{k - 1} x^{(k - 1) d}) + \\ {Bx}^{dk} (A_{k} + A_{k + 1} x^{d} + &Center Dot; &Center Dot; \cdot + A_{2 k - 1} x^{(k - 1) d}) + \\ &Center Dot; &Center Dot; &Center Dot; + \\ {Bx}^{dk (k - 1)} (A_{k (k - 1)} + A_{k (k - 1) + 1} x^{d} + &Center Dot; &Center Dot; &Center Dot; + A_{k^{2} - 1} x^{(k - 1) d})) \mod f (x) \\ = (C_{0} + C_{1} + \cdot \cdot &Center Dot; + C_{k - 1}) \mod f (x) \\ C_{0} = B (A_{0} + A_{1} x^{d} + \cdot \cdot \cdot + A_{k - 1} x^{(k - 1) d}) \\ C_{1} = {Bx}^{dk} (A_{k} + A_{k + 1} x^{d} + \cdot \cdot \cdot + A_{2 k - 1} x^{(k - 1) d}) \\ &Center Dot; &Center Dot; &Center Dot; \\ C_{k - 1} = {Bx}^{dk (k - 1)} (A_{k (k - 1)} + A_{k (k - 1 + 1)} x^{d} + &Center Dot; &Center Dot; &Center Dot; + A_{k^{2} - 1} x^{(k - 1) d}) \end{matrix}$

图1整个乘法器结构中，第1行计算的是C₀=B(A₀+A₁x^d+…+A_k-1x^(k-1)d)，其第1个处理单元PE_0,0计算BA₀乘积结果，第2个处理单元PE_0,1计算BA₁x^d乘积结果，以此类推，第k个处理单元PE_0,k-1计算BA_k-1x^(k-1)d乘积结果。整个k个处理单元计算结果再累加最终得到C₀，输入到第1个FRRP(FinalReconstruction-Reduction-Polynomial)模块。同样地，整个乘法器结构的第2行计算的是C₁=Bx^dk(A_k+A_k+1x ^d+…+A_2k-1x^(k-1)d)，增加的R3模块式计算Bx^dkmodF(x)，其输入是B。其第1个处理单元PE_1,0计算Bx^dxA₀乘积结果，后续与第1行类似，计算所得结果C₁，输入到第2个FRRP模块，与第1个FRRP模块累加得到(C₀+C₁)modF(x)。整个乘法器的每行都进行类似计算，一直计算到第k行，其R3模块的输出结果为Bx^dk(k-1)modF(x)，第k个FRRP模块输入为C_k-1，输出为(C₀+C₁+…+C_k-1)modF(x)，即为整个乘法器运算结果C=(C₀+C₁+…+C_k-1)modF(x)。In the entire multiplier structure in Figure 1, the first row calculates C ₀ =B(A ₀ +A ₁ x ^d +…+A _k-1 x ^(k-1)d ), and its first processing unit PE _{0 ,0} calculates the product result of BA ₀ , the second processing unit PE _0,1 calculates the product result of BA ₁ x ^d , and so on, the kth processing unit PE _0,k-1 calculates BA _k-1 x ^{(k-1 )d} product result. The calculation results of the entire k processing units are accumulated and finally C ₀ is obtained, which is input to the first FRRP (FinalReconstruction-Reduction-Polynomial) module. Similarly, the second row of the entire multiplier structure calculates C ₁ =Bx ^dk (A _k +A _k+1x ^d +…+A _2k-1 x ^(k-1)d ), the increased R3 modular calculation Bx ^dk modF(x), whose input is B. Its first processing unit PE _1,0 calculates the product result of Bx ^dx A ₀ , and the follow-up is similar to the first row. The calculated result C ₁ is input to the second FRRP module and accumulated with the first FRRP module to obtain (C ₀ +C ₁ ) mod F(x). Each row of the entire multiplier performs similar calculations until the kth row, the output of the R3 module is Bx ^dk(k-1) modF(x), the input of the kth FRRP module is C _k-1 , and the output It is (C ₀ +C ₁ +...+C _k-1 )modF(x), that is, the operation result of the entire multiplier C=(C ₀ +C ₁ +...+C _k-1 )modF(x).

每个处理单元PEi,j的详细电路如图2所示，用于计算Bx^dk(i+1)+jdA_ik+j乘积结果。A_in、B_in和

作为输入，B_out和作为输出。对每行的第1个处理单元PE_i,0，其A_in输入的是A_ik，B_in是由第i+1个R3模块的输出，即为Bx^dk(i+1)modF(x)，而

初始化为0。B_out作为R1的输出，也是第2个处理单元PE_i,1的输入，输出的结果为Bx^dk(i+1)+dmodF(x)。

输出的是

的结果，即计算Bx^dk(i+1)A_ik乘积结果。每行的第2个处理单元PE_i,1，其A_in输入的是A_ik+1，B_in输入的是Bx^dk(i+1)+dmodF(x)，

输入的是第1个处理单元PE_i,0计算结果，即为Bx^dk(i+1)A_ik，作为第3个处理单元PE_i,1的输入

B_out输出的是Bx^dk(i+1)+2dmodF(x)计算结果，作为第3个处理单元PE_i,1的输入B_in，

输出的是Bx^dk(i+1)+dA_ik+1乘积结果。以此类推，每行第j+1个处理单元PE_i,j计算的是Bx^dk(i+1)+jdA_ik+j乘积结果，其A_in输入的是A_ik+j，B_in输入的是Bx^dk(i+1)+jdmodF(x)，

输入的是第j个模块的

输出结果，为Bx^{dk(i+1)+(j-1)d}A_ik+(j-1)，B_out输出的是Bx^{dk(i+1)+(j+1)d}modF(x)计算结果，

输出的是Bx^dk(i+1)+jdA_ik+j乘积结果。The detailed circuit of each processing unit PEi,j is shown in Fig. 2, which is used to calculate the product result of Bx ^dk(i+1)+jd A _ik+j . A _in , B _in and

As input, B _out and as output. For the first processing unit PE _i,0 in each row, its A _in input is A _ik , and B _in is the output of the i+1th R3 module, which is Bx ^dk(i+1) modF(x) ,and

Initialized to 0. B _out is the output of R1 and the input of the second processing unit PE _i,1 , and the output result is Bx ^dk(i+1)+d modF(x).

The output is

The result of calculating Bx ^dk(i+1) A _ik product result. The second processing unit PE _i,1 in each row, its A _in input is A _ik+1 , and its B _in input is Bx ^dk(i+1)+d modF(x),

The input is the calculation result of the first processing unit PE _i,0 , which is Bx ^dk(i+1) A _ik , which is used as the input of the third processing unit PE _i,1

The output of B _out is the calculation result of Bx ^dk(i+1)+2d modF(x), which is used as the input B _in of the third processing unit PE _i,1 ,

The output is the product result of Bx ^dk(i+1)+d A _ik+1 . By analogy, the j+1th processing unit PE _{i, j} in each row calculates the product result of Bx ^dk(i+1)+jd A _ik+j , the A _in input is A _ik+j , and the B _in input is Bx ^dk(i+1)+jd modF(x),

The input is the jth module's

The output result is Bx ^{dk(i+1)+(j-1)d} A _ik+(j-1) , B _out output is Bx ^{dk(i+1)+(j+1)d} modF(x) calculation result,

The output is the product result of Bx ^dk(i+1)+jd A _ik+j .

将Bx^dk(i+1)+jd和A_ik+j分别展开，即Bx^dk(i+1)+jd=(b₀β₀+b₁β₁+…+b_m-1β_m-1)x^dk(i+1)+jd，A_ik+j=a_(ik+j)d+a_(ik+j)d+1x+…+a_(ik+j)d+d-1x^d-1 _,根据双基底乘法运算规则,则可得到:Expand Bx ^dk(i+1)+jd and A _ik+j separately, that is, Bx ^dk(i+1)+jd =(b ₀ β ₀ +b ₁ β ₁ +…+b _m-1 β _m-1 )x ^dk(i+1)+jd ，A _ik+j =a _(ik+j)d +a _(ik+j)d+1 x+…+a _(ik+j)d+d-1 x ^{d- 1} _, according to the double base multiplication operation rules, we can get:

Bx^dk(i+1)+jdA_ik+j Bx ^dk(i+1)+jd A _ik+j

=(b₀β₀+b₁β₁+…+b_m-1β_m-1)x^dk(i+1)+jdA_ik+j =(b ₀ β ₀ +b ₁ β ₁ +…+b _m-1 β _m-1 )x ^dk(i+1)+jd A _ik+j

=(b₀ ^(p)β₀+b₁ ^(p)β₁+…b_m-1 ^(p)β_m-1)A_ik+j =(b ₀ ^(p) β ₀ +b ₁ ^(p) β ₁ +…b _m-1 ^(p) β _m-1 )A _ik+j

=(a_(ik+j)d+a_(ik+j)d+1x+…+a_(ik+j)d+d-1x^d-1)B^(p) =(a _(ik+j)d +a _(ik+j)d+1 x+…+a _(ik+j)d+d-1 x ^d-1 )B ^(p)

=a_qB^(p)+a_q+1xB^(p)+…+a_q+d-1x^d-1B^(p) =a _q B ^(p) +a _q+1 xB ^(p) +…+a _q+d-1 x ^d-1 B ^(p)

=a_qB^(p+q)+a_q+1B(^p+q+1)+…+a_q+d-1B^(p+q+d-1) =a _q B ^(p+q) +a _q+1 B( ^p+q+1) +…+a _q+d-1 B ^(p+q+d-1)

=[B^(p+q),B^(p+q+1),...,B^(p+q+d-1)][a_q,a_q+1,...,a_q+d-1]^T =[B ^(p+q) ,B ^(p+q+1) ,...,B ^(p+q+d-1) ][a _q ,a _q+1 ,...,a _{q+d -1} ] ^T

p=dk(i+1)+jdp=dk(i+1)+jd

其中，q=(ik+j)dAmong them, q=(ik+j)d

B^(p)=b₀ ^(p)β₀+b₁ ^(p)β₁+…+b_m-1 ^(p)β_m-1 B ^(p) =b ₀ ^(p) β ₀ +b ₁ ^(p) β ₁ +…+b _m-1 ^(p) β _m-1

图3处理单元PE_i,j的详细电路中，CMP模块的输入是Bx^dk(i+1)+jd，输出是[B^(p+q),B^(p+q+1),...,B^(p+q+d-1)]，CVP模块的输入是A_ik+j，输出的是[a_q,a_q+1,...,a_q+d-1]^T，PWM模块用于计算[B^(p+q),B^(p+q+1),...,B^(p+q+d-1)][a_q,a_q+1,...,a_q+d-1]^T乘积结果，再与

相加，结果输入到暂存器L中，再从暂存器L输出R1模块的输入是B_in，实现x^dB_inmodF(x)运算，结果保存到暂存器L中，再从暂存器L作为B_out输出。Figure 3 In the detailed circuit of processing unit PE _i,j , the input of the CMP module is Bx ^dk(i+1)+jd , and the output is [B ^(p+q) ,B ^(p+q+1) ,... ,B ^(p+q+d-1) ], the input of the CVP module is A _ik+j , the output is [a _q ,a _q+1 ,...,a _q+d-1 ] ^T , the PWM module For calculating [B ^(p+q) ,B ^(p+q+1) ,...,B ^(p+q+d-1) ][a _q ,a _q+1 ,...,a _{q +d-1} ] ^T product result, and then with

Add, the result is input to the temporary register L, and then output from the temporary register L The input of the R1 module is B _in , and the operation of x ^d B _in modF(x) is realized, and the result is saved in the temporary register L, and then output from the temporary register L as B _out .

在计算[B^(p+q),B^(p+q+1),...,B^(p+q+d-1)][a_q,a_q+1,...,a_q+d-1]^T，由于是Toeplitz矩阵-向量乘积，分割成 $[\begin{matrix} t_{1} & t_{2} \\ t_{0} & t_{1} \end{matrix}] [\begin{matrix} v_{0} \\ v_{1} \end{matrix}],$ ( $[\begin{matrix} t_{1} & t_{2} \\ t_{0} & t_{1} \end{matrix}]$ 表示将Toeplitz矩阵[B^(p+q),B^(p+q+1),...,B^(p+q+d-1)]分成四块，其中两块是一样的为t₁，另两块为t₀和t₂， $[\begin{matrix} v_{0} \\ v_{1} \end{matrix}]$ 将向量[a_q,a_q+1,...,a_q+d-1]^T分成两段，T表示矩阵转置，其中可以得到In calculating [B ^(p+q) ,B ^(p+q+1) ,...,B ^(p+q+d-1) ][a _q ,a _q+1 ,...,a _{q+ d-1} ] ^T , since it is a Toeplitz matrix-vector product, is divided into $[\begin{matrix} t_{1} & t_{2} \\ t_{0} & t_{1} \end{matrix}] [\begin{matrix} v_{0} \\ v_{1} \end{matrix}],$ ( $[\begin{matrix} t_{1} & t_{2} \\ t_{0} & t_{1} \end{matrix}]$ Indicates that the Toeplitz matrix [B ^(p+q) ,B ^(p+q+1) ,...,B ^(p+q+d-1) ] is divided into four blocks, two of which are the same as t ₁ , The other two blocks are t ₀ and t ₂ , $[\begin{matrix} v_{0} \\ v_{1} \end{matrix}]$ Divide the vector [a _q ,a _q+1 ,...,a _q+d-1 ] ^T into two segments, where T represents matrix transposition, where we can get

$= = [[{B B}^{((p p + + q q))},, {B B}^{((p p + + q q + + 11))},, . . . . . .,, {B B}^{((p p + + q q + + d d - - 11))}]] [[{a a}_{q q},, {a a}_{q q + + 11},, . . . . . .,, {a a}_{q q + + d d - - 11} {]]}^{T T}$

$= = [\begin{matrix} {t t}_{11} & {t t}_{22} \\ {t t}_{00} & {t t}_{11} \end{matrix}] [\begin{matrix} {v v}_{00} \\ {v v}_{11} \end{matrix}] = = [\begin{matrix} {t t}_{11} (({v v}_{00} + + {v v}_{11})) + + {v v}_{11} (({t t}_{22} + + {t t}_{11})) \\ {t t}_{11} (({v v}_{00} + + {v v}_{11})) + + {v v}_{00} (({t t}_{00} + + {t t}_{11})) \end{matrix}]$

$= = [\begin{matrix} {c c}_{00} \\ {c c}_{11} \end{matrix}]$

图4显示了处理单元PE的CMP，CVP和PWM具体电路。CMP模块的输入是(t₀,t₁,t₂)，经过异或门XOR_1和XOR_2，输入(t₀+t₁,t₁,t₁+t₂)；CVP模块输入的是(v₀,v₁)，经过异或门XOR_3，输入(v₀,v₀+v₁,v₁)；PWM模块是将CMP模块和CVP模块输出的结果进行点对点相乘，经过3个与门AND_1、AND_2和AND_3，输出(v₀(t₀+t₁),t₁(v₀+v₁),v₁(t₂+t₁))；FR还原模块利用2个异或门XOR_4和XOR_5，计算出c₀=t₁(v₀+v₁)+v₁(t₂+t₁)和c₁=t₁(v₀+v₁)+v₀(t₀+t₁)，输出(c₀,c₁)。Figure 4 shows the specific circuits of CMP, CVP and PWM of the processing unit PE. The input of the CMP module is (t ₀ , t ₁ , t ₂ ), through the XOR gates XOR_1 and XOR_2, the input (t ₀ +t ₁ ,t ₁ ,t ₁ +t ₂ ); the input of the CVP module is (v ₀ ,v ₁ ), through the XOR gate XOR_3, input (v ₀ ,v ₀ +v ₁ ,v ₁ ); the PWM module multiplies the output results of the CMP module and the CVP module point-to-point, and passes through three AND gates AND_1, AND_2 and AND_3, output (v ₀ (t ₀ +t ₁ ), t ₁ (v ₀ +v ₁ ), v ₁ (t ₂ +t ₁ )); the FR restoration module utilizes two XOR gates XOR_4 and XOR_5, Calculate c ₀ =t ₁ (v ₀ +v ₁ )+v ₁ (t ₂ +t ₁ ) and c ₁ =t ₁ (v ₀ +v ₁ )+v ₀ (t ₀ +t ₁ ), output ( c ₀ ,c ₁ ).

图2给出了本发明提出的多位元串联脉动乘法器架构，是将图1给出的结构进行折叠得到。图1中使用了k²个运算单元PE，而每行k个运算单元PE的结构和功能是一样的，所以可以用第1行的k个运算单元PE替代剩余的k个运算单元PE，这样需要k个周期。第1个周期A的输入是(A₀,A₁,…,A_k-1)，B直接输入，计算结果经过FRRP还原模块输入到暂存器C中；第2个周期A的输入(A_k,A_k+1,…,A_2k-1)，B经过R3模块输入，计算结果也经过FRRP还原模块，与第1个周期的计算结果相加，保存在暂存器C中；如此，知道第k个周期，A的输入是

B经过（k-1）次R3模块后输入，计算结果经过FRRP还原模块，与前面（k-1）次累加结果相加，保存到暂存器C中，再由暂存器C输出结果，为C=ABmodF(x)。FIG. 2 shows the structure of the multi-bit serial systolic multiplier proposed by the present invention, which is obtained by folding the structure shown in FIG. 1 . In Fig. 1, k ² computing units PE are used, and the structures and functions of the k computing units PE in each row are the same, so the k computing units PE in the first row can be used to replace the remaining k computing units PE, thus K cycles are required. The input of A in the first cycle is (A ₀ ,A ₁ ,…,A _k-1 ), B is directly input, and the calculation result is input into the temporary register C through the FRRP restoration module; the input of A in the second cycle (A _k ,A _k+1 ,…,A _2k-1 ), B is input through the R3 module, and the calculation result is also passed through the FRRP recovery module, added to the calculation result of the first cycle, and stored in the temporary register C; thus, Knowing the kth cycle, the input to A is

B is input after (k-1) times of R3 module, and the calculation result is added to the previous (k-1) accumulated results through the FRRP restoration module, and saved in the temporary register C, and then the temporary register C outputs the result. is C=ABmodF(x).

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. a kind of fast operation multi-bit series pulsating double-base binary finite-field multiplier, is characterized in that, comprises input terminal B , k PE modules, FRRP module, R3 module, described k PE modules are connected in series, and described k After a PE module goes through k cycles, the input of the first cycle A is

, B is directly input, and the calculation result is restored and input to the temporary register C through the FRRP module; the input of A in the second cycle , B is input through the R3 module, and the calculation result is also restored by the FRRP module, added to the calculation result of the first cycle, and stored in the temporary register C ; thus, in the kth cycle, the input of A is

, B is input after ( k -1) times of the R3 module, the calculation result is restored by the FRRP module, added to the ( k -1) accumulation result, saved in the temporary register C , and then used by the temporary Register C output results, the R3 module implements

The calculation of said PE module includes R1 module, CMP module, CVP module, PWM module, XOR gates, and A latch, after the R3 module is output to the R1 module, the coefficient conversion is performed by the CMP module, and the subsection of A is input to the CVP module to perform the subsection coefficient conversion of A, and the calculation of the CMP module and the CVP module The results are all input to the PWM module to realize

and A piecewise product calculation, after

XOR gates are accumulated, and the result is stored in

of latches, by

Latch output result

Among them, A passes through three polynomials

,Expressed as

, there are m coefficients in total, namely

,

Using the segmented cutting method, the m- bit A is cut into

, with d bits per segment, there are k ² segments in total, so there are

; B can be expressed as

, as another input of the multiplier; C is the output result.

2. according to the described rapid operation multi-bit series pulsation double base binary finite field multiplier of claim 1, it is characterized in that, described FRRP module comprises FR module, R2 module, and described R2 module realizes

For calculation, the input of the FR module is the calculation result of k serial PE modules, and the result is restored and output to the R2 module.

3. The fast operation multi-bit series systolic double-base binary finite field multiplier according to claim 1, wherein the CMP module includes exclusive OR gates XOR_1 and XOR_2, and the exclusive OR gates XOR_1 and XOR_2 are connected in parallel.

4. The fast operation multi-bit series pulsating double base binary finite field multiplier according to claim 1, wherein the CVP module is an exclusive OR gate XOR_3.

5. according to the described rapid operation multi-bit series pulsation double base binary finite field multiplier of claim 1, it is characterized in that, described PWM module comprises three parallel AND gates AND_1, AND_2 and AND_3, described CMP module and The result output by the CVP module is multiplied point-to-point.

6 . The fast operation multi-bit series pulsating double base binary finite field multiplier according to claim 1 , wherein the FR module comprises two parallel exclusive OR gates XOR_4 and XOR_5 .