CN111694540A - Base 64 arithmetic circuit for number theory conversion multiplication - Google Patents

Base 64 arithmetic circuit for number theory conversion multiplication Download PDF

Info

Publication number
CN111694540A
CN111694540A CN202010371311.4A CN202010371311A CN111694540A CN 111694540 A CN111694540 A CN 111694540A CN 202010371311 A CN202010371311 A CN 202010371311A CN 111694540 A CN111694540 A CN 111694540A
Authority
CN
China
Prior art keywords
operands
operand
bit
csa
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010371311.4A
Other languages
Chinese (zh)
Other versions
CN111694540B (en
Inventor
华斯亮
卞九辉
张静亚
张慧国
刘玉申
徐健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changshu Institute of Technology
Original Assignee
Changshu Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changshu Institute of Technology filed Critical Changshu Institute of Technology
Priority to CN202010371311.4A priority Critical patent/CN111694540B/en
Publication of CN111694540A publication Critical patent/CN111694540A/en
Application granted granted Critical
Publication of CN111694540B publication Critical patent/CN111694540B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • G06F7/501Half or full adders, i.e. basic adder cells for one denomination
    • G06F7/503Half or full adders, i.e. basic adder cells for one denomination using carry switching, i.e. the incoming carry being connected directly, or only via an inverter, to the carry output under control of a carry propagate signal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/60Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers
    • G06F7/72Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a base 64 arithmetic circuit for number theory conversion multiplication, which comprises 64 operand generating modules, wherein each of 64 input data is subjected to high-order zero filling and then is divided into 22 words by taking 3 bits as one word, 1 path of 64 96-bit operands, 48 paths of 22-bit operands, 3 paths of 32-bit operands and 12 paths of 24-bit operands are combined and output, each operand generating module is connected with an operation digital-analog adding module, and the operands output by each operand generating module are subjected to modular addition; a module for modulo p, which outputs the data from the module for adding each operation number modulo the prime number p, where the prime number p is 264‑232+1. The invention combines the operands from 4096 to 1504 in the prior art, greatly reduces the calculation overhead and improves the calculation efficiency of the base 64 operation.

Description

Base 64 arithmetic circuit for number theory conversion multiplication
Technical Field
The present invention relates to an arithmetic circuit, and more particularly, to a radix-64 arithmetic circuit for number-theoretic transform multiplication.
Background
Large integer multiplication, in addition to conventional long multiplication, also involves
Figure BDA0002478399190000011
And (4) an algorithm.
Figure BDA0002478399190000012
The core idea of the algorithm is as follows: FFT on a primary ring is respectively carried out on two large integers with the length of n, and the two large integers are converted into frequency domain distribution; performing dot multiplication on the frequency domain distribution of the two integers to obtain the frequency domain distribution of the product; the frequency domain distribution of the product is subjected to IFFT in a loop, and the product is obtained. Using a number-theoretic transform instead of a discrete fourier transform, the rounding error problem can be avoided by using modular arithmetic instead of floating point arithmetic. Number theory transform multiplication specially
Figure BDA0002478399190000013
Multiplication using a number theory transformation is used in the algorithm. The number theory transformation and the inverse number theory transformation are used as operation cores in the number theory transformation multiplication, occupy more than 90% of operation amount and operation time in the NTT multiplication, optimize the speed, the area and the power consumption of the number theory transformation, and have critical influence on the overall performance of the NTT multiplication.
A 16777216 point number theoretic transform can be decomposed into 4-level base 64 arithmetic units and twiddle factor multiplication operations. The calculation of the twiddle factor can be calculated in advance and stored in a ROM, and the twiddle factor can be directly read when in use. The calculation amount of the base 64 operation accounts for more than 90% of the logarithm conversion, and the optimization of the base 64 operation is of great importance to the efficiency of the logarithm conversion.
Design and implementation of a large integer multiplier FPGA, thank you star and the like, electronic and information science and newspaper, 2019. The paper describes a paper based on
Figure BDA0002478399190000014
Large integer multiplier hardware architecture for the algorithm. Article 65536 pointAnd (3) performing number theory transformation, namely decomposing the number theory transformation into a form of 64 points and 1024 points, wherein the 1024 point number theory transformation uses a structure constructed by 2-level base 32 operation in series. The base 64 operation comprises 64 shift units and a tree-shaped large sum processing unit. The paper uses "0" padding, so that each tree-shaped large sum processing unit needs to process 64 192 bits of data, and the whole base 64 operation needs to process 64 × 64 — 4096 operands. The efficiency of the basic 64 operation circuit is not high enough, so that the power consumption and the resource needed after the circuit is realized are large.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a base 64 arithmetic circuit for number theory conversion multiplication, which solves the problems of high power consumption and resource overhead of the base 64 arithmetic circuit.
The technical scheme of the invention is as follows: a radix-64 operation circuit for number-theoretic transform multiplication, comprising:
the operand generating module is provided with 64 operand generating modules, the number of the 64 operand generating modules is Xk, k is 0,1,2, … and 63, each operand generating module comprises a dividing circuit, a merging circuit and a zero padding circuit, the dividing circuit divides each of 64 input data into 22 words by taking 3 bits as one word after carrying out high-order zero padding on the input data, and the divided input data is xn,m,0≤n<64,0≤m<22, said merge circuit forming said input data divided into 64 × 22 words into operand outputs, 1 output of said dividing circuits of 64 said output operand generation modules being 64 96-bit operands, 48 outputs being 22 192-bit operands, 3 outputs being 32 192-bit operands, and 12 outputs being 24 192-bit operand outputs, said zero-padding circuit padding empty bits when said merge circuit outputs operands to "0";
the operation digital-analog addition module is used for performing modular addition on the operand output by each operand generation module;
and the number of the first and second groups,
the module of modulus p realizes that the data output by each operation modulus addition module is output after modulus of prime number p, and the prime number p is 264-232+1。
Further, the operand generation module whose output is 64 96-bit operands is numbered X0, the last 22 words of each 96-bit operand are the input data, and the first 10 words are assigned zeros.
Further, the operand generation module with 22 output 192-bit operands is numbered Xk, k is odd, and each operand OPmFrom 32 different input data xn,m,0≤n<64, using the same word index m,0 ≦ m<22 are combined to formn,mAt the lowest position of OPmThe position (D) is calculated from 3 × (m + nk) (mod 192).
Further, the operand generation module outputting the 32 192-bit operands is numbered X16, X32, and X48, the 32 operands are divided into 16 groups, each group of 2 operands, one group of OP0 and OP1, one group of OP2 and OP3, and so on, and the operands OP in each group2jAnd OP2j+1Consisting of 88 different input data xn,m,4j≤n≤4j+3,0≤m<22 are combined to formn,mAt the lowest position of OP2jAnd OP2j+1The position of (2) is calculated from 3 × (m + nk) (mod 192), xn,mIs preferentially placed on OP2jIn, e.g. OP2jIf the position is already occupied, then put on OP2j+1To the corresponding position in (a).
Further, the operand generation module outputting 24 192-bit operands is numbered Xk except X0, X16, X32 and X48, k is an even number divisible by 4 or 8, 24 operands are divided into 4 groups, OP0 to OP5 are one group, OP6 to OP11 are one group, and so on, and the operands OP in each group are OP6jTo OP6j+5Composed of 352 different input data xn,m,16j≤n≤16j+15,0≤m<22 are combined to formn,mAt the lowest position of OP6jTo OP6j+5The position of (2) is calculated from 3 × (m + nk) (mod 192), xn,m4 or 8 words are used as the period merging operand and are preferentially placed in OP6jTo OP6j+5The middle index is the smaller OP.
Further, the operand generation module with 22-bit operand outputs is numbered Xk except X0, X16, X32 and X48, k is not integer 4 or 8Dividing the even number, dividing the 22 operands into 2 groups, one group being OP 0-OP 10, one group being OP 11-OP 21, the operands OP in each group being OP11jTo OP11j+10Composed of 704 different input data xn,m,32j≤n≤32j+31,0≤m<22 are combined to formn,mAt the lowest position of OP11jTo OP11j+10The position of (2) is calculated from 3 × (m + nk) (mod 192), xn,mUsing 2 words as the period to merge operands and preferentially placing them in OP11jTo OP11j+10The middle index is the smaller OP.
The technical scheme provided by the invention has the advantages that:
by using the null bit of 'zero padding' after operand shift, merging operands of the radix 64 operation in the number theory transformation multiplication, merging the operands from 4096 to 1504 in the prior art, greatly reducing the calculation overhead and improving the calculation efficiency of the radix 64 operation.
Drawings
FIG. 1 is a schematic diagram of the general structure of the radix 64 operation circuit for number theory transform multiplication according to the present invention.
Fig. 2 is a schematic diagram of a zero-padding partitioning method for input data by a partitioning circuit in an operand generation module.
FIG. 3 is a diagram of a partitioning circuit in an operand generation module.
Fig. 4 is a schematic diagram of output data obtained by the merging circuit of the X0 operand generation module.
FIG. 5 is a schematic diagram of a merge circuit of the X0 operand generation module.
FIG. 6 shows merged operands of the merge circuit of the X1 operand generation module.
FIG. 7 shows a merging circuit of operand number 0 OP0 in the X1 operand generation block.
FIG. 8 shows merged operands of the merge circuit of the X16 operand generation module.
FIG. 9 shows merged operands of the merge circuit of the X4 operand generation module.
FIG. 10 shows merged operands of the merge circuit of the X2 operand generation module.
Fig. 11 is a circuit schematic diagram of a 64 operation digital-to-analog addition module.
Fig. 12 is a circuit schematic diagram of 22 operation digital-to-analog addition module.
Fig. 13 is a circuit schematic diagram of a 32-operation digital-to-analog addition module.
Fig. 14 is a circuit schematic diagram of the 24 operation digital-to-analog addition module.
Detailed Description
The present invention is further described in the following examples, which are intended to be illustrative only and not to be limiting as to the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which would occur to persons skilled in the art upon reading the present specification and which are intended to be within the scope of the present invention as defined in the appended claims.
The formula for the base 64 operation is as follows
Figure BDA0002478399190000031
Wherein k is not less than 0<64, p is a prime number, W64Is the 64 th unit root.
When prime number p is Solinas prime number, p is 264-232+1. This prime number supports efficient modulo operation: 2192modp=1,296modp=-1,264modp=232-1. Unit root W calculated by using the prime number64=23The characteristic of power of 2, the multiplication and addition operation can be conveniently converted into the shift and the modular addition operation, and the calculation complexity of the number theory conversion is reduced. Thus, the base 64 operation can be written as
Figure BDA0002478399190000041
Each x isn3 bits are taken as a basic unit and are divided into 22 words, which are called xn,m,0≤m<22。xnCan be expressed as
Figure BDA0002478399190000042
Where m denotes the mth word, xnHas a data width of 64 bits, xn,mIs 3 bits, xn,21Is 1 bit. After splitting the input data, the radix 64 operation can be written as follows, and shifted operands can be merged using "0 padding" to reduce the number of modulo added operands.
Figure BDA0002478399190000043
Please refer to fig. 1, the basic 64 operation circuit for number theory transform multiplication according to the present embodiment includes 64 operand generation modules from X0 to X63, an operation digital-analog addition module, and a modulo-p module, where the operation digital-analog addition module is divided into 64 operation digital-analog addition modules, 22 operation digital-analog addition modules, 32 operation digital-analog addition modules, and 24 operation digital-analog addition modules according to the number of input operands. 64-bit data input on the circuit structure are used as the input of each operand generation module, an operation digital-analog addition module is connected behind each operand generation module, and a modulo-p module is connected behind each operation digital-analog addition module.
The operand generation module comprises a dividing circuit, a merging circuit and a zero filling circuit, and sequentially divides, merges and fills zero into 64-bit data to form an operand. Referring to fig. 2 and 3, the dividing circuit divides each 64-bit input data xnIs filled with 0 bits to form 66 bits of data, and then divided into 22 words, each word containing 3 bits, and the 22 nd word is 1 bit because the highest 2 bits are filled with 0 bits. The data segmentation can be easily implemented with existing hardware with little hardware overhead.
The operand generation modules are numbered with Xk, k being 0,1,2, …,15, the merging circuits in each operand generation module are different, but may be divided into 4 groups by type, with the circuits within each group being similar.
Group one: x0, 1 in total; and a second group: k such as X1, X3, X5 and the like is odd, and the number of k is 8; and (3) group III: 3 of X16, X32 and X48; group four: xk, k being an even number divisible by 4 or 8, except for X0, X16, X32 and X48, of which there are 12 in total X4, X8 and X12; group five: the numbers of Xk, k except X0, X16, X32 and X48 are even numbers which cannot be completely divided by 4 or 8, and 16 in total are X2, X6, X10 and the like.
The data merge operation for each group is explained in groups as follows:
group one, the merge circuit of the X0 operand generation module.
The operands are in fact aligned input data. In other words, each operand is derived from 22 consecutive words of the segmented circuit output data. The merging circuit outputs 16 96-bit operands, each new 96-bit operand consisting of 32 words, the last 22 words being the input data, and the first 10 words being assigned zeros. As shown in FIG. 4, operand # j OPjHas 96 bits, isnPut in the low 66 bits, the high 30 bits are filled with zero, and the merging circuit is shown in fig. 5.
And the group two is a merging circuit of odd operand generation modules such as X1, X3, X5 and the like.
For the merging circuit of the Xk operand generation module with odd k, the inputs are 64-bit input data and the outputs are 22 192-bit operands. Each operand OPmComposed of 64 different data xn,m,0≤n<64, using the same word index m,0 ≦ m<22 are combined together. x is the number ofn,mAt the lowest position of OPmThe position of (D) is calculated by 3 × (m + nk) (mod 192). The following operand is output by taking X1 as an example:
the merge circuit of the X1 operand generation module merges the operands as shown in fig. 6. The merged operation has 22 operands, each operand is composed of 64 different data xn,m,0≤n<64, using the same word index m,0 ≦ m<22 are combined together. x is the number of0,0Is 3 × (0+0 × 1) (mod 192) ═ 0, x, in OP01,0Is 3 × (0+1 × 1) (mod 192) ═ 3, and x is in OP00,1Is 3 × (1+0 × 1) (mod 192) ═ 3, x, at the position of OP163,1The lowest bit of (2) is located at 3 × (1+63 × 1) (mod 192) ═ 0. the merging circuit of operand No. 0 OP0 in the X1 operand generation module is shown in fig. 7.
And the operands output by the merging circuits of the rest operand generation modules are analogized in turn.
Group three, the merge circuits of the X16, X32, and X48 operand generation modules.
The input is 64-bit input data and the output is 32 192-bit operands. The 32 operands are divided into 16 groups of 2 operands, one group being OP0 and OP1, one group being OP2 and OP3, and so on. Operands OP within each group2jAnd OP2j+1Consisting of 88 different data xn,m,4j≤n≤4j+3,0≤m<22 are combined together. x is the number ofn,mAt the lowest position of OP2jAnd OP2j+1Is calculated by 33 × (m + nk) (mod 192). xn,mIs preferentially placed on OP2jIn, e.g. OP2jIf the position is already occupied, then put on OP2j+1To the corresponding position in (a). The remaining slots are all filled with "0". Taking the merged circuit output data of the X16 operand generation module as an example, as shown in fig. 8, there are 16 sets of operands, each set including 2 merged operands. Each new 192-bit operand consists of 64 words, which come from 4 different input data.
Group four, Xk, k except X0, X16, X32, and X48, are merging circuits of even operand generation blocks divisible by 4 or 8.
For a merging circuit of an Xk operand generation module where k is an even number other than 0, 16, 32, or 48 that can be divisionally divided by 4 or 8, the input is 64-bit input data and the output is 24 192-bit operands. The 24 operands are divided into 4 groups of 6 operands, one group being OP 0-OP 5, one group being OP 6-OP 11, and so on. Operands OP within each group6jTo OP6j+5Composed of 352 different data xn,m,16j≤n≤16j+15,0≤m<22 are combined together. x is the number ofn,mAt the lowest position of OP6jTo OP6j+5Is calculated by 3 × (m + nk) (mod 192). xn,m4 or 8 words are used as the period merging operand and are preferentially placed in OP6jTo OP6j+5The middle index is the smaller OP. The remaining slots are all filled with "0". Taking the merged circuit output data of the X4 operand generation module as an example, as shown in fig. 9, there are 4 sets of operands, each set including 6 merged operands. First, theOne group comprises OP0 to OP 5; the second group comprises OP6 to OP11 and so on. Each new 192-bit operand consists of 64 words, which come from 16 different input data, each providing 4 or 8 consecutive words.
Group five, Xk, k except X0, X16, X32, and X48, is the merge circuit of even operand generation blocks that are not divisible by 4 or 8.
For a merging circuit of an Xk operand generation module where k is an even number other than 0, 16, 32, or 48 that is not divisible by 4 or 8, the input is 64-bit input data and the output is 22 192-bit operands. The 22 operands are divided into 2 groups of 11 operands, one group being OP 0-OP 10 and one group being OP 11-OP 21. Operands OP within each group11jTo OP11j+10Composed of 704 different input data xn,m,32j≤n≤32j+31,0≤m<22 are combined to formn,mAt the lowest position of OP11jTo OP11j+10The position of (C) is calculated from 3 × (m + nk) (mod 192), cn,mUsing 2 words as the period to merge operands and preferentially placing them in OP11jTo OP11j+10The middle index is the smaller OP. The remaining slots are all filled with "0". Taking the merged circuit output data of the X2 operand generation module as an example, as shown in fig. 10, there are 2 sets of operands, each set including 11 merged operands. The first group comprises OP0 to OP 10; the second group comprises OP11 to OP21 and so on. Each new 192-bit operand consists of 64 words, which come from 32 different input data, each providing 2 consecutive words.
The operand generation modules obtain different operand quantities according to the different groups of operand generation modules, and the operation digital-analog addition module comprises a 64 operation digital-analog addition module, a 22 operation digital-analog addition module, a 32 operation digital-analog addition module and a 24 operation digital-analog addition module.
The 64-operation digital-to-analog addition module is shown in fig. 11, wherein CSA represents a Carry save adder, CPA represents a ripple Carry adder, and "< < 1" represents that the Carry end (Carry end) of the Carry save adder is shifted to the left by 1 bit. The 4i, i-1, 2, …,16 position operands are reserved in 64 operands, and the rest operands are input into the first layer CSA every three; the carry end of the first layer CSA is shifted to the left by 1 bit and the sum end thereof, and the operand with the position of 4i, i being 1,2, …,16 is input into the second layer CSA; the sum end of every two second-layer CSAs and the carry end of one second-layer CSA are shifted left by 1 bit and input into a third-layer CSA; the carry end of the third layer CSA is shifted to the left by 1 bit, the sum end of the third layer CSA and the carry end of the other second layer CSA in every two second layer CSAs are shifted to the left by 1 bit and input into the fourth layer CSA; the sum end of every two fourth-layer CSAs and the carry end of one fourth-layer CSA are shifted to the left by 1 bit and input into the fifth-layer CSA; the carry end of the fifth layer CSA is shifted to the left by 1 bit, the sum end of the fifth layer CSA and the carry end of another one of every two fourth layer CSAs are shifted to the left by 1 bit and input into the sixth layer CSA; shifting the sum end of every two sixth-layer CSAs and the carry end of one sixth-layer CSA to the left by 1 bit and inputting the sum end and the carry end into a seventh-layer CSA; the carry end of the CSA of the seventh layer is shifted to the left by 1 bit, the sum end of the CSA of the seventh layer and the carry end of the CSA of the other of every two CSAs of the sixth layer are shifted to the left by 1 bit and input into the CSA of the eighth layer; the eighth layer has two CSAs in total, and the carry end of the second CSA is shifted to the left by 1 bit, the sum end of the second CSA and the sum end of the first CSA are input into the ninth layer (1 in total); the carry terminal of the CSA of the ninth layer is shifted left by 1 bit, and the carry terminal of the first CSA of the eighth layer are shifted left by 1 to input the CSA of the tenth layer; the CSA carry end of the tenth layer is shifted to the left by 1 bit and the sum end is input into the CPA, and the result is input into the modulo addition module. The modular addition module realizes the addition operation of 193-bit width data, low 192-bit data and 193-th data, and the output result is congruent with the input data pair prime number p.
The operation of the digital-to-analog addition module is shown in fig. 12, where CSA represents a Carry save adder, CPA represents a ripple Carry adder, and "ROL 1-bit" represents a Carry end (Carry end) of the Carry save adder circularly shifted by 1 bit to the left. 32 operands are grouped into 11 groups, and the two groups of operands perform the same operation as follows: 1,2, 3 of 11 operands; 5. 6, 7; 9. 10, 11 respectively inputting three first-layer CSAs, the sum end of the first CSA in the first layer, the operand 4 and the carry end of the second CSA in the first layer are circularly shifted by 1 bit to the left to input the first CSA in the second layer, the operand 8, the carry end of the third CSA in the first layer are circularly shifted by 1 bit to the left and the sum end thereof are input into the second CSA in the second layer, the carry end of the first CSA in the first layer is circularly shifted by 1 bit to the left, the carry end of the first CSA in the second layer is circularly shifted by 1 bit to the left and the sum end thereof are input into the first CSA in the third layer, the sum end of the second CSA in the first layer, the carry end of the second CSA in the second layer are circularly shifted by 1 bit to the left and the sum end thereof are input into the third CSA in the third layer; the sum end of the first CSA in the third layer CSA and the carry end of the second CSA in the third layer are circularly shifted by 1 bit to the left and the sum end thereof are input into the fourth layer CSA; the carry end of the first CSA in the third layer CSA is circularly shifted by 1 bit to the left, the carry end of the fourth layer CSA is circularly shifted by 1 bit to the left and the sum end thereof is input into the fifth layer CSA. The sum end of the fifth-layer CSA of the first group and the carry end of the fifth-layer CSA of the second group are circularly shifted by 1 bit to the left and the sum end thereof are input into the sixth-layer CSA; circularly shifting the carry terminal of the fifth layer CSA of the first group by 1 bit to the left, circularly shifting the carry terminal of the sixth layer CSA by 1 bit to the left and inputting the carry terminal and the sum terminal of the sixth layer CSA into the seventh layer CSA; and the CSA carry end of the seventh layer circularly shifts 1 bit to the left and the sum end is input into the CPA, and the result is input into the modulo addition module. The modular addition module realizes the addition operation of 193-bit width data, low 192-bit data and 193-th data, and the output result is congruent with the input data pair prime number p.
The 32-operation digital-to-analog addition module is shown in fig. 13, where CSA represents a Carry save adder, CPA represents a ripple Carry adder, and "ROL 1-bit" represents that the Carry end (Carry end) of the Carry save adder is circularly shifted by 1 bit to the left. The operands with 4i, i being the operands in positions 1,2, …, and 8 are reserved in the 32 operands, and the rest operands are input into the first layer CSA every three; the carry end of the first layer CSA is shifted to the left by 1 bit and the sum end thereof, and the operand with the position of 4i, i is 1,2, …,8 is input into the second layer CSA; the sum end of every two second-layer CSAs and the carry end of one second-layer CSA are shifted left by 1 bit and input into a third-layer CSA; the carry end of the third layer CSA is shifted to the left by 1 bit, the sum end of the third layer CSA and the carry end of the other second layer CSA in every two second layer CSAs are shifted to the left by 1 bit and input into the fourth layer CSA; the sum end of every two fourth-layer CSAs and the carry end of one fourth-layer CSA are shifted to the left by 1 bit and input into the fifth-layer CSA; the carry end of the fifth layer CSA is shifted to the left by 1 bit, the sum end of the fifth layer CSA and the carry end of another one of every two fourth layer CSAs are shifted to the left by 1 bit and input into the sixth layer CSA; the sixth layer has two CSAs, the carry end of the second CSA is shifted to the left by 1 bit, the sum end of the second CSA and the sum end of the first CSA are input into the seventh layer CSA (1 in total); the carry terminal of the CSA of the seventh layer is shifted left by 1 bit, and the carry terminal of the first CSA of the sixth layer are shifted left by 1 bit and input into the CSA of the eighth layer; the eight-layer CSA carry end is shifted left by 1 bit and the sum end is input into CPA, and the result is input into the modular addition module. The modular addition module realizes the addition operation of 193-bit width data, low 192-bit data and 193-th data, and the output result is congruent with the input data pair prime number p.
The 24-operation digital-to-analog addition module is shown in fig. 14, wherein CSA represents a Carry save adder, CPA represents a ripple Carry adder, and "ROL 1-bit" represents that the Carry end (Carry end) of the Carry save adder is circularly shifted by 1 bit to the left. 24 operands are input into the first layer CSA every three times, and the sum end of every two first layer CSAs and the carry end of one second layer CSA are circularly shifted to the left by 1 bit to be input into the second layer CSA; circularly shifting the carry end of the second layer CSA by 1 bit to the left, circularly shifting the sum end of the second layer CSA and the carry end of another first layer CSA in every two first layers CSA to the left by 1 and inputting the carry end into the third layer CSA; circularly shifting the sum end of every two third-layer CSAs and the carry end of one third-layer CSA to the left by 1 bit and inputting the sum end and the carry end into a fourth-layer CSA; the fourth CSA layer has two CSAs, the carry end of the second CSA is circularly shifted to the left by 1 bit, the sum end of the second CSA and the sum end of the first CSA are input into the fourth CSA layer; circularly shifting the carry end of the fourth layer CSA by 1 bit to the left, circularly shifting the sum end of the fourth layer CSA by 1 bit to the left, and circularly shifting the carry end of another third layer CSA of every two third layers CSA by 1 bit to the left to input the fifth layer CSA; the fifth CSA has two CSAs in total, the carry end of the second CSA is circularly shifted by 1 bit to the left, the sum end of the second CSA and the sum end of the first CSA are input into the sixth CSA (1 in total); circularly shifting the carry terminal of the CSA of the sixth layer by 1 bit to the left, and circularly shifting the sum terminal of the CSA of the sixth layer and the carry terminal of the first CSA of the fifth layer by 1 bit to the left to input the CSA of the seventh layer; and the CSA carry end of the seventh layer circularly shifts 1 bit to the left and the sum end is input into the CPA, and the result is input into the modulo addition module. The modular addition module realizes the addition operation of 193-bit width data, low 192-bit data and 193-th data, and the output result is congruent with the input data pair prime number p.
The module of modulus p realizes the modulus of the input data to prime number p.

Claims (6)

1. A base 64 arithmetic circuit for number theory conversion multiplication is characterized in that 64 operand generating modules are provided, the number of the 64 operand generating modules is Xk, k is 0,1,2,.. and 63, each operand generating module comprises a dividing circuit, a merging circuit and a zero padding circuit, the dividing circuit carries out high-order zero padding on each of 64 input data and divides the input data into 22 words by taking 3 bits as a word, and the divided input data is xn,mN is more than or equal to 0 and less than 64, m is more than or equal to 0 and less than 22, the merging circuit divides the input data into 64 × 22 words to form operand outputs, 1 output in the dividing circuits of the 64 output operand generation modules is 64 96-bit operands, 48 outputs are 22 192-bit operands, 3 outputs are 32 192-bit operands, and 12 outputs are 24 192-bit operand outputs, the zero padding circuit fills the vacant positions when the merging circuit outputs the operands with '0', and an operation digital and analog addition module performs modulo addition on the operands of the outputs of each operand generation module;
and the number of the first and second groups,
the module of modulus p realizes that the data output by each operation modulus addition module is output after modulus of prime number p, and the prime number p is 264-232+1。
2. The radix-64 arithmetic circuitry for number theory transform multiplication of claim 1 wherein the operand generation module whose output is 64 96-bit operands is numbered X0, the last 22 words of each 96-bit operand being input data, the first 10 words being assigned zeros.
3. The radix-64 arithmetic circuit of claim 1 wherein the operand generation blocks outputting the 22 192-bit operands are numbered Xk, k being an odd number, and each operand OPmComposed of 64 different input data xn,mN is more than or equal to 0 and less than 64, the same word index m is used, m is more than or equal to 0 and less than 22, x is combinedn,mAt the lowest position of OPmThe position (D) is calculated from 3 × (m + nk) (mod 192).
4. The radix-64 arithmetic circuitry of claim 1 wherein the operand generation modules outputting the 32-bit operands are numbered X16, X32 and X48, the 32 operands are divided into 16 groups of 2 operands each, OP0 and OP1 are one group, OP2 and OP3 are one group, and so on, the operands OP in each group2jAnd OP2j+1Consisting of 88 different input data xn,mN is more than or equal to 4j and less than or equal to 4j +3, m is more than or equal to 0 and less than 22, and xn,mAt the lowest position of OP2jAnd OP2j+1The position of (2) is calculated from 3 × (m + nk) (mod 192), xn,mIs preferentially placed on OP2jIn, e.g. OP2jIf the position is already occupied, then put on OP2j+1To the corresponding position in (a).
5. The radix-64 arithmetic circuitry for number-theoretic transform multiplications of claim 1 wherein the operand generation modules outputting 24 192-bit operands are numbered Xk except X0, X16, X32 and X48, k being an even number divisible by 4 or 8, the 24 operands being divided into 4 groups, OP0 to OP5 being a group, OP6 to OP11 being a group, and so on, the operands OP in each group being OP6jTo OP6j+5Composed of 352 different input data xn,mN is more than or equal to 16j and less than or equal to 16j +15, m is more than or equal to 0 and less than 22, and xn,mAt the lowest position of OP6jTo OP6j+5The position of (2) is calculated from 3 × (m + nk) (mod 192), xn,mUsing 4 words or 8 words as cycle merge operand, preferablyIs firstly placed on OP6jTo OP6j+5The middle index is the smaller OP.
6. The radix-64 arithmetic circuitry for number theory transform multiplication of claim 1 wherein the operand generation module whose output is 22 192-bit operands is numbered Xk except X0, X16, X32 and X48, k being an even number not divisible by 4 or 8, 22 operands are divided into 2 groups, OP0 to OP10 are one group, OP11 to OP21 are one group, operands OP in each group11jTo OP11j+10Composed of 704 different input data xn,mN is more than or equal to 32j and less than or equal to 32j +31, m is more than or equal to 0 and less than 22, and xn,mAt the lowest position of OP11jTo OP11j+10The position of (2) is calculated from 3 × (m + nk) (mod 192), xn,mUsing 2 words as the period to merge operands and preferentially placing them in OP11jTo OP11j+10The middle index is the smaller OP.
CN202010371311.4A 2020-05-06 2020-05-06 Base 64 operation circuit for number theory transformation multiplication Active CN111694540B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010371311.4A CN111694540B (en) 2020-05-06 2020-05-06 Base 64 operation circuit for number theory transformation multiplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010371311.4A CN111694540B (en) 2020-05-06 2020-05-06 Base 64 operation circuit for number theory transformation multiplication

Publications (2)

Publication Number Publication Date
CN111694540A true CN111694540A (en) 2020-09-22
CN111694540B CN111694540B (en) 2023-04-21

Family

ID=72476916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010371311.4A Active CN111694540B (en) 2020-05-06 2020-05-06 Base 64 operation circuit for number theory transformation multiplication

Country Status (1)

Country Link
CN (1) CN111694540B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100146030A1 (en) * 2008-12-08 2010-06-10 International Business Machines Corporation Combined Binary/Decimal Fixed-Point Multiplier and Method
CN102866875A (en) * 2012-10-05 2013-01-09 刘杰 Universal multi-operand summator
CN103870438A (en) * 2014-02-25 2014-06-18 复旦大学 Circuit structure using number theoretic transform for calculating cyclic convolution
US20180294950A1 (en) * 2017-04-11 2018-10-11 The Governing Council Of The University Of Toronto Homomorphic Processing Unit (HPU) for Accelerating Secure Computations under Homomorphic Encryption
CN108733413A (en) * 2017-04-24 2018-11-02 Arm 有限公司 Shift instruction
CN110543291A (en) * 2019-06-11 2019-12-06 南通大学 Finite field large integer multiplier and implementation method of large integer multiplication based on SSA algorithm

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100146030A1 (en) * 2008-12-08 2010-06-10 International Business Machines Corporation Combined Binary/Decimal Fixed-Point Multiplier and Method
CN102866875A (en) * 2012-10-05 2013-01-09 刘杰 Universal multi-operand summator
CN103870438A (en) * 2014-02-25 2014-06-18 复旦大学 Circuit structure using number theoretic transform for calculating cyclic convolution
US20180294950A1 (en) * 2017-04-11 2018-10-11 The Governing Council Of The University Of Toronto Homomorphic Processing Unit (HPU) for Accelerating Secure Computations under Homomorphic Encryption
CN108733413A (en) * 2017-04-24 2018-11-02 Arm 有限公司 Shift instruction
CN110543291A (en) * 2019-06-11 2019-12-06 南通大学 Finite field large integer multiplier and implementation method of large integer multiplication based on SSA algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈炜: "并行乘法器的设计", 《现代企业教育》 *

Also Published As

Publication number Publication date
CN111694540B (en) 2023-04-21

Similar Documents

Publication Publication Date Title
Mohan et al. RNS-to-Binary Converters for Two Four-Moduli Sets $\{2^{n}-1, 2^{n}, 2^{n}+ 1, 2^{{n}+ 1}-1\} $ and $\{2^{n}-1, 2^{n}, 2^{n}+ 1, 2^{{n}+ 1}+ 1\} $
Blahut Fast algorithms for signal processing
US20210349692A1 (en) Multiplier and multiplication method
Lee et al. A high-speed two-parallel radix-2 4 FFT/IFFT processor for MB-OFDM UWB systems
Al-Khaleel et al. Fast and compact binary-to-BCD conversion circuits for decimal multiplication
JPH11203272A (en) Device, system and method for east fourier transformation processing
CN104617959A (en) Universal processor-based LDPC (Low Density Parity Check) encoding and decoding method
CN110543291A (en) Finite field large integer multiplier and implementation method of large integer multiplication based on SSA algorithm
CN115238863A (en) Hardware acceleration method, system and application of convolutional neural network convolutional layer
CN113608718B (en) Method for realizing prime number domain large integer modular multiplication calculation acceleration
JPS5858695B2 (en) binary multiplication device
CN114548387A (en) Method for executing multiplication operation by neural network processor and neural network processor
CN111694542B (en) Base 16 arithmetic circuit for number theory conversion multiplication
CN111694540A (en) Base 64 arithmetic circuit for number theory conversion multiplication
CN111694541B (en) Base 32 operation circuit for number theory transformation multiplication
US5289399A (en) Multiplier for processing multi-valued data
CN109379191B (en) Dot multiplication operation circuit and method based on elliptic curve base point
US5999962A (en) Divider which iteratively multiplies divisor and dividend by multipliers generated from the divisors to compute the intermediate divisors and quotients
Parhami On equivalences and fair comparisons among residue number systems with special moduli
CN1348141A (en) Discrete 3780-point Fourier transformation processor system and its structure
JP2001101160A (en) Data storage pattern for fast fourier transform
KR20080040978A (en) Parallel and pipelined radix - 2 to the fourth power fft processor
CN210006029U (en) Data processor
CN114422315B (en) Ultra-high throughput IFFT/FFT modulation and demodulation method
Jaberipur et al. A Parallel Prefix Modulo-(2 q+ 2 q− 1+ 1) Adder via Diminished-1 Representation of Residues

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Hua Siliang

Inventor after: Bian Jiuhui

Inventor after: Zhang Jingya

Inventor after: Zhang Huiguo

Inventor after: Liu Yushen

Inventor after: Xu Jian

Inventor before: Hua Siliang

Inventor before: Bian Jiuhui

Inventor before: Zhang Jingya

Inventor before: Zhang Huiguo

Inventor before: Liu Yushen

Inventor before: Xu Jian

GR01 Patent grant
GR01 Patent grant