CN115587274A - Polynomial multiplication accelerating method and device - Google Patents

Polynomial multiplication accelerating method and device Download PDF

Info

Publication number
CN115587274A
CN115587274A CN202211245657.5A CN202211245657A CN115587274A CN 115587274 A CN115587274 A CN 115587274A CN 202211245657 A CN202211245657 A CN 202211245657A CN 115587274 A CN115587274 A CN 115587274A
Authority
CN
China
Prior art keywords
data
module
input
post
multiplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211245657.5A
Other languages
Chinese (zh)
Inventor
王中风
张灏辰
田静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202211245657.5A priority Critical patent/CN115587274A/en
Publication of CN115587274A publication Critical patent/CN115587274A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a method and a device for accelerating polynomial multiplication, wherein the device comprises m preprocessing external chunks, an input sorting module, k preprocessing internal chunks, a group of central multiplier arrays, k post-processing internal chunks, an output integration module and m post-processing external chunks
Figure DDA0003886478990000011

Description

Polynomial multiplication accelerating method and device
Technical Field
The invention relates to a method and a device for accelerating polynomial multiplication.
Background
In the fields of digital signal processing, cryptography, coding theory and the like, the problem of how to quickly perform multiplication operation on two polynomials is often encountered, and the cycle number, total delay and resource consumption of the polynomials are important factors for determining the overall hardware architecture surface-to-efficiency ratio in the application scene, so that people put forward a plurality of achievable optimization methods for the polynomials.
The Karatsuba algorithm since 1962(reference: karatsuba, anatolii)&The Multiplication of polynomial Numbers on automation, soviet Physics Doklady.7.595) was proposed as one of the best ways to reduce the complexity of polynomial Multiplication over several decades. It can make the multiplication complexity in N-term polynomial multiplication be reduced by
Figure BDA0003886478970000013
Down to
Figure BDA0003886478970000012
Addition complexity of not more than
Figure BDA0003886478970000014
However, in practical applications, polynomial multiplication operations with large polynomial coefficients bit width are sometimes encountered, for example, in the study of elliptic curves, such a problem may be encountered in the modular multiplication operations in the galois field, and usually, a conventional multiplier is used as a central multiplier, or a multiplier ip provided in the FPGA is used as a central multiplier. However, when the bit width of the polynomial coefficient reaches tens of bits or hundreds of bits, the functional range of the multiplier ip may be exceeded, and the conventional multiplier design may cause problems of too high operation complexity, too large hardware area, and the like, so that the polynomial multiplier in this case may adversely affect the performance of the whole hardware implementation.
There are many implementations of polynomial multiplication and integer multiplication based on the kartsuba algorithm. For two binomial polynomials a (x) = a 0 +a 1 x and B (x) = B 0 +b 1 x, the classical multiplication algorithm is:
C(x)=a 0 b 0 +(a 0 b 1 +a 0 b 1 )x+a 1 b 1 x 2
the algorithm requires four multiplications and one addition. And a binomial polynomial multiplication algorithm KA based on the Karatsuba algorithm 2 Comprises the following steps:
C(x)=a 0 b 0 +((a 0 +a 1 )(b 0 +b 1 )-a 0 b 0 -a 1 b 1 )x+a 1 b 1 x 2
the algorithm requires three multiplications and four additions. On the premise that the delay and resource consumption of multiplication are far higher than those of addition operation, the complexity of the binomial multiplication is reduced to a certain extent by the algorithm. Based on Karatsuba binomial multiplication, a recursive term of 2 can be obtained n The Karatsuba algorithm of (1), which can be used for two 2 s n The polynomial is used for fast multiplication, and the specific algorithm is shown as algorithm I, wherein
Figure BDA0003886478970000021
The first algorithm is as follows: recursive Karatsuba2 n Polynomial multiplication algorithm
Figure BDA0003886478970000022
Are unsigned integers (including 0).
Figure BDA0003886478970000023
Calculated, the multiplication complexity of the algorithm is 4 of that of the traditional algorithm n Is reduced to 3 n The addition complexity is not more than 2.3 n +1 -2 n+3 +2. Except for 2 n Besides the Karatsuba polynomial multiplication of terms, there are also the Karatsuba algorithms of terms 3, 5, 7, and then the Karatsuba polynomial multiplication of arbitrary integer terms is also formed by using a method similar to the recursive algorithm described above. It is also demonstrated in the references "Weimers kirch, andre and Christof Paar." genetics of the Karatsuba Algorithm for influence implementations. "IACR Cryptol. EPrint Arch.2006 (2006): 224" that for any positive integer N, the ratio of the hardware area of Karatsuba polynomial multiplication to that of conventional polynomial multiplication is not less than
Figure BDA0003886478970000024
Disclosure of Invention
The invention aims to: the technical problem to be solved by the present invention is to provide a method and an apparatus for accelerating polynomial multiplication, and particularly to a method and an apparatus for accelerating polynomial multiplication based on Karatsuba architecture, wherein the method comprises:
two sets of polynomial coefficients are input, and the number of each set of polynomial coefficients is
Figure BDA0003886478970000025
Wherein p is 1 、p 2 、……、p m The number of prime factors is 1, 2, \8230, 8230and m repeatable prime factors;
two sets of polynomial coefficients are expressed as terms
Figure BDA0003886478970000031
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of externally preprocessed data;
performing position taking sorting and reordering on the two groups of externally preprocessed data respectively to obtain sorted data;
the sorted data is counted according to items
Figure BDA0003886478970000032
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of internal preprocessed data, wherein p is -1 、p -2 、……、p -k Respectively 1 st, 2 nd, 8230, k prime factors designated according to use requirements; the usage requirement is determined according to the area (resource) size of a multiplier which can be accepted by a user, for example, to calculate 4-term 64-bit polynomial multiplication, 256 DSPs are used for a traditional multiplier, 144 DSPs are used for a traditional karatsua, and 108 DSPs are used if p-1 is set to be =2, and 81 DSPs are used if p-1 is set to be = p-2= 2;
performing multiplication operation on corresponding data in the two groups of internally preprocessed data to obtain a group of preliminary product data;
the preliminary product data is counted as items
Figure BDA0003886478970000033
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream, and internal post-processing data is obtained;
reordering, shifting and adding the internal post-processed data to obtain integrated data;
the integrated data is counted into items
Figure BDA0003886478970000034
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream to obtain final output data, namely, the term number is
Figure BDA0003886478970000035
The polynomial coefficient of the product of.
The invention also provides an accelerating device for polynomial multiplication, which comprises m preprocessing external chunks, an input sorting module, k preprocessing internal chunks, a group of central multiplier arrays, k post-processing internal chunks, an output integration module and m post-processing external chunks, wherein m and k are positive integers.
The m preprocessing external chunks are used for inputting two groups of polynomial coefficients, and the number of terms of each group of polynomial coefficients is
Figure BDA0003886478970000036
Wherein p is 1 、p 2 、……、p m The 1 st, 2 nd, 8230, m repeatable prime factors of the number of the items; then two groups of polynomial coefficients are calculated according to terms of
Figure BDA0003886478970000037
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of external preprocessed data;
the input sorting module is used for respectively carrying out position taking sorting and reordering on the two groups of externally preprocessed data to obtain sorted data;
the k preprocessing internal chunks are used for sorting the data according to the number of items
Figure BDA0003886478970000038
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of internal preprocessed data, wherein p is -1 、p -2 、……、p -k Respectively 1 st, 2 nd, 8230, k prime factors designated according to use requirements;
the group of central multiplier arrays are used for multiplying corresponding data in the two groups of internal preprocessed data to obtain a group of preliminary product data;
the k post-processing internal chunks are used for setting the preliminary product data into the number of terms
Figure BDA0003886478970000041
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream, and internal post-processing data is obtained;
the output integration module is used for performing reordering, shift operation and addition operation on the internally post-processed data to obtain integrated data;
the m post-processing external chunks are used for integrating the data according to the number of items
Figure BDA0003886478970000042
The Karatsuba algorithm performs operation processing on all operation rules after the multiplication operation of the data stream is completed to obtain final output data, namely the term number is
Figure BDA0003886478970000043
The polynomial coefficient of the product of.
The m preprocessed external chunks are respectively
Figure BDA0003886478970000044
The KA _ pre module of (1);
the k preprocessed internal blocks are respectively
Figure BDA0003886478970000045
The KA _ pre module of (1);
the k post-processing internal chunks are respectively
Figure BDA0003886478970000046
The KA _ post module of (1);
the m post-processing external chunks are respectively
Figure BDA0003886478970000047
The KA _ post module of (1);
wherein
Figure BDA0003886478970000048
Respectively representing the number of items as p 1 、p 2 、……、p m The Karatsuba algorithm module of (1); references "Weimers kirch, andr and Christof paar." genetics of the Karatsuba Algorithm for efficacy Immunitions, "IACR Cryptol. EPrint Arch.2006 (2006): 224" and "Montgomery, peter L." Five, six, and seven-term Karatsuba-like for project. "IEEE Transactions on Computers 54 (2005): 362-369";
wherein
Figure BDA0003886478970000049
Respectively represent the k prime factors p based on the 1 st, 2 nd, 8230; \8230;, k prime factors p specified according to the use requirement -1 、p -2 、……、p -k Corresponding number of items is p -1 、p -2 、……、p -k The Karatsuba algorithm module of (1); references "Weimers kirch, andre and Christof Paar." genetics of the Karatsuba Algorithm for efficiency innovations. "IACR Cryptol. EPrint Arch.2006 (2006): 224" and "Montgomery, peter L." Five, six, and seven-term Karatsuba like for purposes of "IEEE Transactions on Computers 54 (2005): 362-369";
wherein the KA _ pre block represents a hardware device that performs all operations that the data stream undergoes before going from the input to all multiplication operations in the Karatsuba algorithm;
where the KA _ post block represents the hardware device that performs all operations that the data stream undergoes after all multiplication operations to the output in the kartsuba algorithm.
The central multiplier array comprises a plurality of integer multipliers, wherein the number of the multipliers is equal to that of the central multiplier array
Figure BDA0003886478970000051
Figure BDA0003886478970000052
And
Figure BDA0003886478970000053
is determined by the structure of
Figure BDA0003886478970000054
And
Figure BDA0003886478970000055
Figure BDA0003886478970000056
the corresponding central multipliers are respectively l 1 ,l 2 ,...,l m And l -1 ,l -2 ,...,l -k Then, the number of central multipliers is
Figure BDA0003886478970000057
And (4) respectively.
The input sorting module is configured to perform an input sorting algorithm as follows:
Figure BDA0003886478970000058
Figure BDA0003886478970000061
wherein a _ i 0 ,a_i 1 ,...,
Figure BDA0003886478970000062
The first number representing the input sorting module is
Figure BDA0003886478970000063
The input data of (1) and (2) in (8230) \ 8230; and (8230); and,
Figure BDA0003886478970000064
Input binary integer data, b _ i 0 ,b_i 1 ,...,
Figure BDA0003886478970000065
Two sets of numbers representing input sorting modules are
Figure BDA0003886478970000066
The input data of (1), 2, 8230; a,
Figure BDA0003886478970000067
Input binary integer data;
a_o 00 ,a_o 01 ,...,
Figure BDA0003886478970000068
1 st, 2 nd, 8230in a first subgroup in a first set of output data representing input sort modules 8230,
Figure BDA0003886478970000069
Binary integer data, a _ o 10 ,a_o 11 ,...,
Figure BDA00038864789700000610
1 st, 2 nd, 8230in a second subgroup in a first set of output data representing input sort modules 8230,
Figure BDA00038864789700000611
Number of binary integersAccording to the formula of \8230;,
Figure BDA00038864789700000612
representing the first of the output data input to the sorting module
Figure BDA00038864789700000613
Figure BDA00038864789700000613
1, 2, \ 8230; \ 8230;, in the respective subgroup,
Figure BDA00038864789700000614
Binary integer data;
b_o 00 ,b_o 01 ,...,
Figure BDA00038864789700000615
1 st, 2 nd, 8230in a first subgroup in a second set of output data representing input sort modules 8230,
Figure BDA00038864789700000616
A binary integer data, b _ o 10 ,b_o 11 ,...,
Figure BDA00038864789700000617
1 st, 2 nd, 8230in a second subgroup in a second set of output data representing input sort modules 8230,
Figure BDA00038864789700000618
A binary integer data of \8230;,
Figure BDA00038864789700000619
second of the second set of output data representing input sorting modules
Figure BDA00038864789700000620
Figure BDA00038864789700000620
1, 2, \ 8230; \ 8230;, in the respective subgroup,
Figure BDA00038864789700000621
Binary integer data.
The output integration module is used for executing the following output integration algorithm:
Figure BDA0003886478970000071
wherein c _ i 00 ,c_i 01 ,…,
Figure BDA0003886478970000072
The first input data group of the output integration module has 1 st, 2 nd, 8230, 8230,
Figure BDA0003886478970000073
binary integer data, c _ i 10 ,c_i 11 ,…,
Figure BDA0003886478970000074
The 1 st, 2 nd, 8230th, and the like in the second group of input data of the output integration module are shown,
Figure BDA0003886478970000075
Binary integer data, \ 8230 \ 8230;,
Figure BDA0003886478970000076
to represent output integration Module
Figure BDA0003886478970000077
Group input data 1, 2, 8230, 8230,
Figure BDA0003886478970000078
Binary integer data;
wherein c _ o 0 ,c_o 1 ,…,
Figure BDA0003886478970000079
1 st, in the output data representing the output integration module 2, 823060, 8230,
Figure BDA00038864789700000710
Binary positive integer data.
The input sorting module comprises a sorting module and an input reordering module;
the sorting module will be two groups
Figure BDA00038864789700000711
The low to high 0 th to t-1 th bits, t to 2t-1 th bits, \ 8230 \ 8230;, the second bit of each number in the binary integer data
Figure BDA00038864789700000712
To the first
Figure BDA00038864789700000713
The bits are respectively taken out and combined into a new integer, wherein t is an integer set according to the use requirement and is obtained from each initial data
Figure BDA00038864789700000714
New integers are divided into a group to form
Figure BDA00038864789700000715
A new array;
the input reordering module is to
Figure BDA00038864789700000716
A new array of middle front
Figure BDA00038864789700000717
All the 1 st, 2 nd, 8230of the array, 8230,
Figure BDA00038864789700000718
Taking out data and splicing to new No. 1, 2, \8230;, B,
Figure BDA00038864789700000719
Number of data is
Figure BDA00038864789700000720
And will be
Figure BDA0003886478970000081
After a new array
Figure BDA0003886478970000082
All the 1 st, 2 nd, 8230of the array, 8230,
Figure BDA0003886478970000083
Data fetch and splice
Figure BDA0003886478970000084
Number of data is
Figure BDA0003886478970000085
An array of (2).
The output integration module comprises an output reordering module, a shift module array and an addition array;
the output reordering module is to
Figure BDA0003886478970000086
Number of data is
Figure BDA0003886478970000087
The 1 st, 2 nd, 8230of the above-mentioned groups, 8230,
Figure BDA0003886478970000088
Taking out the data and splicing them into new 1 st, 2 nd, 8230, 8230,
Figure BDA0003886478970000089
Number of data is
Figure BDA00038864789700000810
The array of (2);
the shift module array is used for reordering the 1 st, 2 nd, 8230, the,
Figure BDA00038864789700000811
Each data is padded with zero at high order and then left shifted by 0, t, \ 8230; \8230;, n, n in binary by shift register,
Figure BDA00038864789700000812
The bit gets new data;
the addition array shifts all of the data in each array
Figure BDA00038864789700000813
Adding the data by an adder to obtain a sum, wherein all the arrays are obtained together
Figure BDA00038864789700000814
And a step of summing the sums, and outputting the resultant sum as output data of the addition array.
The invention adds a group of input sorting modules and output integration modules in the Karatsuba polynomial multiplication architecture, so that the Karatsuba polynomial multiplication architecture can be extended inwards and outwards in a bidirectional way, and provides a low-complexity low-resource high-bit-width polynomial multiplication method and device based on the Karatsuba architecture. The part outside the input sorting module and the output integration module comprises a Karatsuba preprocessing external block and a Karatsuba post-processing external block which are used for realizing the functions to be realized by the polynomial multiplication operation. The part between the input sorting module and the output integration module comprises a Karatsuba preprocessing internal block, a central multiplier array and a Karatsuba post-processing internal block, the original structure of the Karatsuba is longitudinally extended internally, and the Karatsuba post-processing internal block is further optimized on the basis of realizing functions.
Furthermore, the invention also provides a key exchange acceleration method, and polynomial multiplication operations in the CSIDH key exchange process are all realized by the acceleration method of polynomial multiplication, wherein the number of multipliers is N, and N is the term number of the polynomial involved in the CSIDH key exchange process.
Correspondingly, the invention also provides a key exchange accelerating device, which comprises the accelerating device for polynomial multiplication.
Has the advantages that: the method and the device of the invention realize the further simplification of the high-bit-width polynomial multiplier, so that the N-term polynomial is multipliedThe multiplication complexity of the method operation is further reduced, and the ratio of the hardware area to the traditional polynomial multiplication algorithm can be smaller than
Figure BDA0003886478970000091
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic diagram of a hardware architecture for Karatsuba polynomial multiplication.
FIG. 2 is a schematic diagram of the low complexity, low resource, high bit width polynomial Karatsuba multiplication architecture of the present invention.
Fig. 3 is a schematic circuit diagram of an input sorting module.
Fig. 4 is a circuit diagram of an output integration module.
Detailed Description
The invention provides a method and a device for accelerating polynomial multiplication, in particular to a method and a device for accelerating polynomial multiplication based on a Karatsuba architecture, wherein the method comprises the following steps:
two sets of polynomial coefficients are input, and the number of each set of polynomial coefficients is
Figure BDA0003886478970000092
Wherein p is 1 、p 2 、……、p m The number of prime factors is 1, 2, \8230, 8230and m repeatable prime factors;
two sets of polynomial coefficients are expressed as terms
Figure BDA0003886478970000093
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of externally preprocessed data;
performing position taking sorting and reordering on the two groups of externally preprocessed data respectively to obtain sorted data;
the sorted data is counted according to items
Figure BDA0003886478970000094
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of internal preprocessed data, wherein p is -1 、p -2 、……、p -k Respectively 1 st, 2 nd, 8230, k prime factors designated according to use requirements;
carrying out multiplication operation on corresponding data in the two groups of internal preprocessed data to obtain a group of preliminary product data;
the preliminary product data is counted as items
Figure BDA0003886478970000095
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream, and internal post-processing data is obtained;
reordering, shifting and adding the internal post-processed data to obtain integrated data;
the integrated data is counted into items
Figure BDA0003886478970000096
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream to obtain final output data, namely, the term number is
Figure BDA0003886478970000097
The polynomial coefficient of the product of.
The invention also provides an accelerating device for polynomial multiplication, which comprises m preprocessing external chunks, an input sorting module, k preprocessing internal chunks, a group of central multiplier arrays, k post-processing internal chunks, an output integration module and m post-processing external chunks, wherein m and k are positive integers.
The m preprocessing external chunks are used for inputting two groups of polynomial coefficients, and the number of terms of each group of polynomial coefficients is
Figure BDA0003886478970000101
Wherein p is 1 、p 2 、……、p m The 1 st, 2 nd, 8230, m repeatable prime factors of the number of the items; then two groups of polynomial coefficients are calculated according to terms of
Figure BDA0003886478970000102
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of externally preprocessed data;
the input sorting module is used for respectively carrying out position taking sorting and reordering on the two groups of externally preprocessed data to obtain sorted data;
the k preprocessing internal chunks are used for sorting the sorted data according to the number of items
Figure BDA0003886478970000103
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of internal preprocessed data, wherein p is -1 、p -2 、……、p -k Respectively 1, 2, 8230, k prime factors designated according to use requirements;
the group of central multiplier arrays are used for multiplying corresponding data in the two groups of internally preprocessed data to obtain a group of preliminary product data;
the k post-processing internal chunks are used for setting the preliminary product data into the number of terms
Figure BDA0003886478970000104
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream, and internal post-processing data is obtained;
the output integration module is used for carrying out reordering, shift operation and addition operation on the data subjected to internal post-processing to obtain integrated data;
the m post-processing external chunks are used for integrating the data according to the number of items
Figure BDA0003886478970000105
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream to obtain final output data, namely, the term number is
Figure BDA0003886478970000106
The polynomial coefficient of the product of.
The invention is based on
Figure BDA0003886478970000107
The Karatsuba structure of the terms is designed, wherein m is the order of the Karatsuba external structure, namely the term number of the integral structure input polynomial, k is the order of the Karatsuba internal structure, and p is i The minus sign in the subscript is used to distinguish it from other subscripts, and is also used to indicate that their corresponding KA _ pre and KA _ post functions are used in the intra-Karatsuba architecture. By being at
Figure BDA0003886478970000108
A group of input sorting modules and output integration modules are added in the Karatsuba polynomial multiplication structure, so that the structure can be extended inwards and outwards in a bidirectional mode, and the modified Karatsuba polynomial multiplication structure with low complexity, low resources and high bit width is formed as shown in FIG. 2.
It can be seen that the overall architecture of fig. 2 is similar to that of fig. 1, but with some differences in detail. The two red dotted lines in fig. 2 are the input sorting module and the output integration module designed by the present invention, respectively. The blue modules except the red lines represent the external blocks in the framework, and the red lines are sequentially
Figure BDA0003886478970000111
The KA _ pre module and the KA _ post module; the yellow module within the two red lines represents the internal block in the structure, and the yellow module is arranged from the red line to the inside in sequence
Figure BDA0003886478970000112
The KA _ pre module and the KA _ post module. The polynomial multiplication operation realized in the external architecture is the function realized by the whole architecture, and the internal architectureThen the original Karatsuba framework is longitudinally extended, and deeper optimization is performed on the basis of the external framework. In the middle of the array is a row of central multiplier arrays, the number of the multipliers is
Figure BDA0003886478970000113
And
Figure BDA0003886478970000114
if the number of their corresponding central multipliers is l respectively 1 ,l 2 ,...,l m And l -1 ,l -2 ,...,l -k Then the number of central multipliers in fig. 2 is
Figure BDA0003886478970000115
And (4) respectively. KA. The subscripts of KA _ pre and KA _ post represent the number of terms of this layer of kartsuba polynomial multiplication architecture.
The input sorting algorithm and the output integration algorithm are shown as algorithm two and algorithm three, and the input sorting module circuit schematic diagram and the output integration module circuit schematic diagram are shown as fig. 3 and 4. A new parameter t exists in the second algorithm and the third algorithm, and the requirement is met
Figure BDA0003886478970000116
And is minimized as much as possible.
And (3) algorithm II: inputting a sorting algorithm:
Figure BDA0003886478970000117
Figure BDA0003886478970000121
and (3) algorithm III: and (3) outputting an integration algorithm:
Figure BDA0003886478970000122
represents an integer
Figure BDA0003886478970000123
From jt-1 bit to (j-1) t bit of a slice in binary representation, the subscripts for numbers a _ i and b _ i have only one number, and the subscripts for a _ o and b _ o have two numbers, all for distinction only)
Figure BDA0003886478970000124
Figure BDA0003886478970000131
The subscript of the coefficient c _ o has only one number, and the subscript of c _ i has two numbers, both for distinction only. The subscripts for numbers a _ i and b _ i have only one number, and the subscripts for a _ o and b _ o have two numbers, all for distinction only.
Algorithm two and fig. 3 show an input sorting module comprising a set of functional blocks for bit-wise truncation of input data and a set of circuits for re-ordering and combining the output data sequence. Algorithm three and fig. 4 show that the output integration module includes a set of circuits for rearranging and combining the input data sequence, some shift module arrays, and a set of addition arrays (a row of trapezoidal block arrays in fig. 4). The input sorting module and the output integration module play two roles in the circuit: one is to perform conversion of the length of the coefficient vector by
Figure BDA0003886478970000132
The length of the input-output vector of the central multiplier of the structure of the term Karatsuba becomes
Figure BDA0003886478970000133
Length of data vector transmitted between the mth layer pre-or post-treatment and the (m + 1) th layer pre-or post-treatment from outside to inside in the Karatsuba architecture; and secondly, the bit width of each numerical value in the transmission process is reduced, the number of terms is increased, the Karatsuba architecture can be conveniently extended in a bidirectional mode, and the architecture is further optimized.
In a 4-term (N =4, then according to
Figure BDA0003886478970000134
Taking m =2,p 1 =p 2 =...=p m = 2) polynomial multiplication unit, for example, the polynomial coefficient width is set to 64. Then a multiplier unit operated by conventional polynomial multiplication, a multiplier unit operated by conventional Karatsuba polynomial multiplication, and a low-complexity, low-resource, high-bit-width polynomial multiplication unit (k is 2,t is 16,p) based on the Karatsuba architecture designed in the present scheme -1 =p -2 =...=p -k = 2) the resource/area ratio of the three in the FPGA is shown in table 1.
TABLE 1
Figure BDA0003886478970000135
In the embodiment, an EDA (electronic design automation) platform for simulation, integration and realization is vivado2021.1, and the selected FPGA model is Xilinx Virtex-7xc7vx690tffg1157-3. In the above data, # Slices and # DSP are both data obtained directly after synthesis and implementation, # SEC is data obtained by calculation that can represent hardware resource consumption or area, and the calculation formula is:
#SEC=#BRAMs×100+#DSPs×100+#Slices
where # BRAMs defaults to 0 since no BRAM is used in any of the three multipliers. Theoretically, the minimum limit of the ratio of the hardware area of the Karatsuba polynomial multiplication to the conventional polynomial multiplication algorithm is
Figure BDA0003886478970000141
In the above example this limit value is
Figure BDA0003886478970000142
It can be seen from table 1 that the conventional Karatsuba method is slightly above this limit, whereas the present solution is below this limit.
The embodiment also provides a CSIDH key exchange acceleration method, which includes: the polynomial multiplication operation in the CSIDH key exchange process is realized by the polynomial multiplication acceleration method.
Further, the number of multipliers is N, where N is the number of terms of the polynomial involved in the CSIDH key exchange process, and in an operation environment of 64-bit integers, N is 8 in the CSIDH key exchange process using the CSIDH512 parameter set, N is 16 in the CSIDH key exchange process using the CSIDH1024 parameter set, and N is 32 in the CSIDH key exchange process using the CSIDH2048 parameter set.
The CSIDH key exchange process can involve polynomial multiplication operation of multiple degrees, the polynomial multiplication operation of each degree is the same, and the number N of the multipliers is the number of terms of different polynomials corresponding to the CSIDH key exchange process with different parameters.
Correspondingly, the embodiment of the invention also provides a CSIDH encryption and decryption acceleration device, which comprises the acceleration device for polynomial multiplication.
The CSIDH key exchange acceleration method and apparatus provided in this embodiment can improve the efficiency of the CSIDH key exchange process on the basis of reducing the resource consumption of the FPGA hardware implementation of the CSIDH.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may execute the inventive content of the method for accelerating polynomial multiplication and some or all of the steps in each embodiment provided in the present invention. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
It is clear to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a computer program or a software product, where the computer program or the software product may be stored in a storage medium and include instructions for enabling a device (which may be a personal computer, a server, a single chip microcomputer, an MUU, or a network device) including a data processing unit to execute the method according to the embodiments or some parts of the embodiments of the present invention.
The present invention provides a method and an apparatus for accelerating polynomial multiplication, and a plurality of methods and approaches for implementing the technical solution are provided, the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and modifications may be made without departing from the principle of the present invention, and these improvements and modifications should also be considered as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (10)

1. A method for accelerating polynomial multiplication, comprising:
two sets of polynomial coefficients are input, the number of terms in each set of polynomial coefficients is
Figure FDA0003886478960000011
Wherein p is 1 、p 2 、……、p m The 1 st, 2 nd, 8230, m repeatable prime factors of the number of the items;
two sets of polynomial coefficients are calculated according to terms
Figure FDA0003886478960000012
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of external preprocessed data;
performing position taking sorting and reordering on the two groups of externally preprocessed data respectively to obtain sorted data;
the sorted data is counted according to items
Figure FDA0003886478960000013
The Karatsuba algorithm is as followsAll operation rules of the data stream before reaching the multiplication operation are operated to obtain two groups of internal preprocessed data, wherein p -1 、p -2 、……、p -k Respectively 1 st, 2 nd, 8230, k prime factors designated according to use requirements;
performing multiplication operation on corresponding data in the two groups of internally preprocessed data to obtain a group of preliminary product data;
the preliminary product data is counted as items
Figure FDA0003886478960000014
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream, and internal post-processing data is obtained;
reordering, shifting and adding the internal post-processed data to obtain integrated data;
the integrated data is counted into items
Figure FDA0003886478960000015
The Karatsuba algorithm performs operation processing on all operation rules after the multiplication operation of the data stream is completed to obtain final output data, namely the term number is
Figure FDA0003886478960000016
The polynomial coefficient of the product of.
2. An accelerating device for polynomial multiplication is characterized by comprising m preprocessing external chunks, an input sorting module, k preprocessing internal chunks, a group of central multiplier arrays, k post-processing internal chunks, an output integration module and m post-processing external chunks, wherein m and k are positive integers;
the m preprocessing external chunks are used for inputting two groups of polynomial coefficients, and the number of terms of each group of polynomial coefficients is
Figure FDA0003886478960000017
Wherein p is 1 、p 2 、……、p m The number of prime factors is 1, 2, \8230, 8230and m repeatable prime factors; then two groups of polynomial coefficients are calculated according to terms of
Figure FDA0003886478960000018
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of externally preprocessed data;
the input sorting module is used for respectively carrying out position taking sorting and reordering on the two groups of externally preprocessed data to obtain sorted data;
the k preprocessing internal chunks are used for sorting the data according to the number of items
Figure FDA0003886478960000019
The Karatsuba algorithm performs operation processing on all operation rules before the data stream reaches multiplication operation to obtain two groups of internal preprocessed data, wherein p is -1 、p -2 、……、p -k Respectively 1 st, 2 nd, 8230, k prime factors designated according to use requirements;
the group of central multiplier arrays are used for multiplying corresponding data in the two groups of internal preprocessed data to obtain a group of preliminary product data;
the k post-processing internal chunks are used for generating preliminary product data according to the number of items
Figure FDA0003886478960000021
The Karatsuba algorithm performs operation processing on all operation rules after multiplication operation is performed on the data stream, and internal post-processing data is obtained;
the output integration module is used for performing reordering, shift operation and addition operation on the internally post-processed data to obtain integrated data;
the m post-processing external chunks are used for integrating the data according to the number of items
Figure FDA0003886478960000022
The Karatsuba algorithm performs operation processing on all operation rules after the multiplication operation of the data stream is completed to obtain final output data, namely the term number is
Figure FDA0003886478960000023
The polynomial coefficient of the product of (c).
3. The apparatus of claim 2, wherein the m preprocessed external chunks are each
Figure FDA0003886478960000024
Figure FDA0003886478960000025
The KA _ pre module;
the k preprocessed internal chunks are respectively
Figure FDA0003886478960000026
The KA _ pre module;
the k post-processing internal chunks are respectively
Figure FDA0003886478960000027
The KA _ post module of (1);
the m post-processing external chunks are respectively
Figure FDA0003886478960000028
The KA _ post module of (1);
wherein
Figure FDA0003886478960000029
Respectively representing the number of items as p 1 、p 2 、……、p m The Karatsuba algorithm module of (1);
wherein
Figure FDA00038864789600000210
Respectively representing the prime factors p of the k prime factors based on the 1 st, 2 nd, 8230, and p -1 、p -2 、……、p -k Corresponding number of items is p -1 、p -2 、……、p -k The Karatsuba algorithm module of (1);
wherein the KA _ pre block represents a hardware device that performs all operations that the data stream undergoes before going from the input to all multiplication operations in the Karatsuba algorithm;
where the KA _ post block represents the hardware device that performs all operations that the data stream undergoes after all multiplication operations to the output in the kartsuba algorithm.
4. The apparatus of claim 3 wherein the central multiplier array comprises a plurality of integer multipliers, wherein the number of multipliers is selected from the group consisting of
Figure FDA00038864789600000211
And
Figure FDA00038864789600000212
is determined by the structure of
Figure FDA0003886478960000031
And
Figure FDA0003886478960000032
the corresponding central multipliers are respectively l 1 ,l 2 ,...,l m And l -1 ,l -2 ,...,l -k Then the number of central multipliers is
Figure FDA0003886478960000033
And (4) respectively.
5. The apparatus of claim 4, wherein the input sorting module is configured to execute an input sorting algorithm that:
inputting:
Figure FDA0003886478960000034
Figure FDA0003886478960000035
Figure FDA0003886478960000036
……
Figure FDA0003886478960000037
Figure FDA0003886478960000038
Figure FDA0003886478960000039
Figure FDA00038864789600000310
……
Figure FDA00038864789600000311
Figure FDA00038864789600000312
and (3) outputting:
Figure FDA00038864789600000313
Figure FDA0003886478960000041
Figure FDA0003886478960000042
wherein
Figure FDA0003886478960000043
A first group of numbers representing input sorting modules is
Figure FDA0003886478960000044
The input data of (1) and (2) in (8230) \ 8230; and (8230); and,
Figure FDA0003886478960000045
The input binary integer data is inputted to the input,
Figure FDA0003886478960000046
two sets of numbers representing input sorting modules are
Figure FDA0003886478960000047
The first group of input data includes 1 st, 2 nd, 8230, 8230,
Figure FDA0003886478960000048
Input binary integer data;
Figure FDA0003886478960000049
1 st, 2 nd, 8230in a first subgroup in a first set of output data representing input sort modules 8230,
Figure FDA00038864789600000410
A number of binary integer data of the number of binary integers,
Figure FDA00038864789600000411
1 st, 2 nd, \8230; a,
Figure FDA00038864789600000412
Binary integer data, \ 8230 \ 8230;,
Figure FDA00038864789600000413
representing the first of the output data input to the sorting module
Figure FDA00038864789600000414
1, 2, \ 8230; \ 8230;, in the respective subgroup,
Figure FDA00038864789600000415
Binary integer data;
Figure FDA00038864789600000416
1 st, 2 nd, 8230in a first subgroup in a second set of output data representing input sort modules 8230,
Figure FDA00038864789600000417
A number of binary integer data of the number of binary integer data,
Figure FDA00038864789600000418
1 st, 2, \8230; a,
Figure FDA00038864789600000419
A binary integer data of \8230;,
Figure FDA00038864789600000420
second of a second set of output data representing input sorting modules
Figure FDA00038864789600000421
1 st part of the subgroup 2, 823060, 8230,
Figure FDA00038864789600000422
Binary integer data.
6. The apparatus of claim 5, wherein the output integration module is configured to perform an output integration algorithm that:
inputting:
Figure FDA00038864789600000423
Figure FDA00038864789600000424
Figure FDA00038864789600000425
Figure FDA00038864789600000426
……
Figure FDA0003886478960000051
and (3) outputting:
Figure FDA0003886478960000052
wherein
Figure FDA0003886478960000053
The first input data group of the output integration module has 1 st, 2 nd, 8230, 8230,
Figure FDA0003886478960000054
a number of binary integer data of the number of binary integer data,
Figure FDA0003886478960000055
1 st, 2 nd, 8230, (8230) in the second set of input data representing the output integration module,
Figure FDA0003886478960000056
A binary integer data of \8230;,
Figure FDA0003886478960000057
to represent output integration Module
Figure FDA0003886478960000058
Group input data 1, 2, \8230, 8230, 8230,
Figure FDA0003886478960000059
binary integer data;
wherein
Figure FDA00038864789600000510
The output data of the output integration module includes 1 st, 2 nd, 8230, 8230,
Figure FDA00038864789600000511
binary positive integer data.
7. The apparatus of claim 6, wherein the input sorting module comprises a sorting module and an input reordering module;
the sorting module will be two groups
Figure FDA00038864789600000512
The low to high 0 th to t-1 th bits, t to 2t-1 th bits, \ 8230 \ 8230;, the second bit of each number in the binary integer data
Figure FDA00038864789600000513
To the first
Figure FDA00038864789600000514
The bits are respectively taken out and combined into a new integer, wherein t is an integer set according to the use requirement, and the new integer is obtained from each initial data
Figure FDA00038864789600000515
New integers are divided into a group to form
Figure FDA00038864789600000516
A new array;
the input reordering module is to
Figure FDA00038864789600000517
A new array of middle front
Figure FDA00038864789600000518
All the 1 st, 2 nd, 8230of the array, 8230,
Figure FDA00038864789600000519
Taking out the data and splicing them into new 1 st, 2 nd, 8230, 8230,
Figure FDA00038864789600000520
Number of data is
Figure FDA00038864789600000521
And will be
Figure FDA00038864789600000522
After in a new array
Figure FDA00038864789600000523
All numbers 1, 2, \ 8230of the individual arrays…、
Figure FDA00038864789600000524
Data fetch and splice
Figure FDA00038864789600000525
Number of data is
Figure FDA00038864789600000526
An array of (2).
8. The apparatus of claim 7, wherein the output integration module comprises an output reordering module, a shift module array, an addition array;
the output reordering module is to
Figure FDA00038864789600000527
Number of data is
Figure FDA00038864789600000528
The 1 st, 2 nd, 8230of the above-mentioned groups, 8230,
Figure FDA0003886478960000061
Taking out data and splicing to new No. 1, 2, \8230;, B,
Figure FDA0003886478960000062
Number of data is
Figure FDA0003886478960000063
An array of (2).
9. The apparatus of claim 8, wherein the shift module array is to reorder the 1 st, 2 nd, 8230; the shift module array is to reorder the data in each array,
Figure FDA0003886478960000064
Each data is respectively inHigh-order zero-filling, left-shifting 0, t, 8230, and,
Figure FDA0003886478960000065
The bit gets the new data.
10. The apparatus of claim 9, wherein the addition array shifts all of the data in each array
Figure FDA0003886478960000066
Adding data to obtain a sum, and summing all the arrays
Figure FDA0003886478960000067
And providing the sum as output data of the addition array.
CN202211245657.5A 2022-10-12 2022-10-12 Polynomial multiplication accelerating method and device Pending CN115587274A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211245657.5A CN115587274A (en) 2022-10-12 2022-10-12 Polynomial multiplication accelerating method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211245657.5A CN115587274A (en) 2022-10-12 2022-10-12 Polynomial multiplication accelerating method and device

Publications (1)

Publication Number Publication Date
CN115587274A true CN115587274A (en) 2023-01-10

Family

ID=84780700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211245657.5A Pending CN115587274A (en) 2022-10-12 2022-10-12 Polynomial multiplication accelerating method and device

Country Status (1)

Country Link
CN (1) CN115587274A (en)

Similar Documents

Publication Publication Date Title
Okada et al. Implementation of Elliptic Curve Cryptographic Coprocessor over GF (2 m) on an FPGA
Ding et al. High-speed ECC processor over NIST prime fields applied with Toom–Cook multiplication
KR100308723B1 (en) Round-Storage Adder Circuit and Multiple Binary Data Bit Sum Method
Fan et al. Efficient hardware implementation of Fp-arithmetic for pairing-friendly curves
Guo et al. Fast binary counters and compressors generated by sorting network
Tian et al. Ultra-fast modular multiplication implementation for isogeny-based post-quantum cryptography
Xie et al. Novel bit-parallel and digit-serial systolic finite field multipliers over $ GF (2^ m) $ based on reordered normal basis
Kakde et al. Design of area and power aware reduced Complexity Wallace Tree multiplier
US6957243B2 (en) Block-serial finite field multipliers
CN112799634B (en) Based on base 2 2 MDC NTT structured high performance loop polynomial multiplier
Rani et al. FPGA implementation of fast adders using Quaternary Signed Digit number system
da Rosa et al. The Radix-2 m Squared Multiplier
CN115587274A (en) Polynomial multiplication accelerating method and device
Liu et al. A high speed VLSI implementation of 256-bit scalar point multiplier for ECC over GF (p)
Laxman et al. FPGA implementation of different multiplier architectures
Parhami On equivalences and fair comparisons among residue number systems with special moduli
JP3660075B2 (en) Dividing device
Tiwari et al. Implementation of high speed and low power novel radix 2 booth multiplier using 2248 BEC converter
Jagadeeshkumar et al. A novel design of low power and high speed hybrid multiplier
Ferrer et al. A fast finite field multiplier
Bankar et al. Design of arithmetic circuit using Quaternary Signed Digit Number system
Madhuri et al. Analysis of reconfigurable multipliers for integer and Galois field multiplication based on high speed adders
WO2024109730A1 (en) Variable modular multiplier, operation method, and related device
TWI802095B (en) Modular multiplication circuit and corresponding modular multiplication method
Haripriya et al. Design and Analysis of 16-bit Vedic Multiplier using RCA and CSLA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination