CN107665109B - Montgomery modular multiplication calculation method suitable for embedded system - Google Patents

Montgomery modular multiplication calculation method suitable for embedded system Download PDF

Info

Publication number
CN107665109B
CN107665109B CN201610609265.0A CN201610609265A CN107665109B CN 107665109 B CN107665109 B CN 107665109B CN 201610609265 A CN201610609265 A CN 201610609265A CN 107665109 B CN107665109 B CN 107665109B
Authority
CN
China
Prior art keywords
montgomery
equal
multiplication
calculation
calculated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610609265.0A
Other languages
Chinese (zh)
Other versions
CN107665109A (en
Inventor
曾学文
李杨
叶晓舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Intellix Technologies Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Intellix Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Intellix Technologies Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201610609265.0A priority Critical patent/CN107665109B/en
Publication of CN107665109A publication Critical patent/CN107665109A/en
Application granted granted Critical
Publication of CN107665109B publication Critical patent/CN107665109B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/60Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers
    • G06F7/72Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic
    • G06F7/722Modular multiplication

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a Montgomery modular multiplication calculation method suitable for an embedded system, which comprises the following steps: multi-precision multiplication and Montgomery reduction; the multi-precision multiplication and Montgomery reduction parts are calculated in a mixed scanning mode, an operand scanning mode is used in an internal cycle, and a product scanning mode is used in an external cycle; and the multi-precision multiplication and Montgomery reduction use a coarse-grained integration mode, namely two-part alternate calculation. The Montgomery modular multiplication calculation method can reduce the memory access number in the embedded system and improve the Montgomery modular multiplication algorithm realization efficiency.

Description

Montgomery modular multiplication calculation method suitable for embedded system
Technical Field
The invention relates to the field of public key cryptosystems, in particular to a Montgomery modular multiplication computing method suitable for an embedded system.
Background
In 1985, Montgomery proposed Montgomery modular multiplication algorithm, which is the most widely used modular multiplication algorithm at present. The basic idea is to replace the time consuming inversion and division operations with simple and time saving addition and shift operations. For calculating the modulo product P ═ A × B mod N (A, B, N are N-bit binary large integers and 0<A,B<N), the algorithm first selects an integer R that is relatively prime to N (typically, R is 2)n) And converting the multiplication operation of the modulus N into the multiplication operation of the modulus R. Below we define the Montgomery product: MonPro (A, B) ═ A × B × R-1mod N。
The calculation structure of the Montgomery modular multiplication algorithm is shown in FIG. 1, and the method specifically comprises the following three steps:
(1) calculating N remaining classes of A and B:A’=A*R mod N=A*R2*R-1mod N=MonPro(A*R2),
B’=B*R mod N=B*R2*R-1mod N=MonPro(B*R2);
(2) calculating Montgomery product P '═ A'. B '. R of A' and B-1mod N=MonPro(A’,B’);
(3) Converting P 'into a modular product P, wherein P is A, B, mod N, A' and R-1*B’*R-1mod N
=P’*R-1mod N=MonPro(P’)。
Therefore, the core of the Montgomery modular multiplication algorithm is to calculate Montgomery product MonPro (A ', B'), the specific algorithm flow description of which is shown in FIG. 2. From fig. 2 it can be seen that the Montgomery product calculation can be divided into two key steps: compute multi-precision multiplication T ← a × B and compute Montgomery about subtract P ← (T + M × N)/R.
Dusse proposed the Montgomery modular multiplication algorithm using r-ary numbers in 1990 and using n0’=-n0 -1mod r replaces N' with a corresponding improvement to the algorithm. In 1996, Koc analyzed and summarized various Montgomery modular multiplication algorithm implementation methods, and concluded 5 major improved Montgomery algorithms: SOS, CIOS, FIOS, FIPS, and CIHS. Wherein the last two letters OS/PS/HS represent the calculation multiplication scanning mode, OS represents the operand scanning, PS represents the product scanning, and HS represents the hybrid scanning; the former letter S/CI/FI represents an integration mode used by multi-precision multiplication and Montgomery reduction, S represents a separation mode, namely one part is completely calculated and then the other part is calculated, CI represents a coarse-granularity integration mode, namely two parts of coarse-granularity alternate calculation, and FI represents a fine-granularity integration mode, namely two parts of fine-granularity alternate calculation. The implementation of these algorithms can be made by a series of operations: multiplication mul, addition add, load, store, etc. High performance implementation algorithms focus primarily on optimizing these operations. In the embedded system, because the number of available registers is limited, the storage operation such as load/store is very important, and in the current common method, the number of used registers is fixed, and is generally divided into 5 registersAnd N +4(N being the size modulo N including the number of words) registers. The method of using fewer registers (5) requires more memory access operations, while the method of using more registers (n +4) requires fewer memory access operations but is often unusable because the number of registers needed exceeds the number of available registers of the processor. Therefore, how to dynamically use the registers by the number of available registers of the processor and the size of n makes it a current problem to be solved to reduce the number of memory access operations by fully utilizing the registers.
Disclosure of Invention
The invention aims to overcome the defect that the Montgomery modular multiplication algorithm is not suitable for an embedded system with less resources at present, and provides a method for calculating Montgomery modular multiplication in a coarse-grained integrated mixed scanning mode in order to improve the calculation efficiency of the Montgomery modular multiplication algorithm.
In order to achieve the above object, the present invention provides a Montgomery modular multiplication calculation method suitable for an embedded system, the method comprising: multi-precision multiplication and Montgomery reduction; the multi-precision multiplication and Montgomery reduction parts are calculated in a mixed scanning mode, an operand scanning mode is used in an internal cycle, and a product scanning mode is used in an external cycle; and the multi-precision multiplication and Montgomery reduction use a coarse-grained integration mode, namely two-part alternate calculation.
In the above technical solution, the method specifically includes:
step 1) setting the large number N as m-bit prime number, the word length of the processor is W bit, and the word number of N is
Figure BDA0001062760380000021
A, B are two N remaining classes, namely 0<A,B<N;Montgomery coefficient R is 2nW,n0’=-n0 -1mod r,n0The lowest bit of N; selecting d, d being the word size of the inner loop; the size of the outer loop is
Figure BDA0001062760380000022
Figure BDA0001062760380000023
The minimum integer operation is taken to be more than or equal to the minimum integer operation;
step 2) the modular multiplication calculation process of A and B is as follows: c ═ A ═ B, M ═ C ═ n0' mod R, C ═ C + M × N)/R; taking every d words of operand A, B, M, N, C as a whole:
E=(E[r-1],…,E[0])=({A[n-1],A[n-2],….,A[n-d]}…{A[d-1],A[2],A[1],A[0]})
F=(F[r-1],…,F[0])=({B[n-1],B[n-2],….,B[n-d]}…{B[d-1],B[2],B[1],B[0]})
G=(G[r-1],…,G[0])=({M[n-1],M[n-2],….,M[n-d]}…{M[d-1],M[2],M[1],M[0]})
H=(H[r-1],…,H[0])=({N[n-1],N[n-2],….,N[n-d]}…{N[d-1],N[2],N[1],N[0]})
Q=(Q[2r-1],Q[2r-3],…,Q[1],Q[0])=({C[2n-1],C[2n-2],C[2n-3],C[2n-d]},…,
{C[d-1],C[2],C[1],C[0]})
sequentially calculating all partial products of the q, q is more than or equal to 0 and less than or equal to 2r-1 columns:
E[k]*F[l]+G[k]*H[l]=(Q[q+1],Q[q]),
wherein k + l ═ q; until all the column numbers are calculated, obtaining C;
step 3) judging whether C is larger than or equal to N, and if so, making C equal to C-N; turning to the step 4), otherwise, turning to the step 4);
and 4) outputting a Montgomery product result C of A and B.
In the above technical solution, the step 2) specifically includes:
step 2-1) making q equal to 0;
step 2-2) the set of all k, l satisfying k + l q is denoted as a: a ═ { k, l | k + l ═ q };
step 2-3) calculation of (Q [ Q +1]],Q[q])=∑AE[k]*F[l];
Wherein the content of the first and second substances,
E[k]*F[l]=(A[kd+3],A[kd+2],A[kd+1],A[kd])*(B[ld+3],B[ld+2],B[ld+1],B[ld]);
step 2-4) judging q<r is established, if the judgment result is positive, G [ q ] is calculated]=Q[q]*n0'; otherwise, go to step 2-5);
step 2-5) calculation of (Q [ Q +1]],Q[q])=(Q[q+1],Q[q])+∑AG[k]*H[l];
Step 2-6) making q ═ q + 1; if q is less than or equal to 2r-2, making k equal to k +1, and returning to the step 2-2); otherwise, turning to the step 2-7);
step 2-7) calculate the qth column C ═ C/R, since R ═ 2nWTherefore:
C=(C[2n-1],C[2n-2],…,C[n+1],C[n])。
compared with the prior art, the invention has the technical advantages that:
the mixed scanning idea is applied to Montgomery modular multiplication calculation by using a coarse-grained integration mode, operands are reasonably utilized by dynamically selecting d, the memory access number in an embedded system is reduced, and the realization efficiency of a Montgomery modular multiplication algorithm is improved.
Drawings
FIG. 1 is a schematic diagram of a prior Montgomery modular multiplication calculation structure;
FIG. 2 is a flow chart of a prior art Montgomery modular multiplication to compute a Montgomery product;
FIG. 3 is a schematic diagram of the modular multiplication computation method of the present invention;
FIG. 4 is a block diagram of the Montgomery modular multiplication method CIPOS-a (n 8, d 4) for coarse grain integrated product and operand hybrid scanning according to the present invention;
FIG. 5 is a block diagram of the Montgomery modular multiplication method CIPHS-b (n 8, d 3) for coarse-grained integrated product and operand hybrid scanning according to the present invention;
FIG. 6 is a schematic diagram of block product scanning in the method of the present invention.
Detailed Description
The method of the present invention is described in further detail below with reference to the figures and specific examples.
A Montgomery modular multiplication computation method suitable for an embedded system, the method comprising:
step 1) setting the large number N as m-bit prime number, the word length of the processor is W bit, and the word number of N is
Figure BDA0001062760380000041
A, B are two N remaining classes, namely 0<A,B<N; montgomery coefficient R is 2nW,n0’=-n0 -1mod r,n0The lowest bit of N; select d, d is the word size of the inner loop (scanned using the operand); the size of the outer loop (using product scan) is
Figure BDA0001062760380000042
Figure BDA0001062760380000043
The minimum integer operation is taken to be more than or equal to the minimum integer operation;
step 2) calculating a modular multiplication result C of the A and the B, wherein the calculation process comprises the following steps:
1)C=A*B;
2)M=C*n0’mod R;
3)C=(C+M*N)/R.;
as shown in fig. 3, let a, B denote 2 m-bit multi-precision integers: a ═ A [ n-1], …, A [2], A [1], A [0], B ═ B [ n-1], …, B [2], B [1], B [0 ]. The product C ═ a · B can be expressed as: c ═ C (C2 n-1, …, C2, C1, C0).
Taking every d words of operand A, B, M, N, C as a whole, n-8, d-4 in this embodiment; is represented as follows:
E=(E[1],E[0])=({A[7],A[6],A[5],A[4]}{A[3],A[2],A[1],A[0]})
F=(F[1],F[0])=({B[7],B[6],B[5],B[4]}{B[3],B[2],B[1],B[0]})
G=(G[1],G[0])=({M[7],M[6],M[5],M[4]}{M[3],M[2],M[1],M[0]})
H=(H[1],H[0])=({N[7],N[6],N[5],N[4]}{N[3],N[2],N[1],N[0]})
Q=(Q[3],Q[2],Q[1],Q[0])=({C[15],C[14],C[13],C[12]}{C[11],C[10],C[9],C[8]}
{C[7],C[6],C[5],C[4]}{C[3],C[2],C[1],C[0]})
then the calculation of C ═ A ═ B can be converted into the calculation of (Q3, Q2, Q1, Q0) ═ E1, E0 [ ((F1), F0 ])
Calculating M ═ C × n0' mod R can be converted to calculation (G1)],G[0])=(Q[1],Q[0])*n0
Calculation of C + M N may be translated into calculation
(Q[3],Q[2],Q[1],Q[0])=(Q[3],Q[2],Q[1],Q[0])+(G[1],G[0])*(H[1],H[0]))
Next, C ═ a × B and C ═ C + M × N (M ═ C × N) were alternately calculated by the product scan method0' mod R). Namely, all partial products of q is more than or equal to 0 and less than or equal to 2r-1 after the q column is calculated:
e [ k ] + F [ l ] + G [ k ] + H [ l ] (Q [ Q +1, Q [ Q ]) (where k + l ═ Q), then the next column is counted until all column counts are completed.
Description of the algorithmic structure as shown in fig. 4, each shaded block ①, etc. in the diamond structure and each large box in the multiplication structure in the figure represent a product E k F l or G k H l, the size of the size is r n/d 2 when calculating the whole diamond structure using the shaded blocks as basic units, all E k F l in the 0 th column (i.e., block ①), all G k H l (i.e., block ②), all E k F l in the 1 st column (i.e., block ③), all G k H l (i.e., block ⑤), and all E k F l in the 2 nd column (i.e., block ⑦), all G k H l (i.e., block ⑧).
For each shaded block ①, etc., in the calculation map, E [ k ] F [ l ] (or G [ k ] H [ l ]) is calculated for each row, using operand scanning, with the scale size d being 4, the calculation is performed in rows, keeping one operand B [ i ] in each row unchanged, and multiplying with all terms of another operand a [ j ] (0 ≦ j < d), and the next row is calculated after all the products in the row are calculated.
In another embodiment, when n is 8 and d is 3, as shown in fig. 5, the whole multiplication is divided into many blocks ①, etc., and the coarse-grained integration product scanning method is still used between these blocks, i.e. ① -
Figure BDA0001062760380000052
Is executed with the size of
Figure BDA0001062760380000051
In the following we can divide all columns of the product scan into three parts, the first part being the 1 st to r-1 st columns, all blocks being complete blocks, size d, and the second part being the r to 2r-2 nd columns, the uppermost and lowermost blocks being incomplete blocks, size [ d- (rd-n) according to the completeness of the block]D, the remaining blocks are all full block sizes d x d; the third part, 2r-1 column, contains two partial blocks of size [ d- (rd-n)]*[d-(rd-n)]. The inside of the block is still calculated by means of operand scanning, and the diamond block with incomplete calculation is used for operand scanning along the long edge.
The step 2) specifically comprises the following steps:
step 2-1) making q equal to 0;
step 2-2) the set of all k, l satisfying k + l q is denoted as a: a ═ { k, l | k + l ═ q };
step 2-3) calculation of (Q [ Q +1]],Q[q])=∑AE[k]*F[l];
An operand scanning mode is adopted: calculating according to the row mode, keeping one operand B [ i ] unchanged in each row, and multiplying all terms of another operand A [ j ] (j is more than or equal to 0 and less than d); after all the products of the row are calculated, the next row is calculated. Wherein each of E [ k ] F [ l ] and G [ k ] H [ l ] is calculated
As shown in figure 6 of the drawings,
E[k]*F[l]=(A[kd+3],A[kd+2],A[kd+1],A[kd])*(B[ld+3],B[ld+2],B[ld+1],B[ld])
step 2-4) judging q<Whether r is true, e.g.If the result is positive, G [ q ] is calculated]=Q[q]*n0'; otherwise, go to step 2-5);
step 2-5) calculation of (Q [ Q +1]],Q[q])=(Q[q+1],Q[q])+∑AG[k]*H[l];
Step 2-6) making q ═ q + 1; if q is less than or equal to 2r-1, making k equal to k +1, and returning to the step 2-2); otherwise, turning to the step 2-7);
step 2-7) calculate the qth column C ═ C/R, since R ═ 2WTherefore:
C=(C[15],C[14],C[13],C[12],C[11],C[10],C[9],C[8]);
step 3) judging whether C is larger than or equal to N, and if so, making C equal to C-N; turning to the step 4), otherwise, turning to the step 4);
and 4) outputting a Montgomery product result C of A and B.
The method of the present invention is divided into two cases according to whether n/d is an integer: the first case is that n/d is an integer, i.e.
Figure BDA0001062760380000061
We call CIPOHS-a; the second case is that n/d is not an integer, i.e.
Figure BDA0001062760380000062
We call CIPOHS-b. The total amount of memory accesses for both methods is analyzed.
1. CIPOS-a method
As shown in fig. 4, the number of memory accesses inside each block is first analyzed: because d +1 registers are used for storing operands in each block, wherein d registers store an operand A, and the remaining 1 register stores each word represented by the operand B in a multi-precision mode in turn, each operation in each block is only loaded once, so that the number of loads in each block is 2 d; and the calculation result of each block is directly stored in 2d +1 registers, so the number of storing intermediate results of each block is 0. The outer loop was analyzed below and had a total of 2r since the outer loop size was r ═ n/d2=2(n/d)2Block, and use a coarse-grained integrated product-scan approach. In this case, 2 x 2 is common2Execution of blocks, 8The sequence is performed according to the number ①②③④⑤⑥⑦⑧ marked in the figure, since the number of loads per block is 2d, there is 2(n/d)2Block, so the total amount of load is 2d × 2(n/d)2=4n2D; what needs to be stored in the whole algorithm is M of n words (M ← C × n)0' mod r) and n +1 words, so the total amount of stores is n + n +1 — 2n + 1; so the total number of memory accesses (load and store) is 4n2/d+2n+1。
2. CIPOS-b method
As shown in FIG. 5, the first part, containing the first r-1 columns, all blocks are complete blocks such as blocks ①, ③, ④, and operand scanning is used in each block, so the number of loads in each block is 2d, there are a total of r (r-1) blocks, so the total load is 2d r (r-1), the second part, containing r to 2r-2 columns, the uppermost and lowermost blocks in each column are incomplete blocks, with a size of [ d- (rd-n)]D, in which the scan is performed by scanning operands along the length d, block ⑦ in FIG. 4, where A [0] is first],A[1],A[2]Loaded in a register and then first calculates B [6 ]]And A [0]],A[1],A[2]The product of (a), then B [7 ]]And A [0]],A[1],A[2]The product of (a); the number of load blocks per incomplete block is 2d- (rd-n) for a total of 4(r-1) blocks. The second part of the remaining blocks are all full blocks, using normal operand scanning, the number of loads per block is 2d, there are (r-2) × (r-1) blocks. So that the total load is 4 (r-1). [2d- (rd-n) ]]+2d (r-2) (r-1). And a third part: only 2r-1 columns, only 2 [ d- (rd-n) in size]*[d-(rd-n)]Incomplete blocks, e.g. blocks
Figure BDA0001062760380000074
In these blocks according to the length [ d- (rd-n)]So that the number of loads is 4[ d- (rd-n)]. Table 1 summarizes the number of loads in these three parts, and it can be seen that the total number of loads is 4rn, and the number of stores is 2n +1 as CIPHOSS-a, so the total number of memory accesses is 4rn +2n + 1.
TABLE 1
Figure BDA0001062760380000071
The total amount of memory access can be uniformly recorded by integrating the CIPOS-a and the CIPOS-b
Figure BDA0001062760380000072
We next analyze the number of registers used and the number of memory accesses for several algorithms proposed by Koc and for the CIPOHS algorithm proposed, as shown in table 2.
TABLE 2
Figure BDA0001062760380000073
As can be seen from Table 2, the CIOS algorithm requires a minimum of 2n memory accesses among several existing algorithms compared in the table2+3n +1, but the larger number of registers it requires n + 4. When the value of n is large, the number of available registers is less than n +4, and therefore the algorithm can no longer be used. The CIOS-5reg and FIPS algorithms use a small number of registers, only 5 registers are needed, but the access amount of the memory is large. The CIPOHS algorithm proposed by the present invention solves this problem by dynamically selecting d by the number of available registers, and by making good use of the number of available registers, reducing the number of memory accesses. The memory access number of CIPOS is
Figure BDA0001062760380000081
The number of memory accesses of the CIOS is 2n2+3n +1, when d is an integer greater than 1, the number of memory accesses by CIPOHS is less than CIOS, and the larger the value of d, the smaller the number of memory accesses required by CIPOHS. The times of the multiplication instructions, the addition instructions and the like used by the algorithms are basically the same, and the memory access number of the CIPOHS algorithm is the minimum, so the arithmetic efficiency of the algorithm is the highest.
In summary, the Montgomery modular multiplication method applicable to the embedded system of the present invention alternately calculates the two parts of the multi-precision multiplication and the Montgomery reduction by using the coarse integration method, and uses the mixed scanning method of the product and the operand in the two parts. D is selected by the number of available registers, the number of accessed algorithm memories is reduced by fully utilizing the number of registers, and the operation efficiency of the algorithm is further improved.
The above description is only for the purpose of illustrating the embodiments of the present invention and should not be taken as limiting the scope of the present invention, and it should be understood by those skilled in the art that modifications and equivalents may be made without departing from the spirit and scope of the present invention and that the present invention is also covered by the scope of the present invention.

Claims (1)

1. A Montgomery modular multiplication computation method suitable for an embedded system, the method comprising: multi-precision multiplication and Montgomery reduction; the multi-precision multiplication and Montgomery reduction parts are calculated in a mixed scanning mode, an operand scanning mode is used in an internal cycle, and a product scanning mode is used in an external cycle; the multi-precision multiplication and Montgomery reduction use a coarse-grained integration mode, namely the two parts are alternately calculated;
the method specifically comprises the following steps:
step 1) setting the large number N as m-bit prime number, the word length of the processor is W bit, and the word number of N is
Figure FDA0002218015790000011
A, B are two N remaining classes, namely 0<A,B<N; montgomery coefficient R is 2nW,n0’=-n0 -1mod r,n0The lowest bit of N; selecting d, d being the word size of the inner loop; the size of the outer loop is
Figure FDA0002218015790000012
Figure FDA0002218015790000013
The minimum integer operation is taken to be more than or equal to the minimum integer operation;
step 2) the modular multiplication calculation process of A and B is as follows: c ═ A ═ B, M ═ C ═ n0' mod R, C ═ C + M × N)/R; taking every d words of operand A, B, M, N, C as a whole:
E=(E[r-1],…,E[0])=({A[n-1],A[n-2],….,A[n-d]}…{A[d-1],A[2],A[1],A[0]})
F=(F[r-1],…,F[0])=({B[n-1],B[n-2],….,B[n-d]}…{B[d-1],B[2],B[1],B[0]})
G=(G[r-1],…,G[0])=({M[n-1],M[n-2],….,M[n-d]}…{M[d-1],M[2],M[1],M[0]})
H=(H[r-1],…,H[0])=({N[n-1],N[n-2],….,N[n-d]}…{N[d-1],N[2],N[1],N[0]})
Q=(Q[2r-1],Q[2r-3],…,Q[1],Q[0])=({C[2n-1],C[2n-2],C[2n-3],C[2n-d]},…,{C[d-1],C[2],C[1],C[0]})
E. f, G, H and Q are both block matrices;
sequentially calculating all partial products of the q, q is more than or equal to 0 and less than or equal to 2r-1 columns:
E[k]*F[l]+G[k]*H[l]=(Q[q+1],Q[q]),
wherein k + l ═ q; until all the column numbers are calculated, obtaining C; k and l are integers;
step 3) judging whether C is larger than or equal to N, and if so, making C equal to C-N; turning to the step 4), otherwise, turning to the step 4);
step 4), outputting a Montgomery product result C of A and B;
the step 2) specifically comprises the following steps:
step 2-1) making q equal to 0;
step 2-2) the set of all k, l satisfying k + l q is denoted as a: a ═ { k, l | k + l ═ q };
step 2-3) calculation of (Q [ Q +1]],Q[q])=∑AE[k]*F[l];
Wherein the content of the first and second substances,
E[k]*F[l]=(A[kd+3],A[kd+2],A[kd+1],A[kd])*(B[ld+3],B[ld+2],B[ld+1],B[ld]);
step 2-4) judging q<r is established, if the judgment result is positive, G [ q ] is calculated]=Q[q]*n0'; otherwise, go to step 2-5);
step 2-5) calculation of (Q [ Q +1]],Q[q])=(Q[q+1],Q[q])+∑AG[k]*H[l];
Step 2-6) making q ═ q + 1; if q is less than or equal to 2r-2, making k equal to k +1, and returning to the step 2-2); otherwise, turning to the step 2-7);
step 2-7) calculate the qth column C ═ C/R, since R ═ 2nWTherefore:
C=(C[2n-1],C[2n-2],…,C[n+1],C[n])。
CN201610609265.0A 2016-07-28 2016-07-28 Montgomery modular multiplication calculation method suitable for embedded system Active CN107665109B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610609265.0A CN107665109B (en) 2016-07-28 2016-07-28 Montgomery modular multiplication calculation method suitable for embedded system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610609265.0A CN107665109B (en) 2016-07-28 2016-07-28 Montgomery modular multiplication calculation method suitable for embedded system

Publications (2)

Publication Number Publication Date
CN107665109A CN107665109A (en) 2018-02-06
CN107665109B true CN107665109B (en) 2020-04-14

Family

ID=61115623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610609265.0A Active CN107665109B (en) 2016-07-28 2016-07-28 Montgomery modular multiplication calculation method suitable for embedded system

Country Status (1)

Country Link
CN (1) CN107665109B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1152746A (en) * 1996-09-20 1997-06-25 张胤微 High speed modular multiplication method and device
CN101834723A (en) * 2009-03-10 2010-09-15 上海爱信诺航芯电子科技有限公司 RSA (Rivest-Shamirh-Adleman) algorithm and IP core
CN102207847A (en) * 2011-05-06 2011-10-05 广州杰赛科技股份有限公司 Data encryption and decryption processing method and device based on Montgomery modular multiplication operation
CN102707924A (en) * 2012-05-02 2012-10-03 广州中大微电子有限公司 RSA coprocessor for RFID (radio frequency identification device) intelligent card chip
US8417756B2 (en) * 2007-11-29 2013-04-09 Samsung Electronics Co., Ltd. Method and apparatus for efficient modulo multiplication
CN103914277A (en) * 2014-04-14 2014-07-09 复旦大学 Extensible modular multiplier circuit based on improved Montgomery modular multiplication algorithm

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1152746A (en) * 1996-09-20 1997-06-25 张胤微 High speed modular multiplication method and device
CN1085862C (en) * 1996-09-20 2002-05-29 张胤微 High speed modular multiplication method and device
US8417756B2 (en) * 2007-11-29 2013-04-09 Samsung Electronics Co., Ltd. Method and apparatus for efficient modulo multiplication
CN101834723A (en) * 2009-03-10 2010-09-15 上海爱信诺航芯电子科技有限公司 RSA (Rivest-Shamirh-Adleman) algorithm and IP core
CN102207847A (en) * 2011-05-06 2011-10-05 广州杰赛科技股份有限公司 Data encryption and decryption processing method and device based on Montgomery modular multiplication operation
CN102707924A (en) * 2012-05-02 2012-10-03 广州中大微电子有限公司 RSA coprocessor for RFID (radio frequency identification device) intelligent card chip
CN103914277A (en) * 2014-04-14 2014-07-09 复旦大学 Extensible modular multiplier circuit based on improved Montgomery modular multiplication algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
8比特AVR微控制器上高效及抗侧信道攻击的RSA算法的实现;刘哲;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120415(第4期);全文 *

Also Published As

Publication number Publication date
CN107665109A (en) 2018-02-06

Similar Documents

Publication Publication Date Title
US10698657B2 (en) Hardware accelerator for compressed RNN on FPGA
CN111213125B (en) Efficient direct convolution using SIMD instructions
US8028015B2 (en) Method and system for large number multiplication
US8271571B2 (en) Microprocessor
CN106445471A (en) Processor and method for executing matrix multiplication on processor
EP3659051A1 (en) Accelerated mathematical engine
US8756268B2 (en) Montgomery multiplier having efficient hardware structure
US20120072704A1 (en) &#34;or&#34; bit matrix multiply vector instruction
US20200372097A1 (en) Apparatus and method for matrix operations
US10402196B2 (en) Multi-dimensional sliding window operation for a vector processor, including dividing a filter into a plurality of patterns for selecting data elements from a plurality of input registers and performing calculations in parallel using groups of the data elements and coefficients
US20170169132A1 (en) Accelerated lookup table based function evaluation
US11586442B2 (en) System and method for convolving image with sparse kernels
US20030005267A1 (en) System and method for parallel computing multiple packed-sum absolute differences (PSAD) in response to a single instruction
US9098381B2 (en) Modular arithmatic unit and secure system including the same
US20080288756A1 (en) &#34;or&#34; bit matrix multiply vector instruction
CN115348002B (en) Montgomery modular multiplication rapid calculation method based on multi-word length multiplication instruction
CN107665109B (en) Montgomery modular multiplication calculation method suitable for embedded system
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
JP2502836B2 (en) Preprocessing device for division circuit
CN113504895B (en) Elliptic curve multi-scalar point multiplication calculation optimization method and optimization device
CN113705794B (en) Neural network accelerator design method based on dynamic activation bit sparseness
US11403727B2 (en) System and method for convolving an image
US8332447B2 (en) Systems and methods for performing fixed-point fractional multiplication operations in a SIMD processor
US20210081178A1 (en) Performing constant modulo arithmetic
US7890564B2 (en) Interpolation FIR filter and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant