CN107665109B

CN107665109B - Montgomery modular multiplication calculation method suitable for embedded system

Info

Publication number: CN107665109B
Application number: CN201610609265.0A
Authority: CN
Inventors: 曾学文; 李杨; 叶晓舟
Original assignee: Institute of Acoustics CAS; Beijing Intellix Technologies Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Intellix Technologies Co Ltd
Priority date: 2016-07-28
Filing date: 2016-07-28
Publication date: 2020-04-14
Anticipated expiration: 2036-07-28
Also published as: CN107665109A

Abstract

The invention discloses a Montgomery modular multiplication calculation method suitable for an embedded system, which comprises the following steps: multi-precision multiplication and Montgomery reduction; the multi-precision multiplication and Montgomery reduction parts are calculated in a mixed scanning mode, an operand scanning mode is used in an internal cycle, and a product scanning mode is used in an external cycle; and the multi-precision multiplication and Montgomery reduction use a coarse-grained integration mode, namely two-part alternate calculation. The Montgomery modular multiplication calculation method can reduce the memory access number in the embedded system and improve the Montgomery modular multiplication algorithm realization efficiency.

Description

Montgomery modular multiplication calculation method suitable for embedded system

Technical Field

The invention relates to the field of public key cryptosystems, in particular to a Montgomery modular multiplication computing method suitable for an embedded system.

Background

In 1985, Montgomery proposed Montgomery modular multiplication algorithm, which is the most widely used modular multiplication algorithm at present. The basic idea is to replace the time consuming inversion and division operations with simple and time saving addition and shift operations. For calculating the modulo product P ═ A × B mod N (A, B, N are N-bit binary large integers and 0<A,B<N), the algorithm first selects an integer R that is relatively prime to N (typically, R is 2)ⁿ) And converting the multiplication operation of the modulus N into the multiplication operation of the modulus R. Below we define the Montgomery product: MonPro (A, B) ═ A × B × R^-1mod N。

The calculation structure of the Montgomery modular multiplication algorithm is shown in FIG. 1, and the method specifically comprises the following three steps:

(1) calculating N remaining classes of A and B:A’＝A*R mod N＝A*R²*R^-1mod N＝MonPro(A*R²),

B’＝B*R mod N＝B*R²*R^-1mod N＝MonPro(B*R²)；

(2) calculating Montgomery product P '═ A'. B '. R of A' and B^-1mod N＝MonPro(A’,B’)；

(3) Converting P 'into a modular product P, wherein P is A, B, mod N, A' and R^-1*B’*R^-1mod N

＝P’*R^-1mod N＝MonPro(P’)。

Therefore, the core of the Montgomery modular multiplication algorithm is to calculate Montgomery product MonPro (A ', B'), the specific algorithm flow description of which is shown in FIG. 2. From fig. 2 it can be seen that the Montgomery product calculation can be divided into two key steps: compute multi-precision multiplication T ← a × B and compute Montgomery about subtract P ← (T + M × N)/R.

Dusse proposed the Montgomery modular multiplication algorithm using r-ary numbers in 1990 and using n₀’＝-n₀ ^-1mod r replaces N' with a corresponding improvement to the algorithm. In 1996, Koc analyzed and summarized various Montgomery modular multiplication algorithm implementation methods, and concluded 5 major improved Montgomery algorithms: SOS, CIOS, FIOS, FIPS, and CIHS. Wherein the last two letters OS/PS/HS represent the calculation multiplication scanning mode, OS represents the operand scanning, PS represents the product scanning, and HS represents the hybrid scanning; the former letter S/CI/FI represents an integration mode used by multi-precision multiplication and Montgomery reduction, S represents a separation mode, namely one part is completely calculated and then the other part is calculated, CI represents a coarse-granularity integration mode, namely two parts of coarse-granularity alternate calculation, and FI represents a fine-granularity integration mode, namely two parts of fine-granularity alternate calculation. The implementation of these algorithms can be made by a series of operations: multiplication mul, addition add, load, store, etc. High performance implementation algorithms focus primarily on optimizing these operations. In the embedded system, because the number of available registers is limited, the storage operation such as load/store is very important, and in the current common method, the number of used registers is fixed, and is generally divided into 5 registersAnd N +4(N being the size modulo N including the number of words) registers. The method of using fewer registers (5) requires more memory access operations, while the method of using more registers (n +4) requires fewer memory access operations but is often unusable because the number of registers needed exceeds the number of available registers of the processor. Therefore, how to dynamically use the registers by the number of available registers of the processor and the size of n makes it a current problem to be solved to reduce the number of memory access operations by fully utilizing the registers.

Disclosure of Invention

The invention aims to overcome the defect that the Montgomery modular multiplication algorithm is not suitable for an embedded system with less resources at present, and provides a method for calculating Montgomery modular multiplication in a coarse-grained integrated mixed scanning mode in order to improve the calculation efficiency of the Montgomery modular multiplication algorithm.

In order to achieve the above object, the present invention provides a Montgomery modular multiplication calculation method suitable for an embedded system, the method comprising: multi-precision multiplication and Montgomery reduction; the multi-precision multiplication and Montgomery reduction parts are calculated in a mixed scanning mode, an operand scanning mode is used in an internal cycle, and a product scanning mode is used in an external cycle; and the multi-precision multiplication and Montgomery reduction use a coarse-grained integration mode, namely two-part alternate calculation.

In the above technical solution, the method specifically includes:

step 1) setting the large number N as m-bit prime number, the word length of the processor is W bit, and the word number of N is

A, B are two N remaining classes, namely 0<A,B<N；Montgomery coefficient R is 2^nW，n₀’＝-n₀ ^-1mod r，n₀The lowest bit of N; selecting d, d being the word size of the inner loop; the size of the outer loop is

The minimum integer operation is taken to be more than or equal to the minimum integer operation;

step 2) the modular multiplication calculation process of A and B is as follows: c ═ A ═ B, M ═ C ═ n₀' mod R, C ═ C + M × N)/R; taking every d words of operand A, B, M, N, C as a whole:

E＝(E[r-1],…,E[0])＝({A[n-1],A[n-2],….,A[n-d]}…{A[d-1],A[2],A[1],A[0]})

F＝(F[r-1],…,F[0])＝({B[n-1],B[n-2],….,B[n-d]}…{B[d-1],B[2],B[1],B[0]})

G＝(G[r-1],…,G[0])＝({M[n-1],M[n-2],….,M[n-d]}…{M[d-1],M[2],M[1],M[0]})

H＝(H[r-1],…,H[0])＝({N[n-1],N[n-2],….,N[n-d]}…{N[d-1],N[2],N[1],N[0]})

Q＝(Q[2r-1],Q[2r-3],…,Q[1],Q[0])＝({C[2n-1],C[2n-2],C[2n-3],C[2n-d]},…,

{C[d-1],C[2],C[1],C[0]})

sequentially calculating all partial products of the q, q is more than or equal to 0 and less than or equal to 2r-1 columns:

E[k]*F[l]+G[k]*H[l]＝(Q[q+1],Q[q])，

wherein k + l ═ q; until all the column numbers are calculated, obtaining C;

step 3) judging whether C is larger than or equal to N, and if so, making C equal to C-N; turning to the step 4), otherwise, turning to the step 4);

and 4) outputting a Montgomery product result C of A and B.

In the above technical solution, the step 2) specifically includes:

step 2-1) making q equal to 0;

step 2-2) the set of all k, l satisfying k + l q is denoted as a: a ═ { k, l | k + l ═ q };

step 2-3) calculation of (Q [ Q +1]],Q[q])＝∑_AE[k]*F[l]；

Wherein the content of the first and second substances,

E[k]*F[l]＝(A[kd+3],A[kd+2],A[kd+1],A[kd])*(B[ld+3],B[ld+2],B[ld+1],B[ld])；

step 2-4) judging q<r is established, if the judgment result is positive, G [ q ] is calculated]＝Q[q]*n₀'; otherwise, go to step 2-5);

step 2-5) calculation of (Q [ Q +1]],Q[q])＝(Q[q+1],Q[q])+∑_AG[k]*H[l]；

Step 2-6) making q ═ q + 1; if q is less than or equal to 2r-2, making k equal to k +1, and returning to the step 2-2); otherwise, turning to the step 2-7);

step 2-7) calculate the qth column C ═ C/R, since R ═ 2^nWTherefore:

C＝(C[2n-1],C[2n-2],…,C[n+1],C[n])。

compared with the prior art, the invention has the technical advantages that:

the mixed scanning idea is applied to Montgomery modular multiplication calculation by using a coarse-grained integration mode, operands are reasonably utilized by dynamically selecting d, the memory access number in an embedded system is reduced, and the realization efficiency of a Montgomery modular multiplication algorithm is improved.

Drawings

FIG. 1 is a schematic diagram of a prior Montgomery modular multiplication calculation structure;

FIG. 2 is a flow chart of a prior art Montgomery modular multiplication to compute a Montgomery product;

FIG. 3 is a schematic diagram of the modular multiplication computation method of the present invention;

FIG. 4 is a block diagram of the Montgomery modular multiplication method CIPOS-a (n 8, d 4) for coarse grain integrated product and operand hybrid scanning according to the present invention;

FIG. 5 is a block diagram of the Montgomery modular multiplication method CIPHS-b (n 8, d 3) for coarse-grained integrated product and operand hybrid scanning according to the present invention;

FIG. 6 is a schematic diagram of block product scanning in the method of the present invention.

Detailed Description

The method of the present invention is described in further detail below with reference to the figures and specific examples.

A Montgomery modular multiplication computation method suitable for an embedded system, the method comprising:

A, B are two N remaining classes, namely 0<A,B<N; montgomery coefficient R is 2^nW，n₀’＝-n₀ ^-1mod r，n₀The lowest bit of N; select d, d is the word size of the inner loop (scanned using the operand); the size of the outer loop (using product scan) is

step 2) calculating a modular multiplication result C of the A and the B, wherein the calculation process comprises the following steps:

1)C＝A*B；

2)M＝C*n₀’mod R；

3)C＝(C+M*N)/R.；

as shown in fig. 3, let a, B denote 2 m-bit multi-precision integers: a ═ A [ n-1], …, A [2], A [1], A [0], B ═ B [ n-1], …, B [2], B [1], B [0 ]. The product C ═ a · B can be expressed as: c ═ C (C2 n-1, …, C2, C1, C0).

Taking every d words of operand A, B, M, N, C as a whole, n-8, d-4 in this embodiment; is represented as follows:

E＝(E[1],E[0])＝({A[7],A[6],A[5],A[4]}{A[3],A[2],A[1],A[0]})

F＝(F[1],F[0])＝({B[7],B[6],B[5],B[4]}{B[3],B[2],B[1],B[0]})

G＝(G[1],G[0])＝({M[7],M[6],M[5],M[4]}{M[3],M[2],M[1],M[0]})

H＝(H[1],H[0])＝({N[7],N[6],N[5],N[4]}{N[3],N[2],N[1],N[0]})

Q＝(Q[3],Q[2],Q[1],Q[0])＝({C[15],C[14],C[13],C[12]}{C[11],C[10],C[9],C[8]}

{C[7],C[6],C[5],C[4]}{C[3],C[2],C[1],C[0]})

then the calculation of C ═ A ═ B can be converted into the calculation of (Q3, Q2, Q1, Q0) ═ E1, E0 [ ((F1), F0 ])

Calculating M ═ C × n₀' mod R can be converted to calculation (G1)],G[0])＝(Q[1],Q[0])*n₀’

Calculation of C + M N may be translated into calculation

(Q[3],Q[2],Q[1],Q[0])＝(Q[3],Q[2],Q[1],Q[0])+(G[1],G[0])*(H[1],H[0]))

Next, C ═ a × B and C ═ C + M × N (M ═ C × N) were alternately calculated by the product scan method₀' mod R). Namely, all partial products of q is more than or equal to 0 and less than or equal to 2r-1 after the q column is calculated:

e [ k ] + F [ l ] + G [ k ] + H [ l ] (Q [ Q +1, Q [ Q ]) (where k + l ═ Q), then the next column is counted until all column counts are completed.

Description of the algorithmic structure as shown in fig. 4, each shaded block ①, etc. in the diamond structure and each large box in the multiplication structure in the figure represent a product E k F l or G k H l, the size of the size is r n/d 2 when calculating the whole diamond structure using the shaded blocks as basic units, all E k F l in the 0 th column (i.e., block ①), all G k H l (i.e., block ②), all E k F l in the 1 st column (i.e., block ③), all G k H l (i.e., block ⑤), and all E k F l in the 2 nd column (i.e., block ⑦), all G k H l (i.e., block ⑧).

For each shaded block ①, etc., in the calculation map, E [ k ] F [ l ] (or G [ k ] H [ l ]) is calculated for each row, using operand scanning, with the scale size d being 4, the calculation is performed in rows, keeping one operand B [ i ] in each row unchanged, and multiplying with all terms of another operand a [ j ] (0 ≦ j < d), and the next row is calculated after all the products in the row are calculated.

In another embodiment, when n is 8 and d is 3, as shown in fig. 5, the whole multiplication is divided into many blocks ①, etc., and the coarse-grained integration product scanning method is still used between these blocks, i.e. ① -

Is executed with the size of

In the following we can divide all columns of the product scan into three parts, the first part being the 1 st to r-1 st columns, all blocks being complete blocks, size d, and the second part being the r to 2r-2 nd columns, the uppermost and lowermost blocks being incomplete blocks, size [ d- (rd-n) according to the completeness of the block]D, the remaining blocks are all full block sizes d x d; the third part, 2r-1 column, contains two partial blocks of size [ d- (rd-n)]*[d-(rd-n)]. The inside of the block is still calculated by means of operand scanning, and the diamond block with incomplete calculation is used for operand scanning along the long edge.

The step 2) specifically comprises the following steps:

step 2-1) making q equal to 0;

step 2-3) calculation of (Q [ Q +1]],Q[q])＝∑_AE[k]*F[l]；

An operand scanning mode is adopted: calculating according to the row mode, keeping one operand B [ i ] unchanged in each row, and multiplying all terms of another operand A [ j ] (j is more than or equal to 0 and less than d); after all the products of the row are calculated, the next row is calculated. Wherein each of E [ k ] F [ l ] and G [ k ] H [ l ] is calculated

As shown in figure 6 of the drawings,

E[k]*F[l]＝(A[kd+3],A[kd+2],A[kd+1],A[kd])*(B[ld+3],B[ld+2],B[ld+1],B[ld])

step 2-4) judging q<Whether r is true, e.g.If the result is positive, G [ q ] is calculated]＝Q[q]*n₀'; otherwise, go to step 2-5);

step 2-5) calculation of (Q [ Q +1]],Q[q])＝(Q[q+1],Q[q])+∑_AG[k]*H[l]；

Step 2-6) making q ═ q + 1; if q is less than or equal to 2r-1, making k equal to k +1, and returning to the step 2-2); otherwise, turning to the step 2-7);

step 2-7) calculate the qth column C ═ C/R, since R ═ 2^WTherefore:

C＝(C[15],C[14],C[13],C[12],C[11],C[10],C[9],C[8])；

and 4) outputting a Montgomery product result C of A and B.

The method of the present invention is divided into two cases according to whether n/d is an integer: the first case is that n/d is an integer, i.e.

We call CIPOHS-a; the second case is that n/d is not an integer, i.e.

We call CIPOHS-b. The total amount of memory accesses for both methods is analyzed.

1. CIPOS-a method

As shown in fig. 4, the number of memory accesses inside each block is first analyzed: because d +1 registers are used for storing operands in each block, wherein d registers store an operand A, and the remaining 1 register stores each word represented by the operand B in a multi-precision mode in turn, each operation in each block is only loaded once, so that the number of loads in each block is 2 d; and the calculation result of each block is directly stored in 2d +1 registers, so the number of storing intermediate results of each block is 0. The outer loop was analyzed below and had a total of 2r since the outer loop size was r ═ n/d²＝2(n/d)²Block, and use a coarse-grained integrated product-scan approach. In this case, 2 x 2 is common²Execution of blocks, 8The sequence is performed according to the number ①②③④⑤⑥⑦⑧ marked in the figure, since the number of loads per block is 2d, there is 2(n/d)²Block, so the total amount of load is 2d × 2(n/d)²＝4n²D; what needs to be stored in the whole algorithm is M of n words (M ← C × n)₀' mod r) and n +1 words, so the total amount of stores is n + n +1 — 2n + 1; so the total number of memory accesses (load and store) is 4n²/d+2n+1。

2. CIPOS-b method

As shown in FIG. 5, the first part, containing the first r-1 columns, all blocks are complete blocks such as

blocks

①, ③, ④, and operand scanning is used in each block, so the number of loads in each block is 2d, there are a total of r (r-1) blocks, so the total load is 2d r (r-1), the second part, containing r to 2r-2 columns, the uppermost and lowermost blocks in each column are incomplete blocks, with a size of [ d- (rd-n)]D, in which the scan is performed by scanning operands along the length d, block ⑦ in FIG. 4, where A [0] is first],A[1],A[2]Loaded in a register and then first calculates B [6 ]]And A [0]],A[1],A[2]The product of (a), then B [7 ]]And A [0]],A[1],A[2]The product of (a); the number of load blocks per incomplete block is 2d- (rd-n) for a total of 4(r-1) blocks. The second part of the remaining blocks are all full blocks, using normal operand scanning, the number of loads per block is 2d, there are (r-2) × (r-1) blocks. So that the total load is 4 (r-1). [2d- (rd-n) ]]+2d (r-2) (r-1). And a third part: only 2r-1 columns, only 2 [ d- (rd-n) in size]*[d-(rd-n)]Incomplete blocks, e.g. blocks

In these blocks according to the length [ d- (rd-n)]So that the number of loads is 4[ d- (rd-n)]. Table 1 summarizes the number of loads in these three parts, and it can be seen that the total number of loads is 4rn, and the number of stores is 2n +1 as CIPHOSS-a, so the total number of memory accesses is 4rn +2n + 1.

TABLE 1

The total amount of memory access can be uniformly recorded by integrating the CIPOS-a and the CIPOS-b

We next analyze the number of registers used and the number of memory accesses for several algorithms proposed by Koc and for the CIPOHS algorithm proposed, as shown in table 2.

TABLE 2

As can be seen from Table 2, the CIOS algorithm requires a minimum of 2n memory accesses among several existing algorithms compared in the table²+3n +1, but the larger number of registers it requires n + 4. When the value of n is large, the number of available registers is less than n +4, and therefore the algorithm can no longer be used. The CIOS-5reg and FIPS algorithms use a small number of registers, only 5 registers are needed, but the access amount of the memory is large. The CIPOHS algorithm proposed by the present invention solves this problem by dynamically selecting d by the number of available registers, and by making good use of the number of available registers, reducing the number of memory accesses. The memory access number of CIPOS is

The number of memory accesses of the CIOS is 2n²+3n +1, when d is an integer greater than 1, the number of memory accesses by CIPOHS is less than CIOS, and the larger the value of d, the smaller the number of memory accesses required by CIPOHS. The times of the multiplication instructions, the addition instructions and the like used by the algorithms are basically the same, and the memory access number of the CIPOHS algorithm is the minimum, so the arithmetic efficiency of the algorithm is the highest.

In summary, the Montgomery modular multiplication method applicable to the embedded system of the present invention alternately calculates the two parts of the multi-precision multiplication and the Montgomery reduction by using the coarse integration method, and uses the mixed scanning method of the product and the operand in the two parts. D is selected by the number of available registers, the number of accessed algorithm memories is reduced by fully utilizing the number of registers, and the operation efficiency of the algorithm is further improved.

The above description is only for the purpose of illustrating the embodiments of the present invention and should not be taken as limiting the scope of the present invention, and it should be understood by those skilled in the art that modifications and equivalents may be made without departing from the spirit and scope of the present invention and that the present invention is also covered by the scope of the present invention.

Claims

1. A Montgomery modular multiplication computation method suitable for an embedded system, the method comprising: multi-precision multiplication and Montgomery reduction; the multi-precision multiplication and Montgomery reduction parts are calculated in a mixed scanning mode, an operand scanning mode is used in an internal cycle, and a product scanning mode is used in an external cycle; the multi-precision multiplication and Montgomery reduction use a coarse-grained integration mode, namely the two parts are alternately calculated;

the method specifically comprises the following steps:

A, B are two N remaining classes, namely 0<A,B<N; montgomery coefficient R is 2^nW，n₀’＝-n₀ ^-1mod r，n₀The lowest bit of N; selecting d, d being the word size of the inner loop; the size of the outer loop is

E＝(E[r-1],…,E[0])＝({A[n-1],A[n-2],….,A[n-d]}…{A[d-1],A[2],A[1],A[0]})

F＝(F[r-1],…,F[0])＝({B[n-1],B[n-2],….,B[n-d]}…{B[d-1],B[2],B[1],B[0]})

G＝(G[r-1],…,G[0])＝({M[n-1],M[n-2],….,M[n-d]}…{M[d-1],M[2],M[1],M[0]})

H＝(H[r-1],…,H[0])＝({N[n-1],N[n-2],….,N[n-d]}…{N[d-1],N[2],N[1],N[0]})

Q＝(Q[2r-1],Q[2r-3],…,Q[1],Q[0])＝({C[2n-1],C[2n-2],C[2n-3],C[2n-d]},…,{C[d-1],C[2],C[1],C[0]})

E. f, G, H and Q are both block matrices;

E[k]*F[l]+G[k]*H[l]＝(Q[q+1],Q[q])，

wherein k + l ═ q; until all the column numbers are calculated, obtaining C; k and l are integers;

step 4), outputting a Montgomery product result C of A and B;

the step 2) specifically comprises the following steps:

step 2-1) making q equal to 0;

step 2-3) calculation of (Q [ Q +1]],Q[q])＝∑_AE[k]*F[l]；

Wherein the content of the first and second substances,

E[k]*F[l]＝(A[kd+3],A[kd+2],A[kd+1],A[kd])*(B[ld+3],B[ld+2],B[ld+1],B[ld])；

step 2-5) calculation of (Q [ Q +1]],Q[q])＝(Q[q+1],Q[q])+∑_AG[k]*H[l]；

step 2-7) calculate the qth column C ═ C/R, since R ═ 2^nWTherefore:

C＝(C[2n-1],C[2n-2],…,C[n+1],C[n])。