CN107665109A

CN107665109A - A kind of Montgomery modular multiplication computational methods suitable for embedded system

Info

Publication number: CN107665109A
Application number: CN201610609265.0A
Authority: CN
Inventors: 曾学文; 李杨; 叶晓舟
Original assignee: Institute of Acoustics CAS; Beijing Intellix Technologies Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Intellix Technologies Co Ltd
Priority date: 2016-07-28
Filing date: 2016-07-28
Publication date: 2018-02-06
Anticipated expiration: 2036-07-28
Also published as: CN107665109B

Abstract

The invention discloses a kind of Montgomery modular multiplication computational methods suitable for embedded system, methods described includes：More precision multiplications and Montgomery about subtract；About subtract two parts for more precision multiplications and Montgomery to calculate by the way of mixed sweep, inner loop uses the mode that operand scans, and outer loop uses the mode that product scans；And more precision multiplications and Montgomery about subtract the mode integrated between two parts using coarseness, i.e. two parts interleaved computation.The Montgomery modular multiplications computational methods of the present invention can reduce memory access quantity in embedded system, improve Montgomery modular multiplication algorithms and realize efficiency.

Description

A kind of Montgomery modular multiplication computational methods suitable for embedded system

Technical field

The present invention relates to public-key cryptosystem field, specifically, it is related to a kind of suitable for embedded system Montgomery modular multiplication computational methods.

Background technology

1985, P.L.Montgomery proposed Montgomery modular multiplication algorithms, and it is present most widely used one Kind modular multiplication algorithm.Its basic thought is to replace time-consuming invert and divide operations with simple timesaving addition and shifting function.It is right In calculating modular multiplication P=A*B mod N, (A, B, N are the big integer of binary system and 0 of n-bit<A,B<N), the algorithm chooses one first It is individual (typically to take R=2 with coprime N integer Rⁿ), mould N multiplying is converted into mould R multiplying.We define below Montgomery products：MonPro (A, B)=A*B*R^-1mod N。

The calculating structure of Montgomery modular multiplication algorithms as shown in Figure 1, is specifically divided into following three steps：

(1) A, B N residue classes are calculated：A '=A*R mod N=A*R²*R^-1Mod N=MonPro (A*R²),

B '=B*R mod N=B*R²*R^-1Mod N=MonPro (B*R²)；

(2) A ' and B ' Montgomery products P '=A ' * B ' * R are calculated^-1Mod N=MonPro (A ', B ')；

(3) P ' is converted into modular multiplication product P, P=A*B mod N=A ' * R^-1*B’*R^-1mod N

=P ' * R^-1Mod N=MonPro (P ').

Therefore, the core of Montgomery modular multiplication algorithms is to calculate Montgomery products MonPro (A ', B '), and its is specific Algorithm flow describes as shown in Figure 2.As can be seen from Figure 2 Montgomery products, which calculate, can be divided into two committed steps： Calculate more precision multiplication T ← A*B and calculate Montgomery and about subtract P ← (T+M*N)/R.

Nineteen ninety Dusse proposes the Montgomery modular multiplication algorithms using r system numbers, and utilizes n₀'=- n₀ ^-1Mod r generations Algorithm is correspondingly improved for N '.Koc in 1996 is carried out to the implementation method of various Montgomery modular multiplication algorithms Analysis and summary, and summarize 5 kinds of main improvement Montgomery algorithms：SOS, CIOS, FIOS, FIPS and CIHS.After wherein Two alphabetical OS/PS/HS represent calculating multiplication scan mode, and OS represents that operand scans, PS expression product scannings, and HS expressions are mixed Close scanning；And alphabetical S/CI/FI above represents that more precision multiplications and Montgomery about subtract the integration mode that two parts use, S represents that the mode of separation calculates another part again after having calculated a part completely, and CI represents that coarseness integration mode is thick Granularity interleaved computation two parts, FI represent that fine granularity integration mode is fine granularity interleaved computation two parts.The realization of these algorithms Can be by sequence of operations：Multiplication mul, addition add, load load, storage store etc. and realize.So high performance realization Algorithm is concentrated mainly on optimization, and these are operated above.In embedded systems, due to the limited amount of available register, load/ The storages such as store operation is particularly important, and in currently used method, the register number used is fixed, and being generally divided into makes With 5 registers and n+4 (n is the size that mould N includes number of words) individual register two ways.It is fewer (5) using register Method need more memory access operation, and use the memory access operation that the more mode (n+4) of register needs compared with It is able to can not be used with register number because required register number exceedes processor less but usually.Therefore processing how is passed through Device can use register number and n size dynamically to use register so that by making full use of register to be deposited to reduce internal memory The quantity of extract operation, which turns into, to be currently needed for solving the problems, such as.

The content of the invention

It is an object of the invention to overcome current Montgomery modular multiplication algorithms not to be suitable for the less embedded system of resource A kind of the defects of system, in order to improve the computational efficiency of Montgomery modular multiplication algorithms, it is proposed that the integrated mixed sweep of coarseness The method that mode calculates Montgomery modular multiplications, this method make full use of the available register number choice of dynamical d of processor, lead to Cross sharing operation number when operand scans in block and reduce operand reading number, pass through when product scans between block per column count Exist after complete all products in 2d+1 register to reduce the access number of intermediate result, reduce memory access number on the whole Mesh, improve algorithm and realize efficiency.

To achieve these goals, the invention provides a kind of Montgomery modular multiplications calculating suitable for embedded system Method, methods described include：More precision multiplications and Montgomery about subtract；About subtract two for more precision multiplications and Montgomery Part is calculated by the way of mixed sweep, and inner loop uses the mode that operand scans, and outer loop use multiplies The mode of product scanning；And more precision multiplications and Montgomery about subtract the mode integrated using coarseness between two parts, i.e., two Part interleaved computation.

In above-mentioned technical proposal, methods described specifically includes：

It is m bit prime numbers that step 1), which sets big number N, and the word length of processor is W bit, then N number of words size isA, B are two N residue classes i.e. 0<A,B<N；Montgomery coefficients R=2^nW, n₀'=- n₀ ^-1Mod r, n₀For N Lowest order；D is selected, d is the number of words size of inner loop；Then the size of outer loop is It is more than or equal to take Its smallest positive integral computing；

Step 2) A and B modular multiplication calculating process is：C=A*B, M=C*n₀' mod R, C=(C+M*N)/R；By operand A, B, M, N, C every d word are as an entirety：

E=(E [r-1] ..., E [0])=(A [n-1], A [n-2] ..., A [n-d] } ... { A [d-1], A [2], A [1], A [0]})

F=(F [r-1] ..., F [0])=(B [n-1], B [n-2] ..., B [n-d] } ... { B [d-1], B [2], B [1], B [0]})

G=(G [r-1] ..., G [0])=(M [n-1], M [n-2] ..., M [n-d] } ... { M [d-1], M [2], M [1], M [0]})

H=(H [r-1] ..., H [0])=(N [n-1], N [n-2] ..., N [n-d] } ... { N [d-1], N [2], N [1], N [0]})

Q=(Q [2r-1], Q [2r-3] ..., Q [1], Q [0])=({ C [2n-1], C [2n-2], C [2n-3], C [2n- d]},…,

{C[d-1],C[2],C[1],C[0]})

Q, all partial products of 0≤q≤2r-1 row are calculated successively：

E [k] * F [l]+G [k] * H [l]=(Q [q+1], Q [q]),

Wherein k+l=q；Completed until all columns calculate, obtain C；

Step 3) judges whether C >=N sets up, if set up, makes C=C-N；Step 4) is transferred to, otherwise, is transferred to step 4)；

Step 4) exports A and B Montgomery result of product C.

In above-mentioned technical proposal, the step 2) specifically includes：

Step 2-1) make q=0；

Step 2-2) all k for meeting k+l=q, l set be designated as A：A=k, l | k+l=q }；

Step 2-3) calculate (Q [q+1], Q [q])=∑_AE[k]*F[l]；

Wherein,

E [k] * F [l]=(A [kd+3], A [kd+2], A [kd+1], A [kd]) * (B [ld+3], B [ld+2], B [ld+1], B [ld])；

Step 2-4) judge q<Whether r sets up, if a determination be made that certainly, then calculating G [q]=Q [q] * n₀’；Otherwise, Go to step 2-5)；

Step 2-5) calculate (Q [q+1], Q [q])=(Q [q+1], Q [q])+∑s_AG[k]*H[l]；

Step 2-6) make q=q+1；If q is less than or equal to 2r-2, k=k+1, return to step 2-2 are made)；Otherwise, it is transferred to step 2-7)；

Step 2-7) q row C=C/R are calculated, due to R=2^nW, so：

C=(C [2n-1], C [2n-2] ..., C [n+1], C [n]).

Compared with prior art, the technical advantages of the present invention are that：

Mixed sweep thought is applied in the calculating of Montgomery modular multiplications using the mode that coarseness integrates, passes through dynamic Choose d and rationally utilize operand, reduce memory access quantity in embedded system, improve Montgomery modular multiplication algorithms and realize effect Rate.

Brief description of the drawings

Fig. 1 is that existing Montgomery modular multiplications calculate structural representation；

Fig. 2 is the flow chart that existing Montgomery modular multiplications calculate Montgomery products；

Fig. 3 is the schematic diagram of the modular multiplication computational methods of the present invention；

Fig. 4 is that the coarseness of the present invention integrates the Montgomery modular multiplication methods of sum of products operand mixed sweep CIPOHS-a (n=8, d=4) structure chart；

Fig. 5 is that the coarseness of the present invention integrates the Montgomery modular multiplication methods of sum of products operand mixed sweep CIPOHS-b (n=8, d=3) structure chart；

The schematic diagram that piecemeal product scans in the method for Fig. 6 present invention.

Embodiment

The method of the present invention is further described in detail with specific embodiment below in conjunction with the accompanying drawings.

A kind of Montgomery modular multiplication computational methods suitable for embedded system, methods described include：

It is m bit prime numbers that step 1), which sets big number N, and the word length of processor is W bit, then N number of words size isA, B are two N residue classes i.e. 0<A,B<N；Montgomery coefficients R=2^nW, n₀'=- n₀ ^-1Mod r, n₀For N Lowest order；Select the number of words size that d, d are inner loop (being scanned using operand)；Then outer loop (being scanned using product) Size be It is more than or equal to its smallest positive integral computing to take；

Step 2) calculates A and B modular multiplication result C, and calculating process is：

1) C=A*B；

2) M=C*n₀’mod R；

3) C=(C+M*N)/R.；

As shown in figure 3, A, B are represented the multiprecision integer of 2 m bits, be：A=(A [n-1] ..., A [2], A [1], A [0]), B=(B [n-1] ..., B [2], B [1], B [0]).Then product C=AB can be expressed as：C=(C [2n-1] ..., C [2],C[1],C[0])。

Using operand A, B, M, N, C every d word as an entirety, in the present embodiment, n=8, d=4；Represent such as Under：

E=(E [1], E [0])=({ A [7], A [6], A [5], A [4] } { A [3], A [2], A [1], A [0] })

F=(F [1], F [0])=({ B [7], B [6], B [5], B [4] } { B [3], B [2], B [1], B [0] })

G=(G [1], G [0])=({ M [7], M [6], M [5], M [4] } { M [3], M [2], M [1], M [0] })

H=(H [1], H [0])=({ N [7], N [6], N [5], N [4] } { N [3], N [2], N [1], N [0] })

Q=(Q [3], Q [2], Q [1], Q [0])=({ C [15], C [14], C [13], C [12] } { C [11], C [10], C [9],C[8]}

{C[7],C[6],C[5],C[4]}{C[3],C[2],C[1],C[0]})

Calculating (Q [3], Q [2], Q [1], Q [0])=(E [1], E [0]) * (F [1], F can be converted into by then calculating C=A*B [0])

Calculate M=C*n₀' mod R can be converted into calculating (G [1], G [0])=(Q [1], Q [0]) * n₀’

Calculating can be converted into by calculating C=C+M*N

(Q [3], Q [2], Q [1], Q [0])=(Q [3], Q [2], Q [1], Q [0])+(G [1], G [0]) * (H [1], H [0]))

Underneath with product scan mode interleaved computation C=A*B and C=C+M*N (M=C*n₀’mod R).Calculate Q is arranged, all partial products of 0≤q≤2r-1 row：

After E [k] * F [l]+G [k] * H [l]=(Q [q+1], Q [q]) (wherein k+l=q), then next column is calculated, Zhi Daosuo There is columns to calculate to complete.

Algorithm structure describe as shown in Figure 4, each shaded block in figure in diamond structure 1., 2. wait and multiplication structure in Each big square frame represent product E [k] * F [l] or G [k] * H [l].With shaded block when calculating whole diamond structure For base unit, then its scale is r=n/d=2；Coarseness product scan mode is used when being calculated between block, is first calculated All E [k] * F [l] (i.e. block 1.) of 0th row, calculate all G [k] * H [l] (i.e. block 2.)；All E [k] * of the 1st row is calculated again F [l] (i.e. block 3., 4.), calculate all G [k] * H [l] (i.e. block 5., 6.)；Finally calculate all E [k] * F [l] of the 2nd row (i.e. Block is 7.), calculate all G [k] * H [l] (i.e. block 8.).

For calculating each shaded block in figure 1., 2. etc., that is, each E [k] * F [l] (or G [k] * H [l]) is calculated, By the way of operand scanning, scale is d=4 for it；Calculated according to capable mode, an operation is kept in often going Number B [i] is constant, with another operand A [j] (0≤j<D) all multiplications；Counted again after having calculated all products of this row Calculate next line.

In another embodiment, n=8 is worked as, during d=3, as shown in figure 5,1. whole multiplication is divided into many individual blocks, 2. etc., And between these blocks also still using coarseness integrate product scan mode i.e. press figure in numeric order 1.-Held OK, scale isDue to can not be divided evenly, the block being divided into not all be complete block, as in Fig. 4 1. block is one Individual complete d*d block, 7. block is one incomplete piece.We can scan product according to the integrality of block below All row are divided into three parts：Part I is the 1st to r-1 row, and all blocks are all whole blocks, and size is d*d；Part II is R to 2r-2 is arranged, and its top and nethermost piece are incomplete piece, and size is [d- (rd-n)] * d, and remaining piece is all Complete block size is d*d；Part III is 2r-1 row, and comprising two endless monoblocks, size is [d- (rd-n)] * [d- (rd- n)].The mode memory still scanned inside block using operand is calculated, and calculating incomplete diamond block is carried out along long side Operand scans.

The step 2) specifically includes：

Step 2-1) make q=0；

Step 2-3) calculate (Q [q+1], Q [q])=∑_AE[k]*F[l]；

Using operand scan mode：Calculated according to capable mode, keep an operand B [i] constant in often going, With another operand A [j] (0≤j<D) all multiplications；Next line is calculated again after having calculated all products of this row.Its In, calculate each E [k] * F [l] and G [k] * H [l]

As shown in fig. 6,

E [k] * F [l]=(A [kd+3], A [kd+2], A [kd+1], A [kd]) * (B [ld+3], B [ld+2], B [ld+1], B [ld])

Step 2-5) calculate (Q [q+1], Q [q])=(Q [q+1], Q [q])+∑s_AG[k]*H[l]；

Step 2-6) make q=q+1；If q is less than or equal to 2r-1, k=k+1, return to step 2-2 are made)；Otherwise, it is transferred to step 2-7)；

Step 2-7) q row C=C/R are calculated, due to R=2^W, so：

C=(C [15], C [14], C [13], C [12], C [11], C [10], C [9], C [8])；

Step 4) exports A and B Montgomery result of product C.

Whether it is that the method for the invention is divided into two kinds of situations by integer below according to n/d：The first situation is that n/d is whole Number, i.e.,We term it CIPOHS-a；Second of situation is that n/d is not integer, i.e.,We Referred to as CIPOHS-b.The total amount of two methods memory access is analyzed.

1st, CIPOHS-a methods

As shown in figure 4, the memory access quantity of each piece of inside is analyzed first：Due to using d+1 register in every piece Storage operation number, wherein d register storage operation number A, it is left 1 register storage operation number B multiple-accuracy representings in turn Each word, so each operation only loads once in block, therefore load quantity is 2d in each piece；And every piece of calculating As a result it is stored directly in 2d+1 register, so every piece of storage intermediate result store quantity is 0.Outside lower surface analysis Circulation, because outer loop size is r=n/d therefore shares 2r²=2 (n/d)²Block, and the product scanning side integrated using coarseness Formula.2*2 is shared in this example²=8 pieces, 1. 2. 3. 4. 5. 6. 7. 8. the execution sequence of block performs according to the numeral marked in figure. Because every piece of load quantity is 2d, 2 (n/d) are shared²Block, therefore load total amount is 2d*2 (n/d)²=4n²/d；And entirely calculating That need to store in method is M (M ← C*n of n word₀' mod r) and n+1 word final result C, therefore store total amount is n+ N+1=2n+1；So the total amount of memory access (load and store) is 4n²/d+2n+1。

2nd, CIPOHS-b methods

As shown in figure 5, Part I：Comprising preceding r-1 arrange, all blocks be all whole blocks such as block 1., 3., 4., for every Scanned in one piece using operand, therefore load quantity is 2d in every piece, one shares r* (r-1) block, therefore load total amounts are 2d*r* (r-1).Part II：Arranged comprising r to 2r-2, each column is topmost and nethermost piece is endless monoblock, and size is [d- (rd- N)] * d, scan mode is to carry out operand scanning along the direction that length is d in these blocks；As in Fig. 4 block 7., first by A [0], A [1], A [2] are loaded in a register, then first calculate B [6] and A [0], A [1], A [2] product, then calculate B [7] and A [0], A [1], A [2] product；Therefore each endless monoblock load quantity is 2d- (rd-n), shares 4 (r-1) blocks.Part II Remaining piece is all whole blocks, is scanned using normal operand, and every piece of load quantity is 2d, shares (r-2) * (r-1) block.Institute It is 4 (r-1) * [2d- (rd-n)]+2d* (r-2) * (r-1) with this part load total amounts.Part III：Only arranged comprising 2r-1, Only comprising the endless monoblock that 2 sizes are [d- (rd-n)] * [d- (rd-n)], such as blockIt is according to length in these blocks The operand scan mode of [d- (rd-n)] is scanned, therefore load quantity is 4 [d- (rd-n)].Table 1 summarizes this three Divide load quantity, it can be seen that load total amount is 4rn, and it is 2n+1 that store quantity is identical with CIPOHS-a, within institute The total amount of access is 4rn+2n+1.

Table 1

The total amount of memory access can be uniformly designated as by comprehensive CIPOHS-a and CIPOHS-bBelow I Analyze Koc proposition several algorithms and it is proposed that CIPOHS algorithms using register number and the number of memory access Amount, as shown in table 2.

Table 2

From Table 2, it can be seen that in several existing algorithms compared in table, the memory access minimum number of CIOS algorithms needs Want 2n²+ 3n+1, but the register number of its needs is more to need n+4.When n value is bigger, the number of register can be used Amount is less than n+4, therefore can not reuse this algorithm.And the register number that CIOS-5reg and FIPS algorithms are used is smaller, only 5 are needed, but the access amount of its internal memory is bigger.CIPOHS algorithms proposed by the present invention solve this problem, and it passes through The selection d of the Number dynamics of register can be used, the quantity of register can be used by rationally utilization, to reduce the number of memory access Amount.CIPOHS memory access quantity isCIOS memory access quantity is 2n²+ 3n+1, when d is taken more than 1 Integer when, CIPOHS memory access quantity is less than CIOS, and d value is bigger, CIPOHS need memory access quantity get over It is small.And the multiplying order and addition instruction grade number used for several algorithms is essentially identical, the memory access of CIPOHS algorithms Quantity is minimum, so the operation efficiency highest of algorithm.

In summary, a kind of Montgomery modular multiplications computational methods suitable for embedded system of the invention use coarse grain The integrated more precision multiplications of mode interleaved computation of degree and Montgomery about subtract two parts, are operated in two parts using sum of products The mode of number mixed sweep.By can use the quantity of register to select d, making full use of register number and being deposited to reduce algorithm internal memory The quantity taken, further improve the operation efficiency of algorithm.

The embodiment of the present invention is the foregoing is only, is not intended to limit the scope of the present invention, this area It will be appreciated by the skilled person that on the premise of inventive principle is not departed from, technical scheme is modified or waited With replacing, without departure from the spirit and scope of technical solution of the present invention, it all should cover in protection scope of the present invention.

Claims

1. a kind of Montgomery modular multiplication computational methods suitable for embedded system, methods described includes：More precision multiplications and Montgomery about subtracts；About subtract two parts for more precision multiplications and Montgomery to count by the way of mixed sweep Calculate, inner loop uses the mode that operand scans, and outer loop uses the mode that product scans；And more precision multiplications and Montgomery about subtracts the mode integrated between two parts using coarseness, i.e. two parts interleaved computation.

2. the Montgomery modular multiplication computational methods according to claim 1 suitable for embedded system, it is characterised in that Methods described specifically includes：

It is m bit prime numbers that step 1), which sets big number N, and the word length of processor is W bit, then N number of words size isA,B It is two N residue classes i.e. 0<A,B<N；Montgomery coefficients R=2^nW, n₀'=- n₀ ^-1Mod r, n₀For N lowest order；Selection D, d are the number of words sizes of inner loop；Then the size of outer loop is It is more than or equal to its smallest positive integral to take Computing；

Q=(Q [2r-1], Q [2r-3] ..., Q [1], Q [0])=({ C [2n-1], C [2n-2], C [2n-3], C [2n-d] } ...,

{C[d-1],C[2],C[1],C[0]})

Q, all partial products of 0≤q≤2r-1 row are calculated successively：

E [k] * F [l]+G [k] * H [l]=(Q [q+1], Q [q]),

Wherein k+l=q；Completed until all columns calculate, obtain C；

Step 4) exports A and B Montgomery result of product C.

3. the Montgomery modular multiplication computational methods according to claim 2 suitable for embedded system, it is characterised in that The step 2) specifically includes：

Step 2-1) make q=0；

Step 2-3) calculate (Q [q+1], Q [q])=∑_AE[k]*F[l]；

Wherein,

Step 2-5) calculate (Q [q+1], Q [q])=(Q [q+1], Q [q])+∑s_AG[k]*H[l]；

Step 2-7) q row C=C/R are calculated, due to R=2^nW, so：

C=(C [2n-1], C [2n-2] ..., C [n+1], C [n]).