CN1392472A

CN1392472A - Montgomery analog multiplication algorithm for VLSI and VLSI structure of intelligenjt card analog multiplier

Info

Publication number: CN1392472A
Application number: CN 02125399
Authority: CN
Inventors: 李树国; 周润德; 孙义和
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2002-07-31
Filing date: 2002-07-31
Publication date: 2003-01-22
Anticipated expiration: 2022-07-31
Also published as: CN1230736C

Abstract

The present invention relates to the encryption and decryption technology and features that it is one algorithm with high degree of parallelism and suitable for VLSI implementation. The thrice large number multiplications of primary montgomery analog multiplication are decomposed into 2ss+s times small number multiplications. The VLSI structur for the intelligent card analog multiplifier is one high-order analog multiplier, which has 32 bit multiplier to complete 1024 bit analog multiplication and three stage parallel flow water structure in the data passage. Compared with available structure, the present invention has reduced chip area and analog multiplication clock number and can realize digital signature and confirmation of RSM algorithm in intelligent card.

Description

Montgomery analog multiplication algorithm that VLSI uses and intelligent snap gauge are taken advantage of the VLSI structure of device

Technical field

The Montgomery that VLSI uses (montgomery) modular multiplication algorithm and intelligent snap gauge take advantage of device VLSI structure to belong to the smart card enciphering/deciphering

Technical field.

Background technology

1 public key encryption technology

1976, the M.E.Hellman of Stanford University, W.Diffe and R.Merkle proposed " open code key cryptosystem ", also are asymmetric cryptosystem, also are the double-key cipher system.In this cipher system, the encryption and decryption ability of an encryption system is separated.Encryption and decryption realize by two different keys respectively, and to go out another key by one of them key derivation be infeasible.Adopt each user of asymmetrical cipher system, a pair of selected key is all arranged, one of them is disclosed, becomes PKI.Another is preserved by user oneself is secret.Be called key.Public-key encryptosystem has some following advantages: (1) key distribution is simple.Since encryption and decryption key difference, and can not from encryption key, derive decruption key, thereby encryption key can be distributed as telephone directory book.(2) the secret size of key of preserving reduces.Each user only need preserve the decruption key of oneself.Differentiate mutually that as N smart card and M main frame only need produce (N+M) to key.(3) appearance of PKI makes asymmetric cryptosystem can adapt to open environment for use.(4) can realize digital signature.So-called digital signature mainly is in order to guarantee that the take over party can prove the authenticity in the authenticity of the message that it is received and the source of transmission and a kind of safety practice of taking to the third party to just.Its use can solve the sort of dishonest disagreement that produces owing to transmitting-receiving side, promptly can guarantee to provide and can not deny or counterfeit message according to the interests of oneself.

Contemporary cryptology has solved cryptography issue with key, and key is represented with K.K can be a lot of numerical value.The scope of the probable value of key K is called key space (keyspace).This key (be that computing all depends on key, and represent as subscript with K) is all used in the encryption and decryption computing, and like this, the encryption and decryption function becomes:

E _k1(M)＝C

D _K2(C)=M wherein, E _K1Be the encryption function that depends on key k1, M (Message) is encrypted plaintext

D _K2Be the decryption function that depends on key k2, C (Crypto) is its ciphering process of ciphertext after encrypting, and has characteristic as shown in Figure 1:

The algorithm of realizing public-key cryptosystem is a lot, relatively is typically RSA Algorithm and elliptic curve.RSA Algorithm is in February, 1978, and by the member Riverst of research group of engineering college in the masschusetts, u.s.a (MIT), three experts of Shamir and Adleman propose, and is RSA Algorithm with a letter designation of their name.It can be used for encryption also can be used for digital signature.The safety of RSA is based on the difficulty that big prime number decomposes, and its public-key cryptography and private key are the functions of a pair of big prime number (100 to 200 big prime numbers or bigger).Realize the present chip that has produced many rsa encryptions about RSA hardware, the correctness of RSA Algorithm is by practice and theoretical the proof.

In PKI enciphering/deciphering system, exist a big digital-to-analogue power multiplication P ^eMod N, this computing has caused the huge operand of public key encryption and decryption computing.Big digital-to-analogue power multiplication speed has determined the application performance of public key encryption and decryption.From the domestic and international research present situation, because the high safety of public key encryption and decryption, it is very extensive to make big digital-to-analogue power multiplication use.

2 big digital-to-analogue powers are taken advantage of P ^eThe decomposition of mod N

The public-key cryptosystem encryption and decryption is carried out big digital-to-analogue power multiplication exactly, big digital-to-analogue power multiplication (P ^eMod N) availability of speed decision public key encryption.Big digital-to-analogue power multiplication (P ^eMod N) can be decomposed into big digital-to-analogue multiplication AB mod N, its decomposed form is:

Begin C=1; Assignment constant 1 for i=0 to u-1 do { form of form X=XX (mod N) // second AB mod N of if (ei=1) C=XC (mod N) // first AB mod N } the return C end of //C elder generation

Wherein, e=(e _ne _N-1... e _i... ..e ₀), from asking X ^eIn the algorithm that mod N decomposes, exist a kind of basic operational form AB mod N as can be seen.Because the computing of AB is a kind of common two number phase multiplications.About the research of phase multiplication algorithm comparatively ripe and general, like this obtain AB amass X the time, ask modular arithmetic X mod N just to become basic operation.Usually, when known X value,, and finally obtain X mod N by the circulation of the X-N computing of successively decreasing.This computing is commonly referred to mould and subtracts computing.In the general practical application, make X=AB, so carrying out carrying out multiplication AB earlier before mould subtracts computing, subtract computing again, this modular arithmetic is referred to as modular multiplication.Therefore, modular multiplication AB mod N is with regard to the problem of the research that becomes a value.

The modular multiplication algorithm of 3 Montgomery

RSA cryptographic algorithms is present comparatively successful a kind of public-key cryptosystem in theoretical and practical application, and its security is based in the number theory greatly that integer is decomposed on the difficulty of prime factor.It has pair of secret keys, promptly PKI or encryption key (e, N) and private key or decruption key (d, N).

To plaintext m, its ciphering process: c ≡ E (m)=m ^eC represents ciphertext in the mod N formula

And decrypting process: m ≡ D (c)=c ^dMod N m represents expressly can be proved by the Euler theorem consistance of enciphering/deciphering process.The RSA Algorithm encryption is exactly one in fact and calculates mould power m ^eMod N or c ^dThe process of mod N.But because m, e, c, d, operands such as N are greater than 1024 bits, and directly Montgomery Algorithm is impossible, must earlier it be decomposed into basic big digital-to-analogue multiplication AB mod N.Big digital-to-analogue multiplication AB mod N proposes the Montgomery algorithm in order to solve just.

Original Montgomey modular multiplication algorithm

If N is modulus and N＞1, R is a base coprime with N, usually, and R=2 ^u, u is the figure place of N; R ^-1Satisfy 0＜R with N ^-1＜N, 0＜N '＜R, R R ^-1-N N '=1, i.e. RR ^-1(mod N)=1 or N N ' (mod R)=-1; To given big integer T, and 0≤T＜RNMontgomery algorithm is as follows: function REDC (T)

m←(T?mod?R)N′mod?R

t←(T+mN)/R

if?t≥N?then?t-N?else?return?t

Above-mentioned algorithm only has twice large number multiplication TN ' and mN on the surface, but since T=AB during modular multiplication, 0≤A＜N, and 0≤B＜N is so algorithm carries out three large number multiplication computings altogether.Work as A, when B and N were big integer more than 1024, big number multiplied each other and realizes having brought difficulty to hardware, therefore must decompose big number.In addition, because being Montgomery, the return results of algorithm amasss ABR ^-1Mod N, rather than mould product AB mod N are so also should eliminate the long-pending constant term R of Montgomery when using ^-1And become the mould product.

At present, apply for that the patent that big digital-to-analogue takes advantage of is more, domestic less abroad.The patent that domestic relevant big digital-to-analogue is taken advantage of has two.These two patents are respectively " high speed modular multiplication method and device (96109838.4) ", " circuit of mould multiplication and device (99808871.4) ".These two patents will be applied for a patent with us and compare, and our patent advanced person is in these two patents, and are suitable for large scale integrated circuit VLSI and realize.

Universal day by day along with smart card, the data security in the smart card transaction becomes more and more important.Because (Rivest, Shamir Adleman) have solved digital signature, Information Authentication and authentication to public-key cryptosystem RSA, so smart card adopts the RSA implementation data of public-key cryptosystem to encrypt more and more necessity.But smart card adopts public-key cryptosystem RSA to encrypt two subject matters of present existence: the 1) VLSI of rsa cryptosystem coprocessor (Very Large Scale Integration) realization area excessive 2) the mould power multiplication speed of rsa cryptosystem coprocessor is lower.The application's analysis and improve the Montgomery algorithm that big digital-to-analogue is taken advantage of has proposed a kind of new high basic module multiplier structure.This structure has not only reduced chip area, but also has reduced the clock periodicity of mould power multiplication, is suitable for application of IC cards.

Summary of the invention

The objective of the invention is to take advantage of device design to propose the VLSI structure that Montgomery (montgomery) modular multiplication algorithm that a kind of VLSI uses and intelligent snap gauge are taken advantage of device at the die for special purpose of smart card.The software implementation algorithm FIPS (Finely Integrated Product Scanning) based on the Montgomery of uniprocessor that the present invention is directed to that Koc proposes has proposed the high degree of parallelism algorithm that a kind of VLSI of being used for realizes, also claims improved FIPS algorithm.

Montgomery modular multiplication algorithm proposed by the invention is characterized in that:

It is the high degree of parallelism algorithm that a kind of VLSI of being suitable for realizes, its essence is three times original large number multiplication computings are decomposed into 2s ²+ s time small integer is taken advantage of, and it contains following steps successively:

If A, B are respectively s position r system integer;

A＝(a _s-1?a _s-2…a ₁a ₀)，?B＝(b _s-1?b _s-2…b ₁b ₀)

Mould N also is a s position r system integer,

N=(n _S-1n _S-2N ₁n ₀), and R=r ^s

N＜R is then arranged, n ₀n ₀' mod r=-1, and make A＜N, B＜N,

S:=0, n ' [0] :=-n[0] ^-1Mod r // ask n ₀Mould contrary have (A) to use s ²The low level S of-s time multiplication calculating result of product is individual, available intermediate result m[i] expression:

A.1 i＝0，......s-1

A.2 j＝0，......i-1

A.2.1?S：＝S+a[j]b[i-j]+m[j]n[i-j]

A.3 S：＝S+a[i]b[0]

A.4 m[i]：＝S?n′[0]mod?r

A.5 S：＝S+m[i]n[0]

A.6 S:=S/r//a r system position moves to right

(B) use s ²-s time multiplication calculates the high S position of result of product, and m represents with storage of variables:

B.1 i＝s，...，2?s-1

B.2 j＝i-s+1，...，s-1

B.2.1?S：＝S+a[j]b[i-j]+m[j]n[i-j]

B.3 m[i-s]：＝S?mod?r

B.4 S:=S/r//move to right a r system position (C) with the s sub-addition Montgomery (Montgomery) mould product by: [0,2N) adjust to [0, N)

C.1 t0:=S mod r//t0 in r system position is a r system position

C.2 carry Cy=1

C.3 j＝0，...，s-1

C.3.1?(Cy，b[j])：＝m[j]+not(n[j])+Cy

//Cy is a carry digit, becomes with carry

t0：＝t0+not[0]+Cy

C.4 if t0=0

Then return (b[s-1] b[s-2] ... b[1] b[0])

Otherwise return (m[s-1] m[s-2] ... m[1] m[0])

Intelligent snap gauge proposed by the invention is taken advantage of the VLSI structure of device, it is characterized in that:

It is that 32 multipliers of a kind of usefulness realize that 1024 modular multiplications and data path adopt the high basic mode of three grades of flowing structures to take advantage of device, its first order is respectively a by two inputs, b and m, 32 multipliers of n, and two 64 bit registers that input end links to each other with the output terminal of above-mentioned two multipliers are respectively formed; The second level is made of with 65 bit registers that link to each other with these 64 adder outputs 64 totalizers that add up two 64 long-pending and produce a carry Cy.The third level by input end link to each other with the output terminal of above-mentioned 65 bit registers in the hope of total add up with 76 totalizers and link to each other alternately with these 76 totalizers and 76 bit registers of output terminal output result of product constitute.

It has reached its intended purposes to use proof.

Description of drawings

Fig. 1, the enciphering/deciphering process of two keys of use.

Fig. 2, improved FIPS modular multiplication method during s=3

Fig. 3～Fig. 5, the computer process block diagram of the VLSI purpose Montgomery modular multiplication algorithm that the present invention proposes.

Fig. 6, the RSA mould is taken advantage of the structural representation of device Monpro

Fig. 7, R=r ^s=2 ^KsCounterdie power M ^eThe computer process block diagram of mod N

Fig. 8, the structural representation of rsa encryption processor

Embodiment

Ask for an interview Fig. 2.Improved FIPS method example when it is s=3.It is divided into A, B, C three parts.A promptly calculates a low level s word of result of product corresponding to the calculating on dot-and-dash line right side among Fig. 2; B is corresponding to the calculating in dot-and-dash line left side, and high-order s word of calculating result of product.Used storage of variables m for the storage space of saving high-order s the word of storage space, last Montgomery is long-pending to be stored in (m[s-1] m[s-2] ... m[1] m[0]).Since Montgomery is long-pending can only guarantee [0, scope 2N), so also it should be adjusted to [0, in scope N).C finishes this adjustment function just.

The calculating bottleneck of above-mentioned algorithm is the number of times of multiplication.A need carry out s ²+ 2s multiplication, B need carry out s ²-s time multiplication carries out 2s altogether ²+ s multiplication.C need carry out the s sub-addition to adjust the mould product by [0,2N] to [0, N].

The essence of improving the FIPS algorithm is 3 big numbers of original Montgomery algorithm to be taken advantage of be decomposed into 2s ²+ s time small integer is taken advantage of, and be beneficial to VLSI and realize.FB(flow block) when Fig. 3～Fig. 5 is its computer realization.

It is rsa cryptosystem coprocessor its main operational parts that mould is taken advantage of device.Modular multiplication AB mod N speed depends on the clock periodicity of modular multiplication, so mould takes advantage of the device design object should reduce the clock periodicity of modular multiplication as far as possible under the area of regulation.In the VLSI implementation algorithm, because A, B, N are r system integers, claim that therefore r is a base, and get r=2 usually ^kIf r=2 ^kAnd k 〉=16 claim that then r is Gao Ji.Take advantage of device just to take advantage of device based on the mould of Gao Ji for high basic mode.In the design, count A greatly, B, N respectively are u binary digit, from the security consideration of data, we determine to get the u=1024 bit.A like this, B, N just can be expressed as the multiple precision number be made up of s=u/k word, A=(a _S-1, a _S-2... a _i... a ₁a ₀) _r, and a _i=(a _K-1, a _K-2..., a ₁a ₀). be each a _i(0≤i＜s) can represent k binary digit.The k value is big more, and the VLSI of hardware realizes that scale is also just big more.

In the VLSI implementation algorithm, when s=u/k, total multiplication number of times 2s ²+ s just becomes 2 (u/k) ²+ u/k.As u fixedly the time, multiplication number of times 2 (u/k) ²+ u/k will reduce along with the increase of k, and corresponding operation time is also just few more, and this is that we are desirable.But, because the k value is directly proportional with the hardware realization scale of VLSI, the k value conference cause realization area and the time delay of VLSI bigger.Therefore, the value of k should reduce the clock number of computing as much as possible under the constraint of area.

Choose k= 2 (u/k) so ²+ u/k just becomes 2u+ The subduplicate reason of getting u is: ignoring

The time (when u 〉=1024, Compare very little with u), the multiplication number of times is just from nonlinear u ²Become linear u, this variation is very favourable to reducing the computing clock number.Work as k=

The time, carrying out comprehensively based on the standard cell lib of the 0.35 μ m of TSMC, the result shows that the password coprocessor hardware area is about the 38K door.If increase the value of k again, under identical experiment condition, carry out comprehensively, the password coprocessor mould takes advantage of the device hardware area will become bigger.Therefore, we determine k=in the design

Owing to determined u=1024 bit, k=

=32, so basic r=2 ^k=2 ³²So, realize 1024 modular multiplication with 32 multiplier.In the VLSI implementation algorithm, Part A and Part B respectively contain common product term a[j] b[i-j] and m[j] n[i-j], because these two product term no datat are relevant, therefore, available two 32 multipliers carry out multiplying as shown in Figure 6 simultaneously concurrently, so can finish twice multiplying in a clock period.

In VLSI implementation algorithm Part A, because a[j] b[i-j] and m[j] n[i-j] but two executed in parallel like this, are finished a[j] b[i-j] and m[j] n[i-j] s ²Taking advantage of for-s time only needs (s ²-s)/2 clock period.And other three product term a[i] b[0], Sn ' [0] and m[i] n[0] between exist two secondary data relevant, be a[i] b[0] relevant Sn ' [0] and the relevant m[i of Sn ' [0]] n[0], three grades of flowing structures according to Fig. 6, each relevant needs waited for 3 clock period, so two correlations need 6 clock period altogether.Again because a[i] b[0], Sn[0] and m[i] n[0] need circulation s time, need 6s clock period so finish adding up of these three product terms.In brief, the multiply-add operation of Part A needs 6s+ (s ²-s)/2 clock period, i.e. (s ²+ 11s)/2 clock period.

In VLSI implementation algorithm Part B, but only have the product term a[j of executed in parallel] b[i-j] and m[j] n[i-j], so, (s ²-s) inferior taking advantage of only needs (s ²-s)/2 clock period.And in Part C, the mould product is adjusted to [0, N) should carry out the s sub-addition, also need s clock period.Therefore, the Part A in the algorithm, B, three clock number sums that consumed of C are s ²+ 6s or u+6 The individual clock period.(with s=u/k, k=

Substitution formula s ²+ 6s gets u+6

)

In VLSI implementation algorithm Part A and since this s time product of Sn ' [0] do not count add up with S in, add up and should be 2s ²+ s-s=2s ²The inferior sum of products, therefore, at least should be as the totalizer bit wide that adds up greater than log ₂(2s ²2 ⁶⁴), and s=u/k= =32, so, log ₂(2s ²2 ⁶⁴)=75 are so the totalizer bit wide of selecting to be used to add up is 76.See Fig. 6.

Mould takes advantage of the data path of device to adopt three grades of flowing structures, takes advantage of the concurrency of device with enhancement mode.Be mul32=＞adder64=＞adder76, the first order is two 32 multiplier executed in parallel, add up two 64 long-pending and produce a carry Cy of the totalizer that the second level is 64, the totalizer that the third level is 76 ask total add up and.Mould takes advantage of the control path of device to adopt the state machine model Control Circulation to iterate and mould is taken advantage of exchanges data between device and the storer.In a word, mould takes advantage of device to finish the one-off pattern multiplication needs u+6

The individual clock period.When the u=1024 bit, the one-off pattern multiplication needs 1216 clock period.

The RSA mould that proposes according to the present invention is taken advantage of device Monpro, takes advantage of the mould power M of device realization based on this mould ^eMod N hardware implementation algorithm is as follows; R=r ^s=2 ^Ks

Function MonExp (M, e, N, R)/* N be odd number */

Step 1:M:=MR mod N

Step 2:x:=1R mod N

Step 3:for i=u-1 downto 0

Step 4:x:=MonPro (x, x)

Step 5:if (e _i=1) then x:=MonPro (M, x)

Step 6:x:=MonPro (x, 1)

Step 7:return x

The program flow diagram that the corresponding calculated machine is realized is seen Fig. 7, and its RSA adds the structural representation of power processor and sees Fig. 8.Mux among Fig. 8 represents that 2 select 1 Port Multiplier, the module multiplier structure of Monpro presentation graphs 6.(e N) is encryption key.Modulus-power algorithm from left to right scans e=(e _U-1E _iE ₀) come the RSA mould in the calling graph 6 to take advantage of device MonPro since Montgomery long-pending be not the mould product, so step 1,2,6 is used for the R of cancellation Montgomery in long-pending ^-1Product term makes it to become the mould product.It is exactly the rsa cryptosystem coprocessor that the VLSI of modulus-power algorithm realizes, as shown in Figure 8.E in the modulus-power algorithm _iWith the e among Fig. 8 _i' relation is: work as e _i=0 o'clock, e _i'=0, promptly only carry out the one-off pattern multiplication, work as e _i=1 o'clock, e _i'=01, carry out modular multiplication twice.

Under average situation, to i arbitrarily, e _i=1 or e _i=0 probability half and half so on average need carry out 1.5 times modular multiplication, is then finished the required clock periodicity of Montgomery Algorithm: 1.5u (s ²+ 6s)=1.5u ²+ 9u In the worst case, to i arbitrarily, all e _i=1, all carry out modular multiplication 2 times, then finish the required clock periodicity of Montgomery Algorithm: 2u (s ²+ 6s)=2u ²+ 12u

(s=u/k, k=

).Based on the work clock of 5MHz, encrypt the u=1024 position, the average execution time is: 1.5 * 1024 * (s ²+ 6s)/(5 * 10 ⁶)=1.5 * 1024 * (u+6 )/(5 * 10 ⁶The worst execution time of)=374ms is 2 * 1024 * (s ²+ 6s)/(5 * 10 ⁶)=2 * 1024 * (u+6 )/(5 * 10 ⁶)=498ms1024 position rsa cryptosystem coprocessor, Verilog-XL carries out emulation with the Cadence instrument, has verified enciphering/deciphering M ≡ M ^EdThe consistance of modN and correctness.Based on 0.35 μ m TSMC standard cell lib, to carry out comprehensively with the Synopsys instrument, experimental result shows: the shared 38K door of rsa cryptosystem coprocessor, it finishes 1024 modular multiplication needs 1216 clock period.Its maximum delay is the combinational logic time delay of 32 multipliers, and its value is for 15ns, so the highest 65MHz that allows of rsa cryptosystem coprocessor satisfies the frequency of operation of smart card 20MHz.Under the work clock based on outside 5MHz, the plaintext that the encryption of rsa cryptosystem coprocessor is 1024 on average needs 374ms.

Claims

1, Montgomery (montgomery) modular multiplication algorithm used of VLSI is characterized in that: it is the high degree of parallelism algorithm that a kind of VLSI of being suitable for realizes, its essence is three times original large number multiplication computings are decomposed into 2s ²+ s time small integer is taken advantage of, and it contains following steps successively:

If A, B are respectively s position r system integer;

A＝(a _s-1?a _s-2…a ₁a ₀)，B＝(b _s-1?b _s-2…b ₁b ₀)

Mould N also is a s position r system integer,

N=(n _S-1n _S-2N ₁n ₀), and R=r ^s

N＜R is then arranged, n ₀n ₀' mod r=-1, and make A＜N, B＜N,

A.1 i＝0......s-1

A.2 j＝0......i-1

A.2.1 S：＝S+a[j]b[i-j]+m[j]n[i-j]

A.3 S：＝S+a[i]b[0]

A.4 m[i]：＝S?n′[0]mod?r

A.5 S：＝S+m[i]n[0]

A.6 S:=S/r//a r system position moves to right

B.1 i＝s，...，2?s-1

B.2 j＝i-s+1，...，s-1

B.2.1 S：＝S+a[j]b[i-j]+m[j]n[i-j]

B.3 m[i-s]：＝S?mod?r

C.1 t0:=S mod r//t0 in r system position is a r system position

C.2 carry Cy=1

C.3 j＝0，...，s-1

C.3.1 (Cy，b[j])：＝m[j]+not(n[j])+Cy

//Cy is a carry digit, becomes with carry

t0：＝t0+not[0]+Cy

C.4 if t0=0

Then return (b[s-1] b[s-2] ... b[1] b[0])

Otherwise return (m[s-1] m[s-2] ... m[1] m[0])

2, the montgomery analog multiplication algorithm used of VLSI according to claim 1 and the smart card module multiplier structure that proposes, it is characterized in that: it is that 32 multipliers of a kind of usefulness realize that 1024 modular multiplications and data path adopt the high basic mode of three grades of flowing structures to take advantage of device, its first order is respectively a by two inputs, b and m, 32 multipliers of n, and two 64 bit registers that input end links to each other with the output terminal of above-mentioned two multipliers are respectively formed; The second level is made of with 65 bit registers that link to each other with these 64 adder outputs 64 totalizers that add up two 64 long-pending and produce a carry Cy.The third level by input end link to each other with the output terminal of above-mentioned 65 bit registers in the hope of total add up with 76 totalizers and link to each other alternately with these 76 totalizers and 76 bit registers of output terminal output result of product constitute.