CN1811698A

CN1811698A - Hardware high-density realizing method for great number modules and power system

Info

Publication number: CN1811698A
Application number: CN 200610020386
Authority: CN
Inventors: 王金波
Original assignee: Chengdu Westone Information Industry Inc
Current assignee: Chengdu Westone Information Industry Inc
Priority date: 2006-03-01
Filing date: 2006-03-01
Publication date: 2006-08-02
Anticipated expiration: 2026-03-01
Also published as: CN100435091C

Abstract

Present invention discloses a high base implementation method for large modular power systematical hard ware, referring to public key cryptosystem modular power operating hardware realization method, in order to solve problem of inefficiencies and lack versatility in processing high base ( 2 < SUP > h < / SUP ) in current technology. Present invention divides high base data modular power operation into initialisation processing unit, parallel addition processing unit, modular multiplication arithmetical unit, modular power main body arithmetical unit, data output recovery unit, adopting simple logic to realize modular multiplication operating and modular power main body operation processing high base data in public key cryptography system, compared with current technology, modular power main body operating only using or, exclusive or, and etc simple logic to realize high frequency data processing method, can raising hard ware processing data ability by H times can used in public key cryptography systematical modular power hard ware processing.

Description

The high basic implementation method of the hardware of big digital-to-analogue power system

Technical field

The present invention relates to the Montgomery Algorithm hardware implementation method in the public key cryptosystem, particularly relate in large-scale Montgomery Algorithm for improving data-handling efficiency, by structure dynamic parallel addition and adapted initialization data table memory, adopt simple logic to realize the Gao Ji (2 of modular multiplication and the computing of mould power main body ^HSystem) implementation method.

Background technology

In order to improve the operational efficiency of common key cryptosystem, the efficient of modular multiplication and Montgomery Algorithm is crucial.Traditional remove method of residues and summation realizes that the operation efficiency of big digital-to-analogue is undesirable, in various modular multiplication algorithms, the Montgomery multiplication is to calculate mould to take advantage of one of the most effective algorithm, basic thought is to realize common divide operations by serial addition and displacement, and the Montgomery multiplication has become the basic processing unit in the common key cryptosystem.

When realizing two or more addition of integer with hardware, but the parallel by bit mode carry out, export two data, a carry information C who contains everybody, another contains everybody XOR information S.This carry save adder (Carry Save Adders, below brief note is CSA) can realize exempting to link carry addition.Note " " expression step-by-step exclusive-OR operation, " ∧ " expression step-by-step AND-operation, " ∨ " expression step-by-step OR operation, " :=" represent to give the left side right data exclusive disjunction value assignment.To three integer X, Y, Z carries out the CSA add operation, is output as C and S, satisfies 2C+S=X+Y+Z, and then the CSA computing formula is:

C：＝(X∧Y)∨(X∧Z)∨(Y∧Z)，S：＝XYZ.

As seen, CSA can walk abreast in a beat to the add operation of any position integer and finish, but CSA does not finish once complete add operation.Therefore, for common add operation, CSA also is not suitable for, and carries out round-robin add operation many times for need, and CSA but can finish efficiently.

Realize mould power with programmable logic device (PLD) such as FPGA or CPLD or asic chip, perhaps realizing Montgomery Algorithm with dedicated hardware components and keep supplying layer to call by IP kernel (Intellectual Property) interface quickening common key code operation, is a kind of popular way.At present, big digital-to-analogue in the Montgomery Algorithm takes advantage of the hardware implementation method to be divided into two kinds basically: the one, realize that with parallel addition CSA and mould 2 divide operations modes the Montgomery mould takes advantage of, and the 2nd, handle high base data with array structure (Systolic Arrays) and realize that the Montgomery mould takes advantage of.If k is the mould bit length, d is a private key index bits length in the RSA system.Finish Montgomery mould with first method and take advantage of that only to need k+2 hour hands cycle, RSA working time of signing be (k/2+2) (d/2+3) individual clock period.Get k, d=1024 realizes the RSA system in this way, and minimum clock cycle can reach 9.5ns (device XC2V1500-8, mould take advantage of and take 80,000).Second method makes full use of the high-speed carry addition chain structure that some devices have, and constructs m and connects arithmetic element to avoid oversize carry chain, handles Gao Ji (2 by this operating type ^K/mSystem) data, Montgomery mould are taken advantage of (2m+3) the individual hour hands cycle that needs, and be (m+20) (d/2+2) individual clock period the working time of RSA signature.Get m=128, with 2 ⁴Be base, realize the RSA system in this way, minimum clock cycle reaches 20.7ns (device XC40150XV-8, mould take advantage of and take 3413CLBs).

First method is with the design of simple logic and parallel addition, and it is little to have a hour hands cycle, is convenient to advantages such as transplantings, is the binary data tupe of base but there is not breakthrough with 2, has limited data processing efficiency.Second method can be handled high base data and adopt the pipelining mode, considers the retardance that device is intrinsic, makes that the rectangular array setting can not be too big, and it realizes that frequency is closely related with concrete Devices Characteristics, and design lacks transplantability.Summary is got up, and the first method simplicity of design can reach higher realization frequency, and the second method design is complicated, and it realizes that frequency is also lower.But the former only can handle the binary radix data, and the latter can handle high base data, and their realization speed differs and not quite.

Above-mentioned the analysis showed that, two class methods utilize hardware to realize mould power above utilizing, and are not optimum methods.Parallel addition and mould 2 division methods for designing can not be handled Gao Ji (2 ^HSystem) situation of data has limited data processing efficiency; The realization frequency of high basic matrix row method is lower, and implementation efficiency is closely related with concrete Devices Characteristics, lacks transplantability and versatility.

Summary of the invention

The objective of the invention is for solve the efficient that existing Montgomery Algorithm exists not ideal enough, realize that frequency is lower, lack the problem of versatility, a kind of employing simple logic is provided, effectively improves data-handling efficiency, make things convenient for the various types of hardware chip high speed to realize the high basic implementation method of hardware of the big digital-to-analogue power system of mould power system.

The objective of the invention is to realize by following technical proposals:

The high basic implementation method of the hardware of big digital-to-analogue power system, data are imported programmable logic device (PLD) or asic chip carries out Montgomery Algorithm, it is characterized in that: described implementation method is divided into five unit, be respectively initialization process unit, parallel addition processing unit (CSA parallel addition), modular multiplication unit, mould power main body arithmetic element, data output recovery unit, wherein

A. initialization process unit: the modulus N of Montgomery Algorithm is by basic β=2 ^HExpansion be N=(n _P-1... n ₁n ₀) _β, each digital n wherein _i(i=0 ..., p-1) press low level and arrange to high-order, get

n′＝β-n ₀ ^-1?mod?β，R＝2 ^p+2?mod?N，R2＝R ²?mod?N

M＝n′×N＝(m _p...m ₁?m ₀) _β，M _j＝j×M(j＝0，...，β-1).

In ROM, press data below the binary mode storage:

n′，N，R，R2，{M _j，j＝0，...，β-1}

B. modular multiplication unit: MX={M _j, j=0 ..., β-1} is stored among the ROM, the input data have (A1, A2), (B1, B2) and cycle index l, output data (CY, SY), modular multiplication unit experience l (≤p+3) inferior cyclic process, the loop computation process is divided into three modules,

(1) precalculation module

(2) dynamic parallel addition module

(3) circulation feedback computing module

C. mould power main body arithmetic element: the modulus of Montgomery Algorithm is N, calculates Y=X ^EMod N, input index E=(e _H-1, e _H-2..., e ₁, e ₀) ₂, most significant digit e _H-1=1 (0≤h＜p * H), input plaintext X=(x _P-1, x _P-2..., x ₁, x ₀) _β＜N, establishing mould power main body computing output valve is W, its operational process is divided into three phases,

(1) starting stage, the computing of mould power main body contains two modular multiplication unit arranged side by side, unit-1 and unit-2, corresponding two groups of output variable CZ and SZ, and CP and SP respectively;

(2) cycle stage, enter the loop computation process h time after finishing the starting stage, each circulation time, parallel synchronous operation unit-1 and unit-2, get l=p+3, unit-1 input data are CZ, SZ, CZ, SZ, and output data is CZ and SZ, unit-2 input data are CP, SP, CZ, SZ, and output data is CP and SP;

(3) end stage, finish the cycle stage after, get l=p+2, the operation unit-2, its input data be CP, SP, 0,1, output data is CP and SP;

D. data are exported recovery unit: calculate W=CP+SP, utilize W=(w then _P-1, w _P-2..., w ₁, w ₀) _βAnd q=n ' * w ₀Mod β, and calculating Y:=(W+q * N)/β, obtain Y=X ^EMod N.

Described parallel addition processing unit is handled as follows:

Carry output data behind the CSA is carried out 2 times of processing, satisfy C+S=X+Y+Z+W, be input as that (Z W), is output as that (C, CSA addition formula S) is for X, Y

(C，S)＝CSA4TO2(X，Y，Z，W)：＝CSA(CSA(X，Y，Z)，W)

(X Y), establishes (X to given data vector _j, Y _j)=j * (X+Y), j (j＜2 ⁴) when being odd number, have

(X ₃，Y ₃)：＝CSA4TO2(X，Y，2X，2Y)，

(X ₅，Y ₅)：＝CSA4TO2(X，Y，4X，4Y)，

(X ₇，Y ₇)：＝CSA4TO2(X ₃，Y ₃，4X，4Y)，

(X ₉，Y ₉)：＝CSA4TO2(X ₅，Y ₅，4X，4Y)，

(X ₁₁，Y ₁₁)：＝CSA4TO2(X ₃，Y ₃，8X，8Y)，

(X ₁₃，Y ₁₃)：＝CSA4TO2(X ₅，Y ₅，8X，8Y)，

(X ₁₅，Y ₁₅)：＝CSA4TO2(X ₅，Y ₅，8X，8Y)，

As seen, appoint (0≤j＜2 to j ⁴) and data vector (X Y), calculates and simple shift is handled and just obtained (X through 2 layers of CSA4TO2 at the most _j, Y _j).So, provide arbitrarily

X = {(X_{H / 4 - 1} . . . X_{1} X_{0})}_{2^{4}} = 2^{4 (H / 4 - 1)} X_{H / 4 - 1} + {2^{4 (H / 4 - 2)}}_{X_{H / 4 - 2}} + . . . + X_{0}

(H＞4,0≤x _i＜2 ⁴) and X and Y, again through (log ₂H-2) layer (totally 2 ⁰+ 2 ¹+ ...+H/8) CSA4TO2 computing and simple shift processing, can obtain (X _x, Y _x) :=x * (X+Y).

Stockpile device variable R C=(rc in the middle of described modular multiplication unit is given _P+1... rc ₁Rc ₀) _β, RS=(rs _P+1... rs ₁Rs ₀) _β, and signal variable C, S, three modules of loop computation process are specially,

A. precalculation module to RC and RS zero clearing, is calculated a simultaneously ₀(B1 _A0, B2 _A0);

B. dynamic parallel addition module, the i time circulation time (i=0,1 ..., l-1), obtain a by common addition by A1+A2 _i=a _I12 ^H/2+ a _I0(0≤a _I0, a _I1＜2 ^H/2), A1+A2=(a wherein _P+2... a ₁a ₀) _βIf H≤4 that are provided with are then according to top (X _x, Y _x) handle and directly to calculate (B1 _Ai, B2 _Ai).If H＞4 are according to a _I0And a _I1, utilize the parallel addition processing module, through (B1 _Ai0, B2 _Ai0) and (B1 _Ai1, B2 _Ai1), and calculate (B1 below _Ai, B2 _Ai):

({B 1}_{a_{i}}, {B 2}_{a_{i}}) : = CSA 4 TO 2 ({B 1}_{a_{i 0}}, {B 2}_{a_{i 0}}, 2^{H / 2} {B 1}_{a_{i 1}}, 2^{H / 2} {B 2}_{a_{i 1}})

C. circulation feedback computing module, carry out the i+1 time circulation time (i=0,1 ..., l-1), calculate t=rc ₀+ rs ₀, utilize the calculated value B1 of the i time circulation time _AiAnd B2 _Ai, and the M among the ROM _tValue (utilize the t addressing, or t ₀, t ₁Addressing, t=t ₁2 ^H/2+ t ₀) do following computing and upgrade RC and RS,

(C, S) :=CSA (RC, RS, M _t) (or CSA4TO2 (RC, RS,

M_{t_{1}} < < H / 2,

M _t0))

(RC，RS)：＝CSA4TO2(C＞＞H，S＞＞H，B1 _ai，B2 _ai)

rc ₀：＝rc ₀+(c _H-1∧s _H-1)

Wherein, c _H-1, s _H-1The H-1 bit of expression C and S,＞＞H represents the data H position that moves right,＜＜H/2 represents that data are to the H/2 position that moves to left.

In the described mould power main body arithmetic element three phases,

A. in the starting stage, get l=p+3, with R2,0, X, 0 as input, operation unit-1, its output CZ and SZ feed back to the input block of unit-1 and unit-2, with R and the 0 couple of CP and SP initialize respectively;

B. in the cycle stage, enter the loop computation process h time after finishing the starting stage, each circulation time, l=p+3 is got in parallel synchronous operation unit-1 and unit-2, i (i=0 ..., h-1) in the inferior circulation, work as e _i=0 o'clock, the data among CP and the SP constant (not upgraded) by the output data of unit-2

In the described data output recovery unit, Y＞N, output Y; Otherwise if Y≤N, output Y=Y-N.

The invention has the beneficial effects as follows, adopt simple logic to realize Gao Ji (2 ^HSystem) big digital-to-analogue multiplication and the computing of mould power main body, data operation is with Gao Ji (2 ^HSystem) form is carried out, the computing of mould power main body only use or, XOR, with etc. simple logic, realize the frequency height, implementation method is irrelevant with concrete Devices Characteristics, and is portable strong.In whole Montgomery Algorithm process,, can finish with form of software because the operand of data output recovery unit is minimum.

The mould power system that utilizes the present invention to realize can obtain higher data-handling capacity and system response time faster, and particularly, advantage of the present invention mainly contains:

(1) with 2 ^H(H＞1) system is that base carries out the Montgomery multiplication, with respect to being that base carries out the Montgomery multiplication with the scale-of-two, makes the hardware data processing power have at double the raising of (nearly H doubly).

(2) among the present invention, the modular multiplication unit of design only use or, XOR, with etc. simple logic, avoided complicated calculations such as multiplication and subtraction, be convenient to various hardware realize, and help improving clock frequency.

(3) among the present invention, designed active data output recovery unit, made mould power main body arithmetic element become the calculating main body, the outer computing of main body can realize in conjunction with software thus, further reduces hardware size and implements difficulty.

(4) among the present invention, concrete device property does not influence methods and results, embodies good transplantability, is adapted at realizing on the various hardware platforms such as ASIC, CPLD, FPGA.

(5) modular multiplication cell mesh of the present invention, common fpga chip realize that the long mould of 512 bits takes advantage of (spending 70 clock period altogether) to be easy to obtain the above clock frequency of 120MHz.The FPGA that provides with other method on the open source literature realizes that relatively it realizes that speed has significant advantage.

The present invention is applicable to the common key cryptosystems such as RSA, DSA (as signing the rate request per second more than thousands of times) that speed had strict demand, and the hardware development of big digital-to-analogue power (or the mould is taken advantage of) arithmetic unit in other application system.

Description of drawings

Fig. 1 is a modular multiplication of the present invention unit block diagram;

Fig. 2 is a parallel addition processing module example block diagram of the present invention;

Fig. 3 is a mould power main body arithmetic element block diagram of the present invention;

Fig. 4 is data output recovery unit block diagram of the present invention;

Fig. 5 is a Montgomery Algorithm one-piece construction block diagram of the present invention;

Fig. 6 is a modular multiplication stream line operation exemplary plot of the present invention.

Mark among Fig. 1: 100～103 is 4 input data of modular multiplication; 121～122 is 2 output datas of modular multiplication; 104 is the ROM data after the initialization process; 105 and 106 is the parallel addition processing module, obtains identifying the result; 108 expressions common 2 ^HThe system totalizer, 107 is its i time output result's (two half-words); 109 are the data H/2 position that moves to left; 110 and 120 is the parallel C SA addition of 4 input data; 111 is the word addition of H position; 112 expressions utilize 111 output data addressing ROM, obtain 115 data; 113～115 is the stockpile device unit; 116 is the parallel C SA addition of 3 input data; 117 is single-bit and computing; 118～119 are the H bit arithmetic that moves to left.

X among Fig. 2＜2 ⁴, each mark implication: 200～202 is 3 input data; 203～204 are the data n position that moves to left, and 205～206 are the data s position that moves to left, and 211 are the data 2k position that moves to left, and 212 for the data 2g position that moves to left, and determines n, s, k, data such as g by 202; 209～210 is possible output data, and 207～208 is the parallel C SA addition of 4 input data; 213 is the output data vector.

Mark among Fig. 3: 301～306 is 6 input data; 307 is the i bit value signal of 306 data; 316～317 is 2 output datas; 308～309 is 2 modular multiplication unit arranged side by side; 310～313 are respectively CZ, SZ, CP, SP stockpile device unit; 314 is the data strobe device by 307 controls; 315 is control signal generator means, and wherein l is the interior cycle counter of 2 modular multiplication unit, and h is a mould power main body computing outer circulation counter, and clk is a clock control signal.

Mark among Fig. 4: 400～403 is 4 input data; 413 is output data; 404 is common 2 ^HThe system totalizer; 405 is the minimum H bit data of 404 output datas; 406 is H bit data multiplication; 407 and 408 are respectively 404 and 406 output; 409 is common 2 ^HSystem adds to be taken advantage of and shift right operation; 410 is 409 output; The 411 couples of Y and N carry out size and judge; 412 is common 2 ^HThe system subtraction.

Mark among Fig. 5: 500 is 2 input data of mould power; 505 is mould power output data; 501 is mould power initialization process unit, and output data is stored in ROM; 502 is mould power main body arithmetic element; 503 is the RAM that parallel addition processing module (when taking the pre-stored pattern) or other unit need; 504 are data output recovery unit.

Embodiment

The present invention is further illustrated below in conjunction with the drawings and specific embodiments.

It is the core cell of the present all kinds of public key cryptosystems that use that big digital-to-analogue is taken advantage of module, is the loop body of Montgomery Algorithm.The Montgomery Algorithm of all kinds of public key cryptosystems is because of the difference of mould length and power exponent length, and the scale that its big digital-to-analogue is taken advantage of is also different with cycle index.Such as, the mould power module of system (determining that substantially system realizes the time) needs h big digital-to-analogue to take advantage of circulation altogether, each big digital-to-analogue is taken advantage of l the hour hands cycle of cost that need, add to calling mould and take advantage of module d hour hands cycle of cost, then mould power module spends the individual hour hands cycle of common h (l+d) altogether, generally between 0 to 2, the efficient of mould power system has been determined in therefore big digital-to-analogue multiplication consumption to the d value basically.

Below to get base 2 ⁸(H=8) be example, the required clock in modular multiplication unit expends and the operation efficiency relation among elaboration the present invention program.

The present invention utilizes hardware to carry out the implementation method of the high Base computing of mould power, as shown in Figure 5, comprise initialization process unit 501, parallel addition processing unit, modular multiplication unit, mould power main body arithmetic element 502, data output recovery unit 504 totally five unit, wherein, mould power main body arithmetic element contains two parallel modular multiplication unit, comprises the parallel addition processing unit in the modular multiplication unit.With X, E input,, export Y=X at last in the Montgomery Algorithm input 500 through the processing of these five unit ^EMod N.

At first carry out the initialization process unit, in ROM, store data { M by binary mode _j, j=0 ..., β-1}.If the ROM storage space is less, only calculate and store M _j=j * M, j=0 ..., 2 ⁴-1.Because appoint to x=x ₁2 ⁴+ x ₀(0≤x ₀, x ₁＜2 ⁴) and M, obtain

x \times M = (2^{4} M_{x_{1}}, M_{x_{0}}) .

In the modular multiplication of mould power main body arithmetic element, available two input data (2 ⁴M _T1, M _T0) replacement M _t

Enter mould power main body arithmetic element after finishing the initialization process unit, as shown in Figure 3, it is finished by calling the modular multiplication unit repeatedly.According to Fig. 1 the modular multiplication unit is described below.

Among the present invention the modular multiplication unit as shown in Figure 1, the loop computation process be divided into precalculation module dynamically, parallel addition module, circulation feedback computing module totally three modules.

Parallel addition processing module in the modular multiplication unit as shown in Figure 2, is appointed to j (0≤j＜2 ⁴) and data vector (X Y), calculates and simple shift is handled and just obtained (X through 2 layers of CSA4TO2 at the most _j, Y _j).So, appoint and give

x = {(x_{1} x_{0})}_{2^{4}} = 2^{4 (2 - 1)} x_{1} + 2^{0} x_{0}

(0≤x _i＜2 ⁴) and X and Y, handle (being 105-106 module among Fig. 1) through CSA4TO2 computing and simple shift, can obtain (X _x, Y _x) :=x * (X+Y), 110 output in the corresponding diagram 1.

Common addition A1+A2=(a in the modular multiplication unit ₀, a ₁..., a _P+2) can in p+2 clock, finish, in each clock, export each digital value successively.Precalculation module is calculated a simultaneously to RC and RS zero clearing ₀(B1 _A0, B2 _A0).Finish the loop computation process that enters follow-up two modules after this precomputation.As shown in Figure 1, (B1 _Ai0, B2 _Ai0) and (B1 _Ai1, B2 _Ai1) calculating can in former and later two clocks, finish successively.As shown in Figure 6, in first clock period, calculate a ₀Value, parallel computation a in second clock period ₁Value and (B1 _A00, B2 _A00) value, parallel computation a in the 3rd clock period ₂Value and (B1 _A10, B2 _A10) and (B1 _A01, B2 _A01) value, parallel computation a in the 4th clock period ₃Value, (B1 _A20, B2 _A20), (B1 _A11, B2 _A11) and (RC, RS) after this value has set up the level Four pipelining.Parallel computation a in i clock period _iValue, (Bl _Ai-1,0, B2 _Ai-1,0), (B1 _Ai-2,1, B2 _Ai-2,1) and (RC, RS) value, wherein, a _i=a _I12 ^H/2+ a _I0(O≤a _I0, a _I1＜2 ^H/2).Obtaining t value, B1 _AiAnd B2 _AiCan in a clock, finish the calculating of upgrading RC and RS after the value:

(C, S) :=CSA (RC, RS, M _t) (or CSA4TO2 (RC, RS,

M_{t_{1}} < < H / 2,

M _t0))，

(RC，RS)：＝CSA4TO2(C＞＞H，S＞＞H，B1 _ai，B2 _ai)，

rc ₀：＝rc ₀+(c _H-1∧s _H-1).

Top rc ₀Being worth the 0th bit always is 0, so calculate rc ₀+ (C _H-1∧ s _H-1) only need rc ₀The 0th bit with (c _H-1∧ s _H-1) the value replacement.

The modular multiplication unit must advance the inferior cycle calculations of l=p+3 (l=p+2 in the modular multiplication of mould power main body end stage).The modular multiplication cyclic process is designed to three levels in chronological order: ground floor calculates a _i, the second layer calculates (B1 _Ai, B2 _Ai), the 3rd layer of calculating (RC, RS), so the one-off pattern multiplication needs 2～3 clock period finish precalculation module, and l clock period finished other calculating.If the modulus N position is long is the k=512 bit, if get H=8, then by basic β=2 ⁸Expansion N=(n _P-1... n ₁n ₀) _βIn, p=64.So, l=67 (or 66).As shown in Figure 6,4 level production line operations are set up in the modular multiplication unit, finish the cost of one-off pattern multiplication thus and be total to (l+3)=70 clock period.If adopt common binary radix method, finish the cost of one-off pattern multiplication and be total to k+2=514 the clock period (clock period 9.5ns takies 80,000); If adopt common array structure method, construct m and connect arithmetic element (high-speed carry addition chain structure is avoided oversize carry chain), with 2 ⁴Be base, get m=128, finish the cost of one-off pattern multiplication and be total to (2m+3)=259 clock period (clock period 20.7ns takies 3413CLBs).

Enter data output recovery unit after finishing mould power main body arithmetic element, as shown in Figure 4, input CP, SP, N, n ', wherein CP and SP are respectively 316 and 317 data among Fig. 3.After finishing, data output recovery unit obtains Y=X ^EMod N.

In the actual hardware exploitation, can carry out multinomial clock setting, promptly make full use of the possible maximum clock frequency in modular multiplication unit, improve modular multiplication cell processing speed.The single modular multiplication unit setting that contains mould power exponent window treatments also can be changed in two parallel modular multiplication unit in the mould power loop body unit, can reduce the closely hardware spending of half, but has increased the number of times of modular multiplication.

Take advantage of scheme to go up by mould provided by the invention and realize (H=8) at FPGA (Stratix-epls10f780c6 chip), test result shows, hardware spending is less than 100,000, and highest frequency reaches 126Mhz (clock period 8ns), and its speed is about 8 times of common realization speed.

Claims

1, the high basic implementation method of the hardware of big digital-to-analogue power system, data are imported programmable logic device (PLD) or asic chip carries out Montgomery Algorithm, it is characterized in that: described implementation method is divided into five unit, be respectively initialization process unit, parallel addition processing unit (CSA parallel addition), modular multiplication unit, mould power main body arithmetic element, data output recovery unit, wherein

n′＝β-n ₀ ^-1modβ，R＝2 ^p+2mod?N，R ²＝R2mod?N

M＝n′×N＝(m _p...m ₁m ₀) _β，M _j＝j×M(j＝0，...，β-1)

In ROM, press data below the binary mode storage:

n′，N，R，R2，{M _j，j＝0，...，β-1}

B. modular multiplication unit: MX={ _MJ, j=0 ..., β-1} is stored among the ROM, the input data have (A1, A2), (B1, B2) and cycle index l, output data (CY, SY), modular multiplication unit experience l (≤p+3) inferior cyclic process, loop computation is divided into three modules,

(1) precalculation module

(2) dynamic parallel addition module

(3) circulation feedback computing module

2, the high basic implementation method of hardware of big digital-to-analogue power as claimed in claim 1 system, it is characterized in that: described parallel addition processing unit is handled as follows,

(C，S)＝CSA4TO2(X，Y，Z，W)：＝CSA(CSA(X，Y，Z)，W)

Provide

x = (x_{H / 4 - 1 . . .} x_{1} x_{0})_{2^{4}} = 2^{4 (H / 4 - 1)} x_{H / 4 - 1} + 2^{4 (H / 4 - 1)} x_{H / 4 - 2} + . . . + x_{0}

3, the high basic implementation method of hardware of big digital-to-analogue power as claimed in claim 2 system is characterized in that: stockpile device variable R C=(rc in the middle of described modular multiplication unit is given _P+1... rc ₁Rc ₀) _β, RS=(rs _P+1... rs ₁Rs ₀) _β, and signal variable C, S, three modules of loop computation process are specially,

B. dynamic parallel addition module, the i time circulation time (i=0,1 ..., l-1), obtain a by common addition by A1+A2 _i=a _I12 ^H/2+ a _I0(0≤a _I0, a _I1＜2 ^H/2), A1+A2=(a wherein _P+2... a ₁a ₀) _βAccording to a _I0And a _I1, utilize the parallel addition processing module, through (B1 _Ai0, B2 _Ai0) and (B1 _Ai1, B2 _Ai1), and calculate (B1 below _Ai, B2 _Ai):

({B 1}_{a_{i}}, {B 2}_{a_{i}}) : = CSA 4 TO 2 ({B 1}_{a_{i 0}} {, B 2}_{a_{i 0}}, 2^{H / 2} {B 1}_{a_{i 1}}, 2^{H / 2} {B 2}_{a_{i 1}})

C. circulation feedback computing module, carry out the i+1 time circulation time (i=0,1 ..., l-1), calculate t=rc ₀+ rs ₀, utilize the calculated value B1 of the i time circulation time _AiAnd B2 _Ai, and the M among the ROM _tValue (utilize the t addressing, or t ₀, t ₁Addressing, t=t ₁2 ^H/2+ t ₀) do following computing and upgrade RC and RS, (C, S) :=CSA (RC, RS, M _t) (or CSA4TO2 (RC, RS,

M_{t 1} < < H / 2,

M _t0))

(RC, RS) : = CSA 4 TO 2 (C > > H, S > > H, {B 1}_{a_{i}}, {B 2}_{a_{i}})

rc ₀：＝rc ₀+(c _H-1∧s _H-1)。

4, as the high basic implementation method of hardware of claim 1 or 2 or 3 described big digital-to-analogue power systems, it is characterized in that: in the described mould power main body arithmetic element three phases,

B. in the cycle stage, enter the loop computation process h time after finishing the starting stage, each circulation time, l=p+3 is got in parallel synchronous operation unit-1 and unit-2, i (i=0 ..., h-1) in the inferior circulation, work as e _i=0 o'clock, the data among CP and the SP constant (not upgraded) by the output data of unit-2.

5, the high basic implementation method of hardware of big digital-to-analogue power as claimed in claim 1 system is characterized in that: in the described data output recovery unit, and Y＞N, output Y; Otherwise if Y≤N, output Y=Y-N.