CN1811698A - Hardware high-density realizing method for great number modules and power system - Google Patents

Hardware high-density realizing method for great number modules and power system Download PDF

Info

Publication number
CN1811698A
CN1811698A CN 200610020386 CN200610020386A CN1811698A CN 1811698 A CN1811698 A CN 1811698A CN 200610020386 CN200610020386 CN 200610020386 CN 200610020386 A CN200610020386 A CN 200610020386A CN 1811698 A CN1811698 A CN 1811698A
Authority
CN
China
Prior art keywords
unit
data
output
main body
csa
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200610020386
Other languages
Chinese (zh)
Other versions
CN100435091C (en
Inventor
王金波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Westone Information Industry Inc
Original Assignee
Chengdu Westone Information Industry Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Westone Information Industry Inc filed Critical Chengdu Westone Information Industry Inc
Priority to CNB2006100203868A priority Critical patent/CN100435091C/en
Publication of CN1811698A publication Critical patent/CN1811698A/en
Application granted granted Critical
Publication of CN100435091C publication Critical patent/CN100435091C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

Present invention discloses a high base implementation method for large modular power systematical hard ware, referring to public key cryptosystem modular power operating hardware realization method, in order to solve problem of inefficiencies and lack versatility in processing high base ( 2 < SUP > h < / SUP ) in current technology. Present invention divides high base data modular power operation into initialisation processing unit, parallel addition processing unit, modular multiplication arithmetical unit, modular power main body arithmetical unit, data output recovery unit, adopting simple logic to realize modular multiplication operating and modular power main body operation processing high base data in public key cryptography system, compared with current technology, modular power main body operating only using or, exclusive or, and etc simple logic to realize high frequency data processing method, can raising hard ware processing data ability by H times can used in public key cryptography systematical modular power hard ware processing.

Description

The high basic implementation method of the hardware of big digital-to-analogue power system
Technical field
The present invention relates to the Montgomery Algorithm hardware implementation method in the public key cryptosystem, particularly relate in large-scale Montgomery Algorithm for improving data-handling efficiency, by structure dynamic parallel addition and adapted initialization data table memory, adopt simple logic to realize the Gao Ji (2 of modular multiplication and the computing of mould power main body HSystem) implementation method.
Background technology
In order to improve the operational efficiency of common key cryptosystem, the efficient of modular multiplication and Montgomery Algorithm is crucial.Traditional remove method of residues and summation realizes that the operation efficiency of big digital-to-analogue is undesirable, in various modular multiplication algorithms, the Montgomery multiplication is to calculate mould to take advantage of one of the most effective algorithm, basic thought is to realize common divide operations by serial addition and displacement, and the Montgomery multiplication has become the basic processing unit in the common key cryptosystem.
When realizing two or more addition of integer with hardware, but the parallel by bit mode carry out, export two data, a carry information C who contains everybody, another contains everybody XOR information S.This carry save adder (Carry Save Adders, below brief note is CSA) can realize exempting to link carry addition.Note " " expression step-by-step exclusive-OR operation, " ∧ " expression step-by-step AND-operation, " ∨ " expression step-by-step OR operation, " :=" represent to give the left side right data exclusive disjunction value assignment.To three integer X, Y, Z carries out the CSA add operation, is output as C and S, satisfies 2C+S=X+Y+Z, and then the CSA computing formula is:
C:=(X∧Y)∨(X∧Z)∨(Y∧Z),S:=XYZ.
As seen, CSA can walk abreast in a beat to the add operation of any position integer and finish, but CSA does not finish once complete add operation.Therefore, for common add operation, CSA also is not suitable for, and carries out round-robin add operation many times for need, and CSA but can finish efficiently.
Realize mould power with programmable logic device (PLD) such as FPGA or CPLD or asic chip, perhaps realizing Montgomery Algorithm with dedicated hardware components and keep supplying layer to call by IP kernel (Intellectual Property) interface quickening common key code operation, is a kind of popular way.At present, big digital-to-analogue in the Montgomery Algorithm takes advantage of the hardware implementation method to be divided into two kinds basically: the one, realize that with parallel addition CSA and mould 2 divide operations modes the Montgomery mould takes advantage of, and the 2nd, handle high base data with array structure (Systolic Arrays) and realize that the Montgomery mould takes advantage of.If k is the mould bit length, d is a private key index bits length in the RSA system.Finish Montgomery mould with first method and take advantage of that only to need k+2 hour hands cycle, RSA working time of signing be (k/2+2) (d/2+3) individual clock period.Get k, d=1024 realizes the RSA system in this way, and minimum clock cycle can reach 9.5ns (device XC2V1500-8, mould take advantage of and take 80,000).Second method makes full use of the high-speed carry addition chain structure that some devices have, and constructs m and connects arithmetic element to avoid oversize carry chain, handles Gao Ji (2 by this operating type K/mSystem) data, Montgomery mould are taken advantage of (2m+3) the individual hour hands cycle that needs, and be (m+20) (d/2+2) individual clock period the working time of RSA signature.Get m=128, with 2 4Be base, realize the RSA system in this way, minimum clock cycle reaches 20.7ns (device XC40150XV-8, mould take advantage of and take 3413CLBs).
First method is with the design of simple logic and parallel addition, and it is little to have a hour hands cycle, is convenient to advantages such as transplantings, is the binary data tupe of base but there is not breakthrough with 2, has limited data processing efficiency.Second method can be handled high base data and adopt the pipelining mode, considers the retardance that device is intrinsic, makes that the rectangular array setting can not be too big, and it realizes that frequency is closely related with concrete Devices Characteristics, and design lacks transplantability.Summary is got up, and the first method simplicity of design can reach higher realization frequency, and the second method design is complicated, and it realizes that frequency is also lower.But the former only can handle the binary radix data, and the latter can handle high base data, and their realization speed differs and not quite.
Above-mentioned the analysis showed that, two class methods utilize hardware to realize mould power above utilizing, and are not optimum methods.Parallel addition and mould 2 division methods for designing can not be handled Gao Ji (2 HSystem) situation of data has limited data processing efficiency; The realization frequency of high basic matrix row method is lower, and implementation efficiency is closely related with concrete Devices Characteristics, lacks transplantability and versatility.
Summary of the invention
The objective of the invention is for solve the efficient that existing Montgomery Algorithm exists not ideal enough, realize that frequency is lower, lack the problem of versatility, a kind of employing simple logic is provided, effectively improves data-handling efficiency, make things convenient for the various types of hardware chip high speed to realize the high basic implementation method of hardware of the big digital-to-analogue power system of mould power system.
The objective of the invention is to realize by following technical proposals:
The high basic implementation method of the hardware of big digital-to-analogue power system, data are imported programmable logic device (PLD) or asic chip carries out Montgomery Algorithm, it is characterized in that: described implementation method is divided into five unit, be respectively initialization process unit, parallel addition processing unit (CSA parallel addition), modular multiplication unit, mould power main body arithmetic element, data output recovery unit, wherein
A. initialization process unit: the modulus N of Montgomery Algorithm is by basic β=2 HExpansion be N=(n P-1... n 1n 0) β, each digital n wherein i(i=0 ..., p-1) press low level and arrange to high-order, get
n′=β-n 0 -1?mod?β,R=2 p+2?mod?N,R2=R 2?mod?N
M=n′×N=(m p...m 1?m 0) β,M j=j×M(j=0,...,β-1).
In ROM, press data below the binary mode storage:
n′,N,R,R2,{M j,j=0,...,β-1}
B. modular multiplication unit: MX={M j, j=0 ..., β-1} is stored among the ROM, the input data have (A1, A2), (B1, B2) and cycle index l, output data (CY, SY), modular multiplication unit experience l (≤p+3) inferior cyclic process, the loop computation process is divided into three modules,
(1) precalculation module
(2) dynamic parallel addition module
(3) circulation feedback computing module
C. mould power main body arithmetic element: the modulus of Montgomery Algorithm is N, calculates Y=X EMod N, input index E=(e H-1, e H-2..., e 1, e 0) 2, most significant digit e H-1=1 (0≤h<p * H), input plaintext X=(x P-1, x P-2..., x 1, x 0) β<N, establishing mould power main body computing output valve is W, its operational process is divided into three phases,
(1) starting stage, the computing of mould power main body contains two modular multiplication unit arranged side by side, unit-1 and unit-2, corresponding two groups of output variable CZ and SZ, and CP and SP respectively;
(2) cycle stage, enter the loop computation process h time after finishing the starting stage, each circulation time, parallel synchronous operation unit-1 and unit-2, get l=p+3, unit-1 input data are CZ, SZ, CZ, SZ, and output data is CZ and SZ, unit-2 input data are CP, SP, CZ, SZ, and output data is CP and SP;
(3) end stage, finish the cycle stage after, get l=p+2, the operation unit-2, its input data be CP, SP, 0,1, output data is CP and SP;
D. data are exported recovery unit: calculate W=CP+SP, utilize W=(w then P-1, w P-2..., w 1, w 0) βAnd q=n ' * w 0Mod β, and calculating Y:=(W+q * N)/β, obtain Y=X EMod N.
Described parallel addition processing unit is handled as follows:
Carry output data behind the CSA is carried out 2 times of processing, satisfy C+S=X+Y+Z+W, be input as that (Z W), is output as that (C, CSA addition formula S) is for X, Y
(C,S)=CSA4TO2(X,Y,Z,W):=CSA(CSA(X,Y,Z),W)
(X Y), establishes (X to given data vector j, Y j)=j * (X+Y), j (j<2 4) when being odd number, have
(X 3,Y 3):=CSA4TO2(X,Y,2X,2Y),
(X 5,Y 5):=CSA4TO2(X,Y,4X,4Y),
(X 7,Y 7):=CSA4TO2(X 3,Y 3,4X,4Y),
(X 9,Y 9):=CSA4TO2(X 5,Y 5,4X,4Y),
(X 11,Y 11):=CSA4TO2(X 3,Y 3,8X,8Y),
(X 13,Y 13):=CSA4TO2(X 5,Y 5,8X,8Y),
(X 15,Y 15):=CSA4TO2(X 5,Y 5,8X,8Y),
As seen, appoint (0≤j<2 to j 4) and data vector (X Y), calculates and simple shift is handled and just obtained (X through 2 layers of CSA4TO2 at the most j, Y j).So, provide arbitrarily X = ( X H / 4 - 1 . . . X 1 X 0 ) 2 4 = 2 4 ( H / 4 - 1 ) X H / 4 - 1 + 2 4 ( H / 4 - 2 ) X H / 4 - 2 + . . . + X 0 (H>4,0≤x i<2 4) and X and Y, again through (log 2H-2) layer (totally 2 0+ 2 1+ ...+H/8) CSA4TO2 computing and simple shift processing, can obtain (X x, Y x) :=x * (X+Y).
Stockpile device variable R C=(rc in the middle of described modular multiplication unit is given P+1... rc 1Rc 0) β, RS=(rs P+1... rs 1Rs 0) β, and signal variable C, S, three modules of loop computation process are specially,
A. precalculation module to RC and RS zero clearing, is calculated a simultaneously 0(B1 A0, B2 A0);
B. dynamic parallel addition module, the i time circulation time (i=0,1 ..., l-1), obtain a by common addition by A1+A2 i=a I12 H/2+ a I0(0≤a I0, a I1<2 H/2), A1+A2=(a wherein P+2... a 1a 0) βIf H≤4 that are provided with are then according to top (X x, Y x) handle and directly to calculate (B1 Ai, B2 Ai).If H>4 are according to a I0And a I1, utilize the parallel addition processing module, through (B1 Ai0, B2 Ai0) and (B1 Ai1, B2 Ai1), and calculate (B1 below Ai, B2 Ai):
( B 1 a i , B 2 a i ) : = CSA 4 TO 2 ( B 1 a i 0 , B 2 a i 0 , 2 H / 2 B 1 a i 1 , 2 H / 2 B 2 a i 1 )
C. circulation feedback computing module, carry out the i+1 time circulation time (i=0,1 ..., l-1), calculate t=rc 0+ rs 0, utilize the calculated value B1 of the i time circulation time AiAnd B2 Ai, and the M among the ROM tValue (utilize the t addressing, or t 0, t 1Addressing, t=t 12 H/2+ t 0) do following computing and upgrade RC and RS,
(C, S) :=CSA (RC, RS, M t) (or CSA4TO2 (RC, RS, M t 1 < < H / 2 , M t0))
(RC,RS):=CSA4TO2(C>>H,S>>H,B1 ai,B2 ai)
rc 0:=rc 0+(c H-1∧s H-1)
Wherein, c H-1, s H-1The H-1 bit of expression C and S,>>H represents the data H position that moves right,<<H/2 represents that data are to the H/2 position that moves to left.
In the described mould power main body arithmetic element three phases,
A. in the starting stage, get l=p+3, with R2,0, X, 0 as input, operation unit-1, its output CZ and SZ feed back to the input block of unit-1 and unit-2, with R and the 0 couple of CP and SP initialize respectively;
B. in the cycle stage, enter the loop computation process h time after finishing the starting stage, each circulation time, l=p+3 is got in parallel synchronous operation unit-1 and unit-2, i (i=0 ..., h-1) in the inferior circulation, work as e i=0 o'clock, the data among CP and the SP constant (not upgraded) by the output data of unit-2
In the described data output recovery unit, Y>N, output Y; Otherwise if Y≤N, output Y=Y-N.
The invention has the beneficial effects as follows, adopt simple logic to realize Gao Ji (2 HSystem) big digital-to-analogue multiplication and the computing of mould power main body, data operation is with Gao Ji (2 HSystem) form is carried out, the computing of mould power main body only use or, XOR, with etc. simple logic, realize the frequency height, implementation method is irrelevant with concrete Devices Characteristics, and is portable strong.In whole Montgomery Algorithm process,, can finish with form of software because the operand of data output recovery unit is minimum.
The mould power system that utilizes the present invention to realize can obtain higher data-handling capacity and system response time faster, and particularly, advantage of the present invention mainly contains:
(1) with 2 H(H>1) system is that base carries out the Montgomery multiplication, with respect to being that base carries out the Montgomery multiplication with the scale-of-two, makes the hardware data processing power have at double the raising of (nearly H doubly).
(2) among the present invention, the modular multiplication unit of design only use or, XOR, with etc. simple logic, avoided complicated calculations such as multiplication and subtraction, be convenient to various hardware realize, and help improving clock frequency.
(3) among the present invention, designed active data output recovery unit, made mould power main body arithmetic element become the calculating main body, the outer computing of main body can realize in conjunction with software thus, further reduces hardware size and implements difficulty.
(4) among the present invention, concrete device property does not influence methods and results, embodies good transplantability, is adapted at realizing on the various hardware platforms such as ASIC, CPLD, FPGA.
(5) modular multiplication cell mesh of the present invention, common fpga chip realize that the long mould of 512 bits takes advantage of (spending 70 clock period altogether) to be easy to obtain the above clock frequency of 120MHz.The FPGA that provides with other method on the open source literature realizes that relatively it realizes that speed has significant advantage.
The present invention is applicable to the common key cryptosystems such as RSA, DSA (as signing the rate request per second more than thousands of times) that speed had strict demand, and the hardware development of big digital-to-analogue power (or the mould is taken advantage of) arithmetic unit in other application system.
Description of drawings
Fig. 1 is a modular multiplication of the present invention unit block diagram;
Fig. 2 is a parallel addition processing module example block diagram of the present invention;
Fig. 3 is a mould power main body arithmetic element block diagram of the present invention;
Fig. 4 is data output recovery unit block diagram of the present invention;
Fig. 5 is a Montgomery Algorithm one-piece construction block diagram of the present invention;
Fig. 6 is a modular multiplication stream line operation exemplary plot of the present invention.
Mark among Fig. 1: 100~103 is 4 input data of modular multiplication; 121~122 is 2 output datas of modular multiplication; 104 is the ROM data after the initialization process; 105 and 106 is the parallel addition processing module, obtains identifying the result; 108 expressions common 2 HThe system totalizer, 107 is its i time output result's (two half-words); 109 are the data H/2 position that moves to left; 110 and 120 is the parallel C SA addition of 4 input data; 111 is the word addition of H position; 112 expressions utilize 111 output data addressing ROM, obtain 115 data; 113~115 is the stockpile device unit; 116 is the parallel C SA addition of 3 input data; 117 is single-bit and computing; 118~119 are the H bit arithmetic that moves to left.
X among Fig. 2<2 4, each mark implication: 200~202 is 3 input data; 203~204 are the data n position that moves to left, and 205~206 are the data s position that moves to left, and 211 are the data 2k position that moves to left, and 212 for the data 2g position that moves to left, and determines n, s, k, data such as g by 202; 209~210 is possible output data, and 207~208 is the parallel C SA addition of 4 input data; 213 is the output data vector.
Mark among Fig. 3: 301~306 is 6 input data; 307 is the i bit value signal of 306 data; 316~317 is 2 output datas; 308~309 is 2 modular multiplication unit arranged side by side; 310~313 are respectively CZ, SZ, CP, SP stockpile device unit; 314 is the data strobe device by 307 controls; 315 is control signal generator means, and wherein l is the interior cycle counter of 2 modular multiplication unit, and h is a mould power main body computing outer circulation counter, and clk is a clock control signal.
Mark among Fig. 4: 400~403 is 4 input data; 413 is output data; 404 is common 2 HThe system totalizer; 405 is the minimum H bit data of 404 output datas; 406 is H bit data multiplication; 407 and 408 are respectively 404 and 406 output; 409 is common 2 HSystem adds to be taken advantage of and shift right operation; 410 is 409 output; The 411 couples of Y and N carry out size and judge; 412 is common 2 HThe system subtraction.
Mark among Fig. 5: 500 is 2 input data of mould power; 505 is mould power output data; 501 is mould power initialization process unit, and output data is stored in ROM; 502 is mould power main body arithmetic element; 503 is the RAM that parallel addition processing module (when taking the pre-stored pattern) or other unit need; 504 are data output recovery unit.
Embodiment
The present invention is further illustrated below in conjunction with the drawings and specific embodiments.
It is the core cell of the present all kinds of public key cryptosystems that use that big digital-to-analogue is taken advantage of module, is the loop body of Montgomery Algorithm.The Montgomery Algorithm of all kinds of public key cryptosystems is because of the difference of mould length and power exponent length, and the scale that its big digital-to-analogue is taken advantage of is also different with cycle index.Such as, the mould power module of system (determining that substantially system realizes the time) needs h big digital-to-analogue to take advantage of circulation altogether, each big digital-to-analogue is taken advantage of l the hour hands cycle of cost that need, add to calling mould and take advantage of module d hour hands cycle of cost, then mould power module spends the individual hour hands cycle of common h (l+d) altogether, generally between 0 to 2, the efficient of mould power system has been determined in therefore big digital-to-analogue multiplication consumption to the d value basically.
Below to get base 2 8(H=8) be example, the required clock in modular multiplication unit expends and the operation efficiency relation among elaboration the present invention program.
The present invention utilizes hardware to carry out the implementation method of the high Base computing of mould power, as shown in Figure 5, comprise initialization process unit 501, parallel addition processing unit, modular multiplication unit, mould power main body arithmetic element 502, data output recovery unit 504 totally five unit, wherein, mould power main body arithmetic element contains two parallel modular multiplication unit, comprises the parallel addition processing unit in the modular multiplication unit.With X, E input,, export Y=X at last in the Montgomery Algorithm input 500 through the processing of these five unit EMod N.
At first carry out the initialization process unit, in ROM, store data { M by binary mode j, j=0 ..., β-1}.If the ROM storage space is less, only calculate and store M j=j * M, j=0 ..., 2 4-1.Because appoint to x=x 12 4+ x 0(0≤x 0, x 1<2 4) and M, obtain x &times; M = ( 2 4 M x 1 , M x 0 ) . In the modular multiplication of mould power main body arithmetic element, available two input data (2 4M T1, M T0) replacement M t
Enter mould power main body arithmetic element after finishing the initialization process unit, as shown in Figure 3, it is finished by calling the modular multiplication unit repeatedly.According to Fig. 1 the modular multiplication unit is described below.
Among the present invention the modular multiplication unit as shown in Figure 1, the loop computation process be divided into precalculation module dynamically, parallel addition module, circulation feedback computing module totally three modules.
Parallel addition processing module in the modular multiplication unit as shown in Figure 2, is appointed to j (0≤j<2 4) and data vector (X Y), calculates and simple shift is handled and just obtained (X through 2 layers of CSA4TO2 at the most j, Y j).So, appoint and give x = ( x 1 x 0 ) 2 4 = 2 4 ( 2 - 1 ) x 1 + 2 0 x 0 (0≤x i<2 4) and X and Y, handle (being 105-106 module among Fig. 1) through CSA4TO2 computing and simple shift, can obtain (X x, Y x) :=x * (X+Y), 110 output in the corresponding diagram 1.
Common addition A1+A2=(a in the modular multiplication unit 0, a 1..., a P+2) can in p+2 clock, finish, in each clock, export each digital value successively.Precalculation module is calculated a simultaneously to RC and RS zero clearing 0(B1 A0, B2 A0).Finish the loop computation process that enters follow-up two modules after this precomputation.As shown in Figure 1, (B1 Ai0, B2 Ai0) and (B1 Ai1, B2 Ai1) calculating can in former and later two clocks, finish successively.As shown in Figure 6, in first clock period, calculate a 0Value, parallel computation a in second clock period 1Value and (B1 A00, B2 A00) value, parallel computation a in the 3rd clock period 2Value and (B1 A10, B2 A10) and (B1 A01, B2 A01) value, parallel computation a in the 4th clock period 3Value, (B1 A20, B2 A20), (B1 A11, B2 A11) and (RC, RS) after this value has set up the level Four pipelining.Parallel computation a in i clock period iValue, (Bl Ai-1,0, B2 Ai-1,0), (B1 Ai-2,1, B2 Ai-2,1) and (RC, RS) value, wherein, a i=a I12 H/2+ a I0(O≤a I0, a I1<2 H/2).Obtaining t value, B1 AiAnd B2 AiCan in a clock, finish the calculating of upgrading RC and RS after the value:
(C, S) :=CSA (RC, RS, M t) (or CSA4TO2 (RC, RS, M t 1 < < H / 2 , M t0)),
(RC,RS):=CSA4TO2(C>>H,S>>H,B1 ai,B2 ai),
rc 0:=rc 0+(c H-1∧s H-1).
Top rc 0Being worth the 0th bit always is 0, so calculate rc 0+ (C H-1∧ s H-1) only need rc 0The 0th bit with (c H-1∧ s H-1) the value replacement.
The modular multiplication unit must advance the inferior cycle calculations of l=p+3 (l=p+2 in the modular multiplication of mould power main body end stage).The modular multiplication cyclic process is designed to three levels in chronological order: ground floor calculates a i, the second layer calculates (B1 Ai, B2 Ai), the 3rd layer of calculating (RC, RS), so the one-off pattern multiplication needs 2~3 clock period finish precalculation module, and l clock period finished other calculating.If the modulus N position is long is the k=512 bit, if get H=8, then by basic β=2 8Expansion N=(n P-1... n 1n 0) βIn, p=64.So, l=67 (or 66).As shown in Figure 6,4 level production line operations are set up in the modular multiplication unit, finish the cost of one-off pattern multiplication thus and be total to (l+3)=70 clock period.If adopt common binary radix method, finish the cost of one-off pattern multiplication and be total to k+2=514 the clock period (clock period 9.5ns takies 80,000); If adopt common array structure method, construct m and connect arithmetic element (high-speed carry addition chain structure is avoided oversize carry chain), with 2 4Be base, get m=128, finish the cost of one-off pattern multiplication and be total to (2m+3)=259 clock period (clock period 20.7ns takies 3413CLBs).
Enter data output recovery unit after finishing mould power main body arithmetic element, as shown in Figure 4, input CP, SP, N, n ', wherein CP and SP are respectively 316 and 317 data among Fig. 3.After finishing, data output recovery unit obtains Y=X EMod N.
In the actual hardware exploitation, can carry out multinomial clock setting, promptly make full use of the possible maximum clock frequency in modular multiplication unit, improve modular multiplication cell processing speed.The single modular multiplication unit setting that contains mould power exponent window treatments also can be changed in two parallel modular multiplication unit in the mould power loop body unit, can reduce the closely hardware spending of half, but has increased the number of times of modular multiplication.
Take advantage of scheme to go up by mould provided by the invention and realize (H=8) at FPGA (Stratix-epls10f780c6 chip), test result shows, hardware spending is less than 100,000, and highest frequency reaches 126Mhz (clock period 8ns), and its speed is about 8 times of common realization speed.

Claims (5)

1, the high basic implementation method of the hardware of big digital-to-analogue power system, data are imported programmable logic device (PLD) or asic chip carries out Montgomery Algorithm, it is characterized in that: described implementation method is divided into five unit, be respectively initialization process unit, parallel addition processing unit (CSA parallel addition), modular multiplication unit, mould power main body arithmetic element, data output recovery unit, wherein
A. initialization process unit: the modulus N of Montgomery Algorithm is by basic β=2 HExpansion be N=(n P-1... n 1n 0) β, each digital n wherein i(i=0 ..., p-1) press low level and arrange to high-order, get
n′=β-n 0 -1modβ,R=2 p+2mod?N,R 2=R2mod?N
M=n′×N=(m p...m 1m 0) β,M j=j×M(j=0,...,β-1)
In ROM, press data below the binary mode storage:
n′,N,R,R2,{M j,j=0,...,β-1}
B. modular multiplication unit: MX={ MJ, j=0 ..., β-1} is stored among the ROM, the input data have (A1, A2), (B1, B2) and cycle index l, output data (CY, SY), modular multiplication unit experience l (≤p+3) inferior cyclic process, loop computation is divided into three modules,
(1) precalculation module
(2) dynamic parallel addition module
(3) circulation feedback computing module
C. mould power main body arithmetic element: the modulus of Montgomery Algorithm is N, calculates Y=X EMod N, input index E=(e H-1, e H-2..., e 1, e 0) 2, most significant digit e H-1=1 (0≤h<p * H), input plaintext X=(x P-1, x P-2..., x 1, x 0) β<N, establishing mould power main body computing output valve is W, its operational process is divided into three phases,
(1) starting stage, the computing of mould power main body contains two modular multiplication unit arranged side by side, unit-1 and unit-2, corresponding two groups of output variable CZ and SZ, and CP and SP respectively;
(2) cycle stage, enter the loop computation process h time after finishing the starting stage, each circulation time, parallel synchronous operation unit-1 and unit-2, get l=p+3, unit-1 input data are CZ, SZ, CZ, SZ, and output data is CZ and SZ, unit-2 input data are CP, SP, CZ, SZ, and output data is CP and SP;
(3) end stage, finish the cycle stage after, get l=p+2, the operation unit-2, its input data be CP, SP, 0,1, output data is CP and SP;
D. data are exported recovery unit: calculate W=CP+SP, utilize W=(w then P-1, w P-2..., w 1, w 0) βAnd q=n ' * w 0Mod β, and calculating Y:=(W+q * N)/β, obtain Y=X EMod N.
2, the high basic implementation method of hardware of big digital-to-analogue power as claimed in claim 1 system, it is characterized in that: described parallel addition processing unit is handled as follows,
Carry output data behind the CSA is carried out 2 times of processing, satisfy C+S=X+Y+Z+W, be input as that (Z W), is output as that (C, CSA addition formula S) is for X, Y
(C,S)=CSA4TO2(X,Y,Z,W):=CSA(CSA(X,Y,Z),W)
Provide x = ( x H / 4 - 1 . . . x 1 x 0 ) 2 4 = 2 4 ( H / 4 - 1 ) x H / 4 - 1 + 2 4 ( H / 4 - 1 ) x H / 4 - 2 + . . . + x 0 (H>4,0≤x i<2 4) and X and Y, again through (log 2H-2) layer (totally 2 0+ 2 1+ ...+H/8) CSA4TO2 computing and simple shift processing, can obtain (X x, Y x) :=x * (X+Y).
3, the high basic implementation method of hardware of big digital-to-analogue power as claimed in claim 2 system is characterized in that: stockpile device variable R C=(rc in the middle of described modular multiplication unit is given P+1... rc 1Rc 0) β, RS=(rs P+1... rs 1Rs 0) β, and signal variable C, S, three modules of loop computation process are specially,
A. precalculation module to RC and RS zero clearing, is calculated a simultaneously 0(B1 A0, B2 A0);
B. dynamic parallel addition module, the i time circulation time (i=0,1 ..., l-1), obtain a by common addition by A1+A2 i=a I12 H/2+ a I0(0≤a I0, a I1<2 H/2), A1+A2=(a wherein P+2... a 1a 0) βAccording to a I0And a I1, utilize the parallel addition processing module, through (B1 Ai0, B2 Ai0) and (B1 Ai1, B2 Ai1), and calculate (B1 below Ai, B2 Ai):
( B 1 a i , B 2 a i ) : = CSA 4 TO 2 ( B 1 a i 0 , B 2 a i 0 , 2 H / 2 B 1 a i 1 , 2 H / 2 B 2 a i 1 )
C. circulation feedback computing module, carry out the i+1 time circulation time (i=0,1 ..., l-1), calculate t=rc 0+ rs 0, utilize the calculated value B1 of the i time circulation time AiAnd B2 Ai, and the M among the ROM tValue (utilize the t addressing, or t 0, t 1Addressing, t=t 12 H/2+ t 0) do following computing and upgrade RC and RS, (C, S) :=CSA (RC, RS, M t) (or CSA4TO2 (RC, RS, M t 1 < < H / 2 , M t0))
( RC , RS ) : = CSA 4 TO 2 ( C > > H , S > > H , B 1 a i , B 2 a i )
rc 0:=rc 0+(c H-1∧s H-1)。
4, as the high basic implementation method of hardware of claim 1 or 2 or 3 described big digital-to-analogue power systems, it is characterized in that: in the described mould power main body arithmetic element three phases,
A. in the starting stage, get l=p+3, with R2,0, X, 0 as input, operation unit-1, its output CZ and SZ feed back to the input block of unit-1 and unit-2, with R and the 0 couple of CP and SP initialize respectively;
B. in the cycle stage, enter the loop computation process h time after finishing the starting stage, each circulation time, l=p+3 is got in parallel synchronous operation unit-1 and unit-2, i (i=0 ..., h-1) in the inferior circulation, work as e i=0 o'clock, the data among CP and the SP constant (not upgraded) by the output data of unit-2.
5, the high basic implementation method of hardware of big digital-to-analogue power as claimed in claim 1 system is characterized in that: in the described data output recovery unit, and Y>N, output Y; Otherwise if Y≤N, output Y=Y-N.
CNB2006100203868A 2006-03-01 2006-03-01 Hardware high-density realizing method for great number modules and power system Expired - Fee Related CN100435091C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100203868A CN100435091C (en) 2006-03-01 2006-03-01 Hardware high-density realizing method for great number modules and power system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100203868A CN100435091C (en) 2006-03-01 2006-03-01 Hardware high-density realizing method for great number modules and power system

Publications (2)

Publication Number Publication Date
CN1811698A true CN1811698A (en) 2006-08-02
CN100435091C CN100435091C (en) 2008-11-19

Family

ID=36844646

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100203868A Expired - Fee Related CN100435091C (en) 2006-03-01 2006-03-01 Hardware high-density realizing method for great number modules and power system

Country Status (1)

Country Link
CN (1) CN100435091C (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207847A (en) * 2011-05-06 2011-10-05 广州杰赛科技股份有限公司 Data encryption and decryption processing method and device based on Montgomery modular multiplication operation
CN103645883A (en) * 2013-12-18 2014-03-19 四川卫士通信息安全平台技术有限公司 FPGA (field programmable gate array) based high-radix modular multiplier
CN101510148B (en) * 2009-04-02 2014-10-29 北京中星微电子有限公司 Index operation method and device
CN107193536A (en) * 2017-05-18 2017-09-22 浪潮金融信息技术有限公司 The packet processing method and system of a kind of multidimensional dynamic data
CN112100673A (en) * 2020-09-29 2020-12-18 深圳致星科技有限公司 Federal learning accelerator and RSA intersection calculation method for privacy calculation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2726668B1 (en) * 1994-11-08 1997-01-10 Sgs Thomson Microelectronics METHOD OF IMPLEMENTING MODULAR REDUCTION ACCORDING TO THE MONTGOMERY METHOD
JP3525209B2 (en) * 1996-04-05 2004-05-10 株式会社 沖マイクロデザイン Power-residue operation circuit, power-residue operation system, and operation method for power-residue operation
CN1259617C (en) * 2003-09-09 2006-06-14 大唐微电子技术有限公司 Montgomery analog multiplication algorithm and its analog multiplication and analog power operation circuit
CN1547111A (en) * 2003-12-01 2004-11-17 成都卫士通信息产业股份有限公司 Partition control method for exponent dynamic sliding window for modular power arithmetic

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510148B (en) * 2009-04-02 2014-10-29 北京中星微电子有限公司 Index operation method and device
CN102207847A (en) * 2011-05-06 2011-10-05 广州杰赛科技股份有限公司 Data encryption and decryption processing method and device based on Montgomery modular multiplication operation
CN102207847B (en) * 2011-05-06 2013-12-04 广州杰赛科技股份有限公司 Data encryption and decryption processing method and device based on Montgomery modular multiplication operation
CN103645883A (en) * 2013-12-18 2014-03-19 四川卫士通信息安全平台技术有限公司 FPGA (field programmable gate array) based high-radix modular multiplier
CN107193536A (en) * 2017-05-18 2017-09-22 浪潮金融信息技术有限公司 The packet processing method and system of a kind of multidimensional dynamic data
CN107193536B (en) * 2017-05-18 2020-09-01 浪潮金融信息技术有限公司 Packet processing method and system for multidimensional dynamic data
CN112100673A (en) * 2020-09-29 2020-12-18 深圳致星科技有限公司 Federal learning accelerator and RSA intersection calculation method for privacy calculation

Also Published As

Publication number Publication date
CN100435091C (en) 2008-11-19

Similar Documents

Publication Publication Date Title
CN1103951C (en) Device for executing self-timing algorithm and method thereof
CN1735881A (en) Method and system for performing calculation operations and a device
CN1570848A (en) Montgomery modular multiplier and method thereof using carry save addition
CN110351087B (en) Pipelined Montgomery modular multiplication operation method
CN1811698A (en) Hardware high-density realizing method for great number modules and power system
TW200949691A (en) Microprocessor techniques for real time signal processing and updating
CN101782845A (en) High speed arithmetic device and method of elliptic curve code
CN1992517A (en) Programmable interpolated filter device and realizing method therefor
CN1648853A (en) Multiple-word multiplication-accumulation circuit and montgomery modular multiplication-accumulation circuit
Ding et al. A FPGA-based accelerator of convolutional neural network for face feature extraction
CN1288545A (en) Method and apparatus for arithmetic operation
CN1221891C (en) Operation circuit and operation method
CN1781076A (en) Combined polynomial and natural multiplier architecture
CN108008932A (en) Division synthesizes
CN1258710C (en) Circuit method for high-efficiency module reduction and multiplication
CN101630244B (en) System and method of double-scalar multiplication of streamlined elliptic curve
Oksuzoglu et al. Parametric, secure and compact implementation of RSA on FPGA
CN1858999A (en) Pseudo-random sequence generating device
CN107368459B (en) Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication
CN1652075A (en) System and method for efficient VLSI architecture of finite fields
CN1717653A (en) Multiplier with look up tables
CN116561819A (en) Encryption and decryption method based on from-Cook on-loop polynomial multiplication and on-loop polynomial multiplier
CN104007953A (en) Modular multiplier circuit structure based on Montgomery modular multiplication algorithm of four operands
Pelzl et al. Area–time efficient hardware architecture for factoring integers with the elliptic curve method
Liu et al. A high speed VLSI implementation of 256-bit scalar point multiplier for ECC over GF (p)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081119

Termination date: 20160301

CF01 Termination of patent right due to non-payment of annual fee