KR100632928B1

KR100632928B1 - The Modular multiplier

Info

Publication number: KR100632928B1
Application number: KR1020010016579A
Authority: KR
Inventors: 문상재; 하재철; 신영건
Original assignee: 문상재; 주홍정보통신주식회사; 하재철
Priority date: 2001-03-29
Filing date: 2001-03-29
Publication date: 2006-10-16
Also published as: KR20020076600A

Abstract

본 발명에서는 정보보호 서비스를 제공할 경우에 필수적인 모듈라 멱승을 고속으로 처리하기 위해 새로운 모듈라 곱셈기 회로를 제시하였다. In the present invention, a new modular multiplier circuit is proposed to process the modular power required at high speed in providing an information security service.

본 발명에 따라 캐리 전파 문제를 해결하기 위해 CPA를 활용한 캐리 저장형태의 덧셈기에 의해 한 클럭 시간내의 캐리 전파를 최대한 보장하도록 설계하여 순수한 CSA를 사용하는 것보다 캐리를 저장하는 하드웨어와 최종 결과를 출력하는 시간을 줄일 수 있었다. In order to solve the carry propagation problem according to the present invention, a carry storage type adder utilizing CPA is designed to guarantee the maximum carry propagation within one clock time. The time to print was reduced.

또한, 본 발명에 따른 축소된 곱셈기에 의하면 설계조건이 제한된 환경하에서도 큰 정수에 관한 곱셈을 수행할 수 있도록 덧셈 회로를 1/k로 축소하여 설계하고 이 덧셈기를 반복적으로 사용하도록 하였다. In addition, according to the reduced multiplier according to the present invention, the addition circuit is reduced to 1 / k and designed to perform multiplication on a large integer even under a limited design condition, and the adder is repeatedly used.

곱셈기, 몽고메리 곱셈기, 모듈라Multiplier, Montgomery multiplier, modular

Description

Modular multiplier

도 1은 종래의 몽고메리 곱셈기의 블럭도.1 is a block diagram of a conventional Montgomery multiplier.

도 2는 종래의 또다른 몽고메리 곱셈기의 블럭도.2 is a block diagram of another conventional Montgomery multiplier.

도 3은 본 발명에 따른 캐리저장형 곱셈기의 블럭도.Figure 3 is a block diagram of a carry storage multiplier according to the present invention.

도 4는 도 3에 도시된 CPA를 더 구체적으로 도시한 상세 블럭도.4 is a detailed block diagram illustrating the CPA shown in FIG. 3 in more detail.

도 5는 본 발명에 따른 축소된 덧셈기를 이용한 몽고메리 곱셈기의 블럭도.5 is a block diagram of a Montgomery multiplier using a reduced adder in accordance with the present invention.

본 발명은 몽고메리 알고리듬에 따라 곱셈을 수행하는 곱셈기에 관한 것으로, 좀더 구체적으로는 CPA를 이용한 캐리저장형 덧셈기 또는 축소된 덧셈기를 이용한 몽고메리 곱셈기에 관한 것이다. The present invention relates to a multiplier for performing multiplication according to the Montgomery algorithm, and more particularly, to a Montgomery multiplier using a carry storage type adder or a reduced adder using CPA.

공개 키 암호 시스템에서는 512비트나 1024비트 이상의 큰 정수에 대한 모듈라 멱승(exponentiation) 연산이 필요한데 대부분 수 백번 이상의 모듈라 곱셈을 반복함으로써 처리할 수 있다. 특히, 암호 IC 카드와 같은 물리적으로 설계 공간이 제한된 경우에도 실시간 정보보호 서비스를 위해 고속, 저용량의 모듈라 곱셈 회로를 필요로 하는 경우가 있다. Public-key cryptography systems require modular exponentiation operations on large integers of more than 512 or 1024 bits, most of which can be handled by repeating hundreds of modular multiplications. In particular, even in a physically limited design space such as a cryptographic IC card, there is a case where a high speed, low capacity modular multiplication circuit is required for real time information protection service.

현재까지 개발된 많은 모듈라 곱셈기는 몽고메리(Montgomery) 알고리듬 등에 기반하는데 몽고메리 알고리듬은 곱셈기 내부 연산이 규칙적이고 데이터 흐름이 일정한 구조를 가지고 있어 하드웨어 구현이 용이하다. Many modular multipliers developed to date are based on the Montgomery algorithm. The Montgomery algorithm has a structure in which the internal operation of the multiplier is regular and the data flow is constant, making the hardware easy to implement.

몽고메리 알고리듬을 이용한 곱셈기에는 Arazi가 제안한 DSD(Digital Signature Device)와 Walter가 제안한 평면 시스토릭 어레이(systolic array) 등이 있다. Arazi가 제안한 DSD 곱셈기는 덧셈기(adder), 레지스터(register), 멀티플렉서(MUX)만을 사용함으로써 몽고메리 모듈라 곱셈을 수행할 수 있으나 여기에서는 사용된 구체적인 덧셈기를 제시하지 않고 있어 캐리 전파(carry propagation) 문제를 완전히 해결했다고 할 수 없다. Multipliers using the Montgomery algorithm include Arazi's Digital Signature Device (DSD) and Walter's planar systolic array. Arazi's proposed DSD multiplier can perform Montgomery modular multiplication by using only adders, registers, and multiplexers (MUX), but it does not present a specific adder used to solve the carry propagation problem. You can't say that completely.

공개 키 암호 시스템에 사용하는 모듈라 멱승 A^Emod N은 모듈라 곱셈의 반복으로 이루어진다. 일반적인 모듈라 곱셈

은 두 수 A,B를 곱한 결과를 N으로 나눈 나머지를 취하는 연산이다. 몽고메리 모듈라 곱셈은

가 아닌

을 수행하는데 여기서 R은 N과 서로 소인 N보다 큰 정수이다. 본 명세서에서 A, B 및 N를 모두 n비트라고 가정한다. 또, 각 정수를 기저(base) r로 표현하면

,

및

과 같이 l자리로 나타낼 수 있는데 r=2인 경우에는 l자리는 n비트와 같아진다.The modular power A ^E mod N used in the public key cryptosystem is a repetition of modular multiplication. General Modular Multiplication

Is an operation that takes the remainder of multiplying two numbers A and B by N. Montgomery Modular Multiplication

Not

Where R is an integer greater than N and the prime of N mutually. It is assumed herein that A, B, and N are all n bits. In addition, if each integer is expressed in base r,

,

And

As shown in the figure below, when r = 2, l places equals n bits.

몽고메리 알고리듬은 먼저 N보다 크고 N과 서로 소인 R을 선택하는데

이나

을 간단히 계산할 수 있도록 R=rⁿ으로 하는 것이 일반적이다. 본 명세서에서 r=2이라 가정하며 몽고메리 곱셈은

으로 구체화한다. 따라서 몽고 메리 모듈라 곱셈 알고리듬은 r=2이고, N이 홀수인 경우, 아래와 같이 간단히 나타낼 수 있다. The Montgomery algorithm first selects N greater than N

or

It is common to make R = r ⁿ to simplify the calculation of. In the present specification, it is assumed that r = 2, and Montgomery multiplication

To embody. Therefore, if the Mongolian Mary's modular multiplication algorithm is r = 2 and N is odd, we can simply write

(1) T=0(1) T = 0

(2) for i=0 to n-1 step 1 {(2) for i = 0 to n-1 step 1 {

(2a) C[i]=(T[0]+A[i]B[0]) mod 2 (2a) C [i] = (T [0] + A [i] B [0]) mod 2

(2b) T= T + A[i]B (2b) T = T + A [i] B

(2c) T=T + C[i]N (2c) T = T + C [i] N

(2d) T=T/2 } (2d) T = T / 2}

(3) return(T)(3) return (T)

위에서 단계 2b는 A와 B를 서로 곱하기 위한 반복 과정이며 단계 2c는 모듈라 감소를 위한 반복 단계이다. 또한, 단계 2c 및 2d는 C[i]= 0일 경우에는 T`를 단순히 오른쪽으로 쉬프트시키고, C[i] =1일 경우에는 T에 N을 더한 후 오른쪽으로 쉬프트시키는 것과 동일하다. 상기 알고리듬은 n번의 루프를 돌면서 각 루프마다 2개의 n비트 덧셈기를 사용하고 이 출력을 오른쪽으로 쉬프트시킴으로써 하드웨어로 구현이 가능하다. 상기 알고리듬에서 n번째 루프 후의 최종 계산 값은 T=AB2^-nmod N가 됨을 확인할 수 있다.Step 2b above is an iterative process for multiplying A and B and step 2c is an iterative step for modular reduction. Also, steps 2c and 2d are equivalent to simply shifting T 'to the right when C [i] = 0, and shifting to the right after N is added to T when C [i] = 1. The algorithm can be implemented in hardware by looping n loops and using two n-bit adders for each loop and shifting this output to the right. In the algorithm, it can be seen that the final calculated value after the nth loop becomes T = AB2- ⁿ mod N.

상기와 같은 몽고메리 모듈라 곱셈 알고리듬에 따라 설계한 모듈라 곱셈기(100)를 나타낸 것이 도 1이다. 이 곱셈기의 n 비트 가산기(130)는 중간 계산 값을 저장하는 레지스터(110)의 출력 T와 A[i]B를 더하는 것이고, n비트 가산기(140)는 이전 가산기(130)의 출력을 C[i]값에 따라 C[i]N 와 더하는 것이다. 즉, n 비트 가산기(130)는 몽고메리 알고리듬의 단계 2b를 수행하고, n 비트 가산기(140)는 몽고메리 알고리듬의 단계 2c를 수행한다. 이렇게 해서 얻은 출력의 마지막 비트 Tout을 제외하고 n 비트 가산기(130)의 입력으로 다시 피드백(feedback)시키게 되는데, 이는 오른쪽으로 1 비트씩 쉬프트시키는 것과 똑같은 효과를 얻을 수 있어 상기 몽고메리 알고리듬의 단계 2d를 수행하게 된다. 이러한 과정을 피승수 A의 각 비트 A[i]에 따라 n번 반복 수행함으로써 최종적으로

을 계산한다.FIG. 1 shows a modular multiplier 100 designed according to the Montgomery modular multiplication algorithm. The n-bit adder 130 of this multiplier adds the output T and A [i] B of the register 110, which stores the intermediate calculation value, and the n-bit adder 140 adds the output of the previous adder 130 to C [. i] is added to C [i] N depending on the value. That is, the n bit adder 130 performs step 2b of the Montgomery algorithm, and the n bit adder 140 performs step 2c of the Montgomery algorithm. Except for the last bit Tout of the output thus obtained, it is fed back to the input of the n-bit adder 130, which has the same effect as shifting the bit by one bit to the right, so that step 2d of the Montgomery algorithm Will be performed. This process is repeated n times according to each bit A [i] of the multiplicand A.

Calculate

상기 곱셈기 구조에서는 큰 정수에 대한 덧셈의 수행시 두 개의 덧셈기가 사용되는데, 만약 CPA를 사용할 경우에는 캐리 전파 문제가 생긴다. 특히, 계산에 사용되는 수가 512비트 이상의 큰 정수인 점을 고려하면 캐리 전파 문제로 인해 CPA를 사용하기는 현실적으로 어렵다. 캐리 전파 문제를 해결하기 위한 방법으로 CSA를 사용할 수 있는데 CSA는 3개의 비트 입력에 대해 이들을 더한 값을 캐리(carry)와 합(sum)으로 나누어 2비트로 결과 값을 출력한다. 덧셈기로 CSA를 사용함으로써 각 비트들을 독립적으로 처리할 수 있어 캐리 전파는 해결할 수 있지만 올바른 최종 계산 값을 얻기 위해서는 캐리와 합을 다시 더해야 한다는 단점이 있다. In the multiplier structure, two adders are used to add a large integer. If CPA is used, a carry propagation problem occurs. In particular, considering that the number used in the calculation is a large integer of 512 bits or more, it is practically difficult to use CPA due to the carry propagation problem. The CSA can be used to solve the carry propagation problem. The CSA divides the sum of these values into a carry and sum for three bit inputs and outputs the result value in two bits. By using the CSA as an adder, each bit can be processed independently, so that carry propagation can be solved, but the disadvantage is that the carry and sum must be added again to obtain the correct final calculation value.

도 2는 CSA를 이용한 몽고메리 곱셈기(200)를 나타낸 것이다.2 shows a Montgomery multiplier 200 using CSA.

도 2의 동작 과정을 구체적으로 살펴보면 다음과 같다. 먼저 첫 번째 CSA(230)에서는 피승수 A의 각 비트인 A[i]에 따라 승수 B를 중간 계산값 T에 더한 다. 이 때, 더해진 값을 CSA의 출력인 캐리와 합으로 나타내며, 두 번째 CSA(240)에서는 C[i]에 따라 N을 더한다. 그 후, CSA(240)의 출력에서 캐리의 경우 마지막 비트 CO[1]을 다음 입력의 마지막 비트 CI[0]로, 합(sum)의 경우 마지막 비트 0을 버리고 두 번째 비트 TO[1]을 다음 입력의 마지막 비트 TI[0]로 피드백시킴으로써 더한 결과 값을 하위로 한 비트 쉬프트 시킨 것과 동일한 효과를 얻는다. 이러한 모든 과정이 한 클럭 내에 수행되고 피승수 A의 모든 비트들에 대해 이 과정을 반복하므로 A의 비트 수, 즉 n 클럭만에

에 해당하는 캐리와 합의 값을 얻을 수 있다. Looking at the operation of Figure 2 in detail. First, the first CSA 230 adds a multiplier B to an intermediate calculated value T according to A [i], which is each bit of multiplier A. At this time, the added value is represented as a sum with a carry which is the output of the CSA, and in the second CSA 240, N is added according to C [i]. Then, at the output of the CSA 240, the last bit CO [1] for the carry to the last bit CI [0] of the next input, the last bit 0 for the sum, and the second bit TO [1]. By feeding back the last bit TI [0] of the next input, the same effect is obtained by shifting the resulting bit down one bit. All of this is done in one clock, and this process is repeated for all bits of multiplicand A, so that only the number of bits in A, n clock

You can get the value of the agreement with the carry.

CSA를 이용하면 캐리 전파 문제는 해결되었지만 한 번의 곱셈이 수행될 때마다 캐리와 합을 더해서 최종 값으로 보정해야 다음 연산에 이 결과를 이용할 수 있다. 즉, 곱셈이 끝날 때마다 올바른 결과를 얻기 위해서는 별도의 회로를 이용하거나 마이크로프로세서의 운영이 필요하다. 그러나 별도의 추가적인 회로나 마이크로 프로세서의 동작없이 캐리와 합을 더하여 올바른 최종 결과를 얻기 위한 방법은 다음과 같다. The use of CSA solves the carry propagation problem, but each time a multiplication is done, the carry and sum must be corrected to the final value to use this result for the next operation. In other words, each time multiplication is done, a separate circuit or a microprocessor is required to obtain the correct result. However, here's how to add carry and sum to get the correct end result without any additional circuitry or microprocessor operation.

만약, n클럭 이후에 A[i]와 N[i]를 모두 0으로 입력하고 한 클럭을 더 인가하면 이는 단순히 캐리와 합을 더하여 2로 나누는 연산을 하게 되고 도 2의Tout에 최종 결과값의 최하위 비트가 출력된다. 이와 같은 과정을 n클럭 동안 반복하면 모든 캐리와 합이 더해진 최종 결과가 Tout단자로 한 비트씩 출력된다. 결국, 처음

클럭 동안은 두 수의 모듈라 곱셈을 수행하고 다음

클럭 동안은 캐리와 합을 더하는 과정을 수행하면, 별도의 회로를 추가하지 않고 곱셈기 자체를 반복적으로 이용함으로써 2n클럭 후에 최종 결과를 얻을 수 있다. 이와 같은 방법은 별도의 회로나 마이크로프로세서의 운영없이 최종 결과를 2n클럭 후에 얻을 수 있지만 실제 연산을 위한 클럭에 비해 최종 결과로 보정하는데 추가되는 클럭이 많아 수행 시간 측면에서는 효과적이지 못하다.If after n clocks, both A [i] and N [i] are entered as 0 and one more clock is applied, it simply adds the carry and sum to divide by 2 and the final result in Tout of FIG. The least significant bit is output. If this process is repeated for n clocks, the final result plus all carry and sum is output one bit to Tout terminal. After all, first

During the clock, we do two modular multiplications and then

During the clock, the carry and the summation process can be used to obtain the final result after 2n clocks by repeatedly using the multiplier itself without adding a separate circuit. This method can achieve the final result after 2n clocks without the operation of a separate circuit or microprocessor, but it is not effective in terms of execution time because more clocks are added to correct the final result compared to the clock for actual operation.

본 발명은 상기와 같은 문제점을 해결하여 고속이면서 하드웨어 부담이 적은 모듈라 곱셈기의 회로 구조를 제공하기 위한 것이다. 본 발명에 따른 곱셈기는 일반적인 모듈라 곱셈시 발생하는 캐리 전파 문제를 해결하면서 연산 속도를 높이기 위해 한 클럭내에서의 캐리 전파는 허용하도록 적당한 비트의 캐리 전파 가산기(CPA(Carry Propagation Adder))를 포함하는 캐리 저장형(carry saved type) 가산기를 사용한다. The present invention is to solve the above problems and to provide a circuit structure of a modular multiplier with high speed and low hardware burden. The multiplier according to the present invention includes a carry propagation adder (CPA) of a suitable bit to allow carry propagation within one clock in order to solve the carry propagation problem that occurs during general modular multiplication and to increase the operation speed. Use a carry saved type adder.

또한, 지금까지 제시된 대부분의 곱셈기들은 n비트 덧셈을 한 클럭에 처리하는 n비트 덧셈기를 사용하였는데 이는 연산 비트가 크지만 설계 조건이 제한적인 경우에는 적용하기 힘든 경우도 있다. In addition, most of the multipliers presented so far use n-bit adders, which process n-bit additions in one clock. This is sometimes difficult to apply when the operation bit is large but the design conditions are limited.

본 발명은 이러한 점을 고려하여 연산 비트가 n일 경우에도 w(단, w≤n) 비트 크기의 덧셈기를 반복하여 이용할 수 있도록 w비트 덧셈기를 설계하기 위한 것이다. 따라서, 본 발명에 따른 곱셈기는 하드웨어 구현 조건이 충분하지 않거나 고정된 시스템에서 계산 시간과 회로 복잡도를 고려하여 완충적(tradeoffs)으로 설계하기 위한 것이다.In view of the above, the present invention is to design a w-bit adder to repeatedly use an adder of w (where w ≦ n) bit size even when an operation bit is n. Therefore, the multiplier according to the present invention is designed to be tradeoffs in consideration of computation time and circuit complexity in a system where hardware implementation conditions are insufficient or fixed.

이와 같이 캐리와 합의 보정 처리로 인한 속도 지연 문제를 개선하는 방법으로 CPA를 이용한 캐리 저장형 덧셈기를 고려한다. n비트 곱셈을 위한 CPA를 이용한 캐리 저장형 덧셈기의 기본 회로는 도 3과 같은데 전체적으로는 캐리 저장형 덧셈기 구조를 가진다. 그러나 그 내부는 도 2에서 n개의 FA(Full Adder)를 b개씩 묶어 m=n/b개의 블록으로 나누고 각 블록은 CPA를 이용하여 구성한다. 단, m은 정수라 가정하며 도 3에서 TIi와 TOi는 b비트로 구성된 합(sum) 데이터이며 CI[i], CO[i], DI[i], DO[i] 등은 한 비트씩의 캐리 데이터를 의미한다. 이때 주의할 것은 한 블록내의 연산에 걸리는 시간은 캐리 전파를 포함하여 한 클럭이 넘지 않도록 하는 조건하에서 블록 내의 FA 개수를 고려해야 한다. 예를 들어 하나의 FA를 통과하는데 걸리는 수행시간이 1㎱라고 가정하고 b=32일 때를 생각하면, 이 경우 CPA의 한 블럭을 통과하는데 걸리는 시간은 약 32ns가 된다. 만약 곱셈기의 주 클럭을 10MHz를 사용한다면 한 클럭 시간은 100ns가 되어 한 클럭내에 캐리 전파를 포함한 모든 연산이 끝나게 되므로 캐리 전파가 영향을 미치지 않을 수 있는 충분한 시간이 된다. 따라서 각 블록 내부에서 일어나는 캐리의 전파는 한 클럭 시간보다 작아 캐리 전파는 문제가 되지 않으므로 b가 100보다 작은 범위내에서 더 많은 FA를 이용하여 CPA를 설계할 수 있다. In this way, the carry storage adder using CPA is considered as a method of improving the speed delay problem caused by the carry and consensus correction process. The basic circuit of the carry storage adder using CPA for n-bit multiplication is shown in FIG. 3, and has a carry storage adder structure as a whole. However, the inside thereof binds n FA (Full Adder) by b and divides them into m = n / b blocks, and each block is configured using CPA. However, m is assumed to be an integer, and in FIG. 3, TIi and TOi are sum data composed of b bits, and CI [i], CO [i], DI [i], DO [i], and the like carry data of one bit. Means. It should be noted that the number of FAs in a block should be considered under the condition that the time required for calculation in one block does not exceed one clock including carry propagation. For example, assuming that the execution time for passing one FA is 1 ms and b = 32, the time taken to pass one block of the CPA in this case is about 32 ns. If the main clock of the multiplier is 10 MHz, one clock time is 100 ns, and all operations including carry propagation are completed in one clock, which is enough time for the carry propagation to not be affected. Therefore, carry propagation within each block is less than one clock time, so carry propagation is not a problem. Therefore, CPA can be designed using more FAs within a range of b less than 100.

상기한 b비트로 구성된 CPA와 인접 CPA사이의 구체적인 회로도를 나타낸 것이 도 4이다. 도 4의 예는 b=3로 하여 3개의 FA로 구성된 CPA를 LSB부터 2단을 예로 들었다. 도 4에서 보는 바와 같이 b 개의 FA로 구성된 CPA는 b개의 합(sum)과 1개의 캐리를 발생한다. 발생한 캐리는 상위 FA로 더해지는 것이 아니라 1 비트가 오른쪽으로 쉬프트되므로 레지스터에 기억된 후 다음 클럭이 들어오면 CPA의 MSB에 더해지게 된다. CPA의 MSB쪽에 위치한 FA 입력 비트는 이전 클럭에서 발생했었던 캐리와 자신의 하위 FA에서 발생한 캐리를 XOR하여 결정된다.4 shows a detailed circuit diagram between the CPA composed of the b bits and the adjacent CPA. In the example of FIG. 4, two stages are taken from the LSB as the CPA composed of three FAs with b = 3. As shown in FIG. 4, a CPA composed of b FAs generates b sums and one carry. The generated carry is not added to the upper FA but is shifted one bit to the right, so it is stored in the register and added to the CPA's MSB when the next clock comes in. The FA input bit located on the MSB side of the CPA is determined by XORing the carry from the previous clock and the carry from its lower FA.

이와 같이 m개 CPA의 블록으로 이루어진 캐리 저장형 덧셈기(400)를 사용할 경우에는 n클럭 후에

의 결과가 생성된다. 그러나 그 결과의 형태는 n비트의 합(sum)과 m비트의 캐리 형태로 출력된다. 따라서 A[i]=N[i]=0인 상태에서 m번의 클럭을 추가하면 최종 결과는 매 클럭마다 Tout으로 한 비트씩 출력되어 모두 m비트가 나오고 나머지 (n-m)비트는 n비트 저장용 레지스터에 LSB부터 저장된다. 따라서 최종 결과 값은(n+m)클럭만에 최하위단(m비트)과 레지스터((n-m)비트)에 저장된다. 또한 중간 결과값, 즉, 캐리를 저장하기 위한 레지스터용 D 플립플롭의 개수도 도 2의 2n개에 비해 n+2m로 줄어들게 된다. In this case, in case of using the carry storage adder 400 composed of blocks of m CPAs, after n clocks,

Result is generated. However, the result is output in the form of a sum of n bits and a carry of m bits. Therefore, if m [clock] is added with A [i] = N [i] = 0, the final result is outputted one bit to Tout every clock so that all m bits come out and the remaining (nm) bits are n bit storage registers. Are stored from the LSB. Therefore, the final result is stored in the lowest order (m bits) and register ((nm) bits) only with (n + m) clocks. In addition, the intermediate result value, that is, the number of D flip-flops for registers for storing a carry is reduced to n + 2m compared to 2n of FIG. 2.

상기 곱셈기는 캐리 전파 문제는 해결할 수 있으나 고정된 크기의 n비트인 경우에만 적용이 가능하다. 만약 1024 비트 크기의 곱셈 연산이 필요하여 회로를 설계하면 그 이상의 n에 대한 연산을 할 경우에는 새로운 회로를 만들어 사용해야 한다. 또한 덧셈기의 크기는 연산하고자 하는 수의 크기에 정비례하여 늘어나게 되는 단점이 있다. 이러한 곱셈기 구조는 큰 길이의 정수에 대한 곱셈을 수행해야 하지만 설계 공간이 일정하게 제한된 곳에는 실제 활용하기가 어려운 점이 있다. 만약 사용할 수 있는 게이트가 일정하게 제한된 경우라면 그 이상의 게이트가 필요한 연산 회로는 구성할 수 없다. 따라서 제한된 설계 공간에서도 큰 정수의 곱셈 연산이 가능한 곱셈기 설계 방법이 필요하다.The multiplier can solve the carry propagation problem, but can be applied only when n bits of fixed size are used. If you design a circuit that requires 1024-bit multiplication, you need to create a new circuit to do more than n. In addition, the size of the adder has a disadvantage that increases in proportion to the size of the number to be calculated. This multiplier structure must multiply integers of large length, but it is difficult to use them where the design space is constant. If the available gate is constantly limited, computational circuits requiring more gates cannot be constructed. Therefore, there is a need for a multiplier design method capable of multiplying large integers even in a limited design space.

따라서 본 발명은 상기한 곱셈기를 설계 환경이 제한된 곳에서도 유연성있게 사용하는 방법을 제공하기 위한 것이다. n비트에 대한 모듈라 곱셈을 가정할 때, 곱셈기의 핵심 회로인 큰 정수에 대한 덧셈기를 n비트를 처리하는 크기로 설계하는 것이 아니고 덧셈기를 1/k로 축소하는 것이다(단, k=1, 2, 4, 8, ...등 2ⁱ형태로 함을 가정한다). 이 경우, w비트(w=n/k≤n)의 크기로 설계하면 계산 시간은 k배로 늘어나지만 설계에 필요한 덧셈기는 w비트 정도가 되어 1/k배로 줄일 수 있다. 단, 여기서 회로의 간소화를 위해 w=n/k가 정수가 되도록 설계하는 것이 효과적이다. 예를 들어 1024비트의 덧셈기를 k=4로 하여 축소하여 w=256비트의 덧셈기를 구현한다면, 256비트의 덧셈기를 4회 반복 사용하여 1024비트의 덧셈 연산을 하자는 것이다. 단, k=n/w는 m=n/b의 약수일 경우가 효과적이다.Accordingly, the present invention provides a method of flexibly using the multiplier described above even in a limited design environment. Assuming modular multiplication for n bits, the adder for large integers, the core circuit of the multiplier, is not designed to handle n bits, but the adder is reduced to 1 / k (where k = 1, 2). , 4, 8, ... are assumed to be in the form of 2 ⁱ ). In this case, if the size of the w bits (w = n / k ≤ n) is designed, the calculation time is increased by k times, but the adder required for the design is about w bits, which can be reduced by 1 / k times. However, it is effective to design w = n / k to be an integer in order to simplify the circuit here. For example, if a 1024-bit adder is reduced to k = 4 and a w = 256-bit adder is implemented, the 1024-bit adder is used four times to perform a 1024-bit add operation. However, when k = n / w is a divisor of m = n / b, it is effective.

이를 위해 먼저 B와 N을 각각 w비트씩 묶어 하나의 워드로 하고 이를 순차적으로 처리한다. 따라서, A, B, 그리고 N은 각각

, 및

로 표현할 수 있다. 여기서 BW[j]와 NW[j]는 n비트를 w비트의 워드로 나눈 것으로 가정할 때 j번째의 워드에 있는 계수(coefficient)를 의미한다. To this end, first, B and N are each grouped by w bits to form a word and processed sequentially. Thus, A, B, and N are each

, And

Can be expressed as Here, BW [j] and NW [j] mean coefficients in the jth word, assuming n bits are divided by words of w bits.

정수 B와 N의 워드 단위별 처리를 위해 상기 설명한 몽고메리 알고리듬을 2차원 형태로 다시 표현하면 아래와 같다. 위의 몽고메리 알고리듬의 2b에서 2d까지의 작은 정수로 더하는 과정은 각각 아래의 2b에서 2d까지의 과정과 대응된다. 아래 알고리듬의 중요한 특징은 과정 2b에서 발생하는 캐리(CW1)와 2c에서 발생하는 캐리(CW2)가 각각 최대 1비트가 되도록 함으로써 A[i]B[j]와 CN[j]를 동시에 T[j]에 더하는 방식보다 캐리 처리를 쉽게 설계한다는 것이다. 만약 A[i]B[j]와 CN[j]를 동시에 T[j]에 더하는 구조로 되어 있다면 캐리는 최대 2비트가 되어 하드웨어 구현시 캐리 처리가 쉽지 않게 된다.The Montgomery algorithm described above in two-dimensional form for word-by-word processing of integers B and N is shown below. The addition of small integers from 2b to 2d in the above Montgomery algorithm corresponds to the following 2b to 2d, respectively. An important feature of the algorithm below is that T [j] and CN [j] are simultaneously T [j] by allowing maximum of one bit for carry (CW1) generated in process 2b and carry (CW2) generated in 2c. Is to design the carry process more easily than If A [i] B [j] and CN [j] are added to T [j] at the same time, the carry is maximum 2 bits, so it is not easy to carry in hardware implementation.

(1) T=0(1) T = 0

(2) for i=0 to n-1 step 1{(2) for i = 0 to n-1 step 1 {

(2a) C[i]=(T[0]+A[i]B[0]) mod 2(2a) C [i] = (T [0] + A [i] B [0]) mod 2

CW1[0] =CW2[0] = 0 CW1 [0] = CW2 [0] = 0

for j=0 to k-1 step 1 { for j = 0 to k-1 step 1 {

(2b) CW1[j+1] = (T[j] + A[i]B[j] +CW1[j] )/2^w (2b) CW1 [j + 1] = (T [j] + A [i] B [j] + CW1 [j]) / 2 ^w

T[j]=(T[j]+A[i]B[j]+CW1[j]) mod 2^w T [j] = (T [j] + A [i] B [j] + CW1 [j]) mod 2 ^w

(2c) CW2[j+1] = (T[j] + C[i]NW[j] + CW2[j] )/2^w (2c) CW2 [j + 1] = (T [j] + C [i] NW [j] + CW2 [j]) / 2 ^w

T[j]=(T[j]+C[i]N[j]+CW2[j]) mod 2^w T [j] = (T [j] + C [i] N [j] + CW2 [j]) mod 2 ^w

(2d) T[j] =T[j]/2 (2d) T [j] = T [j] / 2

} }

(3) return(T)(3) return (T)

상기 <수학식 2>의 알고리듬 2를 구현하는 곱셈기(500)의 블럭도를 나타낸 것이 도 5이다. 도 1과 비교해 볼 때, 도 1은 큰 정수에 대한 가산을 할 때 하나의 큰 덧셈기로 n비트씩 가산을 하는 것에 비해 도 5는 n보다 작은 w비트씩을 k번 가산하는 구조로 되어 있음을 알 수 있다. 도 5에서 T를 저장하기 위한 k개의 레지스터(510)와 덧셈기(530,540)는 모두 w비트 즉, 워드 단위로 데이터를 처리한다. 또한, A[i]는 비트 단위로 n번 입력되며 그때마다 BW[j]와 NW[j]는 워드 단위로 입력된다. 결국, 덧셈기의 크기는 1/k로 줄일 수 있는 반면 필요한 클럭 수는 그 만큼 늘어나므로 상호 완충적인 관계가 있다. 5 is a block diagram of a multiplier 500 for implementing Algorithm 2 of Equation 2. Compared with FIG. 1, FIG. 1 shows that FIG. 5 adds k bits smaller than n by k times, compared to adding n bits by one big adder when adding a large integer. Can be. In FIG. 5, the k registers 510 and the adders 530 and 540 for storing T process data in w bits, that is, word units. In addition, A [i] is input n times in units of bits, and each time BW [j] and NW [j] are input in units of words. As a result, the size of the adder can be reduced to 1 / k, while the number of clocks required is increased so that there is a buffer between them.

여기서 도 5의 곱셈기 구조에서 w비트의 덧셈기 구조를 생각해 보자. 만약, 도 2와 같은 CSA를 사용할 경우에는 최종 결과를 얻기 위한 클럭 수는 2nk 클럭이 필요하고 도 3과 같은 CPA를 이용한 캐리 저장형 덧셈기를 이용하면 한번의 곱셈에 필요한 클럭은 (n+m) 클럭의 k배인 (n+m)k번의 클럭만 필요하게 된다.Consider the w-bit adder structure in the multiplier structure of FIG. If the CSA shown in FIG. 2 is used, the number of clocks to obtain the final result requires 2nk clocks, and if the carry storage type adder using the CPA shown in FIG. 3 is used, the clock required for one multiplication is (n + m). Only (n + m) k clocks, k times the clock, are needed.

본 발명에서는 큰 정수에 대한 가산을 작은 크기의 덧셈기를 반복적으로 이용하는 기법을 사용하여 모듈라 곱셈기 회로를 설계하였다. 그리고 VLSI 설계 도구인 ALTERA MAX+ PLUS II를 사용하여 schematic 방법으로 시뮬레이션하여 그 동작이 정확함을 확인하였다. 회로 구현에서 중요한 부분은 도 5에 사용된 두 개의 w비트 덧셈기 회로이다. In the present invention, a modular multiplier circuit is designed by using a method of repeatedly adding a small integer to a small size adder. In addition, the simulation was confirmed by the schematic method using ALTERA MAX + PLUS II, a VLSI design tool. An important part of the circuit implementation is the two w-bit adder circuits used in FIG.

구체적인 예를 들면, n=1024비트의 몽고메리 모듈라 곱셈이 필요한데 설계 공간이 제한되어 있어 256비트 정도의 덧셈기를 설계할 수 있다고 가정하면 k=4가 된다. 만약, 10MHz의 주 클럭을 사용한다면 100ns 정도의 클럭 시간동안에선 캐리 전파를 허용할 수 있다. 이 경우 FA를 통과한 신호가 출력되는 시간이 1ns임을 전 제로 100보다 적은 b=64를 택할 수 있으므로 m=1024/64 =16이다. 따라서 한번의 곱셈을 수행하는데 (n+m)k× 100ns가 되어 416ms 가 된다. 또한, 40MHz의 주 클럭을 사용한다면 25ns 정도의 클럭 시간동안에선 캐리 전파를 허용할 수 있다. 이 경우 FA를 통과한 신호가 출력되는 시간이 1ns임을 전제로 25보다 적은 b=16를 택할 수 있으므로 m=1024/16 =64이다. 따라서 한번의 곱셈을 수행하는데 (n+m)k× 25ns가 되어 109ms 가 된다.As a concrete example, assuming that Montgomery modular multiplication of n = 1024 bits is required and the design space is limited, an adder of about 256 bits can be designed, and k = 4. If you use a 10MHz main clock, you can allow carry propagation for a clock time of about 100ns. In this case, m = 1024/64 = 16 because b = 64 less than 100 can be chosen, which means that the signal passing through FA is 1 ns. Therefore, in one multiplication, (n + m) k × 100ns becomes 416ms. In addition, a 40MHz main clock allows carry propagation for clock times as high as 25ns. In this case, m = 1024/16 = 64 because b = 16 less than 25 can be selected on the assumption that the signal passing through FA is 1ns. Therefore, to perform one multiplication, (n + m) k × 25ns becomes 109ms.

표 1은 기존의 DSD 곱셈기와 본 발명에 따라 반복 가산을 이용한 곱셈기를 비교한 것이다. 단, 여기서는 정확한 회로 개선도를 이해하기 위해 두 방식에서 기본적으로 사용하는 A, B 그리고 N을 저장하기 위한 레지스터는 고려하지 않았다. DSD의 경우 수행시간이 n+1클럭이라고 표기하였으나 이는 DSD에 CPA를 사용했을 경우이고 이것은 큰 정수의 연산에서 해결하기가 힘들다고 볼 수 있다. CSA를 사용할 경우에는 지적한 바와 같이 캐리와 합의 결과를 올바른 값으로 보정하기 위한 회로장치가 필요하다. 본 발명에 따른 곱셈기는 하드웨어 구성 요소 중에서 레지스터의 용량을 많이 줄일 수 있으며 MUX를 사용하지 않으므로 집적 논리 게이트를 많이 줄일 수 있다. Table 1 compares a conventional DSD multiplier and a multiplier using iterative addition according to the present invention. However, here we do not consider the registers for storing A, B, and N, which are basically used in both methods to understand the correct circuit improvement. In the case of DSD, the execution time is indicated as n + 1 clock, but this is the case when CPA is used for DSD, which can be difficult to solve in large integer operation. When using the CSA, as noted, circuitry is needed to correct the results of the agreement with Carry to the correct values. The multiplier according to the present invention can greatly reduce the capacity of the register among the hardware components, and can reduce the integrated logic gate much since no MUX is used.

참고로 D 플립플롭은 5개의 기본 논리 게이트, FA는 5개의 게이트 그리고 4:1 MUX는 10개의 게이트가 필요하다고 가정하고 회로 개선도를 고려해 보자. 방식별로 필요한 게이트를 산출하는 방식은 다음과 같다.For reference, suppose you need five basic logic gates for D flip-flop, five gates for FA, and ten gates for 4: 1 MUX. The method of calculating the required gate for each method is as follows.

① DSD : ① DSD:

D 플립플롭(3n) + FA(n) + MUX(n) = (15+5+10)n = 30n 게이트D flip-flop (3 n ) + FA ( n ) + MUX ( n ) = (15 + 5 + 10) n = 30 n gate

② n-비트 덧셈기를 사용할 경우(즉, k=1, m=n/b) :② When using n-bit adder (ie k = 1, m = n / b):

D 플립플롭(n+2m) + FA(2n) + AND/XOR(2n+4m) = (5+10+2)n +(10+4)m = 17n+14m 게이트D flip-flop ( n + 2 m ) + FA (2 n ) + AND / XOR (2 n + 4 m ) = (5 + 10 + 2) n + (10 + 4) m = 17 n +14 m gate

③ 덧셈기를 1/k로 축소후 사용한 곱셈기(즉, k=n/w, m=n/b, w>=b) : ③ The multiplier used after reducing the adder to 1 / k (ie k = n / w, m = n / b, w> = b):

D 플립플롭(n+2m) + FA(2n/k) + AND/XOR(2n/k+4m) = 5n+10 n/k+2n/k+14m D flip-flop ( n + 2 m ) + FA (2 n / k ) + AND / XOR (2 n / k + 4 m ) = 5 n + 10 n / k + 2 n / k + 14 m

= 5n+12n/k+14m 게이트 = 5 n + 12 n / k + 14 m gates

(m=n/b, k=n/w 그리고 w>=b이며 특히, ③에서 k=1인 경우 n-비트 덧셈기를 이용한 것과 동일) (m = n / b, k = n / w and w> = b, especially when k = 1 in ③, same as using n-bit adder)

위의 산출식에서 보는 바와 같이 본 발명에 따른 곱셈기는 DSD 방법을 이용하여 구현하는 것보다 회로를 설계하는데 필요한 게이트가 적음을 알 수 있다. 만약 본 발명에 따른 곱셈 방식에서 덧셈기를 그대로 n비트 크기로 한다면 k=1이 되고, 이는 CPA를 이용한 캐리 저장형 덧셈기를 이용하는 경우와 비슷해진다. 만약 덧셈기를 1/k로 분할한 것을 사용하게 되면 레지스터를 제외한 부분(FA와 AND)의 게이트 수는 1/k로 줄어든다. 이에 반해 본 발명에 따른 곱셈기의 곱셈 계산 시간은 약 k배 정도 늘어남을 알 수 있다.As shown in the above formula, it can be seen that the multiplier according to the present invention has fewer gates for designing the circuit than the DSD method. If the adder in the multiplication method according to the present invention has an n-bit size as it is, k = 1, which is similar to the case of using a carry storage type adder using CPA. If the divider is added by 1 / k, the gate count of the parts (FA and AND) except the register is reduced to 1 / k. On the contrary, it can be seen that the multiplication time of the multiplier according to the present invention is increased by about k times.

기존 곱셈기와 본 발명에 따른 곱셈기의 비교Comparison of conventional multipliers and multipliers according to the present invention DSDDSD 본 발명에 따른 곱셈기Multiplier according to the present invention 수행시간(클럭)Execution time (clock) n+1 (캐리 전파 해결 문제) n +One (Carry propagation resolution problem) (n+m)k (k=n/w : 분할 수, m=w/b : CPA 수)(n + m) k (k = n / w: number of divisions, m = w / b: number of CPAs) 하드웨어 구성요소Hardware components 레지스터 (D 플립플롭)^* Register (D flip-flop) ^* 3n3n n+2mn + 2m FAFA nn 2n/k2n / k 4:1 MUX4: 1 MUX nn 00 AND/XOR게이트AND / XOR gate 00 2n/k+4m 2n / k + 4m 합계(기본게이트)Total (Default Gate) 30n30n 5n+12n/k+14m 5 n + 12 n / k + 14 m

* : A,B, N을 저장하는 레지스터는 공통적으로 사용*: Common registers for storing A, B, and N

또한, 본 발명에 따른 축소된 곱셈기에 의하면 설계조건이 제한된 환경하에서도 큰 정수에 관한 곱셈을 수행할 수 있도록 덧셈 회로를 1/k로 축소하여 설계하고 이 덧셈기를 반복적으로 사용하도록 하였다. 덧셈기에 사용되는 회로 구조는 주 클럭의 속도, 하드웨어 설계 공간, 계산 소요 시간 그리고 캐리 전파 등을 고려하여 상기한 CPA를 이용한 캐리 저장형 덧셈기를 사용할 수 있으며 회로의 크기도 유연성 있게 선택할 수 있다. 다만, 설계공간과 수행시간의 상호 완충적 관계(tradeoffs)가 적용되므로 덧셈 회로를 1/k로 축소하여 설계한 경우 그렇지 않은 경우보다 k배의 시간이 소요된다. 결국, n비트의 모듈라 곱셈 회로에서 덧셈기 를 1/k로 축소하고 그 덧셈기를 m(단, m=n/b=k=n/w)개의 CPA를 이용할 경우(n+m)k 클럭만에 완전한 곱셈 결과를 얻을 수 있다. 따라서, 본 발명에 따른 곱셈기는 회로의 설계 조건에 맞게 즉, 소요되는 게이트의 수에 맞게 다양하게 설계할 수 있어 암호 IC 카드와 같은 응용분야에 유용하게 사용될 수 있다.In addition, according to the reduced multiplier according to the present invention, the addition circuit is reduced to 1 / k and designed to perform multiplication on a large integer even under a limited design condition, and the adder is repeatedly used. The circuit structure used in the adder may use the carry storage adder using the CPA in consideration of the speed of the main clock, the hardware design space, the calculation time, and the carry propagation, and the circuit size may be flexibly selected. However, since the tradeoffs between the design space and the execution time are applied, it is k times longer than the case where the design of the addition circuit is reduced to 1 / k. Finally, in an n-bit modular multiplication circuit, the adder is reduced to 1 / k and the adder uses m (where m = n / b = k = n / w) CPA (n + m) k clocks only. You can get a complete multiplication result. Therefore, the multiplier according to the present invention can be designed in various ways according to the design conditions of the circuit, that is, the number of gates required, and thus can be usefully used in applications such as cryptographic IC cards.

Claims

In the carry storage adder using CPA for n-bit multiplication, a carry storage adder in which n full adders are grouped by b and divided into m = n / b blocks, and each block is constructed using CPA.

In a multiplier implementing the Montgomery algorithm, a multiplier that reduces the adder used for modular multiplication of n bits to 1 / k, configures the adder and the register with n / k = w bits, and repeats k times to perform multiplication.