KR20040045152A

KR20040045152A - Apparatus for modular multiplication

Info

Publication number: KR20040045152A
Application number: KR1020020073187A
Authority: KR
Inventors: 김영세; 전용성; 이상우; 이윤경; 전성익; 박영수
Original assignee: 한국전자통신연구원
Priority date: 2002-11-22
Filing date: 2002-11-22
Publication date: 2004-06-01
Also published as: KR100481586B1

Abstract

PURPOSE: A modular multiplying device is provided to successively perform a modular operation by removing the complexity and a delay, and removing or minimizing the delay due to a process inputting data to a modular multiplier from a memory as omitting a Montgomery correction factor calculation process needed during the operation process. CONSTITUTION: A CPU(120) is connected to a system bus(100). The memory(110) inputs/outputs the data needed for the modular multiplication operation. The 2-step registers(101-103) respectively store a multiplier, a multiplicand, and a modulus inputted from the system bus. An operation core(105) performs the modular operation by receiving the data stored in the 2-step registers, outputs a carry-out and a sum-out by dividing a result value, and stores the shift data and a Montgomery correction factor occurred during the operation process. A state register(107) stores/informs the outside of an operation state. A control register(108) receives/stores a control signal of the CPU. A register group(109) stores the partial multiplication during the multiplication operation, and stores/outputs the final result value.

Description

Modular multiplication unit {APPARATUS FOR MODULAR MULTIPLICATION}

본 발명은 모듈러(modular) 곱셈 장치에 관한 것으로, 특히, IC 카드 및 이와 같이 중앙 처리 장치(Central Processing Unit : CPU)와 메모리를 내장하고 있는 단일 칩 형태의 시스템에서 사용자의 개인 정보를 보호하고 유출을 방지하여 보안성을 제공하기 위해 사용되는 공개키 암호 알고리즘 중 RSA 암호 알고리즘을 수행하기 위해 사용되는 모듈러 곱셈 연산을 각기 다른 범위의 요구 면적 제한을 가질 수 있는 임의의 시스템에 맞추어 연산기의 크기를 선택하여 설계할 수 있도록 하는 몽고메리 모듈러 곱셈 장치에 관한 것이다.FIELD OF THE INVENTION The present invention relates to a modular multiplication device, and in particular, protects and discloses the user's personal information in an IC card and a single chip type system incorporating such a central processing unit (CPU) and memory. Among the public key cryptographic algorithms used to provide security by providing security, the size of the operator is chosen to suit any system that can have modular multiplication operations used to perform RSA cryptographic algorithms that can have different ranges of required area limits. The present invention relates to a Montgomery modular multiplication device that enables design.

RSA 암호 알고리즘은 모듈러 승산을 수행하여 구현되며, 이는 모듈러 곱셈의 반복 수행을 통하여 가능하다. 이러한 모듈러 곱셈을 빠르게 반복 수행하기 위해서 몽고메리 알고리즘이 주로 사용된다. 그러나, 종래의 몽고메리 모듈러 곱셈을 수행하는 장치는 보조 연산기로써 필요한 입력을 모두 받아 그 결과를 출력하는 방식으로 이 방식은 암호화가 요구되는 입력의 전체 비트를 한꺼번에 처리하는 이점이 있으나, 그 면적이 너무 크다는 문제를 가지게 되어 본 발명에서 고려하는 IC 카드와 같은 시스템에는 적용하기 어렵다. 또한 전체 데이터 비트 수보다 적은 단위의 모듈러 곱셈기를 복수 개 파이프라인(pipeline)하여 수행하는 방식은 적절한 면적을 선택할 수 있다는 점은 전자에 비해 유리하다고 할 수 있으나 중앙 처리 장치와 연동하여 동작하는 시스템에 적용할 경우, 매 동작 클럭마다 새로운 입력을 위한 시간이 요구되어 전체 모듈러 곱셈의 완료 시간을 지연시키게 된다. 그리고, 전체 데이터의 비트 수보다 적은 단위의 단일 연산기를 이용하여 이를 반복 사용하여 모듈러 곱셈을 구현하는 경우엔 모듈러스에 대한 몽고메리 보정 인자를 계산하기 위한 추가 연산이 필요하게 되어 연산의 복잡성을 증가시키고 전체 동작의 지연을 가져올 수 있다. 앞서 언급한 파이프라인 방식에서처럼 중앙 처리 장치와의 연동을 깊이 고려하지 않고 단지 연산기만을 보조 연산기의 형태로 구현하면 전체 데이터 비트에 대한 암호화를 수행할 경우엔 이로 인한 시간지연을 가져오게 되는 문제점도 있다.The RSA cryptographic algorithm is implemented by performing modular multiplication, which is possible through iterative modular multiplication. Montgomery's algorithm is mainly used to perform this modular multiplication quickly. However, the conventional Montgomery modular multiplication apparatus receives all the necessary inputs as an auxiliary operator and outputs the result. This method has the advantage of processing all the bits of the input requiring encryption at once, but the area is too large. It is difficult to apply to a system such as an IC card contemplated by the present invention because of the large problem. In addition, the method of performing a modular pipeline with a unit smaller than the total number of data bits can be advantageous in that it can select an appropriate area, but it is advantageous to a system operating in conjunction with a central processing unit. If applied, time is required for a new input for each operating clock, which delays the completion time of the entire modular multiplication. In addition, when modular multiplication is implemented by using a single arithmetic unit with fewer units than the total number of bits of data, an additional operation is required to calculate Montgomery correction factor for modulus, which increases the complexity of the operation This can cause a delay in operation. As in the aforementioned pipeline method, if only an operator is implemented in the form of an auxiliary operator without considering the interworking with the central processing unit, there is a problem that a time delay is caused when encrypting the entire data bit. .

일반적으로 모듈러 연산기와 같은 보조 연산장치를 하드웨어로 구현하는데 있어서 대부분의 구현 방식들이 그 범위를 연산기 자체만으로 국한시켜 성능향상을 꾀하는 경우가 많다. 즉, 중앙 처리 장치와 구현할 연산기 사이의 연동과 그에 따른 입출력까지 미리 고려하지 않고 단지 주어진 입출력에 대해서 얼마나 빨리 결과를 출력하는지, 그리고 얼마나 하드웨어 요구량을 감소시킬 수 있는지에 대해서만 집중한다. 그러나, 실제로 이렇게 구현된 모듈을 중앙 처리 장치와 연동시킬 경우엔 이 부분에 대한 결여가 매우 큰 문제가 되어 구현된 연산기의 성능을 급격히 저하시키게 되고, 결국 연산기의 성능 향상을 위해 구현된 방식의 가치가 떨어지게 된다. 모듈러 곱셈기의 경우에도 마찬가지로 이러한 문제가 발생할 가능성이 존재하는데, 예로, 수학식 5에 표현된 임의의 워드단위의 곱셈을 반복 수행하여 모듈러 연산을 구현하는 경우, 단계 3 부분이 하드웨어로 구현되는 핵심 연산 모듈이 되는데 여기엔 앞서 언급한 중앙 처리 장치와의 연동이라는 관점에서 반드시 고려되어야 할 몇 가지 부분이 빠져있다. 우선 모듈러스 N에 곱해주어야 할 몽고메리 인자 m을 구하기 위해 N₀ ^*가 외부에서 입력되어야 한다. 이 N₀ ^*를 구하는 것이 바로 중앙 처리 장치의 몫이 되고, 이를 위해 별도의 소프트웨어 및 수행시간이 요구된다. 다음으로는 한 번의 워드단위의 곱셈이 이루어지고 난 후 다음의 곱셈을 수행하기 위한 데이터의 입력에 관한 문제이다. 수학식 3을 하드웨어로 구현할 때 고려해야 할 부분은 내부 연산의 성능향상만이 아니라 어떻게 다음 입력을 받을 것인가에 대한 부분도 함께 고려해 두어야 한다. 즉 이러한 부분의 고려가 없다면 매 워드 단위 연산 마다 내부 연산과 중앙 처리 장치와의 입출력이 완전히 분리되어 동작하게 된다. 이러한 경우 연산기를 구현할 때 고려하지 않았던 부분이 기대했던 연산기의 성능향상보다 더 큰 문제가 될 가능성이 있다.In general, in implementing hardware such as a modular arithmetic unit, most implementation schemes often limit the scope to the arithmetic unit itself to improve performance. That is, instead of considering the linkage between the central processing unit and the calculator to be implemented and the corresponding input / output, it concentrates only on how fast the output is output for a given input and output and how much hardware requirements can be reduced. However, in the case of integrating such a module with the central processing unit, the lack of this part becomes a very big problem, and the performance of the implemented operator is drastically degraded. Will fall. Likewise, in the case of the modular multiplier, such a problem may occur. For example, in the case of implementing a modular operation by repeating an arbitrary word unit multiplication expressed in Equation 5, the core operation in which the third part is implemented in hardware It becomes a module, but there are a few things that must be considered in terms of interworking with the aforementioned central processing unit. First, N ₀ ^* must be input externally to find the Montgomery factor m to be multiplied by modulus N. It is up to the central processing unit to find N ₀ ^* , which requires extra software and execution time. Next, the problem of inputting data to perform the next multiplication after one word multiplication is performed. When implementing Equation 3 in hardware, consideration should be given not only to the performance improvement of the internal operation but also to how to receive the next input. In other words, if there is no consideration of these parts, the internal operation and the input / output of the central processing unit are completely separated from each word unit operation. In this case, the part that was not considered when implementing the operator may be a bigger problem than the expected performance improvement of the operator.

본 발명은 상술한 결점을 해결하기 위하여 안출한 것으로, 중앙 처리 장치를 내장하고 이와 연동하는 암호화 장치를 포함하는 IC 카드와 같은 단일 칩 형태의 시스템에서 요구되는 면적의 제한 상세에 따라 연산 모듈의 크기를 선택하여 설계할 수 있도록 하는 몽고메리 모듈러 곱셈기를 구현함에 있어 연산 과정 중 요구되는 몽고메리 보정 인자 계산 과정을 생략하여 연산의 복잡성 및 수행 시간 지연을 제거하고 메모리로부터 모듈러 곱셈기로 데이터가 입력되는 과정에 의한 시간 지연을 제거 또는 최소화하여 모듈러 연산 수행을 연속적으로 수행할 수 있도록 하는 몽고메리 모듈러 곱셈 장치를 제공하는 데 그 목적이 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned shortcomings, and the size of a calculation module according to the limited details of the area required in a single chip type system such as an IC card including a central processing unit and an encryption device interoperating therewith. In implementing the Montgomery Modular Multiplier, which allows the user to select and design, we eliminate the calculation of the required Montgomery correction factor during the calculation process, eliminating the computational complexity and execution time delays, and the data input from the memory to the Modular Multiplier. It is an object of the present invention to provide a Montgomery modular multiplier that can continuously perform modular operations by eliminating or minimizing time delays.

이와 같은 목적을 달성하기 위한 본 발명은, 기설정된 시스템 버스(system bus)에 접속된 중앙 처리 장치; 상기 시스템 버스에 접속되어 모듈러 곱셈 연산에 필요한 데이터를 입출력하는 메모리; 각기 2단으로 구성되어 상기 시스템 버스로부터 입력되는 승수 A, 피승수 B, 모듈러스 N을 각각 저장하는 2단 입력레지스터 A,B, N; 상기 2단 입력 레지스터 A, B, N에 각기 저장된 데이터를 각각 받아 모듈러 연산을 수행하여 그 결과값을 캐리 아웃(Carry_out)과 섬 아웃(Sum_out)으로 나누어 출력하고 연산 과정 중 발생한 시프트 데이터(shift_data) 및 몽고메리 보정 인자 m을 저장하는 연산 핵심부; 상기 시스템 버스에 접속되어 모듈러 곱셈 동작 상태를 저장했다가 외부에 알리기 위한 상태 레지스터; 상기 중앙 처리 장치의 제어신호를 상기 시스템 버스를 통해 받아 저장하는 제어 레지스터; 상기 시스템 버스에 접속되어 각 중간 결과 값들을 각기 저장하는 다수의 레지스터로 이루어져 모듈러 곱셈 수행 중의 부분 곱을 저장하며 최종 결과 값을 저장 및 출력하는 레지스터 그룹; 상기 연산 핵심부로부터 출력된 캐리 아웃, 섬 아웃, 시프트 데이터, 및 상기 레지스터 그룹의 출력을 받아 선택적으로 2번의 덧셈을 수행하는 가산기; 및 상기 시스템 버스에 접속되어 상기 제어 레지스터로부터 상기 중앙 처리 장치의 제어신호를 제공받아 모듈러 곱셈 연산의 최종 결과값을 출력할 때 까지 상기 연산 핵심부의 입출력 및 상기 가산기의 입출력을 제어하고 상기 레지스터 그룹에 새로운 값을 저장하기 위한 신호를 발생시키는 제어부를 포함하는 것을 특징으로 한다.The present invention for achieving the above object is a central processing unit connected to a predetermined system bus (system bus); A memory connected to the system bus for inputting / outputting data required for a modular multiplication operation; Two stage input registers A, B, and N each configured to store two multipliers A, a multiplier B, and a modulus N respectively inputted from the system bus; Receives data stored in the two-stage input registers A, B, and N, respectively, performs a modular operation, and outputs the result by dividing the result into a carry-out and a sum-out. And an arithmetic core for storing the Montgomery correction factor m; A status register connected to the system bus for storing and notifying a modular multiplication operation state; A control register for receiving and storing a control signal of the central processing unit through the system bus; A register group connected to the system bus, the register group comprising a plurality of registers for storing respective intermediate result values to store partial products during modular multiplication and to store and output a final result value; An adder for selectively carrying out two additions by receiving the carry out, sum out, shift data, and output of the register group output from the operation core unit; And control the input / output of the operation core and the input / output of the adder until they are connected to the system bus and receive control signals of the central processing unit from the control register to output the final result of the modular multiplication operation. And a controller for generating a signal for storing a new value.

도 1은 본 발명에 따른 모듈러 곱셈 장치의 일 실시예를 나타낸 블록도,1 is a block diagram showing an embodiment of a modular multiplication apparatus according to the present invention;

도 2는 도 1에 도시된 2단 입력 레지스터 N, A, B의 일 실시예를 나타낸 회로도,FIG. 2 is a circuit diagram illustrating an embodiment of the two-stage input registers N, A, and B shown in FIG. 1;

도 3은 도 1에 도시된 연산 핵심부의 일 실시예를 나타낸 상세도,3 is a detailed view showing an embodiment of an operation core unit shown in FIG. 1;

도 4는 도 1에 도시된 가산기의 일 실시예를 나타낸 상세도.4 is a detailed view of one embodiment of the adder shown in FIG.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 시스템 버스 101, 102, 103 : 2단 입력레지스터 N, A, B100: system bus 101, 102, 103: two-stage input register N, A, B

104 : 제어부 105 : 연산 핵심부104 control unit 105 operation core unit

106 : 가산기 107 : 상태 레지스터106: adder 107: status register

108 : 제어 레지스터 109 : 워드 레지스터108: control register 109: word register

110 : 메모리 119 : 다중화기110: memory 119: multiplexer

120 : 중앙 처리 장치120: central processing unit

이하, 첨부된 도면을 참조하여 본 발명에 따른 실시예를 상세히 설명하면 다음과 같다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

먼저, 본 발명이 다루고 있는 몽고메리 모듈러 곱셈 알고리즘에 대하여 설명한다. 승수 A, 피승수 B, 및 모듈러스 N에 대해서 수학식 1과 같이 결과 값 R을 구하는 연산을 모듈러 곱셈이라 한다.First, the Montgomery modular multiplication algorithm handled by the present invention will be described. An operation for obtaining the resultant value R for the multiplier A, the multiplicand B, and the modulus N as in Equation 1 is called modular multiplication.

R = A*BmodNR = A * BmodN

모듈러 곱셈을 효율적으로 수행하기 위한 알고리즘 중 몽고메리 알고리즘은 모듈러 곱셈 알고리즘을 잉여계수 r을 사용해서 정수 Z_n영역에서 계산을 rZ_n영역으로 옮겨 계산하는 방법으로 몫과 나머지를 구하는 고전적인 나눗셈을 사용하지 않고도 모듈러 연산을 가능케 한다. 이 경우 영역의 이동을 위한 추가적인 동작이 요구되나 단순히 한 번의 모듈러 곱셈이 아니라 RSA와 같이 반복적인 모듈러 곱셈을 통한 모듈러 승산이 모듈러 곱셈기의 주된 사용 목적이며 모듈러 승산의 경우, 영역의 이동을 매번 수행하는 것이 아니라 전체 승산의 처음과 끝에서만 수행하면 되며 영역의 이동 또한 몽고메리 모듈러 곱셈에 의해 수행 가능하므로 몽고메리 모듈러 연산방법은 모듈러 곱셈기의 구현에 매우 적합한 방법이라고 할 수 있다.Among the algorithms for efficient modular multiplication, the Montgomery algorithm does not use the classical division of the quotient and the remainder by using the modular multiplication algorithm with the surplus coefficient r to move the calculation from the integer Z _n area to the rZ _n area. Enables modular operations without having to In this case, an additional operation is required for moving the area, but modular multiplication through repetitive modular multiplication such as RSA is not a single modular multiplication but the main purpose of the modular multiplier. Instead, it only needs to be performed at the beginning and end of the total multiplication, and the movement of the region can be performed by Montgomery modular multiplication. Therefore, the Montgomery modular arithmetic method is very suitable for the implementation of the modular multiplier.

잉여계수 r은 데이터의 비트 수 n에 대해서 2ⁿ의 값을 가지며 몽고메리 모듈러 곱셈 알고리즘으로 수행한 결과는 수학식 2와 같다.The surplus coefficient r has a value of 2 ⁿ for the number n of bits of data, and the result of the Montgomery modular multiplication algorithm is shown in Equation 2.

R = A*B*r^-1modNR = A * B * r ^-1 modN

수학식 2를 구현하기 위한 일반적인 몽고메리 알고리즘의 수행 과정을 단계별로 나타내면 수학식 3과 같다.Equation 3 shows a step-by-step process of performing a general Montgomery algorithm for implementing equation (2).

단계 1 : R = 0Step 1: R = 0

단계 2 : R = R + A*BStep 2: R = R + A * B

단계 3 : m = R*N^*modr (단, r*r^-1-N*N^*=1)Step 3: m = R * N ^* modr (where r * r ^-1 -N * N ^* = 1)

단계 4 : R = (R + m*N)/rStep 4: R = (R + m * N) / r

단계 5 : if(R>N) R = R-NStep 5: if (R> N) R = R-N

단계 6 : return (R)Step 6: return (R)

위의 수학식 3에서 단계 4의 나눗셈연산은 간단히 쉬프트 동작으로 구현된다. 이처럼 몽고메리 알고리즘은 모듈러 곱셈에 필요한 나눗셈 연산을 곱셈 연산과 시프트 동작만으로 간략화시켜 그 수행속도를 향상시키는 장점이 있다. 그러나 단계 3에서 수행되는 몽고메리 보정 인자 m을 구하기 위한 연산 과정이 추가로 구현되어야 하며 이 연산량 만큼의 추가 동작시간이 요구된다.In Equation 3 above, the division operation of step 4 is simply implemented as a shift operation. As such, the Montgomery algorithm has the advantage of simplifying the division operation required for modular multiplication by only the multiplication operation and the shift operation. However, an operation process for obtaining the Montgomery correction factor m performed in step 3 must be additionally implemented, and additional operation time is required as much as this calculation amount.

따라서 인자 m을 구하기 위한 별도의 연산 과정을 수행하지 않는 것이 보다 효율적이라 할 수 있으며 식 3의 단계 3에서 수행되는 몽고메리 보정 인자 연산없이 모듈러 곱셈을 수행하기 위한 기본적인 방법은 다음 수학식 4와 같다.Therefore, it is more efficient not to perform a separate calculation process for obtaining the factor m, and a basic method for performing modular multiplication without the Montgomery correction factor calculation performed in step 3 of Equation 3 is shown in Equation 4 below.

단계 1 : R = 0Step 1: R = 0

단계 2 : for i = 0 to n-1{Step 2: for i = 0 to n-1 {

R = R+A_i*BR = R + A _i * B

R = R+R₀*NR = R + R ₀ * N

R = R/2R = R / 2

}}

단계 3 : if(R>N) R = R-NStep 3: if (R> N) R = R-N

단계 4 : return (R)Step 4: return (R)

수학식 4는 인자를 구하기 위한 추가 연산없이 단지 단계 2의 과정만으로 모든 모듈러 곱셈을 수행할 수 있다. 전체 데이터를 그 크기 그대로 입력 받아 처리하는 경우에 적용되는 대표적인 연산 방식이라고 할 수 있으며, 이를 기본으로 응용된 다양한 모듈러 연산 방식 및 하드웨어로의 구현에 대한 연구가 있어왔다. 그러나 이러한 방식들은 기본적으로 전체 데이터를 한꺼번에 고려하게 되므로 현재와 같이 1024 이상의 비트를 가지는 데이터를 처리하기 위해서는 하드웨어의 면적이 커 IC 카드와 같은 시스템에 적용되기에는 어렵다.Equation (4) can perform all modular multiplications with only the procedure of step 2 without additional computation to find the arguments. It can be said that it is a representative operation method that is applied when processing the entire data input as it is, and there have been studies on the implementation of various modular operation methods and hardware applied based on this. However, since these methods basically consider the entire data at once, it is difficult to be applied to a system such as an IC card because the area of hardware is large to process data having more than 1024 bits.

따라서, IC 카드와 같은 시스템에서는 처리하고자 하는 데이터를 한꺼번에 수행하지 않고 임의의 워드 단위로 나누어 연산하는 방법이 적용되어야 하며 다음 수학식 5는 이러한 워드 단위 연산을 위한 몽고메리 모듈러 곱셈을 단계적으로 표현한 수식이다.Therefore, in a system such as an IC card, a method of operating by dividing the data to be processed into arbitrary word units without performing them all at once should be applied. Equation 5 is a formula expressing Montgomery modular multiplication for such word unit operations step by step. .

단계 1 : R = 0Step 1: R = 0

단계 2 : for i = 0 to p-1{Step 2: for i = 0 to p-1 {

단계 3 : R = R + A*B_i(B_i는 w-비트 워드(bit word))Step 3: R = R + A * B _i (B _i is w-bit word)

m = R₀* N₀ ^*mod2^w(N₀* N₀ ^*= -1mod2^w)m = R ₀ * N ₀ ^* mod2 ^w (N ₀ * N ₀ ^* = -1mod2 ^w )

R = R + N*mR = R + N * m

R = R/2^w R = R / 2 ^w

}}

단계 4 : if(R>N) R = R-NStep 4: if (R> N) R = R-N

단계 5 : return (R)Step 5: return (R)

위의 수학식 5에서 하나의 열(A*B_i)에 대한 수행은 A 역시 B_i와 같은 크기를 가지는 p개의 w-비트 워드로 나뉘어 p번의 워드 단위의 곱셈을 수행하여 이루어지며 (N*m)의 수행 또한 p개의 w-비트 워드 단위로 나뉘어 p번의 워드 단위의 곱셈으로 이루어진다. 일반적으로 한 워드의 비트 수는 32 비트 또는 그 정수 배의 비트로 정해지는데 이는 IC카드와 같은 시스템이 가지는 시스템 버스의 데이터 크기가 32 비트인 경우를 고려한 것이다.In Equation 5, a column A * B _i is performed by performing p-word multiplication by dividing p w-bit words having the same size as B _i (N * m) is also divided into p w-bit words and multiply by p word units. In general, the number of bits of a word is determined to be 32 bits or an integer multiple of the bits, considering the case that the data size of the system bus of a system such as an IC card is 32 bits.

따라서, 이 수학식 5를 적용하여 하드웨어로 구현된 장치는 w-비트 워드 단위의 연산 모듈을 가지고 이를 제어하는 모듈로 구성되므로 전체 데이터 비트를 한꺼번에 처리하는 경우보다 하드웨어 면적이 감소하게 된다. 그러나, 위의 방법을 그대로 적용하여 구현할 경우 수학식 3의 경우와 마찬가지로 몽고메리 보정 인자를 계산하기 위한 연산량이 당연히 요구되며, 연산과정에 있어서도 내부에서 수행되는 곱셈 연산이 일반적인 곱셈의 부분 곱 과정과 동일하므로 워드 단위의 곱셈의 결과가 2*w-비트의 크기를 가지게 되며 이 중 캐리에 해당하는 w-비트를 다음 워드 단위 곱셈으로 넘겨주어야 다음 순서의 워드 단위 곱셈이 가능하게 되는 연산의 종속성도 가지게 된다.Accordingly, since the hardware implemented apparatus using the Equation 5 is composed of a module that controls the w-bit word unit and controls it, the hardware area is reduced compared to when all the data bits are processed at once. However, if the above method is applied as it is, the calculation amount for calculating the Montgomery correction factor is naturally required as in the case of Equation 3, and the multiplication operation performed internally is the same as the partial multiplication process of general multiplication. Therefore, the result of multiplication in word has the size of 2 * w-bits, and the w-bit corresponding to carry must be passed to the next word unit multiplication to have the dependency of the operation that enables the next unit of word multiplication. do.

그러나, 본 발명의 경우엔 각각의 워드 단위 연산이 이전의 연산 결과와 상관없이 독립적으로 수행되어 각 부분 연산에서 쉬프트되는 값을 저장해두고 이를 이전 워드 연산의 결과에 더해주는 방식으로 결과 값을 구할 수 있으므로 워드 단위의 곱셈 결과가 항상 w-비트의 크기로 유지되며 아울러 각 부분 연산에 있어서 결과 값을 구하기 위한 덧셈과정은 별도의 모듈로 독립적으로 구현할 수 있다. 이에 대한 수식은 다음 수학식 6과 같다.However, in the present invention, since each word unit operation is performed independently of the previous operation result, the result value can be obtained by storing the shifted value in each partial operation and adding it to the result of the previous word operation. The multiplication result in word units is always maintained at the size of w-bits, and the addition process for obtaining the result value in each partial operation can be independently implemented as a separate module. The equation for this is shown in Equation 6 below.

단계 1 : R = 0, T = 0Step 1: R = 0, T = 0

단계 2 : for i = 0 to p-1{Step 2: for i = 0 to p-1 {

Pre_c = 0, Next_c = 0Pre_c = 0, Next_c = 0

단계 3 : for j = 0 to p-1{Step 3: for j = 0 to p-1 {

If(i == 0) R_j= 0 (R_j는 w+1 비트의 크기)If (i == 0) R _j = 0 (R _j is the size of w + 1 bits)

Else R_j= T_j(T_j는 w 비트의 크기)Else R _j = T _j (T _j is the size of w bits)

단계 4 : for b = 0 to w-1{ (단 b는 비트의 순서)Step 4: for b = 0 to w-1 {(where b is the order of bits)

R_j= R_j+ A_jb*B_i R _j = R _j + A _jb * B _i

if(j == 0) m_b= R_j0(m은 w 비트의 크기)if (j == 0) m _b = R _j0 (m is the size of w bits)

else m_b= m_b else m _b = m _b

R_j= R_j+ m_b*N_j R _j = R _j + m _b * N _j

shift_data_b= R_j0(shift_data는 w 비트의 크기)shift_data _b = R _j0 (shift_data is the size of w bits)

R_j= R_j/2R _j = R _j / 2

} end for b} end for b

단계 5 : (Pre_c, T_j) = R_j+ Pre_cStep 5: (Pre_c, T _j ) = R _j + Pre_c

(Next_c, T_j-1) = T_j-1+ shift_data + Next_c(Next_c, T _j-1 ) = T _j-1 + shift_data + Next_c

} end for j} end for j

} end for i} end for i

단계 6 : If(T>N) T = T-N (단, T는 마지막 Next_c를 포함한 n+1 비트의 크기)Step 6: If (T> N) T = T-N (where T is the size of n + 1 bits including the last Next_c)

단계 7 : return (T)Step 7: return (T)

상기 수학식 6에서 임의의 비트 수 w를 결정할 때, 전체 비트 n 보다 2비트 이상 큰 비트 수 n' (n' ≥ n+2)를 전체 비트 수로 하여 n' = w*p의 형태로 구현될 수 있도록 한다. 이 때, p는 임의의 정수이며 n보다 큰 비트의 값은 0으로 한다. 이는 모듈러 연산의 결과 값이 항상 2N 보다 작으므로 그 결과 값이 주어진 비트 크기 내에 모두 표현될 수 있도록 하며, 모듈러 곱셈기가 모듈러 승산만을 위한 장치임을 고려한다면 이렇게 2비트 이상 큰 비트 수로 계산을 수행하면 모듈러 곱셈 연산의 결과가 N보다 클 경우에 매번 결과 값에서 N을 빼는 비교 과정을 수행하지않아도 최종 모듈러 승산의 결과값은 항상 N 보다 작은 값으로 귀결된다. 만일 모듈러 곱셈의 단일 결과값이 필요할 경우엔 제안되는 연산 장치의 외부에서 별도로 비교 과정을 수행하도록 하면 된다. 따라서 수학식 6에서 단계 6은 하드웨어로 구현할 때 생략 가능하다. 그리고, 워드 단위는 앞서 설명한 대로 반드시 32비트로 국한할 필요없이 32비트의 정수배로도 가능한데 이 때, 수학식 5를 그대로 적용하여 32비트의 정수배인 워드 단위로 확장하여 구현한다면 단계 3의 곱셈을 처리하기 위한 장치가 비트 수가 증가함에 따라 회로의 복잡도 및 수행시간의 지연이 증가하게 된다. 따라서 비트수가 증가할 경우엔 이를 해결하기 위해 구조를 개선하여야 되는 문제가 발생한다.In determining the arbitrary number of bits w in Equation 6, the number of bits n '(n' ≥ n + 2) greater than or equal to the total number of bits n may be implemented in the form of n '= w * p. To be able. At this time, p is an arbitrary integer and the value of the bit larger than n is set to 0. This means that the result of a modular operation is always less than 2N, so that the result can be represented within a given bit size. Considering that a modular multiplier is a device for modular multiplication only, if you perform a calculation with a bit number larger than 2 bits, If the result of a multiplication operation is greater than N, the result of the final modular multiplication always results in a value less than N, even if the comparison process is not subtracted from the result value each time. If a single result of modular multiplication is required, the comparison process can be performed outside of the proposed computing device. Therefore, Step 6 in Equation 6 may be omitted when implemented in hardware. As described above, the word unit is not necessarily limited to 32 bits, but may be an integer multiple of 32 bits. At this time, if the equation 5 is applied to the word unit, which is an integer multiple of 32 bits, the multiplication process is performed. As the number of bits increases, the complexity of the circuit and the delay of execution time increase. Therefore, when the number of bits increases, a problem arises in that the structure must be improved to solve this problem.

그러나 본 발명에서 제안한 구조는 비트의 증가에 상관없이 동일한 구조를 그대로 확장하여 사용 가능하므로 하드웨어로 구현할 때, 선택된 비트 크기에 맞춰 현재 구조를 하드웨어로 설계하면 되는 장점을 가지게 된다.However, the structure proposed in the present invention can be used by extending the same structure irrespective of the increase of the bit. Therefore, when the hardware is implemented, the current structure can be designed in hardware according to the selected bit size.

위의 수학식 6에서 덧셈 과정은 각각 단계 4와 단계 5에서 수행되는데, 단계 4의 경우 R_j를 캐리 아웃과 섬 아웃으로 나누어 캐리 저장 가산기로 수행한다. 이는 전가산기를 사용했을 때 발생 가능한 시간 지연을 줄이는 효과뿐만이 아니라 단계 5에서 Rj 에 1비트 캐리를 더하는 과정 (Pre_c, T_j) = R_j+ Pre_c 가 요구되므로 단계 4에서는 캐리 저장 가산기를 통해 캐리 아웃과 섬 아웃으로 된 R_j값을 출력하고 단계 5에서는 (Pre_c, T_j) = 캐리 아웃 + 섬 아웃 + Pre_c 를 연산하는 w-비트 가산기를 구현하면 효율적으로 결과값을 구할 수 있다. 또한 단계 4와 단계 5는 독립적으로 수행되므로 단계 4가 다음 워드 단위 연산을 수행하는 동안 단계 5에서 이전 연산의 결과 값들을 가지고 덧셈과정을 수행하면 된다.In the above Equation 6, the addition process is performed in steps 4 and 5, respectively, and in the case of step 4, R _j is divided into a carry out and a sum out and performed by a carry storage adder. This not only has the effect of reducing the time delay that can occur when using a full adder, but also requires a carry bit adder to carry Rx in step 5 (Pre_c, T _j ) = R _j + Pre_c. Outputting the value of R _j out and sum out and in step 5 implements a w-bit adder that computes (Pre_c, T _j ) = carry out + sum out + Pre_c to obtain an efficient result. In addition, since step 4 and step 5 are performed independently, the step of performing the addition process with the result values of the previous operation in step 5 while step 4 performs the next word unit operation.

따라서, 하나의 워드 단위의 입력들을 받아 단계 4를 수행하는데 소요되는 시간은 w 클록 사이클(clock cycle)이 요구된다. 즉, 한 워드 단위 입력 집합들이 유지되는 시간이 w 클록 사이클이 된다. 따라서 중앙 처리 장치가 메모리로부터 송출하는 다음 워드 단위의 입력 집합을 현재의 워드 입력 집합에 대한 연산이 수행되는 w 클록 사이클동안 미리 받아 놓을 수 있다면 전체 데이터의 모듈러 연산을 수행하는 동안 워드 단위의 데이터 입력을 위한 별도의 시간이 요구되지 않거나 w 클록 사이클 이상의 시간이 걸릴 경우에만 그만큼의 추가시간만이 요구되어 수행시간을 단축시킬 수 있게 된다.Therefore, the time required to perform step 4 by receiving inputs of one word unit requires w clock cycles. In other words, the time during which one word input set is maintained becomes w clock cycles. Thus, if the central processing unit can accept the next set of word units sent from the memory in advance during the w clock cycle during which the operation is performed on the current set of word inputs, then the data input in units of words during the modular operation of the entire data is performed. Only additional time is required when no additional time is required for the W clock cycle or longer, thereby reducing the execution time.

도 1은 본 발명에 따른 모듈러 곱셈 장치의 일 실시예를 나타낸 블록도로, 시스템 버스(100), 2단 입력레지스터 N, A, B(101, 102, 103), 제어부(104), 연산 핵심부(105), 가산기(106), 상태 레지스터(107), 제어 레지스터(108), 워드 레지스터(109), 메모리(110), 다중화기(119), 및 중앙 처리 장치(120)로 구성된다.Figure 1 is a block diagram showing an embodiment of a modular multiplication apparatus according to the present invention, a system bus 100, two-stage input registers N, A, B (101, 102, 103), control unit 104, arithmetic core unit ( 105, adder 106, status register 107, control register 108, word register 109, memory 110, multiplexer 119, and central processing unit 120.

동 도면에 있어서, 본 발명에 따른 몽고메리 모듈러 곱셈 장치는 제어부(104), 연산 핵심부(105), 및 덧셈연산을 위한 가산기(106)를 통해 워드 단위 모듈러 연산 과정을 수행한다. 2단 입력레지스터 N(101) A(102), B(103)는 메모리(110)로부터 다음 연산의 데이터까지 시스템 버스(100)를 통해 미리 받아 저장해둔다. 상태 레지스터(107)는 현재 동작 상태를 외부로 알려 중앙 처리 장치(120)로 하여금 다음 동작을 수행하도록 한다. 제어 레지스터(108)는 중앙 처리 장치(120)가 모듈러 곱셈 장치를 제어하기 위해 제공하는 신호를 저장한다. 워드 레지스터(109)는 w 비트의 크기를 가지는 다수의 레지스터를 구비하여 연산 과정중에 발생한 각 단계의 부분값들을 저장한다.In the figure, the Montgomery modular multiplication apparatus according to the present invention performs a word unit modular operation process through the control unit 104, the calculation core unit 105, and the adder 106 for the addition operation. The two-stage input registers N (101) A (102) and B (103) receive and store the data of the next operation from the memory 110 through the system bus 100 in advance. The status register 107 informs the current operation state to the outside to cause the central processing unit 120 to perform the next operation. The control register 108 stores a signal that the central processing unit 120 provides for controlling the modular multiplication device. The word register 109 has a plurality of registers having a size of w bits to store partial values of each step generated during the operation process.

제어부(104)의 동작에 의해 발생되는 제어신호들을 중심으로 전체 동작을 설명하면 다음과 같다.Referring to the overall operation based on the control signals generated by the operation of the control unit 104 as follows.

먼저, 중앙 처리 장치(120)는 제어 레지스터(108)에 동작 시작을 알리는 값을 시스템 버스(100)를 통해 전송한다. 제어부(104)는 제어 레지스터(108)로부터 동작 시작을 알리는 값을 받아 동작을 준비한다. 이 후, 제어부(104)는 시스템 버스(100)를 통해 전송되는 어드레스(address) 값에 따라 제 1 입력 제어 신호(input_ctrl1)를 발생시켜 2단 입력레지스터 N, A, B(101, 102, 103) 내의 도 2와 같은 레지스터1(200)에 시스템 버스(100)로부터 전송되는 데이터를 저장하도록 한다. 제어부(104)는 2단 입력레지스터 N, A, B(101, 102, 103)가 모든 데이터들을 다 입력받은 후엔 이를 알리는 제 2 입력 제어 신호(input_ctrl2)를 발생시켜 2단 입력레지스터(101, 102, 103) 내의 논리곱 게이트(203)의 하단에 제공함으로써 레지스터2(201)에 레지스터1(200)에 저장되어 있는 값을 저장하도록 한다. 이와 동시에 제어부(104)는 연산 핵심부(105)가 연산을 시작하도록 한다. 이후의 연산 과정에 있어서는 레지스터2(201)가 데이터를 레지스터1(200)로부터 넘겨받은 후엔 상태 레지스터(107)에 저장된 값을 중앙 처리 장치(120)가 읽고 다음 데이터를 전송할 수 있음을 알게 되며, 다시 제어부(104)는 시스템 버스(100)를 통해 제공되는 어드레스에 따라 제 1 입력 제어 신호(input_ctrl1)를 발생시켜 레지스터1(200)이 새로운 데이터를 시스템 버스(100)로부터 받을 수 있도록 한다. 레지스터2(201)는 매 연산 핵심부(105)의 동작이 종료되는 시점, 즉 수학식 6의 단계 4의 순차적인 동작을 위해 제어부(104)가 발생시키는 b 계수 값(MM_count)이 w에서 다시 0으로 바뀌는 시점에 제어부(104)에서 발생되는 제 2 입력 제어 신호(input_ctrl2)에 의해서 새로운 데이터를 레지스터1(200)로부터 넘겨받게 된다.First, the central processing unit 120 transmits a value indicating the start of the operation to the control register 108 via the system bus 100. The control unit 104 receives a value indicating the start of the operation from the control register 108 to prepare for the operation. Thereafter, the controller 104 generates a first input control signal input_ctrl1 according to an address value transmitted through the system bus 100 to generate the two-stage input registers N, A, and B (101, 102, 103). In FIG. 2, data transmitted from the system bus 100 is stored in the register 1 200 as shown in FIG. 2. The controller 104 generates a second input control signal input_ctrl2 for informing the second stage input registers N, A, and B, 101, 102, and 103, after receiving all the data. , The value stored in the register 1 (200) is stored in the register 2 (201) by providing it to the lower end of the logical gate 203 in (103). At the same time, the control unit 104 causes the operation core unit 105 to start the operation. In the subsequent operation, after the register 2 201 receives the data from the register 1 200, the CPU 120 reads the value stored in the status register 107 and transmits the next data. Again, the controller 104 generates a first input control signal input_ctrl1 according to an address provided through the system bus 100 so that the register 1 200 may receive new data from the system bus 100. Register 2 (201) is the time point when the operation of the operation core unit 105 is terminated, that is, the b coefficient value (MM_count) generated by the controller 104 for the sequential operation of step 4 of the equation (6) is again 0 at w The new data is transferred from the register 1 200 by the second input control signal input_ctrl2 generated by the control unit 104 at the point of change to.

연산 핵심부(105)의 동작을 위해서 앞서 설명한 제 2 입력 제어 신호(input_ctrl2)와 b 계수 값(MM_count) 신호 이외에 제어부(104)는 또 다른 계수 값 워드 카운트(word_count)와 토탈 카운트(total_count)를 연산 핵심부(105)로 전송한다. b 계수 값(MM_count)은 수학식 6의 단계 4에 있어서 b 계수 값이며, 워드 카운트(word_count)는 단계 3의 j 계수 값, 토탈 카운트(total_count)는 단계 2의 i 계수 값이다. 각 계수 값의 역할은 먼저, 토탈 카운트(total_count)는 현재 입력된 B가 몇 번째 워드인지를 알려주며, 워드 카운트(word_count)는 A의 워드 순서를 알려주고, b 계수 값(MM_count)은 현재 A_j의 비트 순서를 알려주어 연산 핵심부(105)의 동작을 제어하게 된다.In addition to the second input control signal input_ctrl2 and the b-count value MM_count signal described above, the control unit 104 calculates another count value word_count and total_count. Send to the core 105. The b count value MM_count is the b count value in step 4 of Equation 6, the word count word_count is the j count value in step 3 and the total count is the i count value in step 2. The role of each count value is, firstly, the total count (total_count) indicates the number of words of the currently input B, the word count (word_count) indicates the word order of A, and the b count value (MM_count) indicates the current A _j value. The bit order is notified to control the operation of the operation core unit 105.

연산 핵심부(105)가 한 번의 단계 4연산, 즉 A_j와 B_i에 대한 연산을 수행하고 나면 섬 아웃, 캐리 아웃, 및 시프트 데이터를 가산기(106)로 출력한다. 이 때, 제어부(104)는 가산기(106)의 동작을 위해 애드 스타트(add_start) 신호를 발생시키고 동시에 덧셈 횟수를 나타내는 계수인 애드 카운트(add_count) 신호를 발생시켜서 두 번의 w-비트 덧셈연산을 순차적으로 수행하도록 한다.The operation core unit 105 outputs sumout, carry out, and shift data to the adder 106 after one operation of four operations, i.e., A _j and B _i . At this time, the control unit 104 generates an add_start signal for the operation of the adder 106 and simultaneously generates an add count signal, which is a coefficient indicating the number of additions, to sequentially perform two w-bit addition operations. To do it.

가산기(106)는 두 번의 덧셈을 수행하여 두 번의 출력 값을 워드 레지스터(109)로 전송하게 되는데 그 값은 연산 핵심부(105)가 가산기(106)의 입력으로 제공한 값을 계산할 때의 워드 카운트(word_count) 순서인 j에 대하여 가산기(106)의 두 출력은 각각 j번째에 해당하는 워드 레지스터(109)와 j-1번째에 해당하는 워드 레지스터(109)에 각각 저장된다. 이 때, 값이 저장될 워드 레지스터(109)를 선택하기 위해서 제어부(104)는 워드 카운트(word_count)를 이용해 워드 레지스터(109) 쪽으로 입력 셀(input_sel)이라는 신호를 발생시키며, 이 값에 따라 가산기(106)의 두 출력은 적절한 워드 레지스터(109)에 저장되게 된다.The adder 106 performs two additions and transfers two output values to the word register 109, which are word counts when the calculation core 105 calculates the value provided as the input of the adder 106. The two outputs of the adder 106 are stored in the word register 109 corresponding to the j th and the word register 109 corresponding to the j-1 th, respectively, for j in the (word_count) order. At this time, in order to select the word register 109 where the value is to be stored, the controller 104 generates a signal called an input cell (input_sel) toward the word register 109 using the word count word_count, and according to this value, an adder. The two outputs of 106 will be stored in the appropriate word register 109.

워드 레지스터(109)의 값은 연산 핵심부(105)와 가산기(106)의 입력으로도 사용되게 되는데, 이를 위해 제어부(104)는 다중화기(119)의 제어를 위한 아웃풋 셀(output_sel)이라는 신호를 발생시킨다. 연산 핵심부(105)는 수학식 6의 단계 3에 표현한 것처럼 현재 워드 카운트(word_count) j에 대하여 현재의 B 워드 순서 토탈 카운트(total_count)가 0일 때는 워드 레지스터(109)의 값을 필요로 하지 않으나, 토탈 카운트(total_count)가 0이 아닐 때에는 j번째의 워드 레지스터(109)의 값을 입력받아서 연산 핵심부(105)의 초기값으로 사용하게 된다. 또한 가산기(106)는 j-1번째의 워드 레지스터(109)의 값을 하나의 입력으로 받아야 하므로 이러한 워드 레지스터(109)의 출력을 선택적으로 수행할 수 있도록 제어부(104)가 입력 셀(input_sel) 신호를 제공하게 된다.The value of the word register 109 is also used as an input of the operation core 105 and the adder 106. For this purpose, the controller 104 outputs a signal called an output cell (output_sel) for controlling the multiplexer 119. Generate. The operation core unit 105 does not need the value of the word register 109 when the current B word order total count (total_count) is 0 with respect to the current word count j as expressed in Step 3 of Equation 6. When the total_count is not 0, the value of the j-th word register 109 is input and used as the initial value of the calculation core unit 105. In addition, since the adder 106 needs to receive the value of the j-1 th word register 109 as one input, the control unit 104 may input the input cell (input_sel) so that the output of the word register 109 can be selectively performed. Will provide a signal.

이상의 제어부(104)의 동작과정에서 도 1에 도시한 것처럼 일련의 제어신호들은 상태 레지스터(107)에 저장되어 현재 모듈러 곱셈 동작 상태를 중앙 처리 장치(120)로 알려주게 되며, 동시에 다시 제어부(104)의 입력으로 제공되어 제어부(104)가 다음 제어신호 값을 결정할 수 있도록 한다. 모든 모듈러 곱셈과정을 수행하고 나면 중앙 처리 장치(120)가 제어 레지스터(108)에 동작 정지를 위한 값을 전송하고 이를 제어부(104)가 받아서 동작을 정지하게 된다.In the operation process of the control unit 104 as shown in FIG. 1, a series of control signals are stored in the state register 107 to inform the central processing unit 120 of the current modular multiplication operation state, and at the same time, the control unit 104 again. It is provided as an input so that the controller 104 can determine the next control signal value. After performing all the modular multiplication processes, the central processing unit 120 transmits a value for stopping operation to the control register 108 and receives the control unit 104 to stop the operation.

도 2는 도 1에 도시된 2단 입력 레지스터 N, A, B(101, 102, 103)의 일 실시예를 나타낸 회로도로, 레지스터1,2(200, 201) 및 논리곱 게이트(203)로 구성된다.FIG. 2 is a circuit diagram showing an embodiment of the two-stage input registers N, A, and B (101, 102, 103) shown in FIG. 1, with registers 1, 2 (200, 201), and AND gates 203. FIG. It is composed.

동 도면에 있어서, 레지스터1(200)은 제 1 입력 제어 신호(Input_ctrl1)에 따라 시스템 버스(100)로부터 제공되는 데이터를 저장한다. 레지스터2(201)는 레지스터1(200)로부터 제공되는 데이터를 논리곱 게이트(203)의 출력에 따라 저장한다. 논리곱 게이트(203)는 클록과 제 2 입력 제어 신호(Input_ctrl2)를 논리곱 연산한 게이트 클록(gated clock)을 레지스터2(201)로 제공한다. 이와 같이 2단 입력 레지스터 N, A, B(101, 102, 103)는 논리곱 게이트(203)를 사용하여 새로운 입력 값이 요구되는 시점에만 데이터를 저장한다.In the figure, the register 1 200 stores data provided from the system bus 100 according to the first input control signal Input_ctrl1. Register 2 201 stores the data provided from register 1 200 according to the output of the AND gate 203. The AND gate 203 provides a gate clock obtained by performing an AND operation on the clock and the second input control signal Input_ctrl2 to the register 2 201. In this way, the two-stage input registers N, A, and B (101, 102, 103) use the AND gate 203 to store data only when a new input value is required.

도 3은 도 1에 도시된 연산 핵심부(105)의 일 실시예를 나타낸 상세도이다.3 is a detailed view illustrating an embodiment of the calculation core unit 105 shown in FIG. 1.

일반적으로 상기 수학식 4의 전체 데이터 비트를 한꺼번에 처리하는 구조와 유사하나 몽고메리 보정 인자 m을 저장하는 m 레지스터(307)와 시프트 데이터를 저장하는 시프트 데이터 레지스터(310)를 가지는 중요한 차이점이 있다. 이 연산 핵심부(105)의 동작을 수학식 6에 표현한 연산과정의 단계에 따라 설명하면 우선 단계 3, 단계 4를 모두 수행하여 단계 2의 최초 동작, 즉 제어부(104)에서 입력된 토탈 카운트(total_count)가 0일 때의 A와 B₀에 대한 모듈러 연산 결과 값을 출력할 때까지 입력 캐리 인(Carry_in)은 0이 되고 섬 인(Sum_in) 또한 다중화기0(300)를 통해 0이 입력된다. 이후의 단계 2에 대해서는 이전 단계 2에 의한 결과를 저장하고 있는 워드 레지스터(109)들의 값 중 현재 워드 연산의 순서, 단계 3의 j와 일치하는 워드 레지스터(109) 값이 입력되고, 이 값이 다중화기0(300)를 통해 선택되어 섬 인으로 입력하도록 한다.In general, it is similar to the structure that processes all the data bits of Equation 4 at once, but there is an important difference between the m register 307 for storing the Montgomery correction factor m and the shift data register 310 for storing shift data. Referring to the operation of the operation core unit 105 in accordance with the operation procedure expressed in Equation 6, first of all the steps 3 and 4, the first operation of step 2, that is, the total count input from the control unit 104 (total_count) Input carry-in becomes 0 and sum-in is also inputted through multiplexer 0 (300) until the output value of the modular operation for A and B ₀ when) is 0. In the subsequent step 2, the value of the word register 109 that matches the current word operation order, j in step 3, among the values of the word registers 109 that store the result of the previous step 2 is input, and this value is Selected through the multiplexer 0 (300) to enter the island in.

각 단계 2에 대해서 최초의 단계 3, 즉 워드 카운트(word_count)가 0이 되는 A₀*B_i를 수행할 때는 w 클록 사이클 동안 j=0인 A_0b*B_i(305)의 출력과 캐리 레지스터(303), 섬 레지스터(304)의 값을 입력으로 하는 캐리 저장 가산기1(306)의 출력 중 S u m _ 0를 m 레지스터에 저장하며, 비트 순서 b에 따라 m_b*N_j(308)을 수행, 이 결과를 캐리 저장 가산기1(306)의 두 출력과 함께 캐리 저장 가산기2(309)로 보낸다. 이와 같이 하는 이유는 첫 번째 워드 연산인 A₀*B_i이후의 단계 3 과정 동안에는 몽고메리 보정 인자가 m 레지스터(307)에 저장된 값과 동일하기 때문이다. 이처럼 수학식 5에서 별도의 계산 과정이 필요했던 몽고메리 보정 인자를 구하는 과정은 워드 연산 과정을 수행함과 동시에 처리된다.Output and carry register of A _0b * B _i 305 with j = 0 during w clock cycles when performing the first step 3 for each step 2, i.e., A ₀ * B _i where word count is zero. 303 stores S um _ 0 in the m register among the outputs of the carry storage adder 1 306 having the value of the island register 304 as an input, and m _b * N _j 308 according to the bit order b. Perform and send this result to carry storage adder 2 309 together with the two outputs of carry storage adder 1 306. This is because the Montgomery correction factor is equal to the value stored in the m register 307 during step 3 after the first word operation A ₀ * B _i . As described above, the process of obtaining the Montgomery correction factor, which requires a separate calculation process in Equation 5, is simultaneously performed while performing the word operation process.

그리고, 캐리 저장 가산기2(309)의 출력 중 S u m _ 0는 수학식 4의 경우 그 값이 항상 0이 되므로 쉬프트되어 버려지지만, 현재의 구조에서는 한꺼번에 연산하는 것이 아니라 워드단위로 연산하므로 이 값들을 저장해서 이전 블록의 연산 결과에 더해주어야 한다. 따라서 시프트 데이터 레지스터(310)에 이 값들을 저장해두도록 한다.In the output of the carry storage adder 2 309, Sum _ 0 is shifted because its value always becomes 0 in the case of Equation 4, but in the present structure, the value is calculated in units of words instead of all at once. Save them and add them to the result of the previous block. Therefore, these values are stored in the shift data register 310.

단계 4의 과정을 수행하는 동안 연산 핵심부(105)는 w 클록 사이클 이후에는 한 번의 워드 연산을 마치고 캐리 아웃, 섬 아웃, 및 시프트 데이터 등 3개의 w-비트 값을 단계 5를 수행하는 가산기(106) 쪽으로 출력한다.During the process of step 4, arithmetic core 105 completes one word operation after w clock cycles and adds 106 to step 3 of three w-bit values, such as carry out, sum out, and shift data. To).

도 4는 도 1에 도시된 가산기(106)의 일 실시예를 나타낸 상세도로, 수학식 6의 단계 5를 수행한다.FIG. 4 is a detailed view showing one embodiment of the adder 106 shown in FIG. 1 and performs step 5 of Equation 6. FIG.

한 차례의 워드연산, 즉 단계 4가 수행되고 나면 앞서 설명한대로 연산 핵심부(105)에서 상기 도 3의 설명대로 3개의 출력이 발생한다. 이를 입력으로 각 워드 연산의 결과 값을 구하기 위해서는 단계 5의 과정을 따라 더해주어야 하는데 그 동작 과정은 다음과 같다. 우선 제어부(104)가 애드 스타트(add_start) 신호를 발생시켜 가산기(106)가 동작할 시점임을 알리고 덧셈연산을 수행하도록 한다. 연산 핵심부(105)에서 출력된 캐리 아웃(400), 섬 아웃(402), 시프트 데이터(401)등을 받아 저장하고, 현재 종료된 워드 연산 순서인 j에 대해 j-1번째의 워드 레지스터(109)의 값을 받아 도 4와 같이 다중화기1(403)과 다중화기 2(404)를 통해 애드 카운트(add_count) 신호에 의해서 두 개의 입력으로 선택될 수 있도록 한다. 이 때, 두 캐리 값인 Pre_c(406)와 Next_c(407)는 다중화기3(408)을 통해 선택된다. 먼저 캐리 아웃(400)과 섬 아웃(402)을 w-비트 가산기(405)를 통해 더한다. 그러면, 프리 섬(Pre_sum)과 캐리 비트인 Pre_c(406)가 출력되어 다음 j+1번째의 연산 핵심부(105) 출력의 동일 연산의 캐리 입력으로 사용하기 위해 저장되며, 이 때, 계산된 프리 섬(Pre_sum)은 워드 연산 순서 j와 일치되는 워드 레지스터(109)에 저장된다. 그리고, 두 번째 덧셈 연산에서는 시프트 데이터(401)와 이전 j-1번째의 연산 핵심부(105) 동작 이후 가산기(106)를 통해 발생되었던 프리 섬(Pre_sum)을 저장하고 있는 현재 워드 연산 순서 j보다 하나 이전인 j-1 번째의 워드 레지스터(109) 값을 w-비트 가산기(405)를 통해 더하여 넥스트 섬(Next_sum)과 캐리 Next_c(407)를 출력 저장한다. 이 때 Next_c(407)는 다음 j+1번째의 워드 연산 이후 수행될 단계 5의 두 번째 덧셈 연산을 위해 저장되며, 현재 발생된 넥스트 섬(Next_sum)은 입력 값으로 제공되었던 워드 레지스터(109), 즉 j-1번째 워드 레지스터(109)로 다시 저장되며, 이 값이 입력 A_j-1과 B_i에 대한 워드 모듈러 연산의 결과값이 된다.After one word operation, that is, step 4, is performed, three outputs are generated in the operation core unit 105 as described above with reference to FIG. 3. To obtain the result value of each word operation as an input, it should be added according to the procedure of step 5. The operation process is as follows. First, the controller 104 generates an add_start signal to notify the time at which the adder 106 is to operate and to perform an addition operation. The carry-out 400, the sum-out 402, the shift data 401, etc., output from the operation core unit 105 are received and stored, and the j-1 th word register 109 is stored with respect to j which is the currently completed word operation sequence. ) Can be selected as two inputs by the add_count signal through the multiplexer 1 403 and the multiplexer 2 404 as shown in FIG. 4. At this time, two carry values Pre_c 406 and Next_c 407 are selected through the multiplexer 3 (408). First carry out 400 and sum out 402 are added via w-bit adder 405. Then, the presum (Pre_sum) and the carry bit Pre_c (406) are output and stored for use as a carry input of the same operation of the next j + 1th operation core portion (105) output, where the calculated presum Pre_sum is stored in the word register 109 matching the word operation order j. In the second addition operation, one of the current word operation order j storing the shift data 401 and the pre-sum generated by the adder 106 after the operation of the previous j-1 th operation core unit 105 is stored. The next j-1 th word register 109 is added through the w-bit adder 405 to store the next island Next_sum and carry Next_c 407. In this case, Next_c 407 is stored for the second addition operation of step 5 to be performed after the next j + 1th word operation, and the next generated next sum (Next_sum) is provided as an input value of the word register 109, That is, it is stored back into the j-1 th word register 109, which is the result of the word modular operation on the inputs A _j-1 and B _i .

상기 동작 과정을 반복하여 전체 데이터 입력 A, B, N에 대한 몽고메리 모듈러 곱셈의 최종 결과 값이 워드 레지스터(109)에 저장된다. 이 값을 중앙 처리 장치의 제어신호에 따라 32 비트씩 시스템 버스(100)로 전송하면 된다. 덧셈 연산의 시작은 연산 핵심부(105) 동작이 끝난 후 수행되므로 제어부(104)내에서 클럭 분기를 통해 그 동작 시점을 지정할 수 있고, 또한 덧셈 연산은 연산 핵심부와 독립적이므로 덧셈 결과를 기다릴 필요없이 다음 워드 데이터를 위한 연산 핵심부 동작이 수행된다. 또한 32비트 이상의 워드 단위일 경우 전체 단위를 한번에 더하는 것보단 32비트 가산기를 반복 수행하여 결과를 구하는 것이 보다 효율적이다.By repeating the above operation, the final result of Montgomery modular multiplication for all data inputs A, B, N is stored in the word register 109. This value may be transmitted to the system bus 100 by 32 bits according to a control signal of the central processing unit. Since the start of the add operation is performed after the operation core 105 is finished, the operation time can be specified through the clock branch in the controller 104. Also, the addition operation is independent of the operation core, so the next operation is not required to wait for the addition result. Operational core operations for the word data are performed. In addition, in the case of word units of 32 bits or more, it is more efficient to obtain a result by repeating the 32-bit adder rather than adding all the units at once.

상술한 바와 같이 본 발명에 의하면 몽고메리 모듈러 곱셈을 제한된 면적을가지는 IC카드와 같은 단일 칩 시스템에 적용하기 위해 임의의 워드 단위 연산 장치를 구현함에 있어서 중앙 처리 장치와의 연동을 충분히 고려하여 데이터 입력을 위한 시간 지연을 최소화하였고, 부가되는 몽고메리 보정 인자 연산 과정 없이 연산을 수행할 수 있도록 하여 별도의 연산 과정이나 이를 위한 시간 지연을 제거할 수 있으므로 전체 연산이 단순화되어 하드웨어로의 구현 및 확장이 용이하다.As described above, according to the present invention, in order to apply an arbitrary word unit operation device to apply Montgomery modular multiplication to a single chip system such as an IC card having a limited area, data input may be sufficiently taken into consideration in connection with a central processing unit. It minimizes the time delay for the operation and can perform the calculation without the additional Montgomery correction factor calculation process, so that the separate operation process or time delay for this can be eliminated, so the entire operation is simplified and it is easy to implement and expand into hardware. .

Claims

A central processing unit connected to a predetermined system bus;

A memory connected to the system bus for inputting / outputting data required for a modular multiplication operation;

Two stage input registers A, B, and N each configured to store two multipliers A, a multiplier B, and a modulus N respectively inputted from the system bus;

Receives data stored in the two-stage input registers A, B, and N, respectively, performs a modular operation, and outputs the result by dividing the result into a carry-out and a sum-out. And an arithmetic core for storing the Montgomery correction factor m;

A status register connected to the system bus for storing and notifying a modular multiplication operation state;

A control register for receiving and storing a control signal of the central processing unit through the system bus;

A register group connected to the system bus, the register group comprising a plurality of registers for storing respective intermediate result values to store partial products during modular multiplication and to store and output a final result value;

An adder for selectively carrying out two additions by receiving the carry out, sum out, shift data, and output of the register group output from the operation core unit; And

It is connected to the system bus and receives the control signal of the central processing unit from the control register and controls the input / output of the operation core and the input / output of the adder until the final result value of the modular multiplication operation is output and is added to the register group. And a control unit for generating a signal for storing a value.

The modular multiplier of claim 1, wherein the multiplier A, the multiplicand B, and the modulus N are input from the system bus in word units.

The modular multiplication apparatus of claim 1, wherein the operation core unit receives data stored in the two-stage input registers A, B, and N, respectively, and performs a modular operation in units of words.

The modular multiplication apparatus of claim 1, wherein each intermediate result value is an intermediate result value in a word unit.

The modular multiplication apparatus of claim 1, wherein the second addition is performed in units of words.

The modular multiplier according to claim 1, wherein the two-stage input registers A, B, and N are composed of two registers having the same bit size as the input of a word unit.

The method of claim 1,

The second stage input registers A, B, and N may include a register 1 for storing data provided from the system bus according to a first input control signal Input_ctrl1 provided from the controller;

An AND gate performing an AND operation on a clock and a second input control signal Input_ctrl2 provided from the controller; And

And a register 2 for storing data provided from the register 1 according to the output of the AND gate.

The method of claim 1,

The operation core unit includes: a calculation module for multiplying a bit of A and a word of B according to a b coefficient value, a word count and a total count provided from the control unit with respect to data provided from the second stage input registers A and B;

A multiplexer 0 for determining an input of the multiplexer 2 according to a total count provided from the control unit with respect to the word register and 0;

A multiplexer 1 for determining the input of the carry register according to the b coefficient value for the carry out output from the external input and the carry storage adder 2;

A multiplexer 2 for determining the input of the sum register according to the b coefficient value for the output of the multiplexer 0 and the sumout output from the carry storage adder 2;

A carry register for storing a carry input value of the carry storage adder 1;

An island register for storing an island input value of the carry storage adder 1;

A carry storage adder 1 for outputting a result value in the form of a carry and an island by inputting the carry register, the island register, and an operation module that multiplies the bits of the A and the word B;

An m register for storing the least significant bit of the island output of the carry storage adder 1 according to a word count;

an arithmetic module for multiplying the bit in the order of matching the b coefficient value in the m register with the word value of N according to the word count provided from the controller for the two-stage input register N;

A carry storage adder 2 that outputs a carry out and a sum out as result values of two outputs of the carry storage adder 1 and an output of an operation module that multiplies the bits of m and the word of N; And

And a shift data register for storing the least significant bit of the sumout which is the output of the carry storage adder 2.

The method of claim 1,

The adder includes a carry out register, a shift data register and a sum out register for storing carry out, shift data and sumout and shift data, which are outputs of the core calculating unit, in accordance with an add start provided from the control unit;

A multiplexer 1 for determining the input of the W-bit adder according to the add count signal provided from the control unit for the output of the carry out register and the output of the shift register;

A multiplexer 2 for determining the input of the W-bit adder according to the add count signal provided from the control unit for the output of the sumout register and the output of the word register;

A W-bit adder that receives the output of the multiplexer 1 and the output of the multiplexer 2, performs W-bit addition, and outputs the output twice as a result value of 1-bit carry and word size;

A register for storing Pre_c, which is the first 1-bit carry output of the W-bit adder, and a register for storing Next_c, which is the second 1-bit carry output; And

And a multiplexer 3 for determining the 1-bit carry input of the W-bit adder according to the add count for the output of the Pre_c register and the output of the Next_c register.