KR20090070061A

KR20090070061A - Scalable dual-field montgomery multiplier on dual field using multi-precision carry save adder

Info

Publication number: KR20090070061A
Application number: KR1020070137931A
Authority: KR
Inventors: 김태호; 김창훈; 홍춘표
Original assignee: 대구대학교 산학협력단
Priority date: 2007-12-26
Filing date: 2007-12-26
Publication date: 2009-07-01
Also published as: KR100946256B1

Abstract

A scalable montgomery multiplier on a dual field using a multi-precision carry save adder is provided to reduce a clock cycle number necessary for correction without an additional circuit. A plurality of processing elements has a pipeline structure. An m-bit right side shift register is connected to each of the processing elements. Multiplication or addition operation is performed by a control signal. If Add/Mult SEL is 0, add operation is performed. In the nth PE, a result value is gained. If the Add/Mult SEL is 1, a multiplication result is obtained from a queue. Each PE obtains each bit of an operand from a k-bit shift register of an m-bit.

Description

Scalable Dual-Field Montgomery Multiplier On Dual Field Using Multi-Precision Carry Save Adder}

본 발명은 다정도 캐리 세이브 가산기를 이용한 듀얼필드상의 확장성있는 몽고매리 곱셈기에 관한 것으로, 더욱 상세하게는 The present invention relates to a scalable Mongolian Mary multiplier on a dual field using a multi-precision carry save adder.

다정도 캐리 세이브 가산기를 이용한 듀얼필드상의 확장성있는 몽고매리 곱셈기에 관한 것이다.It is a scalable Mongolian Mary multiplier on a dual field using a multi-precision carry save adder.

알에스에이(RSA)(J.-J. Quisquater and C. Couvreur, “Fast Decipherment Algorithm for RSA Public-key Cryptosystem,” IEE Electronics Letters, Vol. 18, No. 21, pp. 905-907, 1982), Diffie-Hellman 키교환 알고리즘(W. Diffie and M.E. Helman, “New Directions in Cryptography,” IEEE Transactions on Information Theory, Vol. 22, pp. 644-654, 1976), 타원곡선 암호시스템(Elliptic Curve Cryptosystems: ECC)(N. Koblits, “Elliptic Curve Cryptosystems,” Mathematics of Computation, Vol. 48, No. 177, pp. 203-209, 1987)과 같은 암호 응용에서 모듈러 곱셈 및 지수승은 중요한 연산이다. 특히 모듈러 곱셈은 모듈러 지수 및 역원의 기본 연산으로 현재까지 많은 연구결과가 발표되었다. 모듈러 곱셈을 위해 고전적인 방법, 몽고매리(Montgomery) 알고리즘(P.L. Montgomery, “Modular Multiplication without Trial Division,” Math . Computation, Vol. 44, pp. 519-521, 1985), Barret 알고리즘(P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor,” Lecture Notes in Computer Science, Vol. 263, pp. 311-323, 1987) 등이 사용 되었다. 모듈러 곱셈 알고리즘 중, 몽고매리(Montgomery) 곱셈 방법은 내부 연산이 규칙적일 뿐만 아니라 나눗셈 연산은 쉬프트 연산으로 대체되기 때문에 하드웨어 구현에 매우 적합하다. 이러한 장점 때문에 다양한 형태의 변형된 몽고매리 알고리즘과 그에 따른 하드웨어 구현 방법이 연구되어 왔다.RSA (J.-J. Quisquater and C. Couvreur, “Fast Decipherment Algorithm for RSA Public-key Cryptosystem,” IEE Electronics Letters , Vol. 18, No. 21, pp. 905-907, 1982), Diffie-Hellman Key Exchange Algorithm (W. Diffie and ME Helman, “New Directions in Cryptography,” IEEE Transactions on Information Theory , Vol. 22, pp. 644-654, 1976), Elliptic Curve Cryptosystems (ECC) (N. Koblits, “Elliptic Curve Cryptosystems,” Mathematics of Computation , Vol. 48, No. 177, pp. Modular multiplication and exponential power are important operations in cryptographic applications such as 203-209, 1987). In particular, modular multiplication is the basic operation of modular exponents and inverses. Classic method for modular multiplication, Montgomery algorithm (PL Montgomery, “Modular Multiplication without Trial Division,” Math . Computation , Vol. 44, pp. 519-521, 1985), Barret's algorithm (P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor,” Lecture Notes in Computer Science , Vol. 263, pp. 311-323, 1987). Among the modular multiplication algorithms, the Montgomery multiplication method is well suited for hardware implementations because not only the internal operations are regular but also the division operations are replaced by shift operations. Due to these advantages, various types of modified Mongomery algorithms and hardware implementation methods have been studied.

확장성 있는 몽고매리 곱셈기 설계 방법과 하드웨어 구현 방법은 다양하게 소개된 바 있다(E. Savas, A.F. Tenca, and .K. Ko, “A Scalable and Unified Multiplier Architecture for Finite Fields and ,” Lecture Notes in Computer Science, Vol. 1965, pp. 277-292, 2000, A. Tenca and .K. Ko, “A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm,” IEEE Trans . on Computers, Vol. 52, No. 9, pp. 1215-1221, 2003, A.F. Tenca and .K. Ko, “A Scalable Architecture for Montgomery Multiplication,” Lecture Notes in Computer Science, Vol. 1717, pp. 94-108, 1999). 확장성 있는 구조는 입력값의 크기에 제한을 받지 않고 고정된 크기의 회로를 이용하여 연산을 수행한다. 데이터를 일정한 크기의 워드 단위로 나눈 후 워드 단위로 처리 및 전송 한다. 데이터 크기가 m-비트이고 워드의 크기가 w-비트이면 워드의 개수는 e=[m/w] 개 이다. 확장성 있는 구조는 워드의 크기가 커질수록 연산 시간을 단축할 수 있으나 하드웨어 복잡도가 증가한다. 그러나 면적 및 속도를 만족시키는 가장 적합한 워드 크기를 찾는다면 연산시간 및 하드웨어 복잡도에 있어 상충 관계를 개선할 수 있고, 확장성 있는 몽고매리 곱셈기의 연산시간 및 하드웨어 복잡도의 상충관계도 이미 분석된 바 있다.There have been a variety of ways to design and implement a scalable Mongolian multiplier (E. Savas, AF Tenca, and .K. Ko, “A Scalable and Unified Multiplier Architecture for Finite Fields and,” Lecture Notes in Computer Science , Vol. 1965, pp. 277-292, 2000, A. Tenca and .K. Ko, “A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm,” IEEE Trans . on Computers , Vol. 52, no. 9, pp. 1215-1221, 2003, AF Tenca and .K. Ko, “A Scalable Architecture for Montgomery Multiplication,” Lecture Notes in Computer Science , Vol. 1717, pp. 94-108, 1999). Extensible structures are not limited by the size of the input value, but operate using a fixed size circuit. After dividing the data into unit of word of a certain size, it is processed and transmitted by unit of word. If the data size is m-bits and the word size is w-bits, the number of words is e = [m / w]. Extensible structures can reduce computation time as the word size increases, but hardware complexity increases. However, finding the most appropriate word size that satisfies area and speed can improve tradeoffs in computation time and hardware complexity, and has already analyzed the trade-offs in computation time and hardware complexity of scalable Mongolian multipliers. .

최초에 몽고매리(Montgomery)곱셈 알고리즘은 홀수 모듈러스와 함께 모듈러 곱셈 알고리즘을 수행하는 효율적인 방법으로 제안되었다. 만약 모듈러스가 소수이면 몽고매리 곱셈 알고리즘은 유한체

상에서 매우 효율적인 곱셈 연산을 수행할 수 있으며, 다항식기저를 사용하고 기약다항식이 선택되면 유한체

상에서도 곱셈 연산을 수행할 수 있다. 유한체

상에서 몽고매리 곱셈 연산을 위해 캐리 전파 가산기(Carry Propagation Adder: CPA)를 이용하면 최대 처리 지연시간이 증가하고 비트수가 커지면 캐리 전파 문제가 발생한다. 이러한 문제를 해결하기 위해 기존에 제안된 몽고매리 곱셈기는 캐리 세이브 가산기(Carry Save Adder: CSA)를 이용한다. 그러나 CSA는 곱셈 결과가 합과 캐리로 구분되는 캐리 세이브(Carry Save: CS)형태이기 때문에 정확한 곱셈결과를 얻기 위해서는 추가적인 m-비트의 가산기 회로 또는 m 클럭 사이클이 필요하다(J.C. Ha and S.J. Moon, “A Design of Modular Multiplier Based on Multi-Precision Carry Save Adder,” Joint Workshop on Information Security and Cryptology (JWISC'2000), pp. 45-51, 2000). 최근 김 등(김대영, 이준용, “개선된 다정도 CSA에 기반한 모듈라 곱셈기 설계,” 정보과학회논문지 : 시스템 및 이론, 제33권, 제34호, pp. 223-230, 2006)은 MP(Multi-Precision)-CSA에 기반한

상의 효율적인 곱셈기를 제안하였다. MP-CSA는 CSA와 CPA를 결합한 형태로 캐리 전파 문제를 해결하는 동시에 결과값을 보정하기 위한 클럭 사이클 수를 감소시킨다. 하지만 김 등의 구조는 m과 w에 대해 확장성을 제공하지 못하는 문제점이 있었다.Originally, the Montgomery multiplication algorithm was proposed as an efficient way to perform modular multiplication algorithms with odd modulus. If modulus is prime, the Montgomery multiplication algorithm is finite

Can perform very efficient multiplication operations, and if a polynomial basis is used and a weak polynomial is chosen,

You can also perform multiplication operations. Finite body

Using the Carry Propagation Adder (CPA) for the Montgomery multiplication operation, the maximum propagation delay increases and the number of bits causes a carry propagation problem. In order to solve this problem, the conventionally proposed Montgomery multiplier uses a Carry Save Adder (CSA). However, since CSAs are in the form of a Carry Save (CS), where the multiplication results are divided into sums and carry, additional m-bit adder circuits or m clock cycles are required to obtain accurate multiplication results (JC Ha and SJ Moon, “A Design of Modular Multiplier Based on Multi-Precision Carry Save Adder,” Joint Workshop on Information Security and Cryptology (JWISC'2000) , pp. 45-51, 2000). Recently, Kim et al. (Kim Dae-young, Lee Jun-yong, “Design of Modular Multipliers Based on Improved Multi-precision CSA,” Journal of KIISE: Systems and Theory, Vol. 33, No. 34, pp. 223-230, 2006). Precision) -based on CSA

An efficient multiplier of the phase is proposed. MP-CSA combines CSA and CPA to solve carry propagation problems while reducing the number of clock cycles to correct the results. However, Kim's structure has a problem in that it does not provide scalability for m and w.

본 발명의 목적은 상기한 바와 같은 종래의 문제점을 개선하기 위하여 제안된 것으로, 기존에 제안된 구조에 비해 적은 플립플롭을 사용하며 추가회로를 필요로하지 않고 보정에 필요한 클럭 사이클 수를 감소시킬 수 있는 다정도 캐리 세이브 가산기를 이용한 듀얼필드상의 확장성있는 몽고매리 곱셈기를 제공함에 있다.An object of the present invention is proposed to improve the conventional problems as described above, using fewer flip-flops compared to the conventionally proposed structure and can reduce the number of clock cycles required for correction without the need for additional circuitry. It provides a scalable Mongolian multiplier on a dual field using a multi-level carry save adder.

상기한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따르면, 다정도 캐리 세이브 가산기를 이용한 듀얼필드상의 확장성있는 몽고매리 곱셈기는 파이프라인 구조의 복수개의 처리기(PE:Processing Element); 및 상기 각 복수개의 처리기와 연결된 m-bit 우측 시프트 레지스터를 포함하여 이루어지고, 곱셈 또는 덧셈 연산은 컨트롤 신호(Add/Mult SEL)에 의해 수행되되, Add/Mult SEL이 0이면 덧셈 연산을 수행하고 n번째 PE에서 결과값을 얻을 수 있고, Add/Mult SEL이 1이면 큐로부터 곱셈결과를 얻을 수 있도록 하며, 각 PE는 m-비트의 k-비트 쉬프트 레지스터로부터 오퍼랜드 A의 각 비트(a_i,..,a_i _+k-1)를 가져오며(여기서 k는 처리기의 수를 나타냄), 시스템을 유연하게 하기 위해서 S와 C를 저장하기 위한 큐를 사용하고, 큐의 최대 길이는 메모리에 저장되는 워드의 최대 개수와 파이프라인 단계 수에 의존되며 아래 수학식 5와 같이 계산되는 것을 특징으로 한다.According to a preferred embodiment of the present invention for achieving the object as described above, the scalable Mongolian multiplier on the dual field using a multi-precision carry save adder is a plurality of processor (PE) of the pipeline structure; And an m-bit right shift register coupled to each of the plurality of processors, and a multiplication or addition operation is performed by a control signal (Add / Mult SEL), and when Add / Mult SEL is 0, an add operation is performed. You can get the result from the nth PE, and if Add / Mult SEL is 1, you can get the multiplication result from the queue, each PE from each bit of operand A (a _i ,) from the m-bit k-bit shift register. .., a _i _{+ k-1} ), where k represents the number of processors, and uses a queue to store S and C to make the system flexible, and the maximum length of the queue is stored in memory. It depends on the maximum number of words and the number of pipeline stages to be calculated as shown in Equation 5 below.

이상 설명된 바와 같이, 본 발명에 따른 듀얼필드상의 확장성있는 몽고매리 곱셈기에 의하면, 새로운 MP-CSA를 이용한 듀얼-필드(dual-field)상의 확장성 있는 몽고매리 곱셈기를 제안한다. 본 발명의 구조는 유한체 GF(p)와 GF(2^m)상의 곱셈 연산을 수행하며 기존에 제안된 유사한 구조에 비해 비교적 적은 플립플롭(Flip Flop: FF)을 사용한다. 또한 본 발명에 따른 회로는 Savas 등이 제안한 구조와 달리 결과값을 보정하기 위한 추가회로를 필요로 하지 않고 보정에 필요한 클럭 사이클 수를 감소시키는 효과가 있다. 더욱이 본 발명에 따른 곱셈기 회로는 덧셈을 위해 재사용될 수 있고 m과 w에 대해 높은 확장성을 가진다. 따라서 본 발명에서 제안한 구조는 암호응용을 위한 GF(p)와 GF(2^m)상의 곱셈기로서 매우 적합하게 되는 효과가 있다.As described above, the scalable Mongolian multiplier on the dual field according to the present invention proposes a scalable Mongolian multiplier on the dual-field using a new MP-CSA. The structure of the present invention performs a multiplication operation on the finite field GF (p) and GF (2 ^m ) and uses relatively few flip flops (FF) compared to the similar structure proposed previously. In addition, unlike the structure proposed by Savas et al., The circuit according to the present invention has the effect of reducing the number of clock cycles required for correction without requiring an additional circuit for correcting the result. Moreover, the multiplier circuit according to the invention can be reused for addition and has high scalability for m and w. Therefore, the structure proposed in the present invention has an effect of being very suitable as a multiplier on GF (p) and GF (2 ^m ) for cryptographic application.

이하 본 발명에 따른 다정도 캐리 세이브 가산기를 이용한 듀얼필드상의 확장성있는 몽고매리 곱셈기에 대하여 첨부도면을 참조하여 상세히 설명한다.Hereinafter, a scalable Mongolian Mary multiplier on a dual field using a multi-precision carry save adder according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 유한체 GF(p)상의 몽고매리 곱셈 알고리즘을 나타낸 것이고, 도 2는 유한체 GF(2^m)상의 몽고매리 곱셈 알고리즘을 나타낸 것이고, 도 3은 본 발명에 따른 다정도 캐리 세이브 가산기에 기반한 GF(p)상의 워드-레벨 몽고매리 곱셈 알고리즘을 나타낸 것이고, 도 4는 본 발명에 따른 다정도 캐리 세이브 가산기에 기반 한 GF(2^m)상의 워드-레벨 몽고매리 곱셈 알고리즘을 나타낸 것이고, 도 5a는 확장성 있는 몽고매리 곱셈 알고리즘을 위한 자료의존 그래프를 나타낸 것이고, 도 5b는 파이프라인된 2개의 처리기(PE:Processing Element)를 이용한 4-비트 곱셈 연산(w=1,n=2) 과정을 도식적으로 나타낸 것이고,도 5c는 3개의 처리기를 이용한 도 5b와 동일한 계산과정을 도식적으로 나타낸 것이고, 도 6a는 워드-레벨 덧셈 연산을 위한 자료의존 그래프를 나타낸 것이고, 도 6b는 파이프 라인된 4개의 처리기를 이용한 6-비트 덧셈 연산과정(w=1,n=2)을 도식적으로 나타낸 것이고, 도 7은 본 발명에 따른 곱셈 및 덧셈 연산을 위한 파이프라인 구조의 다정도캐리 세이브 가산기를 이용한 듀얼필드상의 확장성 있는 몽고매리 곱셈기를 나타낸 것이고, 도 8은 도 7의 처리기의 블록 다이어그램을 나타낸 것이고, 도 9는 도 7의 처리기의 데이터 패스(부)를(w=4,b=2,n=2) 나타낸 것이다.1 is a finite field will shown the Mongolian Mary multiplication algorithm on the GF (p), 2 is a finite field GF (2 ^m) will showing the Mongolian Mary multiplication algorithm on, Figure 3 is a sweet of the carry save adder in accordance with the present invention will showing the level Mongolian Mary's multiplication algorithm, Figure 4 is sweet of a GF (2 ^m) on the word based on the carry save adder in accordance with the invention-based word on the GF (p) will showing the level Mongolian Mary's multiplication algorithm, and Fig. 5a shows a data-dependent graph for the scalable Mongolian multiplication algorithm, and FIG. 5b shows a 4-bit multiplication operation (w = 1, n = 2) using two pipelined processing elements (PEs). 5C schematically shows the same calculation process as FIG. 5B using three processors, FIG. 6A shows a data dependency graph for word-level addition operation, and FIG. 6b schematically shows a 6-bit addition operation (w = 1, n = 2) using four pipelined processors, and FIG. 7 shows a multiplicity of pipeline structure for multiplication and addition operations according to the present invention. 7 shows a scalable Mongolian multiplier on a dual field using a carry save adder, FIG. 8 shows a block diagram of the processor of FIG. 7, and FIG. 9 shows a data path (part) of the processor of FIG. 7 (w = 4). , b = 2, n = 2).

도 1 내지 도 9를 참조하면, 본 발명에 따른 듀얼-필드(dual-field)상의 확장성 있는 몽고매리(Montgomery)곱셈기에 의하면, 유한체 GF(p)와 GF(2^m)상의 곱셈 연산을 수행한다. 본 발명에 따른 다정도 캐리 세이브 가산기는 두 개의 캐리 세이브 가산기로 구성되며, w-비트의 워드를 처리하기 위한 하나의 캐리 세이브 가산기는 n =[w/b]개의 캐리 전파 가산기로 이루어진다. 여기서 b는 하나의 CPA가 포함하는 듀얼-필드 가산기의 개수이다. 본 발명에 따른 몽고매리 곱셈기는 기존의 연구결과에 비해 거의 동일한 시간 복잡도를 가지지만 낮은 하드웨어 복잡도를 가진다. 뿐만 아니라 본 발명에 따른 연산기는 기존의 연구와 달리 연산의 종료시 정확한 모듈러 곱셈의 결과를 출력한다. 더욱이 본 발명에 따른 회로는 m과 w에 대해 높은 확장성을 가진다. 따라서 본 발명에 따른 구조는 암호응용을 위한 GF(p)와 GF(2^m)상의 곱셈기로서 매우 적합하다 할 수 있다. 1 to 9, according to the scalable Montgomery multiplier on the dual-field according to the present invention, the multiplication operation on the finite field GF (p) and GF (2 ^m ) To perform. The multi-precision carry save adder according to the present invention consists of two carry save adders, and one carry save adder for processing w-bit words consists of n = [w / b] carry propagation adders. Where b is the number of dual-field adders included in one CPA. The Montgomery multiplier according to the present invention has almost the same time complexity as compared to the existing research results, but has a low hardware complexity. In addition, the operator according to the present invention, unlike the existing research, outputs the result of the correct modular multiplication at the end of the operation. Moreover, the circuit according to the invention has high scalability with respect to m and w. Therefore, the structure according to the present invention can be very suitable as a multiplier on GF (p) and GF ( ^2m ) for cryptographic application.

몽고매리(Mongomery ) 곱셈 알고리즘 Mongomery multiplication algorithm

정수 A,B와 모듈러스 p가 주어졌을 때, 몽고매리 곱셈 알고리즘은 아래 수학식 1에서,Given the integers A, B and modulus p, the Montgomery multiplication algorithm is given by

를 계산한다. 여기서 , R=2^m, A,B<p<R이고 p는 m=[log₂p]-비트이다. 본 발명에서는 p가 소수라고 가정한다. 상기 수학식 1을 바탕으로 도 1의 [알고리즘 1]과 같은

상의 몽고매리 곱셈 알고리즘을 얻을 수 있다.Calculate Where R = 2 ^m , A, B <p <R and p is m = [log ₂ p] -bit. In the present invention, it is assumed that p is a prime number. Based on Equation 1, [Algorithm 1] of FIG.

Mongolian multiplication algorithm can be obtained.

도 1에서, 만약 최종 결과값이 p보다 크면 단계 6과 같이 뺄셈 연산이 수행되어야 한다.In Figure 1, if the final result is greater than p, the subtraction operation should be performed as in step 6.

바이너리 확장 필드

은

상의 차수 (m-1)인 다항식으 로 필드 원소가 표현된다. 두 다항식 A(x),B(x)가 주어졌을 때, 몽고매리 곱셈 연산은 아래 수학식 2와 같이 정의된다.Binary extension field

silver

Field elements are represented by polynomials of phase order (m-1). Given two polynomials A (x) and B (x), the Mongolian multiplication operation is defined as in Equation 2 below.

여기서

이고, p(x)는 기약다항식이다. 도 1의 [알고리즘 1]의 R=2^m은 x^m으로 대체된다. 상기 수학식 2를 바탕으로 도2의 [알고리즘 2]와 같은

상의 몽고매리 곱셈 알고리즘을 얻을 수 있다.here

And p (x) is a short polynomial. R = 2 ^m in [Algorithm 1] of FIG. 1 is replaced with x ^m . Based on Equation 2, [Algorithm 2] of FIG.

Mongolian multiplication algorithm can be obtained.

도 2의 [알고리즘 2]에서 0^m은 m-비트가 모두 0인 상태를 나타낸다. 도 1의 [알고리즘 1]의 단계 6에서 수행하는 추가적인 뺄셈 연산은

상의 ,몽고매리 곱셈 알고리즘에서는 생략한다. 또한

상의 연산은 캐리가 발생하지 않기 때문에 덧셈 연산은 단순한 비트별 XOR 연산으로 대체할 수 있으며 기호

로 나타낸다.In [Algorithm 2] of FIG. 2, 0 ^m represents a state where all m-bits are zero. The additional subtraction operation performed in step 6 of [algorithm 1] of FIG.

This is omitted in the Mongolian multiplication algorithm. Also

The addition operation can be replaced with a simple bitwise XOR operation because the above operation does not cause carry.

Represented by

다정도 Much CSACSA 에 기반한 워드-레벨 Based on word-level 몽고매리Montgomery (( MontgomeryMontgomery ) 곱셈 알고리즘Multiplication algorithm

워드-레벨 구조는 데이터를 일정한 크기의 워드 단위로 나눈 후, 워드 단위로 처리 및 전송한다. 데이터 크기가 m-비트이고 워드 크기가 w-비트이면 워드-레벨 몽고매리 곱셈 연산은 e=[(m+1)/w]개의 워드로 나누어진다.The word-level structure divides data into word units of a certain size, and then processes and transmits data in word units. If the data size is m-bits and the word size is w-bits, the word-level Mongolian multiplication operation is divided into e = [(m + 1) / w] words.

다정도 Much CSACSA 에 기반한 Based on 몽고매리Montgomery (( MontgomeryMontgomery ) ) 곱셈기Multiplier

CPA에 기반한 몽고매리 곱셈기는 넌-리던던트(non-redundant) 형식의 결과값을 바로 얻어올 수 있다는 장점이 있지만 비트수가 커지면 캐리 전파 문제가 발생한다. 또한, 공개키 암호 응용은 2048-비트 이상의 키를 요구하기 때문에 캐리 전파 지연은 심각한 문제가 될 수 있다. 반면에 CSA는 1-비트 전가산기(Full Adder: FA)와 동일한 캐리 전파 지연을 가지기 때문에 캐리 전파 문제가 발생하지 않고 고속의 연산이 가능하다. 그러나 내부 곱셈 결과가 CS 형태로 저장되기 때문에 non-redundant 형태의 결과값을 얻기 위한 추가적인 연산이 요구된다. 일반적으로 이러한 문제를 해결하기 위해 추가적인 회로 또는 클럭 사이클이 사용된다(J.C. Ha and S.J. Moon, “A Design of Modular Multiplier Based on Multi-Precision Carry Save Adder,” Joint Workshop on Information Security and Cryptology (JWISC'2000), pp. 45-51, 2000). 최근 CSA와 CPA를 결합한 방식을 사용하는 하이브리드 형태의 MP-CSA 구조가 김 등(김대영, 이준용, “개선된 다정도 CSA에 기반한 모듈라 곱셈기 설계,” 정보과학회논문지 : 시스템 및 이론, 제33권, 제34호, pp. 223-230, 2006)에 의해 제안되었는데, 그 구조의 몽고매리 곱셈기는 두 개의 CSA로 구성되며 데이터 크기가 m-비트이면 하나의 CSA는 n=[m/b]개의 CPA로 나누어진다. 여기서 b는 하나의 CPA가 포함하는 FA의 개수이다. 김 등에 의해 제안된 몽고매리 곱셈기는 n 클럭 사이클의 추가로 결과값을 보정한다.Montgomery multipliers based on CPA have the advantage of being able to get non-redundant results immediately, but carry propagation problems as the number of bits increases. Carry propagation delays can also be a serious problem because public key cryptography applications require keys of 2048-bit or more. On the other hand, the CSA has the same carry propagation delay as the 1-bit full adder (FA), which enables high-speed operation without causing a carry propagation problem. However, since the internal multiplication result is stored in CS form, an additional operation is required to obtain a non-redundant result. Typically, additional circuitry or clock cycles are used to solve this problem (JC Ha and SJ Moon, “A Design of Modular Multiplier Based on Multi-Precision Carry Save Adder,” Joint). Workshop on Information Security and Cryptology (JWISC'2000) , pp. 45-51, 2000). Recently, a hybrid MP-CSA structure using a combination of CSA and CPA has been developed by Kim et al. (Dae-Young Kim, Jun-Yong Lee, “Design of Modular Multipliers Based on Improved Multi-Precision CSA,” Journal of KIISE: Systems and Theory, Vol. 33, No. 34, pp. 223-230, 2006), the Mongolian multiplier of the structure consists of two CSAs, and if the data size is m-bit, one CSA has n = [m / b] CPAs. Divided into Where b is the number of FAs included in one CPA. The Mongolian multiplier proposed by Kim et al. Corrects the result with the addition of n clock cycles.

GF(p)상의 워드-레벨 몽고매리 ( Montgomery ) 곱셈 알고리즘 Word on GF (p) - level Mongolian Mary (Montgomery) multiplication algorithm

본 발명에서는 새로운 GF(p)상의 워드-레벨 몽고매리 곱셈 알고리즘을 제안한다. 본 발명에 따른 곱셈 알고리즘은 도 3에 도시된 [알고리즘 3]과 같다. 두 개의 오퍼랜드 B(피승수), A(승수)와 모듈러스 p가 주어졌을 때, 워드-레벨 몽고매리 곱셈 알고리즘은 ABR^-1 mod p를 계산한다. 여기서 B는 워드단위로, A는 비트단위로 읽어온다. 도 3의 [알고리즘 3]에서 w-비트인 B와 p는 b-비트의 CPA로 나누어지며 곱셈 연산에서 사용되는 벡터 A, B, p는 다음과 수학식 3과 같이 나타낸다.The present invention proposes a new word-level Mongolian multiplication algorithm on GF (p). The multiplication algorithm according to the present invention is the same as [Algorithm 3] shown in FIG. Given two operands B (multipliers), A (multipliers) and modulus p, the word-level Montgomery multiplication algorithm computes ABR ^-1 mod p. Where B is read in word units and A is read in bits. In [Algorithm 3] of FIG. 3, the w-bits B and p are divided by the b-bit CPA, and the vectors A, B, and p used in the multiplication operation are represented by Equation 3 below.

여기서 워드와 CPA 인덱스는 위첨자로 표시되고 비트는 아래 첨자로 표시한 다. 예를 들어, 벡터 B에서 j워드의 k번째 CPA의 i번째 비트는 B_i ^(j)(k)로 나타낼 수 있다. 벡터 B의 i에서 j까지의 비트는 B_j _...i로 나타낸다(j>i). 그리고 (x│y)는 두 비트 시퀀스의 결합을 나타낸다.Word and CPA indexes are represented by superscripts and bits by subscripts. For example, the i th bit of the k th CPA of the j word in the vector B may be represented by B _i ^{(j) (k)} . The bits from i to j of the vector B are represented by B _j _{... i} (j> i). And (x│y) represents a combination of two bit sequences.

본 발명에 따른 곱셈 알고리즘은 A의 각 비트에 대해 TS, B, p의 부분합을 계산한다. B가 완전히 읽혀졌을 때, A의 다음 비트를 읽어온 후 계산을 반복한다. 도 3의 [알고리즘 3]은 내부 결과를 저장하기 위해서 CS 형태를 사용한다. 덧셈 결과는 m-비트의 TS^(j)와 n-비트의 TC^(j)에 저장된다. m번의 반복을 수행한 후 CS 형태의 곱셈 결과를 얻을 수 있으며 non-redundant 형태의 결과를 얻기 위한 n번의 추가적인 덧셈 연산을 수행한다(단계 3). 여기서 입력값 a_i와 B는 모두 0으로 인가한다. 경우에 따라서 m번의 반복이 끝난 후 출력되는 결과값이 모듈러스 p보다 클 수 있으며 2번의 반복수행을 통해 최종결과가 p미만이 되도록 조정할 수 있다(T. Blum and C. Paar, “Montgomery modular exponentiation on reconfigurable hardware,” in Proc . 14 th IEEE Symp . on Computer Arithmetic, pp. 70-77, 1999). 본 발명에 따른 곱셈 알고리즘은 n번의 추가 클럭을 수행하기 때문에 n이 2보다 클 경우 뺄셈 연산을 제거할 수 있다.The multiplication algorithm according to the invention calculates the subtotals of TS, B, p for each bit of A. When B is completely read, it reads the next bit of A and repeats the calculation. Algorithm 3 of FIG. 3 uses a CS form to store internal results. The addition result is stored in m-bit TS ^(j) and n-bit TC ^(j) . After performing m iterations, CS multiplication results are obtained, and n additional addition operations are performed to obtain non-redundant results (step 3). Here, input values a _i and B are all applied as 0. In some cases, the output value after m iterations may be greater than the modulus p and can be adjusted so that the final result is less than p through two iterations (T. Blum and C. Paar, “Montgomery modular exponentiation on reconfigurable hardware, ” in Proc . 14 th IEEE Symp . on Computer Arithmetic , pp. 70-77, 1999). Since the multiplication algorithm according to the present invention performs n additional clocks, the subtraction operation can be eliminated when n is larger than 2.

다정도 Much CSACSA 에 기반한 상의 워드-레벨 Word-levels based on 몽고매리Montgomery 곱셈 알고리즘 Multiplication algorithm

도 4의 [알고리즘 4]는 GF(2^m)상의 워드레벨 곱셈 알고리즘을 나타낸다. GF(2^m)상의 연산은 캐리가 발생하지 않기 때문에 내부 덧셈 연산은 비트별 XOR 연산으로 대체할 수 있다. GF(2^m)상의 기약다항식은 (m+1)-비트이기 때문에 인덱스 i는 0에서 m까지 수행한다.[Algorithm 4] of FIG. 4 shows a word level multiplication algorithm on GF (2 ^m ). Since operations on GF ( ^2m ) do not carry, an internal addition operation can be replaced by a bitwise XOR operation. The index i runs from 0 to m because the weak polynomial on GF (2 ^m ) is (m + 1) -bit.

몽고매리 곱셈 알고리즘의 병행성 Concurrency of the Montgomery Multiplication Algorithm

다정도 Much CSACSA 에 기반한 워드-레벨 Based on word-level 몽고매리Montgomery 곱셈기Multiplier

워드-레벨 몽고매리 곱셈 알고리즘에서 i번째 반복의 j=0, j=1의 반복이 끝나면 (i+1)번째 반복이 즉시 수행된다. 이와 같은 계산을 나타내는 자료의존 그래프는 도 5a와 같다. 태스크 A, B는 기본적으로 다음과 같은 동작을 수행한다. 1) TS, a_j

B,p 의 각 워드에 대한 덧셈 연산(p의 가산 여부는 TS의 최하위 비트에 의해 결정됨), 2) TS워드의 1-비트 우측 쉬프트 연산. 쉬프트된 TS^(j-1)의 생성은 TS^(j)의 최하위 비트가 계산된 후에 가능하다(A. Tenca and .K. Ko, “A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm,” IEEE Trans . on Computers, Vol. 52, No. 9, pp. 1215-1221, 2003). 태스크 A는 2가지 동작에 추가적으로 a_i

B⁽⁰⁾+TS⁽⁰⁾의 덧셈 결과로부터 TS의 최하위 비트를 저장한다(도 3의 [알고리즘 3]의 단계 7). 저장된 비트는 동일한 오퍼랜드의 다음 워드 에 대해 p의 가산 여부를 결정하는데 사용된다. 자료의존 그래프는 한 개의 열에 대해 (e+1)개의 태스크를 가진다. 각 열의 태스크는 다른 처리기(Processing Element: PE)에서 계산되며 하나의 PE에서 생성된 데이터는 파이프라인 형태로 구성된 다음 PE로 전달한다. 각 태스크는 한 클럭 사이클에 계산된다. 본 발명에 따른 구조는 GF(p)상의 몽고매리 곱셈을 위해 t=[(m+n)/k]커널 사이클이 필요하다. 여기서 k는 파이프라인상의 PE 개수이다. 반면에 GF(2^m)상의 몽고매리 곱셈은 t=[m/k]커널 사이클이 필요하다. In the word-level Montgomery multiplication algorithm, the (i + 1) th iteration is performed immediately after the iteration of j = 0 and j = 1 of the ith iteration. A data dependent graph representing such a calculation is shown in FIG. 5A. Tasks A and B basically perform the following operations. 1) TS, a _j

Add operation for each word of B, p (addition of p is determined by the least significant bit of TS), 2) 1-bit right shift operation of TS word. Generation of the shifted TS ^(j-1) is possible after the least significant bit of TS ^(j) is calculated (A. Tenca and .K. Ko, “A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm,” IEEE Trans . on Computers , Vol. 52, no. 9, pp. 1215-1221, 2003). Task A has two operations in addition to a _i

The least significant bit of TS is stored from the addition result of B ⁽⁰⁾ + TS ⁽⁰⁾ (step 7 of [Algorithm 3] in FIG. 3). The stored bit is used to determine whether to add p to the next word of the same operand. Data-dependent graphs have (e + 1) tasks per column. Tasks in each column are calculated in different processing elements (PEs), and the data generated from one PE is pipelined to the next PE. Each task is counted in one clock cycle. The structure according to the invention requires a t = [(m + n) / k] kernel cycle for Mongolian multiplication on GF (p). Where k is the number of PE on the pipeline. On the other hand, Mongolian multiplication on GF (2 ^m ) requires a t = [m / k] kernel cycle.

도 5b는 두 개의 PE를 사용하는 4-비트 곱셈 연산을 나타낸다. 여기서 w=1이고, n=2이다. 이 경우에 몽고매리 곱셈 연산은 3 커널 사이클을 요구한다. 회색 상자는 non-redundant 형태의 결과값을 출력하기 위한 추가적인 반복을 나타낸다. 도 5c는 도 5b와 동일한 계산에서 3개의 PE를 사용한 경우를 나타낸다. 이 경우, 추가적인 번의 반복에 관계없이 2 커널 사이클에 동작한다. 제안된 몽고매리 곱셈 연산의 전체 수행 시간은 아래 수학식 4와 같다.5B shows a 4-bit multiplication operation using two PEs. Where w = 1 and n = 2. In this case, the Montgomery multiplication operation requires three kernel cycles. Gray boxes indicate additional iterations to output non-redundant results. FIG. 5C shows the case where three PEs are used in the same calculation as FIG. 5B. In this case, it runs in two kernel cycles regardless of the additional iterations. The overall execution time of the proposed Mongolian multiplication operation is given by Equation 4 below.

첫 번째 계산식은 주어진 워드의 개수보다 PE의 개수가 더 많은 경우로 1 커 널 사이클에 결과값을 출력하고, 두 번째 계산식은 파이프라인내의 PE 개수가 워드의 개수보다 적은 경우로 결과값을 출력하기 위해 2 커널 사이클 이상을 요구한다.The first expression outputs the result in one kernel cycle when the number of PEs is larger than the number of words given. The second expression outputs the result when the number of PEs in the pipeline is less than the number of words. Requires at least two kernel cycles.

다정도 Much CSACSA 에 기반한 워드-레벨 가산기Based word-level adder

본 발명에 따른 구조는 곱셈기 회로를 재사용해서 워드-레벨 덧셈 연산을 수행한다. GF(p)상의 덧셈 연산은 CS 형태의 결과를 출력하기 때문에 non-redundant 형태로 보정하기 위한 추가연산이 필요하다. 도 6a는 워드-레벨 덧셈 연산을 위한 자료의존 그래프를 나타낸다. 각 열은 e=[m/w]개의 태스크를 가지며 구분된 PE에 의해 계산을 수행한다. 첫 번째 PE는 오퍼랜드 A, B를 입력 받아서 CS 형태의 덧셈 결과를 출력한다. 다음 PE는 non-redundant 형태의 결과를 위해 추가적인 덧셈 연산을 수행한다. 여기서 a_i, B, p는 모두 0이 인가된다.The structure according to the invention reuses multiplier circuitry to perform word-level addition operations. The addition operation on GF (p) outputs the result of CS type, so it needs additional operation to correct to non-redundant type. 6A shows a data dependency graph for word-level addition operations. Each column has e = [m / w] tasks and the calculation is done by separate PEs. The first PE takes operands A and B and outputs the CS addition result. The PE then performs an additional addition operation for non-redundant results. Where a _i , B, and p are all zeros.

도 6b는 4개의 PE를 사용한 6-비트 덧셈 연산을 나타낸다. 여기서 w=1, n=2이다. 회색 상자는 덧셈 연산을 위한 태스크를 나타내며, 두 번째 PE로부터 non-redundant 형태의 덧셈 연산 결과를 얻을 수 있다. 제안한 워드-레벨 가산기는 GF(p)와 GF(2^m)상의 덧셈 연산 결과를 (e+n-1)클럭 사이클 후에 출력한다.6B shows a 6-bit add operation using four PEs. Where w = 1 and n = 2. The gray box represents the task for the add operation, and you can get the result of the non-redundant add operation from the second PE. The proposed word-level adder outputs the result of the addition operation on GF (p) and GF (2 ^m ) after (e + n-1) clock cycles.

확장성 있는 Scalable 몽고매리Montgomery 곱셈기Multiplier

도 7은 파이프라인으로 구성된 확장성 있는 구조를 나타낸다. 파이프라인 구 조는 커널로 불리어지며 k개의 PE로 구성된다. 파이프라인의 각 PE는 전달 받은 워드를 다음 PE로 전달한다. 1-비트 입력 a_i와 n-비트 내부 캐리 C^(j)를 제외한 모든 경로는 w-비트로 구성된다. 또한 데이터 B, p,S 는 커널에 의해 워드단위로 처리한다. 회색 상자는 레지스터를 나타낸다.7 shows an extensible structure composed of pipelines. The pipeline structure is called the kernel and consists of k PEs. Each PE in the pipeline delivers the received word to the next PE. All paths except the 1-bit input a _i and the n-bit internal carry C ^(j) consist of w-bits. The data B, p, and S are processed by the kernel in word units. The gray box represents the register.

워드-레벨 곱셈 및 덧셈 연산Word-Level Multiplication and Addition Operations

본 발명에 따른 구조는 워드-레벨 몽고매리 곱셈 연산을 수행하며 추가적인 가산기 없이 워드-레벨 덧셈 연산을 수행한다. 이와 같은 연산을 위해 멀티플렉서와 컨트롤 신호를 추가한다. 곱셈 또는 덧셈 연산은 컨트롤 신호(Add/Mult SEL)에 의해 수행된다. Add/Mult SEL이 0이면 덧셈 연산을 수행하고 n번째 PE에서 결과값을 얻을 수 있다. 반면에 Add/Mult SEL이 1이면 큐로부터 곱셈결과를 얻을 수 있다. 각 PE는 m-비트의 k-비트 쉬프트 레지스터로부터 오퍼랜드 A의 각 비트(a_i,..,a_i _+k-1)를 가져온다. 시스템을 유연하게 하기 위해서 S와 C를 저장하기 위한 큐를 사용한다. 큐의 최대 길이는 메모리에 저장되는 워드의 최대 개수와 파이프라인 단계 수에 의존되며 아래 수학식 5와 같이 계산한다.The structure according to the present invention performs a word-level Mongolian multiplication operation and performs a word-level addition operation without additional adders. Add a multiplexer and control signal for this operation. Multiplication or addition operations are performed by control signals (Add / Mult SEL). If Add / Mult SEL is 0, the addition operation can be performed and the result value can be obtained from the nth PE. On the other hand, if Add / Mult SEL is 1, multiplication results can be obtained from the queue. Each PE gets each bit (a _i , .., a _i _{+ k-1} ) of operand A from the m-bit k-bit shift register. To make the system flexible, we use queues to store S and C. The maximum length of the queue depends on the maximum number of words stored in memory and the number of pipeline stages, and is calculated as shown in Equation 5 below.

처리기 구조Handler structure

도 8은 PE의 블록 다이어그램을 나타낸다. 데이터 패스는 파이프라인의 이전단계로부터 워드 S^(j), C^(j), B^(j), p^(j)를 전달 받아서 새로운 워드 S^(j-1), C^(j-1)을 계산한다. 입력 B와 p를 지연시키는 것을 출력으로 B^(j-1), p^(j-1),S^(j-1), C^(j-1)을 출력하기 위해서이다. 즉, 하나의 PE가 j번째 워드로 계산하면 다음 PE는 (j-2)번째 워드로 계산한다. 데이터 패스는 덧셈 또는 곱셈 연산 결과를 출력한다(덧셈: A, 곱셈: B). 본 발명의 구조는 곱셈 및 덧셈 연산을 모두 수행하기 위해 n개의 PE에 (w+n)-비트 멀티플렉서, w-비트 멀티플렉서, 컨트롤 신호를 추가한다.8 shows a block diagram of a PE. The data path receives the words S ^(j) , C ^(j) , B ^(j) , p ^(j) from the previous stage of the pipeline and computes new words S ^(j-1) , C ^(j-1) . . Delaying inputs B and p is for outputting B ^(j-1) , p ^(j-1) , S ^(j-1) , and C ^(j-1) as outputs. That is, if one PE calculates the j th word, the next PE calculates the (j-2) th word. The data path outputs the result of the addition or multiplication operation (addition: A, multiplication: B). The structure of the present invention adds a (w + n) -bit multiplexer, a w-bit multiplexer, and a control signal to n PEs to perform both multiplication and addition operations.

데이터 패스는 두 개의 CSA로 구성되며 워드 크기가 w-비트 이면 하나의 CSA는 n=[w/b]개의 CPA로 이루어진다. 여기서 b는 하나의 CPA를 구성하는 1-비트 DFA의 개수이다.The data path consists of two CSAs. If the word size is w-bit, one CSA consists of n = [w / b] CPAs. Where b is the number of 1-bit DFAs constituting one CPA.

도 9는 w=4이고 b=2인 데이터 패스를 나타낸다. 각 4-비트 CSA는 두 개의 2-비트 CPA로 나누어지며 하나의 CPA는 두 개의 1-비트 DFA로 구성된다. DFA는 캐리를 가지는 GF(p)상의 덧셈 연산과 캐리를 가지지 않는 GF(2^m)상의 덧셈 연산을 모두 수행한다. FSEL은 유한체 GF(p)와 GF(2^m)을 선택한다. FSEL이 1이면 DFA는 GF(p)상의 덧셈 연산을 FSEL이 0이면 GF(2^m)상의 덧셈 연산을 수행한다. 데이터 패스는 S⁽⁰⁾+a_i

B⁽⁰⁾의 최하위 비트를 로컬 컨트롤로 사용하며, 이 비트는 p의 가산여부를 결정하는 컨트롤 신호를 생성하는데 사용된다.9 shows a data path with w = 4 and b = 2. Each 4-bit CSA is divided into two 2-bit CPAs, and one CPA consists of two 1-bit DFAs. DFA performs both addition operations on GF (p) with carry and addition operations on GF ( ^2m ) without carry. FSEL selects finite bodies GF (p) and GF (2 ^m ). If FSEL is 1, the DFA performs an add operation on GF (p), and if FSEL is 0, an add operation on GF (2 ^m ) is performed. The data path is S ⁽⁰⁾ + a _i

The least significant bit of B ⁽⁰⁾ is used as local control, which is used to generate a control signal that determines whether p is added.

데이터 패스는 m-비트 덧셈값과 n-비트 캐리를 저장하기 위해 (m+n)-비트 레지스터를 가진다. 또한 덧셈 연산을 수행하기 위한 CSA₁의 출력(A) 또는 곱셈 연산을 위한 결과값(B)을 출력한다.The data path has an (m + n) -bit register to store m-bit additions and n-bit carry. In addition, an output A of the CSA ₁ for performing the addition operation or a result B for the multiplication operation is output.

본 발명에서는 새로운 다정도 MP-CSA를 이용한 듀얼-필드 상의 확장성 있는 몽고매리 곱셈기를 제안하였다. 본 발명에 따른 구조는 몽고매리 곱셈 연산을 위해 t 커널 사이클 후에 완전한 결과를 출력한다. 표 1에 본 발명에서 제안된 구조와 기존에 제안된 구조의 성능을 비교하였다. 종래 기술의 데이터 패스는 내부 결과를 저장하기 위해 2w-비트 FF을 가지는 반면에 본 발명에서 제안된 구조는 (w+n)-비트 FF을 가진다. 일반적으로 n은 w보다 작은 값을 사용하기 때문에 기존에 제안된 구조에 비해 비교적 적은 FF 개수를 가진다. 또한 본 발명의 구조는 CS 형태의 결과를 non-redundant 형태로 변환하기 위한 추가적인 회로를 요구하지 않는다. 또한 곱셈 및 덧셈 연산을 모두 수행한다. 표 2에 종래기술에 의해 제안된 구조에 대해 w, k, m 의 다양한 선택에 따른 커널 사이클, 클럭 사이클, FF 개수를 비교하였다. 표 2에서 n=4로 가정하며 ECC를 위해 NIST(NIST, Recommended elliptic curves for federal government use, May 1999. http://csrc.nist.gov/encryption)에서 권고하는 다섯 가지 GF(p), m ∈ {192,224,266,384,521}, 표준 필드 크기를 선택하였다. 표 2에 나타나듯이 두 개의 구조가 동일한 커널 사이클을 가지면 각 곱셈 연산은 동일한 클럭 사이클에 수행된다. 또한 w 또는 k가 증가할 수록 종래기술과 제안된 곱셈기의 FF개수 차이는 커진다. GF(p)상의 곱셈 연산과 달리 GF(2^m)상의 곱셈 연산은 두 개의 구조가 동일한 시간 및 하드웨어 복잡도를 가진다.The present invention proposes a scalable Mongolian multiplier on dual-field using a new multi-precision MP-CSA. The structure according to the invention outputs the complete result after t kernel cycles for the Montgomery multiplication operation. Table 1 compares the performance of the proposed structure with the structure proposed in the present invention. Prior art data paths have a 2w-bit FF to store internal results, whereas the structure proposed in the present invention has a (w + n) -bit FF. In general, since n uses a value smaller than w, it has a relatively small number of FFs compared to the conventionally proposed structure. In addition, the structure of the present invention does not require additional circuitry for converting the results of the CS form into the non-redundant form. It also performs both multiplication and addition operations. Table 2 compares the number of kernel cycles, clock cycles, and FFs with various selections of w, k, and m for the structure proposed by the prior art. In Table 2, we assume n = 4 and the five GF (p), m recommended by the NIST (Recommended elliptic curves for federal government use, May 1999. http://csrc.nist.gov/encryption) for ECC ∈ {192,224,266,384,521}, the standard field size was chosen. As shown in Table 2, if two structures have the same kernel cycle, each multiplication operation is performed in the same clock cycle. Also, as w or k increases, the difference between the FF number of the conventional multiplier and the proposed multiplier increases. Unlike multiplication operations on GF (p), multiplication operations on GF (2 ^m ) have the same time and hardware complexity for the two structures.

본 발명에서 제안된 구조는 다음과 같은 세 가지 장점을 가진다. 1) 기존에 제안된 구조에 비해 비교적 적은 FF 개수를 가진다. 2) non-redundant 형태의 결과값을 출력하기 위한 추가 회로가 필요하지 않고, 보정에 필요한 클럭 사이클 수를 감소시켰다. 3) 제안된 곱셈기 회로를 재사용해서 덧셈 연산을 수행한다. 따라서 제안된 다정도 CSA에 기반한 듀얼-필드상의 확장성 있는 듀얼-필드 몽고매리 곱셈기는 암호 디바이스의 곱셈 및 덧셈 연산을 위해 적합하며, 특히 스마트카드나 휴대용 장치와 같은 저면적 응용에 매우 적합하다.The proposed structure has three advantages as follows. 1) It has a relatively small number of FFs compared to the proposed structure. 2) No additional circuitry is required to output non-redundant results, reducing the number of clock cycles required for calibration. 3) Add operation is performed by reusing the proposed multiplier circuit. Therefore, the proposed dual-field scalable dual-field Mongolian multiplier based on the multi-precision CSA is suitable for multiplication and addition operations of cryptographic devices, especially for low area applications such as smart cards and portable devices.

도 1은 유한체 GF(p)상의 몽고매리 곱셈 알고리즘을 나타낸 것이다.Figure 1 shows a Mongolian multiplication algorithm on finite field GF (p).

도 2는 유한체 GF(2^m)상의 몽고매리 곱셈 알고리즘을 나타낸 것이다.2 shows a Mongolian multiplication algorithm on finite field GF (2 ^m ).

도 3은 본 발명에 따른 다정도 캐리 세이브 가산기에 기반한 GF(p)상의 워드-레벨 몽고매리 곱셈 알고리즘을 나타낸 것이다.Figure 3 shows a word-level Mongolian multiplication algorithm on GF (p) based on a polynomial carry save adder according to the present invention.

도 4는 본 발명에 따른 다정도 캐리 세이브 가산기에 기반한 GF(2^m)상의 워드-레벨 몽고매리 곱셈 알고리즘을 나타낸 것이다.Figure 4 shows a word-level Montgomery multiplication algorithm on GF (2 ^m ) based on a multiplicity carry save adder according to the present invention.

도 5a는 확장성 있는 몽고매리 곱셈 알고리즘을 위한 자료의존 그래프를 나타낸 것이다.5A shows a data dependent graph for the scalable Mongolian multiplication algorithm.

도 5b는 파이프라인된 2개의 처리기(PE:Processing Element)를 이용한 4-비트 곱셈 연산(w=1,n=2) 과정을 도식적으로 나타낸 것이다.FIG. 5B schematically illustrates a 4-bit multiplication operation (w = 1, n = 2) using two pipelined processing elements (PEs).

도 5c는 3개의 처리기를 이용한 도 5b와 동일한 계산과정을 도식적으로 나타낸 것이다.FIG. 5C schematically shows the same calculation process as FIG. 5B using three processors.

도 6a는 워드-레벨 덧셈 연산을 위한 자료의존 그래프를 나타낸 것이다.6A shows a data dependency graph for word-level addition operations.

도 6b는 파이프 라인된 4개의 처리기를 이용한 6-비트 덧셈 연산과정(w=1,n=2)을 도식적으로 나타낸 것이다.6B schematically illustrates a 6-bit addition operation (w = 1, n = 2) using four pipelined processors.

도 7은 본 발명에 따른 곱셈 및 덧셈 연산을 위한 파이프라인 구조의 다정도캐리 세이브 가산기를 이용한 듀얼필드상의 확장성 있는 몽고매리 곱셈기를 나타낸 것이다.Figure 7 shows a scalable Mongolian Mary multiplier on a dual field using a multi-precision carry save adder of the pipeline structure for multiplication and addition operations in accordance with the present invention.

도 8은 도 7의 처리기의 블록 다이어그램을 나타낸 것이다.8 shows a block diagram of the processor of FIG. 7.

도 9는 도 7의 처리기의 데이터 패스(부)를(w=4,b=2,n=2) 나타낸 것이다.FIG. 9 shows a data path (part) of the processor of FIG. 7 (w = 4, b = 2, n = 2).

Claims

In a scalable Mongolian Mary multiplier on a dual field using a multi-precision carry save adder,

A plurality of processing elements (PEs) in the pipeline structure; And

An m-bit right shift register coupled to each of the plurality of processors;

The multiplication or addition operation is performed by the control signal (Add / Mult SEL). If Add / Mult SEL is 0, the addition operation can be performed and the result is obtained from the nth PE. If Add / Mult SEL is 1, The result of the multiplication is obtained, where each PE gets each bit (a _i , .., a _i _{+ k-1} ) of operand A from the m-bit k-bit shift register, where k is the number of processors In order to make the system flexible, a queue for storing S and C is used, and the maximum length of the queue depends on the maximum number of words and pipeline stages stored in memory, and is calculated as shown in Equation 5 below. Scalable Mongolian multiplier on a dual field using a multi-precision carry save adder.

[Equation 5]

The method of claim 1,

The Carry Save Adder (MP-CSA) is composed of n = [w / b] CPAs when the word size is w-bit.

b is the number of 1-bit dual field adders (DFAs) constituting one CPA,

The data path (part) with w = 4 and b = 2,

Each 4-bit CSA is divided into two 2-bit CPAs, one CPA consists of two 1-bit DFAs,

DFA performs both addition operations on GF (p) with carry and addition operations on GF (2 ^m ) without carry,

FSEL selects the finite field GF (p) and GF (2 ^m ), if FSEL is 1, DFA performs multiplication operation on GF (p), and if FSEL is 0, add operation on GF (2 ^m ),

The data path is S ⁽⁰⁾ + a _i

Use the least significant bit of B ⁽⁰⁾ as a local control, which is used to generate a control signal that determines whether p is added,

The datapath has an (m + n) -bit register to store the m-bit addition and n-bit carry, and the output of CSA ₁ to perform the addition operation (A) or the result of the multiplication operation (B A scalable Mongolian multiplier on a dual field using a multi-precision carry save adder characterized by

The method according to claim 1 or 2,

The mongolian multiplier

Given two operands B (multipliers), A (multipliers) and modulus p,

The word-level Mongolian multiplication algorithm on top calculates ABR ^-1 mod p, where B reads word by word, A reads in bits, and performs [Algorithm 3] of FIG. p is divided by b-bit CPA, and the vectors A, B, and p used in the multiplication operation are represented by Equation 3 below.

Since the word-level Mongolian multiplication algorithm on Fig. 4 performs [Algorithm 4] of Fig. 4, and the operation on GF (2 ^m ) does not generate a carry, the internal addition operation can be replaced by a bitwise XOR operation. A scalable Montgomery multiplier on a dual field using a multi-precision carry save adder, characterized in that index i performs from 0 to m since the weak polynomial over 2 ^m ) is (m + 1) -bit.

[Equation 3]