KR20220049212A

KR20220049212A - Word-parallel calculation method for modular arithmetic

Info

Publication number: KR20220049212A
Application number: KR1020200132562A
Authority: KR
Inventors: 신경욱; 최준백
Original assignee: 금오공과대학교 산학협력단
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2022-04-21
Also published as: KR102496446B1

Abstract

The present invention relates to a method for operating a word parallel for a modular operation, wherein by comprising a step of dividing a multiplier and a multiplicand of a modular multiplication into a plurality of bit word units and a step of performing a modular multiplication operation on a plurality of words corresponding to the divided word units in parallel, the present invention, in hardware implementation of a public key code, enables to easily design a hardware structure optimized for a required performance of an application field.

Description

WORD-PARALLEL CALCULATION METHOD FOR MODULAR ARITHMETIC

본 발명은 공개키 암호의 하드웨어 구현에 있어서 응용분야의 요구 성능에 최적화된 하드웨어 구조를 용이하게 설계할 수 있는 모듈러 연산을 위한 워드 병렬 연산 방법에 관한 것이다.The present invention relates to a word parallel operation method for modular operation that can easily design a hardware structure optimized for performance required in an application field in hardware implementation of public key cryptography.

잘 알려진 바와 같이, 대표적인 공개키 암호에는 ECC(elliptic curve cryptography), RSA(Rivest, Shamir, Adleman) 등이 있으며, 최근 에드워드곡선(edward curve) 또한 주목 받고 있다.As is well known, representative public key cryptography includes elliptic curve cryptography (ECC), Rivest, Shamir, Adleman (RSA), and the like, and the Edward curve is also attracting attention recently.

여기에서, RSA는 매우 큰 소수(prime number)의 곱으로 이루어진 정수의 인수분해가 어렵다는 점에 안전성의 기반을 두고 있으며, 전자서명이 가능하여 다양한 분야의 보안에 사용되고 있다.Here, RSA is based on safety in that it is difficult to factor an integer that is a product of a very large prime number, and it is used for security in various fields because it can digitally sign.

그리고, ECC는 RSA보다 짧은 길이의 키를 사용하면서도 비슷한 안전성을 얻을 수 있어 많은 국제 표준(ISO, ANSI, NIST, SECG)에서 공개키 암호 방식으로 채택되고 있다.In addition, ECC is adopted as a public key cryptography method in many international standards (ISO, ANSI, NIST, SECG) because it can obtain similar security while using a key with a shorter length than RSA.

또한, ECC와 에드워드곡선 공개키 암호의 연산은 점 덧셈(point addition)과 점 두배(point doubling) 연산의 반복으로 계산되며, 점 덧셈 연산과 점 두배 연산은 유한체(finite field) 상의 모듈러 가산 및 감산, 모듈러 곱셈이 필수로 사용되고, 모듈러 역원 연산 또한 공개키 암호 연산 과정 중 필수적으로 사용된다.In addition, the operation of ECC and Edward curve public key cryptography is calculated by repetition of point addition and point doubling operations, and the point addition operation and point doubling operation are modular addition and doubling on a finite field. Subtraction and modular multiplication are essential, and the modular inverse operation is also essential during the public key cryptography operation.

이러한 모듈러 가산 및 감산 연산은 이진(binary) 가산 및 감산 연산이 수행된 후, 그 결과를 모듈러 값과 비교하여 필요시 축약(reduction) 연산을 거쳐 최종 결과 값이 얻어지게 되는데, 모듈러 곱셈은 이진 곱셈과 모듈러 축약 연산을 위한 나눗셈 연산으로 계산될 수 있으며, 나눗셈 연산을 사용하지 않는 모듈러 곱셈 방법도 사용될 수 있다.In such modular addition and subtraction operations, binary addition and subtraction operations are performed, the result is compared with a modular value, and a reduction operation is performed if necessary to obtain a final result value. Modular multiplication is binary multiplication. It can be calculated as a division operation for and modular abbreviation operations, and a modular multiplication method that does not use a division operation can also be used.

여기에서, 소수체 상의 모듈러 곱셈 방법으로는 기수(radix) 곱셈, 인터리브(interleaved) 모듈러 곱셈, 시프트-가산 곱셈, 몽고메리 곱셈 등 다양한 방법들이 사용될 수 있다.Here, various methods such as radix multiplication, interleaved modular multiplication, shift-addition multiplication, and Montgomery multiplication may be used as a method of modular multiplication on a prime number.

또한, 모듈러 역원 연산을 구현하는 대표적인 방법은 최대 공약수(GCD)를 이용하는 방법과 페르마 소정리(Fermat’s little theorem)를 이용하는 방법이 있는데, 페르마 소정리에 의해 정수 a와 소수 p에 대해

의 관계가 성립하므로, 양변을

을 나누면 식

로 모듈러 역원을 계산할 수 있다.In addition, representative methods for implementing the modular inverse operation include a method using a greatest common divisor (GCD) and a method using Fermat's little theorem.

Since the relationship of

Dividing by the expression

We can compute the modular inverse with

이러한 방법은 정수 a의 멱승 연산을 통해 a의 모듈러 역원

을 계산할 수 있지만, 매우 많은 연산량을 필요로 하는 문제점이 있다.This method is a modular inverse of a through the exponentiation of an integer a.

can be calculated, but there is a problem that requires a very large amount of computation.

또 다른 방법으로 유클리드 호제법을 이용한 모듈러 역원 계산 방법이 있으며, 소모 사이클 및 연산량을 줄이기 위한 다양한 형태의 변형된 방법들이 사용될 수 있다.As another method, there is a modular inverse calculation method using the Euclidean algorithm, and various types of modified methods for reducing the consumption cycle and the amount of computation may be used.

상술한 바와 같은 모듈러 연산은 유한 개의 원소로 구성되는 유한체(finite field) 내에서 이루어지는 연산으로, 유한체 그룹 내 원소간의 연산 결과가 그룹 내의 원소의 값을 가지며, 모듈러 연산에는 모듈러 가산 및 감산, 모듈러 곱셈, 모듈러 역원, 모듈러 나눗셈 등이 있으며, 모듈러 합동 특성을 이용할 수 있다.The modular operation as described above is an operation performed in a finite field composed of a finite number of elements, and the result of the operation between elements in the finite field group has the value of the element in the group, and the modular operation includes modular addition and subtraction; There are modular multiplication, modular inverse, and modular division, and the modular congruence property can be used.

최근 무선 통신의 보편화로 공개키 암호의 중요성이 높아지고 있으며, 키교환 프로토콜, 무선통신 보안 규격, 드론 및 자율주행 이동체 보안, 블록체인 등 공개키 암호의 응용분야가 확대되고 있는데, 공개키 암호 알고리듬과 프로토콜은 소프트웨어 또는 하드웨어로 구현될 수 있으나, 소프트웨어로 구현하는 경우는 보안시스템에서 요구하는 보안 안전성, 처리속도, 전력소비 등의 요구 조건을 만족하기 힘든 문제점이 있고, 보안 알고리듬과 프로토콜을 하드웨어로 구현하는 경우, 보안 안전성이 우수하며, 보안시스템에서 요구하는 처리속도, 면적, 전력소비 등에 맞게 최적화하여 구현하는 것이 가능한 장점이 있다.Recently, the importance of public key cryptography is increasing due to the generalization of wireless communication, and the application fields of public key cryptography such as key exchange protocol, wireless communication security standard, drone and autonomous vehicle security, and block chain are expanding. Protocols can be implemented in software or hardware, but when implemented in software, there is a problem in that it is difficult to satisfy the requirements such as security safety, processing speed, and power consumption required by security systems, and security algorithms and protocols are implemented in hardware. In this case, it has excellent security and safety, and it is possible to optimize and implement it according to the processing speed, area, power consumption, etc. required by the security system.

아울러, 공개키 암호는 사용되는 응용분야에 따라 요구되는 성능 요건(처리속도, 하드웨어 복잡도, 전력소비 등)이 달라지므로, 응용분야의 성능 요건에 따라 하드웨어를 재설계해야 하는 불편함이 존재하는데, 예를 들어, 사물인터넷(IoT) 보안에는 저면적과 저전력 소모가 중요한 요소이며, 처리 속도는 중요하지 않은 반면에, 자율주행 이동체, 블록체인 등의 응용분야에서는 고속 처리가 중요한 요소가 된다. In addition, since the performance requirements (processing speed, hardware complexity, power consumption, etc.) required for public key cryptography vary depending on the application field used, there is an inconvenience of having to redesign the hardware according to the performance requirements of the application field, For example, low area and low power consumption are important factors for Internet of Things (IoT) security, and while processing speed is not important, high-speed processing becomes an important factor in applications such as autonomous vehicles and block chains.

상술한 바와 같이 공개키 암호의 하드웨어 구현을 위해 필수적으로 사용되는 유한체 연산회로 중, 연산 과정이 복잡하고 연산량이 많아서 처리속도, 하드웨어 복잡도(사용되는 게이트 수), 전력소비 등에 큰 영향을 미치는 소수체 상의 모듈러 곱셈과 모듈러 역원을 계산하는 기법과 하드웨어 장치의 구조에 대해 기술 개발이 필요한 실정이다.As described above, among the finite-body arithmetic circuits that are essential for hardware implementation of public key cryptography, the number of computational processes is complex and the amount of computation is large, so it has a significant impact on processing speed, hardware complexity (number of gates used), power consumption, etc. There is a need to develop technologies for the structure of hardware devices and techniques for calculating modular multiplication and modular inverses on a sieve.

1. 한국공개특허 제10-2003-0033580호(2003.05.01.공개)1. Korea Patent Publication No. 10-2003-0033580 (published on May 1, 2003)

본 발명은 공개키 암호의 하드웨어 구현에 있어서 응용분야의 요구 성능에 최적화된 하드웨어 구조를 용이하게 설계할 수 있는 모듈러 연산을 위한 워드 병렬 연산 방법을 제공하고자 한다.An object of the present invention is to provide a word parallel operation method for a modular operation that can easily design a hardware structure optimized for the performance required in an application field in hardware implementation of public key cryptography.

또한, 본 발명은 ECC, RSA 등 공개키 암호의 하드웨어 구현에 있어서, 소수체 상의 모듈러 곱셈과 모듈러 역원 계산을 위해 피연산자 데이터를 일정 크기의 워드 단위로 분할하고, 다수 개의 워드를 동시에 병렬로 연산함으로써, 응용분야의 성능 요건에 따른 처리속도, 하드웨어 복잡도 및 전력소비를 구현할 수 있는 모듈러 연산을 위한 워드 병렬 연산 방법을 제공하고자 한다.In addition, in the hardware implementation of public key cryptography such as ECC and RSA, the present invention divides operand data into word units of a certain size for modular multiplication and modular inverse calculation on prime numbers, and simultaneously operates a plurality of words in parallel. , to provide a word parallel operation method for modular operation that can realize processing speed, hardware complexity, and power consumption according to the performance requirements of the application field.

본 발명의 실시예들의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The purpose of the embodiments of the present invention is not limited to the above-mentioned purpose, and other objects not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the following description. .

본 발명의 실시예에 따르면, 모듈러 곱셈의 승수와 피승수를 복수 비트의 워드 단위로 분할하는 단계와, 상기 분할된 워드 단위에 대응하는 복수개의 워드를 병렬로 모듈러 곱셈 연산을 수행하는 단계를 포함하는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.According to an embodiment of the present invention, comprising the steps of dividing a multiplier and a multiplicand of modular multiplication into word units of a plurality of bits, and performing a modular multiplication operation on a plurality of words corresponding to the divided word units in parallel A word parallel operation method for modular operation may be provided.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 곱셈 연산을 수행하는 단계는, 몽고메리 모듈러 곱셈 연산을 적용하되, 복수개의 처리요소(PE) 배열블록, 복수 비트의 이진곱셈기, 복수 비트의 가산기, 레지스터파일블록, 컨트롤러블록, 복수의 선택기 및 복수의 분배기를 포함하는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.In addition, according to an embodiment of the present invention, the step of performing the modular multiplication operation includes applying the Montgomery modular multiplication operation, but a plurality of processing element (PE) array blocks, a multi-bit binary multiplier, a multi-bit adder, and a register A word parallel operation method for modular operation including a file block, a controller block, a plurality of selectors, and a plurality of dividers may be provided.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 곱셈 연산을 수행하는 단계는, 상기 몽고메리 모듈러 곱셈 연산을 통해 유한체 상에서 곱셈 연산을 수행하며, 복수 비트의 모듈러값과, 복수 비트의 승수 데이터와, 피승수 데이터를 입력받아 복수 비트의 곱셈 결과를 출력하는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.In addition, according to an embodiment of the present invention, the step of performing the modular multiplication operation includes performing a multiplication operation on a finite field through the Montgomery modular multiplication operation, a modular value of a plurality of bits, a multiplier data of a plurality of bits, A word parallel operation method for a modular operation that receives multiplicand data and outputs a multiplication result of a plurality of bits may be provided.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 곱셈 연산을 수행하는 단계는, 최하위 워드부터 워드 단위 곱셈을 통해 부분곱을 생성하며, 상기 생성된 부분곱을 가산하여 곱셈 연산을 수행하는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.In addition, according to an embodiment of the present invention, the performing of the modular multiplication operation includes generating a partial product through word unit multiplication from the least significant word, and adding the generated partial products to perform a multiplication operation. Word for modular operation A parallel operation method may be provided.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 곱셈 연산을 수행하는 단계는, 상기 모듈러값에 대응하는 최하위 워드로부터 계산되는 값을 상기 몽고메리 모듈러 곱셈의 축약 연산에 사용하고, 합동 특성을 이용하여 부분곱 가산 결과에 대응하는 최하위 워드가 0이 되도록 만들어 데이터 손실 없이 최하위 워드를 제거하는데 사용하는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.In addition, according to an embodiment of the present invention, the step of performing the modular multiplication operation includes using a value calculated from the least significant word corresponding to the modular value for the abbreviation operation of the Montgomery modular multiplication, and using a congruence characteristic A word parallel operation method for a modular operation that is used to remove the least significant word without data loss by making the least significant word corresponding to the product addition result becomes 0 may be provided.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 연산을 위한 워드 병렬 연산 방법은, 복수 비트의 정수와 모듈러값을 복수 크기의 워드로 분할하는 단계와, 상기 분할된 복수 크기의 워드에 대해 처리요소 배열을 이용하여 복수개의 워드를 병렬로 모듈러 역원 연산을 수행하는 단계를 포함하는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.In addition, according to an embodiment of the present invention, the word parallel operation method for the modular operation includes dividing a plurality of bit integers and modular values into words of a plurality of sizes, and processing elements for the divided words of the plurality of sizes. A word parallel operation method for a modular operation may be provided, including performing a modular inverse operation on a plurality of words in parallel using an array.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 역원 연산을 수행하는 단계는, 몽고메리 모듈러 역원 연산을 적용하되, 복수개의 처리요소(PE) 배열블록, 레지스터파일블록, 계수기블록, 컨트롤러블록 및 복수의 선택기를 포함하는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.In addition, according to an embodiment of the present invention, the performing of the modular inverse operation includes applying the Montgomery modular inverse operation, but includes a plurality of processing element (PE) array blocks, register file blocks, counter blocks, controller blocks, and a plurality of processing elements (PEs). A word parallel operation method for a modular operation including a selector may be provided.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 역원 연산을 수행하는 단계는, 상기 복수 비트의 정수 및 모듈러값을 입력받아 복수 비트의 모듈러 역원 연산 결과를 유사 몽고메리 도메인 상의 값으로 출력하며, 상기 유사 몽고메리 도메인 상의 값은 보정되어 몽고메리 도메인 상의 값으로 변환되고, 복수의 반복루프를 통해 몽고메리 역원 연산이 수행되는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.Also, according to an embodiment of the present invention, the performing of the modular inverse operation includes receiving the multi-bit integer and the modular value and outputting the multi-bit modular inverse operation result as a value on the pseudo Montgomery domain, and the similarity A word parallel operation method for a modular operation in which a value on the Montgomery domain is corrected and converted into a value on the Montgomery domain, and a Montgomery inverse operation is performed through a plurality of iterative loops may be provided.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 역원 연산을 수행하는 단계는, 몽고메리 역원 연산의 소요 사이클 수를 줄이기 위해, 한 번에 여러 비트를 시프트시켜 반복루프 횟수를 감소시키는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.In addition, according to an embodiment of the present invention, the step of performing the modular inverse operation is a word for modular operation that reduces the number of iteration loops by shifting several bits at a time in order to reduce the number of cycles required for the Montgomery inverse operation. A parallel operation method may be provided.

또한, 본 발명의 실시예에 따르면, 상기 모듈러 역원 연산을 수행하는 단계는, while 반복루프에 의해 계산되며, 상기 정수 및 모듈러값의 현재 값에 따라 연산 동작모드가 결정되어 가산 및 감산 연산과 시프트 연산을 위한 데이터가 선택되고 연산이 수행되는 모듈러 연산을 위한 워드 병렬 연산 방법이 제공될 수 있다.In addition, according to an embodiment of the present invention, the step of performing the modular inverse operation is calculated by a while iteration loop, and the operation mode is determined according to the current values of the integer and the modular value, so that addition and subtraction operations and shifts are performed. A word parallel operation method for a modular operation in which data for an operation is selected and an operation is performed may be provided.

또한, 본 발명의 실시예에 따르면, 상기 워드 병렬 연산 방법을 이용한 상기 모듈러 곱셈 연산과 상기 모듈러 역원 연산은 서로 독립적으로 수행될 수 있다.Also, according to an embodiment of the present invention, the modular multiplication operation and the modular inverse operation using the word parallel operation method may be performed independently of each other.

본 발명은 공개키 암호의 하드웨어 구현에 있어서 응용분야의 요구 성능에 최적화된 하드웨어 구조를 용이하게 설계할 수 있다.The present invention can easily design a hardware structure optimized for the performance required in the application field in hardware implementation of public key cryptography.

또한, 본 발명은 ECC, RSA 등 공개키 암호의 하드웨어 구현에 있어서, 소수체 상의 모듈러 곱셈과 모듈러 역원 계산을 위해 피연산자 데이터를 일정 크기의 워드 단위로 분할하고, 다수 개의 워드를 동시에 병렬로 연산함으로써, 응용분야의 성능 요건에 따른 처리속도, 하드웨어 복잡도 및 전력소비를 구현할 수 있다.In addition, in the hardware implementation of public key cryptography such as ECC and RSA, the present invention divides operand data into word units of a certain size for modular multiplication and modular inverse calculation on prime numbers, and simultaneously operates a plurality of words in parallel. , processing speed, hardware complexity and power consumption according to the performance requirements of the application field can be implemented.

도 1은 본 발명의 실시예에 따라 워드 병렬 연산 방법에 의한 모듈러 곱셈을 수행하는 과정을 나타낸 슈도코드이고,
도 2 및 도 3은 본 발명의 실시예에 따라 워드 병렬 연산 방법에 의한 모듈러 곱셈을 수행하는 과정을 예시한 도면이며,
도 4 및 도 5는 본 발명의 실시예에 따른 모듈러 연산기 중에서 모듈러 곱셈기를 예시한 도면이고,
도 6은 본 발명의 실시예에 따라 워드 병렬 연산 방법에 의한 모듈러 역원을 계산하는 과정을 나타낸 슈도코드이며,
도 7 및 도 8은 본 발명의 실시예에 따른 모듈러 연산기 중에서 모듈러 역원기를 예시한 도면이다.1 is a pseudo code showing a process of performing modular multiplication by a word parallel operation method according to an embodiment of the present invention;
2 and 3 are diagrams illustrating a process of performing modular multiplication by a word parallel operation method according to an embodiment of the present invention;
4 and 5 are diagrams illustrating a modular multiplier among modular operators according to an embodiment of the present invention;
6 is a pseudo code illustrating a process of calculating a modular inverse by a word parallel operation method according to an embodiment of the present invention;
7 and 8 are diagrams illustrating a modular inverse of the modular operator according to an embodiment of the present invention.

본 발명의 실시예들에 대한 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of embodiments of the present invention, and methods of achieving them, will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. In describing the embodiments of the present invention, if it is determined that a detailed description of a well-known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따라 워드 병렬 연산 방법에 의한 모듈러 곱셈 연산을 수행하는 과정을 나타낸 슈도코드이고, 도 2 및 도 3은 본 발명의 실시예에 따라 워드 병렬 연산 방법에 의한 모듈러 곱셈 연산을 수행하는 과정을 예시한 도면이며, 도 4 및 도 5는 본 발명의 실시예에 따른 모듈러 연산기 중에서 모듈러 곱셈기를 예시한 도면이다.1 is a pseudocode illustrating a process of performing a modular multiplication operation by a word parallel operation method according to an embodiment of the present invention, and FIGS. 2 and 3 are modular multiplication by a word parallel operation method according to an embodiment of the present invention. It is a diagram illustrating a process of performing an operation, and FIGS. 4 and 5 are diagrams illustrating a modular multiplier among modular operators according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 워드 병렬 연산 방법에 의한 모듈러 곱셈 연산을 수행하는 과정은, 모듈러 곱셈의 승수와 피승수를 복수 비트의 워드 단위로 분할하는 단계와, 분할된 워드 단위에 대응하는 복수개의 워드를 병렬로 모듈러 곱셈 연산을 수행하는 단계를 포함할 수 있다.Referring to FIG. 1 , the process of performing the modular multiplication operation by the word parallel operation method according to the embodiment of the present invention includes the steps of dividing a multiplier and a multiplicand of the modular multiplication into word units of a plurality of bits, the divided word unit The method may include performing a modular multiplication operation on a plurality of words corresponding to .

여기에서, 모듈러 곱셈 연산을 수행하는 단계에서는 예를 들면, 몽고메리 모듈러 곱셈 연산을 적용할 수 있는데, 복수개의 처리요소(PE) 배열블록, 복수 비트의 이진곱셈기, 복수 비트의 가산기, 레지스터파일블록, 컨트롤러블록, 복수의 선택기 및 복수의 분배기를 포함하되, 승수데이터, 피승수 데이터 및 모듈러값은 각각 복수 비트 크기를 갖는 복수의 워드로 분할될 수 있다.Here, in the step of performing the modular multiplication operation, for example, Montgomery modular multiplication operation may be applied. A plurality of processing element (PE) array blocks, a multi-bit binary multiplier, a multi-bit adder, a register file block, It includes a controller block, a plurality of selectors, and a plurality of dividers, wherein the multiplier data, the multiplier data, and the modular value may be divided into a plurality of words each having a plurality of bit sizes.

그리고, 모듈러 곱셈 연산을 수행하는 단계에서는 몽고메리 모듈러 곱셈 연산을 통해 유한체 상에서 곱셈 연산을 수행할 수 있으며, 복수 비트의 모듈러값과, 복수 비트의 승수 데이터와, 피승수 데이터를 입력받아 복수 비트의 곱셈 결과를 출력할 수 있다.And, in the step of performing the modular multiplication operation, the multiplication operation can be performed on the finite field through the Montgomery modular multiplication operation, and the multi-bit multiplication by receiving a multi-bit modular value, multi-bit multiplier data, and multiplicand data as inputs. You can print the results.

또한, 모듈러 곱셈 연산을 수행하는 단계에서는 최하위 워드부터 워드 단위 곱셈을 통해 부분곱을 생성하며, 생성된 부분곱을 가산하여 곱셈 연산이 수행될 수 있다.In addition, in the step of performing the modular multiplication operation, a partial product may be generated through word unit multiplication from the least significant word, and the multiplication operation may be performed by adding the generated partial products.

한편, 모듈러 곱셈 연산을 수행하는 단계에서는 모듈러값의 최하위 워드로부터 계산되는 값을 몽고메리 모듈러 곱셈의 축약 연산에 사용할 수 있고, 합동 특성을 이용하여 부분곱 가산 결과의 최하위 워드가 0이 되도록 만들어 데이터 손실 없이 최하위 워드를 제거하는데 사용될 수 있다.On the other hand, in the step of performing the modular multiplication operation, the value calculated from the least significant word of the modular value can be used for the abbreviation operation of the Montgomery modular multiplication, and data loss by making the least significant word of the partial product addition result 0 by using the congruence characteristic can be used to remove the least significant word without

예를 들면, 도 1에 도시한 바와 같은 슈도코드에서는 모듈러 곱셈의 승수 A와 피승수 B를 w 비트의 워드 단위로 분할하고, 복수개의 워드를 병렬로 연산하는 모듈러 곱셈 연산 과정을 나타내고 있는데, 몽고메리 모듈러 곱셈은 유한체 상에서 곱셈 연산이 이루어지며, L 비트의 모듈러값 N과, L 비트의 승수 데이터 A와 피승수 데이터 B를 입력 받아 L 비트의 곱셈결과

(단,

)를 출력할 수 있다.For example, in the pseudocode shown in FIG. 1, a modular multiplication operation process in which a multiplier A and a multiplicand B of modular multiplication are divided into w-bit word units, and a plurality of words are operated in parallel. Montgomery modular Multiplication is a multiplication operation performed on a finite field. It receives L-bit modular value N, L-bit multiplier data A and multiplicand data B as inputs, and L-bit multiplication result

(only,

) can be printed.

그리고, 모듈러 곱셈을 위한 입력 데이터(A, B, N)은 w 비트(예를 들어, w=32 비트) 크기의 워드 m개로 분할되어 연산되며, 최하위 워드부터 워드 단위 곱셈을 통해 부분곱을 생성하고, 생성된 부분곱을 가산하여 곱셈 연산이 수행될 수 있다.And, the input data (A, B, N) for modular multiplication is divided into m words of w bits (eg, w=32 bits) and is calculated, and a partial product is generated through word unit multiplication from the least significant word. , a multiplication operation may be performed by adding the generated partial products.

이러한 슈도코드에서

는 모듈러 값 N의 최하위 워드

로부터

로 계산될 수 있고,

는 몽고메리 모듈러 곱셈의 축약 연산 과정에 사용되며, 합동 특성을 이용하여 부분곱 가산결과의 최하위 워드가 0이 되도록 만들어 데이터 손실 없이 최하위 워드를 제거하는데 사용될 수 있다.In these pseudocodes

is the least significant word of the modular value N

from

can be calculated as

is used in the abbreviated operation process of Montgomery modular multiplication, and can be used to remove the least significant word without data loss by making the least significant word of the partial product addition result 0 by using the congruence property.

이와 같이, 몽고메리 모듈러 곱셈 방법은 부분곱 가산과정에서 합동 특성을 이용해 최하위 워드를 제거하는 모듈러 축약 연산이 포함되므로, 곱셈결과에

이 포함될 수 있다.As such, the Montgomery modular multiplication method includes a modular abbreviation operation that removes the least significant word using the congruence characteristic in the partial product addition process.

may be included.

그리고, i-루프는 승수 데이터 A를 m개의 워드로 분할하여 m회 반복 연산하는 과정을 나타내고, j-루프는 피승수 데이터 B를 m개의 워드로 분할하여 반복 처리하는 과정을 나타내는데, 연산회로를 구성하는 처리요소(PE : processing element)의 개수

에 따라

회 반복 연산될 수 있고, j-루프 내부의 동시 처리(concurrent) k-루프는

개의 PE에 의해

개의 피승수 워드가 병렬로 연산되는 과정을 나타낼 수 있다.In addition, the i-loop represents the process of dividing the multiplier data A into m words and repeating the operation m times, and the j-loop represents the process of dividing the multiplier data B into m words and repeating the process. The number of processing elements (PE) to be processed

Depending on the

It can be repeated several times, and the concurrent k-loop inside the j-loop is

by PE

It can represent the process in which the multiplicand words are computed in parallel.

상술한 바와 같은 도 1의 슈도코드에 따른 연산 과정에 대해 상세히 설명하면, 첫째, 단계1 내지 단계5에서는 매 i-루프의 반복에서 축약 연산 과정에 사용되는 q 데이터를 생성하기 위해 부분곱의 하위 워드 데이터(

)를 미리 생성할 수 있는데,

개의 PE에 의해

개의 워드가 병렬로 연산될 수 있고, 각각의 PE에서

와

의 곱 (

)에 의해 부분곱 워드가 생성되며,

회 반복 연산으로 m개의 워드가 처리될 수 있다.As described above, the operation process according to the pseudocode of FIG. 1 will be described in detail. First, in steps 1 to 5, in order to generate q data used in the reduction operation process in every i-loop iteration, the lower part of the partial product word data (

) can be created in advance,

by PE

Words can be computed in parallel, and in each PE

Wow

product of (

) to generate a partial product word,

m words can be processed by iterative operation.

둘째, 단계7 내지 단계9에서는 매 i-루프 반복이 시작될 때마다 캐리 데이터를 초기화하고, 매 i-루프의 축약연산을 위해 모듈러 합동에 필요한 q 값을 생성할 수 있는데, 생성된 q 값은 추후에

워드 생성에 사용될 수 있고, 생성된

는 합동 특성에 의해 항상 0이 되어 최하위 워드가 제거되어도 데이터 손실이 없도록 만들 수 있다.Second, in steps 7 to 9, carry data is initialized at every i-loop iteration, and a q value necessary for modular congruence can be generated for the abbreviation operation of every i-loop. to

Can be used for word generation,

is always 0 due to the congruence characteristic, so that there is no data loss even when the least significant word is removed.

셋째, 단계10 내지 단계11에서는 j-루프는

회 만큼 반복 연산되며, 이전 j-루프에서 저장된 캐리 값(C_add1, C_add2)을 가져오는데, j=0인 경우에는, 단계7에서 초기화된 데이터가 사용될 수 있다.Third, in steps 10 to 11, the j-loop is

The operation is repeated several times, and the carry values (C_add1, C_add2) stored in the previous j-loop are fetched. When j=0, the data initialized in step 7 can be used.

넷째, 단계12 내지 단계19에서는

개의 처리요소(PE) 배열에 의해

개의 피승수 워드를 병렬로 연산하며, 곱셈 연산과 가산 연산으로 구성되는데, 단계13 내지 단계15에서 승수 워드(

)와 피승수 워드(

)의 곱셈 연산을 통해 부분곱을 생성할 수 있고, 생성된 부분곱의 상위 워드(

)는 (k+1)번째 워드의 부분곱 가산에 사용되고, 생성된 부분곱의 하위 워드(

)는 (k-1)번째 워드에서 생성된 부분곱의 상위 워드(

)와 가산되며, 이전 i-루프에서 생성된 부분곱 가산결과 s 워드와 가산될 수 있다.Fourth, in steps 12 to 19

by the arrangement of processing elements (PE)

Operates the multiplicand words in parallel, and consists of a multiplication operation and an addition operation. In steps 13 to 15, the multiplier word (

) and the multiplicand word (

), a partial product can be created through the multiplication operation, and the upper word (

) is used for partial product addition of the (k+1)th word, and the lower word (

) is the upper word (

) and can be added to the s word as a result of adding partial products generated in the previous i-loop.

그리고, i-루프 연산이 1회 완료되면, 하나의 승수 워드(

)와 m개의 피승수 워드 간의 곱셈에 의한 부분곱 생성과 부분곱 가산 연산에 의해 m+1개 워드의 부분곱 가산결과가 생성될 수 있다.And, when the i-loop operation is completed once, one multiplier word (

) and m multiplicand words, partial product generation and partial product addition operation may generate partial product addition results of m+1 words.

또한, 단계16 내지 단계18에서는 단계13 내지 단계15에서 연산된 m+1개의 부분곱 가산결과 워드에 대한 축약(reduction) 연산을 처리할 수 있는데, 생성된 m+1개의 워드(

내지

)중에서 최하위 워드

으로 만들어 이를 제거할 수 있고, 단계13 내지 단계15와 동일하게 곱셈 1회, 가산 2회의 연산 구조를 가지며, 단계9에서 생성된 q 데이터와 모듈러 값의 워드 n을 이용하여 축약 연산을 수행할 수 있다.In addition, in steps 16 to 18, it is possible to process a reduction operation on the m+1 partial product addition result words calculated in steps 13 to 15, and the generated m+1 words (

inside

), the least significant word

This can be removed by making it into there is.

다섯째, 단계20 내지 단계24에서는

개의 피승수 워드가 병렬로 연산되는 동시 처리(concurrent) k-루프에서 생성된 캐리 값(

,

)을 저장하여 다음 j-루프의 단계11에서 사용할 수 있는데, 마지막 j-루프의 경우에는, 마지막 워드가 연산된 PE의 캐리 값을 저장하여 추후 최상위 워드(

)를 생성하는 가산 연산에 사용될 수 있다.Fifth, in steps 20 to 24,

Carry values (

,

) can be stored and used in step 11 of the next j-loop.

) can be used for addition operations that produce

여섯째, 단계26에서는 i-루프의 마지막 연산으로, j-루프가 완료되면 단계12 내지 단계19에 의해 생성된 캐리 값들과 이전 i-루프의 캐리값(

)를 가산하여 최상위 워드(

)와 캐리 값(

)을 생성할 수 있다.Sixth, in step 26, as the last operation of the i-loop, when the j-loop is completed, the carry values generated by steps 12 to 19 and the carry value of the previous i-loop (

) by adding the most significant word (

) and the carry value (

) can be created.

일곱째, 단계28 내지 단계30에서는 m회의 i-루프 반복 연산이 완료되면, 곱셈결과값(S)과 모듈러값(N)을 비교하여

인 경우에

연산으로 최종 축약 연산을 수행할 수 있다.Seventh, in steps 28 to 30, when the m repetitions of the i-loop operation are completed, the multiplication result value (S) and the modular value (N) are compared and

in case of

The final abbreviation operation can be performed as an operation.

상술한 바와 같은 과정을 통해, 두 정수 A, B의 모듈러 곱셈결과

가 출력된다. 단계16 내지 단계18의 축약 연산을 통해 매 i-루프마다 부분곱 가산결과의 최하위 워드(

)가 제거되며, m 회의 i-루프 반복을 통해 m개의 하위 워드가 제거되므로,

가 포함된 결과 값이 출력된다.Through the process as described above, the result of modular multiplication of two integers A and B

is output Through the abbreviation operation of steps 16 to 18, the least significant word (

) is removed, and m low-order words are removed through m i-loop iterations,

The result value including the is output.

도 2와 도 3을 참조하면, 상술한 바와 같은 도 1의 슈도코드의 연산 과정을 도식화하여 예시한 것으로, 도 2는 L=192, w=32, m=6,

인 경우의 연산 과정을 나타내는데, i-루프는 6회, j-루프는 3회 반복되며(

), j-루프 내부의 병렬연산 동시 처리(concurrent) k-루프는 2개의 PE에 의해 2개 워드가 병렬로 연산될 수 있다.2 and 3, the operation process of the pseudo code of FIG. 1 as described above is schematically exemplified, and FIG. 2 shows that L=192, w=32, m=6,

It shows the operation process in case of i-loop 6 times and j-loop 3 times (

), parallel operation inside the j-loop concurrent processing (concurrent) In the k-loop, two words can be operated in parallel by two PEs.

여기에서, j=0일 때, 워드

와

가 생성되며, 워드

는 0의 값을 갖고, 마지막 j-루프인 j=2일 때, PE에서 부분곱 가산결과 워드(

,

)가 생성되며, 단계26의 가산 연산을 통해 최상위 워드(

)가 생성될 수 있다.Here, when j = 0, word

Wow

is created, the word

has a value of 0, and when the last j-loop, j = 2, the partial product addition result word (

,

) is generated, and the most significant word (

) can be created.

또한, 도 3은 L=192, w=32, m=6,

=3인 경우의 연산 과정을 나타내는데, i-루프는 6회, j-루프가 2회 반복되며(

), j-루프 내부의 병렬연산 동시 처리(concurrent) k-루프는 3개의 PE에 의해 3개의 워드가 병렬로 연산될 수 있다.In addition, Figure 3 shows that L = 192, w = 32, m = 6,

It shows the operation process when =3, i-loop is repeated 6 times and j-loop is repeated 2 times (

), parallel operation inside the j-loop concurrent processing (concurrent) In the k-loop, 3 words can be operated in parallel by 3 PEs.

상술한 바와 같은 도 2와 도 3을 비교하면, 2개의 PE가 사용되는 도 2의 경우 (

), 총 18회의 반복 연산이 진행되고, 3개의 PE가 사용되는 도 3의 경우 (

), 총 12회의 반복 연산이 수행될 수 있다.Comparing FIG. 2 and FIG. 3 as described above, in the case of FIG. 2 in which two PEs are used (

), in the case of FIG. 3 in which a total of 18 iterative operations are performed and three PEs are used (

), a total of 12 iteration operations can be performed.

여기에서, 모듈러 곱셈 연산에 소요되는 시간은 반복 연산 횟수와 비례관계를 갖게 되는데, 도 2와 도 3에 나타낸 바와 같이, 사용되는 PE 개수에 따라 모듈러 곱셈 연산에 소요되는 시간과 하드웨어 복잡도가 달라지므로, 확장 가능형 모듈러 곱셈 연산 방법 및 확장 가능형 하드웨어 구조를 구현 및 제공할 수 있다.Here, the time required for the modular multiplication operation has a proportional relationship with the number of iteration operations. , it is possible to implement and provide a scalable modular multiplication operation method and an scalable hardware structure.

도 4를 참조하면, 상술한 바와 같은 도 1의 슈도코드를 하드웨어로 구현한 모듈러 곱셈기로서, 확장 가능형 몽고메리 모듈러 곱셈기의 구성을 나타내는데,

개 PE의 1차원 배열과, w비트 이진곱셈기(Bin_Mul)와, w비트 가산기(Adder)와, S 데이터, q 데이터, 중간연산 결과의 캐리 값 등을 저장하는 레지스터파일블록(Reg_File)과, 연산 과정에 필요한 제어신호들을 생성하는 컨트롤러블록(CNTL)과, 다수개의 선택기(MUX) 및 분배기(DEMUX)로 구성될 수 있다.Referring to FIG. 4, as a modular multiplier in which the pseudocode of FIG. 1 as described above is implemented in hardware, the configuration of an expandable Montgomery modular multiplier is shown.

One-dimensional array of PEs, a w-bit binary multiplier (Bin_Mul), a w-bit adder (Adder), S data, q data, and a register file block (Reg_File) that stores the carry value of the intermediate operation result, and operation It may be composed of a controller block (CNTL) that generates control signals necessary for the process, and a plurality of selectors (MUX) and dividers (DEMUX).

여기에서, PE 배열을 구성하는

개의 PE는

개의 워드를 병렬로 동시에 처리하며, 이는 상술한 바와 같은 도 1의 슈도코드에서 동시 처리(concurrent) k-루프를 의미한다.Here, constituting the PE array

dog's pe

Words are simultaneously processed in parallel, which means a concurrent k-loop in the pseudocode of FIG. 1 as described above.

그리고, 사용되는 PE의 개수

에 따라 병렬로 연산되는 워드 개수가 결정되며, 이를 통해 연산에 소요되는 클록 사이클 수(즉, 연산 속도)와 회로 면적을 조절할 수 있다.And, the number of PEs used

The number of words to be computed in parallel is determined according to this, and through this, the number of clock cycles (ie, computation speed) and circuit area required for computation can be adjusted.

또한, 데이터 입출력 포트(A, B, N, Out_data)는 사용되는 PE의 개수

에 따라 크기가 달라지며,

(예를 들어,

)가 되고, 이진 곱셈기(Bin_Mult)는 도 1의 슈도코드의 단계9에서 q를 생성하는 곱셈 연산에 사용되며, 선택기1(MUX1)에 의해 매 i-루프에서 사용되는

데이터가 선택될 수 있다.In addition, the number of data input/output ports (A, B, N, Out_data) is the number of PEs used.

The size varies depending on

(for example,

), and the binary multiplier (Bin_Mult) is used in the multiplication operation to generate q in step 9 of the pseudocode of Fig. 1, and is used in every i-loop by the selector 1 (MUX1).

Data may be selected.

또한, 레지스터파일블록(Reg_File)으로부터 PE의 연산에 사용되는

개의 워드가 선택기1(MUX1)에 의해 선택되어 PE 배열로 입력될 수 있고, 사용되는 PE의 개수

에 따라 선택기1(MUX1)에 의해 선택되는 워드의 개수가 달라지며, 동시에 연산되는 워드의 개수 및 곱셈 연산에 소모되는 사이클 수가 조절될 수 있다.In addition, from the register file block (Reg_File), the

Words can be selected by the selector 1 (MUX1) and input into the PE array, and the number of PEs used

Accordingly, the number of words selected by the selector 1 MUX1 changes, and the number of words simultaneously calculated and the number of cycles consumed in the multiplication operation can be adjusted.

여기에서, 선택기M(0) 내지 M(

-1)은 외부에서 입력되는 데이터와 레지스터파일블록(Reg_File)에서 출력되는 데이터 중 선택해서 PE로 입력되도록 하는데, PE 배열 내부에서 연산이 완료된 후, 출력된 데이터는 레지스터파일블록(Reg_File)에 저장되며, 연산된 캐리 값은 선택기2(MUX2)에 의해 도 1의 슈도코드의 단계20 내지 단계24와 같이 최상위 캐리 값이 선택되어 저장된다.Here, selectors M(0) to M(

-1) selects data input from the outside and data output from the register file block (Reg_File) to be input to the PE. After the operation inside the PE array is completed, the output data is stored in the register file block (Reg_File) The calculated carry value is stored by selecting the highest carry value by the selector 2 (MUX2) as in steps 20 to 24 of the pseudo code of FIG. 1 .

이때, 마지막 j-루프의 경우에는 가산기(Adder)에서 단계26의 최상위 워드

을 연산하며, 레지스터파일블록(Reg_File)에 저장될 수 있다.At this time, in the case of the last j-loop, the most significant word of step 26 in the adder

, and may be stored in the register file block (Reg_File).

도 5를 참조하면, 상술한 바와 같은 도 4의 모듈러 곱셈기에 구비되는 PE의 내부 블록도를 나타내는데, w 비트 이진 곱셈기(Bin_Mult)와 w 비트 가산기 2개 (adder1, adder2), 선택기(MUX) 1개와 레지스터(PP_reg)로 구성될 수 있다.Referring to FIG. 5, there is shown an internal block diagram of a PE included in the modular multiplier of FIG. 4 as described above. A w-bit binary multiplier (Bin_Mult), two w-bit adders (adder1, adder2), and a selector (MUX) 1 It may be composed of a dog and a register (PP_reg).

여기에서, PE는 상술한 바와 같은 도 1의 슈도코드의 단계1 내지 단계5의 연산을 수행하여 부분곱의 하위 워드

를 출력하며, 단계12 내지 단계19의 연산을 수행하여 가산기2(adder2)의 가산결과를 출력할 수 있다.Here, the PE performs the operations of steps 1 to 5 of the pseudocode of FIG. 1 as described above, and the lower word of the partial product

, and the operation of steps 12 to 19 may be performed to output an addition result of the adder 2 (adder2).

그리고, PE는 연산과정에 따라 선택기(MUX)에 의해 연산결과를 선택하여 출력할 수 있는데, 상술한 바와 같은 도 1의 슈도코드의 동시 처리(concurrent) k-루프 내에서 1회의 곱셈 연산과 2회의 가산 연산이 각 PE에 의해 병렬로 연산되며, 이를 위해 PE 내부의 이진 곱셈기와 가산기가 직렬형태로 구현될 수 있다.In addition, the PE can select and output an operation result by the selector (MUX) according to the operation process. As described above, in the concurrent k-loop of the pseudo code of FIG. 1, one multiplication operation and two The addition operation is performed in parallel by each PE, and for this purpose, a binary multiplier and adder inside the PE may be implemented in a serial form.

또한, 이진 곱셈기는 상술한 바와 같은 도 1의 슈도코드의 단계13과 단계16의 곱셈 연산을 수행하며, 가산기1(adder1)은 단계14 및 단계17의 가산 연산을 각각 수행하고, 가산기2(adder2)는 단계15 및 단계18의 가산 연산을 각각 수행할 수 있다.In addition, the binary multiplier performs the multiplication operations of steps 13 and 16 of the pseudocode of FIG. 1 as described above, and the adder 1 (adder1) performs the addition operations of steps 14 and 17, respectively, and the adder 2 (adder2). ) may perform the addition operation of steps 15 and 18, respectively.

여기에서. 각 가산기는 1 비트 캐리 값을 가지며, 인접한 PE 간에 캐리 값이 전달되어 연산에 사용될 수 있는데, PE의 배열에 의한 가산기의 지연을 줄이기 위해 캐리선택 가산기(carry select adder)를 사용하여 구현할 수 있으며, 최악경로 지연을 줄이기 위해 이진 곱셈기와 가산기 사이에 레지스터(PP_reg)를 삽입할 수 있다.From here. Each adder has a 1-bit carry value, and the carry value is transferred between adjacent PEs and can be used for operation. A register (PP_reg) can be inserted between the binary multiplier and adder to reduce the worst-path delay.

그리고, PE 내부에 최악경로 지연을 줄이기 위한 레지스터(PP_reg)의 삽입으로 인해, 이진 곱셈 연산과 가산 연산을 동일 사이클 내에 수행할 수 없으며, 2사이클에 걸쳐 연산될 수 있다.In addition, due to the insertion of the register (PP_reg) for reducing the worst-path delay in the PE, the binary multiplication operation and the addition operation cannot be performed within the same cycle, and may be performed over two cycles.

이때, 추가로 소모되는 사이클을 최소화하기 위해 PL와 PH 생성 시에 연산 중간 결과 값이 사용되지 않는 점을 이용할 수 있는데, 곱셈 연산이 실행된 뒤, 2회의 가산 연산과 다음 연산 과정의 곱셈 연산을 동시에 수행하는 구조로, 상술한 바와 같은 도 1의 슈도코드에서 i=0이고, j=0인 초기 연산의 경우, 단계13의 곱셈 연산이 수행되고, i=m이고, j=ite-1인 마지막 연산의 경우, 단계17과 단계18의 가산 연산이 수행될 수 있다.At this time, in order to minimize the cycle consumed additionally, the point that the intermediate result value is not used when generating PL and PH can be used. With a structure that is performed simultaneously, in the pseudocode of FIG. 1 as described above, in the case of the initial operation where i = 0 and j = 0, the multiplication operation of step 13 is performed, i = m, and j = ite-1 In the case of the last operation, the addition operation of steps 17 and 18 may be performed.

또한, 그 이외의 연산 과정에서는, 단계14와 단계15의 가산 연산과 단계16의 곱셈 연산이 동시에 연산되며, 단계17과 단계18의 가산 연산과 단계13의 곱셈 연산이 동시에 연산되며, 이들 연산은 교대로 반복 수행될 수 있다.In addition, in other calculation processes, the addition operation of steps 14 and 15 and the multiplication operation of step 16 are simultaneously calculated, and the addition operation of steps 17 and 18 and the multiplication operation of step 13 are simultaneously calculated, and these operations are It may be alternately performed repeatedly.

다음에, 도 6은 본 발명의 실시예에 따라 워드 병렬 연산 방법에 의한 모듈러 역원 연산을 수행하는 과정을 나타낸 슈도코드이며, 도 7 및 도 8은 본 발명의 실시예에 따른 모듈러 연산기 중에서 모듈러 역원기를 예시한 도면이다.Next, FIG. 6 is a pseudocode illustrating a process of performing a modular inverse operation by a word parallel operation method according to an embodiment of the present invention, and FIGS. 7 and 8 are modular inverses among modular operators according to an embodiment of the present invention. It is a diagram illustrating the group.

도 6을 참조하면, 본 발명의 실시예에 따른 워드 병렬 연산 방법에 의한 모듈러 역원 연산을 수행하는 과정은, 복수 비트의 정수와 모듈러값을 복수 크기의 워드로 분할하는 단계와, 상기 분할된 복수 크기의 워드에 대해 처리요소 배열을 이용하여 복수개의 워드를 병렬로 모듈러 역원 연산을 수행하는 단계를 포함할 수 있다.Referring to FIG. 6 , the process of performing the modular inverse operation by the word parallel operation method according to an embodiment of the present invention includes dividing a plurality of bit integers and modular values into words of a plurality of sizes, and the divided plurality The method may include performing a modular inverse operation on a plurality of words in parallel using an array of processing elements on a word of size.

여기에서, 모듈러 역원 연산을 수행하는 단계에서는 몽고메리 모듈러 역원 연산 방식을 적용할 수 있는데, 복수개의 처리요소(PE) 배열블록, 레지스터 파일블록, 계수기블록, 컨트롤러블록 및 복수의 선택기를 포함할 수 있다.Here, in the step of performing the modular inverse operation, the Montgomery modular inverse operation method may be applied. It may include a plurality of processing element (PE) array blocks, a register file block, a counter block, a controller block, and a plurality of selectors. .

이러한 모듈러 역원은 유한체 상의 임의의 정수에 대한 곱의 역원을 의미하고, 몽고메리 모듈러 역원 연산은 복수 비트의 정수와 모듈러값을 입력받아 복수 비트의 모듈러 역원 연산 결과를 유사 몽고메리 도메인 상의 값으로 출력할 수 있다.Such a modular inverse means the inverse of the product of an arbitrary integer on a finite field, and the Montgomery modular inverse operation receives a multi-bit integer and a modular value and outputs the multi-bit modular inverse operation result as a value on the pseudo Montgomery domain. can

또한, 모듈러 역원 연산을 수행하는 단계에서는 유사 몽고메리 도메인 상의 값은 보정되어 몽고메리 도메인 상의 값으로 변환될 수 있고, 복수의 반복루프를 통해 몽고메리 역원 연산이 수행될 수 있다.In addition, in the step of performing the modular inverse operation, a value on the pseudo Montgomery domain may be corrected and converted into a value on the Montgomery domain, and the Montgomery inverse operation may be performed through a plurality of iterative loops.

한편, 모듈러 역원 연산을 수행하는 단계에서는 몽고메리 역원 연산의 소요 사이클 수를 줄이기 위해, 한 번에 여러 비트를 시프트시켜 반복루프 횟수를 줄일 수 있다.Meanwhile, in the step of performing the modular inverse operation, in order to reduce the number of cycles required for the Montgomery inverse operation, the number of iteration loops may be reduced by shifting several bits at a time.

상술한 바와 같은 모듈러 역원 연산을 수행하는 단계에서는 while 반복루프에 의해 계산되며, U와 V 데이터의 현재 값에 따라 연산 동작모드가 결정되어 가산 및 감산 연산과 시프트 연산을 위한 데이터가 선택되고 연산이 수행될 수 있다.In the step of performing the modular inverse operation as described above, it is calculated by a while iteration loop, and the operation mode is determined according to the current values of U and V data, so that data for addition and subtraction operations and shift operations are selected and the operation is performed. can be performed.

예를 들면, 도 6에 도시한 바와 같은 슈도코드는 본 발명의 실시예에 따른 워드 병렬 연산 방법에 의한 몽고메리 모듈러 역원(Montgomery modular inverse) 연산을 수행하는 과정을 나타내는데, L 비트의 정수 A와 모듈러 값 N을 w 비트(예를 들어,

비트) 크기의 워드 m개로 분할하고, PE 배열에 의해 복수의 워드를 병렬로 연산할 수 있다.For example, the pseudocode as shown in FIG. 6 represents a process of performing a Montgomery modular inverse operation by a word parallel operation method according to an embodiment of the present invention. The L-bit integer A and the modular Set the value N to w bits (e.g.,

bits), it is divided into m words, and a plurality of words can be operated in parallel by the PE arrangement.

여기에서, 모듈러 역원이란 유한체 상의 임의의 정수 A에 대한 곱의 역원

을 의미하는데, 몽고메리 모듈러 역원 연산은 L 비트의 정수 A와 모듈러값 N을 입력받아 L 비트의 모듈러 역원 연산 결과를 유사 몽고메리 도메인 상의 값

으로 출력하며, 이를 ‘Almost Montgomery Inverse’라고 한다.Here, the modular inverse is the inverse of the product of any integer A on a finite field.

In the Montgomery modular inverse operation, an L-bit integer A and a modular value N are input, and the L-bit modular inverse operation result is expressed as a value in a similar Montgomery domain.

, which is called 'Almost Montgomery Inverse'.

또한, k는

범위의 값을 갖는데, 유사 몽고메리 도메인 상의 값

은 보정 단계(correction phase)를 거쳐 몽고메리 도메인 상의 값

로 변환될 수 있고, 몽고메리 역원 연산 알고리듬은 k 회의 반복루프를 통해 연산이 수행될 수 있으며, 연산결과에

가 곱해져서 유사 몽고메리 도메인 상의 값으로 출력될 수 있다.Also, k is

It has a range of values, which are values on the pseudo-Montgomery domain.

is the value on the Montgomery domain through a correction phase

can be converted to , and the Montgomery inverse calculation algorithm can be operated through k iteration loops, and the calculation result is

may be multiplied and output as a value on the pseudo Montgomery domain.

이러한 k는 정수 A의 값에 영향을 받으며, 평균적으로 약

의 값을 가질 수 있고, 몽고메리 역원 연산의 소요 사이클 수를 줄이기 위해, 한 번에 여러 비트를 시프트시켜 반복루프 횟수를 줄일 수 있으며, 3 비트씩 스캔하는 경우, 소요 사이클 수가 약 22% 감소하여 평균 반복루프 횟수는

이 되고, 상술한 바와 같은 도 6의 슈도코드에는 3 비트 스캔이 적용되었으며, 스캔 비트 크기는 다양하게 적용될 수 있다.This k is affected by the value of the integer A, on average about

In order to reduce the number of cycles required for Montgomery inverse operation, the number of iteration loops can be reduced by shifting several bits at a time. number of iterations

In this case, 3-bit scan is applied to the pseudocode of FIG. 6 as described above, and the scan bit size may be variously applied.

이러한 도 6의 슈도코드에서, 몽고메리 역원 연산은 단계2 내지 단계57의 while 반복루프에 의해 계산되며, U와 V의 현재 값에 따라 연산 동작모드가 결정되어 가산 및 감산 연산과 시프트 연산을 위한 데이터가 선택되고 해당 연산이 수행될 수 있다.In the pseudocode of FIG. 6, the Montgomery inverse operation is calculated by the while iterative loop of steps 2 to 57, and the operation mode is determined according to the current values of U and V, and data for addition and subtraction operations and shift operations is selected and the corresponding operation can be performed.

또한, 도 6의 슈도코드 상에서,

는 x 데이터 워드를 y 비트만큼 오른쪽 시프트 시키는 연산을 의미하며,

는 x 데이터 워드를 y 비트만큼 왼쪽 시프트 시키는 연산을 의미하는데, 동시 처리(concurrent) j-루프는

개의 PE를 이용하여

개의 워드를 병렬로 연산하는 동작을 의미하며, i-루프의 반복 횟수는 워드 개수 m과 사용되는 PE의 개수

에 의해

로 결정될 수 있다.In addition, on the pseudo code of FIG. 6,

means an operation that right shifts the x data word by y bits,

means an operation that shifts the x data word left by y bits, and the concurrent j-loop is

using a dog's PE

It refers to the operation of operating words in parallel, and the number of iterations of the i-loop is the number of words m and the number of PEs used.

by

can be determined as

상술한 바와 같은 슈도코드의 연산 과정을 구체적으로 설명하면, 첫째, 단계1에서는 연산에서 사용되는 데이터를 초기화시킬 수 있는데, L 비트의 모듈러 값 N과 정수 데이터 A를 입력받을 수 있다.When the pseudocode operation process as described above is described in detail, first, data used in the operation may be initialized in step 1, and an L-bit modular value N and integer data A may be input.

둘째, 단계2 내지 단계57에서는 역원 연산을 진행하는 반복루프로, U와 V의 현재 데이터에 따라 반복루프의 연산과정이 달라지며, V가 0이 될 때까지 반복 연산을 수행할 수 있는데, 반복루프는 조건문에 따라 단계4 내지 단계18, 단계19 내지 단계33, 단계34 내지 단계44, 단계45 내지 단계56 중 하나의 연산을 수행할 수 있으며, 매 반복루프마다 시프트 연산의 캐리값(

,

)을 초기화하여 사용할 수 있다.Second, in steps 2 to 57, it is an iterative loop in which the inverse operation is performed. The operation process of the iterative loop varies according to the current data of U and V, and the iterative operation can be performed until V becomes 0. The loop can perform one of steps 4 to 18, steps 19 to 33, steps 34 to 44, and steps 45 to 56 depending on the conditional statement, and the carry value (

,

) can be initialized.

이때, U와 V 데이터의 하위 여러 비트를 스캔하여 연속 시프트 동작을 실행하는 경우, 한 사이클에 여러 비트를 시프트시켜서 평균 소요 사이클을 줄일 수 있다. 여기에서, 도 6에 도시한 바와 같은 슈도코드에는 3 비트 스캔을 적용한 경우를 나타낸다.In this case, when performing a continuous shift operation by scanning several lower bits of U and V data, it is possible to reduce the average required cycle by shifting several bits in one cycle. Here, a case in which 3-bit scan is applied to the pseudocode as shown in FIG. 6 is shown.

셋째, 단계4 내지 단계18에서는 U의 현재 데이터가 짝수인 경우의 연산 과정을 나타내는데, 단계5 내지 단계11은 데이터 U의 하위 3비트를 스캔하여 연속 시프트 동작을 위한 SN 값 설정 및 k 데이터를 가산할 수 있다. 이때, 한 사이클에 연속 시프트 동작이 이루어지는 비트에 따라 SN 값을 결정하며, 반복루프 횟수를 나타내는 k 값이 결정될 수 있다.Third, steps 4 to 18 show the calculation process when the current data of U is an even number. Steps 5 to 11 scan the lower 3 bits of data U to set SN values for continuous shift operation and add k data can do. In this case, the SN value is determined according to the bit on which the continuous shift operation is performed in one cycle, and the value k indicating the number of iteration loops may be determined.

또한, 단계12 내지 단계18은 워드 단위 시프트 연산으로 구성되며, 동시 처리(concurrent) j-루프는 PE 배열에 의해

개의 워드를 병렬 연산하고, i-루프는

회만큼 반복 수행될 수 있으며, 데이터 U와 S를 워드 단위로 분할하여 각 워드를 SN 비트 시프트 연산하고, LS 시프트 연산의 캐리값은 다음 루프에서 사용될 수 있다. 여기에서, RS 시프트 연산의 캐리값

은 단계15에 의해 이전 워드의 알맞은 위치에 저장될 수 있다.In addition, steps 12 to 18 are composed of word-wise shift operations, and the concurrent j-loop is performed by the PE array.

Words are parallelized, and the i-loop is

It can be repeated as many times as possible, and the data U and S are divided into word units to perform an SN bit shift operation on each word, and the carry value of the LS shift operation can be used in the next loop. Here, the carry value of RS shift operation

can be stored in the appropriate location of the previous word by step 15.

넷째, 단계19 내지 단계33에서는 U의 현재 데이터가 홀수이고, V의 현재 데이터가 짝수인 경우 연산 과정을 나타내는데, 단계4 내지 단계18과 유사한 형태를 가질 수 있고, 단계27 내지 단계33은 단계12 내지 단계18과 동일한 연산 구조를 가질 수 있으며, V와 R 데이터의 워드 단위 시프트 연산이 수행될 수 있다.Fourth, in steps 19 to 33, when the current data of U is odd and the current data of V is even, the calculation process is shown, which may have a form similar to that of steps 4 to 18, and steps 27 to 33 are step 12 It may have the same operation structure as in steps 18 to 18, and word unit shift operation of V and R data may be performed.

다섯째, 단계34 내지 단계44에서는 U와 V의 현재 데이터가 홀수이며,

인 경우의 연산 과정을 나타내는데, 가산 연산이 포함되어 있으므로 여러 비트를 스캔하지 않기 때문에, SN 값은 1로 고정되며, k 값 또한 1만큼 가산될 수 있고, PE 내부의 2개의 가산기를 이용하여 가산과 감산 연산이 동시에 수행될 수 있다.Fifth, in steps 34 to 44, the current data of U and V are odd,

This shows the operation process in the case of . Since the addition operation is included, since several bits are not scanned, the SN value is fixed to 1, the k value can also be added by 1, and the addition operation is performed using two adders inside the PE. and subtraction operations can be performed simultaneously.

여기에서, 단계38의 U-V의 감산 연산은 PE 내부의 가산기1(Adder1)에서 계산될 수 있고, 단계39의 S+R의 가산 연산은 가산기2(Adder2)에서 계산될 수 있으며, 시프트기에 의해 감산 연산결과와 S 데이터의 워드 단위 시프트 연산이 수행될 수 있다.Here, the U-V subtraction operation in step 38 may be calculated in adder 1 (Adder1) inside the PE, and the S+R addition operation in step 39 may be calculated in adder 2 (Adder2), and subtraction by a shifter A word unit shift operation of the operation result and the S data may be performed.

여섯째, 단계45 내지 단계56에서는 U와 V의 현재 데이터가 홀수이고,

인 경우의 연산 과정을 나타내는데, 단계34 내지 단계44와 유사한 연산 과정을 수행할 수 있으며, 단계49와 단계50의 V-U 감산 연산과 S+R의 가산 연산이 각각 PE 내부의 가산기1(adder1)과 가산기2(adder2)에서 연산되며, 단계36 내지 단계44와 동일한 연산 구조를 가질 수 있다.Sixth, in steps 45 to 56, the current data of U and V are odd,

It shows the operation process in the case of . A similar operation process to steps 34 to 44 can be performed, and the VU subtraction operation and the S+R addition operation of

steps

49 and 50 are respectively performed with the adder1 inside the PE. It is calculated in the adder 2 (adder2), and may have the same arithmetic structure as in steps 36 to 44.

일곱째, 단계58에서는 축약과정이 포함된 역원 연산결과의 출력 과정을 나타내는데, 반복루프의 연산 결과 R이 출력된 후, PE 내부의 가산기1(Adder1)과 가산기2(Adder2)를 통해 N-R 연산과 2N-R 연산이 동시에 수행될 수 있으며,

이면 N-R을 출력하고,

이면 2N-R을 출력할 수 있다.Seventh, in step 58, the output process of the inverse operation result including the reduction process is shown. After the operation result R of the iterative loop is output, the NR operation and 2N through adder 1 (Adder1) and adder 2 (Adder2) inside the PE. -R operation can be performed at the same time,

If , output NR,

In this case, 2N-R can be output.

상술한 바와 같은 연산 과정을 통해 유사 몽고메리 도메인 상의 역원 연산 결과

와 while 반복루프의 반복 횟수를 나타내는 k 값이 출력될 수 있는데, 반복루프 1회에 소요되는 사이클 수는

로 나타낼 수 있으며, 1 사이클마다

개의 워드가 병렬로 연산될 수 있고, 역원 연산에 소요되는 평균 반복 횟수는

회이며, 평균 소요 사이클은

로 나타낼 수 있다. 따라서 사용되는 PE의 개수에 따라 소요 사이클 수와 하드웨어 복잡도를 조정할 수 있다.The inverse calculation result on the pseudo Montgomery domain through the above-described calculation process

A value of k indicating the number of repetitions of the and while loops can be output. The number of cycles required for one iteration loop is

can be expressed as , and every cycle

Words can be operated in parallel, and the average number of iterations required for the inverse operation is

times, and the average cycle required is

can be expressed as Therefore, the number of cycles required and hardware complexity can be adjusted according to the number of PEs used.

도 7을 참조하면, 상술한 바와 같은 도 6의 슈도코드를 하드웨어로 구현한 모듈러 역원기로서, 확장 가능형 몽고메리 모듈러 역원기의 구성을 나타내는데, 도 6에 도시한 바와 같은 슈도코드의 연산을 수행하는 PE 배열 블록, 레지스터 파일블록(Reg_File), 계수기블록(K-counter), 제어신호를 생성하는 컨트롤러블록(CNTL), 복수의 선택기 등으로 구성될 수 있다.Referring to FIG. 7 , as a modular inverse prototype in which the pseudocode of FIG. 6 as described above is implemented in hardware, the configuration of an expandable Montgomery modular inverse prototype is performed, and the pseudocode operation as shown in FIG. 6 is performed. It may be composed of a PE array block, a register file block (Reg_File), a counter block (K-counter), a controller block (CNTL) generating a control signal, a plurality of selectors, and the like.

여기에서, PE 배열 블록은 모듈러 역원을 연산하는 회로를 나타내며, PE의 1차원 배열과 다수개의 선택기(MUX)로 구성되어 사용되는 PE의 개수

에 따라 연산속도와 하드웨어 복잡도가 달라질 수 있고, 레지스터파일블록(Reg_File)은 U, V, R, S 데이터 및 캐리 값을 저장하며, U_reg, V_reg, R_reg, S_reg, N_reg 레지스터와 캐리 레지스터들로 구성될 수 있다.Here, the PE array block represents a circuit that calculates the modular inverse, and consists of a one-dimensional array of PEs and a plurality of selectors (MUX), and the number of used PEs.

The operation speed and hardware complexity may vary depending on the function. can be

또한, 컨트롤러블록(CNTL)은 U와 V의 현재 값에 따라 SN 값과 동작모드를 결정할 수 있는데, SN 값은 연속 시프트 되는 비트 수를 나타내며, 도 6에 도시한 바와 같은 슈도코드에서는 3 비트 스캔을 적용하므로 1에서 3 범위의 값을 가질 수 있다.In addition, the controller block CNTL can determine the SN value and the operation mode according to the current values of U and V. The SN value represents the number of consecutively shifted bits, and in the pseudo code as shown in FIG. 6, 3-bit scan is applied, so it can have a value in the range of 1 to 3.

그리고, 도 6에 도시한 바와 같은 슈도코드에서 단계4, 단계19, 단계34, 단계45의 조건문과 과정58의 축약 및 출력 과정에 의해 연산 동작모드가 결정될 수 있는데, 이에 따라 PE로 입력되는 데이터와 PE의 연산 동작이 결정될 수 있고, 계수기블록(K-counter)은 내부에 가산기를 포함하며, 반복루프마다 컨트롤러블록(CNTL)에서 생성되는 SN 값을 가산하여 k 값을 결정하여 역원 연산이 완료되면 최종 k 값을 출력할 수 있다.And, in the pseudo code as shown in FIG. 6, the operation mode can be determined by the conditional statements of Step 4, Step 19, Step 34, and Step 45 and the abbreviation and output process of Step 58. Accordingly, the data input to the PE and PE can be determined, the counter block (K-counter) includes an adder inside, and the inverse operation is completed by adding the SN values generated in the controller block (CNTL) for each iteration loop to determine the k value. Then, the final k value can be output.

또한, 동작모드 신호와 선택기(MUX_I1 내지 MUX_I4)에 의해 레지스터파일블록(Reg_File)의 데이터 중 연산에 필요한 데이터가 선택될 수 있고, 컨트롤러블록(CNTL)의 신호에 따라 각 사이클에 연산되는

개의 워드가 결정될 수 있으며, 결정된

개의 워드는 PE 배열의 가산기 및 시프트기에 입력되어 동시에 연산될 수 있다. 이는 슈도코드 상의 동시 처리(concurrent) j-루프를 의미한다.In addition, data necessary for operation among the data of the register file block Reg_File can be selected by the operation mode signal and the selectors MUX_I1 to MUX_I4, and the data required for the operation can be selected in each cycle according to the signal of the controller block CNTL.

n words can be determined,

Words may be input to the adder and shifter of the PE array to be simultaneously calculated. This means a concurrent j-loop on pseudocode.

이때, PE의 가산기 및 시프트기의 캐리 데이터는 인접한 PE로 전달되며, 최상위 PE의 캐리 데이터는 저장되어 다음 루프에서 사용될 수 있고, PE 배열에 의해 연산이 완료된 데이터는 선택기(MUX_O1 내지 MUX_O4)에 의해 레지스터에 저장될 수 있다.At this time, the carry data of the PE adder and shifter are transferred to the adjacent PE, the carry data of the uppermost PE is stored and used in the next loop, and the data whose operation is completed by the PE array is transferred by the selectors (MUX_O1 to MUX_O4). can be stored in registers.

따라서, 사용되는 PE의 개수

에 의해 병렬 연산되는 워드 수와 반복루프 1회에 소요되는 사이클 수가 결정되므로, 역원 연산기의 응용분야에서 요구되는 성능에 맞춰 사용되는 PE의 개수를 조정하여 구현할 수 있다.Therefore, the number of PEs used

Since the number of words to be computed in parallel and the number of cycles required for one iteration loop are determined by

도 8을 참조하면, 본 발명의 실시예에 따른 PE의 내부 블록도를 나타내는데, 32 비트 가산기 2개(Adder1, Adder2), 오른쪽 시프트기(R_Sft) 1개, 왼쪽 시프트기(L_Sft) 1개와 선택기 2개(MUX1,MUX2)로 구성될 수 있다.8, an internal block diagram of a PE according to an embodiment of the present invention is shown. Two 32-bit adders (Adder1, Adder2), one right shifter (R_Sft), one left shifter (L_Sft), and a selector It may be composed of two (MUX1, MUX2).

여기에서, 오른쪽 시프트기(R_Sft)와 왼쪽 시프트기(L_Sft)는 데이터를 입력받아 SN 값만큼 시프트 연산을 수행할 수 있고, 선택기(MUX1, MUX2)는 도 8에 도시한 바와 같은 초기 컨트롤러블록(Int_CNTL)에서 생성된 연산 동작 모드에 따라 데이터를 선택해 연산기로 입력시킬 수 있다.Here, the right shifter (R_Sft) and the left shifter (L_Sft) can receive data and perform a shift operation by an SN value, and the selectors (MUX1, MUX2) are the initial controller blocks ( Int_CNTL), data can be selected and input to the operator according to the operation mode created in the operation mode.

그리고, PE의 가산기1(Adder1)은 U-V와 V-U의 감산 연산과 N-R의 감산 연산을 수행할 수 있고, 가산기2(Adder2)는 R+S의 가산 연산과 2N-R의 감산 연산을 수행할 수 있는데, 가산기1(Adder1)이 단계34, 단계45의 조건문에 의해 단계38, 단계49의 감산 연산을 수행하는 경우, 선택기1(MUX1)에 의해 가산기1(Adder1)의 연산 결과가 오른쪽 쉬프트기(R_Sft)로 입력되어 단계40 및 단계51의 시프트 연산이 수행될 수 있고, 단계4, 단계19의 조건문에 의해 시프트 연산만 수행되는 경우, 선택기1(MUX1)에 의해 외부의 입력 데이터가 선택되어 시프트 연산이 수행될 수 있다.And, Adder1 of PE can perform subtraction operation of U-V and V-U and subtraction operation of N-R, and Adder2 can perform addition operation of R+S and subtraction operation of 2N-R However, when Adder1 performs the subtraction operation of Step 38 and Step 49 by the conditional statement of Step 34 and Step 45, the result of the operation of Adder1 by the selector 1 MUX1 is changed to the right shifter ( R_Sft), the shift operation of steps 40 and 51 can be performed, and when only the shift operation is performed according to the conditional statement of steps 4 and 19, external input data is selected by the selector 1 (MUX1) An operation may be performed.

또한, 도 6에 도시한 바와 같은 슈도코드에서 단계58은 축약 과정을 포함한 데이터 출력 과정을 나타내며, N-R의 연산과 2N-R의 연산이 PE 배열에 의해 동시에 수행될 수 있는데, 가산기1(Adder1)을 통해 N-R의 연산이 수행되며, 왼쪽 시프트기(L_Sft)를 통해 N을 2N으로 연산하고, 가산기2(Adder2)를 통해 2N-R 연산이 수행될 수 있으며, 선택기2(MUX2)는 축약 연산을 위해 왼쪽 시프트기(L_Sft)의 결과를 가산기2(Adder2)의 입력으로 선택하고, 반복루프를 실행할 때에는 외부의 입력을 가산기2(Adder2)의 입력으로 선택할 수 있다.In addition, step 58 in the pseudocode as shown in FIG. 6 represents a data output process including an abbreviation process, and an N-R operation and a 2N-R operation can be simultaneously performed by the PE array. Adder1 The operation of N-R is performed through For this purpose, the result of the left shifter L_Sft is selected as an input of Adder2, and an external input can be selected as an input of Adder2 when the iterative loop is executed.

한편, PE 내부의 시프트기 2개는 각 3 비트의 캐리 데이터를 가지며, 가산기 2개는 각 1 비트의 캐리 데이터를 갖는데, 각 PE의 캐리 데이터는 인접한 PE로 전달될 수 있으며, PE의 1차원 배열로 인한 최악경로(critical path) 지연을 줄이기 위해 캐리선택 가산기(carry select adder)를 사용할 수 있고, 오른쪽 쉬프트기(R_Sft)의 캐리 데이터는 연산 과정 중 알맞은 위치에 즉시 저장되며, 마지막 동시 처리(concurrent) j-루프에서 생성된 왼쪽 시프트기(L_Sft)의 캐리 데이터는 저장되어 다음 루프의 연산에서 사용될 수 있다.Meanwhile, the two shifters inside the PE each have 3-bit carry data, and the two adders each have 1-bit carry data. The carry data of each PE can be transferred to an adjacent PE, and the PE 1D To reduce the critical path delay due to the array, a carry select adder can be used, and the carry data of the right shifter (R_Sft) is immediately stored in an appropriate location during the operation process, and the last concurrent processing ( concurrent) The carry data of the left shifter (L_Sft) generated in the j-loop is stored and can be used in the operation of the next loop.

상술한 바와 같이 본 발명은 ECC, RSA 기반 공개키 암호의 하드웨어 구현에 필수적으로 사용되는 모듈러 곱셈과 모듈러 역원 계산을 위해 피연산자 데이터를 일정 크기(예를 들면, 32-비트)의 워드(word) 단위로 분할하고, 다수 개의 워드를 동시에 병렬로 연산하는 기법과 그 기법을 적용하여 모듈러 연산기(즉, 모듈러 곱셈기 및 모듈러 역원 연산기를 포함함)를 확장 가능형(scalable) 하드웨어로 구현함으로써, 복수의 키길이를 지원함과 아울러 응용분야의 요구 조건(예를 들면, 연산 속도, 하드웨어 복잡도 등)을 만족하는 모듈러 곱셈기와 모듈러 역원기를 효율적으로 설계할 수 있다.As described above, the present invention converts operand data of a certain size (e.g., 32-bit) in word units for modular multiplication and modular inverse calculation, which are essential for hardware implementation of ECC and RSA-based public key cryptography. By implementing a modular operator (that is, including a modular multiplier and a modular inverse operator) as scalable hardware by dividing into It is possible to efficiently design modular multipliers and modular inverses that support length and satisfy application requirements (eg, operation speed, hardware complexity, etc.).

또한, 본 발명에 의한 모듈러 곱셈기와 모듈러 역원기는 워드 단위로 연산을 수행하는 처리요소(PE)의 일차원 배열과, 처리요소 배열의 동작을 제어하는 컨트롤 블록과, 유한체 연산에 사용되는 데이터와 중간 결과값을 저장하는 레지스터와, 처리요소 간의 캐리 데이터를 저장하는 레지스터들을 포함한다.In addition, the modular multiplier and the modular inverse primordial according to the present invention include a one-dimensional array of processing elements (PE) that perform operation in word units, a control block for controlling the operation of the processing element array, and data used for finite field operation and intermediate It includes a register for storing a result value and registers for storing carry data between processing elements.

이상의 설명에서는 본 발명의 다양한 실시예들을 제시하여 설명하였으나 본 발명이 반드시 이에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능함을 쉽게 알 수 있을 것이다.In the above description, various embodiments of the present invention have been presented and described, but the present invention is not necessarily limited thereto. It will be readily appreciated that branch substitutions, transformations and alterations are possible.

Claims

dividing the multiplier and the multiplicand of the modular multiplication into word units of multiple bits;
performing a modular multiplication operation by paralleling a plurality of words corresponding to the divided word units
A word parallel operation method for modular operation including

The method according to claim 1,
The performing of the modular multiplication operation includes applying the Montgomery modular multiplication operation, but includes a plurality of processing element array blocks, a multi-bit binary multiplier, a multi-bit adder, a register file block, a controller block, a plurality of selectors, and a plurality of dividers. A word parallel operation method for modular operation including

3. The method according to claim 2,
In the performing the modular multiplication operation, the multiplication operation is performed on a finite field through the Montgomery modular multiplication operation, and a multi-bit multiplication result by receiving a multi-bit modular value, a multi-bit multiplier data, and a multiplicand data. A word parallel operation method for modular operation that outputs .

4. The method according to claim 3,
The performing of the modular multiplication operation includes generating a partial product through word unit multiplication from the least significant word, and performing the multiplication operation by adding the generated partial products.

5. The method according to claim 4,
In the performing the modular multiplication operation, the value calculated from the least significant word corresponding to the modular value is used for the abbreviation operation of the Montgomery modular multiplication, and the least significant word corresponding to the partial product addition result is 0 by using a congruence characteristic. A word parallel operation method for modular operation used to remove the least significant word without data loss.

6. The method according to any one of claims 1 to 5,
The word parallel operation method for the modular operation is,
dividing a plurality of bit integers and modular values into words of a plurality of sizes;
performing a modular inverse operation by paralleling a plurality of words using a processing element array on the divided words of a plurality of sizes;
A word parallel operation method for modular operation including

7. The method of claim 6,
The step of performing the modular inverse operation includes applying the Montgomery modular inverse operation, but includes a plurality of processing element (PE) array blocks, a register file block, a counter block, a controller block, and a plurality of selectors. Word parallelism calculation method.

8. The method of claim 7,
The performing of the modular inverse operation may include receiving the multi-bit integer and modular values, and outputting the multi-bit modular inverse operation result as a value on the pseudo Montgomery domain, and the value on the pseudo Montgomery domain is corrected and on the Montgomery domain. A word parallel operation method for modular operation in which a value is converted to a value and Montgomery inverse operation is performed through a plurality of iterative loops.

9. The method of claim 8,
The step of performing the modular inverse operation is a word parallel operation method for a modular operation in which the number of iteration loops is reduced by shifting several bits at a time in order to reduce the number of cycles required for the Montgomery inverse operation.

10. The method of claim 9,
The step of performing the modular inverse operation is calculated by a while iteration loop, the operation mode is determined according to the current values of the integer and the modular value, and data for addition and subtraction operations and shift operations are selected and the operation is performed. A word parallel operation method for modular operation.