KR102126933B1

KR102126933B1 - UNIFIED ARM/NEON MODULAR MULTIPLICATION METHOD OF ARMv7-A PROCESSOR

Info

Publication number: KR102126933B1
Application number: KR1020180148288A
Authority: KR
Inventors: 서화정
Original assignee: 한성대학교 산학협력단
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2020-06-25
Also published as: KR20200062646A

Abstract

본 발명은 32-비트 ARMv7-A 프로세서상 곱셈 가속화 방법에 관한 것이다.
본 발명에서는 ARM 일반 연산기와 NEON 이미지 연산기를 구비하는 32-비트 ARMv7-A 프로세서상에서 곱셈을 수행할 때 n 비트 피연산자 사이의 두 개의 곱셈을 카라츠바 곱셈을 이용하여 n/2 비트 피연산자 세 개의 곱셈으로 변형한 후, 두 개의 n/2 비트 곱셈 연산은 ARM 일반 연산기에서 수행하고 나머지 하나의 n/2 비트 곱셈 연산은 NEON 이미지 연산기에서 수행하는 곱셈 가속화 방법이 개시된다.
512-비트 곱셈을 32-비트 모바일 프로세서 (ARM Cortex-A15) 상에서 종래 구현 기법과 비교할 경우 본 발명에서 제안하는 기법은 약 1.26배의 성능 향상을 보였다.The present invention relates to a method for accelerating multiplication on a 32-bit ARMv7-A processor.
In the present invention, when performing multiplication on a 32-bit ARMv7-A processor having an ARM general operator and a NEON image operator, two multiplications between n-bit operands are converted into three multiplications of n/2-bit operands using Karatsuba multiplication. After the transformation, a multiplication acceleration method is disclosed in which two n/2 bit multiplication operations are performed in an ARM general operator and the other n/2 bit multiplication operations are performed in a NEON image operator.
When the 512-bit multiplication is compared with the conventional implementation technique on the 32-bit mobile processor (ARM Cortex-A15), the technique proposed in the present invention shows a performance improvement of about 1.26 times.

Description

How to accelerate multiplication on 32-bit ARMv7-A processors {UNIFIED ARM/NEON MODULAR MULTIPLICATION METHOD OF ARMv7-A PROCESSOR}

본 발명은 32-비트 ARMv7-A 프로세서상 곱셈 가속화 방법에 관한 것으로서, 보다 구체적으로는 32-비트 ARMv7-A 프로세서상에서 ARM 일반 연산기와 NEON 이미지 연산기를 이용하여 곱셈을 가속화 처리하는 방법에 관한 것이다.The present invention relates to a method for accelerating multiplication on a 32-bit ARMv7-A processor, and more particularly, to a method for accelerating multiplication using an ARM general operator and a NEON image operator on a 32-bit ARMv7-A processor.

프리-퀀텀(pre-quantum)(예: RSA)이나 post-퀀텀 인스턴스(예: SIDH)의 공개키 암호화(PKC)는 대개 성능에 중요한 빌딩 블록으로 모듈러 곱셈을 필요로 한다. PKC를 위한 가장 유망한 모듈러 감소 기법 중 하나는 곱셈 연산에서 값 비싼 나누기 연산을 대체하는 몽고메리(Montgomery) 곱셈이다. 다른 CPU 아키텍처에서 보다 빠른 Montgomery 곱셈을 위한 몇 가지 기술이 제안되었다. 그러나, 그러한 연산은 여전히 현대의 프로세서에서 계산 집약적이며, 고속의 최적화가 필요하다. Public key cryptography (PKC) of pre-quantum (eg RSA) or post-quantum instances (eg SIDH) is usually a performance-critical building block that requires modular multiplication. One of the most promising modular reduction techniques for PKC is Montgomery multiplication, which replaces expensive division operations in multiplication operations. Several techniques have been proposed for faster Montgomery multiplication on different CPU architectures. However, such operations are still computationally intensive in modern processors and require high-speed optimization.

32 비트 ARM 아키텍처(예 : ARMv7-A)는 미니 컴퓨터 및 웨어러블 장치에서 널리 사용된다. 고급 ARMv7-A 프로세서는 멀티미디어 대규모 작업 부하를 수행하기 위해 SIMD(Single Instruction Multiple Data) 명령어(예: NEON 엔진)를 지원한다. SIMD 명령어의 병렬 컴퓨팅 성능을 활용하려면 기존의 직렬 구현을 벡터화된 방식으로 다시 작성해야 한다. SAC'13에서 Bos 등은 벡터 계산에 몽고메리 곱셈의 새로운 구현을 도입했다. 이러한 곱셈은 사전 계산된 몽고메리 상수의 부호를 뒤집어서 두 개의 분리된 중간 값으로 결과를 누적하였다. 그러나 구현시 명령 흐름에서 RAW(Read-After-Write) 종속성으로 인해 파이프라인 스톨이 발생했다. 실행될 명령은 소스 레지스터의 피연산자가 읽을 수 있을 때까지 대기해야하기 때문이다. 또 다른 시도에서 제품 스캐닝 기반의 Montgomery 곱셈이 한 번에 한 쌍의 32 비트 곱셈을 계산하기 위해 도입되었다. 그러나 이러한 열 단위 곱셈은 여전히 높은 RAW 종속성을 유발한다. RAW 의존성을 피하기 위해 ICISC'14에서 Seo 등은 비 전통적 순서로 부분 곱을 수행하는 새로운 2-way Cascade Operand Scanning(COS) 곱셈을 도입했다. 따라서 모듈러 곱셈을 위해 두 개의 COS 곱셈이 인터리브 방식으로 수행된다. 이러한 COS 곱셈은 분리된 몽고메리 곱셈이 선택되는 반면, 1024 비트 및 2048 비트 정수와 같은 long 정수에 대해 additive Karatsuba 방법을 사용하여 더욱 향상되었다. 그러나 종래 기술에서는 다중 정밀 곱셈에서 NEON 명령어 세트(VMULL)보다 더 나은 선택을 보여주는 새로운 ARM 곱셈 명령어 세트(UMAAL)에 대해서는 거의 관심을 기울이지 않았다. The 32-bit ARM architecture (eg ARMv7-A) is widely used in mini computers and wearable devices. The advanced ARMv7-A processor supports Single Instruction Multiple Data (SIMD) instructions (eg NEON engine) to perform multimedia massive workloads. To take advantage of the parallel computing power of SIMD instructions, existing serial implementations have to be rewritten in a vectorized way. In SAC'13, Bos et al. introduced a new implementation of Montgomery multiplication in vector computation. This multiplication reversed the sign of the precomputed Montgomery constant, accumulating the results with two separate median values. However, in implementation, pipeline stall occurred due to read-after-write (RAW) dependencies in the command flow. This is because the instruction to be executed must wait until the operand of the source register can be read. In another attempt, product scanning-based Montgomery multiplication was introduced to compute a pair of 32-bit multiplications at a time. However, this column-wise multiplication still causes high RAW dependencies. To avoid RAW dependence, Seo et al. at ICISC'14 introduced a new 2-way Cascade Operand Scanning (COS) multiplication that performs partial multiplication in non-traditional order. Therefore, for modular multiplication, two COS multiplications are performed in an interleaved manner. This COS multiplication is further enhanced by using the additive Karatsuba method for long integers such as 1024-bit and 2048-bit integers, while separate Montgomery multiplication is selected. However, the prior art paid little attention to the new ARM multiply instruction set (UMAAL), which shows a better choice than the NEON instruction set (VMULL) in multi-precision multiplication.

P. Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Conference on the Theory and Application of Cryptographic Techniques, pages 311?323. Springer, 1986. P. Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Conference on the Theory and Application of Cryptographic Techniques, pages 311?323. Springer, 1986. H. Fujii and D. F. Aranha. Curve25519 for the Cortex-M4 and beyond. Progress in Cryptology-LATINCRYPT, 2017. H. Fujii and D. F. Aranha. Curve25519 for the Cortex-M4 and beyond. Progress in Cryptology-LATINCRYPT, 2017. GMP. The GNU Multiple Precision Arithmetic Library. Available for download at https://gmplib.org/, 2018 GMP. The GNU Multiple Precision Arithmetic Library. Available for download at https://gmplib.org/, 2018 H. Seo, Z. Liu, J. Großschadl, and H. Kim. Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation. Security and Communication Networks, 9(18):5401?5411, 2016. H. Seo, Z. Liu, J. Großschadl, and H. Kim. Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation. Security and Communication Networks, 9(18):5401?5411, 2016. B. Koziel, A. Jalali, R. Azarderakhsh, D. Jao, and M. Mozaffari-Kermani. NEON-SIDH: efficient implementation of supersingular isogeny Diffie-Hellman key exchange protocol on ARM. In International Conference on Cryptology and Network Security, pages 88?103. Springer, 2016. B. Koziel, A. Jalali, R. Azarderakhsh, D. Jao, and M. Mozaffari-Kermani. NEON-SIDH: efficient implementation of supersingular isogeny Diffie-Hellman key exchange protocol on ARM. In International Conference on Cryptology and Network Security, pages 88-103. Springer, 2016.

본 발명에서는 이러한 지침을 활용하고 PKC 구현을 위해 더 빠른 모듈러 곱셈을 제공하는 방법을 제안하는 것을 목적으로 한다. The present invention aims to propose a method that utilizes these guidelines and provides faster modular multiplication for PKC implementation.

본 발명의 주요 목적은 The main object of the present invention

1. 통합 ARM/NEON 모듈러 곱셈을 소개하고, 병렬 방식으로 모듈라 곱셈(modular multiplication)을 수행하기 위하여 ARM(UMAAL) 및 NEON(VMULL) 명령어 세트를 정교하게 통합한다. 이 방법은 ARM 레지스터와 NEON 레지스터 사이의 중간 결과와 피연산자를 공유하여 수동(manual) Out-of-Order 실행 및 메모리 액세스로 파이프 라인 스톨 수를 줄인다. 수동 Out-of-Order 실행은 n-word 곱셈에 대해 자동 Out-of-Order 실행을 약 20 % 최적화하고 메모리 액세스 수가 6n 번 절약된다.1.Introducing the integrated ARM/NEON modular multiplication, and the ARM(UMAAL) and NEON(VMULL) instruction sets are precisely integrated to perform modular multiplication in parallel. This method reduces the number of pipeline stalls with manual out-of-order execution and memory access by sharing intermediate results and operands between ARM and NEON registers. Manual Out-of-Order execution optimizes automatic Out-of-Order execution by about 20% for n-word multiplication and saves 6n memory accesses.

2. UMAAL 명령어와 하이브리드-스캐닝 기법을 결합하여 피연산자에 대한 메모리 액세스 횟수를 줄이고 곱셈 및 누적 단계에 대한 처리를 수행함으로써 새로운 Montgomery 감소 방법을 소개한다. 통일된 ARM/NEON n-word Montgomery 곱셈을 위해 본 발명은 메모리 액세스 수를 10n 배 절약하였다.2. A new Montgomery reduction method is introduced by combining UMAAL instructions and hybrid-scanning techniques to reduce the number of memory accesses to operands and to perform multiplication and accumulation steps. For unified ARM/NEON n-word Montgomery multiplication, the present invention saves 10n times the number of memory accesses.

3. 본 발명에 따른 몽고메리 곱셈 연산은 θ(2n² + n)에서 θ(n²)까지 워드 단위 곱셈에 대한 임계 경로의 복잡성을 감소시킨다. ARM Cortex-A15 프로세서에서 가장 잘 알려진 결과보다 34% 더 뛰어나다. 본 발명에 따른 모듈러 곱셈을 기반으로 SIDH 프로토콜을 위한 몽고메리 곱셈을 더욱 향상시킨다. 최신 Microsoft SIDH v3.0 라이브러리에 직접 적용된 SIDH 프로토콜을 위해 제안된 Montgomery 곱셈은 ARM Cortex-A15 프로세서에서 11.7x 성능을 향상시킨다.3. The Montgomery multiplication operation according to the present invention reduces the complexity of the critical path for word unit multiplication from θ(2n ² + n) to θ(n ² ). 34% better than the best-known results on the ARM Cortex-A15 processor. Montgomery multiplication for the SIDH protocol is further improved based on the modular multiplication according to the present invention. The Montgomery multiplication proposed for the SIDH protocol applied directly to the latest Microsoft SIDH v3.0 library improves 11.7x performance on the ARM Cortex-A15 processor.

제안하는 곱셈 기법에서는 n-비트 곱셈에 먼저 Karatsuba 알고리즘을 적용하여 곱셈 복잡도를

으로 변환한 이후에 나머지 연산 중 2개의 n/2-비트 곱셈을 일반 연산기(ARM)에서 수행하고 나머지 1개의 n/2-비트 곱셈을 이미지 연산기(NEON)에서 수행하도록 제안하였다. 모든 계산 결과값은 일반 연산기에서 합산되어 최종적인 결과값을 도출한다. 제안하는 곱셈 기법은 공개키 암호의 핵심 연산을 가속화시키는 장점을 가지고 있다.In the proposed multiplication technique, Karatsuba algorithm is first applied to n -bit multiplication to increase multiplication complexity.

After conversion to, it was proposed to perform two n/2 -bit multiplications of the remaining operations in the general operator (ARM) and the remaining one n/2 -bit multiplications in the image operator (NEON). All calculation results are summed in the general operator to derive the final result. The proposed multiplication technique has the advantage of accelerating the core operation of public key cryptography.

삭제delete

본 발명의 상기 목적은 AARM 일반 연산기와 NEON 이미지 연산기를 구비하는 32-비트 ARMv7-A 프로세서상에서 n-비트를 가지는 인자 A와 B의 곱셈을 가속화하는 32-비트 ARMv7-A 프로세서상 곱셈 가속화 방법에 있어서, n-비트 인자 A는 상위 n/2 비트 A_H와 하위 n/2 비트 A_L로 이루어지고, n-비트 인자 B는 상위 n/2 비트 B_H와 하위 n/2 비트 B_L로 이루어지고, ARM 일반 연산기에서 절대값의 차인 |A_H-A_L| 및 |B_H-B_L| 를 연산하는 제1단계와, ARM 일반 연산기에서 두 개의 곱셈 연산 A_L*B_L 및 A_H*B_H을 수행하는 제2단계와, NEON 이미지 연산기에서는 상기 제1단계을 절대값의 차에 대한 곱셈 연산 |A_H-A_L| * |B_H-B_L| 을 수행하는 제3단계 및 ARM 일반 연산기에서는 제1단계, 제2단계 및 제3단계를 결과를 이용하여

를 연산하는 제4단계를 포함하고, 제2단계와 제3단계는 각각 ARM 일반 연산기와 NEON 이미지 연산기에서 병렬 연산으로 수행되는 것을 특징으로 하는 32-비트 ARMv7-A 프로세서상 곱셈 가속화 방법에 의하여 달성 가능하다.The object of the present invention is a 32-bit ARMv7-A processor multiplication on a 32-bit ARMv7-A processor for accelerating the multiplication of n-bit factor A and B on a 32-bit ARMv7-A processor with an AARM generic operator and a NEON image operator. Hence, n-bit factor A consists of upper n/2 bits A_H and lower n/2 bits A_L, and n-bit factor B consists of upper n/2 bits B_H and lower n/2 bits B_L, ARM The difference between absolute values in a general operator |A_H-A_L| And |B_H-B_L| The first step of calculating, and the second step of performing two multiplication operations A_L*B_L and A_H*B_H in the ARM general operator, and in the NEON image operator, the first step is multiplying the difference of absolute values |A_H- A_L| * |B_H-B_L| Step 3 and ARM In general arithmetic units, the first, second, and third stages are used by using the results.

Comprising a fourth step of computing, the second and third steps are achieved by a multi-acceleration method on a 32-bit ARMv7-A processor, characterized in that it is performed in parallel in the ARM general operator and the NEON image operator, respectively. It is possible.

제1단계부터 시작하는 곱셈 연산을 수행하기 전에 n-비트 인자 A는 상위 n/2 비트 A_H와 하위 n/2 비트 A_L로 나누어 저장하고, n-비트 인자 B를 상위 n/2 비트 B_H와 하위 n/2 비트 B_L로 나누어 저장하여 n-비트 인자 사이의 곱셈을 n/2 비트 인자 사이의 곱셈으로 변환하기 위한 제1-1단계를 더 수행할 수 있다.Before performing the multiplication operation starting from the first step, n-bit factor A is divided into upper n/2 bits A_H and lower n/2 bits A_L and stored, and n-bit factor B is stored in upper n/2 bits B_H and lower order. By dividing and storing n/2 bits B_L, steps 1-1 to convert a multiplication between n-bit factors into a multiplication between n/2 bit factors may be further performed.

또한, 제1-1단계와 제3단계 사이에 수행되는 단계로서 제1단계의 두 개의 연산 결과값은 NEON 이미지 연산기에 전송하는 제2-1단계가 더 구비되어야 한다.In addition, as a step performed between steps 1-1 and 3, steps 2-1 of transmitting the results of the two calculations of the first step to the NEON image calculator should be further provided.

512-비트 곱셈을 32-비트 모바일 프로세서 (ARM Cortex-A15) 상에서 종래 구현 기법과 비교할 경우 본 발명에서 제안하는 기법은 약 1.26배의 성능 향상을 보였다.When the 512-bit multiplication is compared with the conventional implementation technique on the 32-bit mobile processor (ARM Cortex-A15), the technique proposed in the present invention shows a performance improvement of about 1.26 times.

본 발명에서는 ARMv7-A 프로세서에서의 고속 몽고메리 곱셈을 위한 통합 ARM/NEON 설계를 제시했다. 특히, ARM과 NEON 명령어 세트 모두에 대해 병렬 계산을 수행하는 최적화 기법을 제안하고 메모리 액세스 수를 최적화했다. 또한 몽고메리 감소를 위해 UMAAL을 이용한 새로운 하이브리드 스캐닝 방법을 제시했다. 이러한 최적화를 조합하면 매우 효율적인 Montgomery 곱셈을 얻을 수 있다. 이 곱셈은 이전 Cortex-A15 프로세서의 34%가 이전 구현보다 빠르다.In the present invention, an integrated ARM/NEON design for high-speed Montgomery multiplication in an ARMv7-A processor is presented. In particular, we proposed an optimization technique to perform parallel computation for both ARM and NEON instruction sets and optimized the number of memory accesses. In addition, a new hybrid scanning method using UMAAL was proposed to reduce Montgomery. Combining these optimizations yields a very efficient Montgomery multiplication. This multiplication is 34% faster than the previous implementation of the Cortex-A15 processor.

또한 위의 플랫폼에서 SIDH 프로토콜을 매우 효율적으로 구현한 결과를 보고했다. 최적화된 몽고메리 곱셈은 ARM Cortex-A15 프로세서에서 최신 Microsoft SIDH v3.0 라이브러리를 11.7x 가속화한다. 이 결과는 ARMv7-A 프로세서에서 SIDH 구현을 위한 새로운 속도 기록을 설정한다. 마지막으로 제안된 작업과 사전 퀀텀 암호화(예: RSA 및 ECC)를 비교하면 SIDH 프로토콜이 가장 느린 것으로 나타났지만 해적 응용 프로그램을 활용할만큼 충분히 빠르다는 것을 알 수 있다.In addition, the results of implementing the SIDH protocol very efficiently on the above platform were reported. Optimized Montgomery multiplication accelerates the latest Microsoft SIDH v3.0 library by 11.7x on ARM Cortex-A15 processors. This result sets a new speed record for SIDH implementation on ARMv7-A processors. Finally, comparing the proposed work with pre-quantum encryption (such as RSA and ECC) shows that the SIDH protocol appears to be the slowest, but fast enough to take advantage of pirate applications.

본 발명에 따른 방법이 INTEL-SSE 또는 INTEL-AVX 프로세서 제품군과 같은 SIMD 곱셈 연산을 지원하는 다른 프로세서에 완벽하게 적합하다는 것도 주목할 가치가 있다. 이러한 관찰을 바탕으로 다음 작업은 INTEL-SSE 및 INTEL-AVX 프로세서에 제안된 모듈러 곱셈 루틴을 적용하는 것이다. 이는 제안된 모듈러 곱셈에 대한 전통적인 접근 방식을 대체함으로써 경계를 훨씬 더 넓히는 데 직접적인 기여를 할 것으로 보여진다.It is also worth noting that the method according to the present invention is perfectly suited for other processors that support SIMD multiplication operations, such as the INTEL-SSE or INTEL-AVX processor family. Based on these observations, the next task is to apply the proposed modular multiplication routines to the INTEL-SSE and INTEL-AVX processors. This is expected to directly contribute to widening the boundaries by replacing the traditional approach to the proposed modular multiplication.

도 1은 본 발명의 곱셈 수행 절차를 워드-레벨에서 설명하는 설명도.1 is an explanatory diagram illustrating a multiplication procedure of the present invention at a word-level.

본 발명에서 사용하는 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this specification, the terms "include" or "have" are intended to indicate the presence of features, numbers, steps, actions, components, parts or combinations thereof described in the specification, one or more other features. It should be understood that the existence or addition possibilities of fields or numbers, steps, operations, components, parts or combinations thereof are not excluded in advance.

또한, 본 명세서에서, "~ 상에 또는 ~ 상부에" 라 함은 대상 부분의 위 또는 아래에 위치함을 의미하는 것이며, 반드시 중력 방향을 기준으로 상 측에 위치하는 것을 의미하는 것은 아니다. 또한, 영역, 판 등의 부분이 다른 부분 "상에 또는 상부에" 있다고 할 때, 이는 다른 부분 "바로 상에 또는 상부에" 접촉하여 있거나 간격을 두고 있는 경우뿐 아니라 그 중간에 또 다른 부분이 있는 경우도 포함한다.In addition, in the present specification, "on or above" means that it is located above or below the target part, and does not necessarily mean that it is located on the upper side based on the direction of gravity. Also, when a portion of an area, plate, or the like is said to be "on or above" another portion, this means that another portion in the middle, as well as if it is in contact or spaced "on or above" another portion. Also included.

또한, 본 명세서에서, 일 구성요소가 다른 구성요소와 "연결된다" 거나 "접속된다" 등으로 언급된 때에는, 상기 일 구성요소가 상기 다른 구성요소와 직접 연결되거나 또는 직접 접속될 수도 있지만, 특별히 반대되는 기재가 존재하지 않는 이상, 중간에 또 다른 구성요소를 매개하여 연결되거나 또는 접속될 수도 있다고 이해되어야 할 것이다.In addition, in this specification, when one component is referred to as "connected" or "connected" with another component, the one component may be directly connected to the other component, or may be directly connected, but in particular It should be understood that, as long as there is no objection to the contrary, it may or may be connected via another component in the middle.

또한, 본 명세서에서, 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Further, in this specification, terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from other components.

본 발명에서는 미니 컴퓨터 및 웨어러블 장치에서 널리 사용되는 ARMv7-A 프로세서에서 제안된 모듈러 곱셈을 구현하였다. In the present invention, the modular multiplication proposed in the ARMv7-A processor widely used in mini computers and wearable devices is implemented.

원하는 아키텍처는 ARM 및 NEON 명령어 세트를 모두 지원한다. 한편, 길이가 32 비트인 16개의 ARM 레지스터가 있다. 이 레지스터들 중에서 엔지니어는 어셈블리 언어로 R0 ~ R12 및 R14를 포함하여 32 비트 14 레지스터를 사용할 수 있다. ARM 레지스터는 448 비트 (32 × 14) 변수만 보유 할 수 있으며, 중간 결과와 피연산자를 모두 유지할 공간이 충분하지 않았다. 이러한 아키텍처를 기반으로 하는 고속 구현에서 가장 중요한 문제 중 하나는 레지스터를 효율적으로 활용하고 메모리 액세스를 최소화하는 방법이다.The desired architecture supports both ARM and NEON instruction sets. On the other hand, there are 16 ARM registers that are 32 bits long. Of these registers, engineers can use 32-bit 14 registers, including R0 through R12 and R14, in assembly language. ARM registers can only hold 448-bit (32 × 14) variables, and there was not enough space to hold both the intermediate result and the operand. One of the most important issues in a high-speed implementation based on this architecture is how to efficiently utilize registers and minimize memory access.

ARM 프로세서는 매우 강력한 32 비트 현명한 연산을 지원한다. 특히 ARMv7-A 프로세서는 다음을 포함하여 4 개의 부호없는 정수 곱셈 명령어를 지원한다.The ARM processor supports very powerful 32-bit smart operations. In particular, the ARMv7-A processor supports four unsigned integer multiply instructions, including:

MUL (Multiplication: MUL R0, R1, R2 → R0 = R1 × R2 mod 2³²),

MUL (Multiplication: MUL R0, R1, R2 → R0 = R1 × R2 mod 2 ³² ),

UMULL (Unsigned-Multiplication: UMULL R0, R1, R2, R3 → R1 || R0 = R2 × R3),

UMULL (Unsigned-Multiplication: UMULL R0, R1, R2, R3 → R1R0 = R2 × R3),

UMLAL (Unsigned-Multiplication-Accumulation: UMLAL R0, R1, R2, R3 → R1 || R0 = R1 || R0 + R2 × R3), 및

UMLAL (Unsigned-Multiplication-Accumulation: UMLAL R0, R1, R2, R3 → R1||R0 = R1||R0 + R2 × R3), and

UMAAL (Unsigned-Multiplication-Addition-Addition: UMAAL R0, R1, R2, R3 → R1 || R0 = R1 + R0 + R2 × R3).

UMAAL (Unsigned-Multiplication-Addition-Addition: UMAAL R0, R1, R2, R3 → R1R0 = R1 + R0 + R2 × R3).

특히, UMAAL 명령은 가산 연산을 사용하여 곱셈 연산을 수행하기 위해 사용될 수 있고, 캐리 비트를 생성하지 않으며 다음의 방정식 ((2ⁿ-1)² + 2(2ⁿ-1) = (2²ⁿ-1))에 의한 캐리 핸들링(carry handling) 필요성을 제거한다.In particular, the UMAAL instruction can be used to perform multiplication operations using addition operations, does not generate carry bits, and the following equation ((2 ⁿ -1) ² + 2(2 ⁿ -1) = (2 ^2n- Eliminates the need for carry handling by 1)).

반면에 NEON 엔진은 64 비트 더블 (D) 워드와 128 비트 4중(Q) 워드 레지스터를 제공한다. 이러한 엔지니어는 어셈블리 언어로 Q0~Q15를 포함하여 128 비트 16 레지스터를 활용할 수 있다. NEON 레지스터는 모든 중간 결과와 피연산자를 유지하기에 충분한 공간인 2048 비트(128x16) 변수를 유지할 수 있다. 본 발명에서는 이 공간을 메모리가 아닌 임시 저장 공간으로 활용하여 메모리 접근 횟수를 줄인다.The NEON engine, on the other hand, provides 64-bit double (D) words and 128-bit quad (Q) word registers. These engineers can utilize 128-bit 16 registers, including Q0 through Q15, in assembly language. The NEON register can hold 2048-bit (128x16) variables, which is enough space to hold all intermediate results and operands. In the present invention, the number of memory accesses is reduced by using this space as a temporary storage space rather than memory.

NEON 명령은 명령 세트 레벨에서 데이터-병렬성(data-parallelism)을 이용하는 벡터화된 방식(즉, 8 비트, 16 비트, 32 비트 및 64 비트 방식)으로 수행된다. 32 비트 부호없는 곱셈의 경우 NEON 엔진은 다음을 포함하여 두 가지 곱셈 명령어 세트를 사용한다.The NEON instruction is performed in a vectorized manner (ie, 8-bit, 16-bit, 32-bit and 64-bit) using data-parallelism at the instruction set level. For 32-bit unsigned multiplication, the NEON engine uses two sets of multiplication instructions, including:

VMULL.U32 (Vectorized Unsigned Multiplication: VMULL.U32 Q0, D2, D3[0] → D1 = D2[1] × D3[0], D0 = D2[0] × D3[0],

VMLAL.U32 (Vectorized Unsigned Multiplication Accumulation: VMLAL.U32 Q0, D2, D3[0] → D1 = D1 + D2[1] × D3[0], D0 = D0 + D2[0] × D3[0].

또한, 명령어 세트는 2 개의 32 비트 부호없는 곱셈을 발행(issue)하고 동시에 2 개의 64 비트 결과를 생성 할 수 있다. 해당 작업에서는 ARM Cortex-A15 프로세서를 대상 플랫폼으로 사용하였다. 타겟 ARM Cortex-A15 프로세서는 2GHz에서 작동하며 2GB DDR3 RAM을 갖추고 있다. 이 프로세서는 15 단계 정수 파이프라인과 17-25 스테이지 부동 소수점 파이프라인을 제공하며, 순서가 바뀌지 않는 투기 문제 3 방향 수퍼 스칼라 실행 파이프라인을 사용한다.In addition, the instruction set can issue two 32-bit unsigned multiplications and produce two 64-bit results at the same time. In this work, the ARM Cortex-A15 processor was used as the target platform. The target ARM Cortex-A15 processor operates at 2GHz and has 2GB DDR3 RAM. The processor provides a 15-step integer pipeline and a 17-25 stage floating-point pipeline, and uses an unspecified speculative issue 3-way superscalar execution pipeline.

종래 Montgomery 곱셈은 주로 고속 구현을 위해 NEON 명령어를 사용하였다. 그러나 ARM 명령어 세트를 기반으로 한 곱셈 구현은 실행 타이밍에서 여전히 경쟁력이 있다. SAC'13에서 Bos 등은 INTEL-AVX 프로세서 및 ARM-NEON 프로세서를 포함한 다양한 플랫폼에서 Montgomery 곱셈의 성능을 평가하였다.In the conventional Montgomery multiplication, NEON instructions were mainly used for high-speed implementation. However, multiplication implementations based on the ARM instruction set are still competitive in execution timing. In SAC'13, Bos et al. evaluated the performance of Montgomery multiplication on various platforms, including INTEL-AVX processors and ARM-NEON processors.

흥미롭게도 ARMv7-A 프로세서에 대한 몇몇 결과는 ARM 구현이 NEON보다 우수한 성능을 달성했음을 보여준다. Interestingly, some results for the ARMv7-A processor show that the ARM implementation achieved better performance than NEON.

최근 Latincrypt‘17의 Fujii와 Aranha는 메모리 액세스 및 캐리 처리 수를 줄이기 위하여 ARMv7 프로세서의 새로운 곱셈 명령어 세트(UMAAL)와 연속적인 피연산자 캐싱(COC) 곱셈 방법을 사용하는 ARM Cortex-M 프로세서보다 새로운 곱셈 방법을 제안한다. 종래 작업과 비교했을 때, 승산의 구현은 이전 작품보다 약 50% 빠르다.Recent Latincrypt'17's Fujii and Aranha are newer multiplication methods than ARM Cortex-M processors using the ARMv7 processor's new multiplication instruction set (UMAAL) and continuous operand caching (COC) multiplication methods to reduce the number of memory accesses and carry operations. To suggest. Compared to the previous work, the realization of the odds is about 50% faster than the previous work.

ARM 프로세서와 NEON 엔진의 성능을 비교하기 위해 벤치마크에서는 대상 ARMv7-A 플랫폼에서 ARM 및 NEON 명령어에 대한 최상의 구현 방법을 모두 표시했다. 자세한 비교 결과는 표 1에 나와 있다. 예상대로 ARM 명령어는 NEON 명령어보다 A15 프로세서의 경우 16 % 우수한 성능을 나타낸다. 이 결과는 SIMD 명령어 세트(예: NEON 엔진)가 최고 성능을 보장하지 않는다는 것을 보여준다.To compare the performance of the ARM processor and the NEON engine, the benchmarks show all best practices for ARM and NEON instructions on the target ARMv7-A platform. The detailed comparison results are shown in Table 1. As expected, ARM instructions outperformed NEON instructions by 16% for A15 processors. This result shows that the SIMD instruction set (eg NEON engine) does not guarantee the best performance.

Architecture Fujii et al. (ARM) [2] S대et al. (NEON) [4] Ratio(ARM/NEON) ARMv7 158 188 0.84
표 1에서 1행의 각 항목의 사각 괄호 내 숫자는 선행기술문헌의 비특허문헌번호를 의미한다.
Architecture Fujii et al. (ARM) [2] S et al. (NEON) [4] Ratio(ARM/NEON) ARMv7 158 188 0.84
The numbers in square brackets of each item in row 1 in Table 1 refer to the non-patent document numbers of the prior art documents.

이전 작업에서는 두 명령어 세트가 병렬 방식으로 발행될 수 있기 때문에 최적이 아닌 ARM 명령어 세트 또는 NEON 명령어 세트만 사용했다. 다음에서는 ARM과 NEON의 자원을 최대한 활용하기 위해 두 명령어 세트를 잘 통합하고 병렬로 곱하기를 실행한다.In the previous work, only the non-optimal ARM instruction set or the NEON instruction set was used because both instruction sets can be issued in parallel. In the following, to make the best use of the resources of ARM and NEON, the two instruction sets are well integrated and multiplied in parallel.

통합 ARM/NEON 곱셈에 대한 설명은 알고리즘 1에 나와 있다. 먼저, m 비트 피연산자 곱셈을 m/2 비트 피연산자 곱셈으로 나눈다. 그 다음에는 m/2-bit 피연산자 곱셈에 대해 1 단계 Karatsuba 곱셈을 수행한다. Karatsuba 곱셈은 1 레벨에 대해 4개의 m/2 비트 곱셈에서 3개의 m/2 비트 곱셈으로 곱셈의 복잡성을 효율적으로 줄임을 알 수 있다. Karatsuba 루틴은 가산 및 감산과 같은 두 가지 접근법으로 확립될 수 있다. n-word 피연산자

의 곱을 취하면, 곱셈 (P = A · B)은 부가적 Karatsuba 알고리즘을 사용하여 수학식 1에 따라 계산될 수 있다.The description of integrated ARM/NEON multiplication is given in Algorithm 1. First, the m-bit operand multiplication is divided by the m/2-bit operand multiplication. Then, one-step Karatsuba multiplication is performed on m/2-bit operand multiplication. It can be seen that Karatsuba multiplication effectively reduces the complexity of multiplication from four m/2 bit multiplications to three m/2 bit multiplications for one level. Karatsuba routines can be established with two approaches: addition and subtraction. n-word operand

Taking the product of, multiplication (P = AB) can be calculated according to Equation 1 using the additional Karatsuba algorithm.

감산적 Karatsuba 알고리즘을 사용하면 수학식 2와 같이 계산될 수 있다.Using the subtractive Karatsuba algorithm, it can be calculated as in Equation 2.

특히, 본 발명에서는 감산적인 Karatsuba 방법을 사용하였다. CM 계산에서 캐리 비트는 피하지만 CM의 두 절대 차이와 하나의 조건부 부정이 필요하다. 두 절대 차이 계산은 처음에는 ARM 프로세서에서 수행되는 반면 피연산자는 NEON 레지스터로 직접 전송된다. 둘째, 두 개의 m/2-bit 곱셈을 ARM 프로세서에 할당하고 하나의 m/2-bit 곱셈을 NEON 엔진에 할당한다. 두 개의 루틴은 병렬 방식으로 수행된다. 여기에서는 ARM 명령어 세트에서 UMAAL을 사용하는 Fujii 등의 Continuous Operand Caching(COC) 방법과 COS (Cascade Operand Scanning) 방법을 NEON 엔진에서 수행하였다.In particular, in the present invention, a subtractive Karatsuba method was used. The CM calculation avoids carry bits, but requires two absolute differences in CM and one conditional negation. The two absolute difference calculations are initially performed on the ARM processor, while the operands are transferred directly to the NEON register. Second, two m/2-bit multiplications are assigned to the ARM processor and one m/2-bit multiplication is assigned to the NEON engine. The two routines are performed in parallel. Here, Continuous Operand Caching (COC) method such as Fujii using UMAAL in ARM instruction set and Cascade Operand Scanning (COS) method were performed in NEON engine.

병렬 계산의 경우 ARM 및 NEON 명령어 세트는 모두 인터리브 방식으로 실행된다. ARM Cortex-A15 프로세서는 Out-of-Order 실행(execution) (OoOE)을 지원한다. 그러나 GCC 컴파일러에 의한 자동 OoOE는 최적화된 OoOE1을 생성하지 않는다. 이러한 이유로 우리는 수동으로 두 명령 세트를 한 줄씩 섞는다. 예를 들어 512 비트 곱셈을 사용하면 ARM 구현(두 개의 256 비트 곱하기)에 대한 라인 수(the number of lines)는 296이고 NEON 구현 (한 번의 256 비트 곱하기) 라인 수는 141이다. 라인 비율이 대략 2 (약 296/141

2.099)이므로 NEON 명령어당 두 개의 ARM 명령어(예: ..., ARM 명령어; ARM 명령어; NEON 명령어; ...)를 혼합했다. 병렬 세션이 끝나면 중간 결과를 누적하여 최종 결과를 생성한다.For parallel computation, both the ARM and NEON instruction sets are executed interleaved. The ARM Cortex-A15 processor supports Out-of-Order execution (OoOE). However, automatic OoOE by GCC compiler does not generate optimized OoOE1. For this reason, we manually mix the two sets of commands line by line. For example, with 512-bit multiplication, the number of lines for an ARM implementation (two 256-bit multiplications) is 296, and the number of lines for a NEON implementation (multiplying one 256-bits) is 141. Line ratio is approximately 2 (about 296/141

2.099), so we mixed two ARM instructions per NEON instruction (e.g. ..., ARM instruction; ARM instruction; NEON instruction; ...). At the end of the parallel session, intermediate results are accumulated to produce the final result.

이해를 돕기 위해 도 1의 워드-레벨 설명을 제공한다.For ease of understanding, the word-level description of FIG. 1 is provided.

곱셈 구조와 마름모꼴(rhombus) 형식을 모두 사용하여 이 방법을 설명한다. 여기서는 다음 표기법이 필요합니다. A와 B는 길이가 m 비트인 피연산자라 한다. 각 피연산자는 다음과 같이 작성된다 : A = (A[n-1], ..., A[2], A[1], A[0]), B = (B[n-1], ..., B [2], B [1], B [0])에 의해 결정되며, 여기서

이고, w는 워드 크기이다. 곱셈 결과 C = A · B는 피연산자 길이의 두 배이며 C = (C[2n-1], ..., C[2], C[1], C[0])로 표시된다. 곱셈 구조는 위에서 아래로의 부분곱의 순서를 설명하고 마름모 형태의 각 점은 부분곱을 나타낸다. 마름모의 가장 오른쪽 모서리는 가장 낮은 인덱스 (i, j = 0)를 나타내지만 가장 왼쪽은 가장 높은 인덱스 (i, j = n-1)를 나타낸다. 포인트 위의 검은색 화살표는 부분 제품의 처리를 나타낸다. 최하단은 가장 오른쪽 코너 (k = 0)부터 가장 좌측 코너 (k = 2n-1)까지의 범위를 나타내는 인덱스 C[k]를 나타낸다. ARM 명령어 세트는 SISD(Single Instruction Single Data) 아키텍처를 따르므로 하나의 부분곱이 연속적으로 수행된다. 반면에 NEON 명령어 세트는 SIMD(Single Instruction Multiple Data)를 따르고 두 개의 부분곱을 연속적으로 수행한다.Describe this method using both the multiplication structure and the rhombus format. The following notation is required here. A and B are called operands of length m bits. Each operand is written as: A = (A[n-1], ..., A[2], A[1], A[0]), B = (B[n-1],. .., B [2], B [1], B [0]), where

And w is the word size. The result of multiplication C = A · B is twice the length of the operand and is expressed as C = (C[2n-1], ..., C[2], C[1], C[0]). The multiplication structure describes the order of the sub products from top to bottom, and each dot in the rhombus form represents a sub product. The rightmost corner of the rhombus represents the lowest index (i, j = 0), but the leftmost represents the highest index (i, j = n-1). The black arrow above the point indicates the treatment of the partial product. The lowermost part represents the index C[k] indicating the range from the rightmost corner (k = 0) to the leftmost corner (k = 2n-1). Since the ARM instruction set follows the Single Instruction Single Data (SISD) architecture, one partial product is continuously executed. On the other hand, the NEON instruction set follows SIMD (Single Instruction Multiple Data) and continuously performs two subproducts.

곱셈의 ARM / NEON 동시 설계를 수행하기 위해 계산은 세 가지 하위 섹션(②③④)으로 나뉜다. 두 부분(②④)은 COC 방법을 사용하는 ARM 명령어에서 수행되고 다른 부분(③)은 COS 메소드를 사용하는 NEON 명령어에서 수행된다. ARM 프로세서와 NEON 엔진은 모두 독립적인 단위이므로 ARM 및 NEON 명령어는 간섭없이 병렬 방식으로 수행된다.The calculation is divided into three subsections (②③④) in order to perform multiplication ARM / NEON concurrent design. The two parts (②④) are executed in the ARM instruction using the COC method and the other part (③) is executed in the NEON instruction using the COS method. Since both the ARM processor and the NEON engine are independent units, ARM and NEON instructions are performed in parallel without interference.

- 단계①에서, ARM 프로세서에서 중간 부분의 피연산자 (A_M[0~3] ← |A[0~3] - A[4~7] 및 B_M[0~3] ← |B[0~3] - B [4~7]|)를 위한 피연산자를 생성하기 위해 연산자 뺄셈이 수행된다. 메모리 액세스 수를 최적화하기 위해 생성된 피연산자가 ARM 레지스터에서 NEON 레지스터로 직접 전송된다. 이 접근법은 두 개의 n/2 워드 피연산자에 대해 메모리 액세스 수를 2n (n 저장 및 n 로딩 연산)만큼 줄인다.-In step ①, the middle part of the operand in the ARM processor (A _M [0~3] ← |A[0~3]-A[4~7] and B _M [0~3] ← |B[0~3 ]-Operator subtraction is performed to generate operands for B [4~7]|). To optimize the number of memory accesses, the operands generated are transferred directly from the ARM register to the NEON register. This approach reduces the number of memory accesses by 2n (n storage and n loading operations) for two n/2 word operands.

- 단계① 이후에, 병렬 섹션이 시작된다. 단계③에서, Karatsuba 곱셈의 중간 부분(CM[0~7] ← AM[0~3] × BM [0~3])이 NEON에서 수행된다. 단계② 및 단계④에서, Kartatsuba 곱셈의 하부와 상부가 수행된다.-After step ①, the parallel section starts. In step ③, the middle part of Karatsuba multiplication (CM[0~7] ← AM[0~3] × BM [0~3]) is performed in NEON. In steps ② and ④, the lower and upper parts of Kartatsuba multiplication are performed.

- Karatsuba 곱셈은 중간 부분에 대해 3n 덧셈/뺄셈을 필요로 한다. 중간 결과를 부분적으로 누적된 방식으로 저장한다. 먼저, 상위 및 하위 파트가 누적되고 저장된다

. 이로써 n/2-워드 덧셈 및 2n 메모리 액세스(n 워드로드 및 저장 작업)가 저장된다. 둘째, ARM 프로세서의 중간 결과는 NEON 엔진의 결과와 함께 누적된다.

, 여기서,

,

및

이다. 중간 결과(CML)는 NEON 레지스터에서 ARM 레지스터로 직접 전송되므로 n(n/2 워드로드 및 저장 작업)별로 메모리 액세스 수가 절약된다.-Karatsuba multiplication requires 3n addition/subtraction for the middle part. Intermediate results are stored in a partially accumulated fashion. First, the upper and lower parts are accumulated and stored

. This saves n/2-word additions and 2n memory accesses (n word load and store operations). Second, the intermediate results of the ARM processor accumulate with the results of the NEON engine.

, here,

,

And

to be. The intermediate result (CML) is transferred directly from the NEON register to the ARM register, saving the number of memory accesses per n (n/2 word load and store operations).

- 마지막으로 단계 ⑤에서 NEON 프로세서의 나머지 부분(CMH ← CM div 2ⁿ)이 누적된다

. 중간 결과(CMH)는 NEON 레지스터에서 ARM 레지스터로 직접 전송되기 때문에 n(n/2 워드로드 및 저장 작업)별로 메모리 액세스 수가 절약된다. 레지스터 공유 방식을 사용하여 메모리 액세스 횟수를 6n 배 절감했다.-Finally, in step ⑤, the rest of the NEON processor (CMH ← CM div 2 ⁿ ) is accumulated.

. The intermediate result (CMH) is transferred directly from the NEON register to the ARM register, saving the number of memory accesses per n (n/2 word load and store operations). By using the register sharing method, the number of memory accesses is reduced by 6n times.

본 발명에 따른 결과를 표 2와 같이 ARM 및 NEON 명령어 세트에서 이전에 가장 잘 알려진 종래 결과와 비교한다.The results according to the invention are compared to the most well-known prior results in the ARM and NEON instruction sets, as shown in Table 2.

Method Instruction Timings [cc] Fujii et al. [2] ARM 596 GMP-6.1.2 [3] ARM 1,138 Seo et al. [4] NEON 632 본 발명 ARM/NEON 470
표 2에서 각 Method 우측 사각 괄호 내 숫자는 전술한 선행기술문헌의 비특허문헌번호를 의미한다.
Method Instruction Timings [cc] Fujii et al. [2] ARM 596 GMP-6.1.2 [3] ARM 1,138 Seo et al. [4] NEON 632 The present invention ARM/NEON 470
In Table 2, the numbers in the square brackets on the right of each Method refer to the non-patent document numbers of the above-mentioned prior art documents.

ARM 명령어 기반 곱셈은 NEON보다 우수한 성능을 보여준다. 타겟 ARMv7-A 프로세서에서 명령어 기반 승수를 계산한다. 최신의 멀티 정밀도 곱셈 라이브러리도 평가된다. 그러나 GMP-6.1.2는 그 중 가장 느린 성능을 나타냈다. 이것은 오픈 라이브러리가 아직 완전히 최적화되지 않았음을 보여준다. 네 가지 구현 중에서 본 발명에 따른 방법이 최상의 결과를 얻었다. 본 발명에 따른 구현은 ARM 명령어 기반의 곱셈과 비교할 때 ARM Cortex-A15 프로세서에 대해 21.2%의 성능 향상을 보였다. 제안된 통합 접근법은 SISD/SIMD 플랫폼에 일반적이므로 다른 일반적인 SISD/SIMD 아키텍처에도 적용될 수 있다. 가장 유망한 목표 중 하나는 INTEL 명령어 세트와 SSE/AVX2 엔진을 통합하는 INTEL 프로세서이다.ARM instruction-based multiplication shows better performance than NEON. Compute instruction-based multipliers on the target ARMv7-A processor. The latest multi-precision multiplication library is also evaluated. However, GMP-6.1.2 had the slowest performance. This shows that the open library has not been fully optimized yet. Of the four implementations, the method according to the invention achieved the best results. The implementation according to the present invention showed a 21.2% performance improvement over the ARM Cortex-A15 processor compared to ARM instruction-based multiplication. The proposed integration approach is general to the SISD/SIMD platform, so it can be applied to other general SISD/SIMD architectures. One of the most promising goals is an INTEL processor that integrates the INTEL instruction set and the SSE/AVX2 engine.

이전 작업과 달리 ARM 및 NEON 명령어 세트를 모두 사용했다. 512 비트 통합 ARM/NEON 모듈러 곱셈은 878 클럭 사이클만 필요하다. 후지이(Fujii) 등의 연구와 비교할 때 본 발명에 따른 구현은 34% 향상된 성능을 보여준다. 분명히 이것은 통합 ARM/NEON 접근 방식이 병렬 계산을 보장하고 레지스터간에 중간 결과와 피연산자를 전달하여 메모리 액세스 수를 줄이는 것이 주이다. ARM의 Montgomery 감소의 경우 UMAAL 명령어 세트를 완전히 활용하기 위해 하이브리드 스캔을 다시 설계한다. SIDH의 특수 모듈러스에 관해서, 몽고메리 모듈러 곱셈을 제시했다. 일반적인 곱셈에 비해 단어 단위 곱셈의 대략

배를 피할 수 있다. SIDH 구현은 성능을 11% 향상시킨다.Unlike the previous work, both the ARM and NEON instruction sets were used. The 512-bit integrated ARM/NEON modular multiplication requires only 878 clock cycles. Compared to Fujii et al.'s study, implementations according to the present invention show 34% improved performance. Obviously this is mainly because the integrated ARM/NEON approach ensures parallel computation and reduces the number of memory accesses by passing intermediate results and operands between registers. For ARM's Montgomery reduction, hybrid scans are redesigned to take full advantage of the UMAAL instruction set. Regarding the special modulus of SIDH, Montgomery modular multiplication was presented. Approximation of word-wise multiplication compared to normal multiplication

You can avoid the boat. The SIDH implementation improves performance by 11%.

본 발명에 따른 구현의 영향을 평가하기 위해 SIDH 프로토콜에도 원하는 몽고메리 곱셈을 적용했다. 표 5는 AES128의 포스트-퀀텀 보안과 일치하는 ARM Cortex-A15 프로세서의 SIDHp503 프로토콜에 대한 소프트웨어 구현 결과를 보여준다. 비교 목적을 위해 이전 연구와 동일한 파라미터 세트를 채택했다. CANS'16에서 Koziel 등은 ARMv7-A 프로세서에 대한 첫 번째 SIDH 구현을 발표했다. 고성능을 달성하기 위해 Seo 등의 Montgomery 곱셈 방법 (COS) 을 사용했으며 SIDH 프로토콜 구현시 ARM Cortex-A15 프로세서에서 302 * 106 사이클이 필요하다. SIDH 프로토콜을 구현하려면 ARM Cortex-A15의 경우 197*106 사이클이 필요하며 이는 ARM Cortex-A15보다 대략 1.5x 배 빠르다. 이 성능 격차는 고속 모듈러 곱셈에서 나온다. 왜냐하면 제안된 모듈러 곱셈은 종래기술에서보다 약 1.76x 배 빠르기 때문이다. 또한 제안된 구현은 Microsoft SIDH v3.0 라이브러리에 비해 약 11.7x 배 더 빠르다. 현재 라이브러리가 고도로 최적화된 모듈러 곱셈을 제공하지 않기 때문에 몽고메리 곱셈에 대해 본 발명의 최적화로부터 의미있는 진척을 달성할 수 있었다.In order to evaluate the effect of the implementation according to the present invention, the desired Montgomery multiplication was applied to the SIDH protocol. Table 5 shows the software implementation results for the SIDHp503 protocol of the ARM Cortex-A15 processor, consistent with the post-quantum security of AES128. For comparison purposes, we adopted the same set of parameters as the previous study. At CANS'16, Koziel et al. announced the first SIDH implementation for the ARMv7-A processor. In order to achieve high performance, Montgomery Multiplication Method (COS) such as Seo was used, and when implementing the SIDH protocol, it required 302 * 106 cycles on the ARM Cortex-A15 processor. Implementing the SIDH protocol requires 197*106 cycles for the ARM Cortex-A15, which is roughly 1.5x faster than the ARM Cortex-A15. This performance gap comes from high-speed modular multiplication. This is because the proposed modular multiplication is about 1.76x faster than in the prior art. Also, the proposed implementation is about 11.7x faster than the Microsoft SIDH v3.0 library. Significant progress could be achieved from the optimization of the present invention for Montgomery multiplication because the current library does not provide highly optimized modular multiplication.

표 3에서 각 Method 우측 괄호 내 숫자는 선행기술문헌 중에서 비특허문헌 번호를 의미한다.

In Table 3, the number in the right parenthesis of each method means a non-patent document number among prior art documents.

SIDH 프로토콜의 성능을 전통적인 공개키와 비교하기 위해 표 6에 ARMv7-A 플랫폼에서 최신 OpenSSL 1.1.0h 라이브러리의 RSA와 ECC 구현과 본 발명에 따른 SIDH 프로토콜 구현을 비교한다. 128 비트 보안 수준의 RSA 서명 생성 및 검증은 ARM Cortex-A15 프로세서에서 초당 63 및 3,726 작업을 수행 할 수 있으며 NIST P-256의 ECDH 작업은 초당 2,849 번의 작업을 수행 할 수 있다. 비교를 위해 제안된 SIDHp503은 초당 21 회의 작업을 수행한다. ECC는 세 가지 구현 중에서 최고 성능을 제공하지만 퀀텀 보안 기능은 제공하지 않는다. 반면 SIDH 프로토콜은 그 중 가장 낮은 성능을 달성했지만 RSA보다 3 배 느리지만 실용적인 애플리케이션을 사용할 정도로 충분히 빠르다. 게다가, SIDH 프로토콜의 발전은 사전 퀀텀 암호화와 포스트 퀀텀 암호 사이의 성능 차이를 좁힐 것이다.To compare the performance of the SIDH protocol with the traditional public key, Table 6 compares the RSA and ECC implementation of the latest OpenSSL 1.1.0h library on the ARMv7-A platform with the SIDH protocol implementation according to the present invention. The 128-bit security level RSA signature generation and verification can perform 63 and 3,726 operations per second on the ARM Cortex-A15 processor, and the ECDH operation of the NIST P-256 can perform 2,849 operations per second. The SIDHp503 proposed for comparison performs 21 operations per second. ECC provides the highest performance of the three implementations, but does not provide quantum security. The SIDH protocol, on the other hand, achieved the lowest performance, but three times slower than RSA, but fast enough to use a practical application. In addition, advances in the SIDH protocol will narrow the performance gap between pre-quantum encryption and post-quantum encryption.

상기에서 본 발명의 바람직한 실시예가 특정 용어들을 사용하여 설명 및 도시되었지만 그러한 용어는 오로지 본 발명을 명확히 설명하기 위한 것일 뿐이며, 본 발명의 실시예 및 기술된 용어는 다음의 청구범위의 기술적 사상 및 범위로부터 이탈되지 않고서 여러가지 변경 및 변화가 가해질 수 있는 것은 자명한 일이다. 이와 같이 변형된 실시예들은 본 발명의 사상 및 범위로부터 개별적으로 이해되어져서는 안되며, 본 발명의 청구범위 안에 속한다고 해야 할 것이다.Although the preferred embodiment of the present invention has been described and illustrated using specific terms, such terms are only intended to clearly describe the present invention, and the embodiments and described terms of the present invention are the technical spirit and scope of the following claims. It is obvious that various changes and changes can be made without deviating from. Such modified embodiments should not be individually understood from the spirit and scope of the present invention, and should be said to be within the scope of the claims of the present invention.

Claims

delete

In a 32-bit ARMv7-A processor multiplication acceleration method for accelerating the multiplication of n-bit factor A and B on a 32-bit ARMv7-A processor having an ARM general operator and a NEON image operator,
The n-bit factor A consists of upper n/2 bits A_H and lower n/2 bits A_L, and the n-bit factor B consists of upper n/2 bits B_H and lower n/2 bits B_L,
ARM The difference between absolute values in a general operator |A_H-A_L| And |B_H-B_L| The first step of calculating the,
Steps 1-2 of transferring the difference values of the absolute values of the first step to a NEON image operator register;
A second step of performing two multiplication operations A_L*B_L and A_H*B_H in the ARM general operator,
In the NEON image operator, the first step is a multiplication operation on the difference of absolute values |A_H-A_L| * |B_H-B_L| The third step to perform and
ARM In general arithmetic units, the first, second, and third stages are used by using the results.

It includes a fourth step of calculating, the first and second steps are performed after the first step and before the third step, the second step and the third step is a parallel operation in the ARM general operator and NEON image operator, respectively. A method of accelerating multiplication on a 32-bit ARMv7-A processor, characterized in that it is performed.

According to claim 2, As the step performed before the first step,
The first 1 to divide and store the n-bit factor A into the upper n/2 bits A_H and the lower n/2 bits A_L, and to divide and store the n-bit factor B into the upper n/2 bits B_H and the lower n/2 bits B_L. Method for accelerating multiplication on a 32-bit ARMv7-A processor, further comprising one step.

According to claim 3,
As a step performed between the first step 1-1 and the third step,
And a 2-1 step of transmitting the result of the two calculations of the first step to the NEON image operator, a multiplication acceleration method on a 32-bit ARMv7-A processor.