JP5896756B2

JP5896756B2 - Arithmetic apparatus and program

Info

Publication number: JP5896756B2
Application number: JP2012009898A
Authority: JP
Inventors: 晃由山口; 佐藤　恒夫; 恒夫佐藤
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2012-01-20
Filing date: 2012-01-20
Publication date: 2016-03-30
Anticipated expiration: 2032-01-20
Also published as: JP2013148767A

Description

本発明は、演算装置、演算方法及びプログラムに関する。
特に、ＲＳＡ（登録商標）暗号や楕円曲線暗号などに用いられる多倍長整数の加算、減算、乗算、モンゴメリ・リダクション、モンゴメリ乗算を行う演算装置及びプログラムに関する。 The present invention relates to a calculation device, a calculation method, and a program.
In particular, the present invention relates to an arithmetic device and a program for performing addition, subtraction, multiplication, Montgomery reduction, and Montgomery multiplication of multiple-precision integers used in RSA (registered trademark) encryption, elliptic curve encryption, and the like.

ＲＳＡ（登録商標）暗号をはじめとする、多くの公開鍵暗号では、ある奇整数を法とする有限体上の演算を行う。
このとき、一般的なＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）で演算可能なデータ長（例えば３２ビット）を超えた数に対して、演算を行う必要がある。
中でも、多倍長乗算は、ＣＰＵの内部演算幅をｂ（ｂｉｔ）、入力長をｌ（ｂｉｔ）とした場合、ｎ＝ｌ／ｂに対して、Ｏ（ｎ^２）の計算量がかかり、多倍長演算の中でも、重い処理の１つとなっている。 Many public key ciphers, including RSA (registered trademark) ciphers, perform operations on a finite field modulo a certain odd integer.
At this time, it is necessary to perform an operation on a number exceeding a data length (for example, 32 bits) that can be calculated by a general CPU (Central Processing Unit).
In particular, the multiple multiplication requires a calculation amount of O (n ² ) for n = 1 / b when the CPU internal calculation width is b (bit) and the input length is l (bit). This is one of the heavy processing among multiple length calculations.

一方、入力に対してモンゴメリ変換を行い、モンゴメリ変換後のデータに対して、多倍長乗算とモンゴメリ・リダクションとよばれる処理を行うことで、演算コストの高い除算や剰余算を回避する方法が示されている。
また、多倍長乗算とモンゴメリ・リダクションはペアで行うことが多く、これらをあわせてモンゴメリ乗算と呼ぶ。
通常、モンゴメリ・リダクションもＯ（ｎ^２）の計算量がかかる。
公開鍵暗号を高速化するにあたり、これらの処理の高速化が求められる。 On the other hand, Montgomery transformation is applied to the input, and the data after Montgomery transformation is subjected to processing called multiple-precision multiplication and Montgomery reduction, thereby avoiding high-cost division and remainder calculation. It is shown.
In addition, multiple length multiplication and Montgomery reduction are often performed in pairs, and these are collectively called Montgomery multiplication.
Usually, Montgomery reduction also requires O (n ² ).
In order to speed up public key cryptography, it is necessary to speed up these processes.

乗算、モンゴメリ・リダクションを高速化する方法としては、専用ハードウェアを用いたり、入力を変換したりして高速化を行う装置が提案されている（例えば、特許文献１、特許文献２、特許文献３、特許文献４）。 As a method for speeding up multiplication and Montgomery reduction, devices for speeding up by using dedicated hardware or converting input have been proposed (for example, Patent Document 1, Patent Document 2, Patent Document) 3, Patent Document 4).

特開昭６３−２８６９３号公報JP-A 63-28893 特開２０００−１０４７９号公報JP 2000-10479 A 特開２０００−３５３０７７号公報JP 2000-353077 A 特開２００１−１９４９９３号公報JP 2001-194993 A

特許文献１、特許文献３によれば、専用のプロセッサアーキテクチャやメモリアーキテクチャを持つ計算機を用いて多倍長演算を高速化している。
しかしながら、特許文献１及び特許文献３の方式を実現するためには、専用装置が必要となる。 According to Patent Document 1 and Patent Document 3, multiple-length arithmetic is accelerated using a computer having a dedicated processor architecture or memory architecture.
However, in order to realize the methods of Patent Document 1 and Patent Document 3, a dedicated device is required.

特許文献２によれば、正の整数Ｃ，ｐを入力とし、ｐを２進表現したときのビット長をＬとして、ｎ≧Ｌなる整数ｎを用いてＲ＝２^ｎとして定義されたＲを用いて、Ｄ＝Ｃ・Ｒ^−１ｍｏｄｐを計算するにあたって、Ｃ＝αＲ＋βを満たす整数ペア（α，β）を算出する（α，β）と、Ｒ^−１＝ε（ｍｏｄｐ）を満たすεを求め、α＋εβを計算し、その結果に対してｐを法として合同なｐ以下の剰余値を求めている。
しかしながら、当該方式の場合、法ｐによる多倍長剰余算を行う必要があり、剰余演算に計算コストがかかる。 According to Patent Document 2, positive integers C and p are input, the bit length when p is expressed in binary is L, and R defined as R = ²ⁿ using an integer n where n ≧ L. In calculating D = C · R ⁻¹ mod p, an integer pair (α, β) satisfying C = αR + β is calculated (α, β), and R ⁻¹ = ε (mod p) is satisfied. ε is obtained, α + εβ is calculated, and congruent residue values less than or equal to p are obtained by using p as the modulus.
However, in the case of this method, it is necessary to perform multiple-length residue calculation by the modulus p, and the calculation cost is required for the residue calculation.

特許文献４によれば、入力を剰余数系に変換し、変換した系上でモンゴメリ上場算を実現することで、処理の並列化を図り、高速化を実現している。
しかしながら、当該方式では、基底拡張に計算コストがかかる上に、整数による剰余算を行う必要がある。
他の演算に対して、剰余算が遅いプロセッサでは、かえって計算コストがかかる。 According to Patent Document 4, the input is converted into a residue number system, and Montgomery listing calculation is realized on the converted system, thereby achieving parallel processing and realizing high speed.
However, in this method, the base extension requires a calculation cost, and it is necessary to perform an integer remainder calculation.
Compared to other operations, a processor with a slow remainder calculation costs more.

この発明は上記のような課題を解決することを主な目的としており、専用装置を用いることなく、多倍長演算を少ない計算コストで高速に実現することを主な目的とする。 The main object of the present invention is to solve the above-mentioned problems, and it is a main object of the present invention to realize a multiple-length operation at a high speed with a low calculation cost without using a dedicated device.

本発明に係る演算装置は、
制御部と演算部と記憶部とを有し、入力値の加算を行う演算装置であって、
前記制御部は、
それぞれのビット幅が共通しており、それぞれのビット幅が前記演算部の演算ビット幅よりも大きい入力値Ｘ及び入力値Ｙを、それぞれ、前記演算部の演算ビット幅ごとに分割し、
前記記憶部内の所定の記憶領域を割当てて、入力値Ｘから分割されたｎ（ｎ≧２）個の分割値を格納するためのｎ個の変数Ｘ［０］〜Ｘ［ｎ−１］と、入力値Ｙから分割されたｎ個の分割値を格納するためのｎ個の変数Ｙ［０］〜Ｙ［ｎ−１］とを設け、
入力値Ｘ内の最下位ビットが含まれる分割値が０番目の変数Ｘ［０］に格納され、最上位ビットが含まれる分割値が（ｎ−１）番目の変数Ｘ［ｎ−１］に格納されるようにして、ｎ個の分割値を変数Ｘ［０］〜Ｘ［ｎ−１］に格納し、入力値Ｙ内の最下位ビットが含まれる分割値が０番目の変数Ｙ［０］に格納され、最上位ビットが含まれる分割値が（ｎ−１）番目の変数Ｙ［ｎ−１］に格納されるようにして、ｎ個の分割値を変数Ｙ［０］〜Ｙ［ｎ−１］に格納し、
前記記憶部内の所定の記憶領域を割当てて、前記演算部による加算結果を格納するためのｕ（ｕ＞ｎ）個の変数Ｚ［０］〜Ｚ［ｕ−１］を設け、
前記演算部は、
スレッド番号として「０〜ｎ−１」が設定されているｎ個のスレッドを並列に実行し、
スレッド番号＝ｔ（ｔは「０〜ｎ−１」のうちのいずれか）であるスレッドｔにおいて、第１フェーズの処理として、
入力値Ｘのｔ番目の変数Ｘ［ｔ］の値と、入力値Ｙのｔ番目の変数Ｙ［ｔ］の値とを用いて、（Ｘ［ｔ］の値）＋（Ｙ［ｔ］の値）を計算し、加算結果とキャリー値ｃを求め、加算結果をｔ番目の変数Ｚ［ｔ］に格納する処理を行い、
前記制御部は、
カウンタ値ｉを１に設定し、前記演算部に第２フェーズの処理を開始させ、
カウンタ値ｉがｎに達するか、全てのスレッドｔが第２フェーズの処理を停止するまで、停止していない全てのスレッドｔにおいて第２フェーズの１ラウンド分の処理が終了する度に、カウンタ値ｉをインクリメントし、
前記演算部は、
前記制御部によりカウンタ値ｉがインクリメントされる度に、停止していないスレッドｔにおいて、第２フェーズの１ラウンド分の処理として、
〈ａ〉（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］の値とスレッドｔで得られたキャリー値ｃとを用いて、（Ｚ［ｔ＋ｉ］の値）＋ｃを計算し、新たな加算結果と新たなキャリー値ｃを求め、
〈ｂ〉変数Ｚ［ｔ＋ｉ］の値と新たな加算結果とを比較し、両者が一致している場合に、スレッドｔの第２フェーズの処理を停止し、両者が一致しない場合に、新たな加算結果を（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］に格納する
処理を、行うことを特徴とする。 The arithmetic device according to the present invention is:
An arithmetic device that includes a control unit, a calculation unit, and a storage unit, and performs addition of input values,
The controller is
The respective bit widths are common, and the input value X and the input value Y, each of which is larger than the calculation bit width of the calculation unit, are divided for each calculation bit width of the calculation unit,
N variables X [0] to X [n−1] for allocating a predetermined storage area in the storage unit and storing n (n ≧ 2) divided values divided from the input value X; , N variables Y [0] to Y [n−1] for storing n divided values divided from the input value Y are provided,
The divided value including the least significant bit in the input value X is stored in the 0th variable X [0], and the divided value including the most significant bit is stored in the (n−1) th variable X [n−1]. The n divided values are stored in the variables X [0] to X [n−1] so that the divided value including the least significant bit in the input value Y is the 0th variable Y [0. ] And the divided value including the most significant bit is stored in the (n−1) th variable Y [n−1], and the n divided values are stored in the variables Y [0] to Y [ n−1],
U (u> n) variables Z [0] to Z [u−1] for allocating a predetermined storage area in the storage unit and storing the addition result by the calculation unit are provided,
The computing unit is
N threads having “0 to n−1” set as thread numbers are executed in parallel,
In the thread t where the thread number = t (t is one of “0 to n−1”), as the first phase process,
Using the value of the t-th variable X [t] of the input value X and the value of the t-th variable Y [t] of the input value Y, the value of (X [t]) + (Y [t] Value), obtain the addition result and carry value c, and store the addition result in the t-th variable Z [t],
The controller is
The counter value i is set to 1, and the arithmetic unit starts the second phase process;
Every time processing for one round of the second phase is completed in all the threads t that are not stopped until the counter value i reaches n or all the threads t stop the processing of the second phase, the counter value i is incremented,
The computing unit is
Each time the counter value i is incremented by the control unit, in the thread t that has not been stopped, as processing for one round of the second phase,
<a> Using the value of the (t + i) th variable Z [t + i] and the carry value c obtained in the thread t, (Z [t + i] value) + c is calculated, and a new addition result and a new value are calculated. Obtain carry value c,
 The value of the variable Z [t + i] is compared with the new addition result, and if the two match, the second phase processing of the thread t is stopped, and if both do not match, a new The addition result is stored in the (t + i) -th variable Z [t + i].

本発明では、各分割値の加算は第１フェーズの処理として１回で終了し、その後、第２フェーズとしてキャリー値の処理が行われる。
通常、数回のラウンド処理でキャリー値の処理は終了するため、全てのスレッドｔにおいて早期に第２フェーズの処理が終了し、加算処理を高速に行うことができる。
このように、本発明によれば、専用装置を用いることなく、多倍長演算を少ない計算コストで高速に実現することができる。 In the present invention, the addition of each divided value is completed once as the first phase process, and then the carry value process is performed as the second phase.
Normally, the carry value processing is completed in several rounds, so the second phase processing is completed early in all threads t, and the addition processing can be performed at high speed.
As described above, according to the present invention, multiple length arithmetic can be realized at high speed with low calculation cost without using a dedicated device.

実施の形態１〜５に係る多倍長演算装置の構成例を示す図。The figure which shows the structural example of the multiple length arithmetic unit which concerns on Embodiment 1-5. 実施の形態１に係る多倍長加算の手順を示すフローチャート図。FIG. 3 is a flowchart showing a procedure of multiple length addition according to the first embodiment. 実施の形態１に係る多倍長加算の具体例を示す図。FIG. 6 shows a specific example of multiple length addition according to the first embodiment. 実施の形態１に係る多倍長加算の計算過程を示す図。FIG. 6 shows a calculation process of multiple length addition according to the first embodiment. 実施の形態２に係る多倍長減算の手順を示すフローチャート図。FIG. 9 is a flowchart showing a procedure of multiple length subtraction according to the second embodiment. 実施の形態２に係る多倍長減算の具体例を示す図。FIG. 10 shows a specific example of multiple length subtraction according to the second embodiment. 実施の形態２に係る多倍長減算の計算過程を示す図。The figure which shows the calculation process of the multiple length subtraction which concerns on Embodiment 2. FIG. 実施の形態３に係る多倍長乗算の手順を示すフローチャート図。FIG. 9 is a flowchart showing a procedure for multiple-length multiplication according to the third embodiment. 実施の形態３に係る多倍長乗算の具体例を示す図。FIG. 10 is a diagram illustrating a specific example of multiple length multiplication according to the third embodiment. 実施の形態３に係る多倍長乗算の具体例を示す図。FIG. 10 is a diagram illustrating a specific example of multiple length multiplication according to the third embodiment. 実施の形態３に係る多倍長乗算の計算過程を示す図。FIG. 10 is a diagram showing a calculation process of multiple length multiplication according to the third embodiment. 実施の形態３に係る多倍長乗算の計算過程を示す図。FIG. 10 is a diagram showing a calculation process of multiple length multiplication according to the third embodiment. 実施の形態４に係るモンゴメリ・リダクションの手順を示すフローチャート図。The flowchart figure which shows the procedure of Montgomery reduction which concerns on Embodiment 4. FIG. 実施の形態４に係るモンゴメリ・リダクションの具体例を示す図。FIG. 10 is a diagram illustrating a specific example of Montgomery reduction according to the fourth embodiment. 実施の形態４に係るモンゴメリ・リダクションの具体例を示す図。FIG. 10 is a diagram illustrating a specific example of Montgomery reduction according to the fourth embodiment. 実施の形態４に係るモンゴメリ・リダクションの計算過程を示す図。The figure which shows the calculation process of the Montgomery reduction which concerns on Embodiment 4. FIG. 実施の形態４に係るモンゴメリ・リダクションの計算過程を示す図。The figure which shows the calculation process of the Montgomery reduction which concerns on Embodiment 4. FIG. 実施の形態５に係るモンゴメリ乗算の手順を示すフローチャート図。FIG. 10 is a flowchart showing a procedure of Montgomery multiplication according to the fifth embodiment. 実施の形態５に係るモンゴメリ乗算の具体例を示す図。FIG. 10 shows a specific example of Montgomery multiplication according to the fifth embodiment. 実施の形態５に係るモンゴメリ乗算の具体例を示す図。FIG. 10 shows a specific example of Montgomery multiplication according to the fifth embodiment. 実施の形態５に係るモンゴメリ乗算の計算過程を示す図。FIG. 10 is a diagram illustrating a calculation process of Montgomery multiplication according to the fifth embodiment. 実施の形態５に係るモンゴメリ乗算の計算過程を示す図。FIG. 10 is a diagram illustrating a calculation process of Montgomery multiplication according to the fifth embodiment.

実施の形態１〜５では、多倍長演算を高速に実現する多倍長演算装置を説明する。
実施の形態１〜５では、一例として、ＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａ）計算機を用いた多倍長演算装置を説明する。
また、実施の形態１〜５に係る多倍長演算装置は、一般的に入手可能なＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）を用いて実現することも可能である。 In the first to fifth embodiments, a multiple length arithmetic device that realizes multiple length arithmetic at high speed will be described.
In the first to fifth embodiments, as an example, a multiple length arithmetic apparatus using a SIMD (Single Instruction Multiple Data) calculator will be described.
In addition, the multiple length arithmetic devices according to Embodiments 1 to 5 can also be realized using a generally available GPU (Graphics Processing Unit).

実施の形態１．
図１は、実施の形態１〜５に係る多倍長演算装置１００の構成例を示すブロック図である。
実施の形態１〜５に係る多倍長演算装置１００は、計算部１０１、メモリ１０２、通信ポート１０３がバス１０４で接続されている構成となっている。
なお、実施の形態１〜５に係る多倍長演算装置１００は、演算装置の例に相当する。 Embodiment 1 FIG.
FIG. 1 is a block diagram illustrating a configuration example of a multiple length arithmetic apparatus 100 according to Embodiments 1 to 5.
The multiple length arithmetic apparatus 100 according to Embodiments 1 to 5 has a configuration in which a calculation unit 101, a memory 102, and a communication port 103 are connected by a bus 104.
Note that the multiple length arithmetic device 100 according to Embodiments 1 to 5 corresponds to an example of an arithmetic device.

計算部１０１は、複数のプロセッサ１０５〜１０６、命令デコーダ１０７、レジスタ１０８、１０９で構成されている。 The calculation unit 101 includes a plurality of processors 105 to 106, an instruction decoder 107, and registers 108 and 109.

プロセッサ１０５〜１０６は、命令デコーダ１０７がデコードした命令を異なるデータに対して実行する。
プロセッサ１０５〜１０６のいずれかのプロセッサは、多倍長演算を行う際に必要な制御を行う制御部としての役割を有する。
また、プロセッサ１０５〜１０６のいずれかのプロセッサ、又は、プロセッサ１０５〜１０６の全てのプロセッサは、多倍長演算の計算処理を行う演算部としての役割を有する。
制御部として機能するプロセッサが、併せて演算部として機能するようにしてもよい。
プロセッサ１０５〜１０６が、制御部として行う処理の詳細、演算部として行う処理の詳細は、後述する。
演算部として機能するプロセッサは、所定の演算幅ｂビット（例えば３２ビット）の演算を行う。
以下、演算幅のｂビットを１ワードと記す。
また、以下では、制御部として機能するプロセッサを単に「制御部」とも記し、演算部として機能するプロセッサを単に「演算部」とも記す。 The processors 105 to 106 execute the instruction decoded by the instruction decoder 107 on different data.
Any one of the processors 105 to 106 has a role as a control unit that performs control necessary when performing multiple-length arithmetic.
In addition, any one of the processors 105 to 106 or all of the processors 105 to 106 has a role as an arithmetic unit that performs a calculation process of multiple length arithmetic.
A processor that functions as a control unit may function together as a calculation unit.
Details of processing performed by the processors 105 to 106 as the control unit and details of processing performed as the calculation unit will be described later.
The processor functioning as a calculation unit performs a calculation with a predetermined calculation width b bits (for example, 32 bits).
Hereinafter, b bits of the operation width are referred to as one word.
Hereinafter, a processor that functions as a control unit is also simply referred to as a “control unit”, and a processor that functions as a calculation unit is also simply referred to as a “calculation unit”.

データは汎用レジスタ１０８に格納される。
また、メモリ１０２を介してプロセッサ間でデータのやり取りを行う。
特殊レジスタ１０９は、プロセッサ１０５〜１０６の計算値以外の特殊情報を格納するレジスタである。
汎用レジスタ１０８及びメモリ１０２は、記憶部の例に相当する。 Data is stored in the general register 108.
Data is exchanged between processors via the memory 102.
The special register 109 is a register that stores special information other than the calculated values of the processors 105 to 106.
The general-purpose register 108 and the memory 102 correspond to an example of a storage unit.

各プロセッサが実行するプログラムの単位をスレッドと称す。
実施の形態１〜５に係る多倍長演算装置１００の特徴の１つは、１つの多倍長演算を複数のスレッドを用いて演算することにある。
スレッドの本数ｎ（ｎ≧２）は、入力値をｌ（ｌ＞ｂ）ビットとした場合、ｎ≧ｃｅｉｌ（ｌ／ｂ）とする。
ここで、ｃｅｉｌ（ａ）はａ以上の整数のうちの最小の整数とする。
各プロセッサが実行するスレッドには０以上（ｎ−１）以下のスレッド番号が付与される。
プロセッサ１０５〜１０６はスレッドの本数や、各プロセッサが処理するスレッドの番号を、特殊レジスタ１０９から取得することができる。 A unit of a program executed by each processor is called a thread.
One of the features of the multiple length arithmetic apparatus 100 according to Embodiments 1 to 5 is that one multiple length calculation is performed using a plurality of threads.
The number of threads n (n ≧ 2) is n ≧ ceil (l / b) when the input value is l (l> b) bits.
Here, ceil (a) is the smallest integer of integers greater than or equal to a.
A thread number of 0 or more and (n-1) or less is assigned to a thread executed by each processor.
The processors 105 to 106 can acquire the number of threads and the number of threads processed by each processor from the special register 109.

実施の形態１〜５に係る多倍長演算装置１００は、図１に示すように、プロセッサ１０５、１０６、レジスタ１０８、１０９、メモリ１０２（例えば、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ））、通信ポート１０３、バス１０４を備える一般的なコンピュータとすることができる。
また、図１では図示を省略しているが、多倍長演算装置１００は、磁気ディスク装置に代表される不揮発性の記憶装置を備えている。
そして、制御部及び演算部の後述する動作を実現するためのコンピュータプログラムやオペレーティングシステムが不揮発性の記憶装置に記憶されている。
制御部及び演算部の動作を実現するためのコンピュータプログラムの少なくとも一部は、オペレーティングシステムに含まれていてもよい。
そして、プロセッサ１０５、１０６は、オペレーティングシステムを動作させながら、制御部及び演算部の動作を実現するためのコンピュータプログラムをメモリ１０２にロードし、また、これらコンピュータプログラムをメモリ１０２から読み出し、実行することで、制御部及び演算部として機能する。
また、多倍長演算装置１０の動作手順を、演算方法として捉えることもできる。 As shown in FIG. 1, the multiple-length arithmetic device 100 according to Embodiments 1 to 5 includes processors 105 and 106, registers 108 and 109, a memory 102 (for example, a RAM (Random Access Memory)), a communication port 103, A general computer including the bus 104 can be used.
Although not shown in FIG. 1, the multiple length arithmetic unit 100 includes a nonvolatile storage device typified by a magnetic disk device.
A computer program and an operating system for realizing later-described operations of the control unit and the calculation unit are stored in a nonvolatile storage device.
At least a part of the computer program for realizing the operations of the control unit and the calculation unit may be included in the operating system.
The processors 105 and 106 load the computer program for realizing the operations of the control unit and the calculation unit into the memory 102 while operating the operating system, and read and execute the computer program from the memory 102. Thus, it functions as a control unit and a calculation unit.
Further, the operation procedure of the multiple length arithmetic apparatus 10 can be regarded as an arithmetic method.

次に、本実施の形態に係る計算部１０１の動作の概略を説明する。 Next, an outline of the operation of the calculation unit 101 according to the present embodiment will be described.

本実施の形態では、入力値Ｘと入力値Ｙの加算結果（Ｘ＋Ｙ）を変数Ｚに出力する。
入力値Ｘと入力値Ｙは、ともにｌ（ｌ＞ｂ）ビットである。
つまり、入力値Ｘと入力値Ｙのビット幅（ｌビット）は、各プロセッサの１ワードであるｂビットよりも大きい。 In the present embodiment, the addition result (X + Y) of the input value X and the input value Y is output to the variable Z.
Both the input value X and the input value Y are l (l> b) bits.
That is, the bit width (1 bit) between the input value X and the input value Y is larger than b bits, which is one word of each processor.

本実施の形態では、制御部が入力値Ｘと入力値Ｙをそれぞれｎ桁に分割する。
また、制御部は、入力値Ｘから分割されたｎ個の分割値を格納するためのｎ個の変数Ｘ［０］〜Ｘ［ｎ−１］と、入力値Ｙから分割されたｎ個の分割値を格納するためのｎ個の変数Ｙ［０］〜Ｙ［ｎ−１］とを、いずれかの記憶領域（例えば、メモリ１０２）に設ける。
また、制御部は、入力値Ｘ内の最下位ビットが含まれる分割値が０番目の変数Ｘ［０］に格納され、最上位ビットが含まれる分割値が（ｎ−１）番目の変数Ｘ［ｎ−１］に格納されるようにして、ｎ個の分割値を変数Ｘ［０］〜Ｘ［ｎ−１］に格納する。
同様に、制御部は、入力値Ｙ内の最下位ビットが含まれる分割値が０番目の変数Ｙ［０］に格納され、最上位ビットが含まれる分割値が（ｎ−１）番目の変数Ｙ［ｎ−１］に格納されるようにして、ｎ個の分割値を変数Ｙ［０］〜Ｙ［ｎ−１］に格納する。
また、いずれかの記憶領域（例えば、メモリ１０２）に、演算部による加算結果を格納するためのｕ（ｕ＞ｎ）個の変数Ｚ［０］〜Ｚ［ｕ−１］を設ける。
なお、変数Ｚの個数は、ｎ以上であれば任意の数とすることができるが、以下では、２ｎ個の変数Ｚ、つまり、変数Ｚ［０］〜Ｚ［２ｎ−１］を設ける（変数Ｚが２ｎワードのサイズを持つ）例にて説明を進める。 In the present embodiment, the control unit divides the input value X and the input value Y into n digits.
The control unit also stores n variables X [0] to X [n−1] for storing n divided values divided from the input value X and n pieces of divided values from the input value Y. N variables Y [0] to Y [n−1] for storing the division value are provided in any storage area (for example, the memory 102).
Further, the control unit stores the divided value including the least significant bit in the input value X in the 0th variable X [0], and the divided value including the most significant bit as the (n−1) th variable X. N division values are stored in variables X [0] to X [n−1] so as to be stored in [n−1].
Similarly, the control unit stores the divided value including the least significant bit in the input value Y in the 0th variable Y [0], and the divided value including the most significant bit is the (n−1) th variable. The n divided values are stored in variables Y [0] to Y [n-1] so as to be stored in Y [n-1].
Further, u (u> n) variables Z [0] to Z [u−1] for storing the addition result by the calculation unit are provided in any storage area (for example, the memory 102).
The number of variables Z can be any number as long as it is greater than or equal to n. In the following, 2n variables Z, that is, variables Z [0] to Z [2n-1] are provided (variables). The description proceeds with an example in which Z has a size of 2n words.

演算部は、スレッド番号として「０〜ｎ−１」が設定されているｎ個のスレッドを並列に実行する。
演算部の処理は、第１フェーズの処理と、第２フェーズの処理に大別される。 The arithmetic unit executes n threads in which “0 to n−1” is set as the thread number in parallel.
The processing of the computing unit is roughly divided into a first phase process and a second phase process.

演算部は、第１フェーズの処理として、スレッド番号＝ｔ（ｔは「０〜ｎ−１」のうちのいずれか）であるスレッドｔにおいて、入力値Ｘのｔ番目の変数Ｘ［ｔ］の値と、入力値Ｙのｔ番目の変数Ｙ［ｔ］の値とを用いて、（Ｘ［ｔ］の値）＋（Ｙ［ｔ］の値）を計算し、加算結果とキャリー値ｃを求め、加算結果をｔ番目の変数Ｚ［ｔ］に格納する。 In the thread t with thread number = t (t is any one of “0 to n−1”), the arithmetic unit, as the first phase process, sets the t-th variable X [t] of the input value X. (X [t] value) + (Y [t] value) is calculated using the value and the value of the t-th variable Y [t] of the input value Y, and the addition result and the carry value c are calculated. The addition result is stored in the t-th variable Z [t].

次に、制御部がカウンタ値ｉを１に設定し、演算部に第２フェーズの処理を開始させる。
そして、カウンタ値ｉがｎに達するか、全てのスレッドｔが第２フェーズの処理を停止するまで、停止していない全てのスレッドｔにおいて第２フェーズの１ラウンド分の処理が終了する度に、制御部は、カウンタ値ｉをインクリメントする。 Next, the control unit sets the counter value i to 1, and causes the calculation unit to start the second phase process.
Then, every time the processing of one round of the second phase is completed in all the threads t that are not stopped, until the counter value i reaches n or all the threads t stop the processing of the second phase, The control unit increments the counter value i.

演算部は、制御部によりカウンタ値ｉがインクリメントされる度に、停止していないスレッドｔにおいて、第２フェーズの１ラウンド分の処理を繰り返す。
第２フェーズの１ラウンド分の処理は、以下の〈ａ〉と〈ｂ〉の処理である。
〈ａ〉（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］の値とスレッドｔで得られたキャリー値ｃとを用いて、（Ｚ［ｔ＋ｉ］の値）＋ｃを計算し、新たな加算結果と新たなキャリー値ｃを求める。
〈ｂ〉変数Ｚ［ｔ＋ｉ］の値と新たな加算結果とを比較し、両者が一致している場合に、スレッドｔの第２フェーズの処理を停止し、両者が一致しない場合に、新たな加算結果を（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］に格納する。 Each time the counter value i is incremented by the control unit, the arithmetic unit repeats the process for one round of the second phase in the thread t that is not stopped.
The processing for one round of the second phase is the following processing <a> and .
<a> Using the value of the (t + i) th variable Z [t + i] and the carry value c obtained in the thread t, (Z [t + i] value) + c is calculated, and a new addition result and a new value are calculated. A carry value c is obtained.
 The value of the variable Z [t + i] is compared with the new addition result, and if the two match, the second phase processing of the thread t is stopped, and if both do not match, a new The addition result is stored in the (t + i) th variable Z [t + i].

次に、図２のフローチャートを参照して、本実施の形態に係る多倍長演算装置１００で多倍長加算を行う場合の手順を説明する。
なお、図２において、大文字で示す変数はメモリ１０２上のデータとし、小文字で示す変数は汎用レジスタ１０８上のデータとする。
また、各変数のデータ長は１ワードとする。
また、入力値ｃｅｉｌ（ｌ／ｂ）ワードがｎに満たない場合は、満たない部分のワードを予め０でクリアする。 Next, with reference to the flowchart of FIG. 2, the procedure in the case of performing multiple length addition in the multiple length arithmetic apparatus 100 according to the present embodiment will be described.
In FIG. 2, variables indicated by uppercase letters are data on the memory 102, and variables indicated by lowercase letters are data on the general-purpose register 108.
The data length of each variable is 1 word.
If the input value ceil (l / b) word is less than n, the word of the lesser part is cleared to 0 in advance.

まず、演算部が、汎用レジスタ１０８から、スレッドごとに自身のスレッド番号と演算するスレッド本数を取得する（Ｓ２０１）。 First, the calculation unit obtains the number of threads to be calculated with its own thread number for each thread from the general-purpose register 108 (S201).

次に、演算部が出力Ｚの上位ｎ桁をゼロクリアし、スレッドごとにＸとＹの加算を求め、加算値をＺの下位ｎ桁に格納し、制御部が変数（カウンタ値）ｉに１をセットする（Ｓ２０２）。
ここで、Ａｄｄ＿ｃｃ（ａ，ｂ）は、１ワードの入力ａ，ｂに対し、ａ＋ｂの結果を汎用レジスタ１０８に出力し、キャリー値を特殊レジスタ１０９に出力することを意味する。
図２に示すＳ２０２の処理２及び処理３が、第１フェーズの処理に該当する。 Next, the arithmetic unit clears the upper n digits of the output Z to zero, calculates the addition of X and Y for each thread, stores the added value in the lower n digits of Z, and the control unit sets 1 to the variable (counter value) i. Is set (S202).
Here, Add_cc (a, b) means that the result of a + b is output to the general-purpose register 108 and the carry value is output to the special register 109 for one-word inputs a and b.
The process 2 and the process 3 of S202 illustrated in FIG. 2 correspond to the first phase process.

ｉがｎ未満である場合、演算部はＺ［ｔ＋ｉ］の値を読み込み、０との加算を行う（Ｓ２０３）。
ここで、Ａｄｄｃ＿ｃｃ（ａ，ｂ）は、１ワードの入力ａ，ｂに対し、ａとｂと特殊レジスタ１０９のキャリーの値を加算し、加算結果を汎用レジスタ１０８に出力し、加算後のキャリー値を特殊レジスタ１０９に出力することを意味する。 When i is less than n, the calculation unit reads the value of Z [t + i] and adds 0 (S203).
Here, Addc_cc (a, b) adds a and b and the carry value of the special register 109 to the input a and b of one word, and outputs the addition result to the general-purpose register 108. This means that the value is output to the special register 109.

加算の前後で値に変化が無ければ（ｓ＝＝ａ？でＹＥＳ）、演算部は、キャリー値が０であるとみなし、スレッドｔの処理を終了する。
また、演算部は、変化があれば（ｓ＝＝ａ？でＮＯ）、加算結果を、Ｚ［ｔ＋ｉ］に出力し、制御部がｉに１を加算する（Ｓ２０４）。
キャリー値が０になる（すなわち、ｓ＝＝ａ？でＹＥＳとなる）か、ｉがｎとなるまでＳ２０３、Ｓ２０４を繰り返す。
ｎ本のスレッド全てがループを抜けたら処理を終了する。
図２に示すＳ２０３の処理１及び処理２、ｓ＝＝ａ？の判断、Ｓ２０４の処理１が、第２フェーズの１ラウンド分の処理に該当する。 If there is no change in the value before and after the addition (YES when s == a?), The calculation unit considers the carry value to be 0 and ends the processing of the thread t.
If there is a change (NO when s == a?), The calculation unit outputs the addition result to Z [t + i], and the control unit adds 1 to i (S204).
S203 and S204 are repeated until the carry value becomes 0 (that is, when s == a? Becomes YES) or i becomes n.
When all n threads have exited the loop, the process is terminated.
Process 1 and process 2 of S203 shown in FIG. 2, s == a? The process 1 of S204 corresponds to the process for one round of the second phase.

図３は、本実施の形態に係る多倍長加算における値の変化を示す。
図３では、内容を理解しやすくするため、内部演算幅を十進数とし、４桁の演算を４つのスレッド（スレッド番号０〜３）で実行した場合を示している。図３では、１２３４（Ｘ）＋５６７８（Ｙ）＝６９１２（Ｚ）を計算する例を示す。
また、図４は、図３に示した計算の内訳を、スレッド番号０について示している。 FIG. 3 shows a change in value in the multiple length addition according to the present embodiment.
FIG. 3 shows a case where the internal calculation width is a decimal number and a 4-digit calculation is executed by four threads (thread numbers 0 to 3) in order to facilitate understanding of the contents. FIG. 3 shows an example in which 1234 (X) +5678 (Y) = 6912 (Z) is calculated.
FIG. 4 shows the breakdown of the calculation shown in FIG.

次に、本実施の形態に係る多倍長演算装置１００の効果を説明する。
本実施の形態では、各桁の加算は１回で終了し、その後、キャリーの処理を行う。
キャリーの処理のワーストケースはＯ（ｎ）となるが、入力がランダムである場合、キャリーが最後まで残る確率は非常に低いため、数回のキャリーの計算で図２のループを抜けることができるため、加算処理を高速に行うことができる。
また、キャリー加算前後の値を比較することで、キャリー情報を直接参照できなくても、キャリーの有無を判定することができる。
また、本実施の形態に係る多倍長演算装置１００は、前述したように、ＳＩＭＤ計算機等の通常の計算機で実現可能であり、専用装置を用いることなく、多倍長加算を高速に行うことができる。 Next, the effect of the multiple length arithmetic apparatus 100 according to the present embodiment will be described.
In the present embodiment, addition of each digit is completed once, and then carry processing is performed.
The worst case of carry processing is O (n), but if the input is random, the probability of the carry remaining to the end is very low, so it is possible to exit the loop of FIG. 2 with several carry calculations. Therefore, the addition process can be performed at high speed.
Further, by comparing the values before and after the carry addition, the presence / absence of carry can be determined even if the carry information cannot be directly referred to.
In addition, as described above, the multiple length arithmetic device 100 according to the present embodiment can be realized by a normal computer such as a SIMD computer, and can perform multiple length addition at high speed without using a dedicated device. Can do.

以上、本実施の形態では、
複数のプロセッサを内蔵し、単一の命令を複数のデータに対して同時に実行できる計算機と、データを格納し、前記プロセッサが同時にアクセスできるメモリを有する多倍長整数演算装置であって、入力Ｘ，Ｙに対して、Ｚ＝Ｘ＋Ｙを計算する多倍長加算を以下のステップで実行する多倍長整数演算装置を説明した。
１．入力データを計算機の内部演算幅毎に複数の桁（以降、それぞれＸ［ｎ］，Ｙ［ｎ］と記す）に分割するステップ
２．スレッドごとに自身のスレッド番号ｔと演算するスレッド本数ｎを取得するステップ
３．Ｚを０にセットするステップ
４．Ｘ［ｔ］＋Ｙ［ｔ］を計算し、加算結果とキャリーｃを求め、前記加算結果をＺ［ｔ］に格納するステップ
５．桁ｉを１に設定するステップ
６．Ｚ［ｔ＋ｉ］＋ｃを計算し、新たな加算結果と新たなキャリーｃを求め、前記加算結果をＺ［ｔ＋ｉ］に格納するステップ
７．桁ｉに１を加算するステップ
８．ｉ＜ｎかつ、ｃ≠０の間、ステップ６〜７を実行するステップ
９．前記スレッドの全てがステップ６〜８を完了するのを待つステップ。 As described above, in the present embodiment,
A multiple-precision integer arithmetic unit comprising a plurality of processors, a computer capable of executing a single instruction simultaneously on a plurality of data, and a memory for storing data and simultaneously accessible by the processor, wherein the input X , Y, a multiple-precision integer arithmetic unit that executes multiple-precision addition for calculating Z = X + Y in the following steps has been described.
1. 1. A step of dividing input data into a plurality of digits (hereinafter referred to as X [n] and Y [n], respectively) for each internal calculation width of the computer. 2. Obtaining the number n of threads to be calculated with its own thread number t for each thread 3. Set Z to 0 4. Calculate X [t] + Y [t], obtain the addition result and carry c, and store the addition result in Z [t] 5. Set digit i to 1 6. Calculate Z [t + i] + c, obtain a new addition result and a new carry c, and store the addition result in Z [t + i] 7. Add 1 to digit i 8. Steps 6-7 are executed while i <n and c ≠ 0 Waiting for all of the threads to complete steps 6-8.

実施の形態２．
本実施の形態では、多倍長減算処理を行う。
本実施の形態に係る多倍長演算装置１００の構成は図１に示したものと同じである。 Embodiment 2. FIG.
In the present embodiment, multiple length subtraction processing is performed.
The configuration of the multiple length arithmetic apparatus 100 according to the present embodiment is the same as that shown in FIG.

本実施の形態では、入力値Ｙから入力値Ｘを減算した減算結果（Ｙ−Ｘ）を変数Ｚに出力する。
なお、実施の形態１と同様に、本実施の形態でも、入力値Ｘ、入力値Ｙのビット幅がｌ（ｌ＞ｂ）ビットであり、制御部が入力値Ｘと入力値Ｙをそれぞれｎ桁に分割し、変数Ｘ［０］〜Ｘ［ｎ−１］と変数Ｙ［０］〜Ｙ［ｎ−１］とを設け、更に、変数Ｚ［０］〜Ｚ［２ｎ−１］を設ける。
変数Ｚの個数は、本実施の形態でも、ｎ以上であれば任意の数とすることができるが、実施の形態１と同様に、２ｎ個の変数Ｚを設ける（変数Ｚが２ｎワードのサイズを持つ）例にて説明を進める。
また、制御部が分割値をＸ［０］〜Ｘ［ｎ−１］を格納する手順、Ｙ［０］〜Ｙ［ｎ−１］に格納する手順も実施の形態１と同じである。
更に、演算部も、実施の形態１と同様に、ｎ個のスレッドを並列に実行し、第１フェーズの処理と、第２フェーズの処理とを行う。 In the present embodiment, a subtraction result (Y−X) obtained by subtracting the input value X from the input value Y is output to the variable Z.
As in the first embodiment, in this embodiment, the bit width of the input value X and the input value Y is l (l> b) bits, and the control unit sets the input value X and the input value Y to n. Dividing into digits, variables X [0] to X [n-1] and variables Y [0] to Y [n-1] are provided, and variables Z [0] to Z [2n-1] are further provided. .
In the present embodiment, the number of variables Z can be set to an arbitrary number as long as it is n or more, but 2n variables Z are provided as in the first embodiment (the size of the variable Z is 2n words). The explanation will proceed with an example.
Further, the procedure for the control unit to store the divided values in X [0] to X [n-1] and the procedure for storing in Y [0] to Y [n-1] are the same as in the first embodiment.
Further, as in the first embodiment, the arithmetic unit also executes n threads in parallel to perform the first phase process and the second phase process.

本実施の形態では、演算部は、第１フェーズの処理として、スレッドｔにおいて、入力値Ｘのｔ番目の変数Ｘ［ｔ］の値と、入力値Ｙのｔ番目の変数Ｙ［ｔ］の値とを用いて、（Ｙ［ｔ］の値）−（Ｘ［ｔ］の値）を計算し、減算結果とボロー値ｄを求め、減算結果をｔ番目の変数Ｚ［ｔ］に格納する。 In the present embodiment, as the processing of the first phase, the arithmetic unit performs processing of the t-th variable X [t] of the input value X and the t-th variable Y [t] of the input value Y in the thread t. Using the value, (Y [t] value) − (X [t] value) is calculated, the subtraction result and the borrow value d are obtained, and the subtraction result is stored in the t-th variable Z [t]. .

演算部は、制御部によりカウンタ値ｉがインクリメントされる度に、停止していないスレッドｔにおいて、第２フェーズの１ラウンド分の処理を繰り返す。
第２フェーズの１ラウンド分の処理は、以下の〈ａ〉と〈ｂ〉の処理である。
〈ａ〉（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］の値とスレッドｔで得られたボロー値ｄとを用いて、（Ｚ［ｔ＋ｉ］の値）−ｄを計算し、新たな減算結果と新たなボロー値ｄを求める。
〈ｂ〉変数Ｚ［ｔ＋ｉ］の値と新たな減算結果とを比較し、両者が一致している場合に、スレッドｔの第２フェーズの処理を停止し、両者が一致しない場合に、新たな減算結果を（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］に格納する。 Each time the counter value i is incremented by the control unit, the arithmetic unit repeats the process for one round of the second phase in the thread t that is not stopped.
The processing for one round of the second phase is the following processing <a> and .
<a> Using the value of the (t + i) th variable Z [t + i] and the borrow value d obtained in the thread t, (Z [t + i] value) −d is calculated, and the new subtraction result and the new value are calculated. The borrow value d is determined.
 The value of the variable Z [t + i] is compared with the new subtraction result, and if the two match, the second phase processing of the thread t is stopped, and if both do not match, a new The subtraction result is stored in the (t + i) th variable Z [t + i].

次に、図５のフローチャートを参照して、本実施の形態に係る多倍長演算装置１００で多倍長減算を行う場合の手順を説明する。
なお、図５において、大文字で示す変数はメモリ１０２上のデータとし、小文字で示す変数は汎用レジスタ１０８上のデータとする。
また、各変数のデータ長は１ワードとする。
入力値ｃｅｉｌ（ｌ／ｂ）ワードがｎに満たない場合は、満たない部分のワードを予め０でクリアする。 Next, with reference to the flowchart of FIG. 5, the procedure in the case of performing multiple length subtraction in the multiple length arithmetic apparatus 100 according to the present embodiment will be described.
In FIG. 5, variables indicated by uppercase letters are data on the memory 102, and variables indicated by lowercase letters are data on the general-purpose register 108.
The data length of each variable is 1 word.
When the input value ceil (l / b) word is less than n, the word of the lesser part is cleared to 0 in advance.

まず、演算部が、汎用レジスタ１０８から、スレッドごとに自身のスレッド番号と演算するスレッド本数を取得する（Ｓ４０１）。 First, the calculation unit obtains the number of threads to be calculated with its own thread number for each thread from the general-purpose register 108 (S401).

次に、演算部が出力Ｚの上位ｎ桁をゼロクリアし、スレッドごとにＹとＸの減算を求め、減算値をＺの下位ｎ桁に格納し、制御部が変数（カウンタ値）ｉに１をセットする（Ｓ４０２）。
ここで、Ｓｕｂ＿ｃｃ（ａ，ｂ）は１ワードの入力ａ，ｂに対し、ｂ−ａの結果を汎用レジスタ１０８に出力し、ボロー値を特殊レジスタに出力することを意味する。
図５に示すＳ４０２の処理２及び処理３が、第１フェーズの処理に該当する。 Next, the arithmetic unit clears the upper n digits of the output Z to zero, obtains the subtraction of Y and X for each thread, stores the subtraction value in the lower n digits of Z, and the control unit sets 1 to the variable (counter value) i. Is set (S402).
Here, Sub_cc (a, b) means that for one word input a, b, the result of ba is output to the general register 108 and the borrow value is output to the special register.
Process 2 and process 3 of S402 shown in FIG. 5 correspond to the process of the first phase.

ｉがｎ未満である場合、演算部はＺ［ｔ＋ｉ］の値を読み込み、０との減算を行う（Ｓ４０３）。
ここで、Ｓｕｂｃ＿ｃｃ（ａ，ｂ）は１ワードの入力ａ，ｂに対し、ｂからａと特殊レジスタ１０９のボローの値を減算し、減算結果を汎用レジスタ１０８に出力し、ボロー値を特殊レジスタ１０９に出力することを意味する。 When i is less than n, the arithmetic unit reads the value of Z [t + i] and performs subtraction with 0 (S403).
Here, Subc_cc (a, b) subtracts the borrow value of a and the special register 109 from b for one word input a and b, outputs the subtraction result to the general-purpose register 108, and sets the borrow value to the special register. Means output to 109.

減算の前後で値に変化が無ければ（ｓ＝＝ａ？でＹＥＳ）、演算部は、ボロー値が０であるとみなし、スレッドｔの処理を終了する。
また、演算部は、変化があれば（ｓ＝＝ａ？でＮＯ）、減算結果を、Ｚ［ｔ＋ｉ］に出力し、制御部がｉに１を加算する（Ｓ４０４）。
ボロー値が０になる（すなわち、ｓ＝＝ａ？でＹＥＳとなる）か、ｉがｎとなるまでＳ４０３、Ｓ４０４を繰り返す。
ｎ本のスレッド全てがループを抜けたら処理を終了する。
図５に示すＳ４０３の処理１及び処理２、ｓ＝＝ａ？の判断、Ｓ４０４の処理１が、第２フェーズの１ラウンド分の処理に該当する。 If there is no change in the value before and after subtraction (YES when s == a?), The calculation unit assumes that the borrow value is 0, and ends the processing of thread t.
If there is a change (NO at s == a?), The calculation unit outputs the subtraction result to Z [t + i], and the control unit adds 1 to i (S404).
S403 and S404 are repeated until the borrow value becomes 0 (that is, when s == a? Becomes YES) or i becomes n.
When all n threads have exited the loop, the process is terminated.
Process 1 and process 2 of S403 shown in FIG. 5, s == a? The process 1 of S404 corresponds to the process for one round of the second phase.

図６は、本実施の形態に係る多倍長減算における値の変化を示す。
図６では、内容を理解しやすくするため、内部演算幅を十進数とし、４桁の演算を４つのスレッド（スレッド番号０〜３）で実行した場合について示している。図６では、７６３４（Ｙ）−５６７８（Ｘ）＝１９５６（Ｚ）を計算する例を示す。
また、図７は、図６に示した計算の内訳を、スレッド番号０について示している。 FIG. 6 shows a change in value in the multiple length subtraction according to the present embodiment.
FIG. 6 shows a case where the internal operation width is a decimal number and a 4-digit operation is executed by four threads (thread numbers 0 to 3) in order to facilitate understanding of the contents. FIG. 6 shows an example in which 7634 (Y) −5678 (X) = 1957 (Z) is calculated.
FIG. 7 shows a breakdown of the calculation shown in FIG.

次に、本実施の形態に係る多倍長演算装置１００の効果を説明する。
本実施の形態では、各桁の減算は１回で終了し、その後、ボローの処理を行う。
ボローの処理のワーストケースはＯ（ｎ）となるが、入力がランダムである場合、ボローが最後まで残る確率は非常に低いため、数回のボローの計算で図５のループを抜けることができる。
よって、減算処理を高速に行うことができる。
また、ボロー減算前後の値を比較することで、ボロー情報を直接参照できなくても、ボローの有無を判定することができる。
また、本実施の形態に係る多倍長演算装置１００は、前述したように、ＳＩＭＤ計算機等の通常の計算機で実現可能であり、専用装置を用いることなく、多倍長減算を高速に行うことができる。 Next, the effect of the multiple length arithmetic apparatus 100 according to the present embodiment will be described.
In the present embodiment, subtraction of each digit is completed once, and then borrow processing is performed.
The worst case of borrow processing is O (n), but if the input is random, the probability that the borrow will remain until the end is very low, so it is possible to exit the loop of FIG. 5 with several borrow calculations. .
Therefore, the subtraction process can be performed at high speed.
Also, by comparing the values before and after borrow subtraction, it is possible to determine whether or not there is a borrow even if the borrow information cannot be directly referenced.
In addition, as described above, the multiple length arithmetic apparatus 100 according to the present embodiment can be realized by a normal computer such as a SIMD computer, and can perform multiple length subtraction at high speed without using a dedicated device. Can do.

以上、本実施の形態では、
複数のプロセッサを内蔵し、単一の命令を複数のデータに対して同時に実行できる計算機と、データを格納し、前記プロセッサが同時にアクセスできるメモリを有する多倍長整数演算装置であって、入力Ｘ，Ｙに対して、Ｚ＝Ｙ−Ｘを計算する多倍長減算を以下のステップで実行する多倍長整数演算装置を説明した。
１．入力データを計算機の内部演算幅毎に複数の桁（以降、それぞれＸ［ｎ］，Ｙ［ｎ］と記す）に分割するステップ
２．スレッドごとに自身のスレッド番号ｔと演算するスレッド本数ｎを取得するステップ
３．Ｚを０にセットするステップ
４．Ｙ［ｔ］−Ｘ［ｔ］を計算し、減算結果とボローｄを求め、前記減算結果をＺ［ｔ］に格納するステップ
５．桁ｉを１に設定するステップ
６．Ｚ［ｔ＋ｉ］−ｃを計算し、新たな減算結果と新たなボローｄを求め、前記減算結果をＺ［ｔ＋ｉ］に格納するステップ
７．桁ｉに１を加算するステップ
８．ｉ＜ｎかつ、ｃ≠０の間、ステップ６〜７を実行するステップ
９．前記スレッドの全てがステップ６〜８を完了するのを待つステップ。 As described above, in the present embodiment,
A multiple-precision integer arithmetic unit comprising a plurality of processors, a computer capable of executing a single instruction simultaneously on a plurality of data, and a memory for storing data and simultaneously accessible by the processor, wherein the input X , Y, a multiple-precision integer arithmetic unit that executes multiple-precision subtraction for calculating Z = Y−X in the following steps has been described.
1. 1. A step of dividing input data into a plurality of digits (hereinafter referred to as X [n] and Y [n], respectively) for each internal calculation width of the computer. 2. Obtaining the number n of threads to be calculated with its own thread number t for each thread 3. Set Z to 0 4. Calculate Y [t] -X [t], obtain the subtraction result and borrow d, and store the subtraction result in Z [t] 5. Set digit i to 1 6. Calculate Z [t + i] -c, obtain a new subtraction result and a new borrow d, and store the subtraction result in Z [t + i] 7. Add 1 to digit i 8. Steps 6-7 are executed while i <n and c ≠ 0 Waiting for all of the threads to complete steps 6-8.

実施の形態３．
本実施の形態では、多倍長乗算処理を行う。
本実施の形態に係る多倍長演算装置１００の構成は図１に示したものと同じである。 Embodiment 3 FIG.
In the present embodiment, multiple length multiplication processing is performed.
The configuration of the multiple length arithmetic apparatus 100 according to the present embodiment is the same as that shown in FIG.

本実施の形態では、入力値Ｘと入力値Ｙを乗算した乗算結果（Ｘ×Ｙ）を変数Ｚに出力する。
なお、実施の形態１と同様に、本実施の形態でも、入力値Ｘ、入力値Ｙのビット幅がｌ（ｌ＞ｂ）ビットであり、制御部が入力値Ｘと入力値Ｙをそれぞれｎ桁に分割し、変数Ｘ［０］〜Ｘ［ｎ−１］と変数Ｙ［０］〜Ｙ［ｎ−１］とを設け、更に、変数Ｚ［０］〜Ｚ［２ｎ−１］を設ける。
また、制御部が分割値をＸ［０］〜Ｘ［ｎ−１］を格納する手順、Ｙ［０］〜Ｙ［ｎ−１］に格納する手順も実施の形態１と同じである。
更に、演算部も、実施の形態１と同様に、ｎ個のスレッドを並列に実行し、第１フェーズの処理と、第２フェーズの処理とを行う。 In the present embodiment, the multiplication result (X × Y) obtained by multiplying the input value X and the input value Y is output to the variable Z.
As in the first embodiment, in this embodiment, the bit width of the input value X and the input value Y is l (l> b) bits, and the control unit sets the input value X and the input value Y to n. Dividing into digits, variables X [0] to X [n-1] and variables Y [0] to Y [n-1] are provided, and variables Z [0] to Z [2n-1] are further provided. .
Further, the procedure for the control unit to store the divided values in X [0] to X [n-1] and the procedure for storing in Y [0] to Y [n-1] are the same as in the first embodiment.
Further, as in the first embodiment, the arithmetic unit also executes n threads in parallel to perform the first phase process and the second phase process.

本実施の形態では、まず、制御部が、カウンタ値ｉを０に設定し、演算部に第１フェーズの処理を開始させる。
また、制御部は、カウンタ値ｉがｎに達するまでの間、演算部が第１フェーズの１ラウンド分の処理を終了する度に、カウンタ値ｉをインクリメントする。 In the present embodiment, first, the control unit sets the counter value i to 0, and causes the calculation unit to start the first phase process.
In addition, the control unit increments the counter value i every time the arithmetic unit finishes processing for one round of the first phase until the counter value i reaches n.

演算部は、カウンタ値ｉがｎに達するまでの間、制御部によりカウンタ値ｉがインクリメントされる度に、スレッドｔにおいて、第１フェーズの１ラウンド分の処理を繰り返す。
第１フェーズの１ラウンド分の処理は、以下の〈１−ａ〉と〈１−ｂ〉の処理である。
〈１−ａ〉入力値Ｘのｔ番目の変数Ｘ［ｔ］の値と、入力値Ｙのｉ番目の変数Ｙ［ｉ］の値と、（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］の値と、スレッドｔで得られたキャリー成分値ｃとを用いて、（Ｘ［ｔ］の値）×（Ｙ［ｉ］の値）＋（Ｚ［ｔ＋ｉ］の値）＋ｃを計算する。
〈１−ｂ〉計算結果の下位１ワードの値を（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］に格納し、計算結果の上位１ワードの値を新たなキャリー成分値ｃとする。 The arithmetic unit repeats the processing for one round of the first phase in the thread t every time the counter value i is incremented by the control unit until the counter value i reaches n.
The processing for one round of the first phase is the following <1-a> and <1-b> processing.
<1-a> The value of the t-th variable X [t] of the input value X, the value of the i-th variable Y [i] of the input value Y, the value of the (t + i) -th variable Z [t + i] Then, (X [t] value) × (Y [i] value) + (Z [t + i] value) + c is calculated using the carry component value c obtained in the thread t.
<1-b> The value of the lower 1 word of the calculation result is stored in the (t + i) th variable Z [t + i], and the value of the upper 1 word of the calculation result is set as a new carry component value c.

次に、制御部は、カウンタ値ｉがｎに達すると、演算部に第２フェーズの処理を開始させる。
そして、制御部は、カウンタ値ｉが２ｎに達するまでの間、演算部が第２フェーズの１ラウンド分の処理を終了する度に、カウンタ値ｉをインクリメントする。 Next, when the counter value i reaches n, the control unit causes the calculation unit to start the second phase process.
Then, the control unit increments the counter value i every time the arithmetic unit finishes processing for one round of the second phase until the counter value i reaches 2n.

演算部は、カウンタ値ｉが２ｎに達するまでの間、制御部によりカウンタ値ｉがインクリメントされる度に、停止していないスレッドｔにおいて、第２フェーズの１ラウンド分の処理を繰り返す。
第２フェーズの１ラウンド分の処理は、以下の〈２−ａ〉と〈２−ｂ〉の処理である。
〈２−ａ〉スレッドｔで得られたキャリー成分値ｃが０であるか否かを判断し、キャリー成分値ｃが０である場合にスレッドｔの第２フェーズの処理を停止する。
〈２−ｂ〉０でない場合に（Ｚ［ｔ＋ｉ］の値）＋ｃを計算し、計算結果の下位１ワードの値を変数Ｚ［ｔ＋ｉ］の新たな値とし、計算結果の上位１ワードの値を新たなキャリー成分値ｃとする。 The arithmetic unit repeats the process for one round of the second phase in the thread t that is not stopped each time the counter value i is incremented by the control unit until the counter value i reaches 2n.
The processing for one round of the second phase is the following <2-a> and <2-b> processing.
<2-a> It is determined whether or not the carry component value c obtained by the thread t is 0. When the carry component value c is 0, the processing of the second phase of the thread t is stopped.
<2-b> If not 0 (value of Z [t + i]) + c, the value of the lower 1 word of the calculation result is set as the new value of the variable Z [t + i], and the value of the upper 1 word of the calculation result Is a new carry component value c.

次に、図８のフローチャートを参照して、本実施の形態に係る多倍長演算装置１００で多倍長乗算を行う場合の手順を説明する。
なお、図８において、大文字で示す変数はメモリ１０２上のデータとし、小文字で示す変数は汎用レジスタ１０８上のデータとする。
また、各変数のデータ長は１ワードとする。
ただし、図８において太字で表記している変数は２ワードとする。
なお、明細書では、２ワードの変数は、ダブルクオーテーションで表現する（例えば、“ｍ”）。
また、入力値ｃｅｉｌ（ｌ／ｂ）ワードがｎに満たない場合は、満たない部分のワードを予め０でクリアする。 Next, with reference to the flowchart of FIG. 8, the procedure in the case of performing multiple-precision multiplication in the multiple-precision arithmetic apparatus 100 according to the present embodiment will be described.
In FIG. 8, variables indicated by uppercase letters are data on the memory 102, and variables indicated by lowercase letters are data on the general-purpose register 108.
The data length of each variable is 1 word.
However, the variable written in bold in FIG. 8 is 2 words.
In the specification, a 2-word variable is expressed by double quotation (for example, “m”).
If the input value ceil (l / b) word is less than n, the word of the lesser part is cleared to 0 in advance.

まず、演算部が、汎用レジスタ１０８から、スレッドごとに自身のスレッド番号と演算するスレッド本数を取得し、また、Ｚと“ｃ”を０にセットし、制御部がｉに０をセットする（Ｓ６０１）。 First, the calculation unit obtains its own thread number and the number of threads to be calculated for each thread from the general-purpose register 108, sets Z and “c” to 0, and the control unit sets i to 0 ( S601).

そして、ｉがｎ未満の場合、演算部は、乗算処理を行う。
つまり、演算部は、“ｍ”＝Ｘ［ｔ］×Ｙ［ｉ］＋Ｚ［ｔ＋ｉ］＋“ｃ”を計算する。
更に、演算部は、“ｍ”の下位１ワードをＺ［ｔ＋ｉ］に、上位１ワードを“ｃ”に出力する。
次に、制御部が、ｉに１を加算する（Ｓ６０２）。
ここで、Ｍｕｌ＿ｗ（ａ，ｂ）は１ワードのａとｂの積を求め、２ワードの乗算結果を出力することを意味する。
なお、図８のＳ６０２の処理１〜４が、第１フェーズの１ラウンド分の処理に相当する。 When i is less than n, the arithmetic unit performs a multiplication process.
That is, the calculation unit calculates “m” = X [t] × Y [i] + Z [t + i] + “c”.
Further, the arithmetic unit outputs the lower 1 word of “m” to Z [t + i] and the upper 1 word to “c”.
Next, the control unit adds 1 to i (S602).
Here, Mul_w (a, b) means that the product of 1 word a and b is obtained and the multiplication result of 2 words is output.
Note that the processes 1 to 4 in S602 in FIG. 8 correspond to the process for one round of the first phase.

ｉがｎ以上となったら、演算部は、乗算処理を抜け、キャリー処理を行う。
つまり、演算部は、キャリー成分値“ｃ”が０であるか否かを判断し、“ｃ”が０でない場合に（“ｃ”＝＝０？でＮＯ）、“ｃ”＝Ｚ［ｔ＋ｉ］＋“ｃ”を計算し、“ｃ”の下位１ワードをＺ［ｔ＋ｉ］に、上位１ワードを“ｃ”に出力する。
そして、制御部が、変数ｉに１を加算する（Ｓ６０３）。
ｉ≧２ｎまたは“ｃ”＝０となるまでループを繰り返す。
つまり、演算部は、変数ｉが２ｎに達するまでの間、制御部により変数ｉがインクリメントされる度に、停止していないスレッドｔにおいて、スレッドｔで得られたキャリー成分値“ｃ”が０であるか否かを判断し、キャリー成分値“ｃ”が０である場合（“ｃ”＝＝０？でＹＥＳ）にスレッドｔの処理を停止し、キャリー成分値“ｃ”が０でない場合（“ｃ”＝＝０？でＮＯ）は、Ｓ６０３の処理を行う。
ｎ本のスレッド全てがループを抜けたら処理を終了する。
なお、図８に示すＳ６０３の処理１〜３、“ｃ”＝＝０？の判断が、第２フェーズの１ラウンド分の処理に該当する。 When i is greater than or equal to n, the arithmetic unit exits the multiplication process and performs a carry process.
That is, the calculation unit determines whether or not the carry component value “c” is 0, and when “c” is not 0 (“c” == 0 ?? NO), “c” = Z [t + i ] + “C” is calculated, and the lower 1 word of “c” is output to Z [t + i] and the upper 1 word is output to “c”.
Then, the control unit adds 1 to the variable i (S603).
The loop is repeated until i ≧ 2n or “c” = 0.
That is, every time the variable i is incremented by the control unit until the variable i reaches 2n, the computation unit sets the carry component value “c” obtained in the thread t to 0 in the thread t that is not stopped. When the carry component value “c” is 0 (YES when “c” == 0?), The processing of the thread t is stopped, and the carry component value “c” is not 0. (“C” == 0? NO) performs the process of S603.
When all n threads have exited the loop, the process is terminated.
In addition, processings 1 to 3 in S603 shown in FIG. 8, “c” == 0? This corresponds to processing for one round of the second phase.

図９及び図１０は、本実施の形態に係る多倍長乗算における値の変化を示す。
図９及び図１０では、内容を理解しやすくするため、内部演算幅を十進数とし、４桁の演算を４つのスレッド（スレッド番号０〜３）で実行した場合について示している。図９及び図１０では、１２３４（Ｘ）＊５６７８（Ｙ）＝７００６６５２（Ｚ）を計算する例を示す。
なお、図１０には、図９との連続性を明示するために、図９の最下段に示しているｉ＝３の際の計算過程を再度提示している。
また、図１１及び図１２は、図９及び図１０に示した計算の内訳を、スレッド番号０について示している。 9 and 10 show changes in values in the multiple-precision multiplication according to the present embodiment.
FIGS. 9 and 10 show a case where the internal calculation width is a decimal number and a four-digit calculation is executed by four threads (thread numbers 0 to 3) in order to facilitate understanding of the contents. 9 and 10 show an example in which 1234 (X) * 5678 (Y) = 70000662 (Z) is calculated.
In FIG. 10, in order to clarify the continuity with FIG. 9, the calculation process when i = 3 shown at the bottom of FIG. 9 is presented again.
11 and 12 show the breakdown of the calculation shown in FIGS. 9 and 10 for thread number 0. FIG.

次に、本実施の形態に係る多倍長演算装置１００の効果を説明する。
本実施の形態では、乗算処理はｎ回のループで終了する。
また、乗算処理中にキャリーの加算処理も行う点は、本実施の形態の多倍長乗算の特徴の１つである。 Next, the effect of the multiple length arithmetic apparatus 100 according to the present embodiment will be described.
In the present embodiment, the multiplication process is completed in n loops.
In addition, one of the features of the multiple multiplication of the present embodiment is that carry addition processing is also performed during multiplication processing.

キャリー処理について、乗算処理終了後にキャリー成分が残っていれば、前記キャリー成分が０になるまで、加算を行う。
キャリー処理のワーストケースはＯ（ｎ）となるが、入力がランダムである場合、キャリーが最後まで残る確率は非常に低いため、数回のキャリーの計算でループを抜けることができる。
よって、多倍長乗算処理はｎ＋α（α＜ｎ）で行うことができる。
また、本実施の形態に係る多倍長演算装置１００は、前述したように、ＳＩＭＤ計算機等の通常の計算機で実現可能であり、専用装置を用いることなく、多倍長乗算を高速に行うことができる。 Regarding carry processing, if carry components remain after multiplication processing, addition is performed until the carry components become zero.
The worst case of carry processing is O (n). However, if the input is random, the probability that the carry will remain until the end is very low, so it is possible to exit the loop with several carry calculations.
Therefore, the multiple length multiplication process can be performed by n + α (α <n).
In addition, as described above, the multiple-precision arithmetic apparatus 100 according to the present embodiment can be realized by a normal computer such as a SIMD computer, and can perform multiple-precision multiplication at high speed without using a dedicated device. Can do.

以上、本実施の形態では、
複数のプロセッサを内蔵し、単一の命令を複数のデータに対して同時に実行できる計算機と、データを格納し、前記プロセッサが同時にアクセスできるメモリを有する多倍長整数演算装置であって、入力Ｘ，Ｙに対して、Ｚ＝ＸＹを計算する多倍長乗算を以下のステップで実行する多倍長整数演算装置を説明した。
１．入力データを計算機の内部演算幅毎に複数の桁（以降、それぞれＸ［ｎ］，Ｙ［ｎ］と記す）に分割するステップ
２．出力Ｚ，キャリーｃ，桁ｉに０をセットするステップ
３．“ｍ”＝Ｘ［ｔ］×Ｙ［ｉ］＋Ｚ［ｔ＋ｉ］＋ｃを計算し、“ｍ”の下位１ワードをＺ［ｔ＋ｉ］に、上位１ワードをｃに出力するステップ
４．桁ｉに１を加算するステップ
５．ｉ＜ｎの間、ステップ３〜４を実行するステップ
６．ｃ＝Ｚ［ｔ＋ｉ］＋ｃを計算し、ｃの下位１ワードをＺ［ｔ＋ｉ］に、上位１ワードをｃに出力するステップ
７．ｉに１を加算するステップ
８．ｉ＜２ｎかつ、ｃ≠０の間、ステップ６〜７を実行するステップ
９．前記スレッドの全てがステップ６〜８を完了するのを待つステップ。 As described above, in the present embodiment,
A multiple-precision integer arithmetic unit comprising a plurality of processors, a computer capable of executing a single instruction simultaneously on a plurality of data, and a memory for storing data and simultaneously accessible by the processor, wherein the input X , Y, a multiple-precision integer arithmetic unit that executes multiple-precision multiplication for calculating Z = XY in the following steps has been described.
1. 1. A step of dividing input data into a plurality of digits (hereinafter referred to as X [n] and Y [n], respectively) for each internal calculation width of the computer. 2. Set 0 to output Z, carry c, digit i 3. Calculate “m” = X [t] × Y [i] + Z [t + i] + c, and output the lower 1 word of “m” to Z [t + i] and the upper 1 word to c 4. Add 1 to digit i 5. Perform steps 3-4 while i <n 6. Calculate c = Z [t + i] + c, and output the lower 1 word of c to Z [t + i] and the upper 1 word to c 7. Add 1 to i 8. Steps 6-7 are executed while i <2n and c ≠ 0 Waiting for all of the threads to complete steps 6-8.

実施の形態４．
本実施の形態では、多倍長モンゴメリ・リダクション処理を行う。
本実施の形態に係る多倍長演算装置１００の構成は図１に示したものと同じである。 Embodiment 4 FIG.
In the present embodiment, multiple-length Montgomery reduction processing is performed.
The configuration of the multiple length arithmetic apparatus 100 according to the present embodiment is the same as that shown in FIG.

本実施の形態では、演算部の演算ビット幅である１ワード（ｂビット）よりも大きなビット幅の入力値Ｘと法Ｍとに対して、ｒ＝２^ｂ、Ｒ＝ｒ^ｎとして定義されたＲと、（−Ｍ^−１ｍｏｄｒ）として定義されたＭＩｎｖとを用いて、（ＸＲ^−１ｍｏｄＭ）を計算するモンゴメリ・リダクションを行う。ただし、０≦Ｘ＜ＭＲとする。
なお、上記のｎは、法Ｍを１ワードごとに分割した際の法Ｍの分割数であり、ｎ≧２である。
また、本実施の形態では、入力値Ｘを１ワードごとに分割した際の入力値Ｘの分割数は、２ｎであるとする。 In the present embodiment, r = 2 ^b and R = r ⁿ are defined for the input value X and modulus M having a bit width larger than one word (b bits) which is the calculation bit width of the calculation unit. Montgomery reduction is performed to calculate (XR ⁻¹ mod M) using R and MINv defined as (−M ⁻¹ mod r). However, 0 ≦ X <MR.
In addition, said n is the division | segmentation number of the modulus M at the time of dividing the modulus M for every word, and it is n> = 2.
In the present embodiment, the number of divisions of the input value X when the input value X is divided for each word is 2n.

まず、制御部が、入力値Ｘ及び法Ｍを、それぞれ１ワードごとに分割する。
また、制御部は、入力値Ｘから分割された２ｎ個の分割値を格納するための２ｎ個の変数Ｘ［０］〜Ｘ［２ｎ−１］を、いずれかの記憶領域（例えば、メモリ１０２）に設ける。
また、制御部は、入力値Ｘ内の最下位ビットが含まれる分割値が０番目の変数Ｘ［０］に格納され、最上位ビットが含まれる分割値が（２ｎ−１）番目の変数Ｘ［２ｎ−１］に格納されるようにして、２ｎ個の分割値を変数Ｘ［０］〜Ｘ［２ｎ−１］に格納する。
また、制御部は、いずれかの記憶領域（例えば、メモリ１０２）に、法Ｍから分割されたｎ個の分割値を格納するためのｎ個の変数Ｍ［０］〜Ｍ［ｎ−１］を設ける。
また、制御部は、法Ｍ内の最下位ビットが含まれる分割値が０番目の変数Ｍ［０］に格納され、最上位ビットが含まれる分割値が（ｎ−１）番目の変数Ｍ［ｎ−１］に格納されるようにして、ｎ個の分割値を変数Ｍ［０］〜Ｍ［ｎ−１］に格納する。
更に、制御部は、いずれかの記憶領域（例えば、メモリ１０２）に、演算部による計算結果を格納するｎ個の変数Ｚ［０］〜Ｚ［ｎ−１］を設ける。 First, the control unit divides the input value X and the modulus M for each word.
In addition, the control unit stores 2n variables X [0] to X [2n−1] for storing 2n divided values divided from the input value X in any storage area (for example, the memory 102). ).
Further, the control unit stores the divided value including the least significant bit in the input value X in the 0th variable X [0], and the divided value including the most significant bit as the (2n−1) th variable X. 2n divided values are stored in variables X [0] to X [2n-1] so as to be stored in [2n-1].
Further, the control unit stores n variables M [0] to M [n−1] for storing n divided values divided from the modulus M in any storage area (for example, the memory 102). Is provided.
The control unit stores the division value including the least significant bit in the modulus M in the 0th variable M [0], and the division value including the most significant bit as the (n−1) th variable M [[ n divided values are stored in variables M [0] to M [n−1].
Further, the control unit provides n variables Z [0] to Z [n−1] for storing the calculation result by the calculation unit in any storage area (for example, the memory 102).

演算部は、実施の形態１〜３と同様に、スレッド番号として「０〜ｎ−１」が設定されているｎ個のスレッドを並列に実行する。
演算部の処理は、第１フェーズ〜第４フェーズの処理に大別される。 As in the first to third embodiments, the arithmetic unit executes n threads in which “0 to n−1” is set as the thread number in parallel.
The processing of the computing unit is roughly divided into first phase to fourth phase processing.

演算部は、第１フェーズの処理として、以下の〈１−ａ〉と〈１−ｂ〉の処理を行う。
〈１−ａ〉スレッドｔにおいて、入力値Ｘの０番目の変数Ｘ［０］の値とＭＩｎｖとを用いて、（Ｘ［０］の値）×ＭＩｎｖを計算し、２ワードの計算結果の下位１ワードの値をｕとする。
〈１−ｂ〉スレッドｔにおいて、法Ｍのｔ番目の変数Ｍ［ｔ］の値と、入力値Ｘのｔ番目の変数Ｘ［ｔ］の値と、前記ｕとを用いて、（Ｍ［ｔ］の値）×ｕ＋（Ｘ［ｔ］の値）を計算し、２ワードの計算結果をｍとし、計算結果ｍの下位１ワードの値を変数Ｘ［ｔ］の新たな値とし、計算結果ｍの上位１ワードの値をキャリー成分値ｃとする。
〈１−ｃ〉０番目のスレッドであるスレッド０において、スレッド０で得られたキャリー成分値ｃの下位１ワードの値を変数Ｘ［０］の新たな値とする。 The calculation unit performs the following <1-a> and <1-b> processing as the first phase processing.
<1-a> In the thread t, using the value of the 0th variable X [0] of the input value X and MINv, (X [0] value) × MINv is calculated, and the calculation result of 2 words Let u be the value of the lower 1 word.
<1-b> In the thread t, using the value of the t-th variable M [t] of the modulus M, the value of the t-th variable X [t] of the input value X, and the u, (M [ t] value) × u + (X [t] value), the calculation result of 2 words is m, and the value of the lower 1 word of the calculation result m is the new value of the variable X [t]. The value of the upper one word of the result m is set as the carry component value c.
<1-c> In thread 0, which is the 0th thread, the value of the lower 1 word of carry component value c obtained in thread 0 is set as the new value of variable X [0].

次に、制御部は、カウンタ値ｉを１に設定し、演算部に第２フェーズの処理を開始させる。
そして、カウンタ値ｉがｎに達するまで、全てのスレッドｔにおいて第２フェーズの１ラウンド分の処理が終了する度に、カウンタ値ｉをインクリメントする。 Next, the control unit sets the counter value i to 1, and causes the calculation unit to start the second phase process.
The counter value i is incremented every time processing for one round of the second phase is completed in all the threads t until the counter value i reaches n.

演算部は、制御部によりカウンタ値ｉがインクリメントされる度に、第２フェーズの１ラウンド分の処理を繰り返す。
第２フェーズの１ラウンド分の処理は、以下の〈２−ａ〉と〈２−ｂ〉の処理である。
〈２−ａ〉スレッドｔにおいて、入力値Ｘの０番目の変数Ｘ［０］の値とｉ番目の変数Ｘ［ｉ］の値と、ＭＩｎｖとを用いて、｛（Ｘ［０］の値）＋（Ｘ［ｉ］の値）｝×ＭＩｎｖを計算し、２ワードの計算結果の下位１ワードの値をｕとする。
〈２−ｂ〉スレッドｔにおいて、法Ｍのｔ番目の変数Ｍ［ｔ］の値と、前記ｕと、入力値Ｘの（ｔ＋ｉ）番目の変数Ｘ［ｔ＋ｉ］の値と、スレッドｔで得られたキャリー成分値ｃとを用いて、（Ｍ［ｔ］の値）×ｕ＋（Ｘ［ｔ＋ｉ］の値）＋ｃを計算し、２ワードの計算結果をｍとし、計算結果ｍの下位１ワードの値を変数Ｘ［ｔ＋ｉ］の新たな値とし、計算結果ｍの上位１ワードの値を新たなキャリー成分値ｃとする。
〈２−ｃ〉０番目のスレッドであるスレッド０において、スレッド０で得られたキャリー成分値ｃの下位１ワードの値を変数Ｘ［０］の新たな値とする。
また、演算部は、カウンタ値ｉがｎに達すると、スレッド０において、値０を変数Ｘ［０］の新たな値とする。 The calculation unit repeats the process for one round of the second phase every time the counter value i is incremented by the control unit.
The processing for one round of the second phase is the following <2-a> and <2-b> processing.
<2-a> In the thread t, using the value of the 0th variable X [0] of the input value X, the value of the ith variable X [i], and MINv, the value of {(X [0] ) + (Value of X [i])} × MINv, and let u be the value of the lower 1 word of the 2-word calculation result.
<2-b> In the thread t, the value of the t-th variable M [t] of the modulus M, u, the value of the (t + i) -th variable X [t + i] of the input value X, and the thread t Using the obtained carry component value c, (M [t] value) × u + (X [t + i] value) + c is calculated, the calculation result of 2 words is m, and the lower 1 word of the calculation result m Is the new value of the variable X [t + i], and the value of the upper one word of the calculation result m is the new carry component value c.
<2-c> In the thread 0 as the 0th thread, the value of the lower 1 word of the carry component value c obtained in the thread 0 is set as a new value of the variable X [0].
When the counter value i reaches n, the arithmetic unit sets the value 0 as a new value of the variable X [0] in the thread 0.

次に、制御部は、カウンタ値ｉがｎに達すると、スレッド０において値０が変数Ｘ［０］の新たな値とされた後に、演算部に第３フェーズの処理を開始させる。
そして、カウンタ値ｉが２ｎに達するか、全てのスレッドｔが第３フェーズの処理を停止するまで、停止していない全てのスレッドｔにおいて第３フェーズの１ラウンド分の処理が終了する度に、カウンタ値ｉをインクリメントする。 Next, when the counter value i reaches n, the control unit causes the arithmetic unit to start the process of the third phase after the value 0 is changed to the new value of the variable X [0] in the thread 0.
Then, every time the processing for one round of the third phase is completed in all the threads t that are not stopped, until the counter value i reaches 2n or all the threads t stop the processing of the third phase, Increment the counter value i.

演算部は、カウンタ値ｉが２ｎに達するまでの間、制御部によりカウンタ値ｉがインクリメントされる度に、停止していないスレッドｔにおいて、第３フェーズの１ラウンド分の処理を繰り返す。
第３フェーズの１ラウンド分の処理は、以下の〈３−ａ〉と〈３−ｂ〉の処理である。
〈３−ａ〉スレッドｔで得られたキャリー成分値ｃが０であるか否かを判断し、キャリー成分値ｃが０である場合にスレッドｔの第３フェーズの処理を停止する。
〈３−ｂ〉０でない場合に、（Ｘ［（ｔ＋ｉ）ｍｏｄ２ｎ］の値）＋ｃを計算し、２ワードの計算結果をｍとし、計算結果ｍの下位１ワードの値を変数Ｘ［（ｔ＋ｉ）ｍｏｄ２ｎ］の新たな値とし、計算結果ｍの上位１ワードの値を新たなキャリー成分値ｃとする。 The arithmetic unit repeats the process for one round of the third phase in the thread t that is not stopped each time the counter value i is incremented by the control unit until the counter value i reaches 2n.
The processing for one round of the third phase is the following <3-a> and <3-b> processing.
<3-a> It is determined whether or not the carry component value c obtained by the thread t is 0. When the carry component value c is 0, the third phase processing of the thread t is stopped.
<3-b> When not 0, (X [(t + i) mod 2n] value) + c is calculated, the calculation result of 2 words is m, and the value of the lower 1 word of the calculation result m is the variable X [(( t + i) mod 2n] as the new value, and the value of the upper one word of the calculation result m as the new carry component value c.

制御部は、カウンタ値ｉが２ｎに達した場合、又は全てのスレッドｔが第３フェーズの処理を停止した場合に、演算部に第４フェーズの処理を開始させる。 When the counter value i reaches 2n, or when all the threads t stop the third phase process, the control unit causes the calculation unit to start the fourth phase process.

演算部は、第４フェーズの処理として、以下の処理を行う。
変数Ｘ［０］の値を変数ａに格納し、変数Ｘ［ｎ］〜Ｘ［２ｎ−１］の値を、それぞれ、変数Ｘ［０］〜Ｘ［ｎ−１］に格納する。
変数ａの値が０でない場合、又は変数Ｘ［ｎ−１］〜Ｘ［０］の値を連接して得られる値が法Ｍ以上の場合に、Ｘ−Ｍを計算し、計算結果を変数Ｘ［０］〜Ｘ［ｎ−１］に格納し、変数Ｘ［０］〜Ｘ［ｎ−１］の値を変数Ｚ［０］〜Ｚ［ｎ−１］に格納する。 The calculation unit performs the following processing as the fourth phase processing.
The value of variable X [0] is stored in variable a, and the values of variables X [n] to X [2n-1] are stored in variables X [0] to X [n-1], respectively.
When the value of the variable a is not 0, or when the value obtained by concatenating the values of the variables X [n−1] to X [0] is not less than the modulus M, X−M is calculated, and the calculation result is the variable X [0] to X [n-1] are stored, and the values of the variables X [0] to X [n-1] are stored in the variables Z [0] to Z [n-1].

次に、図１３のフローチャートを参照して、本実施の形態に係る多倍長演算装置１００でモンゴメリ・リダクションを行う場合の手順を説明する。
図１３では、入力値Ｘのモンゴメリ・リダクションの結果ＸＲ^−１ｍｏｄＭを変数Ｚに出力する。
入力値Ｘは２ｎ桁に分割され、ｎ個のスレッドを用いて計算を行う。
また、入力値ＸはＸ＜Ｍ・Ｒの関係を満たすものとする。
ここで、Ｒ＝ｒ^ｎとし、ｒ＝２^ｂとする。
Ｚはｎワードのサイズを持つ。
ＭＩｎｖは−Ｍ^−１ｍｏｄｒを満たす１ワード整数である。 Next, a procedure when Montgomery reduction is performed by the multiple length arithmetic apparatus 100 according to the present embodiment will be described with reference to the flowchart of FIG.
In FIG. 13, the result XR ⁻¹ mod M of the Montgomery reduction of the input value X is output to the variable Z.
The input value X is divided into 2n digits, and calculation is performed using n threads.
The input value X satisfies the relationship X <M · R.
Here, the R = ^{r n,} and r = ^{2 b.}
Z has a size of n words.
MINv is a 1-word integer that satisfies −M ⁻¹ mod r.

図１３において、大文字で示す変数はメモリ１０２上のデータとし、小文字で示す変数は汎用レジスタ１０８上のデータとする。
また、各変数のデータ長は１ワードとする。
ただし、図１３において太字で表記した変数は２ワードとする。
なお、明細書では、２ワードの変数は、ダブルクオーテーションで表現する（例えば、“ｍ”）。
また、入力値ｃｅｉｌ（ｌ／ｂ）ワードがｎに満たない場合は、満たない部分のワードを予め０でクリアする。 In FIG. 13, variables indicated by uppercase letters are data on the memory 102, and variables indicated by lowercase letters are data on the general-purpose register 108.
The data length of each variable is 1 word.
However, the variable written in bold in FIG. 13 is 2 words.
In the specification, a 2-word variable is expressed by double quotation (for example, “m”).
If the input value ceil (l / b) word is less than n, the word of the lesser part is cleared to 0 in advance.

まず、演算部が、汎用レジスタ１０８から、スレッドごとに自身のスレッド番号と演算するスレッド本数を取得し、制御部が、変数ｉに１をセットする（Ｓ８０１）。 First, the calculation unit obtains the thread number to be calculated for each thread from the general-purpose register 108, and the control unit sets 1 to the variable i (S801).

次に、モンゴメリ・リダクション処理を行う。
具体的には、演算部が、Ｘ［０］×ＭＩｎｖの下位１ワードｕを求め、“ｍ”＝Ｍ［ｔ］×ｕ＋Ｘ［ｔ］を計算する。
ここで、Ｍｕｌ＿ｌｏ（ａ，ｂ）は１ワード入力ａ，ｂに対し、ａ×ｂを計算し、下位１ワードを出力することを意味する。
そして、演算部は、“ｍ”の下位１ワードをＸ［ｔ］に出力し、上位１ワードを“ｃ”に出力する（Ｓ８０２）。
また、演算部は、スレッド番号が０である場合（ｔ＝＝０？でＹＥＳ）、“ｃ”の下位１ワードをＸ［０］に出力する（Ｓ８０３）。
なお、図１３のＳ８０２とＳ８０３が第１フェーズの処理に相当する。 Next, Montgomery reduction processing is performed.
Specifically, the calculation unit obtains the lower 1 word u of X [0] × MINv and calculates “m” = M [t] × u + X [t].
Here, Mul_lo (a, b) means that a × b is calculated for one word input a, b, and the lower one word is output.
Then, the arithmetic unit outputs the lower 1 word of “m” to X [t] and outputs the upper 1 word to “c” (S802).
When the thread number is 0 (YES when t == 0?), The arithmetic unit outputs the lower 1 word of “c” to X [0] (S803).
Note that S802 and S803 in FIG. 13 correspond to the processing of the first phase.

ｉがｎ未満の場合、演算部は、（Ｘ［０］＋Ｘ［ｉ］）×ＭＩｎｖの下位１ワードｕを求め、“ｍ”＝Ｍ［ｔ］×ｕ＋Ｘ［ｔ＋ｉ］＋“ｃ”を計算する。
また、演算部は、“ｍ”の下位１ワードをＸ［ｔ＋ｉ］に出力し、上位１ワードを“ｃ”に出力する。
そして、制御部が、変数ｉに１を加算する（Ｓ８０４）。
また、演算部は、スレッド番号が０である場合（ｔ＝＝０？でＹＥＳ）、“ｃ”の下位１ワードをＸ［０］に出力する（Ｓ８０５）。
なお、図１３のＳ８０４の処理１〜６、Ｓ８０５が、第２フェーズの１ラウンド分の処理に該当する。 When i is less than n, the arithmetic unit obtains the lower 1 word u of (X [0] + X [i]) × MINv and calculates “m” = M [t] × u + X [t + i] + “c” To do.
The arithmetic unit outputs the lower 1 word of “m” to X [t + i] and outputs the upper 1 word to “c”.
Then, the control unit adds 1 to the variable i (S804).
When the thread number is 0 (YES when t == 0?), The arithmetic unit outputs the lower 1 word of “c” to X [0] (S805).
Note that processes 1 to 6 and S805 of S804 in FIG. 13 correspond to the process for one round of the second phase.

ｉがｎ以上となったらモンゴメリ・リダクション処理を抜け、キャリー処理を行う。
始めに、演算部は、スレッド番号が０である場合（ｔ＝＝０？でＹＥＳ）、Ｘ［０］に０を出力する（Ｓ８０６）。
また、演算部は、“ｍ”＝Ｘ［（ｔ＋ｉ）％２ｎ］＋“ｃ”を計算し、“ｍ”の下位１ワードをＸ［（ｔ＋ｉ）％２ｎ］に、上位１ワードを“ｃ”に出力する。
制御部が、変数ｉに１を加算する（Ｓ８０７）。
ｉ≧２ｎまたは“ｃ”＝０となるまでループを繰り返す。
つまり、演算部は、変数ｉが２ｎに達するまでの間、制御部により変数ｉがインクリメントされる度に、停止していないスレッドｔにおいて、スレッドｔで得られたキャリー成分値“ｃ”が０であるか否かを判断し、キャリー成分値“ｃ”が０である場合（“ｃ”＝＝０？でＹＥＳ）にスレッドｔの処理を停止し、キャリー成分値“ｃ”が０でない場合（“ｃ”＝＝０？でＮＯ）は、Ｓ８０７の処理を行う。
そして、ｎ本のスレッド全てがループを抜けたらキャリー処理を終了する。
なお、図１３に示す“ｃ”＝＝０？の判断、Ｓ８０７の処理１〜４が、第３フェーズの１ラウンド分の処理に該当する。 When i becomes n or more, the Montgomery reduction process is exited and the carry process is performed.
First, when the thread number is 0 (YES when t == 0?), The arithmetic unit outputs 0 to X [0] (S806).
Further, the calculation unit calculates “m” = X [(t + i)% 2n] + “c”, the lower one word of “m” is set to X [(t + i)% 2n], and the upper one word is set to “c”. To "".
The control unit adds 1 to the variable i (S807).
The loop is repeated until i ≧ 2n or “c” = 0.
That is, every time the variable i is incremented by the control unit until the variable i reaches 2n, the computation unit sets the carry component value “c” obtained in the thread t to 0 in the thread t that is not stopped. When the carry component value “c” is 0 (YES when “c” == 0?), The processing of the thread t is stopped, and the carry component value “c” is not 0. (“C” == 0? NO) performs the processing of S807.
When all n threads have exited the loop, the carry process is terminated.
Note that “c” == 0? The processes 1-4 of S807 correspond to the process for one round of the third phase.

なお、メモリが十分にある場合は、２ｎの剰余演算を省略してもよい。
この場合、後述の減算処理で変数ａに格納するデータはＸ［２ｎ］になることに注意する。 If there is sufficient memory, the 2n remainder operation may be omitted.
Note that in this case, the data stored in the variable a in the subtraction process described later is X [2n].

キャリー処理を終了したら、減算処理を行う。
演算部は、まず、Ｘ［０］の値を変数ａに出力し、Ｘの値をｎワードシフトする（Ｓ８０８）。
つまり、演算部は、変数Ｘ［ｎ］〜Ｘ［２ｎ−１］の値を、それぞれ、変数Ｘ［０］〜Ｘ［ｎ−１］に格納する。
そして、ａの値が０でないか、Ｘ［ｎ−１，…，０］の値（Ｘ［ｎ−１］〜Ｘ［０］の値を連接して得られる値）が法Ｍ以上の場合は、演算部は、多倍長整数減算Ｘ−Ｍを実行し、結果をＸに格納する（Ｓ８０９）。
最後に、演算部は、Ｘ［０］〜Ｘ［ｎ−１］の値をＺ［０］〜Ｚ［ｎ−１］に格納する（Ｓ８１０）。
なお、図１３に示すＳ８０８〜Ｓ８１０が、第４フェーズの処理に相当する。 When the carry process is completed, a subtraction process is performed.
The arithmetic unit first outputs the value of X [0] to the variable a, and shifts the value of X by n words (S808).
That is, the calculation unit stores the values of the variables X [n] to X [2n-1] in the variables X [0] to X [n-1], respectively.
When the value of a is not 0 or the value of X [n−1,..., 0] (the value obtained by concatenating the values of X [n−1] to X [0]) is equal to or greater than the modulus M The arithmetic unit executes multiple-precision integer subtraction X-M and stores the result in X (S809).
Finally, the calculation unit stores the values of X [0] to X [n−1] in Z [0] to Z [n−1] (S810).
Note that S808 to S810 illustrated in FIG. 13 correspond to the fourth phase process.

図１４及び図１５は、本実施の形態に係るモンゴメリ・リダクションにおける値の変化を示す。
図１４及び図１５では、内容を理解しやすくするため、内部演算幅を十進数とし、４桁の演算を４つのスレッド（スレッド番号０〜３）で実行した場合について示す。図１４及び図１５では、２３４５６７８（Ｘ）＊Ｒ^−１ｍｏｄ３５１１（Ｍ）＝１７４５（Ｚ）を計算する例を示す。
この場合、ｎ＝４、Ｒ＝１０^ｎ＝１０^４、ｒ＝１０、ＭＩｎｖ＝−３５１１^−１ｍｏｄ１０＝９となる。
なお、図１５には、図１４との連続性を明示するために、図１４の最下段に示しているｉ＝３の際の計算過程を再度提示している。
また、図１６及び図１７は、図１４及び図１５に示した計算の内訳を、スレッド番号０について示している。 14 and 15 show changes in values in Montgomery reduction according to the present embodiment.
14 and 15 show a case where the internal calculation width is a decimal number and four-digit calculation is executed by four threads (thread numbers 0 to 3) in order to facilitate understanding of the contents. 14 and 15 show an example in which 2345678 (X) * R ⁻¹ mod 3511 (M) = 1745 (Z) is calculated.
In this case, n = 4, R = 10 ⁿ = 10 ⁴ , r = 10, and MINv = −3511 ⁻¹ mod 10 = 9.
In FIG. 15, in order to clarify the continuity with FIG. 14, the calculation process at the time of i = 3 shown at the bottom of FIG. 14 is presented again.
FIGS. 16 and 17 show the breakdown of the calculation shown in FIGS. 14 and 15 for the thread number 0.

次に、本実施の形態に係る多倍長演算装置１００の効果を説明する。
本実施の形態では、モンゴメリ・リダクション処理はｎ回のループで終了する。
モンゴメリ・リダクション処理中にキャリーの加算処理も行う点が、本実施の形態の特徴の１つである。
また、各ステップで、最下位の桁を処理するスレッドのキャリーをメモリに出力することで、計算量をＯ（ｎ）にすることができる。 Next, the effect of the multiple length arithmetic apparatus 100 according to the present embodiment will be described.
In the present embodiment, the Montgomery reduction process is completed in n loops.
One of the features of this embodiment is that carry addition processing is also performed during Montgomery reduction processing.
In each step, the calculation amount can be set to O (n) by outputting the carry of the thread that processes the least significant digit to the memory.

キャリー処理について、モンゴメリ・リダクション処理終了後にキャリー成分が残っていれば、前記キャリー成分が０になるまで、加算を行う。
キャリー処理のワーストケースはＯ（ｎ）となるが、入力がランダムである場合、キャリーが最後まで残る確率は非常に低いため、数回のキャリーの計算でループを抜けることができる。
また、２ｎで剰余をとることで、多倍長整数乗算と同じメモリサイズでモンゴメリ・リダクションを行うことができる。
さらに、ｎが２のべき乗の場合、演算コストの大きい剰余演算を、演算コストの小さいビット演算で実現することができる。
また、本実施の形態に係る多倍長演算装置１００は、前述したように、ＳＩＭＤ計算機等の通常の計算機で実現可能であり、専用装置を用いることなく、モンゴメリ・リダクションを高速に行うことができる。 Regarding carry processing, if carry components remain after Montgomery reduction processing, addition is performed until the carry components become zero.
The worst case of carry processing is O (n). However, if the input is random, the probability that the carry will remain until the end is very low, so it is possible to exit the loop with several carry calculations.
Further, by taking the remainder by 2n, Montgomery reduction can be performed with the same memory size as the multiple-precision integer multiplication.
Further, when n is a power of 2, a remainder operation with a high operation cost can be realized by a bit operation with a low operation cost.
In addition, as described above, the multiple length arithmetic device 100 according to the present embodiment can be realized by a normal computer such as a SIMD computer, and can perform Montgomery reduction at high speed without using a dedicated device. it can.

なお、本実施の形態では、入力値Ｘのビット幅は、１ワード単位で分割した際に２ｎ個に分割され、変数Ｘが２ｎ個設けられるものとした。
このため、図１３のＳ８０７の処理１及び処理３で２ｎの剰余演算を行うことにした。
しかし、メモリが十分にある場合は、入力値Ｘを格納する変数Ｘを、ｖ個（但し、ｖはｎの倍数、つまり、ｖは３ｎ、４ｎ等）で構成してもよい。ただし、入力Ｘに格納される値の範囲は０≦Ｘ＜ＭＲを満たすものとする。
このような変数Ｘの場合は、図１３のＳ８０７の処理１及び処理３における２ｎの剰余演算は省略される。
また、同様に、減算処理（図１３のＳ８０８の処理１）で変数ａに格納するデータはＸ［２ｎ］になる。
更に、減算処理（図１３のＳ８０８の処理２）で、Ｘ［ｎ］〜Ｘ［２ｎ−１］の値を、それぞれ、Ｘ［０］〜Ｘ［ｎ−１］に格納する。 In the present embodiment, the bit width of the input value X is divided into 2n when divided in units of one word, and 2n variables X are provided .
For this reason, 2n remainder calculation is performed in the processing 1 and the processing 3 in S807 of FIG.
However, if the memory is sufficient, the variable X for storing the input value X, v number (where, v is a multiple of n, i.e., v is 3n, 4n etc.) may be constituted by. However, the range of values stored in the input X satisfies 0 ≦ X <MR.
In the case of such a variable X, the 2n remainder calculation in the processing 1 and the processing 3 in S807 in FIG. 13 is omitted.
Similarly, the data stored in the variable a in the subtraction process (process 1 in S808 in FIG. 13) is X [ 2n ].
Further, in the subtraction process (process 2 of S808 in FIG. 13), the values of X [ n ] to X [ 2n -1] are stored in X [0] to X [n-1], respectively.

以上、本実施の形態では、
複数のプロセッサを内蔵し、単一の命令を複数のデータに対して同時に実行できる計算機と、データを格納し、前記プロセッサが同時にアクセスできるメモリを有する多倍長整数演算装置であって、入力Ｘ，法Ｍ，内部演算幅ｂ（ｂｉｔ）に対し、“ｒ＝２ｂ”，“Ｒ＝ｒｎ”として定義されたＲと、“−Ｍ^−１ｍｏｄｒ”として定義されたＭＩｎｖを用いて、Ｚ＝ＸＲ^−１ｍｏｄＭを計算するモンゴメリ・リダクションを以下のステップで実行する多倍長整数演算装置を説明した。
１．入力データと法を計算機の内部演算幅毎に複数の桁（以降、それぞれＸ［ｎ］，Ｍ［ｎ］と記す）に分割するステップ
２．スレッドごとに自身のスレッド番号ｔと演算するスレッド本数ｎを取得するステップ
３．Ｘ［０］×ＭＩｎｖの下位１ワードｕを求め、“ｍ”＝Ｍ［ｔ］×ｕ＋Ｘ［ｔ］を計算するステップ
４．“ｍ”の下位１ワードをＸ［ｔ］に出力し、上位１ワードを“ｃ”に出力するステップ
５．スレッド番号が０である場合、“ｃ”の下位１ワードをＸ［０］に出力するステップ
６．桁ｉに１を設定するステップ
７．（Ｘ［０］＋Ｘ［ｉ］）×ＭＩｎｖの下位１ワードｕを求め、“ｍ”＝Ｍ［ｔ］×ｕ＋Ｘ［ｔ＋ｉ］＋“ｃ”を計算するステップ
８．“ｍ”の下位１ワードをＸ［ｔ＋ｉ］に出力し、上位１ワードを“ｃ”に出力するステップ
９．スレッド番号が０である場合、ｃの下位１ワードをＸ［０］に出力するステップ
１０．桁ｉに１を加算するステップ
１１．ｉ＜ｎの間、ステップ６〜９を実行するステップ
１２．スレッド番号が０である場合、Ｘ［０］に０を出力するステップ
１３．“ｍ”＝Ｘ［（ｔ＋ｉ）％２ｎ］＋“ｃ”を計算し、“ｍ”の下位１ワードをＸ［（ｔ＋ｉ）％２ｎ］に、上位１ワードを“ｃ”に出力するステップ
１４．桁ｉに１を加算するステップ
１５．ｉ＜２ｎかつ、ｃ≠０の間、ステップ１３〜１４を実行するステップ
１６．前記スレッドの全てがステップ１３〜１５を完了するのを待つステップ
１７．Ｘ［０］の値を変数ａに取得するステップ
１８．Ｘの値をｎワードシフトするステップ
１９．ａの値が０でないか、Ｘ［ｎ−１，…，０］の値がＭ以上の場合は、多倍長整数減算Ｘ−Ｍを実行し、結果をＸに格納するステップ
２０．Ｘ［０］〜Ｘ［ｎ−１］の値をＺ［０］〜Ｚ［ｎ−１］に格納するステップ As described above, in the present embodiment,
A multiple-precision integer arithmetic unit comprising a plurality of processors, a computer capable of executing a single instruction simultaneously on a plurality of data, and a memory for storing data and simultaneously accessible by the processor, wherein the input X , Modulus M, and R defined as “r = 2b”, “R = rn”, and MINv defined as “−M ⁻¹ mod r” for the internal computation width b (bit), Z A multiple-precision integer arithmetic unit that executes Montgomery reduction for calculating = XR ⁻¹ mod M in the following steps has been described.
1. 1. a step of dividing input data and a method into a plurality of digits (hereinafter referred to as X [n] and M [n], respectively) for each internal calculation width of the computer 2. Obtaining the number n of threads to be calculated with its own thread number t for each thread 3. Calculate the lower 1 word u of X [0] × MINv and calculate “m” = M [t] × u + X [t] 4. Output the lower 1 word of “m” to X [t] and output the upper 1 word to “c” 5. If the thread number is 0, output the lower 1 word of “c” to X [0] 6. Set the digit i to 1 7. Calculate the lower 1 word u of (X [0] + X [i]) × MINv and calculate “m” = M [t] × u + X [t + i] + “c” 8. Output the lower 1 word of “m” to X [t + i] and output the upper 1 word to “c” 9. If the thread number is 0, output the lower 1 word of c to X [0] 10. Add 1 to digit i 11. Perform steps 6-9 while i <n 12. If thread number is 0, outputting 0 to X [0] Step of calculating “m” = X [(t + i)% 2n] + “c” and outputting the lower 1 word of “m” to X [(t + i)% 2n] and the upper 1 word to “c” 14 . 15. Add 1 to digit i 15. Steps 13 to 14 are executed while i <2n and c ≠ 0. Wait for all of the threads to complete steps 13-15. Step of acquiring the value of X [0] in the variable a 18. 18. Shifting the value of X by n words If the value of a is not 0 or the value of X [n−1,..., 0] is greater than or equal to M, a multiple-precision integer subtraction X−M is executed and the result is stored in X20. Storing values of X [0] to X [n-1] in Z [0] to Z [n-1]

また、本実施の形態では、
上記のステップ１２にて、“ｍ”＝Ｘ［ｔ＋ｉ］＋“ｃ”を計算し、“ｍ”の下位１ワードをＸ［ｔ＋ｉ］に、上位１ワードを“ｃ”に出力するステップを実行し、上記のステップ１７にて、Ｘ［２ｎ］の値を変数ａに取得するステップを実行する多倍長整数演算装置を説明した。 In the present embodiment,
In step 12 above, “m” = X [t + i] + “c” is calculated, and the step of outputting the lower 1 word of “m” to X [t + i] and the upper 1 word to “c” is executed. In the above step 17, the multiple-precision integer arithmetic unit that executes the step of acquiring the value of X [2n] in the variable a has been described.

実施の形態５．
本実施の形態では、多倍長モンゴメリ乗算処理を行う。
本実施の形態に係る多倍長演算装置１００の構成は図１に示したものと同じである。 Embodiment 5 FIG.
In the present embodiment, a multiple-length Montgomery multiplication process is performed.
The configuration of the multiple length arithmetic apparatus 100 according to the present embodiment is the same as that shown in FIG.

本実施の形態では、それぞれのビット幅が共通しており、それぞれのビット幅が演算部の演算ビット幅である１ワード（ｂビット）よりも大きい入力値Ｘと入力値Ｙと法Ｍとに対して、ｒ＝２^ｂ、Ｒ＝ｒ^ｎとして定義されたＲと、（−Ｍ^−１ｍｏｄｒ）として定義されたＭＩｎｖとを用いて、（ＸＹＲ^−１ｍｏｄＭ）を計算するモンゴメリ乗算を行う。ただし、０≦Ｘ，Ｙ＜Ｍとする。
なお、上記のｎは、法Ｍを１ワードごとに分割した際の法Ｍの分割数であり、ｎ≧２である。
また、本実施の形態では、入力値Ｘを１ワードごとに分割した際の入力値Ｘの分割数、入力値Ｙを１ワードごとに分割した際の入力値Ｙの分割数は、それぞれｎであるとする。 In this embodiment, each bit width is common, and each input value X, input value Y, and modulus M are larger than one word (b bits) that is the operation bit width of the operation unit. On the other hand, a Montgomery multiplication that calculates (XYR ⁻¹ mod M) using R defined as r = 2 ^b , R = r ⁿ and MInv defined as (−M ⁻¹ mod r) is performed. Do. However, 0 ≦ X, Y <M.
In addition, said n is the division | segmentation number of the modulus M at the time of dividing the modulus M for every word, and it is n> = 2.
In the present embodiment, the number of divisions of the input value X when the input value X is divided for each word and the number of divisions of the input value Y when the input value Y is divided for each word are n. Suppose there is.

まず、制御部が、入力値Ｘ、入力値Ｙ及び法Ｍを、それぞれ１ワードごとに分割する。
また、制御部は、入力値Ｘから分割されたｎ個の分割値を格納するためのｎ個の変数Ｘ［０］〜Ｘ［ｎ−１］と、入力値Ｙから分割されたｎ個の分割値を格納するためのｎ個の変数Ｙ［０］〜Ｙ［ｎ−１］とを、いずれかの記憶領域（例えば、メモリ１０２）に設ける。
また、制御部は、入力値Ｘ内の最下位ビットが含まれる分割値が０番目の変数Ｘ［０］に格納され、最上位ビットが含まれる分割値が（ｎ−１）番目の変数Ｘ［ｎ−１］に格納されるようにして、ｎ個の分割値を変数Ｘ［０］〜Ｘ［ｎ−１］に格納する。
更に、制御部は、入力値Ｙ内の最下位ビットが含まれる分割値が０番目の変数Ｙ［０］に格納され、最上位ビットが含まれる分割値が（ｎ−１）番目の変数Ｙ［ｎ−１］に格納されるようにして、ｎ個の分割値を変数Ｙ［０］〜Ｙ［ｎ−１］に格納する。
また、制御部は、法Ｍから分割されたｎ個の分割値を格納するためのｎ個の変数Ｍ［０］〜Ｍ［ｎ−１］を、いずれかの記憶領域（例えば、メモリ１０２）に設ける。
また、制御部は、法Ｍ内の最下位ビットが含まれる分割値が０番目の変数Ｍ［０］に格納され、最上位ビットが含まれる分割値が（ｎ−１）番目の変数Ｍ［ｎ−１］に格納されるようにして、ｎ個の分割値を変数Ｍ［０］〜Ｍ［ｎ−１］に格納する。
また、制御部は、演算部による計算結果を格納する２ｎ個の変数Ｚ［０］〜Ｚ［２ｎ−１］を、いずれかの記憶領域（例えば、メモリ１０２）に設ける。 First, the control unit divides the input value X, the input value Y, and the modulus M for each word.
The control unit also stores n variables X [0] to X [n−1] for storing n divided values divided from the input value X and n pieces of divided values from the input value Y. N variables Y [0] to Y [n−1] for storing the division value are provided in any storage area (for example, the memory 102).
Further, the control unit stores the divided value including the least significant bit in the input value X in the 0th variable X [0], and the divided value including the most significant bit as the (n−1) th variable X. N division values are stored in variables X [0] to X [n−1] so as to be stored in [n−1].
Further, the control unit stores the divided value including the least significant bit in the input value Y in the 0th variable Y [0], and the divided value including the most significant bit as the (n−1) th variable Y. The n divided values are stored in variables Y [0] to Y [n−1] so as to be stored in [n−1].
Further, the control unit stores n variables M [0] to M [n−1] for storing n divided values divided from the modulus M in any storage area (for example, the memory 102). Provided.
The control unit stores the division value including the least significant bit in the modulus M in the 0th variable M [0], and the division value including the most significant bit as the (n−1) th variable M [[ n divided values are stored in variables M [0] to M [n−1].
In addition, the control unit provides 2n variables Z [0] to Z [2n−1] for storing the calculation result by the calculation unit in any storage area (for example, the memory 102).

演算部は、実施の形態１〜４と同様に、スレッド番号として「０〜ｎ−１」が設定されているｎ個のスレッドを並列に実行する。
演算部の処理は、第１フェーズ〜第４フェーズの処理に大別される。 As in the first to fourth embodiments, the calculation unit executes n threads in which “0 to n−1” is set as the thread number in parallel.
The processing of the computing unit is roughly divided into first phase to fourth phase processing.

演算部は、第１フェーズの処理として、以下の〈１−ａ〉〜〈１−ｆ〉の処理を行う。
〈１−ａ〉スレッドｔにおいて、入力値Ｘの０番目の変数Ｘ［０］の値と、入力値Ｙの０番目の変数Ｙ［０］の値と、ＭＩｎｖとを用いて、（Ｘ［０］の値）×（Ｙ［０］の値）×ＭＩｎｖを計算し、２ワードの計算結果の下位１ワードの値をｕとする。
〈１−ｂ〉スレッドｔにおいて、法Ｍのｔ番目の変数Ｍ［ｔ］の値と前記ｕとを用いて、（Ｍ［ｔ］の値）×ｕを計算し、２ワードの計算結果をｕｍとする。
〈１−ｃ〉スレッドｔにおいて、入力値Ｘの０番目の変数Ｘ［０］の値と入力値Ｙのｔ番目の変数Ｙ［ｔ］の値とを用いて、（Ｘ［０］の値）×（Ｙ［ｔ］の値）を計算し、２ワードの計算結果をｘｙとする。
〈１−ｄ〉スレッドｔにおいて、前記ｕｍの下位１ワードと前記ｘｙの下位１ワードとを加算し、２ワードの計算結果をｍとし、計算結果ｍの下位１ワードの値をｔ番目の変数Ｚ［ｔ］に格納する。
〈１−ｅ〉スレッドｔにおいて、前記ｍの上位１ワードと前記ｕｍの上位１ワードと前記ｘｙの上位１ワードとを加算し、２ワードの計算結果をキャリー成分値ｃとする。
〈１−ｆ〉０番目のスレッドであるスレッド０において、スレッド０で得られたキャリー成分値ｃの下位１ワードの値を変数Ｚ［０］の新たな値とする。 The computing unit performs the following processes <1-a> to <1-f> as the first phase process.
<1-a> In the thread t, using the value of the 0th variable X [0] of the input value X, the value of the 0th variable Y [0] of the input value Y, and MINv, (X [ [0] value) × (Y [0] value) × MINv, and the value of the lower 1 word of the 2-word calculation result is u.
<1-b> In the thread t, using the value of the t-th variable M [t] of the modulus M and the u, (M [t] value) × u is calculated, and the calculation result of 2 words is um.
<1-c> In the thread t, using the value of the 0th variable X [0] of the input value X and the value of the tth variable Y [t] of the input value Y, the value of (X [0] ) × (value of Y [t]), and the calculation result of 2 words is xy.
<1-d> In thread t, the lower 1 word of um and the lower 1 word of xy are added, the calculation result of 2 words is m, and the value of the lower 1 word of calculation result m is the t-th variable. Store in Z [t].
<1-e> In thread t, the upper 1 word of m, the upper 1 word of um, and the upper 1 word of xy are added, and the calculation result of 2 words is set as a carry component value c.
<1-f> In thread 0, which is the 0th thread, the value of the lower 1 word of carry component value c obtained in thread 0 is set as the new value of variable Z [0].

次に、制御部は、カウンタ値ｉを１に設定し、演算部に第２フェーズの処理を開始させる。
そして、制御部は、カウンタ値ｉがｎに達するまで、全てのスレッドｔにおいて第２フェーズの１ラウンド分の処理が終了する度に、カウンタ値ｉをインクリメントする。 Next, the control unit sets the counter value i to 1, and causes the calculation unit to start the second phase process.
The control unit increments the counter value i every time processing for one round of the second phase is completed in all the threads t until the counter value i reaches n.

演算部は、制御部によりカウンタ値ｉがインクリメントされる度に、第２フェーズの１ラウンド分の処理を繰り返す。
第２フェーズの１ラウンド分の処理は、以下の〈２−ａ〉〜〈２−ｆ〉の処理である。
〈２−ａ〉スレッドｔにおいて、０番目の変数Ｚ［０］とｉ番目の変数Ｚ［ｉ］と、入力値Ｘのｉ番目の変数Ｘ［ｉ］と、入力値Ｙの０番目の変数Ｙ［０］と、ＭＩｎｖとを用いて、｛（Ｚ［０］の値）＋（Ｚ［ｉ］の値）＋（Ｘ［ｉ］の値）×（Ｙ［０］の値）｝×ＭＩｎｖを計算し、２ワードの計算結果の下位１ワードの値をｕとする。
〈２−ｂ〉スレッドｔにおいて、法Ｍのｔ番目の変数Ｍ［ｔ］の値と前記ｕとを用いて、（Ｍ［ｔ］の値）×ｕを計算し、２ワードの計算結果をｕｍとする。
〈２−ｃ〉スレッドｔにおいて、入力値Ｘのｉ番目の変数Ｘ［ｉ］の値と入力値Ｙのｔ番目の変数Ｙ［ｔ］の値とを用いて、（Ｘ［ｉ］の値）×（Ｙ［ｔ］の値）を計算し、２ワードの計算結果をｘｙとする。
〈２−ｄ〉スレッドｔにおいて、前記ｕｍの下位１ワードと、前記ｘｙの下位１ワードと、ステップｔで得られたキャリー成分値ｃの下位１ワードと、（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］の値とを加算し、２ワードの計算結果をｍとし、計算結果ｍの下位１ワードの値を（ｔ＋ｉ）番目の変数Ｚ［ｔ＋ｉ］の新たな値とする。
〈２−ｅ〉スレッドｔにおいて、前記ｍの上位１ワードと、前記ｕｍの上位１ワードと、前記ｘｙの上位１ワードと、ステップｔで得られたキャリー成分値ｃの上位１ワードとを加算し、２ワードの計算結果を新たなキャリー成分値ｃとする。
〈２−ｆ〉０番目のスレッドであるスレッド０において、スレッド０で得られたキャリー成分値ｃの下位１ワードの値を変数Ｚ［０］の新たな値とする。
また、演算部は、カウンタ値ｉがｎに達すると、スレッド０において、値０を変数Ｚ［０］の新たな値とする。 The calculation unit repeats the process for one round of the second phase every time the counter value i is incremented by the control unit.
The processing for one round of the second phase is the following <2-a> to <2-f> processing.
<2-a> In thread t, the 0th variable Z [0], the ith variable Z [i], the ith variable X [i] of the input value X, and the 0th variable of the input value Y Using Y [0] and MINv, {(value of Z [0]) + (value of Z [i]) + (value of X [i]) × (value of Y [0])} × MINv is calculated, and the value of the lower 1 word of the 2-word calculation result is set to u.
<2-b> In the thread t, using the value of the t-th variable M [t] of the modulus M and u, the value of (M [t]) × u is calculated. um.
<2-c> In the thread t, using the value of the i-th variable X [i] of the input value X and the value of the t-th variable Y [t] of the input value Y, the value of (X [i] ) × (value of Y [t]), and the calculation result of 2 words is xy.
<2-d> In the thread t, the lower 1 word of the um, the lower 1 word of the xy, the lower 1 word of the carry component value c obtained in step t, and the (t + i) th variable Z [t + i ], The calculation result of 2 words is set as m, and the value of the lower 1 word of the calculation result m is set as a new value of the (t + i) -th variable Z [t + i].
<2-e> In thread t, the upper 1 word of m, the upper 1 word of um, the upper 1 word of xy, and the upper 1 word of carry component value c obtained in step t are added Then, the calculation result of 2 words is set as a new carry component value c.
<2-f> In the thread 0 which is the 0th thread, the value of the lower 1 word of the carry component value c obtained in the thread 0 is set as a new value of the variable Z [0].
Further, when the counter value i reaches n, the arithmetic unit sets the value 0 as a new value of the variable Z [0] in the thread 0.

カウンタ値ｉがｎに達すると、スレッド０において値０が変数Ｘ［０］の新たな値とされた後に、制御部は、演算部に第３フェーズの処理を開始させる。
そして、制御部は、カウンタ値ｉが２ｎに達するか、全てのスレッドｔが第３フェーズの処理を停止するまで、停止していない全てのスレッドｔにおいて第３フェーズの１ラウンド分の処理が終了する度に、カウンタ値ｉをインクリメントする。 When the counter value i reaches n, after the value 0 is set as a new value of the variable X [0] in the thread 0, the control unit causes the calculation unit to start the third phase process.
Then, the control unit finishes processing for one round of the third phase in all the threads t that are not stopped until the counter value i reaches 2n or all the threads t stop the processing of the third phase. Each time, the counter value i is incremented.

演算部は、カウンタ値ｉが２ｎに達するまでの間、制御部によりカウンタ値ｉがインクリメントされる度に、停止していないスレッドｔにおいて、第３フェーズの１ラウンド分の処理を繰り返す。
第３フェーズの１ラウンド分の処理は、以下の〈３−ａ〉及び〈３−ｂ〉の処理である。
〈３−ａ〉スレッドｔで得られたキャリー成分値ｃが０であるか否かを判断し、キャリー成分値ｃが０である場合にスレッドｔの第３フェーズの処理を停止する。
〈３−ｂ〉０でない場合に、スレッドｔで得られたキャリー成分値ｃの下位１ワードと変数Ｚ［（ｔ＋ｉ）ｍｏｄ２ｎ］の値とを加算し、２ワードの計算結果をｍとし、計算結果ｍの下位１ワードの値を変数Ｚ［（ｔ＋ｉ）ｍｏｄ２ｎ］の新たな値とし、前記ｍの上位１ワードとスレッドｔで得られたキャリー成分値ｃの上位１ワードとを加算し、２ワードの計算結果を新たなキャリー成分値ｃとする。 The arithmetic unit repeats the process for one round of the third phase in the thread t that is not stopped each time the counter value i is incremented by the control unit until the counter value i reaches 2n.
The processing for one round of the third phase is the following <3-a> and <3-b> processing.
<3-a> It is determined whether or not the carry component value c obtained by the thread t is 0. When the carry component value c is 0, the third phase processing of the thread t is stopped.
<3-b> If it is not 0, the lower 1 word of the carry component value c obtained in the thread t and the value of the variable Z [(t + i) mod 2n] are added, and the calculation result of 2 words is set as m. The value of the lower 1 word of the calculation result m is set as a new value of the variable Z [(t + i) mod 2n], and the upper 1 word of the m and the upper 1 word of the carry component value c obtained by the thread t are added. The calculation result of 2 words is set as a new carry component value c.

演算部は、第４フェーズの処理として、以下の処理を行う。
変数Ｚ［０］の値を変数ａに格納し、変数Ｚ［ｎ］〜Ｚ［２ｎ−１］の値を、それぞれ、変数Ｚ［０］〜Ｚ［ｎ−１］に格納する。
変数ａの値が０でない場合、又は変数Ｚ［ｎ−１］〜Ｚ［０］の値を連接して得られる値が法Ｍ以上の場合に、Ｚ−Ｍを計算し、計算結果を変数Ｚ［０］〜Ｚ［ｎ−１］に格納する。 The calculation unit performs the following processing as the fourth phase processing.
The value of variable Z [0] is stored in variable a, and the values of variables Z [n] to Z [2n-1] are stored in variables Z [0] to Z [n-1], respectively.
When the value of the variable a is not 0, or the value obtained by concatenating the values of the variables Z [n−1] to Z [0] is equal to or greater than the modulus M, Z−M is calculated, and the calculation result is the variable Store in Z [0] to Z [n-1].

次に、図１８のフローチャートを参照して、本実施の形態に係る多倍長演算装置１００でモンゴメリ乗算を行う場合の手順を説明する。
図１８では、入力値Ｘ，Ｙのモンゴメリ乗算の結果ＸＹＲ^−１ｍｏｄＭをＺに出力する。
Ｚは２ｎワードのサイズをもち、中間変数値の格納も行う。
入力値Ｘ，Ｙはｎ桁に分割され、ｎ個のスレッドを用いて計算を行う。
また、入力値Ｘ，ＹはＸ，Ｙ＜Ｍの関係を満たすものとする。
ここで、Ｒ＝ｒ^ｎとし、ｒ＝２^ｂとする。
ＭＩｎｖは−Ｍ^−１ｍｏｄｒを満たす１ワード整数である。 Next, a procedure for performing Montgomery multiplication in the multiple-precision arithmetic apparatus 100 according to the present embodiment will be described with reference to the flowchart of FIG.
In FIG. 18, the result XYR ⁻¹ mod M of the Montgomery multiplication of the input values X and Y is output to Z.
Z has a size of 2n words and stores intermediate variable values.
The input values X and Y are divided into n digits, and calculation is performed using n threads.
Further, it is assumed that the input values X and Y satisfy the relationship of X and Y <M.
Here, the R = ^{r n,} and r = ^{2 b.}
MINv is a 1-word integer that satisfies −M ⁻¹ mod r.

図１８において、大文字で示す変数はメモリ１０２上のデータとし、小文字で示す変数は汎用レジスタ１０８上のデータとする。
また、各変数のデータ長は１ワードとする。
ただし、図１８において太字で表記した変数は２ワードとする。
なお、明細書では、２ワードの変数は、ダブルクオーテーションで表現する（例えば、“ｃ”）。
また、入力値ｃｅｉｌ（ｌ／ｂ）ワードがｎに満たない場合は、満たない部分のワードを予め０でクリアする。 In FIG. 18, variables indicated by capital letters are data on the memory 102, and variables indicated by small letters are data on the general-purpose register 108.
The data length of each variable is 1 word.
However, the variable written in bold in FIG. 18 is 2 words.
In the specification, a 2-word variable is expressed by double quotation (for example, “c”).
If the input value ceil (l / b) word is less than n, the word of the lesser part is cleared to 0 in advance.

まず、演算部が、汎用レジスタ１０８から、スレッドごとに自身のスレッド番号と演算するスレッド本数を取得し、また、Ｚを０にセットし、制御部が変数ｉを１にセットする（Ｓ１００１）。 First, the calculation unit obtains its own thread number and the number of threads to be calculated for each thread from the general-purpose register 108, sets Z to 0, and sets the variable i to 1 (S1001).

次にモンゴメリ乗算処理を行う。
具体的には、演算部が、Ｘ［０］×Ｙ［０］×ＭＩｎｖの下位１ワードｕを求め、Ｍ［ｔ］×ｕ＋Ｘ［０］×Ｙ［ｔ］を計算する。
計算を２ワード以下の変数で行うため、Ｍ［ｔ］×ｕ，Ｘ［０］×Ｙ［ｔ］を上位ワードと下位ワードに分解し、下位ワードみの加算を行なった後、加算結果の下位１ワードをＺ［ｔ］に出力する。
また、演算部は、前記加算結果の上位ワードと前記Ｍ［ｔ］×ｕ，Ｘ［０］×Ｙ［ｔ］の上位ワードを加算し、２ワードデータ“ｃ”を生成する（Ｓ１００２）。
演算部は、スレッド番号が０である場合（ｔ＝＝０？でＹＥＳ）、“ｃ”の下位１ワードをＺ［０］に出力する（Ｓ１００３）。
なお、図１８のＳ１００２とＳ１００３が第１フェーズの処理に相当する。 Next, Montgomery multiplication processing is performed.
Specifically, the calculation unit obtains the lower one word u of X [0] × Y [0] × MINv and calculates M [t] × u + X [0] × Y [t].
Since the calculation is performed with a variable of 2 words or less, M [t] × u, X [0] × Y [t] is decomposed into an upper word and a lower word, and only the lower word is added. The lower 1 word is output to Z [t].
Further, the arithmetic unit adds the upper word of the addition result and the upper word of M [t] × u, X [0] × Y [t] to generate 2-word data “c” (S1002).
When the thread number is 0 (YES when t == 0?), The arithmetic unit outputs the lower 1 word of “c” to Z [0] (S1003).
Note that S1002 and S1003 in FIG. 18 correspond to the processing of the first phase.

ｉがｎ未満の場合、演算部は、（Ｚ［０］＋Ｚ［ｉ］＋Ｘ［ｉ］×Ｙ［０］）×ＭＩｎｖの下位１ワードｕを求め、Ｍ［ｔ］×ｕ＋Ｘ［０］×Ｙ［ｔ］＋Ｚ［ｔ＋ｉ］＋“ｃ”を計算する。
計算を２ワード以下の変数で行うため、Ｍ［ｔ］×ｕ，Ｘ［０］×Ｙ［ｔ］，“ｃ”を上位ワードと下位ワードに分解し、下位ワードみの加算を行なった後、加算結果の下位１ワードをＺ［ｔ＋ｉ］に出力する。
また、演算部は、前記加算結果の上位ワードと前記Ｍ［ｔ］×ｕ，Ｘ［０］×Ｙ［ｔ］，“ｃ”の上位ワードを加算し、２ワードデータ“ｃ”を生成する。
また、制御部が、変数ｉに１を加算する（Ｓ１００４）。
演算部は、スレッド番号が０である場合（ｔ＝＝０？でＹＥＳ）、“ｃ”の下位１ワードをＺ［０］に出力する（Ｓ１００５）。
なお、図１８のＳ１００４の処理１〜１５、Ｓ１００５が、第２フェーズの１ラウンド分の処理に該当する。 When i is less than n, the arithmetic unit obtains the lower 1 word u of (Z [0] + Z [i] + X [i] × Y [0]) × MINv, and M [t] × u + X [0] × Y [t] + Z [t + i] + “c” is calculated.
Since the calculation is performed with a variable of 2 words or less, M [t] × u, X [0] × Y [t], “c” is decomposed into an upper word and a lower word, and only the lower word is added. , The lower one word of the addition result is output to Z [t + i].
The arithmetic unit adds the upper word of the addition result and the upper word of M [t] × u, X [0] × Y [t], “c” to generate 2-word data “c”. .
Further, the control unit adds 1 to the variable i (S1004).
When the thread number is 0 (YES when t == 0?), The arithmetic unit outputs the lower 1 word of “c” to Z [0] (S1005).
Note that processes 1 to 15 and S1005 of S1004 in FIG. 18 correspond to the process for one round of the second phase.

ｉがｎ以上となったらモンゴメリ乗算処理を抜け、キャリー処理を行う。
始めに、演算部は、スレッド番号が０である場合（ｔ＝＝０？でＹＥＳ）、Ｚ［０］に０を出力する（Ｓ１００６）。
次に、演算部は、Ｚ［（ｔ＋ｉ）％２ｎ］＋“ｃ”を計算する。
計算を２ワード以下の変数で行うため、“ｃ”を上位ワードと下位ワードに分解し、下位ワードみの加算を行なった後、加算結果の下位１ワードをＺ［（ｔ＋ｉ）％２ｎ］に出力する。
また、演算部は、前記加算結果の上位ワードと前記“ｃ”の上位ワードを加算し、２ワードデータ“ｃ”を生成する。
制御部が、変数ｉに１を加算する（Ｓ１００７）。
ｉ≧２ｎまたは“ｃ”＝０となるまでループを繰り返す。
つまり、演算部は、変数ｉが２ｎに達するまでの間、制御部により変数ｉがインクリメントされる度に、停止していないスレッドｔにおいて、スレッドｔで得られたキャリー成分値“ｃ”が０であるか否かを判断し、キャリー成分値“ｃ”が０である場合（“ｃ”＝＝０？でＹＥＳ）にスレッドｔの処理を停止し、キャリー成分値“ｃ”が０でない場合（“ｃ”＝＝０？でＮＯ）は、Ｓ１００７の処理を行う。
そして、ｎ本のスレッド全てがループを抜けたらキャリー処理を終了する。
なお、図１８に示す“ｃ”＝＝０？の判断、Ｓ１００７の処理１〜４が、第３フェーズの１ラウンド分の処理に該当する。 When i is greater than or equal to n, the Montgomery multiplication process is exited and the carry process is performed.
First, when the thread number is 0 (YES when t == 0?), The arithmetic unit outputs 0 to Z [0] (S1006).
Next, the calculation unit calculates Z [(t + i)% 2n] + “c”.
Since the calculation is performed with a variable of 2 words or less, “c” is decomposed into an upper word and a lower word, and only the lower word is added. Output.
The arithmetic unit adds the upper word of the addition result and the upper word of “c” to generate 2-word data “c”.
The control unit adds 1 to the variable i (S1007).
The loop is repeated until i ≧ 2n or “c” = 0.
That is, every time the variable i is incremented by the control unit until the variable i reaches 2n, the computation unit sets the carry component value “c” obtained in the thread t to 0 in the thread t that is not stopped. When the carry component value “c” is 0 (YES when “c” == 0?), The processing of the thread t is stopped, and the carry component value “c” is not 0. (If “c” == 0? NO), the processing of S1007 is performed.
When all n threads have exited the loop, the carry process is terminated.
Note that “c” == 0? The processes 1 to 4 of S1007 correspond to the process for one round of the third phase.

なお、メモリが十分にある場合は、２ｎの剰余演算を省略してもよい。
この場合、後述の減算処理で変数ａに格納するデータはＺ［２ｎ］になることに注意する。 If there is sufficient memory, the 2n remainder operation may be omitted.
Note that in this case, the data stored in the variable a in the subtraction process described later is Z [2n].

キャリー処理を終了したら、減算処理を行う。
演算部は、まず、Ｚ［０］の値を変数ａに出力し、Ｚの値をｎワードシフトする（Ｓ１００８）。
つまり、演算部は、変数Ｚ［ｎ］〜Ｚ［２ｎ−１］の値を、それぞれ、変数Ｚ［０］〜Ｚ［ｎ−１］に格納する。
そして、ａの値が０でないか、Ｚ［ｎ−１，…，０］の値（Ｚ［ｎ−１］〜Ｚ［０］の値を連接して得られる値）が法Ｍ以上の場合は、演算部は、多倍長整数減算Ｚ−Ｍを実行し、結果をＺに格納する（Ｓ１００９）。
最後に、演算部は、Ｚ［０］〜Ｚ［ｎ−１］を出力する。
なお、図１３に示すＳ１００８、Ｓ１００９及びＺ［０］〜Ｚ［ｎ−１］の出力が、第４フェーズの処理に相当する。 When the carry process is completed, a subtraction process is performed.
The arithmetic unit first outputs the value of Z [0] to the variable a, and shifts the value of Z by n words (S1008).
That is, the calculation unit stores the values of variables Z [n] to Z [2n-1] in variables Z [0] to Z [n-1], respectively.
When the value of a is not 0 or the value of Z [n−1,..., 0] (the value obtained by concatenating the values of Z [n−1] to Z [0]) is equal to or greater than the modulus M The arithmetic unit executes multiple-precision integer subtraction ZM and stores the result in Z (S1009).
Finally, the calculation unit outputs Z [0] to Z [n-1].
Note that the outputs of S1008, S1009, and Z [0] to Z [n-1] shown in FIG. 13 correspond to the fourth phase process.

図１９及び図２０は、本実施の形態に係るモンゴメリ乗算における値の変化を示す。
図１９及び図２０では、内容を理解しやすくするため、内部演算幅を十進数とし、４桁の演算を４つのスレッド（スレッド番号０〜３）で実行した場合について示す。図１４及び図１５では、５６７８（Ｘ）＊４３２１（Ｙ）＊Ｒ^−１ｍｏｄ６１３１（Ｍ）＝３７３６（Ｚ）を計算する例を示す。
この場合、ｎ＝４、Ｒ＝１０^ｎ＝１０^４、ｒ＝１０、ＭＩｎｖ＝−６１３１^−１ｍｏｄ１０＝９となる。
なお、図２０には、図１９との連続性を明示するために、図１９の最下段に示しているｉ＝３の際の計算過程（一部）を再度提示している。
また、図２１及び図２２は、図１９及び図２０に示した計算の内訳を、スレッド番号０について示している。 19 and 20 show changes in values in Montgomery multiplication according to the present embodiment.
19 and 20 show a case where the internal operation width is a decimal number and a 4-digit operation is executed by four threads (thread numbers 0 to 3) in order to facilitate understanding of the contents. 14 and 15 illustrate an example in which 5678 (X) * 4321 (Y) * R ⁻¹ mod 6131 (M) = 3736 (Z) is calculated.
In this case, n = 4, R = 10 ⁿ = 10 ⁴ , r = 10, and MINv = −6131 ⁻¹ mod 10 = 9.
In FIG. 20, in order to clearly show the continuity with FIG. 19, the calculation process (partial) at the time of i = 3 shown at the bottom of FIG. 19 is presented again.
21 and 22 show the breakdown of the calculation shown in FIGS. 19 and 20 for the thread number 0. FIG.

次に、本実施の形態に係る多倍長演算装置１００の効果を説明する。
本実施の形態では、モンゴメリ乗算処理はｎ回のループで終了する。
モンゴメリ乗算処理中にキャリーの加算処理も行う点が、本実施の形態の特徴の１つである。
また、加算データを上位と下位で分割して加算することで、３ワードの変数が必要な計算を、２ワード以下の変数で行うことができる。
さらに、各ステップで、最下位の桁を処理するスレッドのキャリーをメモリに出力することで、計算量をＯ（ｎ）にすることができる。 Next, the effect of the multiple length arithmetic apparatus 100 according to the present embodiment will be described.
In the present embodiment, the Montgomery multiplication process is completed in n loops.
One of the features of the present embodiment is that carry addition processing is also performed during Montgomery multiplication processing.
In addition, by dividing the addition data into the upper part and the lower part and adding, it is possible to perform a calculation that requires a variable of 3 words with a variable of 2 words or less.
Furthermore, at each step, the amount of calculation can be reduced to O (n) by outputting the carry of the thread that processes the least significant digit to the memory.

キャリー処理について、モンゴメリ乗算処理終了後にキャリー成分が残っていれば、前記キャリー成分が０になるまで、加算を行う。
キャリー処理のワーストケースはＯ（ｎ）となるが、入力がランダムである場合、キャリーが最後まで残る確率は非常に低いため、数回のキャリーの計算でループを抜けることができる。
また、２ｎで剰余をとることで、多倍長整数乗算と同じメモリサイズでモンゴメリ乗算処理を実現することができる。
さらに、ｎが２のべき乗の場合、演算コストの大きい剰余演算を、演算コストの小さいビット演算で実現することができる。
また、本実施の形態に係る多倍長演算装置１００は、前述したように、ＳＩＭＤ計算機等の通常の計算機で実現可能であり、専用装置を用いることなく、モンゴメリ乗算を高速に行うことができる。 For carry processing, if carry components remain after the Montgomery multiplication processing, addition is performed until the carry components become zero.
The worst case of carry processing is O (n). However, if the input is random, the probability that the carry will remain until the end is very low, so it is possible to exit the loop with several carry calculations.
Further, by taking the remainder by 2n, Montgomery multiplication processing can be realized with the same memory size as that of multiple-precision integer multiplication.
Further, when n is a power of 2, a remainder operation with a high operation cost can be realized by a bit operation with a low operation cost.
In addition, as described above, the multiple length arithmetic apparatus 100 according to the present embodiment can be realized by a normal computer such as a SIMD computer, and can perform Montgomery multiplication at high speed without using a dedicated device. .

なお、本実施の形態では、変数Ｚは２ｎワードで構成されるものとした。
このため、図１８のＳ１００７の処理３及び処理４で２ｎの剰余演算を行うことにした。
しかし、変数Ｚは、ｖワード（但し、ｖはｎの倍数、つまり、ｖは３ｎ、４ｎ等）で構成してもよい。
このような変数Ｚの場合は、図１８のＳ１００７の処理３及び処理４における２ｎの剰余演算は省略される。
また、同様に、減算処理（図１８のＳ１００８の処理１）で変数ａに格納するデータはＺ［２ｎ］になる。
更に、減算処理（図１８のＳ１００８の処理２）で、Ｚ［ｎ］〜Ｚ［２ｎ−１］の値を、それぞれ、Ｚ［０］〜Ｚ［ｎ−１］に格納する。 In the present embodiment, variable Z is composed of 2n words.
For this reason, 2n remainder calculation is performed in the processing 3 and the processing 4 of S1007 in FIG.
However, the variable Z may be composed of v words (where v is a multiple of n, that is, v is 3n, 4n, etc.).
In the case of such a variable Z, the 2n remainder calculation in the processing 3 and the processing 4 in S1007 in FIG. 18 is omitted.
Similarly, the data stored in the variable a in the subtraction process (process 1 in S1008 in FIG. 18) is Z [ 2n ].
Furthermore, the values of Z [ n ] to Z [ 2n -1] are stored in Z [0] to Z [n-1], respectively, in the subtraction processing (processing 2 in S1008 in FIG. 18).

以上、本実施の形態では、
複数のプロセッサを内蔵し、単一の命令を複数のデータに対して同時に実行できる計算機と、データを格納し、前記プロセッサが同時にアクセスできるメモリを有する多倍長整数演算装置であって、入力Ｘ，Ｙ，法Ｍ，内部演算幅ｂ（ｂｉｔ）に対し、“ｒ＝２ｂ”，“Ｒ＝ｒｎ”として定義されたＲと、“−Ｍ−１ｍｏｄｒ”として定義されたＭＩｎｖを用いて、Ｚ＝ＸＹＲ−１ｍｏｄＭを計算するモンゴメリ乗算を以下のステップで実行する多倍長整数演算装置を説明した。
１．入力データと法を計算機の内部演算幅毎に複数の桁（以降、それぞれＸ［ｎ］，Ｙ［ｎ］，Ｍ［ｎ］と記す）に分割するステップ
２．スレッドごとに自身のスレッド番号ｔと演算するスレッド本数ｎを取得するステップ
３．Ｚを０にセットするステップ
４．Ｘ［０］×Ｙ［０］×ＭＩｎｖの下位１ワードｕを求め、Ｍ［ｔ］×ｕとＸ［０］×Ｙ［ｔ］を計算するステップ
５．前記Ｍ［ｔ］×ｕとＸ［０］×Ｙ［ｔ］の下位ワードみの加算を行なった後、加算結果の下位１ワードをＺ［ｔ］に出力するステップ
６．前記加算結果の上位ワードと前記Ｍ［ｔ］×ｕ，Ｘ［０］×Ｙ［ｔ］の上位ワードを加算し、２ワードデータ“ｃ”を生成するステップ
７．スレッド番号が０である場合、“ｃ”の下位１ワードをＺ［０］に出力するステップ
８．桁ｉに１を設定するステップ
９．（Ｚ［０］＋Ｚ［ｉ］＋Ｘ［ｉ］×Ｙ［０］）×ＭＩｎｖの下位１ワードｕを求め、Ｍ［ｔ］×ｕとＸ［ｉ］×Ｙ［ｔ］を計算するステップ
１０．前記Ｍ［ｔ］×ｕ，Ｘ［０］×Ｙ［ｔ］，“ｃ”の下位ワードとＺ［ｔ＋ｉ］との加算を行なった後、加算結果の下位１ワードをＺ［ｔ＋ｉ］に出力するステップ
１１．前記加算結果の上位ワードと前記Ｍ［ｔ］×ｕ，Ｘ［０］×Ｙ［ｔ］，“ｃ”の上位ワードを加算し、２ワードデータ“ｃ”を生成するステップ
１２．変数ｉに１を加算するステップ
１３．スレッド番号が０である場合、“ｃ”の下位１ワードをＺ［０］に出力するステップ
１４．ｉ＜ｎの間、ステップ９〜１３を実行するステップ
１５．スレッド番号が０である場合、Ｚ［０］に０を出力するステップ
１６．“ｃ”の下位ワードとＺ［（ｔ＋ｉ）％２ｎ］との加算を行なった後、加算結果の下位１ワードをＺ［（ｔ＋ｉ）％２ｎ］に出力するステップ
１７．前記加算結果の上位ワードと前記“ｃ”の上位ワードを加算し、２ワードデータ“ｃ”を生成するステップ
１８．桁ｉに１を加算するステップ
１９．ｉ＜２ｎかつ、ｃ≠０の間、ステップ１６〜１８を実行するステップ
２０．前記スレッドの全てがステップ１６〜１８を完了するのを待つステップ
２１．Ｚ［０］の値を変数ａに取得するステップ
２２．Ｚの値をｎワードシフトするステップ
２３．ａの値が０でないか、Ｚ［ｎ−１，…，０］の値がＭ以上の場合は、多倍長整数減算Ｚ−Ｍを実行し、結果をＺに格納するステップ As described above, in the present embodiment,
A multiple-precision integer arithmetic unit comprising a plurality of processors, a computer capable of executing a single instruction simultaneously on a plurality of data, and a memory for storing data and simultaneously accessible by the processor, wherein the input X , Y, modulus M, internal calculation width b (bit), R defined as “r = 2b”, “R = rn”, and MInv defined as “−M−1 mod r” , Z = XYR−1 mod A multi-precision integer arithmetic unit that executes Montgomery multiplication for calculating M in the following steps has been described.
1. 1. A step of dividing input data and a method into a plurality of digits (hereinafter referred to as X [n], Y [n], and M [n], respectively) for each internal calculation width of the computer. 2. Obtaining the number n of threads to be calculated with its own thread number t for each thread 3. Set Z to 0 4. Obtain lower 1 word u of X [0] × Y [0] × MINv and calculate M [t] × u and X [0] × Y [t] 5. After adding only the lower word of M [t] × u and X [0] × Y [t], outputting the lower 1 word of the addition result to Z [t] 6. Add the upper word of the addition result and the upper word of M [t] × u, X [0] × Y [t] to generate 2-word data “c” 7. If the thread number is 0, output the lower 1 word of “c” to Z [0] 8. Set the digit i to 1 (Z [0] + Z [i] + X [i] × Y [0]) × Calculating the lower 1 word u of MINv and calculating M [t] × u and X [i] × Y [t] 10 . After the addition of the lower word of M [t] × u, X [0] × Y [t], “c” and Z [t + i], the lower 1 word of the addition result is output to Z [t + i] Step 11 to perform 11. Add the upper word of the addition result and the upper word of M [t] × u, X [0] × Y [t], “c” to generate 2-word data “c” 12. Add 1 to variable i 13. If the thread number is 0, outputting the lower 1 word of “c” to Z [0] 15. Perform steps 9-13 while i <n If the thread number is 0, output 0 to Z [0] 16. Step of adding the lower word of “c” and Z [(t + i)% 2n] and then outputting the lower word of the addition result to Z [(t + i)% 2n] 17. adding the upper word of the addition result and the upper word of “c” to generate 2-word data “c” 18. Adding 1 to digit i 20. Steps 16-18 are executed while i <2n and c ≠ 0 20. Wait for all of the threads to complete steps 16-18. Step of obtaining the value of Z [0] in the variable a 22. 23. Shifting the value of Z by n words If the value of a is not 0 or the value of Z [n−1,..., 0] is greater than or equal to M, a multiple-precision integer subtraction Z−M is executed, and the result is stored in Z

また、本実施の形態では、
上記のステップ１６にて、“ｃ”の下位ワードとＺ［（ｔ＋ｉ）］との加算を行なった後、加算結果の下位１ワードをＺ［（ｔ＋ｉ）］に出力するステップを実行し、上記のステップ２１にて、Ｚ［２ｎ］の値を変数ａに取得するステップを実行する多倍長整数演算装置を説明した。 In the present embodiment,
In step 16 above, after adding the lower word of “c” and Z [(t + i)], the step of outputting the lower 1 word of the addition result to Z [(t + i)] is executed. In step 21, the multiple-precision integer arithmetic unit that executes the step of acquiring the value of Z [2n] in the variable a has been described.

１００多倍長演算装置、１０１計算部、１０２メモリ、１０３通信ポート、１０４バス、１０５プロセッサ、１０６プロセッサ、１０７命令デコーダ、１０８汎用レジスタ、１０９特殊レジスタ。 100 multiple length arithmetic unit, 101 calculation unit, 102 memory, 103 communication port, 104 bus, 105 processor, 106 processor, 107 instruction decoder, 108 general purpose register, 109 special register.

Claims

An arithmetic device that includes a control unit, a calculation unit, and a storage unit, and performs addition of input values,
The controller is
The respective bit widths are common, and the input value X and the input value Y, each of which is larger than the calculation bit width of the calculation unit, are divided for each calculation bit width of the calculation unit,
N variables X [0] to X [n−1] for allocating a predetermined storage area in the storage unit and storing n (n ≧ 2) divided values divided from the input value X; , N variables Y [0] to Y [n−1] for storing n divided values divided from the input value Y are provided,
The divided value including the least significant bit in the input value X is stored in the 0th variable X [0], and the divided value including the most significant bit is stored in the (n−1) th variable X [n−1]. The n divided values are stored in the variables X [0] to X [n−1] so that the divided value including the least significant bit in the input value Y is the 0th variable Y [0. ] And the divided value including the most significant bit is stored in the (n−1) th variable Y [n−1], and the n divided values are stored in the variables Y [0] to Y [ n−1],
U (u> n) variables Z [0] to Z [u−1] for allocating a predetermined storage area in the storage unit and storing the addition result by the calculation unit are provided,
The computing unit is
N threads having “0 to n−1” set as thread numbers are executed in parallel,
In the thread t where the thread number = t (t is one of “0 to n−1”), as the first phase process,
Using the value of the t-th variable X [t] of the input value X and the value of the t-th variable Y [t] of the input value Y, the value of (X [t]) + (Y [t] Value), obtain the addition result and carry value c, and store the addition result in the t-th variable Z [t],
The controller is
The counter value i is set to 1, and the arithmetic unit starts the second phase process;
Every time processing for one round of the second phase is completed in all the threads t that are not stopped until the counter value i reaches n or all the threads t stop the processing of the second phase, the counter value i is incremented,
The computing unit is
Each time the counter value i is incremented by the control unit, in the thread t that has not been stopped, as processing for one round of the second phase,
<a> Using the value of the (t + i) th variable Z [t + i] and the carry value c obtained in the thread t, (Z [t + i] value) + c is calculated, and a new addition result and a new value are calculated. Obtain carry value c,
 The value of the variable Z [t + i] is compared with the new addition result, and if the two match, the second phase processing of the thread t is stopped, and if both do not match, a new An arithmetic unit characterized by performing a process of storing the addition result in a (t + i) -th variable Z [t + i].

An arithmetic unit that includes a control unit, a calculation unit, and a storage unit, and performs subtraction of an input value,
The respective bit widths are common, and the input value X and the input value Y, each of which is larger than the calculation bit width of the calculation unit, are divided for each calculation bit width of the calculation unit,
N variables X [0] to X [n−1] for allocating a predetermined storage area in the storage unit and storing n (n ≧ 2) divided values divided from the input value X; , N variables Y [0] to Y [n−1] for storing n divided values divided from the input value Y are provided,
The divided value including the least significant bit in the input value X is stored in the 0th variable X [0], and the divided value including the most significant bit is stored in the (n−1) th variable X [n−1]. The n divided values are stored in the variables X [0] to X [n−1] so that the divided value including the least significant bit in the input value Y is the 0th variable Y [0. ] And the divided value including the most significant bit is stored in the (n−1) th variable Y [n−1], and the n divided values are stored in the variables Y [0] to Y [ n−1],
U (u> n) variables Z [0] to Z [u-1] for allocating a predetermined storage area in the storage unit and storing a subtraction result by the calculation unit are provided,
The computing unit is
N threads having “0 to n−1” set as thread numbers are executed in parallel,
In the thread t where the thread number = t (t is one of “0 to n−1”), as the first phase process,
Using the value of the t-th variable X [t] of the input value X and the value of the t-th variable Y [t] of the input value Y, (value of Y [t]) − (X [t] Value), a subtraction result and a borrow value d are obtained, and the subtraction result is stored in the t-th variable Z [t].
The controller is
The counter value i is set to 1, and the arithmetic unit starts the second phase process;
Every time processing for one round of the second phase is completed in all the threads t that are not stopped until the counter value i reaches n or all the threads t stop the processing of the second phase, the counter value i is incremented,
The computing unit is
Each time the counter value i is incremented by the control unit, in the thread t that has not been stopped, as processing for one round of the second phase,
<a> Using the value of the (t + i) th variable Z [t + i] and the borrow value d obtained in the thread t, (Z [t + i] value) −d is calculated, and the new subtraction result and the new value are calculated. To obtain a simple borrow value d,
 The value of the variable Z [t + i] is compared with the new subtraction result, and if the two match, the second phase processing of the thread t is stopped, and if both do not match, a new An arithmetic unit characterized by performing a process of storing a subtraction result in a (t + i) -th variable Z [t + i].

An arithmetic device that includes a control unit, a calculation unit, and a storage unit and performs multiplication of input values,
The controller is
Each bit width is common, and each of the input value X and the input value Y, each bit width being larger than one word which is the calculation bit width of the calculation unit, is divided for each word,
N variables X [0] to X [n−1] for allocating a predetermined storage area in the storage unit and storing n (n ≧ 2) divided values divided from the input value X; , N variables Y [0] to Y [n−1] for storing n divided values divided from the input value Y are provided,
The divided value including the least significant bit in the input value X is stored in the 0th variable X [0], and the divided value including the most significant bit is stored in the (n−1) th variable X [n−1]. The n divided values are stored in the variables X [0] to X [n−1] so that the divided value including the least significant bit in the input value Y is the 0th variable Y [0. ] And the divided value including the most significant bit is stored in the (n−1) th variable Y [n−1], and the n divided values are stored in the variables Y [0] to Y [ n−1],
U (u> n) variables Z [0] to Z [u-1] for allocating a predetermined storage area in the storage unit and storing the calculation result by the calculation unit are provided,
The counter value i is set to 0, and the arithmetic unit starts the first phase process,
Until the counter value i reaches n, each time the arithmetic unit finishes processing for one round of the first phase, the counter value i is incremented,
The computing unit is
N threads having “0 to n−1” set as thread numbers are executed in parallel,
Until the counter value i reaches n, each time the counter value i is incremented by the control unit, the thread number = t (t is one of “0 to n−1”). As a process for one round of the first phase,
<1-a> The value of the t-th variable X [t] of the input value X, the value of the i-th variable Y [i] of the input value Y, the value of the (t + i) -th variable Z [t + i] , (X [t] value) × (Y [i] value) + (Z [t + i] value) + c is calculated using the carry component value c obtained in the thread t.
<1-b> A process of storing the value of the lower 1 word of the calculation result in the (t + i) th variable Z [t + i] and setting the value of the upper 1 word of the calculation result as a new carry component value c,
The controller is
When the counter value i reaches n, the calculation unit starts the second phase process,
Until the counter value i reaches u, the arithmetic unit increments the counter value i every time the processing for one round of the second phase ends.
The computing unit is
Until the counter value i reaches u, each time the counter value i is incremented by the control unit, in the thread t that is not stopped, as processing for one round of the second phase,
<2-a> It is determined whether or not the carry component value c obtained by the thread t is 0, and when the carry component value c is 0, the processing of the second phase of the thread t is stopped.
<2-b> If not 0 (value of Z [t + i]) + c, the value of the lower 1 word of the calculation result is set as the new value of the variable Z [t + i], and the value of the upper 1 word of the calculation result An arithmetic unit characterized by performing a process for setting a new carry component value c.

A control unit, a calculation unit, and a storage unit;
For an input value X and modulus M having a bit width larger than one word (1 word = b bits), which is the computation bit width of the computation unit, r = 2 ^b , R = r ⁿ (n is the modulus M Is the number of divisions of the modulus M when each word is divided, and R defined as n ≧ 2) and MINv defined as (−M ⁻¹ mod r), (XR ⁻¹ mod M) is an arithmetic unit that performs Montgomery reduction,
The controller is
The input value X and the modulus M are each divided into words,
2n variables X [0] to X [2n−1] for allocating a predetermined storage area in the storage unit and storing 2n divided values divided from the input value X are provided,
The divided value including the least significant bit in the input value X is stored in the 0th variable X [0], and the divided value including the most significant bit is stored in the (2n−1) th variable X [2n−1]. 2n division values are stored in variables X [0] to X [2n-1] so as to be stored,
N variables M [0] to M [n−1] for allocating a predetermined storage area in the storage unit and storing n divided values divided from the modulus M are provided,
The division value including the least significant bit in the modulus M is stored in the 0th variable M [0], and the division value including the most significant bit is stored in the (n−1) th variable M [n−1]. As described above, n division values are stored in variables M [0] to M [n−1],
N variables Z [0] to Z [n-1] for allocating a predetermined storage area in the storage unit and storing a calculation result by the calculation unit are provided,
The computing unit is
N threads having “0 to n−1” set as thread numbers are executed in parallel,
As the first phase process,
In a thread t where thread number = t (t is one of “0 to n−1”),
<1-a> Using the value of the 0th variable X [0] of the input value X and MINv, (X [0] value) × MINv is calculated, and the lower 1 word of the calculation result of 2 words Let u be the value
<1-b> Using the value of the t-th variable M [t] of the method M, the value of the t-th variable X [t] of the input value X, and u, the value of (M [t] ) × u + (value of X [t]), the calculation result of 2 words is set as m, the value of the lower 1 word of the calculation result m is set as the new value of the variable X [t], and the higher order of the calculation result m The value of one word is the carry component value c,
In thread 0, which is the 0th thread,
<1-c> A process of setting the value of the lower one word of the carry component value c obtained in the thread 0 as a new value of the variable X [0],
The controller is
The counter value i is set to 1, and the arithmetic unit starts the second phase process;
Every time processing for one round of the second phase is completed in all the threads t until the counter value i reaches n, the counter value i is incremented.
The computing unit is
Each time the counter value i is incremented by the control unit, as a process for one round of the second phase,
In an individual thread t
<2-a> Using the value of the 0th variable X [0], the value of the ith variable X [i] of the input value X, and MINv, {(value of X [0]) + (X The value of [i])} × MINv is calculated, and the value of the lower 1 word of the calculation result of 2 words is set to u,
<2-b> Value of t-th variable M [t] of method M, u, value of (t + i) -th variable X [t + i] of input value X, and carry component obtained by thread t Using the value c, (M [t] value) × u + (X [t + i] value) + c is calculated, the calculation result of 2 words is m, and the value of the lower 1 word of the calculation result m is a variable A new value of X [t + i] is set, and the value of the upper one word of the calculation result m is set as a new carry component value c.
In thread 0, which is the 0th thread,
<2-c> A process of setting the value of the lower one word of the carry component value c obtained in the thread 0 as a new value of the variable X [0],
When the counter value i reaches n, in the thread 0, the value 0 is set as a new value of the variable X [0],
The controller is
When the counter value i reaches n, the value 0 is changed to the new value of the variable X [0] in the thread 0, and then the arithmetic unit starts the third phase process.
Every time processing for one round of the third phase is completed in all the threads t that are not stopped until the counter value i reaches 2n or all the threads t stop the processing of the third phase, the counter value i is incremented,
The computing unit is
Until the counter value i reaches 2n, each time the counter value i is incremented by the control unit, in the thread t that is not stopped, as a process for one round of the third phase,
<3-a> It is determined whether or not the carry component value c obtained by the thread t is 0, and when the carry component value c is 0, the processing of the third phase of the thread t is stopped.
<3-b> When not 0, (X [(t + i) mod 2n] value) + c is calculated, the calculation result of 2 words is m, and the value of the lower 1 word of the calculation result m is the variable X [(( t + i) mod 2n], and the value of the upper one word of the calculation result m is set as a new carry component value c.
The controller is
When the counter value i reaches 2n, or when all the threads t stop the third phase processing, the calculation unit starts the fourth phase processing,
The computing unit is
As the fourth phase process,
The value of variable X [0] is stored in variable a,
The values of variables X [n] to X [2n−1] are stored in variables X [0] to X [n−1], respectively.
When the value of the variable a is not 0, or when the value obtained by concatenating the values of the variables X [n−1] to X [0] is not less than the modulus M, X−M is calculated, and the calculation result is the variable Stored in X [0] to X [n-1],
An arithmetic unit characterized in that values of variables X [0] to X [n-1] are stored in variables Z [0] to Z [n-1].

A control unit, a calculation unit, and a storage unit;
For an input value X and modulus M having a bit width larger than one word (1 word = b bits), which is the computation bit width of the computation unit, r = 2 ^b , R = r ⁿ (n is the modulus M Is the number of divisions of the modulus M when each word is divided, and R defined as n ≧ 2) and MINv defined as (−M ⁻¹ mod r), (XR ⁻¹ mod M) is an arithmetic unit that performs Montgomery reduction,
The controller is
The input value X and the modulus M are each divided into words,
Assigning a predetermined storage area in the storage unit, v for storing 2n pieces of divided value obtained by dividing the input value X (v is a multiple of n, v ≧ 3 n) pieces of variable X [0] to X [v-1] are provided,
The divided value including the least significant bit in the input value X is stored in the 0th variable X [0], and the divided value including the most significant bit is stored in the ( 2n −1) th variable X [ 2n −1]. 2n division values are stored in variables X [0] to X [ 2n- 1] so as to be stored,
N variables M [0] to M [n−1] for allocating a predetermined storage area in the storage unit and storing n divided values divided from the modulus M are provided,
The division value including the least significant bit in the modulus M is stored in the 0th variable M [0], and the division value including the most significant bit is stored in the (n−1) th variable M [n−1]. As described above, n division values are stored in variables M [0] to M [n−1],
N variables Z [0] to Z [n-1] for allocating a predetermined storage area in the storage unit and storing a calculation result by the calculation unit are provided,
The computing unit is
N threads having “0 to n−1” set as thread numbers are executed in parallel,
As the first phase process,
In a thread t where thread number = t (t is one of “0 to n−1”),
<1-a> Using the value of the 0th variable X [0] of the input value X and MINv, (X [0] value) × MINv is calculated, and the lower 1 word of the calculation result of 2 words Let u be the value
<1-b> Using the value of the t-th variable M [t] of the method M, the value of the t-th variable X [t] of the input value X, and u, the value of (M [t] ) × u + (value of X [t]), the calculation result of 2 words is set as m, the value of the lower 1 word of the calculation result m is set as the new value of the variable X [t], and the higher order of the calculation result m The value of one word is the carry component value c,
In thread 0, which is the 0th thread,
<1-c> A process of setting the value of the lower one word of the carry component value c obtained in the thread 0 as a new value of the variable X [0],
The controller is
The counter value i is set to 1, and the arithmetic unit starts the second phase process;
Every time processing for one round of the second phase is completed in all the threads t until the counter value i reaches n, the counter value i is incremented.
The computing unit is
Each time the counter value i is incremented by the control unit, as a process for one round of the second phase,
In an individual thread t
<2-a> Using the value of the 0th variable X [0], the value of the ith variable X [i] of the input value X, and MINv, {(value of X [0]) + (X The value of [i])} × MINv is calculated, and the value of the lower 1 word of the calculation result of 2 words is set to u,
<2-b> Value of t-th variable M [t] of method M, u, value of (t + i) -th variable X [t + i] of input value X, and carry component obtained by thread t Using the value c, (M [t] value) × u + (X [t + i] value) + c is calculated, the calculation result of 2 words is m, and the value of the lower 1 word of the calculation result m is a variable A new value of X [t + i] is set, and the value of the upper one word of the calculation result m is set as a new carry component value c.
In thread 0, which is the 0th thread,
<2-c> A process of setting the value of the lower one word of the carry component value c obtained in the thread 0 as a new value of the variable X [0],
When the counter value i reaches n, in the thread 0, the value 0 is set as a new value of the variable X [0],
The controller is
When the counter value i reaches n, the value 0 is changed to the new value of the variable X [0] in the thread 0, and then the arithmetic unit starts the third phase process.
Every time processing for one round of the third phase is completed in all the threads t that are not stopped until the counter value i reaches 2n or all the threads t stop the processing of the third phase, the counter value i is incremented,
The computing unit is
Until the counter value i reaches 2n , each time the counter value i is incremented by the control unit, in the thread t that is not stopped, as a process for one round of the third phase,
<3-a> It is determined whether or not the carry component value c obtained by the thread t is 0, and when the carry component value c is 0, the processing of the third phase of the thread t is stopped.
<3-b> If not 0, (X [t + i] value) + c is calculated, the calculation result of 2 words is m, and the value of the lower 1 word of the calculation result m is the new value of the variable X [t + i]. A value, and the value of the upper one word of the calculation result m is set as a new carry component value c,
The controller is
When the counter value i reaches 2n , or when all the threads t stop the third phase processing, the calculation unit starts the fourth phase processing,
The computing unit is
As the fourth phase process,
The value of variable X [ 2n ] is stored in variable a,
The values of the variables X [ n ] to X [ 2n -1] are stored in the variables X [0] to X [n-1], respectively.
When the value of the variable a is not 0, or when the value obtained by concatenating the values of the variables X [n−1] to X [0] is not less than the modulus M, X−M is calculated, and the calculation result is the variable Stored in X [0] to X [n-1],
An arithmetic unit characterized in that values of variables X [0] to X [n-1] are stored in variables Z [0] to Z [n-1].

A control unit, a calculation unit, and a storage unit;
The respective bit widths are common, and for each of the input value X, the input value Y, and the modulus M, each bit width is larger than one word (1 word = b bits) that is the calculation bit width of the calculation unit. , R = 2 ^b , R = r ⁿ (n is the number of divisions of the modulus M when the modulus M is divided for each word, and n ≧ 2), and (−M ⁻¹ mod An arithmetic unit that performs Montgomery multiplication to calculate (XYR ⁻¹ mod M) using MINv defined as r),
The controller is
The input value X, the input value Y, and the modulus M are divided for each word,
N variables X [0] to X [n−1] for allocating a predetermined storage area in the storage unit and storing n divided values divided from the input value X, and the input value Y N variables Y [0] to Y [n−1] for storing the divided n divided values are provided,
The divided value including the least significant bit in the input value X is stored in the 0th variable X [0], and the divided value including the most significant bit is stored in the (n−1) th variable X [n−1]. The n divided values are stored in the variables X [0] to X [n−1] so that the divided value including the least significant bit in the input value Y is the 0th variable Y [0. ] And the divided value including the most significant bit is stored in the (n−1) th variable Y [n−1], and the n divided values are stored in the variables Y [0] to Y [ n−1],
N variables M [0] to M [n−1] for allocating a predetermined storage area in the storage unit and storing n divided values divided from the modulus M are provided,
The division value including the least significant bit in the modulus M is stored in the 0th variable M [0], and the division value including the most significant bit is stored in the (n−1) th variable M [n−1]. As described above, n division values are stored in variables M [0] to M [n−1],
2n variables Z [0] to Z [2n−1] for allocating a predetermined storage area in the storage unit and storing the calculation result by the calculation unit are provided,
The computing unit is
N threads having “0 to n−1” set as thread numbers are executed in parallel,
As the first phase process,
In a thread t where thread number = t (t is one of “0 to n−1”),
<1-a> Using the value of the 0th variable X [0] of the input value X, the value of the 0th variable Y [0] of the input value Y, and MINv, the value of (X [0] ) × (value of Y [0]) × MINv, and the value of the lower 1 word of the 2-word calculation result is u,
<1-b> Using the value of the t-th variable M [t] of method M and u, (M [t] value) × u is calculated, and the calculation result of 2 words is set to um,
<1-c> Using the value of the 0th variable X [0] of the input value X and the value of the tth variable Y [t] of the input value Y, (value of X [0]) × (Y [Value of [t]), the calculation result of 2 words is xy,
<1-d> The lower 1 word of um and the lower 1 word of xy are added, the calculation result of 2 words is m, and the value of the lower 1 word of the calculation result m is the t-th variable Z [t] Stored in
<1-e> The upper 1 word of m, the upper 1 word of um, and the upper 1 word of xy are added, and the calculation result of 2 words is set as a carry component value c,
In thread 0, which is the 0th thread,
<1-f> A process of setting the value of the lower one word of the carry component value c obtained in the thread 0 as a new value of the variable Z [0],
The controller is
The counter value i is set to 1, and the arithmetic unit starts the second phase process;
Every time processing for one round of the second phase is completed in all the threads t until the counter value i reaches n, the counter value i is incremented.
The computing unit is
Each time the counter value i is incremented by the control unit, as a process for one round of the second phase,
In an individual thread t
<2-a> The 0th variable Z [0], the ith variable Z [i], the ith variable X [i] of the input value X, and the 0th variable Y [0] of the input value Y And MINv, {(value of Z [0]) + (value of Z [i]) + (value of X [i]) × (value of Y [0])} × MINv is calculated. Let u be the value of the lower 1 word of the 2-word calculation result,
<2-b> Using the value of the t-th variable M [t] of the method M and u, (M [t] value) × u is calculated, and the calculation result of 2 words is set to um.
<2-c> Using the value of the i-th variable X [i] of the input value X and the value of the t-th variable Y [t] of the input value Y, (value of X [i]) × (Y [Value of [t]), the calculation result of 2 words is xy,
<2-d> The lower one word of the um, the lower one word of the xy, the lower one word of the carry component value c obtained in step t, and the value of the (t + i) th variable Z [t + i] And the calculation result of 2 words is m, the value of the lower 1 word of the calculation result m is the new value of the (t + i) th variable Z [t + i],
<2-e> The upper 1 word of m, the upper 1 word of um, the upper 1 word of xy, and the upper 1 word of the carry component value c obtained in step t are added to obtain 2 words And a new carry component value c,
In thread 0, which is the 0th thread,
<2-f> A process of setting the value of the lower one word of the carry component value c obtained in the thread 0 as a new value of the variable Z [0],
When the counter value i reaches n, in the thread 0, the value 0 is set as a new value of the variable Z [0], and
The controller is
When the counter value i reaches n, the value 0 is changed to the new value of the variable X [0] in the thread 0, and then the arithmetic unit starts the third phase process.
Every time processing for one round of the third phase is completed in all the threads t that are not stopped until the counter value i reaches 2n or all the threads t stop the processing of the third phase, the counter value i is incremented,
The computing unit is
Until the counter value i reaches 2n, each time the counter value i is incremented by the control unit, in the thread t that is not stopped, as a process for one round of the third phase,
<3-a> It is determined whether or not the carry component value c obtained by the thread t is 0, and when the carry component value c is 0, the processing of the third phase of the thread t is stopped.
<3-b> If it is not 0, the lower 1 word of the carry component value c obtained in the thread t and the value of the variable Z [(t + i) mod 2n] are added, and the calculation result of 2 words is set as m. The value of the lower 1 word of the calculation result m is set as a new value of the variable Z [(t + i) mod 2n], and the upper 1 word of the m and the upper 1 word of the carry component value c obtained by the thread t are added. A process of setting the calculation result of 2 words as a new carry component value c,
The controller is
When the counter value i reaches 2n, or when all the threads t stop the third phase processing, the calculation unit starts the fourth phase processing,
The computing unit is
As the fourth phase process,
The value of variable Z [0] is stored in variable a,
The values of variables Z [n] to Z [2n-1] are stored in variables Z [0] to Z [n-1], respectively.
When the value of the variable a is not 0, or the value obtained by concatenating the values of the variables Z [n−1] to Z [0] is equal to or greater than the modulus M, Z−M is calculated, and the calculation result is the variable An arithmetic unit that stores data in Z [0] to Z [n-1].

A control unit, a calculation unit, and a storage unit;
The respective bit widths are common, and for each of the input value X, the input value Y, and the modulus M, each bit width is larger than one word (1 word = b bits) that is the calculation bit width of the calculation unit. , R = 2 ^b , R = r ⁿ (n is the number of divisions of the modulus M when the modulus M is divided for each word, and n ≧ 2), and (−M ⁻¹ mod An arithmetic unit that performs Montgomery multiplication to calculate (XYR ⁻¹ mod M) using MINv defined as r),
The controller is
The input value X, the input value Y, and the modulus M are divided for each word,
N variables X [0] to X [n−1] for allocating a predetermined storage area in the storage unit and storing n divided values divided from the input value X, and the input value Y N variables Y [0] to Y [n−1] for storing the divided n divided values are provided,
The divided value including the least significant bit in the input value X is stored in the 0th variable X [0], and the divided value including the most significant bit is stored in the (n−1) th variable X [n−1]. The n divided values are stored in the variables X [0] to X [n−1] so that the divided value including the least significant bit in the input value Y is the 0th variable Y [0. ] And the divided value including the most significant bit is stored in the (n−1) th variable Y [n−1], and the n divided values are stored in the variables Y [0] to Y [ n−1],
N variables M [0] to M [n−1] for allocating a predetermined storage area in the storage unit and storing n divided values divided from the modulus M are provided,
The division value including the least significant bit in the modulus M is stored in the 0th variable M [0], and the division value including the most significant bit is stored in the (n−1) th variable M [n−1]. As described above, n division values are stored in variables M [0] to M [n−1],
V variables (v is a multiple of n and v ≧ 3 n) variables Z [0] to Z [v−1] that allocate a predetermined storage area in the storage unit and store the calculation result by the calculation unit. ],
The computing unit is
N threads having “0 to n−1” set as thread numbers are executed in parallel,
As the first phase process,
In a thread t where thread number = t (t is one of “0 to n−1”),
<1-a> Using the value of the 0th variable X [0] of the input value X, the value of the 0th variable Y [0] of the input value Y, and MINv, the value of (X [0] ) × (value of Y [0]) × MINv, and the value of the lower 1 word of the 2-word calculation result is u,
<1-b> Using the value of the t-th variable M [t] of method M and u, (M [t] value) × u is calculated, and the calculation result of 2 words is set to um,
<1-c> Using the value of the 0th variable X [0] of the input value X and the value of the tth variable Y [t] of the input value Y, (value of X [0]) × (Y [Value of [t]), the calculation result of 2 words is xy,
<1-d> The lower 1 word of um and the lower 1 word of xy are added, the calculation result of 2 words is m, and the value of the lower 1 word of the calculation result m is the t-th variable Z [t] Stored in
<1-e> The upper 1 word of m, the upper 1 word of um, and the upper 1 word of xy are added, and the calculation result of 2 words is set as a carry component value c,
In thread 0, which is the 0th thread,
<1-f> A process of setting the value of the lower one word of the carry component value c obtained in the thread 0 as a new value of the variable Z [0],
The controller is
The counter value i is set to 1, and the arithmetic unit starts the second phase process;
Every time processing for one round of the second phase is completed in all the threads t until the counter value i reaches n, the counter value i is incremented.
The computing unit is
Each time the counter value i is incremented by the control unit, as a process for one round of the second phase,
In an individual thread t
<2-a> The 0th variable Z [0], the ith variable Z [i], the ith variable X [i] of the input value X, and the 0th variable Y [0] of the input value Y And MINv, {(value of Z [0]) + (value of Z [i]) + (value of X [i]) × (value of Y [0])} × MINv is calculated. Let u be the value of the lower 1 word of the 2-word calculation result,
<2-b> Using the value of the t-th variable M [t] of the method M and u, (M [t] value) × u is calculated, and the calculation result of 2 words is set to um.
<2-c> Using the value of the i-th variable X [i] of the input value X and the value of the t-th variable Y [t] of the input value Y, (value of X [i]) × (Y [Value of [t]), the calculation result of 2 words is xy,
<2-d> The lower one word of the um, the lower one word of the xy, the lower one word of the carry component value c obtained in step t, and the value of the (t + i) th variable Z [t + i] And the calculation result of 2 words is m, the value of the lower 1 word of the calculation result m is the new value of the (t + i) th variable Z [t + i],
<2-e> The upper 1 word of m, the upper 1 word of um, the upper 1 word of xy, and the upper 1 word of the carry component value c obtained in step t are added to obtain 2 words And a new carry component value c,
In thread 0, which is the 0th thread,
<2-f> A process of setting the value of the lower one word of the carry component value c obtained in the thread 0 as a new value of the variable Z [0],
When the counter value i reaches n, in the thread 0, the value 0 is set as a new value of the variable Z [0], and
The controller is
When the counter value i reaches n, the value 0 is changed to the new value of the variable X [0] in the thread 0, and then the arithmetic unit starts the third phase process.
Every time processing for one round of the third phase is completed in all the threads t that are not stopped until the counter value i reaches 2n or all the threads t stop the processing of the third phase, the counter value i is incremented,
The computing unit is
Until the counter value i reaches 2n , each time the counter value i is incremented by the control unit, in the thread t that is not stopped, as a process for one round of the third phase,
<3-a> It is determined whether or not the carry component value c obtained by the thread t is 0, and when the carry component value c is 0, the processing of the third phase of the thread t is stopped.
<3-b> When it is not 0, the lower 1 word of the carry component value c obtained in the thread t and the value of the variable Z [t + i] are added, the calculation result of 2 words is set to m, and the calculation result m The value of the lower 1 word is set as a new value of the variable Z [t + i], the upper 1 word of m is added to the upper 1 word of the carry component value c obtained by the thread t, and the calculation result of 2 words is newly obtained. To carry out the carry component value c,
The controller is
When the counter value i reaches 2n , or when all the threads t stop the third phase processing, the calculation unit starts the fourth phase processing,
The computing unit is
As the fourth phase process,
The value of variable Z [ 2n ] is stored in variable a,
The values of variables Z [ n ] to Z [ 2n -1] are stored in variables Z [0] to Z [n-1], respectively.
When the value of the variable a is not 0, or the value obtained by concatenating the values of the variables Z [n−1] to Z [0] is equal to or greater than the modulus M, Z−M is calculated, and the calculation result is the variable An arithmetic unit that stores data in Z [0] to Z [n-1].

A program comprising a group of instructions that enables processing of the control unit and the calculation unit according to claim 1.