JP3904421B2

JP3904421B2 - Remainder multiplication arithmetic unit

Info

Publication number: JP3904421B2
Application number: JP2001308154A
Authority: JP
Inventors: 正博神永; 隆遠藤; 高志渡邊; 邦彦中田; 卓塚本; 誠治小林
Original assignee: Renesas Technology Corp
Current assignee: Renesas Technology Corp
Priority date: 2001-10-04
Filing date: 2001-10-04
Publication date: 2007-04-11
Anticipated expiration: 2021-10-04
Also published as: JP2003114618A

Description

【０００１】
【発明の属する技術分野】
本発明は，剰余乗算演算(modular multiplication)や、べき乗剰余演算(modular exponentiation)を用いた符号化(encryption)及び復号化(decryption)装置に適用して有効なものであり、特に、ICカード(smart card)のような高いセキュリティを持つデータ処理装置に適用して有効な技術に関するものである。
【０００２】
【従来の技術】
ヨーロッパで広く利用されているＧＳＭ(Global Systems for Mobile communications)規格のモバイルホンをはじめとするモバイル端末や、ICカード(smart card)などでは、利用者認証(user authentication)や、電子商取引(electric commerce)を行うことができる。一般に、電子マネー(electric money)として利用する場合、それはＩＣカードの形態を取っており、ＧＳＭモバイルホン(GSM mobile radiotelephone system)の場合は、ＳＩＭ(Subscriber Identification Module)とよばれる形態を取っている。SIMもICカードも端子付きの半導体チップをプラスティックの板に張り付けたものであり、いずれもデバイスの実体は同じく半導体チップであるので、以下ＩＣカードについて説明する。
ＩＣカードは、勝手に書き換えることが許されないような個人情報の保持や、秘密情報である暗号鍵(cryptographic key)を用いたデータの暗号化(encryption)や暗号文(cipher text)の復号化(decryption)を行う装置である。ICカード自体は電源を持っておらず，ＩＣカード用のリーダライタ(Card reader/writer)に差し込まれると，電源の供給を受け，動作可能となる。動作可能になると，リーダライタから送信されるコマンドを受信し，そのコマンドに従って，データの転送等の処理を行う。ICカードの一般的な解説は，オーム社出版電子情報通信学会編水沢順一著「ICカード」などにある。
ICカードの構成は，図１に示すように，カード101の上に，ICカード用チップ102を搭載したものである。図に示すように、一般にICカードは，ISO7816の規格に定められた位置に供給電圧端子Vcc, グランド端子GND，リセット端子RST, 入出力端子I/O, クロック端子CLKを持ち，これらの端子を通して，リーダーライタから電源の供給やリーダライタとのデータの通信を行う(W.Rankl and Effing : SMARTCARD HANDBOOK, John Wiley & Sons, 1997, pp.41参照)。
ICカード用チップの構成は，基本的には通常のマイクロコンピュータと同じ構成である。その構成は，図2に示すように，中央処理装置(CPU：Central Processing Unit)201，記憶装置(Memory)204，入出力（I/O）ポート207，コ・プロセッサ(coprocessor)202からなる（コ・プロセッサはない場合もある）。CPU201は，論理演算(logical operation)や算術演算(arithmetic operation)などを行う装置であり，記憶装置204は，プログラムやデータを格納する装置である。入出力ポートは，リーダライタと通信を行う装置である。コ・プロセッサは，暗号処理そのもの、または、暗号処理に必要な演算を高速に行う装置であり、例えば、RSA暗号の剰余演算を行う為の特別な演算装置や、DES(Data Encryption Standard)の処理を行う暗号装置などがある。ICカード用プロセッサの中には，コ・プロセッサを持たないものも多くある。データバス203は，各装置を接続するバスである。
記憶装置２０４は，ROM(Read Only Memory)やRAM(Random Access Memory)，EEPROM(Electrical Erasable Programmable Read Only Memory)などからなる。ROMは，変更できないメモリであり，主にプログラムを格納するメモリである。RAMは自由に書き換えができるメモリであるが，電源の供給が中断されると，記憶している内容は消滅する。ICカードがリーダライタから抜かれると電源の供給が中断されるため，RAMの内容は，保持されなくなる。EEPROMは，電源の供給が中断されてもその内容を保持することができるメモリである。このメモリは、書き換える必要があり，ICカードがリーダライタから抜かれても，保持するデータを格納するために使われる。例えば，プリペイドカードの利用度数などは，使用するたびに書き換えられ，かつリーダライタから抜かれてもデータを保持する必要があるため，EEPROMで保持される。
利用者認証や、電子マネー決済を実現するには、公開鍵暗号(public key cryptography)の技術が必要である。公開鍵暗号とは、公開情報の中に秘密情報を埋め込んで用いるものであり、送信側と受信側で異なる鍵を用いる為、非対称鍵暗号(asymmetric key cryptography)とも呼ばれる。現在広く利用されている公開鍵暗号として、RSA暗号がある。RSA暗号は、大きな素数の積を生成することは容易だが、与えられた合成数を因数分解することは困難であるという事実に基づいている。RSA暗号では、公開モジュラス(public modulus)Nと、与えられた暗号文(cipher text)Yに対し、秘密鍵指数(secret exponent)Xを用いて、Y^X mod N（Y^Xは、YのX乗を意味する）という計算を行うことにより、平文(plain text)Mを得る計算、又は、この逆の操作が行われる。この操作は、べき乗剰余演算(modular exponentiation)と呼ばれる。
ここで、前記X,Y,Nは、2001年現在、1024ビットから2048ビット程度の非常に大きな数が利用される為、「Y^X mod N」をいかにして高速に実行するかが、従来から応用数学、工学の分野で課題とされていた。特に、ICカードのように記憶装置の容量が制限されており、かつCPUの能力が低いデバイスにおいては、前記べき乗剰余演算は非常に大きなタスクであり、その高速演算は非常に重要な課題である。
べき乗剰余演算のアルゴリズムは種々知られているが、例えば、次に示すものは、広く利用されている。これは、加法連鎖方式(addition chain method)と呼ばれている。
[アルゴリズム１]
input Y, X=(X[n-1]X[n-2]...X[1]X[0]), N
output Y^X mod N
A = 1 ステップ１
B = Y ステップ２
For j = n-1 to 0 step 1{ ステップ３
A = A^2 mod N ステップ４
If X[j] = 1 then A = A*B mod N ステップ５
}
output A ステップ６
このアルゴリズムにおいては、nは、Xのビット長に対応され、(X[n-1]X[n-2]...X[1]X[0])は、Xの二進数表現である。本アルゴリズムは、概略的には、二乗の剰余乗算「A^2 mod N」（ステップ４）及び、剰余乗算「A = A*B mod N」（ステップ５）を組み合せて実行され、X[n-1], X[n-2], ..., X[1], X[0]における1の個数をH(X)とすると、二乗の剰余乗算「A^2 mod N」にn回、剰余乗算「A = A*B mod N」にH(X)回の演算が繰り返し行われることになる。
上記のアルゴリズム1で、正しい結果が得られることを数値例で確認しておく。
アルゴリズムの確認の意味では、指数Xを具体化すれば十分であるので、Y及びNは、記号のまま用いることにする。
S = Y^45 mod N
をアルゴリズム１に従って計算する。ここで、指数45は、二進数では、
45 = 101101（二進数）
と書くことができる。従って、アルゴリズム１の記号を用いて指数を表現すると、
X[5] = 1, X[4] = 0, X[3] = 1, X[2] = 1, X[1] = 0, X[0] = 1
となる。ビット数nは6である。Sの初期値は1である。
(1) j=5のとき
Sの初期値は1であるから、これを二乗しNで割った剰余を取っても、1である（ステップ４）。
X[5] = 1であるから、ステップ５が実行され、
S = 1*Y mod N
= Y mod N
が得られる。
(2) j=4のとき
このときのSの値は、最初S = Y mod Nであるから、これを二乗してNで割った剰余を取ると、S = Y^2 mod Nとなる。
X[4] = 0であるから、ステップ５は実行されず、S = Y^2 mod Nのままである。
(3) j=3のとき
このときのSの値は、最初S = Y^2 mod Nであるから、これを二乗してNで割った剰余を取ると、S = Y^4 mod Nとなる。
X[3] = 1であるから、ステップ５が実行され、
S = Y^4*Y mod N = Y^5 mod N
となる。
(4) j=2のとき
このときのSの値は、最初S = Y^5 mod Nであるから、これを二乗してNで割った剰余を取ると、S = Y^10 mod Nとなる。
X[2] = 1であるから、ステップ５が実行され、
S = Y^10*Y mod N = Y^11 mod N
となる。
(5) j=1のとき
このときのSの値は、最初S = Y^11 mod Nであるから、これを二乗してNで割った剰余を取ると、S = Y^22 mod Nとなる。
X[1] = 0であるから、ステップ５は実行されず、S = Y^22 mod Nのままである。
(6) j=0のとき
このときのSの値は、最初S=Y^22 mod Nであるから、これを二乗してNで割った剰余を取ると、S = Y^44 mod Nとなる。
X[0] = 1であるから、ステップ５が実行され、
S = Y^44*Y mod N = Y^45 mod N
となり、所望の結果が得られる。
このようにアルゴリズム1ではべき乗剰余演算を剰余乗算（二乗も含む）に分解して実行する為、「A*B mod N」の演算機能を持つ演算装置を用いればよい。
しかしながら、A, B, Nはいずれも非常に大きな値であり、例えば、データ長を現在主流の1024ビットであるとすると、中間結果A*Bは、2048ビットの大きな数となる問題がある。さらに、A*BをNで割った値が最終結果となるため、2048ビット÷1024ビットという大きな値を扱う除算を実行しなければならない。ここで、乗算は、乗数と被乗数を分割することにより、マイクロプロセッサ等により並列処理が可能であるが、除算は、並列化が困難であり、これが高速化を阻む要因となっていた。
このような剰余乗算「A*B mod N」における除算の問題を解決するため、Nによる除算を行わずに「A*B*R^(-1) mod N」を実行するアルゴリズム２が知られている。ここで、Rは、2^n(nは例えばNのビット長(bit-length))であり、R>Nを満たす正の整数である。又、以下では、Nを奇数と仮定する。この仮定は、RSAや素体(prime field)上の楕円曲線暗号(elliptic curve cryptography)においては妥当な仮定である。実際、RSAにおいては、Nは大きな素数の積であり、素体では、大きな素数を法とする剰余演算(modular arithmetic operation)が行われるので、同様に、Nは奇数である。
下記アルゴリズム２は、数学者ピーター・モンゴメリ(Peter Montgomery)によって提案されたものである。アルゴリズム２を導く論証の詳細については、ここでは説明を省略するが、例えば、モンゴメリ自身によって書かれた論文P. L. Montgomery, ”Modular Multiplication without Trial Division”, Math. Comp., vol. 44, 1985, pp.519-521、又は、暗号理論(cryptology)における標準的ハンドブック A. Menezes, P.C.van Oorschot, S.A.Vanstone, ”Handbook of Applied Cryptography”, CRC-press, 1997, pp. 602-603に記載がある。
[アルゴリズム２]
W = 0
For j = 0 to n-1 step +1{
T = W + A[j]*B
If T is odd then W = ( T + N )/2
Else W = T/2
}
If W >= N then W = W - N
Output W
アルゴリズム２において求まるのは、A*B*R^(-1) mod Nの値であるので、この演算を行う装置を用いてRSA暗号におけるべき乗剰余演算を実行するには、次のアルゴリズム３のように修正したアルゴリズムを用いる必要がある。
[アルゴリズム３]
input Y, X=(X[n-1]X[n-2]...X[1]X[0]), N
output Y^X mod N
A = R mod N ステップ１
B = Y*R mod N ステップ２
For j =n-1 to 0 step 1{ ステップ３
A = (A^2)*R^(-1) mod N ステップ４
If X[j] = 1 then A = A*B*R^(-1) mod N ステップ５
}
A = A*R^(-1) mod N ステップ６
output A ステップ７
アルゴリズム３においてステップ１−５の間、乗数(multiplier)及び被乗数(multiplicand)には、R mod Nが乗ぜられた形をとる。このデータ形式を、以下、モンゴメリ形式と呼ぶことにする。上記モンゴメリ乗算「ABR^(-1) mod N」において、モンゴメリ形式は不変に保たれる。実際、Aのモンゴメリ形式表示をMont(A) = A*R mod Nのように書くことにすると、
Mont(A)*Mont(B)*R^(-1) mod N
= A*R*B*R*R^(-1) mod N
= A*B*R mod N
= Mont(A*B)
となる。従って、例えば、「A*B*R^(-1) mod N」, 「(A^2)*R^(-1) mod N」, 「A*R^(-1) mod N」を計算する演算器があれば、上記アルゴリズム３の手続きをCPUとこれらの演算器を用いて高速に実行することができる。「(A^2)*R^(-1) mod N」, 「A*R^(-1) mod N」は、それぞれ、「A*B*R^(-1) mod N」において、B=A，B=1とおいたものであるので、「A*B*R^(-1) mod N」という演算器を構成すれば、べき乗剰余演算を実現することができることがわかる。
次に、ステップ２で示した「A*B*R^(-1) mod N」を計算するアルゴリズムにおいて、これを高速化する方法について、これまでに行われてきたことを簡単にまとめておく。
説明の為、アルゴリズム２を再掲する。
[アルゴリズム２（再掲）]
W = 0 ステップ１
For j = 0 to n-1 step +1{ ステップ２
T = W + A[j]*B ステップ３
If T is odd then W = ( T + N )/2 ステップ４
Else W = T/2 ステップ５
}
If W >= N then W = W - N ステップ６
Output W ステップ７
公知の代表的な高速化方法は、以下のように分類することができる。
第一の高速化方法：ステップ３，４，５を含むループの繰り返しの回数を減らすもの
第二の高速化方法：ステップ３，４，５の処理を並列化するもの
第三の高速化方法：ステップ６の処理をより簡単なものに置換えるもの
第一の高速化方法の代表的なものとして、加算処理を複数ビットまとめて実行するものがある。例えば、kビットの処理をまとめて実行する方法を以下のアルゴリズム４に示す。
[アルゴリズム４]
For j = 0 to 2^k - 1 step +1{ ステップ１
TBL(j, B) = j*B ステップ２
TBL(j, N) = j*N ステップ３
}
W = 0 ステップ４
For j = 0 to n/k-1 step +1{ ステップ５
T = W + TBL(A[j,k], B) ステップ６
U = T mod 2^k ステップ７
M[j,k] = -U*N[0,k]^(-1) mod 2^k ステップ８
W = T + TBL(M[j,k], N) ステップ９
W = W/2^k ステップ１０
}
If W >= N then W = W - N ステップ１１
Output W ステップ１２
ここでは簡単の為、kはnの約数であるものとする。例えば、n=1024のときは、k=2, 4, 8等を選べばよい。A[j, k]は、Aの下からj番目のkビットブロックの値を示している。同様に、M[j,k]は、Mの下からj番目のkビットブロックの値を示し、N[0, k]は、Nの最下位のkビットブロックの値を示している。又、ステップ8の演算は、値は正になるようにとるものとする。該アルゴリズム４において、k = 1とするとNが奇数であるため、MはUに一致し、アルゴリズム２が得られる。この方法に基づいて、剰余乗算を行うものは、特開平7-20778, 武仲正彦他「剰余計算装置，テーブル作成装置および乗算剰余計算装置」において、その実現方法が示されている。
アルゴリズム４は、若干複雑であるので、そのしくみを簡単に説明する。
アルゴリズム４の内容を説明するに先立って、モンゴメリ法の原理について簡単に理解しておく必要がある。モンゴメリ法は、原理的には、MとWを未知数とした不定方程式(Diophantine equation)：
（式１） A*B + M*N = W*R
を解くことと同値である。まず、Nは奇数であり、R=2^nと互いに素であるから、（式１）は整数の解を持つことに注意する。Nが偶数の場合、（式１）は必ずしも整数解(integer solution)を持たないので、Nが奇数であるという仮定は、本質的なものである。
（式１）の意味を理解する為に、これを図で表現したものが、図３である。
R=2^nであるから、（式１）の右辺の下位nビットは全て0であるので、（式１）を満たすMを求めるということは、ABとMNの下位nビットが0になるようにMを選ぶという操作に他ならない。この上位半分のビット列が、求める、A*B*R^(-1) mod Nの値となる。但し、2nビットの数と2nビットの数の和は2n+1ビットになり、最上位のビットは1になる可能性がある（以下、このビットをOVビットと呼ぶことにする）。これは、アルゴリズム４のステップ１１において解消される。
（式１）の両辺をR=2^nで割った余り(residue)を取ると、
（式２） (A*B mod 2^n) + (M*N mod 2^n) = 0
を得る。（式２）を変形して、
（式３） M = -A*B*N^(-1) mod 2^n
を得る。法が2のべき（power）なので、このMはビットブロック毎に順に求めることができる。これがアルゴリズム４の意味である。アルゴリズム４においては、ループの回数が1/kになっているので、粗く言って、ループ部は約k倍の速度となる。但し、この場合、kを大きくすると、乗算テーブルのサイズが大きくなるという問題がある。サイズが大きくなると、RAM領域が圧迫される。一般に、ICカード等のデバイスでは、RAM領域が小さいので、kを大きくとることはできないことが多い。これが第一の問題である。
第二の高速化方法は、第一の高速化方法と合わせて用いることができる。アルゴリズム４において、Mの値は、Tの下位kビットが決まった時点で決定される（ステップ８）ことに注目すると、決定直後にステップ９の処理が実行可能であることがわかる。従って、nビットの加算器(adder)を二つ搭載し、ステップ６とステップ９を並列に実行するような回路を構成すれば、速度はほぼ2倍になる。但し、この方法を取ると、加算器が二倍必要となる。これは演算器の量が二倍になることを意味するだけでなく、実行時の消費電力(power consumption)もまた二倍になることを意味している。これが第二の問題である。
第三の高速化方法は、第一、第二の高速化方法とは独立である。
よく知られているように、アルゴリズム４のステップ１１のような比較処理は、実際に減算を実行し、負フラグ(negative flag)を調べることによって達成される。これは、1024ビット等の大きな数同士の減算処理であるので、処理時間が無視できない場合がある。一方、計算機では、R = 2^nとの比較は容易である。
そこで、最後の比較減算処理部（ステップ１１）を、次のように変更したアルゴリズム５を用いることにより高速化を実現することができる。この変更による高速化については、既に、特開平10-21057,中田邦彦「データ処理装置及びマイクロコンピュータ」, Kunihiko Nakada: DATA PROCESSOR AND MICROPROCESSOR, United States Patent, US005961578Aに記載がある。
[アルゴリズム５]
For j = 0 to 2^k - 1 step +1{ ステップ１
TBL(j, B) = j*B ステップ２
TBL(j, N) = j*N ステップ３
}
W = 0 ステップ４
For j = 0 to n/k-1 step +1{ ステップ５
T = W + TBL(A[j,k], B) ステップ６
U = T mod 2^k ステップ７
M[j,k] = -U*N[0,k]^(-1) mod 2^k ステップ８
W = T + TBL(M[j,k], N) ステップ９
W = W/2^k ステップ１０
}
If W >= R then W = W - N ステップ１１
Output W ステップ１２
ここで、ステップ１１における比較処理は、図３におけるOVビットが1であるかどうかを判定するだけで実現できる。これは減算処理よりもずっと軽い処理であり、高速である。但し、アルゴリズム５で得られる値は、A*B*R^(-1) mod N 自身ではなくA*B*R^(-1) mod N + Nである場合がある。しかし、ビット長はnビットのままであるので、べき乗剰余演算では、最終的に値の修正を行うだけでよい。
本発明は、この高速化方法とは無関係であり、第一、第二の高速化方式と関係するものである。
上記の問題１，２は、楕円曲線暗号(elliptic curve cryptosystem)で現れる標数２のガロア体(Galois field of characteristic 2)においても生ずるものである。
ガロア体の概念を簡単に説明しておく。ガロア体そのものは、純粋に数学の概念であるが、適当な同型写像(homomorphism)を構成することによって、具体的な演算に翻訳することができる。ガロア体を計算機上に実現する方法のうち、最も簡単なものは、多項式の剰余演算(modular arithmetic operation)を用いるものである。
まず、係数(coefficient)が、0か1であるような多項式全体を考え、この集合をPOLYとする。POLYの元(element)で、勝手にn次の既約多項式(irreducible polynomial of degree n)F(X)を選び、POLYの二つの元、A(X), B(X)に対し、その和（差と同義になる）を、通常の多項式の和において、係数の和は排他的論理和(exclusive OR)の意味で行うものと定義し、積をA(X)*B(X) mod F(X)によって定義する。代数学においてよく知られているように、F(X)の倍元(multiple)でないPOLYの元は、前記の積演算に対する逆(inverse)を持つ。POLYにおいて、A(X)とB(X)が同値(equivalent)であるとは、これらの差A(X)-B(X)がF(X)の倍元になっていることと定義し、POLYをこの同値関係(equivalent relation)によって同値類(equivalent class)に分けたものをGF(2^n)と書く。これが、標数２のガロア体である。GF(2^n)の代数構造(algebraic structure)は、n次の既約多項式F(X)の選び方に依らないことはよく知られている。実装上は、3項(3 terms)及び5項からなる多項式を選ぶことが多い。この多項式F(X)を、reduction polynomialと呼ぶことがある。
ガロア体の積演算は、通常の積演算に現れる加算処理を排他的論理和の意味で行うことによって実現される。前記加算器において、キャリーの伝播を行わなければ、ビット毎の排他的論理和(bitwise exclusive OR)を行うことと同値になり、ガロア体の積演算が実現される。
モンゴメリ法においても、本質的に同様の操作を行っている。通常のモンゴメリ法において必要だった「モジュラスNは奇数」であるという条件は、F(X)の定数項が1であるという条件に置換えられ、これは、F(X)が既約であるということから、自動的に従うものであり、モンゴメリ法の適用に問題はない。通常のモンゴメリ法におけるR=2^nは、R(X) = X^nに置換える。また、モンゴメリ法において重要であったMの決定の処理もまた、キャリー伝播を行わないことによって実現することができる。従って、ガロア体におけるモンゴメリ法の処理は、以下のように実現することができる。
[アルゴリズム６]
W(X) = 0 ステップ１
For j = 0 to n-1 step +1{ ステップ２
T(X) = W(X) + A[j]*B(X) ステップ３
If T(X)mod X is 1 then W(X) = (T(X) + F(X))/X ステップ４
Else W = T/X ステップ５
}
ここで、A[j]は、A(X)のｊ次の項の係数である。また、和は係数毎の排他的論理和の意味で用いている。アルゴリズム６において、通常のモンゴメリ法において存在した減算処理が必要ないのは、ガロア体においては、次数を除いて大小関係が存在せず、A(X), B(X)の次数がn次未満であれば(GF(2^n)では、これは常に成り立つ)、上記アルゴリズム６において構成されるW(X)は、n次未満となるので、そのままF(X)に対する剰余となるためである。従って、通常の剰余乗算においてモンゴメリ法を行う際に生ずる第三の問題は、ガロア体においては存在しない。
このアルゴリズム６で、正しい答が得られることを数値例によって確認しておく。簡単の為、GF(2^4)において、F(X) = X^4 + X + 1とする。A(X) = X^3 + X^2 + 1, B(X) = X^3 + X + 1に対し、GF(2^4)におけるAとBの積、すなわち、A(X)*B(X) mod F(X)を計算する。
この場合、A[0] = 1, A[1] = 0, A[2] = 1, A[3] = 1であることに注意して、アルゴリズム６の処理を順に実行する。
（1） j = 0 のとき
W(X) = 0, A[0] = 1であるから、ステップ3では、
W(X) = 0 + 1*(X^3 + X + 1)
= X^3 + X + 1
となり、この定数項が1であるから、ステップ４を実行し、
W (X) = (X^3 + X + 1 + X^4 + X + 1)/X
= (X^4 + X^3 + 2*X + 2)/X
= (X^4 + X^3)/X
= X^3 + X^2
となる。ここで、2倍を0としていることに注意する。
（2） j = 1のとき
W(X) = X^3 + X^2, A[1] = 0であるから、ステップ３では、
W(X) = X^3 + X^2 + 0*(X^3 + X + 1)
= X^3 + X^2
となる。この定数項は0であるので、ステップ５を実行し、
W(X) = (X^3 + X^2)/X
= X^2 + X
となる。
（3） j = 2のとき
W(X) = X^2 + X, A[2] = 1であるから、ステップ３では、
W(X) = X^2 + X + 1*(X^3 + X + 1)
= X^3 + X^2 + 2*X + 1
= X^3 + X^2 + 1
となる。この定数項は、1であるので、ステップ４を実行し、
W(X) = (X^3 + X^2 + 1 + X^4 + X + 1)/X
= (X^4 + X^3 + X^2 + X + 2)/X
= (X^4 + X^3 + X^2 + X)/X
= X^3 + X^2 + X + 1
となる。
（4） j = 3のとき
W(X) = X^3 + X^2 + X + 1, A[3] = 1であるから、ステップ３では、
W(X) = X^3 + X^2 + X + 1 + 1*(X^3 + X + 1)
= 2*X^3 + X^2 + 2*X + 2
= X^2
となる。定数項は、0であるので、ステップ５を実行し、
W(X) = X^2/X
= X
となる。上記の計算で得られたW(X)は、A(X)*B(X)*R(X)^(-1) mod F(X)になっているはずであるので、R(X)^(-1) mof F(X)を除去する為にR(X)=X^4を乗ずると、
X*X^4 mod (X^4 + X + 1)
= X*(X+1) = X^2 + X
となる。
この結果が正しいかどうかを確認する為に、A(X)*B(X) mod F(X)を通常の方法で計算する。
A(X)*B(X)
= (X^3 + X^2 + 1)*(X^3 + X + 1)
= X^6 + X^5 + X^4 + 3*X^3 + X^2 + X + 1
= X^6 + X^5 + X^4 + X^3 + X^2 + X + 1
であるから、
A(X)*B(X) mod F(X)
= X^6 + X^5 + X^4 + X^3 + X^2 + X + 1 mod (X^4 + X + 1)
= (X^2 + X + 1)*(X^4 + X + 1) + X^2 + X mod (X^4 + X + 1)
= X^2 + X
となり、確かに、アルゴリズム６で計算した値にR(X)を乗じたものと同じ多項式が得られる。
上記のガロア体における計算においては、２を0に置換える操作をしているが、これは、和の計算を排他的論理和で実行していることによる。
このアルゴリズムをkビット毎に処理するように修正することは、通常の剰余乗算の場合と同様にして実現することができる。GF(2^n)におけるモンゴメリ法のアルゴリズムを示しておく。
[アルゴリズム７]
For j = 0 to 2^k - 1 step +1{ ステップ１
TBL(H[j](X), B(X)) = H[j](X)*B(X) ステップ２
TBL(H[j](X), F(X)) = H[j](X)*F(X) ステップ３
}
W(X) = 0 ステップ４
For j = 0 to n/k-1 step +1{ ステップ５
T(X) = W(X) + TBL(A[j,k](X), B(X)) ステップ６
U(X) = T(X) mod X^k ステップ７
M[j,k](X) = U(X)*F[0,k](X)^(-1) mod X^k ステップ８
W(X) = T(X) + TBL(M[j,k](X), F(X)) ステップ９
W(X) = W(X)/X^k ステップ１０
}
ここで、アルゴリズム７における多項式H[j](X)は、ｊの二進数展開が、j = (j[k-1]j[k-2]...j[0])であるときに、H[j](X) = j[k-1]*X^(k-1) + j[k-2]*X^(k-2) + ...+ j[0]を対応させて得られるものである。又、A[j,k](X)は、多項式A(X)をビット列表現した場合のｊ番目のkビットブロックに対応する多項式であり、F[0,k](X), M[j,k](X)も、これに準ずる。さらに、ステップ６、９の和は、各項毎の排他的論理和を意味する。数値例は省略する。
【０００３】
【発明が解決しようとしている課題】
本発明は、大きな数に対する剰余乗算（modular multiplication）及びべき乗剰余演算(modular exponentiation)を実行する為のマイクロコンピュータ並びにその実行方法に関し、モンゴメリ(Montgomery)の方法(P. L. Montgomery, ”Modular Multiplication without Trial Division”, Math. Comp., vol. 44, 1985, pp.519-521)に基づき、これを該マイクロコンピュータ上に実装する際に上記【従来の技術】において説明した二つの課題、すなわち、
（第一の課題）同時に複数ビットを処理する場合に必要な乗算テーブルが肥大化してしまうこと
（第二の課題）A*B, M*Nの処理を同時に実行して二倍の速度を実現する場合、演算器の量及び消費電力が二倍になってしまうこと
を解決し、又、ガロア体GF(2^n)の積演算に対して同様の問題を解決しようとするものである。
本発明の前記ならびにその他の目的と新規な特徴については、本明細書の記述及び添付図面から明らかになるであろう。
【０００４】
【課題を解決するための手段】
本願において開示される発明の概要を説明すれば、下記の通りである。
下記モンゴメリ法のアルゴリズム４（再掲）を考える。
[アルゴリズム４（再掲）]
For j = 0 to 2^k - 1 step +1{ ステップ１
TBL(j, B) = j*B ステップ２
TBL(j, N) = j*N ステップ３
}
W = 0 ステップ４
For j = 0 to n/k-1 step +1{ ステップ５
T = W + TBL(A[j,k], B) ステップ６
U = T mod 2^k ステップ７
M[j,k] = -U*N[0,k]^(-1) mod 2^k ステップ８
W = T + TBL(M[j,k], N) ステップ９
W = W/2^k ステップ１０
}
If W >= N then W = W - N ステップ１１
Output W ステップ１２
先に〔従来の技術〕において述べたように、アルゴリズム４における処理ステップ６及び９は、大きな数の加算処理であり、ハードウエアの大部分は、この加算器であると考えてよい。
ステップ６とステップ９の処理は、A*B の処理と M*N の処理のkビットブロックに対する処理であるので、これを合成し、A*B + M*Nのkビットブロックに対する加算処理を行ってもよいことがわかる。各ループ一回に対応する加算処理は、「kビットの数×B」及び「kビットの数×N」という形をしているので、必要となる新たなテーブルは、次のような形になることがわかる。
TBL( j, t ) = ｊ*B + ｔ*N(j, t = 0, 1, 2, ..., 2^k - 1)
このテーブルの選択には、二つのパラメータj, tが必要である。これらのパラメータは、Wに加えたときに、その和の下位kビットが0になるように選べばよい。
すなわち、第jステップの処理に際し、テーブルとして、
TBL( A[j,k], M[j,k] ) = A[j,k],*B + M[j,k]*N
を選べばよい。M[j,k]は、A[j,k]にも依存して決まる値である。下記のアルゴリズム８は、テーブルをあらかじめ並べ替えておき、Mの計算を行わずに済むようにしたものである。
[アルゴリズム８]
For j = 0 to 2^k-1 step 1{ ステップ１
For s = 0 to 2^k-1 step 1{ ステップ２
U = (s + j*B[0, k]) mod 2^k ステップ３
M =−U*N[0,k]^(-1) mod 2^k ステップ４
TBL[ j , s ] = j*B + M*N ステップ５
}
}
W = 0 ステップ６
For i = 0 to n/k-1 step 1｛ステップ７
W = W + TBL(A[i, k], W[0, k]) ステップ８
W = W/2^k ステップ９
}
If W >= N then W = W - N ステップ１０
Output W ステップ１１
上記アルゴリズム８は、本発明の本質を表している。アルゴリズム４において二個所に現れた加算処理、すなわち、アルゴリズム４におけるステップ６及び９の処理は、アルゴリズム８においては、一つの加算処理になっている。もし、アルゴリズム４におけるステップ６及び９の処理を順に計算していた場合、アルゴリズム７を採用することにより、計算速度が約2倍となる。又、多くコプロセッサが行っているようにアルゴリズム４におけるステップ６及び９の処理を並列に実行している場合は、加算器が二個必要だが、アルゴリズム７を採用した場合は、加算器の数を半分にすることができ、さらに、このことの直接の結果として、消費電力が減少する。
このアルゴリズム７において、キャリーの伝播を行わなければ、GF(2^n)の積演算が実現することは、〔従来の技術〕にて説明した通りである。従って、GF(2^n)の積演算回路に対しても、上記の通常の剰余乗算において生ずる本発明の効果が現れる。
【０００５】
【発明の実施の形態】
以下に前記アルゴリズム８を再掲する。以下、このアルゴリズムに従って、回路を構成する方法を述べる。
[アルゴリズム８（再掲）]
For j = 0 to 2^k-1 step 1{ ステップ１
For s = 0 to 2^k-1 step 1{ ステップ２
U = (s + j*B[0, k]) mod 2^k ステップ３
M =−U*N[0,k]^(-1) mod 2^k ステップ４
TBL[ j , s ] = j*B + M*N ステップ５
}
}
W = 0 ステップ６
For i = 0 to n/k-1 step 1｛ステップ７
W = W + TBL(A[i, k], W[0, k]) ステップ８
W = W/2^k ステップ９
}
If W >= N then W = W - N ステップ１０
Output W ステップ１１
アルゴリズム７の処理は、大きく分けて三つの部分から構成される。すなわち、ステップ１からステップ５までの「テーブル生成部」、ステップ６からステップ９までの「加算処理部」、及び、ステップ10，11の「値修正部」の三つである。
まず、一般性の高い「加算処理部」を説明し、次に「値修正部」、最後に「テーブル生成部」を説明し、これらを統合した回路構成について説明する。
加算器の構成法は種々のものが知られているが、ここでは代表的なものについて説明する。
加算器は、1ビット毎の加算器を並べて構成される。今、二つのmビットの数S及びTの加算を行う回路を構成する。以下の説明において、それぞれの二進数表示S = (S[m-1]S[m-2]...S[1]S[0]), T = (T[m-1]T[m-2]...T[1]T[0])及びその和Z = (Z[m]Z[m-1]...Z[1]Z[0])を用いる。加算処理をビット単位に分解して考えたとき、ｊ番目のビットの加算に必要な値は、S[j], T[j]及び、直前の計算で生じたキャリーC[j-1]である。但し、C[-1] = 0及び、Z[m] = C[m-1]と定めておく。
図４は、1ビットの加算を行う全加算器(full adder)の例である。j番目の1ビット全加算器は、S[j], T[j]及び、前段のキャリー信号C[j-1]を入力として、Zの第ｊビットの値Z[j]及びキャリー信号C[j]を出力する回路として捉えることができる。
図４の回路は、論理式：
（式４） Z[j] = (S[j] AND T[j] AND C[j-1]) OR (S[j] AND NOT(T[j]) AND NOT(C[j-1])) OR (NOT(S[j]) AND T[j] AND NOT(C[j-1])) OR (NOT(S[j]) AND NOT(T[j]) AND C[j-1]),
（式５） C[j] = NOT((S[j] AND NOT(T[j]) AND NOT(C[j-1])) OR (NOT(S[j]) AND T[j] AND NOT(C[j-1])) OR (NOT(S[j]) AND NOT(T[j]) AND C[j-1]) OR (NOT(S[j]) AND NOT(T[j]) AND NOT(C[j-1]))
を回路化したものである。図４の401, 402, 403, 404, 405は、AND回路であり、406, 407はOR回路、409, 410, 411, 412, 413, 414, 415, 416, 417は、インバータ（NOT回路）である。これらが、それぞれ、（式４）、（式５）のAND, OR, NOTに対応する。AND，OR，NOTはぞれぞれ、いくつかのトランジスタから構成される論理ゲート(logic gate)である。ANDは、入力の全てが1のときのみ1を出力し、それ以外では0を出力する論理ゲートであり、ORは、入力のどれか1つでも1であれば出力が1になり、全て0のときのみ出力が0となる論理ゲートである。NOT（インバータ）は、入力0に対しては1を出力し、入力1に対しては、0を出力する論理ゲートである。
1ビットの全加算器を並べれば、複数ビットの全加算器を作ることができる。図５に、4ビットの全加算器を示す。501，502，503，504は、図４に示した1ビットの全加算器であって、（式４）、（式５）に従って計算がなされる。最下位のビットでは、キャリーを0にセットしておく。又、最後のキャリーはZ[4]となる。ここでは、4ビットの場合を示したが、1ビットの全加算器を増やすことによって、何ビットの加算でもできることは明らかである。
ここに示した全加算器の構成方法は、「桁上げ伝播加算器」(ripple carry adder)と呼ばれる最も素朴なものであり、桁上げ伝播時間(carry propagation time)の比率が大きく、最も加算時間のかかる回路である。より高速に加算を行う為のハードウエアアルゴリズムは、数多く存在するが、加算器の構成法は、本発明の本質とは無関係であるので、これ以上の説明は省略する。
次に、値修正部の処理について説明する。値修正部では、減算処理(subtraction)が用いられる。減算器(subtracter)は、加算器と似た構造をしており、桁上げ伝播加算器を利用して構成することができる。このため、通常の計算機は、加算回路は持っているが、減算回路は持っていない。
図6は、全加算器を利用した4ビットの減算回路（Z = S T）の構成例である。601，602，603，604は1ビットの全加算器であるが、入力に際し、最初のキャリー（ボロー）が1になっている点と、Tをインバータ605，606，607，608によって反転させているという点が異なる。減算では、最後のキャリー（ボロー）が1になるときは、結果が負になったことを意味している。この現象はアンダーフロー(under flow)と呼ばれる。
上記の減算器の概念を用いて、値修正部の構成例を説明する。図７は、値修正回路のブロック図(block diagram)である。n+1ビットのレジスタ702には、値Wが、nビットのレジスタ703には、モジュラスNの値が含まれており、両者が、n+1ビットの減算器704によって、W-Nが計算される。この値W-Nと、この演算の結果として発生するボローを選択信号(selection signal)、及び、Wの値がセレクタ705に入力され、選択信号が0であれば、セレクタ705は、W-Nを選択して出力し、選択信号が1であれば、Wを選択して、これを出力する。
ここでは、W，Nは、専用のレジスタとして説明したが、これをCPU処理で利用するRAM上に配置することもできることは言うまでもない。
次に、テーブル生成部の回路について説明する。簡単の為、k=2の場合のみ示す。
k=2の場合のテーブル生成部で生成される値を列挙すると、図8の表のようになる。この構成例では、入力値A，Bに依存しないモジュラスの値Nの倍数、N, 2*N， 3*Nについては、あらかじめ計算して、これをレジスタ909，910，911に保持しておき、これを用いる。また、入力値Bは、入力値そのものであるから、計算する必要はない。Bは、レジスタ906に保持されている。
0及び、上記の４つの値B, N, 2*N, 3*N以外の11個の値は、毎回計算しなければならない。ここでは、加算器のみ利用して、前記11個の値を計算する構成を示した。レジスタの大きさは、最大で（k+1=）3ビット大きく取らなければならない。各レジスタは、データバス921に接続され、前記データバス921は、加算処理部にデータを供給する信号線として利用される。904は、n+2ビットの数の加算を行う加算器であり、n+3ビットの和の値をセレクタ903に送る。
制御装置905は、クロック信号から決定された計算開始信号を受けて内部のカウンタを二進数表示で、0001に初期化し、セレクタ901を経由して、Bの値を加算器904の入力（両方）として加算器904を動作させる。加算器904で計算が始まる直前に、前記加算器904から制御装置905に制御信号を送る。制御装置905は、前記制御信号を受けて、その内部のカウンタをインクリメントし、二進数表示で、0010という値を得、この下位2ビット00と上位2ビットをそれぞれ、j*B + t*Nにおけるj、ｔとして、セレクタ903が、加算器904の答を書き込むレジスタ907の位置を決定し、前記加算結果を書き込む。次に、再び計算開始信号が制御装置905に送信され、制御装置905は、セレクタ901に制御信号を送り、Bと2*Bを読み出して、加算器904に入力し、先と同じ動作をしてレジスタ908に値3*Bを書き込む。ここまでの動作で、テーブルを合成する為の値、B, 2*B, 3*B, N, 2*N, 3*Nが全て揃ったことになる。以下同様に制御装置のカウンタの値から、読み出すレジスタと書き込むレジスタを決定し、同様の動作を繰り返す。但し、カウンタの下位2ビットが00のときは、加算動作は、スキップするものとする。こうして、全ての値が揃った後は、レジスタ906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920は、読み出し専用のレジスタとして利用される。
次に、これらを組み合せて、A*B*R^(-1) mod Nを計算し、結果をAレジスタに書き込む回路のブロック図を図１０に示す。但し、値調整部については、煩雑さを避ける為省略した。一般に、値調整部で必要な減算処理は、加算器1015の入力値を変更することによって実現する。
図１０は、本発明における重要な構成を全て含んでいる。以下、変数名は、アルゴリズム８で用いたものと同じものを利用する。
図10において、まず、Aレジスタ1003に格納されたAの値を下位から順に読み出す。Aレジスタの構成は、図１１に示されている。この際、制御装置1007は、読み出す2ビットブロックの番号iを9ビットレジスタ1006から受け取って、該当する2ビットブロックの値を読み出し、これをレジスタ1008に送信する。レジスタ1009には、Wの下位2ビットが格納されている。1008，1009の値が、値選択及び信号制御装置1011に送信され、1011は、該当するデータをデータバス1001を介して1002の値テーブルから読み出し、値を1027ビットと1026ビットの値の加算を行う加算器1015に送る。但し、前記加算器1015には、専用のレジスタ1013，1014が接続されており、テーブルから読み出された値は、一時的に1027ビットのレジスタ1014に保持される。1013は、1026ビットのレジスタであり、アルゴリズム７に現れるWが格納される。動作開始時には0にクリアされている。値テーブル1002に含まれるj*B + t*Nという形をしたデータとWの和が加算器1015で加算される。この加算結果は、下位2ビットが二進数表示で00になっているので、これを2ビット右シフタ1016で切り落とし、制御装置1017に入力される。一方、加算器1015は、計算開始と同時に、値1を加算器1018に送り、加算器は、9ビットレジスタ1019の値と、計算開始信号「１」を合計し、その結果を9ビットレジスタ1019に送る。1019には、最初0が格納されている。この和は、制御装置1017に送られ、制御装置1017は、前記の和が、512に達していなければ、1016の出力結果をそのままレジスタ1013に書き込み、1013の値は、1011にてWの下位2ビットのみが取り出されて、レジスタ1009に書き込まれる。又、この際に、1017から、1011を経由して1007に信号が送られ、加算器1005によりiの値がインクリメントされ、Aレジスタの次の2ビットブロックが読み出され、先に延べたのと同じ過程を経てテーブルの値との加算が行われる。1019の値が、512に達していれば、選択的透過回路1012に信号を送り、1013にあるWの値をAレジスタに送る。この時点でのAレジスタの値は、A*B*R^(-1) mod N又は、A*B*R^(-1) mod N＋Nである。後者では、AからNを減算する処理が必要になる。減算処理部は省略する。これは、本発明請求項１，7の実施例の一つである。
上記実施例は、本発明の要旨を逸脱しない範囲において種々変更可能である。例えば、以下のような変更が考えられる。
（１）加算器1015は、ここでは、1027ビットを同時に加算する構成になっているが、これを複数のブロックに分割して、これらをブロック毎に順に加算していくことにより実現することもできる。この場合、ハード量は減るが、速度は落ちる。
（２） 2ビット左シフタ1016を用いないで、直接1013のレジスタに下位2ビットより上位のビットのみ接続することもできる。
（３）値A, B, Nのビット数を変更する。例えば、2048ビットや、512ビット、768ビット等に変更することもできる。
（４） kの値を2でない値、例えば、1，3，4等の値に変更することもできる。但し、nがkの倍数に成っていない場合は、修正する為の論理回路が必要である。
（５）テーブルにおいて、0をなくし、代わりに、Wに加える値が0のときは、Wの値を加算器に入力せず、直接シフタに入力するような論理回路を追加する。
上記の変更は、本質的に本発明の趣旨である「加算用テーブルの合成」と無関係に行うことができる。
次に、上記のような変更とは異なり、若干の注意を要するものについて説明する。
再び、アルゴリズム８を示す。
[アルゴリズム８（再掲）]
For j = 0 to 2^k-1 step 1{ ステップ１
For s = 0 to 2^k-1 step 1{ ステップ２
U = (s + j*B[0, k]) mod 2^k ステップ３
M =−U*N[0,k]^(-1) mod 2^k ステップ４
TBL[ j , s ] = j*B + M*N ステップ５
}
}
W = 0 ステップ６
For i = 0 to n/k-1 step 1｛ステップ７
W = W + TBL(A[i, k], W[0, k]) ステップ８
W = W/2^k ステップ９
}
If W >= N then W = W - N ステップ１０
Output W ステップ１１
ここで、テーブルTBL[j, s]の値は、最大で、(2^k-1)*B + (2^k-1)*Nとなる。この値は、最大で、n+k+1ビットとなるので、kの値に応じて、TBL[j, s]のビット長を増やさなければならない。データ数は、4^k個あるので、kに伴うデータの増加は、k*4^kビットに達する。一般にkは大きな値ではないので、本発明を実現する上で大きな問題になることはないが、望ましい現象ではない。この問題を避ける方法が存在する。
ステップ８及び９に着目する。ステップ８における、TBL(A[i, k], W[0, k])の値は、ステップ８の和の下位kビットが、必ず0になるようにが選ばれている。そして、ステップ９にて、そのkビットを右シフトによって切り落としている。これは、下位kビットの計算が、全く不要であることを示している。従って、テーブルの値そのものも、下位kビットを切り落としたもの（kビット右シフトしたもの）を用意するだけでよいことがわかる。
具体的には、値「j*B + t*N」を用いる代わりに、これをkビットシフトした値「(j*B + t*N)>>k」を新しいテーブルと定義して、アルゴリズム８と同様の計算を行う。但し、Wの値は、加算処理の前にkビット右シフトしておく。すなわち、以下のアルゴリズムに従って計算を行う。
[アルゴリズム９]
For j = 0 to 2^k-1 step 1{ ステップ１
For s = 0 to 2^k-1 step 1{ ステップ２
U = (s + j*B[0, k]) mod 2^k ステップ３
M =−U*N[0,k]^(-1) mod 2^k ステップ４
NTBL[ j , s ] = (j*B + M*N)>>k ステップ５
}
}
W = 0 ステップ６
For i = 0 to n/k-1 step 1｛ステップ７
W = W/2^k ステップ８
W = W + NTBL(A[i, k], W[0, k]) ステップ９
}
If W >= N then W = W - N ステップ１０
Output W ステップ１１
ここで、ステップ８は、実装上は、シフタを利用する必要はない。右シフトは、常にkビットだけ行われるので、これを直接kビットシフトに対応するように配線する。下位kビットの和の計算は行わない。但し、Wの下位kビットのうち、少なくとも１つが1になる場合、すなわち、下位kビットのそれぞれの論理和(logical OR)が1である場合、これをキャリーC[-1]として加算器に入力する。実際、前記論理和が0の場合は、当該kビット全てが0であることを示しており、これに対応するMの下位kビットも0でなければならないが、前記論理和が1である場合、Mとの和の下位kビットが全て0になるようにMを選ぶと、必ずキャリーが発生するからである。
これらを考慮した加算処理の回路の実施例を図15に示す。本実施例は、k=2の場合である。図15に示す回路は、NTBLの値を格納するNTBLレジスタ1504と、Wの値を格納するWレジスタ1503が加算器1502に接続されているものである。Wレジスタの下位2ビットは、加算器1502には接続されておらず、代わりに、値選択及び信号制御装置に接続されており、この2ビットは、NTBLの値を選択するのに用いられる。さらに、前記2ビットはそれぞれ、論理和を計算するORゲート1501に接続されている。ORゲート1501の計算値は、キャリー信号C[-1]として加算器1502に入力される。前記キャリー信号C[-1]は、加算を行う際のキャリーの初期値として利用される。これは本発明4,7の実施例の一つである。
次に、ガロア体GF(2^n)に対し、以下のアルゴリズム７（再掲）に従って動作する本発明の実施例を説明する。
[アルゴリズム７（再掲）]
For j = 0 to 2^k - 1 step +1{ ステップ１
TBL(H[j](X), B(X)) = H[j](X)*B(X) ステップ２
TBL(H[j](X), F(X)) = H[j](X)*F(X) ステップ３
}
W(X) = 0 ステップ４
For j = 0 to n/k-1 step +1{ ステップ５
T(X) = W(X) + TBL(A[j,k](X), B(X)) ステップ６
U(X) = T(X) mod X^k ステップ７
M[j,k](X) = U(X)*F[0,k](X)^(-1) mod X^k ステップ８
W(X) = W(X) + TBL(M[j,k](X), F(X)) ステップ９
W(X) = W(X)/X^k ステップ１０
}
ガロア体GF(2^n)における演算については、【従来の技術】において説明した通りである。従って、先に説明した本発明の第一の実施形態における回路における加算処理を全てビット毎の排他的論理和に変更することによって実現される。
積演算の処理は、シフト処理と加算処理で構成されているので、加算処理の部分をビット毎の排他的論理和の処理に置換えるだけでよい。勿論、テーブルの構成においても、和を排他的論理和に置換えることは言うまでもない。但し、通常のモンゴメリ法における法(modulus)Nがnビットだったのに対し、n次の多項式を使う為、n+1ビットの表現を必要とするという点が異なる。又、ガロア体GF(2^n)の場合は、最後の値修正処理は不要である。その他の構成は、通常のモンゴメリ法の回路と全く同一の構成でよい。これを図１２に示す。但し、ここでは、k=2とし, reduction polynomial F(X)の次数(degree)は、257であるものとする。以下、変数名は、アルゴリズム７で用いたものと同じものを利用する。
図12において、まず、Aレジスタ1203に格納されたAの値を下位から順に読み出す。Aレジスタの構成は、図１１に示されているものと同一である。この際、制御装置1207は、読み出す2ビットブロックの番号iを7ビットレジスタ1206から受け取って、該当する2ビットブロックの値を読み出し、これをレジスタ1208に送信する。レジスタ1209には、Wの下位2ビットが格納されている。1208，1209の値が、値選択及び信号制御装置1211に送信され、1211は、該当するデータをデータバス1201を介して1202の値テーブル（テーブル生成については後に説明する）から読み出し、値を258ビットと256ビットの値の排他的論理和を行う排他的論理和計算器1215に送る。但し、前記排他的論理和計算器1215には、専用のレジスタ1213，1214が接続されており、テーブルから読み出された値は、一時的に258ビットのレジスタ1214に保持される。1213は、258ビットのレジスタであり、アルゴリズム７に現れるW(X)が格納される。動作開始時には0にクリアされている。値テーブル1202に含まれる「（一次以下の多項式）*B(X) + （一次以下の多項式）*F(X)という形をしたデータ（図１３参照）とW(X)の排他的論理和が排他的論理和計算器1215で排他的論理和の計算がなされる。この結果は、下位2ビットが二進数表示で00になっているので、これを2ビット右シフタ1216で切り落とし、制御装置1217に入力される。一方、排他的論理和計算器1215は、計算開始と同時に、値1を加算器1218に送り、加算器は、7ビットレジスタ1219の値と、計算開始信号「１」を合計し、その結果を7ビットレジスタ1219に送る。1219には、最初0が格納されている。この和は、制御装置1217に送られ、制御装置1217は、前記の和が、128に達していなければ、1216の出力結果をそのままレジスタ1213に書き込み、1213の値は、1211にてW(X)の下位2ビットのみが取り出されて、レジスタ1209に書き込まれる。又、この際に、1217から、1211を経由して1207に信号が送られ、排他的論理和計算器1205によりiの値がインクリメントされ、Aレジスタの次の2ビットブロックが読み出され、先に延べたのと同じ過程を経てテーブルの値との排他的論理和の計算が行われる。1219の値が、128に達していれば、選択的透過回路1212に信号を送り、1213にあるW(X)の値をAレジスタに送る。この時点でのAレジスタの値は、A(X)*B(X)*R(X)^(-1) mod F(X)である。
テーブル生成回路も、加算処理を排他的論理和処理に置換えることにより、実現される。これを図１４に示す。
この構成例では、入力値A(X)，B(X)に依存しないreduction polynomial F(X)の倍元、F(X), XF(X),(X+1)F(X)については、あらかじめ計算して、これをレジスタ1409，1410，1411に保持しておき、これを用いる。また、入力値B(X)は、入力値そのものであるから、計算する必要はない。B(X)は、レジスタ1406に保持されている。
0及び、上記の４つの値B(X), F(X), XF(X)，(X+1)F(X)以外の11個の値は、毎回計算しなければならない。ここでは、排他的論理和計算器のみを利用して、前記11個の値を計算する構成を示した。レジスタの大きさは、最大で（k=）2ビット大きく取らなければならない。各レジスタは、データバス1421に接続され、前記データバス1421は、排他的論理和処理部にデータを供給する信号線として利用される。1404は、258ビットの数の加算を行う加算器であり、258ビットの和の値（GF(2^n)では桁上がりは生じないので、ビット数は増えないことに注意）をセレクタ1403に送る。
制御装置1405は、クロック信号から決定された計算開始信号を受けて内部のカウンタを二進数表示で、0001に初期化し、セレクタ1401を経由して、B(X)の値を加算器1404の入力（両方）として排他的論理和計算器1404を動作させる。排他的論理和計算器1404で計算が始まる直前に、前記排他的論理和計算器1404から制御装置1405に制御信号を送る。制御装置1405は、前記制御信号を受けて、その内部のカウンタをインクリメントし、二進数表示で、0010という値を得、この下位2ビット00と上位2ビットをそれぞれ、j(X)*B(X) + t(X)*F(X)におけるj(X)、ｔ(X)として、セレクタ1403が、排他的論理和計算器1404の答を書き込むレジスタ1407の位置を決定し、前記排他的論理和の結果を書き込む。次に、再び計算開始信号が制御装置1405に送信され、制御装置1405は、セレクタ1401に制御信号を送り、B(X)とXB(X)を読み出して、排他的論理和計算器1404に入力し、先と同じ動作をしてレジスタ1408に値(X+1)B(X)を書き込む。ここまでの動作で、テーブルを合成する為の値、B(X), XB(X), (X+1)B(X), F(X), XF(X), (X+1)F(X)が全て揃ったことになる。以下同様に制御装置のカウンタの値から、読み出すレジスタと書き込むレジスタを決定し、同様の動作を繰り返す。但し、カウンタの下位2ビットが00のときは、加算動作は、スキップするものとする。こうして、全ての値が揃った後は、レジスタ1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420は、読み出し専用のレジスタとして利用される。これは、本発明請求項2，7の実施例の一つである。
上記実施例は、本発明の要旨を逸脱しない範囲において種々変更可能である。例えば、以下のような変更が考えられる。
（１）排他的論理和計算器1215は、ここでは、258ビットを同時に加算する構成になっているが、これを複数のブロックに分割して、これらをブロック毎に順に加算していくことにより実現することもできる。この場合、ハード量は減るが、速度は落ちる。
（２） 2ビット左シフタ1216を用いないで、直接1213のレジスタに下位2ビットより上位のビットのみ接続することもできる。
（３）値A(X), B(X), F(X)のビット数を変更する。
（４） kの値を2でない値、例えば、1，3，4等の値に変更することもできる。但し、nがkの倍数に成っていない場合は、修正する為の論理回路が必要である。
（５）テーブルにおいて、0をなくし、代わりに、W(X)に加える値が0のときは、W(X)の値を排他的論理和計算器に入力せず、直接シフタに入力するような論理回路を追加する。
上記の変更は、本質的に本発明の趣旨である「値テーブルの合成」と無関係に行うことができる。
ガロア体においても、通常のモンゴメリ法の処理の場合と同じようにして、テーブルサイズをkビット減らすことができる。キャリー及び、最後の値調整の為の減算処理を除いて考え方はアルゴリズム９と全く同一である。以下に、そのアルゴリズムを示す。
[アルゴリズム１０]
For j = 0 to 2^k - 1 step +1{ ステップ１
NTBL(H[j](X), B(X)) = H[j](X)*B(X)>>k ステップ２
NTBL(H[j](X), F(X)) = H[j](X)*F(X)>>k ステップ３
}
W(X) = 0 ステップ４
For j = 0 to n/k-1 step +1{ ステップ５
T(X) = W(X) + TBL(A[j,k](X), B(X)) ステップ６
U(X) = T(X) mod X^k ステップ７
M[j,k](X) = U(X)*F[0,k](X)^(-1) mod X^k ステップ８
W(X) = W(X) + NTBL(M[j,k](X), F(X)) ステップ９
W(X) = W(X)/X^k ステップ１０
}
ここで、上記M[j,k](X)の決定については、処理中のW(X)の下位のkビットをそのままとればよく、キャリーは発生しない点が、通常のモンゴメリ法と異なる（但し、数学的表現は、ほぼ同じ）。
図１５に対応するガロア体の計算回路を図１６に示す。図１５における回路との違いは、加算器が排他的論理和計算器1601になっている点と、キャリー信号を生成するOR回路がないという点である。これは、本発明請求項４，７の実施例の一つである。
ここまでは、剰余乗算に対する本発明の実施例と、ガロア体G(2^n)に対する本発明の実施例を別々に示したが、両者を融合することができる。
RSA暗号と楕円曲線暗号で必要とされるデータ長は著しく異なる。一般に、RSA暗号でのNのビット数は、十分な安全性を有する為には、1024ビットから2048ビット程度が必要である。一方、RSA暗号と同等の強度を有する楕円曲線暗号でのデータ長は、160ビットから256ビット程度である。従って、RSAの回路の一部分をGF(2^n)の演算器として利用するという構成が可能となる。
GF(2^n)の和演算が、通常の加算処理において、キャリーを全て0とおいたものと一致していることは、以下のようにすればわかる。
例えば、図４の回路に対応する論理式のうち、第ｊビットを表現するもの：
（式４（再掲）） Z[j] = (S[j] AND T[j] AND C[j-1]) OR (S[j] AND NOT(T[j]) AND NOT(C[j-1])) OR (NOT(S[j]) AND T[j] AND NOT(C[j-1])) OR (NOT(S[j]) AND NOT(T[j]) AND C[j-1]),
において、C[j-1] = 0とおけば、
（式６） Z[j] = (S[j] AND T[j] AND 0) OR (S[j] AND NOT(T[j]) AND NOT(0)) OR (NOT(S[j]) AND T[j] AND NOT(0)) OR (NOT(S[j]) AND NOT(T[j]) AND 0)= (S[j] AND NOT(T[j])) OR (NOT(S[j]) AND T[j])
= S[j] EXOR T[j]
を得る。ここで、S[j] EXOR T[j]は、S[j]とT[j]の排他的論理和である。すなわち、キャリーを全て0にすれば、それは、ビット毎の排他的論理和であることがわかる。
ビット毎の排他的論理和は、GF(2^n)の元Aの多項式表現：A(X) = A[n-1]*X^(n-1) + A[n-2]*X^(n-2) + … + A[1]*X + A[0]を、ビット列：(A[n-1]A[n-2]…A[1]A[0])に対応させたときの、GF(2^n)における和の操作に一致していることがわかる。
従って、加算器において、キャリーを0とおく回路を追加して、GF(2^n)の加算（排他的論理和）を実現することができる。この実施例を図１７に示す。図１７は、図４の加算器にセレクタを付けたものである。加算器は連鎖的に接続されており、キャリーC[j]は、次の加算器に伝播する。それぞれのセレクタは、前加算器からのキャリーと、値0を制御信号が1であるか0であるかに応じてキャリーか、0を選択する。例えば、制御信号が1のとき、キャリーを選択し、0のときは、値0を選択することにすれば、前者では通常の加算器、後者ではGF(2^n)の加算器（排他的論理和計算器）として機能する。
これを組み合せて、図１５，１６の回路を共用する加算器の構成を図18に示す。図18におけるキャリー制御機能付き加算器は、図17におけるものと同じであり、切り替え信号によって、剰余乗算用の加算器か、ガロア体GF(2^n)の積演算用の加算器として機能するかが決定される。これは、本発明請求項６，７の実施例の一つである。
本発明の趣旨は、kの値とは無関係であるが、実装上は、kをいくらでも大きく取るわけにはいかない。特にICカード用のマイクロコンピュータにおいては、RAMのサイズは、数キロバイトから十数キロバイト程度であり、kの値は大きく制約される。又、他の方法との比較においても、本発明のアドバンテージが最も高くなるのは、k=2の場合であることが、発明者によって見出された。従って、これを請求項に含め、請求項７とした。
また、べき乗剰余演算、又は、ガロア体GF(2^n)の積演算を実現する為のハードウエア構成は、上記の各種実施の形態に限定されず、適宜変更可能であることは言うまでもない。
【０００６】
【発明の効果】
本願において開示される発明のうち代表的なものによって得られる効果を簡単に説明すれば、下記の通りである。
すなわち、高速にべき乗剰余演算「Y^X mod N」、及びガロア体GF(2^n)上の楕円曲線暗号の処理を実現することができる。
また、べき乗剰余演算、又は、ガロア体GF(2^n)上の積演算の為に上記の専用ハードウエアの実現において、その論理回路規模を最小限にすることができる。
更に、上記専用ハードウエアをICカード用マイクロコンピュータと同一の半導体チップに搭載し、べき乗剰余演算「Y^X mod N」を適用した暗号化・復号化'(encryption/decryption)、又はガロア体GF(2^n)上の楕円曲線暗号の暗号化・復号化為のマイクロコンピュータを低コストで使いやすく実現することができる。又、ICカードだけでなく、GSM等のモバイル端末等の、低消費電力で暗号処理をすることが必要な装置に対して同様の効果が見込まれることは言うまでもない。
【図面の簡単な説明】
【図１】 ICカードの概観及び、端子。
【図２】マイクロコンピュータの構成。
【図３】モンゴメリ法の原理。
【図４】 1ビットの全加算器の例。
【図５】 4ビットの全加算器。
【図６】全加算器を利用した4ビットの減算回路。
【図７】値修正回路。
【図８】 k=2の場合の値テーブル。
【図９】 k=2の場合の値テーブル生成回路。
【図１０】本発明の実施例。
【図１１】 Aレジスタ。
【図１２】ガロア体GF(2^n)に対する本発明の実施例。
【図１３】ガロア体GF(2^n)に対するk=2の場合の値テーブル。
【図１４】ガロア体GF(2^n)に対するk=2の場合の値テーブル生成回路。
【図１５】モンゴメリ剰余乗算における加算処理装置の実施例。
【図１６】ガロア体GF(2^n)に対するモンゴメリ法における加算処理装置の実施例。
【図１７】キャリー制御機能付き加算器。
【図１８】剰余乗算用加算器とガロア体GF(2^n)の積演算用加算器の共用回路。[0001]
BACKGROUND OF THE INVENTION
INDUSTRIAL APPLICABILITY The present invention is effective when applied to an encoding and decryption device using modular multiplication and modular exponentiation, and in particular, an IC card ( The present invention relates to a technology that is effective when applied to a high-security data processing device such as a smart card.
[0002]
[Prior art]
  For mobile terminals such as GSM (Global Systems for Mobile communications) standard mobile phones widely used in Europe and smart cards, user authentication and electronic commerce (electric commerce) )It can be performed. Generally, when used as electronic money, it takes the form of an IC card, and in the case of GSM mobile radio (GSM mobile radiotelephone system), it takes a form called SIM (Subscriber Identification Module). . Both the SIM and the IC card are obtained by sticking a semiconductor chip with a terminal to a plastic plate, and the device itself is also a semiconductor chip. Therefore, the IC card will be described below.
  The IC card retains personal information that cannot be rewritten without permission, encrypts data using a cryptographic key that is secret information, and decrypts cipher text ( device that performs decryption). The IC card itself does not have a power supply, and when it is inserted into a reader / writer for the IC card (card reader / writer), it is supplied with power and can operate. When the operation becomes possible, a command transmitted from the reader / writer is received, and processing such as data transfer is performed according to the command. General explanations of IC cards can be found in “IC Card” written by Junichi Mizusawa, edited by the Institute of Electronics, Information and Communication Engineers, Ohm.
  As shown in FIG. 1, the IC card has a structure in which an IC card chip 102 is mounted on a card 101. As shown in the figure, an IC card generally has a supply voltage terminal Vcc, a ground terminal GND, a reset terminal RST, an input / output terminal I / O, and a clock terminal CLK at positions defined by the ISO7816 standard. , Supply power from the reader / writer and communicate data with the reader / writer (see W. Rank and Effing: SMARTCARD HANDBOOK, John Wiley & Sons, 1997, pp. 41).
  The IC card chip configuration is basically the same as an ordinary microcomputer. As shown in FIG. 2, the configuration consists of a central processing unit (CPU) 201, a memory device 204, an input / output (I / O) port 207, and a coprocessor 202 ( (There may not be a co-processor). The CPU 201 is a device that performs logical operations and arithmetic operations, and the storage device 204 is a device that stores programs and data. The input / output port is a device that communicates with the reader / writer. The co-processor is a device that performs cryptographic processing itself or computation necessary for cryptographic processing at high speed. For example, a special computing device for performing a remainder operation of RSA encryption or DES (Data Encryption Standard) processing There is a cryptographic device that performs. Many IC card processors do not have a co-processor. The data bus 203 is a bus for connecting each device.
  The storage device 204 includes a ROM (Read Only Memory), a RAM (Random Access Memory), an EEPROM (Electrical Erasable Programmable Read Only Memory), and the like. ROM is a memory that cannot be changed, and is a memory that mainly stores programs. RAM is a rewritable memory, but when the power supply is interrupted, the stored contents disappear. When the IC card is removed from the reader / writer, the power supply is interrupted and the RAM contents are not retained. The EEPROM is a memory that can retain the contents even when the power supply is interrupted. This memory needs to be rewritten and is used to store data that is retained even if the IC card is removed from the reader / writer. For example, the usage rate of a prepaid card is rewritten every time it is used, and it is necessary to retain data even if it is removed from the reader / writer, so it is retained in EEPROM.
  In order to realize user authentication and electronic money settlement, public key cryptography technology is required. Public key cryptography is used by embedding secret information in public information, and is also called asymmetric key cryptography because different keys are used on the transmission side and the reception side. There is RSA encryption as a public key encryption widely used at present. RSA ciphers are based on the fact that it is easy to generate a product of large prime numbers, but it is difficult to factor a given composite number. In RSA cryptography, Y ^ X mod N (Y ^ X is Y), using a public modulus N and a given cipher text Y with a secret exponent X Is calculated to obtain the plain text M, or the reverse operation is performed. This operation is called modular exponentiation.
  Here, since X, Y, N is a very large number of about 1024 bits to 2048 bits as of 2001, how to execute “Y ^ X mod N” at high speed, It has been regarded as a problem in the fields of applied mathematics and engineering. In particular, in a device with a limited storage capacity such as an IC card and a low CPU capacity, the power-residue operation is a very large task, and its high-speed operation is a very important issue. .
  Various algorithms for the power-residue calculation are known. For example, the following are widely used. This is called an addition chain method.
[Algorithm 1]
  input Y, X = (X [n-1] X [n-2] ... X [1] X [0]), N
  output Y ^ X mod N
  A = 1 Step 1
  B = Y Step 2
  For j = n-1 to 0 step 1 {Step 3
    A = A ^ 2 mod N Step 4
    If X [j] = 1 then A = A * B mod N Step 5
  }
  output A Step 6
In this algorithm, n corresponds to the bit length of X, and (X [n-1] X [n-2] ... X [1] X [0]) is the binary representation of X . This algorithm is generally executed by combining a square remainder multiplication “A ^ 2 mod N” (step 4) and a remainder multiplication “A = A * B mod N” (step 5), and X [n -1], X [n-2], ..., X [1], X [0] where n is the number of 1s in H (X), n square multiplications “A ^ 2 mod N” Thus, H (X) operations are repeatedly performed on the remainder multiplication “A = A * B mod N”.
  It is confirmed by numerical examples that the correct result can be obtained by the above algorithm 1.
In terms of algorithm confirmation, it is sufficient to specify the index X, so Y and N are used as symbols.
S = Y ^ 45 mod N
Is calculated according to algorithm 1. Here, the index 45 is a binary number.
45 = 101101 (binary)
Can be written. Therefore, when the exponent is expressed using the symbol of algorithm 1,
X [5] = 1, X [4] = 0, X [3] = 1, X [2] = 1, X [1] = 0, X [0] = 1
It becomes. The bit number n is 6. The initial value of S is 1.
(1) When j = 5
   Since the initial value of S is 1, even if the remainder is squared and divided by N, it is 1 (step 4).
   Since X [5] = 1, step 5 is executed,
  S = 1 * Y mod N
    = Y mod N
Is obtained.
(2) When j = 4
   Since the value of S at this time is initially S = Y mod N, taking the remainder obtained by squaring this and dividing by N yields S = Y ^ 2 mod N.
  Since X [4] = 0, step 5 is not executed and S = Y ^ 2 mod N remains.
(3) When j = 3
  Since the value of S at this time is initially S = Y ^ 2 mod N, taking the remainder obtained by squaring this and dividing by N yields S = Y ^ 4 mod N.
  Since X [3] = 1, step 5 is executed,
S = Y ^ 4 * Y mod N = Y ^ 5 mod N
It becomes.
(4) When j = 2
  Since the value of S at this time is initially S = Y ^ 5 mod N, taking the remainder obtained by squaring this and dividing by N yields S = Y ^ 10 mod N.
  Since X [2] = 1, step 5 is executed,
S = Y ^ 10 * Y mod N = Y ^ 11 mod N
It becomes.
(5) When j = 1
  Since the value of S at this time is initially S = Y ^ 11 mod N, taking the remainder obtained by squaring this and dividing by N yields S = Y ^ 22 mod N.
  Since X [1] = 0, step 5 is not executed and S = Y ^ 22 mod N remains.
(6) When j = 0
  Since the value of S at this time is initially S = Y ^ 22 mod N, taking the remainder obtained by squaring this and dividing by N yields S = Y ^ 44 mod N.
  Since X [0] = 1, step 5 is executed,
S = Y ^ 44 * Y mod N = Y ^ 45 mod N
Thus, a desired result is obtained.
  As described above, in Algorithm 1, the power-residue operation is decomposed into the remainder multiplication (including the square) and executed, so an arithmetic device having an arithmetic function “A * B mod N” may be used.
  However, A, B, and N are all very large values. For example, if the data length is currently 1024 bits, the intermediate result A * B has a problem of a large number of 2048 bits. Furthermore, since the final result is a value obtained by dividing A * B by N, division that handles a large value of 2048 bits / 1024 bits must be executed. Here, multiplication can be performed in parallel by a microprocessor or the like by dividing the multiplier and the multiplicand, but division is difficult to parallelize, and this is a factor that hinders speeding up.
  In order to solve such a division problem in the remainder multiplication “A * B mod N”, an algorithm 2 that executes “A * B * R ^ (− 1) mod N” without performing division by N is known. ing. Here, R is 2 ^ n (n is a bit length of N, for example), and is a positive integer satisfying R> N. In the following, it is assumed that N is an odd number. This assumption is a reasonable assumption in RSA and elliptic curve cryptography on the prime field. In fact, in RSA, N is a product of large prime numbers, and in prime fields, a modular arithmetic operation modulo a large prime number is performed, and thus N is an odd number.
  Algorithm 2 below was proposed by mathematician Peter Montgomery. Details of the argument that leads to Algorithm 2 are omitted here, but for example, the paper PL Montgomery, “Modular Multiplication without Trial Division” written by Montgomery himself, Math. Comp., Vol. 44, 1985, pp .519-521, or a standard handbook in cryptology A. Menezes, PCvan Oorschot, SAVanstone, “Handbook of Applied Cryptography”, CRC-press, 1997, pp. 602-603.
[Algorithm 2]
  W = 0
  For j = 0 to n-1 step +1 {
    T = W + A [j] * B
    If T is odd then W = (T + N) / 2
    Else W = T / 2
  }
  If W> = N then W =W-N
  Output W
Since the value of A * B * R ^ (-1) mod N is obtained in algorithm 2, in order to execute a power-residue operation in RSA cryptography using a device that performs this operation, the following algorithm 3 It is necessary to use a modified algorithm.
[Algorithm 3]
  input Y, X = (X [n-1] X [n-2] ... X [1] X [0]), N
  output Y ^ X mod N
  A = R mod N Step 1
  B = Y * R mod N Step 2
  For j = n-1 to 0 step 1 {Step 3
    A = (A ^ 2) * R ^ (-1) mod N Step 4
    If X [j] = 1 then A = A * B * R ^ (-1) mod N Step 5
  }
  A = A * R ^ (-1) mod N Step 6
  output A Step 7
In Step 3 of Algorithm 3, the multiplier and the multiplicand are multiplied by R mod N. Hereinafter, this data format is referred to as a Montgomery format. In the above Montgomery multiplication "ABR ^ (-1) mod N", the Montgomery form is kept unchanged. In fact, if we write the Montgomery form representation of A as Mont (A) = A * R mod N,
Mont (A) * Mont (B) * R ^ (-1) mod N
= A * R * B * R * R ^ (-1) mod N
= A * B * R mod N
= Mont (A * B)
It becomes. So, for example, calculate "A * B * R ^ (-1) mod N", "(A ^ 2) * R ^ (-1) mod N", "A * R ^ (-1) mod N" If there is a computing unit that performs the above, the procedure of the algorithm 3 can be executed at high speed using the CPU and these computing units. "(A ^ 2) * R ^ (-1) mod N" and "A * R ^ (-1) mod N" are respectively B in "A * B * R ^ (-1) mod N" Since = A and B = 1, it is understood that a power-residue operation can be realized by configuring an arithmetic unit “A * B * R ^ (− 1) mod N”.
  Next, in the algorithm for calculating “A * B * R ^ (-1) mod N” shown in Step 2, we will briefly summarize what has been done so far about how to speed it up. .
  For the sake of explanation, algorithm 2 is shown again.
[Algorithm 2 (repost)]
  W = 0 Step 1
  For j = 0 to n-1 step +1 {Step 2
    T = W + A [j] * B Step 3
    If T is odd then W = (T + N) / 2 Step 4
    Else W = T / 2 Step 5
  }
  If W> = N then W =W-N              Step 6
  Output W Step 7
  Known typical speed-up methods can be classified as follows.
  First speed-up method: Reduce the number of loop iterations including steps 3, 4, and 5
  Second speed-up method: parallel processing of steps 3, 4 and 5
  Third speed-up method: Replace the process in step 6 with a simpler one
As a typical first speed-up method, there is a method in which the addition processing is executed collectively for a plurality of bits. For example, the following algorithm 4 shows a method for collectively executing k-bit processing.
[Algorithm 4]
For j = 0 to2 ^ k-1 step +1 {Step 1
TBL (j, B) = j * B Step 2
TBL (j, N) = j * N Step 3
}
W = 0 Step 4
For j = 0 to n / k-1 step +1 {Step 5
   T = W + TBL (A [j, k], B) Step 6
   U = T mod 2 ^ k Step 7
   M [j, k] = -U * N [0, k] ^ (-1) mod 2 ^ k Step 8
   W = T + TBL (M [j, k], N) Step 9
   W = W / 2 ^ k Step 10
}
If W> = N then W =W-N                Step 11
Output W Step 12
Here, for simplicity, k is assumed to be a divisor of n. For example, when n = 1024, k = 2, 4, 8, etc. may be selected. A [j, k] indicates the value of the j-th k-bit block from the bottom of A. Similarly, M [j, k] indicates the value of the j-th k-bit block from the bottom of M, and N [0, k] indicates the value of the lowest-order k-bit block of N. In addition, the calculation in step 8 is performed so that the value becomes positive. In the algorithm 4, if k = 1, N is an odd number, so M matches U and the algorithm 2 is obtained. A method for implementing the remainder multiplication based on this method is disclosed in Japanese Patent Application Laid-Open No. 7-20778, Masahiko Takenaka et al.
  Since the algorithm 4 is slightly complicated, its mechanism will be briefly described.
  Before explaining the contents of the algorithm 4, it is necessary to understand the principle of the Montgomery method. The Montgomery method is in principle an indefinite equation (Diophantine equation) with M and W as unknowns:
(Formula 1) A * B + M * N = W * R
Is equivalent to solving First, note that since N is an odd number and is relatively prime with R = 2 ^ n, (Equation 1) has an integer solution. The assumption that N is odd is essential because (Equation 1) does not necessarily have an integer solution when N is even.
  In order to understand the meaning of (Formula 1), FIG.
  Since R = 2 ^ n, the lower-order n bits on the right side of (Equation 1) are all 0, so obtaining M that satisfies (Equation 1) means that the lower-order n bits of AB and MN are 0. It is nothing but the operation of selecting M. This upper half bit string is the value of A * B * R ^ (-1) mod N to be obtained. However, the sum of the number of 2n bits and the number of 2n bits is 2n + 1 bits, and the most significant bit may be 1 (hereinafter, this bit is referred to as OV bit). This is resolved in step 11 of algorithm 4.
  Taking the remainder of dividing both sides of (Equation 1) by R = 2 ^ n,
(Formula 2) (A * B mod 2 ^ n) + (M * N mod 2 ^ n) = 0
Get. (Equation 2)
(Formula 3) M = -A * B * N ^ (-1) mod 2 ^ n
Get. Since the modulus is power of 2, this M can be obtained for each bit block in turn. This is the meaning of Algorithm 4. In the algorithm 4, since the number of loops is 1 / k, roughly speaking, the loop portion is about k times faster. However, in this case, there is a problem that the size of the multiplication table increases when k is increased. As the size increases, the RAM area is under pressure. In general, in a device such as an IC card, since the RAM area is small, k cannot often be increased. This is the first problem.
The second speed-up method can be used in combination with the first speed-up method. In Algorithm 4, it can be seen that the value of M is determined when the lower k bits of T are determined (step 8), and that the processing of step 9 can be executed immediately after the determination. Therefore, if two n-bit adders are mounted and a circuit that executes step 6 and step 9 in parallel is configured, the speed is almost doubled. However, this method requires twice as many adders. This not only means that the amount of computing units is doubled, but also means that the power consumption during execution is also doubled. This is the second problem.
  The third speed-up method is independent of the first and second speed-up methods.
  As is well known, the comparison process as in step 11 of Algorithm 4 is accomplished by actually performing a subtraction and examining a negative flag. Since this is a subtraction process between large numbers such as 1024 bits, the processing time may not be negligible. On the other hand, on a computer, comparison with R = 2 ^ n is easy.
Therefore, the speed can be increased by using the algorithm 5 in which the last comparison / subtraction processing unit (step 11) is changed as follows. The speeding up by this change has already been described in Japanese Patent Laid-Open No. 10-21057, Kunihiko Nakata “Data Processing Device and Microcomputer”, Kunihiko Nakada: DATA PROCESSOR AND MICROPROCESSOR, United States Patent, US005961578A.
[Algorithm 5]
For j = 0 to2 ^ k-1 step +1 {Step 1
TBL (j, B) = j * B Step 2
TBL (j, N) = j * N Step 3
}
W = 0 Step 4
For j = 0 to n / k-1 step +1 {Step 5
   T = W + TBL (A [j, k], B) Step 6
   U = T mod 2 ^ k Step 7
   M [j, k] = -U * N [0, k] ^ (-1) mod 2 ^ k Step 8
   W = T + TBL (M [j, k], N) Step 9
   W = W / 2 ^ k Step 10
}
If W> = R then W =W-N                Step 11
Output W Step 12
Here, the comparison processing in step 11 can be realized only by determining whether or not the OV bit in FIG. This is much lighter and faster than the subtraction process. However, the value obtained by the algorithm 5 may be A * B * R ^ (-1) mod N + N instead of A * B * R ^ (-1) mod N itself. However, since the bit length remains n bits, it is only necessary to finally correct the value in the modular exponentiation.
  The present invention is not related to this speed-up method and relates to the first and second speed-up methods.
  The above problems 1 and 2 also occur in a characteristic 2 Galois field of characteristic 2 that appears in an elliptic curve cryptosystem.
  Briefly explain the concept of Galois field. The Galois field itself is purely a mathematical concept, but it can be translated into a specific operation by constructing an appropriate homomorphism. The simplest method for realizing a Galois field on a computer is to use a polynomial arithmetic operation.
  First, consider the whole polynomial whose coefficient is 0 or 1, and let this set be POLY. Select an n-th irreducible polynomial of degree F (X) as an element of POLY, and sum the sum of the two POLY elements A (X) and B (X) (Synonymous with difference) is defined as the sum of coefficients in terms of an exclusive OR, and the product is A (X) * B (X) mod F Defined by (X). As is well known in algebra, a POLY element that is not a multiple of F (X) has an inverse to the product operation. In POLY, A (X) and B (X) being equivalent are defined as the difference between these A (X) -B (X) being a double element of F (X). , POL (2 ^ n) is the one in which POLY is divided into equivalence classes by this equivalence relation. This is a characteristic 2 Galois field. It is well known that the algebraic structure of GF (2 ^ n) does not depend on the choice of the nth degree irreducible polynomial F (X). For implementation, we often choose polynomials consisting of 3 terms and 5 terms. This polynomial F (X) is sometimes called a reduction polynomial.
  The Galois field product operation is realized by performing an addition process appearing in a normal product operation in the sense of exclusive OR. If the carry is not propagated in the adder, it is equivalent to performing bitwise exclusive OR for each bit, and a Galois field product operation is realized.
In the Montgomery method, essentially the same operation is performed. The condition that `` modulus N is odd '', which was necessary in the usual Montgomery method, is replaced by the condition that the constant term of F (X) is 1, which means that F (X) is irreducible Therefore, it is automatically followed and there is no problem in applying the Montgomery method. R = 2 ^ n in the usual Montgomery method is replaced with R (X) = X ^ n. In addition, the process of determining M, which was important in the Montgomery method, can also be realized by not performing carry propagation. Therefore, the Montgomery process in the Galois field can be realized as follows.
[Algorithm 6]
  W (X) = 0 Step 1
  For j = 0 to n-1 step +1 {Step 2
    T (X) = W (X) + A [j] * B (X) Step 3
    If T (X) mod X is 1 then W (X) = (T (X) + F (X)) / X Step 4
    Else W = T / X Step 5
  }
Here, A [j] is a coefficient of the j-th order term of A (X). The sum is used to mean exclusive OR for each coefficient. In Algorithm 6, the subtraction processing that is present in the normal Montgomery method is not necessary. In the Galois field, there is no magnitude relationship except for the order, and the order of A (X) and B (X) is less than the nth order. If this is the case (in GF (2 ^ n), this always holds), W (X) constructed in the above algorithm 6 is less than the n-th order, and thus becomes a remainder with respect to F (X). . Therefore, the third problem that occurs when performing the Montgomery method in ordinary modular multiplication does not exist in the Galois field.
  It is confirmed by numerical examples that the correct answer can be obtained by this algorithm 6. For simplicity, let F (X) = X ^ 4 + X + 1 in GF (2 ^ 4). For A (X) = X ^ 3 + X ^ 2 + 1, B (X) = X ^ 3 + X + 1, the product of A and B in GF (2 ^ 4), that is, A (X) * B (X) mod F (X) is calculated.
  In this case, paying attention to A [0] = 1, A [1] = 0, A [2] = 1, A [3] = 1, the processes of algorithm 6 are executed in order.
(1) When j = 0
  Since W (X) = 0, A [0] = 1, in step 3,
W (X) = 0 + 1 * (X ^ 3 + X + 1)
= X ^ 3 + X + 1
Since this constant term is 1, execute step 4 and
W (X) = (X ^ 3 + X + 1 + X ^ 4 + X + 1) / X
= (X ^ 4 + X ^ 3 + 2 * X + 2) / X
= (X ^ 4 + X ^ 3) / X
= X ^ 3 + X ^ 2
It becomes. Note that double is 0.
(2) When j = 1
  Since W (X) = X ^ 3 + X ^ 2, A [1] = 0, in step 3,
W (X) = X ^ 3 + X ^ 2 + 0 * (X ^ 3 + X + 1)
= X ^ 3 + X ^ 2
It becomes. Since this constant term is 0, execute step 5.
W (X) = (X ^ 3 + X ^ 2) / X
= X ^ 2 + X
It becomes.
(3) When j = 2
  Since W (X) = X ^ 2 + X, A [2] = 1, in step 3,
W (X) = X ^ 2 + X + 1 * (X ^ 3 + X + 1)
= X ^ 3 + X ^ 2 + 2 * X + 1
= X ^ 3 + X ^ 2 + 1
It becomes. Since this constant term is 1, execute step 4.
W (X) = (X ^ 3 + X ^ 2 + 1 + X ^ 4 + X + 1) / X
= (X ^ 4 + X ^ 3 + X ^ 2 + X + 2) / X
= (X ^ 4 + X ^ 3 + X ^ 2 + X) / X
= X ^ 3 + X ^ 2 + X + 1
It becomes.
(4) When j = 3
  Since W (X) = X ^ 3 + X ^ 2 + X + 1, A [3] = 1, in step 3,
W (X) = X ^ 3 + X ^ 2 + X + 1 + 1 * (X ^ 3 + X + 1)
= 2 * X ^ 3 + X ^ 2 + 2 * X + 2
= X ^ 2
It becomes. Since the constant term is 0, execute step 5;
W (X) = X ^ 2 / X
= X
It becomes. W (X) obtained by the above calculation should be A (X) * B (X) * R (X) ^ (-1) mod F (X), so R (X) ^ (-1) mof Multiply by R (X) = X ^ 4 to remove F (X)
X * X ^ 4 mod (X ^ 4 + X + 1)
= X * (X + 1) = X ^ 2 + X
It becomes.
  In order to confirm whether this result is correct, A (X) * B (X) mod F (X) is calculated by a normal method.
A (X) * B (X)
= (X ^ 3 + X ^ 2 + 1) * (X ^ 3 + X + 1)
= X ^ 6 + X ^ 5 + X ^ 4 + 3 * X ^ 3 + X ^ 2 + X + 1
= X ^ 6 + X ^ 5 + X ^ 4 + X ^ 3 + X ^ 2 + X + 1
Because
A (X) * B (X) mod F (X)
= X ^ 6 + X ^ 5 + X ^ 4 + X ^ 3 + X ^ 2 + X + 1 mod (X ^ 4 + X + 1)
= (X ^ 2 + X + 1) * (X ^ 4 + X + 1) + X ^ 2 + X mod (X ^ 4 + X + 1)
= X ^ 2 + X
Thus, the same polynomial as that obtained by multiplying the value calculated by the algorithm 6 by R (X) is obtained.
  In the calculation in the above Galois field, an operation of replacing 2 with 0 is performed because the sum calculation is executed by exclusive OR.
  Modifying this algorithm to process every k bits can be realized in the same manner as in the case of ordinary modular multiplication. The algorithm of Montgomery method in GF (2 ^ n) is shown.
[Algorithm 7]
For j = 0 to2 ^ k-1 step +1 {Step 1
   TBL (H [j] (X), B (X)) = H [j] (X) * B (X) Step 2
   TBL (H [j] (X), F (X)) = H [j] (X) * F (X) Step 3
}
W (X) = 0 Step 4
For j = 0 to n / k-1 step +1 {Step 5
   T (X) = W (X) + TBL (A [j, k] (X), B (X)) Step 6
   U (X) = T (X) mod X ^ k Step 7
   M [j, k] (X) = U (X) * F [0, k] (X) ^ (-1) mod X ^ k Step 8
   W (X) = T (X) + TBL (M [j, k] (X), F (X)) Step 9
   W (X) = W (X) / X ^ k Step 10
}
  Here, the polynomial H [j] (X) in algorithm 7 is obtained when the binary expansion of j is j = (j [k-1] j [k-2] ... j [0]) , H [j] (X) = j [k-1] * X ^ (k-1) + j [k-2] * X ^ (k-2) + ... + j [0] Is obtained. A [j, k] (X) is a polynomial corresponding to the j-th k-bit block when the polynomial A (X) is expressed as a bit string, and F [0, k] (X), M [j , k] (X) is equivalent to this. Further, the sum of steps 6 and 9 means exclusive OR for each term. Numerical examples are omitted.
[0003]
[Problems to be solved by the invention]
The present invention relates to a microcomputer for performing modular multiplication and modular exponentiation for large numbers and a method for executing the same, and a Montgomery method (PL Montgomery, "Modular Multiplication without Trial Division"). ”, Math. Comp., Vol. 44, 1985, pp. 519-521), when implementing this on the microcomputer, the two problems described in the above [Prior Art],
(First problem) The multiplication table necessary for processing multiple bits at the same time is enlarged.
(Second problem) When the speed of A * B and M * N is executed simultaneously to achieve twice the speed, the amount of calculator and power consumption will double.
And to solve the same problem for the product operation of Galois field GF (2 ^ n).
The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.
[0004]
[Means for Solving the Problems]
  The outline of the invention disclosed in the present application will be described as follows.
  Consider the following Montgomery algorithm 4 (reprinted).
[Algorithm 4 (repost)]
For j = 0 to2 ^ k-1 step +1 {Step 1
TBL (j, B) = j * B Step 2
TBL (j, N) = j * N Step 3
}
W = 0 Step 4
For j = 0 to n / k-1 step +1 {Step 5
   T = W + TBL (A [j, k], B) Step 6
   U = T mod 2 ^ k Step 7
   M [j, k] = -U * N [0, k] ^ (-1) mod 2 ^ k Step 8
   W = T + TBL (M [j, k], N) Step 9
   W = W / 2 ^ k Step 10
}
If W> = N then W =W-N                  Step 11
Output W Step 12
  As described above in [Prior Art], the processing steps 6 and 9 in the algorithm 4 are a large number of addition processes, and most of the hardware may be considered as this adder.
  Since the processing in step 6 and step 9 is processing for the k-bit block of the processing of A * B and the processing of M * N, this is combined and the addition processing for the k-bit block of A * B + M * N is performed. I understand that I can go. The addition process corresponding to each loop has the form of “number of k bits × B” and “number of k bits × N”, so the necessary new table has the following form: I understand that
TBL (j, t) = j * B + t * N (j, t = 0, 1, 2, ...,2 ^ k-1)
The selection of this table requires two parameters j and t. These parameters can be selected so that when added to W, the lower k bits of the sum are zero.
That is, in the process of the jth step,
TBL (A [j, k], M [j, k]) = A [j, k], * B + M [j, k] * N
You can choose. M [j, k] is a value determined depending on A [j, k]. The following algorithm 8 rearranges the table in advance so that the calculation of M is not necessary.
[Algorithm 8]
For j = 0 to 2 ^ k-1 step 1 {Step 1
   For s = 0 to 2 ^ k-1 step 1 {Step 2
      U = (s + j * B [0, k]) mod 2 ^ k Step 3
      M = −U * N [0, k] ^ (-1) mod 2 ^ k Step 4
      TBL [j, s] = j * B + M * N Step 5
   }
}
W = 0 Step 6
For i = 0 to n / k-1 step 1 {Step 7
   W = W + TBL (A [i, k], W [0, k]) Step 8
   W = W / 2 ^ k Step 9
}
If W> = N then W =W-N                Step 10
Output W Step 11
The above algorithm 8 represents the essence of the present invention. The addition process appearing in two places in the algorithm 4, that is, the processes of steps 6 and 9 in the algorithm 4 are one addition process in the algorithm 8. If the processes of steps 6 and 9 in algorithm 4 are calculated in order, the calculation speed is approximately doubled by adopting algorithm 7. In addition, when the processes of steps 6 and 9 in the algorithm 4 are executed in parallel as many coprocessors perform, two adders are necessary. However, when the algorithm 7 is adopted, the number of adders Can be halved and, as a direct consequence of this, power consumption is reduced.
  In this algorithm 7, if carry propagation is not performed, the product operation of GF (2 ^ n) is realized as described in [Prior Art]. Therefore, the effect of the present invention that occurs in the above-described normal remainder multiplication also appears for the product operation circuit of GF (2 ^ n).
[0005]
DETAILED DESCRIPTION OF THE INVENTION
  The algorithm 8 is listed again below. A method for configuring a circuit according to this algorithm will be described below.
[Algorithm 8 (repost)]
For j = 0 to 2 ^ k-1 step 1 {Step 1
   For s = 0 to 2 ^ k-1 step 1 {Step 2
      U = (s + j * B [0, k]) mod 2 ^ k Step 3
      M = −U * N [0, k] ^ (-1) mod 2 ^ k Step 4
      TBL [j, s] = j * B + M * N Step 5
   }
}
W = 0 Step 6
For i = 0 to n / k-1 step 1 {Step 7
   W = W + TBL (A [i, k], W [0, k]) Step 8
   W = W / 2 ^ k Step 9
}
If W> = N then W =W-N                Step 10
Output W Step 11
The processing of the algorithm 7 is roughly divided into three parts. That is, the “table generation unit” from step 1 to step 5, the “addition processing unit” from step 6 to step 9, and the “value correction unit” from steps 10 and 11.
  First, the highly general “addition processing unit” will be described, then the “value correction unit”, and finally the “table generation unit” will be described, and a circuit configuration integrating them will be described.
  Various methods of configuring the adder are known, but a typical one will be described here.
  The adder is configured by arranging adders for each bit. Now, a circuit that adds two m-bit numbers S and T is constructed. In the following description, each binary representation S = (S [m-1] S [m-2] ... S [1] S [0]), T = (T [m-1] T [m -2] ... T [1] T [0]) and their sum Z = (Z [m] Z [m-1] ... Z [1] Z [0]). When considering the addition processing in units of bits, the values necessary for addition of the j-th bit are S [j], T [j] and the carry C [j-1] generated in the previous calculation. is there. However, it is determined that C [-1] = 0 and Z [m] = C [m-1].
  FIG. 4 is an example of a full adder that performs 1-bit addition. The j-th 1-bit full adder receives S [j], T [j] and the carry signal C [j−1] of the previous stage, and inputs the value Z [j] of the jth bit of Z and the carry signal C It can be understood as a circuit that outputs [j].
  The circuit of FIG.
(Formula 4) Z [j] = (S [j] AND T [j] AND C [j-1]) OR (S [j] AND NOT (T [j]) AND NOT (C [j-1] )) OR (NOT (S [j]) AND T [j] AND NOT (C [j-1])) OR (NOT (S [j]) AND NOT (T [j]) AND C [j-1 ]),
(Formula 5) C [j] = NOT ((S [j] AND NOT (T [j]) AND NOT (C [j-1])) OR (NOT (S [j]) AND T [j] AND NOT (C [j-1])) OR (NOT (S [j]) AND NOT (T [j]) AND C [j-1]) OR (NOT (S [j]) AND NOT (T [j ]) AND NOT (C [j-1]))
Is made into a circuit. In FIG. 4, 401, 402, 403, 404, and 405 are AND circuits, 406 and 407 are OR circuits, 409, 410, 411, 412, 413, 414, 415, 416, and 417 are inverters (NOT circuits). It is. These correspond to AND, OR, and NOT in (Expression 4) and (Expression 5), respectively. Each of AND, OR, and NOT is a logic gate composed of several transistors. AND is a logic gate that outputs 1 only when all of the inputs are 1, and outputs 0 otherwise, OR is 1 if any one of the inputs is 1, and the output is 1 and all 0 This is a logic gate whose output is 0 only when. A NOT (inverter) is a logic gate that outputs 1 for input 0 and outputs 0 for input 1.
  If 1-bit full adders are arranged, a multi-bit full adder can be made. FIG. 5 shows a 4-bit full adder. 501, 502, 503, and 504 are 1-bit full adders shown in FIG. 4, and are calculated according to (Equation 4) and (Equation 5). The carry is set to 0 for the least significant bit. The last carry is Z [4]. Although the case of 4 bits is shown here, it is clear that any number of bits can be added by increasing the number of 1-bit full adders.
  The configuration method of the full adder shown here is the simplest one called `` carry carry adder '' (carry propagation time), the ratio of carry propagation time is the largest, the most addition time It is such a circuit. There are many hardware algorithms for performing addition at higher speed, but the configuration method of the adder is irrelevant to the essence of the present invention, and thus further explanation is omitted.
  Next, processing of the value correction unit will be described. In the value correction unit, subtraction is used. The subtracter has a structure similar to that of an adder and can be configured using a carry propagation adder. For this reason, a normal computer has an addition circuit, but does not have a subtraction circuit.
  FIG. 6 is a configuration example of a 4-bit subtraction circuit (Z = ST) using a full adder. 601, 602, 603, and 604 are 1-bit full adders. At the time of input, the first carry (borrow) is 1, and T is inverted by inverters 605, 606, 607, and 608. Is different. In subtraction, when the last carry (borrow) is 1, it means that the result is negative. This phenomenon is called under flow.
  A configuration example of the value correction unit will be described using the concept of the subtractor. FIG. 7 is a block diagram of the value correction circuit. The n + 1 bit register 702 contains the value W, and the n bit register 703 contains the modulus N value, both of which are calculated by the n + 1 bit subtractor 704. . This value WN and the borrow generated as a result of this calculation are input to a selection signal (selection signal) and the value of W to the selector 705. If the selection signal is 0, the selector 705 selects WN. If the selection signal is 1, W is selected and output.
  Here, W and N have been described as dedicated registers, but it goes without saying that they can also be placed on a RAM used for CPU processing.
  Next, the circuit of the table generation unit will be described. For simplicity, only k = 2 is shown.
  The values generated by the table generation unit when k = 2 are listed as shown in the table of FIG. In this configuration example, the modulus N that is not dependent on the input values A and B, N, 2 * N, and 3 * N are calculated in advance and stored in the registers 909, 910, and 911. Use this. Further, since the input value B is the input value itself, it is not necessary to calculate it. B is held in the register 906.
  Eleven values other than 0 and the above four values B, N, 2 * N, 3 * N must be calculated each time. Here, a configuration is shown in which only the adder is used to calculate the 11 values. The size of the register must be a maximum of (k + 1 =) 3 bits. Each register is connected to a data bus 921, and the data bus 921 is used as a signal line for supplying data to the addition processing unit. Reference numeral 904 denotes an adder that adds the number of n + 2 bits, and sends the sum value of n + 3 bits to the selector 903.
  Upon receiving the calculation start signal determined from the clock signal, the control device 905 initializes the internal counter to 0001 in binary notation, and inputs the value of B to the input of the adder 904 via the selector 901 (both) The adder 904 is operated as follows. A control signal is sent from the adder 904 to the control device 905 immediately before the calculation by the adder 904 starts. The control device 905 receives the control signal, increments its internal counter, obtains a value of 0010 in binary notation, and sets the lower 2 bits 00 and the upper 2 bits to j * B + t * N As j and t, the selector 903 determines the position of the register 907 to which the answer of the adder 904 is written, and writes the addition result. Next, a calculation start signal is transmitted again to the control device 905, and the control device 905 sends a control signal to the selector 901, reads B and 2 * B, inputs them to the adder 904, and performs the same operation as before. To write the value 3 * B to the register 908. By the operation so far, the values for synthesizing the table, B, 2 * B, 3 * B, N, 2 * N, 3 * N are all prepared. Similarly, the register to be read and the register to be written are determined from the counter value of the control device, and the same operation is repeated. However, when the lower 2 bits of the counter are 00, the addition operation is skipped. After all the values are collected, the registers 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920 are used as read-only registers. The
  Next, FIG. 10 shows a block diagram of a circuit that combines these to calculate A * B * R ^ (− 1) mod N and writes the result to the A register. However, the value adjustment unit is omitted in order to avoid complexity. In general, the subtraction processing required by the value adjustment unit is realized by changing the input value of the adder 1015.
  FIG. 10 includes all the important components of the present invention. Hereinafter, the same variable names as those used in the algorithm 8 are used.
  In FIG. 10, first, the value of A stored in the A register 1003 is read in order from the lower order. The configuration of the A register is shown in FIG. At this time, the control device 1007 receives the number i of the 2-bit block to be read from the 9-bit register 1006, reads the value of the corresponding 2-bit block, and transmits it to the register 1008. The register 1009 stores the lower 2 bits of W. The values of 1008 and 1009 are transmitted to the value selection and signal control device 1011. 1011 reads the corresponding data from the value table of 1002 via the data bus 1001, and adds the values of 1027 bits and 1026 bits. Send to adder 1015 to perform. However, dedicated registers 1013 and 1014 are connected to the adder 1015, and values read from the table are temporarily held in a 1027-bit register 1014. Reference numeral 1013 denotes a 1026-bit register in which W that appears in the algorithm 7 is stored. It is cleared to 0 at the start of operation. Data in the form of j * B + t * N included in the value table 1002 and the sum of W are added by an adder 1015. In this addition result, since the lower 2 bits are 00 in binary notation, they are cut off by the 2-bit right shifter 1016 and input to the control device 1017. On the other hand, the adder 1015 sends the value 1 to the adder 1018 simultaneously with the start of the calculation. The adder sums the value of the 9-bit register 1019 and the calculation start signal “1”, and the result is the 9-bit register 1019. Send to. 1019 stores 0 at the beginning. This sum is sent to the control device 1017. If the sum does not reach 512, the control device 1017 writes the output result of 1016 to the register 1013 as it is, and the value of 1013 is the lower order of W at 1011. Only 2 bits are extracted and written to register 1009. At this time, a signal is sent from 1017 to 1007 via 1011, the value of i is incremented by the adder 1005, the next 2-bit block of the A register is read out, and it has been extended. The table value is added through the same process. If the value of 1019 reaches 512, a signal is sent to the selective transmission circuit 1012, and the value of W in 1013 is sent to the A register. The value of the A register at this point is A * B * R ^ (-1) mod N or A * B * R ^ (-1) mod N + N. In the latter case, it is necessary to subtract N from A. The subtraction processing unit is omitted. This is one of the embodiments of claims 1 and 7 of the present invention.
  The above-described embodiments can be variously modified without departing from the gist of the present invention. For example, the following changes can be considered.
(1) Here, the adder 1015 is configured to add 1027 bits at the same time, but it can also be realized by dividing this into a plurality of blocks and adding these in order for each block. it can. In this case, the amount of hardware is reduced, but the speed is reduced.
(2) Without using the 2-bit left shifter 1016, it is possible to directly connect only the upper bits to the lower 10 bits to the register 1013.
(3) Change the number of bits of values A, B, and N. For example, it can be changed to 2048 bits, 512 bits, 768 bits, or the like.
(4) The value of k can be changed to a value other than 2, for example, 1, 3, 4, etc. However, when n is not a multiple of k, a logic circuit for correction is required.
(5) In the table, when 0 is eliminated and instead the value added to W is 0, a logic circuit is added so that the value of W is not input to the adder but is directly input to the shifter.
  The above-described changes can be made independently of “addition table synthesis” which is essentially the gist of the present invention.
  Next, a description will be given of items that require some attention unlike the above-described changes.
  Again, Algorithm 8 is shown.
[Algorithm 8 (repost)]
For j = 0 to 2 ^ k-1 step 1 {Step 1
   For s = 0 to 2 ^ k-1 step 1 {Step 2
      U = (s + j * B [0, k]) mod 2 ^ k Step 3
      M = −U * N [0, k] ^ (-1) mod 2 ^ k Step 4
      TBL [j, s] = j * B + M * N Step 5
   }
}
W = 0 Step 6
For i = 0 to n / k-1 step 1 {Step 7
   W = W + TBL (A [i, k], W [0, k]) Step 8
   W = W / 2 ^ k Step 9
}
If W> = N then W =W-N                Step 10
Output W Step 11
  Here, the maximum value of the table TBL [j, s] is (2 ^ k-1) * B + (2 ^ k-1) * N. Since this value is n + k + 1 bits at the maximum, the bit length of TBL [j, s] must be increased according to the value of k. Since the number of data is 4 ^ k, the increase in data accompanying k reaches k * 4 ^ k bits. In general, k is not a large value, so it does not become a big problem in realizing the present invention, but it is not a desirable phenomenon. There are ways to avoid this problem.
  Focus on steps 8 and 9. The value of TBL (A [i, k], W [0, k]) in step 8 is selected so that the lower k bits of the sum of step 8 are always 0. In step 9, the k bits are cut off by a right shift. This indicates that the calculation of the lower k bits is completely unnecessary. Therefore, it can be understood that the table value itself may be prepared by cutting out the lower k bits (shifted right by k bits).
  Specifically, instead of using the value “j * B + t * N”, the value “(j * B + t * N) >> k” obtained by shifting this by k bits is defined as a new table, and the algorithm The same calculation as in step 8 is performed. However, the value of W is shifted right by k bits before the addition process. That is, the calculation is performed according to the following algorithm.
[Algorithm 9]
For j = 0 to 2 ^ k-1 step 1 {Step 1
   For s = 0 to 2 ^ k-1 step 1 {Step 2
      U = (s + j * B [0, k]) mod 2 ^ k Step 3
      M = −U * N [0, k] ^ (-1) mod 2 ^ k Step 4
      NTBL [j, s] = (j * B + M * N) >> k Step 5
   }
}
W = 0 Step 6
For i = 0 to n / k-1 step 1 {Step 7
   W = W / 2 ^ k Step 8
   W = W + NTBL (A [i, k], W [0, k]) Step 9
}
If W> = N then W =W-N                    Step 10
Output W Step 11
Here, step 8 does not need to use a shifter in mounting. Since the right shift is always performed only by k bits, it is wired so as to directly correspond to the k bit shift. The sum of the lower k bits is not calculated. However, if at least one of the lower k bits of W is 1, that is, if the logical OR of each of the lower k bits is 1, this is carried to the adder as carry C [-1]. input. Actually, when the logical sum is 0, it indicates that all k bits are 0, and the corresponding lower k bits of M must also be 0, but the logical sum is 1. This is because, if M is selected such that the lower k bits of the sum with M are all 0, a carry occurs.
  FIG. 15 shows an embodiment of an addition processing circuit taking these into consideration. In this embodiment, k = 2. In the circuit shown in FIG. 15, an NTBL register 1504 for storing an NTBL value and a W register 1503 for storing a W value are connected to an adder 1502. The lower 2 bits of the W register are not connected to the adder 1502, but instead are connected to the value selection and signal control unit, and these 2 bits are used to select the value of the NTBL. Further, each of the two bits is connected to an OR gate 1501 that calculates a logical sum. The calculated value of the OR gate 1501 is input to the adder 1502 as the carry signal C [−1]. The carry signal C [-1] is used as an initial value of carry when performing addition. This is one of the embodiments of the present inventions 4 and 7.
  Next, an embodiment of the present invention that operates on the Galois field GF (2 ^ n) according to the following algorithm 7 (reproduced) will be described.
[Algorithm 7 (repost)]
For j = 0 to2 ^ k-1 step +1 {Step 1
   TBL (H [j] (X), B (X)) = H [j] (X) * B (X) Step 2
   TBL (H [j] (X), F (X)) = H [j] (X) * F (X) Step 3
}
W (X) = 0 Step 4
For j = 0 to n / k-1 step +1 {Step 5
   T (X) = W (X) + TBL (A [j, k] (X), B (X)) Step 6
   U (X) = T (X) mod X ^ k Step 7
   M [j, k] (X) = U (X) * F [0, k] (X) ^ (-1) mod X ^ k Step 8
   W (X) = W (X) + TBL (M [j, k] (X), F (X)) Step 9
   W (X) = W (X) / X ^ k Step 10
}
The calculation in the Galois field GF (2 ^ n) is as described in [Prior Art]. Therefore, it is realized by changing all the addition processes in the circuit in the first embodiment of the present invention described above to exclusive OR for each bit.
  Since the product operation process is composed of a shift process and an addition process, it is only necessary to replace the part of the addition process with a bitwise exclusive OR process. Of course, it goes without saying that the sum is replaced with an exclusive OR in the table configuration. However, it differs from the normal Montgomery method in which the modulus N is n bits, but an n-th order polynomial is used, so that an expression of n + 1 bits is required. In the case of a Galois field GF (2 ^ n), the last value correction processing is not necessary. Other configurations may be the same as those of a normal Montgomery circuit. This is shown in FIG. Here, it is assumed that k = 2 and the degree (degree) of the reduction polynomial F (X) is 257. Hereinafter, the same variable names as those used in the algorithm 7 are used.
  In FIG. 12, first, the value of A stored in the A register 1203 is read in order from the lower order. The configuration of the A register is the same as that shown in FIG. At this time, the control device 1207 receives the 2-bit block number i to be read from the 7-bit register 1206, reads the value of the corresponding 2-bit block, and transmits it to the register 1208. The register 1209 stores the lower 2 bits of W. The values 1208 and 1209 are transmitted to the value selection and signal control device 1211, and 1211 reads the corresponding data from the value table 1202 (table generation will be described later) via the data bus 1201, and sets the value to 258. The result is sent to an exclusive OR calculator 1215 that performs an exclusive OR of a bit value and a 256-bit value. However, dedicated registers 1213 and 1214 are connected to the exclusive OR calculator 1215, and values read from the table are temporarily held in the 258-bit register 1214. Reference numeral 1213 denotes a 258-bit register in which W (X) appearing in the algorithm 7 is stored. It is cleared to 0 at the start of operation. Exclusive OR of data (see Fig. 13) and W (X) in the form of "(first order polynomial) * B (X) + (first order polynomial) * F (X) included in value table 1202 Is calculated by the exclusive OR calculator 1215. Since the lower 2 bits are 00 in binary notation, the result is cut off by the 2-bit right shifter 1216, and the control device On the other hand, the exclusive OR calculator 1215 sends the value 1 to the adder 1218 simultaneously with the start of calculation, and the adder sends the value of the 7-bit register 1219 and the calculation start signal “1”. Sum up and send the result to 7-bit register 1219. 1219 stores 0 at the beginning. This sum is sent to the control device 1217.If the sum does not reach 128, the control device 1217 writes the output result of 1216 into the register 1213 as it is, and the value of 1213 is W (X ) Are extracted and written to the register 1209. At this time, a signal is sent from 1217 to 1207 via 1211, the value of i is incremented by the exclusive OR calculator 1205, and the next 2-bit block of the A register is read out. The exclusive OR with the table value is calculated through the same process as described above. If the value of 1219 reaches 128, a signal is sent to the selective transmission circuit 1212 and the value of W (X) in 1213 is sent to the A register. At this time, the value of the A register is A (X) * B (X) * R (X) ^ (-1) mod F (X).
  The table generation circuit is also realized by replacing the addition process with an exclusive OR process. This is shown in FIG.
   In this configuration example, the reduction polynomial F (X) that does not depend on the input values A (X) and B (X), F (X), XF (X), (X + 1) F (X) , Calculated in advance, held in registers 1409, 1410, 1411 and used. Further, since the input value B (X) is the input value itself, it is not necessary to calculate it. B (X) is held in the register 1406.
  Eleven values other than 0 and the above four values B (X), F (X), XF (X), (X + 1) F (X) must be calculated each time. Here, a configuration is shown in which only the exclusive OR calculator is used to calculate the 11 values. The register size must be up to (k =) 2 bits larger. Each register is connected to a data bus 1421. The data bus 1421 is used as a signal line for supplying data to the exclusive OR processing unit. 1404 is an adder that adds a 258-bit number, and adds a 258-bit sum value (note that no carry occurs in GF (2 ^ n), so the number of bits does not increase) to the selector 1403 send.
  The control device 1405 receives the calculation start signal determined from the clock signal, initializes the internal counter to 0001 in binary notation, and inputs the value of B (X) to the adder 1404 via the selector 1401. The exclusive OR calculator 1404 is operated as (both). A control signal is sent from the exclusive OR calculator 1404 to the control device 1405 immediately before the calculation by the exclusive OR calculator 1404 starts. The control device 1405 receives the control signal, increments its internal counter, obtains a value of 0010 in binary notation, and sets the lower 2 bits 00 and the upper 2 bits to j (X) * B ( As j (X) and t (X) in X) + t (X) * F (X), the selector 1403 determines the position of the register 1407 to which the answer of the exclusive OR calculator 1404 is written, and the exclusive Write the logical sum result. Next, a calculation start signal is transmitted again to the control device 1405, and the control device 1405 sends a control signal to the selector 1401, reads B (X) and XB (X), and inputs them to the exclusive OR calculator 1404 Then, the value (X + 1) B (X) is written to the register 1408 by performing the same operation as before. Up to this point, the values for synthesizing the table, B (X), XB (X), (X + 1) B (X), F (X), XF (X), (X + 1) F (X) is now complete. Similarly, the register to be read and the register to be written are determined from the counter value of the control device, and the same operation is repeated. However, when the lower 2 bits of the counter are 00, the addition operation is skipped. After all the values are collected, the registers 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420 are used as read-only registers. The This is one of the embodiments of claims 2 and 7 of the present invention.
  The above-described embodiments can be variously modified without departing from the gist of the present invention. For example, the following changes can be considered.
(1) Here, the exclusive OR calculator 1215 is configured to add 258 bits at the same time, but by dividing this into a plurality of blocks and adding them in order for each block. It can also be realized. In this case, the amount of hardware is reduced, but the speed is reduced.
(2) Without using the 2-bit left shifter 1216, it is possible to directly connect only the upper bits to the lower 12 bits to the register 1213.
(3) Change the number of bits of the values A (X), B (X), F (X).
(4) The value of k can be changed to a value other than 2, for example, 1, 3, 4, etc. However, when n is not a multiple of k, a logic circuit for correction is required.
(5) In the table, when 0 is eliminated and instead the value added to W (X) is 0, the value of W (X) is not input to the exclusive OR calculator, but is input directly to the shifter. A simple logic circuit is added.
  The above changes can be made independently of the “value table synthesis” which is essentially the gist of the present invention.
  Even in the Galois field, the table size can be reduced by k bits in the same manner as in the case of normal Montgomery processing. Except for the carry and the subtraction processing for the final value adjustment, the concept is exactly the same as that of the algorithm 9. The algorithm is shown below.
[Algorithm 10]
For j = 0 to2 ^ k-1 step +1 {Step 1
   NTBL (H [j] (X), B (X)) = H [j] (X) * B (X) >> k Step 2
   NTBL (H [j] (X), F (X)) = H [j] (X) * F (X) >> k Step 3
}
W (X) = 0 Step 4
For j = 0 to n / k-1 step +1 {Step 5
   T (X) = W (X) + TBL (A [j, k] (X), B (X)) Step 6
   U (X) = T (X) mod X ^ k Step 7
   M [j, k] (X) = U (X) * F [0, k] (X) ^ (-1) mod X ^ k Step 8
   W (X) = W (X) + NTBL (M [j, k] (X), F (X)) Step 9
   W (X) = W (X) / X ^ k Step 10
}
Here, regarding the determination of M [j, k] (X), the lower k bits of W (X) being processed may be used as they are, and the point that no carry occurs is different from the normal Montgomery method ( However, mathematical expressions are almost the same).
  FIG. 16 shows a Galois field calculation circuit corresponding to FIG. The difference from the circuit in FIG. 15 is that the adder is an exclusive OR calculator 1601 and that there is no OR circuit that generates a carry signal. This is one of the embodiments of claims 4 and 7 of the present invention.
  Up to this point, the embodiment of the present invention for the remainder multiplication and the embodiment of the present invention for the Galois field G (2 ^ n) are shown separately, but both can be merged.
  The data length required for RSA encryption and elliptic curve encryption is significantly different. In general, the number of bits of N in RSA encryption requires about 1024 bits to 2048 bits in order to have sufficient security. On the other hand, the data length of elliptic curve cryptography having the same strength as RSA cryptography is about 160 to 256 bits. Therefore, a configuration in which a part of the RSA circuit is used as an arithmetic unit of GF (2 ^ n) is possible.
  It can be seen that the sum operation of GF (2 ^ n) is the same as the one with carry set to 0 in normal addition processing as follows.
  For example, among the logical expressions corresponding to the circuit of FIG. 4, those expressing the j-th bit:
(Formula 4 (repost)) Z [j] = (S [j] AND T [j] AND C [j-1]) OR (S [j] AND NOT (T [j]) AND NOT (C [j -1])) OR (NOT (S [j]) AND T [j] AND NOT (C [j-1])) OR (NOT (S [j]) AND NOT (T [j]) AND C [ j-1]),
If C [j-1] = 0,
(Expression 6) Z [j] = (S [j] AND T [j] AND 0) OR (S [j] AND NOT (T [j]) AND NOT (0)) OR (NOT (S [j] ) AND T [j] AND NOT (0)) OR (NOT (S [j]) AND NOT (T [j]) AND 0) = (S [j] AND NOT (T [j])) OR (NOT (S [j]) AND T [j])
= S [j] EXOR T [j]
Get. Here, S [j] EXOR T [j] is an exclusive OR of S [j] and T [j]. That is, if all carrys are set to 0, it is understood that this is an exclusive OR for each bit.
  The bitwise exclusive OR is the polynomial representation of the element A of GF (2 ^ n): A (X) = A [n-1] * X ^ (n-1) + A [n-2] * X ^ (n-2) +… + A [1] * X + A [0] corresponds to the bit string: (A [n-1] A [n-2]… A [1] A [0]) It can be seen that this is consistent with the sum operation in GF (2 ^ n).
Therefore, it is possible to add GF (2 ^ n) (exclusive OR) by adding a circuit that sets carry to 0 in the adder. This embodiment is shown in FIG. FIG. 17 is obtained by adding a selector to the adder of FIG. The adders are connected in a chain and carry C [j] propagates to the next adder. Each selector selects 0 as the carry from the pre-adder and the value 0 according to whether the control signal is 1 or 0. For example, if the control signal is 1 and carry is selected, and if it is 0, the value 0 is selected, the former is a normal adder and the latter is a GF (2 ^ n) adder (exclusive Functions as a logical sum calculator).
  FIG. 18 shows the configuration of an adder that combines the circuits shown in FIGS. The adder with carry control function in FIG. 18 is the same as that in FIG. 17 and functions as an adder for remainder multiplication or an adder for product operation of Galois field GF (2 ^ n) depending on the switching signal. Is decided. This is one of the embodiments of claims 6 and 7 of the present invention.
  The gist of the present invention is irrelevant to the value of k, but k cannot be made as large as possible in terms of implementation. Particularly in a microcomputer for an IC card, the RAM size is about several kilobytes to several tens of kilobytes, and the value of k is greatly restricted. Also, in comparison with other methods, the inventors found that the advantage of the present invention is the highest when k = 2. Therefore, this is included in the claims and is defined as claim 7.
  Needless to say, the hardware configuration for realizing the power-residue operation or the product operation of the Galois field GF (2 ^ n) is not limited to the various embodiments described above, and can be changed as appropriate.
[0006]
【The invention's effect】
The effects obtained by the representative ones of the inventions disclosed in the present application will be briefly described as follows.
That is, the power-residue calculation “Y ^ X mod N” and elliptic curve cryptography on the Galois field GF (2 ^ n) can be realized at high speed.
In addition, in the realization of the dedicated hardware described above for the power-residue operation or the product operation on the Galois field GF (2 ^ n), the logic circuit scale can be minimized.
Furthermore, the above-mentioned dedicated hardware is mounted on the same semiconductor chip as the IC card microcomputer, and encryption / decryption (encryption / decryption) or Galois field GF using power-remainder operation "Y ^ X mod N" A microcomputer for encryption and decryption of elliptic curve cryptography on (2 ^ n) can be realized at low cost and easy to use. Needless to say, the same effect can be expected not only for IC cards but also for devices that require low power consumption, such as mobile terminals such as GSM.
[Brief description of the drawings]
[Fig. 1] Overview of IC card and terminals.
FIG. 2 is a configuration of a microcomputer.
FIG. 3 shows the principle of Montgomery method.
FIG. 4 shows an example of a 1-bit full adder.
FIG. 5 is a 4-bit full adder.
FIG. 6 is a 4-bit subtraction circuit using a full adder.
FIG. 7 is a value correction circuit.
FIG. 8 is a value table when k = 2.
FIG. 9 is a value table generation circuit when k = 2.
FIG. 10 shows an embodiment of the present invention.
FIG. 11: A register.
FIG. 12 shows an embodiment of the present invention for Galois field GF (2 ^ n).
FIG. 13 is a value table for k = 2 for a Galois field GF (2 ^ n).
FIG. 14 is a value table generation circuit when k = 2 for a Galois field GF (2 ^ n).
FIG. 15 shows an embodiment of an addition processing device in Montgomery remainder multiplication.
FIG. 16 shows an embodiment of an addition processing device in the Montgomery method for Galois field GF (2 ^ n).
FIG. 17 shows an adder with a carry control function.
FIG. 18 shows a shared circuit of an adder for remainder multiplication and an adder for product operation of Galois field GF (2 ^ n).

Claims

For non-negative integers A and B and an n-bit odd number N, modular multiplication A * B * R ^ (− 1) mod N + s * N (s is a non-negative integer, R = 2 ^ n, X ^ Y represents the power of X to the power of Y) The value M of the indefinite equation A * B + M * N = W * R, where the value W is an integer M and the integer W is an unknown value, is k bits from the lower bits. In the modular multiplication unit that obtains the W by sequentially obtaining each time,
A first storage device for storing A;
A second storage device for storing the table TBL (j, s) (where 0 ≦ j <2 ^ k and 0 ≦ s <2 ^ k);
A third storage device that initially stores 0 as W;
A control device that performs a first process of taking a value in k bits from the lower side of A and storing it in a fourth storage device as a value j;
A value TBL (j, s) is selected from the table TBL stored in the second storage device using the lower-order k-bit value s of W and the value j, and the TBL (j , S) and an adder for performing a second process;
Shift means for performing a third process of shifting the addition result by the adder to the right by k bits and storing as W in the third storage device;
The TBL (j, s) includes a multiplication result j * B of the B and the value j of k-bit length and a multiplication result t * N of the M k-bit length value t (t = 0, 1, ..., 2 ^ k-1) is stored in accordance with the value of j and the value s defined by s = 2 ^ k- (j * B + t * N) mod 2 ^ k. The first to third processes are repeated n / k times in the order of the first, second, and third processes, and the A * B * R ^ (-1) mod N + s * N is set. A remainder multiplication operation device characterized by being obtained.

2. The modular multiplication unit according to claim 1, further comprising means for setting a value of a carry signal generated in the process of adding TBL (j, s) to W to 0.

For non-negative integers A and B and an n-bit odd number N, modular multiplication A * B * R ^ (− 1) mod N + s * N (s is a non-negative integer, R = 2 ^ n, X ^ Y represents the power of X to the power of Y) The value M of the indefinite equation A * B + M * N = W * R, where the value W is an integer M and the integer W is an unknown value, is k bits from the lower bits. In the modular multiplication unit that obtains the W by sequentially obtaining each time,
A first storage device for storing A;
A second storage device for storing the table NTBL (j, s) = (j * B + t * N) >> k (where 0 ≦ j <2 ^ k and 0 ≦ s <2 ^ k);
A third storage device that initially stores 0 as W;
A first control device that performs a first process of taking out a value in k bits from the lower side of A and storing it in a fourth storage device as a value j;
A second process of selecting a value NTBL (j, s) from the table NTBL stored in the second storage device using the lower-order k-bit value s of W and the value j A control device of
If the value of the lower k bits of W is all 0, W is shifted to the right by k bits, and if one or more of the values of the lower k bits of W are 1, then W is shifted to the right. An adder that performs a process of adding 1 after bit shifting and performing a third process of adding NTBL (j, s) to W;
Means for performing a fourth process of storing the addition result by the adder as W in the third storage device;
The NTBL (j, s) includes a multiplication result j * B of the B and the k-bit length value j and a multiplication result t * N of the M k-bit length value t (t = 0, 1, ..., 2 ^ k-1) sum j * B + t * N is stored according to the value of j and the value s defined by s = 2 ^ k− (j * B + t * N) mod 2 ^ k, and the first to fourth processes are performed A remainder multiplication arithmetic apparatus characterized in that said A * B * R ^ (-1) mod N + s * N is obtained by repeatedly executing n / k times in the order of the first, second, third and fourth processes. .

4. The modular multiplication unit according to claim 3, further comprising means for setting a value of a carry signal generated in the process of adding the NTBL (j, s) to W to 0.

2. The modular multiplication apparatus according to claim 1, wherein k is 2.

4. The modular multiplication unit according to claim 3, wherein k is 2.

Polynomials A (X) and B (X) having only 0 and 1 as coefficients and n-th irreducible polynomial F (X) having only 0 and 1 as coefficients and a constant term of 1 ) For the remainder multiplication A (X) * B (X) * R (X) ^ (-1) mod F (X) (where R (X) = X ^ n) in GF (2 ^ n) Indefinite equation A (X) * B (X) + M (X) * F (X) = W (X) * M (X) of W (X) * R (X) where M (X) and W (X) are unknowns In the modular multiplication unit for obtaining W (X) by sequentially obtaining the value of k for each k term from the low-dimensional term,
A first storage device for storing the coefficient of the A (X);
A second storage device for storing the coefficient of B (X);
A third table TBL (j (X), s (X)) (where j (X), s (X) is a k-1 or lower order polynomial having only 0 and 1 as coefficients) Storage device,
A fourth storage device that initially stores 0 as W (X);
A control device that performs a first process of storing in the fifth storage device as a fetch value j (X) in k terms from the low-dimensional side of A (X);
The value TBL (j) from the table TBL stored in the third storage device is obtained by using the value s (X) obtained by extracting the k term from the low-dimensional side of W (X) and the value j (X). (X), s (X)) is selected, and the second process of adding the TBL (j (X), s (X)) to GF (2 ^ n) to the W (X) is performed. A logical OR operator,
Shift means for performing a third process of shifting the addition result in GF (2 ^ n) by the exclusive OR operator to the right by k bits and storing it in the fourth storage device as W (X). And
The TBL (j (X), s (X)) includes a multiplication result j (X) * B (X) of the B (X) and the value j (X) of the k term and the M (X). Multiplication result t (X) * F (X) (k = 0, 1, X,..., X ^ (k-1) + X ^ (k-2) + ... + 1 ) Of GF (2 ^ n) j (X) * B (X) + t (X) * F (X) is the value of j (X) and s (X) = j (X) * B Stored in accordance with the value s (X) defined by (X) + t (X) * F (X) mod X ^ k
The first to third processes are repeated n / k times in the order of the first, second, and third processes, and the A (X) * B (X) * R (X) ^ (− 1 ) Modulus multiplication operation device for finding mod F (X), where addition (+) and multiplication (*) are addition and multiplication in Galois field GF (2 ^ n) apparatus.

Polynomials A (X) and B (X) having only 0 and 1 as coefficients and n-th irreducible polynomial F (X) having only 0 and 1 as coefficients and a constant term of 1 ) For the remainder multiplication A (X) * B (X) * R (X) ^ (-1) mod F (X) (where R (X) = X ^ n) in GF (2 ^ n) Indefinite equation A (X) * B (X) + M (X) * F (X) = W (X) * M (X) of W (X) * R (X) where M (X) and W (X) are unknowns In the modular multiplication unit for obtaining W (X) by sequentially obtaining the value of k for each k term from the low-dimensional term,
A first storage device for storing the coefficient of the A (X);
A second storage device for storing the coefficient of B (X);
Table NTBL (j (X), s (X)) = (j (X) * B (X) + t (X) * F (X)-(j (X) * B (X) + t (X) * F) (X) mod X ^ k)) / X ^ k (where j (X) and s (X) are polynomials having k-1 or lower order coefficients with only 0 and 1 as coefficients. ) Is a third storage device that stores s (X) = j (X) * B (X) + t (X) * F (X) mod X ^ k),
A fourth storage device that initially stores 0 as W (X);
A control device that performs a first process of storing in the fifth storage device as a fetch value j (X) in k terms from the low-dimensional side of A (X);
The value NTBL (j) from the table NTBL stored in the third storage device using the value s (X) obtained by extracting the k term from the low-dimensional side of W (X) and the value j (X). (X), s (X)) is selected, and the coefficient of the W (X) is shifted by k, so that the coefficient of the lower-order k term is set to 0 and then divided by X ^ k. Shift for performing the second process of adding the NTBL (j (X), s (X)) to GF (2 ^ n) to (X) and storing it in the fourth storage device as W (X) and means and the exclusive oR calculator,
The NTBL (j (X), s (X)) includes a multiplication result j (X) * B (X) of the B (X) and the value j (X) of the k term and the M (X). Multiplication result t (X) * F (X) (k = 0, 1, X,..., X ^ (k-1) + X ^ (k-2) + ... + 1 ) Of GF (2 ^ n) in j (X) * B (X) + t (X) * F (X), where the k term is 0 from the lower dimension side and the value divided by X ^ k is Stored in accordance with the value of (X) and the value s (X) defined by s (X) = j (X) * B (X) + t (X) * F (X) mod X ^ k,
The first and second processes are repeated n / k times in the order of the first and second processes, and the A (X) * B (X) * R (X) ^ (-1) mod F (X) is a modular multiplication operation device, wherein addition (+), subtraction (−), and multiplication (*) are addition, subtraction, and multiplication in a Galois field GF (2 ^ n). A remainder multiplication operation device.