JP4838756B2

JP4838756B2 - Multiple length arithmetic method, multiple length arithmetic device and program

Info

Publication number: JP4838756B2
Application number: JP2007131655A
Authority: JP
Inventors: 哲小田; 和麻呂青木; 剛山本
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-05-17
Filing date: 2007-05-17
Publication date: 2011-12-14
Anticipated expiration: 2027-05-17
Also published as: JP2008287489A

Description

本発明は、多倍長演算を行う技術に関し、特にサイズの大きいデータを対象とした多倍長演算を高速で行う技術に関する。 The present invention relates to a technique for performing multiple length arithmetic, and more particularly to a technique for performing multiple length arithmetic on large data at high speed.

CPU（Central Processing Unit）レジスタが扱うことができるデータのビット数には、32ビット,64ビットなどの制限がある。そのため、汎用CPUにおいてCPUレジスタが扱うことができるビット数を超える大きな整数の演算を行う場合、多倍長演算方式と呼ばれる演算方式が利用される。一般に多倍長演算方式では、大きな整数をCPUレジスタが扱うことができるビット長（ワード長）毎に区切った配列を用い、当該ワード長単位で演算を行う。 The number of data bits that can be handled by a CPU (Central Processing Unit) register is limited to 32 bits or 64 bits. Therefore, when performing a large integer operation exceeding the number of bits that can be handled by the CPU register in a general-purpose CPU, an operation method called a multiple-length operation method is used. In general, in the multiple length arithmetic method, an operation is performed in units of the word length using an array in which a large integer is divided for each bit length (word length) that can be handled by the CPU register.

通常、ＣＰＵ上で演算に利用されるデータは、メインメモリからＣＰＵレジスタにロードされ、ＣＰＵ上で計算がなされた後、計算結果がメインメモリにストアされる。しかし、単純な演算にかかる時間に比べ、データをメインメモリからＣＰＵレジスタにロードする時間のほうが大きい。そのため、一般的なＣＰＵでは、キャッシュメモリと呼ばれる高速にアクセス可能なメモリにメインメモリのデータの一部を複製した後、当該キャッシュメモリからＣＰＵレジスタにデータをロードする。そして、同一データを二回目以降にロードする際にキャッシュメモリのデータを用いることで、高速化を実現している。 Normally, data used for calculation on the CPU is loaded from the main memory into the CPU register, and after calculation is performed on the CPU, the calculation result is stored in the main memory. However, the time for loading data from the main memory to the CPU register is longer than the time required for simple calculation. Therefore, in a general CPU, after copying a part of the data in the main memory to a memory called a cache memory that can be accessed at high speed, the data is loaded from the cache memory to the CPU register. The speed is increased by using the data in the cache memory when the same data is loaded for the second time and thereafter.

また、データがすべてキャッシュメモリに存在する環境で多倍長演算を高速に行うアプローチとして、演算コスト（演算するために必要な時間）が大きい演算を演算コストの小さい演算の組み合わせに変換することによって全体の演算コストを低減させる方法が採られる。例えば、大きな整数同士の乗算を一般的なＣＰＵで行う場合、乗算の演算コストが和算の演算コストよりも大きいため、Karatsuba法（非特許文献１）などのような、積演算を和演算に変換して高速化を図る方法が採られることが一般的である。
梅谷武、“Karatsuba法”、［online］、平成13年2月14日、［平成19年5月8日検索］、インターネット〈URL http://homepage1.nifty.com/~umetani/ims/2001/20010214001/0.html〉 In addition, as an approach to perform multiple-precision operations at high speed in an environment where all data exists in the cache memory, by converting an operation with a large operation cost (time required for operation) into a combination of operations with a lower operation cost A method of reducing the overall calculation cost is employed. For example, when multiplying large integers with a general CPU, the operation cost of multiplication is larger than the operation cost of summation, so product operation such as Karatsuba method (Non-patent Document 1) is used for sum operation. Generally, a method of speeding up by converting is adopted.
Takeshi Umeya, “Karatsuba method”, [online], February 14, 2001, [May 8, 2007 search], Internet <URL http://homepage1.nifty.com/~umetani/ims/2001 /20010214001/0.html>

しかし、従来の方法では、データサイズが大きな整数x_iに対し、多倍長演算によってR=x₁・r₁+…+x_I・r_I（Iは2以上の自然数、i∈{1,...,I}、r_iは整数）の演算を行う場合、十分高速に演算を行うことができないことがある。
x_iのデータサイズが非常に大きな値（例えば1Mビット程度）であった場合、キャッシュメモリに読み込まれた各x_iがキャッシュメモリを溢れ、ＣＰＵがx_iを参照するたびにメインメモリにアクセスしなければならない状況となる。一般にキャッシュメモリは高速にアクセス可能なメモリであるが、その記憶容量はメインメモリに比べて非常に小さい。そのため、さらにメインメモリへのアクセス速度が計算速度に比べて遅くなるような環境では、演算コストではなく、メインメモリへのアクセス時間がボトルネックとなって演算速度が低下する。 However, in the conventional method, for an integer x _i having a large data size, R = x ₁ · r ₁ +… + x _I · r _I (I is a natural number of 2 or more, i∈ {1, ..., I} and r _i are integers), the computation may not be performed sufficiently fast.
If the data size of x _i is a very large value (for example, about 1 Mbit), each x _i read into the cache memory overflows the cache memory and the main memory is accessed each time the CPU refers to x _i. It becomes a necessary situation. In general, the cache memory is a memory that can be accessed at high speed, but its storage capacity is very small compared to the main memory. Therefore, in an environment where the access speed to the main memory becomes slower than the calculation speed, the access speed to the main memory becomes a bottleneck, not the calculation cost, and the calculation speed decreases.

本発明はこのような点に鑑みてなされたものであり、データサイズが大きな整数x_iに対し、多倍長演算によってR=x₁・r₁+…+x_I・r_Iの演算を高速で行う技術を提供することを目的とする。 The present invention has been made in view of these points, with respect to the data size is large integer x _i, fast calculation of _{_{R = x 1 · r 1 +}} ... + x I · r I by multiple length arithmetic The purpose is to provide technology performed in

本発明では上記課題を解決するために、ビット分割部が、x_iを各ビット長がω(j)であるx_i(j)毎〔j∈{1,...,J}, Jは2以上の自然数, x_iはx_i(J),…,x_i(1)のビット結合x_i=x_i(J)|…|x_i(1)〕に分割するビット分割過程と、演算制御部が、jに1を代入し、α(1)に0を代入する初期化過程と、積和演算部が、α(j)とキャッシュメモリに読み込まれたx_i(j)およびr_iとを用い、T(j)=α(j)+x₁(j)・r₁+…+x_I(j)・r_Iの演算を行う積和演算過程と、上記積和演算過程後、シフト部が、T(j)の下位ω(j)ビットをR_jとし、T(j)のR_j以外のビットを新たなα(j+1)とするシフト過程と、上記シフト過程後、演算制御部が、j=Jであるか否かを判定する判定過程と、上記判定過程でj≠Jと判定された場合、演算制御部が、j+1を新たなjとして処理を上記積和演算過程に戻すループ過程と、上記判定過程でj=Jと判定された場合、ビット結合部が、α(J),R_J,...,R₁のビット結合R=α(J)|R_J|…|R₁を算出し、Rを出力するビット結合過程と、を有する多倍長演算によってR=x₁・r₁+…+x_I・r_Iの多倍長演算を行う。 In the present invention, in order to solve the above problems, the bit dividing portion is, x _i for each (j) the bit length x _i is omega (j) [j∈ {1, ..., J} , J is 2 or more natural numbers, x _i is a bit-splitting process that divides into x _i (J), ..., x _i (1) bit combinations x _i = x _i (J) | ... | x _i (1)] The control unit substitutes 1 for j and 0 for α (1), and the product-sum operation unit obtains α (j) and x _i (j) and r _i read into the cache memory. And a product-sum operation process for calculating T (j) = α (j) + x ₁ (j) · r ₁ +… + x _I (j) · r _I , and after the product-sum operation process, shift portion, the lower omega (j) bits of T (j) and R _j, a shift process and T new bits other than R _j in (j) α (j + 1 ), after the shift process, When the calculation control unit determines whether j = J or not and j ≠ J in the determination step, the calculation control unit performs processing with j + 1 as a new j. The loop process to return to the sum operation process and the above If j = J is determined in the fixed process, the bit combination unit calculates the bit combination R = α (J) | R _J |… | R ₁ of α (J), R _J , ..., R ₁ Then, a multiple length operation of R = x ₁ · r ₁ +... + X _I · r _I is performed by a multiple length operation having a bit combination process of outputting R.

ここで、積和演算部が取り扱うx_i(j)はx_iから分割されたデータであり、Jは2以上の自然数であるため、x_i(j)のビット長はx_iのビット長よりも短い。そのため、積和演算部がx_iを取り扱う場合に比べ、データがキャッシュメモリから溢れてしまう頻度を低減させ、メインメモリへのアクセス回数を低減させることができる。その結果、全体として演算速度を向上させることができる。 Here, product-sum operation unit handled x _i (j) is the data divided from x _i, for J is a natural number of 2 or more, the bit length of x _i (j) is the bit length of x _i Also short. Therefore, compared to the case where the product-sum operation unit handles x _i , the frequency at which data overflows from the cache memory can be reduced, and the number of accesses to the main memory can be reduced. As a result, the calculation speed can be improved as a whole.

また、本発明において好ましくは、上記キャッシュメモリの記憶容量は、何れか一つのx_iのデータ量とすべてのr_iのデータ量との合計値よりも小さく、当該x_iを分割して得られた一つのx_i(j)のデータ量とすべてのr_iのデータ量との合計値よりも大きい。この場合、すべてのr_iをキャッシュメモリ上に格納して演算を行う環境において、データがキャッシュメモリから溢れてしまう頻度を確実に低減させることができる。 Preferably, in the present invention, the storage capacity of the cache memory is smaller than the total value of the data amount of any one x _{i and} the data amount of all r _i , and is obtained by dividing the x _i. It is larger than the sum of the data amount of one x _i (j) and the data amount of all r _i . In this case, it is possible to reliably reduce the frequency at which data overflows from the cache memory in an environment where calculation is performed with all r _i stored in the cache memory.

また、本発明において好ましくは、上記積和演算過程は、上記積和演算部が所定のビット長W単位で演算を行う過程であり、ω(h)は、上記所定のビット長Wよりも長く、x_iのビット長よりも短い。この場合、積和演算部はx_i(j)のデータを２回以上参照する必要があるため、本発明によってx_i(j)がキャッシュメモリから溢れる頻度を低下させることにより、メインメモリへのアクセス回数を確実に低減させることができる。 Preferably, in the present invention, the product-sum operation process is a process in which the product-sum operation unit performs an operation in units of a predetermined bit length W, and ω (h) is longer than the predetermined bit length W. , Shorter than the bit length of x _i . In this case, since the product-sum operation unit needs to refer to the data of x _i (j) more than once, by reducing the frequency of x _i (j) overflowing from the cache memory according to the present invention, The number of accesses can be reliably reduced.

また、本発明において好ましくは、ω(j)は、すべてのjについて一定である。これにより、積和演算部は各jについて同一の桁上げ処理を使用でき、全体として演算内容を簡略化できる。その効果の一例として、積和演算部を構成するためのプログラム長を短くすることができる。 In the present invention, preferably, ω (j) is constant for all j. Thereby, the product-sum operation unit can use the same carry processing for each j, and the operation content can be simplified as a whole. As an example of the effect, the program length for configuring the product-sum operation unit can be shortened.

また、本発明において好ましくは、上記積和演算過程は、上記積和演算部が、x_i(j)をビット長W−a（W＞a≧0）毎に分割したx_i(j,p)〔p∈{1,...,P}, Pは1以上の自然数, x_i(j)はx_i(j,P),...,x_i(j,1)のビット結合x_i(j)=x_i(j,P)|…|x_i(j,1)〕と、r_iをビット長W−b（W＞b≧0, I≦2^a+b）毎に分割したr_i(q)〔q∈{1,...,Q}, Qは1以上の自然数, r_iはr_i(Q),...,r_i(1)のビット結合r_i=r_i(Q)|…|r_i(1)〕とを用い、ビット長W単位で演算を行って１組以上のp,qについてx₁(j,p)・r₁(q)+…+x_I(j,p)・r_I(q)を算出し、当該演算結果を用いてT(j)=α(j)+x₁(j)・r₁+…+x_I(j)・r_Iを算出する過程である。 According to another embodiment of the present invention, the product sum calculation process, the product-sum operation unit, x _i (j) a bit length W-a (W> a ≧ 0) divided x _i (j for each, p ) [P∈ {1, ..., P}, P is a natural number greater than 1, x _i (j) is a bit combination x of x _i (j, P), ..., x _i (j, 1) _i (j) = x _i (j, P) | ... | x _i (j, 1)] and r _i is divided into bit lengths W−b (W> b ≧ 0, I ≦ 2 ^{a + b} ) R _i (q) [q∈ {1, ..., Q}, Q is a natural number greater than 1, r _i is the bit combination r _i = r _i (Q), ..., r _i (1) r _i (Q) | ... | r _i (1)], and the operation is performed in units of bit length W, and x ₁ (j, p) · r ₁ (q) + ... + x _I (j, p) ・ r _I (q) is calculated, and T (j) = α (j) + x ₁ (j) ・ r ₁ +… + x _I (j) it is the process of calculating the · r _I.

ここで、ビット長W−aのx_i(j,p)とビット長W−bのr_i(q)との積x_i(j,p)・r_i(q)のビット長は必ず２・W以下となり、さらにI≦2^a+bであるため、x₁(j,p)・r₁(q)+…+x_I(j,p)・r_I(q)のビット長も必ず２・W以下となる。この場合、x₁(j,p)・r₁(q)+…+x_I(j,p)・r_I(q)を算出する過程においてビット長２・W以上の桁上げ処理が不要となる。その結果、桁上げ処理数を削減でき、演算速度を向上させることができる。なお、x₁(j,p)・r₁(q)+…+x_I(j,p)・r_I(q)の演算結果を用いたT(j)=α(j)+x₁(j)・r₁+…+x_I(j)・r_Iの算出は、例えば、公知の多倍長演算方法を用いて可能である。 Here, the bit length of the product x _i (j, p) · r _i (q) of x _i (j, p) of the bit length W−a and r _i (q) of the bit length W−b is always 2・ Below W and I ≦ 2 ^{a + b} , so the bit length of x ₁ (j, p) ・ r ₁ (q) +… + x _I (j, p) ・ r _I (q) must be 2 · W or less. In this case, carry processing with a bit length of 2 · W or more is unnecessary in the process of calculating x ₁ (j, p) · r ₁ (q) +… + x _I (j, p) · r _I (q) Become. As a result, the number of carry processing can be reduced and the calculation speed can be improved. Note that T (j) = α (j) + x ₁ (using the calculation result of x ₁ (j, p) ・ r ₁ (q) +… + x _I (j, p) ・ r _I (q) j) · r ₁ +... + x _I (j) · r _I can be calculated using, for example, a known multiple-length arithmetic method.

また、本発明において好ましくは、上記ビット分割部で分割された各x_i(j)は、x₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序でバースト転送可能にメモリに格納され、x₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序で順次キャッシュメモリに読み込まれる。 Preferably, in the present invention, each x _i (j) divided by the bit dividing unit is x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ (J), ..., x _I (J) is stored in memory so that burst transfer is possible, and x ₁ (1), ..., x _I ( 1), x ₁ (2), ..., x _I (2), ..., x ₁ (J), ..., x _I (J) are sequentially read into the cache memory.

ここで、積和演算部は、j=1からj=JまでのT(j)=α(j)+x₁(j)・r₁+…+x_I(j)・r_Iの演算を行うに際し、x₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序でx_i(j)を使用する。このようにx₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序でバースト転送可能にこれらのデータをメモリに格納しておくことにより、ランダム転送よりバースト転送の方が効率がよい環境においてx_i(j)の転送速度が向上し、全体として演算速度が向上する。 Here, the product-sum operation unit calculates T (j) = α (j) + x ₁ (j) · r ₁ +… + x _I (j) · r _I from j = 1 to j = J. In doing so, x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ (J), ..., x Use x _i (j) in the order of _I (J). X ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ (J), ..., x By storing these data in memory so that burst transfer is possible in the order of _I (J), the transfer speed of x _i (j) is improved in an environment where burst transfer is more efficient than random transfer. As a result, the calculation speed is improved.

また、Iを2以上の自然数とし、i∈{1,...,I}とし、x_iを０以上の整数とし、r_iを整数とした場合における、R=x₁・r₁+…+x_I・r_Iの演算を行う多倍長演算方法であって、Hを2以上の整数とし、h∈{1,...,H}とし、S(h)を2以上の整数とし、s(h)∈{1,...,S(h)}とし、i(h,s(h))∈{i(h,1),...,i(h,S(h))}⊂{1,...,I}とし、{i(1,1),...,i(1,S(1)),...,i(H,1),...,i(H,S(H))}={1,...,I}とし、R(h)=x_i(h,1)・r_i(h,1)+…+x_i(h,S(h))・r_i(h,S(h))とし、J(h)を2以上の自然数とし、j∈{1,...,J(h)}とした場合における、(a)ビット分割部が、x_i(h,s(h))を各ビット長がω(h,j)であるx_i(h,s(h))(j)毎〔x_i(h,s(h))はx_i(h,s(h))(J(h)),…,x_i(h,s(h))(1)のビット結合x_i(h,s(h))=x_i(h,s(h))(J(h))|…|x_i(h,s(h))(1)〕に分割するビット分割過程と、(b)演算制御部が、jに1を代入し、α(h,1)に0を代入する初期化過程と、(c)hに対応するr_i(h,1),...,r_i(h,S(h))をキャッシュメモリに読み込むキャッシュ過程と、(d)積和演算部が、α(h,j)とキャッシュメモリに読み込まれたx_i(h,s(h))(j)およびr_i(h,s(h))とを用い、T(h,j)=α(h,j)+x_i(h,1)(j)・r_i(h,1)+…+x_i(h,S(h))(j)・r_i(h,S(h))の演算を行う積和演算過程と、(e)上記積和演算過程後、シフト部が、T(h,j)の下位ω(h,j)ビットをR(h,j)とし、T(h,j)のR(h,j)以外のビットを新たなα(h,j+1)とするシフト過程と、(f)上記シフト過程後、演算制御部が、j=J(h)であるか否かを判定する判定過程と、(g)上記判定過程でj≠J(h)と判定された場合、演算制御部が、j+1を新たなjとして処理を上記積和演算過程に戻すループ過程と、(h)上記判定過程でj=J(h)と判定された場合、ビット結合部が、α(h,J(h)),R(h,J(h)),...,R(h,1)のビット結合R(h)=α(h,J(h))|R(h,J(h))|…|R(h,1)を算出するビット結合過程と、を各hについて順次実行し、加算部が、R=R(1)+…+R(H)を算出し、Rを出力する加算過程と、を有する多倍長演算方法が提供される。 Further, when I is a natural number of 2 or more, i∈ {1,..., I}, x _i is an integer of 0 or more, and r _i is an integer, R = x ₁ · r ₁ +. + x _I · r _I is a multiple-precision arithmetic method in which H is an integer greater than or equal to 2, h∈ {1, ..., H}, and S (h) is an integer greater than or equal to 2. , S (h) ∈ {1, ..., S (h)} and i (h, s (h)) ∈ {i (h, 1), ..., i (h, S (h) )} ⊂ {1, ..., I} and {i (1,1), ..., i (1, S (1)), ..., i (H, 1), ... , i (H, S (H))} = {1, ..., I}, R (h) = x _{i (h, 1)} · r _{i (h, 1)} +… + x _{i (h , S (h))} · ri _{(h, S (h))} , J (h) is a natural number of 2 or more, and j∈ {1, ..., J (h)} a) The bit division unit converts x _{i (h, s (h))} to x _{i (h, s (h))} (j) each bit length is ω (h, j) (x _{i (h, s (h))} is the bit combination x _{i (h, s (h)} ₎ of x _{i (h, s (h))} (J (h)), ..., x _{i (h, s (h))} (1 _{) )} = x _{i (h, s (h))} (J (h)) |… | x _{i (h, s (h))} (1)] and (b) , J is assigned 1 and α (h, 1) is assigned 0, and (c) h corresponding to r _{i (h, 1)} , ..., r _{i (h, S (} a _h)) key A cache process to load into Sshumemori, (d) product-sum operation unit, α (h, j) and read into the cache memory the _{x i (h, s (h} )) (j) and r _{i (h, s ( h))} and T (h, j) = α (h, j) + x _{i (h, 1)} (j) ・ r _{i (h, 1)} +… + x _{i (h, S (h )) A} product-sum operation process for calculating (j) · r _{i (h, S (h))} , and (e) after the product-sum operation process, the shift unit is subordinate to T (h, j) ω ( (f, j) bits are R (h, j) and T (h, j) other than R (h, j) is a new α (h, j + 1) shift process, and (f) After the shift process, the calculation control unit determines whether or not j = J (h), and (g) if j ≠ J (h) is determined in the determination process, the calculation control unit Is a loop process that returns j + 1 as a new j and returns the process to the product-sum operation process, and (h) when j = J (h) is determined in the determination process, the bit combination unit , J (h)), R (h, J (h)), ..., R (h, 1) bit combination R (h) = α (h, J (h)) | R (h, J (h)) | ... | A bit combination process for calculating R (h, 1), and sequentially for each h And rows, the addition unit, R = R (1) + ... + calculates R (H), multiple length arithmetic method comprising an adding step of outputting the R, is provided.

この方法では、R=x₁・r₁+…+x_I・r_I=R(1)+…+R(H)とし、R(h)毎に本発明を適用してRを算出する。この方法は、iの値が大きく、すべてのr₁,…,r_Iを一度にキャッシュメモリに格納できない場合に有効である。 In this method, R = x ₁ · r ₁ +... + X _I · r _I = R (1) +... + R (H), and R is calculated by applying the present invention every R (h). This method is effective when the value of i is large and all r ₁ ,..., R _I cannot be stored in the cache memory at once.

また、この場合、上記キャッシュメモリの記憶容量は、何れか一つのx_i(h,s(h))のデータ量とすべてのr_iのデータ量との合計値よりも小さく、当該x_i(h,s(h))を分割して得られた一つのx_i(h,s(h))(j)のデータ量と当該hに対応するr_i(h,1),...,r_i(h,S(h))のデータ量との合計値よりも大きいことが望ましい。積和演算部がR(h)を算出するために必要なr_i(h,1),...,r_i(h,S(h))を一度にキャッシュメモリに格納できるからである。 In this case, the storage capacity of the cache memory is smaller than the total value of the data amount of any one x _{i (h, s (h)) and} the data amount of all r _i , and the x _{i ( h, s (h))} is obtained by dividing one x _{i (h, s (h))} (j) data amount and r _{i (h, 1)} , ..., It is desirable to be larger than the total value of r _{i (h, S (h)) and} the data amount. This is because r _{i (h, 1)} ,..., R _{i (h, S (h))} necessary for the product-sum operation unit to calculate R (h) can be stored in the cache memory at a time.

以上のように、本発明により、データサイズが大きな整数x_iに対し、多倍長演算によってR=x₁・r₁+…+x_I・r_Iの演算を高速で行うことが可能となる。 As described above, according to the present invention, it is possible to perform the calculation of R = x ₁ · r ₁ +... + X _I · r _I at high speed for multiple integers x _i with a large data size. .

以下、本発明を実施するための最良の形態を図面を参照して説明する。
〔第１の実施形態〕
まず、本発明の第１の実施形態について説明する。
＜構成＞
図１は、本形態の多倍長演算装置１の構成を示したブロック図であり、図２は、図１のレジスタファイル５０及び積和演算部６０の詳細構成を例示したブロック図である。なお、図１，２における矢印はデータの流れを示すが、演算制御部９０に対するデータの入出力の流れ、および、レジスタファイル５０へのデータの入出力の流れの一部については、記載を省略している。 The best mode for carrying out the present invention will be described below with reference to the drawings.
[First Embodiment]
First, a first embodiment of the present invention will be described.
<Configuration>
FIG. 1 is a block diagram showing the configuration of the multiple length arithmetic device 1 of this embodiment, and FIG. 2 is a block diagram illustrating the detailed configuration of the register file 50 and the product-sum operation unit 60 of FIG. The arrows in FIGS. 1 and 2 indicate the flow of data, but the description of the flow of data input / output to the arithmetic control unit 90 and part of the flow of data input / output to the register file 50 is omitted. is doing.

図１に例示するように、本形態の多倍長演算装置１は、メインメモリ１０、ビット分割部２０、キャッシュメモリ３０、レジスタ格納部４０、レジスタファイル５０、積和演算部６０、シフト部７０、ビット結合部８０および演算制御部９０を有している。
本形態の多倍長演算装置１は、ＣＰＵ、ＲＡＭ（Random Access Memory）、ＲＯＭ（Compact Disc Read Only Memory）、ハードディスク装置、バス等からなる公知のコンピュータに所定のプログラムが読みこまれ、ＣＰＵがそのプログラムを実行することにより構築されるものである。 As illustrated in FIG. 1, the multiple length arithmetic device 1 of this embodiment includes a main memory 10, a bit division unit 20, a cache memory 30, a register storage unit 40, a register file 50, a product-sum operation unit 60, and a shift unit 70. The bit combination unit 80 and the operation control unit 90 are provided.
The multiple length arithmetic device 1 of this embodiment is configured such that a predetermined program is read into a known computer including a CPU, a RAM (Random Access Memory), a ROM (Compact Disc Read Only Memory), a hard disk device, a bus, and the like. It is constructed by executing the program.

すなわち、メインメモリ１０は、例えばＲＡＭやハードディスク装置等に相当する。また、キャッシュメモリ３０は、メインメモリ１０よりも高速にアクセス可能なメモリであり、例えばＣＰＵ内に構成されたキャッシュメモリに相当する。また、図２に例示するようにレジスタファイル５０は、例えばＣＰＵ内に構成され、記憶可能なデータのビット長（レジスタ長）がそれぞれWである複数のレジスタ５１を有している。なお、Wは積和演算部６０の演算処理単位であり、例えばワード長に相当する。また、Wの値としては32ビットや64ビットを例示できるが、本発明はこれに限定されるものではない。 That is, the main memory 10 corresponds to, for example, a RAM or a hard disk device. The cache memory 30 is a memory that can be accessed at a higher speed than the main memory 10, and corresponds to, for example, a cache memory configured in the CPU. Further, as illustrated in FIG. 2, the register file 50 is configured in, for example, a CPU, and includes a plurality of registers 51 each having a bit length (register length) of storable data of W. Note that W is a calculation processing unit of the product-sum calculation unit 60 and corresponds to, for example, a word length. The value of W can be exemplified by 32 bits and 64 bits, but the present invention is not limited to this.

また、ビット分割部２０、積和演算部６０、シフト部７０、ビット結合部８０および演算制御部９０は、例えば所定のプログラムが読み込まれたＣＰＵに相当する。ここで、図２に例示した積和演算部６０はKaratsuba法（非特許文献１参照）によって多倍長の乗算を行う場合のブロック図であり、ビット長W単位で乗算を行う乗算部６１、減算部６２、加算部６３および桁上げ部６４を具備する。 In addition, the bit division unit 20, the product-sum operation unit 60, the shift unit 70, the bit combination unit 80, and the operation control unit 90 correspond to, for example, a CPU into which a predetermined program has been read. Here, the product-sum operation unit 60 illustrated in FIG. 2 is a block diagram in the case where multiple multiplication is performed by the Karatsuba method (see Non-Patent Document 1), and a multiplication unit 61 that performs multiplication in units of bit length W. A subtracting unit 62, an adding unit 63, and a carry unit 64 are provided.

＜多倍長演算方法＞
次に、本形態の多倍長演算方法を説明する。
［前提］
本形態では、L（Lは２以上の自然数）ビット長のビットデータX∈{0,1}^Ｌを所定の手順に従いI（Iは２以上の自然数）個に分割した０以上の整数x_i（i∈{1,...,I}、x_iはビットデータ）と、与えられた整数r_i（i∈{1,...,I}、r_iはビットデータ）とを用い、R=x₁・r₁+…+x_I・r_Iの多倍長演算を行う。また、そのための前処理として、整数x_iと整数r_iとがメインメモリ１０に格納されているものとする。 <Multiple length calculation method>
Next, the multiple length calculation method of this embodiment will be described.
[Assumption]
In this embodiment, L (L is a natural number of 2 or more) bit-length bit data Xε {0,1} ^L is divided into I (I is a natural number of 2 or more) pieces of an integer x _{i of} 0 or more according to a predetermined procedure. (I∈ {1, ..., I}, x _i is bit data) and given integer r _i (i∈ {1, ..., I}, r _i is bit data), R = x ₁ · r ₁ + ... + x _I · r _I multiple length calculation is performed. Further, it is assumed that integer x _i and integer r _i are stored in the main memory 10 as preprocessing for that purpose.

［R=x₁・r₁+…+x_I・r_Iの多倍長演算］
図４は、本形態の多倍長演算方法を説明するためのフローチャートである。以下、この図を用いて説明を行っていく。 [Multiple-length operation of R = x ₁ · r ₁ + ... + x _I · r _I ]
FIG. 4 is a flowchart for explaining the multiple length arithmetic method of this embodiment. Hereinafter, description will be made with reference to this figure.

まず、ビット分割部２０が、メインメモリ１０からx_iを読み込み、x_iを各ビット長がω(j)であるx_i(j)毎に分割し、各x_i(j)をメインメモリ１０に格納する（ステップＳ１／ビット分割過程）。図３（ａ）は、各x_iをx_i(j)毎に分割した様子を示した概念図である。ここで、j∈{1,...,J}であり、 Jは2以上の自然数の定数であり、x_i=x_i(J)|…|x_i(1)（x_i(J)|…|x_i(1)はx_i(J),…,x_i(1)のビット結合）の関係を満たす。また、ω(j)は、所定のビット長Wよりも長く、x_iのビット長よりも短い。また、本形態のキャッシュメモリ３０の記憶容量は、何れか一つのx_iのデータ量とすべてのr_iのデータ量との合計値よりも小さく、当該x_iを分割して得られた一つのx_i(j)のデータ量とすべてのr_iのデータ量との合計値よりも大きいものとする。また、前述のように、演算の効率化のためには、ω(j)はすべてのjについて一定であることが望ましい。 First, the bit dividing section 20 reads the x _i from the main memory 10, by dividing the x _i x _i every (j) is the bit length omega (j), a main memory 10 each x _i (j) (Step S1 / bit division process). FIG. 3A is a conceptual diagram showing how each x _i is divided for each x _i (j). Where j∈ {1, ..., J}, J is a constant of a natural number of 2 or more, and x _i = x _i (J) |… | x _i (1) (x _i (J) | ... | x _i (1) satisfies the relationship of x _i (J), ..., bit combination of x _i (1)). Further, ω (j) is longer than the predetermined bit length W and shorter than the bit length of x _i . In addition, the storage capacity of the cache memory 30 of the present embodiment is smaller than the total value of the data amount of any one x _{i and} the data amount of all r _i , and one obtained by dividing the x _i It is assumed that it is larger than the total value of the data amount of x _i (j) and the data amount of all r _i . Further, as described above, it is desirable for ω (j) to be constant for all j in order to improve the efficiency of computation.

ステップＳ１の後、演算制御部９０が、jに1を代入し、α(1)に0を代入し、それらの結果をレジスタ５１に格納する（ステップＳ２／初期化過程）。なお、演算制御部９０は、レジスタ５１に格納されたjおよびα(j)の値を用いて各処理を制御する。 After step S1, the arithmetic control unit 90 substitutes 1 for j, substitutes 0 for α (1), and stores the result in the register 51 (step S2 / initialization process). The arithmetic control unit 90 controls each process using the values of j and α (j) stored in the register 51.

次に、演算制御部９０の制御のもと、メインメモリ１０に格納されたすべてのr_iがキャッシュメモリ３０に読み込まれる（ステップＳ３）。 Next, under the control of the arithmetic control unit 90, all r _i stored in the main memory 10 are read into the cache memory 30 (step S3).

その後、積和演算部６０が、メインメモリ１０からキャッシュメモリ３０にx_i(j)を読み込みつつ、α(j)とキャッシュメモリ３０に読み込まれたx_i(j)およびr_iとを用い、ワード長Wの多倍長演算によってT(j)=α(j)+x₁(j)・r₁+…+x_I(j)・r_Iの演算を行い、その演算結果T(j)を出力する（ステップＳ４／積和演算過程）。すなわち、レジスタ格納部４０が、キャッシュメモリ３０に格納されたx_i(j)（図３（ａ））をビット長W毎に分割したx_i(j,p)〔p∈{1,...,P}, P≧2, x_i(j)=x_i(j,P)|…|x_i(j,1)〕（図３（ｂ））と、キャッシュメモリ３０に格納されたr_i（図３（ｃ））をビット長W毎に分割したr_i(q)〔q∈{1,...,Q}, r_i=r_i(Q)|…|r_i(1)〕（図３（ｄ））とを、レジスタ長Wのレジスタ５１に格納しつつ、積和演算部６０、シフト部７０およびビット結合部８０が、ワード長Wの多倍長演算によってT(j)=α(j)+x₁(j)・r₁+…+x_I(j)・r_Iの演算を行って出力する。以下、積和演算過程（ステップＳ４）の具体例を説明する。 Thereafter, the product-sum operation unit 60 reads α _i (j) from the main memory 10 into the cache memory 30 and uses α _i (j) and x _i (j) and r _i read into the cache memory 30. T (j) = α (j) + x ₁ (j) · r ₁ +… + x _I (j) · r _I is calculated by multiple length operation of word length W, and the operation result T (j) Is output (step S4 / product-sum operation process). That is, the register storage unit 40 divides x _i (j) (FIG. 3A) stored in the cache memory 30 for each bit length W, x _i (j, p) [p∈ {1,. ., P}, P ≧ 2, x _i (j) = x _i (j, P) | ... | x _i (j, 1)] (FIG. 3B) and r stored in the cache memory 30 r _i (q) [q∈ {1,..., Q}, r _i = r _i (Q) | ... | r _i (1) obtained by dividing _i (FIG. 3 (c)) for each bit length W (FIG. 3 (d)) is stored in the register 51 having the register length W, and the product-sum operation unit 60, the shift unit 70, and the bit combination unit 80 perform T (j ) = α (j) + x ₁ (j) · r ₁ +... + x _I (j) · r _I is calculated and output. Hereinafter, a specific example of the product-sum operation process (step S4) will be described.

［積和演算過程（ステップＳ４）の具体例］
図５は、図４の積和演算過程（ステップＳ４）の具体例を説明するためのフローチャートである。以下、この図に沿って説明を行う。なお、ここではKaratuba法（非特許文献１）を用いた処理を例示する。しかし、本発明はこれに限定されるものではなく、その他の公知の多倍長演算方法（よく知られた筆算のアルゴリズムや高速フーリエ変換を用いたアルゴリズム等）を使用してステップＳ４を実行してもよい。また、説明の簡略化のため、ここではx_i(j)及びr_iのビット長ω(j)がそれぞれ２ワード（２・W）以下の場合を例示する。 [Specific example of product-sum operation process (step S4)]
FIG. 5 is a flowchart for explaining a specific example of the product-sum operation process (step S4) of FIG. Hereinafter, description will be made with reference to this figure. In addition, the process using Karatuba method (nonpatent literature 1) is illustrated here. However, the present invention is not limited to this, and step S4 is performed using another known multiple-length arithmetic method (such as a well-known writing algorithm or an algorithm using fast Fourier transform). May be. Further, for the sake of simplicity of explanation, here, a case where the bit length ω (j) of x _i (j) and r _i is 2 words (2 · W) or less is exemplified.

まず、演算制御部９０がiに1を代入し、iをレジスタ５１に格納する（ステップＳ４ａ）。なお、演算制御部９０は、レジスタ５１に格納されたiを用いて各処理を制御する。 First, the arithmetic control unit 90 substitutes 1 for i, and stores i in the register 51 (step S4a). Note that the arithmetic control unit 90 controls each process using i stored in the register 51.

次に、演算制御部９０は、キャッシュメモリ３０にx_i(j)が存在するか否かを判断する（ステップＳ４ｂ）。ここで、キャッシュメモリ３０にx_i(j)が格納されていた場合、演算制御部９０は処理をステップＳ４ｄに移し、キャッシュメモリ３０にx_i(j)が格納されていなかった場合、メインメモリ１０からキャッシュメモリ３０にx_i(j)を読み込んで処理をステップＳ４ｄに移す。 Next, the arithmetic control unit 90 determines whether x _i (j) exists in the cache memory 30 (step S4b). Here, if x _i (j) is stored in the cache memory 30, the arithmetic control unit 90 moves the process to step S4d. If x _i (j) is not stored in the cache memory 30, the main memory 10 reads x _i (j) into the cache memory 30 and moves the process to step S4d.

ステップＳ４ｄでは、x_i(j)をビット長W毎に分割したx_i(j,p)〔p∈{1,2}, x_i(j)=x_i(j,2)|x_i(j,1)〕と、r_iをビット長W毎に分割したr_i(q)〔q∈{1,2}, r_i=r_i(2)|r_i(1)〕とを用い、s_i(j,1)=x_i(j,1)・r_i(1), s_i(j,3)=x_i(j,2)・r_i(2), s_i(j,2)={x_i(j,2)-x_i(j,1)}・(r_i(1)-r_i(2))+s_i(j,1)+s_i(j,3)を算出する（ステップＳ４ｄ）。具体的には、例えば、レジスタ格納部４０が、x_i(j,1)及びr_i(1)をキャッシュメモリ３０から読み込んでレジスタ５１に格納し、乗算部６１（図２）が、格納されたx_i(j,1)及びr_i(1)を用いてs_i(j,1)=x_i(j,1)・r_i(1)を算出し、s_i(j,1)をレジスタ５１に格納する。また、例えば、レジスタ格納部４０が、x_i(j,2)及びr_i(2)をキャッシュメモリ３０から読み込んでレジスタ５１に格納し、乗算部６１が、格納されたx_i(j,2)及びr_i(2)を用いてs_i(j,3)=x_i(j,2)・r_i(2)を算出し、s_i(j,3)をレジスタ５１に格納する。さらに、乗算部６１，減算部６２および加算部６３が、レジスタ５１に格納されたデータを用いてs_i(j,2)={x_i(j,2)-x_i(j,1)}・(r_i(1)-r_i(2))+s_i(j,1)+s_i(j,3)を算出し、s_i(j,2) をレジスタ５１に格納する。なお、本形態のキャッシュメモリ３０の記憶容量は、一つのx_i(j)のデータ量とすべてのr_iのデータ量との合計値よりも大きいため、キャッシュメモリ３０にさらに他の演算やテンポラリ領域が確保できるのであれば、s_i(j,1)=x_i(j,1)・r_i(1), s_i(j,3)=x_i(j,2)・r_i(2), s_i(j,2)={x_i(j,2)-x_i(j,1)}・(r_i(1)-r_i(2))+s_i(j,1)+s_i(j,3)の演算に必要なx_i(j)及びr_iはキャッシュメモリ３０から溢れることはない。この場合、ステップＳ４ｄにおいてメインメモリ１０へアクセスする必要はない（本形態の特徴）。 In step S4d, x _i obtained by dividing x _i a (j) for each bit length W (j, p) _{[p∈ {1,2}, x i (} j) = x i (j, 2) | x i ( j, 1)] and r _i (q) [q∈ {1,2}, r _i = r _i (2) | r _i (1)] obtained by dividing r _i for each bit length W, s _i (j, 1) = x _i (j, 1) ・ r _i (1), s _i (j, 3) = x _i (j, 2) ・ r _i (2), s _i (j, 2 ) = {x _i (j, 2) -x _i (j, 1)} ・ (r _i (1) -r _i (2)) + s _i (j, 1) + s _i (j, 3) Calculate (step S4d). Specifically, for example, the register storage unit 40 reads x _i (j, 1) and r _i (1) from the cache memory 30 and stores them in the register 51, and the multiplication unit 61 (FIG. 2) is stored. X _i (j, 1) and r _i (1) are used to calculate s _i (j, 1) = x _i (j, 1) ・ r _i (1), and s _i (j, 1) is calculated Store in the register 51. In addition, for example, the register storage unit 40 reads x _i (j, 2) and r _i (2) from the cache memory 30 and stores them in the register 51, and the multiplication unit 61 stores the stored x _i (j, 2). ) And r _i (2), s _i (j, 3) = x _i (j, 2) · r _i (2) is calculated, and s _i (j, 3) is stored in the register 51. Further, the multiplication unit 61, the subtraction unit 62, and the addition unit 63 use the data stored in the register 51, so that s _i (j, 2) = {x _i (j, 2) −x _i (j, 1)} Calculate (r _i (1) −r _i (2)) + s _i (j, 1) + s _i (j, 3) and store s _i (j, 2) in the register 51. Note that the storage capacity of the cache memory 30 of this embodiment is larger than the total value of the data amount of one x _i (j) and the data amount of all r _i , so that other operations and temporary operations are added to the cache memory 30. If the area can be secured, s _i (j, 1) = x _i (j, 1) ・ r _i (1), s _i (j, 3) = x _i (j, 2) ・ r _i (2 ), s _i (j, 2) = {x _i (j, 2) -x _i (j, 1)} ・ (r _i (1) -r _i (2)) + s _i (j, 1) + x _i (j) and r _i necessary for the calculation of s _i (j, 3) do not overflow from the cache memory 30. In this case, it is not necessary to access the main memory 10 in step S4d (characteristic of this embodiment).

ステップＳ４ｄの後、演算制御部９０がレジスタ５１に格納されたiがIであるか否かの判断を行い（ステップＳ４ｅ）、i=Iでなければi+1を新たなIとしてレジスタ５１に格納した後（ステップＳ４ｆ）、処理をステップＳ４ｂに戻す。一方、i=Iであれば、加算部６３が、レジスタ５１のデータを用いてs(j,1)=s₁(j,1)+...+s_I(j,1), s(j,2)=s₁(j,2)+...+s_I(j,2), s(j,3)=s₁(j,3)+...+s_I(j,3)を算出し、s(j,1), s(j,2), s(j,3)をレジスタ５１に格納する（ステップＳ４ｇ）。 After step S4d, the arithmetic control unit 90 determines whether i stored in the register 51 is I (step S4e). If i = I is not satisfied, i + 1 is set as a new I in the register 51. After storing (step S4f), the process returns to step S4b. On the other hand, if i = I, the adder 63 uses the data in the register 51 to s (j, 1) = s ₁ (j, 1) + ... + s _I (j, 1), s ( j, 2) = s ₁ (j, 2) + ... + s _I (j, 2), s (j, 3) = s ₁ (j, 3) + ... + s _I (j, 3 ) And s (j, 1), s (j, 2), s (j, 3) are stored in the register 51 (step S4g).

ステップＳ４ｇの後、加算部６３が、レジスタ５１のデータを用いてs(j,1)+α(j)を算出し、桁上げ部６４が、その演算結果の下位WビットをT(j,1)とし、それ以外のビットをβ(j,2)とし、それらをレジスタ５１に格納する（ステップＳ４ｈ）。次に、加算部６３が、レジスタ５１のデータを用いてs(j,2)+β(j,2)を算出し、桁上げ部６４が、その演算結果の下位WビットをT(j,2)とし、それ以外のビットをβ(j,3)とし、それらをレジスタ５１に格納する（ステップＳ４ｉ）。次に、加算部６３が、レジスタ５１のデータを用いてs(j,3)+β(j,3)を算出し、その演算結果をT(j,3)としてレジスタ５１に格納する（ステップＳ４ｊ）。 After step S4g, the adding unit 63 calculates s (j, 1) + α (j) using the data in the register 51, and the carry unit 64 calculates the lower W bits of the calculation result as T (j, 1), the other bits are set to β (j, 2), and they are stored in the register 51 (step S4h). Next, the adding unit 63 calculates s (j, 2) + β (j, 2) using the data in the register 51, and the carry unit 64 calculates the lower W bits of the calculation result as T (j, 2), the other bits are set to β (j, 3), and they are stored in the register 51 (step S4i). Next, the adder 63 calculates s (j, 3) + β (j, 3) using the data in the register 51, and stores the calculation result as T (j, 3) in the register 51 (step S4j).

その後、ビット結合部８０がレジスタ５１に格納されているT(j,3), T(j,2), T(j,1)のビット結合T(j)=T(j,3)|T(j,2)|T(j,1)を出力する（ステップＳ４ｋ／［積和演算過程（ステップＳ４）の具体例］の説明終わり）。
積和演算過程（ステップＳ４）後、シフト部７０が、T(j)の下位ω(j)ビットをR_jとし、T(j)のR_j以外のビットを新たなα(j+1)とし、これらを出力する（ステップＳ５／シフト過程）。 After that, the bit combination unit 80 stores the bit combination T (j) = T (j, 3) | T of T (j, 3), T (j, 2), T (j, 1) stored in the register 51. (j, 2) | T (j, 1) is output (end of description of step S4k / [specific example of product-sum operation process (step S4)]).
Product-sum operation process (step S4) after the shift unit 70, T lower (j) omega (j) bits as R _j, T new bits other than R _j in (j) α (j + 1 ) These are output (step S5 / shift process).

シフト過程（ステップＳ５）後、演算制御部９０が、レジスタ５１に格納したjを参照しj=Jであるか否かを判定する（ステップＳ６／判定過程）。ここで、j≠Jと判定された場合、演算制御部９０は、j+1を新たなjとして新たなjをレジスタ５１に格納し、処理を積和演算過程（ステップＳ４）に戻す（ステップＳ７／ループ過程）。一方、判定過程（ステップＳ６）でj=Jと判定された場合、ビット結合部８０が、α(J),R_J,...,R₁のビット結合R=α(J)|R_J|…|R₁を算出し、Rを出力する（ステップＳ８／ビット結合過程）。 After the shift process (step S5), the arithmetic control unit 90 refers to j stored in the register 51 and determines whether j = J (step S6 / determination process). If it is determined that j ≠ J, the arithmetic control unit 90 stores j + 1 as a new j in the register 51 and returns the process to the product-sum operation process (step S4) (step S4). S7 / loop process). On the other hand, the determination process when it is determined that j = J in (step S6), and the bit coupling portion _{80, α (J), R J} , ..., bit combination of _{R 1 R = α (J)} | R J | ... | calculates the R _1, and outputs the R (step S8 / bit binding process).

以上のように、本形態では、キャッシュメモリ３０にさらに他の演算やテンポラリ領域が確保できるのであれば、ステップＳ４ｄにおいてメインメモリ１０へアクセスする必要はないため、演算速度が向上する。 As described above, in this embodiment, if another calculation or a temporary area can be secured in the cache memory 30, it is not necessary to access the main memory 10 in step S4d, so that the calculation speed is improved.

〔第２の実施形態〕
次に、本発明の第２の実施形態について説明する。
本形態は、第１の実施形態の変形例であり、積和演算過程（ステップＳ４）において、積和演算部６０が、x_i(j)をビット長W−a（W＞a≧0）毎に分割したx_i(j,p)〔p∈{1,...,P}, Pは1以上の自然数, x_i(j)はx_i(j,P),...,x_i(j,1)のビット結合x_i(j)=x_i(j,P)|…|x_i(j,1)〕と、r_iをビット長W−b（W＞b≧0, I≦2^a+b）毎に分割したr_i(q)〔q∈{1,...,Q}, Qは1以上の自然数, r_iはr_i(Q),...,r_i(1)のビット結合r_i=r_i(Q)|…|r_i(1)〕とを用い、ビット長W単位で演算を行って１組以上のp,qについてx₁(j,p)・r₁(q)+…+x_I(j,p)・r_I(q)を算出し、当該演算結果を用いてT(j)=α(j)+x₁(j)・r₁+…+x_I(j)・r_Iを算出する点のみが第１の実施形態と相違する。以下では、この相違点のみを説明する。 [Second Embodiment]
Next, a second embodiment of the present invention will be described.
The present embodiment is a modification of the first embodiment. In the product-sum operation process (step S4), the product-sum operation unit 60 sets x _i (j) to a bit length W−a (W> a ≧ 0). X _i (j, p) [p∈ {1, ..., P}, P is a natural number greater than 1, x _i (j) is x _i (j, P), ..., x _i (j, 1) bit combination x _i (j) = x _i (j, P) | ... | x _i (j, 1)], and r _i is a bit length W−b (W> b ≧ 0, R _i (q) [q∈ {1, ..., Q}, Q is a natural number of 1 or more, r _i is r _i (Q), ..., r divided every I ≦ 2 ^{a + b} ) _i (1) bit combination r _i = r _i (Q) | ... | r _i (1)], and the operation is performed in units of bit length W, and x ₁ (j, p) ・ r ₁ (q) +… + x _I (j, p) ・ r _I (q) is calculated and T (j) = α (j) + x ₁ (j) ・Only the point of calculating r ₁ +... + x _I (j) · r _I is different from the first embodiment. Only this difference will be described below.

＜実施例１＞
本形態の実施例１は、P=Q=2、b>aとし（図６（ａ），（ｂ））、筆算のアルゴリズムによってビット長W単位の乗算を行う例である。
図７は、第２の実施形態の実施例１におけるステップＳ４の処理を説明するためのフローチャートである。以下、この図に沿って本実施例２におけるステップＳ４の処理を説明する。 <Example 1>
The first embodiment of the present embodiment is an example in which P = Q = 2 and b> a (FIGS. 6A and 6B), and multiplication is performed in units of bit length W by a writing algorithm.
FIG. 7 is a flowchart for explaining the processing in step S4 in Example 1 of the second embodiment. Hereinafter, the processing in step S4 in the second embodiment will be described with reference to FIG.

本実施例１のステップＳ４ａ〜４ｃは、第１の実施形態と同じであるため説明を省略する。本実施例１では、ステップＳ４ｂまたはＳ４ｃの後、キャッシュメモリ３０に読み込まれた、x_i(j)をビット長W-a毎に分割したx_i(j,p)〔p∈{1,2}, x_i(j)=x_i(j,2)|x_i(j,1)〕と、r_iをビット長W-b(b>a)毎に分割したr_i(q)〔q∈{1,2}, r_i=r_i(2)|r_i(1)〕とを用い、s_i(j,1)=x_i(j,1)・r_i(1), s_i(j,2)=x_i(j,1)・r_i(2), s_i(j,3)=x_i(j,2)・r_i(1), s_i(j,4)=x_i(j,2)・r_i(2)を算出する（ステップＳ４ｍ）。 Since Steps S4a to 4c of Example 1 are the same as those of the first embodiment, description thereof is omitted. In the first embodiment, after step S4b or S4c, loaded into the cache memory 30, x was divided x _i a (j) for each bit length Wa _i (j, p) [p∈ {1,2}, _{_{x i (j) = x i}} (j, 2) | x i (j, 1) ] and, r _i (q) [q∈ {1 obtained by dividing the r _i for each bit length Wb (b> a), 2}, r _i = r _i (2) | r _i (1)] and s _i (j, 1) = x _i (j, 1) ・ r _i (1), s _i (j, 2 ) = x _i (j, 1) ・ r _i (2), s _i (j, 3) = x _i (j, 2) ・ r _i (1), s _i (j, 4) = x _i (j , 2) · r _i (2) is calculated (step S4m).

具体的には、例えば、レジスタ格納部４０が、x_i(j,1)及びr_i(1)をキャッシュメモリ３０から読み込んでレジスタ５１に格納し、乗算部６１（図２）が、格納されたx_i(j,1)及びr_i(1)を用いてs_i(j,1)=x_i(j,1)・r_i(1)を算出し、s_i(j,1)をレジスタ５１に格納する。また、例えば、レジスタ格納部４０が、x_i(j,2)及びr_i(2)をキャッシュメモリ３０から読み込んでレジスタ５１に格納し、乗算部６１が、格納されたx_i(j,2)及びr_i(2)を用いてs_i(j,3)=x_i(j,2)・r_i(2)を算出し、s_i(j,3)をレジスタ５１に格納する。さらに、乗算部６１が、レジスタ５１に格納されたデータを用いてs_i(j,2)=x_i(j,1)・r_i(2)およびs_i(j,3)=x_i(j,2)・r_i(1)を算出し、s_i(j,2)およびs_i(j,3)をレジスタ５１に格納する。なお、x_i(j,p)のビット長はW-aであり、r_i(q)のビット長はW-bであるため、s_i(j,1), s_i(j,2), s_i(j,3), s_i(j,4)は、何れも2・W-(a+b)〔=(W-a)+(W-b)〕ビットのビット列となる。 Specifically, for example, the register storage unit 40 reads x _i (j, 1) and r _i (1) from the cache memory 30 and stores them in the register 51, and the multiplication unit 61 (FIG. 2) is stored. X _i (j, 1) and r _i (1) are used to calculate s _i (j, 1) = x _i (j, 1) ・ r _i (1), and s _i (j, 1) is calculated Store in the register 51. In addition, for example, the register storage unit 40 reads x _i (j, 2) and r _i (2) from the cache memory 30 and stores them in the register 51, and the multiplication unit 61 stores the stored x _i (j, 2). ) And r _i (2), s _i (j, 3) = x _i (j, 2) · r _i (2) is calculated, and s _i (j, 3) is stored in the register 51. Further, the multiplication unit 61 uses the data stored in the register 51 to generate s _i (j, 2) = x _i (j, 1) · r _i (2) and s _i (j, 3) = x _i ( j, 2) · r _i (1) is calculated, and s _i (j, 2) and s _i (j, 3) are stored in the register 51. Since the bit length of x _i (j, p) is Wa and the bit length of r _i (q) is Wb, s _i (j, 1), s _i (j, 2), s _i ( j, 3) and s _i (j, 4) are all bit strings of 2 · W− (a + b) [= (Wa) + (Wb)] bits.

ステップＳ４ｍの後、演算制御部９０がレジスタ５１に格納されたiがIであるか否かの判断を行い（ステップＳ４ｅ）、i=Iでなければi+1を新たなIとしてレジスタ５１に格納した後（ステップＳ４ｆ）、処理をステップＳ４ｂに戻す。一方、i=Iであれば、加算部６３が、レジスタ５１のデータを用いてs(j,1)=s₁(j,1)+...+s_I(j,1), s(j,2)=s₁(j,2)+...+s_I(j,2), s(j,3)=s₁(j,3)+...+s_I(j,3),s(j,4)=s₁(j,4)+...+s_I(j,4)を算出し、s(j,1), s(j,2), s(j,3) , s(j,4)をレジスタ５１に格納する（ステップＳ４ｎ）。なお、本形態では、s_i(j,1), s_i(j,2), s_i(j,3), s_i(j,4)は、何れも2・W-(a+b)〔=(W-a)+(W-b)〕ビットのビット列となり、I≦2^a+bを満たすため、s(j,1), s(j,2), s(j,3) , s(j,4)は何れも2・Wビット以下のビット列となる。すなわち、ステップＳ４ｎの演算において2ワード以上の桁上がり処理は不要である（本形態の特徴）。 After step S4m, the arithmetic control unit 90 determines whether i stored in the register 51 is I (step S4e). If i = I is not satisfied, i + 1 is set as a new I in the register 51. After storing (step S4f), the process returns to step S4b. On the other hand, if i = I, the adder 63 uses the data in the register 51 to s (j, 1) = s ₁ (j, 1) + ... + s _I (j, 1), s ( j, 2) = s ₁ (j, 2) + ... + s _I (j, 2), s (j, 3) = s ₁ (j, 3) + ... + s _I (j, 3 ), s (j, 4) = s ₁ (j, 4) + ... + s _I (j, 4) is calculated and s (j, 1), s (j, 2), s (j, 3), s (j, 4) is stored in the register 51 (step S4n). In this embodiment, s _i (j, 1), s _i (j, 2), s _i (j, 3), and s _i (j, 4) are all 2 · W− (a + b) [= (Wa) + (Wb)] bit sequence and satisfy I ≦ 2 ^{a + b} , so s (j, 1), s (j, 2), s (j, 3), s (j, Each of 4) is a bit string of 2 · W bits or less. That is, carry processing of 2 words or more is unnecessary in the calculation of step S4n (characteristic of this embodiment).

ステップＳ４ｎの後、加算部６３が、レジスタ５１のデータを用いてs(j,1)+α(j)を算出し、桁上げ部６４が、その演算結果の下位W-bビットをT(j,1)とし、それ以外のビットをβ(j,2)とし、それらをレジスタ５１に格納する（ステップＳ４ｐ）。次に、加算部６３が、レジスタ５１のデータを用いてs(j,2)+β(j,2)を算出し、桁上げ部６４が、その演算結果の下位b-a〔=(W-a)-(W-b)〕ビットをT(j,2)とし、それ以外のビットをβ(j,3)とする（ステップＳ４ｑ）。次に、加算部６３が、レジスタ５１のデータを用いてs(j,3)+β(j,3)を算出し、桁上げ部６４が、その演算結果の下位(W-b)〔=(W-a)+(W-b)-(W-a)〕ビットをT(j,3)とし、それ以外のビットをβ(j,4)とする（ステップＳ４ｒ）。次に、加算部６３が、レジスタ５１のデータを用いてs(j,4)+β(j,4)を算出し、その演算結果をT(j,4)としてレジスタ５１に格納する（ステップＳ４ｓ）。 After step S4n, the adding unit 63 calculates s (j, 1) + α (j) using the data in the register 51, and the carry unit 64 calculates the lower Wb bit of the operation result as T (j, 1), the other bits are set to β (j, 2), and they are stored in the register 51 (step S4p). Next, the adding unit 63 calculates s (j, 2) + β (j, 2) using the data in the register 51, and the carry unit 64 uses the lower-order ba [= (Wa) − (Wb)] The bits are set to T (j, 2), and the other bits are set to β (j, 3) (step S4q). Next, the adding unit 63 calculates s (j, 3) + β (j, 3) using the data in the register 51, and the carry unit 64 calculates the lower order (Wb) [= (Wa ) + (Wb)-(Wa)] bits are set to T (j, 3), and other bits are set to β (j, 4) (step S4r). Next, the adder 63 calculates s (j, 4) + β (j, 4) using the data in the register 51, and stores the calculation result as T (j, 4) in the register 51 (step S1). S4s).

その後、ビット結合部８０が、レジスタ５１に格納されているT(j,4),T(j,3), T(j,2), T(j,1)のビット結合T(j)=T(j,4)|T(j,3)|T(j,2)|T(j,1)を出力する（ステップＳ４ｔ）。
＜実施例２＞
本形態の実施例２は、P=Q=2、b=aとし（図６（ｃ），（ｄ））、Karatsuba法（非特許文献１）によってビット長W単位の乗算を行う例である。
図８は、第２の実施形態の実施例２におけるステップＳ４の処理を説明するためのフローチャートである。以下、この図に沿って本実施例２におけるステップＳ４の処理を説明する。第１の実施形態と本実施例２との相違点は、ステップＳ４ｄ（図５）がステップＳ４ｔ（図８）に置換される点のみである。 After that, the bit combination unit 80 generates a bit combination T (j) = T (j, 4), T (j, 3), T (j, 2), T (j, 1) stored in the register 51. T (j, 4) | T (j, 3) | T (j, 2) | T (j, 1) is output (step S4t).
<Example 2>
Example 2 of the present embodiment is an example in which P = Q = 2 and b = a (FIGS. 6C and 6D), and multiplication is performed in units of bit length W by the Karatsuba method (Non-patent Document 1). .
FIG. 8 is a flowchart for explaining the processing in step S4 in Example 2 of the second embodiment. Hereinafter, the processing in step S4 in the second embodiment will be described with reference to FIG. The only difference between the first embodiment and the second embodiment is that step S4d (FIG. 5) is replaced with step S4t (FIG. 8).

本実施例のステップＳ４ｔでは、x_i(j)をビット長W-a毎に分割したx_i(j,p)〔p∈{1,2}, x_i(j)=x_i(j,2)|x_i(j,1)〕と、r_iをビット長W-b毎に分割したr_i(q)〔q∈{1,2}, r_i=r_i(2)|r_i(1)〕とを用い、s_i(j,1)=x_i(j,1)・r_i(1), s_i(j,3)=x_i(j,2)・r_i(2), s_i(j,2)={x_i(j,2)-x_i(j,1)}・(r_i(1)-r_i(2))+s_i(j,1)+s_i(j,3)を算出する（ステップＳ４ｔ）。ここで、x_i(j,p)のビット長はW-aであり、r_i(q)のビット長はW-bであるため、s_i(j,1), s_i(j,2), s_i(j,3)は、何れも2・W-(a+b)〔=(W-a)+(W-b)〕ビットのビット列となる。また、I≦2^a+bを満たすため、その後のステップＳ４ｇで算出されるs(j,1), s(j,2), s(j,3)は、何れも2・Wビット以下のビット列となる。すなわち、ステップＳ４ｇの演算において2ワード以上の桁上がり処理は不要である（本形態の特徴）。 In step S4t of this embodiment, x _i (j, p) [pε {1,2}, x _i (j) = x _i (j, 2)) obtained by dividing x _i (j) for each bit length Wa. | x _i (j, 1) and], r _i obtained by dividing the r _i for each bit length Wb (q) _{[q∈ {1,2}, r i =} r i (2) | r i (1) ] , S _i (j, 1) = x _i (j, 1) ・ r _i (1), s _i (j, 3) = x _i (j, 2) ・ r _i (2), s _i (j, 2) = {x _i (j, 2) -x _i (j, 1)} ・ (r _i (1) -r _i (2)) + s _i (j, 1) + s _i (j , 3) is calculated (step S4t). Here, since the bit length of x _i (j, p) is Wa and the bit length of r _i (q) is Wb, s _i (j, 1), s _i (j, 2), s _i (j, 3) is a bit string of 2 · W− (a + b) [= (Wa) + (Wb)] bits. In order to satisfy I ≦ 2 ^{a + b} , s (j, 1), s (j, 2), and s (j, 3) calculated in step S4g are all 2 · W bits or less. It becomes a bit string. That is, carry processing of 2 words or more is unnecessary in the calculation of step S4g (characteristic of this embodiment).

以上のように、本形態では、x₁(j,p)・r₁(q)+…+x_I(j,p)・r_I(q)は何れも2・Wビット以下のビット列となる。よって、x₁(j,p)・r₁(q)+…+x_I(j,p)・r_I(q)の演算において2ワード以上の桁上がり処理は不要となり、演算コストを低減させることができる。 As described above, in this embodiment, x ₁ (j, p) · r ₁ (q) + ... + x _I (j, p) · r _I (q) are all bit strings of 2 · W bits or less. . Therefore, it is not necessary to carry more than 2 words in the calculation of x ₁ (j, p) · r ₁ (q) +… + x _I (j, p) · r _I (q), reducing the operation cost. be able to.

〔第３の実施形態〕
本形態は、第１，２の実施形態の変形例である。第１，２の実施形態との相違点は、ビット分割部２０で分割された各x_i(j)が、x₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序でバースト転送可能にメインメモリ１０に格納され、x₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序で順次キャッシュメモリ３０に読み込まれる点である。 [Third Embodiment]
This embodiment is a modification of the first and second embodiments. The difference from the first and second embodiments is that each x _i (j) divided by the bit dividing unit 20 is x ₁ (1),..., X _I (1), x ₁ (2). , ..., x _I (2), ..., x ₁ (J), ..., x _I (J) are stored in the main memory 10 so that burst transfer is possible, and x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ (J), ..., x _I (J) It is a point read into the memory 30.

図９は、メインメモリ１０に格納されたx₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)を例示した概念図である。図９に例示するように、本形態では、x₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序で各x_i(j)がメインメモリ１０に格納される。これにより、メインメモリ１０に格納されたx₁(1)の先頭アドレスを指定するだけで、x₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序でx_i(j)を順次キャッシュメモリ３０にバースト転送することができる。図４のステップＳ４では、x₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序でx_i(j)が必要となるため、各x_i(j)をx₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)の順序でバースト転送可能にメインメモリ１０に格納する構成は、本発明にとって好ましい。 FIG. 9 shows x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ ( It is the conceptual diagram which illustrated J), ..., x _I (J). As illustrated in FIG. 9, in this embodiment, x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ Each x _i (j) is stored in the main memory 10 in the order of (J),..., X _I (J). Thus, only by specifying the head address of x ₁ (1) stored in the main memory 10, x ₁ (1), ..., x _I (1), x ₁ (2), ..., _{x I (2), ...,} x 1 (J), ..., can be burst transfer x _i a (j) sequentially into the cache memory 30 in the order of x _I (J). In step S4 of FIG. 4, x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ (J) ,. .., x _I for x _i (j) is required in the order (J), each of x _i and _{(j) x 1 (1)} , ..., x I (1), x 1 (2) , ..., x _I (2), ..., x ₁ (J), ..., x _I (J) are preferably stored in the main memory 10 so as to be capable of burst transfer. .

〔第４の実施形態〕
本形態は、第１，２の実施形態の変形例であり、R=x₁・r₁+…+x_I・r_I=R(1)+…+R(H)とし、R(h)（h∈{1,...,H}，Hは2以上の整数の定数）毎に本発明の手法を適用するものである（本形態の特徴）。本形態は、Iの値が大きく、すべてのr₁,...,r_Iを同時にキャッシュメモリ３０上に格納できない場合にメインメモリ１０へのアクセス回数を削減できる手法である。すなわち、キャッシュメモリ３０の記憶容量が、何れか一つのx_i(h,s(h))のデータ量とすべてのr_iのデータ量との合計値よりも小さく、当該x_i(h,s(h))を分割して得られた一つのx_i(h,s(h))(j)のデータ量と当該hに対応するr_i(h,1),...,r_i(h,S(h))のデータ量との合計値よりも大きい場合に特に有効な手法である。ただし、S(h)は2以上の整数であり、s(h)∈{1,...,S(h)}であり、i(h,s(h))∈{i(h,1),...,i(h,S(h))}⊂{1,...,I}であり、{i(1,1),...,i(1,S(1)),...,i(H,1),...,i(H,S(H))}={1,...,I}であり、R(h)=x_i(h,1)・r_i(h,1)+…+x_i(h,S(h))・r_i(h,S(h))である。 [Fourth Embodiment]
The present embodiment is a modification of the first and second embodiments, and R = x ₁ · r ₁ + ... + x _I · r _I = R (1) + ... + R (H) and R (h) The method of the present invention is applied every time (h∈ {1,..., H}, H is an integer constant equal to or greater than 2) (feature of this embodiment). In this embodiment, the number of accesses to the main memory 10 can be reduced when the value of I is large and all r ₁ ,..., R _I cannot be stored in the cache memory 30 at the same time. That is, the storage capacity of the cache memory 30 is smaller than the total value of the data amount of any one of x _{i (h, s (h)) and} the data amount of all r _i , and the x _{i (h, s (h))} is obtained by dividing the data amount of one x _{i (h, s (h))} (j) and r _{i (h, 1)} , ..., r _{i ( This} is a particularly effective method when it is larger than the total value of the data amount of _{h, S (h))} . Where S (h) is an integer greater than or equal to 2, s (h) ∈ {1, ..., S (h)}, and i (h, s (h)) ∈ {i (h, 1 ), ..., i (h, S (h))} ⊂ {1, ..., I} and {i (1,1), ..., i (1, S (1)) , ..., i (H, 1), ..., i (H, S (H))} = {1, ..., I} and R (h) = x _{i (h, 1 )} · R _{i (h, 1)} +... + X _{i (h, S (h))} · r _{i (h, S (h))} .

以下では、第１，２の実施形態との相違点を中心に説明を行い、第１，２の実施形態と共通する事項については説明を省略する。
＜構成＞
図１０は、本形態の多倍長演算装置１００の構成を示したブロック図である。なお、図１０における矢印はデータの流れを示すが、演算制御部９０に対するデータの入出力の流れ、および、レジスタファイル５０へのデータの入出力の流れの一部については、記載を省略している。 Below, it demonstrates centering on difference with 1st, 2nd embodiment, and abbreviate | omits description about the matter which is common in 1st, 2nd embodiment.
<Configuration>
FIG. 10 is a block diagram showing a configuration of the multiple length arithmetic apparatus 100 of the present embodiment. The arrows in FIG. 10 indicate the flow of data, but the description of the flow of data input / output to the arithmetic control unit 90 and part of the flow of data input / output to the register file 50 is omitted. Yes.

図１０に例示するように、本形態の多倍長演算装置１００は、メインメモリ１０、ビット分割部２０、キャッシュメモリ３０、レジスタ格納部４０、レジスタファイル５０、積和演算部６０、シフト部７０、ビット結合部８０、演算制御部９０および加算部１１０を有している。なお、図１０において図１と処理機能が同一な部分については、図１と同じ符号を付している。 As illustrated in FIG. 10, the multiple length arithmetic device 100 according to this embodiment includes a main memory 10, a bit division unit 20, a cache memory 30, a register storage unit 40, a register file 50, a product-sum operation unit 60, and a shift unit 70. , A bit combination unit 80, an operation control unit 90, and an addition unit 110. 10 that have the same processing functions as those in FIG. 1 are denoted by the same reference numerals as those in FIG.

＜多倍長演算方法＞
次に、本形態の多倍長演算方法を説明する。
［前提］
第１の実施形態で説明したのと同じである。
［R=x₁・r₁+…+x_I・r_Iの多倍長演算］
図１１は、本形態の多倍長演算方法を説明するためのフローチャートである。以下、この図を用いて説明を行っていく。
まず、演算制御部９０が、hに1を代入し、hをレジスタファイル５０のレジスタ５１に格納する（ステップＳ１１）。演算制御部９０は、レジスタ５１に格納されたhを用いて各処理を制御する。 <Multiple length calculation method>
Next, the multiple length calculation method of this embodiment will be described.
[Assumption]
This is the same as described in the first embodiment.
[Multiple-length operation of R = x ₁ · r ₁ + ... + x _I · r _I ]
FIG. 11 is a flowchart for explaining the multiple length arithmetic method of this embodiment. Hereinafter, description will be made with reference to this figure.
First, the arithmetic control unit 90 substitutes 1 for h, and stores h in the register 51 of the register file 50 (step S11). The arithmetic control unit 90 controls each process using h stored in the register 51.

次に、ビット分割部２０が、メインメモリ１０からx_i(h,s(h))を読み込み、x_i(h,s(h))を各ビット長がω(h,j)であるx_i(h,s(h))(j)毎に分割し、各x_i(h,s(h))(j)をメインメモリ１０に格納する（ステップＳ１２／ビット分割過程）。なお、x_i(h,s(h))はx_i(h,s(h))(J(h)),…,x_i(h,s(h))(1)のビット結合x_i(h,s(h))=x_i(h,s(h))(J(h))|…|x_i(h,s(h))(1)であり、J(h)は2以上の自然数の定数であり、j∈{1,...,J(h)}である。
次に、演算制御部９０が、jに1を代入し、α(h,1)に0を代入し、それらをレジスタファイル５０のレジスタ５１に格納する（ステップＳ１３／初期化過程）。なお、演算制御部９０は、レジスタ５１に格納されたjおよびα(h,1)を用いて各処理を制御する。また、演算制御部９０の制御のもと、hに対応するr_i(h,1),...,r_i(h,S(h))がキャッシュメモリに読み込まれる（ステップＳ１４／キャッシュ過程）。 Next, the bit dividing unit 20 reads x _{i (h, s (h))} from the main memory 10 and sets x _{i (h, s (h))} to x with each bit length being ω (h, j). Each _{i (h, s (h))} (j) is divided and each x _{i (h, s (h))} (j) is stored in the main memory 10 (step S12 / bit division process). X _{i (h, s (h))} is the bit combination x _i of x _{i (h, s (h))} (J (h)), ..., x _{i (h, s (h))} (1) _{(h, s (h))} = x _{i (h, s (h))} (J (h)) |… | x _{i (h, s (h))} (1) where J (h) is 2 These are natural number constants, j∈ {1, ..., J (h)}.
Next, the arithmetic control unit 90 substitutes 1 for j, substitutes 0 for α (h, 1), and stores them in the register 51 of the register file 50 (step S13 / initialization process). The arithmetic control unit 90 controls each process using j and α (h, 1) stored in the register 51. Also, under the control of the arithmetic control unit 90, r _{i (h, 1)} ,..., R _{i (h, S (h)) corresponding} to _h are read into the cache memory (step S14 / cache process) ).

その後、積和演算部６０が、x_i(h,s(h))(j)を順次キャッシュメモリ３０に読み込みながら、α(h,j)とx_i(h,s(h))(j)およびr_i(h,s(h))とを用い、T(h,j)=α(h,j)+x_i(h,1)(j)・r_i(h,1)+…+x_i(h,S(h))(j)・r_i(h,S(h))の演算を行い、T(h,j)を出力する（ステップＳ１５／積和演算過程）。積和演算過程後（ステップＳ１５）、シフト部７０が、T(h,j)の下位ω(h,j)ビットをR(h,j)とし、T(h,j)のR(h,j)以外のビットを新たなα(h,j+1)とし、R(h,j)とα(h,j+1)とを出力する（ステップＳ１６／シフト過程）。 Thereafter, the product-sum operation unit 60 sequentially reads x _{i (h, s (h))} (j) into the cache memory 30, while α (h, j) and x _{i (h, s (h))} (j ) And r _{i (h, s (h))} and T (h, j) = α (h, j) + x _{i (h, 1)} (j) · r _{i (h, 1)} +… + x _{i (h, S (h))} (j) · r _{i (h, S (h))} is calculated and T (h, j) is output (step S15 / product-sum operation process). After the product-sum operation process (step S15), the shift unit 70 sets the lower ω (h, j) bits of T (h, j) to R (h, j) and R (h, j) of T (h, j). Bits other than j) are set as new α (h, j + 1), and R (h, j) and α (h, j + 1) are output (step S16 / shift process).

シフト過程後（ステップＳ１６）、演算制御部９０が、レジスタファイル５０のレジスタ５１に格納されたjを参照し、j=J(h)であるか否かを判定する（ステップＳ１７／判定過程）。ここで、j≠J(h)と判定された場合、演算制御部９０が、j+1を新たなjとしてjをレジスタ５１に格納し、処理をステップＳ１５に戻す（ステップＳ１８／ループ過程）。一方、判定過程（ステップＳ１７）でj=J(h)と判定された場合、ビット結合部８０が、α(h,J(h)),R(h,J(h)),...,R(h,1)のビット結合R(h)=α(h,J(h))|R(h,J(h))|…|R(h,1)を算出してR(h)を出力する（ステップＳ１９／ビット結合過程）。 After the shifting process (step S16), the arithmetic control unit 90 refers to j stored in the register 51 of the register file 50 and determines whether j = J (h) (step S17 / determination process). . If it is determined that j ≠ J (h), the arithmetic control unit 90 stores j in the register 51 with j + 1 as a new j, and returns the process to step S15 (step S18 / loop process). . On the other hand, if it is determined that j = J (h) in the determination process (step S17), the bit combination unit 80 determines that α (h, J (h)), R (h, J (h)),. , R (h, 1) bit combination R (h) = α (h, J (h)) | R (h, J (h)) | ... | R (h, 1) ) Is output (step S19 / bit combination process).

次に、演算制御部９０が、レジスタ５１を参照してh=Hであるか否かを判定し、h=Hでなければ、h+1を新たなhとしてレジスタ５１に格納し（ステップＳ２１）、処理をステップＳ１２に戻す。一方、h=Hであれば、加算部１１０が、R=R(1)+…+R(H)を算出し、Rを出力する（ステップＳ２１／加算過程）。 Next, the arithmetic control unit 90 refers to the register 51 to determine whether h = H, and if not h = H, stores h + 1 as a new h in the register 51 (step S21). ), The process returns to step S12. On the other hand, if h = H, the adding unit 110 calculates R = R (1) +... + R (H) and outputs R (step S21 / addition process).

〔第５の実施形態〕
本形態は、第４の実施形態に第３の実施の形態を適用した形態である。すなわち、本形態では、h=1からh=Hまでの順序で、それぞれのhについて各x_i(h,s(h))(j)が、x_i(h,1)(1),...,x_i(h,S(h))(1),x_i(h,1)(2),...,x_i(h,S(h))(2),...,x_i(h,1)(J(h)),...,x_i(h,S(h))(J(h))の順序でバースト転送可能にメインメモリ１０に格納され、x_i(h,1)(1),...,x_i(h,S(h))(1),x_i(h,1)(2),...,x_i(h,S(h))(2),...,x_i(h,1)(J(h)),...,x_i(h,S(h))(J(h))の順序で順次キャッシュメモリ３０に読み込まれる(本形態の特徴)。 [Fifth Embodiment]
In this embodiment, the third embodiment is applied to the fourth embodiment. That is, in this embodiment, in the order from h = 1 to h = H, each x _{i (h, s (h))} (j) for each _h is x _{i (h, 1)} (1),. .., x _{i (h, S (h))} (1), x _{i (h, 1)} (2), ..., x _{i (h, S (h))} (2), ..., x _{i (h, 1)} (J (h)), ..., x _{i (h, S (h))} (J (h)) are stored in the main memory 10 so as to be capable of burst transfer, and x _{i (h, 1)} (1), ..., x _{i (h, S (h))} (1), x _{i (h, 1)} (2), ..., x _{i (h, S (h ))} (2), ..., x _{i (h, 1)} (J (h)), ..., x _{i (h, S (h))} (J (h)) 30 is read (feature of this embodiment).

図１２（ａ）（ｂ）は、メインメモリ１０に格納されたx_i(h,1)(1),...,x_i(h,S(h))(1),x_i(h,1)(2),...,x_i(h,S(h))(2),...,x_i(h,1)(J(h)),...,x_i(h,S(h))(J(h))を例示した概念図である。 12 (a) and 12 (b) show x _{i (h, 1)} (1), ..., x _{i (h, S (h))} (1), x _{i (h} ) stored in the main memory 10. _{, 1)} (2), ..., x _{i (h, S (h))} (2), ..., x _{i (h, 1)} (J (h)), ..., x _{i (} It is the conceptual diagram which illustrated _{h, S (h))} (J (h)).

図１２（ａ）に例示するように、各x_i(h,s(h))(j)は、x_i(h,1)(1),...,x_i(h,S(h))(1),x_i(h,1)(2),...,x_i(h,S(h))(2),...,x_i(h,1)(J(h)),...,x_i(h,S(h))(J(h))の順序でバースト転送可能にメインメモリ１０に格納される。ここで、x_i(h,1)(1),...,x_i(h,S(h))(1),x_i(h,1)(2),...,x_i(h,S(h))(2),...,x_i(h,1)(J(h)),...,x_i(h,S(h))(J(h))が格納されたメインメモリ１０の領域をブロック［h］と呼ぶ。また、ブロック［h］の先頭アドレスをAS(h)と表現し、終了アドレスをAE(h)と表現する。 As illustrated in FIG. 12A, each x _{i (h, s (h))} (j) is represented by x _{i (h, 1)} (1), ..., x _{i (h, S (h ))} (1), x _{i (h, 1)} (2), ..., x _{i (h, S (h))} (2), ..., x _{i (h, 1)} (J (h )), ..., x _{i (h, S (h))} (J (h)) are stored in the main memory 10 so as to be capable of burst transfer. Where x _{i (h, 1)} (1), ..., x _{i (h, S (h))} (1), x _{i (h, 1)} (2), ..., x _{i ( h, S (h))} (2), ..., x _{i (h, 1)} (J (h)), ..., x _{i (h, S (h))} (J (h)) The stored area of the main memory 10 is referred to as a block [h]. Also, the start address of block [h] is expressed as AS (h), and the end address is expressed as AE (h).

また、図１２（ｂ）に示すように、各ブロック［h］は、ブロック［1］, ブロック［2］,..., ブロック［H］の順序でバースト転送可能にメインメモリ１０に格納される。これにより、各x_i(h,s(h))(j)は、x_i(1,1)(1),...,x_i(1,S(1))(1),x_i(1,1)(2),...,x_i(1,S(1))(2),...,x_i(1,1)(J(1)),...,x_i(1,S(1))(J(1)),...,x_i(H,1)(1),...,x_i(H,S(H))(1),x_i(H,1)(2),...,x_i(H,S(H))(2),...,x_i(H,1)(J(H)),...,x_i(H,S(H))(J(H))の順序でバースト転送可能にメインメモリ１０に格納される。これにより、x_i(1,1)(1)の先頭アドレスAS(h)を指定するのみで、メインメモリ１０からキャッシュメモリ３０にx_i(1,1)(1),...,x_i(1,S(1))(1),x_i(1,1)(2),...,x_i(1,S(1))(2),...,x_i(1,1)(J(1)),...,x_i(1,S(1))(J(1)),...,x_i(H,1)(1),...,x_i(H,S(H))(1),x_i(H,1)(2),...,x_i(H,S(H))(2),...,x_i(H,1)(J(H)),...,x_i(H,S(H))(J(H))の順序でバースト転送可能である。 As shown in FIG. 12B, each block [h] is stored in the main memory 10 so as to be capable of burst transfer in the order of block [1], block [2],..., Block [H]. The Thus, each x _{i (h, s (h))} (j) becomes x _{i (1,1)} (1), ..., x _{i (1, S (1))} (1), x _{i (1,1)} (2), ..., x _{i (1, S (1))} (2), ..., x _{i (1,1)} (J (1)), ..., x _{i (1, S (1))} (J (1)), ..., x _{i (H, 1)} (1), ..., x _{i (H, S (H))} (1), x _{i (H, 1)} (2), ..., x _{i (H, S (H))} (2), ..., x _{i (H, 1)} (J (H)), ..., It is stored in the main memory 10 so as to be capable of burst transfer in the order of x _{i (H, S (H))} (J (H)). As a _result, x _{i (1,1)} only to specify the start address AS (h) (1), from the main memory 10 in the cache memory _{30 x i (1,1) (1} ), ..., x _{i (1, S (1))} (1), x _{i (1,1)} (2), ..., x _{i (1, S (1))} (2), ..., x _{i (1 , 1)} (J (1)), ..., x _{i (1, S (1))} (J (1)), ..., x _{i (H, 1)} (1), ..., x _{i (H, S (H))} (1), x _{i (H, 1)} (2), ..., x _{i (H, S (H))} (2), ..., x _{i (} Burst transfer is possible in the order of _{H, 1)} (J (H)), ..., xi _{(H, S (H))} (J (H)).

図１１のステップＳ１５では、x_i(1,1)(1),...,x_i(1,S(1))(1),x_i(1,1)(2),...,x_i(1,S(1))(2),...,x_i(1,1)(J(1)),...,x_i(1,S(1))(J(1)),...,x_i(H,1)(1),...,x_i(H,S(H))(1),x_i(H,1)(2),...,x_i(H,S(H))(2),...,x_i(H,1)(J(H)),...,x_i(H,S(H))(J(H))の順序でx_i(h,s(h))(j)が必要となるため、この順序でバースト転送可能にメインメモリ１０に格納する構成は、本発明にとって好ましい。 In step S15 of FIG. 11, x _{i (1,1)} (1), ..., x _{i (1, S (1))} (1), x _{i (1,1)} (2), ... , x _{i (1, S (1))} (2), ..., x _{i (1,1)} (J (1)), ..., x _{i (1, S (1))} (J ( 1)), ..., x _{i (H, 1)} (1), ..., x _{i (H, S (H))} (1), x _{i (H, 1)} (2), ... ., x _{i (H, S (H))} (2), ..., x _{i (H, 1)} (J (H)), ..., x _{i (H, S (H))} (J Since x _{i (h, s (h))} (j) is required in the order of (H)), the configuration of storing in the main memory 10 so that burst transfer is possible in this order is preferable for the present invention.

〔実験結果〕
以上のように、上述の各形態では、キャッシュメモリ３０の記憶容量（キャッシュサイズ）を他の演算や演算ｎ途中のテンポラリ領域との合計で超えさせるようなx_iと、キャッシュサイズよりもサイズの小さなr_iに対するR=x₁・r₁+…+x_I・r_Iの多倍長演算の演算速度を向上させることができる。〔Experimental result〕
As described above, in each of the above-described embodiments, x _i that exceeds the storage capacity (cache size) of the cache memory 30 in total with other operations and the temporary area in the middle of the operation n, and a size larger than the cache size. It is possible to improve the calculation speed of the multiple length calculation of R = x ₁ · r ₁ + ... + x _I · r _I for small r _i .

図１３に、この効果を示すため実験結果を示す。図１３では、Athlon64x2 2Ghzにおいて本発明を適用することなく「小田哲，山本剛，青木和麻呂，“高速データ検証”，SCIS, Jan.23-26 2007（「参考文献１」）」の演算を行った場合（従来方法）と、Athlon64x2 2Ghzにおいて第１の実施形態を適用して参考文献１の演算を行った場合（第１の実施形態）とでの演算速度の比較を示す。図１３に示すように、従来方法での演算速度は12.5(Gbps)程度であるのに対し、第１の実施形態での演算速度は8(Gbps)程度である。 FIG. 13 shows the experimental results to show this effect. In FIG. 13, in Athlon64x2 2Ghz, the operation of “Tetsu Oda, Go Yamamoto, Kazuro Aoki,“ High-speed data verification ”, SCIS, Jan. 23-26 2007 (“ Reference 1 ”)) without applying the present invention. A comparison of the calculation speed between the case where the calculation is performed (conventional method) and the case where the calculation according to Reference 1 is performed by applying the first embodiment in Athlon64x2 2Ghz (first embodiment) is shown. As shown in FIG. 13, the computation speed in the conventional method is about 12.5 (Gbps), whereas the computation speed in the first embodiment is about 8 (Gbps).

〔変形例等〕
なお、本発明は上述の実施の形態に限定されるものではない。例えば、各実施形態の構成を適宜組み合わせて本発明を実施してもよい。また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 [Modifications, etc.]
The present invention is not limited to the embodiment described above. For example, you may implement this invention combining the structure of each embodiment suitably. In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical disks, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

本発明の産業上の利用分野としては、例えば、参考文献１のデータ検証処理等を例示できる。 As an industrial application field of the present invention, for example, the data verification processing of Reference 1 can be exemplified.

図１は、第１の実施形態の多倍長演算装置の構成を示したブロック図である。FIG. 1 is a block diagram illustrating a configuration of a multiple length arithmetic apparatus according to the first embodiment. 図２は、図１のレジスタファイル及び積和演算部の詳細構成を例示したブロック図である。FIG. 2 is a block diagram illustrating a detailed configuration of the register file and the product-sum operation unit of FIG. 図３は、データ構成を説明するための概念図である。FIG. 3 is a conceptual diagram for explaining the data structure. 図４は、第１の実施形態の多倍長演算方法を説明するためのフローチャートである。FIG. 4 is a flowchart for explaining the multiple length arithmetic method according to the first embodiment. 図５は、図４の積和演算過程（ステップＳ４）の具体例を説明するためのフローチャートであるFIG. 5 is a flowchart for explaining a specific example of the product-sum operation process (step S4) of FIG. 図６は、データ構成を説明するための概念図である。FIG. 6 is a conceptual diagram for explaining the data configuration. 図７は、第２の実施形態の実施例１におけるステップＳ４の処理を説明するためのフローチャートである。FIG. 7 is a flowchart for explaining the processing in step S4 in Example 1 of the second embodiment. 図８は、第２の実施形態の実施例２におけるステップＳ４の処理を説明するためのフローチャートである。FIG. 8 is a flowchart for explaining the processing in step S4 in Example 2 of the second embodiment. 図９は、メインメモリに格納されたx₁(1),...,x_I(1),x₁(2),...,x_I(2),...,x₁(J),...,x_I(J)を例示した概念図である。FIG. 9 shows x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ (J ), ..., x _I (J). 図１０は、第４の実施形態の多倍長演算装置の構成を示したブロック図である。FIG. 10 is a block diagram showing the configuration of the multiple length arithmetic device of the fourth embodiment. 図１１は、第４の実施形態の多倍長演算方法を説明するためのフローチャートである。FIG. 11 is a flowchart for explaining the multiple length arithmetic method of the fourth embodiment. 図１２は、メインメモリに格納されたx_i(h,1)(1),...,x_i(h,S(h))(1),x_i(h,1)(2),...,x_i(h,S(h))(2),...,x_i(h,1)(J(h)),...,x_i(h,S(h))(J(h))を例示した概念図である。FIG. 12 shows x _{i (h, 1)} (1), ..., x _{i (h, S (h))} (1), x _{i (h, 1)} (2), stored in the main memory. ..., x _{i (h, S (h))} (2), ..., x _{i (h, 1)} (J (h)), ..., x _{i (h, S (h))} It is the conceptual diagram which illustrated (J (h)). 図１３は、本形態の効果を示すため実験結果を示した図である。FIG. 13 is a diagram showing experimental results to show the effect of this embodiment.

Explanation of symbols

１，１００多倍長演算装置 1,100 multiple length arithmetic unit

Claims

R = x ₁ · r ₁ +… + x where I is a natural number of 2 or more, i∈ {1, ..., I}, x _i is an integer of 0 or more, and r _i is an integer _A multiple length arithmetic method for calculating _I · r _I ,
The bit dividing unit is configured such that x _i is a natural number greater than or equal to 2 for each x _i (j) where each bit length is ω (j), and x _i is x _i ( J), ..., x _i (1) bit combination x _i = x _i (J) | ... | x _i (1)]
An initialization process in which the arithmetic control unit substitutes 1 for j and 0 for α (1),
The product-sum operation unit uses α (j) and x _i (j) and r _i read into the cache memory, and T (j) = α (j) + x ₁ (j) · r ₁ +… + a sum-of-products operation for calculating x _I (j) · r _I ;
After the product-sum operation process, a shift process of shifting part, the lower omega (j) bits of T (j) and R _j, and T new bits other than R _j in (j) α (j + 1 ) When,
After the shifting process, the calculation control unit determines whether or not j = J,
When it is determined that j ≠ J in the determination process, the calculation control unit sets a new j as j + 1 and a loop process that returns the process to the product-sum calculation process;
If it is determined that j = J above determination process, the bit coupling _{portion, α (J), R J} , ..., bit combination of _{R 1 R = α (J)} | a R ₁ | R _J | ... A bit combination process to calculate and output R;
A multiple length arithmetic method characterized by comprising:

The multiple length arithmetic method according to claim 1,
The storage capacity of the cache memory is
Smaller than the sum of the amount of data amount and all r _i of any one of the x _i, data volume and all r _i of the one obtained by dividing the x _{_i} x _i (j) Larger than the total amount of data with
A multiple length arithmetic method characterized by the above.

The multiple length arithmetic method according to claim 1 or 2,
The product-sum operation process is a process in which the product-sum operation unit performs an operation in units of a predetermined bit length W.
ω (j) is longer than the predetermined bit length W and shorter than the bit length of x _i ,
A multiple length arithmetic method characterized by the above.

The multiple length arithmetic method according to claim 1,
ω (j) is constant for all j,
A multiple length arithmetic method characterized by the above.

The multiple length arithmetic method according to claim 1,
The product-sum operation process is
The product-sum operation unit divides x _i (j) into bit lengths W−a (W> a ≧ 0), where x _i (j, p) [p∈ {1, ..., P}, P Is a natural number greater than 1, x _i (j) is the bit combination of x _i (j, P), ..., x _i (j, 1) x _i (j) = x _i (j, P) |… | x _i (j, 1)] and r _i (q) [q∈ {1, ...] obtained by dividing r _i into bit lengths W−b (W> b ≧ 0, I ≦ 2 ^{a + b} ). , Q}, Q is a natural number of 1 or more, r _i is a bit combination of r _i (Q), ..., r _i (1) r _i = r _i (Q) |… | r _i (1)] To calculate x ₁ (j, p) · r ₁ (q) +… + x _I (j, p) · r _I (q) for one or more sets of p and q Is a process of calculating T (j) = α (j) + x ₁ (j) · r ₁ + ... + x _I (j) · r _I using the calculation result,
A multiple length arithmetic method characterized by the above.

The multiple length arithmetic method according to claim 1,
Each x _i (j) divided by the bit dividing unit is
x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ (J), ..., x _I (J ) Stored in memory so that burst transfer is possible in the order of x ₁ (1), ..., x _I (1), x ₁ (2), ..., x _I (2), ..., x ₁ (J), ..., x _I Read sequentially into cache memory in the order of (J).
A multiple length arithmetic method characterized by the above.

R = x ₁ · r ₁ +… + x where I is a natural number of 2 or more, i∈ {1, ..., I}, x _i is an integer of 0 or more, and r _i is an integer _A multiple length arithmetic method for calculating _I · r _I ,
Let H be an integer greater than or equal to 2, h∈ {1, ..., H}, S (h) be an integer greater than or equal to 2, and s (h) ∈ {1, ..., S (h)} , I (h, s (h)) ∈ {i (h, 1), ..., i (h, S (h))} ⊂ {1, ..., I} and {i (1, 1), ..., i (1, S (1)), ..., i (H, 1), ..., i (H, S (H))} = {1, ..., I} and R (h) = x _{i (h, 1)} · r _{i (h, 1)} +… + x _{i (h, S (h))} · r _{i (h, S (h))} When J (h) is a natural number of 2 or more and j∈ {1, ..., J (h)},
(a) The bit division unit converts x _{i (h, s (h))} to each x _{i (h, s (h))} (j) whose bit length is ω (h, j) (x _{i (h , s (h))} is x _{i (h, s (h))} (J (h)), ..., x _{i (h, s (h))} (1) bit combination x _{i (h, s (h ))} = x _{i (h, s (h))} (J (h)) |… | x _{i (h, s (h))} (1)]
(b) an initialization process in which the arithmetic control unit substitutes 1 for j and 0 for α (h, 1);
(c) a cache process for reading r _{i (h, 1)} , ..., r _{i (h, S (h)) corresponding} to _h into cache memory;
(d) The product-sum operation unit uses α (h, j) and x _{i (h, s (h))} (j) and r _{i (h, s (h))} read into the cache memory, T (h, j) = α (h, j) + x _{i (h, 1)} (j) ・ r _{i (h, 1)} +… + x _{i (h, S (h))} (j) ・ r a product-sum operation process for calculating _{i (h, S (h))} ;
(e) After the above product-sum operation process, the shift unit sets R (h, j) as the lower ω (h, j) bits of T (h, j) and R (h, j) of T (h, j) ) And a shift process that sets bits other than) as a new α (h, j + 1),
(f) After the shifting process, the calculation control unit determines whether j = J (h),
(g) When it is determined that j ≠ J (h) in the determination process, the calculation control unit sets a new j to j + 1 and returns the process to the product-sum calculation process.
(h) If it is determined that j = J (h) in the above determination process, the bit combination part is α (h, J (h)), R (h, J (h)), ..., R ( bit combination process of calculating bit combination R (h) = α (h, J (h)) | R (h, J (h)) | ... | R (h, 1) of h, 1);
Sequentially for each h,
An adding unit calculates R = R (1) +... + R (H) and outputs R; and
A multiple length arithmetic method characterized by comprising:

The multiple length arithmetic method according to claim 7,
The storage capacity of the cache memory is
It is smaller than the sum of the data amount of any one x _{i (h, s (h)) and} the data amount of all r _i , and is obtained by dividing x _{i (h, s (h)).} A single x _{i (h, s (h))} (j) data amount and r _{i (h, 1)} , ..., r _{i (h, S (h))} data amount corresponding to _h Greater than the sum of
A multiple length arithmetic method characterized by the above.

R = x ₁ · r ₁ +… + x where I is a natural number of 2 or more, i∈ {1, ..., I}, x _i is an integer of 0 or more, and r _i is an integer _A multiple length arithmetic unit that performs the operation of _I · r _I ,
x _i for each (j) the bit length x _i is omega (j) [j∈ {1, ..., J} , J is a natural number of 2 or more, x _i is x _i (J), ..., x _i (1) bit combination x _i = x _i (J) | ... | x _i (1)],
Using α (j) and x _i (j) and r _i read into the cache memory, T (j) = α (j) + x ₁ (j) · r ₁ +… + x _I (j) · a sum-of-products operation unit that performs r _I operation;
Lower omega (j) bits of T (j) and R _j, a shift unit for a T new bits other than R _j in (j) α (j + 1 ),
An arithmetic control unit that executes processing of the product-sum operation unit and the shift unit up to j = J while increasing j by 1 from j = 1 and j = 0 as initial values,
Bit combination R of α (J), R _J , ..., R ₁ R = α (J) | R _J | ... | R ₁ and a bit combination unit for outputting R;
A multiple length arithmetic device characterized by comprising:

The program for making a computer perform each process of the multiple length arithmetic method in any one of Claim 1 to 8.