JP2008090769A

JP2008090769A - Parallel computing method, computing unit, and program for computing unit

Info

Publication number: JP2008090769A
Application number: JP2006273557A
Authority: JP
Inventors: Kazumaro Aoki; 和麻呂青木; Takeshi Yamamoto; 剛山本; Takeshi Shimoyama; 武司下山
Original assignee: Fujitsu Ltd; Nippon Telegraph and Telephone Corp
Current assignee: Fujitsu Ltd; Nippon Telegraph and Telephone Corp
Priority date: 2006-10-05
Filing date: 2006-10-05
Publication date: 2008-04-17
Anticipated expiration: 2026-10-05
Also published as: JP4976800B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce the number of network communication paths between computing units without increasing the number of times of communication. <P>SOLUTION: A parallel computing method includes calculating, using a ring network, the sum of K-dimensional vectors that N computing units record at random, and letting the N computing units share the result; more specifically, dividing K components into N groups, with each of the computing units delivering the components of one group to an adjacent computing unit. The receiving computing unit calculates the sum of the components and the components of the group that it owns, and delivers this result to the next computing unit. This operation is repeated (N-1) times. From the N-th time on, the receiving computing unit records the result as the sum of the components of the group, and delivers data about it to the next computing unit. This operation is repeated up to (2N-2) times. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、複数の演算装置を用いてベクトルの和もしくは行列の積を求める並列演算方法、演算装置、演算装置用プログラムに関する。 The present invention relates to a parallel computing method, a computing device, and a computing device program for obtaining a vector sum or matrix product using a plurality of computing devices.

ベクトルの和を求める演算や行列の積を求める演算は、多くの情報処理で行われている。例えば、セキュリティ関連では、素因数分解の難しさが、いくつかの公開鍵暗号、電子署名などに利用され、暗号解読の難しさの根拠となっているため、素因数分解がどれだけのリソースで可能かを調べることは、暗号方式の安全性評価として重要である。このような素因数分解の難しさの評価は、複数の演算装置を用いて行われ、その処理の中に、行列の積を求める過程が含まれている。また、行列の積を求める過程にはベクトルの和を求める過程が含まれている。 An operation for obtaining a sum of vectors and an operation for obtaining a product of matrices are performed in many information processing. For example, in security, the difficulty of prime factorization is used for some public key cryptography, digital signatures, etc., and is the basis for the difficulty of decryption. It is important to evaluate the security of cryptographic methods. Evaluation of the difficulty of such prime factorization is performed using a plurality of arithmetic units, and the process includes a process of obtaining a matrix product. Further, the process of obtaining the matrix product includes the process of obtaining the sum of the vectors.

代表的な素因数分解の手法としては、数体ふるい法がある。数体ふるい法の線形代数部で扱う行列は、巨大であることに加え、０以外の成分の割合が極端に小さい、いわゆる疎行列であるため、ガウスの消去法などの通常の線型方程式に対する解法では、明らかに効率が悪い。このような行列に対する効率的なアルゴリズムとしてBlock Lanczos法が知られている。なお、数体ふるい法におけるBlock Lanczos法は、非特許文献１に具体的に示されており、この方法の処理の中には、疎行列とその行列の転置行列を繰り返し乗算する演算が含まれている。 A typical prime factorization method is a number field sieving method. The matrix handled in the linear algebra part of the number field sieving method is a so-called sparse matrix with a very small ratio of non-zero components in addition to being huge, so it can be used to solve ordinary linear equations such as Gaussian elimination. Then it is clearly inefficient. The Block Lanczos method is known as an efficient algorithm for such a matrix. The Block Lanczos method in the number field sieving method is specifically shown in Non-Patent Document 1, and the processing of this method includes an operation of repeatedly multiplying a sparse matrix and a transposed matrix of the matrix. ing.

Ｎ個の演算装置（Ｎは２以上の整数）がそれぞれＫ次元のベクトル（Ｋは２以上の整数）を記録している場合に、それらのベクトルの和を求める最も単純な方法は、ある１つの演算装置（親演算装置）に他のすべての演算装置がベクトルの情報を送り、親演算装置がベクトルの和を求める方法である。なお、すべての演算装置でベクトルの和の結果を共有する必要があるときは、親演算装置が他のすべての演算装置に、結果を送る。この方法の場合、（Ｎ−１）個の演算装置が、Ｋ個ずつのベクトルの要素を、送受信するので、２Ｋ（Ｎ−１）回の通信が必要である。この方法の場合、親演算装置と他のすべての演算装置との間に通信路が必要であり、通信路に流れる情報は常に片方向にのみ送られているので、半二重の（Ｎ−１）個の通信路（スター型）を有するネットワークを構築する必要がある。 When N computing devices (N is an integer of 2 or more) record K-dimensional vectors (K is an integer of 2 or more), the simplest method for obtaining the sum of these vectors is 1 In this method, all the other arithmetic devices send vector information to one arithmetic device (parent arithmetic device), and the parent arithmetic device calculates the sum of the vectors. When it is necessary for all the arithmetic devices to share the result of the vector sum, the parent arithmetic device sends the result to all the other arithmetic devices. In the case of this method, since (N−1) arithmetic devices transmit and receive K vector elements, 2K (N−1) times of communication are required. In this method, a communication path is required between the parent arithmetic unit and all other arithmetic units, and information flowing in the communication channel is always sent only in one direction. 1) It is necessary to construct a network having a single communication path (star type).

非特許文献１では、上述のベクトルの和を求める方法を改良した方法が示されている。非特許文献１の方法を実現するシステム構成（Ｎ＝４の場合）を図１に示す。図２は情報の収集の様子、図３は情報の分配の様子を示す。この方法では、各演算装置９０００、９１００、９２００、９３００は、計算部９０１０、９１１０、９２１０、９３１０と記録部９０２０、９１２０、９２２０、９３２０と通信部９０３０、９１３０、９２３０、９３３０とを備えており、各演算装置９０００、９１００、９２００、９３００は、他の演算装置との間に通信路を有している。 Non-Patent Document 1 discloses a method in which the above-described method for obtaining the sum of vectors is improved. A system configuration (in the case of N = 4) for realizing the method of Non-Patent Document 1 is shown in FIG. FIG. 2 shows how information is collected, and FIG. 3 shows how information is distributed. In this method, each arithmetic device 9000, 9100, 9200, 9300 includes a calculation unit 9010, 9110, 9210, 9310, a recording unit 9020, 9120, 9220, 9320, and a communication unit 9030, 9130, 9230, 9330. Each of the arithmetic devices 9000, 9100, 9200, and 9300 has a communication path with other arithmetic devices.

Ｎ個の演算装置（Ｎは２以上の整数）がそれぞれＫ次元のベクトル（Ｋは２以上の整数）を記録している場合に、あらかじめＫ個の要素をＮ個のグループに分ける。ここで、各グループの要素の数を、［Ｋ／Ｎ］と｛Ｋ／Ｎ｝に揃えると効率がよい。ただし、本明細書内では［ｘ］は実数ｘ以下の最大の整数、｛ｘ｝は実数ｘ以上の最小の整数を示す。ＫがＮの倍数の時には、すべてのグループが同じ要素の数Ｋ／Ｎとなる。また、Ｋ＜Ｎの場合には、いくつかのグループは要素の数が０となる。ｎ番目の演算装置以外の演算装置（ｎは１以上Ｎ以下の整数）は、当該演算装置が記録するベクトルのｎ番目のグループに属する要素を、ｎ番目の演算装置に送る。図２は、Ｎ＝４の場合の情報の収集の様子を示している。ｎ番目の演算装置では、ｎ番目のグループに属する要素ごとに和を求め、それぞれの要素の和を、他の演算装置に送る。図３は、Ｎ＝４の場合の情報の分配の様子を示している。このように情報の収集と分配を行うと、最大で２｛Ｋ／Ｎ｝(Ｎ−１）回の通信が必要となる。この方法の場合、単純な方法よりも通信の回数は大きく減っている。しかし、すべての演算装置間に通信路が必要であり、通信路では双方向に情報が送られることになる。つまり、Ｎ（Ｎ−１）／２個の全二重の通信路（すべての演算装置の間で完全グラフ）を有するネットワークが必要である。
下山武司、青木和麻呂、植田広樹、木田祐司“一般数体篩法実装実験（４）−線形代数”、電子情報通信学会技術研究報告、ISEC2003-154、2004． When N computing devices (N is an integer of 2 or more) each record a K-dimensional vector (K is an integer of 2 or more), the K elements are divided into N groups in advance. Here, it is efficient if the number of elements in each group is aligned with [K / N] and {K / N}. However, in this specification, [x] represents a maximum integer less than or equal to a real number x, and {x} represents a minimum integer greater than or equal to the real number x. When K is a multiple of N, all groups have the same number of elements K / N. When K <N, some groups have 0 elements. Arithmetic devices other than the nth arithmetic device (n is an integer of 1 to N) send elements belonging to the nth group of vectors recorded by the arithmetic device to the nth arithmetic device. FIG. 2 shows how information is collected when N = 4. In the nth arithmetic device, a sum is obtained for each element belonging to the nth group, and the sum of each element is sent to another arithmetic device. FIG. 3 shows how information is distributed when N = 4. When information is collected and distributed in this way, communication of 2 {K / N} (N−1) times is required at maximum. In this method, the number of times of communication is greatly reduced as compared with the simple method. However, a communication path is required between all the arithmetic devices, and information is sent in both directions on the communication path. That is, a network having N (N-1) / 2 full-duplex communication paths (complete graphs among all the arithmetic devices) is required.
Takeshi Shimoyama, Kazuo Aoki, Hiroki Ueda, Yuji Kida "General Number Field Sieve Method Implementation Experiment (4)-Linear Algebra", IEICE Technical Report, ISEC2003-154, 2004.

従来の技術では、複数の演算装置に記録されているベクトルの和を求め、結果をそれらの演算装置で共有する場合に、通信回数が多くなるか、通信路が多くなるという問題があった。複数の演算装置を用いてベクトルの和を求める並列演算および行列の積を求める並列演算で、通信回数を少なくすることと、演算装置間のネットワークの通信路の数を少なくすることを、本発明の目的とする。 In the conventional technique, when the sum of vectors recorded in a plurality of arithmetic devices is obtained and the result is shared by the arithmetic devices, there has been a problem that the number of communication increases or the number of communication paths increases. Reducing the number of communication and reducing the number of communication paths of the network between the arithmetic units in the parallel arithmetic for calculating the sum of vectors using a plurality of arithmetic units and the parallel operation for calculating the product of the matrix The purpose.

本発明の並列演算方法は、以下のステップによって、計算部と記録部と通信部とを有する複数の演算装置を用いてＮ個（Ｎは２以上の整数）のＫ次元のベクトル（Ｋは２以上の整数）の和を求める。 The parallel operation method according to the present invention includes N (N is an integer equal to or greater than 2) K-dimensional vectors (K is 2) using a plurality of operation devices each having a calculation unit, a recording unit, and a communication unit according to the following steps. The sum of the above integers) is obtained.

ベクトル記録ステップで、各演算装置が、演算の対象となるＫ次元のベクトルｃ_ｎ（ｎは１からＮの整数）の中の１つまたは複数を当該演算装置の記録部に記録する。
初期設定ステップで、各演算装置が、記録部にτ＝０を記録する。
成分送信ステップで、ベクトルｃ_ｐ（ｐは（ｎ−１−τ（ｍｏｄＮ）＋１）を記録部に有する各演算装置が、（１）当該演算装置の記録部からベクトルｃ_ｐの第ｎ成分を取り出し、（２）当該演算装置とベクトルｃ_ｑ（ｑはｐ−１。ただし、ｐ＝１のときはｑ＝Ｎ）を記録部に有する演算装置とが異なる場合には、通信部を用いてベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｑを記録部に有する演算装置に送信する。 In the vector recording step, each arithmetic device records one or a plurality of K-dimensional vectors c _n (n is an integer from 1 to N) to be calculated in the recording unit of the arithmetic device.
In the initial setting step, each arithmetic unit records τ = 0 in the recording unit.
In the component transmission step, each arithmetic device having the vector c _p (p is (n−1−τ (mod N) +1) in the recording unit) (1) the nth component of the vector c _p from the recording unit of the arithmetic device (2) If the arithmetic unit and the vector c _q (q is p−1, where q = N when p = 1) are different from the arithmetic unit in the recording unit, the communication unit is used. Te n-th component of the vector c _p, and transmits to the processing unit having a vector c _q in the recording unit.

成分受信ステップで、ベクトルｃ_ｑを記録部に有する各演算装置が、当該演算装置とベクトルｃ_ｐを記録部に有する演算装置とが異なる場合には、通信部を用いてベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｐを記録部に有する演算装置から受信する。 A component receiving step, each arithmetic unit having a vector c _q in the recording unit, if the computing device including the arithmetic unit and the vector c _p to the recording portion are different, the vector c _p using the communication section n the component receives from the computing device having the vector c _p to the recording unit.

演算結果記録ステップで、ベクトルｃ_ｑを記録部に有する各演算装置が、（１）τ≦Ｎ−２の場合は、記録部からベクトルｃ_ｑの第ｎ成分を取り出し、計算部でベクトルｃ_ｐの第ｎ成分との和を求め、結果をベクトルｃ_ｑの第ｎ成分として記録部に記録し、（２）τ＞Ｎ−２の場合は、ベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｑの第ｎ成分として記録部に記録する。
τ増加ステップで、各演算装置が、τにτ＋１を代入して記録部に記録する。
繰り返しステップで、各演算装置が、τが２Ｎ−３以下の場合は、成分送信ステップに戻る。
そして、各演算装置の記録部に記録されたベクトルｃ_ｍｘ（ｘは、１からＮの任意の整数）をベクトルの和とする。 In the calculation result recording step, each calculation device having the vector c _q in the recording unit (1) If τ ≦ N−2, the nth component of the vector c _q is extracted from the recording unit, and the calculation unit calculates the vector c _p And the result is recorded in the recording unit as the n-th component of the vector c _q . When (2) τ> N−2, the n-th component of the vector c _p is changed to the vector c _q Is recorded in the recording unit as the n-th component.
In the τ increasing step, each arithmetic unit substitutes τ + 1 for τ and records it in the recording unit.
In the repetition step, each arithmetic device returns to the component transmission step when τ is 2N−3 or less.
Then, a vector c _mx (x is an arbitrary integer from 1 to N) recorded in the recording unit of each arithmetic device is set as the sum of the vectors.

なお、Ｋ次元のベクトルを１つずつ各演算装置に記録した場合は、ベクトルｃ_ｑを記録部に有する演算装置は、必ずベクトルｃ_ｐを記録部に有する演算装置と異なるので、処理が少し簡単になる。 Note that if the recording to the arithmetic unit one by one K-dimensional vector, the computing device having the vector c _q in the recording unit, so always different from the computing device having the vector c _p to the recording unit, processing bit easier become.

また、本発明の並列演算方法は、計算部と記録部と通信部とを有するＩ個の演算装置（Ｉは２以上の整数）を用いて、Ｍ行Ｎ列の行列Ａ（Ｍは２以上の整数、Ｎは２以上の整数、かつＭＮ≧Ｉ）とＮ行Ｋ列の行列Ｂ（Ｋは２以上の整数）との積ＡＢを求める場合にも適用できる。この場合は、あらかじめ、行列ＡのＭＮ個の成分ａ_ｍｎ（ｍは１からＭまでの整数、ｎは１からＮまでの整数）をＩ個のグループ（１番目のグループからＩ番目のグループ）に分けておく。そして、各演算装置は、ｉ番目のグループのすべての成分ａ_ｍｎ（ただし、ｉは当該演算装置の番号であって、１からＩの整数）と、当該成分に乗算する行列Ｂの成分ｂ_ｎ１〜ｂ_ｎＫとを当該演算装置の記録部に記録する成分記録ステップを最初に行う。ベクトル記録ステップでは、各演算装置が、（１）記録部に記録された成分ａ_ｍｎのそれぞれについて、計算部でａ_ｍｎｂ_ｎ１〜ａ_ｍｎｂ_ｎＫを計算し、（２）結果をＫ次元のベクトルｃ_ｍｎ＝（ｃ_ｍｎ１，ｃ_ｍｎ２，…，ｃ_ｍｎＫ）＝（ａ_ｍｎｂ_ｎ１，ａ_ｍｎｂ_ｎ２，…，ａ_ｍｎｂ_ｎＫ）として記録部に記録する。ＭＮ個のＫ次元のベクトルｃ_ｍｎが、ベクトルの和を求める方法でもＫ次元のベクトルに相当する。この後のステップでは、Ｍ組のＮ個のベクトルの和を求める並列演算を行う。 Further, the parallel calculation method of the present invention uses an M number of arithmetic units (I is an integer equal to or greater than 2) using an I number of arithmetic units (I is an integer equal to or greater than 2), and a matrix A (M is equal to or greater than 2). And N is an integer equal to or greater than 2, and MN ≧ I) and a matrix AB of N rows and K columns B (K is an integer equal to or greater than 2). In this case, in advance, MN components a _mn of the matrix A (m is an integer from 1 to M, n is an integer from 1 to N) are grouped into I groups (from the first group to the Ith group). It is divided into. Each arithmetic device has all the components a _mn of the i-th group (where i is the number of the arithmetic device and is an integer from 1 to I) and the component b _{n1 of the} matrix B that multiplies the component. First, a component recording step of recording ˜b _{nK in} the recording unit of the arithmetic unit is performed. In the vector recording step, each arithmetic unit (1) calculates a _mn b _{n1 to} a _mn b _nK in the calculation unit for each of the components a _mn recorded in the recording unit, and (2) the result is K-dimensional. vector _{_{_{_{c mn = (c mn1, c}}}} mn2, ..., c mnK) = (a mn b n1, a mn b n2, ..., a mn b nK) recorded in the recording unit as. The MN K-dimensional vectors c _mn also correspond to K-dimensional vectors in the method for obtaining the vector sum. In the subsequent steps, a parallel operation for calculating the sum of the M sets of N vectors is performed.

さらに、本発明の並列演算方法は、計算部と記録部と通信部とを有するＩ個の演算装置（Ｉは２以上の整数）を用いて、Ｍ行Ｎ列の行列Ａ（Ｍは２以上の整数、Ｎは２以上の整数、かつＭＮ≧Ｉ）と行列Ａの転置行列Ａ^ＴとＮ行Ｋ列の行列Ｂ（Ｋは２以上の整数）との積Ａ^ＴＡＢを求める場合にも適用できる。積ＡＢを求めるまでは、上述の方法と同じである。転置行列Ａ^Ｔを掛ける際には、行列Ａ^Ｔのｎ行ｍ列の成分の演算を行う演算装置を、積ＡＢの演算で、行列Ａのｍ行ｎ列の成分の演算を行った演算装置と同一にする。 Furthermore, the parallel computing method of the present invention uses an M computing device (I is an integer of 2 or more) having an arithmetic unit, a recording unit, and a communication unit, and a matrix A (M is 2 or more). And N is an integer greater than or equal to 2 and MN ≧ I) and the product A ^T AB of the transposed matrix ^{AT of the} matrix A and the matrix B of N rows and K columns (K is an integer of 2 or more) Applicable. The process is the same as described above until the product AB is obtained. When multiplying the transposed matrix ^AT , an arithmetic unit that calculates an n-row m-column component of the matrix ^AT is an arithmetic unit that calculates an m-row n-column component of the matrix A by calculating the product AB. Same as

本発明によれば、１組のＮ個のＫ次元のベクトルの和をＮ個の演算装置で行う場合、最大で２｛Ｋ／Ｎ｝（Ｎ−１）回の通信で、情報の収集と結果の分配が実現できる。また、データはリング状かつ片方向に送られるため、Ｎ個の片方向通信路（リング状）を有するネットワークを備えればよい。これは、単純な方法の、２Ｋ（Ｎ−１）回の通信回数、半二重の（Ｎ−１）個の通信路に比べ、大幅に通信回数を削減している。また、非特許文献１の方法の、最大で２｛Ｋ／Ｎ｝(Ｎ−１）回の通信回数、Ｎ（Ｎ−１）／２個の全二重の通信路に比べ、大幅に通信路を削減している。 According to the present invention, when the sum of a set of N K-dimensional vectors is performed by N arithmetic devices, the information can be collected and transmitted in a maximum of 2 {K / N} (N-1) times. Distribution of results can be realized. Further, since data is sent in a ring shape and in one direction, a network having N unidirectional communication paths (ring shape) may be provided. This significantly reduces the number of communication times compared to the simple method of 2K (N-1) communication times and half-duplex (N-1) communication paths. In addition, compared with the method of Non-Patent Document 1, the communication frequency is 2 {K / N} (N-1) times at the maximum, and communication is significantly more than N (N-1) / 2 full-duplex communication paths. The road has been reduced.

また、同様に、Ｉ個の演算装置（Ｉは２以上の整数）を用いて、Ｍ行Ｎ列の行列Ａ（Ｍは２以上の整数、Ｎは２以上の整数、かつＭＮ≧Ｉ）とＮ行Ｋ列の行列Ｂ（Ｋは２以上の整数）との積ＡＢを求める場合にも、Ｍ組のＮ個のベクトルの和を求める場合に、通信回数と通信路を大幅に削減できる。 Similarly, by using I arithmetic units (I is an integer of 2 or more), an M-row N-column matrix A (M is an integer of 2 or more, N is an integer of 2 or more, and MN ≧ I) Even when the product AB with the matrix B of N rows and K columns (K is an integer of 2 or more) is obtained, the number of communication and the communication path can be greatly reduced when obtaining the sum of M sets of N vectors.

さらに、行列の積Ａ^ＴＡＢを求める場合には、転置行列Ａ^Ｔを掛ける演算では、行列ＡＢのｍ行目の各成分（ｍ行目のベクトル）の情報が必要であるが、この情報は、積ＡＢの演算で、行列Ａのｍ行目の成分の演算を行った演算装置が記録している情報である。本発明によれば、転置行列Ａ^Ｔを掛ける際には、行列Ａ^Ｔのｎ行ｍ列の成分の演算を行う演算装置を、積ＡＢの演算で、行列Ａのｍ行ｎ列の成分の演算を行った演算装置と同一にするので、転置行列Ａ^Ｔを掛ける演算の前に、情報を授受するための通信が不要である。したがって、通信回数と通信路を大幅に削減できる。 Further, when obtaining the matrix product A ^T AB, the multiplication by the transposed matrix A ^T requires information on each component (vector in the m-th row) of the m-th row of the matrix AB. , The information recorded by the arithmetic unit that performed the calculation of the component of the m-th row of the matrix A in the calculation of the product AB. According to the present invention, when multiplying the transposed matrix ^AT , the arithmetic unit that performs the operation of the n-row m-column component of the matrix ^AT is performed by the product AB, and the m-row n-column component of the matrix A is calculated. Since the calculation device is the same as the calculation device that performs the calculation, communication for exchanging information is not required before the calculation of multiplying the transposed matrix ^AT . Therefore, the number of communication times and communication paths can be greatly reduced.

以下では、説明の重複を避けるため同じ機能を有する構成部や同じ処理を行う処理ステップには同一の番号を付与し、説明を省略する。
［第１実施形態］
図４に、Ｎ個の演算装置（Ｎ＝４）を用いてＫ次元のベクトル（Ｋは２以上の整数）の和を求める並列演算システムの構成例を示す。この並列演算システムは、演算装置１０００、１１００、１２００、１３００と、隣り合う演算装置とをつなぐ通信路から構成されている。各演算装置１０００、１１００、１２００、１３００は、計算部１０１０、１１１０、１２１０、１３１０と記録部１０２０、１１２０、１２２０、１３２０と通信部１０３０、１１３０、１２３０、１３３０とを備えている。図５に演算装置１０００の機能構成例を示す。計算部１０１０は、τ増加手段１００１と繰り返し手段１００２と演算結果記録手段１０１２とを備える。なお、τ増加手段１００１と繰り返し手段１００２は、計算部１０１０以外の構成部（たとえば、図示していないが制御部などの構成部が考えられる。）が備えてもよい。記録部１０２０は、ベクトル記録手段１０２２とτ記録手段１０２３とを有する。通信部１０３０は、送信手段１０３１と受信手段１０３２とを有する。なお、他の演算装置１１００、１２００、１３００も同じ機能構成である。 Below, in order to avoid duplication of description, the same number is attached | subjected to the process step which performs the process part which performs the same process and the same function, and abbreviate | omits description.
[First Embodiment]
FIG. 4 shows a configuration example of a parallel operation system that calculates the sum of K-dimensional vectors (K is an integer of 2 or more) using N operation devices (N = 4). This parallel computing system is composed of a communication path that connects computing devices 1000, 1100, 1200, and 1300 and adjacent computing devices. Each arithmetic device 1000, 1100, 1200, 1300 includes calculation units 1010, 1110, 1210, 1310, recording units 1020, 1120, 1220, 1320 and communication units 1030, 1130, 1230, 1330. FIG. 5 shows a functional configuration example of the arithmetic device 1000. The calculation unit 1010 includes a τ increasing unit 1001, a repeating unit 1002, and a calculation result recording unit 1012. Note that the τ increasing unit 1001 and the repeating unit 1002 may be provided in a configuration unit other than the calculation unit 1010 (for example, although not shown, a configuration unit such as a control unit may be considered). The recording unit 1020 includes a vector recording unit 1022 and a τ recording unit 1023. The communication unit 1030 includes a transmission unit 1031 and a reception unit 1032. The other arithmetic devices 1100, 1200, and 1300 have the same functional configuration.

原理１
図６に、図４の並列演算システムを用いて４個の４次元ベクトルの和を求める方法の原理を示す。図中の○内の数字ｘは、処理の順番を示しており、以下では第ｘの処理と表現する。なお、同じ番号は同時に（並列に）行う処理を示している。あらかじめ、演算装置１０００がベクトルｃ_１を、演算装置１１００がベクトルｃ_２を、演算装置１２００がベクトルｃ_３を、演算装置１３００がベクトルｃ_４を記録している。 Principle 1
FIG. 6 shows the principle of a method for obtaining the sum of four four-dimensional vectors using the parallel operation system of FIG. The numbers x in the circles in the figure indicate the order of processing, and are expressed as the x-th processing below. In addition, the same number has shown the process performed simultaneously (in parallel). In advance, the arithmetic device 1000 records the vector c ₁ , the arithmetic device 1100 records the vector c ₂ , the arithmetic device 1200 records the vector c ₃ , and the arithmetic device 1300 records the vector c ₄ .

第１の処理では、以下の処理を並列に行う。演算装置１０００は、ベクトルｃ_１の第１成分ｃ_１１を演算装置１３００に送信するとともに、ベクトルｃ_２の第２成分ｃ_２１を演算装置１１００から受信する。演算装置１１００は、ベクトルｃ_２の第２成分ｃ_２２を演算装置１０００に送信するとともに、ベクトルｃ_３の第３成分ｃ_３３を演算装置１２００から受信する。演算装置１２００は、ベクトルｃ_３の第３成分ｃ_３３を演算装置１１００に送信するとともに、ベクトルｃ_４の第４成分ｃ_４４を演算装置１３００から受信する。演算装置１３００は、ベクトルｃ_４の第４成分ｃ_４４を演算装置１２００に送信するとともに、ベクトルｃ_１の第１成分ｃ_１１を演算装置１０００から受信する。 In the first process, the following processes are performed in parallel. The arithmetic device 1000 transmits the first component c ₁₁ of the vector c ₁ to the arithmetic device 1300 and receives the second component c ₂₁ of the vector c ₂ from the arithmetic device 1100. Arithmetic unit 1100 sends the second component _{c 22} of the vector _{c 2} to the arithmetic unit 1000, it receives a third component _{c 33} of the vector _{c 3} from the computing device 1200. Computing device 1200 transmits a third component _{c 33} of the vector _{c 3} to the processing unit 1100 receives the fourth component _{c 44} of the vector _{c 4} from the computing device 1300. Arithmetic unit 1300 sends the fourth component _{c 44} of the vector _{c 4} to the processing unit 1200 receives the first component _{c 11} of the vector _{c 1} from the arithmetic unit 1000.

第２の処理では、以下の処理を並列に行う。演算装置１０００は、ベクトルｃ_１の第２成分ｃ_１２と受信したベクトルｃ_２の第２成分ｃ_２２との和を、ベクトルｃ_１の第２成分ｃ_１２として記録する。演算装置１１００は、ベクトルｃ_２の第３成分ｃ_２３と受信したベクトルｃ_３の第３成分ｃ_３３との和を、ベクトルｃ_２の第３成分ｃ_２３として記録する。演算装置１２００は、ベクトルｃ_３の第４成分ｃ_３４と受信したベクトルｃ_４の第４成分ｃ_４４との和を、ベクトルｃ_３の第４成分ｃ_３４として記録する。演算装置１３００は、ベクトルｃ_４の第１成分ｃ_４１と受信したベクトルｃ_１の第１成分ｃ_１１との和を、ベクトルｃ_４の第１成分ｃ_４１として記録する。 In the second process, the following processes are performed in parallel. Computing device 1000, the sum of the second component _{c 22} of the vector _{c 2} and the received second component _{c 12} of the vector _{c 1,} is recorded as the second component _{c 12} of the vector _{c 1.} Computing device 1100, the sum of the third component _{c 33} of the vector _{c 3} received a third component _{c 23} of the vector _{c 2,} is recorded as a third component _{c 23} of the vector _{c 2.} Computing device 1200, the sum of the fourth component _{c 44} of the vector _{c 4} received a fourth component _{c 34} of the vector _{c 3,} recorded as a fourth component _{c 34} of the vector _{c 3.} Computing device 1300, the sum of the first component _{c 11} of the vector _{c 1} and the received first component _{c 41} of the vector _{c 4,} is recorded as the first component _{c 41} of the vector _{c 4.}

第１の処理と第２の処理によって、演算前のベクトルｃ_４とベクトルｃ_１の第１成分の合計ｃ_４１＋ｃ_１１が、演算装置１３００にベクトルｃ_４の第１成分として記録されている。第３の処理から第６の処理は、第１の処理と第２の処理の繰り返しである。第３の処理と第４の処理が終了すると、演算前のベクトルｃ_３とベクトルｃ_４とベクトルｃ_１の第１成分の合計ｃ_３１＋ｃ_４１＋ｃ_１１が、演算装置１２００にベクトルｃ_３の第１成分として記録される。第５の処理と第６の処理が終了すると、演算前のベクトルｃ_２とベクトルｃ_３とベクトルｃ_４とベクトルｃ_１の第１成分の合計ｃ_２１＋ｃ_３１＋ｃ_４１＋ｃ_１１が、演算装置１１００にベクトルｃ_２の第１成分として記録される。つまり、第６の処理までで、ベクトルの各成分の合計は求められている。ただし、第１成分の合計は演算装置１１００、第２成分の合計は演算装置１２００、第３成分の合計は演算装置１３００、第４成分の合計は演算装置１０００がそれぞれ記録している。 By the first process and the second process, the first component sum _c 41 _{+ c 11} of the operation before the vector _{c 4} and the vector _{c 1} is recorded as the first component of the vector _{c 4} to the processing unit 1300. The third to sixth processes are repetitions of the first process and the second process. When the third process and the fourth process are finished, the sum c ₃₁ + c ₄₁ + c ₁₁ of the first component of the vector c ₃ , the vector c _4, and the vector c ₁ before the calculation is sent to the calculation device 1200 of the vector c ₃ . Recorded as one component. When the fifth process and the sixth process are finished, the sum c ₂₁ + c ₃₁ + c ₄₁ + c ₁₁ of the first component of the vector c ₂ , the vector c ₃ , the vector c _4, and the vector c ₁ before the calculation is calculated by the calculation device 1100. It is recorded as the first component of the vector c ₂ in. That is, up to the sixth process, the sum of each component of the vector is obtained. However, the arithmetic unit 1100 records the total of the first component, the arithmetic device 1200 records the total of the second component, the arithmetic device 1300 records the total of the third component, and the arithmetic device 1000 records the total of the fourth component.

第７の処理では、以下の処理を並列に行う。演算装置１０００は、ベクトルｃ_１の第４成分ｃ_１４（第４成分の合計）を演算装置１３００に送信するとともに、ベクトルｃ_２の第１成分ｃ_２１（第１成分の合計）を演算装置１１００から受信する。演算装置１１００は、ベクトルｃ_２の第１成分ｃ_２１（第１成分の合計）を演算装置１０００に送信するとともに、ベクトルｃ_３の第２成分ｃ_３２（第２成分の合計）を演算装置１２００から受信する。演算装置１２００は、ベクトルｃ_３の第２成分ｃ_３２（第２成分の合計）を演算装置１１００に送信するとともに、ベクトルｃ_４の第３成分ｃ_４３（第３成分の合計）を演算装置１３００から受信する。演算装置１３００は、ベクトルｃ_４の第３成分ｃ_４３（第３成分の合計）を演算装置１２００に送信するとともに、ベクトルｃ_１の第４成分ｃ_１４（第４成分の合計）を演算装置１０００から受信する。 In the seventh process, the following processes are performed in parallel. The arithmetic device 1000 transmits the fourth component c ₁₄ (total of the fourth components) of the vector c ₁ to the arithmetic device 1300, and transmits the first component c ₂₁ (total of the first components) of the vector c ₂ to the arithmetic device 1100. Receive from. The arithmetic device 1100 transmits the first component c ₂₁ (total of the first components) of the vector c ₂ to the arithmetic device 1000, and also calculates the second component c ₃₂ (total of the second components) of the vector c ₃ to the arithmetic device 1200. Receive from. The computing device 1200 transmits the second component c ₃₂ (sum of the second components) of the vector c ₃ to the computing device 1100, and sends the third component c ₄₃ (sum of the third components) of the vector c ₄ to the computing device 1300. Receive from. The arithmetic device 1300 transmits the third component c ₄₃ (total of the third components) of the vector c ₄ to the arithmetic device 1200, and transmits the fourth component c ₁₄ (total of the fourth components) of the vector c ₁ to the arithmetic device 1000. Receive from.

第８の処理では、以下の処理を並列に行う。演算装置１０００は、受信したベクトルｃ_２の第１成分ｃ_２１（第１成分の合計）を、ベクトルｃ_１の第１成分ｃ_１１として記録する。演算装置１１００は、受信したベクトルｃ_３の第２成分ｃ_３２（第２成分の合計）を、ベクトルｃ_２の第２成分ｃ_２２として記録する。演算装置１２００は、受信したベクトルｃ_４の第３成分ｃ_４３（第３成分の合計）を、ベクトルｃ_３の第３成分ｃ_３３として記録する。演算装置１３００は、受信したベクトルｃ_１の第４成分ｃ_１４（第１成分の合計）を、ベクトルｃ_４の第４成分ｃ_４４として記録する。 In the eighth process, the following processes are performed in parallel. Computing device 1000, the first component _{c 21} of the vector _{c 2} received (sum of the first component), is recorded as a first component _{c 11} of the vector _{c 1.} Computing device 1100, the second component _{c 32} of the vector _{c 3} received (total of the second component) is recorded as the second component _{c 22} of the vector _{c 2.} Arithmetic unit 1200, a third component _{c 43} of the vector _{c 4} has been received (total of the third component) is recorded as a third component _{c 33} of the vector _{c 3.} Arithmetic unit 1300, a fourth component _{c 14} of the vector _{c 1} received (sum of the first component), is recorded as a fourth component _{c 44} of the vector _{c 4.}

第７の処理と第８の処理によって、第１成分の合計は演算装置１１００から演算装置１０００に分配され、第２成分の合計は演算装置１２００から演算装置１１００に分配され、第３成分の合計は演算装置１３００から演算装置１２００に分配され、第４成分の合計は演算装置１０００から演算装置１３００に分配された。第９の処理から第１２の処理は、第７の処理と第８の処理の繰り返しである。この繰り返しによって、各成分の合計はすべての演算装置に分配される。 By the seventh process and the eighth process, the total of the first component is distributed from the arithmetic device 1100 to the arithmetic device 1000, the total of the second component is distributed from the arithmetic device 1200 to the arithmetic device 1100, and the total of the third component Is distributed from the arithmetic device 1300 to the arithmetic device 1200, and the total of the fourth component is distributed from the arithmetic device 1000 to the arithmetic device 1300. The ninth process to the twelfth process are repetitions of the seventh process and the eighth process. By repeating this, the sum of each component is distributed to all the arithmetic units.

原理２
図７に、図４の並列演算システムを用いて４個の５次元ベクトルの和を求める方法の原理を示す。この例は、ベクトルの次元がベクトルの数よりも多い例である。この場合、あらかじめ、ベクトルの成分を４つのグループに分けておく。この例では、第１成分と第５成分を１番目のグループ、第２成分を２番目のグループ、第３成分を３番目のグループ、第４成分を４番目のグループとする。この例では、演算装置１０００が１番目の演算装置、演算装置１１００が２番目の演算装置、演算装置１２００が３番目の演算装置、演算装置１３００が４番目の演算装置とする。あらかじめ、ｎ番目の演算装置が、それぞれベクトルｃ_ｎ＝（ｃ_ｎ１，ｃ_ｎ２，ｃ_ｎ３，ｃ_ｎ４，ｃ_ｎ５）を記録部に記録している。ただし、ｎは１から４の整数である。
原理２の処理も、第１の処理から第１２の処理であり、原理１と同じである。原理１との違いは、ベクトルの次元が異なることである。原理２では、１番目のグループに属する第５成分が第１成分と一緒に処理される。 Principle 2
FIG. 7 shows the principle of a method for obtaining the sum of four five-dimensional vectors using the parallel operation system of FIG. In this example, the dimension of the vector is larger than the number of vectors. In this case, the vector components are divided into four groups in advance. In this example, the first component and the fifth component are the first group, the second component is the second group, the third component is the third group, and the fourth component is the fourth group. In this example, the computing device 1000 is the first computing device, the computing device 1100 is the second computing device, the computing device 1200 is the third computing device, and the computing device 1300 is the fourth computing device. In advance, the n-th arithmetic unit records the vectors c _n = (c _n1 , c _n2 , c _n3 , c _n4 , c _n5 ) in the recording unit. However, n is an integer of 1 to 4.
The process of the principle 2 is also the first process to the twelfth process and is the same as the principle 1. The difference from Principle 1 is that the dimensions of the vectors are different. In principle 2, the fifth component belonging to the first group is processed together with the first component.

原理３
図８に、図４の並列演算システムを用いて４個の２次元ベクトルの和を求める方法の原理を示す。この例は、ベクトルの次元がベクトルの数よりも少ない例である。この場合、あらかじめ、ベクトルの成分を４つのグループに分けておく。この例では、第１成分を１番目のグループ、第２成分を２番目のグループとする。３番目のグループと４番目のグループに属する成分はない。この例でも、演算装置１０００が１番目の演算装置、演算装置１１００が２番目の演算装置、演算装置１２００が３番目の演算装置、演算装置１３００が４番目の演算装置とする。あらかじめ、ｎ番目の演算装置が、それぞれベクトルｃ_ｎ＝（ｃ_ｎ１，ｃ_ｎ２）を記録部に記録している。ただし、ｎは１から４の整数である。
原理３の処理も、第１の処理から第１２の処理であり、原理１と同じである。原理１との違いは、ベクトルの次元が異なることである。原理３では、属する成分がない３番目のグループと４番目のグループの処理では、演算装置は動作しない。 Principle 3
FIG. 8 shows the principle of a method for obtaining the sum of four two-dimensional vectors using the parallel operation system of FIG. In this example, the dimension of the vector is smaller than the number of vectors. In this case, the vector components are divided into four groups in advance. In this example, the first component is the first group and the second component is the second group. There are no components belonging to the third and fourth groups. Also in this example, the arithmetic device 1000 is the first arithmetic device, the arithmetic device 1100 is the second arithmetic device, the arithmetic device 1200 is the third arithmetic device, and the arithmetic device 1300 is the fourth arithmetic device. In advance, the n-th arithmetic unit records the vectors c _n = (c _n1 , c _n2 ) in the recording unit. However, n is an integer of 1 to 4.
The process of the principle 3 is also the first process to the twelfth process and is the same as the principle 1. The difference from Principle 1 is that the dimensions of the vectors are different. In principle 3, the arithmetic unit does not operate in the processing of the third group and the fourth group that do not have any component.

演算方法１
図９は、Ｎ個の演算装置でＮ個のＮ次元のベクトルの和を求める並列演算システムの処理フローを示す図である。なお、この処理フローは原理１に対応している。 Calculation method 1
FIG. 9 is a diagram illustrating a processing flow of a parallel operation system that obtains the sum of N N-dimensional vectors by N operation devices. This processing flow corresponds to Principle 1.

まず、Ｎ個の演算装置が、それぞれの記録部のベクトル記録手段に１つずつベクトルを記録しておく（Ｓ１１０）。また、τに０を代入し、τ記録手段に記録しておく（Ｓ１２０）。
ｎ番目の演算装置（１０００など）は、記録部（１０２０など）のベクトル記録手段（１０２２など）から、第（ｎ−１＋τ（ｍｏｄＮ））＋１成分を取り出し、通信部（１０３０など）の送信手段（１０３１など）を用いて（ｎ−２（ｍｏｄＮ））＋１番目の演算装置（１３００など）に送信する（Ｓ１３０）。ｎ番目の演算装置（１０００など）は、通信部（１０３０など）の受信手段（１０３２など）を用いて第（ｎ＋τ（ｍｏｄＮ））＋１成分をｎ（ｍｏｄＮ））＋１番目の演算装置（１１００など）から受信する（Ｓ１４０）。なお、τ＝０のときのステップＳ１３０とＳ１４０が図６の第１の処理、τ＝１のときのステップＳ１３０とＳ１４０が図６の第３の処理、τ＝２のときのステップＳ１３０とＳ１４０が図６の第５の処理、τ＝３のときのステップＳ１３０とＳ１４０が図６の第７の処理、τ＝４のときのステップＳ１３０とＳ１４０が図６の第９の処理、τ＝５のときのステップＳ１３０とＳ１４０が図６の第１１の処理に該当する。 First, N computing devices record one vector at a time in the vector recording means of each recording unit (S110). Also, 0 is substituted for τ and recorded in the τ recording means (S120).
The n-th arithmetic device (such as 1000) takes out the (n−1 + τ (mod N)) + 1 component from the vector recording means (such as 1022) of the recording unit (such as 1020), and transmits it to the communication unit (such as 1030). Using means (1031 etc.), it is transmitted to (n-2 (mod N)) + 1st arithmetic unit (1300 etc.) (S130). The n-th arithmetic device (1000, etc.) uses the receiving means (1032, etc.) of the communication unit (1030, etc.) to convert the (n + τ (mod N)) + 1 component to n (mod N)) + 1-th arithmetic device ( 1100) (S140). Note that steps S130 and S140 when τ = 0 are the first processing in FIG. 6, steps S130 and S140 when τ = 1 are the third processing in FIG. 6, and steps S130 and S140 when τ = 2. Is the fifth process of FIG. 6, steps S130 and S140 when τ = 3 are the seventh process of FIG. 6, steps S130 and S140 when τ = 4 are the ninth process of FIG. 6, τ = 5 Steps S130 and S140 at this time correspond to the eleventh process of FIG.

ｎ番目の演算装置（１０００など）は、（１）τ≦Ｎ−２の場合は、記録部（１０２０など）のベクトル記録手段（１０２２など）から、第（ｎ＋τ（ｍｏｄＮ））＋１成分を取り出し、ステップＳ１４０で受信した第（ｎ＋τ（ｍｏｄＮ））＋１成分との和を求め、結果を第（ｎ＋τ（ｍｏｄＮ））＋１成分として記録部（１０２０など）のベクトル記録手段（１０２２など）に記録し、（２）τ＞Ｎ−２の場合は、ステップＳ１４０で受信した第（ｎ＋τ（ｍｏｄＮ））＋１成分を第（ｎ＋τ（ｍｏｄＮ））＋１成分として記録部（１０２０など）のベクトル記録手段（１０２２など）に記録する（Ｓ１５０）。τ＝０のときのステップＳ１５０が図６の第２の処理、τ＝１のときのステップＳ１５０が図６の第４の処理、τ＝２のときのステップＳ１５０が図６の第６の処理、τ＝３のときのステップＳ１５０が図６の第８の処理、τ＝４のときのステップＳ１５０が図６の第１０の処理、τ＝５のときのステップＳ１５０が図６の第１２の処理に該当する。
各演算装置は、τにτ＋１を代入して記録部に記録する（Ｓ１６０）。各演算装置は、τが２Ｎ−３以下の場合はステップＳ１３０に戻り、２Ｎ−３より大きい場合は処理を終了する（Ｓ１７０）。図６の場合は、２Ｎ−３＝５なので、τ＝５（第１２の処理）までで繰り返し処理は終了する。 In the case of (1) τ ≦ N−2, the n-th arithmetic unit (such as 1000) obtains the (n + τ (mod N)) + 1 component from the vector recording means (such as 1022) of the recording unit (such as 1020). The sum of the extracted (n + τ (mod N)) + 1 component received in step S140 is obtained, and the result is used as the (n + τ (mod N)) + 1 component to record the vector recording means (1022, etc.) of the recording unit (1020, etc.) (2) If τ> N−2, the (n + τ (mod N)) + 1 component received in step S140 is used as the (n + τ (mod N)) + 1 component of the recording unit (1020, etc.). Recording is performed in vector recording means (1022 or the like) (S150). Step S150 when τ = 0 is the second process of FIG. 6, Step S150 when τ = 1 is the fourth process of FIG. 6, and Step S150 when τ = 2 is the sixth process of FIG. Step S150 when τ = 3 is the eighth process of FIG. 6, Step S150 when τ = 4 is the tenth process of FIG. 6, and Step S150 when τ = 5 is the twelfth process of FIG. Applicable to processing.
Each arithmetic device substitutes τ + 1 for τ and records it in the recording unit (S160). Each arithmetic device returns to step S130 if τ is 2N−3 or less, and ends the process if larger than 2N−3 (S170). In the case of FIG. 6, since 2N−3 = 5, the iterative process is completed until τ = 5 (the twelfth process).

演算方法２
図１０は、Ｎ個の演算装置でＮ個のＫ次元のベクトルの和を求める並列演算システムの処理フローを示す図である。なお、この処理フローは原理１から原理３に対応できる。 Calculation method 2
FIG. 10 is a diagram showing a processing flow of a parallel computing system that obtains the sum of N K-dimensional vectors by N computing devices. This processing flow can correspond to principles 1 to 3.

あらかじめ、ベクトルのＫ個の成分をＮ個のグループ（１番目のグループからＮ番目のグループ）に分けておく。まず、Ｎ個の演算装置が、それぞれの記録部のベクトル記録手段に１つずつベクトルを記録しておく（Ｓ１１０）。また、τに０を代入し、τ記録手段に記録しておく（Ｓ１２０）。 In advance, the K components of the vector are divided into N groups (from the first group to the Nth group). First, N computing devices record one vector at a time in the vector recording means of each recording unit (S110). Also, 0 is substituted for τ and recorded in the τ recording means (S120).

（ｎ−１−τ（ｍｏｄＮ））＋１番目の演算装置（ただし、ｎは１からＮの整数。）が、（１）当該演算装置の記録部からｎ番目のグループのベクトルの成分を取り出し、（２）当該演算装置の通信部を用いて（ｎ−２−τ（ｍｏｄＮ））＋１番目の演算装置に送信する（Ｓ１３１）。（ｎ−２−τ（ｍｏｄＮ））＋１番目の演算装置が、当該演算装置の通信部を用いてｎ番目のグループのベクトルの成分を（ｎ−１−τ（ｍｏｄＮ））＋１番目の演算装置から受信する（Ｓ１４１）。なお、τ＝０のときのステップＳ１３１とＳ１４１が図６〜８の第１の処理、τ＝１のときのステップＳ１３１とＳ１４１が図６〜８の第３の処理、τ＝２のときのステップＳ１３１とＳ１４１が図６〜８の第５の処理、τ＝３のときのステップＳ１３１とＳ１４１が図６〜８の第７の処理、τ＝４のときのステップＳ１３１とＳ１４１が図６〜８の第９の処理、τ＝５のときのステップＳ１３１とＳ１４１が図６〜８の第１１の処理に対応できる。 (N−1−τ (mod N)) + 1st arithmetic device (where n is an integer from 1 to N) (1) extracts the vector component of the nth group from the recording unit of the arithmetic device. (2) Using the communication unit of the arithmetic device, (n-2-τ (mod N)) + 1 is transmitted to the first arithmetic device (S131). The (n-2-τ (mod N)) + 1st arithmetic device uses the communication unit of the arithmetic device to calculate the nth group of vector components as (n-1-τ (mod N)) + 1st Received from the arithmetic unit (S141). Steps S131 and S141 when τ = 0 are the first processing of FIGS. 6 to 8, steps S131 and S141 when τ = 1 are the third processing of FIGS. 6 to 8, and when τ = 2. Steps S131 and S141 are the fifth processes of FIGS. 6 to 8, steps S131 and S141 when τ = 3 are the seventh processes of FIGS. 6 to 8, and steps S131 and S141 when τ = 4 are FIGS. Eighth ninth process, steps S131 and S141 when τ = 5 can correspond to the eleventh process of FIGS.

（ｎ−２−τ（ｍｏｄＮ））＋１番目の各演算装置が、（１）τ≦Ｎ−２の場合は、当該演算装置の記録部からｎ番目のグループのベクトルの成分を取り出し、当該演算装置の記録部で、受信したｎ番目のグループのベクトルの成分との和を求め、結果をｎ番目のグループのベクトルの成分として記録部に記録し、（２）τ＞Ｎ−２の場合は、受信したｎ番目のグループのベクトルの成分を、ｎ番目のグループのベクトルの成分として記録部に記録する（Ｓ１５１）。なお、τ＝０のときのステップＳ１５１が図６〜８の第２の処理、τ＝１のときのステップＳ１５１が図６〜８の第４の処理、τ＝２のときのステップＳ１５１が図６〜８の第６の処理、τ＝３のときのステップＳ１５１が図６〜８の第８の処理、τ＝４のときのステップＳ１５１が図６〜８の第１０の処理、τ＝５のときのステップＳ１５１が図６〜８の第１２の処理に対応できる。
各演算装置は、τにτ＋１を代入して記録部に記録する（Ｓ１６０）。各演算装置は、τが２Ｎ−３以下の場合はステップＳ１３０に戻り、２Ｎ−３より大きい場合は処理を終了する（Ｓ１７０）。図６〜８の場合は、２Ｎ−３＝５なので、τ＝５（第１２の処理）までで繰り返し処理は終了する。 When each (n-2-τ (mod N)) + 1st arithmetic device is (1) τ ≦ N−2, the vector component of the nth group is extracted from the recording unit of the arithmetic device, When the recording unit of the arithmetic unit calculates the sum of the received vector components of the nth group and records the result as a vector component of the nth group in the recording unit. (2) When τ> N−2 Records the received n-th group vector component in the recording unit as the n-th group vector component (S151). Note that step S151 when τ = 0 is the second process in FIGS. 6 to 8, step S151 when τ = 1 is the fourth process in FIGS. 6 to 8, and step S151 when τ = 2 is shown in FIG. 6-8, 6th process, step S151 when τ = 3 is the 8th process of FIGS. 6-8, step S151 when τ = 4 is the 10th process of FIGS. 6-8, τ = 5 Step S151 at this time can correspond to the twelfth process of FIGS.
Each arithmetic device substitutes τ + 1 for τ and records it in the recording unit (S160). Each arithmetic device returns to step S130 if τ is 2N−3 or less, and ends the process if larger than 2N−3 (S170). In the case of FIGS. 6 to 8, since 2N−3 = 5, the repetitive process is completed up to τ = 5 (the twelfth process).

このように処理するので、１組のＮ個のＫ次元のベクトルの和をＮ個の演算装置で行う場合、最大で２｛Ｋ／Ｎ｝（Ｎ−１）回の通信で、情報の収集と結果の分配が実現できる。また、データはリング状かつ片方向に送られるため、Ｎ個の片方向の通信路（リング状）を有するネットワークを備えればよい。 Since processing is performed in this way, when the sum of a set of N K-dimensional vectors is performed by N arithmetic devices, information is collected with a maximum of 2 {K / N} (N-1) communications. And distribution of results. Moreover, since data is sent in a ring shape and in one direction, a network having N unidirectional communication paths (ring shapes) may be provided.

［変形例１］
第１実施形態では、ベクトルの数と演算装置の数が同じ場合を説明した。本変形例では、ベクトルの数が演算装置の数よりも多い場合について説明する。なお、ベクトルの数が演算装置よりも少ない場合、ベクトルを記録していない演算装置が存在することになる。このような演算装置は、並列演算システムに含めなければよい。したがって、ベクトルの数は演算装置の数以上とする。
また、同じ演算装置内の複数のベクトルの和を最初に計算してしまい、１つのベクトルにし、その後、複数の装置間に分散しているベクトルの和を求めることもできる。この場合は、同じ演算装置内のベクトルの和を計算する処理と第１実施形態の処理を組み合わせることで実現できる。
本変形例では、同じ演算装置内のベクトルの和を最初に計算しなくても、１つの演算装置に複数のベクトルがある場合に、本発明の並列演算方法を適用できることを示す。本変形例の演算装置も図４、５に示した演算装置と同じ機能構成である。 [Modification 1]
In the first embodiment, the case where the number of vectors is the same as the number of arithmetic units has been described. In this modification, a case where the number of vectors is larger than the number of arithmetic devices will be described. When the number of vectors is smaller than that of the arithmetic device, there is an arithmetic device that does not record the vector. Such an arithmetic unit may not be included in the parallel arithmetic system. Therefore, the number of vectors is equal to or greater than the number of arithmetic units.
It is also possible to first calculate the sum of a plurality of vectors in the same arithmetic unit, to obtain one vector, and then to obtain the sum of vectors distributed among the plurality of units. This case can be realized by combining the process of calculating the sum of vectors in the same arithmetic unit with the process of the first embodiment.
This modification shows that the parallel operation method of the present invention can be applied when there is a plurality of vectors in one arithmetic device without first calculating the sum of the vectors in the same arithmetic device. The arithmetic device of this modification also has the same functional configuration as the arithmetic device shown in FIGS.

原理
図１１に、図４の並列演算システムを用いて５個の５次元ベクトルの和を求める方法の原理を示す。あらかじめ、演算装置１０００がベクトルｃ_１とｃ_５を、演算装置１１００がベクトルｃ_２を、演算装置１２００がベクトルｃ_３を、演算装置１３００がベクトルｃ_４を記録している。 Principle FIG. 11 shows the principle of a method for obtaining the sum of five five-dimensional vectors using the parallel computing system of FIG. In advance, the arithmetic device 1000 records the vectors c ₁ and c ₅ , the arithmetic device 1100 records the vector c ₂ , the arithmetic device 1200 records the vector c ₃ , and the arithmetic device 1300 records the vector c ₄ .

第１の処理では、以下の処理を並列に行う。演算装置１０００は、ベクトルｃ_５の第５成分ｃ_５５を演算装置１３００に送信するとともに、ベクトルｃ_２の第２成分ｃ_２２を演算装置１１００から受信する。演算装置１１００は、ベクトルｃ_２の第２成分ｃ_２２を演算装置１０００に送信するとともに、ベクトルｃ_３の第３成分ｃ_３３を演算装置１２００から受信する。演算装置１２００は、ベクトルｃ_３の第３成分ｃ_３３を演算装置１１００に送信するとともに、ベクトルｃ_４の第４成分ｃ_４４を演算装置１３００から受信する。演算装置１３００は、ベクトルｃ_４の第４成分ｃ_４４を演算装置１２００に送信するとともに、ベクトルｃ_５の第５成分ｃ_５５を演算装置１０００から受信する。この処理では、ベクトルｃ_１とベクトルｃ_５を記録している演算装置が同じなので、ベクトルｃ_１を記録している演算装置からベクトルｃ_５を記録している演算装置へのベクトルｃ_１の第１成分ｃ_１１の送信は行わない。 In the first process, the following processes are performed in parallel. Computing device 1000 transmits a fifth component _{c 55} of the vector _{c 5} to the processing unit 1300 receives the second component _{c 22} of the vector _{c 2} from the operation unit 1100. Arithmetic unit 1100 sends the second component _{c 22} of the vector _{c 2} to the arithmetic unit 1000, it receives a third component _{c 33} of the vector _{c 3} from the computing device 1200. Computing device 1200 transmits a third component _{c 33} of the vector _{c 3} to the processing unit 1100 receives the fourth component _{c 44} of the vector _{c 4} from the computing device 1300. Arithmetic unit 1300 sends the fourth component _{c 44} of the vector _{c 4} to the processing unit 1200 receives the fifth component _{c 55} of the vector _{c 5} from the operation unit 1000. In this process, since the arithmetic device that records the vector c ₁ and the vector c ₅ is the same, the first operation of the vector c ₁ from the arithmetic device that records the vector c ₁ to the arithmetic device that records the vector c ₅ is performed. transmission of one component _{c 11} is not performed.

第２の処理では、以下の処理を並列に行う。演算装置１０００は、ベクトルｃ_１の第２成分ｃ_１２と受信したベクトルｃ_２の第２成分ｃ_２２との和を、ベクトルｃ_１の第２成分ｃ_１２として記録する。さらに、演算装置１０００は、ベクトルｃ_５の第１成分ｃ_５１とベクトルｃ_１の第１成分ｃ_１１との和を、ベクトルｃ_５の第１成分ｃ_５１として記録する。演算装置１１００は、ベクトルｃ_２の第３成分ｃ_２３と受信したベクトルｃ_３の第３成分ｃ_３３との和を、ベクトルｃ_２の第３成分ｃ_２３として記録する。演算装置１２００は、ベクトルｃ_３の第４成分ｃ_３４と受信したベクトルｃ_４の第４成分ｃ_４４との和を、ベクトルｃ_３の第４成分ｃ_３４として記録する。演算装置１３００は、ベクトルｃ_４の第５成分ｃ_４５と受信したベクトルｃ_５の第５成分ｃ_５５との和を、ベクトルｃ_４の第５成分ｃ_４５として記録する。 In the second process, the following processes are performed in parallel. Computing device 1000, the sum of the second component _{c 22} of the vector _{c 2} and the received second component _{c 12} of the vector _{c 1,} is recorded as the second component _{c 12} of the vector _{c 1.} Additionally, computing device 1000, the sum of the first component _{c 51} a first component _{c 11} of the vector _{c 1} vector _{c 5,} recorded as the first component _{c 51} of the vector _{c 5.} Computing device 1100, the sum of the third component _{c 33} of the vector _{c 3} received a third component _{c 23} of the vector _{c 2,} is recorded as a third component _{c 23} of the vector _{c 2.} Computing device 1200, the sum of the fourth component _{c 44} of the vector _{c 4} received a fourth component _{c 34} of the vector _{c 3,} recorded as a fourth component _{c 34} of the vector _{c 3.} Computing device 1300, the sum of the fifth component _{c 55} of the vector _{c 5} received a fifth component _{c 45} of the vector _{c 4,} is recorded as a fifth component _{c 45} of the vector _{c 4.}

第３の処理から第８の処理は、第１の処理と第２の処理の繰り返しである。第８の処理が終了すると、第１成分の合計は演算装置１１００、第２成分の合計は演算装置１２００、第３成分の合計は演算装置１３００、第４成分の合計は演算装置１０００、第５成分の合計は演算装置１０００がそれぞれ記録している。 The third process to the eighth process are repetitions of the first process and the second process. When the eighth process ends, the total of the first component is the arithmetic device 1100, the total of the second component is the arithmetic device 1200, the total of the third component is the arithmetic device 1300, the total of the fourth component is the arithmetic device 1000, the fifth The arithmetic component 1000 records the total of the components.

第９の処理では、以下の処理を並列に行う。演算装置１０００は、ベクトルｃ_５の第４成分ｃ_５４（第４成分の合計）を演算装置１３００に送信するとともに、ベクトルｃ_２の第１成分ｃ_２１（第１成分の合計）を演算装置１１００から受信する。演算装置１１００は、ベクトルｃ_２の第１成分ｃ_２１（第１成分の合計）を演算装置１０００に送信するとともに、ベクトルｃ_３の第２成分ｃ_３２（第２成分の合計）を演算装置１２００から受信する。演算装置１２００は、ベクトルｃ_３の第２成分ｃ_３２（第２成分の合計）を演算装置１１００に送信するとともに、ベクトルｃ_４の第３成分ｃ_４３（第３成分の合計）を演算装置１３００から受信する。演算装置１３００は、ベクトルｃ_４の第３成分ｃ_４３（第３成分の合計）を演算装置１２００に送信するとともに、ベクトルｃ_５の第４成分ｃ_５４（第４成分の合計）を演算装置１０００から受信する。この処理では、ベクトルｃ_１とベクトルｃ_５を記録している演算装置が同じなので、ベクトルｃ_１を記録している演算装置からベクトルｃ_５を記録している演算装置へのベクトルｃ_１の第５成分ｃ_１５の送信は行わない。 In the ninth process, the following processes are performed in parallel. The arithmetic device 1000 transmits the fourth component c ₅₄ (total of the fourth components) of the vector c ₅ to the arithmetic device 1300, and transmits the first component c ₂₁ (total of the first components) of the vector c ₂ to the arithmetic device 1100. Receive from. The arithmetic device 1100 transmits the first component c ₂₁ (total of the first components) of the vector c ₂ to the arithmetic device 1000, and also calculates the second component c ₃₂ (total of the second components) of the vector c ₃ to the arithmetic device 1200. Receive from. The computing device 1200 transmits the second component c ₃₂ (sum of the second components) of the vector c ₃ to the computing device 1100, and sends the third component c ₄₃ (sum of the third components) of the vector c ₄ to the computing device 1300. Receive from. The computing device 1300 transmits the third component c ₄₃ (sum of the third components) of the vector c ₄ to the computing device 1200 and also sends the fourth component c ₅₄ (sum of the fourth components) of the vector c ₅ to the computing device 1000. Receive from. In this process, since the arithmetic device that records the vector c ₁ and the vector c ₅ is the same, the first operation of the vector c ₁ from the arithmetic device that records the vector c ₁ to the arithmetic device that records the vector c ₅ is performed. 5 transmission components _{c 15} is not performed.

第１０の処理では、以下の処理を並列に行う。演算装置１０００は、受信したベクトルｃ_２の第１成分ｃ_２１（第１成分の合計）を、ベクトルｃ_１の第１成分ｃ_１１として記録する。さらに、演算装置１０００は、ベクトルｃ_１の第５成分ｃ_１５（第５成分の合計）を、ベクトルｃ_５の第５成分ｃ_５５として記録する。演算装置１１００は、受信したベクトルｃ_３の第２成分ｃ_３２（第２成分の合計）を、ベクトルｃ_２の第２成分ｃ_２２として記録する。演算装置１２００は、受信したベクトルｃ_４の第３成分ｃ_４３（第３成分の合計）を、ベクトルｃ_３の第３成分ｃ_３３として記録する。演算装置１３００は、受信したベクトルｃ_１の第４成分ｃ_１４（第１成分の合計）を、ベクトルｃ_４の第４成分ｃ_４４として記録する。
第１１の処理から第１６の処理は、第９の処理と第１０の処理の繰り返しである。この繰り返しによって、各成分の合計はすべての演算装置に分配される。 In the tenth process, the following processes are performed in parallel. Computing device 1000, the first component _{c 21} of the vector _{c 2} received (sum of the first component), is recorded as a first component _{c 11} of the vector _{c 1.} Additionally, computing device 1000, a fifth component _{c 15} of the vector _{c 1} (total of the fifth component) is recorded as a fifth component _{c 55} of the vector _{c 5.} Computing device 1100, the second component _{c 32} of the vector _{c 3} received (total of the second component) is recorded as the second component _{c 22} of the vector _{c 2.} Arithmetic unit 1200, a third component _{c 43} of the vector _{c 4} has been received (total of the third component) is recorded as a third component _{c 33} of the vector _{c 3.} Arithmetic unit 1300, a fourth component _{c 14} of the vector _{c 1} received (sum of the first component), is recorded as a fourth component _{c 44} of the vector _{c 4.}
The eleventh to sixteenth processes are repetitions of the ninth process and the tenth process. By repeating this, the sum of each component is distributed to all the arithmetic units.

演算方法
図１２は、Ｉ個の演算装置でＮ個のＫ次元のベクトルの和を求める並列演算システムの処理フローを示す図である。ただし、Ｉは２以上Ｎ以下の整数とする。
各演算装置が、演算の対象となるＫ次元のベクトルｃ_ｎ（ｎは１からＮの整数）の中の１つまたは複数を当該演算装置の記録部のベクトル記録手段に記録する（Ｓ１１５）。また、各演算装置が、τに０を代入し、τ記録手段に記録しておく（Ｓ１２０）。 Arithmetic Method FIG. 12 is a diagram showing a processing flow of a parallel arithmetic system for obtaining the sum of N K-dimensional vectors with I arithmetic devices. However, I is an integer of 2 or more and N or less.
Each arithmetic device records one or a plurality of K-dimensional vectors c _n (n is an integer from 1 to N) to be calculated in the vector recording means of the recording unit of the arithmetic device (S115). Each arithmetic unit substitutes 0 for τ and records it in the τ recording means (S120).

ベクトルｃ_ｐ（ｐは（ｎ−１−τ（ｍｏｄＮ））＋１）を記録部に有する各演算装置が、（１）当該演算装置の記録部からベクトルｃ_ｐの第ｎ成分を取り出し、（２）当該演算装置とベクトルｃ_ｑ（ｑはｐ−１。ただし、ｐ＝１のときはｑ＝Ｎ）を記録部に有する演算装置とが異なる場合には、通信部を用いてベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｑを記録部に有する演算装置に送信する（Ｓ１３５）。ベクトルｃ_ｑを記録部に有する演算装置が、当該演算装置とベクトルｃ_ｐを記録部に有する演算装置とが異なる場合には、通信部を用いてベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｐを記録部に有する演算装置から受信する（Ｓ１４５）。なお、τ＝０のときのステップＳ１３５とＳ１４５が図１１の第１の処理、τ＝１のときのステップＳ１３５とＳ１４５が図１１の第３の処理、τ＝２のときのステップＳ１３５とＳ１４５が図１１の第５の処理、τ＝３のときのステップＳ１３５とＳ１４５が図１１の第７の処理、τ＝４のときのステップＳ１３５とＳ１４５が図１１の第９の処理、τ＝５のときのステップＳ１３５とＳ１４５が図１１の第１１の処理、τ＝６のときのステップＳ１３５とＳ１４５が図１１の第１３の処理、τ＝７のときのステップＳ１３５とＳ１４５が図１１の第１５の処理に該当する。 Each arithmetic device having the vector c _p (p is (n−1−τ (mod N)) + 1) in the recording unit (1) extracts the n-th component of the vector c _p from the recording unit of the arithmetic device, and ( 2) When the arithmetic unit and the vector c _q (q is p−1, where q = N when p = 1) are different from the arithmetic unit having the recording unit, the vector c _p is used using the communication unit. Are transmitted to the arithmetic unit having the vector c _q in the recording unit (S135). Computing device having a vector c _q in the recording unit, if the computing device including the arithmetic unit and the vector c _p to the recording portion are different, the first n components of the vector c _p using the communication unit, the vector c _p Is received from the arithmetic unit having the recording unit (S145). Steps S135 and S145 when τ = 0 are the first processing of FIG. 11, steps S135 and S145 when τ = 1 are the third processing of FIG. 11, and steps S135 and S145 when τ = 2. Is the fifth process of FIG. 11, steps S135 and S145 when τ = 3 are the seventh process of FIG. 11, steps S135 and S145 when τ = 4 are the ninth process of FIG. 11, τ = 5 Steps S135 and S145 in FIG. 11 are the eleventh processing of FIG. 11, Steps S135 and S145 when τ = 6 are the thirteenth processing of FIG. 11, and Steps S135 and S145 when τ = 7 are the This corresponds to 15 processes.

ベクトルｃ_ｑを記録部に有する各演算装置が、（１）τ≦Ｎ−２の場合は、当該演算装置の記録部からベクトルｃ_ｑの第ｎ成分を取り出し、当該演算装置の計算部で、ベクトルｃ_ｐの第ｎ成分との和を求め、結果をベクトルｃ_ｑの第ｎ成分として記録部に記録し、（２）τ＞Ｎ−２の場合は、ベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｑの第ｎ成分として記録部に記録する（Ｓ１５５）。なお、τ＝０のときのステップＳ１５５が図１１の第２の処理、τ＝１のときのステップＳ１５５が図１１の第４の処理、τ＝２のときのステップＳ１５５が図１１の第６の処理、τ＝３のときのステップＳ１５５が図１１の第８の処理、τ＝４のときのステップＳ１５５が図１１の第１０の処理、τ＝５のときのステップＳ１５５が図１１の第１２の処理、τ＝６のときのステップＳ１５５が図１１の第１４の処理、τ＝７のときのステップＳ１５５が図１１の第１６の処理に該当する。 When each computing device having the vector c _q in the recording unit is (1) τ ≦ N−2, the n-th component of the vector c _q is taken out from the recording unit of the computing device, and the computing unit of the computing device The sum of the vector c _p and the n-th component is obtained, and the result is recorded in the recording unit as the n-th component of the vector c _q . (2) If τ> N−2, the n-th component of the vector c _p is recorded in the recording unit as the n components of the vector _{c q} (S155). Note that step S155 when τ = 0 is the second process in FIG. 11, step S155 when τ = 1 is the fourth process in FIG. 11, and step S155 when τ = 2 is the sixth process in FIG. Step S155 when τ = 3 is the eighth processing of FIG. 11, Step S155 when τ = 4 is the tenth processing of FIG. 11, and Step S155 when τ = 5 is the step of FIG. 12, step S155 when τ = 6 corresponds to the fourteenth processing of FIG. 11, and step S155 when τ = 7 corresponds to the sixteenth processing of FIG.

各演算装置は、τにτ＋１を代入して記録部に記録する（Ｓ１６０）。各演算装置は、τが２Ｎ−３以下の場合はステップＳ１３０に戻り、２Ｎ−３より大きい場合は処理を終了する（Ｓ１７０）。図１１の場合は、２Ｎ−３＝７なので、τ＝７（第１６の処理）までで繰り返し処理は終了する。
このように処理するので、１組のＮ個のＫ次元のベクトルの和をＮ個の演算装置で行う場合、最大で２｛Ｋ／Ｎ｝（Ｎ−１）回の通信で、情報の収集と結果の分配が実現できる。また、データはリング状かつ片方向に送られるため、Ｎ個の片方向の通信路（リング状）を有するネットワークを備えればよい。 Each arithmetic device substitutes τ + 1 for τ and records it in the recording unit (S160). Each arithmetic device returns to step S130 if τ is 2N−3 or less, and ends the process if larger than 2N−3 (S170). In the case of FIG. 11, since 2N−3 = 7, the iterative process is completed up to τ = 7 (sixteenth process).
Since processing is performed in this way, when the sum of a set of N K-dimensional vectors is performed by N arithmetic devices, information is collected with a maximum of 2 {K / N} (N-1) communications. And distribution of results. Moreover, since data is sent in a ring shape and in one direction, a network having N unidirectional communication paths (ring shapes) may be provided.

本変形例では、ベクトルｃ_ｐを記録部に有する各演算装置が、ベクトルｃ_ｑ（ｑはｐ−１。ただし、ｐ＝１のときはｑ＝Ｎ）を記録部に有する演算装置に情報を送った。また、第１実施形態でも、情報を降順（演算装置の番号が小さくなる方向）に送った。しかし、ベクトルｃ_ｐを記録部に有する各演算装置が、ベクトルｃ_ｑ（ｑはｐ＋１。ただし、ｐ＝Ｎのときはｒ＝１）を記録部に有する演算装置（昇順）に情報を送ってもよい。また、番号の付け方を逆（ｎ番目をＮ−ｎ番目に変更）にすれば、物理的な送受信の方向を逆にできる。つまり、番号の付け方は、単に人があらかじめ定めるだけなので、ｑ＝ｐ−１としてもｑ＝ｐ＋１としても実質的に等価である。本明細書では、特に明記しない限り、装置やベクトルなどの順番とは、降順と昇順の両方を含んでいるものとする。 In this modification, each arithmetic unit having a vector c _p to the recording unit, the vector c _{q (p-1} is q. However, when the p = 1 q = N) information to the processing unit having a recording unit sent. Also in the first embodiment, the information is sent in descending order (the direction in which the arithmetic unit number decreases). However, each arithmetic device having the vector c _p in the recording unit sends information to the arithmetic device (ascending order) having the vector c _q (q is p + 1, where r = 1 when p = N) in the recording unit. Also good. If the numbering method is reversed (n-th is changed to N-n-th), the physical transmission / reception direction can be reversed. In other words, since the numbering method is simply determined in advance by a person, both q = p−1 and q = p + 1 are substantially equivalent. In this specification, unless otherwise specified, the order of devices and vectors includes both descending and ascending order.

なお、１つの演算装置に複数のベクトルが記録されている場合には、先に演算装置内のベクトルの和を求め、１つのベクトルとする方が処理の回数が少なくなる。しかし、Ｎ個のＫ次元ベクトルの和をＭ組並列に（同時に）演算するような場合に、組ごとに計算方法が異なると煩雑となるため、本変形例の方法も有効である。また、本変形例は各演算装置に１つずつベクトルを記録しておく場合にも適用できるので、本変形例の方法の方が、第１実施形態の用法よりも適用できる範囲は広い。 When a plurality of vectors are recorded in one arithmetic device, the number of processes is reduced when the sum of the vectors in the arithmetic device is first obtained and set as one vector. However, in the case where the sum of N K-dimensional vectors is calculated in parallel (simultaneously) in M sets, it becomes complicated if the calculation method is different for each set. Therefore, the method of this modification is also effective. In addition, since the present modification can be applied to the case where one vector is recorded in each arithmetic device, the method of the present modification has a wider range of application than the usage of the first embodiment.

［変形例２］
第１実施形態、変形例１では、収集も分配も、情報の流れは同じ方向であった。しかし、収集での情報の流れと分配での情報の流れを逆向きにしてもよい。
原理
図１３に、図４の並列演算システムを用いて４個の４次元ベクトルの和を求める方法の原理を示す。あらかじめ、演算装置１０００がベクトルｃ_１を、演算装置１１００がベクトルｃ_２を、演算装置１２００がベクトルｃ_３を、演算装置１３００がベクトルｃ_４を記録している。第６の処理までは、図６に示した原理と同じである。第７の処理から情報の流れを逆にしている。
演算方法
図１２を用いて処理フローを説明する。本変形例では、τ≦Ｎ−２までは変形例１と同じ処理フローであり、以下のようになる。 [Modification 2]
In the first embodiment and the first modification, the flow of information is the same in both collection and distribution. However, the information flow in collection and the information flow in distribution may be reversed.
Principle FIG. 13 shows the principle of a method for obtaining the sum of four four-dimensional vectors using the parallel computing system of FIG. In advance, the arithmetic device 1000 records the vector c ₁ , the arithmetic device 1100 records the vector c ₂ , the arithmetic device 1200 records the vector c ₃ , and the arithmetic device 1300 records the vector c ₄ . The principle up to the sixth process is the same as that shown in FIG. The flow of information is reversed from the seventh process.
Calculation Method The processing flow will be described with reference to FIG. In this modification, up to τ ≦ N−2 is the same processing flow as that in Modification 1, and is as follows.

τ≦Ｎ−２の場合には、ベクトルｃ_ｐ（ｐは（ｎ−１−τ（ｍｏｄＮ））＋１）を記録部に有する各演算装置が、（１）当該演算装置の記録部からベクトルｃ_ｐの第ｎ成分を取り出し、（２）当該演算装置とベクトルｃ_ｑ（ｑはｐ−１。ただし、ｐ＝１のときはｑ＝Ｎ）を記録部に有する演算装置とが異なる場合には、通信部を用いてベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｑを記録部に有する演算装置に送信する（Ｓ１３５’）。ベクトルｃ_ｑを記録部に有する演算装置が、当該演算装置とベクトルｃ_ｐを記録部に有する演算装置とが異なる場合には、通信部を用いてベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｐを記録部に有する演算装置から受信する（Ｓ１４５’）。ベクトルｃ_ｑを記録部に有する各演算装置が、当該演算装置の記録部からベクトルｃ_ｑの第ｎ成分を取り出し、当該演算装置の計算部で、ベクトルｃ_ｐの第ｎ成分との和を求め、結果をベクトルｃ_ｑの第ｎ成分として記録部に記録する（Ｓ１５５’）。
τ＞Ｎ−２の場合には、ベクトルｃ_ｐ（ｐは（ｎ−１−τ（ｍｏｄＮ））＋１）を記録部に有する各演算装置が、（１）当該演算装置の記録部からベクトルｃ_ｐの第ｎ成分を取り出し、（２）当該演算装置とベクトルｃ_ｒ（ｒはｐ＋１。ただし、ｐ＝Ｎのときはｒ＝１）を記録部に有する演算装置とが異なる場合には、通信部を用いてベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｒを記録部に有する演算装置に送信する（Ｓ１３５’）。ベクトルｃ_ｒを記録部に有する演算装置が、当該演算装置とベクトルｃ_ｐを記録部に有する演算装置とが異なる場合には、通信部を用いてベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｐを記録部に有する演算装置から受信する（Ｓ１４５’）。ベクトルｃ_ｒを記録部に有する各演算装置が、ベクトルｃ_ｐの第ｎ成分を、ベクトルｃ_ｒの第ｎ成分として記録部に記録する（Ｓ１５５’）。
この後の処理も、変形例１と同じである。 In the case of τ ≦ N−2, each arithmetic device having the vector c _p (p is (n−1−τ (mod N)) + 1) in the recording unit is (1) a vector from the recording unit of the arithmetic device. When the n-th component of c _p is extracted, and (2) the arithmetic device is different from the arithmetic device having the vector c _q (q is p−1, where q = N when p = 1) in the recording unit. Transmits the n-th component of the vector c _p to the arithmetic unit having the vector c _q in the recording unit using the communication unit (S135 ′). Computing device having a vector c _q in the recording unit, if the computing device including the arithmetic unit and the vector c _p to the recording portion are different, the first n components of the vector c _p using the communication unit, the vector c _p Is received from the arithmetic unit having the recording unit (S145 ′). Each arithmetic device having the vector c _q in the recording unit extracts the n-th component of the vector c _q from the recording unit of the arithmetic device, and the calculation unit of the arithmetic device obtains the sum of the vector c _p and the n-th component. The result is recorded in the recording unit as the n-th component of the vector c _q (S155 ′).
In the case of τ> N−2, each arithmetic unit having the vector c _p (p is (n−1−τ (mod N)) + 1) in the recording unit is (1) a vector from the recording unit of the arithmetic unit. When the n-th component of c _p is extracted, and (2) the arithmetic unit is different from the arithmetic unit having the vector c _r (r is p + 1, where r = 1 when p = N) in the recording unit, the first n components of the vector c _p using the communication unit, and transmits to the processing unit having a vector c _r in the recording unit (S135 '). Computing device having a vector c _r in the recording unit, if the computing device including the arithmetic unit and the vector c _p to the recording portion are different, the first n components of the vector c _p using the communication unit, the vector c _p Is received from the arithmetic unit having the recording unit (S145 ′). Each arithmetic unit having a vector _{c r} in the recording section, the n-th component of the vector _{c p,} is recorded in the recording unit as the n components of the vector _{c r} (S155 ').
The subsequent processing is the same as that of the first modification.

このように処理するので、１組のＮ個のＫ次元のベクトルの和をＮ個の演算装置で行う場合、最大で２｛Ｋ／Ｎ｝（Ｎ−１）回の通信で、情報の収集と結果の分配が実現できる。また、データはリング状かつ双方向に送られるため、Ｎ個の半二重の通信路（リング状）を有するネットワークを備えればよい。 Since processing is performed in this way, when a set of N K-dimensional vectors is summed by N computing devices, information is collected with a maximum of 2 {K / N} (N-1) communications. And distribution of results. Further, since data is transmitted in a ring shape and in both directions, a network having N half-duplex communication paths (ring shape) may be provided.

［変形例３］
第１実施形態、変形例１、変形例２では、ベクトルの各成分の情報の収集と分配を行ったが、収集するだけでも、各成分の合計値はいずれかの演算装置に記録されている。したがって、τ≦Ｎ−２までの収集の処理までで処理フローを終了させる方法もありうる。 [Modification 3]
In the first embodiment, the first modification, and the second modification, the information of each component of the vector is collected and distributed, but the total value of each component is recorded in one of the arithmetic devices even by collecting the information. . Therefore, there may be a method of terminating the processing flow until the collection processing up to τ ≦ N−2.

［第２実施形態］
本実施形態では、まず、Ｉ個の演算装置（Ｉは２以上の整数）を用いて、Ｍ行Ｎ列の行列Ａ（Ｍは２以上の整数、Ｎは２以上の整数、かつＭＮ≧Ｉ）とＮ行Ｋ列の行列Ｂ（Ｋは２以上の整数）との積ＡＢを求める方法を示す。次に積ＡＢにさらに行列Ａの転置行列Ａ^Ｔを乗算する方法を示す。
原理１
図１４に４行４列の行列Ａと４行４列の行列Ｂとの積を求める場合、および行列Ａの転置行列Ａ^Ｔを乗算する場合の原理を示す。図１４（Ａ）の内容を説明する。行列Ａのｍ行ｎ列の成分をａ_ｍｎ、行列Ｂのｍ行ｎ列の成分をｂ_ｍｎとすると、積ＡＢのｍ行ｎ列の成分は、ａ_ｍ１ｂ_１ｎ＋ａ_ｍ２ｂ_２ｎ＋ａ_ｍ３ｂ_３ｎ＋ａ_ｍ４ｂ_４ｎとなる。行列の積を求める並列演算システムが、行列Ａの成分ａ_ｍｎごとに演算する演算装置を決めるとする。このような並列演算システムでは、例えば、積ａ_１１ｂ_１１は、成分ａ_１１の演算を行う演算装置（言い換えると、記録部に成分ａ_１１を有する演算装置）が計算する。この演算装置は、積ａ_１１ｂ_１２、積ａ_１１ｂ_１３、積ａ_１１ｂ_１４の計算も行うので、これらの結果を成分とする４次元ベクトル（ａ_１１ｂ_１１，ａ_１１ｂ_１２，ａ_１１ｂ_１３，ａ_１１ｂ_１４）を記録部に記録することとなる。積ＡＢの１行目を構成する４次元ベクトルは、成分ａ_１１の演算を行う演算装置の結果（ａ_１１ｂ_１１，ａ_１１ｂ_１２，ａ_１１ｂ_１３，ａ_１１ｂ_１４）、成分ａ_１２の演算を行う演算装置の結果（ａ_１２ｂ_２１，ａ_１２ｂ_２２，ａ_１２ｂ_２３，ａ_１２ｂ_２４）、成分ａ_１３の演算を行う演算装置の結果（ａ_１３ｂ_３１，ａ_１３ｂ_３２，ａ_１３ｂ_３３，ａ_１３ｂ_３４）、成分ａ_１４の演算を行う演算装置の結果（ａ_１４ｂ_４１，ａ_１４ｂ_４２，ａ_１４ｂ_４３，ａ_１４ｂ_４４）の和である。 [Second Embodiment]
In this embodiment, first, by using I computing devices (I is an integer of 2 or more), an M-row N-column matrix A (M is an integer of 2 or more, N is an integer of 2 or more, and MN ≧ I) ) And a matrix B of N rows and K columns (K is an integer of 2 or more). Next, a method of multiplying the product AB by the transposed matrix ^AT of the matrix A will be described.
Principle 1
FIG. 14 shows the principle for obtaining the product of the matrix A of 4 rows and 4 columns and the matrix B of 4 rows and 4 columns and for multiplying the transposed matrix ^AT of the matrix A. The contents of FIG. 14A will be described. When the component of m rows and n columns of matrix A is a _mn and the component of m rows and n columns of matrix B is b _mn , the components of m rows and n columns of product AB are a _m1 b _1n + a _m2 b _2n + a _m3 b _3n + a _m4 b _4n . It is assumed that a parallel operation system for obtaining a matrix product determines an arithmetic device that performs an operation for each component a _{mn of the} matrix A. In such a parallel arithmetic system, for example, the product a ₁₁ b ₁₁ is calculated by an arithmetic device that performs an operation on the component a ₁₁ (in other words, an arithmetic device having the component a _{11 in the} recording unit). Since this arithmetic unit also calculates the product a ₁₁ b ₁₂ , the product a ₁₁ b ₁₃ , and the product a ₁₁ b ₁₄ , a four-dimensional vector (a ₁₁ b ₁₁ , a ₁₁ b ₁₂ , a having these results as components). ₁₁ b ₁₃ , a ₁₁ b ₁₄ ) are recorded in the recording unit. The four-dimensional vector constituting the first row of the product AB is the result (a ₁₁ b ₁₁ , a ₁₁ b ₁₂ , a ₁₁ b ₁₃ , a ₁₁ b ₁₄ ), the component a ₁₂ , which calculates the component a _11. the result of the operation unit for performing operations _{_{_{_{(a 12 b 21, a 12}}}} b 22, a 12 b 23, a 12 b 24), the result of the arithmetic unit for performing arithmetic components _{_{_{_{a 13 (a 13 b 31,}}}} a 13 b ₃₂ , a ₁₃ b ₃₃ , a ₁₃ b ₃₄ ) and the result (a ₁₄ b ₄₁ , a ₁₄ b ₄₂ , a ₁₄ b ₄₃ , a ₁₄ b ₄₄ ) of the arithmetic unit that performs the calculation of the component a ₁₄ .

つまり、図１４（Ｂ）に示すように、ｃ_ｍｎ＝（ａ_ｍｎｂ_ｎ１，ａ_ｍｎｂ_ｎ２，ａ_ｍｎｂ_ｎ３，ａ_ｍｎｂ_ｎ４）とすれば、積ＡＢのｍ行目を構成するベクトルは、ｃ_ｍ１＋ｃ_ｍ２＋ｃ_ｍ３＋ｃ_ｍ４となる。また、第１実施形態のベクトルの和を求める並列演算方法を用いれば、積ＡＢの演算が終了した時には、成分ａ_ｍｎの演算を行う演算装置（成分ａ_ｍｎを記録部に有する演算装置）は、ｃ_ｍ１＋ｃ_ｍ２＋ｃ_ｍ３＋ｃ_ｍ４の計算結果を記録している。なお、このような行列の計算では、各行でｃ_ｍ１＋ｃ_ｍ２＋ｃ_ｍ３＋ｃ_ｍ４の計算を行うので、Ｎ個のＫ次元ベクトルの和をＭ組並列に計算することになる。 That is, as shown in FIG. 14B, if c _mn = (a _mn b _n1 , a _mn b _n2 , a _mn b _n3 , a _mn b _n4 ), the vector constituting the m-th row of the product AB Becomes c _m1 + c _m2 + c _m3 + c _m4 . Further, by using the parallel calculation method for obtaining the vector sum of the first embodiment, when the operation of the product AB has been completed, (computing device having components a _mn in the recording unit) arithmetic unit for performing arithmetic component a _mn is , C _m1 + c _m2 + c _m3 + c _m4 are recorded. In such a matrix calculation, since the calculation of c _m1 + c _m2 + c _m3 + c _m4 is performed in each row, the sum of N K-dimensional vectors is calculated in parallel.

図１４（Ｃ）は、転置行列Ａ^Ｔを乗算する場合の原理を示す。行列Ａのｍ行ｎ列の成分ａ_ｍｎが転置行列Ａ^Ｔのｎ行ｍ列の成分となる。そこで、転置行列Ａ^Ｔの演算では、行列Ａのｍ行ｎ列の成分ａ_ｍｎの演算を行った演算装置が、転置行列Ａ^Ｔのｎ行ｍ列の成分の演算を行う。この場合、転置行列Ａ^Ｔのｎ行ｍ列の成分の演算を行う演算装置には、積ＡＢのｍ行の各成分の情報が必要である。ところが、積ＡＢのｍ行の各成分の情報とは、ｃ_ｍ１＋ｃ_ｍ２＋ｃ_ｍ３＋ｃ_ｍ４の計算結果のことである。つまり、転置行列Ａ^Ｔの演算のためにデータを通信する必要はない。
原理２
図１５に、３行４列の行列Ａと４行２列の行列Ｂとの積を求める場合、および行列Ａの転置行列Ａ^Ｔを乗算する場合の原理を示す。この場合も考え方は原理１と同じである。この場合は、積ＡＢのｍ行目は、４個の２次元ベクトルｃ_ｍｎ＝（ａ_ｍｎｂ_ｎ１，ａ_ｍｎｂ_ｎ２）の和となる。
原理３
図１６に、４行２列の行列Ａと２行４列の行列Ｂとの積を求める場合、および行列Ａの転置行列Ａ^Ｔを乗算する場合の原理を示す。この場合も考え方は原理１と同じである。この場合は、積ＡＢのｍ行目は、２個の４次元ベクトルｃ_ｍｎ＝（ａ_ｍｎｂ_ｎ１，ａ_ｍｎｂ_ｎ２，ａ_ｍｎｂ_ｎ３，ａ_ｍｎｂ_ｎ４）の和となる。 FIG. 14C illustrates the principle when multiplying the transposed matrix ^AT . The component a _{mn of the} matrix A with m rows and n columns becomes the component with n rows and m columns of the transposed matrix ^AT . Therefore, in the computation of the transposed matrix ^AT, the computing device that has computed the m-by-n component a _mn of the matrix A computes the n-by-m component of the transposed matrix ^AT . In this case, the arithmetic unit that performs the operation of the n rows and m columns of the transposed matrix ^AT needs information on each component of the m rows of the product AB. However, the components of the information m rows of product _AB, is that the calculation result of _{_{c m1 + c m2 + c m3}} + c m4. That is, there is no need to communicate data for the computation of the transposed matrix ^AT .
Principle 2
FIG. 15 shows the principle for obtaining the product of the matrix A of 3 rows and 4 columns and the matrix B of 4 rows and 2 columns and for multiplying the transposed matrix ^AT of the matrix A. In this case, the idea is the same as in Principle 1. In this case, the m-th row of the product AB is the sum of four two-dimensional vectors c _mn = (a _mn b _n1 , a _mn b _n2 ).
Principle 3
FIG. 16 shows the principle for obtaining the product of the matrix A of 4 rows and 2 columns and the matrix B of 2 rows and 4 columns and for multiplying the transposed matrix ^AT of the matrix A. In this case, the idea is the same as in Principle 1. In this case, the m-th row of the product AB is the sum of two four-dimensional vectors c _mn = (a _mn b _n1 , a _mn b _n2 , a _mn b _n3 , a _mn b _n4 ).

演算方法１
図１７に、Ｉ個の演算装置を用いて、Ｍ行Ｎ列の行列ＡとＮ行Ｋ列の行列Ｂとの積ＡＢを求める場合の、１つの演算装置の機能構成例を示す。並列演算システムの構成は図４と同じである。また、図１８に、Ｉ個の演算装置を用いて、Ｍ行Ｎ列の行列ＡとＮ行Ｋ列の行列Ｂとの積ＡＢを求める場合の処理フローを示す。 Calculation method 1
FIG. 17 shows an example of the functional configuration of one arithmetic device when the product AB of the matrix A of M rows and N columns and the matrix B of N rows and K columns is obtained using I arithmetic devices. The configuration of the parallel computing system is the same as that in FIG. FIG. 18 shows a processing flow in the case where the product AB of the matrix A of M rows and N columns and the matrix B of N rows and K columns is obtained using I arithmetic devices.

演算装置２０００は、計算部２０１０、記録部２０２０、通信部２０３０から構成される。計算部２０１０は、τ増加手段２００１、繰り返し手段２００２、成分計算手段２０１１、演算結果記録手段２０１２を備える。なお、τ増加手段２００１と繰り返し手段２００２は、計算部２０１０以外の構成部（たとえば、図示していないが制御部などの構成部が考えられる。）が備えてもよい。記録部２０２０は、成分記録手段２０２１、ベクトル記録手段２０２２、τ記録手段２０２３を有する。通信部２０３０は、送信手段２０３１と受信手段２０３２とを有する。なお、他の演算装置も同じ機能構成である。 The arithmetic device 2000 includes a calculation unit 2010, a recording unit 2020, and a communication unit 2030. The calculation unit 2010 includes a τ increasing unit 2001, a repeating unit 2002, a component calculating unit 2011, and an operation result recording unit 2012. Note that the τ increasing unit 2001 and the repeating unit 2002 may be provided in a configuration unit other than the calculation unit 2010 (for example, a configuration unit such as a control unit is not shown). The recording unit 2020 includes a component recording unit 2021, a vector recording unit 2022, and a τ recording unit 2023. The communication unit 2030 includes a transmission unit 2031 and a reception unit 2032. The other arithmetic units have the same functional configuration.

あらかじめ、行列ＡのＭＮ個の成分ａ_ｍｎをＩ個のグループに分けておき、各演算装置（２０００など）が、ｉ番目のグループのすべての成分ａ_ｍｎと、当該成分に乗算する行列Ｂの成分ｂ_ｎ１〜ｂ_ｎＫとを記録部（２０２０など）の成分記録手段（２０２１など）に記録する（Ｓ２１０）。次に、各演算装置（２０００など）が、（１）記録部（２０２０など）の成分記録手段（２０２１など）に記録された成分ａ_ｍｎのそれぞれについて、計算部でａ_ｍｎｂ_ｎ１〜ａ_ｍｎｂ_ｎＫを計算し、（２）結果をＫ次元のベクトルｃ_ｍｎ＝（ｃ_ｍｎ１，ｃ_ｍｎ２，…，ｃ_ｍｎＫ）＝（ａ_ｍｎｂ_ｎ１，ａ_ｍｎｂ_ｎ２，…，ａ_ｍｎｂ_ｎＫ）として記録部（２０２０など）のベクトル記録手段（２０２２など）に記録する（Ｓ２１５）。各演算装置（２０００など）が、記録部（２０２０）のτ記録手段（２０２３など）にτ＝０を記録する（Ｓ２２０）。 The MN components a _mn of the matrix A are divided into I groups in advance, and each arithmetic device (2000, etc.) uses all the components a _mn of the i-th group and the matrix B to be multiplied by the components. The components b _{n1 to} b _nK are recorded in the component recording means (2021, etc.) of the recording unit (2020, etc.) (S210). Next, the processing unit (such as 2000) is, (1) for each of the components _{a mn} recorded in the component record unit (such as 2021) of the recording unit (such as 2020), the calculation unit _a _mn _b n1 _~a mn the b _nK calculated, (2) results of the K-dimensional vector _{_{_{_{c mn = (c mn1, c}}}} mn2, ..., c mnK) = (a mn b n1, a mn b n2, ..., a mn b nK) as Recording is performed on the vector recording means (2022, etc.) of the recording unit (2020, etc.) (S215). Each arithmetic device (2000 or the like) records τ = 0 in the τ recording means (2023 or the like) of the recording unit (2020) (S220).

ベクトルｃ_ｍｐ（ｍは１からＭ、ｐは（ｎ−１−τ（ｍｏｄＮ））＋１）を記録部（２０２０など）のベクトル記録手段（２０２２など）に有する各演算装置（２０００など）が、（１）記録部（２０２０など）のベクトル記録手段（２０２２など）からベクトルｃ_ｍｐの第ｎ成分を取り出し、（２）当該演算装置とベクトルｃ_ｍｑ（ｑはｐ−１。ただし、ｐ＝１のときはｑ＝Ｎ）を記録部に有する演算装置とが異なる場合（同じ演算装置がベクトルｃ_ｍｐとベクトルｃ_ｍｑとを記録していない場合）には、通信部（２０３０など）の送信手段（２０３１など）を用いてベクトルｃ_ｍｐの第ｎ成分を、ベクトルｃ_ｍｑを記録部に有する演算装置に送信する（Ｓ２３０）。ベクトルｃ_ｍｑを記録部に有する各演算装置（２０００など）が、当該演算装置とベクトルｃ_ｍｐを記録部に有する演算装置とが異なる場合には、通信部（２０３０など）の受信手段（２０３２など）を用いてベクトルｃ_ｍｐの第ｎ成分を、ベクトルｃ_ｍｐを記録部に有する演算装置からそれぞれ受信する（Ｓ２４０）。 Each arithmetic unit (2000, etc.) having the vector c _mp (m is 1 to M, p is (n-1-τ (mod N)) + 1) in the vector recording means (2022, etc.) of the recording unit (2020, etc.) (1) The nth component of the vector _mp is taken out from the vector recording means (2022 etc.) of the recording unit (2020 etc.), and (2) the arithmetic unit and the vector _cmq (q is p-1 where p = In the case where the arithmetic unit having q = N in the recording unit is different (when the same arithmetic unit does not record the vector _mp and the vector _cmq ), the transmission of the communication unit (2030, etc.) Using means (2031 etc.), the nth component of the vector _mp is transmitted to the arithmetic unit having the vector c _mq in the recording unit (S230). When each arithmetic device (such as 2000) having the vector c _mq in the recording unit is different from the arithmetic device having the vector _mp in the recording unit, the receiving unit (such as 2032) of the communication unit (such as 2030) the first n components of the vector _{c mp} using), respectively receive from the computing device having the vector _{c mp} in the recording unit (S240).

ベクトルｃ_ｍｑを記録部に有する各演算装置（２０００など）が、（１）τ≦Ｎ−２の場合は、当該演算装置の記録部（２０２０など）のベクトル記録手段（２０２２など）からベクトルｃ_ｍｑの第ｎ成分を取り出し、当該演算装置の計算部（２０１０など）の演算結果記録手段（２０１２など）で、ベクトルｃ_ｍｐの第ｎ成分との和を求め、結果をベクトルｃ_ｍｑの第ｎ成分として記録部（２０２０など）のベクトル記録手段（２０２２など）に記録し、（２）τ＞Ｎ−２の場合は、ベクトルｃ_ｍｐの第ｎ成分を、ベクトルｃ_ｍｑの第ｎ成分として記録部（２０２０など）のベクトル記録手段（２０２２など）に記録する（Ｓ２５０）。 When each arithmetic device (2000, etc.) having the vector c _mq in the recording unit is (1) when τ ≦ N−2, the vector recording means (2022, etc.) of the recording unit (2020, etc.) of the arithmetic device The n-th component of _mq is taken out, and the operation result recording means (2012, etc.) of the calculation unit (2010, etc.) of the arithmetic unit calculates the sum with the n-th component of vector _mp , and the result is the n-th of vector c _mq As a component, it is recorded in the vector recording means (2022, etc.) of the recording unit (2020, etc.). When (2) τ> N−2, the n-th component of the vector _mp is recorded as the n-th component of the vector c _mq Is recorded in the vector recording means (2022, etc.) of the part (2020, etc.) (S250).

各演算装置（２０００など）が、τにτ＋１を代入して記録部（２０２０など）のτ記録手段（２０２３など）に記録する（Ｓ２６０）。各演算装置（２０００など）の折り返し手段（２００２など）が、τが２Ｎ−３以下の場合は処理フローをステップＳ２３０に戻し、それ以外の場合は処理を終了する（Ｓ２７０）。
このような処理によるので、ベクトルの和を求める並列演算と同様に、行列と行列の積を求める並列演算を、通信回数と通信路を少なくしながら実現できる。 Each arithmetic device (2000 or the like) substitutes τ + 1 for τ and records it in the τ recording means (2023 or the like) of the recording unit (2020 or the like) (S260). When the wrapping means (such as 2002) of each arithmetic device (such as 2000) returns τ to 2230 when the τ is 2N−3 or less, the process flow is terminated otherwise (S270).
Due to such processing, parallel operation for obtaining the product of a matrix and a matrix can be realized while reducing the number of communication times and the communication path, as in the parallel operation for obtaining the sum of vectors.

演算方法２
図１９に、Ｉ個の演算装置を用いて、Ｍ行Ｎ列の行列ＡとＮ行Ｋ列の行列Ｂとの積Ａ^ＴＡＢを求める場合の、１つの演算装置の機能構成例を示す。並列演算システムの構成は図４と同じである。また、図２０に、Ｉ個の演算装置を用いて、Ｍ行Ｎ列の行列ＡとＮ行Ｋ列の行列Ｂとの積Ａ^ＴＡＢを求める場合の処理フローを示す。 Calculation method 2
FIG. 19 shows an example of the functional configuration of one arithmetic device when the product A ^T AB of the matrix A of M rows and N columns and the matrix B of N rows and K columns is obtained using I arithmetic devices. The configuration of the parallel computing system is the same as that in FIG. FIG. 20 shows a processing flow in the case where the product A ^T AB of the matrix A with M rows and N columns and the matrix B with N rows and K columns is obtained using I arithmetic devices.

演算装置３０００は、計算部３０１０、記録部３０２０、通信部３０３０から構成される。計算部３０１０は、τ増加手段２００１、繰り返し手段２００２、成分計算手段２０１１、演算結果記録手段２０１２、Ａ^Ｔτ増加手段３００６、Ａ^Ｔ繰り返し手段３００７、Ａ^Ｔ成分計算手段３０１６、Ａ^Ｔ演算結果記録手段３０１７を備える。なお、τ増加手段２００１、繰り返し手段２００２、Ａ^Ｔτ増加手段３００６、Ａ^Ｔ繰り返し手段３００７は、計算部３０１０以外の構成部（たとえば、図示していないが制御部などの構成部が考えられる。）が備えてもよい。記録部３０２０は、成分記録手段２０２１、ベクトル記録手段２０２２、τ記録手段２０２３、Ａ^Ｔベクトル記録手段３０２２、Ａ^Ｔτ記録手段３０２３を有する。通信部３０３０は、送信手段２０３１、受信手段２０３２、Ａ^Ｔ送信手段３０３６、Ａ^Ｔ受信手段３０３７を有する。なお、他の演算装置も同じ機能構成である。
ステップＳ２１０からＳ２７０までの処理は、図１８の演算方法１と同じである。 The arithmetic device 3000 includes a calculation unit 3010, a recording unit 3020, and a communication unit 3030. The calculation unit 3010 includes a τ increasing unit 2001, a repeating unit 2002, a component calculating unit 2011, an operation result recording unit 2012, an ^AT τ increasing unit 3006, an ^AT repeating unit 3007, an ^AT component calculating unit 3016, and an ^AT operation result recording. Means 3017 are provided. Note that the τ increasing means 2001, the repeating means 2002, the ^AT τ increasing means 3006, and the ^AT repeating means 3007 may be constituent parts other than the calculating part 3010 (for example, constituent parts such as a control part although not shown). ) May be provided. The recording unit 3020 includes component recording means 2021, vector recording means 2022, τ recording means 2023, ^AT vector recording means 3022, and ^AT τ recording means 3023. The communication unit 3030 includes a transmission unit 2031, a reception unit 2032, an ^AT transmission unit 3036, and an ^AT reception unit 3037. The other arithmetic units have the same functional configuration.
The processing from step S210 to S270 is the same as the calculation method 1 in FIG.

各演算装置（３０００など）が、（１）記録部（３０２０など）のベクトル記録手段（２０２２など）に記録された成分ａ_ｍｎのそれぞれについて、計算部（３０１０など）のＡ^Ｔ成分計算手段（３０１６など）でａ_ｍｎｃ_ｍ１〜ａ_ｍｎｃ_ｍＫを計算し、（２）結果をＫ次元のベクトルｂ_ｎｍ＝（ｂ_ｎｍ１，ｂ_ｎｍ２，…，ｂ_ｎｍＫ）＝（ａ_ｍｎｃ_ｍ１，ａ_ｍｎｃ_ｍ２，…，ａ_ｍｎｃ_ｍＫ）として記録部（３０２０など）のＡ^Ｔベクトル記録手段（３０２７など）に記録する（Ｓ３１５）。各演算装置（３０００など）が、記録部（３０２０）のＡ^Ｔτ記録手段（３０２８など）にτ＝０を記録する（Ｓ３２０）。なお、Ａ^Ｔτ記録手段（３０２８など）は、τ記録手段（２０２３など）と同じでも良い。 Each arithmetic unit (3000, etc.) (1) for each component a _mn recorded in the vector recording means (2022, etc.) of the recording unit (3020, etc.), the ^AT component calculation means (3030, etc.) of the calculation unit (3010, etc.) A _mn c _{m1 to} a _mn c _mK in (3016, etc.), and (2) the result is obtained as a K-dimensional vector b _nm = (b _nm1 , b _nm2 ,..., B _nmK ) = (a _mn c _m1 , a _mn c _m2 ,..., a _mn c _mK ) are recorded in the ^AT vector recording means (3027, etc.) of the recording unit (3020, etc.) (S315). Each arithmetic unit (such as 3000) records τ = 0 in the A ^T τ recording unit (such as 3028) of the recording unit (3020) (S320). The A ^T τ recording means (3028, etc.) may be the same as the τ recording means (2023, etc.).

ベクトルｂ_ｎｐ（ｎは１からＮ、ｐは（ｎ−１−τ（ｍｏｄＭ）＋１）を記録部に有する各演算装置（３０００など）が、（１）記録部（３０２０など）のＡ^Ｔベクトル記録手段（３０２７など）からベクトルｂ_ｎｐの第ｎ成分を取り出し、（２）当該演算装置とベクトルｂ_ｎｑ（ｑはｐ−１。ただし、ｐ＝１のときはｑ＝Ｍ）を記録部に有する演算装置とが異なる場合には、通信部（３０３０など）のＡ^Ｔ送信手段（３０３６など）を用いてベクトルｂ_ｎｐの第ｎ成分を、ベクトルｂ_ｎｑを記録部に有する演算装置に送信する（Ｓ３３０）。ベクトルｂ_ｎｑを記録部に有する各演算装置（３０００など）が、当該演算装置とベクトルｂ_ｎｐを記録部に有する演算装置とが異なる場合には、通信部（３０３０など）のＡ^Ｔ受信手段（３０３７など）を用いてベクトルｂ_ｎｐの第ｎ成分を、ベクトルｂ_ｎｐを記録部に有する演算装置からそれぞれ受信する（Ｓ３４０）。なお、Ａ^Ｔ送信手段（３０３６など）は送信手段（２０３１など）と同じでもよく、Ａ^Ｔ受信手段（３０３７など）は受信手段（２０３２など）と同じでも良い。 Each arithmetic unit (such as 3000) having a vector b _np (n is 1 to N, p is (n−1−τ (mod M) +1) in the recording unit) is (1) ^{AT of the} recording unit (3020 or the like). The n-th component of the vector b _np is extracted from the vector recording means (3027, etc.), and (2) the arithmetic unit and the vector b _nq (q is p-1, where q = 1 when p = 1) are recorded If the processing device is different from the arithmetic device, the ^AT transmission means (3036, etc.) of the communication unit (3030, etc.) is used to transmit the n-th component of the vector b _np and the vector b _nq to the arithmetic device having the recording unit. (S330) If each arithmetic unit (such as 3000) having the vector b _nq in the recording unit is different from the arithmetic unit having the vector b _np in the recording unit, the communication unit (such as 3030) A ^T reception means ( The n-th component of the vector _{b np} with 037, etc.), respectively receive from the computing device with a vector _{b np} in the recording unit (S340). Incidentally, etc. ^{A T} transmitting means (3036) is such transmission means (2031) The ^AT receiving means (such as 3037) may be the same as the receiving means (such as 2032).

ベクトルｂ_ｎｑを記録部に有する各演算装置（３０００など）が、（１）τ≦Ｎ−２の場合は、記録部（３０２０など）のＡ^Ｔベクトル記録手段（３０２７など）からベクトルｂ_ｎｑの第ｎ成分を取り出し、計算部（３０１０など）のＡ^Ｔ演算結果記録手段３０１７で、ベクトルｂ_ｎｐの第ｎ成分との和を求め、結果をベクトルｂ_ｎｑの第ｎ成分として記録部（３０２０など）のＡ^Ｔベクトル記録手段（３０２７など）に記録し、（２）τ＞Ｎ−２の場合は、ベクトルｂ_ｎｐの第ｎ成分を、ベクトルｂ_ｎｑの第ｎ成分として記録部（３０２０など）のＡ^Ｔベクトル記録手段（３０２７など）に記録する（Ｓ３５０）。 When each computing device (such as 3000) having the vector b _nq in the recording unit is (1) when τ ≦ N−2, the ^AT vector recording means (such as 3027) of the recording unit (such as 3020) receives the vector b _nq The n-th component is extracted, and the ^AT calculation result recording means 3017 of the calculation unit (3010, etc.) calculates the sum of the vector b _np and the n-th component, and the result is used as the n-th component of the vector b _nq . ) In the ^AT vector recording means (3027, etc.), and when (2) τ> N-2, the recording unit (3020, etc.) uses the n-th component of the vector b _np as the n-th component of the vector b _nq Are recorded in the ^AT vector recording means (3027, etc.) (S350).

各演算装置（３０００など）が、τにτ＋１を代入して記録部（３０２０など）のＡ^Ｔτ記録手段（３０２８など）に記録する（Ｓ３６０）。各演算装置（３０００など）の折り返し手段（３００７など）が、τが２Ｎ−３以下の場合は処理フローをステップＳ３３０に戻し、それ以外の場合は処理を終了する（Ｓ３７０）。
このような処理によるので、ベクトルの和を求める並列演算と同様に、行列と行列の積を求める並列演算を、通信回数と通信路を少なくしながら実現できる。さらに、転置行列Ａ^Ｔの演算のためにデータを通信する必要はない。したがって、Block Lanczos法のように、疎行列とその転置行列とを掛ける演算を含む場合には、通信回数を大幅に削減できる。 Each arithmetic unit (such as 3000) substitutes τ + 1 for τ and records it in the A ^T τ recording means (such as 3028) of the recording unit (such as 3020) (S360). When the return means (3007 or the like) of each arithmetic device (3000 or the like) returns τ to step S330 if τ is 2N−3 or less, the processing is terminated otherwise (S370).
Due to such processing, parallel operation for obtaining the product of a matrix and a matrix can be realized while reducing the number of communication times and the communication path, as in the parallel operation for obtaining the sum of vectors. Furthermore, there is no need to communicate data for the computation of the transpose matrix ^AT . Therefore, the number of communications can be greatly reduced when the calculation includes multiplication of a sparse matrix and its transposed matrix as in the Block Lanczos method.

演算方法３
さらに、Block Lanczos法のように、疎行列とその行列の転置行列を繰り返し乗算する演算が含まれている演算の例もある。このような場合には、第２実施形態の演算方法２を繰り返すことができる。この場合は、演算装置３０００は、図１９に点線で示す、Ａ^ＴＡＢ計算繰り返し手段３００８も備えている。図２１に行列とその転置行列との乗算を繰り返す場合の処理フローを示す。ステップＳ２１０とＳ３００とは、図２０と同じである。本変形例の方法では、図２０の処理フローに、繰り返しを判断するステップＳ２７０が付加されている。
このような処理によって、さらにBlock Lanczos法のように、疎行列とその転置行列とを掛ける演算を繰り返す場合には、通信回数を大幅に削減できる。 Calculation method 3
In addition, there is an example of an operation including an operation of repeatedly multiplying a sparse matrix and a transposed matrix of the matrix like the Block Lanczos method. In such a case, the calculation method 2 of the second embodiment can be repeated. In this case, the arithmetic unit 3000 also includes an ^AT AB calculation repetition unit 3008 indicated by a dotted line in FIG. FIG. 21 shows a processing flow when the multiplication of a matrix and its transposed matrix is repeated. Steps S210 and S300 are the same as those in FIG. In the method of this modification, step S270 for determining repetition is added to the processing flow of FIG.
Through such processing, the number of times of communication can be greatly reduced when the operation of multiplying the sparse matrix and its transposed matrix is repeated as in the Block Lanczos method.

なお、上記の実施形態は、図２２に示すコンピュータの記録部５０２０に読み込ませたプログラムによって、制御部５０１０、記録部５０２０、通信部５０３０などに上記方法の各ステップを実行させることができる。また、コンピュータに読み込ませる方法としては、プログラムをコンピュータ読み取り可能な記録媒体に記録しておき、記録媒体からコンピュータに読み込ませる方法、サーバ等に記録されたプログラムを、電気通信回線等を通じてコンピュータに読み込ませる方法などがある。 In the above embodiment, the control unit 5010, the recording unit 5020, the communication unit 5030, and the like can be caused to execute each step of the above method by a program read into the recording unit 5020 of the computer shown in FIG. In addition, as a method of causing the computer to read, the program is recorded on a computer-readable recording medium, and the program recorded on the server or the like is read into the computer through a telecommunication line or the like. There is a method to make it.

本発明は、ベクトルの和を求める演算や行列の積を求める演算を用いる大規模な情報処理システムに利用できる。例えば、暗号を用いたセキュリティシステムの安全性評価システムなどに利用できる。 The present invention can be used for a large-scale information processing system using an operation for obtaining a sum of vectors or an operation for obtaining a product of matrices. For example, it can be used for a safety evaluation system of a security system using encryption.

従来の方法を実現するシステム構成例を示す図。The figure which shows the system configuration example which implement | achieves the conventional method. 従来の情報の収集の様子を示す図。The figure which shows the mode of the collection of the conventional information. 従来の情報の分配の様子を示す図。The figure which shows the mode of the distribution of the conventional information. Ｎ個の演算装置（Ｎ＝４）を用いてＫ次元のベクトルの和を求める並列演算システムの構成例を示す図。The figure which shows the structural example of the parallel arithmetic system which calculates | requires the sum of a K-dimensional vector using N arithmetic units (N = 4). 演算装置１０００の機能構成例を示す図。The figure which shows the function structural example of the arithmetic unit 1000. 並列演算システムを用いて４個の４次元ベクトルの和を求める方法の原理を示す図。The figure which shows the principle of the method of calculating | requiring the sum of four 4-dimensional vectors using a parallel arithmetic system. 並列演算システムを用いて４個の５次元ベクトルの和を求める方法の原理を示す図。The figure which shows the principle of the method of calculating | requiring the sum of four 5-dimensional vectors using a parallel computing system. 並列演算システムを用いて４個の２次元ベクトルの和を求める方法の原理を示す図。The figure which shows the principle of the method of calculating | requiring the sum of four two-dimensional vectors using a parallel arithmetic system. Ｎ個の演算装置でＮ個のＮ次元のベクトルの和を求める並列演算システムの処理フローを示す図。The figure which shows the processing flow of the parallel arithmetic system which calculates | requires the sum of N N-dimensional vectors with N arithmetic units. Ｎ個の演算装置でＮ個のＫ次元のベクトルの和を求める並列演算システムの処理フローを示す図。The figure which shows the processing flow of the parallel arithmetic system which calculates | requires the sum of N K-dimensional vectors with N arithmetic units. 並列演算システムを用いて５個の４次元ベクトルの和を求める方法の原理を示す図。The figure which shows the principle of the method of calculating | requiring the sum of five 4-dimensional vectors using a parallel computing system. Ｉ個の演算装置でＮ個のＫ次元のベクトルの和を求める並列演算システムの処理フローを示す図。The figure which shows the processing flow of the parallel arithmetic system which calculates | requires the sum of N K-dimensional vectors with I arithmetic devices. 並列演算システムを用いて４個の４次元ベクトルの和を求める方法の別の原理を示す図。The figure which shows another principle of the method of calculating | requiring the sum of four 4-dimensional vectors using a parallel computing system. ４行４列の行列Ａと４行４列の行列Ｂとの積を求める場合、および行列Ａの転置行列Ａ^Ｔを乗算する場合の原理を示す図。The figure which shows the principle in the case of calculating | requiring the product of the matrix A of 4 rows 4 columns, and the matrix B of 4 rows 4 columns, and multiplying the transposed matrix ^AT of the matrix A. ３行４列の行列Ａと４行２列の行列Ｂとの積を求める場合、および行列Ａの転置行列Ａ^Ｔを乗算する場合の原理を示す図。The figure which shows the principle in the case of calculating | requiring the product of the matrix A of 3 rows 4 columns, and the matrix B of 4 rows 2 columns, and multiplying the transposed matrix ^AT of the matrix A. ４行２列の行列Ａと２行４列の行列Ｂとの積を求める場合、および行列Ａの転置行列Ａ^Ｔを乗算する場合の原理を示す図。The figure which shows the principle in the case of calculating | requiring the product of the matrix A of 4 rows 2 columns and the matrix B of 2 rows 4 columns, and multiplying the transposed matrix ^AT of the matrix A. Ｉ個の演算装置を用いて、Ｍ行Ｎ列の行列ＡとＮ行Ｋ列の行列Ｂとの積ＡＢを求める場合の、１つの演算装置の機能構成例を示す図。The figure which shows the function structural example of one arithmetic unit in the case of calculating | requiring the product AB of the matrix A of M rows and N columns, and the matrix B of N rows and K columns using I arithmetic units. Ｉ個の演算装置を用いて、Ｍ行Ｎ列の行列ＡとＮ行Ｋ列の行列Ｂとの積ＡＢを求める場合の処理フローを示す図。The figure which shows the processing flow in the case of calculating | requiring the product AB of the matrix A of M rows and N columns, and the matrix B of N rows and K columns using I arithmetic units. Ｉ個の演算装置を用いて、Ｍ行Ｎ列の行列ＡとＮ行Ｋ列の行列Ｂとの積Ａ^ＴＡＢを求める場合の、１つの演算装置の機能構成例を示す図。Using I-number of the arithmetic unit, when determining the product A ^T AB with matrix B of the matrix A and N rows and K columns of M rows and N columns, diagram of a functional configuration example of one operation device. Ｉ個の演算装置を用いて、Ｍ行Ｎ列の行列ＡとＮ行Ｋ列の行列Ｂとの積Ａ^ＴＡＢを求める場合の処理フローを示す図。Using I-number of the arithmetic unit, illustrates a process flow when determining the product A ^T AB with matrix B of the matrix A and N rows and K columns of M rows and N columns. 行列とその転置行列との乗算を繰り返す場合の処理フローを示す図。The figure which shows the processing flow in the case of repeating the multiplication with a matrix and its transpose matrix. コンピュータの機能構成例を示す図。The figure which shows the function structural example of a computer.

Claims

A parallel calculation method for obtaining a sum of N N-dimensional vectors using N calculation devices (N is an integer of 2 or more) having a calculation unit, a recording unit, and a communication unit,
A vector recording step in which each arithmetic device records one-dimensional N-dimensional vectors to be operated one by one in a recording unit of the arithmetic device;
An initial setting step in which each arithmetic unit records τ = 0 in the recording unit;
The nth arithmetic unit (n is an integer from 1 to N) takes out the (n−1 + τ (mod N)) + 1 component and uses the transmission means of the communication unit to (n−2 (mod N)) + 1 A component transmission step for transmitting to the computing device of
a component receiving step in which the nth arithmetic device receives the (n + τ (mod N)) + 1 component from the n (mod N)) + 1 arithmetic device using the receiving means of the communication unit;
When the (n) arithmetic unit is (1) τ ≦ N−2, the (n + τ (mod N)) + 1 component is extracted from the vector recording means of the recording unit, and the received (n + τ (mod N)) The sum with the +1 component is obtained, and the result is recorded in the recording unit as the (n + τ (mod N)) + 1 component. When (2) τ> N−2, the received (n + τ (mod N)) + 1 A calculation result recording step of recording the component as the (n + τ (mod N)) + 1 component in the recording unit;
Each arithmetic device assigns τ + 1 to τ and records it in the recording unit τ increasing step,
When each arithmetic device has τ of 2N−3 or less, a repetition step for returning to the component transmission step;
A parallel operation method having:

A parallel operation method for obtaining a sum of N K-dimensional vectors (K is an integer of 2 or more) using N calculation devices (N is an integer of 2 or more) having a calculation unit, a recording unit, and a communication unit. There,
In advance, the K components of the vector are divided into N groups (from the first group to the Nth group),
A vector recording step in which each computing device records one K-dimensional vector to be computed one by one in a recording unit of the computing device;
An initial setting step in which each arithmetic unit records τ = 0 in the recording unit;
(N−1−τ (mod N)) + 1-th arithmetic device (n is an integer from 1 to N) (1) extracts the vector component of the n-th group from the recording unit of the arithmetic device, (2) A component transmission step of transmitting to the (n-2-τ (mod N)) + 1st arithmetic device using the communication unit of the arithmetic device;
Each (n-2-τ (mod N)) + 1st computing device uses the communication unit of the computing device to calculate the nth group of vector components as (n-1-τ (mod N)) + 1st. A component receiving step for receiving from the computing device of
When each (n-2-τ (mod N)) + 1st arithmetic device is (1) τ ≦ N−2, the vector component of the nth group is extracted from the recording unit of the arithmetic device, In the calculation unit of the arithmetic device, the sum of the received vector components of the nth group is obtained, and the result is recorded in the recording unit as the vector component of the nth group. (2) When τ> N−2 A calculation result recording step of recording the received nth group vector component in the recording unit as an nth group vector component;
Each arithmetic device assigns τ + 1 to τ and records it in the recording unit τ increasing step,
When each arithmetic device has τ of 2N−3 or less, a repetition step for returning to the component transmission step;
A parallel operation method having:

This is a parallel operation method for calculating the sum of N (N is an integer of 2 or more) K-dimensional vectors (K is an integer of 2 or more) using a plurality of arithmetic units having a calculation unit, a recording unit, and a communication unit. And
A vector recording step in which each computing device records one or more of the K-dimensional vectors c _n (n is an integer from 1 to N) to be computed in a recording unit of the computing device;
An initial setting step in which each arithmetic unit records τ = 0 in the recording unit;
Each arithmetic device having the vector c _p (p is (n−1−τ (mod N)) + 1) in the recording unit (1) extracts the n-th component of the vector c _p from the recording unit of the arithmetic device, and ( 2) When the arithmetic unit and the vector c _q (q is p−1, where q = N when p = 1) are different from the arithmetic unit having the recording unit, the vector c _p is used using the communication unit. A component transmission step of transmitting the n-th component of the vector c _q to an arithmetic unit having the vector c _q in the recording unit;
When each arithmetic device having the vector c _q in the recording unit is different from the arithmetic device having the vector c _p in the recording unit, the communication unit is used to determine the nth component of the vector c _p as the vector c a component receiving step of receiving _p from each of the arithmetic units having the recording unit;
When each computing device having the vector c _q in the recording unit is (1) τ ≦ N−2, the n-th component of the vector c _q is taken out from the recording unit of the computing device, and the computing unit of the computing device calculates the sum of the first n components of the vector c _p, the result is recorded in the recording unit as the n components of the vector c _q, a (2) tau> for n-2, the n-th component of the vector c _p, A calculation result recording step of recording in the recording unit as the n-th component of the vector c _q ;
Each arithmetic device assigns τ + 1 to τ and records it in the recording unit τ increasing step,
When each arithmetic device has τ of 2N−3 or less, a repetition step for returning to the component transmission step;
, And a vector c _x (x is an arbitrary integer from 1 to N) is used as a sum of vectors.

This is a parallel operation method for calculating the sum of N (N is an integer of 2 or more) K-dimensional vectors (K is an integer of 2 or more) using a plurality of arithmetic units having a calculation unit, a recording unit, and a communication unit. And
A vector recording step in which each computing device records one or more of the K-dimensional vectors c _n (n is an integer from 1 to N) to be computed in a recording unit of the computing device;
An initial setting step in which each arithmetic unit records τ = 0 in the recording unit;
Each arithmetic device having the vector c _p (p is (n−1−τ (mod N)) + 1) in the recording unit (1) extracts the n-th component of the vector c _p from the recording unit of the arithmetic device, and ( 2) When τ ≦ N−2, if the arithmetic unit is different from the arithmetic unit having the vector c _q (q is p−1, where q = N when p = 1), the first n components of the vector c _p using the communication unit, and sends to the processing unit having a vector c _q in the recording unit, (3) tau> when the n-2, the arithmetic unit and the vector c _{r (r} the p + 1. However, the r = 1) when the p = n when the calculation device having a recording unit are different, the first n components of the vector c _p using the communication unit, the vector c _r in the recording unit A component transmission step for transmitting to a computing device having;
- (1) In the case of tau ≦ N-2, the computing device having the vector c _q in the recording unit, if the computing device including the arithmetic unit and the vector c _p to the recording portion are different, the communication unit the first n components of the vector c _p with, respectively received from the computing device having the vector c _p to the recording unit, (2) in the case of tau> n-2, each computing device having a vector c _r in the recording unit , when the computing device having the arithmetic unit and the vector c _p to the recording portion are different, the first n components of the vector c _p using the communication unit, respectively receive from the computing device having the vector c _p to the recording unit A component receiving step;
(1) In the case of τ ≦ N−2, each arithmetic device having the vector c _q in the recording unit extracts the n-th component of the vector c _q from the recording unit of the arithmetic device, and the calculation unit of the arithmetic device The sum of the vector c _p and the n-th component is obtained, and the result is recorded in the recording unit as the n-th component of the vector c _q . (2) If τ> N−2, each vector having the vector _cr in the recording unit computing device, a first n components of the vector c _p, and the calculation result recording step of recording in the recording unit as the n components of the vector c _r,
Each arithmetic device assigns τ + 1 to τ and records it in the recording unit τ increasing step,
When each arithmetic device has τ of 2N−3 or less, a repetition step for returning to the component transmission step;
, And a vector c _x (x is an arbitrary integer from 1 to N) is used as a sum of vectors.

This is a parallel operation method for calculating the sum of N (N is an integer of 2 or more) K-dimensional vectors (K is an integer of 2 or more) using a plurality of arithmetic units having a calculation unit, a recording unit, and a communication unit. And
A vector recording step in which each computing device records one or more of the K-dimensional vectors c _n (n is an integer from 1 to N) to be computed in a recording unit of the computing device;
An initial setting step in which each arithmetic unit records τ = 0 in the recording unit;
Each arithmetic device having the vector c _p (p is (n−1−τ (mod N)) + 1) in the recording unit (1) extracts the n-th component of the vector c _p from the recording unit of the arithmetic device, and ( 2) At least when τ ≦ N−2, when the arithmetic unit is different from the arithmetic unit having the vector c _q (q is p−1, where q = N when p = 1) in the recording unit. Includes a component transmission step of transmitting the n-th component of the vector c _p to the arithmetic unit having the vector c _q in the recording unit using the communication unit;
At least in the case of tau ≦ N-2, the computing device having the vector c _q in the recording unit, the arithmetic unit and the vector c _p in the case where the arithmetic unit having a recording unit are different, by using the communication unit vector the n-th component of the c _p, and component receiving step of receiving from each arithmetic unit having a vector c _p to the recording unit,
When at least τ ≦ N−2, each arithmetic device having the vector c _q in the recording unit extracts the n-th component of the vector c _q from the recording unit of the arithmetic device, and the calculation unit of the arithmetic device uses the vector c a calculation result recording step of calculating a sum of the _p and the n-th component and recording the result in the recording unit as the n-th component of the vector c _q ;
Each arithmetic device assigns τ + 1 to τ and records it in the recording unit τ increasing step,
When each arithmetic device has at least τ equal to or smaller than N-2, a repetition step for returning to the component transmission step;
, And a vector c _x (x is an arbitrary integer from 1 to N) is used as a sum of vectors.

Using I computing devices (I is an integer of 2 or more) having a calculation unit, a recording unit, and a communication unit, an M-row N-column matrix A (M is an integer of 2 or more, N is an integer of 2 or more, And MN ≧ I) and an N-by-K matrix B (K is an integer of 2 or more), a parallel operation method for obtaining a product AB,
In advance, MN components a _mn (m is an integer from 1 to M, n is an integer from 1 to N) of the matrix A are divided into I groups (from the first group to the Ith group). ,
Each arithmetic device has all the components a _mn of the i-th group (where i is the number of the arithmetic device and is an integer from 1 to I) and the components b _n1 to b of the matrix B to be multiplied by the components a component recording step of recording _{nK in} the recording unit of the arithmetic unit;
Each arithmetic unit (1) calculates a _mn b _{n1 to} a _mn b _nK in the calculation unit for each component a _mn recorded in the recording unit, and (2) the result is a K-dimensional vector c _mn = ( _{_{_{_{c mn1, c mn2, ...,}}}} c mnK) = (a mn b n1, a mn b n2, ..., and vector recording step of recording in the recording unit as _a mn _{b nK),}
An initial setting step in which each arithmetic unit records τ = 0 in the recording unit;
Each arithmetic unit having a vector _mp (m is 1 to M, p is (n−1−τ (mod N)) + 1) in the recording unit, (1) the vector _mp is obtained from the recording unit of the arithmetic unit. When the n component is extracted, and (2) the arithmetic unit and the vector c _mq (q is p−1, where q = N when p = 1) are different from the arithmetic unit in the recording unit, the communication unit A component transmission step of transmitting the n-th component of the vector _mp to the arithmetic unit having the vector c _mq in the recording unit using
When each arithmetic device having the vector c _mq in the recording unit is different from the arithmetic device having the vector _mp in the recording unit, the n-th component of the vector _mp is changed to the vector c using the communication unit. a component receiving step for receiving each _mp from an arithmetic unit having a recording unit;
When each computing device having the vector c _mq in the recording unit is (1) when τ ≦ N−2, the n-th component of the vector c _mq is extracted from the recording unit of the computing device, and the computing unit of the computing device The sum of the vector _mp and the nth component is obtained, and the result is recorded in the recording unit as the nth component of the vector _cmq . (2) When τ> N−2, the nth component of the vector _mp is A calculation result recording step of recording in the recording unit as the n-th component of the vector _cmq ;
Each arithmetic device assigns τ + 1 to τ and records it in the recording unit τ increasing step,
When each arithmetic device has τ of 2N−3 or less, a repetition step for returning to the component transmission step;
And a vector c _mx (x is an arbitrary integer from 1 to N) as a vector of the Mth row of the product AB.

Using I computing devices (I is an integer of 2 or more) having a calculation unit, a recording unit, and a communication unit, a matrix A of M rows and N columns (M is an integer of 2 or more, N is an integer of 2 or more, And a parallel operation method for obtaining a product A ^T AB of MN ≧ I), a transposed matrix ^{AT of a} matrix A, and a matrix B of N rows and K columns (K is an integer of 2 or more),
In advance, MN components a _mn (m is an integer from 1 to M, n is an integer from 1 to N) of the matrix A are divided into I groups (from the first group to the Ith group). ,
Each arithmetic device has all the components a _mn of the i-th group (where i is the number of the arithmetic device and is an integer from 1 to I) and the components b _n1 to b of the matrix B to be multiplied by the components a component recording step of recording _{nK in} the recording unit of the arithmetic unit;
Each arithmetic unit (1) calculates a _mn b _{n1 to} a _mn b _nK for each component a _mn recorded in the recording unit, and (2) outputs the result as a K-dimensional vector c _mn = ( _{_{_{_{c mn1, c mn2, ...,}}}} c mnK) = (a mn b n1, a mn b n2, ..., a a vector recording step of recording in the recording unit as _a mn _{b nK),}
An A initial setting step in which each arithmetic unit records τ = 0 in the recording unit;
Each arithmetic unit having a vector _mp (m is 1 to M, p is (n−1−τ (mod N)) + 1) in the recording unit, (1) the vector _mp is obtained from the recording unit of the arithmetic unit. When the n component is extracted, and (2) the arithmetic unit and the vector c _mq (q is p−1, where q = N when p = 1) are different from the arithmetic unit in the recording unit, the communication unit A component transmission step of transmitting the n-th component of the vector _mp to the arithmetic unit having the vector c _mq in the recording unit using
When each arithmetic device having the vector c _mq in the recording unit is different from the arithmetic device having the vector _mp in the recording unit, the n-th component of the vector _mp is changed to the vector c using the communication unit. A component receiving step for receiving each _mp from the arithmetic unit having the recording unit;
When each computing device having the vector c _mq in the recording unit is (1) when τ ≦ N−2, the n-th component of the vector c _mq is extracted from the recording unit of the computing device, and the computing unit of the computing device The sum of the vector _mp and the nth component is obtained, and the result is recorded in the recording unit as the nth component of the vector _cmq . (2) When τ> N−2, the nth component of the vector _mp is An A calculation result recording step of recording in the recording unit as the nth component of the vector _cmq ;
Each arithmetic unit assigns τ + 1 to τ and records it in the recording unit, Aτ increasing step,
When each arithmetic unit is τ is 2N−3 or less, the A repetition step for returning to the component transmission step;
Each arithmetic unit (1) calculates a _mn c _{m1 to} a _mn c _mK in the calculation unit for each component a _mn recorded in the recording unit, and (2) the result is a K-dimensional vector b _nm = ( _{_{_{_{b nm1, b nm2, ...,}}}} b nmK) = (a mn c m1, a mn c m2, ..., and ^{a T} vector recording step of recording in the recording unit as _a mn _{c mK),}
An ^AT initial setting step in which each arithmetic unit records τ = 0 in the recording unit;
Each arithmetic unit having the vector b _np (n is 1 to N, p is (n−1−τ (mod M) +1) in the recording unit) (1) The nth of the vector b _np from the recording unit of the arithmetic unit (2) When the arithmetic unit and the vector b _nq (q is p−1, where q = M when p = 1) are different from the arithmetic unit having the recording unit, the communication unit is An ^AT component transmission step of transmitting the n-th component of the vector b _np to the arithmetic unit having the vector b _nq in the recording unit,
When each arithmetic unit having the vector b _nq in the recording unit is different from the arithmetic unit having the vector b _np in the recording unit, the communication unit is used to _convert the nth component of the vector b _np to the vector b an ^AT component receiving step for receiving _np from the arithmetic unit having the recording unit, respectively;
When each computing device having the vector b _nq in the recording unit is (1) when τ ≦ N−2, the n-th component of the vector b _nq is extracted from the recording unit of the computing device, and the computing unit of the computing device The sum of the vector b _np and the n-th component is obtained, and the result is recorded in the recording unit as the n-th component of the vector b _nq . (2) If τ> N−2, the n-th component of the vector b _np is An ^AT calculation result recording step of recording in the recording unit as the n-th component of the vector b _nq ;
Each arithmetic unit substitutes τ + 1 for τ and records it in the recording unit, and increases ^AT τ.
When each arithmetic device has τ of 2N−3 or less, an ^AT repeating step for returning to the component transmission step;
And a vector b _nx (x is an arbitrary integer from 1 to M) as a vector in the Nth row of the product A ^T AB.

The parallel operation method according to claim 7, wherein
A matrix of N rows and K columns with the vector b _nx obtained by calculation of the product A ^T AB as the Nth row is defined as a matrix B, and the steps from the A vector recording step to the ^AT repetition step are repeated. Parallel operation method.

An n-th arithmetic unit that constitutes a parallel arithmetic system that calculates a sum of N N-dimensional vectors using N arithmetic units (N is an integer of 2 or more) having a calculation unit, a recording unit, and a communication unit ( n is an integer from 1 to N),
Vector recording means in the recording unit for recording one N-dimensional vector to be calculated;
τ recording means in the recording unit for recording the value of τ;
A transmission unit of a communication unit that extracts the (n−1 + τ (mod N)) + 1 component and transmits it to the (n−2 (mod N)) + 1st arithmetic device;
Receiving means for receiving the (n + τ (mod N)) + 1 component from the n (mod N)) + 1 th computing device;
(1) When τ ≦ N−2, the (n + τ (mod N)) + 1 component is taken out from the vector recording means of the recording unit, and the sum with the received (n + τ (mod N)) + 1 component is obtained. The result is recorded in the recording unit as the (n + τ (mod N)) + 1 component, and when (2) τ> N−2, the received (n + τ (mod N)) + 1 component is (n + τ (mod) N)) a calculation result recording means for recording in the recording unit as a +1 component;
τ increasing means for substituting τ + 1 for τ and recording the τ recording means of the recording unit;
When τ is 2N−3 or less, repeating means for repeating the calculation;
An arithmetic device comprising:

A parallel computing system for obtaining a sum of N K-dimensional vectors (K is an integer of 2 or more) using N computing devices (N is an integer of 2 or more) having a calculation unit, a recording unit, and a communication unit. A computing device comprising:
In advance, the K components of the vector are divided into N groups (from the first group to the Nth group),
Vector recording means for recording one K-dimensional vector to be calculated in a recording unit;
τ recording means in the recording unit for recording the value of τ;
For a certain n (where n is an integer from 1 to N), when the number of the arithmetic unit is (n−1−τ (mod N)) + 1, (1) vector components of the nth group from the recording unit And (2) (n-2-τ (mod N)) + 1 transmitting unit of the communication unit for transmitting to the first arithmetic unit;
For a certain n, if the number of the arithmetic unit is (n-2-τ (mod N)) + 1, the vector component of the nth group is (n-1-τ (mod N)) + 1 Receiving means of the communication unit for receiving from the arithmetic device;
For a certain n, when the number of the arithmetic unit is (n-2-τ (mod N)) + 1, and (1) when τ ≦ N−2, the nth group from the vector recording means of the recording unit The vector component is extracted, the sum of the received vector component and the received n-th group vector component is obtained, and the result is recorded in the recording unit as the n-th group vector component. (2) When τ> N−2 Is a calculation result recording means for recording the received vector component of the nth group in the recording unit as a vector component of the nth group;
τ increasing means for substituting τ + 1 for τ and recording the τ recording means of the recording unit;
When τ is 2N−3 or less, repeating means for repeating the calculation;
An arithmetic device comprising:

A parallel computing system for calculating the sum of N (N is an integer greater than or equal to 2) K-dimensional vectors (K is an integer greater than or equal to 2) using a plurality of arithmetic units having a calculation unit, a recording unit, and a communication unit. An arithmetic unit for
Vector recording means in the recording unit for recording one or more of K-dimensional vectors c _n (n is an integer from 1 to N) to be calculated;
τ recording means in the recording unit for recording the value of τ;
When a vector c _p (p is (n−1−τ (mod N)) + 1) in a recording unit for a certain n, (1) the n-th component of the vector c _p is obtained from the vector recording unit of the recording unit. (2) When the arithmetic unit and the arithmetic unit having the vector c _q (q is p−1, q = N when p = 1) in the recording unit are different, the vector c _q is recorded in the recording unit. Transmitting means in the communication unit for transmitting to the arithmetic device included in
Has a vector c _q in the recording unit for a n, and, if different from the computing device having the vector c _p to the recording unit includes a first n components of the vector c _p, the vector c _p to the recording unit Receiving means in the communication unit for receiving from the arithmetic unit;
When the recording unit has a vector c _q for a certain n, and (1) when τ ≦ N−2, the n-th component of the vector c _q is extracted from the vector recording means of the recording unit, and the vector c _p th calculates the sum of n components, the result is recorded in the vector recording means of the recording unit as the n components of the vector c _q, a (2) τ> of time n-2, the n-th component of the vector c _p, vector an operation result recording unit of the calculation unit for recording in the vector recording unit of the recording unit as the nth component of c _q ;
τ increasing means for substituting τ + 1 for τ and recording the τ recording means of the recording unit;
When τ is 2N−3 or less, repeating means for repeating the calculation;
An arithmetic device comprising:

A parallel computing system for calculating the sum of N (N is an integer greater than or equal to 2) K-dimensional vectors (K is an integer greater than or equal to 2) using a plurality of arithmetic units having a calculation unit, a recording unit, and a communication unit. An arithmetic unit for
Vector recording means in the recording unit for recording one or more of K-dimensional vectors c _n (n is an integer from 1 to N) to be calculated;
τ recording means in the recording unit for recording the value of τ;
When a vector c _p (p is (n−1−τ (mod N)) + 1) in a recording unit for a certain n, (1) the n-th component of the vector c _p is obtained from the vector recording unit of the recording unit. When (2) τ ≦ N−2, when the calculation device and the calculation device having the vector c _q (q is p−1; where p = 1, q = N) in the recording unit are different. Transmits the vector c _q to the arithmetic unit having the recording unit. (3) When τ> N−2, the arithmetic unit and the vector c _r (r is p + 1. However, when p = N, r = 1) when the arithmetic unit having the recording unit is different, the transmission unit in the communication unit for transmitting the vector c _q to the arithmetic unit having the recording unit,
(1) When the tau ≦ N-2, has for some n the vector c _q in the recording unit, and, if different from the computing device having the vector c _p to the recording unit, the vector c _p the n components, received from the computing device having the vector c _p to the recording unit, (2) tau> of time n-2, have a vector c _r in the recording unit for a n, and the vector c _p the if computing device different from that having the recording section, the n-th component of the vector c _p, a receiving unit in a communication unit for receiving from the computing device having the vector c _p to the recording unit,
(1) When τ ≦ N−2, when the recording unit has the vector c _q for a certain n, the n-th component of the vector c _q is extracted from the vector recording means of the recording unit, and the vector c _p th The sum with the n component is obtained, and the result is recorded in the vector recording means of the recording unit as the nth component of the vector c _q . (2) When τ> N−2, the vector _cr is recorded for a certain n if you have the parts, the first n components of the vector c _p, and the operation result recording means calculating unit that records the vector recording means of the recording unit as the n components of the vector c _r,
τ increasing means for substituting τ + 1 for τ and recording the τ recording means of the recording unit;
When τ is 2N−3 or less, repeating means for repeating the calculation;
An arithmetic device comprising:

A parallel computing system for calculating the sum of N (N is an integer greater than or equal to 2) K-dimensional vectors (K is an integer greater than or equal to 2) using a plurality of arithmetic units having a calculation unit, a recording unit, and a communication unit. An arithmetic unit for
Vector recording means in the recording unit for recording one or more of K-dimensional vectors c _n (n is an integer from 1 to N) to be calculated;
τ recording means in the recording unit for recording the value of τ;
When a vector c _p (p is (n−1−τ (mod N)) + 1) in a recording unit for a certain n, (1) the n-th component of the vector c _p is obtained from the vector recording unit of the recording unit. (2) When at least τ ≦ N−2, the arithmetic unit is different from the arithmetic unit having the vector c _q (q is p−1. However, when p = 1, q = N) in the recording unit. In the case, the transmission means in the communication unit for transmitting the vector c _q to the arithmetic unit having the recording unit,
When at least τ ≦ N-2, has for some n the vector c _q in the recording unit, and, if different from the computing device having the vector c _p to the recording section, the n components of the vector c _p and a receiving unit in a communication unit for receiving from the computing device having the vector c _p to the recording unit,
When at least τ ≦ N−2, when the recording unit has a vector c _q for a certain n, the n-th component of the vector c _q is extracted from the vector recording means of the recording unit, and the n-th component of the vector c _p And a calculation result recording means of the calculation unit for recording the result as the nth component of the vector c _q in the vector recording means of the recording unit;
τ increasing means for substituting τ + 1 for τ and recording the τ recording means of the recording unit;
When at least τ is less than or equal to N−2, repeating means for repeating the operation;
An arithmetic device comprising:

Using I computing devices (I is an integer of 2 or more) having a calculation unit, a recording unit, and a communication unit, an M-row N-column matrix A (M is an integer of 2 or more, N is an integer of 2 or more, And an arithmetic unit constituting a parallel arithmetic system for obtaining a product AB of MN ≧ I) and a matrix B of N rows and K columns (K is an integer of 2 or more),
Component recording means in the recording unit for recording all the components a _mn of the matrix A to be calculated and the components b _{n1 to} b _nK of the matrix B to be multiplied by the components;
Component calculation means in the calculation unit for calculating a _mn b _{n1 to} a _mn b _nK for each of the components a _mn recorded in the component recording means of the recording unit;
The results of the K-dimensional vector _{_{_{_{c mn = (c mn1, c}}}} mn2, ..., c mnK) = (a mn b n1, a mn b n2, ..., a mn b nK) and vector recording means in the recording unit to record as ,
τ recording means in the recording unit for recording τ = 0,
When a recording unit has a vector c _mp (m is 1 to M, p is (n−1−τ (mod N)) + 1) for a certain n, (1) the vector c from the recording unit of the recording unit When the arithmetic unit having (2) the vector c _mq (q is p−1, where q = N when p = 1) is different from the arithmetic unit, the n-th component of _mp is extracted. Transmission means in the communication unit for transmitting the n-th component of the vector _mp to the arithmetic unit having the vector c _mq in the recording unit;
Has a vector c _mq in the recording unit for a n, and, if different from the computing device having the vector c _mp in the recording unit includes a first n components of the vector c _mp, the vector c _mp to the recording unit Receiving means in the communication unit for receiving from the arithmetic unit;
When a recording unit has a vector c _mq for a certain n, and (1) when τ ≦ N−2, the n-th component of the vector c _mq is extracted from the vector recording means of the recording unit, and the vector c _mp The sum with the n component is obtained, and the result is recorded in the vector recording means of the recording unit as the nth component of the vector c _mq . (2) When τ> N−2, the nth component of the vector _mp is a calculation result recording unit of the calculation unit for recording in the vector recording unit of the recording unit as the nth component of c _mq ;
τ increasing means for substituting τ + 1 for τ and recording the τ recording means of the recording unit;
When τ is 2N−3 or less, repeating means for repeating the calculation;
An arithmetic device comprising:

Using I computing devices (I is an integer of 2 or more) having a calculation unit, a recording unit, and a communication unit, a matrix A of M rows and N columns (M is an integer of 2 or more, N is an integer of 2 or more, and an arithmetic unit for a parallel computing system determining the product a ^T AB with matrix B MN ≧ I) and transposed matrix a ^T and N rows and K columns of the matrix a (K is an integer of 2 or more),
Component recording means in the recording unit for recording all the components a _mn of the matrix A to be calculated and the components b _{n1 to} b _nK of the matrix B to be multiplied by the components;
A component calculation means in the calculation unit for calculating a _mn b _{n1 to} a _mn b _nK in the calculation unit for each of the components a _mn recorded in the recording unit;
The results of the K-dimensional vector _{_{_{_{c mn = (c mn1, c}}}} mn2, ..., c mnK) = (a mn b n1, a mn b n2, ..., a mn b nK) A vector recording means in the recording unit to record as When,
Aτ recording means in the recording unit for recording τ = 0,
When a recording unit has a vector c _mp (m is 1 to M, p is (n−1−τ (mod N)) + 1) for a certain n, (1) the vector from the A vector recording means of the recording unit removed the n components of the c _mp, (2) the vector _{c mq} (q is p-1. However, when the p = 1 q = n) when the computing device having the recording portion is different from the said arithmetic unit A transmission means in the communication unit for transmitting the n-th component of the vector _mp to the arithmetic unit having the vector c _mq in the recording unit;
Has a vector c _mq in the recording unit for a n, and, if different from the computing device having the vector c _mp in the recording unit includes a first n components of the vector c _mp, the vector c _mp to the recording unit A receiving means in the communication unit for receiving from the arithmetic unit;
When the recording unit has a vector c _mq for a certain n, and (1) when τ ≦ N−2, the n-th component of the vector c _mq is extracted from the A vector recording means of the recording unit, and the vector c _mp The sum with the n-th component is obtained, and the result is recorded in the A vector recording means of the recording unit as the n-th component of the vector c _mq . (2) When τ> N−2, the n-th component of the vector _mp is A calculation result recording means of the calculation unit for recording in the A vector recording means of the recording unit as the n-th component of the vector _cmq ,
Aτ increasing means for substituting τ + 1 for τ and recording the τ recording means of the recording unit;
When τ is 2N−3 or less, A repeating means for repeating the operation;
An ^AT component calculation means in the calculation unit for calculating a _mn c _{m1 to} a _mn c _mK in the calculation unit for each of the components a _mn recorded in the recording unit;
^AT vector recording in the recording unit for recording the result as a K-dimensional vector b _nm = (b _nm1 , b _nm2 ,..., B _nmK ) = (a _mn c _m1 , a _mn c _m2 ,..., A _mn c _mK ) Means,
A ^T τ recording means in the recording unit for recording τ = 0,
When a recording unit has a vector b _np (n is 1 to N, p is (n−1−τ (mod M)) + 1) for a certain n, (1) From the ^AT vector recording means of the recording unit The n-th component of the vector b _np is extracted, and (2) when the arithmetic unit having the vector b _nq (q is p−1, where q = M when p = 1) is different from the arithmetic unit ^AT transmission means in the communication unit for transmitting the n-th component of the vector b _np to the arithmetic unit having the vector b _nq in the recording unit;
Has a vector b _nq to the recording unit for a n, and, if different from the computing device with a vector b _np in the recording unit includes a first n components of the vector b _np, the vector b _np in the recording unit ^AT receiving means in the communication unit for receiving from the arithmetic unit;
When the recording unit has a vector b _nq for a certain n, (1) When τ ≦ N−2, the n-th component of the vector b _nq is extracted from the ^AT vector recording means of the recording unit, and the vector b _np And the result is recorded in the recording unit as the n-th component of the vector b _nq . (2) When τ> N−2, the n-th component of the vector b _np is changed to the vector b _nq ^AT calculation result recording means for recording in the ^AT vector recording means of the recording unit as the n-th component of
A ^T τ increasing means for substituting τ + 1 for τ and recording the τ recording means of the recording unit;
When τ is 2N−3 or less, ^AT repeating means for repeating the operation;
An arithmetic device comprising:

The arithmetic device according to claim 15, wherein
A ^T AB calculation repetition that repeats from the A vector recording step to the A ^T repetition step using a matrix of N rows and K columns with the vector b _nx obtained by calculation of the product A ^T AB as the Nth row as a matrix B An arithmetic device comprising means.

A computer program for causing each step of the calculation method according to any one of claims 1 to 8 to be realized by a computer.