JP2023047899A

JP2023047899A - Data analysis program, data analysis method, and information processor

Info

Publication number: JP2023047899A
Application number: JP2021157081A
Authority: JP
Inventors: 勇太久留米; Yuta Kurume
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2023-04-06

Abstract

To provide an information processor capable of increasing a speed of matrix calculations.SOLUTION: An information processor 10 identifies: a block with no other unprocessed blocks adjacent in a first direction and no other unprocessed blocks adjacent in a second direction orthogonal to the first direction which can start the process as processable blocks 13a, 13b, 13c from the unprocessed blocks among the grid-like multiple blocks included in matrix data 13. The information processor 10 allocates an empty thread out of threads 15a and 15b that are executed in parallel while giving priority to the one with larger calculation cost 14a, 14b, 14c among the processable blocks 13a, 13b, 13c.SELECTED DRAWING: Figure 1

Description

本発明はデータ解析プログラム、データ解析方法および情報処理装置に関する。 The present invention relates to a data analysis program, data analysis method, and information processing apparatus.

コンピュータは、複数のスレッドを用いて行列データを並列に処理することがある。行列計算の並列化では、行列データが格子状の複数のブロックに分割され、異なるスレッドが異なるブロックを並列に処理することがある。ただし、行列計算の内容によっては、コンピュータは、それら複数のブロックを任意の順序で処理できるとは限らず、処理順序について制約をもつことがある。処理順序の制約は、直交する特定の２方向（例えば、下方向と左方向）の少なくとも一方に未処理の隣接ブロックが存在する間は、そのブロックの処理が開始されないという制約であることがある。 A computer may process matrix data in parallel using multiple threads. In the parallelization of matrix computation, matrix data may be divided into a plurality of grid-like blocks, and different threads may process different blocks in parallel. However, depending on the contents of the matrix calculation, the computer cannot always process the plurality of blocks in any order, and may have restrictions on the processing order. A processing order constraint may be a constraint that processing of a block does not begin while there is an unprocessed neighboring block in at least one of two particular orthogonal directions (e.g., down and left). .

例えば、パーシステントホモロジーの行列縮約を、複数のスレッドを用いて並列に実行するフレームワークが提案されている。提案のフレームワークは、上三角行列に含まれる列に当該列よりも左側にある別の列を加算することで、行番号の大きい下側の非ゼロ要素を上三角行列からできる限り消去する掃き出し法を実行する。提案のフレームワークは、最初に対角線上のブロックを処理し、対角線上のブロックを全て処理し終えると、対角線上のブロックに隣接する右上のブロックを処理する。このように、提案のフレームワークは、対角線から右上に向かう順にブロックを処理する。 For example, a framework has been proposed that performs matrix reduction of persistent homology in parallel using multiple threads. The proposed framework is a sweep that removes as many lower nonzero elements as possible from the upper triangular matrix by adding another column to the left of the column contained in the upper triangular matrix. enforce the law. The proposed framework processes the diagonal block first, and when all the diagonal blocks have been processed, it processes the upper right adjacent block to the diagonal block. Thus, the proposed framework processes blocks in order from diagonal to top right.

なお、行列をそれぞれがＭ行Ｍ列である複数の小行列に分割し、それら複数の小行列の中で同じ相対位置をもつ複数の要素を、連続した記憶領域に格納する情報処理装置が提案されている。また、疎行列に含まれる非ゼロ要素を左詰めし、左半分の部分行列をＪＤＳ（Jagged Diagonal Storage）形式で保存し、右半分の部分行列をＣＲＳ（Compressed Row Storage）形式で保存する情報処理装置が提案されている。 An information processing device is proposed in which a matrix is divided into a plurality of submatrices each having M rows and M columns, and a plurality of elements having the same relative position among the plurality of submatrices are stored in a continuous storage area. It is In addition, information processing that left-aligns non-zero elements contained in a sparse matrix, stores the left half submatrix in JDS (Jagged Diagonal Storage) format, and saves the right half submatrix in CRS (Compressed Row Storage) format. A device has been proposed.

特開２０１６－６６３２９号公報JP 2016-66329 A 国際公開第２０１７／１５４９４６号WO2017/154946

Simon Zhang, Mengbai Xiao, Chengxin Guo, Liang Geng, Hao Wang and Xiaodong Zhang, "HYPHA: A Framework based on Separation of Parallelisms to Accelerate Persistent Homology Matrix Reduction", Proc. of the 33rd ACM International Conference on Supercomputing (ICS 2019), pp. 69-81, June 2019Simon Zhang, Mengbai Xiao, Chengxin Guo, Liang Geng, Hao Wang and Xiaodong Zhang, "HYPHA: A Framework based on Separation of Parallelisms to Accelerate Persistent Homology Matrix Reduction", Proc. of the 33rd ACM International Conference on Supercomputing (ICS 2019) , pp. 69-81, June 2019

しかし、行列計算の内容によっては、コンピュータは、行列データに含まれる複数のブロックを全て同等の計算時間で処理できるとは限らない。計算時間が長いボトルネックのブロックが存在する場合、処理順序の制約のもとでは、処理を開始できるブロックが徐々に減少することがある。その結果、並列度が低下して行列計算の速度が低下することがある。そこで、１つの側面では、本発明は、行列計算を高速化することを目的とする。 However, depending on the contents of the matrix calculation, the computer cannot always process all of the blocks included in the matrix data in the same calculation time. If there is a bottleneck block with a long computation time, the number of blocks for which processing can be started may gradually decrease under restrictions on the processing order. As a result, the degree of parallelism decreases and the speed of matrix computation may decrease. Therefore, in one aspect, the present invention aims to speed up matrix calculation.

１つの態様では、以下の処理をコンピュータに実行させるデータ解析プログラムが提供される。行列データに含まれる格子状の複数のブロックのうちの未処理のブロックの中から、第１の方向に他の未処理のブロックが隣接しておらずかつ第１の方向と直交する第２の方向に他の未処理のブロックが隣接していないブロックを、処理を開始できる処理可能ブロックとして特定する。処理可能ブロックが複数ある場合、複数の処理可能ブロックのうち各処理可能ブロックの算出負荷に基づく計算コストの大きい方から優先的に、並列に実行される複数のスレッドのうちの空きスレッドを割り当てる。 In one aspect, a data analysis program is provided that causes a computer to perform the following processes. A second block not adjacent to another unprocessed block in the first direction and orthogonal to the first direction from among the unprocessed blocks among the plurality of lattice-shaped blocks included in the matrix data. Blocks that are not adjacent to other unprocessed blocks in the direction are identified as processable blocks from which processing can begin. When there are a plurality of processable blocks, empty threads among a plurality of threads to be executed in parallel are assigned with priority in order of higher calculation cost based on the calculated load of each processable block.

また、１つの態様では、コンピュータが実行するデータ解析方法が提供される。また、１つの態様では、記憶部と処理部とを有する情報処理装置が提供される。 Also, in one aspect, a computer-implemented data analysis method is provided. Also, in one aspect, an information processing apparatus having a storage unit and a processing unit is provided.

１つの側面では、行列計算が高速化される。 In one aspect, matrix computation is speeded up.

第１の実施の形態の情報処理装置を説明するための図である。1 is a diagram for explaining an information processing device according to a first embodiment; FIG. 情報処理装置のハードウェア例を示すブロック図である。2 is a block diagram showing an example of hardware of an information processing device; FIG. パーシステントホモロジーに用いる行列の例を示す図である。FIG. 4 is a diagram showing an example of a matrix used for persistent homology; 行列に対する掃き出し法の実行例を示す図である。It is a figure which shows the execution example of the sweep-out method with respect to a matrix. 列管理テーブルの例を示す図である。FIG. 10 is a diagram showing an example of a column management table; FIG. 複数のスレッドによる掃き出し法の並列化例を示す図である。FIG. 10 is a diagram showing an example of parallelization of the sweep method using a plurality of threads; 第２の実施の形態の掃き出し法の手順例を示す図である。FIG. 10 is a diagram illustrating an example of the procedure of a sweep method according to the second embodiment; FIG. 第２の実施の形態の掃き出し法の手順例を示す図（続き１）である。FIG. 10 is a diagram (continued 1) showing a procedure example of a sweeping method according to the second embodiment; 第２の実施の形態の掃き出し法の手順例を示す図（続き２）である。FIG. 12 is a diagram (continuation 2) showing an example of the procedure of the sweeping method according to the second embodiment; 第２の実施の形態の掃き出し法の手順例を示す図（続き３）である。FIG. 11 is a diagram (continued 3) showing a procedure example of a sweeping method according to the second embodiment; 各ブロックの計算コストの算出例を示す図である。It is a figure which shows the calculation example of the calculation cost of each block. ブロック優先度が並列処理に与える影響の例を示す図である。It is a figure which shows the example of the influence which a block priority has on parallel processing. 情報処理装置のソフトウェア例を示すブロック図である。2 is a block diagram showing an example of software of an information processing device; FIG. スレッド初期化の手順例を示すフローチャートである。FIG. 11 is a flowchart illustrating an example of a procedure for thread initialization; FIG. 第２の実施の形態のスレッド処理の手順例を示すフローチャートである。13 is a flowchart illustrating an example of a procedure of thread processing according to the second embodiment; 周辺ブロックの計算コストが並列処理に与える影響の例を示す図である。FIG. 10 is a diagram illustrating an example of the influence of calculation costs of peripheral blocks on parallel processing; 周辺ブロックを参照した計算コストの修正例を示す図である。It is a figure which shows the correction example of the calculation cost with reference to the surrounding block. 第３の実施の形態のスレッド処理の手順例を示すフローチャートである。FIG. 11 is a flow chart showing an example of a procedure of thread processing according to the third embodiment; FIG.

以下、本実施の形態を図面を参照して説明する。
［第１の実施の形態］
第１の実施の形態を説明する。 Hereinafter, this embodiment will be described with reference to the drawings.
[First embodiment]
A first embodiment will be described.

図１は、第１の実施の形態の情報処理装置を説明するための図である。
第１の実施の形態の情報処理装置１０は、複数のスレッドを用いて行列データを並列に処理する。情報処理装置１０は、クライアント装置でもよいしサーバ装置でもよい。情報処理装置１０が、コンピュータまたはデータ解析装置と呼ばれてもよい。 FIG. 1 is a diagram for explaining an information processing apparatus according to the first embodiment.
The information processing apparatus 10 according to the first embodiment processes matrix data in parallel using a plurality of threads. The information processing device 10 may be a client device or a server device. The information processing device 10 may be called a computer or a data analysis device.

情報処理装置１０は、記憶部１１および処理部１２を有する。記憶部１１は、ＲＡＭ（Random Access Memory）などの揮発性半導体メモリでもよいし、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの不揮発性ストレージでもよい。処理部１２は、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）などのプロセッサである。ただし、処理部１２が、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの特定用途の電子回路を含んでもよい。プロセッサは、例えば、ＲＡＭなどのメモリ（記憶部１１でもよい）に記憶されたプログラムを実行する。プロセッサの集合が、マルチプロセッサまたは単に「プロセッサ」と呼ばれてもよい。 The information processing device 10 has a storage unit 11 and a processing unit 12 . The storage unit 11 may be a volatile semiconductor memory such as a RAM (Random Access Memory), or may be a non-volatile storage such as an HDD (Hard Disk Drive) or flash memory. The processing unit 12 is, for example, a processor such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or a DSP (Digital Signal Processor). However, the processing unit 12 may include a specific application electronic circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). The processor executes programs stored in a memory such as a RAM (which may be the storage unit 11), for example. A collection of processors may be called a multiprocessor or simply a "processor."

記憶部１１は、行列データ１３を記憶する。行列データ１３は、正方行列を示してもよく、対角線より下側に非ゼロ要素を含まない上三角行列を示してもよい。また、行列データ１３は、パーシステントホモロジーに使用する境界行列を示してもよい。境界行列は、点、線分、三角形、四面体などの位相幾何学上の単体に対応する行および列を含む正方行列である。境界行列は、線分が境界として２つの点をもつ、三角形が境界として３つの線分をもつ、四面体が境界として４つの三角形をもつなどの単体間の関係を示す。 The storage unit 11 stores matrix data 13 . The matrix data 13 may represent a square matrix or an upper triangular matrix that does not contain non-zero elements below the diagonal. The matrix data 13 may also indicate boundary matrices used for persistent homology. A boundary matrix is a square matrix containing rows and columns corresponding to topological simplexes such as points, lines, triangles, tetrahedra, and the like. The boundary matrix indicates the relationship between simplexes, such as a line segment having two points as its boundary, a triangle having three line segments as its boundary, and a tetrahedron having four triangles as its boundary.

並列処理にあたり、行列データ１３は、格子状の複数のブロックに分割される。複数のブロックは同じサイズであってもよく、それぞれ特定の個数の行と特定の個数の列とを含んでもよい。行列データ１３が上三角行列を示す場合、複数のブロックは、対角線上のブロックおよび対角線より上側のブロックによって三角形を形成してもよい。 For parallel processing, the matrix data 13 is divided into a plurality of grid-like blocks. Multiple blocks may be the same size, and may each include a certain number of rows and a certain number of columns. If the matrix data 13 indicates an upper triangular matrix, the blocks may form a triangle with the blocks on the diagonal and the blocks above the diagonal.

処理部１２は、スレッド１５ａ，１５ｂを含む複数のスレッドを並列に実行する。スレッドはプロセスと呼ばれてもよい。複数のスレッドは、処理部１２が有する異なるコアによって実行されてもよい。ただし、情報処理装置１０が有する処理部１２以外のプロセッサによってスレッドが実行されてもよく、情報処理装置１０以外の他の情報処理装置によってスレッドが実行されてもよい。処理部１２は、複数のスレッドの中にブロックを処理中でない空きスレッドが存在する場合、行列データ１３に含まれる何れかの未処理のブロックに空きスレッドを割り当て、割り当てたスレッドに当該ブロックを処理させる。 The processing unit 12 executes a plurality of threads including threads 15a and 15b in parallel. A thread may also be called a process. A plurality of threads may be executed by different cores of the processing unit 12 . However, the thread may be executed by a processor other than the processing unit 12 of the information processing device 10, or may be executed by an information processing device other than the information processing device 10. FIG. If there is an empty thread that is not processing a block among the plurality of threads, the processing unit 12 allocates an empty thread to any unprocessed block included in the matrix data 13, and causes the allocated thread to process the block. Let

ここで、処理部１２は、行列データ１３に含まれる未処理のブロックの中から、順序条件を満たすブロックを、処理を開始できる処理可能ブロックとして特定する。順序条件は、第１の方向に未処理のブロックが隣接しておらず、かつ、第１の方向と直交する第２の方向に未処理のブロックが隣接していないことである。例えば、第１の方向は行番号が増加する下方向であり、第２の方向は列番号が減少する左方向である。処理部１２は、全てのブロックが未処理である初期状態において、対角線上にあるブロックを処理可能ブロックとして特定してもよい。 Here, the processing unit 12 identifies a block that satisfies an order condition from unprocessed blocks included in the matrix data 13 as a processable block that can start processing. The order condition is that no unprocessed blocks are adjacent in a first direction and no unprocessed blocks are adjacent in a second direction orthogonal to the first direction. For example, the first direction is downward with increasing row numbers and the second direction is leftward with decreasing column numbers. In an initial state in which all blocks are unprocessed, the processing unit 12 may identify a diagonal block as a processable block.

処理可能ブロックは、複数のスレッドによる処理の進行に伴って変化する。処理部１２は、あるブロックから見て、第１の方向に隣接するブロックの処理が終了し、かつ、第２の方向に隣接するブロックの処理が終了した場合、当該ブロックを処理可能ブロックとして特定してもよい。一例として、処理部１２は、行列データ１３に含まれる複数のブロックの中から、処理可能ブロック１３ａ，１３ｂ，１３ｃを特定する。 The processable block changes as the processing by multiple threads progresses. The processing unit 12 identifies a block as a processable block when the processing of a block adjacent in the first direction and the processing of a block adjacent in the second direction are completed when viewed from a certain block. You may As an example, the processing unit 12 identifies processable blocks 13 a , 13 b , and 13 c from multiple blocks included in the matrix data 13 .

処理部１２は、処理可能ブロックが複数ある場合、複数の処理可能ブロックのうち計算コストが大きい方から優先的に空きスレッドを割り当てる。計算コストは、各処理可能ブロックに対して算出される。ある処理可能ブロックの計算コストは、その処理可能ブロックの算出負荷に基づく。計算コストは、スレッドに与える負荷を示していると言うこともできる。例えば、処理部１２は、処理可能ブロック１３ａの計算コスト１４ａ、処理可能ブロック１３ｂの計算コスト１４ｂおよび処理可能ブロック１３ｃの計算コスト１４ｃを算出する。計算コスト１４ｂが最も大きい場合、処理部１２は、処理可能ブロック１３ｂに対して優先的に空きスレッドを割り当てる。 When there are a plurality of processable blocks, the processing unit 12 preferentially allocates free threads to the processable blocks in descending order of calculation cost. A computational cost is calculated for each processable block. The computational cost of a processable block is based on the computational load of that processable block. It can also be said that the computational cost indicates the load given to the threads. For example, the processing unit 12 calculates the calculation cost 14a of the processable block 13a, the calculation cost 14b of the processable block 13b, and the calculation cost 14c of the processable block 13c. When the calculation cost 14b is the highest, the processing unit 12 preferentially allocates free threads to the processable block 13b.

計算コストは、複数のスレッドによる処理の進行に伴って変化し得る。あるブロックの計算コストは、当該ブロックに含まれる複数の要素のうち処理を省略可能な要素の個数に基づいて算出されてもよく、当該ブロックに含まれる複数の列のうち処理を省略可能な列の個数に基づいて算出されてもよい。処理を省略可能な列は、当該ブロックから見て第１の方向にある他のブロックの処理結果に基づいて判定されてもよい。 The computational cost may change as the processing progresses with multiple threads. The calculation cost of a certain block may be calculated based on the number of elements for which processing can be omitted among the plurality of elements included in the block. may be calculated based on the number of Columns for which processing can be omitted may be determined based on the processing results of other blocks located in the first direction when viewed from the block.

あるブロックに対してスレッド１５ａが割り当てられると、スレッド１５ａは当該ブロックを処理する。スレッド１５ａは、第２の方向にある他のブロックを用いて当該ブロックを更新してもよい。スレッド１５ａが実行する処理は、ある列に当該列より左側にある他の列を加算することで、当該列に含まれる最も下側の非ゼロ要素を消去する掃き出し法であってもよい。他の列の加算を繰り返すことで、最も下側の非ゼロ要素の行番号が最小化される。ある列について、消去されない最も下側の非ゼロ要素が確定した場合、スレッド１５ａは、それ以降は当該列の処理を省略可能と判定してもよい。 When a thread 15a is assigned to a certain block, the thread 15a processes that block. Thread 15a may update that block with another block in the second direction. The processing performed by thread 15a may be a sweep method that eliminates the lowest non-zero elements in a column by adding it to another column to the left of that column. Repeated addition of other columns minimizes the row number of the lowest non-zero element. When the lowest non-zero element that is not erased is determined for a certain column, the thread 15a may determine that subsequent processing of that column can be omitted.

なお、全てのブロックが未処理である初期状態において、非ゼロ要素が存在しない列など一部の列が、処理を省略可能であると判定されてもよい。また、行列データ１３の対称性から、ｉ行の処理結果に基づいてｉ列の省略可否が判定されてもよい。行列データ１３に含まれる全てのブロックの処理が終了すると、処理部１２は、スレッド１５ａ，１５ｂを含む複数のスレッドを停止する。処理部１２は、行列データ１３の処理結果を出力する。処理部１２は、処理結果を、情報処理装置１０に接続された表示装置に表示してもよいし、不揮発性ストレージに保存してもよいし、他の情報処理装置に送信してもよい。 In an initial state in which all blocks are unprocessed, it may be determined that processing can be omitted for some columns, such as columns with no non-zero elements. Further, based on the symmetry of the matrix data 13, whether or not to omit the i column may be determined based on the processing result of the i row. When processing of all blocks included in the matrix data 13 is completed, the processing unit 12 stops multiple threads including the threads 15a and 15b. The processing unit 12 outputs the processing result of the matrix data 13 . The processing unit 12 may display the processing result on a display device connected to the information processing device 10, store it in a non-volatile storage, or transmit it to another information processing device.

以上説明したように、第１の実施の形態の情報処理装置１０は、行列データ１３に含まれる格子状の複数のブロックのうちの未処理のブロックの中から、処理を開始できる処理可能ブロックを特定する。処理可能ブロックは、第１の方向に他の未処理のブロックが隣接しておらずかつ第１の方向と直交する第２の方向に他の未処理のブロックが隣接していないという順序条件を満たすブロックである。情報処理装置１０は、複数の処理可能ブロックのうち各処理可能ブロックの算出負荷に基づく計算コストの大きい方から優先的に、並列に実行される複数のスレッドのうちの空きスレッドを割り当てる。 As described above, the information processing apparatus 10 according to the first embodiment selects processable blocks for which processing can be started from unprocessed blocks among a plurality of lattice-shaped blocks included in the matrix data 13. Identify. The processable blocks are subject to an order condition that no other unprocessed blocks are adjacent in a first direction and no other unprocessed blocks are adjacent in a second direction orthogonal to the first direction. is a block that satisfies The information processing apparatus 10 allocates empty threads among the plurality of threads to be executed in parallel, preferentially in descending order of calculation cost based on the calculated load of each processable block among the plurality of processable blocks.

これにより、処理順序の制約があっても行列データ１３の処理が並列化され、行列計算が高速化される。また、計算時間の長いボトルネックのブロックがあっても、当該ブロックの処理が優先的に開始されるため、処理の進行に伴って処理可能ブロックの個数が減少する状況が抑制される。よって、処理可能ブロックの個数がスレッドの個数を下回って並列度が減少する状況が抑制され、複数のスレッドを活用して行列計算が高速化される。 As a result, the processing of the matrix data 13 is parallelized even if there are restrictions on the processing order, and the matrix calculation is speeded up. Also, even if there is a bottleneck block with a long calculation time, the processing of the block is preferentially started, so the situation where the number of processable blocks decreases as the processing progresses is suppressed. Therefore, a situation in which the number of processable blocks is less than the number of threads and the degree of parallelism is reduced is suppressed, and a plurality of threads are utilized to speed up the matrix calculation.

なお、情報処理装置１０は、初期状態において対角線上のブロックを処理可能ブロックとして特定してもよく、対角線上の一部のブロックの処理が終了した後、対角線上の他のブロックと対角線上にないブロックとを並列に処理してもよい。これにより、対角線上のブロックを全て処理し終えてから他のブロックを処理し始める場合と比べて、情報処理装置１０は、複数のスレッドを活用して並列度を維持することができる。 Note that the information processing apparatus 10 may specify blocks on the diagonal as processable blocks in the initial state, and after processing of some blocks on the diagonal is completed, other blocks on the diagonal and It may be processed in parallel with blocks that do not exist. As a result, the information processing apparatus 10 can maintain the degree of parallelism by utilizing a plurality of threads, compared to the case where processing of other blocks is started after all blocks on the diagonal line have been processed.

また、情報処理装置１０は、第１の方向に隣接するブロックの処理結果に基づいて計算コストを算出してもよい。その場合に、情報処理装置１０は、第１の方向に隣接するブロックの処理結果に基づいて、処理を省略可能な列を判定してもよく、処理を省略可能な列の個数に基づいて計算コストを算出してもよい。これにより、情報処理装置１０は、スレッドの計算時間に比例する計算コストを精度よく推定することができる。 Further, the information processing apparatus 10 may calculate the calculation cost based on the processing results of blocks adjacent in the first direction. In that case, the information processing apparatus 10 may determine columns for which processing can be omitted based on the processing results of blocks adjacent in the first direction, and the calculation is performed based on the number of columns for which processing can be omitted. You can calculate the cost. Thereby, the information processing apparatus 10 can accurately estimate the calculation cost proportional to the calculation time of the thread.

また、情報処理装置１０は、処理可能ブロックから特定の範囲内にある未処理のブロックの計算コストに基づいて、その処理可能ブロックの計算コストを修正してもよい。第２の方向と反対の第３の方向に隣接する未処理のブロックの計算コストが、その処理可能ブロックの計算コストより大きい場合、情報処理装置１０は、その処理可能ブロックの計算コストを増大させてもよい。これにより、計算コストの小さい処理可能ブロックの裏に計算コストの大きい未処理のブロックが隠れている場合であっても、計算コストの大きいブロックの処理開始が早くなり並列度の減少が抑制される。 Further, the information processing apparatus 10 may correct the calculation cost of the processable block based on the calculation cost of the unprocessed block within a specific range from the processable block. If the computation cost of an unprocessed block adjacent in the third direction opposite to the second direction is greater than the computation cost of the processable block, the information processing device 10 increases the computation cost of the processable block. may As a result, even if an unprocessed block with a high computational cost is hidden behind a processable block with a low computational cost, processing of the block with a high computational cost is started early, and the decrease in parallelism is suppressed. .

［第２の実施の形態］
次に、第２の実施の形態を説明する。
第２の実施の形態の情報処理装置１００は、パーシステントホモロジーの行列計算を、複数のスレッドを用いて並列に実行する。情報処理装置１００は、クライアント装置でもよいしサーバ装置でもよい。情報処理装置１００が、コンピュータ、並列処理装置、データ解析装置または行列演算装置と呼ばれてもよい。 [Second embodiment]
Next, a second embodiment will be described.
The information processing apparatus 100 according to the second embodiment executes persistent homology matrix calculations in parallel using a plurality of threads. The information processing device 100 may be a client device or a server device. The information processing device 100 may be called a computer, parallel processing device, data analysis device, or matrix operation device.

図２は、情報処理装置のハードウェア例を示すブロック図である。
情報処理装置１００は、バスに接続されたＣＰＵ１０１、ＲＡＭ１０２、ＨＤＤ１０３、ＧＰＵ１０４、入力インタフェース１０５、媒体リーダ１０６および通信インタフェース１０７を有する。ＣＰＵ１０１は、第１の実施の形態の処理部１２に対応する。ＲＡＭ１０２またはＨＤＤ１０３は、第１の実施の形態の記憶部１１に対応する。 FIG. 2 is a block diagram illustrating an example of hardware of an information processing apparatus.
The information processing apparatus 100 has a CPU 101, a RAM 102, an HDD 103, a GPU 104, an input interface 105, a medium reader 106 and a communication interface 107 connected to a bus. A CPU 101 corresponds to the processing unit 12 of the first embodiment. A RAM 102 or HDD 103 corresponds to the storage unit 11 of the first embodiment.

ＣＰＵ１０１は、プログラムの命令を実行するプロセッサである。ＣＰＵ１０１は、ＨＤＤ１０３に記憶されたプログラムおよびデータの少なくとも一部をＲＡＭ１０２にロードし、プログラムを実行する。ＣＰＵ１０１は、コア１０１ａ，１０１ｂ，１０１ｃ，１０１ｄなどの複数のコアを有する。複数のコアは、複数のスレッドを並列に実行することができる。なお、情報処理装置１００は、複数のＣＰＵを有してもよい。ＣＰＵの集合またはＣＰＵコアの集合が、マルチプロセッサまたは単にプロセッサと呼ばれてもよい。 The CPU 101 is a processor that executes program instructions. CPU 101 loads at least part of the programs and data stored in HDD 103 into RAM 102 and executes the programs. The CPU 101 has a plurality of cores such as cores 101a, 101b, 101c, and 101d. Multiple cores can execute multiple threads in parallel. Note that the information processing apparatus 100 may have a plurality of CPUs. A collection of CPUs or a collection of CPU cores may be referred to as a multiprocessor or simply a processor.

ＲＡＭ１０２は、ＣＰＵ１０１で実行されるプログラムおよびＣＰＵ１０１で演算に使用されるデータを一時的に記憶する揮発性半導体メモリである。情報処理装置１００は、ＲＡＭ以外の種類の揮発性メモリを有してもよい。 The RAM 102 is a volatile semiconductor memory that temporarily stores programs executed by the CPU 101 and data used for calculations by the CPU 101 . The information processing device 100 may have a type of volatile memory other than RAM.

ＨＤＤ１０３は、ＯＳ（Operating System）、ミドルウェア、アプリケーションソフトウェアなどのソフトウェアのプログラム、および、データを記憶する不揮発性ストレージである。情報処理装置１００は、フラッシュメモリやＳＳＤ（Solid State Drive）などの他の種類の不揮発性ストレージを有してもよい。 The HDD 103 is a non-volatile storage that stores an OS (Operating System), middleware, software programs such as application software, and data. The information processing apparatus 100 may have other types of non-volatile storage such as flash memory and SSD (Solid State Drive).

ＧＰＵ１０４は、ＣＰＵ１０１からの指示に応じてプログラムの命令を実行するプロセッサである。ＧＰＵ１０４は、ＣＰＵ１０１と連携して画像を生成し、情報処理装置１００に接続された表示装置１１１に画像を出力する。表示装置１１１は、例えば、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ、有機ＥＬ（Electro Luminescence）ディスプレイまたはプロジェクタである。情報処理装置１００に、プリンタなどの他の種類の出力デバイスが接続されてもよい。 The GPU 104 is a processor that executes program instructions according to instructions from the CPU 101 . The GPU 104 generates images in cooperation with the CPU 101 and outputs the images to the display device 111 connected to the information processing apparatus 100 . The display device 111 is, for example, a CRT (Cathode Ray Tube) display, a liquid crystal display, an organic EL (Electro Luminescence) display, or a projector. Other types of output devices such as printers may be connected to the information processing apparatus 100 .

また、ＧＰＵ１０４は、コア１０４ａ，１０４ｂ，１０４ｃ，１０４ｄなどの複数のコアを有する。複数のコアは、複数のスレッドを並列に実行することができる。ＧＰＵ１０４は、ＲＡＭ１０２とは異なるＧＰＵメモリを有してもよい。ＧＰＵメモリは、プログラムやデータを一時的に記憶する揮発性半導体メモリである。なお、情報処理装置１００は、複数のＧＰＵを有してもよい。ＧＰＵの集合またはＧＰＵコアの集合が、マルチプロセッサまたは単にプロセッサと呼ばれてもよい。後述する第２の実施の形態のスレッドは、ＧＰＵ１０１のコアまたはＧＰＵ１０４のコアによって実行される。 Also, the GPU 104 has a plurality of cores such as cores 104a, 104b, 104c, and 104d. Multiple cores can execute multiple threads in parallel. GPU 104 may have GPU memory that is different from RAM 102 . GPU memory is a volatile semiconductor memory that temporarily stores programs and data. Note that the information processing apparatus 100 may have a plurality of GPUs. A collection of GPUs or a collection of GPU cores may be referred to as multiprocessors or simply processors. A thread in a second embodiment, which will be described later, is executed by the core of GPU 101 or the core of GPU 104 .

入力インタフェース１０５は、情報処理装置１００に接続された入力デバイス１１２から入力信号を受け付ける。入力デバイス１１２は、例えば、マウス、タッチパネルまたはキーボードである。情報処理装置１００に複数の入力デバイスが接続されてもよい。 Input interface 105 receives an input signal from input device 112 connected to information processing apparatus 100 . The input device 112 is, for example, a mouse, touch panel, or keyboard. A plurality of input devices may be connected to the information processing apparatus 100 .

媒体リーダ１０６は、記録媒体１１３に記録されたプログラムおよびデータを読み取る読み取り装置である。記録媒体１１３は、例えば、磁気ディスク、光ディスクまたは半導体メモリである。磁気ディスクには、フレキシブルディスク（ＦＤ：Flexible Disk）およびＨＤＤが含まれる。光ディスクには、ＣＤ（Compact Disc）およびＤＶＤ（Digital Versatile Disc）が含まれる。媒体リーダ１０６は、記録媒体１１３から読み取られたプログラムおよびデータを、ＲＡＭ１０２やＨＤＤ１０３などの他の記録媒体にコピーする。読み取られたプログラムは、ＣＰＵ１０１によって実行されることがある。 The medium reader 106 is a reading device that reads programs and data recorded on the recording medium 113 . The recording medium 113 is, for example, a magnetic disk, an optical disk, or a semiconductor memory. Magnetic disks include flexible disks (FDs) and HDDs. Optical discs include CDs (Compact Discs) and DVDs (Digital Versatile Discs). A medium reader 106 copies the program and data read from the recording medium 113 to another recording medium such as the RAM 102 or HDD 103 . The read program may be executed by CPU 101 .

記録媒体１１３は、可搬型記録媒体であってもよい。記録媒体１１３は、プログラムおよびデータの配布に用いられることがある。また、記録媒体１１３およびＨＤＤ１０３が、コンピュータ読み取り可能な記録媒体と呼ばれてもよい。 The recording medium 113 may be a portable recording medium. Recording medium 113 may be used to distribute programs and data. Recording medium 113 and HDD 103 may also be referred to as computer-readable recording media.

通信インタフェース１０７は、ネットワーク１１４に接続され、ネットワーク１１４を介して他の情報処理装置と通信する。通信インタフェース１０７は、スイッチやルータなどの有線通信装置に接続される有線通信インタフェースでもよいし、基地局やアクセスポイントなどの無線通信装置に接続される無線通信インタフェースでもよい。 The communication interface 107 is connected to the network 114 and communicates with other information processing apparatuses via the network 114 . The communication interface 107 may be a wired communication interface connected to a wired communication device such as a switch or router, or a wireless communication interface connected to a wireless communication device such as a base station or access point.

次に、パーシステントホモロジーについて説明する。
図３は、パーシステントホモロジーに用いる行列の例を示す図である。
パーシステントホモロジーは、物体の形状の特徴を数量的に表現する位相幾何学の技術である。パーシステントホモロジーは、材料科学や生命科学や社会科学などの技術分野で使用されることがあり、いわゆるビッグデータの解析方法として使用されることがある。 Next, persistent homology will be explained.
FIG. 3 is a diagram showing an example of a matrix used for persistent homology.
Persistent homology is a topological technique that quantitatively represents the shape features of an object. Persistent homology is sometimes used in technical fields such as material science, life science, and social science, and is sometimes used as a so-called big data analysis method.

パーシステントホモロジーは、物体の形状を点の集合として表現し、点の分布の偏りを数量的に解析する。パーシステントホモロジーは、点を半径ｒの粒子とみなし、半径ｒを変化させながら閉領域の出現と消滅を検出する。半径ｒが小さいうちは点同士が接触しないため、点によって囲まれた閉領域（穴と呼ばれることがある）は存在しない。半径ｒが大きくなると、近くの点同士が接触して穴が出現することがある。半径ｒが更に大きくなると、遠くの点同士も接触するため穴が潰れて消滅する。 Persistent homology expresses the shape of an object as a set of points and quantitatively analyzes the bias of the point distribution. Persistent homology regards a point as a particle of radius r and detects the appearance and disappearance of closed regions while changing the radius r. As long as the radius r is small, the points do not touch each other, so there is no closed region (sometimes called a hole) surrounded by the points. As the radius r increases, nearby points may contact each other and a hole may appear. If the radius r is further increased, the holes are crushed and disappear because distant points also come into contact with each other.

異なる半径ｒにおいて大きさの異なる穴が出現することがある。穴が出現する半径ｒおよび穴が消滅する半径ｒは、点の分布に依存する。穴の出現および消滅のタイミングは、以下に説明する境界行列の縮約を通じて代数的に解析される。 Holes of different sizes may appear at different radii r. The radius r at which the hole appears and the radius r at which the hole disappears depend on the point distribution. The timing of the appearance and disappearance of holes is algebraically analyzed through boundary matrix contraction as described below.

物体の形状は、様々な単体が結合した単体複体として表現される。単体には、０次元単体である点、１次元単体である線分、２次元単体である三角形、３次元単体である四面体などが含まれる。ある次元の単体は、それより低い次元の単体を境界としてもつことがある。線分は、端点である２つの点を境界としてもつことがある。三角形は、３つの線分を境界としてもつことがある。四面体は、４つの三角形を境界としてもつことがある。 The shape of an object is represented as a simplicial complex, which is a combination of various simplexes. Simplexes include a point as a 0-dimensional simplex, a line segment as a 1-dimensional simplex, a triangle as a 2-dimensional simplex, a tetrahedron as a 3-dimensional simplex, and the like. A simplex of one dimension may have a simplex of lower dimension as its boundary. A line segment may have two points as boundaries that are endpoints. A triangle may have three line segments as boundaries. A tetrahedron may have four triangles as boundaries.

一例として、三角形状の物体が単体３０，３１，３２，３３，３４，３５，３６を含む。単体３０，３１，３２は点である。単体３３，３４，３５は線分である。単体３６は三角形である。単体３０，３１，３２は、境界をもたない。単体３３は、単体３０，３２を境界としてもつ。単体３４は、単体３１，３２を境界としてもつ。単体３５は、単体３０，３１を境界としてもつ。単体３６は、単体３３，３４，３５を境界としてもつ。 As an example, a triangular shaped object includes simplexes 30,31,32,33,34,35,36. The simplexes 30, 31, 32 are points. Simplexes 33, 34 and 35 are line segments. Simplex 36 is triangular. Simplexes 30, 31, 32 have no boundaries. Simplex 33 has simplex 30, 32 as a boundary. Simplex 34 has simplex 31 and 32 as boundaries. Simplex 35 has simplex 30, 31 as a boundary. Simplex 36 has simplexes 33, 34, and 35 as boundaries.

上記の幾何学データから、情報処理装置１００は、境界行列である行列４１を生成する。行列４１は、正方行列かつ上三角行列である。行列４１の０行～６行は、単体３０，３１，３２，３３，３４，３５，３６に対応する。同様に、行列４１の０列～６列は、単体３０，３１，３２，３３，３４，３５，３６に対応する。列は上位次元の単体を意味し、行は上位次元の単体の境界になり得る下位次元の単体を意味する。行列４１は、境界であるか否かを示すフラグの集合である。単体ｊが単体ｉを境界としてもつ場合、行列４１の要素（ｉ，ｊ）が１になる。そうでない場合、要素（ｉ，ｊ）が０になる。 The information processing apparatus 100 generates a matrix 41, which is a boundary matrix, from the above geometric data. Matrix 41 is a square and upper triangular matrix. Rows 0 to 6 of matrix 41 correspond to simplexes 30, 31, 32, 33, 34, 35, and 36; Similarly, columns 0-6 of matrix 41 correspond to simplexes 30, 31, 32, 33, 34, 35, and 36. A column means a higher-dimensional simplex, and a row means a lower-dimensional simplex that can be a boundary of a higher-dimensional simplex. A matrix 41 is a set of flags indicating whether or not there is a boundary. Element (i, j) of matrix 41 is 1 when simplex j has simplex i as a boundary. Otherwise, element (i,j) is zero.

よって、行列４１の０列～２列はゼロベクトルである。行列４１の３列は（１０１００００）である。行列４１の４列は（０１１００００）である。行列４１の５列は（１１０００００）である。行列４１の６列は（０００１１１０）である。境界行列としての性質上、行列４１は正方行列かつ上三角行列である。 Therefore, columns 0 to 2 of matrix 41 are zero vectors. The third column of matrix 41 is (1010000). Column 4 of matrix 41 is (0110000). The fifth column of matrix 41 is (1100000). Column 6 of matrix 41 is (0001110). Due to its nature as a boundary matrix, the matrix 41 is square and upper triangular.

図４は、行列に対する掃き出し法の実行例を示す図である。
情報処理装置１００は、行列４１に対して掃き出し法を実行する。第２の実施の形態の掃き出し法の単位操作は、ある列に、その列よりも左側にある別の列を加算することである。行列４１の要素は０または１の二値であるため、加算は排他的論理和として定義される。すなわち、０＋０＝０、０＋１＝１、１＋０＝１、１＋１＝０である。 FIG. 4 is a diagram showing an execution example of the sweep method for matrices.
The information processing device 100 executes the sweep method on the matrix 41 . The unit operation of the sweep method of the second embodiment is to add a column to another column to the left of the column. Since the elements of matrix 41 are binary 0 or 1, addition is defined as exclusive OR. That is, 0+0=0, 0+1=1, 1+0=1, 1+1=0.

この掃き出し法の目的は、各列について最も下側にある非ゼロ要素の行番号を最小化することである。以下では、ｊ列の最も下側にある非ゼロ要素の行番号をｌｏｗ（ｊ）と表記することがある。ｊ列が非ゼロ要素を含まない場合、ｌｏｗ（ｊ）＝－１である。２つの列が同じ行に非ゼロ要素を含む場合、一方の列を他方の列に加算することで他方の列の非ゼロ要素が消去される。そのため、掃き出し法を実行すると、少なくとも１つの非ゼロ要素をそれぞれ含む複数の列は、異なるｌｏｗ（ｊ）をもつことになる。 The purpose of this sweeping method is to minimize the row number of the lowest non-zero elements for each column. Below, the row number of the lowest non-zero element in column j is sometimes written as low(j). If the j column contains no non-zero elements, then low(j)=-1. If two columns contain nonzero elements in the same row, adding one column to the other eliminates the nonzero elements in the other column. Therefore, after performing the sweep method, multiple columns each containing at least one non-zero element will have different low(j).

掃き出し法を実行した後にｉ＝ｌｏｗ（ｊ）＞－１が成立する場合、すなわち、ｊ列の最も下側の非ゼロ要素がｉ行にある場合、情報処理装置１００は、要素（ｉ，ｊ）をピボットと判定する。ピボット（ｉ，ｊ）は、ｉが穴の出現タイミングを表し、ｊがその穴の消滅タイミングを表すものと解釈される。よって、境界行列を掃き出し法によって変換することで、パーシステントホモロジーの計算が実行される。 If i=low(j)>−1 holds after executing the sweep method, that is, if the lowest non-zero element in the j column is in the i row, the information processing apparatus 100 performs the element (i, j ) is determined to be a pivot. Pivot (i, j) is interpreted such that i represents the appearance timing of a hole and j represents the disappearance timing of the hole. Therefore, the calculation of persistent homology is performed by transforming the boundary matrix by the sweep method.

上記の掃き出し法の性質から、情報処理装置１００は、行列４１に含まれる複数の列を、左側から右側に向かう順で処理する。また、情報処理装置１００は、同じ列に含まれる複数の要素を、下側から上側に向かう順で処理する。行列４１は上三角行列であるため、情報処理装置１００は、複数の要素を左下から右上に向かう順で処理することになる。 Due to the nature of the above sweep method, the information processing apparatus 100 processes the columns included in the matrix 41 in order from left to right. Further, the information processing apparatus 100 processes a plurality of elements included in the same column in order from bottom to top. Since the matrix 41 is an upper triangular matrix, the information processing apparatus 100 processes the plurality of elements in order from the lower left to the upper right.

ｊ１列を処理する際、情報処理装置１００は、ｊ１列の下側から上側に向かって非ゼロ要素を探索する。非ゼロ要素が検出されると、情報処理装置１００は、その列の左側にある他の列の中から、検出した非ゼロ要素の行番号と同じｌｏｗをもつｊ２列を検索する。該当するｊ２列が存在する場合、情報処理装置１００は、ｊ２列の全体をｊ１列に加算し、更に上側に向かって非ゼロ要素を探索する。該当するｊ２列が存在しない場合、その非ゼロ要素の行番号がｌｏｗ（ｊ１）になり、ｊ１列のピボットが確定する。 When processing the j1 column, the information processing apparatus 100 searches for non-zero elements from the bottom to the top of the j1 column. When a non-zero element is detected, the information processing apparatus 100 searches other columns on the left side of that column for the j2 column having the same low as the row number of the detected non-zero element. If there is a corresponding j2 column, the information processing apparatus 100 adds the entire j2 column to the j1 column and searches upward for non-zero elements. If the corresponding j2 column does not exist, the row number of the non-zero element becomes low(j1) and the j1 column pivot is established.

ｊ１列のピボットが確定すると、情報処理装置１００は、ｊ１列を更に上側に向かって走査しなくてよく、ｊ１列の処理を終了する。最も下側の非ゼロ要素以外の他の非ゼロ要素はピボットと無関係であり、消去しなくてよいためである。 When the pivot of the j1 column is determined, the information processing apparatus 100 does not need to scan the j1 column further upward, and ends the processing of the j1 column. This is because non-zero elements other than the lowest non-zero element are irrelevant to the pivot and need not be deleted.

なお、掃き出し法では、ある列の処理が省略可能と判定されることがある。処理を省略可能な列の中には、掃き出し法の開始時点で判定される列もあるし、途中で判定される列もある。初期状態の境界行列において要素が全てゼロである列は、処理を省略可能であり、開始時点で判定される。また、上記のようにピボットが確定した列は、それ以降は処理を省略可能であり、掃き出し法の途中で判定される。また、ピボット（ｉ，ｊ）が確定した場合、パーシステントホモロジーの性質からｉ列の要素が全てゼロになる。このため、ｉ列が処理を省略可能と判定されてもよく、掃き出し法の途中で判定され得る。 In the sweep method, it may be determined that the processing of a certain column can be omitted. Among the columns for which processing can be omitted, there are columns that are determined at the beginning of the sweep method, and there are columns that are determined in the middle. Columns with all zero elements in the initial boundary matrix can be skipped and are determined at the start. Further, for columns whose pivots have been determined as described above, subsequent processing can be omitted, and determination is made during the sweeping method. Also, when the pivot (i, j) is determined, all the elements of the i column become zero due to the property of persistent homology. Therefore, it may be determined that the i-column can be skipped, and the determination can be made in the middle of the sweep-out method.

行列４１に対して単純な掃き出し法を実行する場合、情報処理装置１００は、０列を選択し、０列がゼロベクトルであるためｌｏｗ（０）＝－１と決定する。同様に、情報処理装置１００は、ｌｏｗ（１）＝ｌｏｗ（２）＝－１と決定する。次に、情報処理装置１００は、３列を選択し、（２，３）の非ゼロ要素を検出する。情報処理装置１００は、ｌｏｗ＝２の列が左側に存在しないためｌｏｗ（３）＝２と決定し、３列の処理を終了する。 When performing a simple sweep method on the matrix 41, the information processing device 100 selects the 0 column and determines low(0)=-1 because the 0 column is the zero vector. Similarly, the information processing apparatus 100 determines low(1)=low(2)=-1. Next, the information processing apparatus 100 selects 3 columns and detects non-zero elements of (2, 3). The information processing apparatus 100 determines that low(3)=2 because there is no column with low=2 on the left side, and ends the processing of the 3rd column.

次に、情報処理装置１００は、４列を選択し、（２，４）の非ゼロ要素を検出する。情報処理装置１００は、ｌｏｗ（３）＝２であるため３列を４列に加算する。これにより、４列は（１１０００００）に変換され、行列４１は行列４２に変換される。次に、情報処理装置１００は、（１，４）の非ゼロ要素を検出する。情報処理装置１００は、ｌｏｗ＝１の列が左側に存在しないためｌｏｗ（４）＝１と決定し、４列の処理を終了する。 Next, the information processing apparatus 100 selects 4 columns and detects non-zero elements of (2, 4). Since low(3)=2, the information processing apparatus 100 adds 3 columns to 4 columns. As a result, column 4 is transformed into (1100000) and matrix 41 is transformed into matrix 42 . Next, the information processing apparatus 100 detects non-zero elements of (1, 4). The information processing apparatus 100 determines that low(4)=1 since there is no column with low=1 on the left side, and ends the processing of the 4th column.

次に、情報処理装置１００は、５列を選択し、（１，５）の非ゼロ要素を検出する。情報処理装置１００は、ｌｏｗ（４）＝１であるため４列を５列に加算する。これにより、５列は（０００００００）に変換され、行列４２は行列４３に変換される。また、情報処理装置１００は、５列がゼロベクトルであるためｌｏｗ（５）＝－１と決定する。 Next, the information processing apparatus 100 selects 5 columns and detects non-zero elements of (1, 5). The information processing apparatus 100 adds 4 columns to 5 columns because low(4)=1. This transforms column 5 into (0000000) and matrix 42 into matrix 43 . Further, the information processing apparatus 100 determines low(5)=−1 because the 5th column is a zero vector.

次に、情報処理装置１００は、６列を選択し、（５，６）の非ゼロ要素を検出する。情報処理装置１００は、ｌｏｗ＝５の列が左側に存在しないためｌｏｗ（６）＝５と決定し、６列の処理を終了する。これにより、情報処理装置１００は、要素（２，３）、要素（１，４）および要素（５，６）をピボットと決定する。 Next, the information processing apparatus 100 selects 6 columns and detects non-zero elements of (5, 6). The information processing apparatus 100 determines that low(6)=5 since there is no column of low=5 on the left side, and ends the processing of the 6th column. Accordingly, the information processing apparatus 100 determines the elements (2, 3), (1, 4), and (5, 6) as pivots.

図５は、列管理テーブルの例を示す図である。
情報処理装置１００は、掃き出し法の制御のために列管理テーブル１３１を保持してもよい。列管理テーブル１３１は、列番号、ｌｏｗおよびゼロフラグを対応付ける。列番号は、境界行列の列を識別する。ｌｏｗは、最も下側の非ゼロ要素が存在する行の行番号である。ただし、非ゼロ要素が存在しない列のｌｏｗは－１である。ｌｏｗの初期値は－１である。ゼロフラグは、列がゼロベクトルであるか否かを示す。 FIG. 5 is a diagram showing an example of a column management table.
The information processing apparatus 100 may hold the column management table 131 for controlling the sweep method. The column management table 131 associates column numbers, low and zero flags. The column number identifies the column of the boundary matrix. low is the row number of the row where the lowest non-zero element resides. However, low is -1 for columns in which there are no non-zero elements. The initial value of low is -1. A zero flag indicates whether the column is a zero vector.

ある列の全ての要素がゼロである場合、その列のゼロフラグがＹＥＳに更新される。列ｊのピボットが確定した場合、ｌｏｗ（ｊ）が０以上の整数に更新される。ピボット（ｉ，ｊ）が確定した場合、列ｉのゼロフラグがＹＥＳに更新されてもよい。ｌｏｗが０以上であるかまたはゼロフラグがＹＥＳである列は、処理を省略可能な列である。 If all elements in a column are zero, the zero flag for that column is updated to YES. When the pivot of column j is confirmed, low(j) is updated to an integer greater than or equal to 0. If pivot (i,j) is settled, the zero flag for column i may be updated to YES. A column whose low is 0 or more or whose zero flag is YES is a column whose processing can be omitted.

次に、境界行列に対する掃き出し法の並列化について説明する。情報処理装置１００は、ＣＰＵ１０１に含まれる複数のコアまたはＧＰＵ１０４に含まれる複数のコアを用いて複数のスレッドを並列に実行し、複数のスレッドによって掃き出し法を並列化する。 Next, parallelization of the sweep method for boundary matrices will be described. The information processing apparatus 100 executes a plurality of threads in parallel using a plurality of cores included in the CPU 101 or a plurality of cores included in the GPU 104, and parallelizes the sweep method by the plurality of threads.

図６は、複数のスレッドによる掃き出し法の並列化例を示す図である。
情報処理装置１００は、行列４１を複数のブロックに分割する。複数のブロックは、同じ個数の行および同じ個数の列を含む正方部分行列である。複数のブロックは、格子状に配置される。ブロックの一辺の長さは、事前に固定的に決められていてもよいし、ユーザから指定されてもよいし、行列４１のサイズに基づいて算出されてもよい。例えば、情報処理装置１００は、ブロックの一辺の長さを、行列の一辺の長さ÷（スレッド数×α）と算出する。αは、１以上の整数を示す定数である。 FIG. 6 is a diagram showing an example of parallelization of the sweep method using a plurality of threads.
The information processing device 100 divides the matrix 41 into a plurality of blocks. A plurality of blocks is a square submatrix containing the same number of rows and the same number of columns. A plurality of blocks are arranged in a grid. The length of one side of the block may be fixed in advance, specified by the user, or calculated based on the size of the matrix 41 . For example, the information processing apparatus 100 calculates the length of one side of the block as the length of one side of the matrix÷(number of threads×α). α is a constant representing an integer of 1 or more.

例えば、情報処理装置１００は、７行７列の行列４１を２行２列のブロックに分割して、ブロック５０，５１，５２，５３，５４，５５，５６，５７，５８，５９を生成する。ブロック５０，５１，５２，５３は、対角要素を含む対角ブロックである。ブロック５４はブロック５０の右隣、ブロック５５はブロック５１の右隣、ブロック５６はブロック５２の右隣である。ブロック５７はブロック５４の右隣、ブロック５８はブロック５５の右隣である。ブロック５９はブロック５７の右隣である。なお、サイズ調整のため、行列４１の右端には空列が１つ挿入され、行列４１の下端には空行が１つ挿入されている。また、情報処理装置１００は、上三角形の外の領域にブロックを生成しなくてもよい。 For example, the information processing apparatus 100 divides the matrix 41 of 7 rows and 7 columns into blocks of 2 rows and 2 columns to generate blocks 50, 51, 52, 53, 54, 55, 56, 57, 58, and 59. . Blocks 50, 51, 52, 53 are diagonal blocks containing diagonal elements. Block 54 is the right neighbor of block 50 , block 55 is the right neighbor of block 51 , and block 56 is the right neighbor of block 52 . Block 57 is the right neighbor of block 54 and block 58 is the right neighbor of block 55 . Block 59 is right next to block 57 . Note that one blank column is inserted at the right end of the matrix 41 and one blank row is inserted at the bottom end of the matrix 41 for size adjustment. Further, the information processing apparatus 100 does not have to generate blocks in areas outside the upper triangle.

これら複数のブロックは、一定の順序条件のもとで順に処理される。あるブロックの左に未処理のブロックが隣接しているか、または、そのブロックの下に未処理のブロックが隣接している場合、そのブロックの処理は開始されない。これは、左のブロックが未処理であると加算すべき列が確定しないためであり、また、下のブロックが未処理であると非ゼロ要素の消去の要否が確定しないためである。よって、全てのブロックが未処理である初期状態においては、対角ブロックのみが処理可能ブロックとなる。 These multiple blocks are processed in order under certain order conditions. If a block has an adjacent unprocessed block to its left or an unprocessed adjacent block below it, processing of that block is not started. This is because the columns to be added are not determined if the left block is unprocessed, and the necessity of erasing non-zero elements is not determined if the lower block is unprocessed. Therefore, in the initial state where all blocks are unprocessed, only diagonal blocks are processable blocks.

情報処理装置１００は、左に未処理のブロックが隣接しておらずかつ下に未処理のブロックが隣接していない処理可能ブロックに、処理中のブロックがない空きスレッドを割り当てる。あるスレッドがブロックの処理を終了すると、情報処理装置１００は、処理可能ブロックを検索する。あるブロックが処理済みになることで、新たな処理可能ブロックが発生することがある。情報処理装置１００は、何れかの処理可能ブロックに、ブロックの処理を終えて空き状態になったスレッドを割り当てる。 The information processing apparatus 100 allocates an empty thread having no block being processed to a processable block having no unprocessed block adjacent to the left and no unprocessed block adjacent to the lower side. When a certain thread finishes processing a block, the information processing apparatus 100 searches for a processable block. As a block becomes processed, a new processable block may occur. The information processing apparatus 100 allocates a thread that has become free after processing a block to any processable block.

各スレッドにおいて行われるブロックの処理は、前述した掃き出し法と同様である。スレッドは、ブロックに含まれる複数の列を左側から右側に向かって順に処理する。ただし、スレッドは、既に省略可能と判定されている列をスキップする。スレッドは、同じ列に含まれる複数の要素を下側から上側に向かって走査する。非ゼロ要素を検出すると、スレッドは、左側の列の中から非ゼロ要素の行番号と同じｌｏｗをもつ列を検索する。 The block processing performed in each thread is the same as the sweep method described above. A thread processes columns in a block from left to right. However, the thread skips columns that have already been determined to be optional. The thread scans the elements in the same row from bottom to top. Upon finding a non-zero element, the thread searches the left column for the column with the same low as the row number of the non-zero element.

相手の列は同一ブロック内の列であることもあるし、そのブロックの外側の列であることもある。該当する相手の列が存在する場合、スレッドは、相手の列の全体を処理中の列に加算する。このとき、処理中のブロックの上側にある他のブロックの要素も更新され得る。該当する相手の列が存在しない場合、検出された非ゼロ要素がピボットとして残る。この場合、処理中の列はそれ以降、処理を省略可能な列と判定される。 The other column may be a column within the same block or a column outside the block. If there is a matching partner column, the thread adds the entire partner column to the column being processed. At this time, elements of other blocks above the block being processed may also be updated. If there is no matching partner column, the detected non-zero element remains as the pivot. In this case, the column being processed is determined as a column whose processing can be omitted thereafter.

ここで、処理可能ブロックを特定して空きスレッドを割り当てる方法の例として、以下のような方法も考えられる。まず、情報処理装置１００は、対角ブロックであるブロック５０，５１，５２，５３にスレッドを割り当てる。スレッド数が４である場合、４つのスレッドがブロック５０，５１，５２，５３を並列に処理することができる。 Here, the following method is also conceivable as an example of a method of specifying processable blocks and allocating free threads. First, the information processing apparatus 100 allocates threads to blocks 50, 51, 52, and 53, which are diagonal blocks. If the number of threads is four, four threads can process blocks 50, 51, 52 and 53 in parallel.

ブロック５０，５１，５２，５３の処理が全て終了すると、情報処理装置１００は、対角線と平行な右上の斜線の上にあるブロック５４，５５，５６を処理可能ブロックとして特定し、ブロック５４，５５，５６にスレッドを割り当てる。この場合、３つのスレッドがブロック５４，５５，５６を並列に処理することができる。 When the processing of blocks 50, 51, 52, and 53 is all completed, information processing apparatus 100 identifies blocks 54, 55, and 56 above the upper right oblique line parallel to the diagonal line as processable blocks. , 56 are assigned threads. In this case, three threads can process blocks 54, 55 and 56 in parallel.

ブロック５４，５５，５６の処理が全て終了すると、情報処理装置１００は、更に右上の斜線の上にあるブロック５７，５８を処理可能ブロックとして特定し、ブロック５７，５８にスレッドを割り当てる。この場合、２つのスレッドがブロック５７，５８を並列に処理することができる。ブロック５７，５８の処理が全て終了すると、情報処理装置１００は、更に右上の斜線の上にあるブロック５９を処理可能ブロックとして特定し、ブロック５９にスレッドを割り当てる。この場合、１つのスレッドがブロック５９を処理する。 When the processing of blocks 54 , 55 , 56 is all completed, the information processing apparatus 100 further identifies blocks 57 , 58 above the upper right diagonal line as processable blocks, and allocates threads to the blocks 57 , 58 . In this case, two threads can process blocks 57 and 58 in parallel. When the processing of blocks 57 and 58 is completed, the information processing apparatus 100 further identifies block 59 above the upper right diagonal line as a processable block, and assigns a thread to block 59 . In this case, one thread processes block 59 .

しかし、上記のスレッド割り当て方法では、掃き出し法の進行に伴って処理可能ブロック数が４，３，２，１と段階的に減少する。このため、使用されるスレッドの個数が段階的に減少して並列度が減少する。また、ブロックには処理を省略可能な列が含まれることがあるため、ブロックによって計算時間が異なる。このため、計算時間の長いボトルネックのブロックが、掃き出し法の進行を遅延させることがある。 However, in the thread allocation method described above, the number of processable blocks decreases step by step to 4, 3, 2, and 1 as the sweep-out method progresses. As a result, the number of threads used is gradually reduced, and the degree of parallelism is reduced. In addition, since blocks may include columns whose processing can be omitted, the calculation time differs depending on the block. Thus, bottleneck blocks with long computation times can slow the progress of the sweep-out method.

そこで、情報処理装置１００は、対角線に拘束されず、処理可能ブロックとなったブロックに空きスレッドを順次割り当てる。また、情報処理装置１００は、各処理可能ブロックの計算コストを推定し、計算コストの大きいブロックから先にスレッドを割り当てる。 Therefore, the information processing apparatus 100 sequentially allocates free threads to blocks that have become processable blocks without being constrained by diagonal lines. In addition, the information processing apparatus 100 estimates the calculation cost of each processable block, and allocates threads to the block with the highest calculation cost first.

図７は、第２の実施の形態の掃き出し法の手順例を示す図である。
行列６１は、上三角形に並んだブロックを示す。行列６１は、例えば、１００行１００列の行列を１０行１０列のブロックに分割したものである。なお、行列６１における座標（ｍ，ｎ）はブロックの位置を示すものであり、要素の位置を示す座標（ｉ，ｊ）とは異なる。ｍはブロック行番号であり、ｎはブロック列番号である。 FIG. 7 is a diagram illustrating a procedure example of the sweep method according to the second embodiment.
Matrix 61 shows blocks arranged in an upper triangle. The matrix 61 is, for example, obtained by dividing a matrix of 100 rows and 100 columns into blocks of 10 rows and 10 columns. Note that the coordinates (m, n) in the matrix 61 indicate the position of the block and are different from the coordinates (i, j) indicating the position of the element. m is the block row number and n is the block column number.

情報処理装置１００は、行列６１に含まれる処理可能ブロックとして、対角線上にある１０個のブロックを特定する。情報処理装置１００は、それら１０個のブロックの計算コストを、２，１，２，２，５，１，７，２，４，３と算出する。計算コストは、例えば、処理を省略可能と判定された列以外の列、すなわち、処理対象の列の個数である。 The information processing apparatus 100 identifies 10 diagonal blocks as processable blocks included in the matrix 61 . The information processing apparatus 100 calculates the calculation costs of these ten blocks as 2, 1, 2, 2, 5, 1, 7, 2, 4, and 3. The calculation cost is, for example, the number of columns other than columns determined to be process-skippable, that is, the number of columns to be processed.

情報処理装置１００は、待ちブロックキュー１３２に処理可能ブロックの情報を挿入する。処理可能ブロックの情報は、ブロック名と計算コストとを含む。ブロック名は、処理可能ブロックを識別する識別子であり、例えば、処理可能ブロックの座標である。待ちブロックキュー１３２では、処理可能ブロックの情報が計算コストの降順にソートされる。計算コストが同じ処理可能ブロックの情報は、ブロック行番号ｍの昇順にソートされる。 The information processing apparatus 100 inserts information about processable blocks into the waiting block queue 132 . The information of processable blocks includes block names and computational costs. A block name is an identifier for identifying a processable block, and is, for example, coordinates of the processable block. In the waiting block queue 132, information on processable blocks is sorted in descending order of calculation cost. Information of processable blocks having the same calculation cost is sorted in ascending order of block row number m.

よって、初期状態では、待ちブロックキュー１３２の先頭は、計算コストが７のブロックＢ（６，６）である。その次が、計算コストが５のブロックＢ（４，４）である。その次が、計算コストが４のブロックＢ（８，８）である。その次が、計算コストが３のブロックＢ（９，９）である。その次が、計算コストが２のブロックＢ（０，０）である。 Therefore, in the initial state, the head of the waiting block queue 132 is block B (6, 6) with a calculation cost of 7. Next is block B(4,4) with a computational cost of 5. Next is block B(8,8) with a computational cost of four. Next is block B(9,9) with a computational cost of 3. Next is block B(0,0) with a computational cost of two.

その次が、計算コストが２のブロックＢ（２，２）である。その次が、計算コストが２のブロックＢ（３，３）である。その次が、計算コストが２のブロックＢ（７，７）である。その次が、計算コストが１のブロックＢ（１，１）である。その次が、計算コストが１のブロックＢ（５，５）である。 Next is block B(2,2) with a computational cost of two. Next is block B(3,3) with a computational cost of two. Next is block B(7,7) with a computational cost of two. Next is block B(1,1) with a computational cost of one. Next is block B(5,5) with a computational cost of one.

また、情報処理装置１００は、スレッドの状態を管理するスレッドテーブル１３３を保持する。スレッドテーブル１３３は、スレッド番号と状態とを対応付ける。スレッド番号は、スレッドを識別する識別子である。処理中のブロックがない場合、状態はＲＥＡＤＹである。処理中のブロックがある場合、状態は処理中のブロックの識別子である。初期状態においては、全てのスレッドの状態がＲＥＡＤＹである。 The information processing apparatus 100 also holds a thread table 133 that manages the state of threads. The thread table 133 associates thread numbers with states. A thread number is an identifier that identifies a thread. If no blocks are being processed, the state is READY. If there is a block in process, the state is the identifier of the block in process. In the initial state, the state of all threads is READY.

図８は、第２の実施の形態の掃き出し法の手順例を示す図（続き１）である。
情報処理装置１００は、待ちブロックキュー１３２から優先度最高のブロックＢ（６，６）を抽出し、スレッド＃１を割り当てる。スレッド＃１はブロックＢ（６，６）の処理を開始する。次に、情報処理装置１００は、待ちブロックキュー１３２からこの時点において優先度最高のブロックＢ（４，４）を抽出し、スレッド＃２を割り当てる。スレッド＃２はブロックＢ（４，４）の処理を開始する。 FIG. 8 is a diagram (continuation 1) showing an example of the procedure of the sweeping method according to the second embodiment.
The information processing apparatus 100 extracts block B (6, 6) with the highest priority from the waiting block queue 132 and assigns thread #1 to it. Thread #1 begins processing block B(6,6). Next, the information processing apparatus 100 extracts block B(4, 4) with the highest priority at this point from the waiting block queue 132, and assigns thread #2 to it. Thread #2 begins processing block B(4,4).

次に、情報処理装置１００は、待ちブロックキュー１３２から優先度最高のブロックＢ（８，８）を抽出し、スレッド＃３を割り当てる。スレッド＃３はブロックＢ（８，８）の処理を開始する。次に、情報処理装置１００は、待ちブロックキュー１３２から優先度最高のブロックＢ（９，９）を抽出し、スレッド＃４を割り当てる。スレッド＃４はブロックＢ（９，９）の処理を開始する。これにより、行列６１が行列６２になる。 Next, the information processing apparatus 100 extracts block B (8, 8) with the highest priority from the waiting block queue 132 and assigns thread #3 to it. Thread #3 begins processing block B(8,8). Next, the information processing apparatus 100 extracts block B (9, 9) with the highest priority from the waiting block queue 132 and assigns thread #4 to it. Thread #4 begins processing block B(9,9). Thus, matrix 61 becomes matrix 62 .

図９は、第２の実施の形態の掃き出し法の手順例を示す図（続き２）である。
ブロックＢ（９，９）の計算コストが小さいため、スレッド＃４の処理が先に終了する。すると、情報処理装置１００は、ブロックＢ（９，９）が処理済みになったことに伴って新たに処理可能になったブロックを検索する。ここでは、新たな処理可能ブロックは存在しない。情報処理装置１００は、待ちブロックキュー１３２から優先度最高のブロックＢ（０，０）を抽出し、スレッド＃４を割り当てる。スレッド＃４はブロックＢ（０，０）の処理を開始する。これにより、行列６２が行列６３になる。 FIG. 9 is a diagram (continuation 2) showing an example of the procedure of the sweeping method according to the second embodiment.
Since the calculation cost of block B (9, 9) is small, processing of thread #4 ends first. Then, the information processing apparatus 100 searches for a block that has newly become processable as the block B (9, 9) has been processed. There are no new processable blocks here. The information processing apparatus 100 extracts block B(0,0) with the highest priority from the waiting block queue 132 and assigns thread #4 to it. Thread #4 begins processing block B(0,0). Thus, matrix 62 becomes matrix 63 .

図１０は、第２の実施の形態の掃き出し法の手順例を示す図（続き３）である。
スレッド＃４に続けてスレッド＃３の処理が終了する。すると、情報処理装置１００は、ブロックＢ（８，８）が処理済みになったことに伴って新たに処理可能になったブロックを検索する。ここでは、左のブロックＢ（８，８）が処理済みかつ下のブロックＢ（９，９）が処理済みであるため、ブロックＢ（８，９）が新たな処理可能ブロックである。情報処理装置１００は、ブロックＢ（８，９）の計算コストを２と算出し、ブロックＢ（８，９）の情報を待ちブロックキュー１３２に挿入する。 FIG. 10 is a diagram (continuation 3) showing an example of the procedure of the sweeping method according to the second embodiment.
Following thread #4, processing of thread #3 ends. Then, the information processing apparatus 100 searches for a block that has newly become processable as the block B(8, 8) has been processed. Here, since left block B(8,8) has been processed and lower block B(9,9) has been processed, block B(8,9) is a new processable block. The information processing apparatus 100 calculates the calculation cost of block B (8, 9) as 2, and inserts the information of block B (8, 9) into the waiting block queue 132 .

待ちブロックキュー１３２において、ブロックＢ（８，９）の優先度はブロックＢ（７，７）とブロックＢ（１，１）の間である。情報処理装置１００は、待ちブロックキュー１３２から優先度最高のブロックＢ（２，２）を抽出し、スレッド＃３を割り当てる。スレッド＃３はブロックＢ（２，２）の処理を開始する。これにより、行列６３が行列６４になる。このようにして、掃き出し法の並列処理が進行する。 In the waiting block queue 132, the priority of block B(8,9) is between block B(7,7) and block B(1,1). The information processing apparatus 100 extracts block B(2,2) with the highest priority from the waiting block queue 132 and assigns thread #3 to it. Thread #3 begins processing block B(2,2). Thus, matrix 63 becomes matrix 64 . In this way, the parallel processing of the sweep method proceeds.

図１１は、各ブロックの計算コストの算出例を示す図である。
ブロック７１，７２，７３，７４，７５，７６は、行列６１に含まれる。ブロック７１はＢ（５，５）、ブロック７２はＢ（６，６）、ブロック７３はＢ（７，７）、ブロック７４はＢ（５，６）、ブロック７５はＢ（６，７）、ブロック７６はＢ（５，７）である。ブロック７１，７２，７３が処理可能ブロックである。 FIG. 11 is a diagram illustrating a calculation example of the calculation cost of each block.
Blocks 71 , 72 , 73 , 74 , 75 and 76 are included in matrix 61 . Block 71 is B(5,5), Block 72 is B(6,6), Block 73 is B(7,7), Block 74 is B(5,6), Block 75 is B(6,7), Block 76 is B(5,7). Blocks 71, 72, and 73 are processable blocks.

ブロック７１に含まれる１０個の列のうち、９個が省略可能な列であり１個が処理対象の列であるため、ブロック７１の計算コストは１である。ブロック７２に含まれる１０個の列のうち、３個が省略可能な列であり７個が処理対象の列であるため、ブロック７２の計算コストは７である。ブロック７３に含まれる１０個の列のうち、８個が省略可能な列であり２個が処理対象の列であるため、ブロック７３の計算コストは２である。 The computational cost of block 71 is 1, because of the 10 columns included in block 71, 9 are optional columns and 1 is the column to be processed. The computational cost of block 72 is 7, because of the 10 columns included in block 72, 3 are optional columns and 7 are columns to be processed. The computational cost of block 73 is 2, because of the 10 columns included in block 73, 8 are optional columns and 2 are columns to be processed.

ブロック７４の計算コストは、ブロック７４の下のブロックの処理結果の影響を大きく受けるため、ブロック７２の処理の終了後に算出される。同様に、ブロック７５の計算コストはブロック７３の処理の終了後に算出され、ブロック７６の計算コストはブロック７５の処理の終了後に算出される。第２の実施の形態では、ブロック７４の計算コストは、ブロック７４が処理可能ブロックになった時点で算出される。同様に、ブロック７５の計算コストは、ブロック７５が処理可能ブロックになった時点で算出される。ブロック７６の計算コストは、ブロック７６が処理可能ブロックになった時点で算出される。 The calculation cost of block 74 is calculated after the processing of block 72 is completed, since it is greatly influenced by the processing results of the blocks below block 74 . Similarly, the computational cost of block 75 is calculated after block 73 has been processed, and the computational cost of block 76 is computed after block 75 has been processed. In the second embodiment, the computational cost of block 74 is calculated when block 74 becomes a processable block. Similarly, the computational cost of block 75 is calculated when block 75 becomes a processable block. The computational cost of block 76 is calculated when block 76 becomes a processable block.

なお、処理を省略可能な列は、掃き出し法の進行に伴って徐々に増加する。ブロック７４の処理開始時点における省略可能な列は、ブロック７２の処理開始時点と同じかブロック７２の処理開始時点よりも増加している。よって、ブロック７４の計算コストはブロック７２以下である。同様に、ブロック７５の計算コストはブロック７３以下であり、ブロック７６の計算コストはブロック７５以下である。 Note that the number of columns for which processing can be omitted gradually increases as the sweeping-out method progresses. The number of optional columns at the start of block 74 is the same as at the start of block 72 or is greater than at the start of block 72 . Therefore, the computational cost of block 74 is less than block 72. Similarly, the computational cost of block 75 is less than block 73 and the computational cost of block 76 is less than block 75 .

次に、計算コストの降順にブロックを処理する効果について説明する。
図１２は、ブロック優先度が並列処理に与える影響の例を示す図である。
行列８３は、掃き出し法の開始時点における上三角行列である。行列８３に含まれる複数の列のうち、列８１，８２の計算コストが相対的に大きいと仮定する。計算コストを考慮せずに複数のスレッドが処理可能ブロックを順に処理する場合、掃き出し法の途中で行列８３は行列８４のようになる可能性がある。 Next, the effect of processing blocks in descending order of computational cost will be described.
FIG. 12 is a diagram illustrating an example of how block priority affects parallel processing.
Matrix 83 is an upper triangular matrix at the start of the sweep method. Suppose that the computational cost of columns 81 and 82 among the columns included in matrix 83 is relatively high. If multiple threads sequentially process the processable blocks without considering the computational cost, the matrix 83 may become like the matrix 84 during the sweep method.

行列８４に示すように、計算コストの小さい列のブロックが先に処理し終わり、計算コストの大きい列８１，８２のブロックが残る。処理順序の制約から、列８１と列８２の間では、列８１の処理中のブロックと同じ行およびそれより上の行のブロックが処理可能ブロックにならない。また、列８２より右側では、列８２の処理中のブロックと同じ行およびそれより上の行のブロックが処理可能ブロックにならない。 As shown in matrix 84, the blocks in columns with lower computational costs are processed first, leaving the blocks in columns 81 and 82 with higher computational costs. Due to restrictions on the processing order, between column 81 and column 82, blocks in the same row as the block being processed in column 81 and in rows above it are not processable blocks. Also, on the right side of column 82, the blocks in the same row as the block being processed in column 82 and the rows above it do not become processable blocks.

そのため、処理可能ブロックが減少する。行列８４では、処理可能ブロックは２個に限定される。その場合、情報処理装置１００が多数のスレッドを起動しても、そのうちの少数のスレッドしかブロックを処理せず、並列度が低下する。 Therefore, the number of processable blocks is reduced. In matrix 84, the number of processable blocks is limited to two. In that case, even if the information processing apparatus 100 activates a large number of threads, only a small number of the threads process blocks, and the degree of parallelism decreases.

これに対して、計算コストの大きいブロックを優先的に処理する場合、掃き出し法の途中で行列８３は行列８５のようになると期待される。行列８５に示すように、計算コストの大きい列８１，８２のブロックの処理が優先的に開始されるため、列８１，８２の未処理のブロックが減少しやすい。列８１の未処理のブロックが減少することで、列８１と列８２の間に処理可能ブロックが発生しやすくなる。また、列８２の未処理のブロックが減少することで、列８２より右側に処理可能ブロックが発生しやすくなる。 On the other hand, when preferentially processing blocks with high calculation costs, the matrix 83 is expected to become like the matrix 85 during the sweeping method. As shown in the matrix 85, processing of blocks in columns 81 and 82 having high calculation costs is preferentially started, so the number of unprocessed blocks in columns 81 and 82 tends to decrease. As the number of unprocessed blocks in column 81 decreases, processable blocks are more likely to occur between column 81 and column 82 . Also, as the number of unprocessed blocks in the column 82 decreases, processable blocks are more likely to occur on the right side of the column 82 .

そのため、処理可能ブロックの減少が抑制される。その結果、情報処理装置１００が起動する複数のスレッドを活用することができ、並列度が維持される。並列度を維持することで、掃き出し法の行列計算が高速化される。 Therefore, the decrease in processable blocks is suppressed. As a result, multiple threads activated by the information processing apparatus 100 can be utilized, and parallelism is maintained. Maintaining the degree of parallelism speeds up the matrix computation of the sweep method.

次に、情報処理装置１００のソフトウェア構造および処理手順について説明する。
図１３は、情報処理装置のソフトウェア例を示すブロック図である。
情報処理装置１００は、幾何学データ記憶部１２１、行列記憶部１２２および制御情報記憶部１２３、行列生成部１２４、スレッド初期化部１２５およびスレッド処理部１２６を有する。スレッド処理部１２６は、掃き出し法実行部１２７、ブロック探索部１２８および計算コスト算出部１２９を有する。 Next, the software structure and processing procedure of the information processing apparatus 100 will be described.
FIG. 13 is a block diagram illustrating an example of software for an information processing device.
The information processing apparatus 100 has a geometric data storage unit 121 , a matrix storage unit 122 , a control information storage unit 123 , a matrix generation unit 124 , a thread initialization unit 125 and a thread processing unit 126 . The thread processing unit 126 has a sweep method execution unit 127 , a block search unit 128 and a calculation cost calculation unit 129 .

幾何学データ記憶部１２１、行列記憶部１２２および制御情報記憶部１２３は、例えば、ＲＡＭ１０２またはＧＰＵメモリの記憶領域を用いて実装される。行列生成部１２４、スレッド初期化部１２５およびスレッド処理部１２６は、例えば、ＣＰＵ１０１またはＧＰＵ１０４と、プログラムとを用いて実装される。 The geometric data storage unit 121, the matrix storage unit 122, and the control information storage unit 123 are implemented using storage areas of the RAM 102 or GPU memory, for example. The matrix generation unit 124, the thread initialization unit 125, and the thread processing unit 126 are implemented using, for example, the CPU 101 or GPU 104 and programs.

幾何学データ記憶部１２１は、複数の点を含む幾何学データを記憶する。行列記憶部１２２は、パーシステントホモロジー用の境界行列を記憶する。制御情報記憶部１２３は、境界行列の縮約に使用する制御情報を記憶する。制御情報には、前述の列管理テーブル１３１、待ちブロックキュー１３２およびスレッドテーブル１３３が含まれる。 The geometric data storage unit 121 stores geometric data including multiple points. The matrix storage unit 122 stores a boundary matrix for persistent homology. The control information storage unit 123 stores control information used for contraction of the boundary matrix. The control information includes the column management table 131, waiting block queue 132 and thread table 133 described above.

行列生成部１２４は、幾何学データ記憶部１２１に記憶された幾何学データから点、線分、三角形、四面体などの単体を検出し、単体間の境界関係を示す境界行列を生成する。行列生成部１２４は、生成された境界行列を行列記憶部１２２に保存する。なお、境界行列の生成と境界行列の縮約を異なる情報処理装置が実行してもよい。 The matrix generation unit 124 detects simple substances such as points, line segments, triangles, and tetrahedrons from the geometric data stored in the geometric data storage unit 121, and generates a boundary matrix indicating the boundary relationship between the simple substances. Matrix generator 124 stores the generated boundary matrix in matrix storage 122 . Note that different information processing apparatuses may perform the generation of the boundary matrix and the contraction of the boundary matrix.

スレッド初期化部１２５は、複数のスレッドを用いた掃き出し法の並列化を制御する。スレッド初期化部１２５は、複数のスレッドを起動して互いに異なるコアに実行させる。スレッドの個数は固定でもよいし、ユーザから指定されてもよいし、プロセッサのコア数に応じて可変でもよい。また、スレッド初期化部１２５は、行列記憶部１２２に記憶された行列を複数のブロックに分割する。 The thread initialization unit 125 controls parallelization of the sweep method using a plurality of threads. The thread initialization unit 125 activates multiple threads and causes different cores to execute the threads. The number of threads may be fixed, specified by the user, or variable according to the number of processor cores. Also, the thread initialization unit 125 divides the matrix stored in the matrix storage unit 122 into a plurality of blocks.

また、スレッド初期化部１２５は、初期状態の行列から処理を省略可能な列を判定する。スレッド初期化部１２５は、対角線上にある対角ブロックを処理可能ブロックとして選択し、処理可能ブロックそれぞれの計算コストを算出する。スレッド初期化部１２５は、処理可能ブロックの情報を計算コストの降順に待ちブロックキュー１３２に挿入する。 In addition, the thread initialization unit 125 determines columns for which processing can be omitted from the matrix in the initial state. The thread initialization unit 125 selects diagonal blocks on the diagonal line as processable blocks, and calculates the calculation cost of each processable block. The thread initialization unit 125 inserts information about processable blocks into the waiting block queue 132 in descending order of calculation cost.

全てのスレッドが停止すると、スレッド初期化部１２５は、境界行列の処理結果を出力する。スレッド初期化部１２５は、処理結果を表示装置１１１に表示してもよいし、不揮発性ストレージに保存してもよいし、他の情報処理装置に送信してもよい。処理結果は、縮約された境界行列を含んでもよいし、検出されたピボットの情報を含んでもよい。 When all the threads are stopped, the thread initialization unit 125 outputs the processing result of the boundary matrix. The thread initialization unit 125 may display the processing result on the display device 111, store it in a non-volatile storage, or transmit it to another information processing device. The processing result may include the reduced boundary matrix and may include information on the detected pivots.

スレッド処理部１２６は、複数のスレッドそれぞれの処理を規定する。スレッド処理部１２６は、ＲＥＡＤＹ状態である場合、待ちブロックキュー１３２から優先度最高のブロックを抽出する。待ちブロックキュー１３２が空でありかつ全てのスレッドがＲＥＡＤＹ状態である場合、スレッド処理部１２６は、自身のスレッドを停止する。 The thread processing unit 126 defines processing for each of a plurality of threads. The thread processing unit 126 extracts the highest priority block from the waiting block queue 132 in the READY state. If the waiting block queue 132 is empty and all threads are in the READY state, the thread processing unit 126 stops its own threads.

掃き出し法実行部１２７は、待ちブロックキュー１３２から抽出されたブロックを処理する。掃き出し法実行部１２７は、処理を省略可能と判定されている列をスキップしながら要素を走査して非ゼロ要素を検出し、他の列の加算による非ゼロ要素の消去を試みる。 The sweep-out method execution unit 127 processes blocks extracted from the waiting block queue 132 . The sweep-out method execution unit 127 detects non-zero elements by scanning elements while skipping columns determined to be process-skippable, and tries to eliminate non-zero elements by adding other columns.

ブロック探索部１２８は、掃き出し法実行部１２７が１つのブロックを処理し終える毎に、行列の中から新たな処理可能ブロックを探索する。新たな処理可能ブロックの候補は、今回処理済みになったブロックの右のブロックと上のブロックである。右下のブロックが既に処理済みである場合は右のブロックが新たな処理可能ブロックであり、左上のブロックが既に処理済みである場合は上のブロックが新たな処理可能ブロックである。ブロック探索部１２８は、処理可能ブロックの情報を待ちブロックキュー１３２に挿入する。 The block search unit 128 searches for a new processable block from the matrix each time the sweep-out method execution unit 127 finishes processing one block. Candidates for new processable blocks are the block to the right and the block above the block that has been processed this time. If the lower right block has already been processed, the right block is the new processable block, and if the upper left block has already been processed, the upper block is the new processable block. The block searching unit 128 inserts information on processable blocks into the waiting block queue 132 .

計算コスト算出部１２９は、ブロックの計算コストを算出する。計算コストは、例えば、処理対象の列の個数である。処理対象の列の個数は、ブロックに含まれる列の総数から処理を省略可能と判定された列の個数を引いた値である。第２の実施の形態では、あるブロックが新たに処理可能になった時点で当該ブロックの計算コストが算出される。 The calculation cost calculator 129 calculates the calculation cost of the block. The computational cost is, for example, the number of columns to be processed. The number of columns to be processed is a value obtained by subtracting the number of columns for which processing can be omitted from the total number of columns included in the block. In the second embodiment, the calculation cost of a certain block is calculated when the block becomes processable.

図１４は、スレッド初期化の手順例を示すフローチャートである。
（Ｓ１０）スレッド初期化部１２５は、行列を複数のブロックに分割する。ブロックの一辺の長さは、例えば、行列の一辺の長さ÷（スレッド数×α）である。 FIG. 14 is a flowchart illustrating an example of a thread initialization procedure.
(S10) The thread initialization unit 125 divides the matrix into multiple blocks. The length of one side of the block is, for example, the length of one side of the matrix÷(number of threads×α).

（Ｓ１１）スレッド初期化部１２５は、初期状態の行列から処理を省略可能な列を判定する。処理を省略可能な列は、ゼロベクトルになることが明らかな列を含む。
（Ｓ１２）スレッド初期化部１２５は、対角ブロックを処理可能ブロックとして特定する。スレッド初期化部１２５は、ステップＳ１１で判定された列の個数に基づいて、対角ブロックそれぞれの計算コストを算出する。スレッド初期化部１２５は、対角ブロックを計算コストの降順にソートし、対角ブロックのブロック名と計算コストを対応付けて待ちブロックキュー１３２に挿入する。 (S11) The thread initialization unit 125 determines a column for which processing can be omitted from the matrix in the initial state. Columns for which processing can be omitted include columns that are clearly zero vectors.
(S12) The thread initialization unit 125 identifies diagonal blocks as processable blocks. The thread initialization unit 125 calculates the calculation cost of each diagonal block based on the number of columns determined in step S11. The thread initialization unit 125 sorts the diagonal blocks in descending order of calculation cost, associates the block names of the diagonal blocks with the calculation costs, and inserts them into the waiting block queue 132 .

（Ｓ１３）スレッド初期化部１２５は、複数のスレッドを起動し、ＣＰＵ１０１またはＧＰＵ１０４の異なるコアに異なるスレッドを実行させる。
図１５は、第２の実施の形態のスレッド処理の手順例を示すフローチャートである。 (S13) The thread initialization unit 125 activates a plurality of threads and causes different cores of the CPU 101 or GPU 104 to execute different threads.
FIG. 15 is a flowchart illustrating an example of a procedure for thread processing according to the second embodiment.

（Ｓ２０）ブロック探索部１２８は、行列の中に新たな処理可能ブロックがあるか判断する。新たな処理可能ブロックは、左の隣接ブロックと下の隣接ブロックが処理済みであり、待ちブロックキュー１３２に登録されていない未処理のブロックである。自身のスレッドが直前に処理したブロックがある場合、ブロック探索部１２８は、そのブロックの上と右のブロックのみ確認してもよい。新たな処理可能ブロックがある場合はステップＳ２１に処理が進み、新たな処理可能ブロックがない場合はステップＳ２２に処理が進む。 (S20) The block search unit 128 determines whether there is a new processable block in the matrix. A new processable block is an unprocessed block whose left adjacent block and lower adjacent block have been processed and are not registered in the waiting block queue 132 . If there is a block processed immediately before by its own thread, the block searching unit 128 may check only the blocks above and to the right of that block. If there is a new processable block, the process proceeds to step S21, and if there is no new processable block, the process proceeds to step S22.

（Ｓ２１）計算コスト算出部１２９は、ステップＳ２０で検出された新たな処理可能ブロックの計算コストを算出する。計算コストは、新たな処理可能ブロックに含まれる列のうち、現在時点で処理を省略可能と判定されている列に基づいて算出される。ブロック探索部１２８は、新たな処理可能ブロックのブロック名と計算コストを対応付けて待ちブロックキュー１３２に挿入する。このとき、ブロック探索部１２８は、処理可能ブロックが計算コストの降順に並ぶように、計算コストに基づいて挿入位置を決定する。 (S21) The calculation cost calculator 129 calculates the calculation cost of the new processable block detected in step S20. The calculation cost is calculated based on the columns that are currently determined to be process-skippable among the columns included in the new processable block. The block search unit 128 associates the block name of the new processable block with the calculation cost and inserts it into the waiting block queue 132 . At this time, the block searching unit 128 determines the insertion position based on the calculation cost so that the processable blocks are arranged in descending order of the calculation cost.

（Ｓ２２）スレッド処理部１２６は、待ちブロックキュー１３２にブロックが登録されているか判断する。待ちブロックキュー１３２にブロックがある場合はステップＳ２３に処理が進み、待ちブロックキュー１３２が空の場合はステップＳ２５に処理が進む。 (S22) The thread processing unit 126 determines whether a block is registered in the waiting block queue 132. FIG. If there is a block in the waiting block queue 132, the process proceeds to step S23, and if the waiting block queue 132 is empty, the process proceeds to step S25.

（Ｓ２３）スレッド処理部１２６は、待ちブロックキュー１３２から優先度最高のブロックを抽出する。優先度最高のブロックは、計算コストが最大のブロックであり、計算コストが最大のブロックが複数ある場合はその中で行番号が最小のブロックである。掃き出し法実行部１２７は、抽出されたブロックに対して掃き出し法を実行する。 (S23) The thread processing unit 126 extracts the highest priority block from the waiting block queue 132. FIG. The block with the highest priority is the block with the highest calculation cost, and if there are multiple blocks with the highest calculation cost, the block with the lowest row number among them. The sweep-out method executing unit 127 executes the sweep-out method on the extracted blocks.

（Ｓ２４）掃き出し法実行部１２７は、ステップＳ２３の処理結果から、処理したブロックに含まれる列のうち新たに処理を省略可能となった列を判定する。処理を省略可能な列は、ピボットが確定した列、すなわち、非ゼロ要素が消去されずに残った列を含む。そして、ステップＳ２０に処理が戻る。 (S24) The sweep-out method execution unit 127 determines, from the processing result of step S23, which columns are included in the processed block and whose processing can be omitted. Columns for which processing can be omitted include columns for which the pivot has been determined, ie, columns for which non-zero elements have not been eliminated. Then, the process returns to step S20.

（Ｓ２５）スレッド処理部１２６は、ブロックを処理中の他のスレッドがあるか判断する。ブロックを処理中の他のスレッドがある場合は一定時間待機してステップＳ２２に処理が戻り、全てのスレッドがＲＥＡＤＹ状態の場合は自身のスレッドを停止する。 (S25) The thread processing unit 126 determines whether there is another thread processing the block. If there is another thread processing a block, it waits for a certain period of time and returns to step S22, and if all threads are in the READY state, it stops its own thread.

以上説明したように、第２の実施の形態の情報処理装置１００は、パーシステントホモロジー用の境界行列を複数のブロックに分割し、複数のスレッドを用いてブロックを並列に処理する。これにより、行列計算が高速化され、ビッグデータ解析が高速化される。 As described above, the information processing apparatus 100 according to the second embodiment divides the boundary matrix for persistent homology into a plurality of blocks, and processes the blocks in parallel using a plurality of threads. This speeds up matrix computation and speeds up big data analysis.

また、情報処理装置１００は、対角線に拘束されずに、１つのブロックの処理が終了した空きスレッドを、その時点の何れかの処理可能ブロックに迅速に割り当てる。これにより、複数のスレッドを活用して並列度を維持することができ、行列計算が高速化される。また、情報処理装置１００は、各ブロックの計算コストを推定し、計算コストが大きいブロックから優先的に処理する。これにより、計算コストの大きいブロックがボトルネックとなって処理可能ブロックが減少するリスクが低下し、並列度の低下が抑制される。 In addition, the information processing apparatus 100 quickly allocates an empty thread that has finished processing one block to any processable block at that time without being constrained by diagonal lines. This allows multiple threads to be utilized to maintain parallelism and speed up matrix calculations. In addition, the information processing apparatus 100 estimates the calculation cost of each block, and preferentially processes blocks in descending order of calculation cost. This reduces the risk of a block with a high computational cost becoming a bottleneck and reducing the number of processable blocks, and suppresses a decrease in parallelism.

また、情報処理装置１００は、掃き出し法の進行に伴って変化する省略可能な列の個数に基づいて、処理可能ブロックの計算コストを算出する。これにより、計算コストが計算時間に比例するように、計算コストが精度よく推定される。 Further, the information processing apparatus 100 calculates the calculation cost of the processable block based on the number of omissible columns that changes as the sweep method progresses. Thereby, the calculation cost is accurately estimated so that the calculation cost is proportional to the calculation time.

［第３の実施の形態］
次に、第３の実施の形態を説明する。前述の第２の実施の形態との違いを中心に説明し、第２の実施の形態と同様の内容については説明を省略することがある。 [Third Embodiment]
Next, a third embodiment will be described. The description will focus on differences from the above-described second embodiment, and descriptions of the same contents as in the second embodiment may be omitted.

第３の実施の形態は、計算コストの算出方法が第２の実施の形態と異なる。第３の実施の形態の情報処理装置は、図２に示す第２の実施の形態のハードウェアおよび図１３に示す第２の実施の形態のソフトウェアと同様の構造によって実現できる。そこで、以下では図２，１３と同じ符号を用いて第３の実施の形態を説明する。 The third embodiment differs from the second embodiment in the calculation method of the calculation cost. The information processing apparatus of the third embodiment can be realized by a structure similar to the hardware of the second embodiment shown in FIG. 2 and the software of the second embodiment shown in FIG. Therefore, the third embodiment will be described below using the same reference numerals as in FIGS.

図１６は、周辺ブロックの計算コストが並列処理に与える影響の例を示す図である。
行列６５は、図１０に示す行列６４に対して掃き出し法を続けた後の行列である。ブロックＢ（０，０）、Ｂ（４，４）、Ｂ（８，８）、Ｂ（９，９）が処理済みである。ブロックＢ（２，２）、Ｂ（３，３）、Ｂ（６，６）、Ｂ（８，９）が処理中である。ブロックＢ（１，１）、Ｂ（５，５）、Ｂ（７，７）が処理可能ブロックである。 FIG. 16 is a diagram illustrating an example of how the calculation cost of peripheral blocks affects parallel processing.
Matrix 65 is the matrix after continuing the sweep method on matrix 64 shown in FIG. Blocks B(0,0), B(4,4), B(8,8) and B(9,9) have been processed. Blocks B(2,2), B(3,3), B(6,6) and B(8,9) are being processed. Blocks B(1,1), B(5,5), and B(7,7) are processable blocks.

ブロックＢ（１，１）の計算コストは１、ブロックＢ（５，５）の計算コストは１、ブロックＢ（７，７）の計算コストは２である。ブロックＢ（６，６）の処理が終了すると、情報処理装置１００は、処理可能ブロックのうち計算コストが最大であるブロックＢ（７，７）を処理し始める。これにより、行列６５は行列６６になる。 The computational cost of block B(1,1) is 1, the computational cost of block B(5,5) is 1, and the computational cost of block B(7,7) is 2. When the processing of block B(6,6) ends, the information processing apparatus 100 starts processing block B(7,7), which has the highest calculation cost among the processable blocks. Matrix 65 thus becomes matrix 66 .

このとき、ブロックＢ（３，４）、Ｂ（５，６）、Ｂ（７，８）の計算コストは、第２の実施の形態の方法ではまだ算出されない。ただし、ブロックＢ（３，４）の計算コストが５、ブロックＢ（５，６）の計算コストが７、ブロックＢ（７，８）の計算コストが２であると仮定する。ブロックＢ（５，６）の計算コストが大きいため、ブロックＢ（５，５）を早く処理して、ブロックＢ（５，６）の処理を早く開始することが好ましい。 At this time, the calculation costs of blocks B(3,4), B(5,6), and B(7,8) have not yet been calculated by the method of the second embodiment. However, assume that block B(3,4) has a computational cost of 5, block B(5,6) has a computational cost of 7, and block B(7,8) has a computational cost of 2. Since the computational cost of block B(5,6) is high, it is preferable to process block B(5,5) early and start processing block B(5,6) early.

しかし、ブロックＢ（５，５）自体の計算コストが小さいため、ブロックＢ（５，５）の処理開始が遅延し、その結果としてブロックＢ（５，６）の処理開始が遅延する。そこで、第３の実施の形態では、情報処理装置１００は、周辺ブロックの計算コストに基づいて処理可能ブロックの計算コストを修正する。 However, since the calculation cost of block B(5,5) itself is small, the start of processing of block B(5,5) is delayed, and as a result the start of processing of block B(5,6) is delayed. Therefore, in the third embodiment, the information processing apparatus 100 corrects the calculation cost of the processable block based on the calculation cost of the peripheral blocks.

図１７は、周辺ブロックを参照した計算コストの修正例を示す図である。
行列６７は、行列６５の後にブロックＢ（６，６）の処理が終了した時点の行列である。第３の実施の形態では、情報処理装置１００は、あるブロックの処理が終了すると、その上に隣接するブロックの計算コストを算出する。計算コストを算出するブロックは、処理可能ブロックでなくてもよい。ここでは、ブロックＢ（３，４）の計算コストが５、ブロックＢ（７，８）の計算コストが２である。また、ブロックＢ（６，６）の処理が終了すると、情報処理装置１００は、ブロックＢ（５，６）の計算コストを７と算出する。 FIG. 17 is a diagram illustrating an example of calculation cost correction with reference to peripheral blocks.
Matrix 67 is the matrix at the time block B(6, 6) has been processed after matrix 65 . In the third embodiment, the information processing apparatus 100 calculates the calculation cost of the adjacent block when the processing of a certain block is completed. A block for calculating a computational cost may not be a processable block. Here, the calculation cost of block B(3,4) is 5, and the calculation cost of block B(7,8) is 2. Further, when the processing of the block B(6,6) is completed, the information processing apparatus 100 calculates the calculation cost of the block B(5,6) as 7.

情報処理装置１００は、あるブロックの計算コストｃ１を算出した際、左に処理可能ブロックが隣接している場合、隣接する処理可能ブロックの計算コストｃ２を修正する。情報処理装置１００は、計算コストｃ１が計算コストｃ２より大きい場合、計算コストｃ１，ｃ２の算術平均を、隣接する処理可能ブロックの計算コストとする。よって、計算コストｃ２が、ｃ２＝（ｃ１＋ｃ２）÷２と更新される。 When the information processing apparatus 100 calculates the calculation cost c1 of a certain block, if there is an adjacent processable block on the left, the information processing apparatus 100 corrects the calculation cost c2 of the adjacent processable block. When the calculation cost c1 is greater than the calculation cost c2, the information processing apparatus 100 takes the arithmetic average of the calculation costs c1 and c2 as the calculation cost of the adjacent processable blocks. Therefore, the calculation cost c2 is updated as c2=(c1+c2)/2.

ここでは、情報処理装置１００は、ブロックＢ（５，５）の計算コストを１から４に修正する。すると、情報処理装置１００は、処理可能ブロックのうち計算コストが最大であるブロックＢ（５，５）を処理し始める。これにより、行列６７は行列６８になる。この結果、ブロックＢ（５，６）の処理開始が早くなる。 Here, the information processing apparatus 100 modifies the calculation cost of the block B(5,5) from 1 to 4. Then, the information processing apparatus 100 starts processing the block B(5, 5), which has the highest calculation cost among the processable blocks. Matrix 67 thus becomes matrix 68 . As a result, the processing of block B(5,6) is started earlier.

図１８は、第３の実施の形態のスレッド処理の手順例を示すフローチャートである。
（Ｓ３０）ブロック探索部１２８は、行列の中に新たな処理可能ブロックがあるか判断する。新たな処理可能ブロックがある場合はステップＳ３１に処理が進み、新たな処理可能ブロックがない場合はステップＳ３２に処理が進む。 FIG. 18 is a flowchart illustrating an example of a thread processing procedure according to the third embodiment.
(S30) The block search unit 128 determines whether there is a new processable block in the matrix. If there is a new processable block, the process proceeds to step S31, and if there is no new processable block, the process proceeds to step S32.

（Ｓ３１）ブロック探索部１２８は、新たな処理可能ブロックのブロック名と計算コストを対応付けて待ちブロックキュー１３２に挿入する。ブロック探索部１２８は、計算コストとして、後述するステップＳ３６で事前に算出されたものを使用する。 (S31) The block searching unit 128 associates the block name and calculation cost of the new processable block with each other and inserts them into the waiting block queue 132 . The block search unit 128 uses the cost calculated in advance in step S36, which will be described later, as the calculation cost.

（Ｓ３２）スレッド処理部１２６は、待ちブロックキュー１３２にブロックが登録されているか判断する。待ちブロックキュー１３２にブロックがある場合はステップＳ３３に処理が進み、待ちブロックキュー１３２が空の場合はステップＳ４０に処理が進む。 (S32) The thread processing unit 126 determines whether a block is registered in the waiting block queue 132. FIG. If there is a block in the waiting block queue 132, the process proceeds to step S33, and if the waiting block queue 132 is empty, the process proceeds to step S40.

（Ｓ３３）スレッド処理部１２６は、待ちブロックキュー１３２から優先度最高のブロックを抽出する。優先度最高のブロックは、計算コストが最大のブロックであり、計算コストが最大のブロックが複数ある場合はその中で行番号が最小のブロックである。掃き出し法実行部１２７は、抽出されたブロックに対して掃き出し法を実行する。 (S 33 ) The thread processing unit 126 extracts the block with the highest priority from the waiting block queue 132 . The block with the highest priority is the block with the highest calculation cost, and if there are multiple blocks with the highest calculation cost, the block with the lowest row number among them. The sweep-out method executing unit 127 executes the sweep-out method on the extracted blocks.

（Ｓ３４）掃き出し法実行部１２７は、ステップＳ３３の処理結果から、処理したブロックに含まれる列のうち新たに処理を省略可能となった列を判定する。
（Ｓ３５）計算コスト算出部１２９は、ステップＳ３３で処理されたブロックＢ（ｎ，ｍ）のブロック行番号が０であるか（ｎ＝０であるか）判断する。条件に該当する場合はステップＳ３０に処理が戻り、条件に該当しない場合はステップＳ３６に処理が進む。 (S34) The sweep-out method execution unit 127 determines, from the processing result of step S33, columns for which processing can be omitted newly among the columns included in the processed block.
(S35) The calculation cost calculator 129 determines whether the block row number of the block B(n,m) processed in step S33 is 0 (whether n=0). If the conditions are met, the process returns to step S30, and if the conditions are not met, the process proceeds to step S36.

（Ｓ３６）計算コスト算出部１２９は、ブロックＢ（ｎ，ｍ）の上に隣接するブロックＢ（ｎ－１，ｍ）の計算コストを算出する。計算コストは、ブロックＢ（ｎ－１，ｍ）に含まれる列のうち現在時点で処理を省略可能と判定されている列に基づいて算出される。 (S36) The calculation cost calculator 129 calculates the calculation cost of the block B(n-1,m) adjacent above the block B(n,m). The calculation cost is calculated based on the columns included in the block B(n−1, m) that are currently determined to be omissible for processing.

（Ｓ３７）計算コスト算出部１２９は、ブロックＢ（ｎ－１，ｍ）の左に隣接するブロックＢ（ｎ－１，ｍ－１）が、待ちブロックキュー１３２に登録された処理待ちのブロックであるか判断する。処理待ちのブロックである場合はステップＳ３８に処理が進み、それ以外の場合はステップＳ３０に処理が戻る。 (S37) The calculation cost calculator 129 determines that the block B (n-1, m-1) adjacent to the left of the block B (n-1, m) is a waiting block registered in the waiting block queue 132. determine if there is If the block is waiting for processing, the process proceeds to step S38, otherwise the process returns to step S30.

（Ｓ３８）計算コスト算出部１２９は、ブロックＢ（ｎ－１，ｍ－１）の計算コストがブロックＢ（ｎ－１，ｍ）の計算コストより小さいか判断する。条件に該当する場合はステップＳ３９に処理が進み、条件に該当しない場合はステップＳ３０に処理が戻る。 (S38) The computational cost calculator 129 determines whether the computational cost of block B(n−1,m−1) is smaller than the computational cost of block B(n−1,m). If the condition is met, the process proceeds to step S39, and if the condition is not met, the process returns to step S30.

（Ｓ３９）計算コスト算出部１２９は、ブロックＢ（ｎ－１，ｍ－１）の計算コストを、ブロックＢ（ｎ－１，ｍ－１）の計算コストとブロックＢ（ｎ－１，ｍ）の計算コストの算術平均に修正する。そして、ステップＳ３０に処理が戻る。 (S39) The calculation cost calculator 129 calculates the calculation cost of block B (n-1, m-1) by combining the calculation cost of block B (n-1, m-1) with the calculation cost of block B (n-1, m). corrected to the arithmetic mean of the computational cost of . Then, the process returns to step S30.

（Ｓ４０）スレッド処理部１２６は、ブロックを処理中の他のスレッドがあるか判断する。ブロックを処理中の他のスレッドがある場合は一定時間待機してステップＳ３２に処理が戻り、全てのスレッドがＲＥＡＤＹ状態の場合は自身のスレッドを停止する。 (S40) The thread processing unit 126 determines whether there is another thread processing a block. If there is another thread processing a block, it waits for a certain period of time and returns to step S32, and if all threads are in the READY state, it stops its own thread.

以上説明したように、第３の実施の形態の情報処理装置１００は、第２の実施の形態と同様の効果を有する。更に、情報処理装置１００は、ある処理可能ブロックの右に計算コストの大きいブロックが隣接している場合、その処理可能ブロックの計算コストを増加させる。これにより、次に処理可能になるブロックの中で計算コストの大きいブロックが早く処理されるようになり、並列度の低下が抑制されて行列計算が高速化される。 As described above, the information processing apparatus 100 of the third embodiment has the same effects as those of the second embodiment. Further, when a processable block is adjacent to a block with a large computational cost on the right, the information processing apparatus 100 increases the computational cost of the processable block. As a result, among the blocks that can be processed next, a block with a large calculation cost is processed early, suppressing a decrease in parallelism and increasing the speed of matrix calculation.

１０情報処理装置
１１記憶部
１２処理部
１３行列データ
１３ａ，１３ｂ，１３ｃ処理可能ブロック
１４ａ，１４ｂ，１４ｃ計算コスト
１５ａ，１５ｂスレッド 10 Information Processing Device 11 Storage Unit 12 Processing Unit 13 Matrix Data 13a, 13b, 13c Processable Blocks 14a, 14b, 14c Calculation Cost 15a, 15b Thread

Claims

A second block not adjacent to another unprocessed block in the first direction and orthogonal to the first direction from unprocessed blocks among a plurality of lattice-shaped blocks included in the matrix data. Identify blocks that are not adjacent to other unprocessed blocks in the direction of as processable blocks from which processing can begin, and
When there are a plurality of processable blocks, an empty thread among a plurality of threads to be executed in parallel is assigned with priority in order of higher calculation cost based on the calculated load of each processable block among the plurality of processable blocks. ,
A data analysis program that causes a computer to perform processing.

In identifying the processable block, in an initial state, a block on a diagonal line among the plurality of blocks is identified as the processable block;
In the allocation of the free threads, after the processing of some blocks on the diagonal line is completed, the processable blocks on the diagonal line and the processable blocks not on the diagonal line are processed in parallel by different threads. to allow
The data analysis program according to claim 1.

calculating the computational cost of the processable block based on the processing result of the block adjacent in the first direction; causing the computer to further perform a process;
The data analysis program according to claim 1.

The blocks adjacent in the first direction and the processable blocks include a plurality of columns in common, and processing of the plurality of columns is performed based on the processing result of the blocks adjacent in the first direction. is determined which columns can be omitted, and
The calculation cost is calculated based on the number of columns for which the processing can be omitted,
4. A data analysis program according to claim 3.

modifying the computational cost of the processable block based on other computational costs for unprocessed blocks within a certain range from the processable block, further causing the computer to perform a process;
The data analysis program according to claim 1.

modifying the computational cost increases the computational cost if the other computational cost for a neighboring unprocessed block in a third direction opposite the second direction is greater than the computational cost;
6. A data analysis program according to claim 5.

A second block not adjacent to another unprocessed block in the first direction and orthogonal to the first direction from unprocessed blocks among a plurality of lattice-shaped blocks included in the matrix data. Identify blocks that are not adjacent to other unprocessed blocks in the direction of as processable blocks from which processing can begin, and
When there are a plurality of processable blocks, an empty thread among a plurality of threads to be executed in parallel is assigned with priority in order of higher calculation cost based on the calculated load of each processable block among the plurality of processable blocks. ,
A data analysis method in which the processing is performed by a computer.

a storage unit that stores matrix data including a plurality of grid-like blocks;
Among the unprocessed blocks among the plurality of blocks, there is no other unprocessed block adjacent in a first direction and another unprocessed block in a second direction perpendicular to the first direction. Blocks that are not adjacent to each other are specified as processable blocks from which processing can be started, and if there are a plurality of processable blocks, the calculation cost based on the calculation load of each processable block among the plurality of processable blocks is large. a processing unit that preferentially allocates an empty thread among a plurality of threads that are executed in parallel;
Information processing device having