JP2023001055A

JP2023001055A - Program, data processing apparatus, and data processing method

Info

Publication number: JP2023001055A
Application number: JP2022092512A
Authority: JP
Inventors: ムハマドバーゲルベイク; Bagherbeik Mohammad; アリシェイクホレスラミ; Ali Sheikholeslami
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2021-06-18
Filing date: 2022-06-07
Publication date: 2023-01-04

Abstract

To solve assignment problems at high speed.SOLUTION: A storage unit 11 stores a flow matrix representing flows between a plurality of entities to be assigned to a plurality of destinations, and a distance matrix representing distances between the plurality of destinations. A processing unit 12 calculates a first change in an evaluation function, which is to be caused by a first assignment change of exchanging the destinations of first and second entities among the plurality of entities, with vector arithmetic operations based on the flow and distance matrices, determines based on the first change whether to accept the first assignment change, and when determining to accept the first assignment change, updates an assignment state and updates the distance matrix by swapping the two columns or two rows (two columns in the example of FIG. 2) of the distance matrix corresponding to the first and second entities.SELECTED DRAWING: Figure 2

Description

本発明は、プログラム、データ処理装置及びデータ処理方法に関する。 The present invention relates to a program, a data processing device, and a data processing method.

割当問題は、配車ルーティングやＦＰＧＡ（Field-Programmable Gate Array）のブロック配置など、様々な実世界のアプリケーションをもつＮＰ困難な組合せ最適化問題のクラスである。割当問題の例として、２次割当問題（ＱＡＰ：Quadratic Assignment Problem）がある（たとえば、非特許文献１参照）。 Allocation problems are a class of NP-hard combinatorial optimization problems that have a variety of real-world applications such as vehicle dispatch routing and block placement in Field-Programmable Gate Arrays (FPGAs). An example of the assignment problem is the quadratic assignment problem (QAP) (see Non-Patent Document 1, for example).

２次割当問題は、ｎ個の要素（施設など）をｎ個の割当先に割り当てる際に、要素間のフロー量（施設間での物資の輸送量などのコスト）と、各要素が割り当てられる割当先間の距離との積の総和を最小化するような割当を求める問題である。すなわち、２次割当問題は、以下の式（１）で表される評価関数の値を最小化するような割当を探索する問題である。評価関数は割当状態に応じたコストを表し、コスト関数などとも呼ばれる。 In the secondary allocation problem, when n elements (facilities, etc.) are allocated to n allocation destinations, the amount of flow between elements (cost such as the amount of goods transported between facilities) and each element are allocated. This is a problem of finding an allocation that minimizes the sum of products of distances between allocation destinations. That is, the secondary assignment problem is a problem of searching for an assignment that minimizes the value of the evaluation function represented by Equation (1) below. The evaluation function expresses the cost according to the allocation state, and is also called a cost function.

式（１）において、ｆ_ｉ，ｊは、識別番号＝ｉ，ｊの要素間のフロー量、ｄ_{φ（ｉ），φ（ｊ）}は、識別番号＝ｉ，ｊの要素が割り当てられている割当先間の距離を示す。
ところで、ノイマン型コンピュータが不得意とする大規模な離散最適化問題を計算する装置として、イジング型の評価関数を用いたボルツマンマシン（イジング装置とも呼ばれる）がある。ボルツマンマシンは、回帰型ニューラルネットワークの一種である。 In equation (1), f _i,j is the amount of flow between elements with identification numbers=i and j, and d _{φ(i) and φ(j)} are assigned to elements with identification numbers=i and j. Indicates the distance between assignees.
By the way, there is a Boltzmann machine (also called an Ising machine) using an Ising evaluation function as a device for calculating a large-scale discrete optimization problem, which is not good for von Neumann computers. A Boltzmann machine is a type of recurrent neural network.

ボルツマンマシンは、組合せ最適化問題を磁性体のスピンの振る舞いを表すイジングモデルに変換する。そして、ボルツマンマシンは、疑似焼き鈍し法やパラレルテンパリング（たとえば、非特許文献２参照）などのマルコフ連鎖モンテカルロ法により、イジング型の評価関数の値が最小となるイジングモデルの状態の探索を行う（たとえば、非特許文献３参照）。評価関数の値は、エネルギーに相当する。なお、ボルツマンマシンは、評価関数の符号を変えれば、評価関数の値が極大になる状態を探索することもできる。イジングモデルの状態は、複数の状態変数の値（ニューロン値とも呼ばれる）の組合せにより表現できる。各状態変数の値として、０または１を用いることができる。 A Boltzmann machine transforms a combinatorial optimization problem into an Ising model that represents the spin behavior of a magnetic material. Then, the Boltzmann machine uses a Markov chain Monte Carlo method such as pseudo-annealing or parallel tempering (see, for example, Non-Patent Document 2) to search for the state of the Ising model that minimizes the value of the Ising-type evaluation function (for example, , Non-Patent Document 3). The value of the evaluation function corresponds to energy. By changing the sign of the evaluation function, the Boltzmann machine can also search for the state where the value of the evaluation function is maximized. The state of the Ising model can be represented by a combination of multiple state variable values (also called neuron values). 0 or 1 can be used as the value of each state variable.

イジング型の評価関数は、たとえば、以下の式（２）で定義される。 The Ising evaluation function is defined, for example, by the following equation (2).

右辺の１項目は、イジングモデルの全状態変数の全組合せについて、漏れと重複なく、Ｎ個の状態変数から選ばれる２つの状態変数の値（０または１）と重み値（２つの状態変数の間の相互作用の強さを表す）との積を積算したものである。ｓ_ｉは、識別番号がｉの状態変数、ｓ_ｊは、識別番号がｊの状態変数であり、ｗ_ｉ，ｊは、識別番号がｉとｊの状態変数間の相互作用の大きさを示す重み値である。右辺の２項目は、各識別番号についてのバイアス係数と状態変数との積の総和を求めたものである。ｂ_ｉは、識別番号＝ｉについてのバイアス係数を示している。 One item on the right side is the value (0 or 1) of two state variables selected from N state variables without omission or duplication, and the weight value (of the two state variables) for all combinations of all state variables of the Ising model. It represents the strength of the interaction between s _i is the state variable with the identification number i, s _j is the state variable with the identification number j, and w _i,j indicates the magnitude of the interaction between the state variables with the identification numbers i and j. is a weight value. The two items on the right side are sums of products of bias coefficients and state variables for each identification number. b _i indicates the bias coefficient for identification number=i.

また、ｓ_ｉの値の変化に伴うエネルギーの変化量（ΔＥ_ｉ）は、以下の式（３）で表される。 Also, the amount of change in energy (ΔE _i ) that accompanies the change in the value of s _i is represented by the following equation (3).

式（３）において、ｓ_ｉが１から０に変化するとき、Δｓ_ｉは－１となり、ｓ_ｉが０から１に変化するとき、Δｓ_ｉは１となる。なお、ｈ_ｉは局所場と呼ばれ、Δｓ_ｉに応じてｈ_ｉに符号（＋１または－１）を乗じたものがΔＥ_ｉとなる。 In equation (3), Δs _i becomes −1 when s _i changes from 1 to 0, and Δs _i becomes 1 when s _i changes from 0 to 1. Note that h _i is called a local field, and ΔE _i is obtained by multiplying h _i by a sign (+1 or −1) according to Δs _i .

そして、たとえば、ΔＥ_ｉが、乱数と温度パラメータの値に基づいて得られるノイズ値（熱ノイズとも呼ばれる）より小さい場合には、ｓ_ｉの値を更新することで状態遷移を発生させ、局所場も更新する、という処理が繰り返される。 Then, for example, if ΔE _i is smaller than the noise value (also called thermal noise) obtained based on the random number and the value of the temperature parameter, the value of s _i is updated to generate a state transition, and the local field is also updated, and the process is repeated.

このようなボルツマンマシンを用いて２次割当問題を計算する技術が提案されている（たとえば、特許文献１、非特許文献４参照）。
２次割当問題のイジング型の評価関数は、以下の式（４）で表せる。 Techniques for calculating the quadratic allocation problem using such Boltzmann machines have been proposed (see Patent Document 1 and Non-Patent Document 4, for example).
The Ising-type evaluation function for the secondary assignment problem can be expressed by the following equation (4).

式（４）において、ｘは状態変数をベクトル化したものであり、ｎ個の要素のｎ個の割当先への割当状態を表す。ｘ^Ｔは、（ｘ_１，１，…，ｘ_１，ｎ，ｘ_２，１，…，ｘ_２，ｎ，……，ｘ_ｎ，１，…，ｘ_ｎ，ｎ）と表せる。ｘ_ｉ，ｊ＝１は、識別番号＝ｉの要素が、識別番号＝ｊの割当先に割り当てられていることを示し、ｘ_ｉ，ｊ＝０は、識別番号＝ｉの要素が、識別番号＝ｊの割当先に割り当てられていないことを示す。 In equation (4), x is a vectorized state variable, and represents the allocation state of n elements to n allocation destinations. _{xT can be expressed as (x1,1,...,x1,n,x2,1} _, _... ^, x2 _,n ,..., _xn,1 ,..., _xn,n ). x _i,j =1 indicates that the element with identification number=i is assigned to the assignee with identification number=j, and x _i,j =0 indicates that the element with identification number=i = indicates that it is not assigned to the assignee of j.

Ｗは、重み値の行列であり、前述のフロー量（ｆ_ｉ，ｊ）と、ｎ個の割当先間の距離の行列Ｄを用いて、以下の式（５）で表せる。 W is a matrix of weight values, and can be represented by the following equation (5) using the aforementioned flow amount (f _i,j ) and matrix D of distances between n allocation destinations.

米国特許出願公開第２０２１／０３２６６７９号明細書U.S. Patent Application Publication No. 2021/0326679

Eugene L. Lawler, “The quadratic assignment problem”, Management Science, Vol.9, No.4 pp.586-599, July 1963Eugene L. Lawler, “The quadratic assignment problem”, Management Science, Vol.9, No.4 pp.586-599, July 1963 Robert H. Swendsen and Jian-Sheng Wang, ”Replica monte carlo simulation of spin-glasses”, Physical Review Letters, Vol.57, No.21, pp.2607-2609, November 1986Robert H. Swendsen and Jian-Sheng Wang, ``Replica monte carlo simulation of spin-glasses'', Physical Review Letters, Vol.57, No.21, pp.2607-2609, November 1986 K. Dabiri, et al., “Replica Exchange MCMC Hardware With Automatic Temperature Selection and Parallel Trial”, IEEE Transactions on Parallel and Distributed Systems, Vol.31, No.7, pp.1681-1692, July 2020K. Dabiri, et al., “Replica Exchange MCMC Hardware With Automatic Temperature Selection and Parallel Trial”, IEEE Transactions on Parallel and Distributed Systems, Vol.31, No.7, pp.1681-1692, July 2020 M.Bagherbeik et al., ”A permutational boltzmann machine with parallel tempering for solving combinatorial optimization problems”, In International Conference on Parallel Problem Solving from Nature, pp.317-331, Springer, 2020M.Bagherbeik et al., ``A permutational boltzmann machine with parallel tempering for solving combinatorial optimization problems'', In International Conference on Parallel Problem Solving from Nature, pp.317-331, Springer, 2020 Rainer E Burkard, Stefan E Karisch, and Franz Rendl, “Qaplib-a quadratic assignment problem library”, Journal of Global optimization, 10(4), pp.391-403, 1997Rainer E Burkard, Stefan E Karisch, and Franz Rendl, “Qaplib-a quadratic assignment problem library”, Journal of Global optimization, 10(4), pp.391-403, 1997 Gintaras Palubeckis, “An algorithm for construction of test cases for the quadratic assignment problem”, Informatica, Lith. Acad. Sci., Vol.11, No.3, pp.281-296, 2000Gintaras Palubeckis, “An algorithm for construction of test cases for the quadratic assignment problem”, Informatica, Lith. Acad. Sci., Vol.11, No.3, pp.281-296, 2000 Zvi Drezner, Peter M Hahn, and Eric D Taillard, “Recent advances for the quadratic assignment problem with special emphasis on instances that are difficult for meta-heuristic methods”, Annals of Operations research, 139(1), pp.65-94, 2005Zvi Drezner, Peter M Hahn, and Eric D Taillard, “Recent advances for the quadratic assignment problem with special emphasis on instances that are difficult for meta-heuristic methods”, Annals of Operations research, 139(1), pp.65-94 , 2005 Allyson Silva, Leandro C. Coelho, and Maryam Darvish, “Quadratic assignment problem variants: A survey and an effective parallel memetic iterated tabu search”, European Journal of Operational Research, 2020Allyson Silva, Leandro C. Coelho, and Maryam Darvish, “Quadratic assignment problem variants: A survey and an effective parallel memetic iterated tabu search”, European Journal of Operational Research, 2020 Danny Munera, Daniel Diaz, and Salvador Abreu, “Hybridization as cooperative parallelism for the quadratic assignment problem”, In Hybrid Metaheuristics, pp. 47-61, Springer International Publishing Switzerland, 2016Danny Munera, Daniel Diaz, and Salvador Abreu, “Hybridization as cooperative parallelism for the quadratic assignment problem”, In Hybrid Metaheuristics, pp. 47-61, Springer International Publishing Switzerland, 2016 Kresimir Mihic, Kevin Ryan, and Alan Wood, “Randomized decomposition solver with the quadratic assignment problem as a case study”, INFORMS Journal on Computing, Vol.30, No.2, pp.295-308, 2018Kresimir Mihic, Kevin Ryan, and Alan Wood, “Randomized decomposition solver with the quadratic assignment problem as a case study”, INFORMS Journal on Computing, Vol.30, No.2, pp.295-308, 2018

上記のような評価関数を用いて２次割当問題を計算する手法は、エネルギーの変化量の計算や、局所場の更新などの処理が多数回繰り返され、メモリアクセスも多数回繰り返されるため計算に時間がかかる。 The method of calculating the quadratic assignment problem using the above evaluation function requires many iterations of processing such as calculation of the amount of change in energy and updating of the local field, and memory access is also repeated many times. time consuming.

１つの側面では、本発明は、割当問題を高速に計算できるプログラム、データ処理装置及びデータ処理方法を提供することを目的とする。 An object of the present invention in one aspect is to provide a program, a data processing apparatus, and a data processing method capable of calculating allocation problems at high speed.

１つの実施態様では、割当問題の解を、割当状態に応じたコストを表す評価関数を用いた局所探索により探索する処理をコンピュータに実行させるプログラムであって、メモリに記憶された、複数の割当先へ割り当てられる複数の要素間のフロー量を表すフロー行列と、前記複数の割当先間の距離を表す距離行列と、に基づいて、前記複数の要素のうち、第１の要素と第２の要素の割当先が入れ替わる第１の割当変更が生じた場合の、前記評価関数の第１の変化量を、ベクトル算術演算を用いて計算し、前記第１の変化量に基づいて、前記第１の割当変更を許容するか否かを判定し、前記第１の割当変更を許容すると判定した場合、前記割当状態を更新するとともに、前記距離行列において、前記第１の要素と前記第２の要素に対応する２列または２行を入れ替えるように更新する、処理を前記コンピュータに実行させるプログラムが提供される。 In one embodiment, a program for causing a computer to execute a process of searching for a solution to an allocation problem by local search using an evaluation function representing a cost according to an allocation state, wherein a plurality of allocations stored in a memory Based on a flow matrix representing the amount of flow between the plurality of elements to be allocated first and a distance matrix representing the distance between the plurality of allocation destinations, the first element and the second element among the plurality of elements calculating a first amount of change in the evaluation function using vector arithmetic operations when a first allocation change occurs in which an element allocation destination is replaced; and based on the first amount of change, the first If it is determined that the first allocation change is permitted, the allocation state is updated, and in the distance matrix, the first element and the second element A program is provided that causes the computer to perform a process of updating to swap two columns or two rows corresponding to .

また、１つの実施態様では、データ処理装置が提供される。
また、１つの実施態様では、データ処理方法が提供される。 Also, in one embodiment, a data processing apparatus is provided.
Also, in one embodiment, a data processing method is provided.

１つの側面では、本発明は、割当問題を高速に計算できる。 In one aspect, the present invention enables fast computation of allocation problems.

ＱＡＰの計算例を示す図である。It is a figure which shows the calculation example of QAP. ＱＡＰの計算時における距離行列の並べ替え例とデータ処理装置の例を示す図である。FIG. 10 is a diagram showing an example of rearrangement of distance matrices and an example of a data processing device when calculating QAP; キャッシュ行列の更新計算の例を示す図である。FIG. 10 is a diagram illustrating an example of cache matrix update calculation; パラレルテンパリングを実行するソルバーシステムの例を示す図である。FIG. 2 illustrates an example solver system that performs parallel tempering; ＱＡＰの解をパラレルテンパリングによる局所探索で探索するアルゴリズムの例を示す図である。FIG. 4 is a diagram showing an example of an algorithm for searching for a QAP solution by local search by parallel tempering; パラレルテンパリングによる局所探索の全体の処理の流れを示すフローチャートである。4 is a flow chart showing the overall processing flow of local search by parallel tempering; ＱＡＰの場合のレプリカの初期化処理の一例の流れを示すフローチャートである。FIG. 11 is a flow chart showing an example of the flow of replica initialization processing in the case of QAP; FIG. ＱＡＰの場合のレプリカ探索処理の一例の流れを示すフローチャートである。FIG. 11 is a flow chart showing an example of the flow of replica search processing in the case of QAP; FIG. ＱＳＡＰの解をパラレルテンパリングによる局所探索で探索するアルゴリズムの例を示す図である。FIG. 10 is a diagram showing an example of an algorithm for searching for a QSAP solution by local search by parallel tempering; ＱＳＡＰの場合のレプリカの初期化処理の一例の流れを示すフローチャートである。FIG. 11 is a flow chart showing an example of the flow of replica initialization processing in the case of QSAP; FIG. ＱＳＡＰの場合のレプリカ探索処理の一例の流れを示すフローチャートである。FIG. 11 is a flow chart showing an example of the flow of replica search processing in the case of QSAP; FIG. ＶΔＣ法と比較したＳＡＭ法、ＢＭ＄法による計算の高速化の度合いの評価結果を示す図である。FIG. 10 is a diagram showing evaluation results of the degree of speeding up of calculation by the SAM method and the BM$ method compared with the VΔC method; スカラー型の演算処理に対するベクトル型の演算処理の高速化の度合いの評価結果を示す図である。FIG. 10 is a diagram showing evaluation results of the degree of acceleration of vector-type arithmetic processing relative to scalar-type arithmetic processing; 負荷分散の一例を示す図である。It is a figure which shows an example of load distribution. 負荷分散による演算処理の高速化の度合いの評価結果を示す図である。FIG. 10 is a diagram showing evaluation results of the degree of speeding up of arithmetic processing by load balancing; 測定アルゴリズムの一例を示す図である。FIG. 4 is a diagram showing an example of a measurement algorithm; ＶΔＣ法、ＳＡＭ法、ＢＭ＄法について測定された相対的な高速化の度合いと、問題サイズに応じて占有されるメモリ階層を示す図である。Fig. 2 shows the relative speedup measured for the VΔC, SAM, and BM$ methods and the memory hierarchy occupied as a function of problem size; ΔＣ生成回路の一例を示す図である。FIG. 4 is a diagram showing an example of a ΔC generation circuit; 列の入れ替えを行うハードウェア構成の第１の例を示す図である。FIG. 10 is a diagram illustrating a first example of a hardware configuration for permuting columns; 列の入れ替え例を示す図である。It is a figure which shows the example of permuting a row|line|column. 列の入れ替えを行うハードウェア構成の第２の例を示す図である。FIG. 10 is a diagram illustrating a second example of a hardware configuration for permuting columns; 列の入れ替えを行うハードウェア構成の第２の例の１つ目の変形例を示す図である。FIG. 10 is a diagram showing a first modification of the second example of the hardware configuration for permuting columns; 列の入れ替えを行うハードウェア構成の第２の例の２つ目の変形例を示す図である。FIG. 11 is a diagram showing a second modification of the second example of the hardware configuration for permuting columns; 列の入れ替えを行うハードウェア構成の第３の例を示す図である。FIG. 13 is a diagram illustrating a third example of a hardware configuration for permuting columns; 列の入れ替えを行うハードウェア構成の第３の例の変形例を示す図である。FIG. 12 is a diagram showing a modification of the third example of the hardware configuration for permuting columns; ΔＣ生成回路の他の例を示す図である。FIG. 10 is a diagram showing another example of a ΔC generation circuit; ２つのレプリカについての処理を行うハードウェア構成の例を示す図である。FIG. 10 is a diagram illustrating an example of a hardware configuration for processing two replicas; FIG. ΔＣ生成回路の他の例を示す図である。FIG. 10 is a diagram showing another example of a ΔC generation circuit; レプリカ処理回路の一例を示す図である。It is a figure which shows an example of a replica processing circuit. 非対称行列を用いたＱＡＰの計算で用いられるレプリカ処理回路の一例を示す図である。FIG. 4 is a diagram showing an example of a replica processing circuit used in QAP calculation using an asymmetric matrix; データ処理装置の一例であるコンピュータのハードウェア例を示す図である。It is a figure which shows the hardware example of the computer which is an example of a data processing apparatus.

以下、発明を実施するための形態を、図面を参照しつつ説明する。
本実施の形態のデータ処理装置は、割当問題の例として２次割当問題（ＱＡＰ）または２次半割当問題（ＱＳＡＰ：Quadratic Semi-Assignment Problem）の解を、局所探索により探索する。以下、ＱＡＰ、ＱＳＡＰ及び局所探索について説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments for carrying out the invention will be described with reference to the drawings.
The data processing apparatus of the present embodiment searches for a solution of a quadratic assignment problem (QAP) or a quadratic semi-assignment problem (QSAP: Quadratic Semi-Assignment Problem) as an example of an assignment problem by local search. QAP, QSAP and local search are described below.

（ＱＡＰ）
図１は、ＱＡＰの計算例を示す図である。ＱＡＰは、ｎ個の要素（施設など）をｎ個の割当先に割り当てる際に、要素間のフロー量と、各要素が割り当てられる割当先間の距離との積の総和を最小化するような割当を求める問題である。 (QAP)
FIG. 1 is a diagram showing an example of QAP calculation. QAP, when assigning n elements (facilities, etc.) to n assignees, minimizes the sum of the products of the flow amount between the elements and the distance between the assignees to which each element is assigned. It is a matter of asking for allocation.

図１には、１～４の識別番号が付された４つの施設を４箇所の割当先（Ｌ_１～Ｌ_４）に割り当てる例が示されている。
ｎ個の要素間のフロー量を表すフロー行列は、以下の式（６）で表される。 FIG. 1 shows an example of allocating four facilities with identification numbers 1 to 4 to four allocation destinations (L ₁ to L ₄ ).
A flow matrix representing the amount of flow between n elements is represented by the following equation (6).

フロー行列（Ｆ）は、ｎ行ｎ列の行列である。ｆ_ｉ，ｊはｉ行ｊ列のフロー量であり、識別番号＝ｉ，ｊの要素間のフロー量を表す。たとえば、図１の識別番号＝１の施設と識別番号＝２の施設との間のフロー量は、ｆ_１，２と表せる。 The flow matrix (F) is a matrix with n rows and n columns. f _i,j is the amount of flow in row i and column j, and represents the amount of flow between elements with identification number=i,j. For example, the amount of flow between the facility with identification number=1 and the facility with identification number=2 in FIG. 1 can be expressed as f _1,2 .

ｎ箇所の割当先間の距離を表す距離行列は、以下の式（７）で表される。 A distance matrix representing the distance between n allocation destinations is represented by the following equation (7).

距離行列（Ｄ）は、ｎ行ｎ列の行列である。ｄ_ｋ，ｌはｋ行ｌ列の距離であり、識別番号＝ｋ，ｌの割当先間の距離を表す。たとえば、図１の識別番号＝１の割当先（Ｌ_１）と識別番号＝２の割当先（Ｌ_２）との間の距離は、ｄ_１，２と表せる。 The distance matrix (D) is a matrix of n rows and n columns. d _{k, l} is the distance of k rows and l columns, and represents the distance between the allocation destinations of identification numbers=k, l. For example, the distance between the allocation destination (L ₁ ) with identification number=1 and the allocation destination (L ₂ ) with identification number=2 in FIG. 1 can be expressed as d _1,2 .

ＱＡＰの計算は、前述の式（１）を最小化するような割当を探索することで行われる。ｎ個の要素のｎ箇所の割当先への割当状態は、整数割当ベクトルφまたはバイナリ状態行列Ｘで表される。φは集合Φ_ｎの要素である。Φ_ｎは、セットＮ＝｛１，２，３，…，ｎ｝の全ての順列のセットである。バイナリ状態行列Ｘに含まれるバイナリ変数であるｘ_ｉ，ｊは、以下の式（８）で表される。 QAP computation is done by searching for allocations that minimize equation (1) above. The allocation state of n elements to n allocation destinations is represented by an integer allocation vector φ or a binary state matrix X. φ is an element of the set Φ _n . Φ _n is the set of all permutations of the set N={1,2,3,...,n}. A binary variable x _i,j included in the binary state matrix X is represented by the following equation (8).

図１には、識別番号＝１の施設が識別番号＝２の割当先、識別番号＝２の施設が識別番号＝３の割当先、識別番号＝３の施設が識別番号＝４の割当先、識別番号＝４の施設が識別番号＝１の割当先にそれぞれ割り当てられた場合の例が示されている。整数割当ベクトルφは、φ（１）＝２、φ（２）＝３、φ（３）＝４、φ（４）＝１となっている。バイナリ状態行列Ｘは、ｘ_１，２、ｘ_２，３、ｘ_３，４、ｘ_４，１がそれぞれ１となっており、その他のｘ_ｉ，ｊは０になっている。 In FIG. 1, the facility with identification number = 1 is the allocation destination with identification number = 2, the facility with identification number = 2 is the allocation destination with identification number = 3, the facility with identification number = 3 is the allocation destination with identification number = 4, An example is shown in which facilities with identification number=4 are assigned to allocation destinations with identification number=1. The integer assignment vector φ is φ(1)=2, φ(2)=3, φ(3)=4, and φ(4)=1. In the binary state matrix X, x _1,2 , x _2,3 , x _3,4 and x _4,1 are each 1, and the other x _i,j are 0.

ＱＡＰには、フロー行列と距離行列の一方または両方が対称行列の場合と、フロー行列と距離行列の両方が非対称行列の場合がある。本実施の形態では、主に対称行列（対角成分が０（バイアスレス））を用いたＱＡＰに焦点が当てられている。このようなＱＡＰが、インスタンスの大部分であるし、計算を単純化するためである。ただし、対称行列を用いたＱＡＰは、非対称行列を用いたＱＡＰに直接的に変換可能である。 In QAP, one or both of the flow matrix and the distance matrix are symmetrical, and both the flow matrix and the distance matrix are asymmetrical. In this embodiment, the focus is mainly on QAP using a symmetric matrix (diagonal elements are 0 (biasless)). This is because such QAPs are the bulk of the instances and to simplify the calculations. However, QAP with symmetric matrices can be directly converted to QAP with asymmetric matrices.

（ＱＳＡＰ）
ＱＳＡＰはＱＡＰを変形したものである。ＱＳＡＰでは、要素数と割当先の数とが等しくない。たとえば、各割当先に複数の要素を割り当てることが許容される。ＱＳＡＰでは、距離行列は以下の式（９）で表される。 (QSAP)
QSAP is a variant of QAP. In QSAP, the number of elements and the number of assignees are not equal. For example, it is permissible to assign multiple elements to each assignee. In QSAP, the distance matrix is represented by Equation (9) below.

距離行列（Ｄ）の対角要素は、割当先内でのルーティングを考慮するため非零である。ＱＳＡＰではさらに、以下の式（１０）で表される追加の行列Ｂが用いられる。 The diagonal elements of the distance matrix (D) are non-zero to allow for intra-assignment routing. QSAP also uses an additional matrix B expressed in Equation (10) below.

ｂ_ｉ，ｋは、識別番号＝ｉの要素を識別番号＝ｋの割当先に割り当てるための一定のコストを表している。
ＱＳＡＰの計算は、ＱＡＰの評価関数である式（１）の代わりに、以下の式（１１）を最小化するような割当を探索することで行われる。 b _i,k represents a constant cost for assigning the element with identification number=i to the assignee with identification number=k.
QSAP is calculated by searching for assignments that minimize the following equation (11) instead of equation (1), which is the QAP evaluation function.

ｎ個の要素のｍ箇所の割当先への割当状態は、整数割当ベクトルψ（ψ∈［１，ｍ］^ｎ）、またはバイナリ状態行列Ｓで表される。バイナリ状態行列Ｓに含まれるバイナリ変数であるｓ_ｉ，ｊは、以下の式（１２）で表される。 The allocation state of n elements to m allocation destinations is represented by an integer allocation vector ψ(ψε[1, m] ⁿ ) or a binary state matrix S. A binary variable s _i,j included in the binary state matrix S is represented by the following equation (12).

（局所探索）
局所探索では、現在の状態の変更によって到達可能な近傍状態内において候補解の探索が行われる。ＱＡＰに対して局所探索を行う方法として、要素間のペアワイズ交換がある。この手法では、２つの要素が選択され、それらの割当先が交換される。２つの要素（識別番号＝ａ，ｂで表される）の、割当先（識別番号＝φ（ａ），φ（ｂ）で表される）の交換による評価関数（式（１））の値の変化量は、以下の式（１３）で表すことができる。 (local search)
Local search searches for candidate solutions within neighboring states that are reachable by changing the current state. Pairwise exchange between elements is a method of performing local search for QAP. In this approach, two elements are selected and their assignments are exchanged. The value of the evaluation function (formula (1)) obtained by exchanging the allocation destinations (identification numbers = φ(a), φ(b)) of the two elements (identification numbers = a, b) can be expressed by the following equation (13).

式（１３）のように、変化量（ΔＣ_ｅｘ）は積和演算ループによって生成される。
ＱＳＡＰに対してペアワイズ交換により局所探索を行う場合、評価関数（式（１１））の値の変化量は、以下の式（１４）で表すことができる。 As shown in equation (13), the amount of change (ΔC _ex ) is generated by a sum-of-products operation loop.
When local search is performed on QSAP by pairwise exchange, the amount of change in the value of the evaluation function (equation (11)) can be expressed by the following equation (14).

式（１４）においてΔＢ_ｅｘは、各割当先への要素の割当に関する一定のコストの変化を示し、以下の式（１５）で表される。 In Equation (14), ΔB _ex indicates a constant change in cost for allocating elements to each assignee, and is expressed by Equation (15) below.

ＱＳＡＰにおける状態空間は、割当状態を表す前述の順列のみに制約されない。ある割当先に、別の要素が割り当てられているかどうかに関係なく、要素をその割当先に再配置することができる。識別番号＝ａの要素を、現在の割当先（識別番号＝ψ（ａ））から識別番号＝ｌの割当先に、割当変更する際の評価関数の値の変化量は、以下の式（１６），（１７）で表される。 The state space in QSAP is not constrained only to the aforementioned permutations representing allocation states. An element can be relocated to its assignee regardless of whether another element is assigned to that assignee. The amount of change in the value of the evaluation function when the element of identification number = a is changed from the current allocation destination (identification number = ψ(a)) to the allocation destination of identification number = l is given by the following formula (16 ), (17).

局所探索では、上記のように計算される変化量に基づいて、その変化量の評価関数の値の変化を引き起こす割当変更の提案を受け入れるか否かが決定される。提案を受け入れるか否かは、たとえば、貪欲法など事前に定義された基準に基づいて決定される。貪欲法が用いられる場合、コスト（評価関数の値（エネルギー））を削減する割当変更が受け入れられる。貪欲法が用いられる場合、提案の受け入れ確率（ＰＡＲ：Proposal Acceptance Rate）は、探索の開始時に高くなるが、探索が評価関数の極小値でスタックし、それ以上の改善の動きが見つからないため、後に０に近づく傾向がある。 In the local search, based on the amount of change calculated as described above, it is determined whether or not to accept a proposal of assignment change that causes a change in the value of the evaluation function of the amount of change. Whether or not to accept the offer is determined based on predefined criteria, eg, greedy. If the greedy method is used, allocation changes that reduce the cost (value of the cost function (energy)) are accepted. When the greedy method is used, the Proposal Acceptance Rate (PAR) is high at the beginning of the search, but the search gets stuck at a local minimum of the evaluation function and no further improvement moves are found. It tends to approach 0 later.

貪欲法に代えて、本実施の形態では、疑似焼き鈍し法などの確率的局所探索法を使用することができる。確率的局所探索法では、割当の変化にランダム性を加えるために、温度パラメータ（Ｔ）を使用して、以下の式（１８）で表されるメトロポリス法の受入れ確率（Ｐ_ａｃｃ）を用いることができる。 Instead of the greedy method, this embodiment can use a probabilistic local search method such as the pseudo-annealing method. In the probabilistic local search method, the temperature parameter (T) is used to add randomness to the assignment changes, using the Metropolis method acceptance probability (P _acc ) expressed in equation (18) below. be able to.

Ｔが無限大に向かって増加すると、Ｐ_ａｃｃ及びＰＡＲは増加し、ΔＣの値に関係なくすべての提案が受け入れられる。Ｔが下げられて０に近づくと、提案の受け入れは貪欲になり、ＰＡＲは探索が最終的に評価関数の極小値でスタックするため、０になる傾向がある。 As T increases towards infinity, P _acc and PAR increase and all proposals are accepted regardless of the value of ΔC. As T is lowered and approaches 0, proposal acceptance becomes greedy and PAR tends to 0 as the search eventually gets stuck at a local minimum of the evaluation function.

ＱＡＰとＱＳＡＰのΔＣの計算は、データ処理装置が割当問題の解を探索する際に、データ処理装置における合計処理時間の大部分を占める。
ΔＣの計算を、以下のようにベクトル算術演算を用いて行うことで計算の高速化が可能となる。 The calculation of ΔC for QAP and QSAP accounts for most of the total processing time in the data processing system as it searches for a solution to the allocation problem.
Calculation of ΔC can be performed at high speed by using vector arithmetic operations as follows.

（ΔＣ計算のベクトル化）
以下この手法をＶΔＣ法と呼ぶ場合がある。式（１３）によるＱＡＰのΔＣ_ｅｘの計算は、ｉ＝ａ，ｂの計算をスキップするという条件がなければ、内積計算を用いて簡単にベクトル化できる。 (Vectorization of ΔC calculation)
Hereinafter, this method may be referred to as the VΔC method. Calculation of ΔC _ex of QAP by Equation (13) can be easily vectorized using inner product calculation, provided that there is no condition to skip the calculation of i=a,b.

本実施の形態のデータ処理装置は、この条件をなくし、ΔＣ_ｅｘの計算をｎ個のすべての要素に関する積和演算ループとし、以下の式（１９）に示されるように、フロー量と距離との追加の積を補償項として加えることで、その条件をなくすことによる補償を行う。 The data processing apparatus of the present embodiment eliminates this condition, calculates ΔC _ex as a product-sum operation loop for all n elements, and uses the flow amount and the distance as shown in the following equation (19). Compensate by removing that condition by adding an additional product of as a compensation term.

式（１９）は、識別番号＝ａ，ｂの要素の割当先が入れ替わったときの、フロー量の差のベクトルΔＦ（式（２０））及び距離の差のベクトルΔＤ（式（２１））を用いて、式（２２）のように、内積を使用して再定式化できる。 Equation (19) expresses the flow amount difference vector ΔF (equation (20)) and the distance difference vector ΔD (equation (21)) when the allocation destinations of the elements with identification numbers = a and b are switched. can be reformulated using inner products as in equation (22).

式（２０）のΔ^ｂ _ａＦは、フロー行列のｂ行とａ行の差ベクトルを表し、Ｆ_ｂ，＊はフロー行列のｂ行を示し、Ｆ_ａ，＊はフロー行列のａ行を示す。
ベクトルΔＦの要素は、元のフロー行列の順序で配置されるが、ベクトルΔＤの要素は現在の割当状態に対応するように並べられる。これは、式（２１）のように、転置されたバイナリ状態行列Ｘ（Ｘ^Ｔと表記されている）と（Ｄ_{φ（ａ），＊}－Ｄ_{φ（ｂ），＊}）との乗算によって数学的に示される。Ｄ_{φ（ａ），＊}は距離行列のａ行を示し、Ｄ_{φ（ｂ），＊}は距離行列のｂ行を示す。 Δ ^b _a F in equation (20) represents the difference vector between rows b and a of the flow matrix, F _b,* denotes row b of the flow matrix, and F _a,* denotes row a of the flow matrix. .
The elements of vector ΔF are arranged in the order of the original flow matrix, while the elements of vector ΔD are ordered to correspond to the current allocation state. This can be done mathematically by multiplying the transposed binary state matrix X (denoted X ^T ) by (D _φ(a),* −D _φ(b),* ) as in equation (21). shown D _φ(a),* denotes row a of the distance matrix and D _φ(b),* denotes row b of the distance matrix.

ソフトウェアでは、式（２１）のような計算は、要素数＝ｎに比例する時間でベクトルΔＤの要素を並べ替えることで実現される。
ＱＳＡＰの場合、１つの割当先に要素が複数配置可能であるため、以下の式（２３）のようにｍ行ｍ列の距離行列から、サイズ＝ｎのベクトル（ΔＤ）が生成される。 In software, a calculation such as Equation (21) is realized by rearranging the elements of vector ΔD in time proportional to the number of elements=n.
In the case of QSAP, since a plurality of elements can be arranged in one allocation destination, a vector (ΔD) of size=n is generated from a distance matrix of m rows and m columns as shown in Equation (23) below.

このようなベクトルΔＤと式（２０）に示したベクトルΔＦとを用いて、ΔＣ^ｓ _ｅｘは、以下の式（２４）により計算できる。 Using such a vector ΔD and the vector ΔF shown in equation (20), ΔC ^s _ex can be calculated by the following equation (24).

ＱＳＡＰの場合、複数の要素が同じ割当先に割り当てられることが制限されないことから、式（２４）の２行目のような追加の補償項がΔＣに加えられる。
また、式（１７）に対応する再配置によるΔＣ^ｓ _ｒｅｌの計算のベクトル化した形式は、以下の式（２５）のように表せる。 Since QSAP does not restrict multiple elements to be assigned to the same assignee, an additional compensation term is added to ΔC as in the second line of equation (24).
Also, the vectorized form of the calculation of ΔC ^s _rel by rearrangement corresponding to Equation (17) can be expressed as Equation (25) below.

再配置による要素の割当先の移動は、１つの要素について行われるため、ベクトルΔＦの計算と、フロー量と距離との積を加えることによる補償は必要ない。
（距離行列の並べ替え）
上記のＶΔＣ法は、ＳＩＭＤ（Single Instruction/Multiple Data）を用いることで、複数のＣＰＵ（Central Processing Unit）や複数のＧＰＵ（Graphics Processing Unit）などのプロセッサ上に実装可能である。 Since the movement of the assignment destination of the element due to rearrangement is performed for one element, it is not necessary to calculate the vector ΔF and compensate by adding the product of the flow amount and the distance.
(Rearrangement of distance matrix)
The above VΔC method can be implemented on processors such as multiple CPUs (Central Processing Units) and multiple GPUs (Graphics Processing Units) by using SIMD (Single Instruction/Multiple Data).

しかし、ＶΔＣ法で行われるΔＤの要素の並べ替えは、要素数＝ｎに比例する時間で実行することができるが、２つの要素の割当先の入れ替えによるΔＣの計算のたびに実行されるため処理効率がよくない。また、大量の計算コストがかかる可能性がある。 However, the rearrangement of the elements of ΔD performed by the VΔC method can be executed in a time proportional to the number of elements = n, but it is executed each time the calculation of ΔC is performed by switching the assignment destination of the two elements. Poor processing efficiency. It can also be computationally expensive.

適切な要素順をもつΔＤを生成する計算コストを最小限に抑えるために、本実施形態のデータ処理装置は、現在の割当状態に応じて距離行列の列を並べ替える。以下この手法をＳＡＭ（State-Aligned D Matrix）法と呼ぶ場合がある。また、以下では、このように並べ替えを行った距離行列を状態整列Ｄ行列と呼び、割当問題がＱＡＰの場合はＤ^Ｘ、割当問題がＱＳＡＰの場合はＤ^Ｓと表記する。 In order to minimize the computational cost of generating ΔD with proper element order, the data processing apparatus of the present embodiment rearranges the columns of the distance matrix according to the current allocation state. Hereinafter, this method may be called SAM (State-Aligned D Matrix) method. Also, hereinafter, the distance matrix rearranged in this way is called a state-ordered D matrix, and is denoted as D ^X when the allocation problem is QAP and D ^S when the allocation problem is QSAP.

図２は、ＱＡＰの計算時における距離行列の並べ替え例とデータ処理装置の例を示す図である。
データ処理装置１０は、たとえば、コンピュータであり、記憶部１１、処理部１２を有する。 FIG. 2 is a diagram showing an example of distance matrix rearrangement and an example of a data processing device when calculating QAP.
The data processing device 10 is, for example, a computer and has a storage section 11 and a processing section 12 .

記憶部１１は、たとえば、ＤＲＡＭ（Dynamic Random Access Memory）などの電子回路である揮発性の記憶装置、または、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの電子回路である不揮発性の記憶装置である。記憶部１１は、ＳＲＡＭ（Static Random Access Memory）レジスタなどの電子回路を含んでいてもよい。 The storage unit 11 is, for example, a volatile storage device that is an electronic circuit such as a DRAM (Dynamic Random Access Memory), or a non-volatile storage device that is an electronic circuit such as a HDD (Hard Disk Drive) or flash memory. . The storage unit 11 may include an electronic circuit such as an SRAM (Static Random Access Memory) register.

記憶部１１は、たとえば、局所探索などをコンピュータに実行させるプログラムを記憶するとともに、前述のフロー行列、距離行列、状態整列Ｄ行列（Ｄ^ＸまたはＤ^Ｓ）、割当状態（整数割当ベクトルφまたはバイナリ状態行列Ｘで表される）を記憶する。 The storage unit 11 stores, for example, a program that causes a computer to execute a local search and the like, and also stores the aforementioned flow matrix, distance matrix, state-aligned D matrix (D ^X or D ^S ), allocation state (integer allocation vector φ or binary state matrix X).

処理部１２は、たとえば、ＣＰＵ、ＧＰＵ、ＤＳＰ（Digital Signal Processor）などのハードウェアであるプロセッサにより実現できる。また、処理部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡなどの電子回路により実現されるようにしてもよい。処理部１２は、記憶部１１に記憶されたプログラムを実行して、局所探索処理をデータ処理装置１０に行わせる。なお、処理部１２は、複数のプロセッサの集合であってもよい。 The processing unit 12 can be realized by a processor, which is hardware such as a CPU, GPU, DSP (Digital Signal Processor), for example. Also, the processing unit 12 may be implemented by an electronic circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA. The processing unit 12 executes a program stored in the storage unit 11 to cause the data processing device 10 to perform local search processing. Note that the processing unit 12 may be a set of multiple processors.

処理部１２は、２つの要素の割当先を入れ替える処理を繰り返し、たとえば、式（１）または式（１１）に示した評価関数の値が極小になる割当状態を探索する。評価関数の極小値のうちの最小値になる割当状態が最適解となる。なお、式（１）または式（１１）に示した評価関数の符号を変えれば、処理部１２は、評価関数の値が極大になる割当状態を探索することもできる（この場合、最大値が最適解となる）。 The processing unit 12 repeats the process of interchanging the allocation destinations of the two elements, and searches for an allocation state in which the value of the evaluation function shown in Equation (1) or Equation (11) is minimized, for example. The optimum solution is the allocation state that has the minimum value among the minimum values of the evaluation function. By changing the sign of the evaluation function shown in Equation (1) or Equation (11), the processing unit 12 can also search for an allocation state in which the value of the evaluation function is maximized (in this case, the maximum value is optimal solution).

図２には、ＱＡＰの計算時における距離行列の並べ替え例が示されている。あるタイムステップ＝ｔの場合の割当状態と、その割当状態に対応した配列となっている状態整列Ｄ行列が、φ^（ｔ）、Ｄ^Ｘ（ｔ）と表記されている。識別番号＝１，３の要素の割当先を入れ替える割当変更が提案され、受け入れられると、タイムステップ＝ｔ＋１では、図２のφ^{（ｔ＋１）}で表されているような割当状態となる。このとき、状態整列Ｄ行列として、受け入れられた割当状態に対応するように、Ｄ^Ｘ（ｔ）に対して１列目と３列目が入れ替えられたＤ^{Ｘ（ｔ＋１）}が生成される。 FIG. 2 shows an example of permutation of the distance matrix when calculating the QAP. An allocation state for a certain time step=t and a state-aligned D matrix having an arrangement corresponding to the allocation state are denoted by φ ^(t) and D ^X(t) . If an allocation change is proposed and accepted to replace the allocation destinations of the elements with identification numbers=1 and 3, at time step=t+1, the allocation state is as represented by φ ^(t+1) in FIG. A state-aligned D-matrix is then generated, ^D_X(t+1) , with the first and third columns swapped with respect to ^D_X(t) to correspond to the accepted allocation states.

なお、提案された割当変更が受け入れられなかった場合、φ^（ｔ）とＤ^Ｘ（ｔ）の更新はされない。
ＱＡＰの場合、状態整列Ｄ行列（Ｄ^Ｘ）は、以下の式（２６）に示されるように、現在の割当状態を表すバイナリ状態行列Ｘの転置行列Ｘ^Ｔに、元の距離行列（Ｄ）を乗ずることで表すことができる。なお、ハードウェア的に状態整列Ｄ行列の列の入れ替えを行う場合のハードウェア構成の例については後述する。 Note that if the proposed allocation change is not accepted, φ ^(t) and D ^X(t) are not updated.
For QAP, the state-aligned D matrix (D ^X ) is the transposed matrix X ^T of the binary state matrix X representing the current allocation state, and the original distance matrix (D), as shown in equation (26) below. can be expressed by multiplying An example of a hardware configuration in which the columns of the state-aligned D-matrix are permuted in terms of hardware will be described later.

ベクトルΔＤ^Ｘは、以下の式（２７）により、前述のベクトルΔＤの要素の並べ替えを行わずに計算でき、ΔＤ^Ｘを前述の式（２２）に代入することで、ΔＣ_ｅｘが計算できる。 The vector ΔD ^X can be calculated by the following equation (27) without rearranging the elements of the vector ΔD described above, and ΔC _ex can be calculated by substituting ΔD ^X into the equation (22) described above.

ＱＳＡＰの場合、状態整列Ｄ行列（Ｄ^Ｓ）は、以下の式（２８）に示されるように、現在の割当状態を表すバイナリ状態行列Ｓの転置行列Ｓ^Ｔに、元の距離行列（Ｄ）を乗ずることで表すことができる。Ｄ^Ｓは、ｍ行ｎ列の行列となる。 For QSAP, the state-aligned D matrix (D ^S ) is the transposed matrix S ^T of the binary state matrix S representing the current allocation state, and the original distance matrix (D ), as shown in equation (28) below. can be expressed by multiplying D ^S is a matrix with m rows and n columns.

割当先交換と再配置によるベクトルΔＤ^Ｓはそれぞれ、以下の式（２９）、式（３０）により、前述のΔＤの要素の並べ替えを行わずに計算でき、前述の式（２４）、式（２５）に代入することで、ΔＣ^ｓ _ｅｘとΔＣ^ｓ _ｒｅｌを計算できる。 The vector ΔD ^S due to the allocation target exchange and rearrangement can be calculated by the following equations (29) and (30), respectively, without rearranging the elements of ΔD, and the above equations (24) and ( 25), ΔC ^s _ex and ΔC ^s _rel can be calculated.

前述のＶΔＣ法では、ベクトルΔＤの要素を並べ替えるには、ΔＣが計算されるたびに要素数＝ｎに比例する操作が行われる。これとは対照的に、上記の手法（ＳＡＭ法）では、割当状態に一致するようにＤ^ＸやＤ^Ｓの並べ替えが行われるのは、提案された割当変更が受け入れられた場合のみである。これにより、平均して反復ごとに必要な操作の数が削減され計算効率が上がるため、計算時間が短縮され、割当問題を高速に計算できるようになる。 In the VΔC method described above, to rearrange the elements of the vector ΔD, an operation proportional to the number of elements=n is performed each time ΔC is calculated. In contrast, in the above approach (the SAM method), D ^X and D ^S are reordered to match the allocation state only if the proposed allocation change is accepted. . This increases computational efficiency by reducing the number of operations required per iteration on average, thus reducing computation time and allowing the allocation problem to be computed quickly.

（ボルツマンマシンキャッシング法（比較例））
ΔＣ計算をベクトル化して行う手法には以下のような手法も考えられる。この手法は、バイナリ状態行列のビットの反転に対応する部分的なΔＣの値（前述の式（３）に示した局所場（ｈ_ｉ）に対応する）を計算して保存する方法である。以下この手法をボルツマンマシンキャッシング法（ＢＭ＄法と表記する）と呼ぶ場合がある。 (Boltzmann machine caching method (comparative example))
The following method is also conceivable as a method of vectorizing the ΔC calculation. This technique is a method of calculating and storing partial ΔC values (corresponding to the local fields (h _i ) shown in equation (3) above) corresponding to bit inversions of the binary state matrix. Hereinafter, this method may be called the Boltzmann machine caching method (abbreviated as BM$ method).

バイナリ状態行列（ＸまたはＳ）の各ビットに対応して、キャッシュされた局所場が用いられる。割当変更の提案が受け入れられたときに、ｎ^２に比例する時間でｎ行ｎ列の局所場による行列（以下キャッシュ行列という）の更新が行われることになるが、ΔＣの計算については、ｎに依存しない時間で行うことができる。 A cached local field is used for each bit of the binary state matrix (X or S). When the allocation change proposal is accepted, the matrix with n rows and ⁿ columns by the local field (hereinafter referred to as the cache matrix) is updated in a time proportional to n2. can be done in a time independent of

ＱＡＰの場合、キャッシュ行列（Ｈ）は、探索処理の開始時に式（３１）により、ｎ^３に比例する時間で生成される。 For QAP ^, the cache matrix (H) is generated by equation (31) at the start of the search process in time proportional to n3.

キャッシュされた局所場を使用して、式（３２）により内積ΔＦ・ΔＤを生成できる。そして、生成したΔＦ・ΔＤを前述の式（２２）に代入することで、ΔＣ_ｅｘが計算できる。 Using the cached local fields, we can generate the inner product ΔF·ΔD according to equation (32). Then, ΔC _ex can be calculated by substituting the generated ΔF·ΔD into the above equation (22).

割当変更の提案が受け入れられると、キャッシュ行列は、以下の図３に示されるように、式（３３）によりクロネッカー積を使用することによって更新される。 If the assignment change proposal is accepted, the cache matrix is updated by using the Kronecker product according to equation (33), as shown in FIG. 3 below.

図３は、キャッシュ行列の更新計算の例を示す図である。
更新は、一度にキャッシュ行列の一行分行われる。ベクトルΔＦの０の要素に対応する行についての処理はスキップされる。 FIG. 3 is a diagram illustrating an example of cache matrix update calculation.
Updates are made one row of the cache matrix at a time. Processing is skipped for rows corresponding to 0 elements of vector ΔF.

ＱＳＡＰの場合、キャッシュ行列（Ｈ^ｓ）は、探索処理の開始時に式（３４）により生成される。 For QSAP, the cache matrix (H ^s ) is generated by equation (34) at the start of the search process.

キャッシュされた局所場を使用して、式（３５）により内積ΔＦ・ΔＤを生成できる。そして、生成したΔＦ・ΔＤを前述の式（２４）に代入することで、ΔＣ^ｓ _ｅｘが計算できる。 Using the cached local fields, we can generate the inner product ΔF·ΔD according to equation (35). ΔC ^s _ex can be calculated by substituting the generated ΔF·ΔD into the above equation (24).

割当変更の提案が受け入れられると、キャッシュ行列は、式（３６）によりクロネッカー積を使用することによって更新される。 If the reassignment proposal is accepted, the cache matrix is updated by using the Kronecker product according to equation (36).

同様に、キャッシュされた局所場を使用して、式（３７）により再配置が生じた場合の内積ΔＦ・ΔＤを生成できる。そして、生成したΔＦ・ΔＤを前述の式（２５）に代入することで、ΔＣ^ｓ _ｒｅｌが計算できる。 Similarly, the cached local field can be used to generate the inner product ΔF·ΔD when rearrangement occurs according to equation (37). Then, ΔC ^s _rel can be calculated by substituting the generated ΔF·ΔD into the above equation (25).

再配置の割当変更の提案が受け入れられると、キャッシュ行列は、式（３８）によって更新される。 If the relocation assignment change proposal is accepted, the cache matrix is updated according to equation (38).

上記のようなＢＭ＄法では、キャッシュ行列を保持することになるが、問題の規模が大きくなるとキャッシュ行列を保持するメモリの容量も大きくなる。このため、高速読み出し可能な比較的容量が小さいメモリの使用が難しくなる。ＳＡＭ法は、このようなキャッシュ行列を保持しなくてよい。 In the BM$ method as described above, a cache matrix is held, and as the scale of the problem increases, the capacity of memory for holding the cache matrix also increases. This makes it difficult to use a memory with a relatively small capacity that can be read out at high speed. The SAM method need not maintain such a cache matrix.

（ソルバーシステムの設計）
上記の３つの手法（ＶΔＣ法、ＳＡＭ法、ＢＭ＄法）の性能を比較検証するため、以下に示すようなソルバーシステムを設計した。なお、以下では、データ処理装置１０の処理部１２がＳＩＭＤ機能を備えたマルチコアＣＰＵを含み、マルチコアＣＰＵ上にパラレルテンパリングアルゴリズムが実装されているものとして説明する。 (Solver system design)
In order to compare and verify the performance of the above three methods (VΔC method, SAM method, BM$ method), a solver system as shown below was designed. In the following description, it is assumed that the processing unit 12 of the data processing device 10 includes a multi-core CPU having SIMD functions, and a parallel tempering algorithm is implemented on the multi-core CPU.

各専用コアは、探索インスタンスデータ（前述のＤ^Ｘ、Ｄ^Ｓ、Ｈ、Ｈ^ｓ）を保持するための専用キャッシュをもつ。
図４は、パラレルテンパリングを実行するソルバーシステムの例を示す図である。 Each dedicated core has a dedicated cache to hold the search instance data (D ^x , D ^s , H, H ^s above).
FIG. 4 is a diagram illustrating an example solver system that performs parallel tempering.

ソルバーシステムは、探索部２０とパラレルテンパリングコントローラ２１を有する。探索部２０は、複数のコア２０ａ１～２０ａｍを有し、コア２０ａ１～２０ａｍのそれぞれが複数のレプリカ（インスタンスに相当する）についての前述の局所探索（ＳＬＳ：Stochastic Local Search）を並列に実行する。レプリカ数はＭである。コア２０ａ１は、レプリカ２０ｂ１を含む複数のレプリカについての局所探索を行う。コア２０ａｍは、レプリカ２０ｂＭを含む複数のレプリカについての局所探索を行う。 The solver system has a search section 20 and a parallel tempering controller 21 . The search unit 20 has a plurality of cores 20a1 to 20am, and each of the cores 20a1 to 20am executes the aforementioned local search (SLS: Stochastic Local Search) on a plurality of replicas (corresponding to instances) in parallel. The number of replicas is M. The core 20a1 performs a local search for multiple replicas including the replica 20b1. The core 20am performs a local search for multiple replicas including the replica 20bM.

ＱＡＰの場合は、レプリカごとに、前述の整数割当ベクトルφ、キャッシュ行列（Ｈ）（ＢＭ＄法の場合）、状態整列Ｄ行列（Ｄ^Ｘ）（ＳＡＭ法の場合）、評価関数の値（Ｃ）が保持される。 In the case of QAP, for each replica, the aforementioned integer allocation vector φ, cache matrix (H) (for BM$ method), state-aligned D matrix (D ^X ) (for SAM method), evaluation function value (C ) is retained.

ＱＳＡＰの場合は、レプリカごとに、前述の整数割当ベクトルψ、キャッシュ行列（Ｈ^ｓ）（ＢＭ＄法の場合）、状態整列Ｄ行列（Ｄ^Ｓ）（ＳＡＭ法の場合）、評価関数の値（Ｃ）が保持される。 In the case of QSAP, for each replica, the above-mentioned integer assignment vector ψ, cache matrix (H ^s ) (for BM$ method), state-aligned D matrix (D ^s ) (for SAM method), evaluation function value ( C) is retained.

式（６）に示したフロー行列（Ｆ）と、ＱＳＡＰの場合に用いられる式（１０）に示した行列Ｂは全レプリカ２０ｂ１～２０ｂＭについて共通である。
Ｍ個のレプリカ２０ｂ１～２０ｂＭには、互いに異なる値の温度パラメータ（Ｔ）が設定される。レプリカ２０ｂ１～２０ｂＭには、Ｔ_ｍｉｎからＴ_ｍａｘまで段階的に上昇する温度の何れかが設定されている。以下、このような段階的に上昇する温度パラメータの値を、温度ラダーと呼ぶ場合がある。 The flow matrix (F) shown in Equation (6) and the matrix B shown in Equation (10) used for QSAP are common to all replicas 20b1 to 20bM.
Different temperature parameters (T) are set for the M replicas 20b1 to 20bM. The replicas 20b1 to 20bM are set to any temperature that increases stepwise from T _min to T _max . Hereinafter, such a stepwise increasing temperature parameter value may be referred to as a temperature ladder.

パラレルテンパリングコントローラ２１は、隣接する温度パラメータの値（Ｔ_ｋ、Ｔ_ｋ＋１）が設定されるレプリカ間で、評価関数の値（Ｃ_ｋ、Ｃ_ｋ＋１）とＴ_ｋ、Ｔ_ｋ＋１に基づいて、以下の式（３９）で表される交換許容確率ＳＡＰで、Ｔ_ｋ、Ｔ_ｋ＋１の値を交換する。 The parallel tempering controller 21 performs the following operations based on the evaluation function values (C _k , C _k+1 ) and T _k , T _k+1 between the adjacent replicas for which the temperature parameter values (T _k , T _k+1 ) are set. The values of T _k and T _k+1 are exchanged with the exchange admissible probability SAP expressed by Equation (39).

上記のような探索部２０とパラレルテンパリングコントローラ２１は、図２に示した処理部１２と記憶部１１により実現される。
フロー行列、状態整列Ｄ行列、キャッシュ行列は、たとえば、単精度浮動小数点数形式で記憶部１１に記憶される。ＶΔＣ法やＳＡＭ法を実行する場合の内積計算やキャッシュ行列の更新などを行う際に、融合積和演算命令が利用できるようにするためである。 The search unit 20 and the parallel tempering controller 21 as described above are realized by the processing unit 12 and the storage unit 11 shown in FIG.
The flow matrix, the state-aligned D matrix, and the cache matrix are stored in the storage unit 11 in single-precision floating-point number format, for example. This is so that the fused multiply-add operation instruction can be used when performing inner product calculation and cache matrix update when executing the VΔC method or the SAM method.

図５は、ＱＡＰの解をパラレルテンパリングによる局所探索で探索するアルゴリズムの例を示す図である。なお、図５のアルゴリズムの例では、並列に局所探索が行われるレプリカ数が３２である。なお、図５において、１６行目、１８行目などに示されている“（１）”、“（３１）”などは、前述の式（１）、式（３１）などを表す。 FIG. 5 is a diagram showing an example of an algorithm for searching a QAP solution by local search by parallel tempering. Note that in the example of the algorithm in FIG. 5, the number of replicas for which local searches are performed in parallel is 32. In FIG. 5, "(1)", "(31)", etc. shown in the 16th line, the 18th line, etc. represent the aforementioned formulas (1), (31), and the like.

アルゴリズムは、温度パラメータの値とレプリカ状態を初期化することから始まる。次に、ソルバーシステムは最適化ループに入り、全てのレプリカに対して並列に、局所探索がＩ回繰り返し実行される。 The algorithm begins by initializing the values of the temperature parameters and the replica state. The solver system then enters an optimization loop where I iterations of the local search are performed on all replicas in parallel.

Ｉ回の局所探索が終わると、レプリカ交換処理が行われ、事前設定されたＢＫＳ（Best-Known-Solution）コスト（Ｃ_ＢＫＳ）が見つかるか、タイムアウト制限に達するまで、最適化ループが再開される。 After I local searches, the replica exchange process is performed and the optimization loop is restarted until either a preset Best-Known-Solution cost (C _BKS ) is found or a timeout limit is reached. .

ＱＡＰの局所探索では、ループが事前設定された反復回数（Ｉ）で実行され、２つの要素（図５の例では施設）が選択され、その２つの要素の割当先を入れ替えたときのΔＣ_ｅｘの値が、選択された手法（ＶΔＣ法、ＢＭ＄法またはＳＡＭ法）により計算される。 In the QAP local search, the loop is run for a preset number of iterations (I), two elements (facility in the example of FIG. 5) are selected, and ΔC _ex is calculated by the chosen method (VΔC method, BM$ method or SAM method).

次に、計算されたΔＣ_ｅｘと各レプリカの現在の温度パラメータの値を用いて、式（１８）のＰ_ａｃｃが計算される。次に、０から１の範囲の乱数値が生成され、Ｐ_ａｃｃと比較されてベルヌーイ試行が実行され、選択された２つの要素間の割当先の入れ替えの提案が受け入れられるかどうかが決定される。提案が受け入れられた場合には、状態（割当状態）が更新される。 P _acc in equation (18) is then calculated using the calculated ΔC _ex and the value of the current temperature parameter for each replica. A random value in the range 0 to 1 is then generated and compared to the P _acc and a Bernoulli trial is performed to determine if the proposal for swapping assignments between the two selected elements is acceptable. . If the offer is accepted, the status (assignment status) is updated.

図６は、パラレルテンパリングによる局所探索の全体の処理の流れを示すフローチャートである。図６では、図５に示したアルゴリズムのパラレルテンパリングによる局所探索の全体の処理の流れが示されている。 FIG. 6 is a flow chart showing the overall processing flow of local search by parallel tempering. FIG. 6 shows the overall processing flow of local search by parallel tempering of the algorithm shown in FIG.

まず初期化ループ（ステップＳ１０～Ｓ１３）が、ｉ＝０からｉ＜Ｍ－１の間、ｉを１つずつ増やしつつ行われる。Ｍはレプリカ数である（図５の例では３２個）。
初期化ループでは、各レプリカに対して初期温度（温度ラダー）の設定（Ｔ［ｉ］←Ｔ^０［ｉ］）が行われる（ステップＳ１１）。そして、レプリカの初期化処理（図７参照）が行われる（ステップＳ１２）。 First, an initialization loop (steps S10 to S13) is performed while increasing i by one between i=0 and i<M-1. M is the number of replicas (32 in the example of FIG. 5).
In the initialization loop, an initial temperature (temperature ladder) is set (T[i]←T ⁰ [i]) for each replica (step S11). Then, a replica initialization process (see FIG. 7) is performed (step S12).

次に、最適化ループ（ステップＳ１４～Ｓ１６）が、ｉ＝０からｉ＜Ｍ－１の間、ｉを１つずつ増やしつつ行われる。
最適化ループでは、レプリカ探索処理が行われる（ステップＳ１５）。レプリカ探索処理では、各レプリカについて、設定された温度パラメータの値を用いて、Ｉ回の局所探索が行われる。 Next, an optimization loop (steps S14 to S16) is performed while increasing i by one between i=0 and i<M−1.
In the optimization loop, replica search processing is performed (step S15). In the replica search process, local searches are performed I times for each replica using the set temperature parameter value.

その後、レプリカ交換処理（ステップＳ１７）が行われ、探索を終了させるか否かが判定される（ステップＳ１８）。たとえば、前述のように、事前設定されたＢＫＳコスト（Ｃ_ＢＫＳ）が見つかるか、タイムアウト制限に達した場合、探索が終了したと判定され、探索が終了する。探索を終了しないと判定された場合、ステップＳ１４からの最適化ループが再開される。 After that, replica exchange processing (step S17) is performed, and it is determined whether or not to end the search (step S18). For example, as described above, if the preset BKS cost (C _BKS ) is found or the timeout limit is reached, then it is determined that the search is over and the search ends. If it is determined not to end the search, the optimization loop from step S14 is restarted.

なお、データ処理装置１０は、探索が終了した場合、探索結果（たとえば、後述の最小値、Ｃ_ｍｉｎ、φ_ｍｉｎ）を出力してもよい。探索結果は、たとえばデータ処理装置１０に接続される表示装置に表示されてもよいし、外部の装置に送信されてもよい。 Note that the data processing device 10 may output search results (for example, minimum values, C _min , φ _min described later) when the search ends. The search result may be displayed on a display device connected to the data processing device 10, for example, or may be transmitted to an external device.

なお、図６ではレプリカごとに順番に処理を行う例を示しているが、たとえば、３２個のコアをもつプロセッサにより、３２個のレプリカについてステップＳ１１，Ｓ１２，Ｓ１５の処理を並列に実行可能である。 Although FIG. 6 shows an example in which processing is performed sequentially for each replica, for example, a processor having 32 cores can execute the processing of steps S11, S12, and S15 for 32 replicas in parallel. be.

次に、ＳＡＭ法を実行する場合を例にして、ステップＳ１２のレプリカの初期化処理と、ステップＳ１５のレプリカ探索処理の流れを、フローチャートを用いて説明する。
図７は、ＱＡＰの場合のレプリカの初期化処理の一例の流れを示すフローチャートである。 Next, the flow of the replica initialization process in step S12 and the replica search process in step S15 will be described using a flow chart, taking the case of executing the SAM method as an example.
FIG. 7 is a flowchart showing an example of the flow of replica initialization processing in the case of QAP.

まず、整数割当ベクトルφの初期化が行われる（ステップＳ２０）。ステップＳ２０の処理では、たとえば、ランダムに要素の割当先（施設の配置先）が決定される。そして、式（１）の計算により、初期コスト（Ｃ）が計算される（ステップＳ２１）。 First, the integer assignment vector φ is initialized (step S20). In the processing of step S20, for example, the allocation destination of the element (placement destination of the facility) is determined at random. Then, the initial cost (C) is calculated by the calculation of formula (1) (step S21).

その後、初期化したφとＣにより、最小値（Ｃ_ｍｉｎとφ_ｍｉｎ）が初期化される（ステップＳ２２）。また、φの初期値に対応するように、状態整列Ｄ行列（Ｄ^Ｘ）の初期化が行われる（ステップＳ２３）。その後、図６に示したフローチャートの処理に戻る。 After that, the initialized φ and C initialize the minimum values (C _min and φ _min ) (step S22). Also, the state-aligned D-matrix (D ^X ) is initialized so as to correspond to the initial value of φ (step S23). After that, the process returns to the process of the flowchart shown in FIG.

図８は、ＱＡＰの場合のレプリカ探索処理の一例の流れを示すフローチャートである。
レプリカ探索処理では、イタレーションループ（ステップＳ３０～Ｓ４０）が、ｉ＝１からｉ＜Ｉの間、ｉを１つずつ増やしつつ行われる。 FIG. 8 is a flowchart showing an example of the flow of replica search processing in the case of QAP.
In the replica search process, an iteration loop (steps S30 to S40) is performed while increasing i by one between i=1 and i<I.

まず、割当先の入れ替え候補となる２つの要素（識別番号＝ａ，ｂで表されている）が選択される（ステップＳ３１）。
そして、その２つの要素の割当先を入れ替えたときのΔＣ_ｅｘ（ａ，ｂ）の値が、式（２２）により計算される（ステップＳ３２）。 First, two elements (represented by identification numbers=a and b) that are candidates for replacement of allocation destinations are selected (step S31).
Then, the value of ΔC _ex (a, b) when the allocation destinations of the two elements are exchanged is calculated by Equation (22) (step S32).

次に、計算されたΔＣ_ｅｘと各レプリカの現在の温度パラメータの値を用いて、式（１８）のＰ_ａｃｃが計算される（ステップＳ３３）。そして、Ｐ_ａｃｃが０から１の範囲の乱数値ｒａｎｄ（）よりも大きいか否かが判定される（ステップＳ３４）。 Next, using the calculated ΔC _ex and the current temperature parameter value of each replica, P _acc in equation (18) is calculated (step S33). Then, it is determined whether or not _Pacc is greater than a random number value rand() ranging from 0 to 1 (step S34).

Ｐ_ａｃｃ＞ｒａｎｄ（）であると判定された場合、Ｄ^Ｘの列ａと列ｂの入れ替え（Ｄ^Ｘ _＊，ａ，Ｄ^Ｘ _＊，ｂ←Ｄ^Ｘ _＊，ｂ，Ｄ^Ｘ _＊，ａ）が行われる（ステップＳ３５）。さらに、割当先の入れ替えによる割当状態の更新（φ（ａ），φ（ｂ）←φ（ｂ），φ（ａ））が行われ（ステップＳ３６）、評価関数の値（Ｃ）の更新（Ｃ←Ｃ＋ΔＣ_ｅｘ（ａ，ｂ））が行われる（ステップＳ３７）。 If it is determined that P _acc >rand( ), then the permutation of columns a and b of D ^X (D ^X _{*, a} , D ^X _{*, b} ←D ^X _{*, b} , D ^X _{*, a} ) is is performed (step S35). Furthermore, the allocation state is updated (φ(a), φ(b)←φ(b), φ(a)) by replacing the allocation destination (step S36), and the evaluation function value (C) is updated ( C←C+ΔC _ex (a, b)) is performed (step S37).

その後、Ｃ＜Ｃ_ｍｉｎであるか否かが判定され（ステップＳ３８）、Ｃ＜Ｃ_ｍｉｎであると判定された場合、最小値の更新（Ｃ_ｍｉｎ←Ｃ、φ_ｍｉｎ←φ）が行われる（ステップＳ３９）。 After that, it is determined whether or not C<C _min (step S38), and if it is determined that C<C _min , the minimum value is updated (C _min ←C, φ _min ←φ) ( step S39).

ステップＳ３９の処理後、またはステップＳ３４の処理でＰ_ａｃｃ＞ｒａｎｄ（）ではないと判定された場合、またはステップＳ３８の処理でＣ＜Ｃ_ｍｉｎではないと判定された場合、ｉ＝ＩとなるまでステップＳ３１からの処理が繰り返される。ｉ＝Ｉとなった場合、図６に示したフローチャートの処理に戻る。 After the process of step S39, or if it is determined in the process of step S34 that P _acc >rand( ) is not determined, or if it is determined that C<C _min is not determined in the process of step S38, until i=I The processing from step S31 is repeated. When i=I, the process returns to the flowchart shown in FIG.

なお、図６～図８に示した処理の順序は一例であり、適宜処理の順序を入れ替えてもよい。
図９は、ＱＳＡＰの解をパラレルテンパリングによる局所探索で探索するアルゴリズムの例を示す図である。 The order of processing shown in FIGS. 6 to 8 is an example, and the order of processing may be changed as appropriate.
FIG. 9 is a diagram showing an example of an algorithm for searching a QSAP solution by local search by parallel tempering.

図９に示されているＱＳＡＰを計算する局所探索の関数は、再配置の提案を受け入れるか否かの判定や更新を行う再配置ループを、２つの要素の割当先を入れ替えるか否かの判定や更新を行う交換ループの前に含んでいる。 The local search function that computes the QSAP shown in FIG. and before the exchange loop that does the update.

次に、ＳＡＭ法を実行する場合を例にして、ＱＳＡＰを計算する場合の、図６のステップＳ１２のレプリカの初期化処理と、ステップＳ１５のレプリカ探索処理の流れを、フローチャートを用いて説明する。 Next, the flow of the replica initialization process in step S12 and the replica search process in step S15 in FIG. 6 when calculating QSAP will be described with reference to a flowchart, taking the case of executing the SAM method as an example. .

図１０は、ＱＳＡＰの場合のレプリカの初期化処理の一例の流れを示すフローチャートである。
まず、整数割当ベクトルψの初期化が行われる（ステップＳ５０）。ステップＳ５０の処理では、たとえば、ランダムに要素の割当先（施設の配置先）が決定される。そして、式（１１）の計算により、初期コスト（Ｃ）が計算される（ステップＳ５１）。 FIG. 10 is a flowchart showing an example of the flow of replica initialization processing in the case of QSAP.
First, the integer assignment vector ψ is initialized (step S50). In the processing of step S50, for example, the allocation destination of the element (location of the facility) is determined at random. Then, the initial cost (C) is calculated according to the equation (11) (step S51).

その後、初期化したψとＣにより、最小値（Ｃ_ｍｉｎとψ_ｍｉｎ）が初期化される（ステップＳ５２）。また、ψの初期値に対応するように、式（２８）により、状態整列Ｄ行列（Ｄ^Ｓ）の初期化が行われる（ステップＳ５３）。その後、図６に示したフローチャートの処理に戻る。 After that, the minimum values (C _min and ψ _min ) are initialized by the initialized ψ and C (step S52). Also, the state-aligned D-matrix (D ^S ) is initialized according to equation (28) so as to correspond to the initial value of ψ (step S53). After that, the process returns to the process of the flowchart shown in FIG.

図１１は、ＱＳＡＰの場合のレプリカ探索処理の一例の流れを示すフローチャートである。
ＱＳＡＰの場合のレプリカ探索処理でも、イタレーションループ（ステップＳ６０～Ｓ７８）が、ｉ＝１からｉ＜Ｉの間、ｉを１つずつ増やしつつ行われる。 FIG. 11 is a flowchart showing an example of the flow of replica search processing in the case of QSAP.
Also in the replica search process in the case of QSAP, an iteration loop (steps S60 to S78) is performed while increasing i by one between i=1 and i<I.

まず、識別番号＝ａの要素と識別番号＝ｌの割当先が選択される（ステップＳ６１）。
そして、識別番号＝ａの要素を識別番号＝ｌの割当先に割り当てたときのΔＣ^ｓ _ｒｅｌの値が、式（２５）により計算される（ステップＳ６２）。 First, an element with identification number=a and an allocation destination with identification number=l are selected (step S61).
Then, the value of ΔC ^s _rel when the element of identification number=a is assigned to the allocation destination of identification number=l is calculated by equation (25) (step S62).

次に、計算されたΔＣ^ｓ _ｒｅｌと各レプリカの現在の温度パラメータの値を用いて、式（１８）のＰ_ａｃｃが計算される（ステップＳ６３）。そして、Ｐ_ａｃｃが０から１の範囲の乱数値ｒａｎｄ（）よりも大きいか否かが判定される（ステップＳ６４）。 Next, using the calculated ΔC ^s _rel and the current temperature parameter value of each replica, P _acc in equation (18) is calculated (step S63). Then, it is determined whether or not _Pacc is greater than the random number rand() ranging from 0 to 1 (step S64).

Ｐ_ａｃｃ＞ｒａｎｄ（）であると判定された場合、Ｄ^ｓの列ａの値が、距離行列Ｄの列ｌの値で更新される（ステップＳ６５）。さらに、割当状態の更新（φ（ａ）←ｌ）が行われ（ステップＳ６６）、評価関数の値（Ｃ）の更新（Ｃ←Ｃ＋ΔＣ^ｓ _ｒｅｌ）が行われる（ステップＳ６７）。 If it is determined that P _acc >rand( ), the value in column a of D ^s is updated with the value in column l of distance matrix D (step S65). Furthermore, the allocation state is updated (φ(a)←l) (step S66), and the evaluation function value (C) is updated (C←C+ΔC ^s _rel ) (step S67).

その後、Ｃ＜Ｃ_ｍｉｎであるか否かが判定され（ステップＳ６８）、Ｃ＜Ｃ_ｍｉｎであると判定された場合、最小値の更新（Ｃ_ｍｉｎ←Ｃ、ψ_ｍｉｎ←ψ）が行われる（ステップＳ６９）。 After that, it is determined whether or not C<C _min (step S68), and if it is determined that C<C _min , the minimum value is updated (C _min ←C, ψ _min ←ψ) ( step S69).

ステップＳ６９の処理後、またはステップＳ６４の処理でＰ_ａｃｃ＞ｒａｎｄ（）ではないと判定された場合、またはステップＳ６８の処理でＣ＜Ｃ_ｍｉｎではないと判定された場合、ステップＳ７０の処理が行われる。 After the process of step S69, or when it is determined in the process of step S64 that P _acc >rand( ) is not obtained, or when it is determined in the process of step S68 that C<C _min is not satisfied, the process of step S70 is performed. will be

ステップＳ７０の処理では、割当先の入れ替え候補となる２つの要素（識別番号＝ａ，ｂで表されている）が選択される。
そして、その２つの要素の割当先を入れ替えたときのΔＣ^ｓ _ｅｘの値が、式（２４）により計算される（ステップＳ７１）。 In the process of step S70, two elements (represented by identification numbers=a and b) that are candidates for replacement of allocation destinations are selected.
Then, the value of ΔC ^s _ex when the allocation destinations of the two elements are exchanged is calculated by equation (24) (step S71).

次に、ΔＣ^ｓ _ｅｘ＜０であるか否かが判定される（ステップＳ７２）。なお、０の代わりに所定の値（固定値）を用いてもよい。
ΔＣ_ｅｘ ^ｓ＜０であると判定された場合、Ｄ^Ｓの列ａと列ｂの入れ替え（Ｄ^Ｓ _＊，ａ，Ｄ^Ｓ _＊，ｂ←Ｄ^Ｓ _＊，ｂ，Ｄ^Ｓ _＊，ａ）が行われる（ステップＳ７３）。さらに、割当先の入れ替えによる割当状態の更新（ψ（ａ），ψ（ｂ）←ψ（ｂ），ψ（ａ））が行われ（ステップＳ７４）、評価関数の値（Ｃ）の更新（Ｃ←Ｃ＋ΔＣ^ｓ _ｅｘ）が行われる（ステップＳ７５）。 Next, it is determined whether or not ΔC ^s _ex <0 (step S72). Note that a predetermined value (fixed value) may be used instead of 0.
If it is determined that ΔC _ex ^s < 0, the permutation of columns a and b of D ^S (D ^S _{*, a} , D ^S _{*, b} ← D ^S _{*, b} , D ^S _{*, a} ) is a row. (step S73). Furthermore, the allocation state is updated (ψ(a), ψ(b)←ψ(b), ψ(a)) by replacing the allocation destination (step S74), and the evaluation function value (C) is updated ( C←C+ΔC ^s _ex ) is performed (step S75).

その後、Ｃ＜Ｃ_ｍｉｎであるか否かが判定され（ステップＳ７６）、Ｃ＜Ｃ_ｍｉｎであると判定された場合、最小値の更新（Ｃ_ｍｉｎ←Ｃ、ψ_ｍｉｎ←ψ）が行われる（ステップＳ７７）。 After that, it is determined whether or not C<C _min (step S76), and if it is determined that C<C _min , the minimum value is updated (C _min ←C, ψ _min ←ψ) ( step S77).

ステップＳ７７の処理後、またはステップＳ７２の処理でΔＣ^ｓ _ｅｘ＜０ではないと判定された場合、またはステップＳ７６の処理でＣ＜Ｃ_ｍｉｎではないと判定された場合、ｉ＝ＩとなるまでステップＳ６１からの処理が繰り返される。ｉ＝Ｉとなった場合、図６に示したフローチャートの処理に戻る。 After the process of step S77, or if it is determined that ΔC ^s _ex <0 is not determined in the process of step S72, or if it is determined that C<C _min is not determined in the process of step S76, steps are performed until i=I. The processing from S61 is repeated. When i=I, the process returns to the flowchart shown in FIG.

なお、図１０～図１１に示した処理の順序は一例であり、適宜処理の順序を入れ替えてもよい。
（計算速度の評価結果）
まず、スカラー型の演算処理を使用し、ＶΔＣ法、ＳＡＭ法、及びＢＭ＄法の３つの方法による計算速度を評価した結果を示す。計算対象は、１０個のＱＡＰインスタンスであり、図５に示したアルゴリズムにより計算が行われた。計算は、図４に示したようなソルバーシステムを２つ用いて、６４レプリカのパラレルテンパリングにより行われた。 The order of processing shown in FIGS. 10 and 11 is an example, and the order of processing may be changed as appropriate.
(Evaluation result of calculation speed)
First, the results of evaluating the calculation speed by three methods, the VΔC method, the SAM method, and the BM$ method, using scalar arithmetic processing, are shown. Calculation targets were 10 QAP instances, and the calculation was performed by the algorithm shown in FIG. The calculations were performed by parallel tempering of 64 replicas using two solver systems as shown in FIG.

各インスタンスは、シーケンシャルに１００回、独立な乱数シードで実行され、ＢＫＳに到達した時間（ＴｔＳ：Time-to-Solution）が記録された。
図１２は、ＶΔＣ法と比較したＳＡＭ法、ＢＭ＄法による計算の高速化の度合いの評価結果を示す図である。横軸はＱＡＰの１０個のインスタンスと幾何平均（図１２では“ＧＥＯＭＥＡＮ”と表記されている）を表し、縦軸はＶΔＣ法と比較したＳＡＭ法、ＢＭ＄法の高速化の度合いを表している。高速化の度合いは、ＶΔＣ法によるＴｔＳに対するＳＡＭ法とＢＭ＄法によるＴｔＳの比率により表されている。 Each instance was run sequentially 100 times with an independent random number seed and the time to reach BKS (TtS) was recorded.
FIG. 12 is a diagram showing evaluation results of the degree of speeding up of calculation by the SAM method and the BM$ method compared with the VΔC method. The horizontal axis represents 10 instances of QAP and the geometric mean (denoted as “GEOMEAN” in FIG. 12), and the vertical axis represents the degree of speedup of the SAM method and BM$ method compared to the VΔC method. there is The degree of speedup is represented by the ratio of TtS by the SAM method and BM$ method to TtS by the VΔC method.

ＳＡＭ法を用いた場合、Ｄ^Ｘを更新する追加の計算コストがあるにもかかわらず、ＶΔＣ法を用いた場合よりも計算速度が向上している。
なお、ＢＭ＄法を用いた場合、ＶΔＣ法に対して幾何平均で２．５７倍、計算速度が向上している。 With the SAM method, the computational speed is faster than with the ^VΔC method, despite the additional computational cost of updating DX.
When the BM$ method is used, the calculation speed is improved by a factor of 2.57 in terms of geometric mean as compared with the VΔC method.

（ＳＩＭＤの場合の計算速度の向上）
次に、スカラー型の演算処理に対するベクトル型の演算処理の計算速度の向上の度合いを評価した結果を示す。ベクトル型の演算処理を行うために、ＡＶＸ（Advanced Vector eXtensions）２のＳＩＭＤ組み込み関数が用いられた。 (Improved calculation speed in case of SIMD)
Next, the result of evaluation of the degree of improvement in computational speed of vector-type operation processing relative to scalar-type operation processing will be shown. SIMD built-in functions of AVX (Advanced Vector eXtensions) 2 were used to perform vector type arithmetic processing.

前述と同様の手順を用いて、同じ１０のＱＡＰインスタンスをベクトル型の演算処理で実行した。
図１３は、スカラー型の演算処理に対するベクトル型の演算処理の高速化の度合いの評価結果を示す図である。横軸はＱＡＰの１０個のインスタンスと幾何平均（図１３では“ＧＥＯＭＥＡＮ”と表記されている）を表し、縦軸はスカラー型の演算処理に対するベクトル型の演算処理の高速化の度合いを表している。高速化の度合いは、ＶΔＣ法、ＳＡＭ法、ＢＭ＄法のそれぞれについての、スカラー型の演算処理によるＴｔＳに対するベクトル型の演算処理によるＴｔＳの比率により表されている。 The same 10 QAP instances were run with vector-based computation using a procedure similar to that described above.
FIG. 13 is a diagram showing an evaluation result of the degree of acceleration of vector-type arithmetic processing relative to scalar-type arithmetic processing. The horizontal axis represents 10 instances of QAP and the geometric mean (denoted as "GEOMEAN" in FIG. 13), and the vertical axis represents the degree of acceleration of vector-type operation processing relative to scalar-type operation processing. there is The degree of speedup is represented by the ratio of TtS due to vector-type arithmetic processing to TtS due to scalar-type arithmetic processing for each of the VΔC method, SAM method, and BM$ method.

スカラー型の演算処理に対してベクトル型の演算処理では、平均して、ＶΔＣ法の場合ほぼ２倍高速であり、ＢＭ＄法とＳＡＭ法の場合、３倍以上高速であった。ＶΔＣ法では、前述のようにＶΔＣ法で行われるΔＤの要素の並べ替えが実行時間のかなりの部分を占めるため、ＳＩＭＤ組み込み関数のメリットが最も少ない。 On average, the VΔC method was twice as fast as the scalar type arithmetic processing, and the BM$ method and the SAM method were three times or more faster than the scalar type arithmetic processing. The V.DELTA.C method has the least advantage of SIMD built-in functions because the reordering of the elements of .DELTA.D performed in the V.DELTA.C method occupies a significant portion of the execution time as described above.

（動的負荷分散）
パラレルテンパリングでは、最低温度が設定されているレプリカと最高温度が設定されているレプリカとでは、ＰＡＲが異なる。Ｔの値が大きいほど（温度が高いほど）、ＰＡＲも上がる。これによりＳＡＭ法やＢＭ＄法の更新処理が、レプリカの処理を行うスレッド間で大きな実行時間のギャップを引き起こす可能性がある。 (dynamic load balancing)
In parallel tempering, the PAR differs between the replica set at the lowest temperature and the replica set at the highest temperature. The higher the value of T (higher the temperature), the higher the PAR. This can cause the SAM and BM$ update processes to cause large execution time gaps between the threads that process the replicas.

これを軽減するために、データ処理装置１０は、各温度のイタレーションごとの時間をＳＬＳ関数内で追跡し、その時間を用いて、各温度で実行されるイタレーションの回数をスケーリングする。 To mitigate this, data processor 10 tracks the time per iteration at each temperature within the SLS function and uses that time to scale the number of iterations performed at each temperature.

図１４は、負荷分散の一例を示す図である。図１４では、Ｔ１～Ｔ３２（Ｔ１＜Ｔ２＜…＜Ｔ３２）の温度ラダーが設定される３２のレプリカについての、温度パラメータの交換前のΔＣの計算時間、更新処理の時間が示されている。 FIG. 14 is a diagram illustrating an example of load distribution. FIG. 14 shows calculation time of ΔC before exchange of temperature parameters and update processing time for 32 replicas for which temperature ladders of T1 to T32 (T1<T2< . . . <T32) are set.

負荷分散を行わない場合、大きい温度パラメータの値が設定されたレプリカほど更新処理の時間が長くなり、小さい温度パラメータの値が設定されたレプリカほど更新処理の時間が短い。このため、小さい温度パラメータの値が設定されたレプリカほど長いアイドル時間が生じている。 When load balancing is not performed, replicas with larger temperature parameter values take longer to update, and replicas with smaller temperature parameters take less time to update. For this reason, replicas with smaller temperature parameter values have longer idle times.

これに対して負荷分散を行う場合、たとえば、Ｔ１が設定されているレプリカの実行時間（スレッドの実行時間）を基準に、他のレプリカにおけるΔＣ計算のイタレーション回数がスケーリングされる。 In the case of load balancing, for example, the number of iterations of ΔC calculation in other replicas is scaled based on the execution time (thread execution time) of the replica for which T1 is set.

これにより、図１４に示すように、各レプリカ間でスレッドの実行時間をほぼ同等にすることができ、アイドル時間が短縮される。
このような負荷分散の有効性を定量的に示すために、前述の１０個のインスタンスを計算対象として、負荷分散による演算処理の高速化の度合いを評価した結果を以下に示す。 As a result, as shown in FIG. 14, the execution times of the threads can be made substantially equal among the replicas, and the idle time can be shortened.
In order to quantitatively demonstrate the effectiveness of such load distribution, the above-mentioned 10 instances were used as calculation targets, and the results of evaluating the degree of speed-up of arithmetic processing due to load distribution are shown below.

図１５は、負荷分散による演算処理の高速化の度合いの評価結果を示す図である。横軸はＱＡＰの１０個のインスタンスと幾何平均“ＧＥＯＭＥＡＮ”を表し、縦軸は負荷分散を行わない場合の演算処理に対する負荷分散を行った場合の高速化の度合いを表している。高速化の度合いは、負荷分散をせずにＳＡＭ法、ＢＭ＄法のそれぞれを実行したときのＴｔＳに対する、負荷分散をしてＳＡＭ法、ＢＭ＄法のそれぞれを実行したときのＴｔＳの比率により表されている。なお、負荷分散をしたときもしなかったときも、同じ温度パラメータの値のセット（たとえば、図１４のＴ１～Ｔ３２）が用いられている。 FIG. 15 is a diagram showing evaluation results of the degree of speeding up of arithmetic processing by load distribution. The horizontal axis represents 10 instances of QAP and the geometric mean "GEOMEAN", and the vertical axis represents the degree of speed-up when load balancing is performed as compared to the case when load balancing is not performed. The degree of speedup is determined by the ratio of TtS when the SAM method and BM$ method are executed with load distribution to the TtS when the SAM method and BM$ method are executed without load distribution. is represented. Note that the same set of temperature parameter values (for example, T1 to T32 in FIG. 14) are used both with and without load sharing.

負荷分散により、ＳＡＭ法では平均１．８６倍、ＢＭ＄では平均１．３８倍、演算処理の高速化が達成できていることが分かった。インスタンスファミリ間での負荷分散の有効性の違いは、インスタンスファミリの特性により、温度パラメータの極値間の異なるＰＡＲのギャップに起因する可能性がある。 It was found that by load distribution, an average speed-up of 1.86 times was achieved in the SAM method and an average speed-up of 1.38 times was achieved in the BM$ method. Differences in load balancing effectiveness between instance families may be due to different PAR gaps between extremes of temperature parameters due to instance family characteristics.

上記の各機能によって提供される計算の高速化の評価結果を表１にまとめる。 Table 1 summarizes the results of evaluating the computation speedup provided by each of the above functions.

２種類の高速化の度合（Incremental Speed-UpとCumulative Speed-Up）が示されている。なお、高速化の度合いは、前述の１０個のインスタンスについてのＴｔＳの幾何平均を用いて計算されたものであり、ＶΔＣ法を用いたスカラー型の演算処理によるＴｔＳを基準（１．００）としている。 Two degrees of speedup (Incremental Speed-Up and Cumulative Speed-Up) are shown. The degree of speedup was calculated using the geometric mean of TtS for the 10 instances described above. there is

（ベンチマーク結果）
上記のＶΔＣ法、ＳＡＭ法、ＢＭ＄法に関して、６４コアのＣＰＵを含む所定のハードウェア構成のデータ処理装置１０を用いてベンチマークスコアが測定された。 (Benchmark result)
With respect to the VΔC method, the SAM method, and the BM$ method, benchmark scores were measured using a data processing apparatus 10 having a predetermined hardware configuration including a 64-core CPU.

上記３つの方法を行うＱＡＰのソルバーのベンチマークスコアを測定するために、ＱＡＰライブラリーリファレンス（非特許文献５参照）にあるインスタンスが用いられた。さらに、非特許文献６，７で提案されたインスタンスが用いられた。また、本実施の形態では、非特許文献７に示された問題サイズの大きいＱＡＰ（既知の最適解がない）の新しいＢＫＳ状態を見つけることが試みられた。またＱＳＡＰのソルバーのベンチマークスコアを測定するために、非特許文献８で紹介されたインスタンスのセットが用いられた。 An instance in the QAP library reference (see Non-Patent Document 5) was used to measure the benchmark score of the QAP solver performing the above three methods. Furthermore, the instances proposed in Non-Patent Documents 6 and 7 were used. Also, in the present embodiment, an attempt was made to find a new BKS state for a large problem size QAP (no known optimal solution) given in Non-Patent Document 7. The set of instances introduced in Non-Patent Document 8 was also used to measure the benchmark score of the QSAP solver.

（ＱＡＰベンチマーク） (QAP benchmark)

表２には、ベクトル型の演算処理を行う３つの方法（ＶΔＣ法、負荷分散を行うＳＡＭ法及びＢＭ＄法）を実施するソルバーによる、ＱＡＰの各インスタンスに対するＴｔＳが示されている。さらに表２には、ＶΔＣ法を用いたベクトル型の演算処理によるＴｔＳを基準（１．００）とした高速化の度合い（ＴｔＳの幾何平均を用いて計算されたもの）が示されている。さらに、表２には、公開されている２つのソルバーであるＰａｒＥＯＴＳ（非特許文献９参照）と、ＰＢＭ（Permutational Boltzmann Machine）（非特許文献１参照）によるＴｔＳと高速化の度合いが示されている。これら２つの公開されているソルバーは、５分のタイムアウトウィンドウ内で、いくつかの難しいインスタンスを１００％の成功率で解くことができるものである。 Table 2 shows the TtS for each instance of QAP by solvers implementing three methods of vector-based computation (VΔC method, SAM method with load balancing, and BM$ method). Furthermore, Table 2 shows the degree of speed-up (calculated using the geometric mean of TtS) with TtS as the reference (1.00) by vector type arithmetic processing using the VΔC method. Furthermore, Table 2 shows the TtS and the degree of speedup by two open solvers, ParEOTS (see Non-Patent Document 9) and PBM (Permutational Boltzmann Machine) (see Non-Patent Document 1). there is These two published solvers are capable of solving some difficult instances with 100% success rate within a 5 minute timeout window.

表２に示されているＴｔＳの値は、ＶΔＣ法、ＳＡＭ法、及びＢＭ＄法による局所探索を１００回独立に連続して実行し、測定されたＴｔＳの平均値（９９％の信頼区間の値として表されている）である。各実行のタイムアウト制限は５分である。ＰａｒＥＯＴＳとＰＢＭについてのＴｔＳの値は、それぞれの開示によるものである。 The TtS values shown in Table 2 are the mean TtS values (with a 99% confidence interval of value). The timeout limit for each run is 5 minutes. The TtS values for ParEOTS and PBM are according to their respective disclosures.

表２に示されているように、ＢＭ＄法は、５つのソルバーの中で全てのインスタンスにわたって最高のパフォーマンスを示し、ＰＢＭと比較して２倍、ＰａｒＥＯＴＳと比較して３００倍以上の高速化を示した。ＳＡＭ法とＢＭ＄法は、ＶΔＣ法に対してそれぞれ１．９２倍、３．２２倍の平均速度向上を示した。なお、表１と比較した表２の結果の違いは、使用されたインスタンスによるものである。 As shown in Table 2, the BM$ method showed the best performance across all instances among the five solvers, speeding up by 2x over PBM and over 300x over ParEOTS. showed that. The SAM and BM$ methods showed an average speed improvement of 1.92 and 3.22 times over the VΔC method, respectively. Note that the difference in results in Table 2 compared to Table 1 is due to the instance used.

ただし、ＳＡＭ法とＢＭ＄法のどちらが良いかは、問題規模やＰＡＲに依存し、ＢＭ＄法よりもＳＡＭ法が優位なケースもある。たとえば後述のように（図１７参照）、ＰＡＲが高くなるとＢＭ＄法よりもＳＡＭ法の方が優位になる傾向がある。また、本評価結果は、ＣＰＵでの実装の結果であり、比較的大きなメモリをもった系での結果である。専用回路での実装の場合、局所場を記憶しなくてもよいＳＡＭ法が、ＢＭ＄法よりも優位となる場合も考えられる。 However, which of the SAM method and the BM$ method is better depends on the problem scale and PAR, and there are cases where the SAM method is superior to the BM$ method. For example, as described later (see FIG. 17), the SAM method tends to be superior to the BM$ method when the PAR is high. This evaluation result is the result of mounting on a CPU, and is the result of a system having a relatively large memory. In the case of implementation in a dedicated circuit, the SAM method, which does not require storing local fields, may be superior to the BM$ method.

（ＱＳＡＰベンチマーク）
ＱＡＰと比較して、ＱＳＡＰソルバーに関するこれまでの開示はほとんどない。主な参考資料は、２０コアのＣＰＵに実装されたＰＭＩＴＳ（Parallel Memetic Iterative Tabu Search）アルゴリズムである（非特許文献８参照）。ＰＭＩＴＳは、ポピュレーションの５０％を用いて、１時間のタイムアウト制限で、停止基準と同じ最良のソリューションに到達する。この方法は、協調レプリカを使用したＭｅｍｅｔｉｃアルゴリズムにおいて、収束を予測する合理的な方法である。しかし、レプリカが温度ラダーに沿って絶えず流動しているパラレルテンパリングでは、このような基準は妥当ではない。 (QSAP benchmark)
Compared to QAP, there are few previous disclosures on QSAP solvers. The main reference material is the PMITS (Parallel Memetic Iterative Tabu Search) algorithm implemented in a CPU with 20 cores (see Non-Patent Document 8). PMITS uses 50% of the population to arrive at the same best solution as the stopping criterion with a timeout limit of 1 hour. This method is a reasonable way to predict convergence in the Memetic algorithm using cooperative replicas. However, in parallel tempering, where replicas are in constant flow along a temperature ladder, such criteria are not valid.

したがって、本実施の形態では、ＱＡＰの場合と同様の方法でＴｔＳが測定され、ＰＭＩＴＳについては基準としてＴｔＳの値が測定された。その結果が、以下の表３に示されている。パラレルテンパリングソルバーとＰＭＩＴＳとの間の相関関係については示されていない。 Therefore, in the present embodiment, TtS was measured in the same manner as for QAP, and the value of TtS was measured as a reference for PMITS. The results are shown in Table 3 below. A correlation between the parallel tempering solver and PMITS is not shown.

表３のように、ＱＡＰと比較すると、ＱＳＡＰインスタンス全体のＶΔＣ法と比較したＳＡＭ法及びＢＭ＄法のパフォーマンスは、ほぼ２倍低下している。これは、より高いＰＡＲに起因するものと考えられる。 As shown in Table 3, when compared to QAP, the performance of the SAM and BM$ methods compared to the VΔC method across QSAP instances is nearly doubled. This is believed to be due to the higher PAR.

（拡張Ｔａｉｌｌａｒｄインスタンス上のＱＡＰスケーリング）
非特許文献７で紹介された、いくつかのＱＡＰインスタンスは未知の最適解をもち、最大ｎ=７２９のサイズである。このサイズと難易度のため、ベンチマークに使用されることはめったにないが、データ処理装置１０は、より良い解を見つけるために、所定の時間制限で、ＶΔＣ法、ＳＡＭ法及びＢＭ＄法を用いて、これらのインスタンスを実行した。 (QAP scaling on extended Taillard instances)
Some QAP instances, introduced in Non-Patent Document 7, have unknown optimal solutions and are of size up to n=729. Rarely used for benchmarking because of its size and difficulty, data processor 10 uses the VΔC, SAM, and BM$ methods within a given time limit to find a better solution. and ran these instances.

これらのインスタンスのＢＫＳ値を改善しようとした以前の試みでは、数分から数時間実行したにもかかわらず、ほとんどまたはまったく改善されなかった（たとえば、非特許文献１０参照）。 Previous attempts to improve the BKS values for these instances yielded little or no improvement despite running for minutes to hours (see, eg, Non-Patent Document 10).

実験では、各インスタンスは、２０回実行され、ｎ＝１２５及びｎ＝１７５の場合は、１０秒、ｎ＝３４３及びｎ＝７２９のインスタンスの場合は３０秒で終了する。表４，５には、各インスタンスについてＶΔＣ法、ＳＡＭ法及びＢＭ＄法を適用した場合の最良のコスト（式（１）の評価関数の値）とともに、平均コストが示されている。 In the experiments, each instance was run 20 times and finished in 10 seconds for n=125 and n=175 and 30 seconds for n=343 and n=729 instances. Tables 4 and 5 show the average cost together with the best cost (value of the evaluation function of equation (1)) when applying the VΔC method, the SAM method and the BM$ method for each instance.

実行時間が短いにもかかわらず、ＳＡＭ法とＢＭ＄法による４つを除くすべてのインスタンスのＢＫＳを改善でき、平均コストも以前のＢＫＳよりも低くなっていることが分かった。 It was found that despite the short run times, the BKS for all but four instances of the SAM and BM$ methods could be improved, and the average cost was also lower than the previous BKS.

（スケーリング分析）
ベンチマーク結果と、ＶΔＣ法とＳＡＭ法の定性的比較に基づくと、ＳＡＭ法がより効率的であると考えられる。ＶΔＣ法ではイタレーションごとにΔＤの各要素の連続的な並べ替えが行われるのに対し、ＳＡＭ法では、割当状態に一致するようにＤ^ＸやＤ^Ｓの並べ替えが行われるのは、提案された割当変更が受け入れられた場合のみであるためである。 (scaling analysis)
Based on benchmark results and a qualitative comparison between the VΔC method and the SAM method, the SAM method is believed to be more efficient. The VΔC method continuously rearranges the elements of ΔD for each iteration, whereas the SAM method rearranges D ^X and D ^S to match the allocation state. This is because it is the only case where the assigned assignment change is accepted.

一方、ＳＡＭ法とＢＭ＄法の相対的な性能は、主にＰＡＲに依存する。ＰＡＲは、使用されている探索アルゴリズムによって大きく異なる可能性がある。さらに、同じ探索アルゴリズム内であっても、ＰＡＲは、シミュレーテッドアニーリングなどのアルゴリズムを使用した実行時や、パラレルテンパリングなどで同時探索されるインスタンス間で変化する可能性がある。 On the other hand, the relative performance of SAM and BM$ methods mainly depends on PAR. PAR can vary greatly depending on the search algorithm used. Moreover, even within the same search algorithm, the PAR can change at run-time using algorithms such as simulated annealing and between concurrently searched instances such as parallel tempering.

以下、問題サイズとＰＡＲ値に応じた実行時間の比較結果を示す。
図１６は、測定アルゴリズムの一例を示す図である。
測定アルゴリズムでは、最初に１～１００のランダムな値をもつフロー行列（Ｆ）及び距離行列（Ｄ）が生成される。そして割当状態（φ）が初期化される。次に、所定のイタレーション回数（Ｉ_{ｌｉｍｉｔ}）の処理が実行され、１ループの実行時間が計測される。このとき、所望のＰＡＲ値で提案がランダムに受け入れられる。 The results of comparison of execution times according to problem sizes and PAR values are shown below.
FIG. 16 is a diagram showing an example of a measurement algorithm.
The measurement algorithm first generates a flow matrix (F) and a distance matrix (D) with random values from 1-100. Then the allocation state (φ) is initialized. Next, processing is executed for a predetermined number of iterations (I _limit ), and the execution time of one loop is measured. Proposals are then randomly accepted at the desired PAR value.

問題のサイズは、ＳＡＭ法及びＢＭ＄法のレプリカデータを格納するために使用されるメモリ階層に基づいて３つのグループに分割されている。ｎ＜２５６と２５６＜ｎ＜１０２４の問題サイズの場合には、Ｉ_{ｌｉｍｉｔ}はそれぞれ１００Ｍ及び１０Ｍが使用される。ｎ＞１０２４の問題サイズの場合には、Ｉ_{ｌｉｍｉｔ}は１Ｍが使用される。 The problem sizes are divided into three groups based on the memory hierarchy used to store replica data for the SAM and BM$ methods. For problem sizes of n<256 and 256<n<1024, I _limit of 100M and 10M are used, respectively. For problem sizes of n>1024, an _{I_limit} of 1M is used.

各イタレーションでは、ΔＣ_ｅｘの値が計算され、乱数生成器が、所望のＰＡＲで提案を受け入れるか否かを判定するために用いられる。
ＣＰＵパイプライン、キャッシュ、メモリに全負荷がかかっている間の性能を測定するため、６４個のループインスタンスが並列に実行された（コアごとに１つ）。この処理は、データポイントごとに１０の異なる乱数シードを使用して繰り返され、平均実行時間が測定された。 At each iteration, a value of ΔC _ex is calculated and used by a random number generator to determine whether or not to accept the proposal with the desired PAR.
64 loop instances were executed in parallel (one per core) to measure the performance while the CPU pipeline, cache, and memory were under full load. This process was repeated using 10 different random number seeds for each data point and the average execution time was measured.

フロー行列の密度がＢＭ＄法のキャッシュ行列（Ｈ）の更新関数に影響を与えるため、２つの別々のシミュレーションセットが実行された。１つは完全に密なフロー行列で、もう１つはスパースなフロー行列であり、ゼロ以外の値が１０％しかない。各測定は、ｎ＝［１００，５０００］の範囲の１０種類の問題サイズと、ＰＡＲ＝［０．０００１，０．１］の範囲の２９のＰＡＲを組み合わせて、合計２９０のパラメータの組み合わせについて計算が行われた。 Because the density of the flow matrix affects the update function of the cache matrix (H) of the BM$ method, two separate sets of simulations were performed. One is a fully dense flow matrix and the other is a sparse flow matrix with only 10% non-zero values. Each measurement was calculated for a total of 290 parameter combinations, combining 10 problem sizes in the range n=[100,5000] and 29 PARs in the range PAR=[0.0001,0.1]. was done.

図１７は、ＶΔＣ法、ＳＡＭ法、ＢＭ＄法について測定された相対的な高速化の度合いと、問題サイズに応じて占有されるメモリ階層を示す図である。高速化の度合いを示す図では、横軸はＰＡＲ［％］を表し、縦軸は高速化の度合い（Ｓｐｅｅｄ－Ｕｐ）を表す。図１７には、ＳＡＭ法及びＶΔＣ法に対するＢＭ＄法の高速化の度合いが、それぞれ異なる密度のフロー行列を用いた場合について示されている。さらに、図１７には、ＶΔＣ法に対するＳＡＭ法の高速化の度合いが示されている。図１７の例では、メモリ階層は、記憶容量が小さい順に、Ｌ２キャッシュ、Ｌ３キャッシュ、ＤＲＡＭがある。 FIG. 17 shows the relative speedups measured for the VΔC method, the SAM method, and the BM$ method, and the memory hierarchy occupied according to the problem size. In the diagram showing the degree of speedup, the horizontal axis represents PAR [%] and the vertical axis represents the degree of speedup (Speed-Up). FIG. 17 shows the degree of acceleration of the BM$ method with respect to the SAM method and the VΔC method when flow matrices with different densities are used. Furthermore, FIG. 17 shows the degree of speedup of the SAM method relative to the VΔC method. In the example of FIG. 17, the memory hierarchy includes L2 cache, L3 cache, and DRAM in descending order of storage capacity.

なお、ＱＡＰの１０のインスタンスに対する非負荷分散のパラレルテンパリングシミュレーションによる、あるフロー行列の密度に対する最大のＰＡＲの値は、以下の表６のようなものであった。 It should be noted that the maximum PAR values for a given flow matrix density from non-load-sharing parallel tempering simulations for 10 instances of QAP were as shown in Table 6 below.

なお、ＱＳＡＰシミュレーションは、ＱＡＰシミュレーションと類似するため、省略した。両シミュレーションの結果には顕著な差異はないと考えられる。
問題サイズによる相対的な高速化の度合いは、メモリ階層のどの層に探索に用いるデータの大部分が格納されているかに基づいて、３つのグループに分けることができる。各コアには専用のＬ２キャッシュ（本計算例では記憶容量が２５６ｋＢ）があり、４つのコアのグループはＬ３キャッシュ（本計算例では記憶容量が１６ＭＢ）を共有している。 Note that the QSAP simulation is omitted because it is similar to the QAP simulation. It is considered that there is no significant difference between the results of both simulations.
The relative speedup with problem size can be divided into three groups based on which tier of the memory hierarchy contains most of the data used for the search. Each core has a dedicated L2 cache (256 kB storage capacity in this calculation example), and a group of four cores share an L3 cache (16 MB storage capacity in this calculation example).

図１７のように、ＳＡＭ法とＢＭ＄法の場合、各ループスレッドのＤ^Ｘとキャッシュ行列（Ｈ）は、コアのＬ２キャッシュに最大ｎ＝２５６のサイズで収まり、ｎ＝１０２４までの問題はＬ３キャッシュに収まる。 As shown in FIG. 17, in the case of the SAM method and the BM$ method, the ^DX and cache matrix (H) of each loop thread fit in the L2 cache of the core with a maximum size of n=256, and the problem up to n=1024 is fits in the L3 cache.

図１７のように、問題サイズが大きくなり、より記憶容量が大きいメモリ階層が用いられるようになると、ＢＭ＄法と他の２つの方法の間の相対的な性能（高速化の度合い）が低下する。探索に用いるデータが上層のメモリ階層に移動すると、ＢＭ＄法がＳＡＭ法及びＶΔＣ法よりも優れた性能を維持するために要するＰＡＲの値が大幅に減少する。 As shown in Figure 17, the relative performance (degree of speedup) between the BM$ method and the other two methods decreases as the problem size increases and memory hierarchies with larger storage capacities are used. do. When the data used for searching is moved to a higher memory hierarchy, the value of PAR required for the BM$ method to maintain superior performance over the SAM and VΔC methods is greatly reduced.

パラレルテンパリングを実施するソルバーの場合、表６に示すように、レプリカ全体の最大のＰＡＲの値は、インスタンスファミリ全体の問題サイズとともに必ずしも減少しない。これは、より小さなＱＡＰインスタンスについて、表２に示したのと同じ相対速度を維持するためのＰＡＲと問題サイズの関係は、必ずしも図１７に示すような関係に依存しないことを示している。 For solvers that implement parallel tempering, as shown in Table 6, the maximum PAR value across replicas does not necessarily decrease with problem size across instance families. This shows that for smaller QAP instances, the relationship between PAR and problem size to maintain the same relative rates as shown in Table 2 does not necessarily depend on the relationship as shown in FIG.

ＶΔＣ法に対するＳＡＭ法の高速化の度合いは、問題サイズに基づいて２つのグループに分けられる。問題サイズがｎ≦８００で、ループインスタンスごとのＤ^Ｘの大部分がキャッシュに収まる場合、ＳＡＭ法では、ＰＡＲ＜１０％の範囲ではＶΔＣよりも明らかに有利である。探索に用いるデータの保存先をＬ２キャッシュからＬ３キャッシュに移行することは、２つの方法間の相対的な性能に大きな影響を与えていない。 The speedup of the SAM method over the VΔC method can be divided into two groups based on problem size. If the problem size is ^n≤800 and most of the DX per loop instance fits in the cache, the SAM method has a clear advantage over VΔC in the range of PAR<10%. Moving the data used for lookups from the L2 cache to the L3 cache does not significantly affect the relative performance between the two methods.

（ハードウェア例）
以下、ＳＡＭ法を実現するためのハードウェア例を説明する。
なお、以下では、説明を簡略化するために、フロー行列と距離行列の両方が対称行列（対角成分が０（バイアスレス））であるものとして説明する。前述のように、このようなＱＡＰが、インスタンスの大部分であるし、計算を単純化するためである。対称行列を用いたＱＡＰは、非対称行列を用いたＱＡＰに直接的に変換可能である。 (Hardware example)
An example of hardware for implementing the SAM method will be described below.
To simplify the description, the following description assumes that both the flow matrix and the distance matrix are symmetric matrices (diagonal elements are 0 (biasless)). As mentioned above, such QAPs are the bulk of the instances and to simplify the calculations. QAP with symmetric matrices can be directly converted to QAP with asymmetric matrices.

対称行列のみを用いたＱＡＰの評価関数は、以下の式（４０）のように表せる。 A QAP evaluation function using only a symmetric matrix can be expressed as in Equation (40) below.

式（１）と式（４０）との違いは、式（１）の“ｊ＝１”が“ｊ＝ｉ”となっている点だけである。
式（４０）で表される評価関数を用いた場合、ＱＡＰの計算に用いられるΔＣ_ｅｘは式（１９）、式（２２）の代わりに、以下の式（４１）、式（４２）で表せる。 The only difference between formula (1) and formula (40) is that "j=1" in formula (1) is "j=i".
When the evaluation function represented by Equation (40) is used, ΔC _ex used for QAP calculation can be represented by the following Equations (41) and (42) instead of Equations (19) and (22). .

図１８は、ΔＣ生成回路の一例を示す図である。
ΔＣ生成回路３０は、フロー行列メモリ３１ａ、状態整列Ｄ行列メモリ３１ｂ、差分計算回路３２ａ，３２ｂ、マルチプレクサ３３ａ，３３ｂ、内積計算回路３４、レジスタ３５、乗算回路３６、加算回路３７を有する。これらは、たとえば、図２に示した記憶部１１または処理部１２に含まれる回路やメモリである。 FIG. 18 is a diagram showing an example of a ΔC generation circuit.
The ΔC generation circuit 30 has a flow matrix memory 31 a , a state-aligned D matrix memory 31 b , difference calculation circuits 32 a and 32 b , multiplexers 33 a and 33 b , an inner product calculation circuit 34 , a register 35 , a multiplication circuit 36 and an addition circuit 37 . These are, for example, circuits and memories included in the storage unit 11 or the processing unit 12 shown in FIG.

フロー行列メモリ３１ａは、フロー行列（Ｆ）を記憶する。
状態整列Ｄ行列メモリ３１ｂは、距離行列を更新したものである状態整列Ｄ行列（Ｄ^Ｘ）を記憶する。 The flow matrix memory 31a stores a flow matrix (F).
The state-aligned D-matrix memory 31b stores a state-aligned D-matrix (D ^X ) which is an updated distance matrix.

図１８の例では、フロー行列メモリ３１ａと状態整列Ｄ行列メモリ３１ｂは、２つのポートを有するデュアルポートメモリである。一部のＦＰＧＡではこのようなデュアルポートメモリを使用可能である。 In the example of FIG. 18, the flow matrix memory 31a and the state alignment D matrix memory 31b are dual port memories having two ports. Such dual port memories are available in some FPGAs.

差分計算回路３２ａは、式（４２）のΔ^ｂ _ａＦであるＦ_ｂ，＊－Ｆ_ａ，＊を計算する。
差分計算回路３２ｂは、式（４２）のΔ^φ（ａ） _φ（ｂ）Ｄ^ＸであるＤ^Ｘ _{φ（ａ），＊}－Ｄ^Ｘ _{φ（ｂ），＊}を計算する。 The difference calculation circuit 32a calculates F _b,* -F _a,* which is Δ ^b _a F in equation (42).
The difference calculation circuit 32b calculates D ^X _φ(a),* −D ^X _φ(b),* , which is Δ ^φ(a) _φ(b) D ^X in equation (42).

マルチプレクサ３３ａは、Ｆ_ａ，＊からｆ_ａ，ｂを選択して出力する。
マルチプレクサ３３ｂは、Ｄ^Ｘ _{φ（ａ），＊}からｄ_{φ（ａ），ｂ}を選択して出力する。
内積計算回路３４は、Δ^ｂ _ａＦとΔ^φ（ａ） _φ（ｂ）Ｄ^Ｘの内積を計算する。内積計算回路３４は、たとえば、並列に接続された複数の乗算器により実現できる。 The multiplexer 33a selects and outputs f _a,b from F _a,* .
The multiplexer 33b selects and outputs d _φ(a),b from D ^X _φ(a),* .
The inner product calculation circuit 34 calculates the inner product of ^Δ ^b _a F and Δ ^φ(a) _{φ(b) DX} . The inner product calculation circuit 34 can be implemented, for example, by a plurality of multipliers connected in parallel.

レジスタ３５は、式（４２）に含まれる係数“２”を保持する。
乗算回路３６は２ｆ_ａ，ｂｄ_{φ（ａ），ｂ}を計算する。
加算回路３７は、内積の計算結果に２ｆ_ａ，ｂｄ_{φ（ａ），ｂ}を加算することでΔＣ_ｅｘを計算し、出力する。 Register 35 holds the coefficient "2" included in equation (42).
Multiplication circuit 36 computes 2f _a,b d _φ(a),b .
The addition circuit 37 calculates and outputs ΔC _ex by adding 2f _a,b d _φ(a),b to the inner product calculation result.

このようなハードウェアを用いたΔＣ_ｅｘの計算は、たとえば、図２の処理部１２に含まれる図示しない制御回路がプログラムを実行することで制御される。
前述のようにＳＡＭ法では、ΔＣ_ｅｘを発生させる要素の割当先の入れ替え（割当変更）が提案され、受け入れられると、状態整列Ｄ行列は、受け入れられた割当状態に対応するように、列が入れ替えられる。 Calculation of ΔC _ex using such hardware is controlled by, for example, a control circuit (not shown) included in the processing unit 12 of FIG. 2 executing a program.
As described above, in the SAM method, a permutation (assignment change) of the assignment destination of the element that generates ΔC _ex is proposed, and if accepted, the state-aligned D matrix is such that the columns correspond to the accepted assignment state. be replaced.

図１９は、列の入れ替えを行うハードウェア構成の第１の例を示す図である。
図１９のように、状態整列Ｄ行列メモリ３１ｂに記憶されている状態整列Ｄ行列の列の入れ替えは、たとえば、マルチプレクサ４０ａ，４０ｂ、スイッチ４１を用いて行うことができる。これらの回路も図２の処理部１２に含みうる。 FIG. 19 is a diagram showing a first example of a hardware configuration for permuting columns.
As shown in FIG. 19, the permutation of the columns of the state-aligned D-matrix stored in the state-aligned D-matrix memory 31b can be performed using multiplexers 40a and 40b and a switch 41, for example. These circuits may also be included in the processing unit 12 of FIG.

マルチプレクサ４０ａは、状態整列Ｄ行列メモリ３１ｂから入れ替えのために読み出される状態整列Ｄ行列の各行のうちの、第１の列の値を順に選択して出力する。
マルチプレクサ４０ｂは、状態整列Ｄ行列メモリ３１ｂから読み出される状態整列Ｄ行列の各行のうちの、第２の列の値を順に選択して出力する。 The multiplexer 40a sequentially selects and outputs the values of the first columns of the rows of the state-aligned D-matrix read out for permutation from the state-aligned D-matrix memory 31b.
The multiplexer 40b sequentially selects and outputs the values of the second columns of the rows of the state-aligned D-matrix read from the state-aligned D-matrix memory 31b.

スイッチ４１は、状態整列Ｄ行列メモリ３１ｂにおいて、マルチプレクサ４０ａから出力される値を、マルチプレクサ４０ｂから出力される値が記憶されていた場所に記憶させるように記憶場所を入れ替えて書き込む。また、スイッチ４１は、状態整列Ｄ行列メモリ３１ｂにおいて、マルチプレクサ４０ｂから出力される値を、マルチプレクサ４０ａから出力される値が記憶されていた場所に記憶させるように記憶場所を入れ替えて書き込む。 The switch 41 writes the value output from the multiplexer 40a to the location where the value output from the multiplexer 40b was stored in the state-arranged D-matrix memory 31b. Also, the switch 41 writes the value output from the multiplexer 40b to the location where the value output from the multiplexer 40a was stored in the state-arranged D-matrix memory 31b.

このようなハードウェアを用いた列の入れ替えは、たとえば、図２の処理部１２に含まれる図示しない制御回路がプログラムを実行することで制御される。
図２０は、列の入れ替え例を示す図である。図２０では、１列目と３列目の入れ替えが行われる例が示されている。 Such column permutation using hardware is controlled by, for example, a control circuit (not shown) included in the processing unit 12 of FIG. 2 executing a program.
FIG. 20 is a diagram illustrating an example of column permutation. FIG. 20 shows an example in which the first and third columns are interchanged.

最初にｄ_１，３とｄ_１，１がマルチプレクサ４０ａ，４０ｂによって選択され、スイッチ４１によって記憶場所が入れ替えられる。次に、ｄ_２，３とｄ_２，１がマルチプレクサ４０ａ，４０ｂによって選択され、スイッチ４１によって記憶場所が入れ替えられる。同様の処理が合計ｎサイクル繰り返されることで、列の入れ替えが完了する。 First, d _1,3 and d _1,1 are selected by multiplexers 40 a and 40 b , and switch 41 switches the memory location. Next, d _2,3 and d _2,1 are selected by multiplexers 40 a and 40 b and switched by switch 41 . The same process is repeated for a total of n cycles to complete the column permutation.

図２１は、列の入れ替えを行うハードウェア構成の第２の例を示す図である。図２１において、図１９に示した要素と同じ要素については同一符号が付されている。
図２１のように、状態整列Ｄ行列メモリ３１ｂには、状態整列Ｄ行列のほか、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）も記憶されている。状態整列Ｄ行列の列の入れ替えは、読み出される転置行列の要素を用いて、たとえば、前述のスイッチ４１のほか、シフトレジスタ４５ａ，４５ｂにより行うことができる。これらの回路も処理部１２に含みうる。 FIG. 21 is a diagram showing a second example of a hardware configuration for permuting columns. In FIG. 21, the same reference numerals are assigned to the same elements as those shown in FIG.
As shown in FIG. 21, the state-aligned D-matrix memory 31b stores not only the state-aligned D-matrix but also the transposed matrix ((D ^X ) ^T ) of the state-aligned D-matrix. The permutation of the columns of the state-arranged D-matrix can be performed by using the elements of the read-out transposed matrix, for example, by the shift registers 45a and 45b in addition to the switch 41 described above. These circuits may also be included in the processing unit 12 .

図２１の回路構成では、状態整列Ｄ行列メモリ３１ｂからは、状態整列Ｄ行列の入れ替えが行われる２列に対応する、転置行列の２行が読み出される。
シフトレジスタ４５ａは、状態整列Ｄ行列メモリ３１ｂから読み出される転置行列の２行のうちの、第１の行のｎ個の値を保持し、１サイクルずつシフトさせて１つずつ値を出力する。 In the circuit configuration of FIG. 21, two rows of the transposed matrix corresponding to the two columns in which the state-aligned D-matrix is permuted are read from the state-aligned D-matrix memory 31b.
The shift register 45a holds n values in the first row of the two rows of the transposed matrix read from the state-aligned D-matrix memory 31b, shifts them one cycle at a time, and outputs the values one at a time.

シフトレジスタ４５ｂは、状態整列Ｄ行列メモリ３１ｂから読み出される転置行列の２行のうちの、第２の行のｎ個の値を保持し、１サイクルずつシフトさせて１つずつ値を出力する。 The shift register 45b holds n values in the second row of the two rows of the transposed matrix read from the state-aligned D-matrix memory 31b, shifts them one cycle at a time, and outputs the values one at a time.

スイッチ４１は、状態整列Ｄ行列メモリ３１ｂにおいて、シフトレジスタ４５ａから出力される値を、シフトレジスタ４５ｂから出力される値が記憶されていた場所に記憶させるように記憶場所を切り替える。また、スイッチ４１は、状態整列Ｄ行列メモリ３１ｂにおいて、シフトレジスタ４５ｂから出力される値を、シフトレジスタ４５ａから出力される値が記憶されていた場所に記憶させるように記憶場所を切り替える。 The switch 41 switches the storage location in the state-aligned D-matrix memory 31b so that the value output from the shift register 45a is stored in the location where the value output from the shift register 45b was stored. Also, the switch 41 switches the storage location in the state-arranged D-matrix memory 31b so that the value output from the shift register 45b is stored in the location where the value output from the shift register 45a was stored.

このようなハードウェア構成によってもｎサイクルで列の入れ替えができる。また、図２１のハードウェア構成の場合、図１９のハードウェア構成よりも配線性が向上する可能性がある。また、比較的回路規模が大きくなる可能性がある図１９に示したようなマルチプレクサ４０ａ，４０ｂが不要になる。 Even with such a hardware configuration, columns can be exchanged in n cycles. Also, in the case of the hardware configuration of FIG. 21, there is a possibility that wiring performance is improved compared to the hardware configuration of FIG. Moreover, the multiplexers 40a and 40b as shown in FIG. 19, which may have a relatively large circuit scale, are not required.

図２２は、列の入れ替えを行うハードウェア構成の第２の例の１つ目の変形例を示す図である。図２２において、図２１に示した要素と同じ要素については同一符号が付されている。 FIG. 22 is a diagram showing a first modification of the second example of the hardware configuration for permuting columns. In FIG. 22, the same symbols are assigned to the same elements as those shown in FIG.

図２２のハードウェア構成では、状態整列Ｄ行列メモリ３１ｂと分離した転置行列メモリ４６に、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）が記憶されている点が、図２１のハードウェア構成と異なっている。転置行列メモリ４６から、転置行列の上記２行が読み出される。その他の構成については、図２１と同様である。 In the hardware configuration of FIG. 22, the transposed matrix ((D ^X ) ^T ) of the state-aligned D matrix is stored in the transposed matrix memory 46 separated from the state-aligned D matrix memory 31b. configuration is different. From the transposed matrix memory 46, the above two rows of the transposed matrix are read. Other configurations are the same as in FIG.

図２３は、列の入れ替えを行うハードウェア構成の第２の例の２つ目の変形例を示す図である。図２３において、図２２に示した要素と同じ要素については同一符号が付されている。 FIG. 23 is a diagram showing a second modification of the second example of the hardware configuration for permuting columns. In FIG. 23, the same reference numerals are assigned to the same elements as those shown in FIG.

図２３のハードウェア構成では、図１８に示したフロー行列メモリ３１ａに、フロー行列のほか、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）も記憶されている。状態整列Ｄ行列の列の入れ替えは、フロー行列メモリ３１ａから読み出される転置行列の要素を用いて、前述のように、スイッチ４１、シフトレジスタ４５ａ，４５ｂにより行うことができる。 In the hardware configuration of FIG. 23, the flow matrix memory 31a shown in FIG. 18 stores not only the flow matrix but also the transposed matrix ((D ^X ) ^T ) of the state-aligned D matrix. The permutation of the columns of the state-aligned D-matrix can be performed by the switch 41 and the shift registers 45a and 45b, as described above, using the elements of the transposed matrix read from the flow matrix memory 31a.

図２４は、列の入れ替えを行うハードウェア構成の第３の例を示す図である。図２４において、図１９に示した要素と同じ要素については同一符号が付されている。
図２４のハードウェア構成では、状態整列Ｄ行列は、２つの状態整列Ｄ行列メモリ３１ｂ１，３１ｂ２に記憶される。 FIG. 24 is a diagram showing a third example of a hardware configuration for permuting columns. In FIG. 24, the same reference numerals are assigned to the same elements as those shown in FIG.
In the hardware configuration of FIG. 24, the state-aligned D-matrices are stored in two state-aligned D-matrix memories 31b1 and 31b2.

状態整列Ｄ行列メモリ３１ｂ１から読み出された状態整列Ｄ行列の各行の要素を用いて、状態整列Ｄ行列メモリ３１ｂ２に記憶されている状態整列Ｄ行列に対して、図１９のハードウェア構成を用いた場合と同様に列の入れ替えが行われる。 Using the elements of each row of the state-aligned D matrix read from the state-aligned D matrix memory 31b1, the hardware configuration of FIG. 19 is used for the state-aligned D matrix stored in the state-aligned D matrix memory 31b2. The columns are exchanged as if

列の入れ替え後の状態整列Ｄ行列は、状態整列Ｄ行列メモリ３１ｂ１にコピーされる。
このようなハードウェア構成によってもｎサイクルで列の入れ替えができる。また、図２４のハードウェア構成の場合、図１９のハードウェア構成よりも配線性が向上する可能性がある。 The state-aligned D-matrix after the permutation of columns is copied to the state-aligned D-matrix memory 31b1.
Even with such a hardware configuration, columns can be exchanged in n cycles. Also, in the case of the hardware configuration of FIG. 24, there is a possibility that wiring performance is improved compared to the hardware configuration of FIG.

図２５は、列の入れ替えを行うハードウェア構成の第３の例の変形例を示す図である。図２５において、図１９に示した要素と同じ要素については同一符号が付されている。
図２５のハードウェア構成では、図１８に示したフロー行列メモリ３１ａに、フロー行列のほか、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）も記憶されている。状態整列Ｄ行列の列の入れ替えは、フロー行列メモリ３１ａから読み出される転置行列の要素を用いて、前述のように、マルチプレクサ４０ａ，４０ｂ、スイッチ４１により行うことができる。 FIG. 25 is a diagram showing a modification of the third example of the hardware configuration for permuting columns. In FIG. 25, the same reference numerals are given to the same elements as those shown in FIG.
In the hardware configuration of FIG. 25, the flow matrix memory 31a shown in FIG. 18 stores not only the flow matrix but also the transposed matrix ((D ^X ) ^T ) of the state-aligned D matrix. The permutation of the columns of the state-aligned D-matrix can be performed by the multiplexers 40a, 40b and the switch 41 as described above, using the elements of the transposed matrix read from the flow matrix memory 31a.

図２６は、ΔＣ生成回路の他の例を示す図である。図２６において、図１８に示した要素と同じ要素については同一符号が付されている。
図２６のΔＣ生成回路５０では、フロー行列メモリ５１ａ、状態整列Ｄ行列メモリ５１ｂとして、シングルポートメモリが用いられている。この場合、Ｆ_ａ，＊を保持し、フロー行列メモリ５１ａからＦ_ｂ，＊が読み出されるタイミングでＦ_ａ，＊を出力し、差分計算回路３２ａとマルチプレクサ３３ａに供給するレジスタ５２ａが用いられる。また、Ｄ^Ｘ _{φ（ａ），＊}を保持し、状態整列Ｄ行列メモリ５１ｂからＤ^Ｘ _{φ（ｂ），＊}が読み出されるタイミングでＤ^Ｘ _{φ（ａ），＊}を出力し、差分計算回路３２ｂとマルチプレクサ３３ｂに供給するレジスタ５２ｂが用いられる。 FIG. 26 is a diagram showing another example of the ΔC generation circuit. In FIG. 26, the same reference numerals are given to the same elements as those shown in FIG.
In the ΔC generation circuit 50 of FIG. 26, single-port memories are used as the flow matrix memory 51a and the state-ordered D matrix memory 51b. In this case, _a register 52a is used which holds Fa _,* , outputs Fa,* at the timing when Fb _,* is read from the flow matrix memory 51a, and supplies it to the difference calculation circuit 32a and the multiplexer 33a. Also, D ^X _φ(a),* is held, D ^X _φ(a),* is output at the timing when D ^X _φ(b),* is read from the state-aligned D matrix memory 51b, and the difference calculation circuit 32b and a register 52b that feeds the multiplexer 33b.

その他の構成については、図１８と同様である。
ところで、図２１～図２４に示した列交換のためのハードウェア構成を、図１８または図２６に示したΔＣ生成回路３０，５０に適用する場合、２つのレプリカについての処理を同時に実行することが可能となる。 Other configurations are the same as those in FIG.
By the way, when the hardware configuration for column exchange shown in FIGS. 21 to 24 is applied to the ΔC generation circuits 30 and 50 shown in FIG. becomes possible.

図２７は、２つのレプリカについての処理を行うハードウェア構成の例を示す図である。図２７には、図２６に示したΔＣ生成回路５０に図２２に示した列交換のためのハードウェア構成を組み合わせた構成が示されている。 FIG. 27 is a diagram showing an example of a hardware configuration for processing two replicas. FIG. 27 shows a configuration in which the .DELTA.C generation circuit 50 shown in FIG. 26 is combined with the hardware configuration for column exchange shown in FIG.

図２７の例では、状態整列Ｄ行列メモリ５１ｂには、２つのレプリカ（Ｒ１、Ｒ２）の状態整列Ｄ行列（Ｄ^Ｘ _Ｒ１、Ｄ^Ｘ _Ｒ２）が記憶されている。
１つのレプリカで割当変更が受け入れられ、状態整列Ｄ行列メモリ５１ｂの書き込みポートを使用して１つのレプリカについて列交換が行われる。その間、他方のレプリカについて、状態整列Ｄ行列メモリ５１ｂの読み出しポートから読み出される状態整列Ｄ行列の要素に基づいて、ΔＣ_ｅｘを計算する処理が行われる。 In the example of FIG. 27, the state-aligned D matrix memory 51b stores state-aligned D matrices (D ^X _R1 , D ^X _R2 ) of two replicas (R1, R2).
Allocation changes are accepted at one replica, and column swaps are performed for one replica using the write port of the state-aligned D-matrix memory 51b. Meanwhile, for the other replica, processing is performed to calculate ΔC _ex based on the elements of the state-aligned D-matrix read from the read port of the state-aligned D-matrix memory 51b.

なお、どちらのレプリカも更新処理を実行していない場合、一方のレプリカについての処理は、他方のレプリカについてのΔＣ_ｅｘの計算が行われている間は、ストールされる。 Note that if neither replica is executing an update process, then the process for one replica is stalled while the ΔC _ex calculation is being performed for the other replica.

図２８は、ΔＣ生成回路の他の例を示す図である。図２８において、図１８に示した要素と同じ要素については同一符号が付されている。
対称順列の問題のみが計算対象となり、ΔＣ_ｅｘが、以下の式（４３）で表される場合、図２８に示すようなΔＣ生成回路６０を用いることもできる。 FIG. 28 is a diagram showing another example of the ΔC generation circuit. In FIG. 28, the same reference numerals are given to the same elements as those shown in FIG.
If only the symmetrical permutation problem is to be calculated and ΔC _ex is represented by the following equation (43), a ΔC generating circuit 60 as shown in FIG. 28 can also be used.

ΔＣ生成回路６０では、図１８に示したようなマルチプレクサ３３ａ，３３ｂなどが不要になり、その代わり、セレクタ６１が設けられている。
たとえば、差分計算回路３２ｂは、ｄ_{φ（ａ），ｉ}－ｄ_{φ（ｂ），ｉ}を計算する減算器３２ｂｉをｎ個有し、セレクタ６１は、減算器３２ｂｉの出力と０との一方を選択して出力する２入力１出力のマルチプレクサ６１ａｉをｎ個有する。 The ΔC generation circuit 60 does not require the multiplexers 33a and 33b as shown in FIG. 18, and has a selector 61 instead.
For example, the difference calculation circuit 32b has n subtractors 32bi for calculating d _{φ(a), i} −d _{φ(b), i} , and the selector 61 selects one of the output of the subtractor 32bi and 0. It has n 2-input 1-output multiplexers 61ai that select and output.

マルチプレクサ６１ａｉは、式（４３）において、ｉ＝ａ，ｂとなる場合に０を出力する。このような場合に、０を出力する方法は、マルチプレクサを用いる方法以外の他の方法でもよい。 The multiplexer 61ai outputs 0 when i=a, b in equation (43). In such a case, the method of outputting 0 may be a method other than the method using a multiplexer.

（レプリカ処理回路）
各レプリカについての処理を行う回路は、前述したΔＣ生成回路４０，５０，６０の何れかと、状態整列Ｄ行列の列の入れ替えを行う前述のいくつかのハードウェア構成の何れかと、を組み合わせることで実現できる。 (Replica processing circuit)
A circuit that performs processing for each replica is a combination of any of the ΔC generation circuits 40, 50, and 60 described above and any of the hardware configurations described above for permuting the columns of the state-aligned D matrix. realizable.

図２９は、レプリカ処理回路の一例を示す図である。図２９には、図２６に示したΔＣ生成回路５０に図１９に示した列交換のためのハードウェア構成を組み合わせたレプリカ処理回路７０の例が示されている。図２９において、図１９と図２６に示した要素と同じ要素については同一符号が付されている。 FIG. 29 is a diagram illustrating an example of a replica processing circuit; FIG. 29 shows an example of a replica processing circuit 70 in which the hardware configuration for column exchange shown in FIG. 19 is combined with the ΔC generation circuit 50 shown in FIG. In FIG. 29, the same elements as those shown in FIGS. 19 and 26 are denoted by the same reference numerals.

図２９の例では、マルチプレクサ３３ｂが図１９に示したマルチプレクサ４０ａの機能も有している。
これらの回路は、図２の記憶部１１または処理部１２に含みうる。 In the example of FIG. 29, multiplexer 33b also has the function of multiplexer 40a shown in FIG.
These circuits can be included in the storage unit 11 or processing unit 12 of FIG.

なお、割当状態を表す整数割当ベクトルφを記憶する構成や、計算されたΔＣ_ｅｘを生じさせる割当変更の提案を受け入れるか否かの判定を行う構成などは、図示が省略されている。 Note that the configuration for storing the integer allocation vector φ representing the allocation state and the configuration for determining whether or not to accept the allocation change proposal that causes the calculated ΔC _ex are omitted from the drawing.

（非対称行列を用いた場合のＱＡＰへの拡張）
非対称行列（対角成分が非零）を用いたＱＡＰの計算では、対称行列を用いた場合のΔＣ_ｅｘに相当するΔＣ_ａｓｙｍは、以下の式（４４）で表される。 (Extension to QAP when using asymmetric matrices)
In QAP calculation using an asymmetric matrix (diagonal component is non-zero), ΔC _asym corresponding to ΔC _ex when using a symmetric matrix is expressed by the following equation (44).

このような、ΔＣ_ａｓｙｍを計算するレプリカ処理回路は、ΔＣ_ｅｘを計算するレプリカ処理回路を用いて実現できる。
図３０は、非対称行列を用いたＱＡＰの計算で用いられるレプリカ処理回路の一例を示す図である。図３０において、図２９に示した要素と同じ要素については同一符号が付されている。 Such a replica processing circuit that calculates ΔC _asym can be realized using a replica processing circuit that calculates ΔC _ex .
FIG. 30 is a diagram showing an example of a replica processing circuit used in QAP calculation using an asymmetric matrix. In FIG. 30, the same reference numerals are assigned to the same elements as those shown in FIG.

ΔＣ_ａｓｙｍを計算するレプリカ処理回路８０は、図２９に示したΔＣ_ｅｘを計算するレプリカ処理回路７０ａ１，７０ａ２を２つ有する。ただし、一方のレプリカ処理回路７０ａ２のフロー行列メモリ５１ａには、フロー行列の転置行列（Ｆ^Ｔ）が記憶されており、レプリカ処理回路７０ａ２の状態整列Ｄ行列メモリ５１ｂには、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）が記憶されている。 The replica processing circuit 80 for calculating ΔC _asym has two replica processing circuits 70a1 and 70a2 for calculating ΔC _ex shown in FIG. However, the flow matrix memory 51a of one replica processing circuit 70a2 stores the transposed matrix (F ^T ) of the flow matrix, and the state-aligned D matrix memory 51b of the replica processing circuit 70a2 stores the state-aligned D matrix A transposed matrix ((D ^X ) ^T ) is stored.

レプリカ処理回路８０は、さらに、メモリ８１ａ，８１ｂ、レジスタ８２ａ，８２ｂ、差分計算回路８３ａ，８３ｂ、乗算回路８４、補償項計算回路８５、加算回路８６を有する。 The replica processing circuit 80 further has memories 81a and 81b, registers 82a and 82b, difference calculation circuits 83a and 83b, a multiplication circuit 84, a compensation term calculation circuit 85, and an addition circuit 86.

メモリ８１ａは、フロー行列の対角成分（Ｆ_ｄ）を記憶し、メモリ８１ｂは、距離行列の対角成分（Ｄ_ｄ）を記憶する。非対称行列を用いた場合、対称行列を用いた場合と異なり、これらの対角成分は非零の値を含みうる。 Memory 81a stores the diagonal elements (F _d ) of the flow matrix and memory 81b stores the diagonal elements (D _d ) of the distance matrix. With an asymmetric matrix, unlike with a symmetric matrix, these diagonal elements can contain non-zero values.

レジスタ８２ａは、メモリ８１ａから読み出されるｆ_ａ，ａまたはｆ_ｂ，ｂの一方を保持し、メモリ８１ａからｆ_ａ，ａまたはｆ_ｂ，ｂの他方が読み出されるタイミングで、ｆ_ａ，ａまたはｆ_ｂ，ｂの一方を出力し、差分計算回路８３ａに供給する。 The register 82a holds one of f a, _a or f _b ,b read from the memory 81a, and at the timing when the other of f _a,a or f _b,b is read from the memory 81a, f _a,a or f One of _{b and b} is output and supplied to the difference calculation circuit 83a.

レジスタ８２ｂは、メモリ８１ｂから読み出されるｄ_{φ（ａ），φ（ａ）}またはｄ_{φ（ｂ），φ（ｂ）}の一方を保持する。そして、レジスタ８２ｂは、メモリ８１ｂからｄ_{φ（ａ），φ（ａ）}またはｄ_{φ（ｂ），φ（ｂ）}の他方が読み出されるタイミングで、ｄ_{φ（ａ），φ（ａ）}またはｄ_{φ（ｂ），φ（ｂ）}の一方を出力し、差分計算回路８３ｂに供給する。 Register 82b holds either d φ _{(a), φ(a)} or d _{φ(b), φ(b)} read from memory 81b. Then, the register 82b stores d _{φ(a), φ(a)} or d _φ _{(a), φ(a)} or d One _{of φ(b) and φ(b)} is output and supplied to the difference calculation circuit 83b.

差分計算回路８３ａは、式（４４）のｆ_ｂ，ｂ－ｆ_ａ，ａを計算する。
差分計算回路８３ｂは、式（４４）のｄ_{φ（ａ），φ（ａ）}－ｄ_{φ（ｂ），φ（ｂ）}を計算する。 The difference calculation circuit 83a calculates f _b,b −f _a,a in equation (44).
The difference calculation circuit 83b calculates d _{φ(a), φ(a)} −d _{φ(b), φ(b)} of the equation (44).

乗算回路８４は、差分計算回路８３ａ，８３ｂの計算結果に基づいて、式（４４）の（ｆ_ｂ，ｂ－ｆ_ａ，ａ）（ｄ_{φ（ａ），φ（ａ）}－ｄ_{φ（ｂ），φ（ｂ）}）を計算する。
補償項計算回路８５は、式（４４）の右辺の４項目である補償項（ｉ＝ａ，ｂの計算をスキップするという条件をなくしたことによる補償を行う項）を計算する。補償項計算回路８５は、補償項を計算するために、レプリカ処理回路７０ａ１のマルチプレクサ３３ａ，３３ｂからｆ_ａ，ｂ、ｄ_{φ（ａ），φ（ｂ）}を受け、レプリカ処理回路７０ａ２のマルチプレクサ３３ａ，３３ｂからｆ_ｂ，ａ、ｄ_{φ（ｂ），φ（ａ）}を受ける。 Multiplying circuit 84 calculates (f _b,b −f _a,a )(d _{φ(a), φ(a)} −d _{φ(b ), φ(b)} ).
The compensation term calculation circuit 85 calculates compensation terms (terms for compensation by eliminating the condition of skipping the calculation of i=a, b), which are the four items on the right side of equation (44). The compensation term calculation circuit 85 receives f _a,b , d _{φ(a), φ(b)} from the multiplexers 33a, 33b of the replica processing circuit 70a1 to calculate the compensation terms, and the multiplexer 33a of the replica processing circuit 70a2. , 33b receive f _b,a , d _{φ(b), φ(a)} .

加算回路８６は、レプリカ処理回路７０ａ１の内積計算回路３４から式（４４）の右辺の１項目の値を受け、レプリカ処理回路７０ａ２の内積計算回路３４から式（４４）の右辺の２項目の値を受ける。さらに加算回路８６は、乗算回路８４から、式（４４）の右辺の３項目の値、補償項計算回路８５から、式（４４）の右辺の４項目の値を受ける。そして、加算回路８６はこれらの和を計算することでΔＣ_ａｓｙｍを生成し、出力する。 Adder circuit 86 receives the value of one item on the right side of equation (44) from inner product calculation circuit 34 of replica processing circuit 70a1, and the value of two items on the right side of equation (44) from inner product calculation circuit 34 of replica processing circuit 70a2. receive. Further, adder circuit 86 receives the values of three items on the right side of equation (44) from multiplier circuit 84 and the values of four items on the right side of equation (44) from compensation term calculation circuit 85 . Then, the addition circuit 86 generates and outputs ΔC _asym by calculating the sum of these.

これらの回路やメモリなどは、図２の記憶部１１または処理部１２に含みうる。
なお、割当状態を表す整数割当ベクトルφを記憶する構成、計算されたΔＣ_ｅｘを生じさせる割当変更の提案を受け入れるか否かの判定を行う構成などは、図示が省略されている。 These circuits, memories, and the like can be included in the storage unit 11 or the processing unit 12 in FIG.
The configuration for storing the integer allocation vector φ representing the allocation state, the configuration for determining whether or not to accept the allocation change proposal that causes the calculated ΔC _ex , and the like are not shown in the drawing.

以上のようなハードウェア構成により、ＱＡＰに対してＳＡＭ法による局所探索を行うことができる。ＱＳＡＰに対してＳＡＭ法による局所探索を行う構成も、上記のハードウェア構成を適宜変更することで実現できる。たとえば、式（２４）の２行目の計算を行うための演算回路（乗算回路や加算回路など）が追加される。 With the hardware configuration as described above, it is possible to perform a local search for QAP by the SAM method. A configuration for performing a local search for QSAP by the SAM method can also be realized by appropriately changing the hardware configuration described above. For example, an arithmetic circuit (such as a multiplication circuit and an addition circuit) is added to perform the calculation on the second line of Equation (24).

なお、上記の処理内容（たとえば、図６～図８、図１０、図１１など）は、データ処理装置１０にプログラムを実行させることでソフトウェアにて実現できる。
プログラムは、コンピュータ読み取り可能な記録媒体に記録しておくことができる。記録媒体として、たとえば、磁気ディスク、光ディスク、光磁気ディスク、半導体メモリなどを使用できる。磁気ディスクには、フレキシブルディスク（ＦＤ：Flexible Disk）及びＨＤＤが含まれる。光ディスクには、ＣＤ（Compact Disc）、ＣＤ－Ｒ（Recordable）／ＲＷ（Rewritable）、ＤＶＤ（Digital Versatile Disc）及びＤＶＤ－Ｒ／ＲＷが含まれる。プログラムは、可搬型の記録媒体に記録されて配布されることがある。その場合、可搬型の記録媒体から他の記録媒体にプログラムをコピーして実行してもよい。 The above processing contents (for example, FIGS. 6 to 8, 10, 11, etc.) can be realized by software by causing the data processing device 10 to execute a program.
The program can be recorded on a computer-readable recording medium. As a recording medium, for example, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc. can be used. Magnetic disks include flexible disks (FDs) and HDDs. Optical discs include CD (Compact Disc), CD-R (Recordable)/RW (Rewritable), DVD (Digital Versatile Disc) and DVD-R/RW. The program may be recorded on a portable recording medium and distributed. In that case, the program may be copied from the portable recording medium to another recording medium and executed.

図３１は、データ処理装置の一例であるコンピュータのハードウェア例を示す図である。
コンピュータ９０は、ＣＰＵ９１、ＲＡＭ９２、ＨＤＤ９３、ＧＰＵ９４、入力インタフェース９５、媒体リーダ９６及び通信インタフェース９７を有する。上記ユニットは、バスに接続されている。 FIG. 31 is a diagram illustrating an example of hardware of a computer, which is an example of a data processing device.
Computer 90 has CPU 91 , RAM 92 , HDD 93 , GPU 94 , input interface 95 , medium reader 96 and communication interface 97 . The units are connected to a bus.

ＣＰＵ９１は、プログラムの命令を実行する演算回路を含むプロセッサである。ＣＰＵ９１は、ＨＤＤ９３に記憶されたプログラムやデータの少なくとも一部をＲＡＭ９２にロードし、プログラムを実行する。なお、ＣＰＵ９１は、たとえば、図４に示したように、複数のレプリカの処理を並列に実行するために、複数のプロセッサコアを備えてもよい。また、コンピュータ９０は複数のプロセッサを備えてもよい。なお、複数のプロセッサの集合（マルチプロセッサ）を「プロセッサ」と呼んでもよい。 The CPU 91 is a processor including an arithmetic circuit that executes program instructions. The CPU 91 loads at least part of the programs and data stored in the HDD 93 into the RAM 92 and executes the programs. Note that the CPU 91 may include a plurality of processor cores in order to execute processing of a plurality of replicas in parallel, as shown in FIG. 4, for example. Computer 90 may also include multiple processors. A set of multiple processors (multiprocessor) may also be called a "processor".

ＲＡＭ９２は、ＣＰＵ９１が実行するプログラムやＣＰＵ９１が演算に用いるデータを一時的に記憶する揮発性の半導体メモリである。なお、コンピュータ９０は、ＲＡＭ９２以外の種類のメモリを備えてもよく、複数個のメモリを備えてもよい。 The RAM 92 is a volatile semiconductor memory that temporarily stores programs executed by the CPU 91 and data used for calculation by the CPU 91 . The computer 90 may be provided with a type of memory other than the RAM 92, and may be provided with a plurality of memories.

ＨＤＤ９３は、ＯＳ（Operating System）やミドルウェアやアプリケーションソフトウェアなどのソフトウェアのプログラム、及び、データを記憶する不揮発性の記憶装置である。プログラムには、たとえば、前述のような割当問題の解を探索する処理をコンピュータ９０に実行させるプログラムが含まれる。なお、コンピュータ９０は、フラッシュメモリやＳＳＤ（Solid State Drive）などの他の種類の記憶装置を備えてもよく、複数の不揮発性の記憶装置を備えてもよい。 The HDD 93 is a nonvolatile storage device that stores an OS (Operating System), software programs such as middleware and application software, and data. The program includes, for example, a program that causes the computer 90 to execute a process of searching for a solution to the assignment problem as described above. Note that the computer 90 may include other types of storage devices such as flash memory and SSD (Solid State Drive), or may include multiple non-volatile storage devices.

ＧＰＵ９４は、ＣＰＵ９１からの命令にしたがって、コンピュータ９０に接続されたディスプレイ９４ａに画像（たとえば、割当問題の計算結果などを表す画像）を出力する。ディスプレイ９４ａとしては、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、プラズマディスプレイ（ＰＤＰ：Plasma Display Panel）、有機ＥＬ（ＯＥＬ：Organic Electro-Luminescence）ディスプレイなどを用いることができる。 The GPU 94 outputs an image (for example, an image representing a calculation result of an assignment problem, etc.) to a display 94a connected to the computer 90 according to instructions from the CPU 91 . As the display 94a, a CRT (Cathode Ray Tube) display, a liquid crystal display (LCD), a plasma display (PDP: Plasma Display Panel), an organic EL (OEL: Organic Electro-Luminescence) display, or the like can be used. .

入力インタフェース９５は、コンピュータ９０に接続された入力デバイス９５ａから入力信号を取得し、ＣＰＵ９１に出力する。入力デバイス９５ａとしては、マウスやタッチパネルやタッチパッドやトラックボールなどのポインティングデバイス、キーボード、リモートコントローラ、ボタンスイッチなどを用いることができる。また、コンピュータ９０に、複数の種類の入力デバイスが接続されていてもよい。 The input interface 95 acquires an input signal from an input device 95 a connected to the computer 90 and outputs it to the CPU 91 . As the input device 95a, a mouse, a touch panel, a touch pad, a pointing device such as a trackball, a keyboard, a remote controller, a button switch, or the like can be used. Also, the computer 90 may be connected to a plurality of types of input devices.

媒体リーダ９６は、記録媒体９６ａに記録されたプログラムやデータを読み取る読み取り装置である。記録媒体９６ａとして、たとえば、磁気ディスク、光ディスク、光磁気ディスク（ＭＯ：Magneto-Optical disk）、半導体メモリなどを使用できる。磁気ディスクには、ＦＤやＨＤＤが含まれる。光ディスクには、ＣＤやＤＶＤが含まれる。 The medium reader 96 is a reading device that reads programs and data recorded on the recording medium 96a. As the recording medium 96a, for example, a magnetic disk, an optical disk, a magneto-optical disk (MO), a semiconductor memory, or the like can be used. Magnetic disks include FDs and HDDs. Optical discs include CDs and DVDs.

媒体リーダ９６は、たとえば、記録媒体９６ａから読み取ったプログラムやデータを、ＲＡＭ９２やＨＤＤ９３などの他の記録媒体にコピーする。読み取られたプログラムは、たとえば、ＣＰＵ９１によって実行される。なお、記録媒体９６ａは、可搬型記録媒体であってもよく、プログラムやデータの配布に用いられることがある。また、記録媒体９６ａやＨＤＤ９３を、コンピュータ読み取り可能な記録媒体ということがある。 The medium reader 96 copies the program and data read from the recording medium 96a to another recording medium such as the RAM 92 and the HDD 93, for example. The read program is executed by the CPU 91, for example. Note that the recording medium 96a may be a portable recording medium, and may be used for distribution of programs and data. Also, the recording medium 96a and the HDD 93 may be referred to as a computer-readable recording medium.

通信インタフェース９７は、ネットワーク９７ａに接続され、ネットワーク９７ａを介して他の情報処理装置と通信を行うインタフェースである。通信インタフェース９７は、スイッチなどの通信装置とケーブルで接続される有線通信インタフェースでもよいし、基地局と無線リンクで接続される無線通信インタフェースでもよい。 The communication interface 97 is an interface that is connected to a network 97a and communicates with other information processing apparatuses via the network 97a. The communication interface 97 may be a wired communication interface connected to a communication device such as a switch via a cable, or a wireless communication interface connected to a base station via a wireless link.

以上、実施の形態に基づき、本発明のプログラム、データ処理装置及びデータ処理方法の一観点について説明してきたが、これらは一例にすぎず、上記の記載に限定されるものではない。 Although one aspect of the program, the data processing apparatus, and the data processing method of the present invention has been described above based on the embodiments, these are merely examples and are not limited to the above description.

たとえば、上記の説明では、割当状態に応じて距離行列の列を並べ替えるものとしたが、割当状態に応じた距離行列の行を並べ替えても、適宜式を変形すれば同様の作用効果が得られる。 For example, in the above description, the columns of the distance matrix are rearranged according to the allocation state. can get.

１０データ処理装置
１１記憶部
１２処理部 10 data processing device 11 storage unit 12 processing unit

Claims

A program for causing a computer to execute a process of searching for a solution to an allocation problem by local search using an evaluation function representing a cost according to an allocation state,
out of the plurality of elements based on a flow matrix representing flow amounts between a plurality of elements allocated to a plurality of allocation destinations and a distance matrix representing distances between the plurality of allocation destinations, stored in memory , calculating a first amount of change in the evaluation function when a first allocation change occurs in which the allocation destinations of the first element and the second element are exchanged, using vector arithmetic operations;
determining whether or not to allow the first allocation change based on the first amount of change;
When it is determined that the first allocation change is permitted, the allocation state is updated, and the distance matrix is updated so that two columns or two rows corresponding to the first element and the second element are exchanged. do,
A program that causes the computer to execute a process.

If the assignment problem is QAP, allow the first assignment change based on a comparison result between the acceptance probability calculated based on the first change amount and the value of the temperature parameter and a random value. 2. The program according to claim 1, causing the computer to execute a process of determining whether or not.

Before calculating the first amount of change,
when the assignment problem is QSAP, calculating a second amount of change in the evaluation function when a second assignment change occurs in which the first element is assigned to a first assignment destination;
determining whether or not to allow the second allocation change based on a comparison result between the acceptance probability calculated based on the second amount of change and the value of the temperature parameter and a random value;
updating the allocation state and the distance matrix if it is determined that the second allocation change is allowed;
After calculating the first amount of change,
Determining whether to allow the first allocation change based on a comparison result between the first amount of change and a predetermined value;
2. The program according to claim 1, which causes the computer to execute processing.

reading the distance matrix row by row from the memory;
selecting two values of the two columns contained in the read row;
swapping storage locations of the two values in the memory;
By repeating the process, the two columns are exchanged,
2. The program according to claim 1, which causes the computer to execute processing.

the memory further stores a transposed matrix of the distance matrix;
In the transposed matrix, among two rows corresponding to the two columns of the distance matrix, the first row is stored in a first shift register and the second row is stored in a second shift register;
The two columns are exchanged by repeating a process of swapping storage locations of two values output one by one from each of the first shift register and the second shift register and writing the values into the memory. I do,
2. The program according to claim 1, which causes the computer to execute processing.

the memory includes a first memory and a second memory in which the distance matrix is stored;
reading the distance matrix row by row from the first memory;
selecting two values of the two columns contained in the read row;
swapping storage locations of the two values and writing them to the second memory;
By repeating the process, the two columns are exchanged,
2. The program according to claim 1, which causes the computer to execute processing.

performing the local search using parallel tempering with a plurality of replicas each having a different temperature parameter value;
the memory stores the distance matrix for each of a first replica and a second replica among the plurality of replicas;
calculating the first variation based on the distance matrix of the second replica while updating the distance matrix of the first replica;
2. The program according to claim 1, which causes the computer to execute processing.

A data processing device that searches for a solution to an allocation problem by local search using an evaluation function representing a cost according to an allocation state,
a storage unit that stores a flow matrix representing flow amounts between a plurality of elements allocated to a plurality of allocation destinations and a distance matrix representing distances between the plurality of allocation destinations;
A first evaluation function of the evaluation function when a first allocation change occurs in which the allocation destinations of the first element and the second element among the plurality of elements are exchanged based on the flow matrix and the distance matrix. using vector arithmetic operations, determining whether or not to permit the first allocation change based on the first amount of change, and determining to permit the first allocation change If so, a processing unit that updates the allocation state and updates the distance matrix so that two columns or two rows corresponding to the first element and the second element are exchanged;
A data processing device having

A data processing method for searching for a solution to an allocation problem by local search using an evaluation function representing a cost according to an allocation state,
the computer
out of the plurality of elements based on a flow matrix representing flow amounts between a plurality of elements allocated to a plurality of allocation destinations and a distance matrix representing distances between the plurality of allocation destinations, stored in memory , calculating a first amount of change in the evaluation function when a first allocation change occurs in which the allocation destinations of the first element and the second element are exchanged, using vector arithmetic operations;
determining whether or not to allow the first allocation change based on the first amount of change;
When it is determined that the first allocation change is permitted, the allocation state is updated, and the distance matrix is updated so that two columns or two rows corresponding to the first element and the second element are exchanged. do,
Data processing method.