JP2012173793A

JP2012173793A - Predictor selection device, predictor selection method, and predictor selection program

Info

Publication number: JP2012173793A
Application number: JP2011032316A
Authority: JP
Inventors: Yoshihiko Kazuhara; 良彦数原; Jun Suzuki; 潤鈴木; Yoshihito Yasuda; 宜仁安田; Yoshimasa Koike; 義昌小池; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-02-17
Filing date: 2011-02-17
Publication date: 2012-09-10
Anticipated expiration: 2031-02-17
Also published as: JP5432935B2

Abstract

PROBLEM TO BE SOLVED: To select an optimum predictor for an asymmetric similarity relationship.SOLUTION: A cluster representation DB10 stores a feature representation for each training cluster. A transformation matrix DB40 stores a transformation matrix for maximizing asymmetric similarity for each set of training clusters after transformation of the feature representation of DB10. A similarity calculation section 50 uses the transformation matrix in DB40 to transform a feature representation of an input test cluster. Similarity between the feature representation of the transformed test cluster and each feature representation stored in DB10 is computed, and an ID of a training cluster with maximum similarity is sent to a predictor selection section 70. The predictor selection section 70 selects a predictor generated by the cluster of the received ID from a predictor DB60 and outputs the selected predictor.

Description

本発明は、テキストや画像などの高次元のベクトルで表現できる情報を用いて予測（例えば分類やランキングなど）を行う技術に関する。 The present invention relates to a technique for performing prediction (for example, classification or ranking) using information that can be expressed by a high-dimensional vector such as text or an image.

周知のようにテキストや画像などのデータを高次元ベクトルで表現し、クラスを予測するにあたってはクラスタ毎に生成した予測器（モデル）が利用されている。その際には、同じ特性を持った訓練データをひとつのクラスタ（訓練クラスタ）にまとめることで訓練データを複数のクラスタに分割し、それぞれのクラスタに含まれる訓練データを用いて予測器を生成し、入力されたテストデータ群（テストクラスタ）に対して適切な予測器を選択的に利用する予測方式が有効である。 As is well known, a predictor (model) generated for each cluster is used to express data such as text and images as a high-dimensional vector and predict a class. In that case, the training data having the same characteristics are grouped into one cluster (training cluster) to divide the training data into multiple clusters, and a predictor is generated using the training data contained in each cluster. A prediction method that selectively uses an appropriate predictor for the input test data group (test cluster) is effective.

この方式において、予測器を生成したクラスタの選択方式としては、図４および図５に示すように、テストクラスタに対しては各モデルの性能を評価できないためテストクラスタに対する類似度が最大の訓練クラスタを選択し、選択された訓練クラスタの生成したモデルをテストクラスタに適用する方法が知られている。 In this method, as a method of selecting a cluster that has generated a predictor, as shown in FIGS. 4 and 5, the performance of each model cannot be evaluated for the test cluster, so the training cluster having the maximum similarity to the test cluster is used. Is selected, and a model generated by the selected training cluster is applied to the test cluster.

そして、従来は、非特許文献１に示すように、訓練データを用いてテクストクラスタに対して最良の性能を示すモデルを生成するクラスタの類似度を最大化するような特徴空間に変換するための変換行列を生成し、訓練クラスタの選択精度を向上させている。 Conventionally, as shown in Non-Patent Document 1, the training data is used to convert a feature space that maximizes the similarity of the cluster that generates a model that shows the best performance for the text cluster. A transformation matrix is generated to improve training cluster selection accuracy.

Eric P.Xing, Andrew Y. Ng, Michael I. Jordan, Stuart Russell ”Distance Metric Learning with Application to Clustering with Side-Information” Proceedings of the 16th annual conference on Neural Information Processing Systems (NIPS ’02), pp.505≡512, 2002.Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, Stuart Russell ”Distance Metric Learning with Application to Clustering with Side-Information” Proceedings of the 16th annual conference on Neural Information Processing Systems (NIPS '02), pp.505 ≡512, 2002.

しかしながら、最適な予測器は、与えられたクラスタに含まれるデータに対する予測器の性能によって評価されることから、図６に示すように、テストクラスタＣＤの最適な予測を行う予測器ＭＢを生成する訓練クラスタＭＢに対して、テストクラスタＣＤから生成された予測器ＭＤが最適であるとは限らない。このような場合の予測器選択にあってはクラスタ間の類似度は対称ではない。 However, since the optimal predictor is evaluated by the performance of the predictor with respect to data included in a given cluster, a predictor MB that performs optimal prediction of the test cluster CD is generated as shown in FIG. The predictor MD generated from the test cluster CD is not always optimal for the training cluster MB. In the predictor selection in such a case, the similarity between clusters is not symmetric.

ところが、従来は、特許文献１のようにクラスタ間の類似度として対称な類似度の距離尺度を用いているため、非対称な類似関係に対してクラスタ間の類似度を最適化することができず、テストクラスタに対して最適な予測器を選択できないおそれがある。 However, conventionally, since a symmetric similarity distance measure is used as the similarity between clusters as in Patent Document 1, the similarity between clusters cannot be optimized for an asymmetric similarity. The optimal predictor may not be selected for the test cluster.

本発明は、上述のような従来技術の問題点を解決するためになされたものであり、非対称な類似関係に対して最適な予測器を選択可能とする技術の提供を解決課題としている。 The present invention has been made in order to solve the above-described problems of the prior art, and an object of the present invention is to provide a technique capable of selecting an optimal predictor for an asymmetric similarity relationship.

そこで、本発明は、事前に訓練データに基づき非対称な類似度を最大とする変換行列を生成し、生成された変換行列を利用して入力されたテストクラスタの予測器を選択する。 Therefore, the present invention generates a transformation matrix that maximizes the asymmetric similarity based on the training data in advance, and selects a predictor of the input test cluster using the generated transformation matrix.

本発明に係る予測器選択装置の一態様は、訓練クラスタの特徴表現を保存するクラスタ表現データベースと、訓練クラスタ間の非対称な類似度を最大化する変換行列を保存する変換行列データベースと、テストクラスタの特徴表現を変換行列データベースに保存された変換行列にて変換し、変換されたテストクラスタの特徴表現とクラスタ表現データベースに保存された各特徴表現との類似度を算出し、算出された類似度が最大の訓練クラスタをテストクラスタに最適な予測器を生成するクラスタと特定する類似度計算手段と、類似度計算手段にて特定されたクラスタから生成された予測器を選択し、選択された予測器を出力する予測器選択手段と、を備える。 One aspect of a predictor selection device according to the present invention includes a cluster expression database that stores feature expressions of training clusters, a transformation matrix database that stores transformation matrices that maximize asymmetric similarity between training clusters, and a test cluster. The feature expression is converted using the transformation matrix stored in the transformation matrix database, and the similarity between the converted feature representation of the test cluster and each feature representation saved in the cluster representation database is calculated, and the calculated similarity The similarity calculation means that identifies the training cluster with the largest training cluster as the test cluster and the predictor generated from the cluster specified by the similarity calculation means is selected, and the selected prediction is selected. Predictor selection means for outputting the output.

本発明に係る予測器選択装置の他の態様は、訓練クラスタの特徴表現を保存するクラスタ表現データベースと、訓練クラスタ毎に最適な予測器を生成する訓練クラスタをクラスタ組として保存する最適情報データベースと、最適情報データベースのクラスタ組の特徴表現をクラスタ表現データベースから取得し、一方の訓練クラスタの特徴表現を変換後にクラスタ組の非対称の類似度を最大化する変換行列を生成する変換行列生成手段と、変換行列生成手段の生成した変換行列にてテストクラスタの特徴表現を変換し、変換されたテストクラスタの特徴表現とクラスタ表現データベースに保存された各特徴表現との類似度を算出し、算出された類似度が最大の訓練クラスタをテストクラスタに最適な予測器を生成するクラスタと特定する類似度計算手段と、類似度計算手段にて特定されたクラスタから生成された予測器を選択し、選択された予測器を出力する予測器選択手段と、を備える。 Another aspect of the predictor selection apparatus according to the present invention includes a cluster expression database that stores a feature expression of a training cluster, an optimal information database that stores a training cluster that generates an optimal predictor for each training cluster as a cluster set, and A transformation matrix generating means for obtaining a feature representation of the cluster set of the optimal information database from the cluster representation database, and generating a transformation matrix that maximizes the asymmetric similarity of the cluster set after transforming the feature representation of one training cluster; The feature expression of the test cluster is converted by the conversion matrix generated by the conversion matrix generation means, and the similarity between the converted feature expression of the test cluster and each feature expression stored in the cluster expression database is calculated and calculated. Similarity that identifies the training cluster with the highest similarity to the cluster that produces the best predictor for the test cluster Comprising computing means, to select a predictor which is generated from the identified clusters by similarity calculation means, a predictor selecting means for outputting the selected predictor, the.

本発明に係る予測器選択方法の一態様は、テストクラスタの特徴表現を、変換行列データベースに保存された訓練クラスタ間の非対称な類似度を最大化する変換行列にて変換し、変換されたテストクラスタの特徴表現とクラスタ表現データベースに保存された各特徴表現との前記類似度を算出し、前記類似度が最大の訓練クラスタをテストクラスタに最適な予測器を生成するクラスタと特定する類似度計算ステップと、該特定されたクラスタから生成された予測器を選択し、選択された予測器を出力する予測器選択ステップと、を有する。 According to one aspect of the predictor selection method of the present invention, a feature expression of a test cluster is transformed with a transformation matrix that maximizes asymmetric similarity between training clusters stored in a transformation matrix database, and a transformed test is performed. The similarity calculation between the feature expression of the cluster and each feature expression stored in the cluster expression database is calculated, and the training cluster having the maximum similarity is identified as the cluster that generates the best predictor for the test cluster. And a predictor selection step of selecting a predictor generated from the identified cluster and outputting the selected predictor.

本発明に係る予測器選択方法の他の態様は、訓練クラスタ毎に最適な予測器を生成する訓練クラスタをクラスタ組として保存する最適情報データベースのクラスタ組の特徴表現を、訓練クラスタの特徴表現を保存するクラスタ表現データベースから取得し、一方の訓練クラスタの特徴表現を変換後にクラスタ組の非対称の類似度を最大化する変換行列を生成する変換行列生成ステップと、該生成された変換行列にてテストクラスタの特徴表現を変換し、変換されたテストクラスタの特徴表現とクラスタ表現データベースに保存された各特徴表現との類似度を算出し、算出された類似度が最大の訓練クラスタをテストクラスタに最適な予測器を生成するクラスタと特定する類似度計算ステップと、該特定されたクラスタから生成された予測器を選択し、選択された予測器を出力する予測器選択ステップと、を有する。 In another aspect of the predictor selection method according to the present invention, the feature information of the cluster set of the optimal information database that stores the training cluster that generates the optimal predictor for each training cluster is stored as the cluster set, and the feature expression of the training cluster. A transformation matrix generation step that generates a transformation matrix that maximizes the asymmetric similarity of a cluster set after transforming the feature representation of one training cluster obtained from the saved cluster representation database, and testing with the generated transformation matrix The cluster feature representation is converted, and the similarity between the converted test cluster feature representation and each feature representation stored in the cluster representation database is calculated, and the training cluster with the largest calculated similarity is optimal for the test cluster. A similarity calculation step for identifying a cluster for generating a predictor, and a predictor generated from the identified cluster. And-option, having a predictor selecting step of outputting the selected predictor.

前記予測器の選択にあたっては、各訓練クラスタから生成された予測器を保存する予測器データベースから選択することもできる。なお、本発明は、前記各装置としてコンピュータを機能させるプログラムの態様としてもよい。このプログラムは、ネットワークや記録媒体などを通じて提供することができる。 In selecting the predictor, the predictor can be selected from a predictor database that stores the predictors generated from each training cluster. In addition, this invention is good also as an aspect of the program which makes a computer function as said each apparatus. This program can be provided through a network or a recording medium.

本発明によれば、非対称な類似関係に対して最適な予測器を選択することができる。 According to the present invention, an optimal predictor can be selected for an asymmetric similarity relationship.

本発明の実施形態に係る予測器選択装置の構成図。The block diagram of the predictor selection apparatus which concerns on embodiment of this invention. 同変換行列生成装置の構成図。The block diagram of the same transformation matrix production | generation apparatus. 同類似度計算部の処理を示すフローチャート。The flowchart which shows the process of the similarity calculation part. クラスタ毎に生成した予測器の選択を示す説明図１。Explanatory drawing 1 which shows selection of the predictor produced | generated for every cluster. 同説明図２Figure 2 クラスタ間の類似度が非対称な状態を示す説明図。Explanatory drawing which shows the state in which the similarity between clusters is asymmetrical.

図１および図２に基づき本発明の実施形態に係る予測器選択装置を説明する。この選択装置１は、非対称な類似度（ＡｓｙｍｍｅｔｒｉｃＳｉｍｉｌａｒｉｔｙ）に対して該類似度を最大とする変換行列を生成する変換行列生成装置２を備え、クラスの予測にあたって変換行列生成装置２で生成された変換行列を利用して入力されたテストクラスタの最適な予測器を選択する。 A predictor selection apparatus according to an embodiment of the present invention will be described with reference to FIGS. 1 and 2. The selection device 1 includes a transformation matrix generation device 2 that generates a transformation matrix that maximizes the similarity with respect to asymmetric similarity, and is generated by the transformation matrix generation device 2 for class prediction. The optimal predictor of the input test cluster is selected using the transformation matrix.

具体的には、前記選択装置１は、コンピュータにより構成され、通常のコンピュータのハードウェアリソース、例えばＣＰＵ．メモリ（ＲＡＭ）．ハードディスクドライブ装置などを備える。このハードウェアリソースとソフトウェアリソース（ＯＳ．アプリケーションなど）との協働の結果、前記選択装置１は、図１および図２に示すように、クラスタ表現ＤＢ１０，最適予測器情報ＤＢ２０，変換行列生成部（変換行列生成装置２に相当する。）３０，変換行列ＤＢ４０，類似度計算部５０，予測器ＤＢ６０，予測器選択部７０を実装する。 Specifically, the selection device 1 is configured by a computer, and hardware resources of a normal computer such as a CPU. Memory (RAM). Includes a hard disk drive device. As a result of the cooperation between the hardware resource and the software resource (OS. Application, etc.), the selection device 1 has a cluster representation DB 10, an optimal predictor information DB 20, a transformation matrix generation unit as shown in FIGS. (Corresponding to the transformation matrix generation device 2) 30, a transformation matrix DB 40, a similarity calculation unit 50, a predictor DB 60, and a predictor selection unit 70 are implemented.

ここでは各ＤＢ１０．２０．４０．６０は、メモリ（ＲＡＭ）やハードディスクドライブ装置などの記憶装置に構築されているものとする。この各部１０〜７０によれば、事前に作成された前記変換行列を保存する変換行列作成ステージと、保存された前記変換行列にて入力されたテクストクラスタの前記類似度を最大とする予測器を選択する予測器選択ステージとが実行される。 Here, it is assumed that each DB 10.20.40.60 is constructed in a storage device such as a memory (RAM) or a hard disk drive device. According to each of the units 10 to 70, a transformation matrix creation stage that saves the transformation matrix created in advance, and a predictor that maximizes the similarity of the text clusters input in the saved transformation matrix. A predictor selection stage to select is executed.

すなわち、クラスタ表現ＤＢ１０には訓練クラスタの特徴表現（特徴ベクトル）が保存され、最適予測器情報ＤＢ２０には訓練クラスタ組、即ち訓練クラスタ毎に最適な予測器を生成するクラスタがペアで保存されている。このとき変換行列作成ステージでは、変換行列生成部３０が、図２に示すように、前記ＤＢ２０のクラスタ組のクラスタ表現データベースに保存された特徴表現を取得し、一方の特徴変換後に前記クラスタ組の非対称の類似度を最大化する変換行列を生成する。生成した変換行列を変換行列ＤＢ４０に保存する。 That is, the cluster representation DB 10 stores feature representations (feature vectors) of training clusters, and the optimal predictor information DB 20 stores training cluster sets, that is, clusters that generate optimal predictors for each training cluster in pairs. Yes. At this time, in the transformation matrix creation stage, as shown in FIG. 2, the transformation matrix generation unit 30 obtains the feature expression stored in the cluster expression database of the cluster set of the DB 20, and after one feature conversion, Generate a transformation matrix that maximizes the asymmetric similarity. The generated transformation matrix is stored in the transformation matrix DB 40.

また、予測器選択ステージでは、類似度計算部５０が、図１に示すように、入力されたテストクラスタの特徴表現を前記ＤＢ４０の変換行列にて変換する。変換されたテストクラスタの特徴表現と、前記ＤＢ１０に保存された各特徴表現との前記類似度を算出する。算出された類似度が最大の訓練クラスタをテストクラスタにとって最適な予測器を生成するクラスタと特定する。 Further, in the predictor selection stage, the similarity calculation unit 50 converts the inputted feature expression of the test cluster with the conversion matrix of the DB 40 as shown in FIG. The similarity between the converted feature expression of the test cluster and each feature expression stored in the DB 10 is calculated. The training cluster having the maximum calculated similarity is identified as the cluster that generates the optimal predictor for the test cluster.

予測器選択部７０は、各訓練クラスタから生成された予測器のパラメータを保存する前記ＤＢ６０から予測器を選択する。すなわち、前記ＤＢ６０を参照して類似度計算部５０にて特定されたクラスタから生成された予測器を選択し、選択された予測器を出力する。以下、各ステージの詳細を説明する。 The predictor selection unit 70 selects a predictor from the DB 60 that stores the parameters of the predictor generated from each training cluster. That is, referring to the DB 60, the predictor generated from the cluster specified by the similarity calculation unit 50 is selected, and the selected predictor is output. Details of each stage will be described below.

≪変換行列生成ステージ≫
変換行列生成部３０は、前記ＤＢ１０．２０を参照してその保存データを入力として受け取る。表１は、前記ＤＢ１０のデータ構造例を示している。ここでは「ｃ₁，ｃ₂，．．．，ｃ_N」が訓練クラスタのクラスタＩＤを示し、各行がそれぞれの訓練クラスタの特徴表現を示している。なお、表１のデータ構造例では、訓練各クラスタの特徴がＭ次元の特徴ベクトル「ｘ₁，ｘ₂，．．．，ｘ_M」で表現され、ｉ行ｊ列目の値は訓練クラスタｃ_iのｊ番目の特徴値を示している。 ≪Transformation matrix generation stage≫
The transformation matrix generation unit 30 refers to the DB 10.20 and receives the stored data as an input. Table 1 shows an example of the data structure of the DB 10. Here, “c ₁ , c ₂ ,..., C _N ” indicates the cluster ID of the training cluster, and each row indicates the characteristic expression of the training cluster. In the data structure of Table 1, the training feature vector features of M-dimensional of each cluster "x _1, x _2, ..., x _M" is represented by, i-th row and j-th column of values training cluster c The j-th feature value of _i is shown.

表２は、前記ＤＢ２０のデータ構造例を示し、テストクラスタＩＤはテストクラスタに擬制された訓練クラスタのＩＤを示している。すなわち、ある訓練クラスタをテストクラスタと擬制し、その他の訓練クラスタのうち最良の予測性能を示した予測器を生成する訓練クラスタが最適予測器生成クラスタとして選択されている。ここで選択された訓練クラスタのＩＤが、テストクラスタに擬制された訓練クラスタのクラスタＩＤ毎に記述されている。 Table 2 shows an example of the data structure of the DB 20, and the test cluster ID indicates the ID of the training cluster imitated by the test cluster. That is, a training cluster that simulates a certain training cluster as a test cluster and generates a predictor showing the best prediction performance among the other training clusters is selected as the optimal predictor generation cluster. The ID of the training cluster selected here is described for each cluster ID of the training cluster imitated by the test cluster.

ここで変換行列生成部３０は、表１．２のような前記ＤＢ１０．２０の保存データを入力として受け取ると、前記ＤＢ２０に保存されたクラスタ組「テストクラスタ（ＩＤ）：最適予測器生成クラスタ（ＩＤ）」の特徴表現を前記ＤＢ１０の保存データから取得する。ここで取得されたテストクラスタの特徴表現の変換後にクラスタ組「テストクラスタ（ＩＤ）：最適予測器生成クラスタ（ＩＤ）」の類似度を最大化するような変換行列を生成し、生成された変換行列を前記ＤＢ４０に保存する（変換行列生成ステップ）。 Here, when receiving the storage data of the DB 10.20 as shown in Table 1.2 as an input, the transformation matrix generation unit 30 receives the cluster set “test cluster (ID): optimal predictor generation cluster ( ID) ”is obtained from the stored data of the DB 10. A transformation matrix that maximizes the similarity of the cluster set “test cluster (ID): optimal predictor generation cluster (ID)” after the transformation of the characteristic representation of the test cluster acquired here is generated, and the generated transformation The matrix is stored in the DB 40 (conversion matrix generation step).

この類似度は、図６に示すような非対称の類似度を意味し、例えば一般化カルバックライブラーダイバージェンス「ＧｅｎｅｒａｌｉｚｅｄＫｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒｄｉｖｅｒｇｅｎｃｅ（ｇｋｌｄ）」を用いることができる。この一般化カルバックライブラーダイバージェンスは式（１）で与えられる。 This similarity means an asymmetric similarity as shown in FIG. 6, and for example, generalized Kullback library divergence “Generalized Kullback-Leibler divergence (gkld)” can be used. This generalized Kullback library divergence is given by equation (1).

ここで「０≧ｐ，ｑ」が必要であるため、訓練クラスタの特徴表現ｘに対するテストクラスタ（テストクラスタに擬制された訓練クラスタ）の特徴表現ｙを、「ｐ＝ｅｘｐ（ｘ），ｑ＝ｅｘｐ（ｙ）」とすることにより、あらゆる実数値を扱うことができる。この際に訓練クラスタの特徴表現ｘに対するテストクラスタの特徴表現ｙは式（２）にしたがって算出できる。 Here, since “0 ≧ p, q” is necessary, the feature expression y of the test cluster (training cluster simulated by the test cluster) with respect to the feature expression x of the training cluster is expressed as “p = exp (x), q = By setting “exp (y)”, any real value can be handled. At this time, the feature expression y of the test cluster with respect to the feature expression x of the training cluster can be calculated according to the equation (2).

この式（２）を、変換行列Ｗを用いてテストクラスタの特徴表現を特徴変換「ｙ→ｙ^Tｗ_i」した後の類似度は式（３）で計算することができる。ここで「Ｇ（ｘ，ｙ）」の二乗を損失関数とすると「ｗ_i」の推定値は式（４）にしたがって求めることができる。 The similarity after the expression (2) is subjected to the characteristic conversion “y → y ^T w _i ” of the feature expression of the test cluster using the conversion matrix W can be calculated by the expression (3). Here, assuming that the square of “G (x, y)” is a loss function, an estimated value of “w _i ” can be obtained according to Equation (4).

この推定値の計算には、損失関数の勾配情報を用いて最急降下法、ニュートン法、ＢＦＧＳ「Ｂｒｏｙｄｅｎ−Ｆｌｅｔｃｈｅｒ−Ｇｏｌｄｆａｒｂ−Ｓｈａｎｎｏ」法などの非線形最適化手法などを用いるとことができる。なお、式（１）〜（４）はプログラムなどに定義されているものとする。 For the calculation of the estimated value, a non-linear optimization method such as a steepest descent method, a Newton method, or a BFGS “Broyden-Fletcher-Goldfarb-Shanno” method can be used using the gradient information of the loss function. Expressions (1) to (4) are defined in a program or the like.

≪予測器選択ステージ≫
（１）類似度計算部５０
類似度計算部５０は、入力されたテクストクラスタと前記ＤＢ１０．４０の保存データを入力として受け取る。ここでは前記ＤＢ４０の変換行列を用いて入力されたテストクラスタの特徴表現を変換し、前記ＤＢ１０の訓練クラスタ特徴との前記類似度を計算する。この類似度が最大の訓練クラスタを前記ＤＢ１０中から探索し、探索された訓練クラスタを最適な予測器を生成するクラスタとしてクラスタＩＤを出力する。 ≪Predictor selection stage≫
(1) Similarity calculation unit 50
The similarity calculation unit 50 receives the input text cluster and the stored data of the DB 10.40 as inputs. Here, the feature expression of the input test cluster is converted using the conversion matrix of the DB 40, and the similarity with the training cluster feature of the DB 10 is calculated. A training cluster having the maximum similarity is searched from the DB 10, and the searched training cluster is output as a cluster for generating an optimal predictor.

図３に基づき処理内容を詳述すれば、テストクラスタが図示省略の入力部に入力されると類似度計算部５０の処理が開始される。処理が開始されると、まず、メモリ（ＲＡＭ）に記憶された「最大類似度ｓ_max」・「最適クラスタｃ_best」を初期化する（Ｓ０１）。この初期化は「ｓ_max←０」および「ｃ_best←ＮＯＮＥ」に書き換えることで実行される。 The processing contents will be described in detail with reference to FIG. 3. When a test cluster is input to an input unit (not shown), the processing of the similarity calculation unit 50 is started. When the process is started, first, the “maximum similarity s _max ” and “optimum cluster c _best ” stored in the memory (RAM) are initialized (S01). This initialization is executed by rewriting “s _max ← 0” and “c _best ← NONE”.

つぎに前記ＤＢ４０から変換行列ｗ_iを取得し、入力されたテストクラスタｃ_testの特徴表現を特徴変換「ｙ→ｙ^Tｗ_i」する（Ｓ０２）。この変換後に前記ＤＢ１０の保存データ中にＳ０５以下を未処理のクラスタｃ_kが存在するか否かを確認する（Ｓ０３）。 Next, the transformation matrix w _i is obtained from the DB 40, and the feature expression of the input test cluster c _{test is} subjected to feature transformation “y → y ^T w _i ” (S02). After this conversion, it is checked whether or not there is an unprocessed cluster _{ck from} S05 in the stored data of the DB 10 (S03).

この確認の結果、未処理のクラスタｃ_kが存在すれば、入力テストクラスタｃ_testのクラスタｃ_kに対する類似度ｓ_kを算出する（Ｓ０６）。類似度ｓ_kの算出には式（３）を用いればよい。ここで算出された類似度ｓ_kが最大値「ｓ_max」よりも大きいか否か、即ち「ｓ_k＞ｓ_max」が成立するか否かが確認され（Ｓ０６）、成立しなければＳ０３に戻る一方、成立すればＳ０７に進む。Ｓ０７では、メモリ（ＲＡＭ）に記憶された「最大類似度ｓ_max」・「最適クラスタｃ_best」を書き換える。ここでは「最大類似度ｓ_max」をＳ０６で算出された類似度ｓ_kに更新「ｓ_max←ｓ_k」し、最適な予測器を生成するクラスタとして当該クラスタｃ_kを記憶「ｃ_best←ｃ_k」する。 If there is an unprocessed cluster c _{k as} a result of this confirmation, the similarity s _k with respect to the cluster c _k of the input test cluster c _test is calculated (S06). Equation (3) may be used to calculate the similarity s _k . It is confirmed whether or not the similarity s _k calculated here is larger than the maximum value “s _max ”, that is, whether or not “s _k > s _max ” is satisfied (S06). On the other hand, if established, the process proceeds to S07. In S07, “maximum similarity s _max ” and “optimum cluster c _best ” stored in the memory (RAM) are rewritten. Here, the “maximum similarity s _max ” is updated to the similarity s _k calculated in S06 “s _max ← s _k ”, and the cluster _ck is stored as a cluster for generating an optimal predictor “c _best ← c _k ".

このＳ０７の処理後にＳ０３に戻って前記ＤＢ１０の保存データ中に未処理のクラスタｃ_kが存在しなくなるまでＳ０５〜Ｓ０７が繰り返され、該クラスタｃ_kが無くなればメモリ（ＲＡＭ）に記憶された「最適クラスタｃ_best」のクラスタＩＤを予測器選択部７０に出力し、処理を終了する。 After the process of S07, the process returns to S03, and S05 to S07 are repeated until there is no unprocessed cluster _{ck in} the stored data of the DB 10, and if the cluster _ck disappears, it is stored in the memory (RAM). The cluster ID of the “optimal cluster c _best ” is output to the predictor selection unit 70, and the process is terminated.

（２）予測器選択部７０
予測器選択部７０は、類似度計算部５０から「最適クラスタｃ_best」のクラスタＩＤを受け取ると、前記ＤＢ６０を参照して最適予測器を図示省略の出力部を通じてモニタなどに出力する。 (2) Predictor selection unit 70
When the predictor selection unit 70 receives the cluster ID of “optimum cluster c _best ” from the similarity calculation unit 50, the predictor selection unit 70 refers to the DB 60 and outputs the optimum predictor to a monitor or the like through an output unit (not shown).

表３は、前記ＤＢ６０のデータ構造例を示し、各訓練クラスタから生成された予測器のパラメータが格納されている。ここでは線形モデル予測器におけるパラメータのデータ構造が示されているが、これに限定されずにあらゆる学習アルゴリズムを用いて生成された予測器の情報を前記ＤＢ６０に保存することができる。具体的には、表３のデータ構造例中、「ｃ₁，ｃ₂，．．．，ｃ_N」が訓練クラスタのクラスタＩＤを示し、各行がそれぞれの訓練クラスタの特徴表現に対する重みを示し、ｉ行ｊ番目の値は訓練クラスタｃ_iのｊ番目の特徴に対する重みの値を示している。 Table 3 shows an example of the data structure of the DB 60, and stores the parameters of the predictor generated from each training cluster. Here, the data structure of parameters in the linear model predictor is shown, but the present invention is not limited to this, and predictor information generated using any learning algorithm can be stored in the DB 60. Specifically, in the data structure example of Table 3, “c ₁ , c ₂ ,..., C _N ” indicates the cluster ID of the training cluster, each row indicates the weight for the feature expression of each training cluster, i and the row j-th value indicates the value of the weight for the j-th feature training cluster c _i.

そして、予測器選択部７０は、類似度計算部５０から出力されたクラスタＩＤの予測器を前記ＤＢ６０の保存データから選択し、選択された予測器を出力する。出力された予測器は、例えばテキストや画像とった高次元ベクトルで表現できる情報の予測（分類やランキング）に利用される。 And the predictor selection part 70 selects the predictor of cluster ID output from the similarity calculation part 50 from the preservation | save data of said DB60, and outputs the selected predictor. The output predictor is used for prediction (classification or ranking) of information that can be expressed by a high-dimensional vector such as text or an image.

このように前記選択装置１によれば、非対称な類似度指標（例えば一般カルバックライブラーダイバージェンス）に基づいてクラスタに対して最適な予測器を生成するクラスタの類似度を最大化する変換行列を生成して前記ＤＢ４０に保存される。この保存データに基づき入力テストクラスタと前記ＤＢ１０の訓練クラスタとの類似度を計算し、該類似度が最大の訓練クラスタのクラスタＩＤが予測器選択部７０に出力されることから、入力テストクラスタに対する予測器選択の精度が向上する。これにより入力テストクラスタに含まれるデータの予測制度を向上させることができる。 As described above, according to the selection device 1, a transformation matrix that maximizes the similarity of a cluster that generates an optimal predictor for the cluster is generated based on an asymmetric similarity index (for example, general Cullback library divergence). And stored in the DB 40. Based on the stored data, the similarity between the input test cluster and the training cluster of the DB 10 is calculated, and the cluster ID of the training cluster having the maximum similarity is output to the predictor selection unit 70. The accuracy of predictor selection is improved. Thereby, the prediction system of the data included in the input test cluster can be improved.

なお、本発明は、上記実施形態に限定されるものではなく、装置構成などは各請求項に記載された範囲内で変形することができる。例えば変換行列生成装置２（変換行列生成部３０）は、必ずしも予測器選択装置１内に組み込まれている必要はなく、別個の装置として構成してもよい。この場合には前記ＤＢ２０．４０は、それぞれの装置１．２にて共有して備えればよい。 In addition, this invention is not limited to the said embodiment, A device structure etc. can be deform | transformed within the range described in each claim. For example, the transformation matrix generation device 2 (transformation matrix generation unit 30) is not necessarily incorporated in the predictor selection device 1, and may be configured as a separate device. In this case, the DB 20.40 may be shared by each device 1.2.

≪プログラムなど≫
本発明は、予測器選択装置１（変換行列生成装置２を含む。）の各部１０．２０．３０．４０．５０．６０．７０の一部もしくは全部として、コンピュータを機能させる文書検索プログラムとして構成することもできる。このプログラムによれば、前記各ステージの一部あるいは全部の処理をコンピュータに実行させることが可能となる。 ≪Programs≫
The present invention is configured as a document search program that causes a computer to function as a part or all of each section 10.20.30.40.50.60.70 of the predictor selection apparatus 1 (including the transformation matrix generation apparatus 2). You can also According to this program, it is possible to cause a computer to execute part or all of the processing of each stage.

前記プログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，ＢＤ−ＲＯＭ，ＢＤ−Ｒ，ＢＤ−ＲＥなどの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 The program can be provided through a network such as a website or e-mail. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, BD-ROM, BD-R, or BD-RE. It is also possible to record, save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１…予測器選択装置
２…変換行列生成装置
１０…クラスタ表現ＤＢ（クラスタ表現データベース）
２０…最適予測器情報ＤＢ（クラスタ表現データベース）
３０…変換行列生成部（変換行列生成手段）
４０…変換行列ＤＢ（変換行列データベース）
５０…類似度計算部（類似度計算手段）
６０…予測器ＤＢ（予測器データベース）
７０…予測器選択部（予測器選択手段） DESCRIPTION OF SYMBOLS 1 ... Predictor selection apparatus 2 ... Transformation matrix production | generation apparatus 10 ... Cluster expression DB (cluster expression database)
20 ... Optimal predictor information DB (cluster expression database)
30: Conversion matrix generation unit (conversion matrix generation means)
40 ... Transformation matrix DB (transformation matrix database)
50. Similarity calculation unit (similarity calculation means)
60 ... Predictor DB (Predictor database)
70: Predictor selection unit (predictor selection means)

Claims

When predicting a class of information that can be represented by a high-dimensional vector, this is a predictor selection device that generates predictors by dividing training data into training clusters and selects an appropriate predictor for the input test cluster. And
A cluster representation database that stores training cluster feature representations;
A transformation matrix database that stores transformation matrices that maximize asymmetric similarity between training clusters;
The feature expression of the test cluster is converted by the conversion matrix stored in the conversion matrix database, and the similarity between the converted feature expression of the test cluster and each feature expression stored in the cluster expression database is calculated and calculated. A similarity calculation means for identifying a training cluster having the maximum similarity as a cluster that generates an optimal predictor for the test cluster;
A predictor selection unit that selects a predictor generated from the cluster specified by the similarity calculation unit, and outputs the selected predictor;
A predictor selection device comprising:

When predicting a class of information that can be represented by a high-dimensional vector, this is a predictor selection device that generates predictors by dividing training data into training clusters and selects an appropriate predictor for the input test cluster. And
A cluster representation database that stores training cluster feature representations;
An optimal information database that stores training clusters as cluster sets that generate optimal predictors for each training cluster;
A transformation matrix generating means for obtaining a feature representation of the cluster set of the optimal information database from the cluster representation database, and generating a transformation matrix that maximizes the asymmetric similarity of the cluster set after transforming the feature representation of one training cluster;
The feature expression of the test cluster is converted by the conversion matrix generated by the conversion matrix generation means, and the similarity between the converted feature expression of the test cluster and each feature expression stored in the cluster expression database is calculated and calculated. A similarity calculation means for identifying a training cluster having the maximum similarity as a cluster that generates an optimal predictor for the test cluster;
A predictor selection unit that selects a predictor generated from the cluster specified by the similarity calculation unit, and outputs the selected predictor;
A predictor selection device comprising:

A predictor database for storing predictors generated from each training cluster;
3. The predictor selection unit according to claim 1, wherein the predictor selection unit selects, from the predictor database, a predictor generated from the cluster specified by the similarity calculation unit. apparatus.

When predicting a class of information that can be represented by a high-dimensional vector, a predictor is executed by a device that generates a predictor by dividing training data into training clusters and selects an appropriate predictor for the input test cluster. A selection method,
The test cluster feature representation is transformed with a transformation matrix that maximizes the asymmetric similarity between the training clusters stored in the transformation matrix database, and the transformed test cluster feature representation and each saved in the cluster representation database Calculating the degree of similarity with a feature representation and identifying the training cluster with the largest degree of similarity as the cluster that generates the best predictor for the test cluster;
A predictor selection step of selecting a predictor generated from the identified cluster and outputting the selected predictor;
A predictor selection method characterized by comprising:

When predicting a class of information that can be represented by a high-dimensional vector, a predictor is executed by a device that generates a predictor by dividing training data into training clusters and selects an appropriate predictor for the input test cluster. A selection method,
The feature information of the cluster set of the optimal information database that stores the training cluster that generates the optimal predictor for each training cluster as the cluster set is obtained from the cluster expression database that stores the feature representation of the training cluster, A transformation matrix generating step that generates a transformation matrix that maximizes the asymmetric similarity of the cluster set after transforming the feature representation;
The feature expression of the test cluster is converted by the generated conversion matrix, the similarity between the converted feature expression of the test cluster and each feature expression stored in the cluster expression database is calculated, and the calculated similarity is A similarity calculation step that identifies the largest training cluster as the cluster that produces the best predictor for the test cluster;
A predictor selection step of selecting a predictor generated from the identified cluster and outputting the selected predictor;
A predictor selection method characterized by comprising:

The predictor selection step selects a predictor generated from the cluster specified in the similarity calculation step from a predictor database storing a predictor generated from each training cluster. 6. The predictor selection method according to any one of 5 above.

The predictor selection program which makes a computer function as each means of the predictor selection apparatus of any one of Claims 1-3.