JP4188181B2

JP4188181B2 - Multivariate data selection device and multivariate data selection program

Info

Publication number: JP4188181B2
Application number: JP2003304944A
Authority: JP
Inventors: 力米森; 高志末永; 正巳原; 務松永
Original assignee: NTT Data Corp
Current assignee: NTT Data Corp
Priority date: 2003-08-28
Filing date: 2003-08-28
Publication date: 2008-11-26
Anticipated expiration: 2023-08-28
Also published as: JP2005078186A

Description

本発明は、大規模データの中からデータ構造を反映した変量を選択する多変量データ選択装置および多変量データ選択プログラムに関する。 The present invention relates to a multivariate data selection device and a multivariate data selection program for selecting a variable reflecting a data structure from large-scale data.

従来より、大規模データからの知識抽出をねらいとした変量選択では、変量の相関を考慮することで重要な変数を特定するＭｃＣａｂｅ（１９８４）の方法（図１３）や（例えば、非特許文献１）、主成分分析に代表される次元圧縮法を用いた後，変量を選択する前と後とのサンプルの変動を定量化することで変量を選択するＫｒｚａｎｏｗｓｋｉ（１９８７）の方法（図１４）が提案されている（例えば、非特許文献２）。 Conventionally, in variable selection aiming at knowledge extraction from large-scale data, the method of McCabe (1984) (FIG. 13) for specifying an important variable by considering the correlation of the variables (for example, Non-Patent Document 1) ), After using the dimension compression method represented by principal component analysis, the method of Krzanowski (1987) (FIG. 14) for selecting a variable by quantifying the variation of the sample before and after selecting the variable. It has been proposed (for example, Non-Patent Document 2).

図１５は、Ｋｒｚａｎｏｗｓｋｉ法による動作を説明するためのフローチャートである。まず、入力装置から、ファイル等に記載された処理対象のデータＸと、パラメータ（選択変量数ｑおよび主成分数ｒ）を入力する（Ｓ１０）。次に、Ｘに対して、主成分分析を適用して、主成分得点を算出し、これをＹとする（Ｓ１２）。次に、Ｙと内部で生成されるＺとの類似度Ｍ^２を求め、それに対応する変量の組み合わせを求める（Ｓ１４）。次に、ステップＳ１４で求めた変量の組み合わせに対応する変量を最も良い選択変量として出力する（Ｓ１６）。 FIG. 15 is a flowchart for explaining the operation by the Krzanowski method. First, processing target data X described in a file or the like and parameters (selection variable number q and principal component number r) are input from an input device (S10). Next, principal component analysis is applied to X to calculate a principal component score, which is set as Y (S12). Next, determine the similarity M ² and Z generated by the Y and internally obtaining a combination of variable corresponding thereto (S14). Next, the variable corresponding to the variable combination obtained in step S14 is output as the best selected variable (S16).

次に、図１６は、上述したステップＳ１４の詳細な動作を説明するためのフローチャートである。まず、選択変量数ｑと全変量数ｐとから、ｐＣｑ通りの変量の組み合わせを全て求める（Ｓ２０）。例えば、ｑ＝２で、変量ｘ１，ｘ２，ｘ３がある場合、（ｘ１，ｘ２）（ｘ２，ｘ３）（ｘ３，ｘ１）の組み合わせが求まる。次に、ステップ２０の結果から、順次１つの組み合わせを取り出し、組み合わせの１つで構成される部分行列Ｘ〜に対して、主成分分析を適用することで、主成分得点を固有値の高い順にｒ個算出し、これをＺとする（Ｓ２２）。 Next, FIG. 16 is a flowchart for explaining the detailed operation of step S14 described above. First, all pCq combinations of variables are obtained from the selected variable number q and the total variable number p (S20). For example, when q = 2 and there are variables x1, x2, and x3, a combination of (x1, x2) (x2, x3) (x3, x1) is obtained. Next, one combination is sequentially extracted from the result of Step 20, and principal component analysis is applied to the submatrix X˜ configured by one of the combinations, thereby obtaining principal component scores in descending order of eigenvalues. This is calculated and set as Z (S22).

次に、ＹとＺから、プロクラステス基準Ｍ^２（注１参照）を算出し、その値を類似度とし、例えば配列に値が上書きされないように保存しておく（Ｓ２４）。次に、ステップＳ２０で求めた変量の組み合わせ全てに対して、プロクラステス基準Ｍ^２を計算したかをチェックする（Ｓ２６）。そして、全ての組み合わせで類似度を算出していない場合には、ステップＳ２２へ戻り、上述した処理を繰り返す。一方、全ての組み合わせで類似度を算出した場合には、ステップＳ２４で保存しておいたＭ^２のうち、最も高い類似度を持つ値を、最終的な類似度とし、それに対応する組み合わせと類似度を出力する（Ｓ２８）。 Next, a procrustes criterion M ² (see note 1) is calculated from Y and Z, and the value is used as a similarity, and stored, for example, so that the value is not overwritten in the array (S24). Then, for all combinations of the variables obtained in step S20, it is checked whether to calculate the Procrustes reference M ² (S26). If the similarity is not calculated for all combinations, the process returns to step S22 and the above-described processing is repeated. On the other hand, when the similarity is calculated for all the combinations, the value having the highest similarity among the M ² stored in step S24 is set as the final similarity, which is similar to the corresponding combination. The degree is output (S28).

注１：Ｍ^２＝ｔｒ（ＹＹ^ｔ＋ＺＺ^ｔ−２Ｄ）
Ｄは、特異値を対角に並べた行列である。ここで、ＹとＺの次元数を合わせるため、本来ｑ個取り出せる主成分得点のうち、低い主成分に対応する主成分得点は捨てられる。
Ｇ．Ｐ．ＭｃＣａｂｅ，“ＰｒｉｎｃｉｐａｌＶａｒｉａｂｌｅｓ，”Ｔｅｃｈｎｏｍｅｔｒｉｃｓ，ｖｏｌ．２６，ｐｐ．１３７−１４４，１９８４．Ｗ．Ｊ．Ｋｒｚａｎｏｗｓｋｉ．”Ｓｅｌｅｃｔｉｏｎｏｆｖａｒｉａｂｌｅｓｔｏｐｒｅｓｅｒｖｅｍｕｌｔｉｖａｒｉａｔｅｄａｔａｓｔｒｕｃｔｕｒｅ，ｕｓｉｎｇｐｒｉｎｃｉｐａｌｃｏｍｐｏｎｅｎｔｓ，”ＡｐｐｌｉｅｄＳｔａｔｉｓｔｉｃｓ，ｖｏｌ．３６，ｐｐ．２２−３３，１９８７． Note 1: M ² = tr (YY ^t + ZZ ^t −2D)
D is a matrix in which singular values are arranged diagonally. Here, in order to match the dimensionality of Y and Z, a principal component score corresponding to a low principal component is discarded among q principal component scores that can be originally extracted.
G. P. McCabe, “Principal Variables,” Technologies, vol. 26, pp. 137-144, 1984. W. J. et al. Krzanowski. “Selection of variables to preserve multivariate data structure, using principal components,“ Applied Statistics, vol. 36, pp. 22-33, 1987.

しかしながら、従来のＭｃＣａｂｅ法では、個々の変数のみに着目しており、変量全体としてみたときに有効な変量が選択されない。また、Ｋｒｚａｎｏｗｓｋｉ法では、データの相関を考慮した，データによって張られる部分空間に着目した変量選択法も提案されているが、この部分空間と選択候補の変量群を比較する際に，部分空間の次元を合わせるために、次元を減らす必要があり、選択候補の変量群に含まれるデータの情報が損なわれてしまうという問題があった。 However, in the conventional McCabe method, attention is paid only to individual variables, and effective variables are not selected when viewed as variables as a whole. In addition, in the Krzanowski method, a variable selection method that takes into account the correlation of data and focuses on the subspace spanned by the data has also been proposed. When comparing this subspace and the variable group of selection candidates, In order to match the dimensions, it is necessary to reduce the dimensions, and there is a problem in that information of data included in the variable group of selection candidates is damaged.

そこで本発明は、変量全体のデータ構造を反映した変量を選択することができる多変量データ選択装置および多変量データ選択プログラムを提供することを目的とする。 Therefore, an object of the present invention is to provide a multivariate data selection device and a multivariate data selection program that can select a variable that reflects the data structure of the entire variable.

上記目的達成のため、本発明は、処理対象データと、前記処理対象データのパラメータとして選択変量数と主成分数を入力する入力手段と、前記入力手段で入力された処理対象データに対して主成分分析手法を用いて固有値と固有ベクトルを求め固有値の高い固有ベクトルから順に前記主成分数取り出し、前記処理対象データの基底ベクトルを構成する基底算出手段と、前記処理対象データから前記選択変量数の全ての変量組み合わせを抽出し、前記変量の組み合わせから順次１つの組み合わせを取り出し、正規直交基底ベクトルをそれぞれ構成する正規直交基底構成手段と、前記基底算出手段で算出した前記基底ベクトルと前記正規直交基底構成手段で生成された正規直交ベクトルの正準角を前記正規直交規定ベクトル分順次算出し記憶する類似度算出手段と、前記類似度算出手段で算出された正準角のうち類似度が最も大きい（角度が最小の）正規直交ベクトルを構成する前記変量組み合わせを選択すべき変量の組み合わせとして出力する出力手段と、を備えることを特徴とする。 For the purposes achieved, the present invention includes a process target data, input means for inputting the selected variables count and number of principal components as parameters of the processing target data, the main to the processing target data input by said input means Eigenvalues and eigenvectors are obtained using a component analysis technique, the number of principal components is extracted in order from eigenvectors with the highest eigenvalues, base calculation means for constructing a base vector of the processing target data, and all of the selected variable numbers from the processing target data A variable combination is extracted, one combination is sequentially extracted from the combination of the variables, and orthonormal basis constructing means each constituting an orthonormal basis vector; the basis vector calculated by the basis calculating means; and the orthonormal basis constructing means The canonical angle of the orthonormal vector generated in step 1 is sequentially calculated and stored for the orthonormal prescribed vector. Outputs the similarity calculation means, as a combination of variables to be selected the variable combinations similarity to (angle minimum) the greatest constitutes an orthonormal vector of canonical angle calculated by the similarity calculation means And an output means .

また、本発明は、コンピュータに、処理対象データと、前記処理対象データのパラメータとして選択変量数と主成分数を入力する入力手段、前記入力手段で入力された処理対象データに対して主成分分析手法を用いて固有値と固有ベクトルを求め固有値の高い固有ベクトルから順に前記主成分数取り出し、前記処理対象データの基底ベクトルを構成する基底算出手段、前記処理対象データから前記選択変量数の全ての変量組み合わせを抽出し、前記変量の組み合わせから順次１つの組み合わせを取り出し、正規直交基底ベクトルをそれぞれ構成する正規直交基底構成手段、前記基底算出手段で算出した前記基底ベクトルと前記正規直交基底構成手段で生成された正規直交ベクトルの正準角を前記正規直交規定ベクトル分順次算出し記憶する類似度算出手段、前記類似度算出手段で算出された正準角のうち類似度が最も大きい（角度が最小の）正規直交ベクトルを構成する前記変量組み合わせを選択すべき変量の組み合わせとして出力する出力手段、として機能させるための多変量データ選択プログラムである。 Further, the present invention provides a computer with an input means for inputting processing target data, a selected variable number and the number of principal components as parameters of the processing target data, and a principal component analysis for the processing target data input by the input means. The eigenvalues and eigenvectors are obtained using a technique, the principal component numbers are extracted in order from the eigenvectors with the highest eigenvalues, basis calculation means for constructing the basis vectors of the processing target data, and all the variable combinations of the selected variable numbers from the processing target data. Extracted, sequentially extracted one combination from the combination of variables, and generated by the orthonormal basis constructing means that constitutes orthonormal basis vectors, the basis vectors calculated by the basis calculating means, and the orthonormal basis constructing means canonical angles of the orthonormal basis vectors partial sequentially calculated and stored for the class of orthonormal vectors Degree calculating means, output means for outputting as a combination of variables to be selected the variable combinations similarity to (angle minimum) the greatest constitutes an orthonormal vector of the similarity canonical angles calculated by the calculating means , A multivariate data selection program for functioning as

請求項１記載の発明によれば、基底算出手段によって、入力手段から入力された処理対象のデータから、所定の基底導出法を用いて、高い固有値を持つ固有ベクトルを順に主成分数個取り出し、前記データの基底ベクトルを算出し、類似度算出手段により、前記基底ベクトルと内部で生成される正規直交基底ベクトルとの類似度を求め、前記類似度に対応する変量の組み合わせを求め、出力手段により、前記変量の組み合わせに対応する変量を最も良い選択変量として出力するようにしたので、変量全体のデータ構造を反映した変量を選択することができ、また、元のデータに対して基底ベクトルを求める際に、基底ベクトルの計算法を変える（例えば判別基準）ことで、目的に応じた変量を選択することができ、ゆえに、有効な変量セット抽出（例えば散布図）を通して、結果の解釈を促進することができるという利点が得られる。 According to the first aspect of the present invention, the basis calculation unit extracts several eigenvectors having high eigenvalues in order from the processing target data input from the input unit using a predetermined basis derivation method, A basis vector of data is calculated, a similarity calculation unit obtains a similarity between the basis vector and an orthonormal basis vector generated internally, a combination of variables corresponding to the similarity is obtained, and an output unit Since the variable corresponding to the combination of variables is output as the best selected variable, it is possible to select a variable reflecting the data structure of the entire variable, and when obtaining the basis vector for the original data In addition, by changing the basis vector calculation method (for example, discriminant criterion), it is possible to select a variable according to the purpose. Through out (e.g. scatter plot), the advantage is obtained that it is possible to facilitate the interpretation of the results.

また、請求項２記載の発明によれば、前記類似度算出手段において、変量組み合わせ算出手段により、前記データのパラメータとして入力された選択変量数と全変量数とから、変量の組み合わせを全て求め、正規直交基底構成手段により、前記変量の組み合わせから、順次１つの組み合わせを取り出し、正規直交基底を構成し、角度算出手段により、前記正規直交基底と前記基底との角度を前記変量の組み合わせ全てに対して算出し、その値を類似度とするようにしたので、変量全体のデータ構造を反映した変量を選択することができ、また、元のデータに対して基底ベクトルを求める際に、基底ベクトルの計算法を変える（例えば判別基準）ことで、目的に応じた変量を選択することができ、ゆえに、有効な変量セット抽出（例えば散布図）を通して、結果の解釈を促進することができるという利点が得られる。 According to the invention of claim 2, in the similarity calculation means, the variable combination calculation means obtains all combinations of variables from the selected variable numbers and the total variable numbers input as parameters of the data, The orthonormal basis constructing means sequentially extracts one combination from the combination of variables to form an orthonormal basis, and the angle calculating means sets the angle between the orthonormal basis and the base for all the combinations of variables. Since the value is used as the similarity, it is possible to select a variable that reflects the data structure of the entire variable, and when obtaining the basis vector for the original data, By changing the calculation method (eg discriminant criterion), it is possible to select the variable according to the purpose, and hence the effective variable set extraction (eg scatter diagram) Through, the advantage is obtained that it is possible to facilitate the interpretation of the results.

また、請求項３記載の発明によれば、前記出力手段により、前記角度算出手段により算出された角度のうち、最も高い類似度を持つ値を、最終的な類似度とし、それに対応する組み合わせと類似度とを出力するようにしたので、変量全体のデータ構造を反映した変量を選択することができ、また、元のデータに対して基底ベクトルを求める際に、基底ベクトルの計算法を変える（例えば判別基準）ことで、目的に応じた変量を選択することができ、ゆえに、有効な変量セット抽出（例えば散布図）を通して、結果の解釈を促進することができるという利点が得られる。 According to the invention of claim 3, a value having the highest similarity among the angles calculated by the angle calculation means by the output means is set as a final similarity, and a combination corresponding thereto. Since the similarity is output, it is possible to select a variable reflecting the data structure of the entire variable, and change the calculation method of the base vector when obtaining the base vector for the original data ( For example, it is possible to select a variable according to the purpose, and therefore, it is possible to facilitate the interpretation of the result through effective variable set extraction (for example, a scatter diagram).

また、請求項４記載の発明によれば、処理対象のデータを入力するステップと、所定の基底導出法を用いて、高い固有値を持つ固有ベクトルを順に主成分数個取り出し、前記データの基底ベクトルを算出するステップと、前記基底ベクトルと内部で生成される正規直交基底ベクトルとの類似度を求めるステップと、前記類似度に対応する変量の組み合わせを求めるステップと、前記変量の組み合わせに対応する変量を最も良い選択変量として出力するステップとをコンピュータに実行させるようにしたので、変量全体のデータ構造を反映した変量を選択することができ、また、元のデータに対して基底ベクトルを求める際に、基底ベクトルの計算法を変える（例えば判別基準）ことで、目的に応じた変量を選択することができ、ゆえに、有効な変量セット抽出（例えば散布図）を通して、結果の解釈を促進することができるという利点が得られる。 According to the invention described in claim 4, using the step of inputting the data to be processed and a predetermined basis derivation method, several eigenvectors having high eigenvalues are extracted in order, and the basis vector of the data is obtained. A step of calculating, a step of obtaining a similarity between the basis vector and an orthonormal basis vector generated internally, a step of obtaining a combination of variables corresponding to the similarity, and a variable corresponding to the combination of the variables. Since the computer executes the step of outputting as the best selected variable, it is possible to select a variable that reflects the data structure of the entire variable, and when obtaining the basis vector for the original data, By changing the basis vector calculation method (for example, discriminant criterion), it is possible to select a variable according to the purpose. Through variable set extraction (e.g. scatter plot), the advantage is obtained that it is possible to facilitate the interpretation of the results.

以下、本発明の実施の形態を、図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

Ａ．実施形態の構成
図１は、本発明の実施形態による多変量データ選択装置の構成を示すブロック図である。図において、入力部１１０は、処理対象であるデータＸ（ｎｘｐ行列）、および選択変量数ｑ、主成分数ｒを入力する。演算部１２０は、種々の処理が施され、設定した基底に応じた最適な変量を選択する。出力部１３０は、処理済のデータ行列を出力する。 A. Configuration of Embodiment FIG. 1 is a block diagram showing a configuration of a multivariate data selection apparatus according to an embodiment of the present invention. In the figure, an input unit 110 inputs data X (nxp matrix) to be processed, a selected variable number q, and a principal component number r. The calculation unit 120 performs various processes, and selects an optimal variable according to the set base. The output unit 130 outputs the processed data matrix.

Ｂ．本発明による多変量データ選択方法の原理
図２は、本発明の多変量データ選択方法の原理を説明するための概念図である。
すなわち、変量選択前のデータＸの部分空間（固有ベクトル）Ｙ ^λ と変量選択候補Ｘ〜の正規直交基底Ｘ＿の部分空間同士の類似度を正準角による数値で定量化し、最大の類似度を持つ変量の組み合わせを変量全体として見たときの有効な変量として選択する。これにより、データの持つ情報の損失を防止し、データ構造をより反映した変量を選択することができる。なお、詳細は、実施形態の動作にて後述する。 B. Principle of Multivariate Data Selection Method According to the Present Invention FIG. 2 is a conceptual diagram for explaining the principle of the multivariate data selection method of the present invention.
That is, to quantify the similarity subspaces each other orthonormal basis X_ variables selected before subspace (eigenvectors) of the data X Y ^lambda and random selection candidate X~ numerically by canonical angles, with the maximum similarity Select a combination of variables as an effective variable when viewed as a whole variable. Thereby, it is possible to prevent the loss of information of data and to select a variable that more reflects the data structure. Details will be described later in the operation of the embodiment.

Ｃ．実施形態の動作
次に、上述した実施形態の動作について説明する。ここで、図３は、本実施形態による多変量データ選択装置の動作を説明するためのフローチャートである。また、図４は、本実施形態による多変量データ選択装置の動作を説明するための概念図である。まず、入力部１１０から、ファイル等に記載された処理対象のデータＸと、パラメータ（選択変量数ｑおよび主成分数ｒ）とを入力する（Ｓ４０）。入力データの一例を数１に示す。 C. Operation of Embodiment Next, the operation of the above-described embodiment will be described. Here, FIG. 3 is a flowchart for explaining the operation of the multivariate data selection apparatus according to the present embodiment. FIG. 4 is a conceptual diagram for explaining the operation of the multivariate data selection apparatus according to the present embodiment. First, data X to be processed described in a file or the like and parameters (a selected variable number q and a main component number r) are input from the input unit 110 (S40). An example of input data is shown in Equation 1.

次に、データＸの基底導出のために、主成分分析や判別分析の基底導出法を用いて、高い固有値を持つ固有ベクトルを順にｒ個取り出し、基底Ｙ^λとする（Ｓ４２）。基底の一例を数２に示す。 Then, because of the underlying derivation of data X, using a base derivation of principal component analysis and discriminant analysis, extraction turn the r eigenvectors with high eigenvalues, as a base Y λ ^(S42). An example of the basis is shown in Equation 2.

次にＹ^λと選択された変量の組み合わせＸ〜から構成した正規直交基底Ｘ＿との類似度λｍａｘを求め、それに対応する変量の組み合わせＸ ^λを求める（Ｓ４４）。そして、最もλｍａｘの大きい変量の組み合わせＸ ^λ ＿に対応する変量の組み合わせを最も良い選択変量として出力する（Ｓ４６）。 Next, the similarity λ max between the orthonormal basis X_ composed of Y ^λ and the selected variable combination X˜ is obtained, and the corresponding variable combination X ^λ is obtained (S44). Then, the variable combination corresponding to the variable combination X ^{λ — having} the largest λmax is output as the best selected variable (S46).

次に、図５は、上述したステップＳ４４、Ｓ４６の詳細な動作を説明するためのフローチャートである。また、図６は、ステップＳ４４、Ｓ４６の詳細な動作を説明するための概念図である。まず、選択変量数ｑと全変量数ｐとから、ｐＣｑ通りの変量の組み合わせを全て求める（Ｓ５０，図６（ａ）参照）。例えば、ｑ＝２で、変量ａ，ｂ，ｃがある場合、（ａ，ｂ）、（ｂ，ｃ）、（ｃ，ａ）の組み合わせが求まる。次に、ステップＳ５０の結末から、順次１つの組み合わせを取り出し、正規直交基底Ｘ＿を構成する（Ｓ５２，図６（ｂ）参照）。 Next, FIG. 5 is a flowchart for explaining the detailed operation of steps S44 and S46 described above. FIG. 6 is a conceptual diagram for explaining detailed operations of steps S44 and S46. First, all pCq combinations of variables are obtained from the selected variable number q and the total variable number p (S50, see FIG. 6A). For example, when q = 2 and there are variables a, b, and c, a combination of (a, b), (b, c), and (c, a) is obtained. Next, one combination is sequentially extracted from the ending of step S50 to form an orthonormal basis X_ (see S52, FIG. 6B).

次に、正規直交基底Ｘ＿とステップＳ４２で得られた基底Ｙ ^λとで、部分空間同士の角度λの最大固有値λｍａｘを算出し、その値を類似度とし、例えば配列に値が上書きされないように保存しておく（Ｓ５４、図６（ｃ）参照）。次に、ステップＳ５０で求めた変量の組み合わせ全てに対して、角度の最大固有値λｍａｘを計算したかをチェックする（Ｓ５６）。そして、全ての組み合わせで類似度を算出していない場合には、ステップＳ５２へ戻り、上述した処理を繰り返す。一方、全ての組み合わせで類似度を算出した場合には、ステップＳ５４で保存しておいた角度の最大固有値λｍａｘのうち、最も高い類似度を持つ値を、最終的な類似度とし、それに対応する組み合わせと類似度を出力する（Ｓ５８、図６（ｄ〉、（ｅ）参照）。 Next, the maximum eigenvalue λmax of the angle λ between the subspaces is calculated from the orthonormal basis X_ and the basis Y ^λ obtained in step S42, and the value is used as a similarity, for example, so that the value is not overwritten in the array Save it (S54, see FIG. 6C). Next, it is checked whether the maximum eigenvalue λmax of the angle has been calculated for all the variable combinations obtained in step S50 (S56). If the similarity is not calculated for all combinations, the process returns to step S52 and the above-described processing is repeated. On the other hand, when the similarity is calculated for all the combinations, the value having the highest similarity among the maximum eigenvalues λmax of the angles stored in step S54 is set as the final similarity and corresponds to it. The combination and the similarity are output (S58, see FIGS. 6D and 6E).

Ｄ．具体例
次に、本実施形態による多変量データ選択装置を用いて行なった多変量データ選択処理の具体例について説明する。本具体例では、羽アリの身体的な特長を計測したデータを用いている。データサイズはサンプル数４０、変量数１９とした。図７は、本実施形態による多変量データ選択方法により選択された変量を示す表図である。図において、○が示された変量が選択されたものである。なお、変量Ｎｏ．１８は、カテゴリデータなので、襞があるものを「１」、ないものを「０」にコード化している。また、選択する変量数ｑを４とし、使用する主成分数ｒを２とした。 D. Specific Example Next, a specific example of the multivariate data selection process performed using the multivariate data selection apparatus according to the present embodiment will be described. In this specific example, data obtained by measuring the physical characteristics of a feather ant are used. The data size was 40 samples and 19 variables. FIG. 7 is a table showing the variables selected by the multivariate data selection method according to this embodiment. In the figure, the variable indicated by ○ is selected. The variable No. Since 18 is category data, the data having a wrinkle is coded as “1” and the data having no wrinkle is coded as “0”. Further, the variable number q to be selected was set to 4, and the number r of main components to be used was set to 2.

ここで、図７に示すデータに対して、本発明の多変量データ選択方法と従来方法とを適用する。それぞれの手法によって選択された変量を用いた第一主成分（図７におけるＰＣ１に対応）、第二主成分（図７におけるＰＣ２に対応）による散布図を通して、元のすべての特徴近似の観点から、変量選択の効果を比較する。なお、以下に説明する図８、図９、図１１、図１２についても、第一主成分が「ＰＣ１」、第二主成分が「ＰＣ２」に対応する。 Here, the multivariate data selection method of the present invention and the conventional method are applied to the data shown in FIG. From the viewpoint of all the original feature approximations through the scatter diagram with the first principal component (corresponding to PC1 in FIG. 7) and the second principal component (corresponding to PC2 in FIG. 7) using the variables selected by the respective methods. Compare the effects of variable selection. 8, 9, 11, and 12 described below, the first principal component corresponds to “PC1” and the second principal component corresponds to “PC2”.

ここで、図８は、全変量を用いた散布図である。また、図９は、本実施形態による多変量データ選択方法によって選択された変量のうち、最も重要度の高い変量を用いた散布図である。図９に示すように、全変量を使わずとも、クラスタ構造は保存されており、位置関係も大きく異なっていないことが分かる。 Here, FIG. 8 is a scatter diagram using all variables. FIG. 9 is a scatter diagram using the variables having the highest importance among the variables selected by the multivariate data selection method according to the present embodiment. As shown in FIG. 9, it can be seen that the cluster structure is preserved and the positional relationship is not significantly different without using all variables.

また、図１０は、従来法で選択された変量を示す表図である。図において、○が示された変量が選択されたものである。また、図１１は、従来法（ＭａＣａｂｅ（１９８４））によって選択された変量を用いた散布図であり、図１２は、従来法（Ｋｒｚａｎｏｗｓｋｉ（１９８７））によって選択された変量を用いた散布図である。図１１に示すように、従来法（ＭａＣａｂｅ（１９８４））では、クラスタ構造が崩れていることが分かる。また、図１２に示すように、従来法（Ｋｒｚａｎｏｗｓｋｉ（１９８７））では、分布の距離に元の情報が十分に反映されていないことが分かる。 FIG. 10 is a table showing the variables selected by the conventional method. In the figure, the variable indicated by ○ is selected. FIG. 11 is a scatter diagram using the variables selected by the conventional method (MaCube (1984)), and FIG. 12 is a scatter diagram using the variables selected by the conventional method (Krzanowski (1987)). is there. As shown in FIG. 11, it can be seen that the cluster structure is broken in the conventional method (MaCabe (1984)). Also, as shown in FIG. 12, it can be seen that the original information is not sufficiently reflected in the distribution distance in the conventional method (Krzanowski (1987)).

なお、上述した実施形態において、演算部１２０の機能は、図示しない記憶部に記憶されたプログラムを実行することで実現するようになっている。記憶部は、ハードディスク装置や光磁気ディスク装置、フラッシュメモリ等の不揮発性メモリやＲＡＭ（Random Access Memory）のような揮発性のメモリ、あるいはこれらの組み合わせにより構成されるものとする。また、上記記憶部とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含む。 In the above-described embodiment, the function of the calculation unit 120 is realized by executing a program stored in a storage unit (not shown). The storage unit is configured by a hard disk device, a magneto-optical disk device, a nonvolatile memory such as a flash memory, a volatile memory such as a RAM (Random Access Memory), or a combination thereof. Further, the storage unit is a fixed time such as a volatile memory (RAM) in a computer system serving as a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. Includes those holding programs.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワークや電話回線等の通信回線のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、上述した処理の一部を実現するためのものであってもよい。さらに、上述した処理を演算部１２０に既に記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information such as a network such as the Internet or a communication line such as a telephone line. The program may be for realizing a part of the above-described processing. Furthermore, what can implement | achieve the process mentioned above in combination with the program already recorded on the calculating part 120, what is called a difference file (difference program) may be sufficient.

以上、この発明の実施形態を図面を参照して詳述してきたが、具体的な構成は、上記実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to the above-described embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本発明の実施形態による多変量データ選択装置の構成を示すブロック図である。It is a block diagram which shows the structure of the multivariate data selection apparatus by embodiment of this invention. 本発明の多変量データ選択方法の原理を説明するための概念図である。It is a conceptual diagram for demonstrating the principle of the multivariate data selection method of this invention. 本実施形態による多変量データ選択装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the multivariate data selection apparatus by this embodiment. 本実施形態による多変量データ選択装置の動作を説明するための概念図である。It is a conceptual diagram for demonstrating operation | movement of the multivariate data selection apparatus by this embodiment. 本実施形態の一部詳細な動作を説明するためのフローチャートである。It is a flowchart for demonstrating the partial detailed operation | movement of this embodiment. 本実施形態の一部詳細な動作を説明するための概念図である。It is a conceptual diagram for demonstrating the partial detailed operation | movement of this embodiment. 本実施形態による多変量データ選択方法により選択された変量を示す表図である。It is a table | surface figure which shows the variable selected by the multivariate data selection method by this embodiment. 全変量を用いた散布図である。It is a scatter diagram using all the variables. 本実施形態による多変量データ選択方法によって選択された変量のうち、最も重要度の高い変量を用いた散布図である。It is a scatter diagram using the variable with the highest importance among the variables selected by the multivariate data selection method according to the present embodiment. 従来法で選択された変量を示す表図である。It is a table | surface figure which shows the variable selected by the conventional method. 従来法（ＭａＣａｂｅ（１９８４））によって選択された変量を用いた散布図である。It is a scatter diagram using the variable selected by the conventional method (MaCabe (1984)). 従来法（Ｋｒｚａｎｏｗｓｋｉ（１９８７））によって選択された変量を用いた散布図である。It is a scatter diagram using the variable selected by the conventional method (Krzanowski (1987)). 従来法（ＭｃＣａｂｅ（１９８４））の原理を示す概念図である。It is a conceptual diagram which shows the principle of the conventional method (McCabe (1984)). 従来法（Ｋｒｚａｎｏｗｓｋｉ（１９８７））の原理を示す概念図である。It is a conceptual diagram which shows the principle of the conventional method (Krzanowski (1987)). Ｋｒｚａｎｏｗｓｋｉ法による動作を説明するためのフローチャートである。It is a flowchart for demonstrating the operation | movement by a Krzanowski method. Ｋｒｚａｎｏｗｓｋｉ法の一部動作を詳細に説明するためのフローチャートである。It is a flowchart for demonstrating in detail some operation | movement of a Krzanowski method.

Explanation of symbols

１１０入力部（入力手段）
１２０演算部（基底算出手段、類似度算出手段、変量組み合わせ算出手段、正規直交基底構成手段、角度算出手段）
１３０出力部（出力手段） 110 Input unit (input means)
120 arithmetic unit (base calculation means, similarity calculation means, variable combination calculation means, orthonormal basis construction means, angle calculation means)
130 Output unit (output means)

Claims

A multivariate selection device including an input unit, a calculation unit, and an output unit,
The input unit is
an input means for inputting processing target data consisting of an n × p matrix (n sample data indicated by p variables) and a selected variable number q and a selected principal component number r as parameters of the processing target data ;
The computing unit is
Wherein the processed data inputted to principal component analysis with input means obtains eigenvectors of p × p matrix corresponding to p eigenvalues and eigenvalues, wherein the selected number of principal components descending order of the eigenvalues of the determined eigenvalues ( r) eigenvalues and basis calculation means for calculating a corresponding eigenvector as a basis vector (p × r matrix) of the processing target data;
All (pCq) variable combinations are extracted from the total variable number p and the selected variable number q of the processing target data input by the input unit, and the row direction is the extraction source of the variable combination. Is a p × q matrix that indicates q variable types extracted by the variable combination in the column direction, and the other elements are the same in the row and column variable types. Orthonormal basis constructing means for constructing pCq number of p × q matrices as many as the number of the variable combinations represented by 0 as pCq orthonormal basis vectors;
A canonical angle between the basis vector calculated by the basis calculation means and one orthonormal basis vector sequentially selected from the pCq orthonormal basis vectors configured by the orthonormal basis configuration means is calculated and stored sequentially . A canonical angle calculation means ,
The output unit is
An output means for outputting the combination of the variables constituting the orthonormal basis vector having the smallest angle among the canonical angles calculated and stored by the canonical angle calculating means ;
A multivariate data selection device characterized by comprising:

In a computer composed of an input unit, a calculation unit, and an output unit,
In the input part of the computer,
input means for inputting processing target data consisting of an n × p matrix (n sample data indicated by p variables) and a selected variable number q and a selected principal component number r as parameters of the processing target data ;
In the computing unit of the computer,
Wherein the processed data inputted to principal component analysis with input means obtains eigenvectors of p × p matrix corresponding to p eigenvalues and eigenvalues, wherein the selected number of principal components descending order of the eigenvalues of the determined eigenvalues ( r) eigenvalues and basis calculation means for calculating a corresponding eigenvector as a basis vector (p × r matrix) of the processing target data;
All (pCq) variable combinations are extracted from the total variable number p and the selected variable number q of the processing target data input by the input unit, and the row direction is the extraction source of the variable combination. Is a p × q matrix that indicates q variable types extracted by the variable combination in the column direction, and the other elements are the same in the row and column variable types. Orthonormal basis constructing means for constructing pCq number of p × q matrices corresponding to the number of the variable combinations indicated by 0 as pCq orthonormal basis vectors,
A canonical angle between the basis vector calculated by the basis calculation means and one orthonormal basis vector sequentially selected from the pCq orthonormal basis vectors configured by the orthonormal basis configuration means is calculated and stored sequentially . Canonical angle calculation means ,
In the output part of the computer,
An output means for outputting the combination of the variables constituting the orthonormal basis vector having the smallest angle among the canonical angles calculated and stored by the canonical angle calculation means;
Program to function as.