JP2021039580A

JP2021039580A - Information processing apparatus, information processing method, and program

Info

Publication number: JP2021039580A
Application number: JP2019161031A
Authority: JP
Inventors: 一則松本; Kazunori Matsumoto; 啓一郎帆足; Keiichiro Hoashi
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2021-03-11
Anticipated expiration: 2039-09-04
Also published as: JP7096218B2

Abstract

To put label propagation into practical use in semi-supervised learning using multidimensional Delaunay division.SOLUTION: A data acquisition unit 30 acquires a labeled data group in which data is with a label indicating a class that data belongs and an unlabeled data group for which the class that the data belongs is unknown. A matrix calculation unit 31 calculates a metering matrix formed by arranging metering vectors that have metering elements indicating similarity between the data. A matrix compression unit 32 generates a compression metering matrix formed by arranging compression metering vectors that are obtained by compressing the dimensions of each metering vector forming the metering matrix. A division unit 33 performs multidimensional Delaunay division for points that each of the compression metering vectors is mapped in multidimensional space. A adjacency matrix acquisition unit 34 acquires connection relationship of each point after the Delaunay division as an adjacency matrix. A label propagation unit 35 propagates the label to each data forming the unlabeled data group based on the adjacency matrix and the labels of the labeled data group.SELECTED DRAWING: Figure 1

Description

本発明は情報処理装置、情報処理方法、及びプログラムに関し、特に、半教師有り学習と呼ばれる機械学習分野の技術に関する。 The present invention relates to an information processing apparatus, an information processing method, and a program, and more particularly to a technique in a machine learning field called semi-supervised learning.

例えば、非特許文献１には、データ間の類似度をガウシアンカーネルとして表すことによってデータ多様体の計量を得たうえで、ラベル有りデータからラベル無しデータにラベル伝播を行う方法が開示されている。 For example, Non-Patent Document 1 discloses a method of performing label propagation from labeled data to unlabeled data after obtaining a metric of a data manifold by expressing the similarity between data as a Gaussian kernel. ..

また、非特許文献２では、データの隣接関係に基づいてラベル無しデータを学習に利用する半教師有り学習の手法において、データの隣接関係を適切に表すことを目的として、ドロネー三角形分割を多次元に拡張し、データの隣接性を抽出することで広域的な性質を組み込む技術が開示されている。 Further, in Non-Patent Document 2, in a semi-supervised learning method in which unlabeled data is used for learning based on the adjacency of data, Delaunay triangulation is multidimensionalized for the purpose of appropriately expressing the adjacency of data. A technique is disclosed that extends to and incorporates wide-area properties by extracting data adjacencies.

Zhou, D., Bousquet, O., Lal, N. T., Weston, J., Scholkopf, B. Learning with local and global consistency, Proceeding of Advances in neural information processing systems 2004.Zhou, D., Bousquet, O., Lal, N.T., Weston, J., Scholkopf, B. Learning with local and global consistency, Proceeding of Advances in neural information processing systems 2004. 松本一則, 帆足啓一郎, 池田和史, ドロネー分割を用いたラベル伝搬型半教師あり学習の検討, 信学技報 116(209), 211-214, 2016-09-05.Kazunori Matsumoto, Keiichiro Hoashi, Kazushi Ikeda, Study of label-propagated semi-supervised learning using Delaunay division, Shingaku Giho 116 (209), 211-214, 2016-09-05.

非特許文献２に開示されている技術では、多次元版ドロネー分割によって、広域的なデータも考慮したグラフ構造を得ている。しかしながら、非特許文献１に開示されている技術のような計量をデータ間の類似度として使用する場合、計量の次元数はラベル有り及びラベル無しのデータの総数に等しくなり、次元数が非常に大きくなる。この結果、多次元版のドロネー分割の計算量が膨大になってしまい、現実的な時間での計算が困難となるという問題が生じていた。 In the technique disclosed in Non-Patent Document 2, a graph structure considering a wide range of data is obtained by the multidimensional version of Delaunay division. However, when a metric such as the technique disclosed in Non-Patent Document 1 is used as the similarity between data, the number of dimensions of the metric is equal to the total number of labeled and unlabeled data, and the number of dimensions is very large. growing. As a result, the amount of calculation for the Delaunay division of the multidimensional version becomes enormous, and there is a problem that the calculation in a realistic time becomes difficult.

本発明はこれらの点に鑑みてなされたものであり、多次元のドロネー分割を利用した半教師有り学習におけるラベル伝搬を実用化するための技術を提供することを目的とする。 The present invention has been made in view of these points, and an object of the present invention is to provide a technique for putting label propagation into practical use in semi-supervised learning using multidimensional Delaunay division.

本発明の第１の態様は、情報処理装置である。この装置は、属するクラスを示すラベルが付されたラベル有りデータ群と、属するクラスが不明であるラベル無しデータ群とを取得するデータ取得部と、前記ラベル有りデータ群と前記ラベル無しデータ群とのそれぞれを構成するデータ中の１つのデータと当該１つのデータを含む他のデータとの間の類似度を示す計量を要素とする計量ベクトルを、データ毎に並べて構成される計量行列を算出する行列算出部と、前記計量行列を構成する各計量ベクトルの次元を圧縮した圧縮計量ベクトルを並べて構成される圧縮計量行列を生成する行列圧縮部と、前記圧縮計量ベクトルそれぞれの要素を座標とする複数の点を前記圧縮計量ベクトルの次元数と同次元の多次元空間にマッピングし、前記複数の点に対して多次元のドロネー分割をする分割部と、ドロネー分割後の各点の接続関係を隣接行列として取得する隣接行列取得部と、前記隣接行列及び前記ラベル有りデータ群のラベルに基づいて、前記ラベル無しデータ群を構成する各データにラベルを伝搬させるラベル伝搬部と、を備える。 The first aspect of the present invention is an information processing device. This device includes a data acquisition unit that acquires a labeled data group with a label indicating a class to which it belongs, an unlabeled data group whose class belongs to unknown, and the labeled data group and the unlabeled data group. Calculate a metric matrix composed of metric vectors whose elements are metric showing the similarity between one data in the data constituting each of the above and the other data including the one data, arranged for each data. A matrix calculation unit, a matrix compression unit that generates a compression metric matrix formed by arranging compression metric vectors that compress the dimensions of each metric vector that constitutes the metric matrix, and a plurality of that have elements of each of the compression metric vectors as coordinates. Is mapped to a multidimensional space having the same dimension as the number of dimensions of the compression measurement vector, and the connection relationship between the division part that performs multidimensional drone division for the plurality of points and each point after the drone division is adjacent It includes an adjacent matrix acquisition unit that acquires as a matrix, and a label propagation unit that propagates a label to each data constituting the unlabeled data group based on the labels of the adjacent matrix and the labeled data group.

前記行列算出部は、半定値性を持つ関数を用いて前記計量を算出してもよい。 The matrix calculation unit may calculate the metric using a function having definite matrix.

前記行列算出部は、ガウシアンカーネルを用いて前記計量を算出してもよい。 The matrix calculation unit may calculate the metric using a Gaussian kernel.

前記行列圧縮部は、前記計量行列の要素のうち所定の閾値未満の要素を０で置換した後に、疎行列に基づく行列分解を用いて前記圧縮計量行列を生成してもよい。 The matrix compression unit may generate the compression metric matrix by using matrix factorization based on a sparse matrix after substituting 0 for elements of the metric matrix that are less than a predetermined threshold.

前記計量行列を構成する計量ベクトルを順に選択するベクトル選択部をさらに備えてもよく、前記行列圧縮部は、前記ベクトル選択部が選択した計量ベクトルの要素のうち所定の閾値以上の要素を抽出するとともに、前記計量行列を構成する他の計量ベクトルも前記計量ベクトルを構成する要素に対応する要素を抽出し、抽出した要素によって構成される行列を前記圧縮計量行列として生成してもよく、前記隣接行列取得部は、前記ベクトル選択部が選択した計量ベクトルに対応する点と他の点との接続関係を前記圧縮計量行列に基づいて特定することにより、前記ベクトル選択部が選択した計量ベクトルに対応する前記隣接行列の要素を決定してもよい。 A vector selection unit that sequentially selects the metric vectors constituting the metric matrix may be further provided, and the matrix compression unit extracts elements of the metric vector selected by the vector selection unit that are equal to or greater than a predetermined threshold. At the same time, other metric vectors constituting the metric matrix may also extract elements corresponding to the elements constituting the metric vector, and a matrix composed of the extracted elements may be generated as the compressed metric matrix. The matrix acquisition unit corresponds to the metric vector selected by the vector selection unit by specifying the connection relationship between the point corresponding to the metric vector selected by the vector selection unit and another point based on the compression metric matrix. The elements of the adjacent matrix may be determined.

本発明の第２の態様は、情報処理方法である。この方法において、プロセッサが、属するクラスを示すラベルが付されたラベル有りデータ群を取得するステップと、属するクラスが不明であるラベル無しデータ群を取得するステップと、前記ラベル有りデータ群と前記ラベル無しデータ群とのそれぞれを構成するデータ中の１つのデータと当該１つのデータを含む他のデータとの間の類似度を示す計量を要素とする計量ベクトルを、データ毎に並べて構成される計量行列を算出するステップと、前記計量行列を構成する各計量ベクトルの次元を圧縮した圧縮計量ベクトルを並べて構成される圧縮計量行列を生成するステップと、前記圧縮計量ベクトルそれぞれの要素を座標とする複数の点を前記圧縮計量ベクトルの次元数と同次元の多次元空間にマッピングし、前記複数の点に対して多次元のドロネー分割をするステップと、ドロネー分割後の各点の接続関係を隣接行列として取得するステップと、前記隣接行列及び前記ラベル有りデータ群のラベルに基づいて、前記ラベル無しデータ群を構成する各データにラベルを伝搬させるステップと、を実行する。 A second aspect of the present invention is an information processing method. In this method, the processor obtains a labeled data group with a label indicating the class to which it belongs, a step of acquiring an unlabeled data group to which the class to which it belongs is unknown, and the labeled data group and the label. None A metric composed of metric vectors whose elements are metric showing the degree of similarity between one data in the data constituting each of the data groups and the other data including the one data, arranged side by side for each data. A step of calculating a matrix, a step of generating a compression metric matrix composed by arranging compression metric vectors obtained by arranging the dimensions of each metric vector constituting the metric matrix, and a plurality of coordinates having each element of the compression metric vector as coordinates. Is mapped to a multidimensional space having the same dimension as the number of dimensions of the compression metric vector, and the step of performing multidimensional drone division for the plurality of points and the connection relationship of each point after the drone division are shown in an adjacent matrix. And the step of propagating the label to each data constituting the unlabeled data group based on the label of the adjacent matrix and the labeled data group is executed.

本発明の第３の態様は、プログラムである。このプログラムは、コンピュータに、属するクラスを示すラベルが付されたラベル有りデータ群を取得する機能と、属するクラスが不明であるラベル無しデータ群を取得する機能と、前記ラベル有りデータ群と前記ラベル無しデータ群とのそれぞれを構成するデータ中の１つのデータと当該１つのデータを含む他のデータとの間の類似度を示す計量を要素とする計量ベクトルを、データ毎に並べて構成される計量行列を算出する機能と、前記計量行列を構成する各計量ベクトルの次元を圧縮した圧縮計量ベクトルを並べて構成される圧縮計量行列を生成する機能と、前記圧縮計量ベクトルそれぞれの要素を座標とする複数の点を前記圧縮計量ベクトルの次元数と同次元の多次元空間にマッピングし、前記複数の点に対して多次元のドロネー分割をする機能と、ドロネー分割後の各点の接続関係を隣接行列として取得する機能と、前記隣接行列及び前記ラベル有りデータ群のラベルに基づいて、前記ラベル無しデータ群を構成する各データにラベルを伝搬させる機能と、を実現させる。 A third aspect of the present invention is a program. This program has a function of acquiring a labeled data group with a label indicating a class to which the computer belongs, a function of acquiring an unlabeled data group of which the class to which the computer belongs is unknown, and the labeled data group and the label. None A metric composed of metric vectors whose elements are metric showing the degree of similarity between one data in the data constituting each of the data groups and the other data including the one data, arranged side by side for each data. A function to calculate a matrix, a function to generate a compressed metric matrix composed by arranging compressed metric vectors obtained by compressing the dimensions of each metric vector constituting the metric matrix, and a plurality of coordinates having each element of the compressed metric vector as coordinates. Is mapped to a multidimensional space having the same dimension as the number of dimensions of the compression metric vector, and the function of performing multidimensional drone division for the plurality of points and the connection relationship of each point after the drone division are shown in an adjacent matrix. And a function of propagating the label to each data constituting the unlabeled data group based on the labels of the adjacent matrix and the labeled data group are realized.

このプログラムを提供するため、あるいはプログラムの一部をアップデートするために、このプログラムを記録したコンピュータ読み取り可能な記録媒体が提供されてもよく、また、このプログラムが通信回線で伝送されてもよい。 In order to provide this program or to update a part of the program, a computer-readable recording medium on which the program is recorded may be provided, or the program may be transmitted over a communication line.

なお、以上の構成要素の任意の組み合わせ、本発明の表現を方法、装置、システム、コンピュータプログラム、データ構造、記録媒体などの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above components and the conversion of the expression of the present invention between methods, devices, systems, computer programs, data structures, recording media and the like are also effective as aspects of the present invention.

本発明によれば、多次元のドロネー分割を利用した半教師有り学習におけるラベル伝搬を実用化することができる。 According to the present invention, label propagation in semi-supervised learning using multidimensional Delaunay division can be put into practical use.

実施の形態に係る情報処理装置の機能構成を模式的に示す図である。It is a figure which shows typically the functional structure of the information processing apparatus which concerns on embodiment. 実施の形態に係る分割部が実行するドロネー分割を説明するための図である。It is a figure for demonstrating the Delaunay division performed by the division part which concerns on embodiment. 実施の形態に係る隣接行列取得部が取得した隣接行列の一例を模式的に示す図である。It is a figure which shows typically an example of the adjacency matrix acquired by the adjacency matrix acquisition part which concerns on embodiment. 実施の形態に係る行列算出部が算出する計量行列の一例を模式的に示す図である。It is a figure which shows typically an example of the metric matrix calculated by the matrix calculation part which concerns on embodiment. 実施の形態に係る行列圧縮部が実行する行列圧縮を説明するための図である。It is a figure for demonstrating the matrix compression performed by the matrix compression part which concerns on embodiment. 実施の形態に係る情報処理装置が実行する情報処理の流れを説明するためのフローチャートである。It is a flowchart for demonstrating the flow of information processing executed by the information processing apparatus which concerns on embodiment.

＜実施の形態の概要＞
実施の形態に係る情報処理方法は、属するクラスを示すラベルが付されたラベル有り学習データと、属するクラスが不明であるラベル無し学習データとの隣接関係に基づいて、ラベル有り学習データに付されたラベルをラベル無し学習データを構成する各データに伝搬させるための手法に関する。 <Outline of the embodiment>
The information processing method according to the embodiment is attached to the labeled learning data based on the adjacency relationship between the labeled learning data with a label indicating the class to which it belongs and the unlabeled learning data to which the class to which it belongs is unknown. The present invention relates to a method for propagating the labeled label to each data constituting the unlabeled learning data.

実施の形態に係る情報処理装置は、まず、ラベル有り学習データとラベル無し学習データとの各学習データを構成するデータ間の類似関係を示す計量行列を生成する。続いて、実施の形態に係る情報処理装置は、生成した計量行列を圧縮してサイズの小さな圧縮計量行列を生成する。次に、実施の形態に係る情報処理装置は、圧縮計量行列を構成する要素に基づいて、ラベル有り学習データとラベル無し学習データとの各学習データのデータ数Ｎと同じ次元のＮ次元空間上に各学習データに対応する点をマッピングし、マッピングした点群に対してＮ次元のドロネー三角形分割を実行する。最後に、実施の形態に係る情報処理装置は、ドロネー三角形分割後の各点の接続関係に基づいて、ラベル有り学習データに付されたラベルをラベル無し学習データを構成する各データに伝搬させる。 The information processing apparatus according to the embodiment first generates a metric matrix showing a similarity relationship between the data constituting each of the labeled learning data and the unlabeled learning data. Subsequently, the information processing apparatus according to the embodiment compresses the generated metric matrix to generate a small-sized compressed metric matrix. Next, the information processing apparatus according to the embodiment is on an N-dimensional space having the same dimension as the number N of the data of each of the labeled learning data and the unlabeled learning data based on the elements constituting the compression metric matrix. The points corresponding to each training data are mapped to, and N-dimensional Delaunay triangulation is executed for the mapped point cloud. Finally, the information processing apparatus according to the embodiment propagates the label attached to the labeled learning data to each data constituting the unlabeled learning data based on the connection relationship of each point after the Delaunay triangle division.

［ドロネー三角形分割］
ここで「ドロネー三角形分割」とは、２次元平面上に離散的に分布する点を頂点とする三角形によって２次元平面を漏れなくかつ重なりなく分割する手法の一種である。ドロネー三角形分割によって分割された三角形は以下に記載するような性質を持つ。すなわち、ドロネー三角形分割によって分割された任意の三角形の外接円の内部には、他の三角形を構成する点が含まれないという性質である。 [Delaunay triangle division]
Here, "Delaunay triangle division" is a kind of method of dividing a two-dimensional plane without omission and overlap by a triangle whose vertices are points distributed discretely on the two-dimensional plane. The triangle divided by the Delaunay triangulation has the properties described below. That is, the property is that the points constituting other triangles are not included in the circumscribed circle of any triangle divided by the Delaunay triangulation.

ドロネー三角形分割は、３次元以上の多次元空間における点群を対象とする空間分割手法に拡張できることが知られている。拡張されたドロネー三角形分割では、多次元空間上に離散的に分布する点を頂点とするシンプレックス（Simplex；単体）によって、多次元空間を分割することになる。 It is known that Delaunay triangulation can be extended to a space division method for a point cloud in a multidimensional space of three or more dimensions. In the extended Delaunay triangulation, the multidimensional space is divided by a simplex (simplex) whose vertices are points distributed discretely in the multidimensional space.

例えば、３次元空間におけるシンプレックスは四面体であるため、３次元空間におけるドロネー三角形分割は、３次元空間上に離散的に分布する点を頂点とする四面体で３次元空間を分割することになる。３次元空間におけるドロネー三角形分割を実行すると、任意の四面体の外接球の内部には、他の四面体を構成する点が含まれない。 For example, since the simplex in the three-dimensional space is a tetrahedron, the Dronay triangle division in the three-dimensional space divides the three-dimensional space by a tetrahedron whose apex is a point distributed discretely in the three-dimensional space. .. When the Delaunay triangulation is performed in three-dimensional space, the circumscribed sphere of any tetrahedron does not contain the points that make up the other tetrahedron.

同様に４次元空間におけるシンプレックスは五胞体であるため、４次元空間におけるドロネー三角形分割は、４次元空間上に離散的に分布する点を頂点とする五胞体で４次元空間を分割することになる。４次元空間におけるドロネー三角形分割を実行すると、任意の五胞体の外接球の内部には、他の五胞体を構成する点が含まれない。 Similarly, since the simplex in the four-dimensional space is a five-cell body, the Delaunay triangulation in the four-dimensional space divides the four-dimensional space by the five-cell bodies whose vertices are points distributed discretely in the four-dimensional space. .. When the Delaunay triangulation in four-dimensional space is performed, the circumscribed sphere of any 5-cell does not contain the points that make up the other 5-cell.

なお、四面体における“超平面”は三角形であり、五胞体における超平面は四面体である。一般に、Ｎ次元のシンプレックスを構成する超平面は、Ｎ−１次元のシンプレックスとなる。 The "hyperplane" in the tetrahedron is a triangle, and the hyperplane in the pentatope is a tetrahedron. In general, the hyperplane that constitutes an N-dimensional simplex is an N-1 dimensional simplex.

このように、３次元以上の多次元空間における点群を対象とするドロネー三角形分割は、正確には“シンプレックス分割”である。本明細書では２次元以上の多次元空間を対象とする分割を、便宜上単に「ドロネー分割」と記載し、ドロネー分割して得られた２次元又はそれ以上の次元のシンプレックスを、単に「シンプレックス」と記載する。ドロネー分割を実行することによって得られた任意のシンプレックスは、そのシンプレックスの外接超球の内部に他のシンプレックスを構成する点が含まれない。この性質は、既知データが分布する空間全体にわたって成り立つ広域的な性質である。 As described above, the Delaunay triangulation for a point cloud in a multidimensional space of three or more dimensions is, to be exact, a "simplex division". In the present specification, a division targeting a multidimensional space of two or more dimensions is simply referred to as "Delaunay division" for convenience, and a two-dimensional or higher dimensional simplex obtained by Delaunay division is simply referred to as "simplex". It is described as. Any simplex obtained by performing a Delaunay split does not include the points that make up the other simplex inside the circumscribed hypersphere of that simplex. This property is a wide-area property that holds throughout the space where known data is distributed.

一般にＮ次元空間におけるドロネー分割の計算量は、Ｎの３乗のオーダーである。機械学習に用いられるデータの数（すなわち、Ｎ次元空間における次元数Ｎ）は、数万から百万のオーダーとなり得るため、そのような場合は現実的な時間での処理が難しくなる。そこで、実施の形態に係る情報処理装置は、ラベル有り学習データとラベル無し学習データとの各学習データを構成するデータ間の類似関係を示す計量行列を圧縮して計算量を削減する。一例として、実施の形態に係る情報処理装置は、計量行列のサイズを百分の一に圧縮する。これにより、多次元空間におけるドロネー分割を現実的な時間で実行することができる。 Generally, the computational complexity of Delaunay division in N-dimensional space is on the order of N cubed. Since the number of data used for machine learning (that is, the number of dimensions N in the N-dimensional space) can be on the order of tens of thousands to millions, in such a case, processing in a realistic time becomes difficult. Therefore, the information processing apparatus according to the embodiment compresses a metric matrix showing a similarity relationship between the data constituting each learning data of the labeled learning data and the unlabeled learning data to reduce the amount of calculation. As an example, the information processing apparatus according to the embodiment compresses the size of the measurement matrix by a factor of 100. As a result, the Delaunay division in the multidimensional space can be executed in a realistic time.

＜実施の形態に係る情報処理装置１の機能構成＞
図１は、実施の形態に係る情報処理装置１の機能構成を模式的に示す図である。情報処理装置１は、記憶部２と制御部３とを備える。図１において、矢印は主なデータの流れを示しており、図１に示していないデータの流れがあってもよい。図１において、各機能ブロックはハードウェア（装置）単位の構成ではなく、機能単位の構成を示している。そのため、図１に示す機能ブロックは単一の装置内に実装されてもよく、あるいは複数の装置内に分かれて実装されてもよい。機能ブロック間のデータの授受は、データバス、ネットワーク、可搬記憶媒体等、任意の手段を介して行われてもよい。 <Functional configuration of the information processing device 1 according to the embodiment>
FIG. 1 is a diagram schematically showing a functional configuration of the information processing device 1 according to the embodiment. The information processing device 1 includes a storage unit 2 and a control unit 3. In FIG. 1, the arrows indicate the main data flows, and there may be data flows not shown in FIG. In FIG. 1, each functional block shows not a hardware (device) unit configuration but a functional unit configuration. Therefore, the functional block shown in FIG. 1 may be mounted in a single device, or may be mounted separately in a plurality of devices. Data can be exchanged between functional blocks via any means such as a data bus, a network, or a portable storage medium.

記憶部２は、情報処理装置１を実現するコンピュータのＢＩＯＳ（Basic Input Output System）等を格納するＲＯＭ（Read Only Memory）や情報処理装置１の作業領域となるＲＡＭ（Random Access Memory）、ＯＳ（Operating System）やアプリケーションプログラム、当該アプリケーションプログラムの実行時に参照される種々の情報を格納するＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）等の大容量記憶装置である。 The storage unit 2 includes a ROM (Read Only Memory) for storing the BIOS (Basic Input Output System) of the computer that realizes the information processing device 1, a RAM (Random Access Memory) that serves as a work area for the information processing device 1, and an OS (OS). It is a large-capacity storage device such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive) that stores an Operating System), an application program, and various information referred to when the application program is executed.

制御部３は、情報処理装置１のＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）等のプロセッサであり、記憶部２に記憶されたプログラムを実行することによってデータ取得部３０、行列算出部３１、行列圧縮部３２、分割部３３、隣接行列取得部３４、ラベル伝搬部３５、及びベクトル選択部３６として機能する。 The control unit 3 is a processor such as a CPU (Central Processing Unit) or GPU (Graphics Processing Unit) of the information processing device 1, and a data acquisition unit 30 and a matrix calculation unit are executed by executing a program stored in the storage unit 2. It functions as 31, a matrix compression unit 32, a division unit 33, an adjacent matrix acquisition unit 34, a label propagation unit 35, and a vector selection unit 36.

なお、図１は、情報処理装置１が単一の装置で構成されている場合の例を示している。しかしながら、情報処理装置１は、例えばクラウドコンピューティングシステムのように複数のプロセッサやメモリ等の計算リソースによって実現されてもよい。この場合、制御部３を構成する各部は、複数の異なるプロセッサの中の少なくともいずれかのプロセッサがプログラムを実行することによって実現される。 Note that FIG. 1 shows an example in which the information processing device 1 is composed of a single device. However, the information processing device 1 may be realized by computing resources such as a plurality of processors and memories, such as a cloud computing system. In this case, each unit constituting the control unit 3 is realized by executing a program by at least one of a plurality of different processors.

データ取得部３０は、属するクラスを示すラベルが付されたデータ群であるラベル有りデータ群を取得する。ラベル有りデータ群は、例えば記憶部２に格納されている。この場合、データ取得部３０は、ラベル有りデータ群を記憶部２から読み出して取得する。ラベル有りデータ群を構成する各データはベクトルで表現される。このベクトルは、データを特徴付ける複数の特徴量を要素に持つベクトルであってもよい。 The data acquisition unit 30 acquires a labeled data group, which is a data group with a label indicating the class to which the data belongs. The labeled data group is stored in, for example, the storage unit 2. In this case, the data acquisition unit 30 reads the labeled data group from the storage unit 2 and acquires it. Each data constituting the labeled data group is represented by a vector. This vector may be a vector having a plurality of feature quantities that characterize the data as elements.

データ取得部３０は、属するクラスが不明であるデータ群であるラベル無しデータ群も取得する。ラベル有りデータ群と同様に、ラベル無しデータ群も、例えば記憶部２に格納されている。ラベル無しデータ群を構成する各データは、ラベル有りデータ群を構成する各データと同次元のベクトルで表現されている。 The data acquisition unit 30 also acquires an unlabeled data group, which is a data group to which the class to which the data belongs is unknown. Like the labeled data group, the unlabeled data group is also stored in, for example, the storage unit 2. Each data constituting the unlabeled data group is represented by a vector having the same dimension as each data constituting the labeled data group.

行列算出部３１は、ラベル有りデータ群とラベル無しデータ群とのそれぞれを構成するデータ中の１つのデータと、その１つのデータを含む他のデータとの間の類似度を示す計量を要素とする計量ベクトルを、データ毎に並べて構成される計量行列を算出する。 The matrix calculation unit 31 uses a metric indicating the degree of similarity between one data in the data constituting each of the labeled data group and the unlabeled data group and the other data including the one data as an element. A metric matrix composed of arranging the metric vectors to be performed for each data is calculated.

具体的には、ラベル有りデータ群とラベル無しデータ群とのそれぞれを構成するデータの総計がＮ（Ｎは２以上の自然数）個であり、それぞれをｘ_１，ｘ_２，・・・，ｘ_Ｎで表すとする。行列算出部３１は、ｘ_１からはじめてｘ_Ｎに到達するまで順番にデータを選択する。いま、行列算出部３１が選択したデータをｘ_ｉとする。行列算出部３１は、ｘ_ｉとｘ_ｉを含む他のデータとの類似度を示す計量を算出する。ｘ_ｉを含む他のデータをｘ_ｊとし、ｘ_ｉとｘ_ｊとの類似度を示す計量をｇ_ｉｊとする。 Specifically, the total number of data constituting each of the labeled data group and the unlabeled data group is N (N is a natural number of 2 or more), and each of them is x ₁ , x ₂ , ..., X. It is represented by _N. Matrix calculating unit 31 selects the data in order to reach the first time x _N from x _1. _{Now, let xi} be the data selected by the matrix calculation unit 31. Matrix calculating unit 31 calculates a metric that indicates the degree of similarity with other data including x _i and x _i. other data, including the x _i and _{x j,} a metric indicating the degree of similarity between _{x i} and _{x j} and _{g ij.}

ｇ_ｉｊ＝Ｆ（ｘ_ｉ，ｘ_ｊ）（１）
式（１）において、Ｆはｘ_ｉとｘ_ｊとの計量を算出するための関数である。関数Ｆの詳細は後述する。 g _ij = F (x _i , x _j ) (1)
In equation (1), F is a function for calculating the metric _{of x i} and x _j. The details of the function F will be described later.

行列算出部３１は、ｘ_ｉ毎に、計量ｇ_ｉｊを要素とする計量ベクトルｇ_ｉを生成する。すなわち、計量ベクトルｇ_ｉは、以下の式（２）で表される。 Matrix calculating unit 31, for each _{x i,} and generates a metric vector _{g i} for the metering _{g ij} as elements. That is, the metric vector _{g i} is expressed by the following equation (2).

行列算出部３１は、計量ベクトルｇ_ｉを並べた行列を生成し、計量行列Ｇとする。すなわち、計量行列Ｇは以下の式（３）で表される。 Matrix calculating unit 31 generates a matrix obtained by arranging metric vector g _i, the metric matrix G. That is, the metric matrix G is represented by the following equation (3).

式（３）から明らかなように、計量行列ＧはＮ次の正方行列である。データ取得部３０が取得するラベル有りデータ群とラベル無しデータ群とはいずれも、例えばＤＮＮ（Deep Neural Network）やＳＶＭ（Support Vector Machine）等のいわゆる教師有り学習に用いられるデータである。したがって、ラベル有りデータ群とラベル無しデータ群とのそれぞれを構成するデータの総計Ｎは、小さくても百のオーダーがあることが多く、大きい場合は百万のオーダーとなる。 As is clear from equation (3), the metric matrix G is an Nth-order square matrix. Both the labeled data group and the unlabeled data group acquired by the data acquisition unit 30 are data used for so-called supervised learning such as DNN (Deep Neural Network) and SVM (Support Vector Machine). Therefore, the total N of the data constituting each of the labeled data group and the unlabeled data group is often on the order of 100 even if it is small, and on the order of 1 million if it is large.

そこで、行列圧縮部３２は、計量行列Ｇを構成する各計量ベクトルｇ_ｉの次元Ｎを圧縮した圧縮計量ベクトルを並べて構成される圧縮計量行列を生成する。行列圧縮部３２が実行する圧縮処理の詳細は後述するが、行列圧縮部３２は、各計量ベクトルｇ_ｉを圧縮することで、ｍ次元（ｍ＜＜Ｎ）の圧縮計量ベクトルｃ_ｉを生成する。一例として、ｍはＮの百分の一程度である。行列圧縮部３２は、圧縮計量ベクトルｃ_ｉを並べて構成される圧縮計量行列Ｃを生成する。圧縮計量行列Ｃは、以下の式（４）で表される。 Therefore, the matrix compression unit 32 generates a composed compressed metric matrix by arranging a compression metric vector obtained by compressing the dimension N of the metric vector g _i constituting the metric matrix G. Although matrix compression unit 32 is more compression processing to be executed will be described later, the matrix compression unit 32 to compress each metric vector g _i, to generate compressed metric vector c _i m-dimensional (m << N) .. As an example, m is about one-hundredth of N. Matrix compression unit 32 generates a compression metric matrix C composed by arranging a compression metric vector c _i. The compression metric matrix C is represented by the following equation (4).

なお、ｍ＜Ｎであるため、圧縮計量行列Ｃは正方行列ではない。

Since m <N, the compression metric matrix C is not a square matrix.

分割部３３は、圧縮計量ベクトルｃ_ｉそれぞれの要素を座標とする複数の点を圧縮計量ベクトルｃ_ｉの次元数と同次元の多次元空間にマッピングし、複数の点に対して多次元のドロネー分割をする。 Dividing unit 33 maps the plurality of points the compression metric vector c _i each element and coordinates dimensionality and the same dimension of the multidimensional space of the compression metric vector c _i, multidimensional Delaunay for a plurality of points Divide.

図２（ａ）−（ｂ）は、実施の形態に係る分割部３３が実行するドロネー分割を説明するための図である。具体的には、図２（ａ）は、分割部３３による圧縮計量ベクトルｃ_ｉのマッピング結果を示す図である。また、図２（ｂ）は、分割部３３によるドロネー分割の結果を示す図である。圧縮計量ベクトルｃ_ｉの第１の要素ｆ１を第１の軸、第２の要素ｆ２を第２の軸とすることで、分割部３３は、圧縮計量ベクトルｃ_ｉを２次元空間上の１点にマッピングすることができる。 2 (a)-(b) are diagrams for explaining Delaunay division executed by the division unit 33 according to the embodiment. Specifically, FIG. 2 (a) is a diagram illustrating a mapping result of the compression metric vector c _i by dividing unit 33. Further, FIG. 2B is a diagram showing the result of Delaunay division by the division unit 33. By setting the first element f1 of the compressed metric vector c _i as the first axis and the second element f2 as the second axis, the dividing unit 33 sets the compressed metric vector c _i as one point on the two-dimensional space. Can be mapped to.

一般に、圧縮計量ベクトルｃ_ｉの次元は２より大きいが、図示の便宜上、図２（ａ）−（ｂ）は、圧縮計量ベクトルｃ_ｉの次元が２の場合の例を示している。圧縮計量ベクトルｃ_ｉが２次元の場合、圧縮計量ベクトルｃ_ｉは２つの要素ｆ１及びｆ２によって構成される。図２（ａ）に示すように、圧縮計量ベクトルｃ_ｉの１つの要素ｆ１を第１の軸、２つ目の要素ｆ２を第２の軸とすることにより、各圧縮計量ベクトルｃ_ｉは圧縮計量ベクトルｃ_ｉの次元と同じ次元（図２では２次元）の空間中の１点にマッピングされる。 Generally, the dimension of the compression metric vector c _i is larger than 2, but for convenience of illustration, FIGS. 2 (a)-(b) show an example in which the dimension _{of the compression metric vector c i is 2.} When the compression metric vector c _i is two-dimensional, the compression metric vector c _i is composed of two elements f1 and f2. As shown in FIG. 2A _{, each compression metric vector c i} is compressed _{by having one element f1 of the compression metric vector c i} as the first axis and the second element f2 as the second axis. It is mapped to one point in the space of the same dimension as the dimension of the metric vector c _{i (two dimensions in FIG. 2).}

分割部３３は、多次元空間にマッピングした各圧縮計量ベクトルｃ_ｉに対してドロネー分割を実行する。この結果、図２（ｂ）に示すように、多次元空間中にマッピングされた各圧縮計量ベクトルｃ_ｉを頂点とする複数のシンプレックス（図２（ｂ）では三角形）が生成される。すなわち、多次元空間中にマッピングされた各圧縮計量ベクトルｃ_ｉは、それぞれ複数の他の圧縮計量ベクトルｃ_ｉと辺で結ばれることになる。 Dividing unit 33 performs Delaunay division for each compression metric vector c _i which is mapped to the multidimensional space. As a result, as shown in FIG. 2 (b), a plurality of simplex whose vertices each compression metric vector c _i mapped in multi-dimensional space (triangles in FIG. 2 (b)) is generated. That is, each compression metric vector c _i mapped in the multidimensional space is connected to a plurality of other compression metric vectors c _{i by edges.}

隣接行列取得部３４は、ドロネー分割後の各点の接続関係を隣接行列として取得する。具体的には、隣接行列取得部３４は、ドロネー分割の結果、互いに辺で直接結ばれた点に対応する圧縮計量ベクトルｃ_ｉ同士は「接続関係有り」、互いに辺で直接結ばれていない点同士を「接続関係無し」とする接続行列を生成して取得する。 The adjacency matrix acquisition unit 34 acquires the connection relationship of each point after the Delaunay division as an adjacency matrix. Specifically, the adjacency matrix acquisition part 34, Delaunay division result "Yes connection relationship" compression metric vector c _i with each other that corresponds to the point tied directly at the sides to each other, a point that is not connected directly with the sides to each other Generate and acquire a connection matrix that makes each other "no connection relationship".

図３は、実施の形態に係る隣接行列取得部３４が取得した隣接行列の一例を模式的に示す図である。図３において、破線の矩形で示される領域が隣接行列を示している。 FIG. 3 is a diagram schematically showing an example of an adjacency matrix acquired by the adjacency matrix acquisition unit 34 according to the embodiment. In FIG. 3, the area indicated by the broken line rectangle indicates the adjacency matrix.

隣接行列は既知の概念であるため詳細な説明を省略するが、多次元空間中にマッピングされた複数の点（頂点）の対が辺（すなわちエッジ）で結ばれているか否かの関係を示す行列である。具体的には、隣接行列の各行を構成するベクトルの要素が、そのベクトルに対応する頂点と、他の頂点との接続関係を示している。このため、隣接行列は、頂点の数と同次元の対称行列となる。 Since the adjacency matrix is a known concept, detailed explanation is omitted, but it shows the relationship between whether or not a pair of points (vertices) mapped in a multidimensional space are connected by an edge (that is, an edge). It is a matrix. Specifically, the elements of the vector that make up each row of the adjacency matrix indicate the connection relationship between the vertices corresponding to the vector and other vertices. Therefore, the adjacency matrix is a symmetric matrix with the same dimensions as the number of vertices.

図３に示す例は、ラベル有りデータ群とラベル無しデータ群とのそれぞれを構成するＮ個のデータｘ_１，ｘ_２，・・・，ｘ_Ｎについて、それぞれが他のベクトルと接続関係がある場合は１、接続関係がない場合は０として表している。具体的には、隣接行列の各行は、それぞれデータｘ_１，ｘ_２，・・・，ｘ_Ｎに関する他のベクトルとの接続関係を示している。例えば、データｘ_１は自分自身との接続関係はないため、データｘ_１に対応する第１行を構成するベクトルの第１列目は０となる。また、データｘ_１はデータｘ_Ｎと接続しているため、図３に示す隣接行列の１行Ｎ列の要素は１となっている。 _{In the example shown in FIG. 3, N data x 1} , x ₂ , ..., X _N constituting each of the labeled data group and the unlabeled data group have a connection relationship with other vectors. In the case, it is expressed as 1, and if there is no connection relationship, it is expressed as 0. Specifically, each row of the adjacency matrix shows the connection relationship with other vectors related to the _{data x 1} , x ₂ , ..., X _{N, respectively.} For example, since the data x ₁ has no connection relationship with itself, the first column of the vector forming the first row corresponding to the _{data x 1 is 0.} Further, since the data x ₁ is connected to the data x _N , the element of 1 row N column of the adjacency matrix shown in FIG. 3 is 1.

ラベル伝搬部３５は、隣接行列及びラベル有りデータ群のラベルに基づいて、ラベル無しデータ群を構成する各データにラベルを伝搬させる。なお、隣接行列が既知である場合におけるラベル伝搬の手法は既知であるため、以下では、ラベルが＋１と−１との２クラスである場合について、伝搬アルゴリズムの一例を簡単に説明する。 The label propagation unit 35 propagates the label to each data constituting the unlabeled data group based on the labels of the adjacency matrix and the labeled data group. Since the method of label propagation when the adjacency matrix is known is known, an example of the propagation algorithm will be briefly described below in the case where the label has two classes of +1 and -1.

いま、隣接行列をＷで表す。また、ラベル有りデータ群とラベル無しデータ群とのそれぞれを構成するＮ個のデータｘ_１，ｘ_２，・・・，ｘ_Ｎそれぞれのラベルをｙで表す。具体的には、ラベル有りデータ群を構成する各データのラベルｙは、＋１又は−１のいずれかである。ラベル無しデータ群を構成する各ベクトルのラベルｙは０とする。 Now, the adjacency matrix is represented by W. _{Further, the labels of the N data x 1} , x ₂ , ..., X _N constituting each of the labeled data group and the unlabeled data group are represented by y. Specifically, the label y of each data constituting the labeled data group is either +1 or -1. The label y of each vector constituting the unlabeled data group is set to 0.

ラベル無しデータ群を構成する各データのラベルｙの予測値をｆとする。ｆは、−１から＋１までの間の実数を取り得る。このとき、予測性能を最大にする目的関数Ｊ（ｆ）は以下の式（５）で示される。 Let f be the predicted value of the label y of each data constituting the unlabeled data group. f can take a real number between -1 and +1. At this time, the objective function J (f) that maximizes the prediction performance is represented by the following equation (5).

ここで、Ｌ＝Ｄ−Ｗであり、ＤはＷの各行の和を対角成分に持つ行列、λは右辺第１項と第２項とのバランスを取る定数である。

Here, L = D-W, D is a matrix having the sum of each row of W as a diagonal component, and λ is a constant that balances the first term and the second term on the right side.

式（５）において、目的関数Ｊ（ｆ）の値を最小化するとき、以下の式（６）が成り立つ。
（１＋λＬ）ｆ＝ｙ（６） In the equation (5), when the value of the objective function J (f) is minimized, the following equation (6) holds.
(1 + λL) f = y (6)

式（５）及び式（６）を用いることで、ラベル伝搬部３５は、隣接行列及びラベル有りデータ群のラベルに基づいて、ラベル無しデータ群を構成する各データにラベルを伝搬させることができる。 By using the equations (5) and (6), the label propagation unit 35 can propagate the label to each data constituting the unlabeled data group based on the labels of the adjacency matrix and the labeled data group. ..

実施の形態に係る情報処理装置１は、各計量ベクトルｇ_ｉを圧縮してから多次元空間にマッピングするため、ドロネー分割を実用的な範囲の演算時間で終了することができる。このように、実施の形態に係る情報処理装置１は、多次元のドロネー分割を利用した半教師有り学習におけるラベル伝搬を実用化することができる。 The information processing apparatus 1 according to the embodiment, for mapping the multi-dimensional space after compressing each metric vector g _i, can be terminated by the operation time of the practical range of Delaunay division. As described above, the information processing apparatus 1 according to the embodiment can put label propagation in semi-supervised learning using multidimensional Delaunay division into practical use.

［計量の算出］
行列算出部３１が計量行列Ｇを算出するために用いる関数Ｆについて説明する。
実施の形態に係る行列算出部３１は、半定値性を持つ関数Ｆを用いて各データ間の計量を算出する。半定値性を持つ関数Ｆの例としてはＳＶＭにおけるカーネル関数が挙げられる。具体的には、多項式カーネル、ガウシアンカーネル、双曲線正接カーネル等が挙げられる。以下では、行列算出部３１がガウシアンカーネルを用いて各データ間の計量を算出することを前提として説明する。 [Calculation of measurement]
The function F used by the matrix calculation unit 31 to calculate the metric matrix G will be described.
The matrix calculation unit 31 according to the embodiment calculates the metric between each data by using the function F having semi-fixed value. An example of the function F having definite matrix is a kernel function in SVM. Specific examples include a polynomial kernel, a Gaussian kernel, and a hyperbolic tangent kernel. In the following, it is assumed that the matrix calculation unit 31 calculates the metric between each data using the Gaussian kernel.

上述したように、ラベル有りデータ群とラベル無しデータ群とを構成する各データはベクトルで表現されている。それらをベクトルｘ_１，ｘ_２，・・・，ｘ_Ｎとする。このとき、ガウシアンカーネルを用いる場合の関数Ｆは以下の式（７）で表される。 As described above, each data constituting the labeled data group and the unlabeled data group is represented by a vector. Let them be vectors x ₁ , x ₂ , ..., X _N. At this time, the function F when using the Gaussian kernel is expressed by the following equation (7).

式（７）において、ｘ_ｉｋは、ベクトルｘ_ｉのｋ番目の要素を示す。ｘ_ｊｋも同様である。

In equation (7), x _ik represents the kth element of the vector x _i. The same applies to _{x jk.}

図４は、実施の形態に係る行列算出部３１が算出する計量行列の一例を模式的に示す図である。図４において、破線の矩形で示される領域が計量行列を示している。式（７）に示す関数Ｆの定義から明らかなように、計量ｇ_ｉｊは０以上１以下の値を取り、ベクトルｘ_ｉとベクトルｘ_ｊとが同値の場合（すなわち、最も類似している場合）１となり、ベクトルｘ_ｉとベクトルｘ_ｊとが離れるほど０に近い値となる。図４に示す例では、ベクトルｘ_１はベクトルｘ_Ｎよりもベクトルｘ_２と類似していることを示している。 FIG. 4 is a diagram schematically showing an example of a measurement matrix calculated by the matrix calculation unit 31 according to the embodiment. In FIG. 4, the area indicated by the broken line rectangle indicates the metric matrix. As is clear from the definition of the function F shown in the equation (7), the metric _gij takes a value of 0 or more and 1 or less, and when the vector x _i and the vector x _j have the same value (that is, when they are most similar). ) 1, and the _{farther the vector x i} and the vector x _j are, the closer the value becomes to 0. In the example shown in FIG. 4, the vector x ₁ is shown to be more similar to the vector x ₂ than the vector x _N.

［計量行列Ｇの圧縮］
続いて、行列圧縮部３２による計量行列Ｇの圧縮について説明する。行列圧縮部３２は、互いに独立な２つの手法を用いて計量行列Ｇを圧縮する。 [Compression of metric matrix G]
Subsequently, the compression of the metric matrix G by the matrix compression unit 32 will be described. The matrix compression unit 32 compresses the metric matrix G by using two methods independent of each other.

（第１の手法）
式（７）に示す関数Ｆの定義から明らかなように、計量ｇ_ｉｊは、ベクトルｘ_ｉとベクトルｘ_ｊとが離れるほど急激に０に近づく。したがって、ラベル有りデータ群とラベル無しデータ群とに偏りがないことを前提とすれば、計量行列Ｇの要素は０に近いものが多いと考えられる。 (First method)
As is clear from the definition of the function F shown in the equation (7), the metric g _ij rapidly approaches 0 as the vector x _i and the vector x _j are separated from each other. Therefore, assuming that there is no bias between the labeled data group and the unlabeled data group, it is considered that many of the elements of the metric matrix G are close to zero.

そこで、行列圧縮部３２は、計量行列Ｇの要素のうち所定の閾値未満の要素を０で置換して行列Ｄを生成する。「所定の閾値」とは、行列圧縮部３２がデータ間の接続関係がないと見なすために参照する「接続関係判定時参照閾値」である。接続関係判定時参照閾値の具体的な値は、圧縮効率（すなわち、演算効率）と精度とのバランス、及び式（７）におけるγの値等を考慮して実験により定めればよいが、例えば０．５である。これにより、行列圧縮部３２は、行列算出部３１が算出した計量行列Ｇを疎行列である行列Ｄに変換する。続いて、行列圧縮部３２は、行列Ｄを、疎行列に基づく行列分解を用いて圧縮計量行列Ｃを生成する。 Therefore, the matrix compression unit 32 replaces the elements of the metric matrix G that are less than a predetermined threshold value with 0 to generate the matrix D. The “predetermined threshold value” is a “reference threshold value at the time of determining the connection relationship” that the matrix compression unit 32 refers to in order to consider that there is no connection relationship between the data. The specific value of the reference threshold value at the time of determining the connection relationship may be determined experimentally in consideration of the balance between the compression efficiency (that is, the calculation efficiency) and the accuracy, the value of γ in the equation (7), and the like. It is 0.5. As a result, the matrix compression unit 32 converts the metric matrix G calculated by the matrix calculation unit 31 into a sparse matrix D. Subsequently, the matrix compression unit 32 generates a compression metric matrix C by using the matrix factorization based on the sparse matrix of the matrix D.

図５（ａ）−（ｂ）は、実施の形態に係る行列圧縮部３２が実行する行列圧縮を説明するための図である。具体的には、図５（ａ）は、一般的な特異値分解を説明するための図であり、図５（ｂ）は、特異値分解を利用したベクトルの圧縮を説明するための図である。 5 (a)-(b) are diagrams for explaining the matrix compression executed by the matrix compression unit 32 according to the embodiment. Specifically, FIG. 5A is a diagram for explaining general singular value decomposition, and FIG. 5B is a diagram for explaining vector compression using singular value decomposition. is there.

図５（ａ）に示すように、行列圧縮部３２が行列Ｄに対して特異値分解を実行することにより、行列Ｄは、左特異ベクトルを並べて構成される行列Ｓと、特異値を対角成分に持つ行列Σと、右特異ベクトルを並べて構成される行列Ｖ^Ｔ（Ｔは行列の転置）とに分解される。なお、図５（ａ）は一般的な特異値分解を説明するための図であるため行列Ｄは行の長さと列の長さとが異なるように図示しているが、計量行列Ｇは正方行列であるため、行の長さと列の長さとは等しい。 As shown in FIG. 5A, the matrix compression unit 32 performs singular value decomposition on the matrix D, so that the matrix D is diagonally opposite to the matrix S formed by arranging the left singular vectors. a matrix Σ with the components, the matrix V ^T constructed by arranging the right singular vectors ^(T is the transpose of the matrix) is decomposed into. Since FIG. 5A is a diagram for explaining a general singular value decomposition, the matrix D is shown so that the row length and the column length are different, but the metric matrix G is a square matrix. Therefore, the row length and the column length are equal.

行列Σの対角成分は、特異値を大きい順に並べて構成されている。行列圧縮部３２は、所定の値以下となる特異値を切り捨てることにより、新たな行列Σ’を生成する。行列Σ’は、行列Σと比較して、行の長さが短い。 The diagonal components of the matrix Σ are configured by arranging the singular values in descending order. The matrix compression unit 32 generates a new matrix Σ'by truncating a singular value that is equal to or less than a predetermined value. The matrix Σ'has a shorter row length than the matrix Σ.

行列圧縮部３２は、左特異ベクトルを並べて構成される行列Ｓに行列Σ’を乗算することで、新たな行列Ｄ’を算出する。行列Ｄ’は、行列Ｄと比較すると、行の長さが短くなっている。行列Ｄ’の列の長さは行列Ｄと同じであり、ラベル有りデータ群とラベル無しデータ群とを構成する各データの数と等しい。行列圧縮部３２は、行列Ｄ’を構成するｉ番目の行ベクトルを、圧縮計量ベクトルｃ_ｉとする。これにより、行列圧縮部３２は、計量行列Ｇの行方向のサイズを圧縮することができる。 The matrix compression unit 32 calculates a new matrix D'by multiplying the matrix S formed by arranging the left singular vectors by the matrix Σ'. The row length of the matrix D'is shorter than that of the matrix D. The length of the matrix D'is the same as that of the matrix D, and is equal to the number of data constituting the labeled data group and the unlabeled data group. Matrix compression unit 32, the i-th row vector constituting the matrix D ', the compression metric vector c _i. As a result, the matrix compression unit 32 can compress the size of the weighing matrix G in the row direction.

（第２の手法）
続いて、行列圧縮部３２による計量行列Ｇの圧縮の第２の手法について説明する。 (Second method)
Subsequently, a second method of compressing the metric matrix G by the matrix compression unit 32 will be described.

計量行列Ｇの圧縮の第２の手法の概要は、計量ベクトルｇ_ｉの接続関係を決定する際には、多次元空間において計量ベクトルｇ_ｉの近傍に存在する計量ベクトルｇ_ｊのみを選択して決定するというものである。 Summary of the second approach of the compression of the metric matrix G is weighed in determining the connection relationship of the vector g _i selects only metric vector g _j present in the vicinity of the metric vector g _i in a multidimensional space It is to decide.

これを実現するために、ベクトル選択部３６は、まず計量行列Ｇを構成する計量ベクトルｇ_ｉを順に選択する。計量ベクトルｇ_ｉの各要素は、他の計量ベクトルｇ_ｊとの類似度を表している。そこで、行列圧縮部３２は、ベクトル選択部３６が選択した計量ベクトルｇ_ｉの要素のうち所定の閾値以上の要素を抽出する。ここで、所定の閾値とは、行列圧縮部３２が計量ベクトルｇの対が近傍であるかどうか、言い換えると、計量ベクトルｇの対が類似しているか否かを判定するために参照する「近傍判定用閾値」である。近傍判定用閾値の具体的な値は、圧縮効率（すなわち、演算効率）と精度とのバランスや、計量の算出に用いる関数Ｆの値域等を考慮して実験により定めればよい。 To achieve this, the vector selection unit 36 selects a metric vector g _i first configure the metric matrix G in this order. Each element of the metric vector g _i represents the degree of similarity with _{other metric vectors g j.} Therefore, the matrix compression unit 32 extracts a predetermined threshold value or more elements among the elements of the metric vector g _i the vector selection unit 36 has selected. Here, the predetermined threshold value is a "neighborhood" referred to by the matrix compression unit 32 for determining whether or not the pairs of the metric vectors g are close to each other, in other words, whether or not the pairs of the metric vector g are similar. Judgment threshold ". The specific value of the threshold value for neighborhood determination may be determined experimentally in consideration of the balance between compression efficiency (that is, calculation efficiency) and accuracy, the range of the function F used for calculating the metric, and the like.

行列圧縮部３２は、計量行列Ｇを構成する他の計量ベクトルｇ_ｊも計量ベクトルｇ_ｉを構成する要素に対応する要素を抽出し、抽出した要素によって構成される行列を計量ベクトルｇ_ｊに関する圧縮計量行列Ｃ_ｉとして生成する。 The matrix compression unit 32 extracts elements corresponding to the elements constituting the metric vector g _i _{as well as other metric vectors g j} constituting the metric matrix G, and compresses the matrix composed of the extracted elements with respect to the metric vector g _j. generating a metric matrix _{C i.}

例えば、ベクトル選択部３６が、ベクトルｘ_１に対応する計量ベクトルｇ_１を選択したとする。行列圧縮部３２が、計量ベクトルｇ_１を構成する要素（ｇ_１１，ｇ_１２，・・・，ｇ_１Ｎ）の中から近傍判定用閾値以上の要素を抽出した結果、１番目、５番目、及びＮ番目の要素であるｇ_１１，ｇ_１５，及びｇ_１Ｎが抽出されたとする。この場合、圧縮計量ベクトルｃ_１は、ｃ_１（ｇ_１１，ｇ_１５，ｇ_Ｎ）となる。行列圧縮部３２は、残りの計量ベクトルｃ_ｊ（計量ベクトルｇ_２から計量ベクトルｇ_Ｎ）についても１番目、５番目、及びＮ番目の要素を抽出して圧縮計量ベクトルｃ_ｊを生成する。これにより、行列圧縮部３２は、計量ベクトルｇ_１に関する圧縮計量行列Ｃ_１を生成する。 For example, suppose that the vector selection unit 36 selects the metric vector g ₁ _{corresponding to the vector x 1} . As a result of the matrix compression unit 32 extracting the elements above the neighborhood determination threshold value from the elements (g ₁₁ , g ₁₂ , ..., G _1N _{) constituting the metric vector g 1, the first, fifth, and} It is assumed that the Nth elements g ₁₁ , g ₁₅ , and g _{1N are extracted.} In this case, the compression metric vector c ₁ becomes c ₁ (g ₁₁ , g ₁₅ , g _N ). The matrix compression unit 32 also extracts the first, fifth, and Nth elements of the remaining metric vector c _j (metric vector g ₂ to metric vector g _N ) to generate a _{compressed metric vector c j.} As a result, the matrix compression unit 32 generates a compression metric matrix C ₁ _{with respect to the metric vector g 1.}

ベクトル選択部３６が計量ベクトルｇ_ｉを選択する度に、行列圧縮部３２は、選択された計量ベクトルｇ_ｉに対応する圧縮計量行列Ｃ_ｉを生成する。隣接行列取得部３４は、ベクトル選択部３６が選択した計量ベクトルｇ_ｉに対応する点と他の点との接続関係を圧縮計量行列Ｃ_ｉに基づいて特定することにより、ベクトル選択部３６が選択した計量ベクトルｇ_ｉに対応する隣接行列の要素を決定する。 Each time a vector selection unit 36 selects a metric vector g _i, the matrix compression unit 32 generates a compression metric matrix C _i corresponding to the selected metric vector g _i. Adjacency matrix acquisition unit 34, by specifying, based connection between points in the other point corresponding to the metric vector g _i the vector selection unit 36 selects the compression metric matrix C _i, selected vector selection unit 36 determining the elements of the adjacency matrix corresponding to the metric vector g _i.

これにより、情報処理装置１は、各計量ベクトルｇについて、その計量ベクトルｇの近傍にある計量ベクトルｇのみを用いて隣接行列を決定することができる。ゆえに、すべての計量ベクトルｇの接続関係を特定する場合と比較して、情報処理装置１は、隣接行列の算出に要する演算時間を短縮することができる。 As a result, the information processing apparatus 1 can determine an adjacency matrix for each metric vector g by using only the metric vector g in the vicinity of the metric vector g. Therefore, the information processing apparatus 1 can shorten the calculation time required for calculating the adjacency matrix as compared with the case of specifying the connection relationship of all the metric vectors g.

（第１の手法と第２の手法との関係）
上述した計量行列Ｇの圧縮に関する第１の手法と第２の手法とは互いに独立している。したがって、行列圧縮部３２は、第１の手法と第２の手法とを併用することができる。具体的には、行列圧縮部３２は、第１の手法を用いて算出した行列に対して、さらに第２の手法を用いて圧縮することができる。 (Relationship between the first method and the second method)
The first method and the second method regarding the compression of the metric matrix G described above are independent of each other. Therefore, the matrix compression unit 32 can use both the first method and the second method together. Specifically, the matrix compression unit 32 can further compress the matrix calculated by using the first method by using the second method.

＜情報処理装置１が実行する情報処理方法の処理フロー＞
図６は、実施の形態に係る情報処理装置１が実行する情報処理の流れを説明するためのフローチャートである。本フローチャートにおける処理は、例えば情報処理装置１が起動したときに開始する。 <Processing flow of information processing method executed by information processing device 1>
FIG. 6 is a flowchart for explaining the flow of information processing executed by the information processing apparatus 1 according to the embodiment. The process in this flowchart starts, for example, when the information processing device 1 is activated.

データ取得部３０は、属するクラスを示すラベルが付されたラベル有りデータを取得する（Ｓ２）。また、データ取得部３０は、属するクラスが不明であるラベル無しデータ群を取得する（Ｓ４）。 The data acquisition unit 30 acquires labeled data with a label indicating the class to which it belongs (S2). Further, the data acquisition unit 30 acquires an unlabeled data group to which the class to which the data belongs is unknown (S4).

行列算出部３１は、ラベル有りデータ群とラベル無しデータ群とのそれぞれを構成するデータ中の１つのデータと、その１つのデータを含む他のデータとの間の類似度を示す計量を要素とする計量ベクトルｇを、データ毎に並べて構成される計量行列Ｇを算出する（Ｓ６）。 The matrix calculation unit 31 uses a metric indicating the degree of similarity between one data in the data constituting each of the labeled data group and the unlabeled data group and the other data including the one data as an element. A metric matrix G is calculated by arranging the metric vectors g to be performed for each data (S6).

行列圧縮部３２は、計量行列Ｇを構成する各計量ベクトルｇの次元を圧縮した圧縮計量ベクトルｃを並べて構成される圧縮計量行列Ｃを生成する（Ｓ８）。分割部３３は、圧縮計量ベクトルｃそれぞれの要素を座標とする複数の点を圧縮計量ベクトルｃの次元数と同次元の多次元空間にマッピングし、複数の点に対して多次元のドロネー分割を実行する（Ｓ１０）。 The matrix compression unit 32 generates a compression metric matrix C formed by arranging compression metric vectors c obtained by compressing the dimensions of each metric vector g constituting the metric matrix G (S8). The division unit 33 maps a plurality of points whose coordinates are each element of the compression measurement vector c to a multidimensional space having the same dimension as the number of dimensions of the compression measurement vector c, and performs multidimensional drone division for the plurality of points. Execute (S10).

隣接行列取得部３４は、分割部３３によるドロネー分割後の各点の接続関係を隣接行列として取得する（Ｓ１２）。ラベル伝搬部３５は、隣接行列及びラベル有りデータ群のラベルに基づいて、ラベル無しデータ群を構成する各データにラベルを伝搬させる（Ｓ１４）。 The adjacency matrix acquisition unit 34 acquires the connection relationship of each point after the Delaunay division by the division unit 33 as an adjacency matrix (S12). The label propagation unit 35 propagates the label to each data constituting the unlabeled data group based on the labels of the adjacency matrix and the labeled data group (S14).

ラベル伝搬部３５がラベル無しデータ群を構成する各データにラベルを伝搬させると、本フローチャートにおける処理は終了する。 When the label propagation unit 35 propagates the label to each data constituting the unlabeled data group, the process in this flowchart ends.

＜実施の形態に係る情報処理装置１が奏する効果＞
以上説明したように、実施の形態に係る情報処理装置１によれば、多次元のドロネー分割を利用した半教師有り学習におけるラベル伝搬を実用化することができる。 <Effects of the information processing device 1 according to the embodiment>
As described above, according to the information processing apparatus 1 according to the embodiment, label propagation in semi-supervised learning using multidimensional Delaunay division can be put into practical use.

以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されず、その要旨の範囲内で種々の変形及び変更が可能である。例えば、装置の全部又は一部は、任意の単位で機能的又は物理的に分散・統合して構成することができる。また、複数の実施の形態の任意の組み合わせによって生じる新たな実施の形態も、本発明の実施の形態に含まれる。組み合わせによって生じる新たな実施の形態の効果は、もとの実施の形態の効果をあわせ持つ。 Although the present invention has been described above using the embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments, and various modifications and changes can be made within the scope of the gist thereof. is there. For example, all or a part of the device can be functionally or physically distributed / integrated in any unit. Also included in the embodiments of the present invention are new embodiments resulting from any combination of the plurality of embodiments. The effect of the new embodiment produced by the combination also has the effect of the original embodiment.

１・・・情報処理装置
２・・・記憶部
３・・・制御部
３０・・・データ取得部
３１・・・行列算出部
３２・・・行列圧縮部
３３・・・分割部
３４・・・隣接行列取得部
３５・・・ラベル伝搬部
３６・・・ベクトル選択部
1 ... Information processing device 2 ... Storage unit 3 ... Control unit 30 ... Data acquisition unit 31 ... Matrix calculation unit 32 ... Matrix compression unit 33 ... Division unit 34 ... Adjacency matrix acquisition unit 35 ... Label propagation unit 36 ... Vector selection unit

Claims

A data acquisition unit that acquires a labeled data group with a label indicating the class to which it belongs and an unlabeled data group to which the class to which it belongs is unknown.
A metric vector whose element is a metric indicating the degree of similarity between one data in the data constituting each of the labeled data group and the unlabeled data group and the other data including the one data. A matrix calculation unit that calculates a metric matrix that is arranged side by side for each data,
A matrix compression unit that generates a compressed metric matrix composed by arranging compressed metric vectors that compress the dimensions of each metric vector that constitutes the metric matrix.
A division unit that maps a plurality of points whose coordinates are each element of the compression metric vector into a multidimensional space having the same dimension as the number of dimensions of the compression metric vector, and performs multidimensional drone division for the plurality of points. ,
An adjacency matrix acquisition unit that acquires the connection relationship of each point after Delaunay division as an adjacency matrix,
A label propagation unit that propagates a label to each data constituting the unlabeled data group based on the labels of the adjacency matrix and the labeled data group.
Information processing device equipped with.

The matrix calculation unit calculates the metric using a function having definite matrix.
The information processing device according to claim 1.

The matrix calculation unit calculates the metric using a Gaussian kernel.
The information processing device according to claim 1 or 2.

The matrix compression unit generates the compressed metric matrix by using matrix factorization based on a sparse matrix after substituting 0 for elements of the metric matrix that are less than a predetermined threshold.
The information processing device according to claim 3.

A vector selection unit for sequentially selecting the metric vectors constituting the metric matrix is further provided.
The matrix compression unit extracts elements of the metric vector selected by the vector selection unit that are equal to or greater than a predetermined threshold, and other metric vectors that make up the metric matrix also correspond to the elements that make up the metric vector. The elements to be used are extracted, and a matrix composed of the extracted elements is generated as the compression metric matrix.
The adjacency matrix acquisition unit specifies the connection relationship between a point corresponding to the metric vector selected by the vector selection unit and another point based on the compression metric matrix, so that the metric vector selected by the vector selection unit To determine the elements of the adjacency matrix corresponding to
The information processing device according to claim 3 or 4.

The processor,
A step to acquire a labeled data group with a label indicating the class to which it belongs, and
The step to get the unlabeled data group whose class belongs to unknown,
A metric vector whose element is a metric indicating the degree of similarity between one data in the data constituting each of the labeled data group and the unlabeled data group and the other data including the one data. Steps to calculate the metric matrix constructed side by side for each data,
A step of generating a compressed metric matrix composed by arranging compressed metric vectors obtained by compressing the dimensions of each metric vector constituting the metric matrix, and
A step of mapping a plurality of points whose coordinates are each element of the compression metric vector into a multidimensional space having the same dimension as the number of dimensions of the compression metric vector, and performing multidimensional drone division for the plurality of points.
The step of acquiring the connection relationship of each point after Delaunay division as an adjacency matrix,
A step of propagating a label to each data constituting the unlabeled data group based on the labels of the adjacency matrix and the labeled data group.
Information processing method to execute.

On the computer
A function to acquire a labeled data group with a label indicating the class to which it belongs, and
A function to get unlabeled data group whose class belongs to unknown,
A metric vector whose element is a metric indicating the degree of similarity between one data in the data constituting each of the labeled data group and the unlabeled data group and the other data including the one data. A function to calculate a metric matrix composed of data arranged side by side, and
A function to generate a compressed metric matrix composed by arranging compressed metric vectors obtained by compressing the dimensions of each metric vector constituting the metric matrix, and
A function of mapping a plurality of points whose coordinates are each element of the compression metric vector into a multidimensional space having the same dimension as the number of dimensions of the compression metric vector, and performing multidimensional drone division for the plurality of points.
A function to acquire the connection relationship of each point after Delaunay division as an adjacency matrix,
A function of propagating a label to each data constituting the unlabeled data group based on the labels of the adjacency matrix and the labeled data group.
A program that realizes.