JP6907772B2

JP6907772B2 - Information processing equipment and programs

Info

Publication number: JP6907772B2
Application number: JP2017136075A
Authority: JP
Inventors: 岡本　洋; 洋岡本
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2017-07-12
Filing date: 2017-07-12
Publication date: 2021-07-21
Anticipated expiration: 2037-07-12
Also published as: JP2019020806A

Description

本発明は、情報処理装置およびプログラムに関する。 The present invention relates to an information processing device and a program.

発明者は、クラスタリングについて、「マルコフ連鎖のモジュール分解」に基づいて、ネットワークから重なりと階層を持つクラスタ構造を検出する方法を特許文献１にて提案した。マルコフ連鎖のモジュール分解に基づくクラスタリング（コミュニティ抽出）の計算では、ネットワークの各ノードが持つ確率がリンクを経由して他のリンクに遷移（ランダムウォーク）するというモデルで各ノードの確率の変化を繰り返し計算し、定常状態に達したときの情報に基づき、各ノードがどのクラスタに属するのかを判定した。 Regarding clustering, the inventor has proposed in Patent Document 1 a method for detecting a cluster structure having overlaps and hierarchies from a network based on "modular decomposition of Markov chains". In the calculation of clustering (community extraction) based on the module decomposition of Markov chains, the probability of each node in the network changes repeatedly (random walk) through the link to another link (random walk). It was calculated and based on the information when the steady state was reached, it was determined which cluster each node belonged to.

特開２０１３−１６８１２７号公報Japanese Unexamined Patent Publication No. 2013-168127

二種類のノードから構成されるネットワーク（以下、二部ネットワーク）で扱うベクトルデータは、文書の特徴表現の場合だけでなく、例えば、各種の測定による物理量や、各種診断における検査値を表す場合も考えられる。この場合、例えば、温度の値は、ゼロあるいはマイナスの値となることもあるが、文書の特徴表現とは異なり、値の大小は、ノードとデータ点を結ぶリンクの重みを表すものとはならないため、このようなベクトルデータを二部ネットワークで表現しようとする場合、値の大小からリンクの重みを得る方法は採用できない。 The vector data handled by the network composed of two types of nodes (hereinafter referred to as the two-part network) is not only used for expressing the characteristics of a document, but also for expressing physical quantities obtained by various measurements and test values in various diagnoses. Conceivable. In this case, for example, the temperature value may be zero or a negative value, but unlike the feature representation of the document, the magnitude of the value does not represent the weight of the link connecting the node and the data point. Therefore, when trying to represent such vector data in a two-part network, the method of obtaining the weight of the link from the magnitude of the value cannot be adopted.

また負の値を含むベクトルデータを扱うクラスタリングの方法としては、例えばデータ間の相対的な位置関係に基づいて行うノンパラメトリックな方法があるが、この場合、事前にデータペア間の距離を求める必要があり、データ数が多くなると計算量が増え、クラスタリングに時間がかかることとなる。 As a clustering method for handling vector data containing negative values, for example, there is a non-parametric method based on the relative positional relationship between data. In this case, it is necessary to obtain the distance between data pairs in advance. As the number of data increases, the amount of calculation increases and clustering takes time.

本発明は、負の値を含むベクトルデータを、ノンパラメトリックな方法よりも少ない計算量でクラスタリングする技術を提供することを目的とする。 An object of the present invention is to provide a technique for clustering vector data containing negative values with a smaller amount of calculation than a nonparametric method.

本発明の請求項１に係る情報処理装置は、複数の成分がベクトルで表現されるベクトルデータを取得する取得手段と、前記ベクトルデータをパラメトリック手法によりクラスタリングする第１クラスタリング手段と、前記ベクトルデータを表すデータ点と前記第１クラスタリング手段により得られた各クラスタの特徴点をノードとする二部ネットワークを生成する生成手段と、前記データ点のノードと、前記特徴点のノードとを結ぶリンク重みを算出する算出手段と、前記二部ネットワークにおけるリンクを介するノード間の遷移確率を前記リンク重みに応じて決定し、前記ノード間の遷移の確率過程の繰り返し計算を実行することにより、前記ノードのクラスタリングを行う第２クラスタリング手段とを備える。 The information processing apparatus according to claim 1 of the present invention includes acquisition means for acquiring vector data in which a plurality of components are represented by vectors, first clustering means for clustering the vector data by a parametric method, and the vector data. A generation means for generating a two-part network having the data points to be represented and the feature points of each cluster obtained by the first clustering means as nodes, and a link weight connecting the node of the data point and the node of the feature point. Clustering of the nodes by determining the calculation means to be calculated and the transition probability between the nodes via the link in the two-part network according to the link weight and executing the iterative calculation of the probability process of the transition between the nodes. A second clustering means for performing the above is provided.

本発明の請求項２に係る情報処理装置においては、前記第１クラスタリング手段は、ｋ平均法でクラスタリングを行い、前記第１クラスタリング手段により得られたクラスタの中心を前記特徴点とする。 In the information processing apparatus according to claim 2 of the present invention, the first clustering means performs clustering by the k-means method, and the center of the cluster obtained by the first clustering means is set as the feature point.

本発明の請求項３に係る情報処理装置においては、前記リンクの重みを、前記特徴点と前記データ点との間のユークリッド距離が短いほど正の値として大きくする活性化関数により算出する。 In the information processing apparatus according to claim 3 of the present invention, the weight of the link is calculated by an activation function that increases the weight of the link as a positive value as the Euclidean distance between the feature point and the data point becomes shorter.

本発明の請求項４に係る情報処理装置においては、前記データ点をｘ_n、前記特徴点をｍ_kとした場合、前記ユークリッド距離を式（１）により算出する。

In the information processing apparatus according to claim 4 of the present invention, when the data point is x _n and the feature point is m _k , the Euclidean distance is calculated by the equation (1).

本発明の請求項５に係る情報処理装置においては、前記第１クラスタリング手段は、Ｋ−ｍｅｄｏｉｄｓ法でクラスタリングを行い、前記第１クラスタリング手段により得られたクラスタの代表点を前記特徴点とする。 In the information processing apparatus according to claim 5 of the present invention, the first clustering means performs clustering by the K-medoids method, and the representative points of the clusters obtained by the first clustering means are the feature points.

本発明の請求項６に係る情報処理装置においては、前記第１クラスタリング手段は、混合ガウスモデルによりクラスタリングを行い、複数のガウス分布の各々を前記特徴点とし、前記データ点の前記クラスタへの寄与度を前記リンク重みとする。 In the information processing apparatus according to claim 6 of the present invention, the first clustering means performs clustering by a mixed Gaussian model, sets each of a plurality of Gaussian distributions as the feature points, and contributes the data points to the cluster. Let the degree be the link weight.

本発明の請求項７に係る情報処理装置においては、前記混合ガウスモデルは、前記ベクトルデータの分布を楕円体で近似する。 In the information processing apparatus according to claim 7 of the present invention, the mixed Gaussian model approximates the distribution of the vector data with an ellipsoid.

本発明の請求項８に係る情報処理装置においては、前記第２クラスタリング手段は、前記クラスタの重要度を算出し、重要度が予め定められた条件を満たすクラスタを抽出する。 In the information processing apparatus according to claim 8 of the present invention, the second clustering means calculates the importance of the clusters and extracts clusters whose importance satisfies a predetermined condition.

本発明の請求項９に係る情報処理装置においては、前記算出手段は、前記特徴点と前記データ点との間のユークリッド距離を算出し、前記データ点との前記ユークリッド距離が近い順に、予め定められた個数の前記特徴点のノードを選択し、選択した特徴点のノードと前記データ点との前記リンク重みを正の値とし、選択した特徴点以外の特徴点のノードと前記データ点との前記リンク重みを０とする。 In the information processing apparatus according to claim 9 of the present invention, the calculation means calculates the Euclidean distance between the feature point and the data point, and determines in advance in the order in which the Euclidean distance from the data point is close. Select the selected number of the feature point nodes, set the link weight between the selected feature point node and the data point as a positive value, and set the feature point nodes other than the selected feature point and the data point to each other. The link weight is set to 0.

本発明の請求項１０に係るプログラムは、コンピュータを、複数の成分がベクトルで表現されるベクトルデータを取得する取得手段と、前記ベクトルデータをパラメトリック手法によりクラスタリングする第１クラスタリング手段と、前記ベクトルデータを表すデータ点と前記第１クラスタリング手段により得られた各クラスタの特徴点をノードとする二部ネットワークを生成する生成手段と、前記データ点のノードと、前記特徴点のノードとを結ぶリンク重みを算出する算出手段と、前記二部ネットワークにおけるリンクを介するノード間の遷移確率を前記リンク重みに応じて決定し、前記ノード間の遷移の確率過程の繰り返し計算を実行することにより、前記ノードのクラスタリングを行う第２クラスタリング手段として機能させるためのプログラムである。 The program according to claim 10 of the present invention includes an acquisition means for acquiring vector data in which a plurality of components are represented by vectors, a first clustering means for clustering the vector data by a parametric method, and the vector data. A data point representing the above and a generation means for generating a two-part network having the feature points of each cluster obtained by the first clustering means as nodes, and a link weight connecting the node of the data point and the node of the feature point. By determining the calculation means for calculating the above and the transition probability between the nodes via the link in the two-part network according to the link weight and executing the iterative calculation of the probability process of the transition between the nodes. This is a program for functioning as a second clustering means for performing clustering.

本発明の請求項１に係る情報処理装置によれば、負の値を含むベクトルデータを、計算量を抑えてクラスタリングすることができる。
本発明の請求項２に係る情報処理装置によれば、ノンパラメトリックの手法でクラスタリングを行う構成と比較して、早くクラスタリングを行うことができる。
本発明の請求項３に係る情報処理装置によれば、リンクの重みを負の値とせずにクラスタリングを行うことができる。
本発明の請求項４に係る情報処理装置によれば、ベクトルデータに負の値が含まれていても、リンクの重みが負の値にならないようにすることができる。
本発明の請求項５に係る情報処理装置によれば、クラスタの中心に最も近いデータを特徴点とすることができる。
本発明の請求項６に係る情報処理装置によれば、混合ガウスモデルを用いない構成と比較して、特徴点の数を少なくすることができる。
本発明の請求項７に係る情報処理装置によれば、クラスタリングの計算量を抑えることができる。
本発明の請求項８に係る情報処理装置によれば、重要なクラスタを抽出することができうる。
本発明の請求項９に係る情報処理装置によれば、クラスタリングの精度を良くすることができる。
本発明の請求項１０に係るプログラムによれば、負の値を含むベクトルデータを、計算量を抑えてクラスタリングすることができる。 According to the information processing apparatus according to claim 1 of the present invention, vector data including negative values can be clustered with a reduced amount of calculation.
According to the information processing apparatus according to claim 2 of the present invention, clustering can be performed faster than in a configuration in which clustering is performed by a nonparametric method.
According to the information processing apparatus according to claim 3 of the present invention, clustering can be performed without setting the weight of the link to a negative value.
According to the information processing apparatus according to claim 4 of the present invention, even if the vector data contains a negative value, the weight of the link can be prevented from becoming a negative value.
According to the information processing apparatus according to claim 5 of the present invention, the data closest to the center of the cluster can be used as a feature point.
According to the information processing apparatus according to claim 6 of the present invention, the number of feature points can be reduced as compared with the configuration not using the mixed Gaussian model.
According to the information processing apparatus according to claim 7 of the present invention, the amount of clustering calculation can be suppressed.
According to the information processing apparatus according to claim 8 of the present invention, important clusters can be extracted.
According to the information processing apparatus according to claim 9 of the present invention, the accuracy of clustering can be improved.
According to the program according to claim 10 of the present invention, vector data including negative values can be clustered with a reduced amount of calculation.

本発明の一実施形態に係る情報処理装置の構成を示した図。The figure which showed the structure of the information processing apparatus which concerns on one Embodiment of this invention. 制御部１０が行う処理の流れを示したフローチャート。The flowchart which showed the flow of the process performed by the control unit 10. 二部ネットワークの一例を示した図。The figure which showed an example of the two-part network. 制御部１０が行う処理の流れを示したフローチャート。The flowchart which showed the flow of the process performed by the control unit 10.

［実施形態］
図１は、本発明に係る情報処理装置１の構成の一例を示した図である。情報処理装置１は、コンピュータ装置であり、制御部１０、記憶部１１、操作部１２、表示部１３および通信部１４を備える。 [Embodiment]
FIG. 1 is a diagram showing an example of the configuration of the information processing device 1 according to the present invention. The information processing device 1 is a computer device, and includes a control unit 10, a storage unit 11, an operation unit 12, a display unit 13, and a communication unit 14.

通信部１４は、通信回線に接続されており、他のコンピュータ装置と通信を行う通信インターフェースの機能を有する。表示部１３は、ディスプレイ装置であり、制御部１０が行った処理の結果を表示する。操作部１２は、例えば情報処理装置１を操作するためのキーボードやマウス等である。 The communication unit 14 is connected to a communication line and has a function of a communication interface for communicating with other computer devices. The display unit 13 is a display device, and displays the result of processing performed by the control unit 10. The operation unit 12 is, for example, a keyboard, a mouse, or the like for operating the information processing device 1.

記憶部１１は、データを永続的に記憶する記憶装置を含み、データ点を表すベクトルデータを記憶する。ここで記憶されるベクトルデータは、数２のように実数値である複数の成分で表現されるデータである。複数の各成分は、例えば、画像形成装置内の各種センサの測定値（実数値）を表し、負の値を含むことができる。複数の各成分においては、例えば温度センサの測定値が含まれ、測定値は、正の値だけでなくゼロや負の値をとることがある。 The storage unit 11 includes a storage device that permanently stores data, and stores vector data representing data points. The vector data stored here is data represented by a plurality of components that are real values as in Equation 2. Each of the plurality of components represents, for example, measured values (real values) of various sensors in the image forming apparatus, and can include negative values. Each of the plurality of components includes, for example, a measured value of a temperature sensor, and the measured value may take not only a positive value but also zero or a negative value.

データ全体がＮ個のデータ点ｘ₁、・・・、ｘ_nからなるとき、これを数３に示したＮ×Ｄの設計行列で表す。 When the entire data consists of N data points x ₁ , ..., X _n , this is represented by the N × D design matrix shown in Equation 3.

また、記憶部１１は、制御部１０が実行するプログラムを記憶する。記憶部１１が記憶するプログラムは、ベクトルデータからクラスタリングを行うプログラムである。記憶部１１に記憶されるプログラムは、通信部１４により電気通信回線を介して取得したものや、コンピュータ読み取り可能な記録媒体から取得したものであってもよい。 Further, the storage unit 11 stores a program executed by the control unit 10. The program stored in the storage unit 11 is a program that performs clustering from vector data. The program stored in the storage unit 11 may be a program acquired by the communication unit 14 via a telecommunication line or a program acquired from a computer-readable recording medium.

制御部１０は、ＣＰＵ（Central Processing Unit）とＲＡＭ（Random Access Memory）を備えており、記憶部１１に記憶されているプログラムを実行する。記憶部１１に記憶されているプログラムを制御部１０が実行すると、取得部１０１、第１クラスタリング部１０２、生成部１０３、算出部１０４、第２クラスタリング部１０５が実現し、ベクトルデータに対してクラスタリングを行う機能が実現する。 The control unit 10 includes a CPU (Central Processing Unit) and a RAM (Random Access Memory), and executes a program stored in the storage unit 11. When the control unit 10 executes the program stored in the storage unit 11, the acquisition unit 101, the first clustering unit 102, the generation unit 103, the calculation unit 104, and the second clustering unit 105 are realized, and clustering is performed on the vector data. The function to perform is realized.

本発明に係る取得手段の一例である取得部１０１は、記憶部１１からベクトルデータを取得する。本発明に係る第１クラスタリング手段の一例である第１クラスタリング部１０２は、パラメトリックな方法でベクトルデータをクラスタリングする。本発明に係る生成手段の一例である生成部１０３は、ベクトルデータの個々のデータ点と、第１クラスタリング部１０２によるクラスタリングで得た個々のクラスタの平均をノードとする二部ネットワークを生成する。本発明に係る算出手段の一例である算出部１０４は、二部ネットワークにおける個々のデータ点のノードと、クラスタの平均のノードとを結ぶリンクの重みを算出する。本発明に係る第２クラスタリング手段の一例である第２クラスタリング部１０５は、生成部１０３が生成した二部ネットワークにおけるリンクを介するノード間の遷移確率を前記リンク重みに応じて決定し、ノード間の遷移の確率過程の繰り返し計算を実行することにより、二部ネットワークのノードのクラスタリングを行う。 The acquisition unit 101, which is an example of the acquisition means according to the present invention, acquires vector data from the storage unit 11. The first clustering unit 102, which is an example of the first clustering means according to the present invention, clusters vector data by a parametric method. The generation unit 103, which is an example of the generation means according to the present invention, generates a two-part network in which individual data points of vector data and the average of individual clusters obtained by clustering by the first clustering unit 102 are nodes. The calculation unit 104, which is an example of the calculation means according to the present invention, calculates the weight of the link connecting the node of each data point in the two-part network and the average node of the cluster. The second clustering unit 105, which is an example of the second clustering means according to the present invention, determines the transition probability between the nodes via the link in the two-part network generated by the generation unit 103 according to the link weight, and determines the transition probability between the nodes. Clustering the nodes of the two-part network is performed by performing iterative calculations of the transition stochastic process.

図２は、プログラムを実行した制御部１０が行う処理の流れを示したフローチャートである。まず制御部１０（取得部１０１）は、記憶部１１に記憶されているベクトルデータを取得する（ステップＳＡ１）。次に制御部１０（第１クラスタリング部１０２）は、取得したベクトルデータを予め定められた方法でクラスタリングする（ステップＳＡ２）。ここでベクトルデータをクラスタリングする方法は、パラメトリックなクラスタリング方法であり、例えば、ｋ平均法（Ｋ−ｍｅａｎｓ法）である。ｋ平均法は、ベクトルデータを情報処理装置１のユーザが指定したＫ個の個数のクラスタに分割する。ｋ平均法でベクトルデータを分割し、数４の式によりＫ個のクラスタの中心ｍ_kが、各クラスタに属するベクトル点の平均として得られる。この中心ｍ_kは、クラスタの仮の中心でクラスタの特徴点となる。数４の式においてＣ_kは、クラスタｋ（ｋ＝１，・・・，Ｋ）を表し、Ｎ_kは、クラスタｋに属する要素（データ点）の個数を表す。ｋ平均法でクラスタリングを行い、中心ｍ_kをクラスタの特徴点とすることにより、ここでパラメトリックなクラスタリングを行わない構成と比較して、精度よく特徴点をデータ分布が局所的に密になっている部分の中心として選ぶこととなる。 FIG. 2 is a flowchart showing a flow of processing performed by the control unit 10 that executes the program. First, the control unit 10 (acquisition unit 101) acquires the vector data stored in the storage unit 11 (step SA1). Next, the control unit 10 (first clustering unit 102) clusters the acquired vector data by a predetermined method (step SA2). Here, the method for clustering vector data is a parametric clustering method, for example, a k-means method (K-means method). In the k-means method, the vector data is divided into K clusters specified by the user of the information processing apparatus 1. dividing the vector data in k-means, the center m _k of K clusters by the numerical formula 4 is obtained as the average of vector points belonging to each cluster. This center m _k is a temporary center of the cluster and is a feature point of the cluster. In the equation of Equation 4, C _k represents the cluster k (k = 1, ..., K), and N _k represents the number of elements (data points) belonging to the cluster k. By performing clustering by the k-means method and using the central _mk as the feature point of the cluster, the data distribution of the feature points is accurately locally densed as compared with the configuration in which parametric clustering is not performed. It will be selected as the center of the part.

ステップＳＡ２の処理においては、各クラスタにおけるデータ点が特定のモデルに従って分布すると仮定し、各クラスタにおいてデータ点が球状あるいは楕円状に分布すると仮定する。パラメトリックなクラスタリング方法の場合、ノンパラメトリックな方法のように全てのデータペア間の距離を求める必要がないため、ノンパラメトリックな方法と比較すると少ない計算量でクラスタリングが行われる。 In the process of step SA2, it is assumed that the data points in each cluster are distributed according to a specific model, and that the data points in each cluster are distributed in a spherical or elliptical shape. In the case of the parametric clustering method, it is not necessary to obtain the distance between all the data pairs as in the nonparametric method, so that clustering is performed with a smaller amount of calculation than the nonparametric method.

制御部１０は、ステップＳＡ２の処理を行うことにより、各クラスタの平均で特徴点となる中心ｍ_kを特徴ノードとして特定する。制御部１０（生成部１０３）は、特徴ノードを特定すると、個々のデータ点と、ステップＳＡ２で得た個々のクラスタ平均をノードとする二部ネットワークを生成する（ステップＳＡ３）。二部ネットワークとは、二部グラフとも呼ばれ、ノードの集合が２つの部分集合に分割されており、同じ部分集合内のノード同士の間にリンクがないネットワーク（グラフ）のことである。二部ネットワークの一例を図３に例示する。図３では、三角形がデータ点に対応するデータ点ノードｎを表し、円形がクラスタの平均に対応する特徴ノードｍを表す。また、データ点ノードｎと特徴ノードｍを結ぶ直線がリンクである。 By performing the process of step SA2, the control unit 10 _{identifies the center mk,} which is the average feature point of each cluster, as the feature node. When the feature node is specified, the control unit 10 (generation unit 103) generates a two-part network having individual data points and individual cluster averages obtained in step SA2 as nodes (step SA3). A bipartite network is also called a bipartite graph, and is a network (graph) in which a set of nodes is divided into two subsets and there is no link between nodes in the same subset. An example of the two-part network is illustrated in FIG. In FIG. 3, the triangle represents the data point node n corresponding to the data point, and the circle represents the feature node m corresponding to the average of the clusters. A straight line connecting the data point node n and the feature node m is a link.

次に制御部１０（算出部１０４）は、データ点ノードｎと特徴ノードｍを結ぶリンクの重みｗ_nkを算出する（ステップＳＡ４）。ここで制御部１０は、例えば、数５に示したクラスタの平均ｍ_kを中心とする活性化関数を通じて重みｗ_nkを定める。数５の（１）の式は、各クラスタの中心ｍ_kと各データ点ｘ_nとの間のユークリッド距離である。数５の（２）の式は、データ点のノードｎと特徴ノードｍとを結ぶリンクの重みを、ユークリッド距離が短いほど正の値として大きくする活性化関数である。数５の式により、ベクトルデータの成分あるいは特徴点の成分に負の値があってもリンクの重みｗ_nkを正または０にし、負の値にならないようにすることができる。 Next, the control unit 10 (calculation unit 104) calculates the weight w _nk of the link connecting the data point node n and the feature node m (step SA4). Here, the control unit 10 determines the _{weight w nk} through an activation function centered on _{the average m k of} the clusters shown in Equation 5, for example. Equation (1) of Equation 5 is the Euclidean distance between the _{center m k of} each cluster and each data point x _n. The equation (2) of Equation 5 is an activation function that increases the weight of the link connecting the node n of the data point and the feature node m as a positive value as the Euclidean distance is shorter. According to the equation of Equation 5, even if the component of the vector data or the component of the feature point has a negative value, the weight w _nk of the link can be set to positive or 0 so that it does not become a negative value.

次に制御部１０（第２クラスタリング部１０５）は、ステップＳＡ３で生成した二部ネットワークを対象として、ネットワークのモジュール分解の手法によるコミュニティ分解を行う（ステップＳＡ５）。このネットワークのモジュール分解は、次の数６の式で表現される。 Next, the control unit 10 (second clustering unit 105) performs community decomposition by the method of module decomposition of the network for the two-part network generated in step SA3 (step SA5). The module decomposition of this network is expressed by the following equation 6.

数６の式において、ｐ（ｎ）はノードｎが持つ確率（そのノードにランダムウォーカーが存在する確率）である。またπ_kは、クラスタｋの事前確率であり、そのクラスタｋの重要度を示す。π_kのｋについての総和は１である。またｐ（ｎ｜ｋ）は、クラスタｋにおけるノードｎの確率である。Ｋはクラスタｋの総数である。数６の式は、ノードｎの確率ｐ（ｎ）が、各クラスタｋにおける当該ノードｎの確率ｐ（ｎ｜ｋ）の組み合わせに分解できることを表している。 In the equation of Equation 6, p (n) is the probability that the node n has (the probability that a random walker exists at that node). Further, π _k is a prior probability of the cluster k and indicates the importance of the cluster k. The sum of π _k for k is 1. Further, p (n | k) is the probability of the node n in the cluster k. K is the total number of clusters k. The equation of Equation 6 represents that the probability p (n) of the node n can be decomposed into a combination of the probabilities p (n | k) of the node n in each cluster k.

制御部１０（第２クラスタリング部１０５）が行う具体的な計算手法は、特願２０１７−０３４８８８に記載された方法と同様でよい。以下では、具体的な計算処理として、特願２０１７−０３４８８８に記載された方法に基づく処理の例を、図４のフローチャートを用いて説明する。 The specific calculation method performed by the control unit 10 (second clustering unit 105) may be the same as the method described in Japanese Patent Application No. 2017-034888. Hereinafter, as a specific calculation process, an example of the process based on the method described in Japanese Patent Application No. 2017-034888 will be described with reference to the flowchart of FIG.

図４の手順では、まず制御部１０は、生成した二部ネットワークについての遷移確率行列Ｔ_nmを生成する（ステップＳＢ１）。遷移確率行列Ｔ_nmは、ネットワーク内のノードｍからノードｎへリンクを辿ってエージェント（言い換えれば、ノードｍが持つ確率値）が遷移（ランダムウォーク）する確率（即ち遷移確率）を行列として表現したものである。本実施形態においては、制御部１０は、例えばノードから出る１以上のリンクを、ステップＳＡ４で設定した重みｗ_nkに応じた確率でエージェントが選択するとみなして遷移確率行列Ｔ_nmを求める。即ち、制御部１０は、重みｗ_nkの値が大きいほど、そのリンクについての遷移確率の値を高くする。遷移確率行列については、更に特開２０１３−１６８１２７号公報、特開２０１６−０２９５２６号公報、特開２０１６−２１８５３１号公報も参照されたい。 In the procedure of FIG. 4, first, the control unit 10 generates a transition probability matrix T _nm for the generated two-part network (step SB1). The transition probability matrix T _nm expresses the probability (that is, the transition probability) that the agent (in other words, the probability value of the node m) transitions (random walks) by following the link from the node m in the network to the node n. It is a thing. _{In the present embodiment, the control unit 10 determines the transition probability matrix T nm} by assuming that, for example, one or more links exiting from the node are selected by the agent with a probability corresponding _{to the weight w nk set in step SA4.} That is, the control unit 10 increases the value of the transition probability for the link as the value of _{the weight w nk increases.} For the transition probability matrix, see also Japanese Patent Application Laid-Open No. 2013-168127, Japanese Patent Application Laid-Open No. 2016-029526, and Japanese Patent Application Laid-Open No. 2016-218531.

次に、制御部１０は、定常リンク確率を計算する（ステップＳＢ２）。この計算では、まずステップＳＢ１で得られた二部ネットワークの遷移確率行列Ｔ_nmを用いて、その二部ネットワークにおける確率遷移（ランダムウォーク）の定常状態において各ノードが持つ確率（定常状態のノード確率）を計算する。この計算では、例えば次の数７の式の計算を定常状態となるまで繰り返す。 Next, the control unit 10 calculates the steady-state link probability (step SB2). In this calculation, first, the transition probability matrix T _nm of the two-part network obtained in step SB1 is used, and the probability that each node has in the steady state of the stochastic transition (random walk) in the two-part network (node probability in the steady state). ) Is calculated. In this calculation, for example, the calculation of the following equation 7 is repeated until a steady state is reached.

数７の式において、ｐ_t（ｎ）は、離散的な時刻ｔにおいてノードｎが持つ確率である。数７の式を繰り返し計算して定常状態となったときのｐ_t（ｎ）が、ノードｎの定常状態でのノード確率ｐ^stead（ｎ）である。 In the equation of equation 7, _pt (n) is the probability that node n has at discrete time t. P _t when it becomes a steady state repeatedly calculate the number 7 formula (n) is a node probability p ^Stead at steady state of the node n (n).

次に制御部１０は、各ノードｎの定常状態でのノード確率ｐ^stead（ｎ）から、定常状態でのリンク確率を次の数８の式に従って計算する。 Next, the control unit 10 calculates the link probability in the steady state from ^{the node probability p stead} (n) of each node n in the steady state according to the following equation (8).

リンク確率とは、ノード確率ｐ_t（ｎ）に対してそのノードから出るリンクｌ（エル）の遷移確率を乗じたものである。リンクｌについての定常状態のリンク確率（数８の式の左辺）は、そのリンクｌの起点のノードの定常状態のノード確率に対して、遷移確率行列Ｔ_nmに含まれる、そのリンクｌの起点ノードから終点ノードへの遷移確率を乗じたものである。 The link probability is _{obtained by multiplying the node probability pt} (n) by the transition probability of the link l (L) exiting the node. The steady-state link probability for the link l (the left side of the equation of equation 8) is the starting point of the link l included in the _{transition probability matrix T nm with respect to the steady-state node probability of the node starting from the link l.} It is the product of the transition probability from the node to the end node.

特開２０１６−０２９５２６号公報および特開２０１６−２１８５３１号公報では、Ｄ回の仮想的な観測で得られる観測データである通過情報τ_n ^(d)（ｄは１からＤまでの整数。ｎはノードの識別番号）を学習データとして用いた。これに対して以下に説明する例では、観測回数Ｄが十分大きい(ノード数Ｎよりもはるかに多い)という妥当な想定の下、τ_n ^(d)の代わりに実リンクｌに関する通過情報として、数９の式を用いる。 In Japanese Patent Application Laid-Open No. 2016-029526 and Japanese Patent Application Laid-Open No. 2016-218531, passage information τ _n ^(d) (d is an integer from 1 to D. N is an integer from 1 to D, which is observation data obtained by virtual observation of D times. The node identification number) was used as training data. On the other hand, in the example described below, under the reasonable assumption that the number of observations D is sufficiently large (much more than the number of nodes N _{), instead of τ n} ^(d) , the passing information regarding the actual link l is used. The equation of equation 9 is used.

ここでｎはノードの識別番号である。またδはクロネッカーのδである。即ち、数９の式が定義するノードｎの実リンクｌに関する通過情報(学習データ)は、そのノードｎがその実リンクｌの終点ノード（terminal end of link l）または起点ノード（initial end of link l）に一致する場合に値が１となり、それ以外の場合は値が０となる。制御部１０は、二部ネットワークの情報からこのような通過情報を学習用のデータとして生成する。生成した通過情報は、後述するＥＭ（Expectation Maximization）アルゴリズムの計算で用いる。 Here, n is a node identification number. Δ is Kronecker's δ. That is, the passage information (learning data) regarding the real link l of the node n defined by the equation of Equation 9 is such that the node n is the terminal end of link l or the starting node (initial end of link l) of the real link l. ), The value is 1, otherwise the value is 0. The control unit 10 generates such passage information as learning data from the information of the two-part network. The generated passage information is used in the calculation of the EM (Expectation Maximization) algorithm described later.

また、本実施形態では、特開２０１６−０２９５２６号公報での仮想的な観測の各回ｄにおける複数のクラスタ（成分）全体に対するクラスタｋが占める割合γ^(d)（ｋ）の代わりに、実リンクｌに関して後述する数１１の（３）の式で定義される割合γ_lk（チルダ付き）を用いる。 ^{Further, in the present embodiment, instead of the ratio γ (d)} (k) of the cluster k to the entire plurality of clusters (components) in each virtual observation in JP-A-2016-209526, the actual link is used. _{For l, the ratio γ lk} (with tilde) defined by the equation (3) of Equation 11 described later is used.

また、このような観測回数ｄから実リンクの番号ｌへの置き換えにより、関数の総和の表現は以下のように置き換えられる。 Further, by replacing the number of observations d with the number l of the actual link, the expression of the sum of the functions is replaced as follows.

後述する数１１の（１）の式の右辺第２項は、特開２０１６−０２９５２６号公報等に説明した同様の式に対してこのような置き換えを行ったものである。 The second term on the right-hand side of the equation (1) of Equation 11 described later is such a replacement for the same equation described in JP-A-2016-029526 and the like.

図４の手順の説明に戻ると、次に制御部１０は、確率ｐ_t（ｎ｜ｋ）および重要度π_k ^new、および割合γ_lkの初期値を仮決めし、繰り返し回数のカウンタｇの値を０に初期化する（ステップＳＢ３）。確率ｐ_t（ｎ｜ｋ）は、クラスタｋにおけるノードｎの確率である。また、重要度π_k ^newは、クラスタｋの重要度である。またγ_lkは、リンクｌにおける、複数のクラスタ全体に対するクラスタｋが占める割合である。 Returning to the explanation of the procedure of Figure 4, then the control unit 10, the probability p _t (n | k) and severity [pi _k ^{new new,} and the initial value of the ratio gamma _lk provisionally decided, the number of repetitions of the counter g Initialize the value to 0 (step SB3). The probability _pt (n | k) is the probability of the node n in the cluster k. The importance π _k ^new is the importance of the cluster k. Further, γ _lk is the ratio of the cluster k to the entire plurality of clusters in the link l.

次に制御部１０は、下記の数１１の（１）、（２）および（３）の式を用いてＥＭアルゴリズムの繰り返し計算を行う。 Next, the control unit 10 repeatedly calculates the EM algorithm using the equations (1), (2), and (3) of the following equation 11.

すなわちまず制御部１０は、（３）の式を用いて割合γ_lkを計算する（ステップＳＢ４）（ＥＭアルゴリズムのＥステップ）。この計算の最初の繰り返しでは、ステップＳＢ３で仮決めした初期値を用いる。 That is, first, the control unit 10 _{calculates the ratio γ lk} using the equation (3) (step SB4) (E step of the EM algorithm). In the first iteration of this calculation, the initial value tentatively determined in step SB3 is used.

次に制御部１０は、現在の確率ｐ_t（ｎ｜ｋ）および重要度π_k ^newを一時刻前の値ｐ_t-1（ｎ｜ｋ）および重要度π_k ^oldとする置き換えを行う（ステップＳＢ５）。そして、（１）の式および（２）の式に従って、確率ｐ_t（ｎ｜ｋ）および重要度π_k ^newを計算する（ステップＳＢ６）（ＥＭアルゴリズムのＭステップ）。より詳しくは、ステップＳＢ６では、まず（２）の式に従って新たな重要度π_k ^newを計算し、その後、この新たな重要度を用いて（１）の式の計算を行うことで、確率ｐ_t（ｎ｜ｋ）を求める。ここでαは、正の実数であって、クラスタの大きさを定めるパラメータであり、予め定めた値を用いればよい。 Next, the control unit 10 replaces _{the current probability pt} (n | k) and the importance π _k ^new with the values _pt-1 (n | k) _{one time ago and the importance π k} ^{old (} Step SB5). _{Then, the probability pt} (n | k) and the importance π _k ^new are calculated according to the equations (1) and (2) (step SB6) (M step of the EM algorithm). More specifically, in step SB6, ^{the probability p is calculated by first calculating the new} _{importance π k} new according to the equation (2) and then calculating the equation (1) using this new importance. _{Find t} (n | k). Here, α is a positive real number and is a parameter that determines the size of the cluster, and a predetermined value may be used.

そして、制御部１０は、繰り返し計算の回数のカウンタｇをインクリメントし（ステップＳＢ７）、そのカウンタｇが予め定めた値Ｇに達したかどうかを判定し（ステップＳＢ８）、達していなければステップＳＢ４〜ＳＢ７の処理を繰り返す。値Ｇは、本実施形態の計算手法においてステップＳＢ４〜ステップＳＢ６の計算が収束するのに必要な繰り返し回数であり、実験や経験的知識等により予め定めておく。 Then, the control unit 10 increments the counter g of the number of repetitive calculations (step SB7), determines whether or not the counter g has reached a predetermined value G (step SB8), and if not, step SB4. The process of ~ SB7 is repeated. The value G is the number of repetitions required for the calculations in steps SB4 to SB6 to converge in the calculation method of the present embodiment, and is predetermined by experiments, empirical knowledge, and the like.

制御部１０は、ステップＳＢ８で、カウンタｇが値Ｇに達したと判定した場合は、繰り返し計算が収束したものとして、図４の処理を終了する。制御部１０は、ステップＳＢ８の判定結果がＹｅｓとなった後、数１２の式に従ってノードｎのクラスタｋへの所属度γ（ｋ｜ｎ）を計算する（ステップＳＡ６）。

When the control unit 10 determines in step SB8 that the counter g has reached the value G, the control unit 10 considers that the iterative calculation has converged and ends the process of FIG. After the determination result in step SB8 is Yes, the control unit 10 calculates the degree of belonging γ (k | n) of the node n to the cluster k according to the equation of Equation 12 (step SA6).

この式のうちπ_kおよびｐ（ｎ｜ｋ）は、ＥＭアルゴリズムの計算（ステップＳＢ４〜ステップＳＢ６）の繰り返しにより最終的に求められたπ_k ^newおよびｐ_t（ｎ｜ｋ）である。数１２の式は、π_kおよびｐ（ｎ｜ｋ）から、ベイズの定理により、ノードｎがクラスタｋに所属する度合い（所属度）を示すγ（ｋ｜ｎ）を計算する式である。制御部１０は、このようにして求めた所属度γ（ｋ｜ｎ）をクラスタリング結果として出力する（ステップＳＡ７）。所属度γ（ｋ｜ｎ）は、ノードｎのソフトクラスタリングの結果を表す情報である。 Of [pi _k and p in the formula (n | k) is calculated of the EM algorithm (step SB4~ step SB6) repeated by finally the obtained [pi _k ^{new new} and p _t of | a (n k). The equation of equation 12 is an equation for calculating γ (k | n) indicating the degree (affiliation degree) of the node n belonging to the cluster k from _{π k and p (n | k) according to Bayes' theorem.} The control unit 10 outputs the degree of belonging γ (k | n) thus obtained as a clustering result (step SA7). The degree of belonging γ (k | n) is information representing the result of soft clustering of the node n.

なお、別の例として、制御部１０は、求めた所属度γ（ｋ｜ｎ）を予め定めた閾値で二値化したものをクラスタリング結果として出力してもよい。このクラスタリング結果は、ノードｎが、所属度γ（ｋ｜ｎ）の値が閾値以上となるクラスタｋに対して所属する（二値化結果の値が１）ことを表す。定めた閾値の値によっては、ノードｎについて二値化結果が１となるクラスタｋが複数ある場合もあるが、これは一種のソフトクラスタリングの結果とみなせる。 As another example, the control unit 10 may output a obtained binarized degree of belonging γ (k | n) with a predetermined threshold value as a clustering result. This clustering result indicates that the node n belongs to the cluster k in which the value of the degree of belonging γ (k | n) is equal to or greater than the threshold value (the value of the binarization result is 1). Depending on the value of the set threshold value, there may be a plurality of clusters k in which the binarization result is 1 for the node n, but this can be regarded as a kind of soft clustering result.

また制御部１０は、繰り返し計算で用いたｋ＝１〜Ｋ（クラスタ総数）のＫ個のクラスタ全部についてのクラスタリング結果のうち、重要ないくつかのクラスタについてのクラスタリング結果のみを抽出し、最終的なクラスタリング結果として出力してもよい。重要なクラスタは、重要度π_kに基づき判定すればよい。例えば、繰り返し計算が収束したときに得られた最終的な重要度π_kが予め定めた閾値以上となるクラスタｋを重要なクラスタとして抽出したり、その重要度π_kが上位から所定順位以内にあるクラスタｋを重要なクラスタとして抽出したりすればよい。 Further, the control unit 10 extracts only the clustering results for some important clusters from the clustering results for all K clusters with k = 1 to K (total number of clusters) used in the iterative calculation, and finally. It may be output as a clustering result. Important clusters may be determined based on the _{importance π k.} _{For example, a cluster k whose final importance π k} obtained when the iterative calculation converges is equal to or higher than a predetermined threshold value is extracted as an important cluster, or the importance π _k is within a predetermined order from the top. A certain cluster k may be extracted as an important cluster.

なお、ステップＳＢ８における収束の判定では、図４に例示した方法の代わりに、特開２０１３−１６８１２７号公報、特開２０１６−０２９５２６号公報および特開２０１６−２１８５３１号公報で説明したものと同様の、繰り返し毎の評価値Ｑｔの変化量が微小な値（閾値未満）となったときに、繰り返し計算が収束したと判定してもよい。 In the determination of convergence in step SB8, instead of the method illustrated in FIG. 4, the same methods as described in JP2013-168127A, 2016-029526A, and 2016-218531A are the same. , When the amount of change in the evaluation value Qt for each repetition becomes a minute value (less than the threshold value), it may be determined that the repetition calculation has converged.

［変形例］
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。例えば、上述の実施形態を以下のように変形して本発明を実施してもよい。なお、上述した実施形態および以下の変形例は、各々を組み合わせてもよい。 [Modification example]
Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and can be implemented in various other embodiments. For example, the present invention may be carried out by modifying the above-described embodiment as follows. The above-described embodiment and the following modifications may be combined with each other.

上述した実施形態においては、重みｗ_nkのパラメータを、データから求める構成としてもよい。例えば、制御部１０は、数５の式のパラメータｄ_k ²を数１３の式に従って求めてもよい。 In the above-described embodiment, _{the parameter of the weight w nk} may be obtained from the data. For example, the control unit 10 _{may obtain the parameter d k} ² of the equation of equation 5 according to the equation of equation 13.

数１３の式において、Ｃ_kは、ステップＳＡ２においてｋ平均法で求めたクラスタを表し、Ｎ_kは、Ｃ_kに属するデータ点の数を表す。この構成によれば、数５の活性化関数のパラメータをベクトルデータから得ることができる。 In the equation of equation 13, C _k represents the cluster obtained by the k-means method in step SA2, and N _k represents the number of data points belonging to C _k. According to this configuration, the parameters of the activation function of Equation 5 can be obtained from the vector data.

上述した実施形態においては、リンクの重みを数５の式で求めているが、リンクの重みを求める方法は、実施形態の方法に限定されるものではない。例えば、逆べき関数である数１４の式によりリンクの重みｗ_nkを求めてもよい。数１４の式においては、Ｃは、Ｃ＞０の定数であり、γ＞０である。 In the above-described embodiment, the weight of the link is obtained by the equation of Equation 5, but the method of obtaining the weight of the link is not limited to the method of the embodiment. _{For example, the link weight w nk} may be obtained by the equation of equation 14 which is a function to be reversed. In the equation of equation 14, C is a constant of C> 0 and γ> 0.

また、データ点ｘ_nとのユークリッド距離が近い順に予め定められたＭ個の特徴ノードを選び、選んだノードとデータ点との間のリンクの重みを例えばｗ_nk＝１として正の値とし、選んだ特徴ノード以外の特徴ノードとデータ点とのリンクの重みをｗ_nk＝０としてもよい。本発明においては、複数の特徴点（あるいは、それらに対応する特徴ノード）を選ぶ必要があり、クラスタリングの精度はこれらをどのように選ぶかに依存する。データ点の分布が密なところに特徴点を選ぶのは、クラスタリングの精度を良くする方法の一つである。本変形例によれば、特徴点をデータ分布が局所的に密になっている部分の中心として選ぶため、クラスタリングの精度が良くなる。 In addition, M predetermined feature nodes are selected in order of increasing Euclidean distance from the data point x _n, and the weight of the link between the selected node and the data point is set to a positive value, _{for example, w nk = 1.} The weight of the link between the feature node other than the selected feature node and the data point _{may be w nk} = 0. In the present invention, it is necessary to select a plurality of feature points (or feature nodes corresponding to them), and the accuracy of clustering depends on how to select these. Choosing feature points where the distribution of data points is dense is one of the methods to improve the accuracy of clustering. According to this modification, since the feature point is selected as the center of the portion where the data distribution is locally dense, the accuracy of clustering is improved.

上述した実施形態においては、ステップＳＡ２において、ベクトルデータをｋ平均法でクラスタリングしているが、ステップＳＡ２においてベクトルデータをクラスタリングする方法は、ｋ平均法に限定されるものではない。例えば、ｋ平均法に替えて、Ｋ−ｍｅｄｏｉｄｓ法でクラスタリングしてもよい。Ｋ−ｍｅｄｏｉｄｓ法は、ベクトルデータをユーザが指定したＫ個のクラスタに分割する点ではｋ平均法と同様であるが、クラスタの中心をデータ点の平均で定める替わりに、各クラスタに属するデータ点の中からそのクラスタの代表点を定める。ｋ平均法でクラスタリングを行った場合、特徴点は、データ点と必ずしも一致しないが、Ｋ−ｍｅｄｏｉｄｓ法でクラスタリングした場合、データ分布が局所的に密になっている部分の中心に最も近いデータ点を特徴点として選ぶこととなる。 In the above-described embodiment, the vector data is clustered by the k-means method in step SA2, but the method of clustering the vector data in step SA2 is not limited to the k-means method. For example, instead of the k-means method, clustering may be performed by the K-medoids method. The K-means method is similar to the k-means method in that it divides the vector data into K clusters specified by the user, but instead of defining the center of the cluster by the average of the data points, the data points belonging to each cluster. Determine the representative point of the cluster from among them. When clustering is performed by the k-means method, the feature points do not always match the data points, but when clustering by the K-means method, the data points closest to the center of the locally dense part of the data distribution. Will be selected as a feature point.

そして、Ｋ−ｍｅｄｏｉｄｓ法でクラスタリングして得られたクラスタｋの代表点をｒ_kとし、Ｋ個のクラスタの代表点のそれぞれに特徴ノードを対応させる。代表点は、クラスタ内の点であり、その点以外のクラスタ内の点との非類似度の総和が最少となる点とする。そして、ステップＳＡ２においてＫ−ｍｅｄｏｉｄｓ法でクラスタリングを行った場合、ステップＳＡ４で求めるリンクの重みｗ_nkを数１５の式により算出する。数１５の（１）の式は、各クラスタの代表点ｒ_kと各データ点ｘ_nとの間のユークリッド距離である。数１５の（２）の式は、データ点のノードｎと特徴ノードｒとを結ぶリンクの重みを、ユークリッド距離が短いほど正の値として大きくする活性化関数である。なお、リンクの重みは、ベクトルデータをＫ−ｍｅｄｏｉｄｓ法でクラスタリングした場合、逆べき関数である数１５の（３）の式により求めてもよい。数１５の（３）式においては、Ｃは、Ｃ＞０の定数であり、γ＞０である。また、ベクトルデータをＫ−ｍｅｄｏｉｄｓ法でクラスタリングした場合、データ点ｘ_nとの距離が近い順に予め定められたＭ個の特徴ノードを選び、選んだノードとデータ点との間のリンクの重みを例えばｗ_nk＝１として正の値とし、選んだ特徴ノード以外の特徴ノードとデータ点とのリンクの重みをｗ_nk＝０としてもよい。 Then, the representative point of the cluster k obtained by clustering K-Medoids method and r _k, is the corresponding characteristic node to each of the representative points of the K clusters. The representative point is a point in the cluster, and the total dissimilarity with points in the cluster other than that point is the minimum. Then, when clustering is performed by the K-medoids method in step SA2, the link weight w _nk obtained in step SA4 is calculated by the formula of equation 15. The equation (1) of Eq. 15 is the Euclidean distance between the _{representative point r k of} each cluster and each data point x _n. Equation (2) of Equation 15 is an activation function that increases the weight of the link connecting the node n of the data point and the feature node r as a positive value as the Euclidean distance is shorter. The link weight may be obtained by the equation (3) of Equation 15, which is a function to be reversed when the vector data is clustered by the K-medoids method. In equation (3) of Eq. 15, C is a constant of C> 0 and γ> 0. When the vector data is clustered by the K-medoids method, _{M predetermined feature nodes are selected in order of proximity to the data points x n,} and the weight of the link between the selected nodes and the data points is weighted. For example, w _nk = 1 may be set to a positive value, and the weight of the link between the feature node other than the selected feature node and the data point may be set _{to w nk = 0.}

また、ステップＳＡ２においては、混合ガウスモデルを用いてベクトルデータをクラスタリングしてもよい。混合ガウスモデルについては、例えば、Bishop, C.M. Pattern Recognition and Machine Learning (Springer)の9章を参照されたい。混合ガウスモデルを用いてベクトルデータをクラスタリングする方法では、Ｋ個のガウス分布が得られる。制御部１０は、得られたＫ個のガウス分布のそれぞれに特徴ノードを対応させる。そして、ステップＳＡ２において混合ガウスモデルを用いてベクトルデータをクラスタリングした場合、ステップＳＡ４で求めるリンクの重みｗ_nkを数１６の式により算出する。数１６の式は、ガウス分布に対応する特徴ノードｋとデータ点のノードｎとの間のリンクの重みを、データ点ｘ_nのクラスタｋへの寄与度として定めている。そして、例えば、寄与度γ_nkをリンクの重みｗ_nkとする。 Further, in step SA2, vector data may be clustered using a mixed Gaussian model. For mixed Gaussian models, see, for example, Chapter 9 of Bishop, CM Pattern Recognition and Machine Learning (Springer). In the method of clustering vector data using a mixed Gaussian model, K Gaussian distributions can be obtained. The control unit 10 associates a feature node with each of the obtained K Gaussian distributions. Then, when the vector data is clustered using the mixed Gaussian model in step SA2, the link weight w _nk obtained in step SA4 is calculated by the equation of equation 16. The equation of equation 16 defines the weight of the link between the feature node k corresponding to the Gaussian distribution and the node n of the data points as the contribution of _{the data points x n to the cluster k.} Then, for example, the contribution degree γ _nk is set as the link weight w _nk .

混合ガウスモデルは、データの分布の濃淡を、局所的に楕円体で近似する。楕円体の各軸の長さは、例えば、データに合わせて情報処理装置１のユーザが操作部１２で指定することにより設定される。一方、ｋ平均法あるいはＫ−ｍｅｄｏｉｄｓ法は、局所的なデータの分布が球であると仮定している。混合ガウスモデルを用いた場合の方がより特徴点の数を少なくなり、特徴点の数が少なくなることにより、クラスタリングの計算量が抑えられる。 The mixed Gaussian model locally approximates the shading of the data distribution with an ellipsoid. The length of each axis of the ellipsoid is set, for example, by being specified by the user of the information processing apparatus 1 in the operation unit 12 according to the data. On the other hand, the k-means clustering method or the K-means method assumes that the local data distribution is a sphere. When the mixed Gaussian model is used, the number of feature points is smaller, and the number of feature points is smaller, so that the amount of clustering calculation can be suppressed.

１…情報処理装置、１０…制御部、１１…記憶部、１２…操作部、１３…表示部、１４…通信部、１０１…取得部、１０２…第１クラスタリング部、１０３…生成部、１０４…算出部、１０５…第２クラスタリング部。 1 ... Information processing device, 10 ... Control unit, 11 ... Storage unit, 12 ... Operation unit, 13 ... Display unit, 14 ... Communication unit, 101 ... Acquisition unit, 102 ... First clustering unit, 103 ... Generation unit, 104 ... Calculation unit, 105 ... Second clustering unit.

Claims

An acquisition method for acquiring vector data in which multiple components are represented by vectors,
A first clustering means for clustering the vector data by a parametric method,
A generation means for generating a two-part network having a data point representing the vector data and a feature point of each cluster obtained by the first clustering means as a node, and a generation means.
A calculation means for calculating the link weight connecting the node of the data point and the node of the feature point, and
A second clustering means for clustering the nodes by determining the transition probability between the nodes via the link in the two-part network according to the link weight and executing the iterative calculation of the stochastic process of the transition between the nodes. Information processing device equipped with.

The first clustering means performs clustering by the k-means method and performs clustering.
The information processing apparatus according to claim 1, wherein the center of the cluster obtained by the first clustering means is the feature point.

The information processing apparatus according to claim 2, wherein the weight of the link is calculated by an activation function that increases the Euclidean distance between the feature point and the data point as a positive value as the distance is shorter.

When the data point is x _n and the feature point is m _k , the Euclidean distance is calculated by the equation (1).
The information processing device according to claim 3.

The first clustering means performs clustering by the K-medoids method and performs clustering.
The information processing apparatus according to claim 1, wherein the representative point of the cluster obtained by the first clustering means is the feature point.

The first clustering means performs clustering by a mixed Gaussian model and performs clustering.
Each of the plurality of Gaussian distributions is used as the feature point.
The information processing apparatus according to claim 1, wherein the contribution of the data points to the cluster is the link weight.

The information processing apparatus according to claim 6, wherein the mixed Gaussian model approximates the distribution of the vector data with an ellipsoid.

The second clustering means calculates the importance of the cluster and calculates the importance of the cluster.
The information processing apparatus according to any one of claims 1 to 7, which extracts clusters whose importance satisfies a predetermined condition.

The calculation means is
Calculate the Euclidean distance between the feature point and the data point,
A predetermined number of nodes of the feature points are selected in order of increasing distance from the data points to the Euclidean distance.
The first aspect of claim 1, wherein the link weight between the node of the selected feature point and the data point is a positive value, and the link weight between the node of the feature point other than the selected feature point and the data point is 0. Information processing device.

Computer,
An acquisition method for acquiring vector data in which multiple components are represented by vectors,
A first clustering means for clustering the vector data by a parametric method,
A generation means for generating a two-part network having a data point representing the vector data and a feature point of each cluster obtained by the first clustering means as a node, and a generation means.
A calculation means for calculating the link weight connecting the node of the data point and the node of the feature point, and
A second clustering means for clustering the nodes by determining the transition probability between the nodes via the link in the two-part network according to the link weight and executing the iterative calculation of the stochastic process of the transition between the nodes. A program to function as.