JP6846369B2

JP6846369B2 - Programs, devices and methods for classifying unknown multidimensional vector data groups

Info

Publication number: JP6846369B2
Application number: JP2018024404A
Authority: JP
Inventors: 修平山口; 直紀関口; 栄二宇都宮
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-02-14
Filing date: 2018-02-14
Publication date: 2021-03-24
Anticipated expiration: 2038-02-14
Also published as: JP2019139651A

Description

本発明は、複数次元のベクトルデータ群をクラス分類する技術に関する。 The present invention relates to a technique for classifying a multidimensional vector data group.

近年、ＩｏＴ(Internet of Things)用のセンサの小型化及び高精度化に伴って、サーバは、ネットワークを介して大量のセンサデータを収集することができる。また、ニューラルネットワークの普及に伴って、それらビッグデータも容易に分析することができる。更に、これまで人手に基づく各種モニタリング業務も、システムによって自動化され、一元的に管理されてきている。例えば人の行き来が困難な地域（山間部や、インフラ設備（橋や鉄塔、柵等）、野生動物監視）であっても、環境センサを配置することによって、遠隔からのリモート監視も可能となってきている。 In recent years, with the miniaturization and high accuracy of sensors for IoT (Internet of Things), a server can collect a large amount of sensor data via a network. Moreover, with the spread of neural networks, such big data can be easily analyzed. Furthermore, various manual monitoring operations have been automated and centrally managed by the system. For example, even in areas where it is difficult for people to come and go (mountains, infrastructure equipment (bridges, steel towers, fences, etc.), wildlife monitoring), remote monitoring is possible by arranging environmental sensors. It's coming.

一方で、ビッグデータの分析という観点からは、センサデータ自体が統一的なフォーマットで構成されていない。そのために、ニューラルネットワークを用いたとしても、分析結果の信頼性の確保が難しく、データの共有化が進みにくいという問題もある。特に、センサデータが多種多様且つ大量になってくるほど、多数次元のベクトルデータを正常状態と異常状態とに分類することは難しくなる。 On the other hand, from the viewpoint of big data analysis, the sensor data itself is not composed in a unified format. Therefore, even if a neural network is used, there is a problem that it is difficult to secure the reliability of the analysis result and it is difficult to promote data sharing. In particular, as the amount of sensor data becomes diverse and large, it becomes difficult to classify multidimensional vector data into a normal state and an abnormal state.

従来、機械設備に設置されたセンサからセンサデータを取得し、異常予兆の有無を診断する技術がある（例えば特許文献１参照）。この技術によれば、正常期間のセンサデータの時系列的な波形から正常モデルを学習し、運転期間のセンサデータを正常モデルと比較して、機械設備の異常予兆の有無を診断する。正常モデルは、統計的分類手法のクラスタリングによって、正常期間のセンサデータから生成されたクラスタによって構成される。クラスタとは、多次元ベクトル空間における中心及び半径で特定される領域である。 Conventionally, there is a technique of acquiring sensor data from a sensor installed in mechanical equipment and diagnosing the presence or absence of an abnormality sign (see, for example, Patent Document 1). According to this technique, a normal model is learned from the time-series waveform of the sensor data in the normal period, and the sensor data in the operating period is compared with the normal model to diagnose the presence or absence of a sign of abnormality in machinery and equipment. The normal model is composed of clusters generated from sensor data during the normal period by clustering of statistical classification methods. A cluster is a region specified by a center and a radius in a multidimensional vector space.

特開２０１７−０３３４７０号公報Japanese Unexamined Patent Publication No. 2017-033470

「t-SNE を用いた次元圧縮方法のご紹介」、[online]、［平成２９年１２月９日検索］、インターネット＜URL:https://blog.albert2005.co.jp/2015/12/02/tsne/＞"Introduction of dimensional compression method using t-SNE", [online], [Search on December 9, 2017], Internet <URL: https://blog.albert2005.co.jp/2015/12/ 02 / tsne />

図１は、未知のビッグデータの分析における課題を表す説明図である。 FIG. 1 is an explanatory diagram showing a problem in analysis of unknown big data.

図１によれば、特許文献１に記載の技術も同様に、正常期間におけるセンサデータから事前学習を必要するものである。そのために、多種多様且つ大量であって未知の多数次元のベクトルデータを、事前学習なしに分類することは困難であった。
また、クラスタリングによってクラスタに分類して学習しているが、クラスタ数を予め指定しなければならない。このクラスタ数は、学習モデルの精度にも影響を与えるものであるために、未知のデータに対して予めクラスタ数を決定することは極めて難しい。 According to FIG. 1, the technique described in Patent Document 1 also requires prior learning from sensor data in a normal period. Therefore, it has been difficult to classify a wide variety of unknown multidimensional vector data without prior learning.
In addition, although learning is performed by classifying into clusters by clustering, the number of clusters must be specified in advance. Since this number of clusters also affects the accuracy of the learning model, it is extremely difficult to determine the number of clusters in advance for unknown data.

そこで、本発明は、事前学習なしに、未知の複数次元のベクトルデータ群をクラス分類することができるプログラム、装置及び方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a program, an apparatus, and a method capable of classifying an unknown multidimensional vector data group without prior learning.

本発明によれば、複数次元のベクトルデータ群をクラス分類する装置に搭載されたコンピュータを機能させるプログラムであって、
複数次元のベクトルデータ群をクラスタリングすることによって、クラスタ数ｋを推定するクラスタ数推定手段と、
複数次元のベクトルデータを、出力層をクラスタ数ｋとするニューラルネットワークに入力し、ｋ個のクラス毎の分類確率を出力するクラス分類手段と、
クラス分類手段によって最も高い分類確率となるクラスのラベルを、当該複数次元のベクトルデータに付与することによって、教師データを生成する教師データ生成手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, it is a program for operating a computer mounted on a device for classifying a group of multidimensional vector data.
A cluster number estimation means that estimates the number of clusters k by clustering a multidimensional vector data group,
A class classification means that inputs multidimensional vector data into a neural network whose output layer is the number of clusters k and outputs the classification probability for each k class .
By assigning a label of the class having the highest classification probability by the classification means to the vector data of the plurality of dimensions, the computer functions as a teacher data generation means for generating teacher data. And.

本発明のプログラムにおける他の実施形態によれば、
クラスタ数推定手段は、ＤＢＳＣＡＮ(Density-Based Spatial Clustering of Applications with Noise)に基づくものである
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the program of the present invention
It is also preferable to make the computer function so that the cluster number estimation means is based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

本発明のプログラムにおける他の実施形態によれば、
クラス分類手段は、ニューラルネットワークにおける出力層の活性化関数として、ソフトマックス関数を用いて、合計確率１とするクラス毎の分類確率を出力する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the program of the present invention
It is also preferable that the classification means uses a softmax function as an activation function of the output layer in the neural network and causes the computer to output the classification probability for each class having a total probability of 1.

本発明のプログラムにおける他の実施形態によれば、
クラスタ数推定手段の前段にあって、複数次元のベクトルデータ群を次元圧縮する次元圧縮手段を更に有し、
次元圧縮手段から出力された低次元ベクトルデータ群を、クラスタ数推定手段へ入力する
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the program of the present invention
In front of the cluster number estimation means, it further has a dimensional compression means for dimensionally compressing a multidimensional vector data group.
It is also preferable to make the computer function so as to input the low-dimensional vector data group output from the dimensional compression means to the cluster number estimation means.

本発明のプログラムにおける他の実施形態によれば、
次元圧縮手段は、ｔ−ＳＮＥ(t-Distributed Stochastic Neighbor Embedding)に基づくものであり、
低次元ベクトルデータ群は、２次元又は３次元のベクトルデータ群である
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the program of the present invention
The dimensional compression means is based on t-SNE (t-Distributed Stochastic Neighbor Embedding).
It is also preferable to make the computer function so that the low-dimensional vector data group is a two-dimensional or three-dimensional vector data group.

本発明のプログラムにおける他の実施形態によれば、
複数次元のベクトルデータ群は、複数のセンサから出力されたベクトルデータを混在させたものである
ようにコンピュータを機能させることも好ましい。 According to other embodiments in the program of the present invention
It is also preferable to make the computer function so that the multidimensional vector data group is a mixture of vector data output from a plurality of sensors.

本発明によれば、複数次元のベクトルデータ群をクラス分類する装置であって、
複数次元のベクトルデータ群をクラスタリングすることによって、クラスタ数ｋを推定するクラスタ数推定手段と、
複数次元のベクトルデータを、出力層をクラスタ数ｋとするニューラルネットワークに入力し、ｋ個のクラス毎の分類確率を出力するクラス分類手段と、
クラス分類手段によって最も高い分類確率となるクラスのラベルを、当該複数次元のベクトルデータに付与することによって、教師データを生成する教師データ生成手段と
を有することを特徴とする。 According to the present invention, it is an apparatus for classifying a group of multidimensional vector data.
A cluster number estimation means that estimates the number of clusters k by clustering a multidimensional vector data group,
A class classification means that inputs multidimensional vector data into a neural network whose output layer is the number of clusters k and outputs the classification probability for each k class .
It is characterized by having a teacher data generation means for generating teacher data by assigning a label of the class having the highest classification probability by the classification means to the vector data of the plurality of dimensions.

本発明によれば、複数次元のベクトルデータ群を入力する装置のクラス分類方法であって、
装置は、
複数次元のベクトルデータ群をクラスタリングすることによって、クラスタ数ｋを推定する第１のステップと、
複数次元のベクトルデータを、出力層をクラスタ数ｋとするニューラルネットワークに入力し、ｋ個のクラス毎の分類確率を出力する第２のステップと、
第２のステップによって最も高い分類確率となるクラスのラベルを、当該複数次元のベクトルデータに付与することによって、教師データを生成する第３のステップと
を実行することを特徴とする。
According to the present invention, it is a classification method of a device for inputting a multidimensional vector data group.
The device is
The first step of estimating the number of clusters k by clustering a multidimensional vector data group,
The second step of inputting multidimensional vector data into a neural network whose output layer is the number of clusters k and outputting the classification probabilities for each of k classes ,
By assigning the label of the class having the highest classification probability by the second step to the multidimensional vector data, the third step and the third step of generating the teacher data are executed. ..

本発明のプログラム、装置及び方法によれば、事前学習なしに、未知の複数次元のベクトルデータ群をクラス分類することができる。 According to the program, apparatus and method of the present invention, an unknown multidimensional vector data group can be classified without prior learning.

未知のビッグデータの分析における課題を表す説明図である。It is explanatory drawing which shows the problem in the analysis of unknown big data. 本発明における分析装置の機能構成図である。It is a functional block diagram of the analyzer in this invention. 次元圧縮部を表す説明図である。It is explanatory drawing which shows the dimension compression part. クラスタ推定部及びクラス分類部を表す説明図である。It is explanatory drawing which shows the cluster estimation part and the classification part.

以下では、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図２は、本発明における分析装置の機能構成図である。 FIG. 2 is a functional configuration diagram of the analyzer according to the present invention.

分析装置１は、未知の複数次元のベクトルデータ群をクラス分類するものである。図２によれば、分析装置１は、ベクトルデータ群蓄積部１００と、次元圧縮部１０１と、クラスタ数推定部１１と、クラス分類部１２と、教師データ生成部１３とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、ベクトルデータ群のクラス分類方法としても理解できる。 The analyzer 1 classifies an unknown multidimensional vector data group. According to FIG. 2, the analyzer 1 includes a vector data group storage unit 100, a dimension compression unit 101, a cluster number estimation unit 11, a classification unit 12, and a teacher data generation unit 13. These functional components are realized by executing a program that makes the computer mounted on the device function. In addition, the processing flow of these functional components can be understood as a method for classifying vector data groups.

［ベクトルデータ群蓄積部１００］
ベクトルデータ群蓄積部１００は、未知の複数次元のベクトルデータ群を蓄積している。「複数次元」とは、複数のセンサから出力されたベクトルデータを混在させたものであることを意味する。
複数次元のベクトルデータ群としては、例えば動体センサ（加速度センサや地磁気センサなど）、生体センサ（心拍センサなど）、環境センサ（温湿度センサ、気圧センサ、圧力センサなど）、光センサ（超音波センサや赤外線センサなど）、音響センサのような様々なベクトルデータを混在させることができる。本発明によれば、それらベクトルデータ群を時系列に取得することができれば、正常時と異常時とを大まかに判定することができる。 [Vector data group storage unit 100]
The vector data group storage unit 100 stores an unknown multidimensional vector data group. "Multiple dimensions" means that vector data output from a plurality of sensors are mixed.
The multidimensional vector data group includes, for example, a moving body sensor (acceleration sensor, geomagnetic sensor, etc.), a biological sensor (heartbeat sensor, etc.), an environmental sensor (temperature / humidity sensor, pressure sensor, pressure sensor, etc.), and an optical sensor (ultrasonic sensor, etc.). And infrared sensors), various vector data such as acoustic sensors can be mixed. According to the present invention, if these vector data groups can be acquired in a time series, it is possible to roughly determine a normal time and an abnormal time.

［次元圧縮部１０１］
次元圧縮部１０１は、オプション的なものであって、クラスタ数推定部１１の前段にあって、複数次元のベクトルデータ群を次元圧縮する。次元圧縮部１０１は、次元圧縮した低次元ベクトルデータ群を、クラスタ数推定部１１へ出力する。ここでの低次元ベクトルデータ群とは、２〜３次元程度のベクトルデータ群を意味する。 [Dimension compression unit 101]
The dimensional compression unit 101 is an optional one, which is in front of the cluster number estimation unit 11 and dimensionally compresses a multidimensional vector data group. The dimensional compression unit 101 outputs the dimensionally compressed low-dimensional vector data group to the cluster number estimation unit 11. The low-dimensional vector data group here means a vector data group having about 2 to 3 dimensions.

図３は、次元圧縮部を表す説明図である。 FIG. 3 is an explanatory diagram showing a dimensional compression unit.

次元圧縮部１０１は、ｔ−ＳＮＥ(t-Distributed Stochastic Neighbor Embedding)に基づくものである（例えば非特許文献１参照）。これは、２点間の「近さ」を確率分布で表現する手法である。

ｐ_ij：ｘ_iからｘ_jの近さを表す同時分布
ｐ_j|i：平均ｘ_iに従うガウス分布についてｘ_jが抽出される確率密度
σ_i ²：平均ｘ_iのガウス分布の分散（異なる2点の類似度のみを表す）
ｐ_i|i＝0

次元削減後も元のデータ構造を完全に再現できていれば、ｐ_ij＝ｑ_ijとなる。そのために、ｔ−ＳＮＥでは、ｐ_ijとｑ_ijとの誤差がなるべく小さくなるように次元削減を目指す。
ｔ−ＳＮＥでは、分布間の距離を測る指標として、カルバック・ライブラー・ダイバージェンス(Kullback-Leibler-divergence)を用いる。

ｔ−ＳＮＥでは、ｐ_jiとｑ_jiとを用いて目的関数Ｃを最小化する。解析的に最小解を求めることができないので、勾配法を用いる。

収束した後のＹ＝｛ｙ1，ｙ2，・・・，ｙn｝が、出力される。次元圧縮により、高次元データを人間が視覚的に把握することができる。 The dimensional compression unit 101 is based on t-SNE (t-Distributed Stochastic Neighbor Embedding) (see, for example, Non-Patent Document 1). This is a method of expressing the "closeness" between two points by a probability distribution.

p _ij : Joint distribution representing the closeness of _{x i} to x _j _{p j | i} _{: Probability density at which x j} is extracted for the Gaussian distribution according to the mean x _i _{σ i} ² : Variance of the Gaussian distribution of the mean x _{i (different 2)} Represents only the similarity of points)
p _{i | i} = 0

If the original data structure can be completely reproduced even after the dimension reduction, p _ij = q _ij . Therefore, in t-SNE, we aim to reduce the dimension so that the error between _{p ij} and q _{ij is as small as possible.}
In t-SNE, Kullback-Leibler-divergence is used as an index for measuring the distance between distributions.

In t-SNE, the objective function C is minimized by using _{p ji} and q _ji. Since the minimum solution cannot be obtained analytically, the gradient method is used.

Y = {y1, y2, ..., Yn} after convergence is output. Dimensional compression allows humans to visually grasp high-dimensional data.

図４は、クラスタ推定部及びクラス分類部を表す説明図である。 FIG. 4 is an explanatory diagram showing a cluster estimation unit and a classification unit.

［クラスタ数推定部１１］
クラスタ数推定部１１は、複数次元のベクトルデータ群をクラスタリングすることによって、クラスタ数ｋを推定する。
クラスタ数推定部１１は、ＤＢＳＣＡＮ(Density-Based Spatial Clustering of Applications with Noise)に基づくものである。ＤＢＳＣＡＮは、クラスタ数を予め設定する必要がないために、クラスタ数を未知として、ベクトルデータ群の分類が可能となる。 [Cluster number estimation unit 11]
The cluster number estimation unit 11 estimates the number of clusters k by clustering a multidimensional vector data group.
The cluster number estimation unit 11 is based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Since it is not necessary to set the number of clusters in advance in DBSCAN, the number of clusters is unknown and the vector data group can be classified.

ＤＢＳＣＡＮは、密度ベースのクラスタリング方法であり、以下の２つのパラメータを用いる。
距離の閾値：ε(Eps-neighborhood of a point)
対象個数の閾値：minPts(a minimum number of points)
データの点を、以下の３つの種類に分類する。
コア点：半径ε以内に少なくともminPts個の隣接点を持つ点
到達可能点：半径ε以内にminPts個ほどは隣接点がないが、
半径ε以内にCore pointsを持つ点
外れ値：半径ε以内に隣接点がない点
コア点の集まりからクラスタを作成し、到達可能点を各クラスタに割り当てる。
即ち、ＤＢＳＣＡＮは、ある空間に点集合が与えられたとき、高い密度領域にある点同士をグループとしてまとめて、低い密度領域にある点を外れ値とする。
そして、クラスタ数推定部１１は、ＤＢＳＣＡＮによって、最適なクラスタ数ｋを推定し、それをクラス分類部１２へ出力する。 DBSCAN is a density-based clustering method that uses the following two parameters.
Distance threshold: ε (Eps-neighborhood of a point)
Target number threshold: minPts (a minimum number of points)
Data points are classified into the following three types.
Core points: Points with at least minPts of adjacent points within a radius of ε Reachable points: There are not as many adjacent points as there are minPts within a radius of ε,
Outliers with Core points within radius ε: Points with no adjacent points within radius ε Create a cluster from a collection of core points and assign reachable points to each cluster.
That is, in DBSCAN, when a set of points is given to a certain space, the points in the high density region are grouped together, and the points in the low density region are set as outliers.
Then, the cluster number estimation unit 11 estimates the optimum number of clusters k by DBSCAN and outputs it to the classification unit 12.

［クラス分類部１２］
クラス分類部１２は、複数次元のベクトルデータを、出力層をクラスタ数ｋとするニューラルネットワークに入力し、ｋ個のクラス毎の分類確率を出力する。クラスタ数ｋは、クラスタ数推定部１１によって推定されたものである。 [Class classification unit 12]
The class classification unit 12 inputs the multidimensional vector data into the neural network having the output layer as the number of clusters k, and outputs the classification probability for each of k classes. The number of clusters k is estimated by the number of clusters estimation unit 11.

クラス分類部１２は、全結合型のニューラルネットワークにおける出力層の活性化関数として、ソフトマックス関数を用いて、合計確率１とするクラス毎の分類確率を出力する。 The class classification unit 12 uses a softmax function as an activation function of the output layer in the fully connected neural network, and outputs the classification probability for each class having a total probability of 1.

ニューラルネットワークとは、生体の脳における特性を計算機上のシミュレーションによって表現することを目指した数学モデルをいう。シナプスの結合によってネットワークを形成した人工ニューロン（ユニット）が、学習によってシナプスの結合強度を変化させ、問題解決能力を持つようなモデル全般をいう。 A neural network is a mathematical model that aims to express the characteristics of a living body in the brain by computer simulation. It refers to a general model in which artificial neurons (units) that form a network by synaptic connection change the synaptic connection strength by learning and have problem-solving ability.

図４によれば、順伝播型の畳み込みニューラルネットワーク(Convolutional Neural Network, CNN)として、入力層(input layer)と、中間層(hidden layer)と、出力層(output layer)との３つの層から構成され、入力層から出力層へ向けて一方向に伝播する。中間層は、グラフ状に複数の層から構成するものであってもよい。各層は、複数のユニット（ニューロン）を持ち、前方層のユニットから後方層のユニットへつなぐ関数のパラメータを、「重み(weight)」と称す。学習とは、この関数のパラメータとして、適切な「重み」を算出することにある。 According to FIG. 4, as a forward propagation type convolutional neural network (CNN), there are three layers, an input layer, a hidden layer, and an output layer. It is configured and propagates in one direction from the input layer to the output layer. The intermediate layer may be composed of a plurality of layers in a graph shape. Each layer has multiple units (neurons), and the parameter of the function that connects the unit in the anterior layer to the unit in the posterior layer is called "weight". Learning is to calculate an appropriate "weight" as a parameter of this function.

本発明のニューラルネットワークは、分類問題（データがどのクラスに属するかを判別）としてソフトマックス関数を適用する。出力層が全部でｋ個あるとして、ｍ番目の出力ｙ_mを、以下のように表す。
ｙ_m＝exp(ｘ_m)／Σ_i=1 ^kexp(ｘ_i)
exp(ｘ)：ｅ^ｘを表す指数関数（eは2.7182・・・のネイピア数）
ｘ_m：入力信号 The neural network of the present invention applies a softmax function as a classification problem (determining which class the data belongs to). As the output layer is k pieces in total, the m-th output y _m, expressed as follows.
_{_{y m = exp (x m)}} / Σ i = 1 k exp (x i)
exp (x): exponential function representing the e ^x (e Napier number of 2.7182 ...)
x _m : Input signal

ソフトマックス関数の出力は、全ての入力信号から矢印による結びつきがある。出力の各ニューロンが全ての入力信号ｘ_mから影響を受ける。
また、ソフトマックス関数の出力の総和は１となり、その性質によりソフトマックス関数の出力を「確率」として解釈することができる。この確率の結果から、どのクラスに属するかと判断することができる。
ソフトマックス関数により判別されたクラスは、未知の複数センサデータのクラスとなる。 The output of the softmax function is linked by arrows from all input signals. Each output neuron is affected by _{all input signals x m.}
Further, the total output of the softmax function is 1, and the output of the softmax function can be interpreted as a "probability" due to its nature. From the result of this probability, it can be determined which class it belongs to.
The class determined by the softmax function is an unknown class of multiple sensor data.

但し、本発明によれば、クラス分類手法として、教師有り学習のサポートベクターマシン(Support Vector Machine)のようなパターン認識モデルは使用しない。本発明によれば、教師無し学習としてクラス分類するためである。 However, according to the present invention, a pattern recognition model such as a support vector machine for supervised learning is not used as a classification method. This is because according to the present invention, the learning is classified as unsupervised learning.

［教師データ生成部１３］
教師データ生成部１３は、クラス分類部１２によって最も高い分類確率となるクラスのラベルを、当該複数次元のベクトルデータに付与することによって、教師データを生成する。その教師データは、教師データ蓄積部に蓄積することによって、教師有り学習モデルへ適用することもできる。 [Teacher data generation unit 13]
The teacher data generation unit 13 generates teacher data by assigning the label of the class having the highest classification probability by the classification unit 12 to the vector data of the plurality of dimensions. The teacher data can also be applied to a supervised learning model by accumulating it in the teacher data storage unit.

以上、詳細に説明したように、本発明のプログラム、装置及び方法によれば、事前学習なしに、未知の複数次元のベクトルデータ群をクラス分類することができる。 As described in detail above, according to the program, apparatus and method of the present invention, an unknown multidimensional vector data group can be classified without prior learning.

本発明によれば、汎用的に、多種多様且つ大量であって未知の多数次元のベクトルデータを、事前学習なしに分類することができる。また、クラスタリングの際に、学習モデルの精度にも影響を与えるクラスタ数ｋを予め指定する必要もない。この点も、未知の多数次元のベクトルデータに対する分類に適する。 According to the present invention, it is possible to generally classify a wide variety of large amounts of unknown multidimensional vector data without prior learning. Further, at the time of clustering, it is not necessary to specify in advance the number of clusters k, which also affects the accuracy of the learning model. This point is also suitable for classification of unknown multidimensional vector data.

更に、本発明によれば、未知の複数次元のベクトルデータ群をクラス分類することができるので、多種多様なセンサデータを混在させることもできる。
例えば動体センサをインフラ設備（例えば橋や鉄塔、柵等）に設置して、そのセンサデータ群を、例えば正常状態と異常状態とに大まかに分類することもできる。
また、例えば動体センサや生体センサを人や動物に装着することによって、そのセンサデータ群を、例えば人や動物の行動把握（走る、歩く、静止等）に分類することもできる。
更に、例えば環境センサや光センサ、音響センサを、既存の機械設備に装着することによって、そのセンサデータ群を、例えば正常状態と異常状態とに大まかに分類することもできる。 Further, according to the present invention, unknown multidimensional vector data groups can be classified into classes, so that a wide variety of sensor data can be mixed.
For example, a moving body sensor can be installed in infrastructure equipment (for example, a bridge, a steel tower, a fence, etc.), and the sensor data group can be roughly classified into, for example, a normal state and an abnormal state.
Further, for example, by attaching a moving body sensor or a biological sensor to a person or an animal, the sensor data group can be classified into, for example, grasping the behavior of the person or the animal (running, walking, resting, etc.).
Further, for example, by mounting an environmental sensor, an optical sensor, or an acoustic sensor on an existing machine or equipment, the sensor data group can be roughly classified into, for example, a normal state and an abnormal state.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 With respect to the various embodiments of the present invention described above, various changes, modifications and omissions within the scope of the technical idea and viewpoint of the present invention can be easily made by those skilled in the art. The above explanation is just an example and does not attempt to restrict anything. The present invention is limited only to the scope of claims and their equivalents.

１分析装置
１００ベクトルデータ群蓄積部
１０１次元圧縮部
１１クラスタ数推定部
１２クラス分類部
１３教師データ生成部
1 Analyzer 100 Vector data group storage unit 101 Dimensional compression unit 11 Cluster number estimation unit 12 Class classification unit 13 Teacher data generation unit

Claims

A program that operates a computer installed in a device that classifies a group of multidimensional vector data.
A cluster number estimation means that estimates the number of clusters k by clustering a multidimensional vector data group,
A class classification means that inputs multidimensional vector data into a neural network whose output layer is the number of clusters k and outputs the classification probability for each k class .
By assigning the label of the class having the highest classification probability by the classification means to the vector data of the plurality of dimensions, the computer can function as the teacher data generation means for generating the teacher data. Characteristic program.

The program according to claim 1, wherein the cluster number estimation means causes a computer to function so as to be based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

The class classification means uses a softmax function as an activation function of an output layer in the neural network, and causes a computer to function so as to output a classification probability for each class having a total probability of 1. Item 2. The program according to item 1 or 2.

In front of the cluster number estimation means, it further has a dimensional compression means for dimensionally compressing a multidimensional vector data group.
The program according to any one of claims 1 to 3, wherein the computer functions to input the low-dimensional vector data group output from the dimensional compression means to the cluster number estimation means.

The dimensional compression means is based on t-SNE (t-Distributed Stochastic Neighbor Embedding).
The program according to claim 4, wherein the low-dimensional vector data group causes a computer to function as if it were a two-dimensional or three-dimensional vector data group.

The program according to any one of claims 1 to 5, wherein the multidimensional vector data group causes a computer to function as if vector data output from a plurality of sensors are mixed.

A device that classifies a group of multidimensional vector data.
A cluster number estimation means that estimates the number of clusters k by clustering a multidimensional vector data group,
A class classification means that inputs multidimensional vector data into a neural network whose output layer is the number of clusters k and outputs the classification probability for each k class .
An apparatus having a teacher data generation means for generating teacher data by assigning a label of a class having the highest classification probability by the classification means to the vector data of the plurality of dimensions. ..

This is a classification method for devices that input multidimensional vector data groups.
The device is
The first step of estimating the number of clusters k by clustering a multidimensional vector data group,
The second step of inputting multidimensional vector data into a neural network whose output layer is the number of clusters k and outputting the classification probabilities for each of k classes ,
By assigning the label of the class having the highest classification probability by the second step to the multidimensional vector data, the third step and the third step of generating the teacher data are executed. How to classify equipment.