JP2020160618A

JP2020160618A - Determination device and determination method and determination program

Info

Publication number: JP2020160618A
Application number: JP2019057293A
Authority: JP
Inventors: 五十嵐　弓将; Yumimasa Igarashi; 弓将五十嵐
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2020-10-01
Also published as: WO2020195771A1

Abstract

To make it possible to perform both unsupervised and semi-supervised error detection without considering a predetermined preconditions for input data for training with SOM.SOLUTION: A determination device 10 has a learning part 122 and a determination part 123. The learning part 122 obtains a threshold of these outliers from a distribution of a quantization error between a representative vector of an output layer of the SOM and the input data obtained by inputting input data into the SOM that has been unsupervised and trained, and from a value of a neighborhood distance of the representative vector generates a plurality of thresholds by combining the obtained thresholds. The determination part 123 inputs the vector data to be determined into the SOM, calculates the quantization error of the vector data to be determined, and compares the quantization error with at least one of the plurality of thresholds to determine anomalies in the vector data to be determined.SELECTED DRAWING: Figure 5

Description

本発明は、判定装置、判定方法及び判定プログラムに関する。 The present invention relates to a determination device, a determination method, and a determination program.

自己組織化写像（Self Organizing Map：ＳＯＭ）は、データ解析の技術分野における、機械学習によるクラスタリング手法の１つである（例えば、非特許文献１参照）。クラスタリングは、クラスタ解析、クラスタ分析ともいう。 Self-organizing map (SOM) is one of the clustering methods by machine learning in the technical field of data analysis (see, for example, Non-Patent Document 1). Clustering is also referred to as cluster analysis or cluster analysis.

ここで、クラスタリングとは、複数の変数からなる多変量データの解析手法の１つである。多変量は一般的に多次元のベクトルで表される。以降、多変量をベクトルと呼ぶ。そして、クラスタリングは、ラベルを予め付与することなく、それぞれのベクトルの種類を未知のものとして、ベクトルの値の特徴のみを用いてベクトルをいくつかの集団、すなわちクラスタに分類する手法である。なお、ラベルは、解析対象となる多変量データを分類するために基準とする情報である。 Here, clustering is one of the analysis methods of multivariate data composed of a plurality of variables. Multivariates are generally represented by multidimensional vectors. Hereinafter, the multivariate is referred to as a vector. Then, clustering is a method of classifying a vector into several groups, that is, clusters, using only the characteristics of the value of the vector, assuming that the type of each vector is unknown without giving a label in advance. The label is information used as a reference for classifying the multivariate data to be analyzed.

このように、予備知識なしでデータを分類する機械学習の手法は、教師なし学習と呼ばれる。一方で、データに対し予め外的基準によるラベルを付与した上で、ラベルが示す分類に適合するように分類基準を別途学習する手法は、教師あり学習と呼ばれる。 Such a machine learning method for classifying data without prior knowledge is called unsupervised learning. On the other hand, a method in which data is labeled according to an external standard in advance and then the classification standard is separately learned so as to conform to the classification indicated by the label is called supervised learning.

ＳＯＭは、教師なし学習を行うニューラルネットワークの一種である。一般に、ニューラルネットワークは、人間の脳に見られるいくつかの機能的特性を計算機上で模擬した計算モデルのことである。 SOM is a type of neural network that performs unsupervised learning. In general, a neural network is a computational model that simulates some functional characteristics found in the human brain on a computer.

ＳＯＭは、入力層と出力層とからなる２層構造のニューラルネットワークである。ＳＯＭは、多変量のｎ次元ベクトルである入力データを、ベクトル量子化を行いつつ、低次元（通常は２次元）の格子構造のモデルである出力層に非線形写像する手法である。 SOM is a two-layer neural network composed of an input layer and an output layer. SOM is a method of nonlinearly mapping input data, which is a multivariate n-dimensional vector, to an output layer, which is a model of a low-dimensional (usually two-dimensional) lattice structure, while performing vector quantization.

ここで、ベクトル量子化とは、複数の異なる入力データをまとめ、１つの代表値を表すベクトルに置き換えて出力することをいう。量子化によって置き換えられる出力のベクトル数は、入力データのベクトル数よりも少なくなる。すなわち、入力データがＮ個のｎ次元ベクトルから構成されるとすれば、出力は、Ｋ個のｎ次元ベクトル（Ｋ＜Ｎ）となる。 Here, vector quantization means collecting a plurality of different input data and replacing them with a vector representing one representative value for output. The number of output vectors replaced by quantization is less than the number of input data vectors. That is, if the input data is composed of N n-dimensional vectors, the output is K n-dimensional vectors (K <N).

ベクトル量子化の手順は、以下のようになる。まず、Ｎ個の入力ベクトルをＫ個のクラスタに分類する。次に、Ｋ個のクラスタそれぞれに「代表ベクトル」と呼ばれる、クラスタを代表するベクトルを１つ割り当てる。この代表ベクトルの決定方法として、様々な方法が存在し、ＳＯＭは、それらの決定方法の中の１つに位置づけられる。このような手順により、Ｎ個の入力ベクトルをＫ個の代表ベクトルに置き換えることによって、ベクトル量子化を行う。 The procedure of vector quantization is as follows. First, N input vectors are classified into K clusters. Next, one vector representing the cluster, called a "representative vector", is assigned to each of the K clusters. There are various methods for determining the representative vector, and SOM is positioned as one of those determination methods. According to such a procedure, vector quantization is performed by replacing N input vectors with K representative vectors.

ＳＯＭにおいて、ベクトル量子化によって出力される代表ベクトルは、出力層の各格子点に配置されている。代表的な２次元のＳＯＭにおいては、出力層はＫ個の格子点からなる２次元格子となる。例えば、ｘ軸、ｙ軸からなる２次元直交座標において、ｘ軸の格子点の数をＸ、ｙ軸の格子点の数をＹとすれば、出力層の格子点の数Ｋは、Ｋ＝Ｘ×Ｙである。 In SOM, the representative vector output by vector quantization is arranged at each grid point of the output layer. In a typical two-dimensional SOM, the output layer is a two-dimensional grid consisting of K grid points. For example, in two-dimensional Cartesian coordinates consisting of the x-axis and the y-axis, if the number of grid points on the x-axis is X and the number of grid points on the y-axis is Y, the number K of the grid points in the output layer is K =. X × Y.

さらに、ＳＯＭでは、互いに似た特徴を持つ代表ベクトルは、出力層の上で距離的に近い格子点（以下ノードと呼ぶ。）に配置される。通常、ベクトル間の距離の計算として、ユークリッド距離が用いられる。ＳＯＭは、このＳＯＭの特性によって、同様の特徴を持つ代表ベクトル、すなわち、相対的なユークリッド距離が短い代表ベクトル同士をまとめることで、出力層の上でいくつかのクラスタに分けることができる。 Further, in SOM, representative vectors having similar characteristics are arranged on the output layer at lattice points (hereinafter referred to as nodes) that are close in distance. Euclidean distance is usually used to calculate the distance between vectors. Due to the characteristics of this SOM, the SOM can be divided into several clusters on the output layer by grouping representative vectors having similar characteristics, that is, representative vectors having a short relative Euclidean distance.

一般的な機械学習と同様、ＳＯＭにも学習フェーズと写像フェーズとの２つの処理フェーズがある。学習フェーズでは、入力データを用いてベクトル量子化を行いつつ、競合的学習を繰り返し行って出力層を構築する。学習フェーズでは、入力データは入力層と出力層とからなる２層構造のニューラルネットワークのモデルを構築するために使われる。一方、写像フェーズでは、学習フェーズによって構築したモデルを用いて、入力データを入力層から出力層に写像する。写像フェーズでは、入力データは、出力層上の１つの代表ベクトルに写像され、代表ベクトルのクラスタによって分類される。 Like general machine learning, SOM has two processing phases, a learning phase and a mapping phase. In the learning phase, while performing vector quantization using input data, competitive learning is repeatedly performed to construct an output layer. In the learning phase, the input data is used to build a model of a two-layer neural network consisting of an input layer and an output layer. On the other hand, in the mapping phase, the input data is mapped from the input layer to the output layer using the model constructed in the learning phase. In the mapping phase, the input data is mapped to one representative vector on the output layer and classified by a cluster of representative vectors.

次に、ＳＯＭを利用した異常判定手法について説明する。ＳＯＭを利用した異常判定手法には、大きく分けて教師なし学習による異常判定方式と、半教師あり学習による異常判定方式の２つがある（例えば、非特許文献２参照）。 Next, an abnormality determination method using SOM will be described. There are roughly two types of abnormality determination methods using SOM: an abnormality determination method based on unsupervised learning and an abnormality determination method based on semi-supervised learning (see, for example, Non-Patent Document 2).

第一の方式として、教師なし学習による異常判定方式について説明する。教師なし学習による異常判定では、クラスタリングを行うデータ、すなわち入力データが、学習を開始する時点で予め全て与えられていることが必要である。さらに第一の方式では、入力データの大多数は正常である、という仮定が成り立つ必要がある。これらの条件と仮定の下、第一の方式では、入力データの中から、入力データの大多数にほとんど合致しないと思われるデータを、クラスタリングの手法を用いて抽出する。 As the first method, an abnormality determination method by unsupervised learning will be described. In the abnormality determination by unsupervised learning, it is necessary that all the data to be clustered, that is, the input data is given in advance at the time of starting the learning. Furthermore, in the first method, it is necessary to make the assumption that the majority of the input data is normal. Under these conditions and assumptions, in the first method, data that seems to hardly match the majority of the input data is extracted from the input data by using a clustering method.

ＳＯＭを利用した教師なし学習による異常判定では、まず、学習フェーズにおいて、入力データを用いて、ＳＯＭのモデルを構築する。 In the abnormality determination by unsupervised learning using SOM, first, in the learning phase, a model of SOM is constructed using the input data.

次に、ＳＯＭを利用した教師なし学習による異常判定では、学習によって得られるモデルの出力層の代表ベクトルに対してクラスタリングを行う。そして、ＳＯＭを利用した教師なし学習による異常判定では、クラスタリングの結果得られる複数の代表ベクトルのクラスタの中から、入力データの大多数が写像されているクラスタの集合を特定する。これは、「正常」に相当するクラスタを特定する処理となる。 Next, in the abnormality determination by unsupervised learning using SOM, clustering is performed on the representative vector of the output layer of the model obtained by the learning. Then, in the abnormality determination by unsupervised learning using SOM, a set of clusters in which the majority of the input data is mapped is specified from among the clusters of a plurality of representative vectors obtained as a result of clustering. This is a process for identifying a cluster corresponding to "normal".

続いて、ＳＯＭを利用した教師なし学習による異常判定では、残りのクラスタから少数の入力データが写像され、かつ、入力データの大多数が写像されているクラスタの集合からの相対的な距離が最も遠いクラスタを特定する。そして、ＳＯＭを利用した教師なし学習による異常判定では、その代表ベクトルのクラスタを「異常」とする。これは、「異常」に相当するクラスタを特定する処理となる。 Subsequently, in the anomaly determination by unsupervised learning using SOM, the relative distance from the set of clusters in which a small number of input data is mapped from the remaining clusters and the majority of the input data is mapped is the largest. Identify distant clusters. Then, in the abnormality determination by unsupervised learning using SOM, the cluster of the representative vector is regarded as "abnormal". This is a process for identifying the cluster corresponding to the "abnormality".

そして、ＳＯＭを利用した教師なし学習による異常判定では、「異常」と分類したクラスタに写像されている入力データを特定することによって、入力データの中から異常と判断されるデータを抽出することができる（例えば、非特許文献３参照）。 Then, in the abnormality judgment by unsupervised learning using SOM, it is possible to extract the data judged to be abnormal from the input data by specifying the input data mapped to the cluster classified as "abnormal". Yes (see, for example, Non-Patent Document 3).

また、第二の方式として、半教師あり学習による異常判定方式について説明する。半教師あり学習による異常判定では、予め外的基準よってラベルが付与された、正常に相当する入力データが必要である。この第二の方式は、正常なデータ及び異常なデータの半分、すなわち、一部のラベルつきデータを用いて学習を行うため、半教師あり学習と呼ばれる。この条件の下、半教師あり学習による異常判定では、正常に相当する入力データのみを機械学習することによって、逆に正常なデータに合致しない異常なデータを識別できるようにする。 In addition, as a second method, an abnormality determination method based on semi-supervised learning will be described. In the abnormality judgment by semi-supervised learning, input data corresponding to normal, which is labeled in advance by an external standard, is required. This second method is called semi-supervised learning because learning is performed using half of normal data and abnormal data, that is, some labeled data. Under this condition, in the abnormality determination by semi-supervised learning, abnormal data that does not match the normal data can be identified by machine learning only the input data corresponding to the normal.

具体的には、半教師あり学習による異常判定では、学習フェーズにおいて、学習器は、入力データから正常な領域と推定される多次元の境界面を決定する学習を行う。次に、半教師あり学習による異常判定では、判定フェーズにおいて、学習を終えた学習器に、新たな判定対象のデータを入力する。ここで、もし新たな入力データが正常な領域の境界内部にあると学習器によって判定された場合には正常となり、境界の外部にあると判定された場合には異常となる。 Specifically, in the abnormality determination by semi-supervised learning, in the learning phase, the learner performs learning to determine a multidimensional boundary surface estimated to be a normal region from the input data. Next, in the abnormality determination by semi-supervised learning, in the determination phase, new determination target data is input to the learning device that has completed learning. Here, if the learner determines that the new input data is inside the boundary of the normal region, it becomes normal, and if it is determined that the new input data is outside the boundary, it becomes abnormal.

ＳＯＭを利用した半教師あり学習による異常判定では、第一の方式と同様、入力データを使って学習フェーズにおいてＳＯＭのモデルを構築する。ＳＯＭを利用した半教師あり学習による異常判定では、入力するデータは全て正常に相当するデータであり、学習によって、ある入力データは、出力層のある１つのノードに写像されるようになる。ここで、ＳＯＭにおいては、全ての入力データは必ずどれかの１つ代表ベクトルに写像されるが、逆にどの入力データも写像されない代表ベクトルを持つノードが出力層には存在する。 In the abnormality determination by semi-supervised learning using SOM, the SOM model is constructed in the learning phase using the input data as in the first method. In the abnormality determination by semi-supervised learning using SOM, all the input data are data corresponding to normal, and by learning, some input data is mapped to one node with an output layer. Here, in SOM, all the input data is always mapped to one of the representative vectors, but conversely, there is a node in the output layer having a representative vector in which no input data is mapped.

したがって、ＳＯＭを利用した半教師あり学習による異常判定では、学習後のＳＯＭの出力層に、正常に相当する入力データが写像されるノードと、何も写像されないノードとが分布する。これによって、ＳＯＭを利用した半教師あり学習による異常判定では、出力層上の正常と推定される領域の境界面を決定できる。 Therefore, in the abnormality determination by semi-supervised learning using SOM, nodes in which input data corresponding to normal is mapped and nodes in which nothing is mapped are distributed in the output layer of SOM after learning. As a result, in the abnormality determination by semi-supervised learning using SOM, the boundary surface of the region presumed to be normal on the output layer can be determined.

既存の方式を用いてＳＯＭの写像フェーズにおいて異常な入力データを判定するために、判定対象のデータが写像される出力層の代表ベクトルと、入力データとの間の量子化誤差、すなわち２つのベクトル間のユークリッド距離を計算する。そして、ＳＯＭを利用した半教師あり学習による異常判定では、その値が別途予め設定された閾値よりも大きい場合に、入力データは、正常な領域の外部にあると判断し異常と判定される。また、ＳＯＭを利用した半教師あり学習による異常判定では、入力データが、正常に相当する入力データが全く写像されないノードに写像された場合にも異常と判定できる（例えば、非特許文献４参照）。 In order to determine abnormal input data in the mapping phase of SOM using the existing method, the quantization error between the representative vector of the output layer on which the data to be determined is mapped and the input data, that is, two vectors Calculate the Euclidean distance between. Then, in the abnormality determination by semi-supervised learning using SOM, when the value is larger than the threshold value set separately in advance, the input data is determined to be outside the normal region and is determined to be abnormal. Further, in the abnormality determination by semi-supervised learning using SOM, it can be determined that the input data is abnormal even when the input data corresponding to the normal is mapped to a node on which no mapping is performed (see, for example, Non-Patent Document 4). ..

自己組織化マップ，T. コホネン(著)，大北正昭 (監修)，丸善出版；改訂版 (2016/2/1)，ISBN-10: 4621065513，ISBN-13: 978-4621065518, pp.60-62, 111-117.Self-Organizing Map, T. Kohonen (Author), Masaaki Ohkita (Supervised), Maruzen Publishing; Revised Edition (2016/2/1), ISBN-10: 4621065513, ISBN-13: 978-4621065518, pp.60- 62, 111-117. Victoria J., Hodge and Jim Austin, “A Survey of Outlier Detection Methodologies”, Artificial Intelligence Review, 22:2004, 2004.Victoria J., Hodge and Jim Austin, “A Survey of Outlier Detection Methodologies”, Artificial Intelligence Review, 22: 2004, 2004. M. Almi'ani, A. A. Ghazleh, A. Al-Rahayfeh and A. Razaque, “Intelligent Intrusion Detection System Using Clustered Self Organized Map”, 2018 Fifth International Conference on Software Defined Systems (SDS) , Barcelona， 2018, pp. 138-144.M. Almi'ani, AA Ghazleh, A. Al-Rahayfeh and A. Razaque, “Intelligent Intrusion Detection System Using Clustered Self Organized Map”, 2018 Fifth International Conference on Software Defined Systems (SDS), Barcelona, 2018, pp. 138 -144. Albert J. Hoglund, Kimmo Hatonen, and Antti S. Sorvari, “A COMPUTER HOST-BASED USER ANOMALY DETECTION SYSTEM USING THE SELF-ORGANIZING MAP”, Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on. Vol. 5. IEEE, 2000.Albert J. Hoglund, Kimmo Hatonen, and Antti S. Sorvari, “A COMPUTER HOST-BASED USER ANOMALY DETECTION SYSTEM USING THE SELF-ORGANIZING MAP”, Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on. Vol. 5. IEEE, 2000.

このように、ＳＯＭを利用した異常判定手法には、大きく分けて教師なし学習による異常判定方式と半教師あり学習による異常判定方式との２つがあり、それぞれデータを正常と異常に分類する方法が異なる。このとき、既存の異常判定方式ではそれぞれ以下のような問題が発生する。 In this way, there are roughly two types of abnormality judgment methods using SOM: an abnormality judgment method based on unsupervised learning and an abnormality judgment method based on semi-supervised learning, and there are methods for classifying data into normal and abnormal, respectively. different. At this time, the following problems occur in the existing abnormality determination methods.

第一の問題について説明する。教師なし学習による異常判定方式では、まず入力データが学習を開始する時点で予め全て与えられていることが必要である。このため、教師なし学習による異常判定方式では、学習後に入力データが新たに生成される場合に、全ての入力データを使って学習を再度やり直す必要がある。特に、通信ネットワークの異常判定では、判定対象となる通信に関するデータが時間の経過と共に動的に生成されて入力データが増加するため、教師なし学習による異常判定方式では、学習を終了することができないという問題が発生する。 The first problem will be explained. In the anomaly determination method by unsupervised learning, it is necessary that all the input data is given in advance at the time of starting learning. Therefore, in the anomaly determination method by unsupervised learning, when new input data is generated after learning, it is necessary to redo the learning using all the input data. In particular, in the abnormality judgment of the communication network, the data related to the communication to be judged is dynamically generated with the passage of time and the input data increases, so that the learning cannot be completed by the abnormality judgment method by unsupervised learning. The problem occurs.

また、教師なし学習による異常判定方式では、入力データの大多数は正常であるという仮定が成り立つ必要がある。しかしながら、教師あり学習のように外的基準で予めラベルを付与することなく入力データの種類の分布を事前に確認することは、現実世界では一般的に困難である。このため、別途経験的な分析などにより入力データの大多数が正常であるなどの傾向を把握できなければ、教師なし学習による異常判定方式による異常判定の精度はあまり高くならないという問題が発生する。 In addition, in the anomaly determination method by unsupervised learning, it is necessary to hold the assumption that the majority of the input data is normal. However, it is generally difficult in the real world to confirm the distribution of input data types in advance without pre-labeling them by an external standard as in supervised learning. For this reason, unless the tendency that the majority of the input data is normal cannot be grasped by separate empirical analysis or the like, there arises a problem that the accuracy of abnormality determination by the abnormality determination method by unsupervised learning does not become very high.

さらに、教師なし学習による異常判定方式の判定の仕組みの問題として、入力データに正常と異常データとがほぼ同じ数含まれていたとすると、入力データの中で大多数を占めるクラスタを特定することができないため、異常判定を行えないという問題がある。 Furthermore, as a problem of the judgment mechanism of the abnormality judgment method by unsupervised learning, if the input data contains almost the same number of normal and abnormal data, it is possible to identify the cluster that occupies the majority in the input data. There is a problem that the abnormality cannot be determined because it cannot be determined.

次に、第二の問題について説明する。半教師あり学習による異常判定方式では、学習を開始する時点で入力データが全て与えられている必要はないが、学習に入力するデータは全て正常に相当するデータである必要がある。このため、半教師あり学習による異常判定方式では、学習器とは別に予め外的基準により入力データを分類しておく必要がある。しかしながら、半教師あり学習による異常判定方式では、正常と異常の明確な外部の分類基準がない場合に、学習のための入力データに異常に相当するデータが混入してしまう可能性がある。このように、半教師あり学習による異常判定方式では、前提条件を完全に満たすことができない場合には、異常なデータも正常なものとして学習してしまうため、判定精度が低下するという問題がある。 Next, the second problem will be described. In the abnormality determination method by semi-supervised learning, it is not necessary that all the input data is given at the time of starting the learning, but all the data input to the learning need to be the data corresponding to normal. Therefore, in the abnormality determination method by semi-supervised learning, it is necessary to classify the input data in advance according to an external standard separately from the learning device. However, in the anomaly determination method based on semi-supervised learning, there is a possibility that data corresponding to anomalies may be mixed in the input data for learning when there is no clear external classification standard of normality and anomaly. In this way, in the abnormality judgment method by semi-supervised learning, if the preconditions cannot be completely satisfied, abnormal data is also learned as normal, so that there is a problem that the judgment accuracy is lowered. ..

さらに、半教師あり学習による異常判定方式では、正常と異常の判定を、判定対象のデータが写像される出力層の代表ベクトル及び入力データの間の量子化誤差と、予め設定した閾値との間の大小比較で行う。このため、半教師あり学習による異常判定方式では、判定を行うために、事前に人間が閾値を何かしらの手段で決めて与えなければならない。 Furthermore, in the semi-supervised learning anomaly determination method, normality and anomaly are determined between the quantization error between the representative vector of the output layer on which the data to be determined is mapped and the input data, and a preset threshold value. Compare the size of. For this reason, in the anomaly determination method based on semi-supervised learning, a human must determine a threshold value by some means in advance and give it in order to make a determination.

ここで、半教師あり学習による異常判定方式では、この閾値が多変量の入力ベクトルの量子化誤差であるため、学習の結果得られたＳＯＭの出力層に関する情報がなければ、適切な閾値の値を定めることが、通常できないという問題がある。したがって、半教師あり学習による異常判定方式では、人間が試行錯誤的に閾値を調整して適当な値を発見しなければならない。しかしながら、そのような調整は、ＳＯＭの出力層や異常判定に関する事前知識がなければ難しい。さらに、閾値は、そもそも、連続値をとる実数であり、恣意的に設定可能であることから、最適な閾値とその閾値とを用いた異常判定の結果の妥当性を判断できないという問題がある。 Here, in the anomaly determination method by semi-supervised learning, since this threshold value is the quantization error of the multivariate input vector, if there is no information on the output layer of SOM obtained as a result of learning, an appropriate threshold value is obtained. There is a problem that it is usually not possible to determine. Therefore, in the anomaly determination method based on semi-supervised learning, a human must adjust the threshold value by trial and error to find an appropriate value. However, such adjustment is difficult without prior knowledge of the SOM output layer and abnormality determination. Further, since the threshold value is a real number that takes a continuous value and can be set arbitrarily, there is a problem that the validity of the result of the abnormality determination using the optimum threshold value and the threshold value cannot be determined.

このように、第一の問題の第一の原因として、教師なし学習による異常判定方式では、通信データのように、時間とともに増加するような入力データを取り扱うことができないことが挙げられる。第一の問題の第二の原因として、動的に生成される入力データについては、入力データに含まれる正常と異常の分布を事前に確認する手段がないことが挙げられる。 As described above, the first cause of the first problem is that the anomaly determination method by unsupervised learning cannot handle input data that increases with time, such as communication data. The second cause of the first problem is that there is no means to confirm the distribution of normal and abnormal contents contained in the input data in advance for the dynamically generated input data.

また、第二の問題の第一の原因として、半教師あり学習による異常判定方式では、学習のための入力データに異常に相当するデータが混入した場合に、その混入したデータを区別する手段がないことが挙げられる。そして、第二の問題の第二の原因として、半教師あり学習による異常判定方式では、異常判定基準となる閾値を、人手を介さずに自動的に調節して設定する手段がないことがあげられる。 In addition, as the first cause of the second problem, in the abnormality judgment method by semi-supervised learning, when data corresponding to an abnormality is mixed in the input data for learning, there is a means for distinguishing the mixed data. There is no such thing. The second cause of the second problem is that in the semi-supervised learning abnormality determination method, there is no means for automatically adjusting and setting the threshold value, which is the abnormality determination criterion, without human intervention. Be done.

本発明は、上記に鑑みてなされたものであって、ＳＯＭを用いる学習のための入力データに対して所定の提条件を考慮することなく、教師なし学習と半教師あり学習との双方の性質を併せ持つ異常判定を実施することが可能になる判定装置、判定方法及び判定プログラムを提供することを目的とする。 The present invention has been made in view of the above, and is a property of both unsupervised learning and semi-supervised learning without considering predetermined conditions for input data for learning using SOM. It is an object of the present invention to provide a determination device, a determination method, and a determination program capable of performing an abnormality determination having both.

上述した課題を解決し、目的を達成するために、本発明の判定装置は、複写された通信データまたは通信の数値データに統計処理を行い、多変量変数のベクトルデータを取得する取得部と、学習対象のベクトルデータを入力データとしてＳＯＭによる教師なし学習を行った後、学習済みのＳＯＭに再度入力データを入力して得られたＳＯＭの出力層の代表ベクトルと入力データとの間の量子化誤差と、代表ベクトルの近傍距離とを計算し、量子化誤差と代表ベクトルの近傍距離の値との分布からこれらの外れ値の閾値を求め、該求めた閾値を組み合わせて、複数の閾値を生成する学習部と、判定対象のベクトルデータをＳＯＭに入力し、判定対象のベクトルデータが写像される出力層の代表ベクトルと判定対象のベクトルデータとの間の量子化誤差を計算し、該計算した量子化誤差と複数の閾値の少なくともいずれかとを比較して判定対象のベクトルデータが異常であるか否かを判定する判定部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the determination device of the present invention includes an acquisition unit that performs statistical processing on the copied communication data or numerical data of communication and acquires vector data of multivariate variables. After performing unsupervised learning by SOM using the vector data to be learned as input data, the input data is input again to the learned SOM, and the quantization between the representative vector of the output layer of the SOM and the input data is obtained. The error and the proximity distance of the representative vector are calculated, the thresholds of these deviation values are obtained from the distribution of the quantization error and the near distance value of the representative vector, and the obtained thresholds are combined to generate a plurality of thresholds. The learning unit to be used and the vector data to be judged are input to the SOM, and the quantization error between the representative vector of the output layer on which the vector data to be judged is mapped and the vector data to be judged is calculated and calculated. It is characterized by having a determination unit for determining whether or not the vector data to be determined is abnormal by comparing the quantization error with at least one of a plurality of threshold values.

本発明によれば、ＳＯＭを用いる学習のための入力データに対して所定の提条件を考慮することなく、教師なし学習と半教師あり学習との双方の性質を併せ持つ異常判定を実施することが可能になる。 According to the present invention, it is possible to carry out an abnormality determination having both properties of unsupervised learning and semi-supervised learning without considering predetermined conditions for input data for learning using SOM. It will be possible.

図１は、ＳＯＭの出力層の例を示した図である。FIG. 1 is a diagram showing an example of an output layer of SOM. 図２は、ｊ番目の代表ベクトルの周囲にある近傍の代表ベクトルとの位置関係を示す図である。FIG. 2 is a diagram showing a positional relationship with a representative vector in the vicinity around the j-th representative vector. 図３は、Ｕ−マトリックスの一例を示す図である。FIG. 3 is a diagram showing an example of the U-matrix. 図４は、箱ひげ図を説明する図である。FIG. 4 is a diagram illustrating a boxplot. 図５は、実施の形態における通信システムの構成の一例を示す図である。FIG. 5 is a diagram showing an example of the configuration of the communication system according to the embodiment. 図６は、図５に示す通信システムにおける処理の流れを説明する図である。FIG. 6 is a diagram illustrating a processing flow in the communication system shown in FIG. 図７は、実施の形態に係る学習処理の処理手順を示すフローチャートである。FIG. 7 is a flowchart showing a processing procedure of the learning process according to the embodiment. 図８は、実施の形態に係る判定処理の処理手順を示すフローチャートである。FIG. 8 is a flowchart showing a processing procedure of the determination process according to the embodiment. 図９は、プログラムが実行されることにより、判定装置が実現されるコンピュータの一例を示す図である。FIG. 9 is a diagram showing an example of a computer in which a determination device is realized by executing a program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施の形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals.

［実施の形態］
本実施の形態に係る判定装置は、学習フェーズにおいて、ＳＯＭによる教師なし学習を行った後、ＳＯＭの出力層の代表ベクトルと入力データとの間の量子化誤差と、ＳＯＭの出力層にある代表ベクトルの近傍距離とを計算し、これらの値の分布から既存の外れ値の閾値を計算する。本実施の形態に係る判定装置は、学習フェーズにおいて、これらから得られる２つの閾値を組み合わせて、教師なし学習及び半教師あり学習の性質を持つ異常判定のための複数の閾値を生成する。 [Embodiment]
In the determination device according to the present embodiment, after unsupervised learning by SOM in the learning phase, the quantization error between the representative vector of the output layer of SOM and the input data and the representative in the output layer of SOM. Calculate the neighborhood distance of the vector and calculate the threshold of existing outliers from the distribution of these values. In the learning phase, the determination device according to the present embodiment combines the two threshold values obtained from these to generate a plurality of threshold values for abnormality determination having the properties of unsupervised learning and semi-supervised learning.

続いて、本実施の形態に係る判定装置は、写像フェーズにおいて、判定対象のデータをＳＯＭに入力して、判定対象のデータの量子化誤差を計算し、計算した量子化誤差と、複数の閾値の少なくともいずれかとの大小比較を基に、入力データの異常を判定する。 Subsequently, the determination device according to the present embodiment inputs the data to be determined to the SOM in the mapping phase, calculates the quantization error of the data to be determined, and the calculated quantization error and a plurality of thresholds. The abnormality of the input data is determined based on the magnitude comparison with at least one of.

このように、本実施の形態に係る判定装置は、教師なし学習及び半教師あり学習の性質を持つ異常判定のための複数の閾値を生成する。そして、本実施の形態に係る判定装置は、いずれかの閾値と、入力データの量子化誤差との比較によって、教師なし学習及び半教師あり学習を組み合わせた異常判定を行うことが可能になる。 As described above, the determination device according to the present embodiment generates a plurality of threshold values for abnormality determination having the properties of unsupervised learning and semi-supervised learning. Then, the determination device according to the present embodiment can perform anomaly determination by combining unsupervised learning and semi-supervised learning by comparing one of the threshold values with the quantization error of the input data.

［数理的背景］
まず、以降の説明において必要となる数理的背景を説明する。図１は、ＳＯＭの出力層の例を示した図である。図１における１つの六角形は、１つの代表ベクトルを表す。代表ベクトルは、２次元空間上に六角格子状に配列され、格子配列は全体で長方形を形成する。なお、格子の形状は、正方格子も選択できる。 [Mathematical background]
First, the mathematical background required in the following explanation will be described. FIG. 1 is a diagram showing an example of an output layer of SOM. One hexagon in FIG. 1 represents one representative vector. The representative vectors are arranged in a hexagonal lattice in a two-dimensional space, and the lattice arrangement forms a rectangle as a whole. A square grid can also be selected as the grid shape.

代表ベクトルが六角格子の場合、１つの格子点の周囲には６つの格子点が等距離で存在する。図１の例では、横軸（Ｘ軸）の格子点(Ｘ)が１３個、縦軸（Ｙ軸）の格子点(Ｙ)が１３個あり、全体の格子点の数(Ｋ)は１６９個ある。格子点の配列の左下を１番目として右上に向かって数えると、ｊ番目の代表ベクトルは、ｍ_ｊ(ｊ＝１，２，・・・，Ｋ)、Ｋ＝Ｘ×Ｙで表すことができる。 When the representative vector is a hexagonal lattice, there are six lattice points equidistant around one lattice point. In the example of FIG. 1, there are 13 grid points (X) on the horizontal axis (X axis) and 13 grid points (Y) on the vertical axis (Y axis), and the total number of grid points (K) is 169. There are pieces. Counting toward the upper right with the lower left of the array of grid points as the first, the jth representative vector can be represented by m _j (j = 1, 2, ..., K), K = XY. ..

本実施の形態では、ＳＯＭの学習と写像フェーズとの両方において、格子配列の上下左右の境界は、上下、左右それぞれで、循環するトロイダル構造であると限定する。つまり、格子配列の長方形の上辺と下辺、及び左辺と右辺を隣接させて学習と写像を行う。この条件により、１つの格子点の近傍には、必ず６つの格子点が存在することになる。 In the present embodiment, in both the learning of SOM and the mapping phase, the upper, lower, left and right boundaries of the lattice arrangement are limited to a toroidal structure that circulates in each of the upper, lower, left and right. That is, learning and mapping are performed by adjoining the upper and lower sides and the left and right sides of the rectangle of the lattice arrangement. Under this condition, there are always 6 grid points in the vicinity of one grid point.

図１は、学習フェーズによって構築したＳＯＭを用いて入力データを、出力層に写像した結果の例を示している。六角形の中の数字及び色は、写像された入力データの数を表し、写像された入力データの数が多いほど赤い色で表現される。写像された入力データの数が０の代表ベクトルは灰色で表されている。 FIG. 1 shows an example of the result of mapping the input data to the output layer using the SOM constructed in the learning phase. The numbers and colors in the hexagon represent the number of mapped input data, and the larger the number of mapped input data, the more red the color. The representative vector in which the number of mapped input data is 0 is shown in gray.

そして、入力データが写像される出力層の代表ベクトルと入力データとの間の量子化誤差ｄ_ｉ（ｉ＝１，２，・・・，Ｎ）は、以下の式（１）で表される。 Then, the quantization error d _i between the representative vector of the output layer input data is mapped and the input data _(i = 1, 2, · · ·, N) is represented by the following formula (1) ..

図２は、ｊ番目の代表ベクトルの周囲にある近傍の代表ベクトルとの位置関係を示す図である。図２において、中心の代表ベクトルと周囲の代表ベクトルとの間の近傍距離ｈ_ｊ(ｊ＝１，２，・・・，Ｋ)は、以下の式（２）で表される。 FIG. 2 is a diagram showing a positional relationship with a representative vector in the vicinity around the j-th representative vector. In FIG. 2, the neighborhood distance h _j (j = 1, 2, ..., K) between the central representative vector and the surrounding representative vector is expressed by the following equation (2).

六角格子の場合には、近傍の代表ベクトルが６つ存在するので、近傍距離は、中心の代表ベクトルと、それら６つの代表ベクトルそれぞれとの間のユークリッド距離を計算し、それらの平均値を計算することによって得られる。 In the case of a hexagonal lattice, there are six representative vectors in the neighborhood, so for the neighborhood distance, calculate the Euclidean distance between the representative vector in the center and each of the six representative vectors, and calculate the average value of them. Obtained by doing.

図３は、Ｕ−マトリックスの一例を示す図である。Ｕ−マトリックスは、上記のように計算した近傍距離ｈ_ｊ(ｊ＝１，２，・・・，Ｋ)を全格子点について計算し、ｈ_ｊの大小によって濃淡階調で可視化した図である。六角形の中の数字は、ｈ_ｊの値を示している。 FIG. 3 is a diagram showing an example of the U-matrix. The U-matrix is a diagram in which the neighborhood distance h _j (j = 1, 2, ..., K) calculated as described above is calculated for all grid points and visualized in shade gradation according to the magnitude of h _j . .. The numbers in the hexagon indicate the value of h _j .

ｈ_ｊの値が大きくＵ−マトリックスで濃く表される代表ベクトルは、近傍の代表ベクトルとの間の距離が離れていることを示す。これに対し、ｈ_ｊの値が小さく薄く表される代表ベクトルは、近傍の代表ベクトルと距離が近いことを示す。 A representative vector in which the value of h _j is large and represented by a dark U-matrix indicates that the distance from the representative vector in the vicinity is large. On the other hand, the representative vector in which the value of h _j is small and represented thinly indicates that the distance is close to the representative vector in the vicinity.

ＳＯＭでは、互いに似た特徴を持つ代表ベクトルは出力層の上で近い格子点に配置される。このため、隣接し合うｈ_ｊが小さい代表ベクトルの集団（連続する淡い階調の領域）は、代表ベクトルのクラスタを形成しているとみなすことができる。一方、隣接し合うｈ_ｊが大きい代表ベクトルは、相対的に近傍の代表ベクトルから離れており、出力層の代表ベクトル全体の集合から孤立しているとみなすことができる。 In SOM, representative vectors with similar characteristics are placed at close grid points on the output layer. Therefore, a group of representative vectors having small h _j adjacent to each other (a region of continuous light gradation) can be regarded as forming a cluster of representative vectors. On the other hand, the adjacent representative vectors having a large h _j are relatively separated from the neighboring representative vectors, and can be regarded as isolated from the set of the entire representative vectors of the output layer.

したがって、ｈ_ｊが大きい代表ベクトルに写像される入力ベクトルは、教師なし学習による異常判定方式における入力データの大多数が写像されているクラスタの集合から、相対的に離れている、異常のクラスタに属しているとみなすことができる。 Therefore, the input vector mapped to the representative vector having a large h _j becomes an abnormal cluster that is relatively far from the set of clusters to which the majority of the input data in the anomaly determination method by unsupervised learning is mapped. It can be considered to belong.

図４は、箱ひげ図を説明する図である。箱ひげ図は、既存の最も簡易な外れ値検出手法の１つである。箱ヒゲ図は、あるデータの分布が与えられたときに、五数要約と呼ばれる要約統計量を用いて、外れ値検出の境界の閾値を計算するための手法である。データ分布の五数要約は、Ｑ_０／４，Ｑ_１／４，Ｑ_２／４，Ｑ_３／４，Ｑ_４／４の５つの値である。 FIG. 4 is a diagram illustrating a boxplot. Box plots are one of the simplest outlier detection methods in existence. Box plots are a technique for calculating the threshold of outlier detection boundaries using a summary statistic called a quintuple summary given a distribution of data. The _quintuplet summary of the data distribution consists of five values: Q _0/4 , Q _1/4 , Q _2/4 , Q _3/4 , and Q _4/4 .

Ｑ_０／４は、最小値（minimum）である。Ｑ_１／４は、第１四分位点（lower quartile）であり、データを小さい順に並べ、データの全数の１／４にあたるところのデータの値である。Ｑ_２／４は、中央値（第２四分位点、median）であり、データを小さい順に並べてデータの全数の１／２にあたるところのデータの値である。Ｑ_３／４は、第３四分位点（upper quartile）であり、データを小さい順に並べてデータの全数の３／４にあたるところのデータの値である。Ｑ_４／４は、最大値（maximum）である。 _{Q0 / 4} is the minimum value. Q _1/4 is the first lower quartile, and is the value of the data corresponding to 1/4 of the total number of data arranged in ascending order. Q2 _{/ 4} is the median (second quartile, median), which is the value of the data in which the data is arranged in ascending order and corresponds to 1/2 of the total number of data. Q _3/4 is the upper quartile, which is the value of the data in which the data is arranged in ascending order and corresponds to 3/4 of the total number of data. Q4 / ₄ is the maximum value.

箱ヒゲ図を用いて外れ値を扱う場合には、以下の式（３）〜式（５）のように、ひげの上限値及びひげの下限値を計算する。ＩＱＲは四分位範囲と呼ばれる。 When dealing with outliers using a box plot, the upper limit of the whiskers and the lower limit of the whiskers are calculated as in the following equations (3) to (5). IQR is called the interquartile range.

ひげの上端値は、与えられたデータのうち、ひげの上限より小さい最大値である。ひげの下端値は、与えられたデータのうち、ひげの下限より大きい最小値となる。 The top value of the whiskers is the maximum value of the given data that is smaller than the upper limit of the whiskers. The lower end value of the whiskers is the minimum value of the given data that is larger than the lower limit of the whiskers.

本実施の形態に係る判定装置は、箱ひげ図に対して、ＳＯＭの出力層の代表ベクトルと入力データとの間の量子化誤差ｄ_ｉと、代表ベクトルの近傍距離ｈ_ｊの分布を与え、それぞれの「ひげの端」の値を計算する。ここで、ｄ_ｉとｈ_ｊとは、距離であり負の値は取らないため、最小値は０である。本実施の形態では、ひげの上端値のみを計算する。 Determining apparatus according to this embodiment, with respect to box plots, given the quantization error d _i between the representative vector and the input data of SOM output layer, the distribution of the neighborhood distance h _j of the representative vectors, Calculate the value of each "beard edge". Here, d _i and h _j, for a and negative values are not taken distance, the minimum value is 0. In this embodiment, only the top value of the whiskers is calculated.

以上から、量子化誤差ｄ_ｉと代表ベクトルの近傍距離ｈ_ｊとの分布について、これらの最大値とひげの上端値とが得られる。以下の説明では、これらを、Ｄｍａｘ（量子化誤差ｄ_ｉの最大値）、Ｄｗｈｉ（量子化誤差ｄ_ｉのひげ上端値）、Ｈｗｈｉ（近傍距離ｈ_ｊのひげ上端値）で表す。 From the above, the distribution of the near distance h _j of the representative vectors and quantization error d _i, and the upper end value of the maximum value and the whiskers are obtained. In the following description, these are expressed by Dmax (the maximum value of the quantization error _{d i),} Dwhi (beard upper end value of the quantization error _{d i),} Hwhi (beard upper values of the neighboring distance _{h j).}

本実施の形態に係る判定装置では、これらの値を組み合わせて合算することにより、異常判定の手法を選択するための４種類の閾値Ｔ１〜Ｔ４を算出する。閾値Ｔ１〜Ｔ４は、式（６）〜式（９）のように表される。 In the determination device according to the present embodiment, four types of threshold values T1 to T4 for selecting an abnormality determination method are calculated by combining these values and adding them together. The threshold values T1 to T4 are expressed as equations (6) to (9).

閾値Ｔ１は、学習済みのＳＯＭに対して再度学習に用いた入力データを入力し得られる写像の量子化誤差ｄ_ｉの最大値Ｄｍａｘである。このため、学習時の入力データを写像すると、その量子化誤差ｄ_ｉは、Ｄｍａｘ以下になる。ここで、学習時の入力データが全て正常に相当すると仮定すれば、閾値Ｔ１は、写像された代表ベクトルから正常と推定される領域の境界面までの距離の最大値に相当する。すなわち閾値Ｔ１は、半教師あり学習による異常判定における量子化誤差ｄ_ｉの閾値の１つに相当するとみなすことができる。 Threshold T1 is the maximum value Dmax of the quantization error d _i of the mapping that is obtained by inputting the input data used for learning again for trained SOM. Therefore, when mapping the input data at the time of learning, the quantization error d _i is less than or equal to Dmax. Here, assuming that all the input data at the time of learning correspond normally, the threshold value T1 corresponds to the maximum value of the distance from the mapped representative vector to the boundary surface of the region estimated to be normal. That threshold T1 can be considered to correspond to one of the threshold of the quantization error d _i in the abnormality determination by semi-supervised learning.

閾値Ｔ２は、学習済みＳＯＭの近傍距離ｈ_ｊのひげの上端値Ｈｗｈｉである。このため、ある近傍距離ｈ_ｊがＨｗｈｉより大きい場合、その代表ベクトルの近傍距離は、近傍距離ｈ_ｊの分布における外れ値であり、出力層の代表ベクトル全体の集合から孤立しているとみなすことができる。すなわち、閾値Ｔ２は、教師なし学習による異常判定方式において入力データが写像されるクラスタの近傍距離ｈ_ｊの閾値の１つに相当するとみなすことができる。 Threshold T2 is the upper end value Hwhi beard neighborhood distance _{h j} of the learned SOM. Therefore, when a certain neighborhood distance h _j is larger than Hwhi, the neighborhood distance of the representative vector is an outlier in the distribution of the neighborhood distance h _j , and is considered to be isolated from the set of the entire representative vector of the output layer. Can be done. That is, the threshold value T2 can be regarded as one of the threshold values of the proximity distance h _j of the cluster on which the input data is mapped in the anomaly determination method by unsupervised learning.

閾値Ｔ３は、閾値Ｔ１及び閾値Ｔ２を合算した値である。言い換えると、閾値Ｔ３は、半教師あり学習による量子化誤差の異常判定と、教師なし学習による代表ベクトルの異常判定の閾値と、を組み合わせた値である。 The threshold value T3 is a value obtained by adding the threshold value T1 and the threshold value T2. In other words, the threshold value T3 is a value that is a combination of the abnormality determination of the quantization error by semi-supervised learning and the threshold value of the abnormality determination of the representative vector by unsupervised learning.

すなわち、閾値Ｔ１及び閾値Ｔ２の和である閾値Ｔ３を用いることによって、学習時の入力データが全て正常に相当すると完全に仮定できない場合であっても、教師なし学習による異常判定方式の要素を含めながら、閾値判定することが可能になる。閾値Ｔ３は、半教師あり学習及び教師なし学習の異常判定で用いる各閾値を組み合わせた方式である。閾値Ｔ３は、量子化誤差ｄ_ｉの最大値Ｄｍａｘを用いる分、より学習時の入力データが全て正常に相当すると仮定する半教師あり学習における閾値（Ｔ１）に近い異常判定の閾値であると言える。 That is, by using the threshold value T3 which is the sum of the threshold value T1 and the threshold value T2, even if it cannot be completely assumed that all the input data at the time of learning correspond to normal, the element of the abnormality determination method by unsupervised learning is included. However, it becomes possible to determine the threshold value. The threshold value T3 is a method in which each threshold value used in the abnormality determination of semi-supervised learning and unsupervised learning is combined. Threshold T3 is said minute using the maximum value Dmax of the quantization error d _i, is an abnormality determination threshold value close to the threshold (T1) in assumed semi-supervised learning more input data at the time of learning all correspond correctly ..

閾値Ｔ４は、閾値Ｔ３における量子化誤差ｄ_ｉの最大値Ｄｍａｘに代えて、量子化誤差ｄ_ｉのひげの上端値Ｄｗｈｉを、Ｈｗｈｉと合算する。Ｄｗｈｉを用いる閾値Ｔ４を用いることによって、閾値Ｔ３において学習時の入力データが全て正常に相当すると仮定することなしに、量子化誤差ｄ_ｉの分布の外れ値から、量子化誤差の異常判定を簡易に実施することができる。 Threshold T4, instead of the maximum value Dmax of the quantization error _{d i} at the threshold T3, the upper value Dwhi beard quantization error _{d i,} is summed with Hwhi. Simple By using the threshold value T4 using Dwhi, without input data at the time of learning in the threshold T3 is assumed that all corresponding normally, outliers in the distribution of the quantization error d _i, the abnormality determination of the quantization error Can be carried out.

すなわち、閾値Ｔ４は、教師なし学習による量子化誤差及び代表ベクトルの異常判定の閾値を組み合わせた値である。閾値Ｔ４も閾値Ｔ３と同様に、学習時の入力データが全て正常に相当すると完全に仮定できない場合であっても、教師なし学習による異常判定方式の要素を含めながら閾値判定することが可能になる。閾値Ｔ４は、量子化誤差ｄ_ｉのひげの上端値を閾値に用いる分、学習時の入力データの分布の偏りによって、量子化誤差の外れ値を分類する傾向が強くなるため、より教師なし学習における閾値（Ｔ２）に近い異常判定の閾値であると言える。 That is, the threshold value T4 is a value that combines the quantization error due to unsupervised learning and the threshold value for determining the abnormality of the representative vector. Similar to the threshold value T3, the threshold value T4 can be determined by including the elements of the abnormality determination method by unsupervised learning even when it cannot be completely assumed that all the input data at the time of learning correspond to normal. .. Threshold T4, the partial use of the upper end value of the beard of the quantization error d _i to the threshold, the bias of distribution of the input data at the time of learning, because the tendency to classify the outliers of the quantization error becomes stronger, more unsupervised It can be said that the threshold value for determining an abnormality is close to the threshold value (T2) in.

本実施の形態に係る判定装置は、写像フェーズにおいて、新たに正常か異常かを判定したい入力データｘ´を、学習済みＳＯＭに入力し、その量子化誤差ｄ´を、以下の式（１０）式を用いて計算する。 In the mapping phase, the determination device according to the present embodiment inputs input data x'that is newly determined to be normal or abnormal to the trained SOM, and calculates the quantization error d'by the following equation (10). Calculate using the formula.

半教師あり学習による異常判定方式において、学習時の入力データが全て正常に相当すると仮定すれば、どの学習データも写像されない代表ベクトルを持つノードは、異常と判断できる。したがって、本実施の形態に係る判定装置では、入力データｘ´が学習フェーズで入力データが全く写像されない（すなわち写像数が０）の代表ベクトルに写像される場合、ｘ´自体の量子化誤差に写像先の代表ベクトルの近傍距離を加えることによって、教師なし学習及び半教師あり学習を組み合わせた異常判定を、ｄ´と閾値Ｔ１〜Ｔ４のいずれかとの比較により行う。 In the anomaly determination method by semi-supervised learning, assuming that all the input data at the time of learning correspond to normal, a node having a representative vector in which no learning data is mapped can be determined to be anomaly. Therefore, in the determination device according to the present embodiment, when the input data x'is mapped to a representative vector in which the input data is not mapped at all (that is, the number of maps is 0) in the learning phase, the quantization error of x'itself By adding the proximity distance of the representative vector of the mapping destination, an abnormality determination that combines unsupervised learning and semi-supervised learning is performed by comparing d'with any of the thresholds T1 to T4.

具体的に、本実施の形態に係る判定装置では、判定対象のデータｘ´が、正常であるか否かについては、以下のようにして判定する。まず、本実施の形態に係る判定装置では、閾値Ｔ１からＴ４の中から判定用の閾値１つを選択する。選択する閾値をＴ´とする。そして、本実施の形態に係る判定装置では、データｘ´から量子化誤差ｄ´を計算し、閾値Ｔ´と比較する。そして、本実施の形態に係る判定装置では、式（１１）のように判定を行う。 Specifically, in the determination device according to the present embodiment, whether or not the data x'to be determined is normal is determined as follows. First, in the determination device according to the present embodiment, one threshold value for determination is selected from the threshold values T1 to T4. Let T'be the threshold to be selected. Then, in the determination device according to the present embodiment, the quantization error d'is calculated from the data x'and compared with the threshold value T'. Then, in the determination device according to the present embodiment, the determination is performed as in the equation (11).

［通信システムの構成］
次に、実施の形態に係る判定装置の構成について説明する。図５は、実施の形態における通信システムの構成の一例を示す図である。図５に示すように、実施の形態における通信システム１は、複数の計算機端末２に接続するネットワーク機器３と、ネットワーク機器3と接続する判定装置１０とを有する。 [Communication system configuration]
Next, the configuration of the determination device according to the embodiment will be described. FIG. 5 is a diagram showing an example of the configuration of the communication system according to the embodiment. As shown in FIG. 5, the communication system 1 according to the embodiment includes a network device 3 connected to a plurality of computer terminals 2 and a determination device 10 connected to the network device 3.

ネットワーク機器３は、コンピュータネットワークを構成するネットワーク機器の１つである。ネットワーク機器３は、自装置に接続している計算機端末２や機器が行う通信のデータをそのまま複写し、別の接続口から、複写した通信データを出力する。または、ネットワーク機器３は、自装置に接続している計算機端末２２や機器が行う通信のデータを数えることで数値化し、その数値データを出力する。 The network device 3 is one of the network devices constituting the computer network. The network device 3 copies the communication data of the computer terminal 2 and the device connected to the own device as it is, and outputs the copied communication data from another connection port. Alternatively, the network device 3 digitizes the data of the communication performed by the computer terminal 22 or the device connected to the own device, digitizes the data, and outputs the numerical data.

判定装置１０は、ネットワーク機器３が出力した複写した通信データまたは通信の数値データを入力データとして、ＳＯＭによる異常判定を実行する。判定装置１０は、学習フェーズにおいて、ＳＯＭによる教師なし学習を行った後、入力データの量子化誤差とＳＯＭの出力層にある代表ベクトルとの近傍距離を計算し、これらの値の分布から既存の外れ値の閾値を計算する。そして、判定装置１０は、学習フェーズにおいて、これらから得られる２つの閾値を組み合わせて、教師なし学習及び半教師あり学習の性質を持つ異常判定のための閾値を４種類生成する。判定装置１０は、写像フェーズにおいて、判定対象のデータについて量子化誤差を計算し、４種類の閾値との大小比較を基に、入力データの異常を判定する。 The determination device 10 uses the copied communication data or numerical communication data output by the network device 3 as input data, and executes an abnormality determination by SOM. In the learning phase, the determination device 10 calculates the proximity distance between the quantization error of the input data and the representative vector in the output layer of the SOM after performing unsupervised learning by SOM, and the existing distribution of these values is used as the determination device 10. Calculate the outlier threshold. Then, in the learning phase, the determination device 10 combines the two threshold values obtained from these to generate four types of threshold values for abnormality determination having the properties of unsupervised learning and semi-supervised learning. In the mapping phase, the determination device 10 calculates a quantization error for the data to be determined, and determines an abnormality in the input data based on a magnitude comparison with four types of threshold values.

［判定装置の構成］
次に、図５を参照して、判定装置１０の構成について説明する。図５に示すように、判定装置１０は、通信部１１、制御部１２、記憶部１３、入力部１４及び出力部１５を有する。 [Configuration of judgment device]
Next, the configuration of the determination device 10 will be described with reference to FIG. As shown in FIG. 5, the determination device 10 includes a communication unit 11, a control unit 12, a storage unit 13, an input unit 14, and an output unit 15.

通信部１１は、ネットワーク等を介して接続された他の装置との間で、各種情報を送受信する通信インタフェースである。通信部１１は、ＮＩＣ（Network Interface Card）等で実現され、ＬＡＮ（Local Area Network）やインターネットなどの電気通信回線を介した他の装置と制御部１２（後述）との間の通信を行う。例えば、通信部１１は、ネットワークを介して、ネットワーク機器３から、ネットワーク機器３が複写した通信データまたは通信の数値データを受信する。また、通信部１１は、判定部１２３（後述）によるＳＯＭによる判定結果を、例えば、対処を行う他の装置に送信する。 The communication unit 11 is a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like. The communication unit 11 is realized by a NIC (Network Interface Card) or the like, and communicates between another device and the control unit 12 (described later) via a telecommunication line such as a LAN (Local Area Network) or the Internet. For example, the communication unit 11 receives the communication data copied by the network device 3 or the numerical data of the communication from the network device 3 via the network. Further, the communication unit 11 transmits the determination result by the SOM by the determination unit 123 (described later) to, for example, another device for coping.

制御部１２は、判定装置１０全体を制御する。制御部１２は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）等の電子回路や、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）等の集積回路である。また、制御部１２は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、内部メモリを用いて各処理を実行する。また、制御部１２は、各種のプログラムが動作することにより各種の処理部として機能する。 The control unit 12 controls the entire determination device 10. The control unit 12 is, for example, an electronic circuit such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). In addition, the control unit 12 has an internal memory for storing programs and control data that define various processing procedures, and executes each process using the internal memory. Further, the control unit 12 functions as various processing units by operating various programs.

制御部１２は、統計処理部１２１（取得部）、学習部１２２及び判定部１２３を有する。制御部１２の各機能部によって実行される処理は、ＳＯＭモデル１３２（後述）の学習及び閾値の計算を行う学習フェーズと、判定対象のデータをＳＯＭモデルに入力して異常判定を行う写像フェーズとに大別される。制御部１２における学習部１２２は、学習フェーズの処理を行う機能部である。判定部１２３は、写像フェーズを行う機能部である。 The control unit 12 has a statistical processing unit 121 (acquisition unit), a learning unit 122, and a determination unit 123. The processes executed by each functional unit of the control unit 12 include a learning phase in which learning of the SOM model 132 (described later) and calculation of a threshold value are performed, and a mapping phase in which data to be determined is input to the SOM model to perform abnormality determination. It is roughly divided into. The learning unit 122 in the control unit 12 is a functional unit that performs processing in the learning phase. The determination unit 123 is a functional unit that performs a mapping phase.

統計処理部１２１は、ネットワーク機器３から受信した、複写された通信データまたは通信の数値データを取得する。統計処理部１２１は、これらのデータに統計処理を行い、多変量変数のベクトルとする。統計処理部１２１は、取得したベクトルを含む統計データを、二次記憶領域（例えば、記憶部１３の統計データ記憶部１３１（後述））に保存する。ベクトルの変数の例としては、パケットサイズ、ペイロードサイズの平均や変動係数、パケット到着間隔時間の変動係数などがある。 The statistical processing unit 121 acquires the copied communication data or the numerical data of the communication received from the network device 3. The statistical processing unit 121 performs statistical processing on these data to obtain a vector of multivariate variables. The statistical processing unit 121 stores the statistical data including the acquired vector in the secondary storage area (for example, the statistical data storage unit 131 (described later) of the storage unit 13). Examples of vector variables include packet size, average payload size and coefficient of variation, and coefficient of variation of packet arrival interval time.

学習部１２２は、本実施の形態の数理的背景で説明した学習フェーズの各処理を実行する。学習部１２２は、統計データ記憶部１３１の学習対象のベクトルデータを読み込み、このベクトルデータを入力データとして、ＳＯＭによる教師なし学習を行ったＳＯＭのモデルを生成する。なお、ＳＯＭを生成する単位としては、計算機端末２の１台ごとに生成してもよいし、複数の計算機端末２をまとめてもよい。 The learning unit 122 executes each process of the learning phase described in the mathematical background of the present embodiment. The learning unit 122 reads the vector data to be learned by the statistical data storage unit 131, and uses this vector data as input data to generate an SOM model in which unsupervised learning is performed by SOM. As a unit for generating SOM, it may be generated for each computer terminal 2, or a plurality of computer terminals 2 may be combined.

そして、学習部１２２は、学習済みの前記ＳＯＭに再度入力データを入力して得られたＳＯＭの出力層の代表ベクトルと入力データとの間の量子化誤差と、代表ベクトルの近傍距離とを計算する。学習部１２２は、量子化誤差と代表ベクトルの近傍距離の値との分布から、これらの外れ値の閾値を生成する。学習部１２２は、少なくとも、この計算から得られる２つの閾値を組み合わせて、複数の閾値Ｔ１〜Ｔ４を生成する。学習部１２２は、ＳＯＭモデル１３２、全代表ベクトルの近傍距離ｈ_ｊを含む近傍距離データ１３３、閾値Ｔ１〜Ｔ４を含む閾値データ１３４を、全て一組にして、二次記憶領域（記憶部１３の学習データ記憶部１３０）に保存する。 Then, the learning unit 122 calculates the quantization error between the representative vector and the input data of the output layer of the SOM obtained by inputting the input data into the learned SOM again, and the proximity distance of the representative vector. To do. The learning unit 122 generates thresholds for these outliers from the distribution of the quantization error and the value of the proximity distance of the representative vector. The learning unit 122 combines at least the two thresholds obtained from this calculation to generate a plurality of thresholds T1 to T4. The learning unit 122 sets the SOM model 132, the neighborhood distance data 133 including the neighborhood distance h _j of all the representative vectors, and the threshold data 134 including the thresholds T1 to T4 into a set, and the secondary storage area (storage unit 13). It is stored in the learning data storage unit 130).

判定部１２３は、本実施の形態の数理的背景で説明した写像フェーズの各処理を実行する。判定部１２３は、判定対象のベクトルデータのＳＯＭモデルに入力し、判定対象のベクトルデータについて異常判定を行う。判定部１２３は、統計データ記憶部１３１に保存されたベクトルデータのうち、学習フェーズの入力データに含まれないデータ、または、新たに判定したいと指定されたデータを判定対象のベクトルデータとして読み込む。 The determination unit 123 executes each process of the mapping phase described in the mathematical background of the present embodiment. The determination unit 123 inputs the vector data to be determined to the SOM model, and makes an abnormality determination on the vector data to be determined. Of the vector data stored in the statistical data storage unit 131, the determination unit 123 reads data that is not included in the input data of the learning phase or data that is newly designated to be determined as vector data to be determined.

判定部１２３は、判定対象のベクトルデータを、ＳＯＭモデル１３２に入力し、判定対象のベクトルデータが写像される出力層の代表ベクトルと判定対象のベクトルデータとの間の量子化誤差ｄ´を計算する。続いて、判定部１２３は、量子化誤差ｄ´を閾値Ｔ１，Ｔ２，Ｔ３，Ｔ４のそれぞれと比較して、判定対象のデータが異常であるか否かを判定する。 The determination unit 123 inputs the vector data to be determined to the SOM model 132, and calculates the quantization error d'between the representative vector of the output layer on which the vector data to be determined is mapped and the vector data to be determined. To do. Subsequently, the determination unit 123 compares the quantization error d'with each of the threshold values T1, T2, T3, and T4, and determines whether or not the data to be determined is abnormal.

判定部１２３は、判定結果を判定対象のデータに付与し、二次記憶領域（記憶部１３の判定データ記憶部１３５）に保存する。保存された判定結果を、図示しないオペレータが取得することで対処等の判断を行ってもよいし、その内容がアラームとして別途通知されてもよい。そして、その内容を基に、ネットワーク機器３において異常な通信をフィルタさせる指示情報が、ネットワーク機器３に送信されてもよい。 The determination unit 123 adds the determination result to the data to be determined and stores it in the secondary storage area (determination data storage unit 135 of the storage unit 13). An operator (not shown) may acquire the saved determination result to make a determination such as a countermeasure, or the content may be separately notified as an alarm. Then, based on the content, instruction information for filtering abnormal communication in the network device 3 may be transmitted to the network device 3.

記憶部１３は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、光ディスク等の記憶装置である。なお、記憶部１３は、ＲＡＭ（Random Access Memory）、フラッシュメモリ、ＮＶＳＲＡＭ（Non Volatile Static Random Access Memory）等のデータを書き換え可能な半導体メモリであってもよい。記憶部１３は、判定装置１０で実行されるＯＳ（Operating System）や各種プログラムを記憶する。さらに、記憶部１３は、プログラムの実行で用いられる各種情報を記憶する。 The storage unit 13 is a storage device for an HDD (Hard Disk Drive), an SSD (Solid State Drive), an optical disk, or the like. The storage unit 13 may be a semiconductor memory in which data such as a RAM (Random Access Memory), a flash memory, and an NVSRAM (Non Volatile Static Random Access Memory) can be rewritten. The storage unit 13 stores the OS (Operating System) and various programs executed by the determination device 10. Further, the storage unit 13 stores various information used in executing the program.

記憶部１３は、統計データ記憶部１３１、学習データ記憶部１３０及び判定データ記憶部１３５を有する。統計データ記憶部１３１は、統計処理部１２１が統計処理によって取得した多変量変数ベクトルを含む統計データを記憶する。学習データ記憶部１３０は、学習部１２２が生成したＳＯＭモデル１３２、学習部１２２が計算した代表ベクトルの近傍距離を含む近傍距離データ１３３、及び、学習部１２２が計算した閾値Ｔ１〜Ｔ４を含む閾値データ１３４を含む。判定データ記憶部１３５は、判定対象のデータと、判定部１２３による判定結果とを対応付けて記憶する。 The storage unit 13 includes a statistical data storage unit 131, a learning data storage unit 130, and a determination data storage unit 135. The statistical data storage unit 131 stores statistical data including a multivariate variable vector acquired by the statistical processing unit 121 by statistical processing. The learning data storage unit 130 includes the SOM model 132 generated by the learning unit 122, the neighborhood distance data 133 including the neighborhood distance of the representative vector calculated by the learning unit 122, and the thresholds T1 to T4 calculated by the learning unit 122. Contains data 134. The determination data storage unit 135 stores the data to be determined and the determination result by the determination unit 123 in association with each other.

入力部１４は、判定装置１０の操作者からの各種操作を受け付ける入力インタフェースである。例えば、入力部１４は、タッチパネル、音声入力デバイス、キーボードやマウス等の入力デバイスによって構成される。入力部１４は、例えば、操作者による操作に応じて、学習フェーズの開始指示、写像フェーズの開始指示、判定結果の送信指示を受け付ける。 The input unit 14 is an input interface that receives various operations from the operator of the determination device 10. For example, the input unit 14 is composed of an input device such as a touch panel, a voice input device, and a keyboard and a mouse. The input unit 14 receives, for example, a learning phase start instruction, a mapping phase start instruction, and a determination result transmission instruction in response to an operation by the operator.

出力部１５は、例えば、液晶ディスプレイなどの表示装置、プリンタ等の印刷装置、情報通信装置等によって実現される。出力部１５は、例えば、判定結果の画面表示等を行う。 The output unit 15 is realized by, for example, a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like. The output unit 15 displays, for example, a screen display of the determination result.

［処理の流れ］
次に、通信システム１における処理の流れについて説明する。図６は、図５に示す通信システム１における処理の流れを説明する図である。 [Processing flow]
Next, the processing flow in the communication system 1 will be described. FIG. 6 is a diagram illustrating a processing flow in the communication system 1 shown in FIG.

図６に示すように、ネットワーク機器３が、計算機端末２が行う通信のデータを複写した通信データまたは通信の数値データを、判定装置１０の統計処理部１２１に出力する（図６の（１）参照）。 As shown in FIG. 6, the network device 3 outputs the communication data obtained by copying the communication data performed by the computer terminal 2 or the numerical data of the communication to the statistical processing unit 121 of the determination device 10 ((1) of FIG. 6). reference).

判定装置１０では、統計処理部１２１が、複写された通信データまたは通信の数値データに統計処理を行い、多変量変数のベクトルデータを取得し、二次記憶領域に保存する（図６の（２）参照）。 In the determination device 10, the statistical processing unit 121 performs statistical processing on the copied communication data or the numerical data of the communication, acquires the vector data of the multivariate variable, and stores it in the secondary storage area ((2) in FIG. 6). )reference).

学習部１２２は、統計データ記憶部１３１のベクトルデータを入力データとして読み込む（図６の（３）参照）。そして、学習部１２２は、学習フェーズの処理の結果、ＳＯＭのモデルの生成、入力データの量子化誤差と代表ベクトルの近傍距離ｈ_ｊとの計算、及び、これらの外れ値の閾値を組み合わせた閾値Ｔ１〜Ｔ４の計算を行う。そして、学習部１２２は、ＳＯＭモデル１３２、全代表ベクトルの近傍距離ｈ_ｊを含む近傍距離データ１３３、閾値Ｔ１〜Ｔ４を含む閾値データ１３４の保存を行う（図６の（４）参照）。 The learning unit 122 reads the vector data of the statistical data storage unit 131 as input data (see (3) in FIG. 6). Then, the learning unit 122 generates a SOM model as a result of the processing of the learning phase, calculates the quantization error of the input data and the neighborhood distance h _j of the representative vector, and the threshold value combining the threshold values of these outliers. Calculations of T1 to T4 are performed. Then, the learning unit 122 saves the SOM model 132, the neighborhood distance data 133 including the neighborhood distance h _j of all the representative vectors, and the threshold data 134 including the thresholds T1 to T4 (see (4) in FIG. 6).

そして、判定部１２３は、二次記憶領域から、ＳＯＭモデル１３２、全代表ベクトルの近傍距離ｈ_ｊ、閾値Ｔ１〜Ｔ４の読み込みを行う（図６の（５）参照）。判定部１２３は、二次記憶領域から、判定対象のベクトルデータの読み込みを行う（図６の（６）参照）。 Then, the determination unit 123 reads the SOM model 132, the proximity distance h _j of all the representative vectors, and the threshold values T1 to T4 from the secondary storage area (see (5) in FIG. 6). The determination unit 123 reads the vector data to be determined from the secondary storage area (see (6) in FIG. 6).

判定部１２３は、判定対象のベクトルデータをＳＯＭモデル１３２に入力して写像し、量子化誤差ｄ´を計算する。続いて、判定部１２３は、量子化誤差ｄ´を閾値Ｔ１，Ｔ２，Ｔ３，Ｔ４のそれぞれとの大小を比較し、判定対象のベクトルデータが異常であるか否かを判定する。判定部１２３は、判定結果を判定対象のデータに付与し、二次記憶領域に保存する（図６の（７）参照）。 The determination unit 123 inputs the vector data to be determined to the SOM model 132 and maps it to calculate the quantization error d'. Subsequently, the determination unit 123 compares the magnitude of the quantization error d'with each of the threshold values T1, T2, T3, and T4, and determines whether or not the vector data to be determined is abnormal. The determination unit 123 adds the determination result to the data to be determined and stores it in the secondary storage area (see (7) in FIG. 6).

［学習処理の処理手順］
次に、学習フェーズにおける学習処理の処理手順について説明する。図７は、実施の形態に係る学習処理の処理手順を示すフローチャートである。 [Processing procedure of learning process]
Next, the processing procedure of the learning process in the learning phase will be described. FIG. 7 is a flowchart showing a processing procedure of the learning process according to the embodiment.

学習フェーズでは、学習部１２２は、与えられた入力データのみを用いて、図７に示す処理を行う。学習部１２２は、入力データを用いてＳＯＭによる教師なし学習を行う（ステップＳ１）。学習部１２２は、ＳＯＭ出力層データを保存する（ステップＳ２）。そして、学習部１２２は、入力データの量子化誤差を計算する（ステップＳ６）。学習部１２２は、代表ベクトルの近傍距離を計算し（ステップＳ３）、記憶部１３に保存する（ステップＳ４）。 In the learning phase, the learning unit 122 performs the process shown in FIG. 7 using only the given input data. The learning unit 122 performs unsupervised learning by SOM using the input data (step S1). The learning unit 122 saves the SOM output layer data (step S2). Then, the learning unit 122 calculates the quantization error of the input data (step S6). The learning unit 122 calculates the proximity distance of the representative vector (step S3) and stores it in the storage unit 13 (step S4).

学習部１２２は、量子化誤差と代表ベクトル近傍距離の値との分布から、それらのひげ上端をそれぞれ計算する（ステップＳ５，Ｓ７）。学習部１２２は、量子化誤差の最大値、近傍距離のひげの上端値、量子化誤差のひげの上端値を基に、閾値Ｔ１，Ｔ２，Ｔ３，Ｔ４を計算する（ステップＳ８）。学習部１２２は、計算した閾値Ｔ１，Ｔ２，Ｔ３，Ｔ４を記憶部１３に保存して（ステップＳ９）、学習処理を終了する。 The learning unit 122 calculates the upper end of each whiskers from the distribution of the quantization error and the value of the distance near the representative vector (steps S5 and S7). The learning unit 122 calculates the threshold values T1, T2, T3, and T4 based on the maximum value of the quantization error, the upper end value of the whiskers in the vicinity distance, and the upper end value of the whiskers of the quantization error (step S8). The learning unit 122 stores the calculated threshold values T1, T2, T3, and T4 in the storage unit 13 (step S9), and ends the learning process.

［判定処理の処理手順］
次に、写像フェーズにおける判定処理の処理手順について説明する。図８は、実施の形態に係る判定処理の処理手順を示すフローチャートである。 [Judgment processing procedure]
Next, the processing procedure of the determination process in the mapping phase will be described. FIG. 8 is a flowchart showing a processing procedure of the determination process according to the embodiment.

写像フェーズでは、学習フェーズで得られるＳＯＭのモデル、代表ベクトル近傍距離の値、及び、閾値Ｔ１，Ｔ２，Ｔ３，Ｔ４を用いて、図８に示す処理を行う。判定部１２３は、判定対象のデータｘ´を学習済みのＳＯＭ入力層に入力する（ステップＳ１１）。判定部１２３は、判定対象のデータｘ´の量子化誤差ｄ´を計算する（ステップＳ１２）。 In the mapping phase, the processing shown in FIG. 8 is performed using the SOM model obtained in the learning phase, the value of the representative vector neighborhood distance, and the threshold values T1, T2, T3, and T4. The determination unit 123 inputs the data x'to be determined to the learned SOM input layer (step S11). The determination unit 123 calculates the quantization error d'of the data x'to be determined (step S12).

続いて、判定部１２３は、閾値Ｔ１から順に閾値Ｔ４までについて、閾値Ｔ´を選択する（ステップＳ１３）。判定部１２３は、閾値Ｔ´と量子化誤差ｄ´とを比較し、式（１１）を用いて判定を行う（ステップＳ１４）。判定部１２３は、判定結果を、判定対象のデータｘ´と対応付けて保存する（ステップＳ１５）。 Subsequently, the determination unit 123 selects the threshold value T'for the threshold value T1 to the threshold value T4 in order (step S13). The determination unit 123 compares the threshold value T'and the quantization error d'and makes a determination using the equation (11) (step S14). The determination unit 123 saves the determination result in association with the data x'to be determined (step S15).

そして、判定部１２３は、閾値Ｔ１〜Ｔ４全てと、量子化誤差ｄ´とを比較した場合（ステップＳ１６：Ｙｅｓ）、処理を終了する。一方、判定部１２３は、閾値Ｔ１〜Ｔ４全てと、量子化誤差ｄ´とを比較していない場合（ステップＳ１６：Ｎｏ）、次の閾値を選択して（ステップＳ１３）、量子化誤差ｄ´と比較し、判定する（ステップＳ１４）。 Then, when the determination unit 123 compares all the threshold values T1 to T4 with the quantization error d'(step S16: Yes), the determination unit 123 ends the process. On the other hand, when the determination unit 123 does not compare all the threshold values T1 to T4 with the quantization error d'(step S16: No), the determination unit 123 selects the next threshold value (step S13) and performs the quantization error d'. (Step S14).

判定対象のデータｘ´に関する最終的な異常判定は、閾値Ｔ１〜Ｔ４のいずれか１つとの判定結果によって行ってもよい。また、判定対象のデータｘ´に関する最終的な異常判定は、閾値Ｔ１〜Ｔ４を用いた判定結果のうち複数の判定結果を合わせて総合的に行ってもよい。 The final abnormality determination regarding the data x'to be determined may be performed based on the determination result of any one of the threshold values T1 to T4. Further, the final abnormality determination regarding the data x'to be determined may be performed comprehensively by combining a plurality of determination results among the determination results using the threshold values T1 to T4.

例えば、学習に用いる入力データと、判定対象のデータとが同じ集団から選ばれ、教師あり学習による異常検知を適用する場合について説明する。この場合、入力データが全て正常データであることを前提としていることから、判定に入力するデータが異常となる可能性はゼロとなる。この場合、判定部１２３は、教師なし学習による異常検知方式の閾値である閾値Ｔ２、または、閾値Ｔ２に近い閾値Ｔ４を用いればよい。 For example, a case where the input data used for learning and the data to be judged are selected from the same group and the abnormality detection by supervised learning is applied will be described. In this case, since it is assumed that all the input data are normal data, the possibility that the data input for the determination becomes abnormal is zero. In this case, the determination unit 123 may use the threshold value T2, which is the threshold value of the abnormality detection method by unsupervised learning, or the threshold value T4, which is close to the threshold value T2.

また、学習に用いるデータが全て正常データであることを保証（証明）することは一般的には困難である場合が多いが、完全ではないがほぼ正常データであると判断可能な場合、判定部１２３は、半教師あり学習による異常検知方式の閾値Ｔ１、または、閾値Ｔ１に近い閾値Ｔ３を閾値に用いればよい。 In addition, it is generally difficult to guarantee (prove) that all the data used for learning is normal data, but when it can be determined that the data is almost normal, although it is not perfect, the judgment unit. For 123, the threshold value T1 of the abnormality detection method by semi-supervised learning or the threshold value T3 close to the threshold value T1 may be used as the threshold value.

さらに、時間とともに増加するような入力データのように、データの全体分布が学習時に確定できない場合には、教師なし学習による異常検知方式を用いることは難しくなる。この場合には、判定部１２３は、半教師あり学習における閾値Ｔ１に近い閾値Ｔ３、または、教師なし学習における閾値Ｔ２に近い閾値Ｔ４の閾値を用いればよい。判定部１２３は、教師なし学習及び半教師あり学習の性質を持つ異常判定のための閾値を用いるため、教師なし学習及び半教師あり学習を組み合わせた異常判定を行うことが可能になる。 Furthermore, when the overall distribution of the data cannot be determined at the time of learning, such as input data that increases with time, it becomes difficult to use the anomaly detection method by unsupervised learning. In this case, the determination unit 123 may use a threshold value T3 close to the threshold value T1 in semi-supervised learning or a threshold value T4 close to the threshold value T2 in unsupervised learning. Since the determination unit 123 uses a threshold value for abnormality determination having the properties of unsupervised learning and semi-supervised learning, it is possible to perform abnormality determination by combining unsupervised learning and semi-supervised learning.

［実施の形態の効果］
本実施の形態に係る判定装置１０は、学習フェーズにおいて、ＳＯＭによる教師なし学習を行った後、ＳＯＭの出力層の代表ベクトルと入力データとの間の量子化誤差と、ＳＯＭの出力層にある代表ベクトルの近傍距離とを計算し、これらの値の分布から既存の外れ値の閾値を計算する。判定装置１０は、学習フェーズにおいて、これらから得られる２つの閾値を組み合わせて、教師なし学習及び半教師あり学習の性質を持つ異常判定のための複数の閾値を生成する。 [Effect of Embodiment]
The determination device 10 according to the present embodiment is located in the output layer of the SOM and the quantization error between the representative vector of the output layer of the SOM and the input data after unsupervised learning by the SOM in the learning phase. The neighborhood distance of the representative vector is calculated, and the threshold of existing outliers is calculated from the distribution of these values. In the learning phase, the determination device 10 combines the two thresholds obtained from these to generate a plurality of thresholds for abnormality determination having the properties of unsupervised learning and semi-supervised learning.

続いて、判定装置１０は、写像フェーズにおいて、判定対象のデータをＳＯＭに入力して、判定対象のデータの量子化誤差を計算し、計算した量子化誤差と、複数の閾値の少なくともいずれかとの大小比較を基に、入力データの異常を判定する。 Subsequently, in the mapping phase, the determination device 10 inputs the data to be determined to the SOM, calculates the quantization error of the data to be determined, and determines the calculated quantization error and at least one of a plurality of thresholds. The abnormality of the input data is judged based on the magnitude comparison.

このように、判定装置１０は、教師なし学習及び半教師あり学習の性質を持つ異常判定のための複数の閾値を生成する。そして、本実施の形態に係る判定装置は、いずれかの閾値と、入力データの量子化誤差との比較によって、教師なし学習及び半教師あり学習を組み合わせた異常判定を行うことが可能になる。 In this way, the determination device 10 generates a plurality of threshold values for abnormality determination having the properties of unsupervised learning and semi-supervised learning. Then, the determination device according to the present embodiment can perform anomaly determination by combining unsupervised learning and semi-supervised learning by comparing one of the threshold values with the quantization error of the input data.

このため、判定装置１０は、学習を開始する時点で入力データが全て与えられている必要がないため、通信ネットワークのように時間とともに増加するような入力データでもＳＯＭによる異常判定が可能になる。 Therefore, since it is not necessary for the determination device 10 to be given all the input data at the time of starting the learning, it is possible to determine the abnormality by the SOM even for the input data that increases with time like a communication network.

また、判定装置１０は、ＳＯＭの出力層にある代表ベクトルの近傍距離を用いて異常検知ができるため、教師なし学習による異常検知方式のように入力データに含まれる正常と異常の分布を事前に把握する必要がない。 Further, since the determination device 10 can detect anomalies using the proximity distance of the representative vector in the output layer of the SOM, the distribution of normals and anomalies included in the input data is preliminarily distributed as in the anomaly detection method by unsupervised learning. You don't have to figure it out.

また、判定装置１０は、半教師あり学習による異常検知方式のように，学習時の入力データが全て正常であると仮定する必要がないため、学習する入力データに異常に相当するデータが混入しても異常検知ができる。 Further, since the determination device 10 does not need to assume that all the input data at the time of learning is normal as in the abnormality detection method by semi-supervised learning, data corresponding to the abnormality is mixed in the input data to be learned. However, abnormality can be detected.

また、判定装置１０は、教師なし学習及び半教師あり学習の性質を持つ異常判定のための複数の閾値を自動的に生成するため、事前に閾値を決定するという煩雑な処理が不要である。 Further, since the determination device 10 automatically generates a plurality of threshold values for abnormality determination having the properties of unsupervised learning and semi-supervised learning, complicated processing of determining the threshold values in advance is unnecessary.

また、判定装置１０は、学習時の入力データの量子化誤差及び代表ベクトルの近傍距離から４種類の閾値Ｔ１〜Ｔ４を自動的に生成し、それぞれの閾値との比較により異常判定を行っている。これらの閾値Ｔ１〜Ｔ４は、教師なし学習及び半教師あり学習の性質を持つ併せ持つ異常判定を可能とするため、判定装置１０は、入力データに関する前提条件によって、異常検知方式を教師なし学習または半教師あり学習のどちらかに限定する必要がない。 Further, the determination device 10 automatically generates four types of threshold values T1 to T4 from the quantization error of the input data at the time of learning and the proximity distance of the representative vector, and performs abnormality determination by comparing with each threshold value. .. Since these thresholds T1 to T4 enable abnormality determination having both unsupervised learning and semi-supervised learning properties, the determination device 10 uses unsupervised learning or semi-supervised learning depending on the preconditions related to the input data. It does not have to be limited to either supervised learning.

そして、判定装置１０は、４種類の閾値Ｔ１〜Ｔ４のいずれか１つを用いた判定結果を基に、最終的な異常判定を行ってもよい。また、判定装置１０は、閾値Ｔ１〜Ｔ４の用いて行った複数の判定結果を組み合わせて、最終的な異常判定を行ってもよい。 Then, the determination device 10 may make a final abnormality determination based on the determination result using any one of the four types of threshold values T1 to T4. Further, the determination device 10 may perform a final abnormality determination by combining a plurality of determination results performed using the threshold values T1 to T4.

このように、本実施の形態では、ＳＯＭを用いる学習のための入力データに対して所定の提条件を考慮することなく、教師なし学習と半教師あり学習との双方の性質を併せ持つ異常判定を一括で実施することが可能になる。 As described above, in the present embodiment, an abnormality determination having both properties of unsupervised learning and semi-supervised learning is performed without considering predetermined conditions for input data for learning using SOM. It will be possible to carry out all at once.

［システム構成等］
図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Each component of each of the illustrated devices is a functional concept and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed in arbitrary units according to various loads and usage conditions. It can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

また、本実施の形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行なうこともでき、あるいは、手動的に行なわれるものとして説明した処理の全部又は一部を公知の方法で自動的に行なうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed. It is also possible to automatically perform all or part of the above by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
図９は、プログラムが実行されることにより、判定装置１０が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 [program]
FIG. 9 is a diagram showing an example of a computer in which the determination device 10 is realized by executing the program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

ハードディスクドライブ１０９０は、例えば、ＯＳ（Operating System）１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、判定装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、判定装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the determination device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, the program module 1093 for executing the same processing as the functional configuration in the determination device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN, WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, all other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are included in the category of the present invention.

１通信システム
２計算機端末
３ネットワーク機器
１０判定装置
１１通信部
１２制御部
１３記憶部
１４入力部
１５出力部
１２１統計処理部
１２２学習部
１２３判定部
１３０学習データ記憶部
１３１統計データ記憶部
１３２ＳＯＭモデル
１３３近傍距離データ
１３４閾値データ
１３５判定データ記憶部 1 Communication system 2 Computer terminal 3 Network equipment 10 Judgment device 11 Communication unit 12 Control unit 13 Storage unit 14 Input unit 15 Output unit 121 Statistical processing unit 122 Learning unit 123 Judgment unit 130 Learning data storage unit 131 Statistical data storage unit 132 SOM model 133 Proximity distance data 134 Threshold data 135 Judgment data storage unit

Claims

An acquisition unit that performs statistical processing on the copied communication data or numerical data of communication and acquires vector data of multivariate variables.
After performing unsupervised learning by SOM (Self Organizing Map) using the vector data to be learned as input data, the input data is input to the learned SOM again to obtain a representative vector of the output layer of the SOM. The quantization error between the input data and the proximity distance of the representative vector are calculated, and the threshold values of these outliers are obtained from the distribution of the quantization error and the neighborhood distance value of the representative vector. A learning unit that combines the obtained thresholds to generate multiple thresholds,
The vector data to be determined is input to the SOM, the quantization error between the representative vector of the output layer on which the vector data to be determined is mapped and the vector data to be determined is calculated, and the calculated quantum is calculated. A determination unit that compares the conversion error with at least one of the plurality of thresholds to determine whether or not the vector data to be determined is abnormal.
A determination device characterized by having.

The learning unit gives a distribution of the quantization error and the neighborhood distance to the box whiskers diagram, which is a method of calculating the threshold value of the boundary of outlier detection, and obtains a whiskers of the quantization error and the neighborhood distance. Claim 1 is characterized in that the upper end value is calculated and the plurality of thresholds are generated by using the maximum value of the quantization error, the upper end value of the whiskers of the quantization error, and the upper end value of the whiskers of the neighborhood distance. Judgment device described in.

The learning unit is a sum of a first threshold value which is the maximum value of the quantization error, a second threshold value which is the upper end value of the whiskers at the vicinity distance, the first threshold value, and the second threshold value. The determination device according to claim 2, wherein the threshold value of 3 and the fourth threshold value, which is the sum of the upper end value of the whiskers of the quantization error and the second threshold value, are generated.

The determination unit compares any one of the first threshold value, the second threshold value, the third threshold value, and the fourth threshold value with the quantization error of the vector data to be determined, and determines the determination target. The determination device according to claim 3, wherein when the quantization error of the vector data of the above exceeds the threshold value of the comparison target, it is determined that the vector data of the determination target is abnormal.

It is a judgment method executed by the judgment device.
The process of statistically processing the copied communication data or numerical data of communication to acquire the vector data of multivariate variables, and
After performing unsupervised learning by SOM (Self Organizing Map) using the vector data to be learned as input data, the input data is input to the learned SOM again to obtain a representative vector of the output layer of the SOM. The quantization error between the input data and the proximity distance of the representative vector are calculated, and the threshold values of these outliers are obtained from the distribution of the quantization error and the neighborhood distance value of the representative vector. The process of generating multiple thresholds by combining the obtained thresholds and
The vector data to be determined is input to the SOM, the quantization error between the representative vector of the output layer on which the vector data to be determined is mapped and the vector data to be determined is calculated, and the calculated quantum is calculated. A step of comparing the conversion error with at least one of the plurality of thresholds to determine whether or not the vector data to be determined is abnormal.
A determination method characterized by including.

Steps to perform statistical processing on the copied communication data or numerical data of communication and acquire vector data of multivariate variables,
After performing unsupervised learning by SOM (Self Organizing Map) using the vector data to be learned as input data, the input data is input to the learned SOM again to obtain a representative vector of the output layer of the SOM. The quantization error between the input data and the proximity distance of the representative vector are calculated, and the threshold values of these outliers are obtained from the distribution of the quantization error and the neighborhood distance value of the representative vector. A step to generate multiple thresholds by combining the obtained thresholds, and
The vector data to be determined is input to the SOM, the quantization error between the representative vector of the output layer on which the vector data to be determined is mapped and the vector data to be determined is calculated, and the calculated quantum is calculated. A step of comparing the conversion error with at least one of the plurality of thresholds to determine whether or not the vector data to be determined is abnormal.
A judgment program characterized by having a computer execute.