JP2004349846A

JP2004349846A - Outlier detecting method

Info

Publication number: JP2004349846A
Application number: JP2003142251A
Authority: JP
Inventors: Hiromichi Kawano; 弘道川野; Yoko Hoshiai; 擁湖星合; Akiko Takahashi; 彰子高橋; Ken Nishimatsu; 研西松
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-05-20
Filing date: 2003-05-20
Publication date: 2004-12-09

Abstract

<P>PROBLEM TO BE SOLVED: To provide an outlier detecting method capable of detecting an outlier even if data to be inspected do not follow normal distribution. <P>SOLUTION: The outlier detecting method of time series data having periodicity has a factor analyzing procedure for performing a factor analysis while considering that n items of data are data for a variable item in one period with an aggregation of data of one period as one sample; and an outlier determining procedure for defining a distance between samples with a factor score calculated by the factor analyzing procedure as an index indicating similarity between data, and determining whether or not it is an outlier on the basis of the distance. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、周期性のある時系列データの外れ値検出方法に係わり、本発明は、通信網の異常状態検出、誤測定データ検出に適用される。
【０００２】
【従来の技術】
網設備管理・計画業務の遂行にあたり、基礎トラフィックを適切に算出するためには、トラフィックデータから異常値（外れ値）を自動検出して排除する必要がある。
従来の外れ値検出方法として、測定データが正規分布に従うことを仮定して、グラブス検定を適用して外れ値候補データを抽出する方法が知られている（下記、非特許文献１、２参照）。
従来の方法は、検定対象であるデータが正規分布に従うことを前提にしており、検定対象であるトラヒックデータが正規分布に従わない場合には適用できない。
音声通信を対象とした固定電話のトラヒックでは、測定データが正規分布に従うが、近年の非音声通信に代表されるブロードバンドトラヒックでは、測定データが正規分布に従う保証はない。
【０００３】
なお、本願発明に関連する先行技術文献としては以下のものがある。
【非特許文献１】
井上，星合“設備管理用データを対象としたはずれ値検出方法”，２００１信学ソ大，ｎｏ，Ｂ−７−７１，ｐ．２６０，Ｓｅｐｔ．２００１．
【非特許文献２】
星合，井上，”網設備計画用トラヒックデータの外れ値検出に関する一考察”，２００２信学総大，ｎｏ．Ｂ−７−３７，ｐ．２６４、Ｍａｒ．２００２．
【０００４】
【発明が解決しようとする課題】
前述したように、従来の方法は、検定対象であるデータが正規分布に従うことを前提にしており、そのため、検定対象であるトラヒックデータ（例えば、ブロードバンドトラヒックなど）が正規分布に従わない場合には適用できないという問題点があった。
前記従来技術の問題点を解決するためになされたものであり、本発明の目的は、検定対象であるデータが正規分布に従わない場合でも外れ値を検出することが可能な外れ値検出方法を提供することにある。
本発明の前記ならびにその他の目的と新規な特徴は、本明細書の記述及び添付図面によって明らかにする。
【０００５】
【課題を解決するための手段】
本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、下記の通りである。
即ち、本発明は、周期性のある時系列データの外れ値検出方法であって、１周期分のデータの集まりを１サンプルとして、１周期内のｎ個のデータを変数項目に対するデータとみなして因子分析を行う因子分析手順と、前記因子分析手順で算出された因子得点をデータ間の類似度を表す指標としてサンプル間の距離を定義し、距離に基いて、外れ値であるかどうかを判定する外れ値判断手順とを有する。
また、本発明では、１周期がｎ個のデータから構成されるｍ周期分の時系列データ（Ｘ_１１，Ｘ_１２，Ｘ_１３，．．．，Ｘ_１ｎ，Ｘ_２１，Ｘ_２２，Ｘ_２３，．．．，Ｘ_２ｎ，．．．，Ｘ_ｍ１，Ｘ_ｍ２，Ｘ_ｍ３，．．．，Ｘ_ｍｎ）から、ｎ変量から成るｍ個の新たなデータ｛Ｙ_ｌ，Ｙ_２，．．．，Ｙ_ｍ、但し、Ｙ_ｉ＝（Ｘ_ｉ１，Ｘ_ｉ２，Ｘ_ｉ３，．．．，Ｘ_ｉｎ）｝を生成し、当該データ（Ｙ_ｌ，Ｙ_２．．．，Ｙ_ｍ）に対して因子分析を行う。
【０００６】
また、本発明では、ｐ次元の因子で特徴づけられた各サンプルの因子得点を算出し、因子得点を用いてサンプル間の距離を算出することにより、外れ値を含むサンプルを検出する。
また、本発明では、ｐ次元の因子で特徴づけられた各サンプルの因子得点を算出し、因子得点を用いてサンプルをクラスターに分類し、クラスター間の距離とクラスター内のサンプル数とを用いて、外れ値を含むサンプルを検出する。
また、本発明では、ｐ次元の因子で特徴づけられた各サンプルの因子得点を算出し、因子得点を用いてサンプルをクラスターに分類し、クラスター間の距離とクラスター内のサンプル数の比率とを用いて、外れ値を含むサンプルを検出する。
【０００７】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態を詳細に説明する。
なお、実施の形態を説明するための全図において、同一機能を有するものは同一符号を付け、その繰り返しの説明は省略する。
図１は、本発明の実施の形態の外れ値検出方法を実施するための外れ値検出装置の概略構成を示すブロック図である。
同図において、１は外れ値検出装置、２はデータ入力装置、３は因子分析装置、４は外れ値判断装置である。
データ入力装置２は、入力データ（１周期がｎ個のデータから構成されるｍ周期分の時系列データ）からｎ変量からなるｍ個のデータを生成する。
因子分析装置３は、ｎ変量からなるｍ個の観測データを、ｐ次元の因子に縮約しｐ個の因子得点を計算する。
外れ値判断装置４は、因子分析装置３の出力であるｐ個の因子得点より、各サンプル間の距離を算出し、各サンプルの中に外れ値が存在するか否かを判断する。
【０００８】
図２は、本実施の形態の外れ値検出装置の処理手順を示すフローチャートである。以下、本実施の形態の外れ値検出装置の処理手順を説明する。
始めに、データ入力装置２が、入力データ（１周期がｎ個のデータから構成されるｍ周期分の時系列データ）からｎ変量からなるｍ個のデータを生成する（ステップＳ０）。
データ入力装置２は、下記（１１）式で表される入力データ（１周期がｎ個のデータから構成されるｍ周期分の時系列データ）から、下記（１２）式で表されるｎ変量から成るｍ個の新たなデータ（Ｙ_ｌ，Ｙ_２，．．．，Ｙ_ｍ）を生成する。
【数５】

次に、因子分析装置３が、ｎ変量からなるｍ個の観測データ（前記（１２）式で表されるデータ）に対して因子分析を行い、ｐ個の因子得点を計算する（ステップＳ１）。
因子分析装置３は、ｍ周期分の時系列データをｎ変最からなるｍ個の観測データとして因子分析を行うことで、前記（１２）式におけるｎ個のデータ（Ｘ_ｉ１，Ｘ_ｉ２，Ｘ_ｉ３，．．．，Ｘ_ｉｎ）をｐ次元の因子に縮約し、ｐ個の因子得点を計算する。
即ち、因子分析装置３は、前記（１２）式の（Ｘ_ｉ１，Ｘ_ｉ２，Ｘ_ｉ３，．．．，Ｘ_ｉｎ）を標準化して、下記（１３）式を算出する。
【０００９】
【数６】

前記（１３）式より、下記（１４）式を求める。
【数７】

次に、前記（１４）式をｎ次元ベクトルとして、下記（１５）式で表される分散共分散行列Ｒを求める。
【００１０】
【数８】

【００１１】
相関行列Ｒの固有値を求め、降順にλ_１，λ_２，．．．．，λ_ｎに並べ、固有値に対して、下記（１６）式で表される寄与率（Ｃ_ｉ）を求める。
【数９】

寄与率が０．８以上となる固有値（λ_１，λ_２，．．．．，λ_ｐ；但し、ｐ≦ｎ）を抽出し、下記（１７）式で表される固有ベクトル（Ｗ_１，Ｗ_２，．．．．，Ｗ_ｐ）を求める。
【数１０】

前記固有ベクトルを用いて、下記（１８）で表される因子負荷量を算出する。
【数１１】

最小２剰法によって、下記（１９）式を最小にするｆ_ｉｊを求める。
【００１２】
【数１２】

ｆ_ｉｊより、下記（２０）式を求める。
【数１３】

この（２０）式は、因子得点と呼ばれ、前記（１２）式におけるｎ個のデータ（Ｘ_ｉ１，Ｘ_ｉ２，Ｘ_ｉ３，．．．，Ｘ_ｉｎ）を、Ｐ個（Ｐ＜ｎ）の共通する変数で表現したときの変数であり、前記（１２）式を、Ｐ個の潜在変数で表現したものである。
次に、外れ値判断装置４が、ｐ個の因子得点を用いてｍ個のデータ間の距離を求め、外れ値を検出する（ステップＳ２）。
外れ値判断装置４の判断手法は３つあり、一番目の判断手法（請求項３に記載の発明）は、前記（２０）式で表されるデータ（Ｆ_ｌ，Ｆ_２，．．．，Ｆ_ｍ）に対し、データ間の距離を、下記（２１）式で与え、当該（２１）式より、下記（２２）式を算出し、下記（２３）式を満たす時、データＹｉは外れ値であると判断する。
【００１３】
【数１４】

【００１４】
外れ値判断装置４の二番目の判断手法（請求項４に記載の発明）は、前記（２０）式で表されるデータ（Ｆ_ｌ，Ｆ_２，．．．，Ｆ_ｍ）にクラスター分析を適用し、ｍ個のデータ（Ｆ_１，Ｆ_２，．．．，Ｆ_ｍ）を、下記（２４）式で表されるｑ個のクラスターに分類する。
【数１５】
Ｃ１，Ｃ２，．．．，Ｃｑ・・・・・・・・・・・・・・・・・・・・（２４）
例えば、初期値として、１つのデータを構成単位とするｍ個のクラスター｛Ｆ_ｌ｝、｛Ｆ_２｝，．．．，｛Ｆ_ｍ｝を設定する。
ｍ個のクラスター（Ｃ１，Ｃ２，．．．，Ｃｍ）に対して、非類似度行列ｄ_ｉｊ（ｉ，ｊ＝１，２，．．．，ｍ）を、最短距離法、最長距離法、群平均法、重心法、ウォード法等を用いて計算する。
例えば、最短距離法の場合、クラスターＣｉ、Ｃｊの非類似度行列ｄ_ｉｊは、クラスターＣｉ、Ｃｊ内の構成要素を、それぞれ（Ｆ_１，Ｆ_２，．．．，Ｆ_ｉ）、（Ｆ’_１，Ｆ’_２，．．．，Ｆ’_ｊ）とすると、下記（２５）式により表される。
【数１６】

ここで、Ｌ_ｉｊは、クラスターを構成する要素（Ｆ_ｉ，Ｆ’_ｊ）間の距離で、ユークリッド平方距離、マハラビノスの距離、シンコフスキー距離等がある。
ユークリッド平方距離の場合、Ｌ_ｉｊは、下記（２６）式で与えられる。
【００１５】
【数１７】

非類似度行列ｄ_ｉｊの全ての値がしきい値Ｄ’よりも大きければ、クラスター分析を終了する。そうでなければ、非類似度行列ｄ_ｉｊが最も小さいクラスターＣｉ、Ｃｊとを融合して、１つのクラスターを生成する。
この処理によって、新しく生成された（ｍ−１）個のクラスターに対して、非類似度行列ｄ_ｉｊ（ｉ，ｊ＝１，２，．．．，ｍ−１）を計算し、非類似度行列ｄ_ｉｊの全ての値がしきい値Ｄ’よりも大きくなるまで前述の処理を続行する。これにより、前述の（２４）式で表されるクラスターに分類する。
そして、下記（２７）式で表されるクラスター間距離Ｌｉｊを、最短距離法、最長距離法、群平均法、重心法、メジアン法、ウォード法のいずれかの方法を用いて求め、下記（２８）式を満たすクラスターＣｉが存在すれば、Ｃｉに属するデータＹｉは外れ値であると判断する。
【数１８】
Ｌｉｊ＝Ｄ（Ｆ_ｉ、Ｆ_ｊ）・・・・・・・・・・・・・・・・・・（２７）
Ｌｉｊ＞Ｔ、かつ、｜Ｃｉ｜＞｜Ｃｊ｜、但し、Ｔはしきい値、｜Ｃｉ｜はクラ
スターＣｉの要素の数・・・・・・・・・・・・・・・・・・・（２８）
【００１６】
外れ値判断装置４の三番目の判断手法（請求項５に記載の発明）は、前記（２０）式で表されるデータ（Ｆ_ｌ，Ｆ_２，．．．，Ｆ_ｍ）にクラスター分析を適用し、ｍ個のデータ（Ｆ_１，Ｆ_２，．．．，Ｆ_ｍ）を、前記（２４）式で表されるｑ個のクラスターに分類し、前記（２７）式で表されるクラスター間距離Ｌｉｊを、最短距離法、最長距離法、群平均法、重心法、メジアン法、ウォード法のいずれかの方法を用いて求め、下記（２９）式を満たすクラスターＣｉが存在すれば、Ｃｉに属するデータＹｉは外れ値であると判断する。
【数１９】

以上、本発明者によってなされた発明を、前記実施の形態に基づき具体的に説明したが、本発明は、前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲において種々変更可能であることは勿論である。
【００１７】
【発明の効果】
本願において開示される発明のうち代表的なものによって得られる効果を簡単に説明すれば、下記の通りである。
本発明の外れ値検出方法によれば、検定対象であるデータが正規分布に従わない場合でも外れ値を検出することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態の外れ値検出方法を実施するための外れ値検出装置の概略構成を示すブロック図である。
【図２】本発明の実施の形態の外れ値検出装置の処理手順を示すフローチャートである。
【符号の説明】
１…外れ値検出装置、２…入力装置、３…因子分析装置、４…外れ値判断装置。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method for detecting outliers of time-series data having periodicity, and the present invention is applied to detection of an abnormal state of a communication network and detection of erroneously measured data.
[0002]
[Prior art]
In performing network equipment management and planning work, it is necessary to automatically detect and remove abnormal values (outliers) from traffic data in order to properly calculate basic traffic.
As a conventional outlier detection method, there is known a method of extracting outlier candidate data by applying a Grubbs test, assuming that measurement data follows a normal distribution (see Non-Patent Documents 1 and 2 below). .
The conventional method is based on the premise that data to be tested follows a normal distribution, and cannot be applied when traffic data to be tested does not follow a normal distribution.
In fixed-line telephone traffic for voice communication, measurement data follows a normal distribution, but in broadband traffic represented by recent non-voice communication, there is no guarantee that measurement data follows a normal distribution.
[0003]
Prior art documents related to the present invention include the following.
[Non-patent document 1]
Inoue, Hoshiai "Outlier detection method for equipment management data", 2001 IEICE, No. B-7-71, p. 260, Sept. 2001.
[Non-patent document 2]
Hoshiai, Inoue, "A Consideration on Outlier Detection of Traffic Data for Network Equipment Planning", 2002 IEICE, no. B-7-37, p. 264, Mar. 2002.
[0004]
[Problems to be solved by the invention]
As described above, the conventional method is based on the premise that the data to be tested follows a normal distribution. Therefore, if the traffic data to be tested (for example, broadband traffic) does not follow a normal distribution, There was a problem that it could not be applied.
An object of the present invention is to provide an outlier detection method capable of detecting an outlier even when data to be tested does not follow a normal distribution. To provide.
The above and other objects and novel features of the present invention will become apparent from the description of the present specification and the accompanying drawings.
[0005]
[Means for Solving the Problems]
The following is a brief description of an outline of typical inventions disclosed in the present application.
That is, the present invention is a method for detecting outliers of time-series data having periodicity, wherein a set of data for one cycle is taken as one sample, and n data in one cycle are regarded as data for a variable item. A factor analysis procedure for performing a factor analysis, and a distance between samples is defined as an index indicating a similarity between data using the factor score calculated in the factor analysis procedure, and it is determined whether or not an outlier is based on the distance. Outlier determination procedure.
Further, in the present invention, time-series data (X ₁₁ , X ₁₂ , X ₁₃ ,..., X _1n , X ₂₁ , X ₂₂ , X ₂₃ ,... , X _2n ,..., X _m1 , X _m2 , X _m3 ,..., X _mn ), m new data {Y ₁ , Y ₂ _,. . . , Y _m , where Y _i = (X _i1 , X _i2 , X _i3 ,..., X _in )}, and generates a factor for the data (Y ₁ , Y ₂ ..., Y _m ). Perform analysis.
[0006]
Further, in the present invention, a sample including an outlier is detected by calculating a factor score of each sample characterized by a p-dimensional factor, and calculating a distance between the samples using the factor score.
Further, in the present invention, a factor score of each sample characterized by a p-dimensional factor is calculated, the sample is classified into clusters using the factor score, and the distance between clusters and the number of samples in the cluster are used. , Detect samples containing outliers.
Further, in the present invention, a factor score of each sample characterized by a p-dimensional factor is calculated, the samples are classified into clusters using the factor score, and the distance between clusters and the ratio of the number of samples in the cluster are calculated. To detect samples containing outliers.
[0007]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In all the drawings for describing the embodiments, components having the same function are denoted by the same reference numerals, and repeated description thereof will be omitted.
FIG. 1 is a block diagram showing a schematic configuration of an outlier detection apparatus for implementing an outlier detection method according to an embodiment of the present invention.
In the figure, 1 is an outlier detection device, 2 is a data input device, 3 is a factor analyzer, and 4 is an outlier determination device.
The data input device 2 generates m pieces of data composed of n variables from input data (time-series data for m cycles in which one cycle is composed of n pieces of data).
The factor analyzer 3 reduces m observation data consisting of n variables into p-dimensional factors, and calculates p factor scores.
The outlier determining unit 4 calculates the distance between each sample from p factor scores output from the factor analyzer 3, and determines whether or not an outlier exists in each sample.
[0008]
FIG. 2 is a flowchart illustrating a processing procedure of the outlier detection apparatus according to the present embodiment. Hereinafter, a processing procedure of the outlier detection apparatus according to the present embodiment will be described.
First, the data input device 2 generates m pieces of data composed of n variables from input data (time-series data for m cycles, where one cycle is composed of n pieces of data) (step S0).
The data input device 2 converts the input data (time-series data for m periods in which one period is composed of n data) represented by the following expression (11) into an n-variate represented by the following expression (12). M new data (Y ₁ , Y ₂ ,..., Y _m ) are generated.
(Equation 5)

Next, the factor analyzer 3 performs factor analysis on m observation data (data represented by the above equation (12)) consisting of n variables, and calculates p factor scores (step S1). .
The factor analysis device 3 performs the factor analysis on the m-series time-series data as m observation data consisting of n-variable maximums, thereby _obtaining the n data (X _i1 , X _i2 , X _i ) in the equation (12). _{_i3,} ..., contracted the _{X in)} the factor of p-dimensional, to calculate the p number of factor scores.
That is, the factor analyzer 3 standardizes (X _i1 , X _i2 , X _i3 ,..., X _in ) of the above equation (12) and calculates the following equation (13).
[0009]
(Equation 6)

The following equation (14) is obtained from the above equation (13).
(Equation 7)

Next, a variance-covariance matrix R expressed by the following equation (15) is obtained by using the equation (14) as an n-dimensional vector.
[0010]
(Equation 8)

[0011]
The eigenvalues of the correlation matrix R are determined, and λ ₁ , λ ₂ ,. . . . , Arranged in lambda _n, determined with respect to the eigenvalue contribution rate expressed by the following equation (16) to _{(C i).}
(Equation 9)

Eigenvalues (λ ₁ , λ ₂ ,..., Λ _p ; p ≦ n) at which the contribution ratio is 0.8 or more are extracted, and eigenvectors (W ₁ , W ₁ ) expressed by the following equation (17) are extracted. ₂ , ..., W _p ).
(Equation 10)

Using the eigenvector, a factor load represented by the following (18) is calculated.
(Equation 11)

By least-over - method, seek _{f ij} to minimize the following expression (19).
[0012]
(Equation 12)

The following equation (20) is obtained from f _ij .
(Equation 13)

The expression (20) is called a factor score, and the n data (X _i1 , X _i2 , X _i3 ,..., X _in ) _{in the} expression (12) are _replaced by P (P <n) data. This is a variable when expressed by a common variable, and is obtained by expressing the equation (12) with P latent variables.
Next, the outlier determination device 4 calculates the distance between the m pieces of data using the p factor scores, and detects outliers (step S2).
There are three determination methods of the outlier determination device 4, and the first determination method (the invention according to claim 3) is the data (F ₁ , F ₂ _,. F _m ), the distance between data is given by the following equation (21), the following equation (22) is calculated from the equation (21), and when the following equation (23) is satisfied, the data Yi is an outlier. Is determined.
[0013]
[Equation 14]

[0014]
The second judgment method of the outlier judgment device 4 (the invention according to claim 4) performs a cluster analysis on the data (F ₁ , F ₂ ,..., F _m ) represented by the expression (20). By applying this, m data (F ₁ , F ₂ ,..., F _m ) are classified into q clusters represented by the following equation (24).
(Equation 15)
C1, C2,. . . , Cq (24)
For example, as an initial value, m clusters {F ₁ }, {F ₂ },. . . , {F _m }.
For the m clusters (C1, C2, ..., Cm), the dissimilarity matrix _dij (i, j = 1, 2, ..., m) is calculated using the shortest distance method, the longest distance method, Calculate using group average method, centroid method, Ward method, etc.
For example, in the case of the shortest distance method, the dissimilarity matrix d _ij of the clusters Ci and Cj represents the components in the clusters Ci and Cj as (F ₁ , F ₂ ,..., F _i ) and (F ′), respectively. ₁ , F ′ ₂ ,..., F ′ _j ), is represented by the following equation (25).
(Equation 16)

Here, L _ij is a distance between elements (F _i , F ′ _j ) constituting the cluster, and includes a Euclidean square distance, a Mahalanobis distance, a Sinkovsky distance, and the like.
In the case of the Euclidean square distance, L _ij is given by the following equation (26).
[0015]
[Equation 17]

If all the values of the dissimilarity matrix _dij are larger than the threshold value D ', the cluster analysis ends. Otherwise, one cluster is generated by fusing the clusters Ci and Cj with the smallest dissimilarity matrix _dij .
By this processing, the dissimilarity matrix d _ij (i, j = 1, 2,..., M−1) is calculated for the (m−1) newly generated clusters, and the dissimilarity is calculated. The above processing is continued until all the values of the matrix d _ij become larger than the threshold value D ′. As a result, the cluster is classified into the cluster represented by the above equation (24).
Then, the inter-cluster distance Lij expressed by the following equation (27) is obtained by using any of the shortest distance method, the longest distance method, the group averaging method, the center of gravity method, the median method, and the Ward method. If there is a cluster Ci that satisfies the expression, it is determined that the data Yi belonging to Ci is an outlier.
(Equation 18)
Lij = D (F _i , F _j ) (27)
Lij> T and | Ci |> | Cj |, where T is a threshold value and | Ci | is the number of elements of the cluster Ci.・ (28)
[0016]
The third determination method (the invention according to claim 5) of the outlier determination device 4 performs a cluster analysis on the data (F ₁ , F ₂ ,..., F _m ) represented by the expression (20). The m data (F ₁ , F ₂ ,..., F _m ) are classified into q clusters represented by the above equation (24), and the cluster represented by the above equation (27) is applied. The distance Lij is obtained by using any of the shortest distance method, the longest distance method, the group averaging method, the centroid method, the median method, and the Ward method. If there is a cluster Ci satisfying the following expression (29), Ci is obtained. Are determined to be outliers.
[Equation 19]

As described above, the invention made by the inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and can be variously modified without departing from the gist of the invention. Needless to say,
[0017]
【The invention's effect】
The following is a brief description of an effect obtained by a representative one of the inventions disclosed in the present application.
According to the outlier detection method of the present invention, it is possible to detect an outlier even when data to be tested does not follow a normal distribution.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of an outlier detection apparatus for implementing an outlier detection method according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating a processing procedure of the outlier detection apparatus according to the embodiment of the present invention.
[Explanation of symbols]
1 outlier detection device, 2 input device, 3 factor analyzer, 4 outlier determination device.

Claims

An outlier detection method for time-series data having periodicity,
A factor analysis procedure for performing a factor analysis by regarding a set of data for one cycle as one sample and treating n data in one cycle as data for a variable item;
An outlier determination step of defining a distance between samples using the factor score calculated in the factor analysis procedure as an index indicating a degree of similarity between data, and determining whether or not an outlier is based on the distance. An outlier detection method characterized by the following.

In the factor analysis procedure, from time-series data for m periods in which one period represented by the following equation (1) is composed of n data, m data composed of n variables represented by the following equation (2) are obtained. Generating new data (Y ₁ , Y ₂ ,..., Y _m ), and performing a factor analysis on the data (Y ₁ , Y _2, ..., Y _m ). Item 1. The outlier detection method according to Item 1.

In the factor analysis procedure, a factor analysis is performed on the data represented by the formula (2), and the factor score is data represented by the following formula (3), which is expressed by the formula (2). (F ₁ , F ₂ ,..., F _m ) is obtained by expressing the data (Y ₁ , Y ₂ ,..., Y _m ) represented by p (p <n) latent variables.
In the outlier determination procedure, a distance between data is defined by the following equation (4) for the data given by the following equation (3), and the following equation (5) is calculated from the equation (4). The outlier detection method according to claim 2, wherein when the expression (5) satisfies the following expression (6), the data Yi is determined to be an outlier.

In the outlier determination procedure, cluster analysis is applied to the data represented by the above equation (3), and m pieces of data (F ₁ , F ₂ ,..., F _m ) are calculated by the following equation (7). Classify into q clusters represented,
The inter-cluster distance Lij represented by the following equation (8) is obtained by using any one of the shortest distance method, the longest distance method, the group average method, the center of gravity method, the median method, and the Ward method,
4. The outlier detection method according to claim 3, wherein if there is a cluster Ci satisfying the following equation (9), the data Yi belonging to Ci is determined to be an outlier.
[Equation 3]
C1, C2,. . . , Cq (7)
Lij = D (F _i , F _j ) (8)
Lij> T and | Ci |> | Cj |, where T is a threshold value and | Ci | is the number of elements of the cluster Ci.・ (9)

In the outlier determination procedure, a cluster analysis is applied to the data expressed by the above equation (3), and m pieces of data (F ₁ , F ₂ ,..., Y _m ) are expressed by the following equation (7). Classify into q clusters represented,
The inter-cluster distance Lij represented by the following equation (8) is obtained by using any one of the shortest distance method, the longest distance method, the group average method, the center of gravity method, the median method, and the Ward method,
4. The outlier detection method according to claim 3, wherein if there is a cluster Ci satisfying the following expression (10), the data Yi belonging to Ci is determined to be an outlier.