JP6182473B2

JP6182473B2 - Data analysis apparatus, method and program

Info

Publication number: JP6182473B2
Application number: JP2014023953A
Authority: JP
Inventors: 佐藤　大輔; 大輔佐藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-09-25
Filing date: 2014-02-12
Publication date: 2017-08-16
Anticipated expiration: 2034-02-12
Also published as: JP2017102931A; JP2015088155A; JP6259058B2

Description

本発明は、データ分析装置及び方法及びプログラムに係り、特に、コンピュータグラフィックスにおける２次元上のデータを、折れ線、上に凸もしくは下に凸な曲線で近似する技術、及びデータ分析における外れ値、変化点、曲線領域、直線領域の検出を早期に検出するためのデータ分析装置及び方法及びプログラムに関する。 The present invention relates to a data analysis apparatus, method, and program, and in particular, a technique for approximating two-dimensional data in computer graphics with a polygonal line, an upwardly convex or downwardly convex curve, and an outlier in data analysis, The present invention relates to a data analysis apparatus, method, and program for early detection of change points, curved areas, and straight line areas.

第１の従来の技術として、非特許文献１の統計的検定に基づく変化点検出方式について説明する。 As a first conventional technique, a change point detection method based on the statistical test of Non-Patent Document 1 will be described.

いま、データ系列 Data series now

があり、このデータ系列中に変化点候補を１つ設ける。この点をｔとすると変化点の前後で２つのデータ系列に分割する。前半を、

There is one change point candidate in this data series. If this point is t, it is divided into two data series before and after the change point. The first half

後半を

The second half

とする。ここで、各x_j(j=1,2,…,n)は多次元連続値を取るものとする。ここで各時刻ｔにおいて

And Here, each x _j (j = 1, 2,..., N) is assumed to take a multidimensional continuous value. Here at each time t

が与えられたときのx_tの条件付確率密度関数

Conditional probability density function of x _t given

を、

The

とする。ここでΣは実数値パラメータを要素とする分散共分散行列を表す。上付きＴは転置を表す。w_tとしては自己回帰モデルとして、

And Here, Σ represents a variance-covariance matrix whose elements are real-valued parameters. The superscript T represents transposition. w _t is an autoregressive model,

とする。あるいは１次元の場合には多項式回帰モデルとして、

And Or in the case of one dimension, as a polynomial regression model,

とする。

And

パラメータα_１，…，α_k，μを最尤推定により求める。最尤推定により求められたパラメータを式(2)、式(3)に代入した結果を Parameters α ₁ ,..., Α _k , μ are _obtained by maximum likelihood estimation. The result of substituting the parameters obtained by maximum likelihood estimation into Equation (2) and Equation (3)

とする。ここで、データ

And Where the data

とモデルによる予測値との２乗誤差

Error between model and model predicted value

は、

Is

と表される。

It is expressed.

変化点を求めるには、予め決められた閾値δに対して、 In order to obtain the change point, for a predetermined threshold δ,

が成り立つt*が求める変化点となる。ここでt*は

The t * that holds is the change point to be obtained. Where t * is

である。

It is.

複数の変化点を検出するには、 To detect multiple change points,

と

When

をそれぞれ前半と後半に分割し再帰的に変化点を検出すればよい。

May be divided into the first half and the second half, and the change point may be detected recursively.

第２の従来技術として、コンピュータグラフィックスにおける、画素列であるセグメントを線分近似する方法を説明する。 As a second prior art, a method for approximating a segment, which is a pixel row, in a computer graphics line segment will be described.

セグメントを線分近似する方法として、２分割法（binary decomposition method）（例えば、非特許文献２参照）がある。手順としては、以下の手順を再帰的に行う。 As a method of approximating a segment to a line segment, there is a binary decomposition method (for example, see Non-Patent Document 2). As a procedure, the following procedure is performed recursively.

１．画像列の始点・終点をつないだ直線に対して、各画素から距離を計算する。 1. The distance is calculated from each pixel with respect to a straight line connecting the start point and end point of the image sequence.

２．最大距離が、ある閾値以下の場合、その直線を線分として採用する。そうでない場合は、その画素で画素列を分割し、分割された画素列で上記１．の処理を再帰的に繰り返す。 2. If the maximum distance is below a certain threshold, the straight line is adopted as the line segment. Otherwise, the pixel column is divided by the pixel, and the above 1.. This process is repeated recursively.

３．画素列の分割がない場合、処理を終了する。 3. If there is no division of the pixel column, the process is terminated.

なお、元の画素列が閉曲線である場合には、始点を任意に選び、その点から最も遠い点を終点として選び上記の手順を実施する。 If the original pixel row is a closed curve, the starting point is arbitrarily selected, the point farthest from that point is selected as the end point, and the above procedure is performed.

第３の従来技術として、画素列の角点を検出する方法を説明する。 As a third conventional technique, a method for detecting a corner point of a pixel row will be described.

画素列の分割点は角点である方が、図形の特徴をよく捉えているので好ましい。以下に説明する角点検出法（corner detection）（例えば、非特許文献３参照）は、画素列の角点を積極的に見つけ、その点を分割の区切りとする方法である。 It is preferable that the dividing points of the pixel column are corner points because the features of the figure are well understood. The corner detection method (corner detection) described below (see, for example, Non-Patent Document 3) is a method of positively finding a corner point of a pixel row and using that point as a partitioning break.

１．画素列の始点P₁から終点P_nとする。 1. The starting point P ₁ to the end point P _{n of the} pixel row are assumed.

２．i=kとする。 2. i = k.

３．点Piに対してk画素離れた前後の点P_i-kとP_i+kの画素を結ぶ線分の角度を求める。 3. Determining the angle of a line connecting the pixels P _ik and P _{i + k} points before and after a distance k pixels for a point Pi.

４．i＜n−kならばi=k+1とする。 4). If i <n−k, i = k + 1.

５．閾値より小さく極小の角度であったP_iを分割点とする。 5. The P _i was at an angle of less minima than the threshold value and the division point.

kの選び方に特に規定はないが、はじめにkは大きな値を選び、徐々に小さくしていくのがより図形の特徴を捉えるために好ましい。 Although there is no particular rule on how to select k, it is preferable to first select a large value for k and gradually decrease it in order to capture the features of the figure.

V. Guralnik and J. Srivastava, Event detection from time series data, In Proceedings of the Fifth ACM SIGKDD, International conference on Knowledge Discovery and Data Mining (KDD99), ACM Press, pp. 32-42, 1999.V. Guralnik and J. Srivastava, Event detection from time series data, In Proceedings of the Fifth ACM SIGKDD, International conference on Knowledge Discovery and Data Mining (KDD99), ACM Press, pp. 32-42, 1999. ディジタル画像処理編集委員会、ディジタル画像処理、CG-ARTS協会），第189頁から第190頁Digital Image Processing Editorial Board, Digital Image Processing, CG-ARTS Association), pp. 189-190 ディジタル画像処理編集委員会、ディジタル画像処理、CG-ARTS協会），第190頁Digital Image Processing Editorial Committee, Digital Image Processing, CG-ARTS Association), page 190 仁科健、統計的工程管理、朝倉書店発行,第24頁から第56頁，2009年11月25日Takeshi Nishina, Statistical Process Control, Asakura Shoten, pages 24 to 56, November 25, 2009

しかしながら、上記の第１の従来技術は、変化点のみを検出するもので、変化点以外の領域が直線的なのか曲線的なのかの判断はできないという問題がある。また、一般に、他の値から大きく外れた値である外れ値は除外して分析を行うが、当該技術では、外れ値の除外はできないという課題がある。さらに、当該技術は、上記のt*を決定するために全ての候補についてパラメータ推定を実施し、２乗誤差を計算しなければならないという計算量が大きいという課題と、閾値δの決め方に指針がないため、得られた変化点がどういう指針に基づいて算出されたものかを明らかにできないという課題がある。また、新たなデータが入手できた際には既存のデータを含めて計算し直さなければならないという課題もある。 However, the first prior art described above detects only a change point, and there is a problem that it cannot be determined whether a region other than the change point is linear or curved. In general, analysis is performed by excluding outliers that are greatly deviated from other values, but this technique has a problem that outliers cannot be excluded. In addition, the technique has a large amount of calculation in which parameter estimation is performed for all candidates to calculate the above-mentioned t * and a square error must be calculated, and a guideline is provided for determining the threshold value δ. Therefore, there is a problem that it is not possible to clarify what guideline the obtained change point is calculated based on. Another problem is that when new data is available, it must be recalculated including the existing data.

また、上記の第２の従来技術は、暗に画素列は線型近似からのずれが小さいことを仮定している。もし線型近似からのずれが大きい場合（データが細かく変動しているような場合）、収束せずに手続きが終了しないという課題がある。最大距離の閾値の設定をうまく設定する必要があるが、これについては試行錯誤的に実施する他ないという課題がある。 Also, the second conventional technique implicitly assumes that the pixel column has a small deviation from the linear approximation. If the deviation from the linear approximation is large (when the data is fluctuating finely), there is a problem that the procedure does not end without converging. Although it is necessary to set the threshold value of the maximum distance well, there is a problem that this must be implemented by trial and error.

また、上記第３の従来技術は、k画素離れた点のkを適正に選ぶことが難しいという課題、角度の閾値設定の難しさという課題、極小と判断する際の領域の設定の難しさという課題がある。 The third prior art also has a problem that it is difficult to properly select k at a point separated by k pixels, a problem that it is difficult to set an angle threshold, and a difficulty in setting an area when it is determined to be a minimum. There are challenges.

上記全ての従来技術において、外れ値の抽出、曲線領域の抽出は不可能である。また、データが持っている変動を考慮した上で直線領域や変化点を検出することができないという課題もある。 In all the above prior arts, it is impossible to extract outliers and curve areas. In addition, there is a problem that it is not possible to detect a straight line region or a change point in consideration of fluctuations of data.

本発明は上記の点に鑑みなされたもので、２次元においてデータが与えられた時に、外れ値、データが変化する変化点、曲線的な領域、直線的な領域を検出し、検出のための計算量が少なく、既にあるデータに対してだけでなく、データが逐次的に入手される場合にもそれらの点、領域を早期に検出することが可能なデータ分析装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points. When data is given in two dimensions, an outlier, a change point at which the data changes, a curved area, and a linear area are detected, and the detection is performed. Provided is a data analysis apparatus, method, and program that can detect the points and areas at an early stage not only for existing data but also when data is obtained sequentially, with a small amount of calculation. For the purpose.

順序関係を有する複数の２次元データを分析するデータ分析装置であって、
前記複数の２次元データの順序関係において隣接する２次元データからベクトルを作成し、隣接するベクトルのなす角を、ベクトル外積を用いて計算するベクトル・なす角作成手段と、
前記なす角の平均と前記なす角の標準偏差を求める統計値計算手段と、
前記なす角の平均と前記なす角の標準偏差から信頼区間を算出し、各なす角が該信頼区間内に含まれるか否かにより、該なす角の算出元である２次元データが外れ値または変化点候補であるかを判定する信頼区間判定手段と、
を有するデータ分析装置が提供される。

A data analysis device for analyzing a plurality of two-dimensional data having an order relationship ,
Create a vector from two-dimensional data adjacent in the plurality of sequential relationship of the two-dimensional data, the angle of adjacent contact vector, and the vector angle generating means for calculation using vectors cross product,
A statistical value calculating means for obtaining an average of the angle formed and a standard deviation of the angle formed;
A confidence interval is calculated from the average of the angles formed and the standard deviation of the angles formed, and depending on whether or not each formed angle is included in the confidence interval, the two-dimensional data that is the calculation source of the angle formed is an outlier or Confidence interval determination means for determining whether or not a change point candidate;
Is provided.

一態様によれば、２つのベクトルのなす角およびその平均、または、２つのベクトルのなす角の和を用いて、変化点、直線領域、曲線領域、外れ値を検出するために、隣接する２つのデータからベクトルを作成し、そのベクトルのなす角の和の変化によって変化点以外の点が直線領域、曲線領域であるかを判定でき、さらに外れ値も排除可能となる。 According to one aspect, two adjacent points are detected in order to detect a change point, a linear region, a curved region, and an outlier using an angle formed by two vectors and an average thereof, or a sum of angles formed by two vectors. A vector is created from two pieces of data, and it is possible to determine whether a point other than the change point is a straight line region or a curved region by changing the sum of angles formed by the vectors, and it is also possible to eliminate outliers.

本発明の第１の実施の形態における外れ値及び変化点検出装置の構成例である。It is a structural example of the outlier and change point detection apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における外れ値及び変化点検出装置の処理のフローチャートである。It is a flowchart of the process of the outlier and change point detection apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるデータから作成されるベクトル及びベクトルのなす角のイメージ図である。It is an image figure of the angle which the vector created from the data in the 1st Embodiment of this invention and a vector makes. 本発明の第２の実施の形態におけるデータ分析装置の構成例である。It is a structural example of the data analyzer in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における閉曲線の例である。It is an example of the closed curve in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における閉曲線の分割の例である。It is an example of the division | segmentation of the closed curve in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における連続した端点を持つ閉曲線の分割の例である。It is an example of the division | segmentation of the closed curve which has the continuous end point in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における事前処理のフローチャートである。It is a flowchart of the pre-processing in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における番号の対応付けを説明する図である。It is a figure explaining the matching of the number in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における曲線領域検出のフローチャートである。It is a flowchart of the curve area | region detection in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における変化点検出のフローチャートである。It is a flowchart of the change point detection in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における外れ点検出のフローチャートである。It is a flowchart of the outlier detection in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における直線領域検出のフローチャートである。It is a flowchart of the linear area | region detection in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における番号の対応付けを幾何学的に説明する図である。It is a figure explaining the matching of the number in the 2nd Embodiment of this invention geometrically.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施の形態］
変化点の前後でデータは急に大きく変化し、変化点ではないところでは変動はしていてもその変動は小さい。この小さい変動を無視すればデータは時間に対して線形的に遷移しているといえる。そうでない場合に対しても何らかの変換を行うことで線形的に遷移するように変換できる場合を扱う。つまり、データは区分的に線で近似できるため、変動部分を除けば時間経過に対して常に同じ方向に成長しているといえる。 [First Embodiment]
The data changes suddenly before and after the change point, and even if there is a change at a point other than the change point, the change is small. If this small fluctuation is ignored, it can be said that the data transitions linearly with respect to time. The case where it can be converted so as to make a linear transition by performing some conversion even when it is not so is handled. In other words, since the data can be approximated by a line in a piecewise manner, it can be said that the data always grows in the same direction with respect to the passage of time except for the fluctuation portion.

本発明はこの考えに基づいて課題を解決する。即ち、隣接する２つのデータ(t_i,x_i)，(t_i+1，x_i+1)からベクトルr_i=(t_i+1−t_i，x_i+1−x_i)を作成し、隣接するベクトル同士（r_iとr_i+1）でベクトル外積を計算する。このベクトル外積から２つのベクトルのなす角を計算し、このなす角の信頼区間を求め、その信頼区間外となった場合に外れ値あるいは変化点候補とする。変換点であるためにはＮ個のなす角のデータからその平均値を移動平均として求め、その信頼区間外になったときには外れ値ではなく変化点と検出する。 The present invention solves the problem based on this idea. That is, a vector r _i = (t _{i + 1} −t _i , x _{i + 1} −x _i ) is created from two adjacent data (t _i , x _i ) and (t _{i + 1} , x _{i + 1} ). and, calculating a vector cross product with the neighboring vector between (r _i and r _{i + 1).} An angle formed by two vectors is calculated from the vector outer product, and a confidence interval of the angle formed is obtained. If the confidence interval is outside the confidence interval, an outlier or a change point candidate is determined. In order to be a conversion point, an average value is obtained as a moving average from data of N corners, and when it is outside the confidence interval, it is detected as a change point rather than an outlier.

図１は、本発明の第１の実施の形態における外れ値及び変化点検出装置の構成例である。 FIG. 1 is a configuration example of an outlier and change point detection apparatus according to the first embodiment of the present invention.

同図に示す外れ値及び変化点検出装置１００は、データ入力部１１０、事前処理部１２０、ベクトル作成部１３０、なす角作成部１４０、データ蓄積部１５０、統計値計算部１６０、パラメータ入力部１７０、信頼区間判定部１８０、結果出力部１９０を有する。 The outlier and change point detection apparatus 100 shown in FIG. 1 includes a data input unit 110, a preprocessing unit 120, a vector generation unit 130, a formed corner generation unit 140, a data storage unit 150, a statistical value calculation unit 160, and a parameter input unit 170. A confidence interval determination unit 180 and a result output unit 190.

データ入力部１１０は、時系列データ(t_i,x_i),i=0,1,2,…,n（t_iは時刻データ、x_iは時刻t_iにおける観測値）を入力し、データ蓄積部１５０に格納する。当該データの入力方法としては、端末から入力する、データベース等の記憶されたデータを読み込む等があるが、特に限定しない。 The data input unit 110 inputs time-series data (t _i , x _i ), i = 0, 1, 2,..., N (t _i is time data, x _i is an observed value at time t _i ), and data Store in the storage unit 150. Examples of the data input method include, but are not particularly limited to, input from a terminal and reading stored data such as a database.

事前処理部１２０は、なす角θ_iの分散を十分小さくするため、データ蓄積部１５０に格納されている入力されたデータ(t_i,x_i)を移動平均による平滑化を行う。当該事前処理部１２０は、統計値計算部１６０においてベクトルのなす角の標準偏差が大きい場合に動作する。 The pre-processing unit 120 smoothes the input data (t _i , x _i ) stored in the data storage unit 150 by moving average in order to sufficiently reduce the variance of the angle θ _i formed. The pre-processing unit 120 operates when the standard deviation of the angle formed by the vector is large in the statistical value calculation unit 160.

ベクトル作成部１３０は、データ蓄積部１５０からデータを読み出し、隣接する二つのデータからベクトルを作成する。 The vector creation unit 130 reads data from the data storage unit 150 and creates a vector from two adjacent data.

なす角作成部１４０は、ベクトル作成部１３０で作成されたベクトルデータについて、ベクトルの外積を求めることにより、２つのベクトルのなす角θ_ｉを求め、データ蓄積部１５０に格納する。 The formed angle creation unit 140 obtains an angle θ _i formed by the two vectors by obtaining an outer product of the vectors of the vector data created by the vector creation unit 130 and stores them in the data storage unit 150.

統計値計算部１６０は、データ蓄積部１５０からベクトルのなす角θ_iを取得し、当該なす角θ_iの平均値μと標準偏差σを求める。 The statistical value calculation unit 160 acquires the angle θ _i formed by the vector from the data storage unit 150, and obtains the average value μ and the standard deviation σ of the formed angle θ _i .

パラメータ入力部１７０は、信頼区間を判定するためのパラメータを取得する。パラメータはオペレータの端末から入力されるものとする。 The parameter input unit 170 acquires a parameter for determining a confidence interval. The parameters are input from the operator's terminal.

信頼区間判定部１８０は、データ蓄積部１５０から２つのベクトルのなす角θ_iを取得して、入力されたパラメータを用いて、当該なす角θ_iが信頼区間に含まれるかを判定し、含まれない場合には外れ値と判定し、統計値計算部１６０に対して平均値の計算を依頼し、当該平均値に対する信頼区間を求め、その平均値が平均の信頼区間に含まれているか判定し、含まれていない場合は変化点として判断する。 The confidence interval determination unit 180 acquires the angle θ _i formed by the two vectors from the data storage unit 150, determines whether the angle θ _i formed by the input parameter is included in the confidence interval using the input parameters, and includes If not, it is determined as an outlier, the statistical value calculation unit 160 is requested to calculate an average value, a confidence interval for the average value is obtained, and it is determined whether the average value is included in the average confidence interval. If it is not included, it is determined as a change point.

図２は、本発明の第１の実施の形態における外れ値及び変化点検出装置の処理のフローチャートである。 FIG. 2 is a flowchart of processing of the outlier and change point detection apparatus according to the first embodiment of the present invention.

まずは、データ入力からベクトルの作成、２つのベクトルのなす角を求めるまでを説明する。 First, a description will be given of how to create a vector from data input and obtain an angle formed by two vectors.

外れ値を判定するためのパラメータが入力される。（ステップ１０１）。 Parameters for determining outliers are input. (Step 101).

今、データ入力部１１０で時系列データ（t_i,x_i）（i = 0,1,2,…,n）が得られており、時々刻々新たなデータが加わっている状況を想定する。得られたデータはデータ蓄積部１５０に記憶される（ステップ１０２）。ここで、t_iは時刻データ、x_iは時刻t_iで観測された値である。 Now, it is assumed that time series data (t _i , x _i ) (i = 0, 1, 2,..., N) is obtained in the data input unit 110 and new data is added every moment. The obtained data is stored in the data storage unit 150 (step 102). Here, t _i is time data, and x _i is a value observed at time t _i .

ステップ１０３において、事前処理の要・不要を判定する。事前処理が必要な場合はステップ１０４に移行し、不要な場合はステップ１０５に移行する。ステップ１０８において標準偏差が所定の値以上である場合には、ステップ１０９において、"事前処理必要"である状態となるため、事前処理を行う（ステップ１０４）。事前処理は、事前処理部１２０にて元データ（t_i,x_i）,i=1,2,…,nを平滑化する等を行うものである。 In step 103, it is determined whether pre-processing is necessary or not. If pre-processing is necessary, the process proceeds to step 104, and if not necessary, the process proceeds to step 105. If the standard deviation is greater than or equal to the predetermined value in step 108, the pre-processing is performed in step 109 because the pre-processing is necessary (step 104). In the pre-processing, the pre-processing unit 120 smoothes the original data (t _i , x _i ), i = 1, 2,.

事前処理部１２０における事前処理の方法としては、例えば、平滑化する方法として、元データ（t_i,x_i）（i = 0,1,2,…,n）に対して移動平均をとる方法や、なす角θ_iを十分小さくする方法として、時間tを例えば１００倍する、値ｘを例えば１／１００とする方法等が考えられる。また、早期のデータが定常状態（平均と標準偏差が０に近い状態）になっていないような場合には、ベクトルのなす角θ_i（i=1,2,…,n-1）を全て使用するのではなく、θ_i（i=m,m+1,…,n-1）を使用し、θ_i,i=1,2,…,m-1を使用しないようにする方法がある。 As a pre-processing method in the pre-processing unit 120, for example, as a smoothing method, a method of taking a moving average with respect to original data (t _i , x _i ) (i = 0, 1, 2,..., N) Alternatively, as a method for sufficiently reducing the formed angle θ _i , a method of multiplying the time t by 100, for example, and a value x of 1/100, for example, can be considered. If the early data is not in a steady state (mean and standard deviation are close to 0), all the angles θ _i (i = 1, 2, ..., n-1) formed by the vectors There is a method to use θ _i (i = m, m + 1, ..., n-1) and not use θ _i , i = 1,2, ..., m-1 instead of using .

ベクトル作成部１３０では、データ蓄積部１５０からデータを読み出し、隣接する２つのデータから次のようなベクトルr_i (i=1,2,…,n)を作成する（ステップ１０５）。 The vector creation unit 130 reads data from the data storage unit 150 and creates the following vector r _i (i = 1, 2,..., N) from two adjacent data (step 105).

得られたベクトルデータをなす角作成部１４０に渡し、なす角作成部１４０にてベクトル外積r_i×r_i+1,(i=1,2,…,n-1)を計算し、その２つのベクトルのなす角θ_i（i=1,2,…,n-1）を求める（ステップ１０６）。なす角を求めるには、次のようにして行う。ベクトル外積の定義から

The obtained vector data is transferred to the corner creation unit 140, which calculates the vector outer product r _i × r _{i + 1} , (i = 1, 2,..., N-1), An angle θ _i (i = 1, 2,..., N−1) formed by two vectors is obtained (step 106). The angle to be formed is determined as follows. From the definition of vector outer product

が成り立つ。結果、

Holds. result,

が得られる。よって２つのベクトルのなす角は、

Is obtained. So the angle between the two vectors is

と求められる。但し、

Is required. However,

である。得られたなす角のデータをデータ蓄積部１５０に格納する。

It is. The obtained angle data is stored in the data storage unit 150.

ベクトル作成部１３０で隣接するデータからベクトルを作成し、なす角作成部１４０において隣接するベクトルのなす角を作成する方法のイメージを図３に示す。ベクトル外積はベクトルであるためなす角についても方向があることも同図に示している。 FIG. 3 shows an image of a method of creating a vector from adjacent data by the vector creation unit 130 and creating an angle formed by the adjacent vector in the formed corner creation unit 140. Since the vector outer product is a vector, the direction of the angle formed is also shown in FIG.

次に、統計値計算部１６０では、データ蓄積部１５０からなす角のデータを読み込み、２つのベクトルのなす角θ_iの平均と標準偏差を求める（ステップ１０７）。いま、一番新しいデータが（t_n,x_n）とすると、既に説明した方法によりθ₁，θ₂，…，θ_n-1が求められ、これらの平均μ（1,n-1）を求める。 Next, the statistical value calculation unit 160 reads the angle data formed from the data storage unit 150 and obtains the average and standard deviation of the angle θ _i formed by the two vectors (step 107). Now, _assuming that the latest data is (t _n , x _n ), θ ₁ , θ ₂ ,..., Θ _n-1 are obtained by the method described above, and the average μ (1, n-1) is obtained. Ask.

標準偏差σ(1,n-1)は、

The standard deviation σ (1, n-1) is

として求める。

Asking.

２つのベクトルのなす角θ_iは平均がμ(1,n-1）、標準偏差がσ(1,n-1)である正規分布N(μ(1,n-1），σ(1,n-1))に従うものと仮定する。これはすなわち、なす角θ_iの平均値 The angle θ _i formed by the two vectors has a normal distribution N (μ (1, n-1), σ (1,1) with an average μ (1, n-1) and standard deviation σ (1, n-1). n-1)). This is the average value of the angle θ _i

標準偏差

standard deviation

であることを仮定している。元データがほぼ直線的であれば平均値

Is assumed. Average value if source data is almost linear

は成り立つ。標準偏差が所定の値以上である場合には（ステップ１０８,Yes）、ステップ１０３に移行し、ステップ１０４において前述の事前処理を行う。

Holds. If the standard deviation is equal to or larger than the predetermined value (step 108, Yes), the process proceeds to step 103, and the above-described pre-processing is performed in step 104.

以下の説明では、２つのベクトルのなす各θ_iは平均がμ（1,n-1）、標準偏差がσ(1,n-1)である正規分布N(μ(1,n-1)，σ(1,n-1))に従っており、早期のデータも定常状態になっているとして説明する。 In the following description, each θ _i formed by two vectors has a normal distribution N (μ (1, n-1) having an average μ (1, n-1) and a standard deviation σ (1, n-1). , Σ (1, n−1)), and the early data is assumed to be in a steady state.

変化点は、外れ値が高い頻度で現れ始めた点とみなすことができる。外れ値は変化点の候補である。そのためまずは外れ値を検出する方法を説明する。 The change point can be regarded as a point where an outlier starts to appear at a high frequency. Outliers are candidates for change points. Therefore, first, a method for detecting an outlier will be described.

データ蓄積部１５０からθ_i（i=1,2,…,n-1）を得る。２つのベクトルのなす角θ_iは正規分布に従うと仮定しているため、信頼区間判定部１８０において、θ_m（ここでm<n-1）が信頼区間 Θ _i (i = 1, 2,..., N−1) is obtained from the data storage unit 150. Since it is assumed that the angle θ _i formed by the two vectors follows a normal distribution, in the confidence interval determination unit 180, θ _m (where m <n−1) is a confidence interval.

に含まれないときのθ_mを外れ値とする。ここでZはパラメータ入力部１７０から入力されるパラメータであり、95%信頼区間の場合、正規分布表からZ=1.96、99%信頼区間の場合Z=2,58となる。信頼区間の計算では、θ_mのデータは使用せず、その１つ前のデータまでを用いて計算してある。外れ値でなければステップ１０３に戻る（ステップ１１０、No）。

Let θ _m when it is not included in be an outlier. Here, Z is a parameter input from the parameter input unit 170. In the case of 95% confidence interval, Z = 1.96 from the normal distribution table, and in the case of 99% confidence interval, Z = 2,58. In the calculation of the confidence interval, the data of θ _m is not used, and the calculation is performed up to the previous data. If it is not an outlier, the process returns to Step 103 (Step 110, No).

外れ値である場合には（ステップ１１０,Yes）、統計値計算部１６０にて平均値を計算し（ステップ１１１）、信頼区間判定部１８０で平均値に対する信頼区間を求め、平均値が信頼区間に入っているかどうかの判定を行う。当該処理について以下に詳細に説明する。 If it is an outlier (step 110, Yes), the statistical value calculation unit 160 calculates the average value (step 111), the confidence interval determination unit 180 obtains a confidence interval for the average value, and the average value is the confidence interval. It is determined whether it is in. This process will be described in detail below.

θ_mが変化点でない外れ値か変化点であるかどうかの判定を行う（ステップ１１０，１１２）。変化点ではなく、外れ値である場合（ステップ１１０,Yes、ステップ１１２,No）、外れ値を出力する。N≧3として、 It is determined whether θ _m is an outlier that is not a change point or a change point (steps 110 and 112). When it is not a change point but an outlier (step 110, Yes, step 112, No), an outlier is output. As N ≧ 3,

である。なぜならば、θ_mを外れ値であるとは、元データ（t_m+1,x_m+1）が外れ値であることが原因であり、式(11)から元データ（t_m+1,x_m+1）が外れ値であるとθ_m，θ_m+1，θ_m+2の値に影響を与えることがわかる。しかし、θ_mが外れ値であれば

It is. This is because θ _m is an outlier because the original data (t _{m + 1} , x _{m + 1} ) is an outlier, and the original data (t _{m + 1} , It can be seen that if x _{m + 1} ) is an outlier, the values of θ _m , θ _{m + 1} , and θ _{m + 2} are affected. However, if θ _m is an outlier,

である。一方、θ_mが変化点であれば、m以降のθ_i，i>mについても式(15)を満たさないことが多いため、|μ(m,m+N-1)|>0となり、θ_mが変化点であるときの平均の絶対値は、θ_mが外れ値であるときの平均の絶対値よりも大きいことが考えられる。ここで次のような定理が知られている。

It is. On the other hand, if θ _m is the changing point, θ _i and i> m after m often do not satisfy Equation (15), so | μ (m, m + N−1) |> 0, It is conceivable that the average absolute value when θ _m is a change point is larger than the average absolute value when θ _m is an outlier. Here, the following theorems are known.

定理：Xが平均μ、標準偏差σの正規分布に従うならば、大きさNの無作為標本に基づく平均 Theorem: if X follows a normal distribution with mean μ and standard deviation σ, mean based on a random sample of size N

は、平均μ、標準偏差

Is the mean μ, standard deviation

の正規分布に従う。

Follow the normal distribution of.

θ_mが変化点であればN≧3に対してｍ以降のN点の平均μ（m,m+N-1）は式(17)を満たさず、θ_mが変化点でない外れ値であれば式(17)を満たす。これにより変化点か外れ値かの判定を行う。式(17)内のパラメータZは式(15)のそれと同じである。このZはパラメータ入力部１７０より入力される。 If θ _m is the changing point, the average μ (m, m + N-1) of the N points after m for N ≧ 3 does not satisfy Equation (17), and θ _m is an outlier that is not a changing point. Equation (17) is satisfied. Thus, it is determined whether it is a change point or an outlier. Parameter Z in equation (17) is the same as that in equation (15). This Z is input from the parameter input unit 170.

より早く変化点を検出したい場合には、式(15)において信頼区間を連続するデータでプラス側あるいはマイナス側で同じ側にはみ出したときには変化点とみなしてもよい。

When it is desired to detect the change point earlier, it may be regarded as a change point when the confidence interval in Formula (15) protrudes to the same side on the plus side or the minus side with continuous data.

変化点ではなく、外れ値であった場合には（ステップ１１０,Yes、ステップ１１２,No）、結果出力部１９０で外れ値のデータを出力する（ステップ１１３）。 When it is not a change point but an outlier (step 110, Yes, step 112, No), the result output unit 190 outputs outlier data (step 113).

外れ値であり（ステップ１１０,Yes）、変化点であった場合（ステップ１１２,Yes）は、結果出力部１９０において、変化点までのデータあるいは変化点の次のデータまでを分析対象から外す。すなわち、式(15)における平均や分散の計算に用いるデータは変化点の次あるいは変化点の２つ先のデータからを新たな分析対象とし、データ蓄積部１５０のデータを一括更新する（ステップ１１４）。 If it is an outlier (step 110, Yes) and it is a change point (step 112, Yes), the result output unit 190 excludes data up to the change point or data next to the change point from the analysis target. In other words, the data used for calculating the average and variance in the equation (15) is updated from the data next to the change point or the data ahead of the change point, and the data in the data storage unit 150 is updated all at once (step 114). ).

結果出力部１９０は、変化点のデータを出力する（ステップ１１５）。新たなデータが入力されればステップ１０２に移行し、上記の処理を繰り返す。 The result output unit 190 outputs change point data (step 115). If new data is input, the process proceeds to step 102 and the above processing is repeated.

本実施の形態によれば、データの外れ値及び変化点候補を検出するために、隣接するデータをベクトルとして扱い、そのベクトルの外積を求め、そのなす角の平均と、なす角の標準偏差を求めるだけであり計算量が少ない。また、必要なパラメータは平均に必要なデータ数と何パーセントの信頼区間を設定するかであり容易である。 According to the present embodiment, in order to detect outliers and change point candidates of data, adjacent data is treated as a vector, an outer product of the vectors is obtained, and an average of the formed angles and a standard deviation of the formed angles are obtained. It is only a calculation and the amount of calculation is small. The necessary parameters are easy because the number of data required for averaging and what percentage of confidence intervals are set.

［第２の実施の形態］
本実施の形態では、コンピュータグラフィックスにおける２次元上のデータを対象として、変化点、直線領域、曲線領域、外れ値を検出する。 [Second Embodiment]
In the present embodiment, change points, straight line regions, curved regions, and outliers are detected for two-dimensional data in computer graphics.

図４は、本発明の第２の実施の形態におけるデータ分析装置の構成例を示す。 FIG. 4 shows a configuration example of the data analysis apparatus according to the second embodiment of the present invention.

同図に示すデータ分析装置２００は、データ入力部２１０、管理図データ作成部２２０、データ蓄積部２３０、計算部２４０、閉曲線分割部２５０、結果出力部２６０、制御部２７０を有し、管理データ作成部２２０は、ベクトル作成部２２１、なす角の和作成部２２２、群作成部２２３を有し、計算部２４０は、群の平均値と範囲R(レンジ：最大値と最小値の差)を管理するR管理図を計算するR管理図計算部２４１と、測定値を時系列に管理する The data analysis apparatus 200 shown in the figure includes a data input unit 210, a management chart data creation unit 220, a data storage unit 230, a calculation unit 240, a closed curve division unit 250, a result output unit 260, and a control unit 270. The creation unit 220 includes a vector creation unit 221, a sum of corners creation unit 222, and a group creation unit 223. The calculation unit 240 calculates the group average value and the range R (range: difference between the maximum value and the minimum value). An R control chart calculation unit 241 that calculates an R control chart to be managed, and manages measured values in time series

管理図（以下では、X⁻管理図と記す）を計算するX⁻管理図計算部２４２を有する。

An X ^- control chart calculation unit 242 for calculating a control chart (hereinafter referred to as X ^- control chart) is provided.

データ入力部２１０は、２次元データを取得して制御部２７０を介してデータ蓄積部２３０に格納する。データの取得方法は、端末から入力する、データベース等の記憶されたデータを読み込む等があるが、特に入力方法については限定しない。 The data input unit 210 acquires two-dimensional data and stores it in the data storage unit 230 via the control unit 270. The data acquisition method includes input from a terminal, reading stored data such as a database, and the like, but the input method is not particularly limited.

データ蓄積部２３０は、データ入力部２１０から入力されたデータおよび、管理図データ作成部２２０、計算部２４０、閉曲線分割部２５０で求められた結果を格納する。 The data storage unit 230 stores the data input from the data input unit 210 and the results obtained by the control chart data creation unit 220, the calculation unit 240, and the closed curve division unit 250.

結果出力部２６０は、管理図データ作成部２２０、計算部２４０、閉曲線分割部２５０で求められた結果を出力する。 The result output unit 260 outputs the results obtained by the control chart data creation unit 220, the calculation unit 240, and the closed curve division unit 250.

[1]コンピュータグラフィックス特有処理：
コンピュータグラフィックスへの応用の場合、描く図形が閉曲線のように同じt_kに対して複数のd_kを持つある場合がある。この場合、閉曲線分割部２５０にて以下の処理を行う。なお、データ分析においては同じt_kに対して１つのd_kを持つため、閉曲線分割部２５０及び以下に説明する内容は不要である。 [1] Computer graphics specific processing:
In the case of application to computer graphics, there are cases where a figure to be drawn has a plurality of d _k for the same t _k as a closed curve. In this case, the closed curve dividing unit 250 performs the following processing. Since with one d _k for the same t _k in the data analysis, the contents to be described closed curve division section 250 and the following is not required.

閉曲線、つまりデータ（t_k,d_k）において同じt_kに対して複数のd_kを持つ場合には、同じt_kに対して一つのd_kを持つようにデータを複数の組に分ける必要がある。その分ける方法について説明する。 Closed curve, i.e. the data (t _k, d _k) if it has a plurality of d _k for the same t _k in the need to separate the data to have a d _k for the same t _k to a plurality of sets There is. The method of dividing will be described.

図５は、本発明の第２の実施の形態における閉曲線の分割の例を示す。図５において、縦軸はd_kであり、横軸はt_kを示す。図５のような同じt_kに対してd_kが連続していない場合の閉曲線の場合、t_kごとにd_kの数（但し、連続している場合は１つと数える）を計数する。図５においてd_k軸と平行な線の上に書かれている数字はその線上におけるd_kの数であり、それらの平行な線の間に書かれている数字はその領域でのd_kの数である。 FIG. 5 shows an example of dividing a closed curve in the second embodiment of the present invention. In FIG. 5, the vertical axis represents d _k and the horizontal axis represents t _k . In the case of a closed curve when d _k is not continuous with respect to the same t _k as shown in FIG. 5, the number of d _k is counted for each t _k (however, when it is continuous, it is counted as one). In FIG. 5, the number written on the line parallel to the d _k axis is the number of d _k on the line, and the number written between the parallel lines is the number of d _{k in} that region. Is a number.

今、この平行な線の間に書かれている数字による分類を行う。結果として中央の「６」が３つ連続している領域は１つの領域となる。最もd_kの数が多い領域（図５の場合は「６」）において組ごとに図６のように番号付けを行う。図６においては、L1,L2,…,L6である。これらの組をd_kの数が「４」や「２」の領域に拡張する。拡張の仕方は各組のd_kの値に対して領域（d_k−ε，d_k＋ε）をとり、その領域内に含まれる組は同じ組と認識する。図６では、d_kの数が「４」の領域の例を示しており、L1,L3,L5,L6が該当する。このようにしてt_k対してd_kが１つになるよう組を分ける。 Now, we sort by the numbers written between these parallel lines. As a result, a region in which three “6” s in the center are continuous is one region. Numbering is performed for each group as shown in FIG. 6 in the region having the largest number of d _k (“6” in the case of FIG. 5). In FIG. 6, L1, L2,..., L6. These sets are expanded to areas _where the number of d _k is “4” or “2”. The extension method takes a region (d _k −ε, d _k + ε) for each set of d _k values, and the sets included in the region are recognized as the same set. FIG. 6 shows an example of a region _where the number of d _k is “4”, which corresponds to L1, L3, L5, and L6. In this way, the set is divided so that d _k becomes one for t _k .

次に、図７のような同じt_kに対してd_kが連続している場合の閉曲線の場合について考える。 Next, consider the case of a closed curve when d _k continues for the same t _k as shown in FIG.

図７のように、d_kが連続している辺の端点以外を削除し、組に分ける。組の分け方は、同じt_kに対してd_kが連続していない場合の閉曲線の場合と同様である。削除した部分については、以下に説明する処理が終了した後に再度直線で結ぶ。 As shown in FIG. 7, the end points other than the side where d _k is continuous are deleted and divided into sets. The grouping method is the same as in the case of the closed curve when d _k is not continuous for the same t _k . The deleted portion is connected again with a straight line after the processing described below is completed.

ここまでがコンピュータグラフィックス特有の処理である。 This is the processing unique to computer graphics.

以後、データ分析における処理及び上記の処理を終えた後のコンピュータグラフィックスでの処理について説明する。 Hereinafter, processing in data analysis and processing in computer graphics after finishing the above processing will be described.

[2]データ分析処理
[2-1]事前処理
図８は、本発明の第２の実施の形態における事前処理のフローチャートである。 [2] Data analysis processing
[2-1] Pre-processing FIG. 8 is a flowchart of the pre-processing in the second embodiment of the present invention.

まずは、データ入力からベクトルの作成、２つのベクトルのなす角の和の作成、シューハートの管理図（例えば、非特許文献４参照）における群の作成までの事前処理を説明する。 First, pre-processing from data input to creation of a vector, creation of a sum of angles formed by two vectors, and creation of a group in a Schuhart control chart (see, for example, Non-Patent Document 4) will be described.

事前処理は主に図４に示す管理図データ作成部２２０で実施される。 The pre-processing is mainly performed by the control chart data creation unit 220 shown in FIG.

今、データ入力部２１０から制御部２７０を介して時系列データ（t_i,d_i）,i=1,2,…,nが得られており、データ蓄積部２３０に時々刻々新たなデータが加わっている状況を想定する（ステップ２０１）。ここで、t_iは時刻データ、d_iは時刻t_iで観測された値である。 Now, time-series data (t _i , d _i ), i = 1, 2,..., N are obtained from the data input unit 210 via the control unit 270, and new data is constantly added to the data storage unit 230. Assume that the user is participating (step 201). Here, t _i is time data and d _i is a value observed at time t _i .

ベクトル作成部２２１では、データ蓄積部２３０からデータ（t_i,d_i）を読み出し、隣接する２つのデータから次のようなベクトルv_i (i=1,2,3,…,n)を作成する（ステップ２０２）。 The vector creation unit 221 reads data (t _i , d _i ) from the data storage unit 230 and creates the following vectors v _i (i = 1, 2, 3,..., N) from two adjacent data. (Step 202).

ベクトル作成部２２１は、得られたベクトルをなす角の和作成部２２２に送り、なす角の和作成部２２２は、ベクトルの外積

The vector creation unit 221 sends the obtained vector to the corner sum creation unit 222. The corner creation unit 222 creates the outer product of the vectors.

を計算し（ステップ２０３）、その２つのベクトルのなす角θ_i (i=3,4,…,n)を求める（ステップ２０４）。なす角を求めるには、次のようにして行う。ベクトルの外積の定義から、

Is calculated (step 203), and an angle θ _i (i = 3,4,..., N) formed by the two vectors is obtained (step 204). The angle to be formed is determined as follows. From the definition of vector cross product,

が成り立つ。結果、

Holds. result,

が得られる。隣接するデータからベクトルを作成し、隣接するベクトルのなす角を作成するイメージ図を図３に示す。ベクトル外積はベクトルであるためなす角についても方向があることも図３に示している。

Is obtained. FIG. 3 shows an image diagram in which a vector is created from adjacent data and an angle formed by the adjacent vector is created. Since the vector outer product is a vector, it is also shown in FIG.

よって２つのベクトルのなす角は、 So the angle between the two vectors is

と求められる。但し、

Is required. However,

である。なす角の和

It is. Sum of corners

は、

Is

として求める（ステップ１０５）。

(Step 105).

群作成部２２３は、X⁻−R管理図で用いる群を作成する（ステップ２０６）。作成の方法を説明する。群の大きさ（１つの群の中のデータの数）を「２」として説明する。 The group creation unit 223 creates a group used in the X ^-- R control chart (step 206). The creation method will be described. The group size (the number of data in one group) will be described as “2”.

群をg₄,g₅,…,g_nとすると、 If the group is g ₄ , g ₅ ,…, g _n ,

となる。以後、一つ一つの群を「点」と表現する。得られた群の点は、データ蓄積部２３０に記憶される。このg₄,g₅,…,g_nは、分析前においては曲線領域、変化点、外れ点、直線領域いずれにも該当していないので、未分類点とする（ステップ２０７）。

It becomes. Hereinafter, each group is expressed as a “point”. The obtained group points are stored in the data storage unit 230. These g ₄ , g ₅ ,..., G _n do not correspond to any of the curved region, change point, outlier point, and straight line region before the analysis, and are therefore unclassified points (step 207).

ここで、元データd_k、ベクトルr_k、なす角の和 Where original data d _k , vector r _k , sum of angles formed

群g_kの番号付けの対応の例を図９に示す。元データd_k,d_k+1からベクトルr_kが作成される。ベクトルr_k、r_k+1からなす角の和

FIG. 9 shows an example of the correspondence of the numbering of the group g _k . A vector r _k is created from the original data d _k and d _{k + 1} . Sum of angles from vectors r _k , r _{k + 1}

が作成される。

Is created.

から群g_kが作成され、データ蓄積部２３０に格納する。

The group g _k is created from and stored in the data storage unit 230.

[3]曲線領域、変化点、外れ値、直線領域の検出
次に、計算部２４０で曲線領域、変化点、外れ値、直線領域を検出する方法を説明する。 [3] Detection of Curve Area, Change Point, Outlier, and Linear Area Next, a method for detecting the curve area, change point, outlier, and straight line area by the calculation unit 240 will be described.

検出する順序として、初めに曲線領域を検出し、次に変化点を検出し、外れ値を検出し、最後に直線領域を検出する。 As a detection order, first, a curved region is detected, then a change point is detected, an outlier is detected, and finally a straight region is detected.

[3-1]曲線領域の検出
曲線領域の検出について説明する。 [3-1] Detection of curve area The detection of the curve area will be described.

計算部２４０のX⁻管理図計算部２４２は、X⁻管理図を使用し、群の各点から群内部での平均 X calculator 240 ^- control chart calculation unit 242, X ^- use the control chart, the average inside the group from each point of the group

を求める。群の大きさが「２」である場合には、

Ask for. If the group size is “2”,

となる。X⁻管理図計算部２４２は、これに対して、X⁻管理図を作成し、データ蓄積部２３０に格納する。X⁻管理図の作成方法は、例えば、前述の非特許文献４に記載の方法を利用することが可能である。

It becomes. X ^- control chart calculation unit 242, whereas, X ^- to create a control chart is stored in the data storage unit 230. As a method for creating the X ^- control chart, for example, the method described in Non-Patent Document 4 described above can be used.

図１０は、本発明の第２の実施の形態における曲線領域検出のフローチャートである。 FIG. 10 is a flowchart of the curve area detection in the second embodiment of the present invention.

X⁻管理図計算部２４２は、 X ^- control chart calculation unit 242

の値が６点（シューハートの管理図による）以上連続して増加（減少）している区間があるかを判定する（ステップ３０１）。ここでなければ（ステップ３０１、No）、「曲線領域は無し」として変化点の検出の処理（ステップ４０１）に移行する。ある場合（ステップ３０１、Yes）は、それらの点を曲線構成点とする。また、この領域を曲線領域とし、データ蓄積部２３０に格納する（ステップ３０２）。

It is determined whether or not there is a section in which the value of is continuously increasing (decreasing) by 6 points or more (according to the Schuhart control chart) (step 301). If not (No in step 301), the process proceeds to the process of detecting the change point (step 401) as “no curve area”. If there is any (step 301, Yes), these points are set as curve constituent points. Further, this region is set as a curve region and stored in the data storage unit 230 (step 302).

なす角の和が増加しているということは徐々に方向が変わっていることを示しており、元データは曲線になっている。曲線領域と隣接する未分類点領域で連続しない単独の点を除くことで曲線領域と合わせて増加（減少）となっている点（拡張曲線構成点）があるかどうか判定する（ステップ３０３）。無ければ（ステップ３０３、No）、曲線領域は既に検出された曲線領域のみとなり、ステップ４０１に移行する。もしあれば（ステップ３０３、Yes）、拡張曲線構成点を含めて曲線領域としてデータ蓄積部２３０に格納し、ステップ４０１に移行する（ステップ３０４）。なお、データの性質により６点以上の連続の数値を他の数値にしてもよい。当該数値の変更は、値が小さいと、わずかな連続増加または減少があるとそれらを曲線領域として検出してしまい、逆に大きすぎると、増加または減少領域がある程度長期に亘らないと検出できなくなってしまうため、６点を基本としてデータと変化点抽出の目的に合わせて変更する。 An increase in the sum of the angles formed indicates that the direction is gradually changing, and the original data is a curve. It is determined whether there is a point (extended curve constituent point) that is increased (decreased) together with the curved region by removing a single point that is not continuous in the unclassified point region adjacent to the curved region (step 303). If there is not (step 303, No), the curve area is only the detected curve area, and the process proceeds to step 401. If there is (step 303, Yes), the extended curve composing points are stored in the data storage unit 230 as a curved region, and the process proceeds to step 401 (step 304). Note that a continuous numerical value of 6 points or more may be changed to another numerical value depending on the nature of the data. If the value is small, a slight continuous increase or decrease will be detected as a curved area, and if it is too large, the increase or decrease area will not be detected to some extent. Therefore, the data is changed according to the purpose of data and change point extraction based on 6 points.

[3-2]変化点の検出
次に変化点の検出について説明する。 [3-2] Change Point Detection Next, change point detection will be described.

図１１は、本発明の第２の実施の形態における変化点検出のフローチャートである。 FIG. 11 is a flowchart of change point detection according to the second embodiment of the present invention.

変化点の検出は、R管理図計算部２４１によって行われる。 The change point is detected by the R control chart calculation unit 241.

R管理図計算部２４１は、データ蓄積部２３０から読み込んだ群の各点から群内部での最大値から最小値を引いたレンジR_kを求める。群の大きさが「２」の場合には、 The R control chart calculation unit 241 obtains a range R _{k obtained} by subtracting the minimum value from the maximum value inside the group from each point of the group read from the data storage unit 230. If the group size is “2”,

となる。連続する未分類点を各未分類点領域としてR管理図を作成し、データ蓄積部２３０に格納する(ステップ４０１）。R管理図の作成方法は、例えば、X⁻管理図と同様に、非特許文献４の技術を利用することが可能である。各R管理図において、R管理限界外の点があるかを判定する（ステップ４０２）。無い場合（ステップ４０２、No）は変化点が存在しないため、ステップ５０１に移行する。ある場合（ステップ４０２、Yes）は、このR管理限界外の点をR変化点とする（ステップ４０３）。

It becomes. An R control chart is created using consecutive unclassified points as respective unclassified point regions, and stored in the data storage unit 230 (step 401). As a method for creating an R control chart, for example, the technique of Non-Patent Document 4 can be used similarly to the X ^- control chart. In each R chart, it is determined whether there is a point outside the R management limit (step 402). If there is no change (step 402, No), the process proceeds to step 501 because there is no change point. If there is any (Step 402, Yes), a point outside the R management limit is set as an R change point (Step 403).

次に、R変化点に挟まれている単独のR管理限界内の点はあるかを判定する（ステップ４０４）。無い場合（ステップ４０４、No）には、データ蓄積部２３０から再度連続する未分類の各領域を読み出して、当該各領域に対して群の大きさ「２」のR管理図を作成し、上記の処理を繰り返す。ある場合（ステップ４０４、Yes）にはその挟まれたR管理限界内の点をR変化点として再度連続する未分類点の各領域に対して群の大きさ「２」のR管理図を作成し、上記の処理を繰り返す（ステップ４０５）。 Next, it is determined whether there is a point within the single R management limit sandwiched between the R change points (step 404). If there is not (step 404, No), each successive uncategorized area is read from the data storage unit 230, and an R control chart having a group size “2” is created for each area. Repeat the process. In some cases (step 404, Yes), an R control chart with a group size of “2” is created for each region of consecutive unclassified points, with points within the R control limit sandwiched between them as R change points. Then, the above process is repeated (step 405).

[3-3]外れ点の検出
次に、外れ点の検出について説明する。 [3-3] Detection of outliers Next, detection of outliers will be described.

外れ点の検出は、R管理図計算部２４１によって行われる。 Detection of outliers is performed by the R control chart calculation unit 241.

図１２は、本発明の第２の実施の形態における外れ点検出のフローチャートである。 FIG. 12 is a flowchart of outlier detection in the second embodiment of the present invention.

R管理図計算部２４１は、データ蓄積部２３０から読み出したR管理図において２つの連続したR変化点を外れ候補点とする（ステップ５０１）。外れ候補点があるかを判定する（ステップ５０２）。無ければ（ステップ５０２、No)、ステップ６０１に移行する。あれば（ステップ５０２、Yes）、時間的に最も早い外れ候補点を選ぶ（ステップ５０３）。選択されたその外れ候補点を挟む２つの未分類点領域を仮に１つの未分類点領域と見做し、R管理図（未分類点領域の仮連結）を作成する（ステップ５０４）。但し、その外れ候補点はこのR管理図作成時には使用しない。連結された未分類点領域は全てR管理限界内かを判定（未分類点領域の連結確認）する（ステップ５０５）。少なくとも１つ以上のR管理限界外の点があり、未分類点領域の連結が不可能な場合（ステップ５０５、No）、２つの未分類点領域は連結せずに外れ候補点はR変化点とする（ステップ５０６）。すべてR管理限界内であり、未分類点領域が連結可能の場合（ステップ５０５、Yes）、２つの未分類点領域を連結し、外れ候補点を外れ点とする（ステップ５０７）。外れ候補点があるかどうかを判定しなければ（ステップ５０２、No)、ステップ６０１に移行し、あれば次に時間的に最も早い外れ候補点を選び（ステップ５０３）、同様のことを繰り返す。 The R management chart calculation unit 241 sets two consecutive R change points as outlier candidate points in the R management chart read from the data storage unit 230 (step 501). It is determined whether there is an outlier candidate point (step 502). If not (No in Step 502), the process proceeds to Step 601. If there is (Yes in Step 502), the earliest candidate point in time is selected (Step 503). The two unclassified point regions sandwiching the selected outlier candidate points are regarded as one unclassified point region, and an R control chart (temporary connection of unclassified point regions) is created (step 504). However, the outlier candidate points are not used when creating this R chart. It is determined whether all connected unclassified point regions are within the R control limit (confirmation of connection of unclassified point regions) (step 505). If there is at least one point outside the R control limit and the unclassified point area cannot be connected (No in step 505), the two unclassified point areas are not connected and the candidate point is the R change point (Step 506). When all are within the R control limit and the unclassified point regions can be connected (Yes in Step 505), the two unclassified point regions are connected, and the outlier candidate points are set as outliers (Step 507). If it is not determined whether or not there is a missing candidate point (No in step 502), the process proceeds to step 601, and if there is, the next candidate point that is the earliest in time is selected (step 503), and the same is repeated.

[3-4]直線領域の検出
最後に直線領域の検出について説明する。 [3-4] Detection of linear region Finally, detection of a linear region will be described.

直線領域の検出は、X⁻管理図計算部２４２によって行われる。 The detection of the straight line area is performed by the X ^- control chart calculation unit 242.

図１３は、本発明の第２の実施の形態における直線領域検出のフローチャートである。 FIG. 13 is a flowchart of straight line area detection according to the second embodiment of the present invention.

X⁻管理図計算部２４２は、データ蓄積部２３０から読み出した連続する未分類点を未分類点領域として各未分類点領域に対してX⁻管理図を作成し、データ蓄積部２３０に格納する（ステップ６０１）。データ蓄積部２３０から読み出した各X⁻管理図で管理限界外の点があるかを判定する（ステップ６０２）。管理限界外の点があれば（ステップ６０２、Yes）、X⁻管理限界外の点を The X ^- control chart calculation unit 242 creates an X ^- control chart for each unclassified point region by using consecutive unclassified points read from the data storage unit 230 as unclassified point regions, and stores them in the data storage unit 230. (Step 601). It is determined whether there is a point outside the control limit in each X ^- control chart read from the data storage unit 230 (step 602). If there is a point outside the control limits (step 602, Yes), X ^- a point outside the control limits

変化点（以下、X⁻変化点と記す）とする。X⁻変化点によって未分類領域が分割され、各未分類領域に対してX⁻管理図を作成し、データ蓄積部２３０に格納する（ステップ６０３）。各X⁻管理図で管理限界外の点がなくなるまで上記の処理を繰り返す。

Let it be a change point (hereinafter referred to as X ^- change point). The unclassified area is divided by X ^- change points, and an X ^- control chart is created for each unclassified area and stored in the data storage unit 230 (step 603). The above process is repeated until there are no points outside the control limits in each X ^- control chart.

管理限界外の点がない場合（ステップ６０２、No）、未分類点を直線構成点としてみ分類点領域を直線領域とし（ステップ６０４）、各直線領域に対して直線を引く（ステップ６０５）。 If there is no point outside the control limit (step 602, No), the unclassified point is regarded as a straight component point, the classified point area is defined as a straight line area (step 604), and a straight line is drawn for each straight line area (step 605).

[4]元データ、ベクトル、なす角、なす角の和、群の関係
次に、元データ（t_k,d_k） (k=1,2,…,n)，ベクトルv_k (k=2,3,…,n)、なす角θ_k(k=3,4,…,n)、なす角の和 [4] Original data, vector, angle formed, sum of angle formed, group relationship Next, original data (t _k , d _k ) (k = 1,2, ..., n), vector v _k (k = 2 , 3, ..., n), the angle θ _k (k = 3,4,…, n), the sum of the angles

群g_k(k=4,5,…n)との関係を整理する。

The relationship with the group g _k (k = 4, 5,... N) is arranged.

これらの関係を図９を用いて説明する。元データ(t_k−3,d_k−3)，（t_k−2,d_k−3）からベクトルv_k-2が作られ、v_k-2とv_k-1からなす角θ_k-1が作られ、結果としてなす角の和 These relationships will be described with reference to FIG. A vector v _k-2 is created from the original data (t _k−3 , d _k−3 ), (t _k−2 , d _k−3 ), and an angle θ _k− formed by v _k−2 and v _k−1. ₁ is made and the resulting angle sum

が作られ、なす角の和

Is the sum of the corners

と

When

から群g_kが作られることを表現している。図１４は、幾何学的にこれらの関係を示したものである。

Expresses that the group g _k is created from. FIG. 14 shows these relationships geometrically.

これらの図を参考にある群g_kまで、あるいはg_kから曲線領域だった場合には元データではそれぞれd_kまでが曲線領域、d_k+1からが曲線領域であることがわかる。 When these figures are referred to up to a group g _k or from g _k to a curved region, it can be seen that in the original data, up to d _k is a curved region, and from d _{k + 1} is a curved region.

同様に、ある群g_kまで、あるいはg_kからが直線領域だった場合には元データではそれぞれd_kまでが直線領域、d_k+1からが直線領域であることがわかる。 Similarly, until the group g _k, or g in the case of _k was linear region it is understood that to each d _k of the original data is linear region, from d _{k + 1} is a linear region.

変化点についても同様にある群g_kまで、あるいはg_kから変化点が複数連続した場合には元データではそれぞれd_kまでが変化点領域、d_k+1からが変化点領域である。 Similarly, for change points, up to a certain group g _k , or when there are a plurality of change points from g _k , in the original data, up to d _k is the change point region, and from d _{k + 1} is the change point region.

変化点を1つのデータで構成したい場合（折れ線状になる）を以下に説明する。 The case where the change point is composed of one data (becomes a polygonal line) will be described below.

変化点を1つのデータで構成させるための手順を説明するには、直線領域及び曲線領域でどのように線を引くかが必要なため、まずそれらを説明する。 In order to explain the procedure for configuring the change point with one data, it is necessary to describe how to draw a line in the straight line region and the curved region.

直線領域での直線を引く方法は、
d_k=bt_k+c (25)
として、例えば最小二乗法でパラメータb，cを求める。 To draw a straight line in a straight line area,
d _k = bt _k + c (25)
For example, parameters b and c are obtained by the method of least squares.

曲線領域での線の引き方は、 How to draw a line in the curve area

というように、例えば2次式で近似し、最小二乗法等によりパラメータa，b，cを求める。

Thus, for example, approximation is performed using a quadratic expression, and parameters a, b, and c are obtained by a least square method or the like.

まず、２つの直線領域に挟まれている変化点を1つのデータで構成させるための手順を説明する。２つの直線領域の直線の交点が複数の変化点t_kの最小値と最大値の間にある場合は、その交点を変化点とする。2つの直線領域の直線の交点が複数の変化点のt_kの最小値と最大値の間にない場合は、複数の変化点の左側にある直線領域で引いた直線上で直線領域でも最も大きなt_kのときの直線の値と複数の変化点の右側にある直線領域で引いた直線上で直線領域でも最も小さいt_kのときの直線の値を直線で結ぶ。 First, a procedure for configuring a change point sandwiched between two linear regions with one data will be described. When the intersection of the straight lines in the two straight line areas is between the minimum value and the maximum value of the plurality of change points t _k , the intersection is set as the change point. If the intersection of two straight line areas is not between the minimum and maximum values of t _k at multiple change points, it is the largest straight line area on the straight line drawn on the left of the multiple change points. The straight line value at t _k and the straight line value at the smallest t _k in the straight line region on the straight line drawn on the right side of the plurality of change points are connected by a straight line.

複数の変化点を挟んでいる領域が直線領域ではなく曲線領域である場合には、複数の変化点と隣接している曲線領域の点を（t_k,d_k）とすると、次の直線 If the area between multiple change points is not a straight line area but a curved area, if the point of the curve area adjacent to the multiple change points is (t _k , d _k ), the next straight line

を引く。直線領域で複数の変化点が挟まれる場合と同様に交点が複数の変化点のt_kの最小値と最大値の間にある場合は、その交点を変化点とする。最小値と最大値の間にない場合には、複数の変化点の左側になる曲線領域で引いた曲線上で曲線領域でも最も大きいt_kの時の曲線の値と複数の変化点の右側にある直線領域で引いた曲線上で曲線領域でも最も小さいt_kの時の曲線の値とを直線で結ぶ。

pull. As in the case where a plurality of change points are sandwiched in the straight line region, when the intersection is between the minimum value and the maximum value of t _k of the plurality of change points, the intersection is set as the change point. If it is not between the minimum and maximum values, the curve value at the largest t _k in the curve area on the curve area to the left of the multiple change points and the right side of the multiple change points On the curve drawn in a certain straight line area, the curve value at the smallest t _k in the curved area is connected with a straight line.

複数の変化点が直線領域と曲線領域で挟まれる場合には、各領域において今まで説明してきた手順で線を引き、交点の扱いについても今まで説明してきたとおりとする。 When a plurality of change points are sandwiched between a straight line region and a curved region, a line is drawn in the procedure described so far in each region, and the handling of the intersection point is as described above.

データ分析においては、特に時系列データの場合にはデータが逐次入力されていく状況がある。新たにデータが入手される度に今まで説明してきた手順をすべて実施してもよいが、いくつかの場合では計算量を軽減することが可能である。 In data analysis, there is a situation in which data is sequentially input especially in the case of time-series data. Every time new data is acquired, all the procedures described so far may be performed. However, in some cases, the amount of calculation can be reduced.

直近の点がX⁻管理領域内である場合やR管理領域内である場合にはその領域に新たな点を加えて判定すればよく、全ての点について再度判定する必要はない。 When the most recent point is in the X ^- management area or in the R management area, the determination may be made by adding a new point to the area, and it is not necessary to determine all the points again.

しかし、直近の点がそれ以外の場合には、全ての点について再度判定を実施する。 However, if the most recent point is other than that, the determination is performed again for all points.

上記のように本実施の形態によれば、判定に品質管理で使われているシューハートX⁻−R管理図自体にパラメータ設定が既になされているため、従来技術のようにパラメータを設定する必要がない。計算量は隣接するデータをベクトルとして扱いそのベクトル外積を求め、そのなす角の和を求め、X⁻−R管理図における群（なす角の和のデータをいくつかまとめたもの）の平均、範囲（レンジ）、及びそれらの平均を求めるだけであるので、計算量が少ない。また、新たなデータを入手したときには新たなデータがX⁻−R管理図の管理限界内か外かを判断するだけであるため、新たな計算無しでも直線領域内の点かどうかの判定は可能である。 As described above, according to the present embodiment, parameters have already been set in the Schuart X ^-- R control chart itself that is used in quality control for determination, so it is necessary to set parameters as in the prior art. There is no. The amount of calculation treats adjacent data as vectors, finds the vector cross product, finds the sum of the angles formed, and calculates the average and range of the group in the X ⁻ R control chart (summarizing some sum of angle sums) Since only the (range) and the average thereof are obtained, the calculation amount is small. In addition, when new data is obtained, it is only necessary to determine whether the new data is within the control limits of the X ^-- R control chart. Therefore, it is possible to determine whether the point is within the straight line area without new calculation. It is.

上記の処理で求められた、変化点、曲線領域、直線領域、外れ点を結果出力部２６０から出力する。 The result output unit 260 outputs the change point, the curve area, the straight line area, and the off-point obtained by the above processing.

なお、上記の図１に示すデータ分析装置１００及び図４に示すデータ分析装置２００の動作をプログラムとして構築し、外れ値、変化点、曲線領域、直線領域を検出する装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 In addition, the operation of the data analysis device 100 shown in FIG. 1 and the data analysis device 200 shown in FIG. 4 is constructed as a program, and the computer is used as a device that detects outliers, change points, curved regions, and straight regions. It can be installed and executed, or distributed via a network.

本発明は、上記の実施の形態に限定されることなく特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

１００データ分析装置
１１０データ入力部
１２０事前処理部
１３０ベクトル作成部
１４０なす角作成部
１５０データ蓄積部
１６０統計値計算部
１７０パラメータ入力部
１８０信頼区間判定部
１９０結果出力部
２００データ分析装置
２１０データ入力部
２２０管理図データ作成部
２２１ベクトル作成部
２２２なす角の和作成部
２２３群作成部
２３０データ蓄積部
２４０計算部
２４１ R管理図計算部
２４２ X⁻管理図計算部
２５０閉曲線分割部
２６０結果出力部 DESCRIPTION OF SYMBOLS 100 Data analyzer 110 Data input part 120 Pre-processing part 130 Vector preparation part 140 Angle creation part 150 Data accumulation part 160 Statistical value calculation part 170 Parameter input part 180 Confidence interval judgment part 190 Result output part 200 Data analysis apparatus 210 Data input part 220 control chart data creating unit 221 vector generating section 222 angle sum creation section 223 group creation unit 230 data storage unit 240 calculating unit 241 R control chart calculator 242 X of ^- control chart calculation unit 250 closed curve division section 260 the result output section

Claims

A data analysis device for analyzing a plurality of two-dimensional data having an order relationship,
A vector-formed angle creating means for creating a vector from adjacent two-dimensional data in the order relationship of the plurality of two-dimensional data, and calculating an angle formed by the adjacent vectors using a vector outer product;
A statistical value calculating means for obtaining an average of the angle formed and a standard deviation of the angle formed;
A confidence interval is calculated from the average of the angles formed and the standard deviation of the angles formed, and depending on whether or not each formed angle is included in the confidence interval, the two-dimensional data that is the calculation source of the angle formed is an outlier or Confidence interval determination means for determining whether or not a change point candidate;
A data analysis apparatus comprising:

A data analysis device for analyzing a plurality of two-dimensional data having an order relationship,
A vector-formed angle creating means for creating a vector from adjacent two-dimensional data in the order relationship of the plurality of two-dimensional data, and calculating an angle formed by the adjacent vectors using a vector outer product;
An accumulated value calculation means for calculating an accumulated angle of the angle formed by the vector for each data;
A control chart calculation means for creating a group k that collects a predetermined number of cumulative values of angles formed, and generating a Shuhart X ^-- R control chart from the group k ;
Detecting means for determining whether the two-dimensional data from which the angle formed by the vector is calculated is an outlier, a change point candidate, a straight line region, or a curved region, using a cumulative value of the angle formed by the vector; ,
I have a,
The detection means includes
Wherein X ^- is calculated control limits of -R control chart, the average within each group, range the X ^- depending on whether contained within the control limits of -R control chart, with the calculated origin of the angle Including outlier / change point determination means for determining that certain data is an outlier or a change point,
A data analyzer characterized by that.

The confidence interval determination means includes:
The average of the angles formed at N (≧ 3) points including the outlier is calculated, the confidence interval of the average is calculated, and if the average is not included in the confidence interval of the average, it is determined as a change point The data analysis apparatus according to claim 1, further comprising means.

The confidence interval determination means includes:
When the standard deviation of the angle formed by the statistical value calculation means is large, it is determined that the given data is the outlier and the change point candidate after performing smoothing by moving average on the given data The data analysis apparatus according to claim 1, further comprising:

The detection means includes
Calculating an average of the inside of said group k from each point of the group k, wherein X ^- managed in time series of individual measurements of -R control chart X ^- control charts X ^- value is predetermined point If there is a section in which increased or decreased number of consecutive units or more, X ^- point was a curvilinear configuration point, and the area between the curve area, the points alone is not continuous unclassified dot region adjacent to the curve area 3. The data analysis according to claim 2 , further comprising: a curve area detecting unit that, when there is an extended curve composing point that is increased or decreased together with the curved area by removing the curvilinear area, includes the extended curve composing point. apparatus.

The detection means includes
Using the R control chart managed according to the data range of the X ^-- R control chart, a range R obtained by subtracting the minimum value from the maximum value inside the group is obtained from each point of the group, and is outside the R control limit. If there is a point within the single R control limit that is sandwiched between the R change points, a change point detection means that includes the points within the R control limit is included. The data analysis apparatus according to claim 2 .

The detecting means is arranged after the curved area detecting means.
Wherein X ^- using control chart, the X ^- ^- X that manages the time series of individual measurements of -R control chart to determine whether there is a point outside the control limits in control charts, when there is, X ^- 6. The data according to claim 5 , further comprising: a straight line area detecting means having a point outside the control limit as an X ^- change point, and if there is no point outside the control limit, an unclassified point as a straight component point and an unclassified point area as a straight line area. Analysis equipment.

A data analysis method for analyzing a plurality of two- dimensional data having an order relationship,
Data analysis equipment
Creating a vector from adjacent two-dimensional data in the order relationship of the plurality of two-dimensional data, and calculating an angle formed by the adjacent vectors using a vector outer product;
A statistical value calculating step for obtaining an average of the formed angles and a standard deviation of the formed angles;
A confidence interval is calculated from the average of the angles formed and the standard deviation of the angles formed, and depending on whether or not each formed angle is included in the confidence interval, the two-dimensional data that is the calculation source of the angle formed is an outlier or A confidence interval determination step for determining whether the candidate is a change point;
The data analysis method characterized by performing.

A data analysis method for analyzing a plurality of two-dimensional data having an order relationship,
Data analysis equipment
Creating a vector from adjacent two-dimensional data in the order relationship of the plurality of two-dimensional data, and calculating an angle formed by the adjacent vectors using a vector outer product;
A step of calculating a cumulative value of corners for calculating a cumulative value of corners of the vector for each data;
A control chart calculation step of creating a group k that collects a predetermined number of cumulative values of angles formed, and generating a Shuhart X ^-- R control chart from the group k ;
A detection step of determining whether the two-dimensional data that is a calculation source of the angle formed by the vector is an outlier, a change point candidate, a straight line region, or a curved region, using a cumulative value of the angle formed by the vector; ,
The execution,
The detecting step includes
Wherein X ^- is calculated control limits of -R control chart, the average within each group, range the X ^- depending on whether contained within the control limits of -R control chart, with the calculated origin of the angle Determine that some data is an outlier or change point,
A data analysis method characterized by that.

Computer
A data analysis program for functioning as each unit of the data analysis apparatus according to any one of claims 1 to 7 .