JP2005258599A

JP2005258599A - Method for visualization of data, apparatus for visualization of data, program for visualization of data, and storage medium

Info

Publication number: JP2005258599A
Application number: JP2004066451A
Authority: JP
Inventors: Tomoharu Iwata; 具治岩田; Kazumi Saito; 和己斎藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-03-09
Filing date: 2004-03-09
Publication date: 2005-09-22

Abstract

<P>PROBLEM TO BE SOLVED: To appropriately visually detect predetermined data such as an outlier in given arbitrary multidimensional data. <P>SOLUTION: An outlier visualization apparatus A is used to visualize predetermined data in a plurality of data y forming object data Y by means of a mixed autoregressive model mixing a K number of autoregressive models. In the outlier visualization apparatus A, a model parameter estimation part 11 reads the object data Y into memory, and pairs the data y with input vectors x created from the data y to create a plurality of data D into memory, then an extended probability vector estimation part 12 estimates joint probabilities of the stored data D and classes k into memory, and combines a mean 2 sigma value to the probability vectors to create extended probability vectors into memory, and an outlier visualization part 13 uses the extended probability vectors to visually display the predetermined data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、経済時系列データ、テキストデータ、生物データなど、任意のデータが与えられたときに、例えば該データに含まれる所定のデータを可視化する方法や装置などに関する。 The present invention relates to a method and apparatus for visualizing predetermined data included in, for example, given data such as economic time series data, text data, and biological data.

近年、大量のデータが電子的に蓄積されつつある。このように電子的に蓄積されるデータから異常値を検出することによって、過去の大きな出来事を知ること、また将来の危機に備えることが、重要な研究課題となっている。
異常値検出のための主要な方法として、正常な状態のデータを利用してモデルを構築し、該モデルとの誤差の大きいデータを異常とするものがある（非特許文献１）。また、異常値を出力する方法としては、横軸にデータ、縦軸に異常の程度をプロットするものがある。
Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales, A. Goldenberg et.al. Proceedings of the National Academy of Sciences of the United States of America, 99(8):5237-5240,2002 In recent years, a large amount of data has been accumulated electronically. Knowing major past events and preparing for future crises by detecting outliers from electronically stored data are important research issues.
As a main method for detecting an abnormal value, there is a method in which a model is constructed using data in a normal state, and data having a large error from the model is abnormal (Non-Patent Document 1). As a method of outputting an abnormal value, there is a method of plotting data on the horizontal axis and the degree of abnormality on the vertical axis.
Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales, A. Goldenberg et.al.Proceedings of the National Academy of Sciences of the United States of America, 99 (8): 5237-5240,2002

しかしながら、正常な状態でのモデルを構築し、該構築したモデルとの誤差の大きいデータを異常とすると、モデルの表現能力の欠如のために、異常値ではないのに異常値とみなされてしまうことがある。正確に異常値を検出するために、モデルをより表現能力の高いものにする必要がある。 However, if a model in a normal state is constructed, and data with a large error from the constructed model is regarded as abnormal, it is regarded as an abnormal value even though it is not an abnormal value due to the lack of the ability to express the model. Sometimes. To accurately detect outliers, the model needs to be more expressive.

さらに、従来の異常値の出力法では、異常の程度を知ることはできるが、各異常の性質
を知ることは困難であるという問題点がある。 Furthermore, in the conventional abnormal value output method, it is possible to know the degree of abnormality, but it is difficult to know the nature of each abnormality.

そこで、本発明は、与えられた任意のデータのなかから、例えば異常値といったような所定のデータを適切に可視化して検出することができるようにした、データの可視化方法、データの可視化装置、データの可視化プログラム、及び記憶媒体を提供することを主たる目的とする。 Therefore, the present invention provides a data visualization method, a data visualization device, and the like, which can appropriately detect and detect predetermined data such as an abnormal value from given arbitrary data. The main object is to provide a data visualization program and a storage medium.

前記課題を解決した本発明は、与えられた任意のデータに基づき確率モデルに含まれるパラメータが異なる複数の確率モデルの線形和で記述される混合モデルを構築し、データと各確率モデルとの同時確率と平均２シグマ値を結合したものを表示することにより、異常値といったような所定のデータを可視化することとした。 The present invention that has solved the above problems constructs a mixed model described by a linear sum of a plurality of probability models having different parameters included in the probability model based on given arbitrary data, and simultaneously combines the data and each probability model. By displaying the combination of the probability and the average 2 sigma value, predetermined data such as an abnormal value is visualized.

即ち、本発明は、情報を記憶するメモリを作業領域として演算を行う演算手段を有するコンピュータを用いて、対象データＹを構成する複数のデータｙのなかの所定のデータを、Ｋ個の確率モデルの線形和で記述される混合モデルを用いて可視化するデータの可視化方法に関する。本発明のデータの可視化方法では、前記演算手段が、前記対象データＹを読み込んで前記メモリに記憶し、前記データｙと各確率モデルとの同時確率をベクトルにしたものを確率ベクトルとして設定して前記メモリに記憶し、前記確率ベクトルに、全確率モデルの所定の確率変数における確率密度である平均確率密度を結合することで拡大確率ベクトルを作成して前記メモリに記憶し、前記拡大確率ベクトルを用いて前記所定のデータを可視化する座標データを作成して前記メモリに記憶することを特徴とする。 That is, the present invention uses a computer having a calculation means for performing calculation using a memory for storing information as a work area, and converts predetermined data among a plurality of data y constituting target data Y into K probability models. The present invention relates to a method for visualizing data that is visualized using a mixed model described by a linear sum. In the data visualization method of the present invention, the calculation means reads the target data Y and stores it in the memory, and sets the probability of the data y and each probability model as a vector as a probability vector. Storing in the memory, creating an expanded probability vector by combining the probability vector with an average probability density that is a probability density in a predetermined random variable of all probability models, storing the expanded probability vector in the memory, and The coordinate data for visualizing the predetermined data is created and stored in the memory.

ここで、確率モデルは、後記する実施形態では、自己回帰モデルであるが、正規分布モデルや多項分布モデルなどを適用することができる。また、混合モデルは、後記する実施形態では、自己回帰混合モデルであるが、確率モデルに対応した他の混合モデルでもよい。また、平均確率密度は、後記する実施形態では、平均２シグマ値であるが、他の平均確率密度でもよい。なお、詳しい解決手段については、後記する実施形態で詳しく説明する。 Here, although the probability model is an autoregressive model in the embodiment described later, a normal distribution model, a multinomial distribution model, or the like can be applied. In the embodiment described later, the mixed model is an autoregressive mixed model, but may be another mixed model corresponding to the probability model. In addition, the average probability density is an average 2 sigma value in the embodiment described later, but other average probability density may be used. Detailed solution means will be described in detail in an embodiment described later.

本発明によれば、与えられた任意のデータのなかから、例えば異常値といったような所定のデータを適切に可視化して検出することができる。 According to the present invention, predetermined data such as an abnormal value can be appropriately visualized and detected from given data.

以下、本発明の「データの可視化方法、データの可視化装置、データの可視化プログラム、及び記憶媒体」を実施するための最良の形態（以下「実施形態」という）を詳細に説明する。なお、以下説明する実施形態は、最初に可視化方法（データの可視化方法）の原理（「可視化方法の基本的な考え方」・「可視化方法の原理・数式の説明」）を説明し、その後、可視化方法などを具現化した「異常値可視化装置」を具体的に説明する。 The best mode (hereinafter referred to as “embodiment”) for carrying out the “data visualization method, data visualization device, data visualization program, and storage medium” of the present invention will be described in detail below. In the embodiment described below, the principle of the visualization method (data visualization method) (“basic concept of visualization method” / “principle of visualization method / description of mathematical formula”) is explained first, and then visualization is performed. The “abnormal value visualization device” embodying the method will be specifically described.

≪可視化方法の基本的な考え方≫
最初に、可視化方法の基本的な考え方を、式（１）〜式（５）を参照して説明する。 ≪Basic concept of visualization method≫
First, the basic concept of the visualization method will be described with reference to formulas (1) to (5).

（拡大確率ベクトルの推定）
本実施形態では、可視化するデータは、次の式（１）の構造を有するＤ^*とする。
なお、この式（１）におけるＤ（ｉ）はｉ番目のデータ、Ｎは与えられたデータの総数である。このＤ（ｉ）は、後記するデータｙ（ｔ）と後記する入力ベクトルｘ（ｔ）が対になったものである。補足すると、Ｄ（ｉ）は、「入力ベクトルｘ（ｉ）とデータｙ（ｉ）とからなるデータ」である。
ちなみに、データＤ（ｉ）やデータｙ（ｔ）におけるカッコ内のｉやｔは、配列変数として定義されているデータを指定するインデックスである。インデックスの名称が異なっても、インデックスの値が同じならば同じデータを指定することができる。 (Estimation of expansion probability vector)
In the present embodiment, the data to be visualized is D ^* having the structure of the following formula (1).
In this equation (1), D (i) is the i-th data, and N is the total number of given data. This D (i) is a pair of data y (t) to be described later and an input vector x (t) to be described later. Supplementally, D (i) is “data consisting of input vector x (i) and data y (i)”.
Incidentally, i and t in parentheses in data D (i) and data y (t) are indexes that specify data defined as array variables. Even if the index names are different, the same data can be specified as long as the index values are the same.

データＤ^*（つまりデータＤ（ｉ））は、Ｋ個の確率モデルの重み付き線形和によって記述される次の式（２）の混合モデルによって生成される。つまり、混合モデルは、式（２）に示すように、複数の確率モデルの線形和で定義される。なお、後記するように、Ｋは確率モデルの総数である。 Data D ^* (that is, data D (i)) is generated by a mixed model of the following equation (2) described by a weighted linear sum of K probability models. That is, the mixed model is defined by a linear sum of a plurality of probability models as shown in Expression (2). As will be described later, K is the total number of probability models.

なお、ΣＰ（ｋ）＝１、Ｐ（ｋ）≧０、Θは未知パラメータである。ここで、未知パラメータのΘは後記するＡとσの少なくとも１つに相当する。
未知パラメータは、未知パラメータのΘが与えられたもとでのデータＤ^*が生成される確率を最大化することによって推定する。つまり、次の式（３）を最大化することで未知パラメータのΘが推定される。ここで、式（３）の左辺はΘが推定値であることを示す。また、式（３）の右辺は、右辺を最大とするΘを推定値とすることを意味する。ちなみに、式（３）は後記する式（１１）〜式（１７）に対応する。 Note that ΣP (k) = 1, P (k) ≧ 0, and Θ are unknown parameters. Here, Θ of the unknown parameter corresponds to at least one of A and σ described later.
The unknown parameter is estimated by maximizing the probability that the data D ^* is generated under the unknown parameter Θ. That is, Θ of the unknown parameter is estimated by maximizing the following equation (3). Here, the left side of Expression (3) indicates that Θ is an estimated value. Further, the right side of Equation (3) means that Θ that maximizes the right side is an estimated value. Incidentally, Expression (3) corresponds to Expression (11) to Expression (17) described later.

データＤ（ｉ）と各確率モデル（ｋ＝１， ... ，Ｋ）との同時確率をベクトルにしたものを、次の式（４）で示す確率ベクトルｚ^*（ｉ）とする。なお、ここでもカッコ内のｉは配列変数である確率ベクトルの任意の１つを指定するインデックスである。
ここで、式（４）は、「データｙと確率モデルとの同時確率をベクトルにしたものを確率ベクトルとして設定」したものである。 A probability vector z ^* (i) represented by the following equation (4) is a vector obtained by using the simultaneous probability of the data D (i) and each probability model (k = 1,..., K) as a vector. Here, i in parentheses is an index for designating any one of probability vectors that are array variables.
Here, the equation (4) is obtained by “setting a probability vector of the joint probability of the data y and the probability model as a probability vector”.

この確率ベクトルｚ^*（ｉ）に平均２シグマ値（α〜）を結合（付加）したものを、次の式（５）の拡大確率ベクトルｚ（ｉ）とする。 The probability vector z ^* (i) combined with (added to) an average 2-sigma value (α˜) is defined as an expanded probability vector z (i) in the following equation (5).

平均２シグマ値とは、全確率モデルの２シグマ値（標準偏差の２倍）における確率密度の平均である。ちなみに、式（４）は後記する式（１９）に対応し、式（５）は後記する式（１８）に対応する。 The average 2 sigma value is an average of probability density in the 2 sigma value (twice the standard deviation) of the entire probability model. Incidentally, Expression (4) corresponds to Expression (19) described later, and Expression (5) corresponds to Expression (18) described later.

（拡大確率ベクトルの可視化による異常値の可視化）
次に、拡大確率ベクトルを可視化することにより異常値を可視化する手法について説明する。拡大確率ベクトルを、例えば２次元や３次元のベクトルに次元圧縮して可視化する。次元圧縮法としては古典的多次元尺度法、ＣｏＰＥ法など数多く提案されており、どの手法を用いても構わない。後記するのはＣｏＰＥ法である。 (Visualization of abnormal values by visualization of expansion probability vector)
Next, a method for visualizing an abnormal value by visualizing an expansion probability vector will be described. The expansion probability vector is visualized by, for example, dimensional compression into a two-dimensional or three-dimensional vector. Many methods such as a classical multidimensional scaling method and a CoPE method have been proposed as a dimensional compression method, and any method may be used. What will be described later is the CoPE method.

拡大確率ベクトルのなかで、平均２シグマ値が相対的に高いデータは、異常値の程度が大きい（他のデータとの異なり方が大きい）ものであるということができる。そのため拡大確率ベクトルを可視化することにより、異常値の程度を表す軸ができ、異常値を視覚的に検出することが可能となる。なお、確率ベクトルは各確率モデルに属する確率を表しているため、拡大確率ベクトルを可視化することによって、異常値の程度の大きさだけでなく、異常値がどのような性質を持つかも知ることができる。
補足すると、表示されるデータが複数の確率モデルのうち、どの確率モデルにより生成されたかが検証可能であるので、例えば、異常値と判定されたデータがどの確率モデルにより生成されたかを知り、その確率モデルのパラメータを確認すれば、異常値がどのような性質を持つかを知ることができる。
ちなみに、表示するデータを、例えば確率モデルごとに色分けするなど、一の確率モデルにより処理されたデータと、パラメータが異なる他の確率モデルにより処理されたデータとを、識別可能に表示することで、異常値と判定されたデータがどの確率モデルにより生成されたかを視覚的に知ることができる。 In the expansion probability vector, it can be said that data having a relatively high average 2 sigma value has a large degree of abnormal value (differing from other data is large). Therefore, by visualizing the expansion probability vector, an axis representing the degree of the abnormal value is formed, and the abnormal value can be visually detected. Since the probability vector represents the probability belonging to each probability model, by visualizing the expanded probability vector, it is possible to know not only the magnitude of the abnormal value but also the nature of the abnormal value. it can.
Supplementally, since it is possible to verify which of the plurality of probability models the displayed data was generated by, for example, know which probability model generated the data determined to be an abnormal value, and the probability If you check the parameters of the model, you can know what kind of properties the outliers have.
By the way, by displaying the data to be displayed, for example, by color coding for each probability model, the data processed by one probability model and the data processed by another probability model with different parameters are displayed in an identifiable manner, It is possible to visually know which probability model has generated the data determined to be an abnormal value.

≪可視化方法の原理・数式の説明≫
次に、与えられた任意のデータに含まれる異常値を、混合モデルを用いて可視化する可視化方法の原理と使用する数式を説明する。なお、数式の説明順序は、後で説明するフローチャート（図３など）での処理の説明順序とは必ずしも一致していない。 ≪Principle of visualization method ・ Explanation of mathematical expression≫
Next, a principle of a visualization method for visualizing an abnormal value included in given arbitrary data using a mixed model and a mathematical expression to be used will be described. Note that the description order of the mathematical expressions does not necessarily match the description order of the processing in a flowchart (FIG. 3 and the like) described later.

（データ）
まず、与えられた任意のデータ（つまり対象データＹ）を説明する。本実施形態の可視化方法の適用対象となる対象データＹは、次の式（６）の構造をしている。
例えば、ｙ（１）は１９８３年１月の経済時系列データ、ｙ（２）は翌月の経済時系列データ、ｙ（Ｔ）は１９８３年１月から数えてＴ番目の月の経済時系列データである。仮にＴが２４０ならば、ｙ（Ｔ）は２００２年１２月の経済時系列データである。つまり、データｙ（ｔ）は時刻ｔの経済時系列データ、Ｔは与えられた時系列の総数である。ちなみに、この経済時系列データは、１９８３年１月から２００２年１２までの、「日本のマネタリーベース」、「国際金利」、「卸売物価指数」、「機械受注」、「鉱工業生産指数」、「円ドル為替レート」の月ごとの、６次元の時系列データであるものとする（時刻ｔ＝１９８３年１月〜２００２年１２月、変数の次元数ｄ＝６、時系列の総数Ｔ＝２４０）。なお、式（６）の対象データＹは、「複数のデータｙからなる任意の対象データＹ」に相当する。ここでは、対象データＹとして６次元のデータを示しているが、次元は、１次元以上でよい。 (data)
First, given arbitrary data (that is, target data Y) will be described. The target data Y that is an application target of the visualization method of the present embodiment has a structure of the following formula (6).
For example, y (1) is the economic time series data for January 1983, y (2) is the economic time series data for the next month, and y (T) is the economic time series data for the Tth month counted from January 1983. It is. If T is 240, y (T) is economic time series data of December 2002. That is, data y (t) is economic time series data at time t, and T is the total number of time series given. By the way, this economic time series data is from “January base of Japan”, “International interest rate”, “Wholesale price index”, “Machine orders”, “Industrial production index”, “Yen” “Dollar exchange rate” is assumed to be 6-dimensional time series data for each month (time t = January 1983 to December 2002, variable dimension number d = 6, total number of time series T = 240) . The target data Y in Expression (6) corresponds to “arbitrary target data Y composed of a plurality of data y”. Here, six-dimensional data is shown as the target data Y, but the dimension may be one or more dimensions.

（自己回帰モデルへの入力ベクトル）
対象データＹに含まれる異常値を可視化するに際して、対象データＹから式（７）で示す構造をした自己回帰モデルへの入力ベクトルｘ（ｔ）を作成する。なお、ｘ（ｔ）を適宜「入力ベクトル」と略称する。ちなみに、可視化の対象となる対象データＹに含まれる「異常値」は、「所定のデータ」に相当する。 (Input vector to autoregressive model)
When the abnormal value included in the target data Y is visualized, an input vector x (t) from the target data Y to the autoregressive model having the structure shown by the equation (7) is created. Note that x (t) is abbreviated as “input vector” as appropriate. Incidentally, the “abnormal value” included in the target data Y to be visualized corresponds to “predetermined data”.

この式（７）で、τは自己回帰モデルの次数を示す。例えば、次数τが２ならば時刻ｔから１月遡った時刻のデータｙ（ｔ−１）と２月遡った時刻のデータｙ（ｔ−２）に基づいて、時刻ｔにおける入力ベクトルｘ（ｔ）が作成されることになる。なお、式（７）における「´」は行列の転置である。ちなみに、入力ベクトルｘ（ｔ）は、ｄτのベクトルを有するデータになる。ここで、ｄは前記のとおり変数の次元数（ここでは６）である。例えば、次数τが２の場合は、ｄτは１２になる。 In this equation (7), τ represents the order of the autoregressive model. For example, if the order τ is 2, the input vector x (t at time t is based on the data y (t−1) at the time retroactive to January from the time t and the data y (t−2) at the retroactive time in February. ) Will be created. Note that “′” in Equation (7) is a transpose of a matrix. Incidentally, the input vector x (t) is data having a vector of dτ. Here, d is the number of dimensions of variables (here, 6) as described above. For example, when the order τ is 2, dτ is 12.

（自己回帰モデル）
自己回帰モデルは、次の式（８）で与えられる。すなわち、自己回帰モデルでは、時刻ｔのデータｙ（ｔ）を過去τ期間のデータの重み付き線形和と誤差との和で記述する。つまり、時刻ｔにおけるデータｙ（ｔ）は、次の式（８）で表現される。
なお、式（８）のＡは、ｄτ×ｄ行列の自己回帰モデルのパラメータ（適宜「パラメータ」という）、εは白色雑音の誤差項である。 (Autoregressive model)
The autoregressive model is given by the following equation (8). That is, in the autoregressive model, the data y (t) at time t is described as the sum of the weighted linear sum of the data in the past τ period and the error. That is, the data y (t) at time t is expressed by the following equation (8).
In Equation (8), A is a parameter of an autoregressive model of dτ × d matrix (referred to as “parameter” as appropriate), and ε is an error term of white noise.

（自己回帰混合モデル）
自己回帰混合モデルでは、次の式（９）のように、時刻ｔのデータｙ（ｔ）は、Ｋ個の自己回帰モデルの重み付き線形和によって生成されると考える（Ｋは１以上）。なお、ｋは、自己回帰モデルを特定（指定）するインデックスであり、１，２，…，Ｋの値をとる）。 (Autoregressive mixed model)
In the autoregressive mixed model, it is considered that the data y (t) at time t is generated by a weighted linear sum of K autoregressive models as in the following equation (9) (K is 1 or more). Note that k is an index for specifying (designating) an autoregressive model and takes values 1, 2,..., K).

ここで、式（９）の左辺は、入力ベクトルｘ（ｔ）が解っているときのデータｙ（ｔ）の確率密度である。また、式（９）の右辺のＰ（ｋ）は、自己回帰モデルがｋ番目のときの重みである。同じく右辺のｐ（ｙ（ｔ）｜ｘ（ｔ），ｋ）は、入力ベクトルｘ（ｔ）と自己回帰モデルがｋ番と解っているときの入力データｙ（ｔ）の確率密度である。つまり、この式（９）は、Ｋ個の確率モデルの重み付き線形和を示すものでもあるといえる。 Here, the left side of Equation (9) is the probability density of the data y (t) when the input vector x (t) is known. Further, P (k) on the right side of Equation (9) is a weight when the autoregressive model is kth. Similarly, p (y (t) | x (t), k) on the right side is the probability density of the input data y (t) when the input vector x (t) and the autoregressive model are understood to be k-th. That is, this equation (9) can also be said to indicate a weighted linear sum of K probability models.

パラメータを推定するに際して、式（９）の右辺のｐ（ｙ（ｔ）｜ｘ（ｔ），ｋ）は、次の式（１０）で求まる。すなわち、自己回帰モデルでは、誤差を正規分布と仮定しているため、Ｋ個の自己回帰モデルのうちのｋ番目の自己回帰モデルに属すとしたときの時刻ｔのデータｙ（ｔ）の確率分布（つまり、入力ベクトルｘ（ｔ）と自己回帰モデルが解っているときの入力データｙ（ｔ）の確率密度）は次の式（１０）で求まる。
なお、Ａ_kはｋ番目の自己回帰モデルのパラメータ、σ_k ²はｋ番目の自己回帰モデルのの分散である。ここで、すべての変数の分散は一定かつ独立という仮定をおいている。 In estimating the parameters, p (y (t) | x (t), k) on the right side of the equation (9) is obtained by the following equation (10). In other words, since the error is assumed to be a normal distribution in the autoregressive model, the probability distribution of the data y (t) at time t when it belongs to the kth autoregressive model among the K autoregressive models. (In other words, the input vector x (t) and the probability density of the input data y (t) when the autoregressive model is solved) are obtained by the following equation (10).
Incidentally, A _k is a parameter of the k-th autoregressive model, sigma _k ² is the variance of the k th autoregressive model. Here, it is assumed that the variance of all variables is constant and independent.

（モデルパラメータの推定）
自己回帰混合モデルの未知パラメータの推定は次のように行う。 (Model parameter estimation)
The unknown parameter of the autoregressive mixed model is estimated as follows.

Ｓｔｅｐ１；まず、Ｋ^*＝１として、次の式（１１）で表される自己回帰モデルの２乗誤差Ｅ₁を最小にすることでパラメータＡ₁を推定する。
この計算は、最小２乗法で行う。なお、時刻ｔは、初期値がτ＋１であり、終値がＴである。前記したとおり、τは自己回帰モデルの次数である。Ｋ^*はループカウンタである。 Step 1; First, assuming that K ^* = 1, the parameter A ₁ is estimated by minimizing the square error E ₁ of the autoregressive model expressed by the following equation (11).
This calculation is performed by the least square method. At time t, the initial value is τ + 1 and the final value is T. As described above, τ is the order of the autoregressive model. K ^* is a loop counter.

Ｓｔｅｐ２；１≦ｓ≦Ｋ^*を満たすｓを選び、パラメータを次の式（１２）、式（１３）のように設定する。ｓは、パラメータＡを選択するインデックスである。ΔＡは例えば乱数である。ちなみに、最初にＳｔｅｐ２を実行するときは、Ｓｔｅｐ１でＫ^*＝１としていることからｓは１になる。２度目以降にＳｔｅｐ２を実行するときは、再びｓが１になることもあり得る。 Step 2; s satisfying 1 ≦ s ≦ K ^* is selected, and parameters are set as shown in the following equations (12) and (13). s is an index for selecting the parameter A. ΔA is, for example, a random number. Incidentally, when Step 2 is executed for the first time, s becomes 1 since K ^* = 1 at Step 1. When Step 2 is executed after the second time, s may become 1 again.

Ｓｔｅｐ３；ループカウンタのＫ^*を１つインクリメントする。 Step 3; Increment K ^* of the loop counter by one.

Ｓｔｅｐ４；データの尤度の対数をとった次の式（１３）で、Ｌ_K ^*を最大化することによりパラメータ（Ａ、σ、Ｐ）を推定する。なお、右辺のΣのなかの第２項の確率密度は、式（１０）に相当する。 Step 4: The parameters (A, σ, P) are estimated by maximizing L _K ^* in the following equation (13) that is a logarithm of the likelihood of data. Note that the probability density of the second term in Σ on the right side corresponds to Equation (10).

Ｓｔｅｐ５；Ｋ^*＜Ｋならば、つまりループカウンタのＫ^*が、自己回帰モデルの総数であるＫに満たない場合は、Ｓｔｅｐ２に戻り、Ｓｔｅｐ２以降の処理を行う。 Step5; K ^* <If K, i.e. of the loop counter K ^* is, if less than K is the total number of autoregressive model returns to Step2, performs processing after Step2.

ところで、Ｓｔｅｐ４において、解析的にＬ_K ^*を最大にするパラメータを算出することはできない。このため、ＥＭアルゴリズムを反復することによって、つまり次のＥ−Ｓｔｅｐ、Ｍ−Ｓｔｅｐを反復することによって、近似的にＬ_K ^*を最大にするパラメータを算出する。 By the way, in Step 4, the parameter that maximizes L _K ^* cannot be calculated analytically. Therefore, a parameter that approximately maximizes L _K ^* is calculated by repeating the EM algorithm, that is, by repeating the following E-Step and M-Step.

Ｅ−Ｓｔｅｐ；次の式（１４）を用いて、ｘ（ｔ）とｙ（ｔ）が解っているときのｋ番目の自己回帰モデルの重みＰ（ｋ｜ｘ（ｔ），ｙ（ｔ））を算出する。 E-Step; The weight P (k | x (t), y (t) of the kth autoregressive model when x (t) and y (t) are known using the following equation (14) ) Is calculated.

Ｍ−Ｓｔｅｐ；次のパラメータ、Ｐ（ｋ）、Ａ_k、σ_k ²を次の式（１５）〜式（１７）により推定する。なお、Ｐ（ｋ）、Ａ_k、σ_k ²は、自己回帰モデルがｋ番目のときの、それぞれ、確率密度、パラメータ、分散である。 M-Step; The following parameters, P (k), A _k , and σ _k ² are estimated by the following equations (15) to (17). Note that P (k), A _k , and σ _k ² are probability density, parameter, and variance, respectively, when the autoregressive model is kth.

（拡大確率ベクトルの推定）
パラメータの推定が終わると自己回帰混合モデルが構築される。この後は、拡大確率ベクトルのｚ（ｔ）を推定する。ｚ（ｔ）は、次の式（１８）により推定（定義）される。
なお、式（１８）の拡大確率ベクトルｚ（ｔ）は、次の式（１９）の確率ベクトルｚ^*（ｔ）に次の式（２０）の自己回帰混合モデルの平均２シグマ値（α〜）を付加したものである。ちなみに、式（１８）は式（５）に対応し、式（１９）は式（４）に対応する。 (Estimation of expansion probability vector)
When parameter estimation is complete, an autoregressive mixture model is constructed. Thereafter, z (t) of the expansion probability vector is estimated. z (t) is estimated (defined) by the following equation (18).
Note that the expansion probability vector z (t) in the equation (18) is an average 2 sigma value (α˜) of the autoregressive mixed model in the following equation (20) to the probability vector z ^* (t) in the following equation (19). ) Is added. Incidentally, equation (18) corresponds to equation (5), and equation (19) corresponds to equation (4).

なお、式（１８）と式（１９）において、重みＰについても確率密度ｐについても、ｋは１，２，...，Ｋまでの値をとる。また、確率ベクトルｚ^*（ｔ）と拡大確率ベクトルｚ（ｔ）は、各時刻ｔ（ｔ＝１〜Ｔ）ごとにそれぞれ推定される。ちなみに、式（２０）において、分散σ_k ²は式（１７）のものを使用することができる。
ここで、式（１９）は、「入力ベクトルｘと当該入力ベクトルｘに対応する前記データｙとからなるデータＤと各自己回帰モデルとの同時確率をベクトルにしたものを確率ベクトルとして設定」したものである。 In Expressions (18) and (19), k takes values up to 1, 2,..., K for both weight P and probability density p. Further, the probability vector z ^* (t) and the expanded probability vector z (t) are estimated at each time t (t = 1 to T), respectively. Incidentally, in the equation (20), the dispersion σ _k ² can use the equation (17).
Here, the equation (19) is set as “the probability vector is a vector in which the simultaneous probability of the data D composed of the input vector x and the data y corresponding to the input vector x and each autoregressive model is set”. Is.

（可視化による異常値検出）
与えられた対象データＹからパラメータを推定、自己回帰混合モデルの構築、拡大確率ベクトルの推定が完了すると可視化による異常値検出を行う。本実施形態では、異常値は、各時刻（ｔ＝１〜Ｔ）の拡大確率ベクトルｚ（ｔ）をもとに、ＣｏＰＥ法（Connectivity Preserving Embedding法）によって２又は３次元で可視化するものとする。
なお、ＣｏＰＥ法は、「クロスエントロピ最小化に基づくネットワークデータの埋め込み」、情報処理学会論文誌、44(9):1234-1231,2003に詳しく記載されている。 (Abnormal value detection by visualization)
When the parameters are estimated from the given target data Y, the construction of the autoregressive mixed model, and the estimation of the expansion probability vector are completed, the abnormal value is detected by visualization. In this embodiment, the abnormal value is visualized in two or three dimensions by the CoPE method (Connectivity Preserving Embedding method) based on the expansion probability vector z (t) at each time (t = 1 to T). .
The CoPE method is described in detail in “Embedding Network Data Based on Cross-Entropy Minimization”, Transactions of Information Processing Society of Japan, 44 (9): 1234-1231, 2003.

ＣｏＰＥ法では、まず、時刻ｔがｉのときのデータＤ（ｉ）と時刻ｔがｊのときのデータＤ（ｊ）の類似度ｓ_i,jを、時刻ｉと時刻ｊの拡大確率ベクトルｚ（ｉ），ｚ（ｊ）のコサイン類似度によって算出する。この際使用する数式は次の式（２１）である。なお、データＤ（ｉ）は、入力ベクトルｘ（ｉ）を入力としてデータｙ（ｉ）を出力とするデータである（ｙとｘとが対になったデータ）。同様に、データＤ（ｊ）は、入力ベクトルｘ（ｊ）を入力としてデータｙ（ｊ）を出力とするデータである。 In the CoPE method, first, the similarity s _{i, j} between the data D (i) when the time t is i and the data D (j) when the time t is j is used as an expansion probability vector z between the time i and the time j. (I) Calculated by the cosine similarity of z (j). The mathematical formula used at this time is the following formula (21). The data D (i) is data having the input vector x (i) as an input and the data y (i) as an output (data in which y and x are paired). Similarly, data D (j) is data having input vector x (j) as input and data y (j) as output.

この式（２１）において、分子のｚ（ｉ）´は、時刻ｉにおける式（１８）の拡大確率ベクトルｚ（ｉ）の転置を示す。また、分母は拡大確率ベクトルｚ（ｉ）の絶対値と同ｚ（ｊ）の絶対値の積を示す。 In this equation (21), z (i) ′ of the numerator indicates the transposition of the expansion probability vector z (i) of equation (18) at time i. The denominator represents the product of the absolute value of the expansion probability vector z (i) and the absolute value of z (j).

ここで、ｒ_iをデータＤ（ｉ）の座標とする。座標（ｉがτ＋１〜Ｔまでの全座標）を、ｒ_iとｒ_jの確率密度ｐ（ｒ_i，ｒ_j）と類似度ｓ_i,jのクロスエントロピの和に正則化項を加えた次の式（２２）を最小化することにより算出する。 Here, let r _{i be} the coordinates of the data D (i). Coordinates (i is all coordinates to tau + 1 to T), and r _i and r _j the probability density p (r _i, r _j) and the similarity s _i, a regularization term to the sum of the cross-entropy of _j added following This is calculated by minimizing the equation (22).

なお、時刻ｉは、初期値がτ＋１であり、終値がＴ−１である。また、時刻ｊは、初期値がτ＋２であり、終値がＴである。μは正則化項の重みで、例えば可視化結果に基づいて適宜設定される。‖ｒ_i‖²は、座標ｒ_iの原点からの距離である。 At time i, the initial value is τ + 1 and the final value is T-1. The time j has an initial value of τ + 2 and a final value of T. μ is the weight of the regularization term, and is appropriately set based on the visualization result, for example. ‖R _i ‖ ² is the distance from the origin of the coordinate r _i.

ここで、ｐ（ｒ_i，ｒ_j）のクロスエントロピは次の式（２３）により算出され、この式（２３）で必要とするｐ（ｒ_i，ｒ_j）は次の式（２４）により算出される。 Here, the cross entropy of p (r _i , r _j ) is calculated by the following equation (23), and p (r _i , r _j ) required in this equation (23) is calculated by the following equation (24). Calculated.

この座標ｒ_iを２次元又は３次元でグラフ化して、可視化する。これにより、異常値が視覚的に判別可能になる。 The coordinates r _i are visualized by graphing them in two or three dimensions. Thereby, the abnormal value can be visually discriminated.

≪異常値可視化装置の構成≫
次に、前記した原理を踏まえて、与えられたデータ（対象データＹ）に含まれる異常値を、コンピュータにより可視化する異常値可視化装置を具体的に説明する。
図１は、本発明の一実施形態の異常値可視化装置の構成図である。 ≪Configuration of abnormal value visualization system≫
Next, based on the principle described above, an abnormal value visualization apparatus that visualizes an abnormal value included in given data (target data Y) by a computer will be specifically described.
FIG. 1 is a configuration diagram of an abnormal value visualization apparatus according to an embodiment of the present invention.

図１に示すように、異常値可視化装置Ａは、演算装置１、入力装置２、記憶装置３、表示装置４がバスに接続された構成をしている。演算装置１は、ハードウェア的には、演算手段としてのＣＰＵ（Central Processing Unit）や情報を記憶するメモリとしてのＲＡＭ（Random Access Memory）などから構成されている。この演算装置１は、ソフトウェア構成として、モデルパラメータ推定部１１、拡大確率ベクトル推定部１２、異常値可視化部１３を備えている。各部１１，１２，１３の機能は、後で詳しく説明する。なお、モデルパラメータ推定部１１、拡大確率ベクトル推定部１２、異常値可視化部１３は、その基になるプログラムが記憶装置３に記憶されており、異常値可視化装置Ａを起動した際、ＣＰＵにより記憶装置３からＲＡＭ上に読み出されて各部１１，１２，１３として機能するものとする。ＣＰＵは、各種情報を記憶するメモリであるＲＡＭを作業領域として、演算結果や演算経過を随時ＲＡＭに記憶しつつ各種演算を行う。もちろん、ＣＰＵは、ＣＰＵが内蔵するキャッシュメモリも作業領域としつつ各種演算を行う。 As shown in FIG. 1, the abnormal value visualization apparatus A has a configuration in which an arithmetic device 1, an input device 2, a storage device 3, and a display device 4 are connected to a bus. In terms of hardware, the arithmetic device 1 includes a CPU (Central Processing Unit) as arithmetic means and a RAM (Random Access Memory) as a memory for storing information. The arithmetic device 1 includes a model parameter estimation unit 11, an expansion probability vector estimation unit 12, and an abnormal value visualization unit 13 as a software configuration. The function of each part 11, 12, 13 will be described in detail later. The model parameter estimator 11, the expansion probability vector estimator 12, and the abnormal value visualization unit 13 have their base programs stored in the storage device 3, and are stored by the CPU when the abnormal value visualization device A is activated. It is assumed that the data is read from the device 3 onto the RAM and functions as the respective units 11, 12, and 13. The CPU performs various calculations while storing the calculation results and calculation progress in the RAM as needed using a RAM, which is a memory for storing various information, as a work area. Of course, the CPU performs various calculations while using the cache memory built in the CPU as a work area.

入力装置２は、キーボード、ネットワークインタフェイスカード、ＦＤＤのようなディスクドライブ装置などから構成される入力手段であり、前記した対象データＹ（経済時系列のデータ）や自己回帰モデルの次数τ、自己回帰モデルの総数Ｋなどの各種データを異常値可視化装置Ａに入力するために用いられる。記憶装置３は、ハードディスク装置などから構成される記憶手段である。本実施形態では、入力装置２から入力された各種データは、記憶装置３に記憶されるものとする。また、記憶装置３は、前記したように、各部１１，１２，１３の基になるプログラムを記憶しているものとする。表示装置４は、例えばグラフィックボード及びそれに接続された液晶モニタであり、異常値可視化部１３により可視化されたデータを視覚的に表示する。 The input device 2 is an input means composed of a keyboard, a network interface card, a disk drive device such as an FDD, and the like, the target data Y (economic time series data), the order τ of the autoregressive model, the self It is used to input various data such as the total number K of regression models to the abnormal value visualization apparatus A. The storage device 3 is a storage means configured from a hard disk device or the like. In the present embodiment, it is assumed that various data input from the input device 2 is stored in the storage device 3. Further, as described above, it is assumed that the storage device 3 stores a program that is the basis of the units 11, 12, and 13. The display device 4 is, for example, a graphic board and a liquid crystal monitor connected thereto, and visually displays data visualized by the abnormal value visualization unit 13.

次に、演算装置１にソフトウェア的に（プログラムモジュールとして）構成されるモデルパラメータ推定部１１、拡大確率ベクトル推定部１２、異常値可視化部１３の機能を説明する。 Next, functions of the model parameter estimation unit 11, the expansion probability vector estimation unit 12, and the abnormal value visualization unit 13 configured in software on the arithmetic device 1 (as program modules) will be described.

（モデルパラメータ推定部）
演算装置１のモデルパラメータ推定部１１は、式（９）及び式（１０）で示す自己回帰混合モデルのパラメータを推定して、自己回帰混合モデルを構築する機能を有する。このため、モデルパラメータ推定部１１は、（ａ）記憶装置３に記憶された式（６）で示す対象データＹと次数τ、その他を演算装置１に読み込む処理、（ｂ）演算装置１に読み込んだ対象データＹと次数τとから、式（７）の自己回帰モデルへの入力ベクトルを算出する処理、（ｃ）対象データＹと入力ベクトルｘ（ｔ）から、式（１１）〜式（１７）を用い、自己回帰混合モデルのパラメータなど（Ａ、σ、Ｐ（ｋ））を推定する処理を行う機能を有する。なお、読み込んだ対象データＹなどや演算の結果・途中経過は、適宜図示しないＲＡＭに記憶して、後段の拡大確率ベクトル推定部１２での処理に活用するものとする。本実施形態では、重みのＰ（ｋ）もパラメータの１つとして説明する。 (Model parameter estimation unit)
The model parameter estimation unit 11 of the arithmetic device 1 has a function of estimating the parameters of the autoregressive mixed model represented by the equations (9) and (10) and constructing the autoregressive mixed model. For this reason, the model parameter estimation unit 11 (a) reads the target data Y and the order τ shown in the equation (6) stored in the storage device 3 into the arithmetic device 1, and (b) reads into the arithmetic device 1. From the target data Y and the order τ, a process for calculating an input vector to the autoregressive model of Expression (7), (c) From the target data Y and the input vector x (t), Expressions (11) to (17) ) And a process for estimating parameters (A, σ, P (k)) of the autoregressive mixed model. It should be noted that the read target data Y and the result of the calculation and the intermediate result are stored in a RAM (not shown) as appropriate, and used for processing in the subsequent expansion probability vector estimation unit 12. In the present embodiment, the weight P (k) will be described as one of the parameters.

（拡大確率ベクトル推定部）
拡大確率ベクトル推定部１２は、前記した式（１８）〜式（２０）を用いて、また、適宜ＲＡＭに記憶されているモデルパラメータ推定部１１の演算の結果や途中経過を用いて、拡大確率ベクトルｚ（ｔ）を各時刻ｔ（ｔ＝１〜Ｔ）ごとにそれぞれ推定する処理を行う機能を有する。 (Expansion probability vector estimation unit)
The expansion probability vector estimation unit 12 uses the above-described equations (18) to (20), and appropriately uses the calculation results of the model parameter estimation unit 11 stored in the RAM and the progress of the expansion probability. It has a function of performing a process of estimating the vector z (t) at each time t (t = 1 to T).

（異常値可視化部）
異常値可視化部１３は、前記した式（２１）〜式（２４）を用いて対象データＹに含まれる異常値をＣｏＰＥ法により可視化する処理を行う機能を有する。 (Abnormal value visualization part)
The abnormal value visualization unit 13 has a function of performing a process of visualizing the abnormal value included in the target data Y by the CoPE method using the equations (21) to (24).

≪異常値可視化装置の動作≫
次に、フローチャートを用いて、前記説明した構成を有する異常値可視化装置Ａの動作を説明する。
図２は、読み込んだデータから異常値を可視化するまでの処理の概要を示したフローチャートである。この図２に示すように、異常値可視化装置Ａは、ステップＳ１００で、記憶装置３から対象データＹを読み込み、「自己回帰混合モデルの構築処理」を行う。なお、この処理を行うのは、図１に示すモデルパラメータ推定部１１である。次に、ステップＳ２００で、「拡大確率ベクトルの推定処理」を行う。なお、この処理を行うのは、図１に示す拡大確率ベクトル推定部１２である。拡大確率ベクトルの推定処理後は、Ｓ３００で、「異常値の可視化処理」を行う。なお、この処理を行うのは、図１に示す異常値可視化部１３である。 ≪Operation of abnormal value visualization system≫
Next, the operation of the abnormal value visualization apparatus A having the above-described configuration will be described using a flowchart.
FIG. 2 is a flowchart showing an outline of processing from the read data until the abnormal value is visualized. As shown in FIG. 2, the abnormal value visualization apparatus A reads the target data Y from the storage device 3 in step S <b> 100 and performs “autoregressive mixture model construction process”. This process is performed by the model parameter estimation unit 11 shown in FIG. Next, in step S200, “expansion probability vector estimation processing” is performed. This process is performed by the expansion probability vector estimation unit 12 shown in FIG. After the expansion probability vector estimation processing, “abnormal value visualization processing” is performed in S300. This process is performed by the abnormal value visualization unit 13 shown in FIG.

（自己回帰混合モデルの構築）
次に、図３のフローチャートを参照して、図２のステップＳ１００の「自己回帰混合モデルの構築処理」を詳細に説明する（適宜図１など参照）。ここで行うのは、式（９）及び式（１０）で示す自己回帰混合モデルのパラメータを推定して、自己回帰混合モデルを構築する処理である。以下の説明における動作主体は、モデルパラメータ推定部１１である。なお、モデルパラメータ推定部１１は、ＣＰＵやＲＡＭなどから構成される演算装置１で機能するプログラムモジュールである。
ちなみに、この図３のフローチャートは、コンピュータでの演算処理の順序を考慮して表現しているので、「可視化方法の原理・数式の説明」のところで説明した数式の登場順序とは必ずしも一致しない。この点は、以降の各フローチャート（図４、図５）においても同じである。 (Building an autoregressive mixed model)
Next, the “autoregressive mixture model construction process” in step S100 of FIG. 2 will be described in detail with reference to the flowchart of FIG. 3 (see FIG. 1 as appropriate). What is performed here is a process of constructing an autoregressive mixture model by estimating the parameters of the autoregressive mixture model represented by equations (9) and (10). The operating subject in the following description is the model parameter estimation unit 11. Note that the model parameter estimation unit 11 is a program module that functions in the arithmetic device 1 including a CPU, a RAM, and the like.
Incidentally, since the flowchart of FIG. 3 is expressed in consideration of the order of the arithmetic processing in the computer, it does not necessarily match the appearance order of the mathematical expressions described in “Principle of visualization method / Explanation of mathematical expressions”. This also applies to the subsequent flowcharts (FIGS. 4 and 5).

まず、図３に示すように、モデルパラメータ推定部１１は、記憶装置３から演算装置１に、式（６）で示す構造をした対象データＹを読み込む（Ｓ１０１）。対象データＹは、前記したとおり６次元の経済時系列データである。読み込んだ対象データＹについて、式（７）で示す構造をした自己回帰モデルへの入力データｘ（ｔ）の作成を行う（Ｓ１０２）。なお、式（７）で使われるτは、前記したとおり自己回帰モデルの次数であり、このτは、別途演算装置１に読み込まれているものとする。 First, as illustrated in FIG. 3, the model parameter estimation unit 11 reads the target data Y having the structure represented by Expression (6) from the storage device 3 to the arithmetic device 1 (S101). The target data Y is 6-dimensional economic time series data as described above. With respect to the read target data Y, input data x (t) to an autoregressive model having a structure represented by Expression (7) is created (S102). It is assumed that τ used in Equation (7) is the order of the autoregressive model as described above, and this τ is read into the arithmetic device 1 separately.

次に、ループカウンタＫ^*を初期化して１にする（Ｓ１０３）。なお、ループカウンタＫ^*は、１〜Ｋまでの整数値をとる。このステップＳ１０３は、すべての自己回帰モデルのパラメータを推定するための準備ステップである。 Next, the loop counter K ^* is initialized to 1 (S103). The loop counter K ^* takes an integer value from 1 to K. This step S103 is a preparation step for estimating the parameters of all autoregressive models.

続いて、前記ステップＳ１０１で読み込んだ対象データＹに含まれるデータｙ（ｔ）と前記ステップＳ１０２で作成した入力ベクトルｘ（ｔ）を用いて、式（１１）の２乗最小誤差Ｅ₁を最小にすることで、パラメータＡ₁を推定する（Ｓ１０４）。ここまでは、自己回帰モデルのパラメータの推定で、次のステップＳ１０５以降は、自己回帰混合モデルのパラメータの推定である。 Subsequently, by using the data y (t) included in the target data Y read in step S101 and the input vector x (t) created in step S102, the squared minimum error E ₁ of equation (11) is minimized. Thus, the parameter A ₁ is estimated (S104). Up to this point, the parameters of the autoregressive model are estimated, and the subsequent steps S105 and after are parameters of the autoregressive mixed model.

パラメータＡ₁を推定した後は、１≦ｓ≦Ｋ^*を満たす任意のインデックスｓを選び、パラメータを式（１２）のように設定する（Ｓ１０５）。なお、ループカウンタのＫ^*が１のときは、必ずｓは１になる。式（１２）のＡ_K*+1＝Ａｓ−ΔＡは、添字の部分でＫ^*＋１としていることから、次のＡ_K ^*の値を設定していることになる。 After the parameter A ₁ is estimated, an arbitrary index s satisfying 1 ≦ s ≦ K ^* is selected, and the parameters are set as shown in Expression (12) (S105). When K ^{* of} the loop counter is 1, s is always 1. _{A K * + 1 = As-} ΔA of formula (12), since you are K ^* +1 part of subscript, it means that by setting the value of the next A _K ^*.

パラメータのＡを式（１２）のように設定すると、ループカウンタＫ^*を１つインクリメントする（Ｓ１０６）。そして、前記したＥＭアルゴリズムのＥ−ｓｔｅｐを実行する（Ｓ１０７）。つまり、式（１４）により、データｙ（ｔ）と入力ベクトルｙ（ｔ）が解っているときのｋ番目の自己回帰モデルの重みを算出する（Ｓ１０７）。
ちなみに、式（１１）と式（１３）では、各時刻ｔ＝τ＋１〜Ｔまでの値を使って、１つのＥ₁、１つのＬ_K ^*が算出される。一方、式（１４）については、複数個のＰ（ｋ｜ｘ（ｔ）、ｙ（ｔ））が算出される。 When the parameter A is set as shown in equation (12), the loop counter K ^* is incremented by one (S106). Then, E-step of the EM algorithm described above is executed (S107). That is, the weight of the k-th autoregressive model when the data y (t) and the input vector y (t) are known is calculated from the equation (14) (S107).
Incidentally, in Eqs. (11) and (13), one E ₁ and one L _K ^* are calculated using values from time t = τ + 1 to T. On the other hand, for Equation (14), a plurality of P (k | x (t), y (t)) is calculated.

次に、前記したＥＭアルゴリズムのＭ−ｓｔｅｐを実行する（Ｓ１０８）。つまり、式（１５）によりＰ（ｋ）を、式（１６）によりσ_k ²を、式（１７）によりＡ_kをそれぞれ推定する。ちなみに、ステップＳ１０７で算出した重みＰ（ｋ｜ｘ（ｔ），ｙ（ｔ））を用いて式（１５）〜式（１７）を実行することから、実行するごとに異なるＰ（ｋ）、σ_k ²、Ａ_kが推定される。 Next, M-step of the above-described EM algorithm is executed (S108). That is, P (k) is estimated by Expression (15), σ _k ² is estimated by Expression (16), and A _k is estimated by Expression (17). Incidentally, since the equations (15) to (17) are executed using the weight P (k | x (t), y (t)) calculated in step S107, P (k), σ _k ² and A _k are estimated.

ステップＳ１０９とステップＳ１１０では、式（１３）を用い、推定したパラメータを検証する。パラメータの検証は、式（１３）によりデータの尤度の対数をとったＬ_K ^*を算出し（Ｓ１０９）、これが最大化したときのパラメータを検証済みのパラメータとして採用するものである。なお、本実施形態では、Ｌ_K ^*の最大化をＬ_K ^*が収束したか否かにより判断する（Ｓ１１０）。従って、Ｌ_K ^*が収束したならば（Ｓ１１０においてＹｅｓ）、インクリメントされたＫ^*におけるパラメータの推定が完了する。一方、Ｌ_K ^*が収束しないならば（Ｓ１１０においてＮｏ）、ステップＳ１０７に戻ってステップＳ１０７以降を再度実行し、新たなパラメータを推定して検証を行う。 In step S109 and step S110, the estimated parameter is verified using equation (13). In the parameter verification, L _K ^* obtained by taking the logarithm of the data likelihood is calculated by the equation (13) (S109), and the parameter when this is maximized is adopted as the verified parameter. In the present embodiment, a maximum of L _{_K} ^{^*} L _K ^* is determined by whether or not convergence (S110). Therefore, if L _K ^* has converged (Yes in S110), the parameter estimation for the incremented K ^* is completed. On the other hand, if L _K ^* does not converge (No in S110), the process returns to step S107 and step S107 and subsequent steps are executed again, and new parameters are estimated and verified.

ちなみに、本実施形態では、Ｌ_K ^*が収束したか否かの判断は、ステップＳ１０９において算出したＬ_K ^*の前回値と今回値を比較し、その差の絶対値が所定値未満（以下）ならば収束したと判断するものである。もちろん、Ｌ_K ^*が収束したか否かを判断するのはＬ_K ^*が最大化したか否かを判断するための一例であり、Ｌ_K ^*が最大化したか否かの判断の仕方については、本実施形態以外の判断の仕方でもよい。 Incidentally, in the present embodiment, whether or not L _K ^* has converged is determined by comparing the previous value of L _K ^* calculated in step S109 with the current value, and the absolute value of the difference is less than a predetermined value (below). Then, it is judged that it has converged. Of course, determining whether or not L _K ^* has converged is an example for determining whether or not L _K ^* has been maximized, and how to determine whether or not L _K ^* has been maximized. May be a determination method other than this embodiment.

ステップＳ１１１では、ループカウンタのＫ^*が自己回帰モデルの総数であるＫよりも小さいか（Ｋ^*＜Ｋ）否かを判断する。Ｋ^*＜Ｋならば（ステップＳ１１１においてＹｅｓ）、ステップＳ１０５に戻り、ステップＳ１０５以降を実行する。つまり、次のＫ^*（次の自己回帰モデル）についてのパラメータの推定を行う。一方、Ｋ^*＜Ｋでないならば、つまりＫ^*＝Ｋならば（ステップＳ１１１においてＮｏ）、自己回帰混合モデルのすべてのパラメータ（Ｐ（ｋ），σ_k ²、Ａ_k）が推定されたので、この図３のフローチャートの処理を終了する。これにより、自己回帰混合モデルが構築される。 In step S111, the loop counter K ^* to determine whether small or (K ^* <K) judges than K is the total number of autoregressive model. If K ^* <K (Yes in step S111), the process returns to step S105, and step S105 and subsequent steps are executed. That is, the parameter for the next K ^* (next autoregressive model) is estimated. On the other hand, if K ^* <K, that is, if K ^* = K (No in step S111), all parameters (P (k), σ _k ² , A _k ) of the autoregressive mixture model have been estimated. Then, the process of the flowchart of FIG. Thereby, an autoregressive mixed model is constructed.

（拡大確率ベクトルの推定処理）
次に、図４のフローチャートを参照して、図２のステップＳ２００の「拡大確率ベクトルの推定処理」を詳細に説明する（適宜図など１参照）。以下の説明における動作主体は、拡大確率ベクトル推定部１２である。なお、拡大確率ベクトル推定部１２は、演算装置１で機能するプログラムモジュールである。 (Expansion probability vector estimation process)
Next, the “expansion probability vector estimation process” in step S200 of FIG. 2 will be described in detail with reference to the flowchart of FIG. The operation subject in the following description is the expansion probability vector estimation unit 12. The expansion probability vector estimation unit 12 is a program module that functions in the arithmetic device 1.

まず、図４に示すように、拡大確率ベクトル推定部１２は、ステップＳ２０１でｔ＝０として、時刻ｔを初期化する。これは、ｔ＝１〜Ｔまでのすべての時刻ｔについて、拡大確率ベクトルを推定するための準備である。ステップＳ２０２で、時刻ｔをインクリメントする（データｙ（０）が存在しないため）。ステップＳ２０３で、「データｙ（ｔ）と入力ベクトルｘ（ｔ）」と「各自己回帰モデル（ｋ＝１，... ，Ｋ）」との同時確率（つまりデータＤ（ｔ）と各確率モデルとの同時確率）をベクトルにしたものを、式（１８）の確率ベクトルｚ^*（ｔ）として設定する。なお、式（１８）は、前記したステップＳ１００での演算結果に基づいて設定される。 First, as illustrated in FIG. 4, the expansion probability vector estimation unit 12 initializes time t by setting t = 0 in step S201. This is a preparation for estimating the expansion probability vector for all times t from t = 1 to T. In step S202, the time t is incremented (because there is no data y (0)). In step S203, the simultaneous probability of “data y (t) and input vector x (t)” and “each autoregressive model (k = 1,..., K)” (that is, data D (t) and each probability A vector in which the joint probability with the model) is set as a vector is set as a probability vector z ^* (t) in Expression (18). Expression (18) is set based on the calculation result in step S100 described above.

次に、自己回帰混合モデルの平均２シグマ値を、ステップＳ１００での演算結果を利用して式（２０）により算出する（Ｓ２０４）。そして、この平均２シグマ値を確率ベクトルｚ^*（ｔ）に付加して、式（１９）の拡大確率ベクトルｚ（ｔ）を推定する（Ｓ２０５）。次に、時刻ｔが時系列の総数であるＴより小さいか（ｔ＜Ｔか）を判断する（Ｓ２０６）。ｔがＴに満たない場合（Ｓ２０６でＹｅｓ）は、ステップＳ２０２に戻ってステップＳ２０２以降を実行する。つまり、次の時刻ｔの拡大確率ベクトルｚ（ｔ）を推定する。一方、ステップＳ２０６で、ｔ＝Ｔとなった場合は（Ｙｅｓ）、すべての時刻ｔについての拡大確率ベクトルが推定されたので、この図４のフローチャートの処理を終了する。 Next, the average 2 sigma value of the autoregressive mixed model is calculated by the equation (20) using the calculation result in step S100 (S204). Then, the average 2 sigma value is added to the probability vector z ^* (t) to estimate the expansion probability vector z (t) of Expression (19) (S205). Next, it is determined whether time t is smaller than T which is the total number of time series (t <T) (S206). When t is less than T (Yes in S206), the process returns to Step S202 and executes Step S202 and subsequent steps. That is, the expansion probability vector z (t) at the next time t is estimated. On the other hand, if t = T in step S206 (Yes), the expansion probability vectors for all the times t have been estimated, and thus the processing of the flowchart of FIG. 4 ends.

（異常値の可視化処理）
次に、図５のフローチャートを参照して、図２のステップ３００の「異常値の可視化処理」を詳細に説明する（適宜図１など参照）。以下の説明における動作主体は、異常値可視化部１３である。なお、異常可視化部１３は、ＣＰＵやＲＡＭなどから構成される演算装置１で機能するプログラムモジュールである。 (Abnormal value visualization processing)
Next, the “abnormal value visualization process” in step 300 of FIG. 2 will be described in detail with reference to the flowchart of FIG. 5 (see FIG. 1 as appropriate). The operation subject in the following description is the abnormal value visualization unit 13. The abnormality visualization unit 13 is a program module that functions in the arithmetic device 1 configured by a CPU, a RAM, and the like.

まず、図５に示すように、異常値可視化部１３は、ステップＳ３０１により、時刻ｉのデータＤ（ｉ）と時刻ｉとは異なる時刻ｊのデータＤ（ｊ）の類似度ｓ_i,jを、前記したステップＳ２００で算出した時刻ｉと時刻ｊの拡大確率ベクトルｚ（ｉ），ｚ（ｊ）を用いて算出する（式（２１）参照）。 First, as shown in FIG. 5, the abnormal value visualization unit 13 determines the similarity s _{i, j} between the data D (i) at time i and the data D (j) at time j different from time i in step S301. The calculation is performed using the expansion probability vectors z (i) and z (j) at time i and time j calculated in step S200 described above (see formula (21)).

次に、ステップＳ３０２により、式（２２）〜式（２４）を用いてｒ_iをデータＤ（ｉ）の座標とする。座標ｒｉを、式（２３）・式（２４）を用いつつ、確率密度ｐ（ｒｉ，ｒｊ）とのクロスエントロピの和に正則化項を加えた式（２２）を最小化することにより求める。 Next, in step S302, r _i is set as the coordinate of the data D (i) using the equations (22) to (24). The coordinates ri are obtained by minimizing the expression (22) obtained by adding the regularization term to the sum of the cross entropy with the probability density p (ri, rj) while using the expressions (23) and (24).

以上の結果、他のデータから離れたところに位置するものが異常値であり、異常値が可視化される。 As a result of the above, what is located away from other data is an abnormal value, and the abnormal value is visualized.

≪可視化結果≫
本発明の可視化方法などの有効性を、図６を参照しつつ、経済時系列データを用いた実施例で示す（適宜図１など参照）。なお、図６の（ａ）から（ｄ）は、本実施形態の異常値可視化装置により異常値を３次元で可視化した結果を示す図である（図面では平面的に表示されている）。 ≪Visualization result≫
The effectiveness of the visualization method of the present invention will be shown in an embodiment using economic time series data with reference to FIG. 6 (see FIG. 1 as appropriate). FIGS. 6A to 6D are diagrams showing the results of visualizing abnormal values in three dimensions by the abnormal value visualization apparatus of the present embodiment (displayed in a plane in the drawing).

用いるデータ（対象データＹ）は、前記したとおり、１９８３年１月から２００２年１２月までの「日本のマネタリーベース」、「国際金利」、「卸売物価指数」、「機械受注」、「鉱工業生産指数」、「円ドル為替レート」の月ごとの時系列データで、変数の数ｄ＝６、データ数Ｔ＝２４０である。この対象データＹに対して、前記した自己回帰混合モデルを用いた時系列データの異常値の可視化方法を採用し、自己回帰モデルの次数をτ＝１、自己回帰モデルの総数Ｋ＝１〜４として可視化を行った。ちなみに、Ｋが１の場合が図６（ａ）であり、Ｋが２の場合が図６（ｂ）であり、Ｋが３の場合が図６（ｃ）であり、Ｋが４の場合が図６（ｄ）である。各図中の１点が１つの月である。なお、異常値可視化装置Ａは、自己回帰モデルごとに点の種類が異なるように表示する機能を有しており、自己回帰モデルが複数ある場合でも、表示されている点が何番目の自己回帰モデルにより生成されたものであるのかが、視覚的に解るようになっている。例えば、図６（ｂ）は、１番目の自己回帰モデルが○印の点で示してあり、２番目の自己回帰モデルが×印の点で示してある。図６（ｃ）は、３番目の自己回帰モデルが△印の点で示してある。また、図６（ｄ）は、４番目の自己回帰モデルが□の点で示してある。 As described above, the data used (target data Y) are “Japanese monetary base”, “international interest rate”, “wholesale price index”, “machine orders”, “industrial production index” from January 1983 to December 2002. ”,“ Yen-dollar exchange rate ”for each month, the number of variables d = 6 and the number of data T = 240. For this target data Y, the method of visualizing abnormal values of time series data using the autoregressive mixed model described above is adopted, the order of the autoregressive model is τ = 1, and the total number of autoregressive models is K = 1-4. As a visualization. Incidentally, the case where K is 1 is FIG. 6A, the case where K is 2 is FIG. 6B, the case where K is 3 is FIG. 6C, and the case where K is 4 is shown in FIG. It is FIG.6 (d). One point in each figure is one month. In addition, the abnormal value visualization apparatus A has a function of displaying different types of points for each autoregressive model, and even if there are a plurality of autoregressive models, the displayed autoregressive point is what number the autoregressive model is. It can be visually understood whether it is generated by the model. For example, in FIG. 6B, the first autoregressive model is indicated by a point with a circle, and the second autoregressive model is indicated with a point with an x mark. In FIG. 6C, the third autoregressive model is indicated by a triangle mark. In FIG. 6D, the fourth autoregressive model is indicated by □.

図６に示すように、この期間に起こった大きな経済事件として、１９９８年８月のロシア危機がある。異常値の可視化結果を見ると、自己回帰モデルの総数Ｋ＝１〜４のすべての場合において、ロシア危機の翌月である１９９８年９月が他の多くの月から大きく離れており、異常値であることが解る。しかしながら、自己回帰モデルの総数Ｋが１の場合は、他の月も１９９８年９月近辺に位置されており、異常値が複数存在する。自己回帰モデルの総数Ｋを２、３としていっても異常値とみなされる月は複数あるが、自己回帰モデルの総数Ｋが４の場合は、１９９８年９月のみ異常値となっている。 As shown in Fig. 6, the Russian crisis in August 1998 is a major economic incident that occurred during this period. Looking at the results of abnormal value visualization, in all cases of autoregressive models K = 1 to 4, September 1998, the month following the Russian crisis, was far away from many other months. I understand that there is. However, when the total number K of autoregressive models is 1, other months are also located near September 1998, and there are a plurality of abnormal values. Even if the total number K of the autoregressive model is 2 or 3, there are a plurality of months that are regarded as abnormal values. However, when the total number K of the autoregressive model is 4, only an abnormal value is obtained in September 1998.

これは、自己回帰モデルの総数Ｋが１の場合、自己回帰混合モデルは線形モデルであるため、データが非線形性を有していた場合、適切にモデル化することができない。そのため、本来は異常値ではないにもかかわらず、モデルの表現能力が低いために異常値とされることがある。しかし、自己回帰モデルの総数Ｋを増やしていくことによって、モデルの表現能力が高まり、本来の異常値を検出することができる。
このように、本発明によれば、表現力のあるモデルで可視化するので、従来手法よりも良好な異常値の検出が実現できる。 This is because when the total number K of autoregressive models is 1, the autoregressive mixed model is a linear model, and therefore cannot be appropriately modeled if the data has nonlinearity. For this reason, although it is not an abnormal value originally, it may be an abnormal value because the expression ability of the model is low. However, by increasing the total number K of autoregressive models, the model's ability to express increases, and the original abnormal value can be detected.
As described above, according to the present invention, since visualization is performed with a model having expressive power, it is possible to realize detection of abnormal values better than the conventional method.

また、従来の異常値検出手法は、データ空間のなかでの平均からのデータの乖離をもとにしている。しかし、本実施形態では、各データをモデルのパラメータ空間に写像することにより、換言すると、式（２）や式（９）のように、データをある確率モデルから生成されるようにすることにより、対象データＹといった生データが有する特徴をより明確にすることができる。 Further, the conventional abnormal value detection method is based on the deviation of the data from the average in the data space. However, in the present embodiment, by mapping each data to the parameter space of the model, in other words, by generating the data from a certain probability model as shown in the equations (2) and (9). The characteristics of the raw data such as the target data Y can be made clearer.

なお、異常値可視化装置は、プログラムにより可視化方法を実行するものであり、この可視化方法をコンピュータに実行させるプログラムは、ＣＤＲ−ＲＯＭやデジタル多目的ディスクなどの記憶媒体に記憶されて流通され、また、通信回線を介して流通され、コンピュータにインストールされて機能する。この可視化方法の実施形態としては、可視化方法を実行するプログラムがインストールされたコンピュータ（つまり異常値可視化装置）が通信回線を介して対象データを取得し、これを可視化し、その結果を、通信回線を介して回答するようにしてもよい。また、拡大確率ベクトルを用いての異常値の可視化処理（図２のステップＳ３００）は、前記ＣｏＰＥ法以外に種々の手法を適用することができる。
また、確率モデルについて、前記実施形態では、自己回帰モデルを用いた例を説明したが、この自己回帰モデルに限定されるものではなく、正規分布モデルや多項分布モデルといった、他の確率モデルを用いてもよい。
また、前記した実施形態では、所定のデータの一例として異常値を可視化したが、所定のデータとして、例えば好ましい値を可視化するようにしてもよい。
つまり、本発明は、前記した実施形態に限定されることなく、その技術思想の及ぶ範囲で幅広く変形実施することができる。 The abnormal value visualization apparatus executes a visualization method by a program, and a program for causing a computer to execute the visualization method is stored and distributed in a storage medium such as a CDR-ROM or a digital multipurpose disk. It is distributed via a communication line and is installed in a computer and functions. As an embodiment of this visualization method, a computer (that is, an abnormal value visualization device) in which a program for executing the visualization method is installed acquires target data via a communication line, visualizes the data, and the result is displayed on the communication line. You may make it answer via. Also, various methods other than the CoPE method can be applied to the abnormal value visualization process using the expansion probability vector (step S300 in FIG. 2).
In the above embodiment, an example using an autoregressive model has been described for the probability model. However, the present invention is not limited to this autoregressive model, and other probability models such as a normal distribution model and a multinomial distribution model are used. May be.
In the above-described embodiment, the abnormal value is visualized as an example of the predetermined data. However, for example, a preferable value may be visualized as the predetermined data.
That is, the present invention is not limited to the above-described embodiment, and can be widely modified within the scope of its technical idea.

本発明の一実施形態の異常値可視化装置の構成図である。It is a block diagram of the abnormal value visualization apparatus of one Embodiment of this invention. 読み込んだデータから異常値を可視化するまでの処理の概要を示したフローチャートである。It is the flowchart which showed the outline | summary of the process until it visualizes an abnormal value from the read data. 図２のステップＳ１００の「自己回帰混合モデルの構築処理」を示すフローチャートである。It is a flowchart which shows "the construction process of the autoregressive mixed model" of step S100 of FIG. 図２のステップＳ２００の「拡大確率ベクトルの推定処理」を示すフローチャートである。It is a flowchart which shows the "expansion probability vector estimation process" of step S200 of FIG. 図２のステップＳ３００の「異常値の可視化処理」を示すフローチャートである。It is a flowchart which shows the "abnormal value visualization process" of step S300 of FIG. 図１の異常値可視化装置により異常値を可視化した結果を示す図であり、（ａ）はＫが１の場合を示し、（ｂ）はＫが２の場合を示し、（ｃ）はＫが３の場合を示し、（ｄ）はＫが４の場合を示す。It is a figure which shows the result of having visualized the abnormal value by the abnormal value visualization apparatus of FIG. 1, (a) shows the case where K is 1, (b) shows the case where K is 2, and (c) shows K being K. The case of 3 is shown, and (d) shows the case where K is 4.

Explanation of symbols

Ａ異常値可視化装置
１演算装置（演算手段）
１１モデルパラメータ推定部
１２拡大確率ベクトル推定部
１３異常値可視化部
２入力装置
３記憶装置
４表示装置

A Abnormal value visualization device 1 Arithmetic device (calculation means)
DESCRIPTION OF SYMBOLS 11 Model parameter estimation part 12 Expansion probability vector estimation part 13 Abnormal value visualization part 2 Input device 3 Memory | storage device 4 Display apparatus

Claims

Using a computer having calculation means for performing calculation using a memory for storing information as a work area, predetermined data among a plurality of data y constituting target data Y is described by a linear sum of K probability models. In the data visualization method to visualize using the mixed model
The computing means is
Read the target data Y and store it in the memory;
A vector of the joint probability of the data y and each probability model is set as a probability vector and stored in the memory;
An expanded probability vector is created by combining the probability vector with an average probability density that is an average of the probability density in a predetermined random variable of all probability models, and stored in the memory.
Creating coordinate data for visualizing the predetermined data using the expansion probability vector and storing the coordinate data in the memory;
A method for visualizing data.

Using a computer having calculation means for performing calculation using a memory for storing information as a work area, predetermined data among a plurality of data y constituting target data Y is linearized with K autoregressive models having different parameters. In the data visualization method that visualizes using the autoregressive mixed model described by the sum,
The computing means is
A procedure for reading the target data Y and storing it in the memory;
A step of creating a plurality of input vectors x to the autoregressive model based on a plurality of data y included in the target data Y and the order τ of the autoregressive model and storing the input vector x in the memory;
Using the plurality of data y and the plurality of input vectors x to calculate parameters of the autoregressive model and store them in the memory;
A step of sequentially storing the parameters of the autoregressive mixed model using the plurality of data y and the plurality of input vectors x by the number of the autoregressive model and storing the same in the memory;
A step of setting a vector of the simultaneous probability of the data D consisting of the input vector x and the data y corresponding to the input vector x and each autoregressive model as a probability vector and storing it in the memory;
Calculating an average probability density that is an average of probability densities in a predetermined random variable of the total autoregressive model and storing the average probability density in the memory;
Estimating an expanded probability vector by adding the average probability density to the probability vector and storing it in the memory;
A step of calculating coordinates for visual display using the expansion probability vector and storing them in the memory;
With
A method for visualizing data, wherein the predetermined data is visualized using the coordinates thus calculated.

The calculation of the parameter is
Execute E-step and M-step of the EM algorithm repeatedly, calculate a judgment value that is a logarithm of the likelihood of data based on each calculated parameter, and verify the parameter that maximizes this judgment value The procedure to be adopted as
The data visualization method according to claim 2, wherein:

As a procedure for visualization using the expansion probability vector,
A step of calculating a similarity between one data D of the plurality of data D and another data D different from the data D based on the expansion probability vector and storing the similarity in the memory;
A procedure for calculating coordinates of the plurality of data based on the similarity and storing them in the memory;
Using a CoPE method,
The data visualization method according to claim 2, wherein:

5. The data visualization method according to claim 1, wherein the target data Y is multidimensional time-series data.

The probability density in the predetermined random variable is a 2 sigma value and the average probability density is a 2 sigma value;
The data visualization method according to claim 1, wherein the data visualization method is a data visualization method.

Computation means for performing computation using a memory for storing information as a work area and a display device for displaying the computation results of the computation means, and predetermined data among a plurality of data y constituting the target data Y, In a data visualization apparatus for visualizing using an autoregressive mixed model described by a linear sum of K autoregressive models having different parameters,
The computing means is
A function of creating a plurality of input vectors x to the autoregressive model based on a plurality of data y included in the target data Y and the order τ of the autoregressive model;
A function of calculating parameters of the autoregressive model using the plurality of data y and a plurality of input vectors x;
A function of sequentially repeating the procedure of calculating the parameters of the autoregressive mixed model using the plurality of data y and the plurality of input vectors x by the number of the autoregressive models;
A model parameter estimator comprising:
A function of setting as a probability vector a set of simultaneous probabilities of a plurality of data D having the input vector x as an input and the corresponding data y as an output and each autoregressive model;
A function for calculating an average probability density that is an average of probability densities in a predetermined random variable of the total autoregressive model,
A function of estimating an expanded probability vector by adding the average probability density to the probability vector;
An expanded probability vector estimator comprising:
A function of calculating the similarity between one data D of the plurality of data D and another data D different from the data D based on the expansion probability vector;
A function for calculating coordinates of the plurality of data based on the similarity,
A function for instructing to display the calculated coordinates on the display device;
An abnormal value visualization unit comprising:
A data visualization device comprising:

7. A data visualization program for causing a computer having computation means for performing computation using the memory for storing information as a work area, the data visualization method according to claim 1. .

A data visualization method according to any one of claims 1 to 6, characterized in that a visualization program is stored that is executed by a computer having computing means for performing computation using a memory for storing information as a work area. Storage medium.