JP2010039756A

JP2010039756A - Independence test device, data analysis device, and independence test program

Info

Publication number: JP2010039756A
Application number: JP2008201945A
Authority: JP
Inventors: Takashi Isozaki; 隆司磯崎
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2008-08-05
Filing date: 2008-08-05
Publication date: 2010-02-18

Abstract

<P>PROBLEM TO BE SOLVED: To accurately test the independence of an item in sample data even in the case of the small number of samples of sample data. <P>SOLUTION: Resampling from sample data is performed more than once with the number of pieces of resample data different by size, which is equal to or smaller than the number of pieces of sample data, for each piece of resample data, and a test statistic for testing the independence of two items in the sample data is calculated on the basis of resampled sample data, and a test statistic in the number of pieces of data larger than the number of pieces of the sample data is estimated on the basis of an approximate curve determined by a plurality of calculated test statistics per each piece of resampling data, and the estimated test statistic is compared with a threshold to determine the independence of two items. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、独立性判定装置、データ解析装置、及び独立性検定プログラムに関する。 The present invention relates to an independence determination device, a data analysis device, and an independence test program.

統計学において、ある変数どうしに相互の依存性があるのかどうかを知ることは基本的で重要な分析である。例えば、医療分野における病気の症状と薬効、年齢、または性別との関係や、また、生産工場における欠陥の発生と様々な環境要因との関係など数多くの例が挙げられる。こうした変数間の独立性を調べる手法として、統計的仮説検定方法と呼ばれる手法が発達しており、χ二乗検定、尤度比検定（Ｇ検定）などが知られている（例えば、非特許文献１参照）。これら従来の統計的仮説検定方法は多くのデータが得られているという仮定の下に、データを出力する確率分布関数が中心極限定理より正規分布で近似でき、統計量がその近似から導かれるχ二乗分布で表現できることを前提としているため、少数のデータしか得られない場合には正規分布による近似の精度が劣化するためその閾値の信頼性が下がり、従って仮説検定結果の信頼性が低下する。 In statistics, knowing whether certain variables have interdependencies is a basic and important analysis. For example, there are many examples such as the relationship between disease symptoms and efficacy, age, or gender in the medical field, and the relationship between the occurrence of defects in production plants and various environmental factors. As a method for examining the independence between variables, a method called a statistical hypothesis test method has been developed, and a chi-square test, a likelihood ratio test (G test), and the like are known (for example, Non-Patent Document 1). reference). With these conventional statistical hypothesis testing methods, the probability distribution function that outputs data can be approximated by a normal distribution by the central limit theorem, assuming that a lot of data is obtained, and the statistic is derived from the approximation. Since it is assumed that it can be expressed by a square distribution, when only a small number of data can be obtained, the accuracy of approximation by the normal distribution deteriorates, so that the reliability of the threshold value is lowered, and therefore the reliability of the hypothesis test result is lowered.

また、少数データの統計量を精度よく算出する手法として、ブートストラップによる仮説検定方法が知られている（例えば、非特許文献２参照）。ブートストラップによる仮説検定を行った場合、得られたデータにおける統計量の精度を高めるという効果はあるものの、比較を行なうための閾値を算出するχ二乗分布の近似精度の信頼性が低いため閾値の信頼性は低いままであり、やはり少数データでの仮説検定の信頼性は低い。
東京大学教養学部統計学教室編「自然科学の統計学」、東京大学出版会、１９９２年汪金芳ら著「統計科学のフロンティア１１計算統計Ｉ」、岩波書店、２００３年、ｐ．５６ Further, as a technique for calculating a small amount of data with high accuracy, a bootstrap hypothesis testing method is known (for example, see Non-Patent Document 2). When bootstrap hypothesis testing is performed, there is an effect of improving the accuracy of the statistics in the obtained data, but the threshold accuracy of the chi-square distribution for calculating the threshold for comparison is low because the reliability of the approximation accuracy is low. The reliability remains low, and the reliability of hypothesis testing with a small number of data is still low.
“Department of Statistics of Natural Science”, Faculty of Liberal Arts, University of Tokyo, University of Tokyo Press, 1992 “Science Frontier 11 in Statistical Science I”, Iwanami Shoten, 2003, p. 56

本発明は、標本データの標本数が少ない場合でも、標本データの中の項目の独立性を精度よく検定することができる独立性検定装置、データ解析装置、及び独立性検定プログラムを提供することを目的とする。 The present invention provides an independence test apparatus, a data analysis apparatus, and an independence test program that can accurately test the independence of items in sample data even when the number of sample data is small. Objective.

上記目的を達成するために、本発明に係る独立性検定装置は、各々確率変数に対応した事象からなる項目を複数有する複数の標本データの標本数以下の大きさが異なる再標本数を複数設定し、設定した再標本数毎に該再標本数の標本データを前記複数の標本データから抽出する抽出手段と、前記抽出手段で抽出した標本データに基づいて、前記抽出手段で抽出した標本データの複数の項目の中の２つの項目の独立性又は前記２つの項目以外の１つ以上の項目を条件として前記２つの項目の独立性を検定するための条件付の独立性検定量を検定するための独立性検定量であって、前記２つの項目の相関の程度に応じた値を計量するためのＧ検定量又はχ二乗検定量に基づく独立性検定量を前記再標本数毎に算出する算出手段と、前記算出手段で算出された前記再標本数毎の独立性検定量から定まる近似曲線に基づいて、前記標本数より大きな標本数における独立性検定量を推定する推定手段と、前記推定手段で推定した独立性検定量が予め定めた値以下の場合に前記２つの項目が独立であると判定する判定手段とを含んで構成されている。 In order to achieve the above object, the independence test apparatus according to the present invention sets a plurality of resample numbers whose sizes are not more than the sample number of a plurality of sample data having a plurality of items each consisting of an event corresponding to a random variable. And extracting means for extracting the sample data of the number of resamples from the plurality of sample data for each set number of resamples, and the sample data extracted by the extraction means based on the sample data extracted by the extraction means To test the independence of two items among a plurality of items or a conditional independence test amount for testing the independence of the two items on the condition of one or more items other than the two items A calculation for calculating an independence test amount based on a G test amount or a chi-square test amount for measuring a value corresponding to the degree of correlation between the two items for each resample number. Means and the calculating means An estimation means for estimating an independence test quantity in a larger number of samples than the number of samples based on an approximated curve determined from the independence test quantity for each resample number issued; and an independence test quantity estimated by the estimation means And determining means for determining that the two items are independent when the value is equal to or less than a predetermined value.

また、前記算出手段は、前記Ｇ検定量又はχ二乗検定量を相互情報量又は条件付相互情報量に換算して前記独立性検定量を算出するようにすることができる。 Further, the calculation means may calculate the independence test amount by converting the G test amount or the chi-square test amount into a mutual information amount or a conditional mutual information amount.

また、前記抽出手段は、１つの前記再標本数について複数回の抽出を行い、前記算出手段は、前記再標本数毎に抽出回数分の複数の独立性検定量を算出し、前記複数の独立性検定量の平均値から前記再標本数毎の独立性検定量を算出するようにすることができる。 Further, the extraction means performs a plurality of extractions for one resample number, and the calculation means calculates a plurality of independence test amounts for each resample number, and the plurality of independent test quantities. The independence test amount for each resample number can be calculated from the average value of the sex test amount.

また、前記抽出手段は、１つの前記再標本数について複数回の抽出を行い、前記算出手段は、前記再標本数毎に抽出回数分の複数の独立性検定量を算出し、前記推定手段は、前記複数の独立性検定量の分散に基づく重み付き最小二乗法により定まる近似曲線に基づいて、前記標本数より大きな標本数における独立性検定量を推定するようにすることができる。 In addition, the extraction unit performs a plurality of extractions for one resample number, the calculation unit calculates a plurality of independence test amounts for each resample number, and the estimation unit includes: The independence test amount in the number of samples larger than the number of samples can be estimated based on an approximate curve determined by a weighted least square method based on the variance of the plurality of independence test amounts.

また、本発明に係るデータ解析装置は、上記独立性検定装置と、前記判定手段で独立ではないと判定された項目を関連付けて、前記標本データの複数の項目についてベイジアンネットワークまたはマルコフネットワークを構成する構成手段とを含んで構成されている。 The data analysis apparatus according to the present invention relates to the independence test apparatus and items determined to be not independent by the determination means, and constitutes a Bayesian network or a Markov network for a plurality of items of the sample data. And a configuration means.

また、本発明に係る独立性検定プログラムは、コンピュータを、各々確率変数に対応した事象からなる項目を複数有する複数の標本データの標本数以下の大きさが異なる再標本数を複数設定し、設定した再標本数毎に該再標本数の標本データを前記複数の標本データから抽出する抽出手段と、前記抽出手段で抽出した標本データに基づいて、前記抽出手段で抽出した標本データの複数の項目の中の２つの項目の独立性又は前記２つの項目以外の１つ以上の項目を条件として前記２つの項目の独立性を検定するための条件付の独立性検定量を検定するための独立性検定量であって、前記２つの項目の相関の程度に応じた値を計量するためのＧ検定量又はχ二乗検定量に基づく独立性検定量を前記再標本数毎に算出する算出手段と、前記算出手段で算出された前記再標本数毎の独立性検定量から定まる近似曲線に基づいて、前記標本数より大きな標本数における独立性検定量を推定する推定手段と、前記推定手段で推定した独立性検定量が予め定めた値以下の場合に前記２つの項目が独立であると判定する判定手段として機能させるためのプログラムである。 Further, the independence test program according to the present invention sets a plurality of resample numbers having different sizes below the sample number of a plurality of sample data having a plurality of items each consisting of an event corresponding to a random variable. Extraction means for extracting the sample data of the number of resamples from the plurality of sample data for each number of resamples, and a plurality of items of sample data extracted by the extraction means based on the sample data extracted by the extraction means The independence of two items in or the independence to test a conditional independence test amount for testing the independence of the two items subject to one or more items other than the two items A calculation means for calculating an independence test quantity based on a G test quantity or a chi-square test quantity for measuring a value according to the degree of correlation between the two items for each resample number; Said calculating means Based on an approximate curve determined from the calculated independence test amount for each resample number, an estimation means for estimating an independence test amount in a sample number larger than the sample number, and an independence test amount estimated by the estimation means Is a program for causing the two items to function as determination means for determining that the two items are independent when the value is equal to or less than a predetermined value.

以上説明したように、請求項１記載の独立性検定装置、及び請求項６記載の独立性検定プログラムによれば、標本データの標本数が少ない場合でも、標本データの中の項目の独立性を精度よく検定することができる、という効果が得られる。 As described above, according to the independence test apparatus according to claim 1 and the independence test program according to claim 6, the independence of items in sample data can be increased even when the number of sample data is small. The effect that the test can be performed with high accuracy is obtained.

また、請求項２から請求項４記載の独立性検定装置によれば、独立性を検定するための検定量を精度よく推定することができる、という効果が得られる。 In addition, according to the independence test apparatus described in claims 2 to 4, it is possible to obtain an effect that the test amount for testing the independence can be accurately estimated.

また、請求項５記載のデータ解析装置によれば、精度よく判定された項目間の独立性に基づいて、精度よくデータ解析を行うことができる、という効果が得られる。 Further, according to the data analysis apparatus of the fifth aspect, it is possible to obtain an effect that the data analysis can be performed with high accuracy based on the independence between items determined with high accuracy.

以下、図面を参照して本発明の実施の形態について詳細に説明する。なお、以下では、本発明の独立性検定装置を、サンプルデータを解析してベイジアンネットワークを構成することにより項目間の関連性について解析するデータ解析装置に適用した場合について説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Hereinafter, a case will be described in which the independence test apparatus of the present invention is applied to a data analysis apparatus that analyzes the relationship between items by analyzing sample data to form a Bayesian network.

図１に示すように、本実施の形態に係るデータ解析装置１０は、各種設定及び条件等を入力するための操作キー、キーボード、マウス、及びタッチパネル等の入力装置１２、構成されたベイジアンネットワークを可視化して表示するためのディスプレイ等の表示装置１４、及びデータ解析の処理を実行するコンピュータ１６を備えている。 As shown in FIG. 1, the data analysis apparatus 10 according to the present embodiment includes an input device 12 such as operation keys, a keyboard, a mouse, and a touch panel for inputting various settings and conditions, and a configured Bayesian network. A display device 14 such as a display for visualizing and displaying, and a computer 16 for executing data analysis processing are provided.

コンピュータ１６は、データ解析装置１０全体の制御を司るＣＰＵ２４、後述するデータ解析プログラム及び独立性検定プログラム等各種プログラムを記憶した記憶媒体としてのＲＯＭ２６、ワークエリアとしてデータを一時的に格納するＲＡＭ２８、各種情報が記憶された記憶手段としてのＨＤＤ（ハードディスク）３０、ネットワークと接続するためのネットワークＩ／Ｆ（インタフェース）部３２、Ｉ／Ｏ（入出力）ポート３４、及びこれらを接続するバスを含んで構成されている。Ｉ／Ｏポート３４には、入力装置１２及び表示装置１４が接続されている。 The computer 16 includes a CPU 24 that controls the entire data analysis apparatus 10, a ROM 26 that stores various programs such as a data analysis program and an independence test program described later, a RAM 28 that temporarily stores data as a work area, It includes an HDD (hard disk) 30 as storage means for storing information, a network I / F (interface) unit 32 for connecting to a network, an I / O (input / output) port 34, and a bus connecting these. It is configured. The input device 12 and the display device 14 are connected to the I / O port 34.

このコンピュータ１６をハードウエアとソフトウエアとに基づいて定まる機能実現手段毎に分割した機能ブロックで説明すると、図２に示すように、サンプルデータからリサンプリングを行うデータ抽出部３６と、データ抽出部３６で抽出されたサンプルデータに基づいて、独立性を検定するための検定量を算出する検定量算出部３８と、検定量算出部３８で算出された検定量に基づいて、サンプルデータのデータ数より大きなデータ数における検定量を推定する推定部４０と、推定部４０で推定した検定量に基づいて、サンプルデータの中の２つの項目が独立か否かを判定する判定部４２と、判定部４２の判定結果に基づいてサンプルデータの項目についてベイジアンネットワークを構成するネットワーク構成部４４とを含んだ構成で表すことができる。 When the computer 16 is described with functional blocks divided for each function realization means determined based on hardware and software, as shown in FIG. 2, a data extraction unit 36 for resampling from sample data, and a data extraction unit Based on the sample data extracted in 36, a test amount calculation unit 38 that calculates a test amount for testing independence, and the number of sample data data based on the test amount calculated by the test amount calculation unit 38 An estimator 40 for estimating a test amount for a larger number of data; a determination unit 42 for determining whether two items in the sample data are independent based on the test amount estimated by the estimator 40; Based on the determination result of 42, the sample data items are represented by a configuration including a network configuration unit 44 that configures a Bayesian network. Door can be.

ここで、本実施の形態の独立性検定の原理について、複数の項目（Ｘ，Ｙ，Ｚ・・・）について確率変数に対応した値（事象）を保持したｎ個のサンプルデータが与えられ、項目Ｘと項目Ｙとの独立性を検定する場合について説明する。 Here, regarding the principle of the independence test of the present embodiment, n sample data holding values (events) corresponding to random variables for a plurality of items (X, Y, Z...) Are given. A case where the independence between the item X and the item Y is tested will be described.

まず、ｎ以下の自然数ｍ_１、ｍ_２、・・・ｍ_ｉ、・・・、ｍ_ｉｍａｘをリサンプリングデータ数として設定する。次に、「項目Ｘと項目Ｙとは項目Ｚの下で独立である。」という仮説を立て、与えられたｎ個のサンプルデータからｍ_１個のサンプルデータをリサンプリングし、仮説に基づいて得られる理論値とサンプルデータから得られる実測値とに基づいて、項目Ｘと項目Ｙとの条件付相関の程度を表わす計量としての独立性検定量を算出する。同様に、ｎ個のサンプルデータからｍ_２・・・ｍ_ｉ、・・・、ｍ_ｉｍａｘ個のサンプルデータをリサンプリングして、それぞれについて検定量を算出する。 First, natural numbers m ₁ , m ₂ ,..., _M _i _,. Next, a hypothesis that “item X and item Y are independent under item Z” is made, m ₁ sample data is resampled from the given n sample data, and based on the hypothesis Based on the obtained theoretical value and the actual measurement value obtained from the sample data, an independence test amount is calculated as a metric representing the degree of conditional correlation between the item X and the item Y. Similarly, m ₂ ..., _M _i ,..., _{Mi max} sample data are resampled from n sample data, and a test amount is calculated for each.

図３に示すように、横軸にサンプリングデータ数、縦軸に検定量をとった座標系に、算出した検定量をプロットする（・マーク）。ｍ_１、ｍ_２、・・・ｍ_ｉ、・・・、ｍ_ｉｍａｘのリサンプリングデータ数について算出された検定量から、最小二乗法等の手法を用いて近似曲線を求め、ｎより大きなサンプリング数Ｎにおける検定量（×マーク）を推定する。この推定された検定量と閾値とを比較することにより独立性を判定する。検定量は、相互情報量又は条件付相互情報量への換算が容易なＧ統計量やχ二乗統計量を用いることが望ましい。そしてこれらの検定統計量がχ二乗分布の有意性として通常用いられる値（５％）などを用いて算出されるχ二乗分布の閾値以下の場合には、仮説が正しい（項目Ｘと項目Ｙとは独立である）と判定し、閾値より大きい場合には、仮説が正しくない（項目Ｘと項目Ｙとは独立ではない）と判定する。 As shown in FIG. 3, the calculated test amount is plotted on the coordinate system with the number of sampling data on the horizontal axis and the test amount on the vertical axis (• mark). Approximate curve is obtained from the test amount calculated for the number of resampling data of m ₁ , m ₂ ,..., m _i _,. Estimate the test amount at N (x mark). Independence is determined by comparing the estimated test amount with a threshold value. As the verification amount, it is desirable to use a G statistic or a chi-square statistic that can be easily converted into mutual information or conditional mutual information. And if these test statistics are below the threshold of the chi-square distribution calculated using the value (5%) normally used as the significance of the chi-square distribution, the hypothesis is correct (item X, item Y and Is determined to be independent), and if it is greater than the threshold, it is determined that the hypothesis is incorrect (item X and item Y are not independent).

次に、図４を参照して、本実施の形態におけるデータ解析プログラムの処理ルーチンについて説明する。ここでは、個人のプロファイルや嗜好性、選好性を学習し、新着電子メールを個人にとって重要かどうかを判断し、その結果を個人に知らせるシステムを構築するためのベイジアンネットワークを構成する場合について説明する。 Next, the processing routine of the data analysis program in the present embodiment will be described with reference to FIG. This section describes the case of configuring a Bayesian network to learn a person's profile, preferences, and preferences, determine whether new email is important to the individual, and build a system that informs the person of the result. .

ステップ１００で、サンプルデータを取得する。サンプルデータは、入力装置１２から入力されてもよいし、ネットワークを介して外部接続された記憶装置に記憶されていてもよいし、またはコンピュータ１６のＨＤＤ３０に予め記憶されていてもよい。ここでは、項目Ｘ、項目Ｙ及び項目Ｚを有するサンプルデータが５００個取得されるものとする。 In step 100, sample data is acquired. The sample data may be input from the input device 12, stored in a storage device externally connected via a network, or stored in advance in the HDD 30 of the computer 16. Here, it is assumed that 500 pieces of sample data having item X, item Y, and item Z are acquired.

サンプルデータは、図５に示すように、項目Ｘは、電子メールの文面に「会議」という文字が存在するか否かに関する項目（会議）で、事象ｘ１＝”有”、事象ｘ２＝”無”である。項目Ｙは、電子メールの差出人との間に過去に頻繁な電子メールのやり取りがあったか否かに関する項目（頻繁）で、事象ｙ１＝”有”、事象ｙ２＝”無”である。項目Ｚは、新着電子メールが重要か否かに関する項目（重要）で、事象ｚ１＝”重要”、事象ｚ２＝”重要ではない”である。なお、かっこ内は、各項目の内容を端的に表す項目名である。 In the sample data, as shown in FIG. 5, the item X is an item (conference) regarding whether or not the word “conference” exists in the text of the e-mail. Event x1 = “present”, event x2 = “none” ". Item Y is an item (frequently) regarding whether or not frequent electronic mail has been exchanged with the sender of the electronic mail in the past, and event y1 = “present” and event y2 = “none”. The item Z is an item (important) regarding whether or not the new arrival e-mail is important, and event z1 = “important” and event z2 = “not important”. The items in parentheses are item names that directly represent the contents of each item.

ここで、事象ｘの確率変数をＰ（ｘ）とすると、項目Ｘの事象ｘ１及びｘ２は、理論値に基づいた場合には、Ｐ（ｘ１）＝１／２、Ｐ（ｘ２）＝１／２となる。リサンプリングされたサンプルデータの実測値に基づいた場合には、Ｐ（ｘ１）＝ｘ１の個数／リサンプリングデータ数、Ｐ（ｘ２）＝ｘ２の個数／リサンプリングデータ数、となる。項目Ｙ及び項目Ｚについても同様である。このように、サンプルデータは、確率変数に対応した事象からなる複数の項目を有している。 Here, if the random variable of the event x is P (x), the events x1 and x2 of the item X are P (x1) = 1/2 and P (x2) = 1 / 2. When based on the actually measured value of the resampled sample data, P (x1) = number of x1 / number of resampled data, P (x2) = number of x2 / number of resampled data. The same applies to item Y and item Z. Thus, the sample data has a plurality of items consisting of events corresponding to random variables.

次に、ステップ１０２で、サンプルデータの中の項目のうち、独立性を検定することにより項目間の相関を解析する項目を決定する。ここでは、条件付独立性を検定するため、２つの項目と条件となる１つの項目を決定する。この決定は、サンプルデータの全ての項目の組み合わせについて相関を解析するために、任意の項目を予め定めたルールに従って行ってもよいし、ユーザの選択により入力装置１２から入力された選択信号に基づいて行ってもよい。 Next, in step 102, an item for analyzing the correlation between items is determined by testing the independence among the items in the sample data. Here, in order to test conditional independence, two items and one item as a condition are determined. This determination may be performed according to a predetermined rule for any item in order to analyze the correlation for all combinations of items of the sample data, or based on a selection signal input from the input device 12 by user selection. You may go.

ここでは、新着メールが重要か重要でないかを判断するに当たって、文面に「会議」の文字があるか否ということと、差出人と過去に頻繁なメールのやり取りがあったか否かということとに相関があるか否かを解析することを目的とし、独立性を検定する項目として項目Ｘと項目Ｙ、及び条件となる項目として項目Ｚを決定する。 Here, in determining whether new mail is important or not, there is a correlation between whether or not there is a word “meeting” in the text and whether or not the sender has frequently exchanged mail in the past. For the purpose of analyzing whether or not there is, item X and item Y are determined as items for testing independence, and item Z is determined as a condition item.

次に、ステップ１０４で、後述する独立性検定処理を実行して、上記ステップ１０２で決定した項目について独立性を検定する。 Next, in step 104, the independence test process described later is executed to test the independence of the items determined in step 102.

次に、ステップ１０６で、独立性の検定結果に基づいて、項目間のネットワークを構成する。検定結果として、条件付独立性仮説が棄却された場合には、項目間に相関があるということになるため、ベイジアンネットワークを構成するエッジがあると判断する。また、条件付独立性仮説が採択された場合には、項目間に相関がないということになるため、ベイジアンネットワークを構成するエッジがないと判断する。この判断に基づいてベイジアンネットワークを構成する。例えば、「項目Ｚの条件下で項目Ｘと項目Ｙとは独立」という仮説が棄却されなかった場合には、項目Ｚの条件下で項目Ｘと項目Ｙとは独立ということになり、ＸとＹとの間のエッジは存在しないことになり、さらにＸとＺ、ＹとＺの間には全ての条件付独立性の仮説が棄却されたとするとＸとＺおよびＹとＺの間にはエッジが存在することになる。この段階ではエッジはまだ有向辺ではない無向辺の状態だが、ベイジアンネットワークにおいて条件付独立性を考慮して有向辺の向きを定めるオリエンテーション・ルール（例えば、「Ｃ．Ｍｅｅｋ著、「Ｃａｕsal ＩｎｆｅｒｅｎｃｅａｎｄＣａｕｓａｌＥｘｐｌａｎａｔｉｏｎｗｉｔｈＢａｃｋｇｒｏｕｎｄＫｎｏｗｌｅｄｇｅ」，ＣｏｎｆｅｒｅｎｃｅｏｎＵｎｃｅｒｔａｉｎｔｙｉｎＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ，１９９５年」参照）に従えば、図６に示すようなベイジアンネットワークが構成される。 Next, in step 106, a network between items is constructed based on the independence test result. If the conditional independence hypothesis is rejected as a test result, it means that there is a correlation between items, so it is determined that there is an edge constituting the Bayesian network. Further, when the conditional independence hypothesis is adopted, it means that there is no correlation between items, so that it is determined that there is no edge constituting the Bayesian network. Based on this determination, a Bayesian network is configured. For example, if the hypothesis that “the item X and the item Y are independent under the condition of the item Z” is not rejected, the item X and the item Y are independent under the condition of the item Z. There will be no edge between Y, and if X and Z and all conditional independence hypotheses are rejected between Y and Z, there will be an edge between X and Z and between Y and Z. Will exist. At this stage, the edge is in a state of an undirected edge that is not yet a directed edge, but an orientation rule that determines the direction of a directed edge in consideration of conditional independence in a Bayesian network (for example, “C. Meek,“ Causal According to “Inference and Casual Expansion with Background Knowledge”, “Conference on Uncertainty in Artificial Intelligence, 1995”), a Bayesian network as shown in FIG. 6 is configured.

次に、ステップ１０８で、解析を終了するか否かを判断する。予め定めたルールに従って項目を決定する場合には、全ての項目の組み合わせについて解析が終了したか否かを判断する。ユーザの選択により項目を決定する場合には、引き続き別の項目について解析を行うか否かの選択画面を表示するなどして、ユーザにより入力された選択信号により判断する。解析を終了する場合には、ステップ１１０へ進み、構成されたベイジアンネットワークを可視化して表示装置１４に表示して処理を終了する。解析を終了しない場合には、ステップ１０２へ戻り、別の項目について解析の処理を繰り返す。なお、構成されたベイジアンネットワークのデータを、ネットワークを介して接続された外部装置に出力してもよい。 Next, in step 108, it is determined whether or not to end the analysis. When determining items according to a predetermined rule, it is determined whether or not the analysis has been completed for all combinations of items. When an item is determined by the user's selection, a selection screen for determining whether or not to analyze another item is displayed, and the determination is made based on a selection signal input by the user. When the analysis is to be terminated, the process proceeds to step 110, where the configured Bayesian network is visualized and displayed on the display device 14, and the process is terminated. If the analysis is not terminated, the process returns to step 102 and the analysis process is repeated for another item. The configured Bayesian network data may be output to an external device connected via the network.

次に、図７を参照して、データ解析処理（図４）のステップ１０４で実行される独立性検定処理の処理ルーチンについて説明する。 Next, the processing routine of the independence test process executed in step 104 of the data analysis process (FIG. 4) will be described with reference to FIG.

ステップ２００で、サンプルデータからリサンプリングする際のデータ数を示すリサンプリングデータ数ｍ_ｉ、及びリサンプリングデータ数ｍ_ｉ毎に何回のリサンプリングを行うかを示すサンプリング回数ｊｍａｘを決定する。この決定は、サンプルデータの個数ｎに基づいて予め定めたルールで決定してもよいし、ユーザからの入力により決定してもよい。予め定めたルールで決定する場合には、例えば、ｍ_１＝ｎ×０．２、ｍ_２＝ｎ×０．４、ｍ_３＝ｎ×０．５、ｍ_４＝ｎ×０．６、ｍ_５＝ｎ×０．７、ｍ_６＝ｎ×０．８、ｍ_７＝ｎ×０．９、ｊ＝１０回、のように決定することができる。ここでは、サンプルデータの個数ｎは５００個であるので、リサンプリングデータ数ｍ_ｉは、１００／２００／２５０／３００／３５０／４００／４５０個と決定される。 In step 200, the number of resampling data indicating the number of data at the time of resampling from the sample data m _i, and the sampling times jmax indicating whether to perform many times resampling for each resampled data number m _i is determined. This determination may be made according to a predetermined rule based on the number n of sample data, or may be made by input from the user. In the case of determining according to a predetermined rule, for example, m ₁ = n × 0.2, m ₂ = n × 0.4, m ₃ = n × 0.5, m ₄ = n × 0.6, m ₅ = n × 0.7, m ₆ = n × 0.8, m ₇ = n × 0.9, j = 10 times. Here, since the number n of the sample data is 500, resampling the data number _{m i} is determined 100/200/250/300/350/400/450 and.

次に、ステップ２０２で、ｉ及びｊに「１」をセットし、次に、ステップ２０４で、リサンプリングデータ数ｍ_ｉでのｊ回目のリサンプリングを行う。ここでは、ｉ及びｊは「１」であるので、リサンプリングデータ数ｍ_１（１００個）での１回目のリサンプリングが行われる。 Next, in step 202, "1" is set to the i and j, then in step 204, it performs the j th resampling resampling data number m _i. Here, since i and j are “1”, the first resampling is performed with the number of resampling data m ₁ (100).

次に、ステップ２０６で、リサンプリングされたサンプルデータに基づいて、独立性を検定するための検定量を算出する。ここでは、下記（１）式で表わされるＧ検定量を下記（２）式で表わされる相互情報量との換算関係（式（３））により式（４）の相互情報量を検定量として用いχ二乗分布の値も同様に２ｍ_iで割った値を用いる。χ二乗検定量を用いた場合はχ二乗検定量自体がＧ検定量の近似であるため全く同様の式で相互情報量に換算することができる。相互情報量は確率の値から算出される相関を現わす計量であるため、データ数が十分に得られれば大数の法則（例えば、非特許文献１参照）によりある値に収束することがわかる。従って下記（４）式は十分なデータ数が得られることによりリサンプリングデータ数ｍ_iが決まればある値に収束する。 Next, in step 206, a test amount for testing the independence is calculated based on the resampled sample data. Here, the mutual information amount of the formula (4) is used as the test amount by the conversion relationship (formula (3)) of the G test amount represented by the following formula (1) with the mutual information amount represented by the following formula (2). the value of χ-square distribution is also used divided by the similarly 2m _i. When the χ square test amount is used, the χ square test amount itself is an approximation of the G test amount, so that it can be converted into the mutual information by the same formula. Since the mutual information is a metric showing the correlation calculated from the probability value, it can be seen that if a sufficient number of data is obtained, it converges to a certain value by the law of large numbers (for example, see Non-Patent Document 1). . Therefore the following equation (4) converges to a certain value once the resampled data number m _i by sufficient number of data can be obtained.

なお、Ｏは観測頻度でＥは仮説の下での期待頻度でありここでは独立又は条件付独立と仮定した場合の頻度となる。 Note that O is the observation frequency, and E is the expected frequency under the hypothesis, which is the frequency when assumed to be independent or conditional independent.

なお、矢印付のｚ（「ｚ^＊」とも表す）は、少なくとも１つ以上の項目（事象）の結合事象を表す。また、ｐ（ｘ，ｙ，ｚ^＊）は、事象ｘ，ｙ，ｚ^＊の同時確率、ｐ（ｘ，ｙ｜ｚ^＊）は、事象ｚ^＊が生じているという条件下における事象ｘ，ｙの同時確率、ｐ（ｘ｜ｚ^＊）は、事象ｚ^＊が生じているという条件下における事象ｘの周辺確率、及びｐ（ｙ｜ｚ^＊）は、事象ｚ^＊が生じているという条件下における事象ｙの周辺確率である。また、項目Ｘと項目Ｙとが独立であれば、（５）式が成り立つため、相互情報量は「０」となり、項目Ｘと項目Ｙとの相関が高くなるに従って大きな値となる。 Note that z with an arrow (also expressed as “z ^* ”) represents a combined event of at least one item (event). Further, p (x, y, z ^* ) is the joint probability of the event x, y, z ^* , and p (x, y | z ^* ) is the event x, y under the condition that the event z ^* occurs. , P (x | z ^* ) is the marginal probability of event x under the condition that event z ^* occurs, and p (y | z ^* ) is the condition under which event z ^* occurs Is the marginal probability of event y. If the item X and the item Y are independent, the equation (5) is established, so that the mutual information amount is “0”, and the value increases as the correlation between the item X and the item Y increases.

また、χ二乗検定量χ^２は、 The χ square test amount χ ² is

で表わされる。ここでもＯとＥは前記のＧ検定量での定義と同じである。 It is represented by Here, O and E are the same as defined in the G test amount.

例えば、リサンプリング数１００個のサンプルデータを集計して、図８に示すような集計結果が得られたとする。この集計結果から、事象ｚ＝ｚ１（重要）という条件の下、事象ｘ＝ｘ１（有）、及び事象ｙ＝ｙ１（有）について、ｐ（ｘ，ｙ，ｚ）＝０．３５、ｐ（ｘ，ｙ｜ｚ）＝０．３５、ｐ（ｘ｜ｚ）＝０．４５、及びｐ（ｙ｜ｚ）＝０．４５となる。同様に、事象ｚ＝ｚ１（重要）という条件の下、ｘ＝ｘ１及びｙ＝ｙ２の場合、ｘ＝ｘ２及びｙ＝ｙ１の場合、ｘ＝ｘ２及びｙ＝ｙ２の場合、さらに同様に、事象ｚ＝ｚ２（重要ではない）という条件の下、ｘ＝ｘ１及びｙ＝ｙ１の場合、ｘ＝ｘ１及びｙ＝ｙ２の場合、ｘ＝ｘ２及びｙ＝ｙ１の場合、ｘ＝ｘ２及びｙ＝ｙ２の場合の和を上記（１）式に代入して検定量を算出する。 For example, it is assumed that sample data with 100 resamplings is totaled and a totaling result as shown in FIG. 8 is obtained. From this total result, p (x, y, z) = 0.35, p (x) for event x = x1 (present) and event y = y1 (present) under the condition of event z = z1 (important). x, y | z) = 0.35, p (x | z) = 0.45, and p (y | z) = 0.45. Similarly, under the condition of event z = z1 (important), if x = x1 and y = y2, if x = x2 and y = y1, if x = x2 and y = y2, and so on, Under the condition z = z2 (not important), when x = x1 and y = y1, when x = x1 and y = y2, when x = x2 and y = y1, x = x2 and y = y2 Substituting the sum in the above case into the above equation (1), the test amount is calculated.

次に、ステップ２０８で、ｊ＝ｊｍａｘとなったか否かを判断することにより、決定したサンプリング回数分のリサンプリングが終了したか否かを判断する。ここでは、１つのリサンプリングデータ数ｍ_１について１０回リサンプリングを行うように決定されており、まだ１回目であるので否定されてステップ２１０へ進む。 Next, in step 208, it is determined whether or not resampling for the determined number of samplings is completed by determining whether or not j = jmax. Here, it is determined to perform resampling 10 times for _one resampling data number m ₁ , and since it is still the first time, the result is negative and the process proceeds to step 210.

ステップ２１０で、ｊを１インクリメントしてステップ２０４へ戻り、決定したサンプリング回数（１０回）分のリサンプリングが終了するまで処理を繰り返す。ｊ＝ｊｍａｘとなった場合には、ステップ２１２へ進む。 In step 210, j is incremented by 1, and the process returns to step 204. The process is repeated until resampling for the determined number of sampling times (10 times) is completed. If j = jmax, the routine proceeds to step 212.

ステップ２１２で、１つのリサンプリングデータ数ｍ_ｉについてｊ回分のリサンプリングにより算出されたｊ個の検定量ＭＩ_ｉｊからリサンプリングデータ数ｍ_ｉについての平均検定量ＭＩ_ｉ及びその分散値を算出する。算出した平均検定量ＭＩ_ｉ及びその分散値は、一旦所定の記憶領域に記憶しておく。 In step 212, calculates the average test weight MI _i and the variance values for one resampling data number m _i for j times the resampling data number m _i of the j-test weight MI _ij calculated by resampling . The calculated average test amount MI _i and its variance are temporarily stored in a predetermined storage area.

次に、ステップ２１４で、ｉ＝ｉｍａｘとなったか否かを判断することにより、決定したリサンプリングデータ数ｍ_ｉの全てについてのリサンプリングを終了したか否かを判断する。ここでは、リサンプリングデータ数ｍ_ｉは、１００／２００／２５０／３００／３５０／４００／４５０個の７つが決定されており、まだリサンプリングデータ１００個でのリサンプリングしか行っていないので、否定されてステップ２１６へ進む。 Next, at step 214, by determining whether a i = imax, it is determined whether to exit the resampling for all of the determined resampling data number m _i. Here, resampling the data number _{m i} is 100/200/250/300/350/400/450 of 7 but have been determined, yet because not only performed resampling at 100 resampling data, negative Then, the process proceeds to step 216.

ステップ２１６で、ｉを１インクリメントしてステップ２０４へ戻り、決定したリサンプリングデータ数ｍ_ｉの全てについてリサンプリングが終了するまで処理を繰り返す。ｉ＝ｉｍａｘとなった場合には、ステップ２１８へ進む。 In step 216, back increments i by one to step 204, for all of the determined resampling data number m _i repeats the processing until the resampling is finished. If i = imax, the process proceeds to step 218.

ステップ２１８で、図９に示すように、横軸にリサンプリングデータ数、縦軸に検定量を相互情報量換算に換算した値をとった座標系に、リサンプリングデータ数ｍ_ｉ毎の平均検定量ＭＩ_ｉをプロットする（四角マーク）。なお、プロットした各点に示されたエラーバーは、サンプリング回数ｊ毎に算出した検定量ＭＩ_ｉｊの分布を示している。この各点に基づいて、最小二乗法や分散による重み付最小二乗法により近似曲線５０を算出する。 In step 218, as shown in FIG. 9, the horizontal axis resampling data number, test weight on the vertical axis on the coordinate system taking the value in terms of mutual information in terms of the average test each resampled data number m _i The quantity MI _i is plotted (square mark). Note that the error bar indicated at each plotted point indicates the distribution of the test amount MI _ij calculated for each sampling frequency j. Based on these points, the approximate curve 50 is calculated by the least square method or the weighted least square method by variance.

次に、ステップ２２０で、サンプルデータの個数ｎより十分大きなデータ数を外挿データ数Ｎとして決定する。外挿データ数Ｎの決定は、サンプルデータの個数ｎ、リサンプリングデータ数ｍ_ｉ、及びサンプリング回数ｊなどに基づいて適切な値を決定してもよいし、ユーザからの入力により決定してもよい。算出した近似曲線に基づいて、決定した外挿データ数Ｎにおける検定量を推定検定量ＭＩ_Ｎとして算出する。 Next, in step 220, the number of data sufficiently larger than the number n of sample data is determined as the number of extrapolated data N. The number N of extrapolated data may be determined based on the number n of sample data, the number of resampled data m _i , the number of sampling times j, or the like, or may be determined by input from the user. Good. Based on the calculated approximate curve, it calculates the test statistic in the determined extrapolated data number N as the estimated test weight MI _N.

次に、ステップ２２２で、独立性を検定する項目の自由度に基づいて有意水準９５％のχ二乗分布を相互情報量へ変換した閾値分布５２を算出し、決定した外挿データ数Ｎにおける閾値ｔｈを、この閾値分布５２に基づいて算出する。外挿データ数Ｎを２０００個とした場合には、図９に示すように、推定検定量ＭＩ_Ｎ及び閾値ｔｈが算出される。 Next, in step 222, a threshold distribution 52 is calculated by converting the chi-square distribution with a significance level of 95% into mutual information based on the degree of freedom of the item to be tested for independence. th is calculated based on the threshold distribution 52. When the extrapolated data number N and 2000, as shown in FIG. 9, the estimated test weight MI _N and threshold value th is calculated.

次に、ステップ２２４で、推定検定量ＭＩ_Ｎが閾値ｔｈより大きいか否かを判定し、推定検定量ＭＩ_Ｎが閾値ｔｈより大きい場合には、ステップ２２６へ進んで、検定結果「棄却」（条件付独立ではない）を出力し、推定検定量ＭＩ_Ｎが閾値ｔｈ以下の場合には、検定結果「採択」（条件付独立である）を出力して、リターンする。 Next, in step 224, the estimated test statistic MI _N, it is determined whether or not greater than the threshold th, when the estimated test statistic MI _N is greater than the threshold th, the process proceeds to step 226, the test result "rejection" ( outputs not independent) conditionally, when the estimated test statistic MI _N is less than the threshold value th, the test result outputs "adoption" (independently a is conditional), the process returns.

以上説明したように、取得されたサンプルデータについて、リサンプリングデータ数を異ならせて複数回のリサンプルリングを行い、リサンプリング毎に算出した検定量に基づいて、サンプルデータの個数より十分大きなデータ数における検定量を推定するため、サンプルデータ数が少ない場合でも、精度よく独立性を検定することができ、この検定結果を用いて項目間のネットワークを構成することにより、精度の高いデータ解析を行うことができる。 As described above, the sample data obtained is resampled multiple times with different numbers of resampling data, and data sufficiently larger than the number of sample data based on the test amount calculated for each resampling. Since the test quantity in number is estimated, independence can be tested accurately even when the number of sample data is small, and by constructing a network between items using this test result, highly accurate data analysis is possible. It can be carried out.

なお、本実施の形態では、条件付独立を検定する場合について説明したが、条件となる項目を設定せず２つの項目について条件なしの独立性を検定するようにしてもよい。 In the present embodiment, the case of testing conditional independence has been described. However, it is also possible to test for independence without conditions for two items without setting a condition item.

また、本実施の形態では、データ解析として、独立性検定の結果を用いてベイジアンネットワークを構成する場合について説明したが、２変数間の相関もしくは条件付き相関に基づいてネットワークを構成するものであればよく、例えば、マルコフネットワークを構成するようにしてもよい。 In the present embodiment, the case where a Bayesian network is configured using the result of the independence test as data analysis has been described. However, if the network is configured based on correlation between two variables or conditional correlation. For example, a Markov network may be configured.

また、本実施の形態では、リサンプリング毎の平均検定量から近似曲線を算出する場合について説明したが、リサンプリング毎の検定量をそのままプロットし（例えば、図９の各点毎のエラーバーの範囲）、リサンプリングデータ数毎の検定量の分散に基づく重み付最小二乗法により近似曲線を算出するようにしてもよい。 In the present embodiment, the approximate curve is calculated from the average verification amount for each resampling. However, the verification amount for each resampling is plotted as it is (for example, the error bar for each point in FIG. 9). Range), an approximated curve may be calculated by a weighted least square method based on the variance of the test amount for each number of resampling data.

また、本実施の形態では、検定量として、Ｇ検定量を用いたが、χ二乗検定量又は相互情報量に変換できる他の統計的独立検定量を用いてもよい。 In this embodiment, the G test amount is used as the test amount. However, a χ square test amount or another statistical independent test amount that can be converted into the mutual information amount may be used.

また、本実施の形態では、推定検定量と比較するための閾値をχ二乗分布に基づいて算出する場合について説明したが、これに限定されるものではなく、外挿データ数毎に適切な閾値を設定してもよい。 Further, in the present embodiment, the case where the threshold value for comparison with the estimated test amount is calculated based on the χ square distribution is described, but the present invention is not limited to this, and an appropriate threshold value is set for each extrapolated data number. May be set.

また、本実施の形態では、サンプルデータの項目数が３つの場合について説明したが、より多くの項目を含むサンプルデータを用いてもよい。項目数が増えることにより独立性を検定する回数が指数的に増加するため、ＰＣアルゴリズム（例えば、「Ｃ．Ｇｌｙｍｏｕｒ＆Ｆ．Ｃｏｏｐｅｒ編「Ｃｏｍｐｕｔａｔｉｏｎ，Ｃａｕｓａｔｉｏｎ，＆Ｄｉｓｃｏｖｅｒ」、ＡＡＡＩＰｒｅｓｓ／ＭＩＴＰｒｅｓｓ、１９９９年」参照）などの効率的な条件付独立性検定に基づくベイジアンネットワーク構成アルゴリズムを用いることもできる。 Further, in the present embodiment, the case where the number of items of sample data is three has been described, but sample data including more items may be used. Since the number of independence tests increases exponentially as the number of items increases, the PC algorithm (eg, “Computation, Causation, & Discover” edited by C. Glymour & F. Cooper, AAAI Press / MIT Press, 1999) Bayesian network construction algorithms based on efficient conditional independence tests such as

本実施の形態に係るデータ解析装置１０の構成を示す概略図である。It is the schematic which shows the structure of the data analysis apparatus 10 which concerns on this Embodiment. 本実施の形態に係るデータ解析装置１０の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the data analysis apparatus 10 which concerns on this Embodiment. 本実施の形態の独立性検定の原理を説明するための線図である。It is a diagram for demonstrating the principle of the independence test of this Embodiment. 本実施の形態におけるデータ解析プログラムの処理ルーチンを示すフローチャートである。It is a flowchart which shows the processing routine of the data analysis program in this Embodiment. サンプルデータの一例を示す図である。It is a figure which shows an example of sample data. データ解析により構成されるベイジアンネットワークの一例を示す図である。It is a figure which shows an example of the Bayesian network comprised by data analysis. 本実施の形態における独立性検定プログラムの処理ルーチンを示すフローチャートである。It is a flowchart which shows the processing routine of the independence test | inspection program in this Embodiment. 集計結果の一例を示す表である。It is a table | surface which shows an example of a total result. 推定検定量及び閾値の算出について説明するための図である。It is a figure for demonstrating calculation of an estimated test amount and a threshold value.

Explanation of symbols

１０データ解析装置
１２入力装置
１４表示装置
１６コンピュータ
３６データ抽出部
３８検定量算出部
４０推定部
４２判定部
４４ネットワーク構成部 DESCRIPTION OF SYMBOLS 10 Data analysis apparatus 12 Input apparatus 14 Display apparatus 16 Computer 36 Data extraction part 38 Test amount calculation part 40 Estimation part 42 Determination part 44 Network structure part

Claims

A plurality of resample numbers having different sizes below the sample number of a plurality of sample data having a plurality of items each corresponding to a random variable are set, and the sample data of the resample number is set for each set resample number Extraction means for extracting from a plurality of sample data;
Based on the sample data extracted by the extraction means, an independence test amount for testing the independence of two items among the plurality of items of the sample data extracted by the extraction means or 1 other than the two items A conditional independence test quantity for testing the independence of the two items on condition of two or more items, and a G test quantity for measuring a value according to the degree of correlation between the two items Or a calculation means for calculating an independence test amount based on the chi-square test amount for each resample number;
Based on an approximate curve determined from the independence test amount for each resample number calculated by the calculation unit, estimation means for estimating an independence test amount in a sample number larger than the sample number;
A determination unit that determines that the two items are independent when the independence test amount estimated by the estimation unit is equal to or less than a predetermined value;
Independence tester including.

The independence test apparatus according to claim 1, wherein the calculation means calculates the independence test amount by converting the G test amount or the chi-square test amount into a mutual information amount or a conditional mutual information amount.

The extraction means performs a plurality of extractions for one resample number,
The calculation means calculates a plurality of independence test amounts for the number of extractions for each resample number, and calculates an independence test amount for each resample number from an average value of the plurality of independence test amounts.
The independence test | inspection apparatus of Claim 1 or Claim 2.

The extraction means performs a plurality of extractions for one resample number,
The calculation means calculates a plurality of independence test amounts for the number of extractions for each resample number,
The estimation means estimates an independence test amount in a sample number larger than the sample number based on an approximate curve determined by a weighted least square method based on a variance of the plurality of independence test amounts.
The independence test | inspection apparatus of Claim 1 or Claim 2.

The independence test apparatus according to any one of claims 1 to 4,
Configuration means for associating items determined to be independent by the determination means and configuring a Bayesian network or a Markov network for a plurality of items of the sample data;
Data analysis device including

Computer
A plurality of resample numbers having different sizes below the sample number of a plurality of sample data having a plurality of items each corresponding to a random variable are set, and the sample data of the resample number is set for each set resample number Extraction means for extracting from a plurality of sample data;
Based on the sample data extracted by the extraction means, an independence test amount for testing the independence of two items among the plurality of items of the sample data extracted by the extraction means or 1 other than the two items A conditional independence test quantity for testing the independence of the two items on condition of two or more items, and a G test quantity for measuring a value according to the degree of correlation between the two items Or a calculation means for calculating an independence test amount based on the chi-square test amount for each resample number;
Based on an approximate curve determined from the independence test amount for each resample number calculated by the calculation unit, estimation means for estimating an independence test amount in a sample number larger than the sample number;
A determination unit that determines that the two items are independent when the independence test amount estimated by the estimation unit is equal to or less than a predetermined value;
Independence test program to make it function.