JP2004213463A

JP2004213463A - Data processing method, data processing program, and its recording medium

Info

Publication number: JP2004213463A
Application number: JP2003001132A
Authority: JP
Inventors: Yasushi Tsun; 寧鍾; Kikuchin To; 菊珍董
Original assignee: WEB INTELLIGENCE LAB KK
Current assignee: WEB INTELLIGENCE LAB KK
Priority date: 2003-01-07
Filing date: 2003-01-07
Publication date: 2004-07-29
Also published as: AU2003292672A1; WO2004061764A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a data mining method effective to a problem in the real world by enabling the finding out of a rule from large-scale or various data, a noise, or imperfect data. <P>SOLUTION: This method comprises a combination of a generalized distribution table GDT and a rough set(RS) theory, and covers all theories of logic, aggregate and probability. The GDT is used to form a probability reference for evaluating the imperfectness/ambiguity of data and the strength of the found rule, and the rough set theory is used to extract the rule or narrow a minimum rule set covering all actual examples from strong rules. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、実世界の大規模、複雑、不完全で曖昧なデータからルールや因果関係などの知識発見を可能にするデータマイニングにかかる技術に属する。
【０００２】
【従来の技術】
コンピュータの計算速度の飛躍的向上と記憶容量の大規模化並びにネットワーク環境の整備により、大量のデータの蓄積、大規模な情報（知識とデータ）の共有や再利用が可能となった。データベースに対するニーズが増加するにつれ、それらに蓄積されているデータを有効に活用し、データ群の背後にある法則や因果関係を分析／抽出する必要性が高まっている。
【０００３】
実世界のデータからｉｆ−ｔｈｅｎルールを発見する帰納学習に関する研究がこの二十年間行われ、代表的なボトムアップ法：バージョン空間（ｖｅｒｓｉｏｎ−ｓｐａｃｅ）［Ｍｉｔｃｈｅｌｌ−７７］（非特許文献１）やＢａｃｋ−ｐｒｏｐａｇａｔｉｏｎ［Ｒｕｍｅｌｈａｒｔ−８６］（非特許文献２）から、トップダウン法：Ｃ４．５［Ｑｕｉｎｌａｎ−９３］（非特許文献３）及びラフ集合の方法［Ｐａｗｌａｋ−９１］（非特許文献４）などまで色々提案された。
【０００４】
【非特許文献１】
［Ｍｉｔｃｈｅｌｌ−７７］Ｍｉｔｃｈｅｌｌ，Ｔ．Ｍ．
”ＶｅｒｓｉｏｎＳｐａｃｅｓ：ＡＣａｎｄｉｄａｔｅＥｌｉｍｉｎａｔｉｏｎＡｐｐｒｏａｃｈｔｏＲｕｌｅＬｅａｒｎｉｎｇ”，
Ｐｒｏｃ．５ｔｈＩｎｔ．ＪｏｉｎｔＣｏｎｆ．ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ，（１９７７）３０５−３１０．
【０００５】
【非特許文献２】
［Ｒｕｍｅｌｈａｒｔ−８６］Ｒｕｍｅｌｈａｒｔ，Ｄ．Ｅ．，Ｈｉｎｔｏｎ，Ｇ．Ｅ．，ａｎｄＷｉｌｌｉａｍｓ，Ｒ．Ｊ．
”ＬｅａｒｎｉｎｇＩｎｔｅｒｎａｌＲｅｐｒｅｓｅｎｔａｔｉｏｎｓｂｙＢａｃｋ−ＰｒｏｐａｇａｔｉｏｎＥｒｒｏｒｓ”．
ＮａｔｕｒｅＶｏｌ．３２３（１９８６）５３３−５３６．
【０００６】
【非特許文献３】
［Ｑｕｉｎｌａｎ−９３］Ｑｕｉｎｌａｎ，Ｊ．Ｒ．
”Ｃ４．５：ＰｒｏｇｒａｍｓｆｏｒＭａｃｈｉｎｅＬｅａｒｎｉｎｇ｝”，
ＭｏｒｇａｎＫａｕｆｍａｎｎ（１９９３）．
【０００７】
【非特許文献４】
［Ｐａｗｌａｋ−９１］Ｐａｗｌａｋ，Ｚ．
”ＲｏｕｇｈＳｅｔｓ，ＴｈｅｏｒｅｔｉｃａｌＡｓｐｅｃｔｓｏｆＲｅａｓｏｎｉｎｇａｂｏｕｔＤａｔａ”，
ＫｌｕｗｅｒＡｃａｄｅｍｉｃＰｕｂｌｉｓｈｅｒｓ（１９９１）．
【０００８】
表１を例に、基本的概念を示す。ｕ１，ｕ２，…ｕ６が示す各行を実例と呼ぶ。頭痛、体温、筋肉痛とインフルエンザは属性と呼ばれる。表１においては、頭痛、体温、筋肉痛の有無などに応じて、インフルエンザか否かを判断するものである。判断結果にかかる属性（インフルエンザ）のことを決定属性といい、決定属性に条件を与える属性（頭痛／体温／筋肉痛）のことを条件属性という。各欄に現れるもの（例えば、頭痛については、ｙｅｓ／ｎｏ、体温についてはｎｏｒｍａｌ／ｈｉｇｈ／ｖｅｒｙ−ｈｉｇｈ）をそれぞれの属性の値という。
【０００９】
【表１】

【００１０】
「ボトムアップ法」
ボトムアップ法は遂次（ｉｎｃｒｅｍｅｎｔａｌ）学習法である。すなわち、実例が同時に入力される時だけでなく一つずつ与えられる時も概念の学習は可能である。ボトムアップ法は、効果的にデータ変更を取り扱うことができる、しかし、計算時間が長くなる。
ボトムアップ法では、各学習サイクルで１つの新しい実例をとって、その実例を使って学習された概念を修正する。全ての実例の処理が終るまで繰り返す。
代表的なボトムアップ法の一つは、離散的方法としてＭｉｔｃｈｅｌｌによって提案されたバージョン空間（ｖｅｒｓｉｏｎ−ｓｐａｃｅ）法である。バージョン空間法は状態空間の構造に依存した探索手法である。新たに正実例（インフルエンザの値がｙｅｓの実例）が与えられると、それから得られる概念、あるいはそれよりも一般的な概念が、求める概念になりうる。負実例（インフルエンザの値がｎｏの実例）が与えられると、それに対応する概念よりも特殊な要素だけが、求める概念になりうる。与えられた実例集合に対する概念はたくさんありうる。これらの概念のうち、目標とする構造のものを見出すには、問題解決における発見的探索を用いることができる。正実例から求められた一般的な概念と、負実例から求められた特殊な概念からそれぞれ出発して、遂次呈示された正、負実例により、可能な探索領域を狭めていく。求められた一般的な概念をルールと呼ぶ。
【００１１】
表１のインフルエンザの例を使って、バージョン空間法を説明する。
いま、ある対象ｕの属性を表す四つの述語
頭痛（Ｕ，Ｘ），Ｘ ∈｛ｙｅｓ，ｎｏ｝
体温（Ｕ，Ｙ），Ｙ ∈｛ｎｏｒｍａｌ，ｈｉｇｈ，ｖｅｒｙ−ｈｉｇｈ｝
筋肉痛（Ｕ，Ｚ），Ｚ ∈｛ｙｅｓ，ｎｏ｝
インフルエンザ（Ｕ，Ｄ），Ｄ｛ｙｅｓ，ｎｏ｝
がある。
【００１２】
これらの述語の集合で表される概念
ｕ１＝｛頭痛（Ｕ，ｙｅｓ），体温（Ｕ，ｎｏｒｍａｌ），筋肉痛（Ｕ，ｙｅｓ），インフルエンザ（Ｕ，ｎｏ）｝
ｕ２＝｛頭痛（Ｕ，ｙｅｓ），体温（Ｕ，ｈｉｇｈ），筋肉痛（Ｕ，ｙｅｓ），インフルエンザ（Ｕ，ｙｅｓ）｝
ｕ３＝｛頭痛（Ｕ，ｙｅｓ），体温（Ｕ，ｖｅｒｙ−ｈｉｇｈ），筋肉痛（Ｕ，ｙｅｓ），インフルエンザ（Ｕ，ｙｅｓ）｝
ｕ４＝｛頭痛（Ｕ，ｎｏ），体温（Ｕ，ｎｏｒｍａｌ），筋肉痛（Ｕ，ｙｅｓ），インフルエンザ（Ｕ，ｎｏ）｝
ｕ５＝｛頭痛（Ｕ，ｎｏ），体温（Ｕ，ｈｉｇｈ），筋肉痛（Ｕ，ｎｏ），インフルエンザ（Ｕ，ｎｏ）｝
ｕ６＝｛頭痛（Ｕ，ｎｏ），体温（Ｕ，ｖｅｒｙ−ｈｉｇｈ），筋肉痛（Ｕ，ｙｅｓ），インフルエンザ（Ｕ，ｙｅｓ）｝
を考える。
【００１３】
ここで、インフルエンザという属性の値がｙｅｓ，ｎｏとなるルールを求めるので、このインフルエンザを決定属性と呼ぶ。このとき、（頭痛、体温、筋肉痛）により構成されたバージョン空間は図１のようなグラフで表すことができる。グラフの枝は代入操作を表し、一番上の節点は最も「特殊な」概念、一番下の節点は最も「一般的な」概念を表している。
実例ｕ１よりもっと一般的な概念は、図１の下線つきの節点である。ｕ１とｕ２は、「頭痛」、「筋肉痛」という二つの属性で値が一致しているので、これらの共通的な一般化は（ｙｅｓ，Ｙ，ｙｅｓ）である。ここのＹは変数で、｛ｎｏｒｍａｌ，ｈｉｇｈ，ｖｅｒｙ−ｈｉｇｈ｝のなかのどの値でもよい。ｕ１とｕ２の決定属性（インフルエンザ）にかかる値は、ｕ１がｎｏであり、ｕ２がｙｅｓであるという点で異なっているので、（ｙｅｓ，Ｙ，ｙｅｓ）はルールとして使えないということである。
【００１４】
同様に、実例ｕ３とｕ６とについて、共通の一般化を行うと、（Ｘ，ｖｅｒｙ−ｈｉｇｈ，ｙｅｓ）となる。ｕ３とｕ６の決定属性の値（決定クラス）はどちらもｙｅｓであり、同じである。従って、（Ｘ，ｖｅｒｙ−ｈｉｇｈ，ｙｅｓ）という一般化概念は、インフルエンザ＝ｙｅｓを導くルールとして、一つの候補になる。すべての実例を学習し終わった時点で、矛盾が発生しなければ、この一般化概念はルールとして抽出される。
【００１５】
「トップダウン法（あるいはバッチ学習）」
トップダウン法では、一度、全てのデータを分析し、データに基づく概念を求めようとする。Ｃ４．５とラフ集合は典型的なものである。
Ｃ４．５は上から下へ、いわゆる、トップダウン・モードで決定木を構成して、枝を切り取るにより、ルールを発見する。木の構成は、ルート節点の選択から始まる。各属性を一つずつ、木のルート節点に当てはめてみる、選出された最も適切な属性を、ルート節点として扱う。一旦ルート節点が決めたら、実例は、この節点によって分類される。分類された各グループをサブ木として、上述の操作を繰り返す。
【００１６】
前述のインフルエンザの例を使って、Ｃ４．５の木の構造を図２により説明する。まず、３つの属性の評価で、「体温」は最も適切な属性と評価され、ルート節点となる。体温に基づく分類は体温＝｛ｎｏｒｍａｌ｝と体温＝｛ｈｉｇｈ，ｖｅｒｙ−ｈｉｇｈ｝の二つのグループ分けるのが適当で、実例が二つのグループに分けられる。次に、体温＝｛ｈｉｇｈ，ｖｅｒｙ−ｈｉｇｈ｝のグループに着目すると、「筋肉痛」は最も適切な属性と評価され、結果として、以下のルールが発見される：
体温（ｎｏｒｍａｌ） → インフルエンザ（ｎｏ）
体温（ｈｉｇｈ，ｖｅｒｙ−ｈｉｇｈ）∧筋肉痛（ｙｅｓ） → インフルエンザ（ｙｅｓ）
体温（ｈｉｇｈ，ｖｅｒｙ−ｈｉｇｈ）∧筋肉痛（ｎｏ） → インフルエンザ（ｎｏ）
【００１７】
もう一つのトップダウン法はラフ集合を用いるである。ラフ集合を最初に提案したのは、ポーランドの計算機科学者ＺｄｚｉｓｌａｗＰａｗｌａｋ［Ｐａｗｌａｋ−８２］である。
【非特許文献５】
［Ｐａｗｌａｋ−８２］Ｐａｗｌａｋ，Ｚ．
”ＲｏｕｇｈＳｅｔｓ，
ＩｎｔｅｒｎａｔｉｏｎａｌＪｏｕｒｎａｌｏｆＣｏｍｐｕｔｅｒａｎｄＩｎｆｏｒｍａｔｉｏｎＳｃｉｅｎｃｅｓ，Ｖｏｌ．１１（１９８２）３４１−３５６．
ラフ集合は集合理論が、曖昧（ｕｎｃｅｒｔａｉｎ）で不完全なデータにも対応出来るように拡張したものである［Ｐａｗｌａｋ−９１，Ｓｋｏｗｒｏｎ−９２］。ある概念があった時、人間によって測定できる知識では、その概念の範囲を上下の近似でしか把握できない。ラフ集合とはそのような近似の方法を理論的に研究したものである。
【非特許文献６】
［Ｓｋｏｗｒｏｎ−９２］Ｓｋｏｗｒｏｎ，Ａ．ａｎｄＲａｕｓｚｅｒ，Ｃ．
Ｔｈｅｄｉｓｃｅｒｎｉｂｉｌｉｔｙｍａｔｒｉｘｅｓａｎｄｆｕｎｃｔｉｏｎｓｉｎｉｎｆｏｒｍａｔｉｏｎｓｙｓｔｅｍｓ，
Ｒ．Ｓｌｏｗｉｎｓｋｉ（ｅｄ．）ＩｎｔｅｌｌｉｇｅｎｔＤｅｃｉｓｉｏｎＳｕｐｐｏｒｔ，ＫｌｕｗｅｒＡｃａｄｅｍｉｃＰｕｂｌｉｓｈｅｒｓ（１９９２）３３１−３６２．
ラフ集合で提唱された概念は大きく分けて、（１）データベースにおけるデータ集合の近似的表現および（２）その近似的表現に必要な最小限の属性の集合である縮約を求める方法、の二つである。
【００１８】
データから、ルールを発見するためのラフ集合の方法論では、データベースが決定表として
Ｔ＝（Ｕ，Ａｔ，｛Ｖａ｜ａ∈Ａｔ｝，｛Ｉａ｜ａ∈Ａｔ｝，Ｃ，Ｄ）
と定義される。ここで、Ｕは実例の有限集合、Ａｔはすべての属性の有限集合、Ｖａは各ａ∈Ａｔに対する値の集合、ＩａはＵｘＡｔから｛Ｖａ｜ａ∈Ａｔ｝への写像である。ＣとＤはＡｔの二つの部分集合で、Ｃを条件属性の集合とよぶ、Ｄは決定属性の集合である。Ｃ∪Ｄ＝ＡｔとＣ∩Ｄ＝空集合である。Ｃに含まれる属性の値がすべて等しい実例の集合まとめることにより、Ｕはこれらのいくつかの集合に分類される。これをＣによる同値クラスという、Ｕ／Ｃで表す。Ｕ／Ｄも同様である。Ｕ／Ｃ，Ｕ／Ｄをそれぞれ条件クラス、決定クラスと呼ぶ。
【００１９】
決定表の行は、有限集合中の一つの実例を表す。列は、属性の集合を表す。そして、各セルは実例の対応する属性の値を表す。例えば、上記のインフルエンザデータに対しては以下のようになる。
実例の集合Ｕ＝｛ｕ_１，ｕ_２，ｕ_３，ｕ_４，ｕ_５，ｕ_６｝
属性の集合Ａ_ｔ＝｛頭痛，筋肉痛，インフルエンザ｝
Ｖ_頭痛＝｛ｙｅｓ，ｎｏ｝、Ｖ_体温＝｛ｎｏｒｍａｌ，ｈｉｇｈ，ｖｅｒｙ＿ｈｉｇｈ｝、Ｖ_筋肉痛＝｛ｙｅｓ，ｎｏ｝、Ｖ_{インフルエンザ}＝｛ｙｅｓ，ｎｏ｝
Ｉｕ１，_頭痛＝ｙｅｓ、Ｉｕ１，_筋肉痛＝ｙｅｓ、…、Ｉｕ６，_{インフルエンザ}＝ｙｅｓ
条件属性Ｃ＝｛頭痛，筋肉痛｝
決定属性Ｄ＝｛インフルエンザ｝
【００２０】
上近似と下近似はラフ集合の近似概念の重要な部分である。下近似集合ＲＸは関係Ｒを用いてＸの要素として確実に分類される集合を表す。上近似集合Ｒ＾Ｘは関係Ｒを用いてＸの要素として分類される可能性がある集合を表す。ルール発見は、下近似を探すということになる。
上近似と下近似を説明するため、表２のとおり、頭痛と筋肉痛により構成されたデータを着目する：
【００２１】
【表２】

【００２２】
関係Ｒ＝｛頭痛，筋肉痛｝
Ｒによる分類Ｕ／Ｒ＝｛｛ｕ_５｝，｛ｕ_１，ｕ_２，ｕ_３｝，｛ｕ_４，ｕ_６｝｝
決定属性インフルエンザ（’ｎｏ’）の集合Ｘ１＝｛ｕ_１，ｕ_４，ｕ_５｝
関係Ｒを用いてＸ１の要素として確実に分類される集合、下近似集合ＲＸ_１＝｛ｕ_５｝。
関係Ｒを用いてＸ１の要素として分類される可能性がある集合、Ｘ_１の上近似集合Ｒ＾Ｘ１＝｛ｕ_１，ｕ_２，ｕ_３，ｕ_４，ｕ_５，ｕ_６｝
【００２３】
この関係は図３で表せる。関係Ｒについては、ｕ_５のみが正しく分類できる。
縮約とは、決定クラスに分類することができる最小限の属性を集合させたものをいう。インフルエンザの例で属性頭痛を削除しても、表３のように、各実例は前とおりに分類される。
【００２４】
【表３】

【００２５】
ここで、体温と筋肉痛は削除できない属性である。なぜならば、例えば、ｕ２とｕ４に注目すると、体温のみの値からは決定属性にかかるインフルエンザの値を一意に決定できない。筋肉痛についても同様である。よって、体温と筋肉痛が縮約になる。
縮約１＝｛筋肉痛，体温｝。
同様に、表４のように、筋肉痛の属性を削除しても分類に影響はない。
【００２６】
【表４】

【００２７】
しかし、表４から、さらに頭痛または体温の属性を削除したら、決定属性であるインフルエンザを一意的に決定できなくなるので、
縮約２＝｛頭痛，体温｝。
全ての縮約に含まれる属性はＣＯＲＥと呼ばれる。この例では、
ＣＯＲＥ＝｛頭痛，体温｝ ∧｛体温，筋肉痛｝＝｛体温｝
Ｓｋｏｗｒｏｎ氏が提案した識別行列［Ｓｋｏｗｒｏｎ−９２，Ｐａｗｌａｋ−９１］を利用すると、決定表の縮約を簡単に導出できる。Ｕ＝｛ｕ_１，ｕ_２，ｕ_３，…，ｕ_ｎ｝とすると、識別行列Ｍ（Ｔ）はｎｘｎの行列で、その成分ｍ_ｉｊは実例ｕ_ｉとｕ_ｊを異なった決定クラスに分類するための必要な属性の集合である。Ｍ（Ｔ）が対角行列、かつｍ_ｉｉ＝空集合なので、Ｍ（Ｔ）は下三角行列、即ち、ｍ_ｉｊ：１≦ｉ＜ｊ≦ｎで表すことができる。ｃ ∈ｍ_ｉｊのときの全ての属性値ｃの論理和を∨ｍ_ｉｊで表す。すべてのｕ_ｉ∈Ｕに関して、縮約は識別関数ｆ_Ｔ（数１）により生成される：
【００２８】
【数１】

ただし、ｍ_ｉｊ＝空集合ならば、∨ｍ_ｉｊ＝Ｆａｌｓｅ。
【００２９】
ｆ_Ｔの簡約化された論理和標準形中の各論理積は縮約と呼ばれる。
上記のインフルエンザの例で、識別行列を利用して、表５に示すように、縮約を求める過程を説明する。
【００３０】
【表５】

【００３１】
まず、ｕ１と他の実例を区別するため、どの属性が重要なのかを分析する。ｕ１とｕ２は体温属性のみ値が異なっているので、ｕ１とｕ２を区別できるのは体温属性である。ｕ１とｕ３を区別するのも、体温属性である。ｕ１とｕ４，ｕ１とｕ５は同じ決定クラスに属しているので区別する必要はない。ｕ１とｕ６を区別するのは頭痛と体温属性である。このプロセスは、以下の二つのステップからなる。
▲１▼ 決定クラス（インフルエンザ）が異なる二つの実例に着目する
▲２▼ 二つの実例中、属性値が異なる属性を抽出する。
ｕ２からｕ６は同様にすると、表６にかかる識別行列ができる：
【００３２】
【表６】

【００３３】
各要素の論理積をとると、
ｆＴ＝体温∧（頭痛∨体温）∧（頭痛∨筋肉痛）∧（頭痛∨体温∨筋肉痛）∧（体温∨筋肉痛）＝（体温∧頭痛）∨（体温∧筋肉痛）．
二つの縮約｛体温，頭痛｝と｛体温，筋肉痛｝が求められた。
データベースからルールや因果関係などからの知識発見／データマイニング（ＫｎｏｗｌｅｄｇｅＤｉｓｃｏｖｅｒｙａｎｄＤａｔａＭｉｎｉｎｇ、ＫＤＤと略記）に関する研究は開始されてから日が浅く、以上で述べた方法は理想的な実験データにたいしては，完璧な結果が得られるが、実世界のデータに対しては、対応できない。
実世界のデータは以下の特徴をもつ。
１．大規模性
２．不完全性（欠損データ、不完全データ）
３．曖昧性（ノイズデータ）
４．動的性（データの増減）
５．多様性（連続値データと離散データの混在）
６．背景知識の利用
従来の方法では、実世界データの問題に対しては、表７で示すように部分的にしか対応できない。
【００３４】
【表７】

【００３５】
バージョン空間（ｖｅｒｓｉｏｎ−ｓｐａｃｅ）法の主な問題点は：
１．ある程度以上の規模の問題では、計算量が多くなってしまう
２．ノイズデータ、つまり誤りや欠損データに対しては、うまくはたらかない
上の節のバージョン空間の図を見れば分かる、属性の数と各属性に異なった値の数がバージョン空間の大きさの決め手である。属性の数、または各属性の属性値の数が増えると、下の式（数２）に従ってバージョン空間の大きさが増し、計算量が多くなる。
【数２】

（ｍは属性の数、ｎ_ｉは属性ｉに含まれる異なった値の数）
【００３６】
従来のｂａｃｋ−ｐｒｏｐａｇａｔｉｏｎも、以下のような問題点を持つ：
１．解かれる問題に関する意味が明確でないので、与えられた決定または出力が達されたプロセスは理解できる形式で、説明するのが難しい
２．背景知識は、ネットワークに組み込まれない
３．満足な結果を得るための、トレーニング時間が長い。これは、このようなネットワーク学習方法が、基本的に、トレーニング・データセットを繰り返し使用するからである
それゆえに、ネットワークで知識をどのようにコード化（ｅｎｃｏｄｅ）、または、明示的に表示するか、訓練されたネットワークからどのようにルールを抽出するかがこの方法の問題点である。
【００３７】
ラフ集合の方法は速度が速いが、実例が遂次呈示される場合では、概念学習ができない。データ集合が無矛盾であることはまれである。ラフ集合の方法では、このような矛盾を取り除いて、縮約を求めることを行っている。これは現実のデータベースが多くの確率的な要素を含んでいることを考えれば、非現実的な取り扱いである。
この方法の大きな問題点は、属性の数を増やすと、識別行列が大きくなる。各実例を識別するための属性の組み合わせパターンが多くなり、縮約も多くなる。どの縮約がもっとも適切かを評価する基準はない。
【００３８】
ラフ集合理論の誕生は多くの研究者に非常な大きな興味を与え、ラフ集合に基づく分類学習アルゴリズムは、開発され続けている。しかし、実世界の分類問題がラフ集合の基本的な概念を越えているので、確率統計技術などが加味されなければいけない。
通常、ボトムアップ法は実例が同時呈示される場合だけでなく、遂次呈示の場合でも、概念学習が可能である。しかし、速度が遅い。トップダウン法は速い学習能力と少ない計算時間という特徴を持つ。しかし、実例が遂次呈示された場合、あるいは、データの変更があった場合、概念学習ができない、学習過程でバックグラウンドの知識が使えない、さらに、並列−分散型モードで実行できないという問題点も持つ。
上述のどの方法でも、実世界の応用、例えばノイズ処理、探索過程を管制するための偏差調整及び背景知識の利用等において満足な解決方法は見つけられていない。
【００３９】
【発明が解決しようとする課題】
本発明は、ロジック、集合と確率を結合して、従来技術の問題点を解決し、実世界のデータに対して適用可能な、確率的ラフ集合帰納学習システムを開発することが目的である。
本発明は、大規模、多様データ、ノイズや、不完全なデータからもルールの発見が可能であり、実世界の問題に対して有効なデータマイニング手法を提供することが目的である。
【００４０】
【課題を解決するための手段】
今までの知識発見の方法は、ロジック、集合、または、確率の理論のいずれかのひとつに基づいて開発されており、適応範囲が限られていた。
本発明に関わる手法（ＧＤＴ−ＲＳメソッド）、及び、これを実現したシステム（ＧＤＴ−ＲＳ）は、図４に示すように、一般化分布表（ＧＤＴ）とラフ集合（ＲＳ）理論の組み合わせからなり、ロジック、集合、確率の理論の全てにまたがるものである。ＧＤＴは、データの不完全性・曖昧性と発見されたルールの強さを評価する確率基準を作るために用いられ、ラフ集合理論は、ルールの抽出や、強いルールの中からすべての実例をカバーする最小ルール集合を絞り出すために用いられる。
ＧＤＴ−ＲＳメソッドで発見されたルールの強さには、ノイズや未知（未観察）データの影響が明示的に表される。また、ＧＤＴ−ＲＳメソッドは、学習のためのバイアス調整に対する柔軟性を持ち、背景知識の利用ができる。ＧＤＴ−ＲＳメソッドを利用して、実例の順位付け問題も解決できるため、本発明は順序情報表も提案した。
【００４１】
１．ＧＤＴ−ＲＳメソッドの概要
ＧＤＴ−ＲＳメソッドはノイズデータや、不完全性なデータからルール発見することを目的とする。ＧＤＴ−ＲＳメソッドは一般化分布表（ＧＤＴ）とラフ集合（ＲＳ）の理論からなる。以下の節で、ＧＤＴ−ＲＳメソッドを理解するために必要な概念である、一般化分布表（ＧＤＴ）、順序情報表等の基本的な概念を紹介した上で、ＧＤＴ−ＲＳメソッドの特徴を示す。なお、もう一つの基本的な概念であるラフ集合理論については、すでに述べた。
【００４２】
２．一般化分布表（ＧＤＴ）
ＧＤＴ−ＲＳメソッドの中心は変形した遷移マトリクス、いわゆる一般化分布表ＧＤＴ（ＧｅｎｅｒａｌｉｚａｔｉｏｎＤｉｓｔｒｉｂｕｔｉｏｎＴａｂｌｅ）であり、データの不完全性・曖昧性と発見されたルールの強さを評価する確率基準を提供する。
ＧＤＴは実例と概念の確率関係を表している。ＧＤＴは三つの要素からなる：「可能な実例」、「可能な一般化」及び可能な実例と可能な一般化の間の関係を示す「確率関係」。表８から得られるＧＤＴの例を表９で示す。
【００４３】
【表８】

【００４４】
【表９】

【００４５】
ここで、最上段の行のａ０ｂ０ｃ０、ａ０ｂ０ｃ１、…は全ての属性値の可能な組み合わせを表している。ここでは、属性ａ，ｃが２とおり、属性ｂが３とおりあるので、組み合わせは２ｘ３ｘ２＝１２とおりとなる。一番左の列の｛＊ｂ０ｃ０｝、｛＊ｂ０ｃ１｝、…を可能な一般化という。｛＊ｂ０ｃ０｝は、ａに関しては一般化を行い（＊で示す）、ｂ０ｃ０を有する実例の集合｛ａ０ｂ０ｃ０，ａ１ｂ０ｃ０｝を示している。確率関係は、可能な一般化の中で、各々の可能な実例がとりうる確率を示す。
表８で、ＧＤＴの各要素Ｇ_ｉｊは対応する可能な実例と可能な一般化間の関係の強さを示す「確率関係」を表している。背景知識が利用されない場合、事前分布は一様分布で、Ｇ_ｉｊは式１により定義される。
【００４６】
【数３】

【００４７】
ここでのＰＩ_ｊはｊ列目の可能な実例、ＰＧ_ｉはｉ行目の可能な一般化、Ｎ_ｐｇｉはｉ行目の一般化に含まれる可能な実例の数を表している。
ここで、ｉ行目の確率の和は次式で表される。
【数４】

ＧＤＴを作成する時は、条件属性のみを使用する。条件属性によって構成された一般化はルールの条件部として扱う。決定属性はルールの概念部として扱う。
【００４８】
３．背景知識による事前分布の調節
本発明の重要な特徴の一つはルール発見過程を制御するために、バイアスの選択と利用が出来ることである。ここでは、背景知識をバイアスとして利用することにより、ＧＤＴの作成とルール発見過程をどの様に制御するかについて説明する。背景知識がバイアスとして利用されない場合、ディフォルトでは、一つの一般化に対して、関連する全ての可能な実例の発生する確率は等しく、式１に示されるように事前分布は一様分布である。しかし、背景知識の利用により、事前分布を調整することが出来る。この場合、事前分布は一様分布ではなくなる。
一般的に背景知識は下記の形式で定義される：
ａ_ｉ１ｊ１⇒ａ_ｉ２ｊ２，Ｑ
ここでのａ_ｉ１ｊ１は属性ｉ１のｊ１番目の値で、ａ_ｉ２ｊ２は属性ｉ２のｊ２番目の値である。ａ_ｉ１ｊ１、ａ_ｉ２ｊ２、Ｑをそれぞれ背景知識の「前提」、「結論」、「強さ」と呼ぶ。背景知識の意味は、ａ_ｉ１ｊ１の条件下でａ_ｉ２ｊ２の発生する確率はＱである、ということである。Ｑ＝０は”ａ_ｉ１ｊ１とａ_ｉ２ｊ２は同時に発生することはない”、Ｑ＝１は”ａ_ｉ１ｊ１とａ_ｉ２ｊ２はいつも同時に発生する”、という意味である。Ｑ＝１／ｎ_ｉ２の時は背景知識が利用されない場合のａ_ｉ２ｊ２の発生する確率と同じである。ここで、ｎａ_ｉ２は属性ｉ２の値の数を表している。背景知識”ａ_ｉ１ｊ１⇒ａ_ｉ２ｊ２，Ｑ”により、属性値ａ_ｉ１ｊ１が発生した時のａ_ｉ２ｊ２の発生する確率が、一様分布の１／ｎ_ｉ２からＱに変更されるので、属性ｉ２のほかの値に対して、発生する確率は（１−Ｑ）／（ｎ_ｉ２−１）となる。
【００４９】
例えば：背景知識
”ａ０＝＞ｃ１，１００％”
すなわち、”ａ０が発生したら、ｃ１は必ず発生する”という背景知識を導入することにより、表９は以下の表１０のように変更される。変化した確率には￣を付けた。
【００５０】
【表１０】

一般化ａ０ｂ０＊のうち、ａ０ｂ０ｃ０は背景知識を利用しない均等分布の場合は１／２だったのに対し、上記背景知識によれば起こりえないので確率は０となる。他方、ａ０ｂ０ｃ１は、上記背景知識によれば必ず起こるので確率は１となる。
【００５１】
４．ルールの強さと未知データ
本件発明により発見されたルールを以下のように表す。
Ｘ→ＹｗｉｔｈＳ，
すなわち、”ＸならばＹで、強さがＳ”である。ここで、Ｘは概念が満足しなければいけない条件（一般化された条件属性の組み合わせ）、ＹはＸにより決定される概念（決定属性の値）、Ｓはルールの強さを示している。言い換えれば、Ｘは｛ａ０ｂ０＊｝，｛ａ０＊ｃ０｝，…等の条件属性値の組合せであり、Ｙは決定属性値またはその組合せである。Ｘ、Ｙ、Ｓをそれぞれルールの条件、結論、強さと呼ぶ。
ルールＸ→Ｙの強さＳは以下のように定義されている：
Ｓ（Ｘ→Ｙ）＝ｓ（Ｘ）ｘ（１−ｒ（Ｘ→Ｙ））式２
式２はｓ（Ｘ）という要素と、（１−ｒ（Ｘ→Ｙ））という要素との積で成り立っている。
前者ｓ（Ｘ）はルールの安定性を表す指標である。例えば、｛ａ０＊ｃ０｝という条件のときに、決定クラスｄ１となるというルール｛ａ０＊ｃ０｝→ｄ１が発見されたとする。そのときに、一般化指標｛ａ０＊ｃ０｝を満たすべき全ての組み合わせのうち、２０％の実例しか得られていないデータの母集団と、９０％の実例を得たデータの母集団によって得られたルールとは自ずから安定性が異なる。前者の場合、残りの８０％のデータ如何によってそのルール自体が覆される可能性があるからである。そこで、ある一般化指標に該当する可能な実例のうち、未収集データ（実例）の割合は、ルールの安定性を示す指標となりうる。
【００５２】
後者（１−ｒ（Ｘ→Ｙ））は、ルールの信頼性を示す指標である。例えば、｛ａ０＊ｃ０＊｝という条件のときに、決定クラスｄ１となるというルール｛ａ０＊ｃ０｝→ｄ１が発見されたとする。そのときに、一般化指標｛ａ０＊ｃ０｝を満たすべき全ての組み合わせのうち、１００％の実例がｄ１になっている場合と、８０％はｄ１であるが、２０％はｄ２という決定クラスが得られている場合とは自ずからルールの信頼性が異なる。そこで、後述するように、ノイズ率ｒ（Ｘ→Ｙ）という概念を定立し、８０％はｄ１であるが、２０％はｄ２という決定クラスが得られている場合においては、ノイズ率を２０％とした。
ｓ（Ｘ）という要素と、（１−ｒ（Ｘ→Ｙ））という要素との定性的な概念は説明したとおりであるが、ここでは、数式を用いて、これらの概念をより詳しく説明する。
ｓ（Ｘ）は一般化Ｘ（ルールの条件部）の強さを表す。Ｘ＝ＰＧ_ｉとすると、一般化ＰＧ_ｉの強さｓ（ＰＧ_ｉ）はＰＧ_ｉに含まれるデータベースに存在する実例の確率の和で定義される：
【数５】

ここで、Ｇ_ｉｊはデータベースから得られる実例ＰＩ_ｊの一般化ＰＧ_ｉの確率、Ｎ_{ｉｎｓ−ｒｅｌ}（ＰＧ_ｉ）はデータベースに収集され、一般化ＰＧ_ｉに含まれる実例の数である。ｓ（ＰＧ_ｉ）の値は実データの入力により変化する。ＰＧ_ｉに含まれる全て可能な実例がデータベースに存在すれば、ＰＧ_ｉの強さが最大値の１になる。
【００５３】
一般化の強さが未知データの影響を反映している。未知データとはデータベースに収集されていないデータのことである。一般に、データベースでは、全ての可能な実例を収集しているわけではない。中には、全く発生する可能性のない実例があるし、データ採集の範囲や時間にも制限があるからである。未知データを無視してルール発見を行うと、信頼度の低いルールが発見される恐れがある。ＧＤＴ−ＲＳメソッドでは一般化分布表ＧＤＴに基づき、未知データの影響を評価する方法を提供し、ルールの強さを明記的に示すようにした。未知データが少ないとき、ルールの強さの値は大となり、そうでないときは、小さくなる。
【００５４】
ｒ（Ｘ→Ｙ）はルールＸ→Ｙのノイズ率を表す。これは、未知データとノイズデータの影響によるルールの不完全性と不確定性を示している。ノイズ率ｒは以下のように定義される：
【数６】

ここで、Ｎ_{ｉｎｓ−ｒｅｌ}（Ｘ）は一般化Ｘに属する実例の数、Ｎ_{ｉｎｓ−ｃｌａｓｓ}（Ｘ，Ｙ）はその実例の中で、決定クラスＹに属している実例の数である。ｒ（Ｘ→Ｙ）は分類の正確率を示す。つまり、一般化Ｘに属する実例がクラスＹに属する確率を示している。
なお、ＧＤＴ−ＲＳメソッドを実現したソフトウエアでは、ユーザーの許すノイズ率を閾値として指定できる。ルールのノイズ率が閾値より高いものは自動的に削除される。
【００５５】
５．順序情報表
実世界のデータに対して、直面する多くの問題は単なる分類問題ではない。実例の順位付けはそのようなタイプの問題の１つである。例えば、商品の売上ランクと保証期間、値段との関係を求める問題はそのような順序問題である。このような問題に対しては、順位付けルール、”実例ｘのある属性ａの値が実例ｙの同じ属性の値より前に位置付けられるならば、実例ｘは実例ｙより前に位置付けられる”という関係を、発見することが目的になる。
属性値による実例の順位付けが、必ずしも実例の全体的な順位付けと同じものであるとは限らない。どの属性が全体的な順位付けを決定するうえでより重要な役割を演ずるか、そして、どの属性が全体的な順位付けに貢献しないか知ることは大切である。
決定表では、各属性の異なった属性値の間の意味関係を考慮していない。しかし、順位付けルールを発見するためには、データの意味を表す「順位情報」が必要である。そのため、意味情報を取り入れることによって、順序化決定表が作成される。順序化決定表は決定表に意味関係が付加されたものと見ることができる。
【００５６】
順位付けされた属性値は、ＧＤＴ−ＲＳメソッドによるデータマイニング処理の一ステップを構成するものではない。むしろ、ＧＤＴ−ＲＳメソッドを適用するにあたって、順位付けされた属性値を用いて情報を前処理する場面で必要なものである。
この前処理がＧＤＴ−ＲＳに対してどのように適用されるかという点については、後述する。ここでは、順序化決定表の概念のみを説明する。
定義１．＞が、集合Ｕの上の二項関係であるとき、関係＞が以下の２つの性質を満たすならば、「弱い順序」という。
【数７】

ただし、’は命題の否定を示す。「弱い順序」により、以下の同値関係が定義される：
【数８】

２つの要素ｘ，ｙに対して、ｘ〜ｙならばｘとｙは、＞によって識別できないと言う。同値関係〜は、Ｕの分類Ｕ／〜を誘導する、Ｕ／〜の上の順序関係＞^＊は以下のように定義される：
【数９】

ここで、［ｘ］〜はｘを含む同値類である。さらに、＞^＊は線形順序である。Ｕ／〜のどんな２つの異なる同値類も比較することができる。
【００５７】
定義２．順序決定表は、決定表に順序関係を加味したものである。
属性ａの値の順序は、自然に実例の順序を誘導する、すなわち、Ｕの要素ｘ，ｙに対して、次のように定義する：
【数１０】

ここで、＞_｛ａ｝は、属性ａによって誘導されるＵの上の順序関係を示す。実例ｘの属性ａの値が実例ｙの同じ属性の値より前に配列される場合に限り、ｘはｙより前に配列される。関係＞_｛ａ｝は＞_ａと全く同じプロパティを持つ。属性Ａ_ｔの部分集合Ａについて、以下の定義がある：
【数１１】

すなわち、ｘがＡで全ての属性によってｙより前に配列される場合に限り、ｘはｙより前に配列される。
多くの実世界アプリケーションでは、特別な属性、いわゆる、決定属性を必要とする。決定属性による実例の順位付けは、’＞’によって示されて、実例の全体的な順位付けと呼ばれている。
【００５８】
例：表１１は、５つのメーカによって生産される製品の順序決定表を示す。ａ、ｂ、ｃ、ｄとｏは、それぞれ、製品のサイズ、保証期間、価格、重みと全体的な順位付けを表す。＞_ａ、＞_ｂ、＞_ｃ、＞_ｄと＞_ｏはそれぞれ属性の順位情報を示す。
【００５９】
【表１１】

【００６０】
属性値の順位情報に基づいて、以下のような製品の順位情報を得る：
＞^＊ _｛ａ｝：［ｐ３，ｐ４，ｐ５］＞^＊ _｛ａ｝［ｐ１］＞^＊ _｛ａ｝［ｐ２］，
＞^＊ _｛ｂ｝：［ｐ１，ｐ２，ｐ３，ｐ４］＞^＊ _｛ｂ｝［ｐ５］，
＞^＊ _｛ｃ｝：［ｐ１，ｐ５］＞^＊ _｛ｃ｝［ｐ４］＞^＊ _｛ｃ｝［ｐ２，ｐ３］，
＞^＊ _｛ｄ｝：［ｐ４，ｐ５］＞^＊ _｛ｄ｝［ｐ３］＞^＊ _｛ｄ｝［ｐ１］＞^＊ _｛ｄ｝［ｐ２］，
＞^＊ _｛ｏ｝：［ｐ１］＞^＊ _｛ｏ｝［ｐ４］＞^＊ _｛ｏ｝［ｐ２，ｐ３，ｐ５］
｛ａ，ｂ｝と｛ｃ，ｄ｝について、以下の順位を得る：
＞_{｛ａ，ｂ｝}：空集合
＞_{｛ｃ，ｄ｝}：ｐ１＞_{｛ｃ，ｄ｝}ｐ２，ｐ４＞_{｛ｃ，ｄ｝}ｐ２，ｐ５＞_{｛ｃ，ｄ｝}ｐ２，ｐ４＞_{｛ｃ，ｄ｝}ｐ３，ｐ５＞_{｛ｃ，ｄ｝}ｐ３．
属性ａとｂを結合することによって、全ての実例は、同じクラスに入れられる。この例では：＞_{｛ｃ，ｄ｝}は弱い順序ではない。すなわち、２つの弱い順序の結合は、弱い順序を生成しないことがある。これは、順序ｘ＞_Ａｙを用いて、有効なルールが求められないことを示唆している。
【００６１】
本発明は以上の知見及びラフ集合理論を組み合わせたものであり、具体的には、複数の実例から、一の決定属性値と相当因果関係を有するｍ個の条件属性値の組み合わせとしてのルールを発見するためのデータ処理方法であって、決定属性値が異なる二つの前記実例について、前記条件属性値の異なる前記条件属性を抽出するステップと、抽出された前記条件属性に関して論理演算を行い、ルールを導出するステップとを含む、データ処理方法にかかるものである。ここで、「実例」は、ｎ個の条件属性のそれぞれに対応するｎ個の条件属性値の組み合わせと、この組み合わせに関連づけられた１個以上の決定属性にかかる決定属性値とを有する情報の単位である。なお、ここで、ｎ＞ｍという関係を具備する。
【００６２】
本発明にかかるデータ処理方法は、上記に加えて、ｍ個の条件属性値の組み合わせを満たしうる可能な実例の数と実際にデータ処理に使用した実処理実例の数とを比較した第一の指標により、前記ルールの強さを評価するステップを含むことが好ましい。
本発明にかかるデータ処理方法は、上記に加えて、一の決定属性値と関連付けられるｍ個（ｎ＞ｍ）の条件属性値の組み合わせを具備する実処理実例の数と、前記一の決定属性値以外の決定属性値と関連付けられるｍ個（ｎ＞ｍ）の条件属性値の組み合わせを具備する実処理実例の数とを比較した第二の指標により、前記ルールの強さを評価するステップを含むことが望ましい。
【００６３】
本発明は、ユーザにより設定された順序との関連で各条件属性値を比較し、どのよう条件属性が特定の決定属性に相関があるのかどうかを発見することができる。この場合、本発明は、それぞれの条件属性値について予め定められた順序を設定するステップと、それぞれの前記条件属性値を比較し、この順序との整合性に関連付けて指標化するステップと、を具備する処理を行ったデータを実例として用いることが望ましい。
本発明は、ＧＤＴ手法として、ｐ個（ｎ＞ｐ）の条件属性値を特定し、ｎ−ｐ個の条件属性値を一般化した前記条件属性値の組み合わせを設定し、この一般化した条件属性値の組み合わせに該当しうるｎ個の条件属性値の組み合わせの数ｑを導出し、設定された一般化した条件属性値の組み合わせに該当する特定のｎ個の前記条件属性値の組み合わせが現れる確率を１／ｑとして扱うことが望ましい。本発明によれば、この確率は、実世界に存在する背景情報を参酌して修正することができる。
なお、本発明にかかるＧＤＴ−ＲＳメソッドを実現するプログラム、またはこれを格納した記録媒体（ＧＤＴ−ＲＳシステム）も本発明の射程内である。
【００６４】
【作用】
本発明では、ＧＤＴが一般的な概念と特殊な要素の間の確率関係を用いるという特徴を利用して、データの不完全性・曖昧性と発見されたルールの強さを評価する確率、及びルール選択の基準を得ることができる。本発明によるＧＤＴ−ＲＳメソッドは、下記の実世界の問題を解決する。
【００６５】
１．大規模データに対して
ラフ集合では、全ての実例を識別できるための識別行列を作ることにより、縮約が求まる。ＧＤＴ−ＲＳメソッドでは、各実例に対して、その実例を識別できる識別ベクトル（識別行列の中の一行）を作成することにより、実例に関するルールを求める。全ての実例を処理終わったら、最小ルール集合を求める。これにより、膨大な識別行列の作成をさけ、直接ルール（縮約ではない）を発見できる。
ＧＤＴ−ＲＳメソッドは、ＧＤＴとラフ集合理論の組み合わせにより、すべての実例をカバーするより強いルールの最小集合を抽出することができるが、更に大規模なデータベースにも対応できるように二つのＧＤＴ−ＲＳメソッドの実現アルゴリズムを提案している。一つは、最適解を求める方法であり、属性の数が少ない小規模なデータベースに適している。もう一つは、貪欲法と呼ばれる方法を利用して、短時間で大規模なデータベースから有用なルールを発見する、近似解を求める方法である。
【００６６】
２．連続データに対して
ＧＤＴ−ＲＳメソッドは離散データの処理に適する。連続データは、離散化して用いる。そのため、前処理機能として、連続値の離散化の機能を持たせた。
【００６７】
３．ノイズと未観察データに対して：
ルールの強さには、ノイズの影響が明示的に表されるので、ユーザーは許される範囲でノイズ率を自由に調整し、満足できるルールを求めることができる。
【００６８】
４．不完全データに対して：
ルールの強さには、未知（未観察）データの影響が明示的に表されるので、未知データの存在を考慮したルールが抽出される。
【００６９】
５．データ変更に対して：
ＧＤＴ−ＲＳメソッドでは一つの実例に対して、その実例に関するルールを発見できるので、実例の追加については簡単に対応できる。
【００７０】
６．背景知識の利用
既知の事実を背景知識として利用することにより、より有効な分布表が作成でき、より有意義なルールを発見できる。
【００７１】
【表１２】

表１２の例を使って、本発明にかかるＧＤＴ−ＲＳメソッドがどのように動作するかを説明する。
Ｓｔｅｐ１．同じ条件属性の実例をまとめる
【００７２】
【表１３】

−は矛盾を意味する。
【００７３】
Ｓｔｅｐ２．各々の実例に対して決定属性の値が異なる各実例に対して、識別ベクトルを作成
例：ｕ２の識別ベクトルの作成する。決定属性の値がｕ２のそれと等しいものについて作成する。
ｍ_２，ｊを実例ｕ２とｕｊにおける値が異なる属性の集合とすると、
ｍ_２，_１＝｛ｂ｝
ｍ_２，_２＝空集合
ｍ_２，_４＝｛ａ，ｃ｝
ｍ_２，_６＝｛ｂ｝
ｍ_２，_７＝ｕ２とｕ７は同じ決定クラスなので、識別する必要はない。
識別ベクトルは表１４のようになる。
【００７４】
【表１４】

【００７５】
Ｓｔｅｐ３．識別ベクトルから識別関数ｆＴを作成
【数１２】

Ｓｔｅｐ４．識別関数からルールの生成と選択
【数１３】

上記（数１３）にｕ２の対応する属性値を代入すると、二つのルールが生成される：
【数１４】

この二つのルールの強さはＧＤＴ−ＲＳメソッドにより以下のように算出される：
【数１５】

一般化｛ａ０，ｂ１｝（＝｛ａ０ｂ１＊｝）に含まれる可能な実例はａ０ｂ１ｃ０とａ０ｂ１ｃ１である。そのうち、ａ０ｂ１ｃ１のみがデータセットに存在する。ａ０ｂ１ｃ０は観察されない実例である。そのため、一般化｛ａ０，ｂ１｝の強さは０．５である。一般化｛ａ０，ｂ１｝に含まれる実際の実例はすべて決定クラスｄ（ｙ）に属しているので、ノイズ率は０である。
【数１６】

一般化｛ｂ１，ｃ１，＊｝に含まれる可能な実例はａ０ｂ１ｃ１とａ１ｂ１ｃ１である。二つともデータベースに存在するので、一般化｛＊，ｂ１，ｃ１｝の強さは２×１／２＝１である。一般化｛＊，ｂ１，ｃ１｝に含まれる実際の実例はすべて決定クラスｄ（ｙ）に属するので、ノイズ率は０である。
【数１７】

二つのルールの強さを比較すると、ｒ２．｛ｂ１ｃ１｝→ｄ（ｙ）の方が大きいので、ｒ２が選択される。
同様にｕ４，ｕ６，ｕ７の実例に対して、Ｓｔｅｐ２からＳｔｅｐ４を実行して、以下のルールが生成される：
【数１８】

【００７６】
Ｓｔｅｐ５．全ての実例をカバーするルールの最小集合を選択
ＧＤＴ−ＲＳメソッドのルール選択規則は以下のように定められている：
１．含まれる実例の数が最も多いルールが優先的に選択される
２．含まれる属性の数が少ない方が優先的に選択される
３．同じ実例を含むものについては、強さがもっとも大きいルールが優先的に選択される
クラスｄ（ｙ）に関係するルールとルールに含まれる実例を表１５に示す：
【００７７】
【表１５】

ルール｛ｂ１ｃ１｝→ｄ（ｙ）が含む実例の数は２で、他のルールより多いので、優先的に選択される。そして、｛ａ１ｃ１｝→ｄ（ｙ）と｛ａ０ｂ１｝→ｄ（ｙ）が含む実例は、｛ｂ１ｃ１｝→ｄ（ｙ）にも含まれるので、クラスｄ（ｙ）の最小ルール集合は｛ｂ１ｃ１｝→ｄ（ｙ）のみである。
クラスｄ（ｎ）に関係するルールとルールに含まれる実例を表１６に示す：
【００７８】
【表１６】

【００７９】
二つのルールは異なった実例を含むので、両方のルールが採用される。
結果として、以下のようなルール集合を発見した：
｛ｂ１ｃ１｝→ｄ（ｙ）Ｓ＝１
｛ｂ２｝→ｄ（ｎ）Ｓ＝１／４
｛ｃ０｝→ｄ（ｎ）Ｓ＝１／６
次に、順序付け問題について、ＧＤＴ−ＲＳメソッドでの処理方法を商品販売順位の問題について、表１７を使って説明する。
【００８０】
【表１７】

【００８１】
ここで表１７の＞_ａ：ｓｍａｌｌ＞_ａｍｉｄｄｌｅ＞_ａｌａｒｇｅ等の部分はユーザが設定する。例えば、この例においては、携帯用の電気製品を対象としている。携帯用であるから、サイズについては小さいものがもっとも好ましく、大きいものは好ましくないと考えられるので、＞_ａ：ｓｍａｌｌ＞_ａｍｉｄｄｌｅ＞_ａｌａｒｇｅという設定になる。同様に、保証期間は長い方が好ましいので、＞_ｂ：３ｙｅａｒｓ＞_ｂ２ｙｅａｒｓとなる。属性ｃは価格、属性ｄは重量である。
属性ｏは順序づけを示している（ｏｒｄｅｒの頭文字である）。この例では、実例ｐ１がｏ＝１となっており、最も売れ行きがいいことを示す。売れ行きに関しては、その次がｐ４であり、ｐ２，ｐ３，ｐ５は売れ行きがよくない。
【００８２】
これらの情報をもとに、ａないしｄのどの属性において、売れ行きとの相関が認められるかどうかという点について分析を行うために、この順序化情報表に基づくデータの前処理を行い、ＧＤＴ−ＲＳメソッドを適用する。
Ｓｔｅｐ１．バイナリ情報表を作成
下の式に基づいて、順序決定表（表１８）からバイナリ情報表を作成する。
【数１９】

【００８３】
【表１８】

【００８４】
例えば、（１，２）の欄の１，０，１，１，１は、Ｉａ，Ｉｂ，Ｉｃ，Ｉｄ，Ｉｏにより、ｐ１とｐ２を比較した結果である。属性ａによれば、実例ｐ１はｍｉｄｄｌｅであり、実例ｐ２はｌａｒｇｅであるから、ｐ１とｐ２を比較すると、ユーザの設定した順序に整合する。よって、上式に基づいて、指標として”１”が選択される。他方、属性ｂについては、ｐ１，ｐ２ともに３ｙｅａｒｓで同じであるから、ユーザの設定した順序に整合しない（値が同じ場合は「整合しない」とみなす）。よって、上式に基づいて、指標として、”０”が選択される。
このように、各属性ごと順番にｐ１，ｐ２の属性値をユーザの設定した順序との関係で比較し、”０”または”１”を記入したのが表１８である。
【００８５】
Ｓｔｅｐ２．ＧＤＴ−ＲＳメソッドを利用して、ルール発見を行う。
順序化情報表を用いない場合は、各実例の生データによって、ＧＤＴ−ＲＳメソッドを適用した。しかし、順序化情報表を用いる場合は、上記のとおり、各実例の生データを順序化情報表及び設定されたルールに基づいて表１７のように指標化を行い、これに対して、ＧＤＴ−ＲＳメソッドを適用する。
この結果、決定属性に相関する条件属性を見いだすことができる。この場合でいうと、属性「サイズ」「保証期間」「価格」「重量」のうち、どの要因が売上げに最も影響するのか、という分析結果を得ることが可能となる。
【００８６】
【発明の実施の形態】
本発明にかかる手法を実施するための前処理のフローを図５に示す。図５に示すように、本発明を適用してルールを発見するに先だち、データに対しては以下の前処理を行う。
１）順序関係情報（表１１参照）を利用するかどうかを判断する（ステップ５１）。順序関係情報を用いるときは、順序決定表からバイナリ情報表を作成する（ステップ５３，表１２参照）。
２）属性選択をすべきかどうかを判断する（ステップ５５）。
３）連続値属性に対して離散化するかどうかを判断する（ステップ５７）。
これらの前処理を経たデータを本発明による手法で分析する（ステップ５９）。また、本発明にかかるＧＤＴ−ＲＳメソッドの処理ルーチンを図６に示す。
本発明にかかる手法によれば、まず、同じ条件属性（例えば、ａ０ｂ０ｃ１）の組み合わせを有する実例を一つの集合にまとめる（ステップ１１）。この結果、同じ条件属性を持つものの中で、結果属性の値が異なるものが混在する場合は、その条件属性はルールを発見する上で役に立たない。そこで、条件属性の組み合わせが同じで、決定属性が異なる実例については削除を行う（ステップ１５）。ただし、このステップは任意であり、ユーザがこのようなノイズについて寛容な場合は、この限りではない（ステップ１３）。
【００８７】
次に、すでに説明した手法により、決定属性が異なる各実例について、識別ベクトルを作成する（ステップ１７）。識別ベクトルが得られれば、そこから識別関数を作成し（ステップ１９）、識別関数からさらにルールを導出する（ステップ２１）。
ここで、ユーザには、背景知識を利用するかどうかのオプションが与えられる（ステップ２３）。これにより、背景知識による調整を加えた分布を採用したり（ステップ２５）、調整を加えていない一様分布を採用する（ステップ２７）。
導出されたルールがユーザの許容するノイズ率を超える場合（ステップ２９）は、ルールを削除する（ステップ３１）。全ての実例が処理されたことを確認後（ステップ３３）、全ての実例をカバーするルールの最終集合を選択する（ステップ３５）。このようにして、ルールが決められる（ステップ３７）。
【００８８】
【実施例】
以下の説明においては、実世界のデータ三例を用いてＧＤＴ−ＲＳメソッドの効果を示す。
一番目は髄膜炎の例で、背景知識を利用する場合と利用しない場合の結果を比較し、背景知識の利用により、良い結果が得られることを示す。二番目は、細菌培養検査という大規模データベースを用いて、データの前処理、つまり、連続データの離散化、背景知識の利用での属性選択あるいは自動属性選択から、ルール生成するまでのプロセスを説明し、大規模性／複雑問題への有効性を示す。三番目は、胃癌データの例で、ラフ集合論に基づく順位情報を利用して、順位付けルールの生成法を説明し、有用性を示す。
実施例１．髄膜脳炎データベースからの知識発見
まず、髄膜炎の例で、背景知識を用いる場合と、用いない場合の結果を比較し、背景知識の利用の効果を示す。
共通データとして与えられた髄膜炎のデータベースは、１４０人の患者のデータからなっている。属性の数が３８であり、比較的小規模のデータである［Ｔｓｕｍｏｔｏ−９８］Ｔｓｕｍｏｔｏ，Ｓ．ＣｏｍｐａｒｉｓｉｏｎａｎｄＥｖａｌｕａｔｉｏｎｏｆＫＤＤＭｅｔｈｏｄｓｗｉｔｈＣｏｍｍｏｎＤａｔａｓｅｔｓ，（ＳｐｅｃｉａｌＰａｎｅｌＤｉｓｃｕｓｓｉｏｎＳｅｓｓｉｏｎｏｎＫｏｗｌｅｄｇｅＤｉｓｃｏｖｅｒｙｆｒｏｍａＭｅｎｉｎｇｉｔｉｓＤａｔａｂａｓｅ）Ｐｒｏｃ．ｔｈｅ１２ｔｈＡｎｎｕａｌＣｏｎｆｅｒｅｎｃｅｏｆＪＳＡＩ（１９９８）７２−７３。
下記の各因子に関するルールを生成するのが目的である。
・診断に重要な因子（ＤＩＡＧ，ＤＩＡＧ２に重要な因子）
・原因菌探索に重要な因子（ＣＵＬＴ＿ＦＩＮＤ，ＣＵＬＴＵＲＥに重要な因子）
・後遺症の有無に関して重要な因子（Ｃ＿ＣＯＵＲＳＥ，ＣＯＵＲＳＥ（Ｇｒｏｕｐｅｄ）に重要な因子）
この実験では、六つの属性（ＤＩＡＧ，ＤＩＡＧ２，ＣＵＬＴ＿ＦＩＮＤ，
ＣＵＬＴＵＲＥ，Ｃ＿ＣＯＵＲＳＥ，ＣＯＵＲＳＥ（Ｇｒｏｕｐｅｄ））
をそれぞれ決定クラスとし、背景知識を利用する場合と利用しない場合についてＧＤＴ−ＲＳメソッドを実行し、対応する因子に関するルール生成を行った。
【００８９】
背景知識
医師の経験を背景知識として利用する。背景知識の一例は以下のとおりである。
・脳波所見（ＥＥＧ＿ＷＡＶＥ）が正常ならば、脳波の局所以上の有無（ＥＥＧ＿ＦＯＣＵＳ）が異常になることはない；
・白血球数（ＷＢＣ）が高ければ、炎症性蛋白（ＣＲＰ）も高い。
【００９０】
以下のリストは、医師から提供された背景知識の一覧である。三組に分けられている：
・同時に発生しないもの
ＥＥＧ＿ＷＡＶＥ（ｎｏｒｍａｌ）とＥＥＧ＿ＦＯＣＵＳ（＋）
ＣＳＦ＿ＣＥＬＬ（ｌｏｗ）とＣｅｌｌ＿Ｐｏｌｙ（ｈｉｇｈ）
ＣＳＦ＿ＣＥＬＬ（ｌｏｗ）とＣｅｌｌ＿Ｍｏｎｏ（ｈｉｇｈ）
・発生する可能性が低いもの
ＷＢＣ（ｌｏｗ）のときＣＲＰ（ｈｉｇｈ）は発生する可能性が低い
ＷＢＣ（ｌｏｗ）のときＥＳＲ（ｈｉｇｈ）は発生する可能性が低い
ＷＢＣ（ｌｏｗ）のときＣＳＦ＿ＣＥＬＬ（ｈｉｇｈ）は発生する可能性が低い
ＷＢＣ（ｌｏｗ）のときＣｅｌｌ＿Ｐｏｌｙ（ｈｉｇｈ）は発生する可能性が低い
ＷＢＣ（ｌｏｗ）のときＣｅｌｌ＿Ｍｏｎｏ（ｈｉｇｈ）は発生する可能性が低い
ＢＴ（ｌｏｗ）のときＳＴＩＦＦ（ｈｉｇｈ）は発生する可能性が低い
ＢＴ（ｌｏｗ）のときＬＡＳＥＧＵＥ（ｈｉｇｈ）は発生する可能性が低い
ＢＴ（ｌｏｗ）のときＫＥＲＮＩＧ（ｈｉｇｈ）は発生する可能性が低い
・同時に発生する可能性が高いもの
ＷＢＣ（ｈｉｇｈ）とＣＲＰ（ｈｉｇｈ）
ＷＢＣ（ｈｉｇｈ）とＥＳＲ（ｈｉｇｈ）
ＷＢＣ（ｈｉｇｈ）とＣＳＦ＿ＣＥＬＬ（ｈｉｇｈ）
ＷＢＣ（ｈｉｇｈ）とＣｅｌｌ＿Ｐｏｌｙ（ｈｉｇｈ）
ＷＢＣ（ｈｉｇｈ）とＣｅｌｌ＿Ｍｏｎｏ（ｈｉｇｈ）
ＢＴ（ｈｉｇｈ）とＳＴＩＦＦ（ｈｉｇｈ）
ＢＴ（ｈｉｇｈ）とＬＡＳＥＧＵＥ（ｈｉｇｈ）
ＢＴ（ｈｉｇｈ）とＫＥＲＮＩＧ（ｈｉｇｈ）
ＢＴ（ｈｉｇｈ）とＣＲＰ（ｈｉｇｈ）
ＢＴ（ｈｉｇｈ）とＥＳＲ（ｈｉｇｈ）
…
ＥＥＧ＿ＦＯＣＵＳ（＋）とＦＯＣＡＬ（＋）
ＥＥＧ＿ＷＡＶＥ（＋）とＥＥＧ＿ＦＯＣＵＳ（＋）
ＣＲＰ（ｈｉｇｈ）とＣＳＦ＿ＧＬＵ（ｌｏｗ）
ＣＲＰ（ｈｉｇｈ）とＣＳＦ＿ＰＲＯ（ｌｏｗ）
【００９１】
括弧中の”ｈｉｇｈ”と”ｌｏｗ”は曖昧な範囲なので、以下のように処理した：
１．医学知識によるデータを離散化する場合：
”ｈｉｇｈ”は正常値の上限より高い値、
”ｌｏｗ”は正常値の下限より小さい値として扱う。
２．連続値を離散化する場合：
離散化された属性値は有限個の数に分類にされるが、ここで、最大値に分類されたものは”ｈｉｇｈ”、最小値に分類されたものを”ｌｏｗ”とする。それ以外のものは”ｈｉｇｈ”とも”ｌｏｗ”ともしない。
本稿に示される結果は、自動離散化の方法により連続値属性を離散化して得られた結果である。
ＧＤＴ−ＲＳメソッドで利用する背景知識では、発生する確率Ｑは”高い”や”低”などではなく、０以上１以下の数を指定しなければならない。しかし、医師からの知識はそうではないため、適切なＱの値の設定が必要である。今回の実験では、背景知識”ａ_ｉ１ｊ１⇒_ｉ２ｊ２の発生する確率が高い”に対して、Ｑを１とし、背景知識”ａ_ｉ１ｊ１⇒ａ_ｉ２ｊ２の発生する確率が低い”に対して、Ｑを０としている。
【００９２】
結果の比較
背景知識を用いる場合と、用いない場合では得られたルールの一部に違いが見られた。背景知識の利用により、以下の影響が表されている：
背景知識が用いられない場合には強さが低いという理由で得られなかったルールが選択された。例えば、背景知識を用いたときには、下記のｒｕｌｅ_１の強さは、背景知識を用いない場合の４倍となった。
ｒｕｌｅ_１：ＯＮＳＥＴ（ａｃｕｔｅ）∧ＥＳＲ（≦５）∧ＣＳＦ＿ＣＥＬＬ（＜１０）∧ＣＵＬＴＵＲＥ（−）→ＶＩＲＵＳ（Ｅ）
この理由を説明する。背景知識が利用されない場合、ｒｕｌｅ_１の強さは３０×（３８４／Ｅ）である。但し、３０はルールの条件部が満足する実例の数、３８４／Ｅは条件部（一般化）の強さである。この場合は背景知識に、ｒｕｌｅ_１に影響する知識、つまり背景知識の前提部がｒｕｌｅ_１の条件部に含まれ、背景知識の結論部がｒｕｌｅ_１の条件部に含まれないような背景知識は以下の二つである。
ＣＳＦ＿ＣＥＬＬ（ｌｏｗ）とＣｅｌｌ＿Ｐｏｌｙ（ｈｉｇｈ）は同時に発生しない
ＣＳＦ＿ＣＥＬＬ（ｌｏｗ）とＣｅｌｌ＿Ｍｏｎｏ（ｈｉｇｈ）は同時に発生しない
【００９３】
Ｃｅｌｌ＿ＰｏｌｙとＣｅｌｌ＿Ｍｏｎｏは連続値属性で、離散化により、どちらもｌｏｗとｈｉｇｈに分割されていて、その以外のものはない。ＣＳＦ＿ＣＥＬＬが低い（ｌｏｗ）時、Ｃｅｌｌ＿ＰｏｌｙやＣｅｌｌ＿Ｍｏｎｏが高く（ｈｉｇｈ）なることがないという背景知識により、低いＣＳＦ＿ＣＥＬＬが発生したら、Ｃｅｌｌ＿ＰｏｌｙやＣｅｌｌ＿Ｍｏｎｏの属性値がｌｏｗの一つしか発生しない。よって、Ｃｅｌｌ＿ＰｏｌｙとＣｅｌｌ＿Ｍｏｎｏの属性値の数が１減り、元の２から１になる。そのため、全て属性の属性値の数の積ＥがＥ×１／２×１／２＝Ｅ／４に変わり、ルールの強さＳは３０×（３８４／Ｅ）×４に変更される。
【００９４】
また、背景知識の利用により、一部のルールが他のルールに置き換えられた。例えば、ルール
ｒｕｌｅ_２：ＤＩＡＧ（ＶＩＲＵＳ（Ｅ））∧ＬＯＣ［４，７］→ＥＥＧ＿ａｂｎｏｒｍａｌ，Ｓ＝３０／Ｅ
は背景知識が利用されない場合は発見されたが、背景知識を利用したときは、これに代わり下のｒｕｌｅ_２’が得られた。
ｒｕｌｅ_２’：ＥＥＧ＿ＦＯＣＵＳ（＋）∧ＬＯＣ［４，７］→ＥＥＧ＿ａｂｎｏｒｍａｌ，Ｓ＝（１０／Ｅ）×４
これは、背景知識を利用したことにより、ｒｕｌｅ_２’の強さがｒｕｌｅ_２の強さを上まったものになったからである。
この結果に対する医師のコメントは「ｒｕｌｅ_２とｒｕｌｅ_２’は両方とも合理的だが、ｒｕｌｅ_２’のほうがより合理的である」というものであった。
【００９５】
実施例２．細菌培養検出データベースからの知識発見
ここで使用したのは共通データとして与えられている細菌培養検出データベースである。
実例の数が２０，０００で、属性の数が４９である。このデータベースは実世界のデータの集まりで、前処理も行っていない。そのため、いくつかの特徴を持っている：
１．属性の値が欠けている欠損データが存在する
２．「病名」と「検出菌」は属性値の数が多い
３．時系列データを含んでいる。即ち、同じ患者の異なった時間での検査データが収集されている
また、幾つかの属性が複数個の子属性からなる。例えば、病名属性は三つの病名を含み、目的菌属性は二種類の目的菌を含んでいる。子属性も計算すると、属性の数が６０にのぼる。そのため、属性選択が必要になる。
【００９６】
このデータベースに対して、主な目的は二つある：
１．検出菌と他の属性との関係を見つけ出す
ここで、検出菌がある場合（＋）とない場合（−）の二つの決定クラスに分け、各クラスに対して、ｉｆ−ｔｈｅｎルールを探す。
２．検出菌がある場合、抗生物質の感受性と他の属性との関係を見つけ出す
抗生物質の感受性は四種類（Ｒ、Ｓ、Ｉ、空）あるが、今回はｒｅｓｉｓｔａｎｔ（Ｒ）とｓｅｎｓｉｂｉｌｉｔｙ（Ｓ）だけに対してｉｆ−ｔｈｅｎルールを探す。
今回の医療データでは、得られるルールに含まれる実例の数が少ないので、正確率が１００％すなわちノイズがないルールがあっても、十分な信頼性があるといえない。そこで、ＧＤＴ−ＲＳメソッドでは、ユーザーの指定する正確率の範囲でルール発見を行っている。
【００９７】
実行手順：
１．高正確率でルール発見を行う
２．実例数が少ない（例えば、３以下）ルールに含まれる実例に対して、低正確率でもう一度ルール発見を行う。
以下では、目的毎の実験結果を示す。
【００９８】
目的１．検出菌と属性の関係
まず、欠損データについては、”？”を代入して、属性の値の一つとして扱う。そして、属性の数を減らすために、明らかに検出菌に無関係な属性、例えば、治療手段、材料、病棟などを削除し、さらに、属性選択方法により重要ではない属性「尿硝酸塩」あるいは「尿蛋白」などを削除した。実験は検出菌を決定クラス、以下の属性を条件属性として行った。
性別、生年月日（十代、二十代、… に変更）、発熱、気切、白血球、病名１、尿ＷＢＣ、尿蛋白、尿潜血、尿定量、βａｌａｃｔａｍｅｓｅ
表１９と表２０は検出菌がある（＋）とない（−）に関するルールを示している。発見されたルールが多いので、ここでは含む実例の数が多く、正確率が高い一部のルールしか表していない。表のｇｅｎは条件に含まれる一般化の数、ｓｔｒｇ．は一般化の事前分布、「実例数」の中の（＋）と（−）はそれぞれデータベースに存在し、各ルールに含まれる検出菌（＋）あるいは（−）に属している実例の数である。
【００９９】
【表１９】

【０１００】
【表２０】

【０１０１】
目的２．抗生物質と属性の関係
抗生物質の種類は以下のとおりである：
ＰＣＧ，ＰＣｓ，Ａｕｇ，ＰＣｓ−緑，ＣＥＰｓ−１，ＣＥＰｓ−２，ＣＥＰｓ−３，ＣＥＰｓ−４，ＣＥＰｓ−緑，ＡＧｓ，ＭＬｓ，ＴＣｓ，ＬＣＭｓ，ＣＰｓ，ＣＢＰｓ，ＶＣＭ，ＲＦＰ／ＦＯＭ
これらの属性を決定属性として、ルール発見を行う。
前例と同じく、重要でない属性を除いて、以下の属性を得た。
性別、年齢（何十代）、発熱、カテーテル１、気切、挿管、ドレナージ１、白血球、投薬１、ステロイド、抗癌剤、放射線、病名１、尿ＷＢＣ、尿硝酸塩、尿蛋白、尿潜血、尿定量、総菌数、検出菌、βａｌａｃｔａｍｅｓｅ
ここで、抗生物質｛Ａｕｇ、ＣＥＰｓ−１、ＰＣＧ、ＰＣｓ｝に対する、ＧＤＴ−ＲＳメソッドの実行結果を示す。ＧＤＴ−ＲＳメソッドは全ての実例を含むルールの集合を探すので、発見されたルールの数が多い。表２１の結果には医師の評価で、価値があると認められたもののみ列挙されている。
【０１０２】
【表２１】

【０１０３】
Ｃ４．５との比較
同じデータに対して、Ｃ４．５によるルール発見を行った。一部の結果はＧＤＴ−ＲＳメソッドと同じである。例えば：
βａｌａｃｔａｍｅｓｅ（３＋）→検出菌（＋）
白血球（不明）∧病名１（腫瘍）∧尿定量（１０^４）→検出菌（＋）
尿定量（＜１０^３）→検出菌（−）
しかし、一部のルール、例えば、医師によると、意味のあるルール：
病名１（肺炎）→検出菌（−）
はＧＤＴ−ＲＳメソッドで発見されたが、Ｃ４．５では発見されなかった。
Ｃ４．５で発見され、ＧＤＴ−ＲＳメソッドで発見されなかったルールもある。例えば次の二つである：
βａｌａｃｔａｍｅｓｅ（１＋）→検出菌（＋）
発熱（３９℃）→検出菌（−）
その原因はＧＤＴ−ＲＳメソッドでは、より低い閾値（ノイズ率４０％）でルール発見を行ったためである。そのため、Ｃ４．５を用いたときより意義のある一般的なルールが発見された。Ｃ４．５では、βａｌａｃｔａｍｅｓｅ（１＋）あるいは発熱（３９℃）を含む、もっと特殊なルールが発見された（表２２参照）。
【０１０４】
【表２２】

【０１０５】
実施例３．胃癌データベースからの知識発見
最後の例は、国立がんセンターで集められる胃癌データベースである。患者の症状、例えば、ガンの種類、ｓｅｒｏｓａｌな侵入の有無、腹膜転移の有無、肝臓転移の有無、ガン組織の最大直径、手術前合併症など、およそ２７の属性がある。解析の目的は、手術後９０日以内に死亡した患者に働いた主要な因子を見つけることである。この例で、順位情報を利用して、ＧＤＴ−ＲＳメソッドに基づく、順位付けルールの生成を説明する。
胃癌データベースは、典型的に不平衝なものである。すなわち、手術後９０日以内に死亡した患者人数２０７に対して、生存した患者の人数は５９６３である。また、生存した患者の中にも死亡した患者と似たような症状を持つ例が多く見られる。
詳しい説明の前に、国立がんセンターの胃癌データの主な属性、そして、その属性値に関する順序情報を示す（表２３を参照）。
【０１０６】
【表２３】

「注」”．”や”．．”は、特に説明がなかった場合、欠損値という意味である。
【０１０７】
順序決定表から順位付けルールを見つけるために、まず、順序決定表をバイナリ情報に変える。これよりＧＤＴ−ＲＳシステムを直ちに適用することができる。
本発明にかかるＧＤＴ−ＲＳメソッドで、発見されたルールの品質は、正確度とカバー率を含むいくつかの観点から評価される。正確度は、ルールに含まれるすべての属性を含む実例の数とルールの条件部に含まれるすべての属性を含む実例の比である。ユーザーは、閾値として許された範囲で正確度レベルを指定することができる。胃癌データの例では、閾値は０．６に設定される。一方、カバー率は単にルールに含まれる正実例と負実例の量で示す。
バイナリ情報表で、実例のペア（ｘ，ｙ）∈Ｕ×Ｕを考える。情報関数は、以下のように定義される：
【数２０】

例えば、ｘ＞_｛ａ｝ｙならＩ_ａ（ｘ，ｙ）＝１である。（ｘ，ｘ）のような実列ペアは扱わない。
表２４で示される結果から、以下のことが分かった。２２個のルールのうち、１７個が属性ｌｉｖｅｒ＿ｍｅｔａ（肝臓転移）を含んでいる、１１個が属性ｐｅｒｉｔｏｎｅａｌ−ｍａｔｅ（腹膜の転移）を含んでいる、その中の５個はｌｉｖｅｒ＿ｍｅｔａを含む。この結果は、肝臓転移や腹膜の転移が手術後９０日以内に死亡の主要な原因であるということを示す。
【０１０８】
【表２４】

【０１０９】
国立がんセンターのスタッフによるこの結果の評価は「発見された順位付けルールは、合理的で面白い」というものであった。彼らは受容性と新規性に基づく、各ルールを評価した。表２５は胃癌データから発見されたルール及び各ルールに対する彼らの評価を示している。納得性と新規性の評価値は、１が最低、５が最高である。表に示されたように、多くのルールが受容されたが、その一部には新規性があるという評価を得た。
【０１１０】
【表２５】

【０１１１】
【発明の効果】
本発明にかかるＧＤＴ−ＲＳメソッドは、バイアス調整に対する柔軟性、背景知識の利用可能性、ノイズデータ、欠損データに対する処理の有効性、未観測な情報の予測性などについて、優れている。
本発明によれば、以下の効果が認められる。
１．背景知識をバイアスとして利用することにより、より合理的なルールが発見される。
２．未観測なデータを予測し、ルールの強さを決めるので、より意味あるルールが発見される。
【図面の簡単な説明】
【図１】従来技術によるバージョン空間の例を示す。
【図２】従来技術によるＣ４．５の木構造の例を示す。
【図３】従来技術によるラフ集合による分析結果の例を示す。
【図４】本発明にかかる解決手段の概念図を示す。
【図５】本発明によるデータ処理の前処理のフロー図を示す。
【図６】本発明によるデータ処理のフロー図を示す。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention belongs to a technique related to data mining that enables discovery of knowledge such as rules and causal relationships from large-scale, complex, incomplete and ambiguous data in the real world.
[0002]
[Prior art]
The dramatic improvement in computer computing speed, the increase in storage capacity, and the improvement of the network environment have made it possible to store large amounts of data and to share and reuse large-scale information (knowledge and data). As the need for databases increases, the need to effectively utilize the data stored in them and to analyze / extract the rules and causal relationships behind the data groups has increased.
[0003]
Research on inductive learning for finding if-then rules from real-world data has been performed for the past two decades, and a typical bottom-up method: version-space [Mitchell-77] (Non-Patent Document 1) From the Back-propagation [Rumelhart-86] (Non-patent document 2), the top-down method: C4.5 [Quinlan-93] (Non-patent document 3) and the rough set method [Pawlak-91] (Non-patent document 4) Various suggestions were made.
[0004]
[Non-patent document 1]
[Mitchell-77] Mitchell, T.M. M.
"Version Spaces: A Candidate Elimination Approach to Rule Learning",
Proc. 5th Int. Joint Conf. Artificial Intelligence, (1977) 305-310.
[0005]
[Non-patent document 2]
[Rumelhart-86] Rumelhart, D.M. E. FIG. , Hinton, G .; E. FIG. , And Williams, R.A. J.
"Learning Internal Representations by Back-Propagation Errors".
Nature Vol. 323 (1986) 533-536.
[0006]
[Non-Patent Document 3]
[Quinlan-93] Quinlan, J. et al. R.
“C4.5: Programs for Machine Learning｝”,
Morgan Kaufmann (1993).
[0007]
[Non-patent document 4]
[Pawlak-91] Pawlak, Z .;
"Rough Sets, Theoretical Aspects of Reasoning About Data",
Kluer Academic Publishers (1991).
[0008]
The basic concept is shown in Table 1 as an example. Each row indicated by u1, u2,... u6 is called an example. Headache, temperature, muscle pain and flu are called attributes. In Table 1, it is determined whether or not the subject is influenza according to the presence or absence of headache, body temperature, muscle pain, and the like. An attribute (flu) related to the determination result is called a decision attribute, and an attribute (headache / body temperature / muscle pain) that gives a condition to the decision attribute is called a condition attribute. What appears in each column (for example, yes / no for headache, normal / high / very-high for body temperature) is referred to as the value of each attribute.
[0009]
[Table 1]

[0010]
"Bottom-up method"
The bottom-up method is an incremental learning method. That is, the concept can be learned not only when the examples are input at the same time, but also when the examples are given one by one. Bottom-up methods can handle data changes effectively, but increase computation time.
In the bottom-up method, one new instance is taken in each learning cycle and the concepts learned using that instance are modified. Repeat until all instances have been processed.
One of the typical bottom-up methods is a version-space method proposed by Mitchell as a discrete method. The version space method is a search method depending on the structure of the state space. If a new positive example (an example in which the value of influenza is yes) is given, a concept obtained from the concept or a more general concept can be the desired concept. Given a negative example (an example where the influenza value is no), only the elements that are more specific than the corresponding concept can be the desired concept. There can be many concepts for a given set of examples. Of these concepts, a heuristic search in problem solving can be used to find a target structure. Starting from the general concept obtained from the positive examples and the special concept obtained from the negative examples, possible search areas are narrowed by the positive and negative examples presented successively. The general concept found is called a rule.
[0011]
The version space method will be described using the example of influenza in Table 1.
Now, four predicates representing the attributes of a target u
Headache (U, X), X {yes, no}
Body temperature (U, Y), Y {normal, high, very-high}
Muscle pain (U, Z), Z {yes, no}
Influenza (U, D), D ｛yes, no｝
There is.
[0012]
Concept represented by a set of these predicates
u1 = {headache (U, yes), body temperature (U, normal), muscle pain (U, yes), influenza (U, no)}
u2 = {headache (U, yes), body temperature (U, high), muscle pain (U, yes), influenza (U, yes)}
u3 = {headache (U, yes), body temperature (U, very-high), muscle pain (U, yes), influenza (U, yes)}
u4 = {headache (U, no), body temperature (U, normal), muscle pain (U, yes), influenza (U, no)}
u5 = {headache (U, no), body temperature (U, high), muscle pain (U, no), influenza (U, no)}
u6 = {headache (U, no), body temperature (U, very-high), muscle pain (U, yes), influenza (U, yes)}
think of.
[0013]
Here, since a rule is obtained in which the value of the attribute of influenza is yes, no, this influenza is called a decision attribute. At this time, the version space constituted by (headache, body temperature, muscle pain) can be represented by a graph as shown in FIG. The branches of the graph represent substitution operations, with the top node representing the most "special" concept and the bottom node representing the most "general" concept.
A more general concept than example u1 is the underlined node of FIG. Since u1 and u2 have the same value in the two attributes “headache” and “muscle pain”, their common generalization is (yes, Y, yes). Here, Y is a variable, and may be any value in {normal, high, very-high}. The values relating to the decision attribute (influenza) of u1 and u2 are different in that u1 is no and u2 is yes, so that (yes, Y, yes) cannot be used as a rule.
[0014]
Similarly, when common generalization is performed on the actual examples u3 and u6, (X, very-high, yes) is obtained. The values (decision classes) of the decision attributes of u3 and u6 are both yes, which is the same. Therefore, the generalized concept of (X, very-high, yes) is one candidate as a rule for influenza = yes. When all examples have been learned and no contradictions occur, this generalized concept is extracted as a rule.
[0015]
"Top-down method (or batch learning)"
In the top-down method, all data is analyzed once and an attempt is made to find a concept based on the data. C4.5 and rough sets are typical.
C4.5 constructs a decision tree in a so-called top-down mode from top to bottom, and finds rules by cutting branches. The tree construction starts with the selection of the root node. Each attribute is applied to the root node of the tree, and the selected most appropriate attribute is treated as the root node. Once the root node is determined, the examples are categorized by this node. The above operation is repeated with each classified group as a sub tree.
[0016]
Using the influenza example described above, the structure of the C4.5 tree will be described with reference to FIG. First, in the evaluation of three attributes, “body temperature” is evaluated as the most appropriate attribute, and becomes the root node. The classification based on the body temperature is suitably divided into two groups, body temperature = {normal} and body temperature = {high, very-high}, and the examples are divided into two groups. Next, focusing on the group of body temperature = {high, very-high}, "muscle pain" is evaluated as the most appropriate attribute, and as a result, the following rules are found:
Body temperature (normal) → Influenza (no)
Body temperature (high, very-high) ∧ muscle pain (yes) → influenza (yes)
Body temperature (high, very-high) ∧ muscle pain (no) → influenza (no)
[0017]
Another top-down method is to use rough sets. The first proposed rough set was Polish computer scientist Zdzislaw Pawlak [Pawlak-82].
[Non-Patent Document 5]
[Pawlak-82] Pawlak, Z .;
"Rough Sets,
International Journal of Computer and Information Sciences, Vol. 11 (1982) 341-356.
The rough set is an extension of the set theory so that it can cope with uncertain and incomplete data [Pawlak-91, Skouron-92]. When there is a certain concept, knowledge that can be measured by humans can only grasp the range of the concept by upper and lower approximations. The rough set is a theoretical study of such an approximation method.
[Non-Patent Document 6]
[Skowron-92] Skowron, A .; and Rauszer, C.I.
The discernibility matrixes and functions in information systems,
R. Slowski (ed.) Intellectual Decision Support, Kluer Academic Publishers (1992) 331-362.
The concepts proposed in the rough set can be broadly divided into (1) an approximate expression of a data set in a database and (2) a method of obtaining a contraction which is a minimum set of attributes necessary for the approximate expression. One.
[0018]
In the rough set methodology for finding rules from data, a database is used as a decision table.
T = (U, At, {Va | a} At}, {Ia | a {At}, C, D)
Is defined as Here, U is a finite set of examples, At is a finite set of all attributes, Va is a set of values for each a∈At, and Ia is a mapping from UxAt to {Va | a∈At}. C and D are two subsets of At, C is called a set of condition attributes, and D is a set of decision attributes. C∪D = At and C∩D = empty set. By putting together a set of instances where the values of the attributes contained in C are all equal, U is classified into some of these sets. This is represented by U / C, which is an equivalence class by C. The same applies to U / D. U / C and U / D are called a condition class and a decision class, respectively.
[0019]
Each row of the decision table represents one instance of the finite set. A column represents a set of attributes. Each cell then represents the value of the corresponding attribute in the example. For example, the above-mentioned influenza data is as follows.
Example set U = ｛u₁, U₂, U₃, U₄, U₅, U₆｝
Attribute set A_t= Headache, muscle pain, influenza
V_headache= {Yes, no}, V_{Body temperature}= {Normal, high, very_high}, V_{muscle pain}= {Yes, no}, V_influenza= {Yes, no}
Iu1,_headache= Yes, Iu1,_{muscle pain}= Yes, ..., Iu6_influenza= Yes
Condition attribute C = {Headache, muscle pain}
Decision attribute D = {influenza}
[0020]
The upper and lower approximations are an important part of the rough set approximation concept. The lower approximation set RX represents a set reliably classified as an element of X using the relation R. The upper approximate set R ＾ X represents a set that may be classified as an element of X using the relation R. Rule finding involves looking for a lower approximation.
To illustrate the upper and lower approximations, look at the data composed of headache and muscle pain, as shown in Table 2:
[0021]
[Table 2]

[0022]
Relation R = {Headache, Muscle pain}
Classification by R U / R = ｛｛u₅｝, ｛U₁, U₂, U₃｝, ｛U₄, U₆｝｝
Influenza ('no') set X1 = ｛u₁, U₄, U₅｝
A set reliably classified as an element of X1 using the relation R, a lower approximation set RX₁= ｛U₅｝.
A set that may be classified as an element of X1 using the relation R, X₁Upper approximation set R ＾ X1 = ｛u₁, U₂, U₃, U₄, U₅, U₆｝
[0023]
This relationship can be represented in FIG. For relation R, u₅Only can be correctly classified.
The contraction refers to a set of minimum attributes that can be classified into a decision class. Even if the attribute headache is deleted in the case of influenza, each example is classified as before, as shown in Table 3.
[0024]
[Table 3]

[0025]
Here, body temperature and muscle pain are attributes that cannot be deleted. Because, for example, when attention is paid to u2 and u4, it is not possible to uniquely determine the value of influenza related to the determination attribute from the value of only the body temperature. The same is true for muscle pain. Thus, body temperature and muscle pain are reduced.
Contraction 1 = {muscle pain, body temperature}.
Similarly, as shown in Table 4, even if the attribute of muscle pain is deleted, the classification is not affected.
[0026]
[Table 4]

[0027]
However, if the attribute of headache or body temperature is further deleted from Table 4, it becomes impossible to uniquely determine the influenza, which is the determining attribute,
Reduction 2 = {headache, body temperature}.
The attribute included in all reductions is called CORE. In this example,
CORE = ｛headache, body temperature｝｝ body temperature, muscular pain｝ = ｛body temperature｝
The reduction of the decision table can be easily derived by using the identification matrix [Skovron-92, Pawlak-91] proposed by Skowron. U = ｛u₁, U₂, U₃, ..., u_n識別, the identification matrix M (T) is an n × n matrix, and its component m_ijIs an example u_iAnd u_jIs a set of attributes needed to classify a into different decision classes. M (T) is a diagonal matrix and m_ii= Empty set, so M (T) is a lower triangular matrix, ie, m_ij: 1 ≦ i <j ≦ n. c ∈m_ijThe logical sum of all the attribute values c at the time of_ijExpressed by All u_iFor ∈U, the reduction is the discriminant function f_TGenerated by (Equation 1):
[0028]
(Equation 1)

Where m_ij= If the empty set, ∨m_ij= False.
[0029]
f_TEach AND in the reduced or standard form of S is called a reduction.
In the above-mentioned example of influenza, a process of obtaining a contraction as shown in Table 5 using an identification matrix will be described.
[0030]
[Table 5]

[0031]
First, to distinguish u1 from other instances, we analyze which attributes are important. Since u1 and u2 differ only in the body temperature attribute, it is the body temperature attribute that can distinguish u1 and u2. It is the body temperature attribute that distinguishes u1 and u3. Since u1 and u4 and u1 and u5 belong to the same decision class, there is no need to distinguish them. What distinguishes u1 and u6 are headache and body temperature attributes. This process consists of the following two steps.
(1) Focus on two examples with different decision classes (influenza)
{Circle around (2)} Extract attributes having different attribute values from the two examples.
Similarly, u2 to u6 form an identification matrix according to Table 6:
[0032]
[Table 6]

[0033]
When ANDing each element,
fT = body temperature ∧ (headache ∨ body temperature) ∧ (headache ∨ muscle pain) ∧ (headache ∨ body temperature ∨ muscle pain) ∧ (body temperature ∨ muscle pain) = (body temperature ∧ headache) ∨ (body temperature ∧ muscle pain).
Two contractions were required: body temperature, headache and body temperature, muscle pain.
Research on knowledge discovery and data mining (abbreviated as KDD) from rules and causal relationships from the database has only recently been started, and the methods described above are not suitable for ideal experimental data. It produces perfect results, but cannot handle real-world data.
Real world data has the following characteristics.
1. Large scale
2. Incompleteness (missing data, incomplete data)
3. Ambiguity (noise data)
4. Dynamic (increase / decrease in data)
5. Diversity (mixed continuous and discrete data)
6. Use of background knowledge
The conventional method can only partially solve the problem of real world data as shown in Table 7.
[0034]
[Table 7]

[0035]
The main problems of the version-space method are:
1. For a problem of a certain size or more, the amount of calculation increases
2. Does not work well with noisy data, ie error or missing data
As you can see from the version space diagram in the section above, the number of attributes and the number of different values for each attribute are key to the size of the version space. When the number of attributes or the number of attribute values of each attribute increases, the size of the version space increases according to the following equation (Equation 2), and the amount of calculation increases.
(Equation 2)

(M is the number of attributes, n_iIs the number of different values included in attribute i)
[0036]
Conventional back-propagation also has the following problems:
1. The process by which a given decision or output is reached is understandable and difficult to explain because the meaning of the problem being solved is not clear
2. Background knowledge is not built into the network
3. Long training time to get satisfactory results. This is because such network learning methods basically use the training dataset repeatedly.
The problem with this method is therefore how to encode or express knowledge explicitly in the network and how to extract rules from the trained network.
[0037]
Although the rough set method is fast, concept learning cannot be performed when examples are presented successively. Data sets are rarely consistent. The rough set method removes such contradictions and seeks reduction. This is an unrealistic treatment given that real databases contain many probabilistic elements.
The major problem with this method is that as the number of attributes increases, the identification matrix increases. The number of attribute combination patterns for identifying each instance increases, and the reduction also increases. There is no standard for evaluating which reduction is most appropriate.
[0038]
The birth of rough set theory has attracted a great deal of interest to many researchers, and classification learning algorithms based on rough sets continue to be developed. However, since the classification problem in the real world goes beyond the basic concept of rough sets, stochastic techniques must be added.
In general, the bottom-up method enables concept learning not only in the case where examples are presented simultaneously, but also in the case of successive presentations. However, the speed is slow. The top-down method has features of fast learning ability and short calculation time. However, when examples are presented successively or data is changed, concept learning cannot be performed, background knowledge cannot be used in the learning process, and furthermore, it cannot be executed in a parallel-distributed mode. Also have.
None of the above methods has found a satisfactory solution in real world applications, such as noise processing, deviation adjustment to control the search process and the use of background knowledge.
[0039]
[Problems to be solved by the invention]
An object of the present invention is to solve the problems of the prior art by combining logic, sets and probabilities, and to develop a stochastic rough set induction learning system applicable to real world data.
An object of the present invention is to provide a data mining method that can find rules from large-scale, diverse data, noise, and incomplete data, and that is effective for real-world problems.
[0040]
[Means for Solving the Problems]
Until now, knowledge discovery methods have been developed based on one of logic, set, or probability theory, and have a limited scope.
The method (GDT-RS method) according to the present invention and the system (GDT-RS) realizing the same are based on a combination of a generalized distribution table (GDT) and a rough set (RS) theory, as shown in FIG. It spans all theories of logic, sets, and probabilities. GDT is used to create a probability criterion that evaluates the incompleteness and ambiguity of data and the strength of discovered rules. Rough set theory extracts rules and extracts all examples from strong rules. Used to narrow down the minimum rule set to cover.
The influence of noise and unknown (unobserved) data is explicitly represented in the strength of a rule discovered by the GDT-RS method. In addition, the GDT-RS method has flexibility with respect to bias adjustment for learning, and can use background knowledge. Since the GDT-RS method can be used to solve the ranking problem in the actual example, the present invention also proposed an order information table.
[0041]
1. Overview of GDT-RS method
The GDT-RS method aims to find rules from noise data or incomplete data. The GDT-RS method consists of the generalized distribution table (GDT) and the theory of rough set (RS). In the following sections, after introducing basic concepts such as generalized distribution table (GDT) and order information table, which are concepts necessary for understanding the GDT-RS method, the features of the GDT-RS method will be described. Show. Rough set theory, another basic concept, has already been described.
[0042]
2. Generalized distribution table (GDT)
The center of the GDT-RS method is a transformed transition matrix, a so-called Generalization Distribution Table (GDT), which provides a probability criterion for evaluating the incompleteness / ambiguity of data and the strength of discovered rules. .
The GDT represents the probability relationship between the example and the concept. The GDT is composed of three components: a "possible example", a "possible generalization", and a "probability relation" that indicates the relationship between possible examples and possible generalizations. Table 9 shows examples of GDTs obtained from Table 8.
[0043]
[Table 8]

[0044]
[Table 9]

[0045]
Here, a0b0c0, a0b0c1,... Of the top row represent possible combinations of all attribute values. Here, since there are two types of attributes a and c and three types of attribute b, the combinations are 2 × 3 × 2 = 12. {* B0c0}, {* b0c1},... In the leftmost column are called possible generalizations. {* B0c0} indicates a set {a0b0c0, a1b0c0} of the examples having b0c0 by performing generalization on a (indicated by *). The probability relationship indicates the probability of each possible instance within the possible generalizations.
In Table 8, each element G of the GDT_ijRepresents a "probability relationship" that indicates the strength of the relationship between the corresponding possible instance and the possible generalization. If no background knowledge is used, the prior distribution is uniform and G_ijIs defined by equation 1.
[0046]
(Equation 3)

[0047]
PI here_jIs a possible example of the j-th column, PG_iIs the possible generalization of the i-th row, N_pgiRepresents the number of possible instances included in the ith generalization.
Here, the sum of the i-th row probability is represented by the following equation.
(Equation 4)

When creating a GDT, only condition attributes are used. The generalization constituted by the condition attribute is treated as the condition part of the rule. The decision attribute is treated as a conceptual part of the rule.
[0048]
3. Adjusting prior distributions with background knowledge
One of the important features of the present invention is the ability to select and use biases to control the rule discovery process. Here, how to control the creation of the GDT and the rule discovery process by using the background knowledge as a bias will be described. If background knowledge is not used as a bias, by default, for one generalization, the probability of occurrence of all relevant possible instances is equal, and the prior distribution is uniform as shown in Equation 1. However, the prior distribution can be adjusted by using the background knowledge. In this case, the prior distribution is no longer a uniform distribution.
Generally, background knowledge is defined in the following format:
a_i1j1⇒a_i2j2, Q
A here_i1j1Is the j1th value of attribute i1 and a_i2j2Is the j2th value of the attribute i2. a_i1j1, A_i2j2, Q are referred to as “premise”, “conclusion”, and “strength” of the background knowledge, respectively. The meaning of background knowledge is a_i1j1Under the conditions of_i2j2Is Q. Q = 0 is "a_i1j1And a_i2j2Do not occur at the same time, and Q = 1_i1j1And a_i2j2Always occur simultaneously. ”Q = 1 / n_i2A when the background knowledge is not used_i2j2Is the same as the probability of occurrence of Where na_i2Represents the number of values of the attribute i2. Background knowledge "a"_i1j1⇒a_i2j2, Q ”, the attribute value a_i1j1A when the occurrence of_i2j2Is 1 / n of the uniform distribution_i2From Q to Q, the probability of occurrence for other values of attribute i2 is (1-Q) / (n_i2-1).
[0049]
For example: background knowledge
"A0 => c1, 100%"
That is, by introducing the background knowledge that "when a0 occurs, c1 always occurs", Table 9 is changed as shown in Table 10 below. The probability of change was marked with ￣.
[0050]
[Table 10]

Of the generalized a0b0 *, a0b0c0 was 1/2 in the case of a uniform distribution not using background knowledge, whereas the probability is 0 because it cannot occur according to the background knowledge. On the other hand, a0b0c1 always occurs according to the background knowledge, so the probability is 1.
[0051]
4. Rule strength and unknown data
The rules discovered by the present invention are expressed as follows.
X → Y with S,
That is, "X is Y and strength is S". Here, X indicates a condition that the concept must satisfy (combination of generalized condition attributes), Y indicates a concept determined by X (value of the determined attribute), and S indicates the strength of the rule. In other words, X is a combination of condition attribute values such as {a0b0 *}, {a0 * c0},..., And Y is a determined attribute value or a combination thereof. X, Y, and S are referred to as the rule condition, conclusion, and strength, respectively.
The strength S of rule X → Y is defined as follows:
S (X → Y) = s (X) x (1-r (X → Y)) Equation 2
Equation 2 is made up of the product of the element s (X) and the element (1-r (X → Y)).
The former s (X) is an index indicating the stability of the rule. For example, suppose that under the condition of {a0 * c0}, a rule {a0 * c0} → d1 that becomes the decision class d1 is found. At that time, of all combinations that should satisfy the generalized index {a0 * c0}, a population of data obtained only by 20% of examples and a population of data obtained by 90% of examples are obtained. The rules are naturally different in stability. In the former case, the rule itself may be overturned depending on the remaining 80% of the data. Therefore, among the possible examples corresponding to a certain generalized index, the ratio of uncollected data (example) can be an index indicating the stability of the rule.
[0052]
The latter (1-r (X → Y)) is an index indicating the reliability of the rule. For example, suppose that under the condition of {a0 * c0 *}, a rule {a0 * c0} → d1 that becomes the decision class d1 is found. At that time, among all the combinations that should satisfy the generalized index {a0 * c0}, the case where 100% of the examples are d1 and the case where 80% are d1 but 20% are d2 Naturally, the reliability of the rules is different from that obtained. Therefore, as will be described later, a concept of a noise rate r (X → Y) is established. When the decision class of 80% is d1 and 20% is d2, the noise rate is set to 20%. And
The qualitative concept of the element of s (X) and the element of (1-r (X → Y)) is as described above, but here, these concepts will be described in more detail using mathematical expressions. .
s (X) represents the strength of generalization X (the condition part of the rule). X = PG_iThen, generalized PG_iStrength s (PG_i) Is PG_iDefined as the sum of the probabilities of the instances present in the database contained in:
(Equation 5)

Where G_ijIs an example PI obtained from the database_jGeneralized PG_iProbability of N_ins-rel(PG_i) Is collected in the database and generalized PG_iIs the number of instances included in. s (PG_iThe value of ()) changes depending on the input of actual data. PG_iIf there are all possible instances in the database, the PG_iIs the maximum value of 1.
[0053]
The strength of generalization reflects the influence of unknown data. Unknown data is data that has not been collected in the database. In general, databases do not collect all possible instances. Some examples are unlikely to occur, and the range and time of data collection are limited. If rule discovery is performed ignoring unknown data, rules with low reliability may be discovered. In the GDT-RS method, a method for evaluating the influence of unknown data is provided based on the generalized distribution table GDT, and the strength of the rule is explicitly shown. When the amount of unknown data is small, the value of the strength of the rule is large, and otherwise, the value is small.
[0054]
r (X → Y) represents the noise rate of rule X → Y. This indicates incompleteness and uncertainty of the rule due to the influence of unknown data and noise data. The noise ratio r is defined as:
(Equation 6)

Where N_ins-rel(X) is the number of instances belonging to generalized X, N_ins-class(X, Y) is the number of examples belonging to the decision class Y among the examples. r (X → Y) indicates the classification accuracy rate. That is, it indicates the probability that an example belonging to generalization X belongs to class Y.
In software that implements the GDT-RS method, a noise rate allowed by the user can be specified as a threshold. Rules whose noise rate is higher than the threshold are automatically deleted.
[0055]
5. Order information table
Many problems encountered with real-world data are more than just classification problems. Example ranking is one such type of problem. For example, the problem of obtaining the relationship between the sales rank of a product, the warranty period, and the price is such an order problem. For such a problem, the ranking rule states that if the value of an attribute a of instance x is located before the value of the same attribute of instance y, instance x is located before instance y. The purpose is to discover relationships.
The ranking of the examples by attribute values is not necessarily the same as the overall ranking of the examples. It is important to know which attributes play a more important role in determining the overall ranking, and which attributes do not contribute to the overall ranking.
The decision table does not take into account semantic relationships between different attribute values of each attribute. However, in order to find a ranking rule, "rank information" representing the meaning of data is required. Therefore, an ordered decision table is created by incorporating semantic information. The ordered decision table can be viewed as a semantic relationship added to the decision table.
[0056]
The ranked attribute values do not constitute one step of the data mining process by the GDT-RS method. Rather, when the GDT-RS method is applied, it is necessary when information is pre-processed using the ranked attribute values.
How this preprocessing is applied to the GDT-RS will be described later. Here, only the concept of the ordering decision table will be described.
Definition 1. When> is a binary relation on the set U, if the relation> satisfies the following two properties, it is called “weak order”.
(Equation 7)

Here, 'indicates negation of the proposition. The "weak order" defines the following equivalence relations:
(Equation 8)

For two elements x and y, if x to y, x and y cannot be identified by>. The equivalence relation ~ is an order relation on U / ~ that derives the classification U / ~ of U>^*Is defined as:
(Equation 9)

Here, [x] to are equivalent classes including x. In addition,>^*Is a linear order. Any two different equivalence classes of U / ~ can be compared.
[0057]
Definition 2. The order determination table is obtained by adding an order relation to the determination table.
The order of the values of the attribute a naturally derives the order of the instances, ie, for the elements x, y of U, we define:
(Equation 10)

Where_｛A｝ Indicates an order relation on U induced by the attribute a. X is arranged before y only if the value of attribute a of example x is arranged before the value of the same attribute of example y. Relationship>_｛A｝Ha>_aHas exactly the same properties as. Attribute A_tFor a subset A of, there is the following definition:
(Equation 11)

That is, x is arranged before y only if x is A and is arranged before y by all attributes.
Many real world applications require special attributes, so-called decision attributes. The ranking of the examples by decision attributes is indicated by '>' and is referred to as the overall ranking of the examples.
[0058]
Example: Table 11 shows a sequencing table for products produced by five manufacturers. a, b, c, d and o represent the product size, warranty period, price, weight and overall ranking, respectively. >_a,>_b,>_c,>_dAnd>_o Indicates the rank information of each attribute.
[0059]
[Table 11]

[0060]
Based on the attribute value ranking information, the following product ranking information is obtained:
>^* _｛A｝: [P3, p4, p5]>^* _｛A｝  [P1]>^* _｛A｝  [P2],
>^* _{B}: [P1, p2, p3, p4]>^* _{B}  [P5],
>^* _{C}: [P1, p5]>^* _{C}  [P4]>^* _{C}  [P2, p3],
>^* _{D}: [P4, p5]>^* _{D}  [P3]>^* _{D}  [P1]>^* _{D}  [P2],
>^* _｛O｝: [P1]>^* _｛O｝  [P4]>^* _｛O｝  [P2, p3, p5]
For {a, b} and {c, d} we get the following ranking:
>_{{A, b}}: Empty set
>_{{C, d}}: P1>_{{C, d}}p2, p4>_{{C, d}}p2, p5>_{{C, d}}p2, p4>_{{C, d}}p3, p5>_{{C, d}}p3.
By combining attributes a and b, all instances are put into the same class. In this example:_{{C, d}}Is not a weak order. That is, the combination of two weak orders may not produce a weak order. This is the order x>_Ay is used to indicate that a valid rule is not required.
[0061]
The present invention is a combination of the above knowledge and rough set theory. Specifically, from a plurality of examples, a rule as a combination of m condition attribute values having one causal relationship with one determined attribute value is described. A data processing method for discovering, for two actual examples having different determined attribute values, extracting the condition attributes having different condition attribute values, performing a logical operation on the extracted condition attributes, Deriving the data processing method. Here, the “example” is information of information having a combination of n condition attribute values corresponding to each of the n condition attributes and one or more decision attribute values associated with the combination. Is a unit. Here, a relationship of n> m is provided.
[0062]
In addition to the above, the data processing method according to the present invention is a first method that compares the number of possible examples that can satisfy the combination of m condition attribute values with the number of actual processing examples actually used for data processing. Preferably, the method includes a step of evaluating the strength of the rule using an index.
In addition to the above, the data processing method according to the present invention further includes: a number of actual processing examples including a combination of m (n> m) condition attribute values associated with one determined attribute value; Evaluating the strength of the rule by a second index comparing the number of actual processing instances having m (n> m) combinations of condition attribute values associated with the determined attribute value other than the value. It is desirable to include.
[0063]
The present invention can compare each condition attribute value in relation to the order set by the user and discover how the condition attributes are correlated with a particular decision attribute. In this case, the present invention includes: a step of setting a predetermined order for each condition attribute value; and a step of comparing each of the condition attribute values and indexing them in association with consistency with the order. It is desirable to use the processed data as an example.
According to the present invention, as a GDT method, p (n> p) condition attribute values are specified, and a combination of the condition attribute values obtained by generalizing np condition attribute values is set. The number q of n condition attribute value combinations that can correspond to the attribute value combinations is derived, and a specific n number of the condition attribute value combinations corresponding to the set generalized condition attribute value combinations appear. It is desirable to treat the probability as 1 / q. According to the present invention, this probability can be corrected in consideration of background information existing in the real world.
Note that a program for realizing the GDT-RS method according to the present invention or a recording medium (GDT-RS system) storing the program is also within the scope of the present invention.
[0064]
[Action]
In the present invention, utilizing the feature that the GDT uses a probability relationship between a general concept and a special element, the probability of evaluating the incompleteness / ambiguity of data and the strength of a found rule, and The criteria for rule selection can be obtained. The GDT-RS method according to the present invention solves the following real-world problems.
[0065]
1. For large data
In the rough set, the reduction is obtained by creating an identification matrix for identifying all the examples. In the GDT-RS method, for each example, a rule for the example is obtained by creating an identification vector (one row in an identification matrix) that can identify the example. When all the examples have been processed, a minimum rule set is obtained. As a result, a rule (not a contraction) can be found directly without creating a huge identification matrix.
The GDT-RS method can extract a minimum set of stronger rules covering all the examples by a combination of GDT and rough set theory. However, two GDT-RS methods can be used for a larger database. An algorithm for implementing the RS method is proposed. One is a method for finding an optimal solution, which is suitable for a small database having a small number of attributes. The other is a method called a greedy method that finds useful rules from a large-scale database in a short time and finds an approximate solution.
[0066]
2. For continuous data
The GDT-RS method is suitable for processing discrete data. Continuous data is used after being discretized. Therefore, a function of discretizing continuous values is provided as a preprocessing function.
[0067]
3. For noise and unobserved data:
Since the effect of noise is explicitly expressed in the strength of the rule, the user can freely adjust the noise rate within the allowable range and seek a satisfactory rule.
[0068]
4. For incomplete data:
Since the influence of unknown (unobserved) data is explicitly expressed in the strength of the rule, a rule is extracted in consideration of the existence of the unknown data.
[0069]
5. For data changes:
In the GDT-RS method, since a rule relating to one example can be found for one example, addition of an example can be easily handled.
[0070]
6. Use of background knowledge
By using known facts as background knowledge, more effective distribution tables can be created and more meaningful rules can be found.
[0071]
[Table 12]

Using the example of Table 12, how the GDT-RS method according to the present invention operates will be described.
Step1. Put together examples of the same condition attribute
[0072]
[Table 13]

-Means contradiction.
[0073]
Step2. Create an identification vector for each instance where the value of the decision attribute differs for each instance
Example: Create an identification vector of u2. A determination attribute is created for a value equal to that of u2.
m_{2, j}Is a set of attributes having different values in the examples u2 and uj.
m₂,₁= {B}
m₂,₂= Empty set
m₂,₄= {A, c}
m₂,₆= {B}
m₂,₇= U2 and u7 are the same decision class and need not be identified.
The identification vector is as shown in Table 14.
[0074]
[Table 14]

[0075]
Step3. Create identification function fT from identification vector
(Equation 12)

Step4. Generating and selecting rules from discriminant functions
(Equation 13)

Substituting the corresponding attribute value of u2 into (Equation 13) yields two rules:
[Equation 14]

The strength of these two rules is calculated by the GDT-RS method as follows:
[Equation 15]

Possible examples included in the generalization {a0, b1} (= {a0b1 *}) are a0b1c0 and a0b1c1. Among them, only a0b1c1 exists in the data set. a0b1c0 is a non-observed example. Therefore, the strength of the generalization {a0, b1} is 0.5. Since the actual examples included in the generalization {a0, b1} all belong to the decision class d (y), the noise rate is zero.
(Equation 16)

Possible instances included in the generalization {b1, c1, *} are a0b1c1 and a1b1c1. Since both exist in the database, the strength of the generalization {*, b1, c1} is 2 × 1/2 = 1. Since the actual examples included in the generalization {*, b1, c1} all belong to the decision class d (y), the noise rate is zero.
[Equation 17]

Comparing the strengths of the two rules, r2. Since {b1c1} → d (y) is larger, r2 is selected.
Similarly, for the examples of u4, u6 and u7, the following rules are generated by executing Step2 to Step4:
(Equation 18)

[0076]
Step5. Select the smallest set of rules that covers all instances
The rule selection rules for the GDT-RS method are defined as follows:
1. The rule with the largest number of instances included is selected first
2. The one with the smaller number of included attributes is selected preferentially
3. For those containing the same example, the rule with the highest strength is preferentially selected
Table 15 shows rules relating to class d (y) and examples included in the rules:
[0077]
[Table 15]

The number of instances included in the rule {b1c1} → d (y) is two, which is higher than other rules, and is therefore preferentially selected. Since the actual examples included in {a1c1} → d (y) and {a0b1} → d (y) are also included in {b1c1} → d (y), the minimum rule set of the class d (y) is {b1c1 ｝ → d (y) only.
Table 16 shows the rules involved in class d (n) and the examples included in the rules:
[0078]
[Table 16]

[0079]
Since the two rules contain different instances, both rules are adopted.
As a result, we found the following rule set:
{B1c1} → d (y) S = 1
{B2} → d (n) S = 1/4
{C0} → d (n) S = 1/6
Next, a processing method in the GDT-RS method will be described for the ordering problem, and a merchandise sales order problem will be described using Table 17.
[0080]
[Table 17]

[0081]
Here in Table 17>_a: Small>_a middle>_a The part such as "large" is set by the user. For example, in this example, a portable electric product is targeted. Since it is portable, it is considered that a small size is most preferable and a large size is not preferable._a: Small>_a middle>_a The setting is large. Similarly, a longer warranty period is preferred, so>_b: 3years>_b 2ears. Attribute c is price and attribute d is weight.
The attribute o indicates the ordering (the first letter of order). In this example, the actual example p1 has o = 1, indicating that the sales are the best. Regarding sales, p4 is next, and p2, p3, and p5 do not sell well.
[0082]
Based on this information, in order to analyze which attribute of a or d has a correlation with sales, pre-processing of data based on this ordered information table is performed, and GDT- Apply the RS method.
Step1. Create binary information table
A binary information table is created from the order determination table (Table 18) based on the following equation.
[Equation 19]

[0083]
[Table 18]

[0084]
For example, 1,0,1,1,1 in the column of (1,2) is the result of comparing p1 and p2 by Ia, Ib, Ic, Id, Io. According to the attribute a, since the example p1 is middle and the example p2 is large, when p1 and p2 are compared, they match the order set by the user. Therefore, “1” is selected as an index based on the above equation. On the other hand, attribute b does not match the order set by the user because p1 and p2 are the same in 3ears (if the values are the same, it is regarded as “not matching”). Therefore, “0” is selected as an index based on the above equation.
In this way, Table 18 shows that the attribute values of p1 and p2 are compared in order for each attribute in relation to the order set by the user, and "0" or "1" is entered.
[0085]
Step2. Rule discovery is performed using the GDT-RS method.
When the ordering information table was not used, the GDT-RS method was applied based on the raw data of each example. However, when the ordered information table is used, as described above, the raw data of each example is indexed as shown in Table 17 based on the ordered information table and the set rules. Apply the RS method.
As a result, a condition attribute correlated with the decision attribute can be found. In this case, it is possible to obtain an analysis result as to which of the attributes “size”, “warranty period”, “price”, and “weight” most affects sales.
[0086]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 5 shows a flow of the pre-processing for implementing the method according to the present invention. As shown in FIG. 5, before finding a rule by applying the present invention, the following preprocessing is performed on data.
1) It is determined whether or not to use the order relation information (see Table 11) (step 51). When using the order relation information, a binary information table is created from the order determination table (step 53, see Table 12).
2) It is determined whether or not to select an attribute (step 55).
3) It is determined whether or not the continuous value attribute is discretized (step 57).
The data subjected to these preprocessings is analyzed by the method according to the present invention (step 59). FIG. 6 shows a processing routine of the GDT-RS method according to the present invention.
According to the method of the present invention, first, examples having the same combination of condition attributes (for example, a0b0c1) are collected into one set (step 11). As a result, if some of the attributes having the same condition attribute have different result attribute values, the condition attribute is not useful for finding a rule. Therefore, an example in which the combination of the condition attributes is the same and the determined attributes are different is deleted (step 15). However, this step is optional, and does not apply if the user is tolerant of such noise (step 13).
[0087]
Next, an identification vector is created for each example having different determination attributes by the method described above (step 17). When the identification vector is obtained, an identification function is created from the identification vector (step 19), and rules are further derived from the identification function (step 21).
Here, the user is given an option whether to use the background knowledge (step 23). As a result, a distribution adjusted by background knowledge is adopted (step 25), or a uniform distribution without adjustment is adopted (step 27).
If the derived rule exceeds the noise rate allowed by the user (step 29), the rule is deleted (step 31). After confirming that all instances have been processed (step 33), a final set of rules covering all instances is selected (step 35). In this way, rules are determined (step 37).
[0088]
【Example】
In the following description, the effects of the GDT-RS method will be described using three examples of real world data.
The first is the case of meningitis, comparing the results with and without background knowledge, and showing that good results can be obtained by using background knowledge. Second, using a large-scale database called bacterial culture test, we explain the process from data preprocessing, that is, discretization of continuous data, attribute selection using background knowledge or automatic attribute selection, to rule generation. And demonstrate its effectiveness for large-scale / complex problems. The third is an example of stomach cancer data, which describes the method of generating ranking rules using ranking information based on rough set theory, and shows its usefulness.
Embodiment 1 FIG. Knowledge discovery from the meningoencephalitis database
First, in the case of meningitis, the effect of using background knowledge is compared with the case where background knowledge is used, and the effect of using background knowledge is shown.
The database of meningitis given as common data consists of data of 140 patients. [Tsumoto-98] which has 38 attributes and is relatively small-scale data. Comparison and Evaluation of KDD Methods with Common Datasets, (Special Panel Discussion Session on Kowledge Discovery Pharmaceuticals. the 12th Annual Conference of JSAI (1998) 72-73.
The purpose is to generate rules for each of the following factors:
・ Factors important for diagnosis (Factors important for DIAG and DIAG2)
・ Factors important for search for causative bacteria (Facts important for CULT_FIND, CULTURE)
・ Factors important for the presence of sequelae (Factors important for C_COURSE, COURSE (Grouped))
In this experiment, six attributes (DIAG, DIAG2, CULT_FIND,
CULTURE, C_COURSE, COURSE (Grouped)
Are the decision classes, and the GDT-RS method is executed for the case where the background knowledge is used and for the case where the background knowledge is not used, and rules are generated for the corresponding factors.
[0089]
Background knowledge
Use the doctor's experience as background knowledge. An example of the background knowledge is as follows.
-If the EEG findings (EEG_WAVE) are normal, the presence or absence (EEG_FOCUS) of the EEG above the locality will not be abnormal;
-The higher the white blood cell count (WBC), the higher the inflammatory protein (CRP).
[0090]
The following is a list of background knowledge provided by physicians. It is divided into three sets:
.Not simultaneously
EEG_WAVE (normal) and EEG_FOCUS (+)
CSF_CELL (low) and Cell_Poly (high)
CSF_CELL (low) and Cell_Mono (high)
.Those that are unlikely to occur
CRP (high) is unlikely to occur when WBC (low)
ESR (high) is unlikely to occur when WBC (low)
CSF_CELL (high) is unlikely to occur when WBC (low)
Cell_Poly (high) is unlikely to occur when WBC (low)
Cell_Mono (high) is unlikely to occur when WBC (low)
STIFF (high) is unlikely to occur when BT (low)
LASEGUUE (high) is unlikely to occur when BT (low)
KERNIG (high) is less likely to occur when BT (low)
.Those that are likely to occur at the same time
WBC (high) and CRP (high)
WBC (high) and ESR (high)
WBC (high) and CSF_CELL (high)
WBC (high) and Cell_Poly (high)
WBC (high) and Cell_Mono (high)
BT (high) and STIFF (high)
BT (high) and LASEGUUE (high)
BT (high) and KERNIG (high)
BT (high) and CRP (high)
BT (high) and ESR (high)
…
EEG_FOCUS (+) and FOCAL (+)
EEG_WAVE (+) and EEG_FOCUS (+)
CRP (high) and CSF_GLU (low)
CRP (high) and CSF_PRO (low)
[0091]
Since "high" and "low" in parentheses are ambiguous ranges, they were processed as follows:
1. When discretizing data from medical knowledge:
"High" is a value higher than the upper limit of the normal value,
“Low” is treated as a value smaller than the lower limit of the normal value.
2. To discretize continuous values:
The discretized attribute values are classified into a finite number. Here, the one classified as the maximum value is defined as “high”, and the one classified as the minimum value is defined as “low”. Others are neither "high" nor "low".
The results shown in this paper are the results obtained by discretizing the continuous value attribute by the automatic discretization method.
In the background knowledge used in the GDT-RS method, the probability Q of occurrence must be not less than “high” or “low”, but a number between 0 and 1 inclusive. However, since the knowledge from the doctor is not so, it is necessary to set an appropriate Q value. In this experiment, the background knowledge "a_i1j1⇒_i2j2Is high, Q is set to 1 and the background knowledge “a_i1j1⇒a_i2j2 Q is set to 0 for "the probability of occurrence of" is low ".
[0092]
Compare results
There was a difference in some of the obtained rules when background knowledge was used and when it was not used. The use of background knowledge has demonstrated the following effects:
If no background knowledge was used, rules that could not be obtained because of their low strength were selected. For example, when background knowledge is used, the following rule₁Is four times as strong as when no background knowledge is used.
rule₁： ONSET (acquire) ∧ESR (≦ 5) ∧CSF_CELL (<10) ∧CULTURE (-) → VIRUS (E)
The reason will be described. Rule if background knowledge is not used₁Is 30 × (384 / E). Here, 30 is the number of examples that the condition part of the rule satisfies, and 384 / E is the strength of the condition part (generalization). In this case, the background knowledge₁That presumes background knowledge that affects₁And the conclusion part of the background knowledge is rule₁There are the following two background knowledge that are not included in the condition part.
CSF_CELL (low) and Cell_Poly (high) do not occur at the same time
CSF_CELL (low) and Cell_Mono (high) do not occur at the same time
[0093]
Cell_Poly and Cell_Mono are continuous value attributes, both of which are divided into low and high by discretization, and there is nothing else. When CSF_CELL is low (low), due to background knowledge that Cell_Poly and Cell_Mono do not become high (high), if low CSF_CELL occurs, only one of the attribute values of Cell_Poly or Cell_Mono is low. Therefore, the number of attribute values of Cell_Poly and Cell_Mono is reduced by one, and is changed from two to one. Therefore, the product E of the number of attribute values of all attributes is changed to E × 1/2 × 1/2 = E / 4, and the strength S of the rule is changed to 30 × (384 / E) × 4.
[0094]
In addition, some rules were replaced by other rules due to the use of background knowledge. For example, the rule
rule₂: DIAG (VIRUS (E)) ∧LOC [4,7] → EEG_abnormal, S = 30 / E
Was discovered when background knowledge was not used, but when background knowledge was used, instead of this rule₂'was gotten.
rule₂′: EEG_FOCUS (+) ∧LOC [4,7] → EEG_abnormal, S = (10 / E) × 4
This is because of the use of background knowledge, rule₂’’ S strength is rule₂Because it has become stronger.
The doctor's comment on this result is "rule₂And rule₂’Are both reasonable, but rule₂'Is more reasonable.'
[0095]
Embodiment 2. FIG. Knowledge discovery from bacterial culture detection database
Here, a bacterial culture detection database provided as common data was used.
The number of examples is 20,000 and the number of attributes is 49. This database is a collection of real-world data, with no preprocessing. Therefore, it has several features:
1. There is missing data with missing attribute values
2. "Disease name" and "Detected bacteria" have many attribute values
3. Contains time series data. That is, test data of the same patient at different times is collected
Some attributes are composed of a plurality of child attributes. For example, the disease name attribute includes three disease names, and the target bacteria attribute includes two types of target bacteria. When the child attributes are also calculated, the number of attributes reaches 60. Therefore, attribute selection is required.
[0096]
For this database, there are two main purposes:
1. Find relationships between detected bacteria and other attributes
Here, if there are detected bacteria (+) and no detected bacteria (-), they are divided into two decision classes, and if-then rules are searched for each class.
2. Find the relationship between antibiotic susceptibility and other attributes, if present
There are four types of susceptibility of antibiotics (R, S, I, empty), but this time we will look for if-then rules only for resistant (R) and sensibility (S).
In this medical data, the number of examples included in the obtained rules is small, so that even if there is a rule with an accuracy rate of 100%, that is, a rule without noise, it cannot be said that there is sufficient reliability. Therefore, in the GDT-RS method, rule discovery is performed within the range of the accuracy rate specified by the user.
[0097]
Execution procedure:
1. Find rules with high accuracy
2. For examples included in a rule with a small number of examples (for example, 3 or less), rule discovery is performed again at a low accuracy rate.
The following shows the experimental results for each purpose.
[0098]
Purpose 1. Relationship between detected bacteria and attributes
First, for missing data, “?” Is substituted and treated as one of the attribute values. Then, in order to reduce the number of attributes, attributes that are clearly unrelated to the detected bacterium, such as treatment means, materials, wards, etc., are deleted. "Etc. were deleted. The experiment was performed using the detected bacteria as a decision class and the following attributes as condition attributes.
Gender, date of birth (changed to teens, twenties, ...), fever, gut, leukocyte, disease name 1, urinary WBC, urine protein, urinary occult blood, urine quantification, βalactamese
Tables 19 and 20 show rules regarding the presence (+) and absence (-) of the detected bacteria. Due to the large number of rules that have been discovered, the number of examples involved is large, and only some of the rules with high accuracy are represented. Gen in the table is the number of generalizations included in the condition, strg. Is the prior distribution of the generalization, (+) and (-) in "number of cases" are present in the database, respectively, and are the number of cases belonging to the detected bacteria (+) or (-) included in each rule. is there.
[0099]
[Table 19]

[0100]
[Table 20]

[0101]
Purpose 2. Relationship between antibiotics and attributes
The types of antibiotics are as follows:
PCG, PCs, Aug, PCs-green, CEPs-1, CEPs-2, CEPs-3, CEPs-4, CEPs-green, AGs, MLs, TCs, LCMs, CPs, CBPs, VCM, RFP / FOM
Rule discovery is performed using these attributes as decision attributes.
As before, the following attributes were obtained, except for those that were not important.
Gender, age (teens), fever, catheter 1, incision, intubation, drainage 1, leukocyte, medication 1, steroid, anticancer drug, radiation, disease name 1, urinary WBC, urine nitrate, urine protein, urinary occult blood, urine determination , Total number of bacteria, detected bacteria, βalactamese
Here, the execution result of the GDT-RS method for the antibiotic {Aug, CEPs-1, PCG, PCs} is shown. Since the GDT-RS method searches for a set of rules including all instances, the number of rules found is large. The results in Table 21 list only those that have been determined to be valuable by the evaluation of the physician.
[0102]
[Table 21]

[0103]
Comparison with C4.5
Rule discovery was performed on the same data using C4.5. Some results are the same as the GDT-RS method. For example:
β-alactamese (3+) → detected bacteria (+)
Leukocyte (unknown) ∧ disease name 1 (tumor) ∧ urine quantitative (10⁴) → Detected bacteria (+)
Urine determination (<10³) → Detected bacteria (-)
But some rules, for example, according to doctors, make sense:
Disease name 1 (pneumonia) → bacteria detected (-)
Was found in the GDT-RS method but not in C4.5.
Certain rules were discovered in C4.5 but not in the GDT-RS method. For example:
β-alactamese (1+) → detected bacteria (+)
Fever (39 ° C) → detected bacteria (-)
The reason for this is that in the GDT-RS method, rule discovery was performed at a lower threshold (noise rate: 40%). For this reason, general rules that are more meaningful when using C4.5 have been discovered. In C4.5, more specific rules were discovered, including βalactamese (1+) or fever (39 ° C.) (see Table 22).
[0104]
[Table 22]

[0105]
Embodiment 3 FIG. Knowledge discovery from gastric cancer database
A final example is the gastric cancer database collected at the National Cancer Center. There are approximately 27 attributes of the patient's symptoms, such as the type of cancer, the presence or absence of serosal invasion, the presence or absence of peritoneal metastasis, the presence or absence of liver metastasis, the maximum diameter of cancer tissue, and preoperative complications. The purpose of the analysis was to find the key factors that worked on patients who died within 90 days after surgery. In this example, generation of a ranking rule based on the GDT-RS method using the ranking information will be described.
Gastric cancer databases are typically dissenting. That is, the number of surviving patients is 5963, while the number of patients 207 died within 90 days after the operation. In addition, there are many cases in which surviving patients have symptoms similar to those of patients who died.
Before a detailed description, the main attributes of the National Cancer Center's stomach cancer data and the order information on the attribute values are shown (see Table 23).
[0106]
[Table 23]

“Note” “.” And “..” mean missing values unless otherwise specified.
[0107]
To find the ranking rules from the ordering table, first, the ordering table is converted into binary information. Thus, the GDT-RS system can be immediately applied.
In the GDT-RS method according to the present invention, the quality of the discovered rules is evaluated from several viewpoints including accuracy and coverage. Accuracy is the ratio of the number of instances including all attributes included in the rule to the instance including all attributes included in the condition part of the rule. The user can specify an accuracy level within a range permitted as a threshold. In the example of gastric cancer data, the threshold is set to 0.6. On the other hand, the coverage is simply indicated by the amount of positive examples and negative examples included in the rule.
Consider an example pair (x, y) ペア U × U in a binary information table. An information function is defined as follows:
(Equation 20)

For example, x>_｛A｝I for y_a(X, y) = 1. Real column pairs such as (x, x) are not handled.
From the results shown in Table 24, the following was found. Of the 22 rules, 17 include the attribute liver_meta (liver metastasis), 11 include the attribute peripheral-mate (metastasis of the peritoneum), and 5 of them include river_meta. The results indicate that liver and peritoneal metastases are the leading causes of death within 90 days after surgery.
[0108]
[Table 24]

[0109]
National Cancer Center staff evaluated this result as "the ranking rules found are reasonable and interesting." They evaluated each rule based on acceptability and novelty. Table 25 shows the rules found from the gastric cancer data and their evaluation for each rule. The evaluation value of convincingness and novelty is 1 for the lowest and 5 for the highest. As shown in the table, many rules were accepted, but some of them were evaluated as novel.
[0110]
[Table 25]

[0111]
【The invention's effect】
The GDT-RS method according to the present invention is excellent in flexibility for bias adjustment, availability of background knowledge, effectiveness of processing for noise data and missing data, predictability of unobserved information, and the like.
According to the present invention, the following effects are recognized.
1. By using background knowledge as a bias, more rational rules are discovered.
2. It predicts unobserved data and determines the strength of the rules, so more meaningful rules are discovered.
[Brief description of the drawings]
FIG. 1 shows an example of a version space according to the prior art.
FIG. 2 shows an example of a tree structure of C4.5 according to the prior art.
FIG. 3 shows an example of an analysis result by a rough set according to the related art.
FIG. 4 is a conceptual diagram of a solution according to the present invention.
FIG. 5 shows a flowchart of pre-processing of data processing according to the present invention.
FIG. 6 shows a flowchart of data processing according to the present invention.

Claims

One decision attribute value from a plurality of examples having a combination of n condition attribute values corresponding to each of the n condition attributes and one or more decision attribute values associated with the combination A data processing method for finding a rule as a combination of m (n> m) condition attribute values having a considerable causal relationship with:
Extracting the condition attributes having different condition attribute values for the two examples having different determined attribute values;
Performing a logical operation on the extracted condition attribute to derive a rule;
And a data processing method.

including the step of evaluating the strength of the rule by a first index comparing the number of possible examples that can satisfy the combination of m condition attribute values with the number of actual processing examples actually used for data processing. And a data processing method according to claim 1.

The number of actual processing examples having m (n> m) combinations of condition attribute values associated with the one determined attribute value, and m (n) associated with determined attribute values other than the one determined attribute value 3. The data processing method according to claim 1, further comprising the step of evaluating the strength of the rule by using a second index that is compared with the number of actual processing examples having a combination of condition attribute values of> m). .

Setting a predetermined order for each condition attribute value;
Comparing each of the condition attribute values and indexing them in association with the consistency with the order;
4. The data processing method according to claim 1, wherein data subjected to the processing including: is used as the actual example.

Specifying p (n> p) condition attribute values, setting a combination of the condition attribute values obtained by generalizing the np condition attribute values, and setting the combination of the generalized condition attribute values The number q of the n possible combinations of the condition attribute values is derived, and the probability of occurrence of the specific n number of the condition attribute values corresponding to the set generalized condition attribute value combination is 1 5. The data processing method according to claim 1, wherein the data processing method is treated as / q.

Correcting the probability in consideration of background information existing in the real world;
The data processing method according to claim 5, further comprising:

A data processing program comprising steps according to the data processing method according to claim 1.

A data processing program recording medium storing a program having steps according to the data processing method of claim 1.